文章目录

2. Unified Detection
- 2.1. Network Design
- 2.2. Training
- 2.3. Inference
- 2.4. Limitations of YOLO

2. Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and real- time speeds while maintaining high average precision.

我们将目标检测的独立组件统一到一个单一的神经网络中。我们的网络使用来自整个图像的特征来预测每个边界框，并同时预测图像的所有类别，这意味着我们的网络对所有目标做全局处理。YOLO 设计支持端到端训练和实时速度，同时保持较高的平均精度。

Our system divides the input image into an S×SS \times SS×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

我们的系统把图像分割为 S×SS \times SS×S 的网格，如果对象的中心落入某网格单元中，则该网格单元负责检测该对象。

Each grid cell predicts BBB bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object)×IOUtruthPr(Object) \times IOU_{truth}Pr(Object)×IOUtruth. If no pred object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每个网格单元预测 BBB 个边界框和它们的置信度。这些置信度表示了模型对边界框所包含的对象的置信度，以及它所认为框预测的准确度。我们将置信度定义为 Pr(Object)×IOUtruthPr (Object) \times IOU_{truth}Pr(Object)×IOUtruth。如果该单元格中不存在预测对象，则置信度得分应为零。否则，我们希望置信度得分等于预测框和真实值的交并比（IOU）。

Each bounding box consists of 5 predictions: x,y,w,hx, y, w, hx,y,w,h, and confidence. The (x,y)(x, y)(x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每个边界框由 5 个部分组成：x,y,w,hx, y, w, hx,y,w,h, 和置信度。 (x,y)(x, y)(x,y) 坐标表示相对于网格单元的边界框的中心。宽度和高度是相对于整个图像预测的。最后，置信预测表示预测框和任何真实值之间的 IOU。

Each grid cell also predicts CCC conditional class proba- bilities, Pr(Classi∣Object)Pr(Class_i |Object)Pr(Classi∣Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

每个网格单元还负责预测 CCC 个条件类概率，Pr(Classi∣Object)Pr(Class_i |Object)Pr(Classi∣Object)。这些概率是以边界框包含对象的网格单元为条件。我们只预测每个网格单元的一组类概率，而不管框 BBB 的数量。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

在测试时，我们将条件类概率和单个框置信度预测相乘，

Pr(Classi∣Object)×Pr(Object)×IOUpredtruth=Pr(Classi)×IOUpredtruthPr(Class_i | Object) \times Pr(Object) \times IOU_{pred}^{truth} = Pr(Class_i) \times IOU_{pred}^{truth}Pr(Classi∣Object)×Pr(Object)×IOUpredtruth=Pr(Classi)×IOUpredtruth

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

这为我们提供了每个框特定于类的置信度分数。这些分数编码了该类出现在框中的概率以及预测的框与对象的匹配程度。

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S×SS \times SS×S grid and for each grid cell predicts BBB bounding boxes, confidence for those boxes, and CCC class probabilities. These predictions are encoded as an S×S×(B×5+C)S \times S \times (B \times 5 + C)S×S×(B×5+C) tensor.
译：
图2：模型。我们把预测变为了回归问题。它分割图像为 S×SS \times SS×S 个网格，并且每个网格负责预测 BBB 个边界框、边界框的置信度，以及类别 CCC 的概率。这些预测被编码为 S×S×(B×5+C)S \times S \times (B \times 5 +C)S×S×(B×5+C) 维张量。

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7 × 7 × 30 tensor.

为了在 PASCAL VOC 上评估 YOLO，我们使用 S = 7，B = 2。PASCAL VOC 有 20 个标记类别，因此 C = 20。我们的最终预测是 7 × 7 × 30 张量。

2.1. Network Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

我们将此模型实现为卷积神经网络，并在 PASCAL VOC 检测数据集 [9] 上对其进行评估。网络的初始卷积层从图像中提取特征，而全连接层预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [33]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convo- lutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

我们的网络架构受到用于图像分类的 GoogLeNet 模型的启发 [33]。我们的网络有 24 个卷积层，后续跟随 2 个全连接层。我们不使用 GoogLeNet 使用的初始模块，而是简单地使用 1 × 1 缩减层和 3 × 3 卷积层，类似于 Lin 等人 [22]。完整的网络如图 3 所示。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

我们还训练了一个快速版本的 YOLO，旨在突破快速目标检测的界限。 Fast YOLO 使用具有较少卷积层（9 个而不是 24 个）和这些层中的过滤器较少的神经网络。除了网络的大小之外，YOLO 和 Fast YOLO 的所有训练和测试参数都是相同的。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.
译：
图3：结构。我们的检测网络有24个卷积层，并跟随2个全连接层。不同的是，我们使用了 1×11 \times 11×1 的卷积层以减少特征空间。我们基于 ImageNet 对半尺寸（224 x 224输入图像）做了分类的预训练，之后使用全尺寸图像做检测。

The final output of our network is the 7 × 7 × 30 tensor of predictions.

我们网络的最终输出是 7 × 7 × 30 的预测张量。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [29]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 valida- tion set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].

我们在 ImageNet 1000 级竞赛数据集 [29] 上预训练我们的卷积层。对于预训练，我们使用图 3 中的前 20 个卷积层，然后是平均池化层和全连接层。我们对该网络进行了大约一周的训练，并在 ImageNet 2012 验证集上实现了 88% 的单次裁剪 top-5 准确率，与 Caffe 的 Model Zoo [24] 中的 GoogLeNet 模型相当。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [28]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

然后我们转换模型以执行检测。 Ren等人表明将卷积层和连接层添加到预训练网络可以提高性能 [28]。按照他们的例子，我们添加了四个卷积层和两个具有随机初始化权重的全连接层。检测通常需要细粒度的视觉信息，因此我们将网络的输入分辨率从 224 × 224 增加到 448 × 448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell loca- tion so they are also bounded between 0 and 1.

我们的最后一层预测类别概率和边界框坐标。我们通过图像的宽度和高度对边界框的宽度和高度进行归一化，使它们落在 0 和 1 之间。我们将边界框的 x 和 y 坐标参数化为特定网格单元位置的偏移量，因此它们也被限制在 0 之间和 1。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

我们对最后一层使用线性激活函数，所有其他层使用 Leaky ReLu 作为激活函数：

ϕ(x)={xx>00.1xelsewise\phi (x) = \left\{\begin{matrix} x & x > 0 \\ 0.1 x & elsewise \end{matrix}\right.ϕ(x)={x0.1xx>0elsewise

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

我们针对模型输出中的平方和误差进行了优化。我们使用平方和误差是因为它很容易优化，但是它并不完全符合我们最大化平均精度的目标。它将定位误差与不理想的分类误差相等地加权。通常，在一副图像中，许多网格单元不包含任何对象，该方法会导致有物体的网格的置信度因为它们而趋向于零。因此，它会导致模型不稳定，从而导致过早出现异常。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confi- dence predictions for boxes that don’t contain objects. We use two parameters, λcoord\lambda_{coord}λcoord and λnoobj\lambda_{noobj}λnoobj to accomplish this. We set λcoord=5\lambda_{coord} = 5λcoord=5 and λnoobj=0.5\lambda_{noobj}=0.5λnoobj=0.5

为了解决这个问题，我们增加了边界框坐标预测的损失，并减少了不包含对象的框的置信度预测的损失。我们使用两个参数 λcoord\lambda_{coord}λcoord 和 λnoobj\lambda_{noobj}λnoobj 来实现这一点。我们设置 λcoord=5\lambda_{coord} = 5λcoord=5 和 λnoobj=0.5\lambda_{noobj}=0.5λnoobj=0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

Sum-squared error 也同样加权大框和小框的错误。我们的误差度量应该反映大盒子中的小偏差比小盒子中的小。为了部分解决这个问题，我们预测边界框宽度和高度的平方根，而不是直接预测宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO 预测每个网格单元有多个边界框。在训练时，我们只希望一个边界框预测器负责单个对象。而使用哪个预测器来“负责”预测的唯一条件是根据它是否具有与真实值最高的IOU。这使得边界框预测器专业化，每个预测器在预测特定大小、纵横比或对象类别方面会变得更好，从而提高整体召回率。

During training we optimize the following, multi-part loss function:

在训练期间，我们优化了以下多部分损失函数：

λcoord∑i=0S2∑j=0B1ijobj[(xi−x^i)2+(yi−y^i)2]+λcoord∑i=0S2∑j=0B1ijobj[(ωi−ωi^)2+(hi−hi^)2]+∑i=0S2∑j=0B1ijobj(Ci−C^i)2+λnoobj∑i=0S2∑j=0B1ijnoobj(Ci−C^i)2+∑i=0S21iobj∑c∈classes(pi(c)−p^i(c))2\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2] + \\ \lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{obj} [(\sqrt{\omega_i} - \sqrt{\hat{\omega_i}})^2 + (\sqrt{h_i} - \sqrt{\hat{h_i}})^2] + \\ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{obj} (C_i - \hat{C}_i)^2 +\\ \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}_{ij}^{noobj} (C_i - \hat{C}_i)^2 +\\ \sum_{i=0}^{S^2} \mathbf{1}_{i}^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2 λcoordi=0∑S2j=0∑B1ijobj[(xi−x^i)2+(yi−y^i)2]+λcoordi=0∑S2j=0∑B1ijobj[(ωi−ωi^)2+(hi−hi^)2]+i=0∑S2j=0∑B1ijobj(Ci−C^i)2+λnoobji=0∑S2j=0∑B1ijnoobj(Ci−C^i)2+i=0∑S21iobjc∈classes∑(pi(c)−p^i(c))2

where objobjobj denotes if object appears in cell iii and objijobj_{ij}objij denotes that the jthj_{th}jth bounding box prediction in cell is “responsible” for that prediction.

其中 objobjobj 表示对象是否出现在单元格 iii 中，objijobj_{ij}objij 表示单元格中的第 jthj_{th}jth 个边界框预测对该预测“负责”。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

请注意，如果该网格单元中存在对象，则损失函数只会惩罚分类错误（因此是前面讨论的条件类概率）。如果该预测器对真实框“负责”（即具有该网格单元中任何预测器的最高 IOU），它也只会惩罚边界框坐标误差。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.

我们在来自 PASCAL VOC 2007 和 2012 的训练和验证数据集上对网络进行了大约 135 个epochs的训练。在对 2012 测试时，我们还包括 VOC 2007 测试数据进行训练。在整个训练过程中，我们使用 64 的批大小、0.9 的动量和 0.0005 的衰减。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−310^{-3}10−3 to 10−210^{−2}10−2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−210^{−2}10−2 for 75 epochs, then 10−310^{−3}10−3 for 30 epochs, and finally 10−410^{−4}10−4 for 30 epochs.

我们的学习率计划如下：对于第一个 epoch，我们慢慢地将学习率从 10−310^{-3}10−3 提高到 10−210^{−2}10−2。如果我们以高学习率开始，我们的模型通常会因梯度不稳定而发散。我们继续用 10−210^{−2}10−2 训练 75 个时期，然后用 10−310^{−3}10−3 训练 30 个时期，最后用 10−410^{−4}10−4 训练 30 个时期。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate=0.5rate = 0.5rate=0.5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

为了避免过度拟合，我们使用 dropout 和广泛的数据增强。在第一个连接层之后具有 rate=0.5rate = 0.5rate=0.5 的 dropout 层可防止层之间的协同适应 [18]。对于数据增强，我们引入了高达原始图像大小 20% 的随机缩放和平移。我们还在 HSV 色彩空间中随机调整图像的曝光度和饱和度，最高可达 1.5 倍。

2.3. Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

就像在训练中一样，预测测试图像的检测只需要一次网络评估。在 PASCAL VOC 上，网络预测每个图像的 98 个边界框和每个框的类别概率。 YOLO 在测试时非常快，因为它只需要一次网络评估，这与基于分类器的方法不同。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.

网格设计在边界框预测中强制执行空间多样性。通常很清楚一个对象属于哪个网格单元，并且网络只为每个对象预测一个框。然而，一些大的物体或靠近多个网格边界的物体可以被多个网格很好地定位。非最大抑制可用于修复这些多重检测。虽然对于 R-CNN 或 DPM 而言对性能并不重要，但非最大抑制会增加 mAP 的 2-3%。

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint lim- its the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

YOLO 对边界框预测施加了很强的空间约束，因为每个网格单元只预测两个框并且只能有一个类。这种空间约束限制了我们的模型可以预测的附近物体的数量。我们的模型在处理成群出现的小物体时遇到了困难，例如成群的鸟。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses rela- tively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由于我们的模型学习从数据中预测边界框，因此它很难泛化到具有新的或不寻常的纵横比或配置的对象。我们的模型还使用相对粗糙的特征来预测边界框，因为我们的架构具有来自输入图像的多个下采样层。

Finally, while we train on a loss function that approxi- mates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

最后，当我们训练一个近似检测性能的损失函数时，我们的损失函数对小边界框和大边界框的错误处理相同。大框的小错误通常是良性的，但小框的小错误对 IOU 的影响要大得多。我们的主要错误来源是不正确的定位。

论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (2/3)相关推荐

论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (3/3)
文章目录 3. Comparison to Other Detection Systems 4. Experiments 4.1. Comparison to Other Real-Time Syst ...
论文解读 Receptive Field Block Net for Accurate and Fast Object Detection
其它机器学习.深度学习算法的全面系统讲解可以阅读<机器学习-原理.算法与应用>,清华大学出版社,雷明著,由SIGAI公众号作者倾力打造. 书的购买链接书的勘误,优化,源代码资源 PDF全 ...
论文阅读：Saliency-Guided Region Proposal Network for CNN Based Object Detection
论文阅读:Saliency-Guided Region Proposal Network for CNN Based Object Detection (1)Author (2)Abstract (3 ...
【论文笔记】ObjectBox: From Centers to Boxes for Anchor-Free Object Detection
论文论文题目:ObjectBox: From Centers to Boxes for Anchor-Free Object Detection 收录于:ECCV2022 论文地址:https:// ...
论文阅读笔记三十三：Feature Pyramid Networks for Object Detection(FPN CVPR 2017)
论文源址:https://arxiv.org/abs/1612.03144 代码:https://github.com/jwyang/fpn.pytorch 摘要特征金字塔是用于不同尺寸目标检测中的 ...
论文笔记——C2FNet:Context-aware Cross-level Fusion Network for Camouﬂaged Object Detection
Context-aware Cross-level Fusion Network for Camouﬂaged Object Detection 论文地址:https://arxiv.org/pdf/ ...
论文笔记-F3Net：Fusion, Feedback and Focus for Salient Object Detection
论文笔记之2020-AAAI-F3Net-F3Net:Fusion, Feedback and Focus for Salient Object Detection 论文地址:https://arxi ...
【论文阅读】【3d目标检测】Group-Free 3D Object Detection via Transformers
论文标题:Group-Free 3D Object Detection via Transformers iccv2021 本文主要是针对votenet等网络中采用手工group的问题提出的改进我们 ...
论文精读《OFT: Orthographic Feature Transform for Monocular 3D Object Detection》
OFT: Orthographic Feature Transform for Monocular 3D Object Detection 文章目录 OFT: Orthographic Feature ...

论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (2/3)