YOLOv1论文翻译解读

《You Only Look Once: Unified, Real-Time Object Detection》

论文地址；代码地址
这是一篇2015年的关于目标检测的论文，可以算是yolo系列的开端，起到了开山的作用，为后面的发展奠定了理论基础。YOLO是目前比较流行的目标检测算法，速度快结构简单。其他的目标检测算法也有RCNN，faster-RCNN, SSD等。近几年来，目标检测算法取得了很大的突破。比较流行的算法可以分为两类，一类是基于Region Proposal的R-CNN系算法（R-CNN，Fast R-CNN, Faster R-CNN），它们是two-stage的，需要先使用启发式方法（selective search）或者CNN网络（RPN）产生Region Proposal，然后再在Region Proposal上做分类与回归。而另一类是Yolo，SSD这类one-stage算法，其仅仅使用一个CNN网络直接预测不同目标的类别与位置。第一类方法是准确度高一些，但是速度慢，但是第二类算法是速度快，但是准确性要低一些。本文首先介绍的是Yolo1算法，其全称是You Only Look Once: Unified, Real-Time Object Detection，题目基本上把Yolo算法的特点概括全了：You Only Look Once说的是只需要一次CNN运算，Unified指的是这是一个统一的框架，提供end-to-end的预测，而Real-Time体现是Yolo算法速度快，达到实时。这里我会针对英文与中文对应翻译解读，一般地方都是谷歌翻译，有些专业部分根据自己理解进行意译。

摘要：

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

翻译：YOLO是一种新的目标检测算法。先前有关目标检测的工作一般都是将检测任务转化为了分类任务处理。YOLO则是将目标检测任务转化为定位边界框和相关类概率的回归问题。YOLO网络可以在一次评估中直接从完整图像中预测边界框和类概率。由于整个检测管道是单个网络，因此可以直接在检测性能上实现端到端的优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

翻译：YOLO的检测速度非常快。其标准版本模型可以以每秒45帧的速度实时处理图像。较小的网络Fast YOLO每秒处理速度更是达到了155帧，并且在速度快的同时仍可达到其他实时检测器mAP两倍的效果。但是与state-of-the-art（最先进的检测系统）相比，YOLO在目标定位时更容易出错，但是减少了在背景上预测出不存在的物体（false positives假的正样本）的概率。而且，YOLO比DPM、R-CNN等物体检测系统能够学到更加抽象的目标特征，所以YOLO可以从真实图像领域迁移到其他领域，如艺术领域等。

1、Introduction（介绍）

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.

翻译: 人们看了一眼图像，立即知道图像中有什么对象，它们在哪里以及它们如何相互作用。人类的视觉系统快速准确，使我们能够执行一些复杂的任务，例如在不需要太多思考的情况下就能够驾驶。快速，准确的目标检测算法则将允许计算机在没有专用传感器的情况下驾驶汽车，并使用辅助设备向人类用户传达实时场景信息，这将极大开发出通用响应型机器人系统的潜力。

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image.

翻译： 现在主流的检测系统一般是利用分类器来执行检测任务。检测某类目标便采用该类目标的分类器，并在测试图像的各个位置和比例上对其进行了评估。例如DPM之类的系统使用滑动窗口方法，其分类器在整个图像上均匀的滑动。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene . These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

翻译： 大多最新方法如R-CNN等都使用区域提议方法，首先在图像中生成先验边界框，然后在这些提议的框上运行分类器。分类后，将使用后期处理来完善边界框，消除重复检测并根据场景中的其他对象对框进行重新评分。这些复杂的环节运行缓慢且难以优化，因为每个单独的环节都必须分别进行训练。

解读： 采用滑动窗口的目标检测算法思路非常简单，它将检测问题转化为了图像分类问题。其基本原理就是采用不同大小和比例（宽高比）的窗口在整张图片上以一定的步长进行滑动，然后对这些窗口对应的区域做图像分类（应该是在学习这个东西是目标物还是背景）如下图所示。
但是在实际情况中你并不知道要检测的目标大小是什么规模，所以你要设置不同大小和比例的窗口去滑动，而且还要选取合适的步长，这样会产生很多的子区域，并且都要经过分类器去做预测，这需要很大的计算量，所以你的分类器不能太复杂，因为要保证速度。解决思路之一就是减少要分类的子区域，这就是R-CNN（基于区域提议的CNN）的一个改进策略，其采用了selective search方法来找到最有可能包含目标的子区域（Region Proposal），其实可以看成采用启发式方法过滤掉很多子区域，这会提升效率。这里由于不是讲解R-CNN算法，所以不再对其做深入讲解。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

翻译： 本文提出的YOLO（you only look once），将物体检测任务当做回归问题（regression problem）来处理，直接通过整张图片的所有像素得到bounding box的坐标、box中包含物体的置信度和class probabilities。使用YOLO系统，每幅图像只需看一次即可预测其中存在的物体及其位置。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

翻译： YOLO非常简单：请参见图1。单个卷积网络可同时预测多个边界框和这些框的类概率。YOLO训练完整图像并直接优化检测性能。与传统的对象检测方法相比，此统一模型具有多个优点。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage:http://pjreddie.com/yolo/.

翻译： 第一，YOLO检测速度非常快。由于将检测框架视为回归问题，因此不需要复杂的流程。只需在测试时在新图像上运行神经网络即可预测检测结果。在 Titan X GPU 上，不需要经过批处理，标准版本的 YOLO 系统可以每秒处理 45 张图像；YOLO 的极速版本可以处理 150 帧图像。这就意味着 YOLO 可以以小于 25 毫秒延迟的处理速度，实时地处理视频。此外，YOLO达到了其他实时系统均值平均精度的两倍以上。有关在网络摄像头上实时运行的系统的演示，请参阅项目网页：http：//pjreddie.com/yolo/。

Second, YOLO reasons globally about the image when S × S grid on input Bounding boxes + confdence Class probability map Final detections making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method , mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

翻译： 第二，当对输入边界框上的S×S网格和置信类别概率特征图做出预测时，YOLO会在整体上对图像进行推理。与基于滑动窗口和区域推荐的算法不同，YOLO在训练和测试期间都是“看”到整幅图像，因此它能够隐式地对有关类及其外观的上下文信息进行编码。由于Fast R-CNN是一种顶部检测方法，由于看不到较大的上下文，因此会将图像中的背景色块误认为是目标对象。与Fast R-CNN相比，YOLO产生的背景错误少于一半。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

翻译： 第三，YOLO 学到物体更泛化的特征表示。当在自然场景图像上训练 YOLO，再在艺术图像上去测试 YOLO 时，YOLO 的表现要优于 DPM、R-CNN。YOLO 模型更能适应新的领域，由于其是高度可推广的，即使有非法输入，它也不太可能崩溃。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments. All of our training and testing code is open source. A variety of pretrained models are also available to download.

翻译： YOLO在准确性方面仍落后于最新的检测系统。尽管它可以快速识别图像中的对象，但是在定位方面效果不太好，尤其是定位小型对象目标。

2. Unified Detection（统一检测）

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.

翻译： YOLO将对象检测的各个组成部分统一为一个神经网络。其网络使用整个图像中的特征来预测每个边界框。它还可以同时预测图像所有类的所有边界框。这意味着其网络会全局考虑整个图像和图像中的所有目标对象。 YOLO设计可实现端到端的训练和实时检测的速度，同时还能够保持较高的平均精度。

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

翻译： 我们的系统将输入图像划分为S×S网格。如果目标对象的中心落入网格单元，则该网格单元负责检测该对象。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object) ∗ IOUtruth pred . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

翻译： 每个网格单元预测B个边界框和这些框的置信度得分。这些置信度得分反映了该模型对这个网格包含一个对象的置信度（是否有目标对象），以及它认为网格预测的准确性。形式上，我们将置信度定义为：

Pr(Object)表示其网格单元（注意是网格单元不是边界框）内是否含有目标对象，如果该单元格中没有对象，即Pr(Object) 为0，则置信度分数应为零。否则，则表示含有目标对象，Pr(Object) 为1，置信度分数等于预测框与真实框的交并比（IOU）（数值在[0,1]之间，数值越大说明重合区域越大，得分越高）。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

翻译： 每个边界框由5个预测组成：x，y，w，h和置信度。（x，y）坐标表示边界框相对于网格单元边界的中心。（w，h）相对于整个图像预测宽度和高度。最后，置信度预测表示预测框与任何地面真实框之间的IOU。
解读： 实际训练过程中，w和h的值使用图像的宽度和高度进行归一化到[0,1]区间内；x，y是bounding box中心位置相对于当前格子位置的偏移值，并且被归一化到[0,1]。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

翻译： 每个网格单元还预测C个条件类概率，具体如下所示：

这些概率以包含对象的网格单元为条件。无论预测边界框B的数量如何，都仅预测每个网格单元的一组类概率，也就是说每个网格单元预测一组类概率。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
Pr(Classi|Object) ∗ Pr(Object) ∗ IOUtruth
pred = Pr(Classi) ∗ IOUtruth
pred (1)

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

翻译： 在测试时，将条件类别的概率和各个预测框的置信度相乘：

这样既可得到每个bounding box的具体类别的confidence score。这乘积既包含了bounding box中预测的class的 probability信息，也反映了bounding box是否含有Object和bounding box坐标的准确度。（显然如果cell中不含有目标对象，则乘积直接为0）

解读： 将YOLO用于PASCAL VOC数据集时：论文使用的 S=7，即将一张图像分为7×7=49个网格每一个网格预测B=2个boxes（每个box有 x,y,w,h,confidence，5个预测值），同时C=20（PASCAL数据集中有20个类别）。因此，最后的prediction是7×7×30 { 即S * S * ( B * 5 + C) }的Tensor（张量）。
注意：
1. 由于输出层为全连接层，因此在检测时，YOLO 训练模型只支持与训练图像相同的输入分辨率。
2. 虽然每个格子可以预测B个bounding box，但是最终只选择只选择IOU最高的bounding box作为物体检测输出，即每个格子最多只预测出一个物体。当物体占画面比例较小，如图像中包含畜群或鸟群时，每个格子包含多个物体，但却只能检测出其中一个。这是YOLO方法的一个缺陷。

2.1. Network Design（网络设计）

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

翻译： 作者将该模型实现为卷积神经网络，并在PASCAL VOC检测数据集上对其进行评估。网络的初始卷积层从图像中提取特征，而完全连接的层则预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification . Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al. The full network is shown in Figure 3.

翻译： YOLO网络借鉴了GoogLeNet分类网络结构。网络有24个卷积层，其后是2个完全连接的层，不同的是，YOLO未使用inception module，而是使用1x1卷积层（此处1x1卷积层的存在是为了跨通道信息整合）+3x3卷积层简单替代。最终输出的是7x7x30的张量的预测值。具体如下图所示：

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

翻译： 作者还训练了一种快速版本的YOLO，旨在突破快速物体检测的界限。 Fast YOLO使用的神经网络具有较少的卷积层（从9个而不是24个），并且这些层中的卷积较少。除了网络的规模外，YOLO和Fast YOLO之间的所有训练和测试参数都相同。

解读： 图片分成7x7个网格(grid cell)，某个物体的中心落在这个网格中此网格就负责预测这个物体。图中物体狗的中心点（红色框）落入第5行、第2列的格子内，所以这个格子负责预测图像中的物体狗。

最后一层输出为 7x7x30的维度。每个 1x1x30的维度对应原图7x7个cell中的一个，1x1x30中含有类别预测和bounding box坐标预测。总得来讲就是让网格负责类别信息，bounding box主要负责坐标信息(部分负责类别信息：confidence也算类别信息)。

每个网格（1130维度对应原图中的cell）要预测2个bounding box （图中黄色实线框）的坐标（x,y,w,h），其中：中心坐标的（x,y）相对于对应的网格归一化到0-1之间，w,h用图像的width和height归一化到0-1之间。每个bounding box除了要回归自身的位置之外，还要附带预测一个confidence值。这个confidence代表了所预测的box中含有object的置信度和这个box预测的精确度两重信息：

其中如果有ground true box(人工标记的物体)落在一个grid cell里，第一项取1，否则取0。第二项是预测的bounding box和实际的ground truth box之间的IOU值。即：每个bounding box要预测 x,y,w,h,confidence,共5个值，2个bounding box共10个值，对应 1130维度特征中的前10个。每个网格还要预测类别信息，论文中共有 20 类。
7 x 7 的网格，每个网格要预测 2 个 bounding box 和 20 个类别概率，输出就是 7 x 7 x ( 5 x 2 + 20)。【通用公式：S x S个网格，每个网格要预测 B 个bounding box 还要预测 C 个class probability，输出就是 S x S x ( 5 x B + C ) 的一个tensor。注意：class信息是针对每个网格的，confidence信息是针对每个bounding box的】。

2.2. Training(训练)

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset . For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo . We use the Darknet framework for all training and inference .

翻译： 作者在ImageNet 1000-class竞赛数据集上对卷积层进行预训练。对于预训练，这个网络是Figure3中的前20个卷机网络+average-pooling layer（平均池化层）+ fully connected layer（全连接层）（此时网络输入是224*224）；对这个网络进行了大约一周的训练，并在ImageNet 2012验证集上单目标top-5的准确性达到了88％，与 Caffe’s Model Zoo中的GoogLeNet模型相当。作者使用Darknet框架对模型进行所有训练和推理。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance . Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.

翻译： 训练检测网络：转换模型去执行检测任务，《Object detection networks on convolutional feature maps》提到说在预训练网络中增加卷积和全链接层可以改善性能。在作者的例子基础上添加4个卷积层和2个全链接层，随机初始化权重。检测要求细粒度的视觉信息，所以把网络输入把224224变成448448。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1. We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

翻译： 最后一层可以预测类概率和边界框坐标。通过图像的宽度和高度对边界框的宽度和高度进行归一化，使其落在0和1之间。将边界框的x和y坐标参数化为特定网格单元位置的偏移量，因此它们也被限制在0之间和1.对最终层使用线性激活函数，而所有其他层使用以下泄漏校正线性激活：

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

翻译： 作者针对模型输出中的平方误差进行了优化。使用平方和误差是因为它易于优化，但它与我们实现平均精度最大化的目标并不完全一致。它对定位误差和分类误差的权重相等，这可能不理想。同样，在每个图像中，许多网格单元都不包含任何对象。这会将这些单元格的“置信度”得分推向零，通常会超过确实包含对象的单元格的梯度。这可能会导致模型不稳定，从而导致训练早期发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.

翻译： 为了解决这个问题，增加边界框坐标预测的损失，并且减少了不包含对象的框的置信度预测的损失。使用两个参数λcoord和λnoobj来完成此操作。我们设置λcoord= 5和λnoobj= 0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

翻译： 平方和误差也平均权衡大框和小框中的误差。我们的误差指标应反映出，大框中的小偏差比小框中的小偏差要小。为了部分解决此问题，我们预测边界框的宽度和高度的平方根，而不是直接预测宽度和高度。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

翻译： YOLO预测每个网格单元有多个bounding box，在训练时我们只需要每个object（ground true box）只有一个bounding box专门负责（一个object 一个bbox）。具体做法是与ground true box（object）的IOU最大的bounding box 负责该ground true box(object)的预测。这种做法称作bounding box predictor****specialization(专职化)。每个预测器会对特定（sizes,aspect ratio or classed of object）的ground true box预测的更好，从而提高整体召回率。

During training we optimize the following, multi-part loss function:

第一项是边界框中心坐标的误差项， 1 i j o b j 是第 i 个单元格存在目标，该单元格第 j 个边界框负责预测该目标。第一项是边界框中心坐标的误差项，1_{ij}^{obj} 是第i个单元格存在目标，该单元格第j个边界框负责预测该目标。第一项是边界框中心坐标的误差项，1ijobj是第i个单元格存在目标，该单元格第j个边界框负责预测该目标。
第二项是边界框的高与宽的误差项。第二项是边界框的高与宽的误差项。第二项是边界框的高与宽的误差项。
第三项是包含目标的边界框的置信度误差项。第三项是包含目标的边界框的置信度误差项。第三项是包含目标的边界框的置信度误差项。
第四项是不包含目标的边界框的置信度误差项。第四项是不包含目标的边界框的置信度误差项。第四项是不包含目标的边界框的置信度误差项。
第五项是包含目标的单元格的分类误差项， 1 i j o b j 指第 i 个单元格存在目标。 l 类别只与单元格存在目标有关第五项是包含目标的单元格的分类误差项，1_{ij}^{obj}指第i个单元格存在目标。l类别只与单元格存在目标有关第五项是包含目标的单元格的分类误差项，1ijobj指第i个单元格存在目标。l类别只与单元格存在目标有关
这里特别说一下置信度的 ∗ ∗ t a r g e t ∗ ∗ 值，如果是不存在目标，此时由于 P r ( O b j e c t ) = 0 ，那么 C i = 0 。这里特别说一下置信度的**target**值，如果是不存在目标，此时由于Pr(Object)=0，那么C_{i}=0。这里特别说一下置信度的∗∗target∗∗值，如果是不存在目标，此时由于Pr(Object)=0，那么Ci=0。
如果存在目标，则 P r ( O b j e c t ) = 1 , 此时需要确定即计算 t r u t h 和 p r e d 之间的真实 I O U p r e d t r u t h 。如果存在目标，则Pr(Object)=1, 此时需要确定即计算truth和pred之间的真实IOU^{truth}_{pred}。如果存在目标，则Pr(Object)=1,此时需要确定即计算truth和pred之间的真实IOUpredtruth。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
翻译： 如果该网格单元中存在对象，则损失函数会计算该网格的classification error。如果预测框（即该网格单元中所有预测框与真实框的IOU最高的一个）对真实框（就是标签框，也就是目标对象的位置）“负责”，它只会计算bounding box coordinate error。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
翻译： 我们根据PASCAL VOC 2007和2012的训练和验证数据集对网络进行了135个epochs的训练。对包含了VOC 2007的测试数据的2012数据集上进行训练测试。在整个训练过程中，我们使用的批次大小为64，动量为0.9，衰减为0.0005。

**Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10^-3 to 10^-2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10^-2 for 75 epochs, then 10^-3 for 30 epochs, and finally 10^-4 for 30 epochs. **

翻译：学习率时间表如下：在第一个epochs，将学习率从10^-3 逐渐提高到 10^{-2**。如果以较高的学习率开始，则由于梯度不稳定的情况，使得模型经常发散。继续以**10}-2进行75个epochs，然后以10^-3 进行30 epochs，最后以10^-4进行30 epochs。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers . For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.

翻译：为了避免训练过拟合，使用了dropout和数据增强。在第一个全连接的层之后，速率为.5的dropout层防止了层之间的共适应。对于数据扩充，引入了随机缩放和最多原始图像大小20％的转换。在HSV颜色空间中，还将图像的曝光和饱和度随机调整至1.5倍。

解读： 损失函数设计图

损失函数的设计目标就是让坐标（x,y,w,h），confidence，classification 这个三个方面达到很好的平衡。简单的全部采用了sum-squared error loss来做这件事会有以下不足：
a) 8维的localization error和20维的classification error同等重要显然是不合理的；
b) 如果一个网格中没有object（一幅图中这种网格很多），那么就会将这些网格中的box的confidence push到0，相比于较少的有object的网格，这种做法是overpowering的，这会导致网络不稳定甚至发散。
解决方案如下：
（1）更重视8维的坐标预测，给这些损失前面赋予更大的loss weight，在pascal VOC训练中取5。（上图蓝色框）
（2）对没有object的bbox的confidence loss，赋予小的loss weight，在pascal VOC训练中取0.5。（上图橙色框）
（3）有object的bbox的confidence loss (上图红色框) 和classloss （上图紫色框）的loss weight正常取1。
（4）对不同大小的bbox预测中，相比于大bbox预测偏一点，小box预测偏一点更不能忍受。而sum-square error loss中对同样的偏移loss是一样。为了缓和这个问题，作者用了一个比较取巧的办法，就是将box的width和height取平方根代替原本的height和width。如下图：small bbox的横轴值较小，发生偏移时，反应到y轴上的loss（下图绿色）比big box(下图红色)要大。
例如大框的宽为16，小框宽为4，大框偏了4才变成了20，小框的宽偏了4，变成了两倍，虽然都是偏了相同的距离，但是明显小框受到的影响更大；若换成平方根就将他们之间的影响凸显出了，大框宽的平方根是4，小框为2；大框宽的偏移量为 √20-√16=2√5-4=0.4721
小框宽的偏移量为√8-√4=2√2-2=0.8284
很明显小框的变化大于大框，可以更好的凸显变化；这里只是举例说明。

2.3. Inference（推理）

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

翻译： 就像在训练中一样，预测测试图像的检测仅需要进行一次网络评估。在PASCAL VOC上，网络可预测每个图像98个边界框，并预测每个框的类概率。与基于分类器的方法不同，YOLO只需要进行一次网络评估，因此测试时间非常快。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.

翻译： 网格设计在边界框预测中强制执行空间分集。通常，很明显，一个对象属于哪个网格单元，并且网络仅为每个对象预测一个框。当图像中的物体较大，或者处于 grid cells 边界的物体，可能在多个 cells 中被定位出来。可以用Non-Maximal Suppression（NMS，非极大值抑制）进行去除重复检测的物体，可以使最终的 mAP 提高，但相比较于 NMS 对于 DPM、R-CNN 的提高并不算大。

解读： 测试过程，使用训练好的模型对图像进行测试时，就是一个预测边界框和最终得分的过程，和训练过程有所不同，训练过程是一个通过标签对比和loss函数优化使得预测框不断向真实标签框靠近的过程，其最终目的就是得到最优的权重参数；而测试过程中不会再有真实标签，是通过之前训练出的模型中的权重参数来预测出最优边界框的过程。
输入图片，网络会按照与训练时相同的分割方式将测试图片分割成S x S的形状，因此，划分出来的每个网格预测的class信息和Bounding box预测的confidence信息相乘，就得到了每个Bounding box的class-specific confidence score，即得到了每个Bounding box预测具体物体的概率和位置重叠的概率。具体如下所示：

输入一幅448x448x3的图像，通过模型中网络，最后的输出是一个7x7x30的张量；因为是VOC数据集有20类物体，所以7x7x30个张量是由7x7x(2x(4+1)+20) 组成，故其中含有7x7x2 个confidence，又因为每个grid cell只预测一个对象，所以其每个grid cell有20类目标概率；由图中公式可得，每个grid cell中的2个预测边界框的每一类的score是由该grid cell 预测的2个边界框的confidengce乘以grid cell中的该类概率得到的，即score=2x1的张量；而每个grid cell中的2个预测边界框的所有类别的score则等于2x20；因为共有7x7个grid cell，故共预测7x7x2个边界框，最后得出98个20*1的向量，这个式子既考虑到了每个预测边界框的每一种类别概率，也考虑到了预测边界框的置信度（有多适合这个物体）；依次将所有的grid cell都乘完，值得注意一点是每个grid cell的预测边界框confidengce只与其预测的类别概率相乘；我们就可以得到下图

从图中的黄色条状就是得出的98x20x1的张量，每一个框条对应了一个预测边界框，且都对应了20个通过网格预测的class信息和Bounding box预测的confidence信息相乘得到score值；

按照类别把它们分为20类。之后的过程如下（先以第一类假设为“dog”举例）：

通过设置一个阈值，小于阈值直接设为0，来筛选每一类的scores得分

之后对挑选出的每一类对象通过score得分降序排列，这里只是通过dog类进行讲解

降序排列后，由于一些大的物体可能会占用好几个grid cell，但是由于一个物体只能由一个grid cell中的一个预测边界框负责，所以需要选出score值最高的那个边界框来负责该物体，就像例图中的狗就占用了好多grid cell；此时就需要进行NMS处理，
非极大值抑制： 抑制的过程是一个迭代-遍历-消除的过程。
1.将所有框的得分排序，选中最高分及其对应的框。
2.遍历其余的框，如果和当前最高分框的重叠面积(IOU)大于一定阈值，我们就将框删除。
3.从未处理的框中继续选一个得分最高的，重复上述过程。

通过选出一个类中score值最高的框，然后使用同一类中未选择过的最高score值进行比较，若是两个框的IOU大于阈值则表明两个框重叠太多，属于共同负责一个物体，此时需要将框中该类的score置为0，依次将所有框都与其进行比较，最终输出score非0的框；其他的所有类也是如此，都需要进行NMS处理过程。

计算这两个框的IOU，若 >0.5表示这两个框重合太多，故将第二个值赋0处理，然后继续第三个框和第一个框，第四个和第一个直到结束，然后继续取除第一个框外的非0框继续处理，直到全部处理完，然后继续看第二行（即第二类目标对象）。
直观来感受一下非极大值抑制的过程：

如图中所示，47个预测边界框中狗的score得分为0.8，为最高，此时选择其为最高score的框，然后使用其他框与其进行对比

使用第二高score值的框与其对比，就是bb20的框，其IOU大于阈值，故需要将该框此类score置0；其他的框也是如此：

直到遍历所有的框；非极大值抑制 ：非极大值抑制算法（non maximum suppression, NMS），这个算法不单单是针对Yolo算法的，而是所有的检测算法中都会用到。NMS算法主要解决的是一个目标被多次检测的问题。
最终输出结果如下所示：

最后一步 ：还需要对最后输出框中的20种类的score进行排序，因为一个框只能预测一个物体；之所以进行这一步是因为当两个目标或多个目标正好都落在了同一个grid cell中，而且经过NMS处理后，发现两个目标或多个目标的score值在一个预测边界框都是最高的，这就出现了一框预测多个物体的情况，此时就需要对两个目标的score值进行对比，谁高则预测边界框的类别就是谁。
当然还有一种情况就是两个或多个相同类别的目标落在同一个grid cell中，此时只会出现一个边界框来预测它。

2.4. Limitations of YOLO（YOLO的局限性）

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

翻译： YOLO对相互靠的很近的物体（挨在一起且中点都落在同一个格子上的情况），还有很小的群体检测效果不好，这是因为一个网格中只预测了两个框，并且只属于一类。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

翻译： 由于模型学会了根据数据来预测边界框，因此很难将其推广到具有新的或不同寻常的宽高比或配置的对象。模型还使用相对粗糙的特征来预测边界框，因为体系结构从输入图像中有多个下采样层。

解读： 测试图像中，当同一类物体出现的不常见的长宽比和其他情况时泛化能力偏弱。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.Our main source of error is incorrect localizations.

翻译： 最后，虽然训练的是近似检测性能的损失函数，但损失函数在小边界框与大边界框中对待错误的方式相同。大框中的小错误通常是影响不大的，但小框中的小错误对IOU的影响非常大。主要错误来源是错误定位。

解读： 由于损失函数的问题，定位误差是影响检测效果的主要原因，尤其是大小物体的处理上，还有待加强。

后面的对比就不在赘述了，具体还需要与代码一起，才能真正理解算法精髓。