YOLOv2论文名称: YOLO9000: Better, Faster, Stronger

YOLOv2论文下载地址：https://arxiv.org/pdf/1612.08242.pdf

声明：论文翻译仅用来学习，转载请注明出处

YOLO9000: Better, Faster, Stronger

YOLO9000：更好、更快、更强

Abstract

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP , outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don’t have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

摘要

我们介绍了YOLO9000，一个最先进的实时目标检测系统，可以检测超过9000个目标类别。首先，我们提出了对YOLO检测方法的各种改进，这些改进既是新的，也是来自先前的工作。改进后的模型YOLOv2在标准检测任务上是最先进的，如PASCAL VOC和COCO。使用一种新的、多尺度的训练方法，同一个YOLOv2模型可以在不同的规模下运行，在速度和准确性之间提供了一个简单的权衡。在67FPS时，YOLOv2在VOC 2007上得到76.8mAP。在40 FPS时，YOLOv2得到78.6 mAP，超过了最先进的方法，如带有ResNet和SSD的Faster R-CNN，同时运行速度仍然很高。最后，我们提出了一种联合训练目标检测和分类的方法。使用这种方法，我们在COCO检测数据集和ImageNet分类数据集上同时训练YOLO9000。我们的联合训练使YOLO9000能够预测没有标记检测数据的目标类别的检测情况。我们在ImageNet检测任务上验证了我们的方法。尽管只有200个类中的44个有检测数据，YOLO9000在ImageNet检测验证集上得到了19.7的mAP。在COCO上没有的的156个类中，YOLO9000得到了16.0 mAP。但YOLO能检测的不仅仅是200个类；它能预测9000多个不同目标类别的检测。而且它仍然是实时运行的。

1. Introduction

1. 引言

General purpose object detection should be fast, accurate, and able to recognize a wide variety of objects. Since the introduction of neural networks, detection frameworks have become increasingly fast and accurate. However, most detection methods are still constrained to a small set of objects.

Current object detection datasets are limited compared to datasets for other tasks like classification and tagging. The most common detection datasets contain thousands to hundreds of thousands of images with dozens to hundreds of tags [3] [10] [2]. Classification datasets have millions of images with tens or hundreds of thousands of categories [20] [2].

We would like detection to scale to level of object classification. However, labelling images for detection is far more expensive than labelling for classification or tagging (tags are often user-supplied for free).Thus we are unlikely to see detection datasets on the same scale as classification datasets in the near future.

通用的目标检测应该是快速、准确的，并且能够识别各种各样的目标。自从引入神经网络以来，检测框架已经变得越来越快和准确。然而，大多数检测方法仍然被限制在一小部分目标上。

与分类和标记等其他任务的数据集相比，当前的目标检测数据集是有限的。最常见的检测数据集包含几千到几十万张图像，有几十到几百个标签[3] [10] [2]。分类数据集有数以百万计的图像，有数万或数十万个类别[20] [2]。

我们希望检测能够达到目标分类的水平。然而，为检测而给图像贴标签比为分类或标记贴标签要昂贵得多（标签通常是给用户免费提供的）。因此，我们不太可能在不久的将来看到与分类数据集相同规模的检测数据集。

Figure 1: YOLO9000. YOLO9000 can detect a wide variety of object classes in real-time.

图1：YOLO9000。YOLO9000可以实时检测各种目标类别。

We propose a new method to harness the large amount of classification data we already have and use it to expand the scope of current detection systems. Our method uses a hierarchical view of object classification that allows us to combine distinct datasets together.

We also propose a joint training algorithm that allows us to train object detectors on both detection and classification data. Our method leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness.

Using this method we train YOLO9000, a real-time object detector that can detect over 9000 different object categories. First we improve upon the base YOLO detection system to produce YOLOv2, a state-of-the-art, real-time detector. Then we use our dataset combination method and joint training algorithm to train a model on more than 9000 classes from ImageNet as well as detection data from COCO.

All of our code and pre-trained models are available online at http://pjreddie.com/yolo9000/.

我们提出了一种新的方法来利用我们已经拥有的大量分类数据，并利用它来扩大当前检测系统的范围。我们的方法使用目标分类的分层观点，使我们能够将不同的数据集结合在一起。

我们还提出了一种联合训练算法，使我们能够在检测和分类数据上训练目标检测器。我们的方法利用标记的检测图像来学习精确定位目标，同时使用分类图像来增加其词汇量和鲁棒性。

使用这种方法，我们训练了YOLO9000，一个实时的目标检测器，可以检测超过9000个不同的物体类别。首先，我们在基础YOLO检测系统的基础上进行改进，以产生YOLOv2，一个最先进的实时检测器。然后，我们使用我们的数据集组合方法和联合训练算法，在ImageNet的9000多个类别以及COCO的检测数据上训练一个模型。

我们所有的代码和预训练的模型都可以在线获得：http://pjreddie.com/yolo9000/。

2. Better

2. 更好

YOLO suffers from a variety of shortcomings relative to state-of-the-art detection systems. Error analysis of YOLO compared to Fast R-CNN shows that YOLO makes a significant number of localization errors. Furthermore, YOLO has relatively low recall compared to region proposal-based methods. Thus we focus mainly on improving recall and localization while maintaining classification accuracy.

Computer vision generally trends towards larger, deeper networks [6] [18] [17]. Better performance often hinges on training larger networks or ensembling multiple models together. However, with YOLOv2 we want a more accurate detector that is still fast. Instead of scaling up our network, we simplify the network and then make the representation easier to learn. We pool a variety of ideas from past work with our own novel concepts to improve YOLO’s performance. A summary of results can be found in Table 2.

相比于最先进的检测系统，YOLO存在着各种缺陷。与Faster R-CNN相比，对YOLO的错误分析表明，YOLO出现了大量的定位错误。此外，与基于区域建议的方法相比，YOLO的召回率相对较低。因此，我们主要关注的是在保持分类精度的同时，提高召回率和定位准确度。

计算机视觉通常趋向于更大、更深的网络[6] [18] [17]。更好的性能往往取决于训练更大的网络或将多个模型集合在一起。然而，在YOLOv2中，我们希望有一个更准确的检测器，但仍然是快速的。我们没有扩大我们的网络，而是简化了网络，然后让表征更容易学习。我们将过去工作中的各种想法与我们自己的新概念结合起来，以提高YOLO的性能。在表2中可以看到结果的总结。

Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to significant increases in mAP. Two exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the anchor box style approach increased recall without changing mAP while using the new network cut computation by 33%.

**表2：从YOLO到YOLOv2的路径。**大多数列出的设计决定都导致了mAP的显著增加。两个例外是改用带有锚框的全卷积网络和使用新网络。改用锚框式的方法增加了召回率而不改变mAP，而使用新的网络使计算量减少了33%。

Batch Normalization. Batch normalization leads to significant improvements in convergence while eliminating the need for other forms of regularization [7]. By adding batch normalization on all of the convolutional layers in YOLO we get more than 2% improvement in mAP . Batch normalization also helps regularize the model. With batch normalization we can remove dropout from the model without overfitting.

批量归一化 批量归一化导致收敛性的显著改善，同时消除了对其他形式的规范化的需求[7]。通过在YOLO的所有卷积层上添加批量归一化，我们在mAP上得到了超过2%的改善。批量规范化也有助于规范化模型。有了批归一化，我们可以在不过拟合的情况下去除模型中的dropout。

High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution.

For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP.

高分辨率分类器 所有最先进的检测方法都使用在ImageNet上预训练的分类器[16]。从AlexNet开始，大多数分类器在小于256×256的输入图像上运行[8]。最初的YOLO在224×224的情况下训练分类器网络，并将分辨率提高到448以进行检测训练。这意味着网络在切换到检测学习时还必须调整到新的输入分辨率。

对于YOLOv2，我们首先在ImageNet上以448×448的完整分辨率对分类网络进行微调，并进行10个epoch。这让网络有时间调整其滤波器，以便在更高的分辨率输入下更好地工作。然后，我们再对检测网络的结果微调。这个高分辨率的分类网络使我们的mAP增加了近4%。

Convolutional With Anchor Boxes. YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly Faster R-CNN predicts bounding boxes using hand-picked priors [15]. Using only convolutional layers the region proposal network (RPN) in Faster R-CNN predicts offsets and confidences for anchor boxes. Since the prediction layer is convolutional, the RPN predicts these offsets at every location in a feature map. Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.

We remove the fully connected layers from YOLO and use anchor boxes to predict bounding boxes. First we eliminate one pooling layer to make the output of the network’s convolutional layers higher resolution. We also shrink the network to operate on 416 input images instead of 448×448. We do this because we want an odd number of locations in our feature map so there is a single center cell. Objects, especially large objects, tend to occupy the center of the image so it’s good to have a single location right at the center to predict these objects instead of four locations that are all nearby. YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13.

带有锚框的卷积 YOLO直接使用卷积特征提取器顶部的全连接层来预测边界框的坐标。Faster R-CNN不直接预测坐标，而是使用手工挑选的先验因素来预测边界框[15]。Faster R-CNN中的区域生成网络（RPN）只使用卷积层来预测锚框的偏移量和置信度。由于预测层是卷积，RPN预测了特征图中每个位置的偏移量。预测偏移量而不是坐标可以简化问题，使网络更容易学习。

我们从YOLO中移除全连接层，并使用锚框来预测边界框。首先，我们消除了一个池化层，使网络卷积层的输出具有更高的分辨率。我们还缩小了网络，使其在分辨率为416×416的输入图像上运行，而不是448×448。我们这样做是因为我们希望在我们的特征图中有奇数个位置，以便只有一个中心单元。目标，尤其是大型目标，往往会占据图像的中心位置，所以在中心位置有一个单一的位置来预测这些目标是很好的，而不是在中心附近的四个位置。YOLO的卷积层对图像进行了32倍的降样，所以通过使用416的输入图像，我们得到了一个13×13的输出特征图。

When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object.

Using anchor boxes we get a small decrease in accuracy. YOLO only predicts 98 boxes per image but with anchor boxes our model predicts more than a thousand. Without anchor boxes our intermediate model gets 69.5 mAP with a recall of 81%. With anchor boxes our model gets 69.2 mAP with a recall of 88%. Even though the mAP decreases, the increase in recall means that our model has more room to improve.

引入锚框后，我们将类别预测机制与空间位置分开处理，单独预测每个锚框的类和目标。和原来的YOLO一样，目标预测仍然预测先验框和真实框的IOU，而类别预测则预测在有目标存在下，该类别的条件概率。

使用锚框，我们得到的准确率会有小幅下降。YOLO每张图片只预测了98个框，但使用锚框后，我们的模型预测了超过一千个框。在没有锚框的情况下，我们的中间模型mAP为69.5，召回率为81%。有了锚框，我们的模型mAP为69.2，召回率为88%。即使mAP下降了，平均召回率的增加意味着我们的模型有更大的改进空间。

Dimension Clusters. We encounter two issues with anchor boxes when using them with YOLO. The first is that the box dimensions are hand picked. The network can learn to adjust the boxes appropriately but if we pick better priors for the network to start with we can make it easier for the network to learn to predict good detections.

Instead of choosing priors by hand, we run k-means clustering on the training set bounding boxes to automatically find good priors. If we use standard k-means with Euclidean distance larger boxes generate more error than smaller boxes. However, what we really want are priors that lead to good IOU scores, which is independent of the size of the box. Thus for our distance metric we use:

维度集群 在与YOLO一起使用锚框时，我们遇到了两个问题。第一个问题是，框的尺寸是手工挑选的。网络可以学习适当地调整框，但是如果我们为网络挑选更好的先验锚框来开始，我们可以使网络更容易学习预测好的检测结果。

我们在训练集的边界框上运行k-means聚类，以自动找到好的先验参数，而不是手工选择先验参数。如果我们使用标准的k-means和欧氏距离，大的框比小的框产生更多的误差。然而，我们真正想要的是能获得好的IOU分数的先验锚框，这与框的大小无关。因此，对于距离度量，我们使用：
d(box ,centroid )=1−IOU⁡(box ,centroid )\begin{equation} d(\text { box }, \text { centroid })=1-\operatorname{IOU}(\text { box }, \text { centroid }) \end{equation} d( box , centroid )=1−IOU( box , centroid )
We run k-means for various values of k and plot the average IOU with closest centroid, see Figure 2. We choose k = 5 as a good tradeoff between model complexity and high recall. The cluster centroids are significantly different than hand-picked anchor boxes. There are fewer short, wide boxes and more tall, thin boxes.

我们对不同的k值运行k-means，并绘制出最接近中心点的平均IOU，见图2。我们选择k = 5作为模型复杂性和高召回率之间的良好权衡。聚类中心点与手工挑选的锚框有明显不同。短而宽的框较少，高而薄的框较多。

Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.

**图2：VOC和COCO上的聚类框尺寸。**我们对边界框的尺寸进行k-means聚类，以便为我们的模型获得良好的先验锚框。左图显示了我们在不同的k选择下得到的平均IOU。我们发现k=5为召回率与模型的复杂性提供了良好的权衡。右图显示了VOC和COCO的相对中心点。两组先验都倾向于更薄、更高的盒子，而COCO的大小变化比VOC更大。

We compare the average IOU to closest prior of our clustering strategy and the hand-picked anchor boxes in Table 1. At only 5 priors the centroids perform similarly to 9 anchor boxes with an average IOU of 61.0 compared to 60.9. If we use 9 centroids we see a much higher average IOU. This indicates that using k-means to generate our bounding box starts the model off with a better representation and makes the task easier to learn.

我们在表1中比较了我们的聚类策略和手工挑选的锚框的平均IOU与最接近的先验。在只有5个先验的情况下，中心点的表现与9个锚框相似，平均IOU分别为61.0，60.9。如果我们使用9个中心点，我们会看到一个高得多的平均IOU。这表明，使用k-means来生成我们的边界框，使模型开始有一个更好的表示，并使任务更容易学习。

Table 1: Average IOU of boxes to closest priors on VOC 2007. The average IOU of objects on VOC 2007 to their closest, unmodified prior using different generation methods. Clustering gives much better results than using hand-picked priors.

**表1：VOC 2007上的框与最接近的先验的平均IOU。**使用不同的生成方法，VOC 2007上的物体与它们最接近的、未修改的先验物的平均IOU。聚类的结果比使用手工挑选的先验要好得多。

Direct location prediction. When using anchor boxes with YOLO we encounter a second issue: model instability, especially during early iterations. Most of the instability comes from predicting the (x, y) locations for the box. In region proposal networks the network predicts values tx and ty and the (x, y) center coordinates are calculated as:

**直接的位置预测。**当YOLO使用锚框时，我们遇到了第二个问题：模型的不稳定性，特别是在早期迭代中。大部分的不稳定性来自于对框的（x，y）位置的预测。在区域生成网络中，网络预测值tx和ty，（x，y）中心坐标的计算方法是：
x=(tx∗wa)−xay=(ty∗ha)−ya\begin{equation} \begin{aligned} &x=\left(t_{x} * w_{a}\right)-x_{a} \\ &y=\left(t_{y} * h_{a}\right)-y_{a} \end{aligned} \end{equation} x=(tx∗wa)−xay=(ty∗ha)−ya
For example, a prediction of tx = 1 would shift the box to the right by the width of the anchor box, a prediction of tx = −1 would shift it to the left by the same amount.

This formulation is unconstrained so any anchor box can end up at any point in the image, regardless of what location predicted the box. With random initialization the model takes a long time to stabilize to predicting sensible offsets.

Instead of predicting offsets we follow the approach of YOLO and predict location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. We use a logistic activation to constrain the network’s predictions to fall in this range.

例如，tx=1的预测会将框向右移动，移动的宽度为锚框的宽度，tx=-1的预测会将框向左移动相同的长度。

这个公式是不受限制的，所以任何锚框都可以在图像中的任何一点结束，而不管这个框是在哪个位置预测的。在随机初始化的情况下，模型需要很长时间才能稳定地预测出合理的偏移量。

我们不预测偏移量，而是遵循YOLO的方法，预测相对于网格单元位置的坐标。这使得真实值的界限在0到1之间。我们使用逻辑激活来约束网络的预测，使其落在0~1这个范围内。

The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates for each bounding box, tx, ty, tw, th, and to. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

该网络在输出特征图中的每个单元预测了5个边界框。该网络为每个边界框预测了5个坐标，即tx、ty、tw、th和to。如果单元格与图像左上角的偏移量为（cx，cy），且先验框的宽度和高度为pw，ph，则预测值对应于：
bx=σ(tx)+cxby=σ(ty)+cybw=pwetwbh=phethPr⁡(object )∗IOU⁡(b,object )=σ(to)\begin{equation} \begin{aligned} b_{x} &=\sigma\left(t_{x}\right)+c_{x} \\ b_{y} &=\sigma\left(t_{y}\right)+c_{y} \\ b_{w} &=p_{w} e^{t_{w}} \\ b_{h} &=p_{h} e^{t_{h}} \\ \operatorname{Pr}(\text { object }) * \operatorname{IOU}(b, \text { object }) &=\sigma\left(t_{o}\right) \end{aligned} \end{equation} bxbybwbhPr( object )∗IOU(b, object )=σ(tx)+cx=σ(ty)+cy=pwetw=pheth=σ(to)
Since we constrain the location prediction the parametrization is easier to learn, making the network more stable. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the version with anchor boxes.

由于我们限制了位置预测，参数化更容易学习，使网络更稳定。使用维度聚类以及直接预测边界框中心位置，比起使用锚框的版本，YOLO提高了近5%。

Figure 3: Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function.

**图3：具有维度先验和位置预测的边界框。**我们预测框的宽度和高度，作为集群中心点的偏移量。我们用一个sigmoid函数预测框的中心坐标相对于过滤器应用的位置。

**Fine-Grained Features. ** This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.

The passthrough layer concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels instead of spatial locations, similar to the identity mappings in ResNet. This turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, which can be concatenated with the original features. Our detector runs on top of this expanded feature map so that it has access to fine grained features. This gives a modest 1% performance increase.

细粒度的特征这个修改后的YOLO在13×13的特征图上预测探测结果。虽然这对大型物体来说是足够的，但它可能会受益于更细粒度的特征来定位较小的物体。Faster R-CNN和SSD都在网络中的各种特征图上运行他们的网络，以获得多个分辨率。我们采取了一种不同的方法，只需要增加一个直通层，从早期的层中提取26×26分辨率的特征。

直通层通过将相邻的特征堆叠到不同的通道而不是空间位置上，将高分辨率的特征与低分辨率的特征串联起来，类似于ResNet中的恒等映射。这种细粒度的特征。这就把26×26×512的特征图变成了13×13×2048的特征图，它可以与原始特征连接起来。我们的检测器在这个扩展的特征图之上运行，这样它就可以访问细粒度的特征。这使性能有了1%的适度提高。

Multi-Scale Training. The original YOLO uses an input resolution of 448 × 448. With the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model.

Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, …, 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training.

多尺度的训练原始的YOLO使用448×448的输入分辨率。通过添加锚框，我们将分辨率改为416×416。然而，由于我们的模型只使用卷积层和池化层，因此可以实时调整大小。我们希望YOLOv2能够鲁棒地运行在不同尺寸的图像上，所以我们将多尺度训练应用到模型中。

我们不需要修改输入图像的大小，而是每隔几个迭代就改变网络。每10个批次，我们的网络就会随机选择一个新的图像尺寸。由于我们的模型缩减了32倍，我们从以下32的倍数中抽取：{320, 352, …, 608}。因此，最小的选项是320 × 320，最大的是608 × 608。我们将调整网络的尺寸，然后继续训练。

This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions. The network runs faster at smaller sizes so YOLOv2 offers an easy tradeoff between speed and accuracy.

At low resolutions YOLOv2 operates as a cheap, fairly accurate detector. At 288 × 288 it runs at more than 90 FPS with mAP almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high framerate video, or multiple video streams.

At high resolution YOLOv2 is a state-of-the-art detector with 78.6 mAP on VOC 2007 while still operating above real-time speeds. See Table 3 for a comparison of YOLOv2 with other frameworks on VOC 2007. Figure 4

这种制度迫使网络学会在各种输入维度上进行良好的预测。这意味着同一个网络可以预测不同分辨率下的检测结果。网络在较小的尺寸下运行得更快，因此YOLOv2在速度和准确性之间提供了一个简单的权衡。

在低分辨率下，YOLOv2作为一个廉价、相当准确的检测器运行。在288×288时，它以超过90 FPS的速度运行，其mAP几乎与Faster R-CNN一样好。这使它成为较小的GPU、高帧率视频或多个视频流的理想选择。

在高分辨率下，YOLOv2是一个最先进的检测器，在VOC 2007上的mAP为78.6，而运行速度仍高于实时速度。YOLOv2与其他框架在VOC 2007上的比较见表3。图4

Table 3: Detection frameworks on PASCAL VOC 2007. YOLOv2 is faster and more accurate than prior detection methods. It can also run at different resolutions for an easy tradeoff between speed and accuracy. Each YOLOv2 entry is actually the same trained model with the same weights, just evaluated at a different size. All timing information is on a Geforce GTX Titan X (original, not Pascal model).

表3：PASCAL VOC 2007上的检测框架YOLOv2比之前的检测方法更快、更准确。它还可以在不同的分辨率下运行，便于在速度和准确性之间进行权衡。每个YOLOv2条目实际上都是相同的训练过的模型，具有相同的权重，只是在不同的尺寸下进行评估。所有计时信息都是在Geforce GTX Titan X（原始的，不是Pascal模型）上进行的。

Figure 4: Accuracy and speed on VOC 2007.

图4：VOC 2007的准确度和速度。

Further Experiments. We train YOLOv2 for detection on VOC 2012. Table 4 shows the comparative performance of YOLOv2 versus other state-of-the-art detection systems. YOLOv2 achieves 73.4 mAP while running far faster than competing methods. We also train on COCO and compare to other methods in Table 5. On the VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP , comparable to SSD and Faster R-CNN.

进一步的实验我们训练YOLOv2对VOC 2012进行检测。表4显示了YOLOv2与其他最先进的检测系统的性能比较。YOLOv2实现了73.4 mAP，同时运行速度远远超过比较的方法。我们还对COCO进行了训练，并在表5中与其他方法进行了比较。在VOC指标（IOU = 0.5）上，YOLOv2得到44.0 mAP，与SSD和Faster R-CNN相当。

Table 4: PASCAL VOC2012 test detection results. YOLOv2 performs on par with state-of-the-art detectors like Faster R-CNN with ResNet and SSD512 and is 2 − 10× faster.

**表4：PASCAL VOC2012测试检测结果。**YOLOv2与采用ResNet和SSD512的Faster R-CNN等先进检测器性能相当，速度提高2至10倍。

Table 5: Results on COCO test-dev2015. Table adapted from [11]

表5：COCO test-dev2015的结果。表格改编自[11]。

3. Faster

3. 更快

We want detection to be accurate but we also want it to be fast. Most applications for detection, like robotics or selfdriving cars, rely on low latency predictions. In order to maximize performance we design YOLOv2 to be fast from the ground up.

我们希望检测是准确的，但我们也希望它是快速的。大多数检测的应用，如机器人或自动驾驶汽车，都依赖于低延迟的预测。为了最大限度地提高性能，我们在设计YOLOv2时从头到尾都是快速的。

Most detection frameworks rely on VGG-16 as the base feature extractor [17]. VGG-16 is a powerful, accurate classification network but it is needlessly complex. The convolutional layers of VGG-16 require 30.69 billion floating point operations for a single pass over a single image at 224 × 224 resolution.

The YOLO framework uses a custom network based on the Googlenet architecture [19]. This network is faster than VGG-16, only using 8.52 billion operations for a forward pass. However, it’s accuracy is slightly worse than VGG16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s custom model gets 88.0% ImageNet compared to 90.0% for VGG-16.

大多数检测框架依靠VGG-16作为基础特征提取器[17]。VGG-16是一个强大的、准确的分类网络，但它是不必要的复杂。VGG-16的卷积层需要306.9亿次浮点运算来处理一张224×224分辨率的图像。

YOLO框架使用一个基于Googlenet架构的定制网络[19]。这个网络比VGG-16更快，一个前向通道只用了85.2亿次运算。然而，它的准确性比VGG16略差。对于224×224的单张图像，前5名的准确率，YOLO在ImageNet上的自定义模型精度为88.0%，而VGG-16为90.0%。

Darknet-19. We propose a new classification model to be used as the base of YOLOv2. Our model builds off of prior work on network design as well as common knowledge in the field. Similar to the VGG models we use mostly 3 × 3 filters and double the number of channels after every pooling step [17]. Following the work on Network in Network (NIN) we use global average pooling to make predictions as well as 1 × 1 filters to compress the feature representation between 3 × 3 convolutions [9]. We use batch normalization to stabilize training, speed up convergence, and regularize the model [7].

Our final model, called Darknet-19, has 19 convolutional layers and 5 maxpooling layers. For a full description see Table 6. Darknet-19 only requires 5.58 billion operations to process an image yet achieves 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.

Darknet-19我们提出一个新的分类模型，作为YOLOv2的基础。我们的模型建立在先前的网络设计工作以及该领域的常识之上。与VGG模型类似，我们主要使用3×3的过滤器，并在每个池化步骤后将通道的数量增加一倍[17]。按照网络中的网络（NIN）的工作，我们使用全局平均池来进行预测，以及使用1×1滤波器来压缩3×3卷积之间的特征表示[9]。我们使用批量归一化来稳定训练，加速收敛，并使模型正规化[7]。

我们的最终模型，称为Darknet-19，有19个卷积层和5个maxpooling层。完整的描述见表6。Darknet-19只需要55.8亿次操作来处理一幅图像，却在ImageNet上达到了72.9%的最高准确率和91.2%的top-5准确率。

Table 6: Darknet-19.

Training for classification. We train the network on the standard ImageNet 1000 class classification dataset for 160 epochs using stochastic gradient descent with a starting learning rate of 0.1, polynomial rate decay with a power of 4, weight decay of 0.0005 and momentum of 0.9 using the Darknet neural network framework [13]. During training we use standard data augmentation tricks including random crops, rotations, and hue, saturation, and exposure shifts.

As discussed above, after our initial training on images at 224 × 224 we fine tune our network at a larger size, 448. For this fine tuning we train with the above parameters but for only 10 epochs and starting at a learning rate of 10−3. At this higher resolution our network achieves a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%.

分类的训练 我们使用随机梯度下降法在标准的ImageNet 1000类分类数据集上训练网络160次，使用Darknet神经网络框架[13]，起始学习率为0.1，多项式速率衰减为4次方，权重衰减为0.0005，动量为0.9。在训练过程中，我们使用标准的数据增强技巧，包括随机作物、旋转、色调、饱和度和曝光度的转变。

如上所述，在对224×224的图像进行初始训练后，我们在更大的尺寸（448）上对我们的网络进行微调。在这种微调中，我们用上述参数进行训练，但只用了10个epoch，并以10-3的学习率开始。在这个更高的分辨率下，我们的网络达到了76.5%的最高准确率和93.3%的Top-5准确率。

Training for detection. We modify this network for detection by removing the last convolutional layer and instead adding on three 3 × 3 convolutional layers with 1024 filters each followed by a final 1 × 1 convolutional layer with the number of outputs we need for detection. For VOC we predict 5 boxes with 5 coordinates each and 20 classes per box so 125 filters. We also add a passthrough layer from the final 3 × 3 × 512 layer to the second to last convolutional layer so that our model can use fine grain features.

We train the network for 160 epochs with a starting learning rate of 10−3, dividing it by 10 at 60 and 90 epochs.We use a weight decay of 0.0005 and momentum of 0.9. We use a similar data augmentation to YOLO and SSD with random crops, color shifting, etc. We use the same training strategy on COCO and VOC.

检测的训练 我们对这个网络进行了修改，去掉了最后一个卷积层，而是增加了三个3×3的卷积层，每个卷积层有1024个过滤器，然后是最后一个1×1的卷积层，输出的数量是我们检测所需的。对于VOC，我们预测5个框的5个坐标，每个框有20个类别，所以有125个过滤器。我们还从最后的3×3×512层向第二个卷积层添加了一个直通层，以便我们的模型可以使用细粒度的特征。

我们用10-3的起始学习率训练网络160个epoch，在60和90个epoch时除以10。我们使用0.0005的权重衰减和0.9的动量。我们使用与YOLO和SSD类似的数据增强，包括随机裁剪、颜色转换等。我们在COCO和VOC上使用同样的训练策略。

3. Stronger

3. 更强

We propose a mechanism for jointly training on classification and detection data. Our method uses images labelled for detection to learn detection-specific information like bounding box coordinate prediction and objectness as well as how to classify common objects. It uses images with only class labels to expand the number of categories it can detect.

During training we mix images from both detection and classification datasets. When our network sees an image labelled for detection we can backpropagate based on the full YOLOv2 loss function. When it sees a classification image we only backpropagate loss from the classificationspecific parts of the architecture.

我们提出了一种对分类和检测数据进行联合训练的机制。我们的方法使用标记为检测的图像来学习特定的检测信息，如边界框坐标预测和目标类，以及如何对普通目标进行分类。它使用只有类别标签的图像来扩大它可以检测的类别的数量。

在训练过程中，我们混合了来自检测和分类数据集的图像。当我们的网络看到被标记为检测的图像时，我们可以根据完整的YOLOv2损失函数进行反向传播。当它看到一个分类图像时，我们只从架构的分类特定部分反向传播损失。

This approach presents a few challenges. Detection datasets have only common objects and general labels, like “dog” or “boat”. Classification datasets have a much wider and deeper range of labels. ImageNet has more than a hundred breeds of dog, including “Norfolk terrier”, “Yorkshire terrier”, and “Bedlington terrier”. If we want to train on both datasets we need a coherent way to merge these labels.

Most approaches to classification use a softmax layer across all the possible categories to compute the final probability distribution. Using a softmax assumes the classes are mutually exclusive. This presents problems for combining datasets, for example you would not want to combine ImageNet and COCO using this model because the classes “Norfolk terrier” and “dog” are not mutually exclusive.

We could instead use a multi-label model to combine the datasets which does not assume mutual exclusion. This approach ignores all the structure we do know about the data, for example that all of the COCO classes are mutually exclusive.

这种方法带来了一些挑战。检测数据集只有常见的物体和一般的标签，如 "狗 "或 “船”。分类数据集有更广泛和更深入的标签范围。ImageNet有一百多个狗的品种，包括 “诺福克梗”、"约克夏梗 "和 “贝灵顿梗”。如果我们想在这两个数据集上进行训练，我们需要一个连贯的方法来合并这些标签。

大多数分类方法在所有可能的类别中使用softmax层来计算最终的概率分布。使用softmax时，假定这些类别是相互排斥的。这给合并数据集带来了问题，例如，你不会想用这个模型来合并ImageNet和COCO，因为 "诺福克梗 "和 "狗 "这两个类别并不相互排斥。

我们可以使用一个多标签模型来结合数据集，而这个模型并不假定相互排斥。这种方法忽略了我们所知道的关于数据的所有结构，例如，所有的COCO类都是互斥的。

Hierarchical classification. ImageNet labels are pulled from WordNet, a language database that structures concepts and how they relate [12]. In WordNet, “Norfolk terrier” and “Yorkshire terrier” are both hyponyms of “terrier” which is a type of “hunting dog”, which is a type of “dog”, which is a “canine”, etc. Most approaches to classification assume a flat structure to the labels however for combining datasets, structure is exactly what we need.

WordNet is structured as a directed graph, not a tree, because language is complex. For example a “dog” is both a type of “canine” and a type of “domestic animal” which are both synsets in WordNet. Instead of using the full graph structure, we simplify the problem by building a hierarchical tree from the concepts in ImageNet.

分层分类 ImageNet的标签是从WordNet中提取的，WordNet是一个语言数据库，用于构造概念和它们之间的关系[12]。在WordNet中，"Norfolk terrier "和 "Yorkshire terrier "都是 "terrier "的外来语，而 "terrier "是 "猎狗 "的一种，是 "狗 "的一种，是 "犬类 "的一种等等。大多数分类方法都假定标签有一个平面结构，然而对于结合数据集来说，结构正是我们所需要的。

WordNet的结构是一个有向图，而不是一棵树，因为语言是复杂的。例如，"狗 "既是 "犬类 "的一种类型，也是 "家畜 "的一种类型，它们都是WordNet中的主题词。我们没有使用完整的图结构，而是通过从ImageNet中的概念建立一棵分层的树来简化这个问题。

To build this tree we examine the visual nouns in ImageNet and look at their paths through the WordNet graph to the root node, in this case “physical object”. Many synsets only have one path through the graph so first we add all of those paths to our tree. Then we iteratively examine the concepts we have left and add the paths that grow the tree by as little as possible. So if a concept has two paths to the root and one path would add three edges to our tree and the other would only add one edge, we choose the shorter path.

为了建立这棵树，我们检查了ImageNet中的视觉名词，并查看了它们通过WordNet图到根节点的路径，在这个例子中是 “物理对象”。许多同义词在图中只有一条路径，因此我们首先将所有这些路径添加到我们的树上。然后，我们反复检查我们剩下的概念，并添加路径，使树的增长尽可能少。因此，如果一个概念有两条通往根的路径，其中一条路径会给我们的树增加三条边，而另一条只增加一条边，我们就选择较短的路径。

The final result is WordTree, a hierarchical model of visual concepts. To perform classification with WordTree we predict conditional probabilities at every node for the probability of each hyponym of that synset given that synset. For example, at the “terrier” node we predict:

最后的结果是WordTree，一个视觉概念的分层模型。为了用WordTree进行分类，我们在每个节点上预测条件概率，即在给定的同义词中，每个同义词的概率。例如，在 "terrier "节点，我们预测：
Pr⁡(Norfolkterrier∣terrier)Pr⁡(Yorkshireterrier∣terrier)Pr⁡(Bedlingtonterrier∣terrier)...\operatorname{Pr} (Norfolk terrier|terrier)\\ \operatorname{Pr} (Yorkshire terrier|terrier)\\ \operatorname{Pr} (Bedlington terrier|terrier)\\ ... Pr(Norfolkterrier∣terrier)Pr(Yorkshireterrier∣terrier)Pr(Bedlingtonterrier∣terrier)...
If we want to compute the absolute probability for a particular node we simply follow the path through the tree to the root node and multiply to conditional probabilities. So if we want to know if a picture is of a Norfolk terrier we compute:

如果我们想计算一个特定节点的绝对概率，我们只需沿着树的路径到根节点，然后乘以条件概率。因此，如果我们想知道一张图片是否是诺福克猎犬，我们就计算一下:
Pr⁡(Norfolk terrier )=Pr⁡(Norfolk terrier|terrier )Pr⁡(terrier ∣hunting dog )…Pr⁡(mammal ∣Pr⁡(animal )Pr⁡(animal ∣physical object )\begin{equation} \begin{gathered} \operatorname{Pr}(\text { Norfolk terrier })=\operatorname{Pr}(\text { Norfolk terrier|terrier }) \\ \operatorname{Pr}(\text { terrier } \mid \text { hunting dog }) \\ \ldots \\ \operatorname{Pr}(\text { mammal } \mid \operatorname{Pr}(\text { animal }) \\ \operatorname{Pr}(\text { animal } \mid \text { physical object }) \end{gathered} \end{equation} Pr( Norfolk terrier )=Pr( Norfolk terrier|terrier )Pr( terrier ∣ hunting dog )…Pr( mammal ∣Pr( animal )Pr( animal ∣ physical object )

For classification purposes we assume that the the image contains an object: Pr(physical object) = 1.

To validate this approach we train the Darknet-19 model on WordTree built using the 1000 class ImageNet. To build WordTree1k we add in all of the intermediate nodes which expands the label space from 1000 to 1369. During training we propagate ground truth labels up the tree so that if an image is labelled as a “Norfolk terrier” it also gets labelled as a “dog” and a “mammal”, etc. To compute the conditional probabilities our model predicts a vector of 1369 values and we compute the softmax over all sysnsets that are hyponyms of the same concept, see Figure 5.

为了分类的目的，我们假设该图像包含一个物体。Pr(物理对象) = 1。

为了验证这种方法，我们在使用1000类ImageNet建立的WordTree上训练Darknet-19模型。为了建立WordTree1k，我们加入了所有的中间节点，将标签空间从1000扩大到1369。在训练过程中，我们在树上传播基础事实标签，这样，如果一张图片被标记为 “诺福克梗”，它也会被标记为 "狗 "和 “哺乳动物”，等等。为了计算条件概率，我们的模型预测了一个由1369个值组成的向量，我们计算了所有作为同一概念的假名的系统集的softmax，见图5。

Figure 5: Prediction on ImageNet vs WordTree. Most ImageNet models use one large softmax to predict a probability distribution. Using WordTree we perform multiple softmax operations over co-hyponyms.

图5：ImageNet上的预测与WordTree上的预测 大多数ImageNet模型使用一个大的softmax来预测一个概率分布。使用WordTree，我们对共同语气词进行了多次softmax操作。

Using the same training parameters as before, our hierarchical Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy. Despite adding 369 additional concepts and having our network predict a tree structure our accuracy only drops marginally. Performing classification in this manner also has some benefits. Performance degrades gracefully on new or unknown object categories. For example, if the network sees a picture of a dog but is uncertain what type of dog it is, it will still predict “dog” with high confidence but have lower confidences spread out among the hyponyms.

This formulation also works for detection. Now, instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the value of P r(physical object). The detector predicts a bounding box and the tree of probabilities. We traverse the tree down, taking the highest confidence path at every split until we reach some threshold and we predict that object class.

使用与之前相同的训练参数，我们的分层式Darknet-19达到了71.9%的top-1准确率和90.4%的top-5准确率。尽管增加了369个额外的概念，并让我们的网络预测树状结构，但我们的准确率只下降了一点。以这种方式进行分类也有一些好处。在新的或未知的对象类别上，性能会优雅地下降。例如，如果网络看到一张狗的照片，但不确定它是什么类型的狗，它仍然会以高置信度预测 “狗”，但在假名中分布的置信度会降低。

这种表述也适用于检测。现在，我们不是假设每张图片都有一个物体，而是使用YOLOv2的物体性预测器来给我们提供Pr（物理物体）的值。检测器会预测出一个边界框和概率树。我们向下遍历这棵树，在每一个分叉处采取最高的置信度路径，直到我们达到某个阈值，我们就可以预测那个物体类别。

Dataset combination with WordTree. We can use WordTree to combine multiple datasets together in a sensible fashion. We simply map the categories in the datasets to synsets in the tree. Figure 6 shows an example of using WordTree to combine the labels from ImageNet and COCO. WordNet is extremely diverse so we can use this technique with most datasets.

**用WordTree组合数据集。**我们可以使用WordTree以合理的方式将多个数据集组合在一起。我们只需将数据集中的类别映射到树上的同位素。图6显示了一个使用WordTree来结合ImageNet和COCO的标签的例子。WordNet是非常多样化的，所以我们可以将这种技术用于大多数数据集。

Figure 6: Combining datasets using WordTree hierarchy. Using the WordNet concept graph we build a hierarchical tree of visual concepts.Then we can merge datasets together by mapping the classes in the dataset to synsets in the tree. This is a simplified view of WordTree for illustration purposes.

图6：使用WordTree层次结构结合数据集 使用WordNet概念图，我们建立了一个视觉概念的分层树。然后，我们可以通过将数据集中的类映射到树上的概念集来将数据集合并在一起。这是WordTree的一个简化视图，用于说明问题。

Joint classification and detection. Now that we can combine datasets using WordTree we can train our joint model on classification and detection. We want to train an extremely large scale detector so we create our combined dataset using the COCO detection dataset and the top 9000 classes from the full ImageNet release. We also need to evaluate our method so we add in any classes from the ImageNet detection challenge that were not already included. The corresponding WordTree for this dataset has 9418 classes. ImageNet is a much larger dataset so we balance the dataset by oversampling COCO so that ImageNet is only larger by a factor of 4:1.

Using this dataset we train YOLO9000. We use the base YOLOv2 architecture but only 3 priors instead of 5 to limit the output size. When our network sees a detection image we backpropagate loss as normal. For classification loss, we only backpropagate loss at or above the corresponding level of the label. For example, if the label is “dog” we do assign any error to predictions further down in the tree, “German Shepherd” versus “Golden Retriever”, because we do not have that information.

联合分类和检测 现在我们可以使用WordTree结合数据集，我们可以训练分类和检测的联合模型。我们想训练一个极大规模的检测器，所以我们使用COCO检测数据集和ImageNet完整版本中的前9000个类来创建我们的联合数据集。我们还需要评估我们的方法，所以我们加入了ImageNet检测挑战中尚未包括的任何类别。这个数据集的相应WordTree有9418个类。ImageNet是一个更大的数据集，所以我们通过对COCO的过度采样来平衡数据集，使ImageNet只比它大4:1。

使用这个数据集，我们训练YOLO9000。我们使用基本的YOLOv2架构，但只有3个先验因素，而不是5个，以限制输出大小。当我们的网络看到一个检测图像时，我们像平常一样反向传播损失。对于分类损失，我们只在标签的相应级别或以上反向传播损失。例如，如果标签是 “狗”，我们不给树上更远的预测分配任何错误，"德国牧羊犬 "与 “金毛猎犬”，因为我们没有这些信息。

When it sees a classification image we only backpropagate classification loss. To do this we simply find the bounding box that predicts the highest probability for that class and we compute the loss on just its predicted tree. We also assume that the predicted box overlaps what would be the ground truth label by at least .3 IOU and we backpropagate objectness loss based on this assumption.

Using this joint training, YOLO9000 learns to find objects in images using the detection data in COCO and it learns to classify a wide variety of these objects using data from ImageNet.

当它看到一个分类图像时，我们只反向传播分类损失。要做到这一点，我们只需找到预测该类的最高概率的边界框，并计算其预测树上的损失。我们还假设预测框与地面真实标签至少有0.3 IOU的重叠，我们根据这一假设反向传播对象性损失。

通过这种联合训练，YOLO9000学会了使用COCO中的检测数据来寻找图像中的物体，并学会了使用ImageNet中的数据对这些物体进行分类。

We evaluate YOLO9000 on the ImageNet detection task. The detection task for ImageNet shares on 44 object categories with COCO which means that YOLO9000 has only seen classification data for the majority of the test images, not detection data. YOLO9000 gets 19.7 mAP overall with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data for. This mAP is higher than results achieved by DPM but YOLO9000 is trained on different datasets with only partial supervision [4]. It also is simultaneously detecting 9000 other object categories, all in real-time.

When we analyze YOLO9000’s performance on ImageNet we see it learns new species of animals well but struggles with learning categories like clothing and equipment. New animals are easier to learn because the objectness predictions generalize well from the animals in COCO. Conversely, COCO does not have bounding box label for any type of clothing, only for person, so YOLO9000 struggles to model categories like “sunglasses” or “swimming trunks”.

我们在ImageNet检测任务上评估了YOLO9000。ImageNet的检测任务与COCO共享44个对象类别，这意味着YOLO9000只看到了大多数测试图像的分类数据，而不是检测数据。YOLO9000总体上得到了19.7的mAP，在它从未见过任何标记的检测数据的156个不相干的对象类别上得到了16.0的mAP。这个mAP比DPM取得的结果要高，但是YOLO9000是在不同的数据集上训练的，只有部分监督[4]。它还同时检测了9000个其他物体类别，而且都是实时的。

当我们分析YOLO9000在ImageNet上的表现时，我们看到它能很好地学习新的动物物种，但在学习服装和设备等类别时却很困难。新的动物更容易学习，因为对象性预测可以很好地从COCO中的动物中概括出来。相反，COCO没有任何类型的衣服的边界框标签，只有人的标签，所以YOLO9000在为 "太阳镜 "或 "游泳裤 "等类别建模时很吃力。

Table 7: YOLO9000 Best and Worst Classes on ImageNet. The classes with the highest and lowest AP from the 156 weakly supervised classes. YOLO9000 learns good models for a variety of animals but struggles with new classes like clothing or equipment.

表7：YOLO9000在ImageNet上最好和最差的类 在156个弱监督类中，AP最高和最低的类。YOLO9000为各种动物学习了很好的模型，但对于像服装或设备这样的新类却很困难。

5. Conclusion

5. 结论

We introduce YOLOv2 and YOLO9000, real-time detection systems. YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth tradeoff between speed and accuracy.

YOLO9000 is a real-time framework for detection more than 9000 object categories by jointly optimizing detection and classification. We use WordTree to combine data from various sources and our joint optimization technique to train simultaneously on ImageNet and COCO. YOLO9000 is a strong step towards closing the dataset size gap between detection and classification.

我们介绍了YOLOv2和YOLO9000，实时检测系统。YOLOv2是最先进的，在各种检测数据集上比其他检测系统快。此外，它可以在各种图像尺寸下运行，在速度和准确性之间提供平稳的权衡。

YOLO9000是一个实时框架，通过联合优化检测和分类来检测9000多个物体类别。我们使用WordTree来结合各种来源的数据和我们的联合优化技术，在ImageNet和COCO上同时训练。YOLO9000是朝着缩小检测和分类之间的数据集大小差距迈出的有力一步。

Many of our techniques generalize outside of object detection. Our WordTree representation of ImageNet offers a richer, more detailed output space for image classification. Dataset combination using hierarchical classification would be useful in the classification and segmentation domains. Training techniques like multi-scale training could provide benefit across a variety of visual tasks.

For future work we hope to use similar techniques for weakly supervised image segmentation. We also plan to improve our detection results using more powerful matching strategies for assigning weak labels to classification data during training. Computer vision is blessed with an enormous amount of labelled data. We will continue looking for ways to bring different sources and structures of data together to make stronger models of the visual world.

我们的许多技术可以在目标检测之外进行推广。我们对ImageNet的WordTree表示为图像分类提供了一个更丰富、更详细的输出空间。使用分层分类的数据集组合在分类和分割领域将是有用的。像多尺度训练这样的训练技术可以在各种视觉任务中提供好处。

对于未来的工作，我们希望将类似的技术用于弱监督的图像分割。我们还计划在训练过程中使用更强大的匹配策略为分类数据分配弱标签来提高我们的检测结果。计算机视觉有着得天独厚的大量标记数据。我们将继续寻找方法，将不同来源和结构的数据结合起来，为视觉世界建立更强大的模型。

References

[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143, 2015. 6

[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1

[3] M. Everingham, L. V an Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 1

[4] P . F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4.

http://people.cs.uchicago.edu/ pff/latent-release4/. 8

[5] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 4, 5, 6

[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 2, 4, 5

[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. 2, 5

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 2

[9] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013. 5

[10] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 1, 6

[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 4, 5, 6

[12] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet: An on-line lexical database.

International journal of lexicography, 3(4):235–244, 1990. 6

[13] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 5

[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Y ou only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015. 4, 5

[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 2, 3, 4, 5, 6

[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 2

[17] K. Simonyan and A. Zisserman. V ery deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 2, 5

[18] C. Szegedy, S. Ioffe, and V . V anhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. 2

[19] C. Szegedy, W. Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov, D. Erhan, V . V anhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 5

[20] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 1

YOLOv2论文中英文对照翻译相关推荐

YOLOv3论文中英文对照翻译
YOLOv3论文名称: YOLOv3: An Incremental Improvement YOLOv3论文下载地址:https://arxiv.org/pdf/1804.02767.pdf 声明: ...
YOLOv1论文中英文对照翻译
YOLOv1论文(附论文下载超链接):You Only Look Once: Unified, Real-Time Object Detection 声明:论文翻译仅用来学习,转载请注明出处 You ...
AI论文系列-经典论文[原文、中文翻译、中英文对照翻译]
AI论文系列-经典论文[原文.中文翻译.中英文对照翻译] @[TOC](AI论文系列-经典论文[原文.中文翻译.中英文对照翻译]) 1. CV系列 2. NLP系列 3. GNA系列 1. CV系列 ...
宇宙最强，meltdown论文中英文对照版(三)
本文由郭健郭大侠翻译,将分为三次连载完成,这是第三部分.郭大侠是蜗窝科技(http://www.wowotech.net/)的创始人,倡导"慢下来,享受技术"的健康理念,侠之大者, ...
英语作文计算机主板,(完整版)电脑主板bios英文版的中英文对照翻译.pdf
电脑主板 BIOS 英文版的中英文对照翻译让你的电脑 BIOS 知识迅速提高滴. Time/System Time 时间 / 系统时间 Date/System Date 日期/ 系统日期 Level ...
android 错误中英互译,安卓手机Recovery模式刷机情况下的中英文对照翻译
recovery ,用关机键音量 /- (依机型不同而不同,不过有些机型可能没有刷入recovery,可自行刷入.)即可进入recovery界面,在这个界面你可以直接用sd 卡上的zip格式的ro ...
20210828每周分享（第二期）-中英文对照翻译插件、笔记软件
1.资讯集中发现个非常好用的资讯集中网站,每天早上到公司后,都会打开看看.不用把时间浪费在刷各种APP上,每天稍微了解下时事就行. 网址:http://make.mk/ 2.中英文对照翻译插 ...
基于Django和翻译API实现web版的中英文对照翻译（一）
笔者经常需要翻译一些英文文档,但是试用了一些商业软件之后,一来觉得满足不了自己的翻译习惯,二来也是觉得对于个人来说,使用需要的收费的东西总是会有些顾忌. 一番了解之后,决定选用搜狗翻译/有道翻译官方提 ...
计算机辅助外文文献,计算机辅助设计建筑CAD论文中英文对照资料外文翻译文献.doc...
中英文对照资料外文翻译文献 Computer-aided design (CAD) ??? Computer-aided design (CAD) is the use of a wide range ...
CVPR 2021 Authors Guidelines 投稿须知中英文对照翻译
怕存在一些被忽视的流程, 因此本文对CVPR2021的Authors Guidelines:http://cvpr2021.thecvf.com/node/33 进行全文对照翻译. 目录 AUTHO ...

YOLOv2论文中英文对照翻译