DL之YoloV3：YoloV3论文《YOLOv3: An Incremental Improvement》的翻译与解读

YoloV3论文翻译与解读

Abstract

1. Introduction

2. The Deal

论文地址：https://arxiv.org/pdf/1804.02767.pdf

YoloV3论文翻译与解读

Abstract

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

我们对YOLO系列算法进行一些最新情况介绍！我们做了一些小的设计更改以使它更好。我们还培训了这个非常棒的新网络。比上次大一点，但更准确。不过还是很快，别担心。在320×320处，Yolov3在22毫秒内以28.2 mAP的速度运行，与SSD一样精确，但速度快了三倍。当我们看到旧的.5 IOU地图检测标准yolov3是相当不错的。在Titan X上，51 ms内可达到57.9 AP50，而在198 ms内，Retinanet可达到57.5 AP50，性能相似，但速度快3.8倍。与往常一样，所有代码都在 https://pjreddie.com/yolo/.

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how we do. We’ll also tell you about some things we tried that didn’t work. Finally we’ll contemplate what this all means.

有时候你只需要打一年电话就行了，你知道吗？今年我没有做很多研究。在Twitter上花了很多时间。和GANs 玩了一会儿。去年我有一点动力，我设法对YOLO做了一些改进。但是，老实说，没有什么比这更有趣的了，只是一些小的改变让它变得更好。我也在其他人的研究上做了一点帮助。事实上，这就是我们今天来到这里的原因。我们有一个摄像头准备就绪的最后期限[4]，我们需要引用我对Yolo所做的一些随机更新，但我们没有来源。所以准备一份技术报告吧！关于技术报告，最重要的是他们不需要介绍，你们都知道为什么我们会在这里。因此，本导言的结尾将为论文的其余部分做上标记。首先，我们会告诉你YOLOV3上处理了什么。然后我们会告诉你我们是怎么做的。我们也会告诉你一些我们尝试过但不起作用的事情。最后，我们将思考这一切意味着什么。

2. The Deal

So here’s the deal with YOLOv3: We mostly took good ideas from other people. We also trained a new classifier network that’s better than the other ones. We’ll just take you through the whole system from scratch so you can understand it all.

所以YOLOv3是这样的:我们主要从别人那里获得好主意。我们还训练了一个新的分类器网络，它比其他分类器更好。我们将从头开始介绍整个系统，这样您就能理解所有内容。

Figure 1. We adapt this figure from the Focal Loss paper [9]. YOLOv3 runs significantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU.

图1.我们根据Focal Loss报告[9]调整了这个数字。Yolov3的运行速度明显快于其他具有类似性能的检测方法。从M40或Titan X获得的时间，都是基于相同的GPU。

2.1. Bounding Box Prediction

Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw, ph, then the predictions correspond to:

按照YOLO9000，我们的系统预测使用维度集群作为锚定框[15]的边界框。网络为每个边界框预测4个坐标，分别为tx、ty、tw、th。如果单元格距图像左上角偏移(cx, cy)，且边界框先验有宽和高pw, ph，则预测对应:

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is tˆ * our gradient is the ground truth value (computed from the ground truth box) minus our prediction: tˆ * − t* . This ground truth value can be easily computed by inverting the equations above.

在训练中，我们使用误差损失的平方和。如果地面真理协调预测tˆ*我们的梯度是地面真值(从地面实况框计算)-我们的预测:tˆ*−t *。这一地面真值可以很容易地计算通过反演上述方程。

Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].
图2.带有尺寸优先和位置预测的边界框。我们预测了盒子的宽度和高度作为与簇形心的偏移。我们使用一个sigmoid函数来预测盒子相对于过滤器应用程序位置的中心坐标。这个数字公然自抄自[15]。

YOLOv3 predicts an objectness score for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. If the bounding box prior is not the best but does overlap a ground truth object by more than some threshold we ignore the prediction, following [17]. We use the threshold of .5. Unlike [17] our system only assigns one bounding box prior for each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

YOLOv3使用逻辑回归预测每个边界框的客观得分。如果边界框先验与地面真值对象的重叠超过任何其他边界框先验，则该值应为1。如果边界框先验不是最好的，但是重叠了超过某个阈值的地面真值对象，我们忽略预测，跟随[17]。我们使用的阈值是。5。与[17]不同的是，我们的系统只为每个地面真值对象分配一个边界框。如果一个边界框先验没有分配给一个地面真值对象，它不会导致坐标或类预测的损失，只会导致对象性的损失。

2.2. Class Prediction

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions. This formulation helps when we move to more complex domains like the Open Images Dataset [7]. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

每个框使用多标签分类预测边界框可能包含的类。我们没有使用softmax，因为我们发现它对于良好的性能是不必要的，相反，我们只是使用独立的逻辑分类器。在训练过程中，我们使用二元交叉熵损失进行类预测。当我们移动到更复杂的领域，比如开放图像数据集[7]时，这个公式会有所帮助。在这个数据集中有许多重叠的标签(即女人和人)。使用softmax会假定每个框只有一个类，而通常情况并非如此。多标签方法可以更好地对数据建模。

2.3. Predictions Across Scales

YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

YOLOv3预测了三种不同尺度的盒子。我们的系统从这些尺度中提取特征，使用类似于特征金字塔网络[8]的概念。从我们的基本特征提取器，我们添加了几个卷积层。最后一个预测了一个三维张量编码的边界框、对象和类预测。在COCO[10]的实验中，我们在每个尺度上预测3个盒子，因此对于4个边界盒偏移量、1个对象预测和80个类预测，张量是N×N×[3(4 + 1 + 80)]。

Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network. We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).

接下来，我们从之前的两层中提取特征图，并将其向上采样2×。我们还从网络的早期获取一个feature map，并使用连接将其与我们的上采样特性合并。该方法允许我们从上采样的特征中获取更有意义的语义信息，并从早期的特征图中获取更细粒度的信息。然后，我们再添加几个卷积层来处理这个组合的特征图，并最终预测出一个类似的张量，尽管现在张量是原来的两倍。我们再次执行相同的设计来预测最终规模的盒子。因此，我们对第三尺度的预测得益于所有的先验计算以及网络早期的细粒度特性。我们仍然使用k-means聚类来确定我们的边界框先验。我们只是随意选择了9个簇和3个尺度然后在尺度上均匀地划分簇。在COCO数据集中，9个簇分别为(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116×90)、(156×198)、(373×326)。

2.4. Feature Extractor

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!

我们使用一个新的网络来进行特征提取。我们的新网络是YOLOv2、Darknet-19中使用的网络和新颖的剩余网络之间的混合方法。我们的网络使用连续的3×3和1×1卷积层，但现在也有一些快捷连接，而且明显更大。它有53个卷积层。等待.....Darknet-53 !

This new network is much more powerful than Darknet- 19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.
表2，backbones的比较，精确度，数十亿次运算，每秒数十亿次浮点运算，以及各种网络的FPS。

Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster. Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

每个网络都以相同的设置进行训练，并以256×256的单次裁剪精度进行测试。运行时间是在泰坦X上以256×256的速度测量的。因此，Darknet-53的性能与最先进的分类器相当，但浮点运算更少，速度更快。Darknet-53比ResNet-101好，并且1.5×更快。Darknet-53的性能与ResNet-152相似，并且速度是后者的2倍。Darknet-53还实现了每秒最高的浮点运算。这意味着网络结构更好地利用GPU，使其更有效地评估，从而更快。这主要是因为ResNets层太多，效率不高。

2.5. Training

We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

我们仍然训练完整的图像没有硬负面挖掘或任何东西。我们使用多尺度训练，大量的数据扩充，批量标准化，所有标准的东西。我们使用Darknet神经网络框架来训练和测试[14]。