文章目录

3. Comparison to Other Detection Systems
4. Experiments
- 4.1. Comparison to Other Real-Time Systems
- 4.2. VOC 2007 Error Analysis
- 4.3. Combining Fast R-CNN and YOLO
- 4.4. VOC 2012 Results
- 4.5. Generalizability: Person Detection in Artwork
5. Real-Time Detection In The Wild
6. Conclusion
References

3. Comparison to Other Detection Systems

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [35, 21, 13, 10] or localizers [1, 31] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [34, 15, 38]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

目标检测是计算机视觉的核心问题。检测管道通常首先从输入图像中提取一组稳健的特征（Haar [25]、SIFT [23]、HOG [4]、卷积特征 [6]）。然后，分类器 [35, 21, 13, 10] 或定位器 [1, 31] 用于识别特征空间中的对象。这些分类器或定位器以滑动窗口方式在整个图像或图像中的某些区域子集上运行 [34, 15, 38]。我们将 YOLO 检测系统与几个顶级检测框架进行了比较，突出了关键的异同。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features inline and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

可变形零件模型(DPM)。使用滑动窗口方法进行对象检测 [10]。 DPM 使用不相交的管道来提取静态特征、对区域进行分类、预测高分区域的边界框等。我们的系统用单个卷积神经网络替换了所有这些不同的部分。该网络同时执行特征提取、边界框预测、非极大值抑制和上下文推理。对比静态特征，网络通过内联关系训练这些网络并针对检测任务而优化它们。因此，我们的统一架构模型比DPM更快、更准确。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [34] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

R-CNN。 R-CNN 及其变体使用区域提出法而不是滑动窗口来查找图像中的对象。选择性搜索 [34] 生成潜在的边界框，它使用卷积网络提取特征，使用SVM对边界框进行评分，线性模型调整边界框，非最大抑制消除重复检测。这个复杂管道的每个阶段都必须独立精确地调整，结果系统非常慢，在测试时每张图像需要超过 40 秒 [14]。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

YOLO 与 R-CNN 有一些相似之处。每个网格单元给出潜在的边界框并使用卷积特征对这些框进行评分。然而，我们的系统对网格单元提议施加了空间限制，这有助于减少对同一对象的多次检测。我们的系统还给出了更少的边界框，每张图像只有 98 个，而来自 Selective Search 的大约 2000 个。最后，我们的系统将这些单独的组件组合成一个单一的、联合优化的模型。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [27]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

其他快速检测器 Fast R-CNN 和 Faster R-CNN 专注于通过共享计算和使用神经网络来给出范围而不是 Selective Search 中的那样 [14] [27] 加速 R-CNN 框架。尽管它们比 R-CNN 提供了速度和准确性的改进，但两者都仍达不到实时性能。

Many research efforts focus on speeding up the DPM pipeline [30] [37] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [30] actually runs in real-time.

许多研究工作集中在加速 DPM 管道 [30] [37] [5]。它们加速 HOG 计算，使用级联并将计算推送到 GPU。然而，实际上只有 30Hz DPM [30] 实时运行。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

YOLO 没有尝试优化大型检测管道的各个组件，而是完全抛弃了管道，并且设计速度很快。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [36]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.

像人脸或人这样的单一类别的检测器可以高度优化，因为它们需要处理的变化要少得多 [36]。 YOLO 是一种通用检测器，可以学习同时检测各种物体。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, Multi-Box cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.

Deep MultiBox。与 R-CNN 不同，Szegedy 等人，训练卷积神经网络来预测感兴趣的区域 [8]，而不是使用 Selective Search。 MultiBox 还可以通过用单个类预测替换置信度预测来执行单个对象检测。然而，Multi-Box 不能执行一般的物体检测，仍然只是更大检测管道中的一个部分，需要进一步的图像补丁分类。 YOLO 和 MultiBox 都使用卷积网络来预测图像中的边界框，但 YOLO 是一个完整的检测系统。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [31]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. Over-Feat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

OverFeat。 Sermanet 等人，训练一个卷积神经网络来执行定位并调整该定位器来执行检测 [31]。 OverFeat 有效地执行滑动窗口检测，但它仍然是一个不相交的系统。 Over-Feat 优化定位，而不是检测性能。与 DPM 一样，定位器在进行预测时只能看到本地信息。它无法推理全局上下文，因此需要大量的后处理来产生相干检测。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [26]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.

MultiGrasp。我们的工作在设计上类似于 Redmon 等人的抓握检测工作 [26]。我们用于边界框预测的网格方法基于 MultiGrasp，用于回归到抓取。然而，抓取检测比物体检测简单得多。 MultiGrasp 只需要为包含一个对象的图像预测单个可抓取区域。它不必估计物体的大小、位置或边界或预测它的类别，只需找到适合抓取的区域即可。 YOLO 预测图像中多个类别的多个对象的边界框和类别概率。

4. Experiments

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.

首先，我们将 YOLO 与 PASCAL VOC 2007 上的其他实时检测系统进行了比较。为了了解 YOLO 和 R-CNN 变体之间的差异，我们探索了 YOLO 和 Fast R-CNN（性能最高的版本之一的RNN[14]）在 VOC 2007 上的错误。基于不同的错误配置文件，我们表明 YOLO 可用于重新评分 Fast R-CNN 检测并减少False Positives，从而显着提升性能。我们还展示了 VOC 2012 结果并将 mAP 与当前最先进的方法进行了比较。最后，我们表明 YOLO 在两个艺术品数据集上比其他检测器更好地泛化到新领域。

4.1. Comparison to Other Real-Time Systems

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [37] [30] [14] [17] [27] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [30]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.

目标检测方面的许多研究工作都集中在使标准检测管道快速化。 [5, 37, 30, 14, 17, 27] 然而，只有 Sadeghi 等人，实际上制作了一个实时运行的检测系统（每秒 30 帧或更好）[30]。我们将 YOLO 与他们在 30Hz 或 100Hz 下运行的 DPM 的 GPU 实现进行了比较。尽管其他努力没有达到实时目标，但我们还比较了它们的相对 mAP 和速度，以检查对象检测系统中可用的准确度-性能权衡。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.

Fast YOLO是PASCAL上最快的目标检测方法；据我们所知，它是现存最快的物检测器。凭借 52.7% 的 mAP，它的准确度是先前实时检测工作的两倍多。 YOLO 将 mAP 提升至 63.4%，同时仍保持实时性能。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.

我们还使用 VGG-16 训练 YOLO。这个模型比 YOLO 更准确，但也明显慢。它可用于与依赖 VGG-16 的其他检测系统进行比较，但由于它比实时慢，因此本文的其余部分重点介绍了我们更快的模型。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [37]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.

最快的 DPM 有效地加速了 DPM，而不会牺牲太多的 mAP，但它仍然没有达到 2 倍的实时性能 [37]。与神经网络方法相比，它也受到 DPM 检测精度相对较低的限制。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.

R-CNN 去掉 R（循环网络）而改用静态边界框的Selective Search [20]。虽然它比 R-CNN 快得多，但它仍然缺乏实时性，并且由于没有好的建议而严重影响了准确性。

Table 1: Real-Time System on Pascal VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed
译：
在Pascal VOC 2007上的实时系统。对比它们的性能和速度。Fast YOLO是最快的检测方法，并且是其他实时系统的2倍精度。而YOLO有着比Fast YOLO高于10 mAP的准确度，并且在速度上依然比其他模型要快。

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from real- time.

Fast R-CNN 加快了 R-CNN 的分类阶段，但它仍然依赖于选择性搜索，每张图像可能需要大约 2 秒来生成边界框建议。因此，它具有高 mAP，但在 0.5 fps 时仍远非实时。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler- Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.

最近的 Faster R-CNN 用神经网络代替了选择性搜索来提出边界框，类似于 Szegedy 等人 [8]的工作。在我们的测试中，他们最准确的模型达到了 7 fps，而一个较小的、不太准确的模型以 18 fps 的速度运行。 Faster R-CNN 的 VGG-16 版本比 YOLO 高 10 mAP，但也慢 6 倍。 Zeiler-Fergus Faster R-CNN 仅比 YOLO 慢 2.5 倍，但精度也较低。

4.2. VOC 2007 Error Analysis

To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast R- CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.

为了进一步检查 YOLO 和最先进检测器之间的差异，我们查看了 VOC 2007 的详细结果细分。我们将 YOLO 与 Fast R-CNN 进行了比较，因为 Fast R-CNN 是性能最高的检测器之一 PASCAL 及其检测结果是公开的。

We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:

我们使用 Hoiem 等人的方法和工具。 [19] 对于测试时的每个类别，我们查看该类别的前 N 个预测。每个预测要么正确，要么根据错误类型进行分类：

Correct: correct class and IOU > .5
Localization: correct class, .1 < IOU < .5
Similar: class is similar, IOU > .1
Other: class is wrong, IOU > .1
Background: IOU < .1 for any object

正确：正确的类别和 IOU > .5

本地化：正确的类，.1 < IOU < .5

相似：类相似，IOU > .1

其他：类错误，IOU > .1

背景：任何对象的 IOU < .1

Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
译：
图4，错误分析：Fast R-CNN vs YOLO。这些图比较了对于N类物体在定位以及背景错误的情况。

Figure 4 shows the breakdown of each error type averaged across all 20 classes.

图 4 显示了所有 20 个类别的平均每个错误类型的细分。

YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.

YOLO 努力正确定位对象。 YOLO 的本地化错误比所有其他来源的总和还要多。 Fast R-CNN 的定位错误要少得多，但背景错误要多得多。 13.6% 的顶级检测是不包含任何对象的误报。 Fast R-CNN 预测背景检测的可能性几乎是 YOLO 的 3 倍。

4.3. Combining Fast R-CNN and YOLO

YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.

YOLO 的背景错误比 Fast R-CNN 少得多。通过使用 YOLO 从 Fast R-CNN 中消除背景检测，我们获得了显着的性能提升。对于 R-CNN 预测的每个边界框，我们检查 YOLO 是否预测了类似的框。如果是，我们会根据 YOLO 预测的概率和两个框之间的重叠对该预测进行提升。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.

最好的 Fast R-CNN 模型在 VOC 2007 测试集上实现了 71.8% 的 mAP。与YOLO结合时，其mAP提升3.2%至75.0%。我们还尝试将顶级 Fast R-CNN 模型与其他几个版本的 Fast R-CNN 结合起来。这些集合产生了 0.3% 到 0.6% 之间的 mAP 小幅增加，有关详细信息，请参见表 2。

Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best ver- sion of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
译：
VOC 2007 上的模型组合实验。我们检查将各种模型与 Fast R-CNN 的最佳版本相结合的效果。其他版本的 Fast R-CNN 只提供很小的好处，而 YOLO 提供了显着的性能提升。

Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
译：
PASCAL VOC 2012 排行榜。 YOLO 与截至 2015 年 11 月 6 日的完整 comp4（允许外部数据）公共排行榜进行比较。显示了各种检测方法的平均精度和每类平均精度。 YOLO 是唯一的实时检测器。 Fast R-CNN + YOLO 是得分第四高的方法，比 Fast R-CNN 提高了 2.3%。

The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.

YOLO 的提升不仅仅是模型集成的副产品，因为结合不同版本的 Fast R-CNN 几乎没有什么好处。相反，正是因为 YOLO 在测试时犯了不同类型的错误，所以它在提升 Fast R-CNN 的性能方面如此有效。

Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.

不幸的是，这种组合并没有从 YOLO 的速度中受益，因为我们分别运行每个模型然后组合结果。然而，由于 YOLO 速度如此之快，与 Fast R-CNN 相比，它不会增加任何显着的计算时间。

4.4. VOC 2012 Results

On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.

在 VOC 2012 测试集上，YOLO 的 mAP 得分为 57.9%。这低于当前最先进的水平，更接近使用 VGG-16 的原始 R-CNN，参见表 3。与最接近的竞争对手相比，我们的系统在处理小物体方面存在困难。在瓶子、羊和电视/监视器等类别上，YOLO 的得分比 R-CNN 或 Feature Edit 低 8-10%。但是，在其他类别（例如 cat 和 train）上，YOLO 实现了更高的性能。

Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.

我们组合的 Fast R-CNN + YOLO 模型是性能最高的检测方法之一。 Fast R-CNN 从与 YOLO 的组合中获得了 2.3% 的改进，使其在公共排行榜上提高了 5 个位置。

4.5. Generalizability: Person Detection in Artwork

Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.

用于对象检测的学术数据集从同一分布中提取训练和测试数据。在实际应用中，很难预测所有可能的用例，并且测试数据可能与系统之前看到的不同 [3]。我们将 YOLO 与毕加索数据集 [12] 和人物艺术数据集 [3] 上的其他检测系统进行了比较，这两个数据集用于测试艺术品上的人物检测。

Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.

图 5 显示了 YOLO 和其他检测方法之间的比较性能。作为参考，我们给出了人的 VOC 2007 检测 AP，其中所有模型仅在 VOC 2007 数据上进行训练。在 Picasso 上，模型在 VOC 2012 上进行了训练，而在 People-Art 上，它们在 VOC 2010 上进行了训练。

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.

R-CNN 在 VOC 2007 上有很高的 AP。但是，当应用于艺术品时，R-CNN 会大幅下降。 R-CNN 使用选择性搜索来针对自然图像进行调整的边界框建议。 R-CNN 中的分类器步骤只看到小区域，需要好的建议。

DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.

DPM 在应用于艺术品时可以很好地保持其 AP。先前的工作理论认为 DPM 性能良好，因为它具有强大的对象形状和布局空间模型。尽管 DPM 的性能不如 R-CNN，但它从较低的 AP 开始。

YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.

YOLO 在 VOC 2007 上有很好的表现，并且在应用于艺术品时，其 AP 的衰减比其他方法少。与 DPM 一样，YOLO 建模对象的大小和形状，以及对象之间的关系以及对象通常出现的位置。艺术品和自然图像在像素级别上有很大不同，但它们在对象的大小和形状方面相似，因此 YOLO 仍然可以预测良好的边界框和检测。

5. Real-Time Detection In The Wild

YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, including the time to fetch images from the camera and display the detections.

YOLO 是一种快速、准确的物体检测器，非常适合计算机视觉应用。我们将 YOLO 连接到网络摄像头并验证它是否保持实时性能，包括从摄像头获取图像和显示检测的时间。

Figure 5: Generalization results on Picasso and People-Art datasets.

Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.

The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.

由此产生的系统是互动的和引人入胜的。虽然 YOLO 单独地处理图像，但当连接到网络摄像头时，它的功能就像一个跟踪系统，在物体四处移动和外观变化时检测它们。可以在我们的项目网站上找到系统的演示和源代码：http://pjreddie.com/yolo/。

6. Conclusion

We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.

我们介绍了 YOLO，一种用于物体检测的统一模型。我们的模型构建简单，可以直接在完整图像上进行训练。与基于分类器的方法不同，YOLO 是在与检测性能直接对应的损失函数上训练的，并且整个模型是联合训练的。

Fast YOLO is the fastest general-purpose object detec- tor in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.

Fast YOLO 是文献中最快的通用目标检测器，YOLO 推动了实时目标检测的最新技术。 YOLO 还可以很好地推广到新的领域，使其成为依赖快速、稳健的对象检测的应用程序的理想选择。

Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen Distinguished Investigator Award.

致谢：这项工作得到 ONR N00014-13-1-0720、NSF IIS-1338054 和艾伦杰出研究员奖的部分支持。

References

[1] M. B. Blaschko and C. H. Lampert. Learning to localize ob- jects with structured output regression. In Computer Vision– ECCV 2008, pages 2–15. Springer, 2008. 4
[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In International Conference on Computer Vision (ICCV), 2009. 8
[3] H. Cai, Q. Wu, T. Corradi, and P. Hall. The cross- depiction problem: Computer vision algorithms for recog- nising objects in artwork and in photographs. arXiv preprint arXiv:1505.00110, 2015. 7
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recogni- tion, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 4, 8
[5] T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijaya- narasimhan, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Confer- ence on, pages 1814–1821. IEEE, 2013. 5
[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- vation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013. 4
[7] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards unified object detection and semantic segmentation. In Computer Vision–ECCV 2014, pages 299–314. Springer, 2014. 7
[8] D.Erhan,C.Szegedy,A.Toshev,andD.Anguelov.Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Confer- ence on, pages 2155–2162. IEEE, 2014. 5, 6
[9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual ob- ject classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015. 2
[10] P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ra- manan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 4
[11] S. Gidaris and N. Komodakis. Object detection via a multi- region & semantic segmentation-aware CNN model. CoRR, abs/1505.01749, 2015. 7
[12] S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting peo- ple in cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101–116. Springer, 2014. 7
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014. 1, 4, 7
[14] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 2,5,6,7
[15] S. Gould, T. Gao, and D. Koller. Region-based segmenta- tion and object detection. In Advances in neural information processing systems, pages 655–663, 2009. 4
[16] B. Hariharan, P. Arbela ́ez, R. Girshick, and J. Malik. Simul- taneous detection and segmentation. In Computer Vision– ECCV 2014, pages 297–312. Springer, 2014. 7
[17] K.He,X.Zhang,S.Ren,andJ.Sun.Spatialpyramidpooling in deep convolutional networks for visual recognition. arXiv preprint arXiv:1406.4729, 2014. 5
[18] G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,and R. R. Salakhutdinov. Improving neural networks by pre- venting co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. 4
[19] D.Hoiem,Y.Chodpathumwan,andQ.Dai.Diagnosingerror in object detectors. In Computer Vision–ECCV 2012, pages 340–353. Springer, 2012. 6
[20] K. Lenc and A. Vedaldi. R-cnn minus r. arXiv preprint arXiv:1506.06981, 2015. 5, 6
[21] R. Lienhart and J. Maydt. An extended set of haar-like fea- tures for rapid object detection. In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 1, pages I–900. IEEE, 2002. 4
[22] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2
[23] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. 4
[24] D. Mishkin. Models accuracy on imagenet 2012 val. https://github.com/BVLC/caffe/wiki/ Models-accuracy-on-ImageNet-2012-val. Ac- cessed: 2015-10-2. 3
[25] C. P. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Computer vision, 1998. sixth international conference on, pages 555–562. IEEE, 1998. 4
[26] J.RedmonandA.Angelova.Real-timegraspdetectionusing convolutional neural networks. CoRR, abs/1412.3128, 2014. 5
[27] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To- wards real-time object detection with region proposal net- works. arXiv preprint arXiv:1506.01497, 2015. 5, 6, 7
[28] S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015. 3, 7
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 3
[30] M. A. Sadeghi and D. Forsyth. 30hz object detection with dpm v5. In Computer Vision–ECCV 2014, pages 65–79. Springer, 2014. 5, 6
[31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localiza- tion and detection using convolutional networks. CoRR, abs/1312.6229, 2013. 4, 5
[32] Z. Shen and X. Xue. Do more dropouts in pool5 feature maps for better object detection. arXiv preprint arXiv:1409.6911, 2014. 7
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2
[34] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Interna- tional journal of computer vision, 104(2):154–171, 2013. 4, 5
[35] P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 4:34–47, 2001. 4
[36] P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154,2004. 5
[37] J. Yan, Z. Lei, L. Wen, and S. Z. Li. The fastest deformable part model for object detection. In Computer Vision and Pat- tern Recognition (CVPR), 2014 IEEE Conference on, pages 2497–2504. IEEE, 2014. 5, 6
[38] C.L.ZitnickandP.Dolla ́r.Edgeboxes:Locatingobjectpro- posals from edges. In Computer Vision–ECCV 2014, pages 391–405. Springer, 2014. 4

论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (3/3)相关推荐

论文解读 Receptive Field Block Net for Accurate and Fast Object Detection
其它机器学习.深度学习算法的全面系统讲解可以阅读<机器学习-原理.算法与应用>,清华大学出版社,雷明著,由SIGAI公众号作者倾力打造. 书的购买链接书的勘误,优化,源代码资源 PDF全 ...
论文阅读：Saliency-Guided Region Proposal Network for CNN Based Object Detection
论文阅读:Saliency-Guided Region Proposal Network for CNN Based Object Detection (1)Author (2)Abstract (3 ...
【论文笔记】ObjectBox: From Centers to Boxes for Anchor-Free Object Detection
论文论文题目:ObjectBox: From Centers to Boxes for Anchor-Free Object Detection 收录于:ECCV2022 论文地址:https:// ...
论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (2/3)
文章目录 2. Unified Detection 2.1. Network Design 2.2. Training 2.3. Inference 2.4. Limitations of YOLO ...
论文阅读笔记三十三：Feature Pyramid Networks for Object Detection(FPN CVPR 2017)
论文源址:https://arxiv.org/abs/1612.03144 代码:https://github.com/jwyang/fpn.pytorch 摘要特征金字塔是用于不同尺寸目标检测中的 ...
论文笔记——C2FNet:Context-aware Cross-level Fusion Network for Camouﬂaged Object Detection
Context-aware Cross-level Fusion Network for Camouﬂaged Object Detection 论文地址:https://arxiv.org/pdf/ ...
论文笔记-F3Net：Fusion, Feedback and Focus for Salient Object Detection
论文笔记之2020-AAAI-F3Net-F3Net:Fusion, Feedback and Focus for Salient Object Detection 论文地址:https://arxi ...
【论文阅读】【3d目标检测】Group-Free 3D Object Detection via Transformers
论文标题:Group-Free 3D Object Detection via Transformers iccv2021 本文主要是针对votenet等网络中采用手工group的问题提出的改进我们 ...
论文精读《OFT: Orthographic Feature Transform for Monocular 3D Object Detection》
OFT: Orthographic Feature Transform for Monocular 3D Object Detection 文章目录 OFT: Orthographic Feature ...

论文研读 —— 4. You Only Look Once Unified, Real-Time Object Detection (3/3)