文章目录

Authors and Publishment
- Authors
- Bibtex
Abstract
1. Introduction
2. Convnet Configurations
- 2.1. Architecture
- 2.2. Configurations
- 2.3. Discussion

Authors and Publishment

Authors

Karen Simonyan / Visual Geometry Group, Department of Engineering Science, University of Oxford
Andrew Zisserman / Visual Geometry Group, Department of Engineering Science, University of Oxford

Bibtex

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

在这项工作中，我们研究了卷积网络深度对其在大规模图像识别设置中的准确性的影响。我们主要的贡献在于使用非常小的（3 x 3）的卷积滤波，通过增加深度对网络进行了全面的评估，并表明通过把网络深度增加到16-19层后，可以对现有的技术精度进行明显的改善。

These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

这些发现是我们提交 2014 年 ImageNet 挑战赛的基础，我们的团队分别获得了定位和分类赛道的第一和第二名。我们还发现，我们的方法可以很好地推广到其他数据集，并在这些数据集上取得了最先进的结果。我们已经公开了两个性能最佳的卷积网络（ConvNet）模型，以促进进一步研究在计算机视觉中使用深度视觉表示。

1. Introduction

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale im- age and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012).

卷积网络 (ConvNets) 最近在大规模图像和视频识别方面取得了巨大成功 (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) 由于大型公共图像存储库，如 ImageNet (Deng et al., 2009) 和高性能计算系统，如 GPU 或大规模分布式集群 (Dean et al., 2012)，使得应用成为可能。

In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).

特别是，ImageNet 大规模视觉识别挑战赛 (ILSVRC) (Russakovsky et al., 2014) 在深度视觉识别架构的发展中发挥了重要作用，并作为数代图像分类系统使用的测试平台，从高维浅层特征编码 (Perronnin et al., 2010) (ILSVRC-2011 的获胜者) 到深度卷积网络 (Krizhevsky et al., 2012) (ILSVRC-的获胜者) 。

With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC- 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer.

随着卷积网络在计算机视觉领域成为一种工具，基于 Krizhevsky 等人（2012）的原始架构做了很多改进，以达到更好的准确性。例如，在ILSVRC-2013中表现最好的（Zeiler & Fergus, 2013; Sermanet et al., 2014）使用了更小的窗口和在其第一个层卷积层中使用了更小的步幅。

Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.

另一项改进涉及在整个图像和多个尺度上密集地训练和测试网络（Sermanet 等人，2014 年；霍华德，2014 年）。在本文中，我们讨论了卷积网络架构设计的另一个重要方面——深度。为此，我们固化了除此之外的其他参数，并通逐步在网络中增加卷积层数，并且所有层中都使用了非常小的（3 x 3）卷积滤波器。

As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models1 to facilitate further research.

因此，我们得到了更精确的卷积网络架构，它不仅在 ILSVRC 分类和定位任务上达到了最先进的精度，而且还适用于其他图像识别数据集，即便我们仅使用相对简单的一部分管道时（例如，由线性 SVM 分类的深层特征，无需微调）。我们发布了两个性能最佳的模型¹，以促进未来的研究。

The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper.

本文的其余部分安排如下。在章节2中，我们描述了我们的卷积网络的配置情况。关于图像分类训练和评估的细节将放在第 3 节中介绍，对于 ILSVRC 上对配置的比较则在章节4中，章节 5 则是论文的总结。

For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.

为了完整起见，我们还在附录 A 中描述和评估了我们在 ILSVRC-2014 中的定位系统，以及在附录B种讨论对于不同数据集的泛化能力。以及在附录C中，包含论文的修订列表。

2. Convnet Configurations

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.

为了公平地评估由于卷积层深度的增加带来的性能改进，受Ciresan（2011）、Krizhevsky等人（2012）等人的启发，我们所有的卷积网络配置都遵循相同的设计原则。在这个章节里，我们首先描述一个通用卷机网络结构(Sect. 2.1) ，然后再细致地介绍评估用到的结构 (Sect. 2.2)。以及在章节2.3中，我们的设计选择以及对现有技术进行比较。

2.1. Architecture

During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center).

在训练过程中，输入到我们的卷积网络图像大小是固定的224 × 224 RGB图像。唯一的预处理是对像素减去在训练集中得到的RGB均值。图像经过卷积层栈，我们使用非常小的结构 3 x 3（该结构可以捕捉到左/右，上/下，中心中最小的特征）

In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers.

在其中一处地方，我们也用到了1 x 1的卷积滤波器，可以被看作是对输入通道的线性转换（随后是非线形的）。卷积运算步进为1个像素；对于卷积的空间填充是根据卷积计算后的空间大小决定，例如，对于 3 x 3卷积，填充尺度为1个像素。

Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

空间池化由五个最大池化层执行，（它们）跟随在一些卷积层后（不是所有的最大池化层都跟随着卷积层）。最大池化执行一个 2 x 2 的运算窗口，执行步进为2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

卷积层栈（在不同结构有不同的深度）后跟随的是三个全连接层（FC）；前两个（全连接层）拥有4096个通道，最后一个负责执行1000路分类输出（基于ILSVRC 数据集有1000个分类）。最后一层是 soft-max 层。所有的全连接层在所有结构中都拥有相同大小。

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).

所有的隐藏层都使用了 ReLu (Krizhevsky et al., 2012) 非线形输出。我们注意到我们全部的（除了一个）都不包含局部响应归一化（LRN）(Krizhevsky et al., 2012)；如同第4节所示，该归一化对于ILSVRC数据集不提高任何性能，而仅导致内存开销增加和时间消耗。在某些情况需要使用时，LRN的参数配置如(Krizhevsky et al., 2012)论文中描述一致。

2.2. Configurations

The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

关于卷积网络的配置信息，在表1中进行了描述。在接下来的章节，我们使用（A- E）表示不同的网络。所有的配置遵循章节2.1中描述的通用标准，差别仅在于深度；从11级网络权重A（8个Conv.和3个FC），到19级网络权重E（16个Conv. 和3个FC）。卷积层大小都很小（通道数），从第一层的64个通道，经过每次的max-pooling层后增加2倍，直到512个通道。

In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).

在表 2 中，我们介绍了每种配置的参数情况。随着网络深度的增加，网络权重数并不比浅层网络同时拥有大尺寸卷积的多(Sermanet et al., 2014)。

2.3. Discussion

Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014).

我们的卷积网络配置与 ILSVRC-2012 (Krizhevsky et al., 2012) 和 ILSVRC-2013 竞赛 (Zeiler & Fergus, 2013; Sermanet et al., 2014) 中表现最好的参赛作品中使用的配置完全不同。

Rather than using relatively large receptive fields in the first conv. layers (e.g. 11 × 11 with stride 4 in (Krizhevsky et al., 2012), or 7 × 7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5;

相较于（Krizhevsky 等人，2012）在第一个感知卷积层使用的 11 x 11，4步长，或者(Zeiler & Fergus, 2013; Sermanet et al., 2014)中使用 7 x 7，2步长。我们通篇使用了 3 x 3的感知卷积层，它和所有的像素执行卷积过程（步长1）。这很容发现两个 3 x 3卷积层执行效果与1个 5 x 5的卷积层是一样的。

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv⟨receptive field size⟩-⟨number of channels⟩”. The ReLU activation function is not shown for brevity.
表1: 卷积网络的配置（显示在列中）。随着添加更多层（添加的层以粗体显示），配置的深度从左侧 (A) 到右侧 (E) 增加。卷积层参数表示为“conv（感知器大小）-（通道数）”。为简洁起见，未显示 ReLU 激活函数。

Table 2: Number of parameters (in millions).
表2: 网络参数（百万级）

such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3(32C2)=27C23(3^2 C^2) = 27 C^23(32C2)=27C2 weights；at the same time, a single 7 × 7 conv. layer would require 72C2=49C27^2 C^2 = 49 C^272C2=49C2 parameters, i.e. 81% more.

(那些论文中)使用了 7 x 7的感知卷积层。那么我们获得了什么，举例来说，一组3 x 3卷积层栈替代了7 x 7卷机层？首先，我们使用了三个非线性校正层而不是单个，这使得决策函数更具判别性。其次，我们减少了参数数量：假设 3 x 3卷机层的输入和输出都有C个通道，就可以计算出栈的权重大小为 3(32C2)=27C23(3^2 C^2) = 27 C^23(32C2)=27C2；同时，单个 7 x 7的卷积则需要72C2=49C27^2 C^2 = 49 C^272C2=49C2 个参数，即多 81%

This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between). The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers.

这可以看作是对 7 × 7 的正则化，使得它们按照 3 × 3 卷机大小进行分解（在其间注入非线性）。加入 1 x 1 卷积层（配置C，表1）也是一种在不影响决策过程的前提下，增加了非线形。

Even though in our case the 1 × 1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1 × 1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).

尽管在我们的例子中，1 x 1 卷积实质是在相同空间中做了一次映射（输入和输出通道数相同），由矫正函数引入了额外的非线性。需要注意的是，1 x 1 卷积层被Lin等人（2014）引入到“网络中的网络”架构中。

Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance.

小尺寸的卷积层在Ciresan等人（2011）的工作中也曾被使用过，但他们的网络深度明显没有我们这么多，并且他们也没比较过ILSVRC数据集上的表现。Goodfellow等人（2014）使用过卷积网络（11层权重）用于街道号码识别，并且发现增加深度可以得到更好的性能。

GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions).

GoogLeNet (Szegedy et al., 2014) 是 ILSVRC-2014 分类任务中表现最好的，独立于我们的工作，但相似之处在于它基于非常深的卷积层（22 个权重层）和小卷积过滤器（除了 3 × 3，它们还使用 1 × 1 和 5 × 5 卷积）。

Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.

显然，它们的网络拓扑结构比我们的更复杂，并且特征图的空间分辨率在第一层中被更积极地降低以减少计算量。正如将在章节4.5中的那样。我们的模型在单网络分类精度方面优于 Szegedy 等人（2014）的模型。

http://www.robots.ox.ac.uk/ ̃vgg/research/very_deep/ ↩︎

论文研读 —— 7. Very Deep Convolutional Networks for Large-Scale Image Recognition (1/3)相关推荐

论文研读 —— 7. Very Deep Convolutional Networks for Large-Scale Image Recognition (2/3)
文章目录 3. Classification Framework 3.1. Training 3.2. Testing 3.3. Implementation Details 4. Classific ...
论文笔记：Very deep convolutional networks for large-scale image recognition（VGG）
一.基本信息标题:Very deep convolutional networks for large-scale image recognition 时间:2014 出版源:arXiv 论文领域: ...
【论文阅读】DeepPap: Deep Convolutional Networks for Cervical Cell Classification
写在前面:该文章发于期刊. DeepPap: Deep Convolutional Networks for Cervical Cell Classification Ling Zhang, Le L ...
论文笔记 - 《Very Deep Convolutional Networks For Large-Scale Image Recognition》精典
基于卷积神经网络的图像分类(经典网络) 作者:Karen Simonyan & Andrew Zisserman(两位大神) 单位:牛津大学 (Visual Geometry Group) 发 ...
《Very Deep Convolutional Networks For Large-Scale Image Recognition》翻译
1 引言 2 ConvNet配置 2.1 架构 2.2 配置 2.3 讨论 3 分类框架 3.1 训练 3.2 测试 3.3 实现细节 4 分类实验 4.1 单尺度评估 4.2 多尺度评估 4.3 多 ...
Very Deep Convolutional Networks for Large-Scale Image Recognition-VGGNet解读
作者:HYH 日期:2020-9-10 论文期刊:ICLR2015 标签:VGG 论文:<Very Deep Convolutional Networks for Large-Scale Ima ...
VGGNet 阅读理解 - Very Deep Convolutional Networks for Large-Scale Image Recognition
论文理解 - VGGNet - Very Deep Convolutional Networks for Large-Scale Image Recognition [VGG-Paper] [原文地址 ...
论文Very Deep Convolutional Networks for Large-Scale Image Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition 转载请注明:http://blog.csdn.net/stdcou ...
VGGNet论文翻译-Very Deep Convolutional Networks for Large-Scale Image Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition Karen Simonyan[‡] & Andrew Zi ...

论文研读 —— 7. Very Deep Convolutional Networks for Large-Scale Image Recognition (1/3)