Aggregated Residual Transformations for Deep Neural Networks
作者：Saining Xie1 Ross Girshick2 Piotr Dollar2 Zhuowen Tu1 Kaiming He2
1、UC San Diego 2、Facebook AI Research
论文：https://openaccess.thecvf.com/content_cvpr_2017/papers/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.pdf
代码：https://github.com/facebookresearch/ResNeXt

文章目录

摘要
1. 介绍
2. 相关工作
3. 方法
- 3.1. 模板
- 3.2. 重新审视简单神经元
- 3.3. 聚合变换
- 3.4. 模型容量
4. 实施细节
5. 实验
- 5.1. 在ImageNet-1K上的实验
- 5.4. 5.4. 关于COCO目标检测的实验

摘要

We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.
　　我们提出了一个简单、高度模块化的图像分类网络架构。我们的网络是通过重复一个构建块来构建的，该构建块聚合了一组具有相同拓扑结构的变换。我们简单的设计产生了一个同质的、多分支的架构，只需要设置几个超参数。这个策略暴露了一个新的维度，我们称之为 “cardinality”（变换集合的大小），作为深度和宽度维度之外的一个重要因素。在ImageNet-1K数据集上，我们的经验表明，即使在保持复杂性的限制条件下，增加cardinality也能够提高分类的准确性。此外，当我们增加容量时，增加cardinality比加深或加宽更有效。我们的模型被命名为ResNeXt，是我们参加2016年ILSVRC分类任务的基础，我们在其中获得了第二名。我们在ImageNet-5K集和COCO检测集上进一步研究了ResNeXt，也显示出比其ResNet对应的更好的结果。代码和模型在网上是公开的。

1. 介绍

Research on visual recognition is undergoing a transition from “feature engineering” to “network engineering” [25, 24, 44, 34, 36, 38, 14]. In contrast to traditional handdesigned features (e.g., SIFT [29] and HOG [5]), features learned by neural networks from large-scale data [33] require minimal human involvement during training, and can be transferred to a variety of recognition tasks [7, 10, 28]. Nevertheless, human effort has been shifted to designing better network architectures for learning representations.
　　视觉识别的研究正经历着从 "特征工程 "到 "网络工程 "的转变[25, 24, 44, 34, 36, 38, 14]。与传统的手工设计的特征（如SIFT[29]和HOG[5]）相比，由神经网络从大规模数据[33]中学习的特征在训练过程中只需要最少的人工参与，并且可以转移到各种识别任务中[7, 10, 28]。尽管如此，人类的努力已经转移到了设计更好的网络架构来学习表征。
　　Designing architectures becomes increasingly difficult with the growing number of hyper-parameters (width2, filter sizes, strides, etc.), especially when there are many layers. The VGG-nets [36] exhibit a simple yet effective strategy of constructing very deep networks: stacking building blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topology. This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks. Moreover, we argue that the simplicity of this rule may reduce the risk of over-adapting the hyperparameters to a specific dataset. The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20].
　　随着超参数（宽度（每层的通道数）、滤波器大小、步长等）的增加，设计架构变得越来越困难，尤其是在有很多层的时候。VGGNet[36]展示了一种简单而有效的构建深度网络的策略：堆叠相同形状的构建块。这一策略被ResNets[14]所继承，它将相同拓扑结构的模块堆叠在一起。这个简单的规则减少了超参数的自由选择，深度被作为神经网络的一个基本维度。此外，我们认为，这一规则的简单性可以减少对特定数据集过度调整超参数的风险。VGGnets和ResNets的鲁棒性已经被各种视觉识别任务[7, 10, 9, 28, 31, 14]以及涉及语音[42, 30]和语言[4, 41, 20]的非视觉任务所证明。
　　Unlike VGG-nets, the family of Inception models [38, 17, 39, 37] have demonstrated that carefully designed topologies are able to achieve compelling accuracy with low theoretical complexity. The Inception models have evolved over time [38, 39], but an important common property is a split-transform-merge strategy. In an Inception module, the input is split into a few lower-dimensional embeddings (by 1×1 convolutions), transformed by a set of specialized filters (3×3, 5×5, etc.), and merged by concatenation. It can be shown that the solution space of this architecture is a strict subspace of the solution space of a single large layer (e.g., 5×5) operating on a high-dimensional embedding. The split-transform-merge behavior of Inception modules is expected to approach the representational power of large and dense layers, but at a considerably lower computational complexity.
　　与VGGNet不同，Inception模型家族[38, 17, 39, 37]已经证明，精心设计的拓扑结构能够以较低的理论复杂性达到令人信服的准确性。Inception模型随着时间的推移不断发展[38, 39]，但一个重要的共同属性是分割-转换-合并策略。在Inception模块中，输入被分割成几个较低维度的嵌入（通过1×1的卷积），通过一组专门的过滤器（3×3，5×5等）进行转换，并通过串联进行合并。可以证明，这种结构的解空间是在高维嵌入上操作的单一大层（如5×5）的解空间的严格子空间。Inception模块的分割-转换-合并行为有望接近大型密集层的表示能力，但计算复杂度却大大降低。
　　Despite good accuracy, the realization of Inception models has been accompanied with a series of complicating factors — the filter numbers and sizes are tailored for each individual transformation, and the modules are customized stage-by-stage. Although careful combinations of these components yield excellent neural network recipes, it is in general unclear how to adapt the Inception architectures to new datasets/tasks, especially when there are many factors and hyper-parameters to be designed.
　　尽管有很好的准确性，但Inception模型的实现一直伴随着一系列复杂的因素–滤波器的数量和大小是为每个单独的变换而定制的，模块也是逐级定制的。尽管这些组件的精心组合产生了优秀的神经网络配方，但一般来说，不清楚如何使Inception架构适应新的数据集/任务，特别是当有许多因素和超参数需要设计时。
　　In this paper, we present a simple architecture which adopts VGG/ResNets’ strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in our network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. We pursuit a simple realization of this idea — the transformations to be aggregated are all of the same topology (e.g., Fig. 1 (right)). This design allows us to extend to any large number of transformations without specialized designs.
　　在本文中，我们提出了一个简单的架构，它采用了VGG/ResNets的重复层策略，同时以一种简单、可扩展的方式利用分割-变换-合并策略。我们网络中的一个模块执行一组转换，每个模块都在一个低维嵌入上，其输出通过求和进行聚合。我们追求这个想法的简单实现–要聚合的变换都是相同的拓扑结构（例如，图1（右））。这种设计使我们能够扩展到任何数量的转换，而不需要专门的设计。

Interestingly, under this simplified situation we show that our model has two other equivalent forms (Fig. 3). The reformulation in Fig. 3(b) appears similar to the InceptionResNet module [37] in that it concatenates multiple paths; but our module differs from all existing Inception modules in that all our paths share the same topology and thus the number of paths can be easily isolated as a factor to be investigated. In a more succinct reformulation, our module can be reshaped by Krizhevsky et al.’s grouped convolutions [24] (Fig. 3©), which, however, had been developed as an engineering compromise.
　　有趣的是，在这种简化的情况下，我们表明我们的模型还有另外两种等价形式（图3）。图3(b)中的重述似乎与InceptionResNet模块[37]相似，它将多条路径连接起来；但我们的模块与所有现有的Inception模块不同，我们所有的路径都有相同的拓扑结构，因此路径的数量可以很容易地被分离出来，成为一个待研究的因素。在一个更简洁的重述中，我们的模块可以由Krizhevsky等人的分组卷积[24]来重塑（图3©），然而，它是作为一种工程妥协而开发的。

　　We empirically demonstrate that our aggregated transformations outperform the original ResNet module, even under the restricted condition of maintaining computational complexity and model size — e.g., Fig. 1(right) is designed to keep the FLOPs complexity and number of parameters of Fig. 1(left). We emphasize that while it is relatively easy to increase accuracy by increasing capacity (going deeper or wider), methods that increase accuracy while maintaining (or reducing) complexity are rare in the literature.
　　我们通过经验证明，即使在保持计算复杂度和模型大小的限制条件下，我们的聚合转化也优于原始ResNet模块–例如，图1（右）的设计是为了保持图1（左）的FLOPs复杂度和参数数量。我们强调，虽然通过增加容量（更深或更广）来提高精度相对容易，但在保持（或减少）复杂性的同时提高精度的方法在文献中很少见。
　　Our method indicates that cardinality (the size of the set of transformations) is a concrete, measurable dimension that is of central importance, in addition to the dimensions of width and depth. Experiments demonstrate that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider, especially when depth and width starts to give diminishing returns for existing models.
　　我们的方法表明，除了宽度和深度这两个维度之外，cardinality（转换集合的大小）是一个具体的、可测量的维度，具有核心的重要性。实验表明，增加cardinality是一种比更深或更广更有效的获得准确性的方法，特别是当深度和宽度开始给现有模型带来递减的回报时。
　　Our neural networks, named ResNeXt (suggesting the next dimension), outperform ResNet-101/152 [14], ResNet200 [15], Inception-v3 [39], and Inception-ResNet-v2 [37] on the ImageNet classification dataset. In particular, a 101-layer ResNeXt is able to achieve better accuracy than ResNet-200 [15] but has only 50% complexity. Moreover, ResNeXt exhibits considerably simpler designs than all Inception models. ResNeXt was the foundation of our submission to the ILSVRC 2016 classification task, in which we secured second place. This paper further evaluates ResNeXt on a larger ImageNet-5K set and the COCO object detection dataset [27], showing consistently better accuracy than its ResNet counterparts. We expect that ResNeXt will also generalize well to other visual (and non-visual) recognition tasks.
　　我们的神经网络被命名为ResNeXt（暗示下一个维度），在ImageNet分类数据集上的表现优于ResNet-101/152[14]、ResNet200[15]、Inception-v3[39]和Inception-ResNet-v2[37]。特别是，101层的ResNeXt能够达到比ResNet-200[15]更好的准确度，但复杂度只有50%。此外，ResNeXt表现出比所有Inception模型简单得多的设计。ResNeXt是我们提交给ILSVRC 2016分类任务的基础，我们在其中获得了第二名。本文在更大的ImageNet-5K集和COCO物体检测数据集[27]上进一步评估了ResNeXt，显示出比ResNet同类模型持续更好的准确性。我们希望ResNeXt也能很好地适用于其他视觉（和非视觉）识别任务。

2. 相关工作

Multi-branch convolutional networks. The Inception models [38, 17, 39, 37] are successful multi-branch architectures where each branch is carefully customized. ResNets [14] can be thought of as two-branch networks where one branch is the identity mapping. Deep neural decision forests [22] are tree-patterned multi-branch networks with learned splitting functions.
　　多分支卷积网络。Inception模型[38, 17, 39, 37]是成功的多分支架构，每个分支都是精心定制的。ResNets[14]可以被认为是双分支网络，其中一个分支是恒等映射。深度神经决策森林[22]是具有学习分裂功能的树状模式的多分支网络。
　　Grouped convolutions. The use of grouped convolutions dates back to the AlexNet paper [24], if not earlier. The motivation given by Krizhevsky et al. [24] is for distributing the model over two GPUs. Grouped convolutions are supported by Caffe [19], Torch [3], and other libraries, mainly for compatibility of AlexNet. To the best of our knowledge, there has been little evidence on exploiting grouped convolutions to improve accuracy. A special case of grouped convolutions is channel-wise convolutions in which the number of groups is equal to the number of channels. Channel-wise convolutions are part of the separable convolutions in [35].
　　分组卷积。如果不是更早的话，分组卷积的使用可以追溯到AlexNet论文[24]。Krizhevsky等人[24]给出的动机是为了将模型分布在两个GPU上。Caffe[19]、Torch[3]和其他库都支持分组卷积，主要是为了兼容AlexNet。据我们所知，很少有证据表明利用分组卷积来提高准确性。分组卷积的一个特例是通道级卷积，其中组的数量与通道的数量相等。通道级的卷积是[35]中可分离卷积的一部分。
　　Compressing convolutional networks. Decomposition (at spatial [6, 18] and/or channel [6, 21, 16] level) is a widely adopted technique to reduce redundancy of deep convolutional networks and accelerate/compress them. Ioannou et al. [16] present a “root”-patterned network for reducing computation, and branches in the root are realized by grouped convolutions. These methods [6, 18, 21, 16] have shown elegant compromise of accuracy with lower complexity and smaller model sizes. Instead of compression, our method is an architecture that empirically shows stronger representational power.
　　压缩卷积网络。分解（在空间[6, 18]和/或通道[6, 21, 16]层面）是一种广泛采用的技术，以减少深度卷积网络的冗余，并加速/压缩它们。Ioannou等人[16]提出了一个用于减少计算的 "根 "模式网络，根中的分支是由分组卷积实现的。这些方法[6, 18, 21, 16]显示了优雅的精度妥协，复杂度较低，模型尺寸较小。我们的方法不是压缩，而是根据经验显示出更强的表示能力的架构。
　　Ensembling. Averaging a set of independently trained networks is an effective solution to improving accuracy [24], widely adopted in recognition competitions [33]. Veit et al. [40] interpret a single ResNet as an ensemble of shallower networks, which results from ResNet’s additive behaviors [15]. Our method harnesses additions to aggregate a set of transformations. But we argue that it is imprecise to view our method as ensembling, because the members to be aggregated are trained jointly, not independently.
　　组合。对一组独立训练的网络进行平均化是提高准确性的有效解决方案[24]，在识别竞赛中被广泛采用[33]。Veit等人[40]将单个ResNet解释为较浅的网络的集合，这是ResNet的加法行为的结果[15]。我们的方法是利用加法来聚合一组变换。但我们认为，将我们的方法视为集合是不准确的，因为要集合的成员是联合训练的，而不是独立的。

3. 方法

3.1. 模板

We adopt a highly modularized design following VGG/ResNets. Our network consists of a stack of residual blocks. These blocks have the same topology, and are subject to two simple rules inspired by VGG/ResNets: (i) if producing spatial maps of the same size, the blocks share the same hyper-parameters (width and filter sizes), and (ii) each time when the spatial map is downsampled by a factor of 2, the width of the blocks is multiplied by a factor of 2. The second rule ensures that the computational complexity, in terms of FLOPs (floating-point operations, in # of multiply-adds), is roughly the same for all blocks.
　　我们采用了遵循VGG/ResNets的高度模块化设计。我们的网络由残差块的堆栈组成。这些块具有相同的拓扑结构，并受制于VGG/ResNets的两个简单规则：（i）如果产生相同大小的空间特征图，这些块共享相同的超参数（宽度和过滤器大小），以及（ii）每次当空间图被下采样2倍时，块的宽度被乘以2倍。第二个规则确保计算的复杂性，以FLOPs（浮点运算，以乘加的数量）来说，对所有块都大致相同。
　　With these two rules, we only need to design a template module, and all modules in a network can be determined accordingly. So these two rules greatly narrow down the design space and allow us to focus on a few key factors. The networks constructed by these rules are in Table 1.
　　有了这两条规则，我们只需要设计一个模板模块，一个网络中的所有模块都可以据此确定。因此，这两条规则大大缩小了设计空间，使我们能够专注于几个关键因素。由这些规则构建的网络见表1。
　　

3.2. 重新审视简单神经元

The simplest neurons in artificial neural networks perform inner product (weighted sum), which is the elementary transformation done by fully-connected and convolutional layers. Inner product can be thought of as a form of aggregating transformation:
　　人工神经网络中最简单的神经元执行内积（加权和），这是全连接和卷积层所做的基本转换。内积可以被认为是一种聚合转换的形式：

where $x = [x_1, x_2, ..., x_D]$ is a D-channel input vector to the neuron and $w_i$ is a filter’s weight for the i-th channel. This operation (usually including some output nonlinearity) is referred to as a “neuron”. See Fig. 2.
　　其中 $x = [x_1, x_2, ..., x_D]$ 是神经元的D通道输入向量， $w_i$ 是第i个通道的滤波器的权重。这种操作（通常包括一些输出非线性）被称为 “神经元”。见图2。
　　
　　The above operation can be recast as a combination of splitting, transforming, and aggregating. (i) Splitting: the vector x is sliced as a low-dimensional embedding, and in the above, it is a single-dimension subspace $x_i$ . (ii) Transforming: the low-dimensional representation is transformed, and in the above, it is simply scaled: $w_ix_i$ . (iii) Aggregating: the transformations in all embeddings are aggregated by $∑i=1D\sum_{i=1}^D$ .
　　上述操作可以被重塑为分割、变换和聚合的组合。(i) 分割：向量x被切成低维嵌入，在上面，它是一个单维的子空间 $x_i$ 。 (ii) 变换：低维表示被变换，在上面，它被简单地缩放： $w_ix_i$ 。 (iii) 聚合：所有嵌入中的变换被 $∑i=1D\sum_{i=1}^D$ 聚合。

3.3. 聚合变换

Given the above analysis of a simple neuron, we consider replacing the elementary transformation ( $w_ix_i$ ) with a more generic function, which in itself can also be a network. In contrast to “Network-in-Network” [26] that turns out to increase the dimension of depth, we show that our “Network-in-Neuron” expands along a new dimension.
　　鉴于上述对简单神经元的分析，我们考虑用一个更通用的函数来代替基本变换（ $w_ix_i$ ），它本身也可以是一个网络。与 “网络中的网络”[26]相比，我们表明我们的 "网络中的神经元 "沿着一个新的维度扩展。
　　Formally, we present aggregated transformations as:
　　从形式上看，我们将聚合的转换表述为：

where $Ti(x)\mathcal{T}_i(x)$ can be an arbitrary function. Analogous to a simple neuron, $Ti\mathcal{T}_i$ should project $x$ into an (optionally lowdimensional) embedding and then transform it.
其中 $Ti(x)\mathcal{T}_i(x)$ 可以是一个任意的函数。类似于一个简单的神经元， $Ti\mathcal{T}_i$ 应该将 $x$ 投射到一个（可选择的低维）嵌入中，然后进行变换。
　　In Eqn.(2), C is the size of the set of transformations to be aggregated. We refer to C as cardinality [2]. In Eqn.(2) C is in a position similar to D in Eqn.(1), but C need not equal D and can be an arbitrary number. While the dimension of width is related to the number of simple transformations (inner product), we argue that the dimension of cardinality controls the number of more complex transformations. We show by experiments that cardinality is an essential dimension and can be more effective than the dimensions of width and depth.
　　在公式(2)中，C是要聚合的变换集的大小。我们把C称为cardinality[2]。在公式(2)中，C的位置类似于公式(1)中的D，但C不需要等于D，可以是一个任意的数字。虽然宽度的维度与简单变换（内积）的数量有关，但我们认为cardinality的维度控制着更复杂的变换的数量。我们通过实验表明，cardinality是一个重要的维度，比宽度和深度维度更有效。
　　In this paper, we consider a simple way of designing the transformation functions: all $Ti′s\mathcal{T}_i's$ have the same topology. This extends the VGG-style strategy of repeating layers of the same shape, which is helpful for isolating a few factors and extending to any large number of transformations. We set the individual transformation $Ti\mathcal{T}_i$ to be the bottleneckshaped architecture [14], as illustrated in Fig. 1 (right). In this case, the first 1×1 layer in each $Ti\mathcal{T}_i$ produces the lowdimensional embedding.
　　在本文中，我们考虑了一种设计变换函数的简单方法：所有 $Ti′s\mathcal{T}_i's$ 都有相同的拓扑结构。这扩展了VGG式的重复相同形状的层的策略，这有助于隔离少数因素并扩展到任何大量的变换。我们将单个变换 $Ti\mathcal{T}_i$ 设定为瓶颈形架构[14]，如图1（右）所示。在这种情况下，每个 $Ti\mathcal{T}_i$ 中的第一个1×1层产生低维嵌入。
　　The aggregated transformation in Eqn.(2) serves as the residual function [14] (Fig. 1 right):
　　公式（2）中的聚合变换作为残差函数[14]（图1右）:

where y is the output.
其中y是输出。
　　Relation to Inception-ResNet. Some tensor manipulations show that the module in Fig. 1(right) (also shown in Fig. 3(a)) is equivalent to Fig. 3(b). Fig. 3(b) appears similar to the Inception-ResNet [37] block in that it involves branching and concatenating in the residual function. But unlike all Inception or Inception-ResNet modules, we share the same topology among the multiple paths. Our module requires minimal extra effort designing each path.
　　与Inception-ResNet的关系。一些张量的操作表明，图1（右）中的模块（也显示在图3（a）中）与图3（b）等价。图3(b)看起来与Inception-ResNet[37]块相似，因为它涉及到残差函数中的分支和连接。但与所有的Inception或Inception-ResNet模块不同，我们在多条路径中共享相同的拓扑结构。我们的模块在设计每条路径时只需要最小的额外努力。
　　Relation to Grouped Convolutions. The above module becomes more succinct using the notation of grouped convolutions [24]. This reformulation is illustrated in Fig. 3©. All the low-dimensional embeddings (the first 1×1 layers) can be replaced by a single, wider layer (e.g., 1×1, 128-d in Fig 3©). Splitting is essentially done by the grouped convolutional layer when it divides its input channels into groups. The grouped convolutional layer in Fig. 3© performs 32 groups of convolutions whose input and output channels are 4-dimensional. The grouped convolutional layer concatenates them as the outputs of the layer. The block in Fig. 3© looks like the original bottleneck residual block in Fig. 1(left), except that Fig. 3© is a wider but sparsely connected module.
　　与分组转的关系。使用分组卷积的符号，上述模块变得更加简洁[24]。图3(c )说明了这种重构。所有的低维嵌入（第一个1×1层）可以被一个更宽的单层（例如，图3（c）中的1×1，128-d）所取代。分割基本上是由分组卷积层在将其输入通道分成组时完成的。图3(c )中的分组卷积层执行了32组卷积，其输入和输出通道是4维的。分组卷积层将它们连接起来作为该层的输出。图3(c )中的块看起来像图1(左)中的原始瓶颈残留块，只是图3(c )是一个更宽但连接稀疏的模块。
　　We note that the reformulations produce nontrivial topologies only when the block has depth ≥3. If the block has depth = 2 (e.g., the basic block in [14]), the reformulations lead to trivially a wide, dense module. See the illustration in Fig. 4.
　　我们注意到，只有当块的深度≥3时，这些重述才会产生非实质性的拓扑结构。如果区块的深度=2（例如，[14]中的基本区块），重新计算就会产生一个宽而密集的模块。见图4中的说明。
　
　　Discussion. We note that although we present reformulations that exhibit concatenation (Fig. 3(b)) or grouped convolutions (Fig. 3(c )), such reformulations are not always applicable for the general form of Eqn.(3), e.g., if the transformation $Ti\mathcal{T}_i$ takes arbitrary forms and are heterogenous. We choose to use homogenous forms in this paper because they are simpler and extensible. Under this simplified case, grouped convolutions in the form of Fig. 3(c ) are helpful for easing implementation.
　　讨论。我们注意到，尽管我们提出了表现出串联（图3(b)）或分组卷积（图3(c )）的重述，但这种重述并不总是适用于公式(3)的一般形式，例如，如果变换 $Ti\mathcal{T}_i$ 采取任意形式且是异质的。我们在本文中选择使用同质形式，因为它们更简单且可扩展。在这种简化的情况下，图3(c )形式的分组卷积有助于简化实现。

3.4. 模型容量

Our experiments in the next section will show that our models improve accuracy when maintaining the model complexity and number of parameters. This is not only interesting in practice, but more importantly, the complexity and number of parameters represent inherent capacity of models and thus are often investigated as fundamental properties of deep networks [8].
　　我们在下一节的实验将表明，我们的模型在保持模型复杂度和参数数量的情况下提高了准确性。这不仅在实践中很有意思，更重要的是，复杂性和参数数代表了模型的固有能力，因此经常被作为深度网络的基本属性进行研究[8]。
　　When we evaluate different cardinalities C while preserving complexity, we want to minimize the modification of other hyper-parameters. We choose to adjust the width of the bottleneck (e.g., 4-d in Fig 1(right)), because it can be isolated from the input and output of the block. This strategy introduces no change to other hyper-parameters (depth or input/output width of blocks), so is helpful for us to focus on the impact of cardinality.
　　当我们在保留复杂度的情况下评估不同的cardinalities C时，我们希望尽量减少对其他超参数的修改。我们选择调整瓶颈的宽度（例如，图1（右）中的4-d），因为它可以与块的输入和输出隔离。这种策略不会对其他超参数（深度或块的输入/输出宽度）产生任何变化，因此有助于我们关注心率的影响。
　　In Fig. 1(left), the original ResNet bottleneck block [14] has 256 · 64 + 3 · 3 · 64 · 64 + 64 · 256 ≈ 70k parameters and proportional FLOPs (on the same feature map size). With bottleneck width d, our template in Fig. 1(right) has:
　　在图1（左）中，原始的ResNet瓶颈块[14]有256 · 64 + 3 · 3 · 64 · 64 + 64 · 256 ≈ 70k参数和比例FLOPs（在相同的特征图大小上）。在瓶颈宽度为d时，我们在图1（右）中的模板有。

parameters and proportional FLOPs. When C = 32 and d = 4, Eqn.(4) ≈ 70k. Table 2 shows the relationship between cardinality C and bottleneck width d.
参数和成比例的FLOPs。当C=32和d=4时，公式（4）≈70k。表2显示了cardinality C和瓶颈宽度d之间的关系。

　　Because we adopt the two rules in Sec. 3.1, the above approximate equality is valid between a ResNet bottleneck block and our ResNeXt on all stages (except for the subsampling layers where the feature maps size changes). Table 1 compares the original ResNet-50 and our ResNeXt-50 that is of similar capacity.5 We note that the complexity can only be preserved approximately, but the difference of the complexity is minor and does not bias our results.
　　由于我们采用了第3.1节中的两个规则，上述近似的平等在ResNet瓶颈块和我们的ResNeXt之间的所有阶段都是有效的（除了子采样层的特征图大小变化）。表1比较了原始的ResNet-50和我们的ResNeXt-50，后者具有类似的能力。5 我们注意到，复杂度只能近似保留，但复杂度的差异很小，不会对我们的结果产生偏差。

4. 实施细节

Our implementation follows [14] and the publicly available code of fb.resnet.torch [11]. On the ImageNet dataset, the input image is 224×224 randomly cropped from a resized image using the scale and aspect ratio augmentation of [38] implemented by [11]. The shortcuts are identity connections except for those increasing dimensions which are projections (type B in [14]). Downsampling of conv3, 4, and 5 is done by stride-2 convolutions in the 3×3 layer of the first block in each stage, as suggested in [11]. We use SGD with a mini-batch size of 256 on 8 GPUs (32 per GPU). The weight decay is 0.0001 and the momentum is 0.9. We start from a learning rate of 0.1, and divide it by 10 for three times using the schedule in [11]. We adopt the weight initialization of [13]. In all ablation comparisons, we evaluate the error on the single 224×224 center crop from an image whose shorter side is 256.
　　我们的实现遵循[14]和fb.resnet.torch[11]的公开可用代码。在ImageNet数据集上，输入的图像是224×224的随机剪裁，是使用[11]实现的[38]的比例和长宽比增强的图像。除了那些增加维度的是投影（[14]中的B型）外，捷径是恒等连接。conv3、4和5的下采样是通过每个阶段的第一个块的3×3层中的stride-2卷积来完成的，正如[11]中所建议的那样。我们使用SGD，在8个GPU上的小批处理量为256（每个GPU32）。权重衰减为0.0001，动量为0.9。我们从0.1的学习率开始，用[11]中的时间表除以10进行三次。我们采用[13]的权重初始化。在所有的消融比较中，我们对短边为256的图像的单个224×224中心裁剪的误差进行评估。
　　Our models are realized by the form of Fig. 3©. We perform batch normalization (BN) [17] right after the convolutions in Fig. 3©. ReLU is performed right after each BN, expect for the output of the block where ReLU is performed after the adding to the shortcut, following [14].
　　我们的模型是通过图3(c )的形式实现的。在图3(c )中，我们在卷积之后立即进行批量归一化（BN）[17]。ReLU在每个BN之后进行，除了在添加到捷径之后进行ReLU的块的输出，遵循[14]。
　　We note that the three forms in Fig. 3 are strictly equivalent, when BN and ReLU are appropriately addressed as mentioned above. We have trained all three forms and obtained the same results. We choose to implement by Fig. 3© because it is more succinct and faster than the other two forms.
　　我们注意到，当BN和ReLU如上所述得到适当处理时，图3中的三种形式是严格等同的。我们对这三种形式都进行了训练，得到了相同的结果。我们选择用图3(c )来实现，因为它比其他两种形式更加简洁和快速。

5. 实验

5.1. 在ImageNet-1K上的实验

We conduct ablation experiments on the 1000-class ImageNet classification task [33]. We follow [14] to construct 50-layer and 101-layer residual networks. We simply replace all blocks in ResNet-50/101 with our blocks.
　　我们在1000个类别的ImageNet分类任务[33]上进行了消融实验。我们按照[14]构建50层和101层的残差网络。我们只是用我们的区块替换ResNet-50/101中的所有区块。
　　Notations. Because we adopt the two rules in Sec. 3.1, it is sufficient for us to refer to an architecture by the template. For example, Table 1 shows a ResNeXt-50 constructed by a template with cardinality = 32 and bottleneck width = 4d (Fig. 3). This network is denoted as ResNeXt-50 (32×4d) for simplicity. We note that the input/output width of the template is fixed as 256-d (Fig. 3), and all widths are doubled each time when the feature map is subsampled (see Table 1).
　　说明。因为我们采用了第3.1节中的两个规则，所以我们用模板来指代一个架构就足够了。例如，表1显示了一个由卡度=32和瓶颈宽度=4d的模板构建的ResNeXt-50（图3）。为了简单起见，这个网络被表示为ResNeXt-50（32×4d）。我们注意到，模板的输入/输出宽度固定为256-d（图3），每次对特征图进行子采样时，所有的宽度都翻倍（见表1）。
　　Cardinality vs. Width. We first evaluate the trade-off between cardinality C and bottleneck width, under preserved complexity as listed in Table 2. Table 3 shows the results and Fig. 5 shows the curves of error vs. epochs. Comparing with ResNet-50 (Table 3 top and Fig. 5 left), the 32×4d ResNeXt-50 has a validation error of 22.2%, which is 1.7% lower than the ResNet baseline’s 23.9%. With cardinality C increasing from 1 to 32 while keeping complexity, the error rate keeps reducing. Furthermore, the 32×4d ResNeXt also has a much lower training error than the ResNet counterpart, suggesting that the gains are not from regularization but from stronger representations.
　　Cardinality 与宽度。我们首先评估了在表2所列的保留复杂性下，cardinality C和瓶颈宽度之间的权衡。表3显示了结果，图5显示了误差与历时的曲线。与ResNet-50相比（表3顶部和图5左侧），32×4d ResNeXt-50的验证误差为22.2%，比ResNet基线的23.9%低1.7%。随着cardinality C从1增加到32，同时保持复杂性，错误率不断降低。此外，32×4d的ResNeXt的训练误差也比ResNet的训练误差低得多，这表明收益不是来自正则化，而是来自更强的表示。
　　

　　Similar trends are observed in the case of ResNet-101 (Fig. 5 right, Table 3 bottom), where the 32×4d ResNeXt101 outperforms the ResNet-101 counterpart by 0.8%. Although this improvement of validation error is smaller than that of the 50-layer case, the improvement of training error is still big (20% for ResNet-101 and 16% for 32×4d ResNeXt-101, Fig. 5 right). In fact, more training data will enlarge the gap of validation error, as we show on an ImageNet-5K set in the next subsection.
　　在ResNet-101的情况下也观察到了类似的趋势（图5右侧，表3底部），32×4d的ResNeXt101比ResNet-101的同类产品高出0.8%。尽管这一验证误差的改善比50层的情况要小，但训练误差的改善仍然很大（ResNet-101为20%，32×4d ResNeXt-101为16%，图5右）。事实上，更多的训练数据将扩大验证误差的差距，正如我们在下一小节中对ImageNet-5K集所展示的那样。
　　Table 3 also suggests that with complexity preserved, increasing cardinality at the price of reducing width starts to show saturating accuracy when the bottleneck width is small. We argue that it is not worthwhile to keep reducing width in such a trade-off. So we adopt a bottleneck width no smaller than 4d in the following.
　　表3还表明，在保留复杂性的前提下，当瓶颈宽度较小时，以减少宽度为代价增加cardinality开始显示出饱和的准确性。我们认为，在这样的权衡中，继续减少宽度是不值得的。所以我们在下文中采用了不小于4d的瓶颈宽度。
　　Increasing Cardinality vs. Deeper/Wider. Next we investigate increasing complexity by increasing cardinality C or increasing depth or width. The following comparison can also be viewed as with reference to 2× FLOPs of the ResNet-101 baseline. We compare the following variants that have ∼15 billion FLOPs. (i) Going deeper to 200 layers. We adopt the ResNet-200 [15] implemented in [11]. (ii) Going wider by increasing the bottleneck width. (iii) Increasing cardinality by doubling C.
　　增加cardinality vs. Deeper/Wider。接下来我们研究通过增加cardinality C或增加深度或宽度来增加复杂性。下面的比较也可以看作是参考ResNet-101基线的2×FLOPs。我们比较了以下具有∼150亿FLOPs的变体。(i) 深入到200层。我们采用[11]中实现的ResNet-200[15]。(ii) 通过增加瓶颈宽度来扩大。(iii) 通过增加C的两倍来增加cardinality。
　　Table 4 shows that increasing complexity by 2× consistently reduces error vs. the ResNet-101 baseline (22.0%). But the improvement is small when going deeper (ResNet200, by 0.3%) or wider (wider ResNet-101, by 0.7%).
　　表4显示，增加2倍的复杂度可以持续减少与ResNet-101基线的误差（22.0%）。但在更深（ResNet200，减少0.3%）或更宽（更宽的ResNet-101，减少0.7%）的情况下，改进很小。
　　

On the contrary, increasing cardinality C shows much better results than going deeper or wider. The 2×64d ResNeXt-101 (i.e., doubling C on 1×64d ResNet-101 baseline and keeping the width) reduces the top-1 error by 1.3% to 20.7%. The 64×4d ResNeXt-101 (i.e., doubling C on 32×4d ResNeXt-101 and keeping the width) reduces the top-1 error to 20.4%.
　　相反，增加cardinality C显示出比更深或更宽的效果好得多。2×64d ResNeXt-101（即在1×64d ResNet-101基线上增加一倍的C，并保持宽度）将top-1的误差降低了1.3%至20.7%。64×4d ResNeXt-101（即在32×4d ResNeXt-101的基础上增加一倍的C并保持宽度）将top-1的误差降低到20.4%。
　　We also note that 32×4d ResNet-101 (21.2%) performs better than the deeper ResNet-200 and the wider ResNet101, even though it has only ∼50% complexity. This again shows that cardinality is a more effective dimension than the dimensions of depth and width.
　　我们还注意到，32×4d的ResNet-101（21.2%）比更深的ResNet-200和更宽的ResNet101表现得更好，尽管它的复杂性只有50%∼。这再次表明，cardinality是一个比深度和宽度维度更有效的维度。
　　Residual connections. The following table shows the effects of the residual (shortcut) connections:
　　残差连接。下表显示了残差（捷径）连接的效果。

　　Removing shortcuts from the ResNeXt-50 increases the error by 3.9 points to 26.1%. Removing shortcuts from its ResNet-50 counterpart is much worse (31.2%). These comparisons suggest that the residual connections are helpful for optimization, whereas aggregated transformations are stronger representations, as shown by the fact that they perform consistently better than their counterparts with or without residual connections.
　　从ResNeXt-50中去掉快捷键后，误差增加了3.9个点，达到26.1%。从ResNet-50的对应物中去除捷径的情况要差得多（31.2%）。这些比较表明，残余连接有助于优化，而聚合转换则是更强的表征，这表现在它们的表现始终比有或没有残余连接的对应物好。
　　Performance. For simplicity we use Torch’s built-in grouped convolution implementation, without special optimization. We note that this implementation was brute-force and not parallelization-friendly. On 8 GPUs of NVIDIA M40, training 32×4d ResNeXt-101 in Table 3 takes 0.95s per mini-batch, vs. 0.70s of ResNet-101 baseline that has similar FLOPs. We argue that this is a reasonable overhead. We expect carefully engineered lower-level implementation (e.g., in CUDA) will reduce this overhead. We also expect that the inference time on CPUs will present less overhead. Training the 2×complexity model (64×4d ResNeXt-101) takes 1.7s per mini-batch and 10 days total on 8 GPUs.
　　性能。为了简单起见，我们使用Torch内置的分组卷积实现，没有进行特别的优化。我们注意到，这个实现是粗暴的，不适合并行化。在8个NVIDIA M40的GPU上，训练表3中的32×4d ResNeXt-101每小批需要0.95秒，而具有类似FLOPs的ResNet-101基线需要0.70秒。我们认为，这是一个合理的开销。我们希望精心设计的低级别的实现（例如在CUDA中）将减少这一开销。我们还期望CPU上的推理时间会出现较少的开销。训练2倍复杂度的模型（64×4d ResNeXt-101），每个小批次需要1.7秒，在8个GPU上总共需要10天。
　　Comparisons with state-of-the-art results. Table 5 shows more results of single-crop testing on the ImageNet validation set. In addition to testing a 224×224 crop, we also evaluate a 320×320 crop following [15]. Our results compare favorably with ResNet, Inception-v3/v4, and Inception-ResNet-v2, achieving a single-crop top-5 error rate of 4.4%. In addition, our architecture design is much simpler than all Inception models, and requires considerably fewer hyper-parameters to be set by hand.
　　与最先进的结果的比较。表5显示了在ImageNet验证集上进行单一作物测试的更多结果。除了测试224×224的作物外，我们还按照[15]评估了320×320的作物。我们的结果与ResNet、Inception-v3/v4和Inception-ResNet-v2相比，取得了4.4%的单作物前5名错误率。此外，我们的架构设计比所有的Inception模型简单得多，需要手工设置的超参数也少得多。
　　

ResNeXt is the foundation of our entries to the ILSVRC 2016 classification task, in which we achieved 2nd place. We note that many models (including ours) start to get saturated on this dataset after using multi-scale and/or multicrop testing. We had a single-model top-1/top-5 error rates of 17.7%/3.7% using the multi-scale dense testing in [14], on par with Inception-ResNet-v2’s single-model results of 17.8%/3.7% that adopts multi-scale, multi-crop testing. We had an ensemble result of 3.03% top-5 error on the test set, on par with the winner’s 2.99% and Inception-v4/InceptionResNet-v2’s 3.08% [37].
　　ResNeXt是我们参加2016年ILSVRC分类任务的基础，我们在其中取得了第二名的成绩。我们注意到，许多模型（包括我们的模型）在使用多尺度和/或多作物测试后，在这个数据集上开始变得饱和。我们使用[14]中的多尺度密集测试，单模型前1/前5的错误率为17.7%/3.7%，与采用多尺度、多作物测试的Inception-ResNet-v2的单模型结果17.8%/3.7%相当。我们在测试集上的合集结果是3.03%的前5名错误，与冠军的2.99%和Inception-v4/InceptionResNet-v2的3.08%持平[37]。

5.4. 5.4. 关于COCO目标检测的实验

Next we evaluate the generalizability on the COCO object detection set [27]. We train the models on the 80k training set plus a 35k val subset and evaluate on a 5k val subset (called minival), following [1]. We evaluate the COCOstyle Average Precision (AP) as well as AP@IoU=0.5 [27]. We adopt the basic Faster R-CNN [32] and follow [14] to plug ResNet/ResNeXt into it. The models are pre-trained on ImageNet-1K and fine-tuned on the detection set. Implementation details are in the appendix.
　　接下来，我们对COCO物体检测集[27]的可推广性进行评估。按照[1]，我们在80K训练集和35K值子集上训练模型，并在5K val子集（称为minival）上评估。我们评估COCO式的平均精度（AP）以及AP@IoU=0.5 [27]。我们采用基本的Faster R-CNN[32]，并按照[14]将ResNet/ResNeXt插入其中。这些模型在ImageNet-1K上进行了预训练，并在检测集中进行了微调。实施细节在附录中。
　　Table 8 shows the comparisons. On the 50-layer baseline, ResNeXt improves AP@0.5 by 2.1% and AP by 1.0%, without increasing complexity. ResNeXt shows smaller improvements on the 101-layer baseline. We conjecture that more training data will lead to a larger gap, as observed on the ImageNet-5K set.
　　表8显示了这些比较。在50层基线上，ResNeXt将AP@0.5，提高了2.1%，将AP提高了1.0%，而没有增加复杂性。ResNeXt在101层基线上的改进较小。我们推测，更多的训练数据将导致更大的差距，正如在ImageNet-5K集上观察到的那样。
　　
　　It is also worth noting that recently ResNeXt has been adopted in Mask R-CNN [12] that achieves state-of-the-art results on COCO instance segmentation and object detection tasks.
　　值得注意的是，最近ResNeXt被采用在Mask R-CNN[12]中，在COCO实例分割和物体检测任务上取得了最先进的成果。

【翻译】Aggregated Residual Transformations for Deep Neural Networks相关推荐

批量残差网络-Aggregated Residual Transformations for Deep Neural Networks
Aggregated Residual Transformations for Deep Neural Networks Facebook AI Research 大牛 Ross Girshick K ...
ResNeXt - Aggregated Residual Transformations for Deep Neural Networks
<Aggregated Residual Transformations for Deep Neural Networks>是Saining Xie等人于2016年公开在arXiv上: h ...
论文阅读：Aggregated Residual Transformations for Deep Neural Networks
本萌新记录一下看过的论文,如果理解有误大佬们体谅下QAQ. 摘要: 作者提出一个用于图像分类的.简单.高度模块化的网络结构.该网络是通过重复一个构建块(building block)来构建的,该构建块 ...
Aggregated Residual Transformations for Deep Neural Networks（论文翻译）
摘要我们提出了一种用于图像分类的简单.高度模块化的网络架构.我们的网络是通过重复一个构建块来构建的,该构建块聚合了一组具有相同拓扑的转换.我们简单的设计产生了一个同质的多分支架构,只需设置几个超参数 ...
Aggregated Residual Transformations for Deep Neural Networks
论文链接: https://arxiv.org/abs/1611.05431 废话不多说,先上图.上图左侧为A block of ResNet,右侧即为本文章所提出的新结构:A block of Re ...
论文笔记——Aggregated Residual Transformations for Deep Neural Networks(ResNeXt)
论文下载: https://arxiv.org/pdf/1611.05431.pdf 论文代码: https://github.com/miraclewkf/ResNeXt-PyTorch 论文摘要: ...
论文笔记 Aggregated Residual Transformations for Deep Neural Networks
这篇文章构建了一个基本"Block",并在此"Block"基础上引入了一个新的维度"cardinality"(字母"C" ...
CNN--ResNeXt--Aggregated Residual Transformations for Deep Neural Networks
Paper:https://arxiv.org/pdf/1611.05431.pdf Code:https://github.com/Cadene/pretrained-models.pytorch/ ...
综述翻译：多任务学习-An Overview of Multi-Task Learning in Deep Neural Networks
An Overview of Multi-Task Learning in Deep Neural Networks 文章目录 An Overview of Multi-Task Learning i ...

【翻译】Aggregated Residual Transformations for Deep Neural Networks