
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.




Deep networks naturally integrate low/mid/high level features [50] and classifiers in an end-to-end multi-layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non-trivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.


Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with back-
propagation [22].


When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.


The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.



但是实际上由于梯度消失或者爆炸的问题没有办法加深,不过梯度消失和爆炸的问题可以使用初始化以及中间层初始化的策略解决。(比如xavier 初始化以及kaiming 初始化,何老师自己引用自己=_=)



In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

本文引入深度残差讯息框架来解决退化问题。作者定义底层映射(underlying mapping)表示为H(x),作者定义了一个堆叠的非线性层去拟合F(x),这里F(x) = H(x) - x。原始的映射被重新表示为F(x) + x。假设优化残差映射比原始的映射更容易。极端的情况下,如果恒等映射是最优的,那么此时会把残差映射挤压为0,这样的操作更容易实现。

底层映射H(x),从公式上来看,表示为H(x) = F(x) + x,实际上就是下一个layer想要比上一层学习到更深层次的内容。
这个更深层次的内容表示为F(x),也就是作者提到的残差映射(residual mapping)。
此处的x就是上一层学习到的内容,如果当前层啥也学不到,即F(x)为0,那么最也可以把上一层学习到的内容拿过来,也就是F(x)为0的情况下,H(x) = x,
这个情况叫做恒等映射(identify mapping)。

The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

公式F(x) + x 的前向传播方式可以表示成“shortcut connections"的跳跃连接的方式,也就是同等映射的表示方式,如figure 2所示。这样可以使用SGD进行端到端的训练和反向传播。




作者的学习残差的思路是从Fisher Vector和VLAD中取得的。经典的机器学习方法,常用于计算机视觉。

shortcut connections

此外还有highway network,设置了一个gate,这个gate用来控制残差的流入。

可见,想有点思路,是要多读论文的 =_=


感觉作者好絮叨啊 =_=


Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions 2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we
explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.


With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.


In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one.




y = F ( x , { W i } ) + x y = F(x, \{W_i\}) + x y=F(x,{Wi​})+x
函数F表示需要学习的残差映射。如上面的figure 2中由两层。

F = W 2 σ ( W 1 x ) F = W_2\sigma(W_1x) F=W2​σ(W1​x) (1)

F + x公式表示shortcut connection,并且使用element-wise的方式相加的。

公式(1) 没有增加额外的参数也没有增加额外的计算复杂度。在实践中很好用。可以公平的对比发现,plain网络和residual网络拥有相同多的参数,深度,快读以及计算量(除了element-wise以外)

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection W s W_s Ws​ by the shortcut connections to match the dimensions:

x和F的维度必须相同,如果不同我们可以使用一个线性投影 W s W_s Ws​,来匹配shortcut connection的纬度。
y = F ( x , { W i } ) + W s x y = F(x, \{W_i\}) + W_sx y=F(x,{Wi​})+Ws​x (2)

We can also use a square matrix W s in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus W s is only used when matching dimensions.

也可以在公式(1)中使用方阵 W s W_s Ws​。但是在实验中发现恒等映射解决退化问题最经济使用,因此 W s W_s Ws​矩阵尽在匹配维度的时候使用。

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W 1 x + x y = W_1 x+x y=W1​x+x, for which we have not observed advantages.

y = W 1 x + x y = W_1 x+x y=W1​x+x,啥用都不起了。

We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F ( x , W i ) F(x,{W_i }) F(x,Wi​) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

注意到上面的公式符号都是用全连接层表示的,但是他们也适用于卷积层。函数 F ( x , W i ) F(x,{W_i }) F(x,Wi​)可以表示多个卷积层。element-wise在卷积层上表示为逐channel相加。

晕 =_=



对于相同的输出特征图尺寸,layers要有相同的卷积核数量,如果特征图的尺寸减半了,那么调用一个downsampleing的降采样过程,即世界使用一个stride为2的卷积核计算。最后网络以一个average pooling和一个全连接层加上softmax结束。

上面的plain网络,可不是resnet,别看晕了,这个网络是为了和resnet进行对比用的,作者想要对比的公平一点 =_=


基于上面的plain 网络,在这个基础上插入shortcut connection。
identity shortcut可以直接使用,当输入和输出有相同的维度。(figure 3的实线)
当维度增加的时候,这里有两个选择(figure 3的虚线)
(A)用padding 0来提升维度。这个操作不会引入额外的参数
(B)使用方程(2)的projection shortcut来匹配维度(使用1×1的卷积)



identity vs projection shortcuts

identity shortcuts没有额外的参数假如,使用project shortcuts的形式如上所述包含两种,一种是(A)通过补充pad来扩充维度 (B)project shortcuts用来提高维度,其它的shortcut用来执行identity操作(C)所有shortcut都是用projection

上面三种策略都比plain network要强,这三个相比,C > B > A 原因如下:
We argue that this is because the zero-padded dimensions in A indeed have no residual learning.
作者认为pad 0 的部分没有学习到残差。

C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts.

C比B略好,作者认为原因是C比B多了不少projection shortcuts。

But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.

ABC三者的差异表明对于退化问题,projection shortcuts没起到太多作用。
考虑到计算复杂度和模型大小,文中其余部分没使用C。identify shortcuts

这里首次出现bottleneck 结构,结合上面的图和projection shortcut,bottlenect 结构就是projection shortcut的实现方式,下面也说了就是对projection shortcut的替换。

Deeper Bottleneck Architectures

Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design 4 . For each residual function F, we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.


The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

没有额外参数添加的identify shortcut对于bottleneck结构更加重要。如果identity shortcut换成 projection,那么时间夫再度和模型大小会乘2,因为shortcut 连接到了两个高维度的终端。所以identify shortcuts 比bottleneck更有效率。


class BasicBlock(nn.Module):expansion = 1def __init__(self, in_channel, out_channel, stride=1, downsample=None, **kwargs):super(BasicBlock, self).__init__()self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,kernel_size=3, stride=stride, padding=1, bias=False)self.bn1 = nn.BatchNorm2d(out_channel)self.relu = nn.ReLU()self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,kernel_size=3, stride=1, padding=1, bias=False)self.bn2 = nn.BatchNorm2d(out_channel)self.downsample = downsampledef forward(self, x):identity = xif self.downsample is not None:identity = self.downsample(x)out = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out += identityout = self.relu(out)return out


class Bottleneck(nn.Module):"""注意:原论文中,在虚线残差结构的主分支上,第一个1x1卷积层的步距是2,第二个3x3卷积层步距是1。但在pytorch官方实现过程中是第一个1x1卷积层的步距是1,第二个3x3卷积层步距是2,这么做的好处是能够在top1上提升大概0.5%的准确率。可参考Resnet v1.5 https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch"""expansion = 4def __init__(self, in_channel, out_channel, stride=1, downsample=None,groups=1, width_per_group=64):super(Bottleneck, self).__init__()width = int(out_channel * (width_per_group / 64.)) * groupsself.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=width,kernel_size=1, stride=1, bias=False)  # squeeze channelsself.bn1 = nn.BatchNorm2d(width)# -----------------------------------------self.conv2 = nn.Conv2d(in_channels=width, out_channels=width, groups=groups,kernel_size=3, stride=stride, bias=False, padding=1)self.bn2 = nn.BatchNorm2d(width)# -----------------------------------------self.conv3 = nn.Conv2d(in_channels=width, out_channels=out_channel*self.expansion,kernel_size=1, stride=1, bias=False)  # unsqueeze channelsself.bn3 = nn.BatchNorm2d(out_channel*self.expansion)self.relu = nn.ReLU(inplace=True)self.downsample = downsampledef forward(self, x):identity = xif self.downsample is not None:identity = self.downsample(x)out = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out = self.relu(out)# conv3是1×1的卷积out = self.conv3(out)out = self.bn3(out)out += identityout = self.relu(out)return out

上面是bottleneck,可以看到conv1的kernel = 1,首先会对channel进行压缩,然后执行conv2进行3×3的卷积运算,在使用conv3将channel扩大

class ResNet(nn.Module):def __init__(self,block,blocks_num,num_classes=1000,include_top=True,groups=1,width_per_group=64):super(ResNet, self).__init__()self.include_top = include_topself.in_channel = 64self.groups = groupsself.width_per_group = width_per_groupself.conv1 = nn.Conv2d(3, self.in_channel, kernel_size=7, stride=2,padding=3, bias=False)self.bn1 = nn.BatchNorm2d(self.in_channel)self.relu = nn.ReLU(inplace=True)self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)# 第一层的stride是1self.layer1 = self._make_layer(block, 64, blocks_num[0]) # 3self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)self.layer3 = self._make_layer(block, 256, blocks_num[2], stride=2)self.layer4 = self._make_layer(block, 512, blocks_num[3], stride=2)if self.include_top:self.avgpool = nn.AdaptiveAvgPool2d((1, 1))  # output size = (1, 1)self.fc = nn.Linear(512 * block.expansion, num_classes)for m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')def _make_layer(self, block, channel, block_num, stride=1):downsample = None# block 如果是BasicBlock的expansion是1 in_channels固定是64 输出的channel只有第一个是64,也就是BasicBlock的首个layer不会执行;# 所有的BottleneckBlock会调用downshampleif stride != 1 or self.in_channel != channel * block.expansion:'''print("in channel:")print (self.in_channel)print(block.expansion)print("stride :")print(stride)'''downsample = nn.Sequential(nn.Conv2d(self.in_channel, channel * block.expansion, kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(channel * block.expansion))layers = []layers.append(block(self.in_channel,channel,downsample=downsample,stride=stride,groups=self.groups,width_per_group=self.width_per_group))self.in_channel = channel * block.expansionfor _ in range(1, block_num):layers.append(block(self.in_channel,channel,groups=self.groups,width_per_group=self.width_per_group))return nn.Sequential(*layers)def forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.maxpool(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.layer4(x)if self.include_top:x = self.avgpool(x)x = torch.flatten(x, 1)x = self.fc(x)return xdef resnet34(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet34-333f7ec4.pthreturn ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)def resnet50(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet50-19c8e357.pthreturn ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)def resnet101(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet101-5d3b4d8f.pthreturn ResNet(Bottleneck, [3, 4, 23, 3], num_classes=num_classes, include_top=include_top)def resnext50_32x4d(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pthgroups = 32width_per_group = 4return ResNet(Bottleneck, [3, 4, 6, 3],num_classes=num_classes,include_top=include_top,groups=groups,width_per_group=width_per_group)def resnext101_32x8d(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pthgroups = 32width_per_group = 8return ResNet(Bottleneck, [3, 4, 23, 3],num_classes=num_classes,include_top=include_top,groups=groups,width_per_group=width_per_group)




resnet的构建思想,包括identity shortcut和bottleneck shortcut,两种方式都是使得网络学习残差,其中bottleneck 通过对channel进行先压缩再计算,再扩展channel的方式减少计算和模型大小。



FLOPS:注意全大写,是floating point operations per second的缩写,意指每秒浮点运算次数,理解为计算速度。是一个衡量硬件性能的指标。

FLOPs:注意s小写,是floating point operations的缩写(s表复数),意指浮点运算数,理解为计算量。可以用来衡量算法/模型的复杂度。


Ci=input channel, k=kernel size, HW=output feature map size, Co=output channel.
MAC 表示加法乘法:y = ax + b

( 2 × C i × K 2 − 1 ) × H × W × C o (2×C_i×K^2 - 1) × H×W×C_o (2×Ci​×K2−1)×H×W×Co​

( 2 × I − 1 ) × O (2×I - 1)× O (2×I−1)×O

