He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.90


  1. Recent evidence[40, 43] reveals that network depth is of crucial importance, and the leading results [40, 43, 12, 16] on the challenging ImageNet dataset [35] all exploit “very deep” [40] models, with a depth of sixteen [40] to thirty [16].
  2. Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [14, 1, 8], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 8, 36, 12] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].
    When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [10, 41] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
  3. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
    The formulation of F(x)+x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 33, 48] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.
    具体地:假设某层的输入是 x,期望输出是 H(x), 如果我们直接把输入 x 传到输出作为初始结果,这就是一个更浅层的网络,更容易训练,而这个网络没有学会的部分,我们可以使用更深的网络 F(x) 去训练它,使得训练更加容易,最后希望拟合的结果就是 F(x) = H(x) - x,这就是一个残差的结构
  4. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
    残差有两种连接方式,如图3中的实线和虚线:实线,维度相同;虚线,维度不同,有两种方式解决:1. 用0填充,2. 通过一个映射矩阵(1x1的卷积)转换到相同维度。




def conv3x3(in_channel, out_channel, stride=1):return nn.Conv2d(in_channel, out_channel, 3, stride=stride, padding=1, bias=False)
class residual_block(nn.Module):def __init__(self, in_channel, out_channel, same_shape=True):super(residual_block, self).__init__()self.same_shape = same_shapestride=1 if self.same_shape else 2    #如果需要不改变输出大小,则stride=1,否则=2self.conv1 = conv3x3(in_channel, out_channel, stride=stride) #第一层卷积self.bn1 = nn.BatchNorm2d(out_channel)self.conv2 = conv3x3(out_channel, out_channel) #第二层卷积self.bn2 = nn.BatchNorm2d(out_channel)if not self.same_shape:  #如果需要改变输出特征的大小,那么需要通过一维卷积调整输入x的大小self.conv3 = nn.Conv2d(in_channel, out_channel, 1, stride=stride)def forward(self, x):out = self.conv1(x)out = F.relu(self.bn1(out), True)out = self.conv2(out)out = F.relu(self.bn2(out), True)if not self.same_shape: #如果输出的大小与输入不同,则通过一维卷积调整x的大小,使x维度与输出保持相同x = self.conv3(x)return F.relu(x+out, True)  #将输出out与输入x相加到一起形成新的特征




class resnet(nn.Module):def __init__(self, in_channel, num_classes, verbose=False):super(resnet, self).__init__()self.verbose = verbose  #是否输出的一个标志self.block1 = nn.Conv2d(in_channel, 64, 7, 2)self.block2 = nn.Sequential(nn.MaxPool2d(3, 2),residual_block(64, 64),residual_block(64, 64))self.block3 = nn.Sequential(residual_block(64, 128, False),  #残差模块的输入输出大小不一样residual_block(128, 128))self.block4 = nn.Sequential(residual_block(128, 256, False),residual_block(256, 256))self.block5 = nn.Sequential(residual_block(256, 512, False),residual_block(512, 512),nn.AvgPool2d(3))self.classifier = nn.Linear(512, num_classes)def forward(self, x):x = self.block1(x)if self.verbose:  #为True的话,就输出这个block的输出的大小print('block 1 output: {}'.format(x.shape))x = self.block2(x)if self.verbose:print('block 2 output: {}'.format(x.shape))x = self.block3(x)if self.verbose:print('block 3 output: {}'.format(x.shape))x = self.block4(x)if self.verbose:print('block 4 output: {}'.format(x.shape))x = self.block5(x)if self.verbose:print('block 5 output: {}'.format(x.shape))x = x.view(x.shape[0], -1)x = self.classifier(x)return x


