1、Abstract

We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. (本文提出一种比过去使用的神经网络更深的训练网络,残差学习结构。)

We provide residual networks are easier to optimize, and can gain accuracy from considerably increased depth. (由实验证明,残差网络更容易优化,并且在增加大量深度后可以得到更好的精度。)

也就是,摘要表示,对于神经网络而言,网络深度很重要,然后残差网络帮助网络深度增加并能得到好的效果。

2、Introduction

Deep networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multi-layer fashion. (在多层端对端的结构中,深度神经网络能整合低中高三个维度的特征和分类器)

自从发现深度对神经网络的重要作用之后,大家都往很深的方向发展,然后就出问题了。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients, which hamper convergence from the beginning. This problem has been largely addressed by normalized initialization and intermediate normalization layers. (深度的增加,一个问题随之而来,堆叠更多的层能让神经网络的学习更好吗?一个障碍马上回答了这个问题,及时梯度消失,梯度消失会影响收敛,但这个问题已经被BN解决了。)

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. (当BN解决了梯度消失的问题,网络继续收敛时,另一个问题又暴露出来了,当神经网络深度增加是,精度却趋于饱和,并且迅速衰退(下降)。而且还不是过拟合的锅)

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. (训练精度的衰退表示并不是所有的系统都是一样容易优化的。) 也就是说精度衰退是优化的不好。

There exits a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. (对于更深的模型,这有一种通过构建的解决方案:恒等映射(identity mapping)来构建增加的层,而其它层直接从浅层模型中复制。) 实践证明这是目前可以实现的最好的方法了。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Formally, denoting the desired underlying mapping as , we let the stacked nonlinear layers fit another mapping of . The original mapping is recast into . If an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. (本文中我们使用一种深度残差网络解决退化问题,底层映射为,令非线性叠层满足另一个映射,原始映射则重定义为,如果可以使用恒等映射,将残差变成0比非线性堆叠更容易拟合恒等映射。)

Shortcut connections are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameter nor computational complexity. (shortcut connection 是跳过一层或者多层的,在本文中,shortcut connection使用的是简单的恒等映射,并将它的输出加到神经网络层的结果之中。恒等shortcut connection并不会增加额外的参数和计算复杂度。)

残差结构的优势:

Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks. (我们的深度残差结构很容易优化,相对应的,一般的神经网络结构在深度增加的时候就有更高的训练误差。并且深度残差结构在增加深度时能有更好的精度,得出的结果比普通的神经网络更好。)

然后就举了ImageNet和CIFAR-10的例子,深度残差结构复杂度比VGG低,test error 也低,可以通用普适。

3、Deep Residual Learning

If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions. (一个假设,多层非线性层网络能逼近一份复杂的函数,那么它就能逼近一个残差函数。)

If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. (若优化函数更接近与恒等映射而不是0映射,相对于学习这个函数得到一个新的结果,求解器更容易得到一个恒等映射的波动。)

 residual mapping, 

The dimensions of x and F must be equal in Eqn. (1). If this is not case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

在Eqn.(1)中,输入输出必须维度一致,当维度不一致时,我们需要增加一个线性矩阵Ws让shortcut connection维度匹配,得到Eqn. (2)。

另外,只有恒等映射就可以解决degradation问题,但是在单层的神经网络里就没法使用残差网络了。We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function  can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel. (我们还发现不仅是对于全连接层,对于卷积层也是同样适用的。函数可以表示多个卷积层,在两个特征图的通道之间执行元素逐级的加法。)

4、Network Architectures

Plain work:

plain网络结构主要受VGG网络的启发。卷积层主要为3*3的滤波器,并遵循以下两点要求:1) 输出特征尺寸相同的层含有相同数量的滤波器;2) 如果特征尺寸减半,则滤波器的数量增加一倍来保证每层的时间复杂度相同。

Residual work:

在以上plain网络的基础上,插入shortcut连接将网络变成残差版本。输入和输出的维度相同,直接使用恒等shortcuts; 当维度增加时有两种方案:

(A) shortcut仍然使用恒等映射,在增加的维度上使用0来填充,这样做不会增加额外的参数;

(B) 使用Eq.2的映射shortcut使维度保持一致(通过1*1的卷积)。

残差网络的实现细节:略

5、Experiments

论文使用了两个项目来对残差网络进行实验和对比,以验证残差网络的特点和效果。

首先,在ImageNet Classification中,用了ImageNet 2012 classification datasets,分别测试了不同深度(18 and 34)的plain network 和residual network。从测试结构我们可以看出 ,使用plain network在34 层深度增加是training error和validation error都有不同程度的增加,而使用residual network 层数增加时,对两种error都有一定的降低。

另外,top-1 error表示,ResNet 解决了因增加深度带来的degradation问题,并且相对于plain network,降低了top-1 error。18层的两种结构的准确率差不多,但是ResNet 收敛更快,所以在not overly deep时,SGD在ResNet 收敛更快。

Project shortcut are not essential for address the degradation problem. Identity shortcut are particular important for not increasing the complexity of the bottleneck architectures that are introduced below. (Project shortcut 确实比identity shortcut要好,但这个好的程度不高并且带来了更复杂的结构,所以综合结构复杂度来看,还是采用identity shortcut。)

然后增加网络深度到50、101、152结果更加精确,也没有degradation problem。

另外,还有一个人future study:34层的training error在迭代后期更高,虽然用了BN,没有vanish gradient。准确率还可以,即求解器有效,也就收敛率不高了。

对CIFAR-10的测试:

使用的网络和ImageNet差不多,只是在110层改了一下增加前期的学习速率。分析其层响应:the residual functions might be generally closer to zero than the non-residual functions. When there are more layers, an individual layer of ResNet tends to modify the signal less.

超过1000层后,没有优化困难并且结果很好,但是test error很大,层数太多过拟合了。以后也可以使用更好的正则化得到更好的效果。

读文献——《Deep Residual Learning for Image Recognition》相关推荐

  1. 【论文泛读】 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    [论文泛读] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift ...

  2. 批归一化《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    批归一化<Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift ...

  3. Batch normalization:accelerating deep network training by reducing internal covariate shift的笔记

    说实话,这篇paper看了很久,,到现在对里面的一些东西还不是很好的理解. 下面是我的理解,当同行看到的话,留言交流交流啊!!!!! 这篇文章的中心点:围绕着如何降低  internal covari ...

  4. 【BN】《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    ICML-2015 在 CIFAR-10 上的小实验可以参考如下博客: [Keras-Inception v2]CIFAR-10 文章目录 1 Background and Motivation 2 ...

  5. 《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》阅读笔记与实现

    今年过年之前,MSRA和Google相继在ImagenNet图像识别数据集上报告他们的效果超越了人类水平,下面将分两期介绍两者的算法细节. 这次先讲Google的这篇<Batch Normali ...

  6. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文笔记

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 论文链接: h ...

  7. 论文阅读:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    文章目录 1.论文总述 2.Why does batch normalization work 3.BN加到卷积层之后的原因 4.加入BN之后,训练时数据分布的变化 5.与BN配套的一些操作 参考文献 ...

  8. 读文献——《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》

    在自己阅读文章之前,通过网上大神的解读先了解了一下这篇文章的大意,英文不够好的惭愧... 大佬的文章在https://blog.csdn.net/happynear/article/details/4 ...

  9. 深度学习论文--Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    本文翻译论文为深度学习经典模型之一:GoogLeNet-BN 论文链接:https://arxiv.org/abs/1502.03167v3 摘要:训练深度神经网络的难度在于:前一层网络参数的变化,导 ...

  10. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    机器学习领域有个很重要的假设:IID独立同分布假设,就是假设训练数据和测试数据是满足相同分布的,这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障.BatchNorm就是在深度神经网络训 ...

最新文章

  1. win7下安装配置tomcat,java运行环境
  2. 实现织梦dedecms百度主动推送(实时)网页抓取
  3. js异步请求php数据,原生JS发送异步数据请求实例详解
  4. 使用 ExceptionDispatchInfo 捕捉并重新抛出异常
  5. 公平的反义词是什么?
  6. mac解压错误22无效的参数_看看PHP 7.3新版本中的JSON错误处理
  7. quartus仿真文件的编写
  8. vm虚拟化服务器杀毒,VMware服务器虚拟化平台应急方案
  9. 《时运赋》 送给坚持在路上的你
  10. U盘分区损坏了还能恢复吗?双重方法解难题
  11. 从周易六十四卦看软件架构真好懂!女朋友这下不用担心我的学习了~【程序员编程】
  12. ubuntu下lnmp安装mysql密码_Ubuntu LNMP环境搭建
  13. HTTP POST GET
  14. Linux管道通信多次读写,linux进程通信之(二):管道的读与写
  15. OpenGL光源位置
  16. NTVDM CPU 遇到无效的指令的解决方法大全
  17. 名企今年给应届生的待遇
  18. sql数据库去重语法_sql数据库查询语句去重
  19. 英汉互译教程---生词
  20. 官宣!麻省理工学院官方线上课程之机器学习开启啦

热门文章

  1. [POJ 1625] Censored! (AC自动机+DP+高精度)
  2. JAVA学习3-抽象类、内部类、数组、Object、System、String、基本包装类型
  3. Alpha版本冲刺(五)
  4. 吐槽一下typora
  5. Open_cv中常用函数的原型及参数解释(一)
  6. The word ‘xx‘ is not correctly spelled问题的解决
  7. 算法复习-Akari Puzzle
  8. 机器学习笔记1.矩估计、极大似然估计。
  9. Ubuntu 18.04 LTS 桌面版 WPS 安装字体
  10. 通过puppet hiera配置bigtop大数据集群