读文献——《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》

在自己阅读文章之前，通过网上大神的解读先了解了一下这篇文章的大意，英文不够好的惭愧...

大佬的文章在https://blog.csdn.net/happynear/article/details/44238541，掺杂着我一小些理解和大佬的总结，记录一下我的学习，大家一起交流。

首先，引用大佬对文章的梳理解释，大概整理如下：

总的来说呢，就是DNN训练的过程中，每层的输入分布都会受到上一层的结果的影响，然后就会让学习速率提不上来，初始化也不好设置，也就是 internal covariate shift,（covariate shift查了一下解释在后面一张图）。然后呢这篇文章就开始分析怎么让每一层的输入分布标准化一下，还要对训练和反向传播没有太大的负面影响这一系列的问题。

一点题外话

本篇文章从数据的预处理开始，大佬普及了数据的预处理对加速收敛的好处。一般有减均值->zscore->白化。逐级提升随机初始化的权重对数据分割的有效性，降低overfit的可能性。当然这都是基础知识和本文关系不大。

一个叫做Z分数，也叫标准分数（standard score），另外还有一个T分数，这两个概念都不难。

然后犯懒的我就再来一张图，讲的是这篇文章在解决的这个叫做covariat shift问题是啥：

正文内容：

然后文章举了一个随机梯度下降的例子，来表现 It is advantageous for the distribution of x to remain fixed over time.然后又用sigmiod函数的梯度小时问题在于，X的绝对值越大学习就会变慢，若改变参数则易进入饱和阶段，收敛变慢-->所以引出了修正线性单元。

We could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the optimizer would be less likely to get truck in the saturated regime, and the training would accelerate.

所以就开始引入一个新的机制，Batch Normalization: take a step towards reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets.

Reducing Internal Covariate Shift

首先，internal Covariate shift是指由于训练时网络参数的改变导致网络激励分布的变化。即（We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.）也就是输入分布在参数的变化下不断变化，如果将输入先进行白化处理(同PCA，去除输入信息的冗余性），那么网络训练收敛就会加快。

同时，normalization也会对梯度下降有一定的影响，这一点也要在算法设计中考虑到。为了这一点，对任意一个参数值，网络总是用理想分布作为激励。分析一波gradient descent optimization然后，白化太复杂了，不适合对每层训练都用，其他方法也不行，然后就简化一下白化来用了。

Normalization via Mini-Batch Statistics

The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have zero mean and unit variance. 首先，对输入的特征白化进行替代，使用减去均值和单位化方差的方法，对每个标量进行各自normalize。不过这样对随机梯度优化就不行了，so we use mini-batches in stochastic gradient training, ench mini-batch produces estimates of the mean and variance of each activation.

然后就有了算法1，由x(k)--> -->y(k)，也就是将mini-batch中的x(k)进行简化的减均值单位化方差得到,然后，为了让数据尽量集中到线性部分对数据进行线性处理也就是yi的那个公式，γ的初始值设置为根号var[x(k)], β初始值设为E[x(k)]。

由此各参数其偏导进行后向传播更新：

？？？但其实我对这个求导是怎么来的，有些疑问，比如只有xi在上面中三个式子都有偏导，但是和就只有第三个式子里传递呢？还不懂，大家知道的请告诉我。

另外，对于mini-batch的大小， s mini-batch of size m and feature map of size q×p. the effective mini-batch of size m' = m· pq。

Batch-Normalized Convolution Network

然后，同样的，吧normalize应用在convolution networks时，就对传统的卷积进行了改进。

新的算法：

For convolution layers, we addictionally want the normalization to obey the convolutional property - so that different elements of the same feature map, at different locations, are normalized in the same way.

优点：

1、reducing the dependence of gradients，this allows us to use much highrt learning rates.

2、reduce the need of Dropout

3、use staurating nonlinearities by preventing the network from getting stuck in the saturated modes.

4、it prevents small changes in layer parameters from amplifying as the data propagates through a deep network.

Further Study：We further conjecture that Batch Normalization may lead the layer Jacobians to have sigular values close to 1, which is known to be beneficial for training.