xavier初始化_深入解读xavier初始化（附源码）

论文是Understanding the difficulty of training deep feedforward neural networks。

一篇感觉不错的翻译为【Deep Learning】笔记：Understanding the difficulty of training deep feedforward neural networks。

一些不错的解读文章Understanding the difficulty of training deep feedforward neural networks。

这篇论文还是很经典的，比如作者的名字Xavier Glorot，Xavier初始化就是这位大佬搞的。

参考的源码是TensorFlow版的，API在variance_scaling_initializer，源码在initializers.py。

0 Abstract

论文讲深度学习效果变好的两大功臣是参数初始化和训练技巧，参数初始化能够拔高到如此地步还是很震撼我的：

All these experimental results were obtained with new initialization or training mechanisms.

然而现有的初始化方法即随机初始化表现并不好，本文的主要目的在于论证这点，以及基于论证指明改进的方向：

Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

论文首先观察了非线性激活函数的影响，发现sigmoid激活函数因其均值不适合于深度网络，其会导致高级层陷入饱和状态：

We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation.

论文发现处于饱和的神经元能够自己逃出饱和状态：

Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks.

论文发现较少饱和的激活函数通常是有用的：

We find that a new non-linearity that saturates less can often be beneficial.

论文发现每层网络的雅克比矩阵的奇异值远大于1时网络就会难以训练：

Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1.

据此论文提出了一种新的初始化方法。

1 Deep Neural Networks

深度学习网络旨在从低阶特征中学习高阶特征：

Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features.

论文并不着眼于无监督的预训练或半监督的训练准则，而是观察多层神经网络的问题所在：

So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multi-layer neural networks.

这个观察主要是指层之间以及训练过程中激活值和梯度的变化情况：

Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations.

另外也评估了激活函数与初始化过程的影响：

We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pre-training is a particular form of initialization and it has a drastic impact).

2 Experiment Setting and Datasets

2.1 Online Learning on an Infinite Dataset: Shapeset-3*2

论文讲了在线学习的好处：

The online setting is also interesting because it focuses on the optimization issues rather than on the small-sample regularization effects.

在线学习是一个优化问题，而不是在小数据集内的正则化效果。我自己的理解就是，小数据集的正则化效果就是给你一堆二维点，求一个直线使得距离最短（理论上可以直接求解得到，这里使用训练的方法），可以先给一个初始直线，扔进一个点去，优化这个直线使得直线与点之间的距离变小，直到小数据集的点利用完。而在线学习就是，不断重复上述这个过程，点是取之不尽用之不竭的，无穷的点可以无限接近于真实状态。两者的不同在于网络是否知道全部的数据集。

之后讲了数据集

，对该数据集的描述是：

Shapeset-3*2 contains images of 1 or 2 two-dimensional objects, each taken from 3 shape categories (triangle, parallelogram, ellipse), and placed with random shape parameters (relative lengths and/or angles), scaling, rotation, translation and grey-scale.

图1上部分有一部分例图：

这里讲下图1种为什么将会产生9种可能的分类：

The task is to predict the objects present (e.g. triangle+ellipse, parallelogram+parallelogram, triangle alone, etc.) without having to distinguish between the foreground shape and the background shape when they overlap. This therefore defines nine configuration classes.

这9种分别是：

1.单独的目标，有3种：triangle alone, parallelogram alone, ellipse alone

2.三种目标选择不同的两种，则有

种：triangle+parallelogram, triangle+ellipse, parallelogram+ellipse

3.三种目标选择相同的两种，有3种：triangle+triangle, parallelogram+parallelogram, ellipse+ellipse

2.2 Finite Datasets

有三个数据集，分别是：MNIST、 CIFAR-10、Small-ImageNet。

2.3 Experimental Setting

有一些网络的基本设置，另外介绍了三种激活函数：the sigmoid、the hyperbolic tangent、the softsign，这三个激活函数如下图所示：

其中，

。

3 Effect of Activation Functions and Saturation During Training

该章基本上是对三个激活函数的实验，判断好坏的效果论文是从以下两方面介绍的：

1.excessive saturation of activation function (then gradients will not propagate well)

2.overly linear units (they will not compute something interesting)

首先分析第一点，因为反步法求解的时候需要考虑到激活函数的偏导，如果激活函数处于饱和状态，即意味者其偏导接近于０，会导致梯度弥散，参考反向传播算法的公式推导 - HappyRocking的专栏 - CSDN博客，反向传播时代价函数对

和

的偏导为：

另外：

这两个公式证明反步法求解得到的梯度与激活函数的导数相关，激活函数饱和表示激活函数的导数接近为0，这是不利的。

关于第二点，刚好最近学习了下ResNet，可参考我的文章深入解读残差网络ResNet V1（附源码），4.2节中讲了如果没有激活函数，两层的神经网络也是相当于一层神经网络的，因为线性函数的叠加依然是线性函数，神经网络拟合的是非线性函数（非线性一般由非线性函数赋予)，所以过多的线性单元是无用的。

3.1 Experiments with the sigmoid

Sigmoid激活函数的平均值非零，而平均值与海森矩阵的奇异值相关（又是一篇很古老的论文了，没时间看了），这导致其训练相对较慢：

The sigmoid non-linearity has been already shown to slow down learning because of its non-zero mean that induces important singular values in the Hessian.

接着对图2做一个解释吧：

论文是这么解释activation values的：

activation values: output of the sigmoid

因为sigmoid是一个函数，其因变量随着自变量变化而变化，而自变量

。论文里讲训练的时候会一直拿固定的300个样本的检测集去测试，

表示的就是测试样本传至此节点时的值，则activation values的值为

，这里

为sigmoid函数。

而每一层是有一千个隐含节点的，每个节点都有一个对应的activation values，因为会有一个平均值和标准差，再加上300个样本，图2即意在阐明这个。

不过不理解的是图示里讲的top hidden layer指的啥层，按照描述是Layer 4，但top hidden layer应该是Layer 1呀，有疑问。现在倾向于就是Layer 4了，因为正文里有这么一句话：

We see that very quickly at the beginning, all the sigmoid activation values of the last hidden layer are pushed their lower saturation value of 0.

综上，the last hidden layer和top hidden layer指的都是Layer 4。

论文开始对图2进行分析，在很长的一段时间内，前三层激活值的平均值一直保持在0.5左右，Layer 4则一直保持在0左右，即处于饱和区，并且当Layer 4开始跳出饱和区时，前三层开始饱和并稳定下来。

论文给出的解释是：随机初始化的时候，最后一层softmax层

，刚开始训练的时候会更依赖于偏置

，而不是top hidden layer（即Layer 4）的输出

，因此梯度更新的时候会使得

倾向于0，即使得

倾向于0：

The logistic layer output sofmax(b+Wh) might initially rely more on its biaes b (which are learned very quickly) than on the top hidden activations h derived from the input image (because h would vary in ways that are not predictive of y, maybe correlated mostly with other and possibly more dominant variations of x).
Thus the error gradient would tend to push Wh towards 0, which can be achieved by pushing h towards 0.

我理解的意思是w的梯度回传里还有个系数h，b对应的系数则是1，所以h如果非常小的话，w的梯度是非常小的，这导致其学习速度比b差很多。

这会导致反向学习很困难，使得低级层很难学习：

However, pushing the sigmoid outputs to 0 would bring them into a saturation regime which would prevent gradients to flow backward and prevent the lower layers from learning features.

而Sigmoid的输出0位于饱和区，这使得其刚开始训练的时候会很缓慢：

Eventually but slowly, the lower layers move toward more useful features and the top hidden layer then moves out of the saturation regime.

3.2 Experiments with Hyperbolic tangent

对图3里的98 percentiles不理解，姑且认为98%的点都位于上下两个记号之间吧。

论文发现Layer 1的markers首先到达1附近，表示饱和了（为啥？？？），之后是Layer 2，依次类推。

3.3 Experiments with the Softsign

softsign激活函数类似于Hyperbolic tangent激活函数，在达到饱和状态方面也有不同，因为其对

和

的渐近线是多项式逼近而不是指数逼近（意味着更为缓慢的变化）。有兴趣的可以比较2.3节本文所绘的图。

图4也比较了训练完毕两个激活函数的激活值分布的不同。

可以看到，Hyperbolic tangent的激活值大部分位于饱和区（即1或-1附近）或者线性区（即0附近），而softsign的激活值大多位于(-0.6,-0.8)或(0.6,0.8)区间内，这是一段非常好的非线性区，反向传播可以传播的很好，说明softsign比Hyperbolic tangent要好。

4 Studying Gradients and their Propagation

4.1 Effect of the Cost Function

Cross entropy损失函数为

，quadratic cost损失函数为

，区别在于

的情况，Cross entropy损失函数为正无穷大，quadratic cost损失函数为1，个人理解损失函数越大越有利于收敛。

作者构想了一个特简单的网络（我设想的网络函数为

，其中label恒等于0，则只有当

、

时loss才为0）：

可以看出，Cross entropy损失函数的plateau更少，个人理解损失函数成山坡，权值更新表示的就是一个小球不断在表面滚动直至落入一个坑里，表示得到一个局部最优解，那么plateau越多，小球在plateau就没有动能往前滚动了，性能也越差。

4.2 Gradients at initialization

首先阐述了下什么是梯度弥散，即梯度在反向推导的时候回越来越小：

He studied networks with linear activation at each layer, finding that the variance of the back-propagated gradients decreases as we go backwards in the network.

然后给了两个反向传播的公式：

之后的公式较为复杂，文章《Understanding the difficulty of training deep feedforward neural networks》笔记的推导相当不错，可以参考参考，这里就不赘述其推导过程了。

文章里有推导，

文章没有推导，有时间自己推导一遍吧，文章里的公式7是由公式3推导得到的。

然后论文里讲前向传播和反向传播的两个条件，注意

和

是相邻的两层：

至于

推导到

，可参考文章深度学习--Xavier初始化方法 - shuzfan的专栏 - CSDN博客。

对应的代码为：

分析函数_initializer()，首先会得到fan_in和fan_out的值，即输入层和输出层的个数，如果mode =='FAN_AVG'，则n = (fan_in + fan_out) /2.0，unform表示是否均匀分布，如果是则有：

limit = math.sqrt(3.0 * factor / n)
return random_ops.random_uniform(shape, -limit, limit, dtype, seed=seed)

这正是论文里的公式

。

另外，如果uniform为False，则有：

trunc_stddev = math.sqrt(1.3 * factor / n)
return random_ops.truncated_normal(shape, 0.0, trunc_stddev, dtype, seed=seed)

variance_scaling_initializer介绍说uniform为True或False都是xavier初始化，就不太理解了。

另外，需要考虑的一点是现在大部分是ReLU函数，因此

不是一定成立的，这个的影响待研究。

之后论文给了效果图（一共有3幅，这里只给1幅）：

后面还讲了雅克比矩阵的奇异值（可以参考【线性代数】通俗的理解奇异值以及与特征值的区别，还有奇异值分解及其应用）：

，论文认为平均奇异值的大小表示了层间激活值方差的比例，而这个比例越接近1，代表流动性越好：

When consecutive layers have the same dimension, the average singular value corresponds to the average ratio of infinitesimal volumes mapped from

to

, as well as to the ratio of average activation variance going from

to

.

而正则初始化平均奇异值为0.8，标准初始化平均奇异值为0.5，相对来说正则初始化更利于梯度的回传。

4.3 Back-propagated Gradients During Learning

论文也指明是不能简单地依靠方差来进行理论分析的：

In particular, we cannot use simple variance calculations in our theoretical analysis because the weights values are not anymore independent of the activation values and the linearity hypothesis is also violated.

图7对梯度递减有个直观的说明：

从图7上部分标准初始化方法可以看出，从Layer 1到Layer 5梯度越来越小，当然在训练过程中梯度衰减现象会越来越不明显。使用本文提出的正则初始化方法可没有此梯度衰减现象。

论文还观察到对标准初始化方法而言，权重的梯度没有梯度衰减的现象存在：

What was initially really surprising is that even when the back-propagated gradients become smaller (standard initialization), the variance of the weights gradients is roughly constant across layers, as shown on Figure 8.

5 Error Curves and Conclusions

接下来会用错误率来验证上述几种策略的优点。

实验结果如下图所示：

可以得到如下的几个结论：

拥有sigmoid或hyperbolic tangent激活函数，standard initialization的经典神经网络很难收敛；
softsign比tanh对初始化方法更鲁棒，可能是因为其更好的非线性性；
对tanh来说，normalized initialization非常有用，可能是因为其层到层之间传播的激活值和梯度得以保留（即前向传播和后向传播无衰减现象）；

论文还讲了一些别的方法（很难理解，有机会再刷），得到的一些结论是：

观察层与层之间传播的激活值和梯度有利于理解深层网络的训练复杂度；
Sigmoid尽量不要用，会使得最后一层过快饱和，产生较差的动态学习性能；
保持层与层之间激活值和梯度的良好流动非常有用；
许多实验结果不可解释；

(已完结)