ch06-Pytorch的正则化与归一化

0.引言
1.weight decay 和 dropout
- 1.1.Regularization
- 1.2.Dropout
2.Normalization
- 2.1.Batch Normalization
- 2.2.Batch Normalization in PyTorch
- - 2.2.1.nn.BatchNorm1d()
  - 2.2.2.nn.BatchNorm2d()
  - 2.2.3.nn.BatchNorm3d()
- 2.3.其他常见的Normalization
- 2.4.Layer Normalization(LN)
- 2.5.Instance Normalization(IN)
- 2.6.Group Normalization
- 2.7.小结

0.引言

1.weight decay 和 dropout

本节主要介绍正则化与偏差-方差分解，以及 PyTorch 中的 L2 正则项–weight decay

1.1.Regularization

Regularization 中文是正则化，可以理解为一种减少方差的策略。在机器学习中，误差可以分解为：偏差，方差与噪声之和。即误差=偏差+方差+噪声

偏差度量了学习算法的期望预测与真实结果的偏离程度，即刻画了学习算法本身的拟合能力。
方差度量了同样大小的训练集的变动所导致的学习性能的变化，即刻画了数据扰动所造成的影响。
噪声则表达了在当前任务上学习任何算法所能达到的期望泛化误差的下界。

举例说明：

上方深红色曲线：验证集loss
下方偏橙色曲线：训练集loss

如图所示，通常不考虑噪声，偏差就是训练集的误差与真实值之间的差，方差就是验证集误差与训练集误差只差。

一元线性模型中，在train data set上模拟很好，但是在test data set 上表现就很差，是典型的过拟合现象，就是方差过大的一个例子。

Regularization就是解决方差过大的问题的。

下图中彩色部分就是cost的等高线，每个相同颜色代表cost的值一样，左边是L1、右边是L2，黑色矩形和圆形代表正则项的等高线。

Regylaerization Term 正则化方式有 L1 和 L2 正则项两种。其中 L2 正则项又被称为权值衰减(weight decay)。

其中Cost可以用Loss表示，然后λ是超参数，1/2是为了消除求导带的一个系数。

通常，λ取值范围为(0,1)，所以wi(1−λ)变小了。也就是为什么叫权重衰减的原因，变小了嘛。

当没有正则项时： O b j = L o s s ， w i + 1 = w i − ∂ o b j ∂ w i = w i − ∂ L o s s ∂ w i \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s}，w_{i+1}=w_{i}-\frac{\partial o b j}{\partial w_{i}}=w_{i}-\frac{\partial L o s s}{\partial w_{i}} Obj=Loss，wi+1=wi−∂wi∂obj=wi−∂wi∂Loss。
当使用 L2 正则项时， O b j = L o s s + λ 2 ∗ ∑ i N w i 2 ， w i + 1 = w i − ∂ o b j ∂ w i = w i − ( ∂ L o s s ∂ w i + λ ∗ w i ) = w i ( 1 − λ ) − ∂ L o s s ∂ w i \boldsymbol{O} \boldsymbol{b} \boldsymbol{j}=\boldsymbol{L} \boldsymbol{o} \boldsymbol{s} \boldsymbol{s}+\frac{\lambda}{2} * \sum_{i}^{N} \boldsymbol{w}_{i}^{2}，\begin{aligned} w_{i+1}=w_{i}-\frac{\partial o b j}{\partial w_{i}} &=w_{i}-\left(\frac{\partial L o s s}{\partial w_{i}}+\lambda * w_{i}\right) =w_{i}(1-\lambda)-\frac{\partial L o s s}{\partial w_{i}} \end{aligned} Obj=Loss+2λ∗∑iNwi2，wi+1=wi−∂wi∂obj=wi−(∂wi∂Loss+λ∗wi)=wi(1−λ)−∂wi∂Loss，其中 0 < λ < 1 0 < \lambda < 1 0<λ<1，所以具有权值衰减的作用。

在 PyTorch 中，L2 正则项是在优化器中实现的，在构造优化器时可以传入 weight decay 参数，对应的是公式中的 λ \lambda λ。

下面代码对比了没有 weight decay 的优化器和 weight decay 为 0.01 的优化器的训练情况，在线性回归的数据集上进行实验，模型使用 3 层的全连接网络，并使用 TensorBoard 可视化每层权值的变化情况。代码如下：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from common_tools import set_seed
from tensorboardX import SummaryWriterset_seed(1)  # 设置随机种子
n_hidden = 200
max_iter = 2000
disp_interval = 200
lr_init = 0.01# ============================ step 1/5 数据 ============================
def gen_data(num_data=10, x_range=(-1, 1)):w = 1.5train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())return train_x, train_y, test_x, test_ytrain_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))# ============================ step 2/5 模型 ============================
class MLP(nn.Module):def __init__(self, neural_num):super(MLP, self).__init__()self.linears = nn.Sequential(nn.Linear(1, neural_num),nn.ReLU(inplace=True),nn.Linear(neural_num, neural_num),nn.ReLU(inplace=True),nn.Linear(neural_num, neural_num),nn.ReLU(inplace=True),nn.Linear(neural_num, 1),)def forward(self, x):return self.linears(x)net_normal = MLP(neural_num=n_hidden)
net_weight_decay = MLP(neural_num=n_hidden)# ============================ step 3/5 优化器 ============================
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)# ============================ step 4/5 损失函数 ============================
loss_func = torch.nn.MSELoss()# ============================ step 5/5 迭代训练 ============================writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):# forwardpred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)optim_normal.zero_grad()optim_wdecay.zero_grad()loss_normal.backward()loss_wdecay.backward()optim_normal.step()optim_wdecay.step()if (epoch+1) % disp_interval == 0:# 可视化for name, layer in net_normal.named_parameters():writer.add_histogram(name + '_grad_normal', layer.grad, epoch)writer.add_histogram(name + '_data_normal', layer, epoch)for name, layer in net_weight_decay.named_parameters():writer.add_histogram(name + '_grad_weight_decay', layer.grad, epoch)writer.add_histogram(name + '_data_weight_decay', layer, epoch)test_pred_normal, test_pred_wdecay = net_normal(test_x), net_weight_decay(test_x)# 绘图plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')plt.plot(test_x.data.numpy(), test_pred_normal.data.numpy(), 'r-', lw=3, label='no weight decay')plt.plot(test_x.data.numpy(), test_pred_wdecay.data.numpy(), 'b--', lw=3, label='weight decay')plt.text(-0.25, -1.5, 'no weight decay loss={:.6f}'.format(loss_normal.item()), fontdict={'size': 15, 'color': 'red'})plt.text(-0.25, -2, 'weight decay loss={:.6f}'.format(loss_wdecay.item()), fontdict={'size': 15, 'color': 'red'})plt.ylim((-2.5, 2.5))plt.legend(loc='upper left')plt.title("Epoch: {}".format(epoch+1))plt.show()plt.close()

训练 2000 个 epoch 后，模型如下：

可以看到使用了 weight decay 的模型虽然在训练集的 loss 更高，但是更加平滑，泛化能力更强。

下面是使用 Tensorboard 可视化的分析。首先查看不带 weight decay 的权值变化过程，第一层权值变化如下：

没有使用L2正则项，权值没有变化，都是在-1到1之间，右边是迭代次数，共2000次。使用正则项后，看到权值在慢慢缩小。

可以看到，加上了 weight decay 后，随便训练次数的增加，权值的分布逐渐靠近 0 均值附近，这就是 L2 正则化的作用，约束权值尽量靠近 0。

第二层权值变化如下：

weight decay 在优化器中的实现。由于 weight decay 在优化器的一个参数，因此在执行optim_wdecay.step()时，会计算 weight decay 后的梯度，具体代码如下：

def step(self, closure=None):"""Performs a single optimization step.Arguments:closure (callable, optional): A closure that reevaluates the modeland returns the loss."""loss = Noneif closure is not None:loss = closure()for group in self.param_groups:weight_decay = group['weight_decay']momentum = group['momentum']dampening = group['dampening']nesterov = group['nesterov']for p in group['params']:if p.grad is None:continued_p = p.grad.dataif weight_decay != 0:d_p.add_(weight_decay, p.data).........p.data.add_(-group['lr'], d_p)

可以看到：dp 是计算得到的梯度，如果 weight decay 不为 0，那么更新 d p = d p + w e i g h t d e c a y × p . d a t a d_p=dp+weight_{decay} \times p.data dp=dp+weightdecay×p.data，对应公式： ( ∂ L o s s ∂ w i + λ ∗ w i ) \left(\frac{\partial L o s s}{\partial w{i}}+\lambda * w_{i}\right) (∂wi∂Loss+λ∗wi)。最后一行是根据梯度更新权值。

1.2.Dropout

Dropout概念: 指在深度学习网络的训练过程中，对于神经网络单元，按照一定的概率将其暂时从网络中丢弃。注意是暂时，对于随机梯度下降来说，由于是随机丢弃，故而每一个mini-batch都在训练不同的网络。

在使用 dropout 时，数据尺度会发生变化，如果设置 dropout_prob =0.3，那么在训练时，数据尺度会变为原来的 70%；而在测试时，执行了 model.eval() 后，dropout 是关闭的，因此所有权重需要乘以 (1-dropout_prob)，把数据尺度也缩放到 70%。
Dropout:随机失活
数据尺度变化: 测试时，所有权重乘以1-drop_prob
drop_prob = 0.3 ， 1-drop_prob = 0.7
随机:dropout probability
失活:weight = 0

PyTorch 中 Dropout 层如下，通常放在每个网路层的最前面：

torch.nn.Dropout(p=0.5, inplace=False)

参数：

p 被舍弃的概率，也叫失活概率

这里需要注意的是：在测试集测试时，所有权重需要乘以 1-drop_prob ，这是因为，我们需要保证训练集和测试集的数据尺度相同，举个例子，如果全连接层的前一层共有100个神经元，则对这一层的某个神经元，相当于有100个权值wi乘上100个xi，如果我们假设每一个wi,xi,的尺度为1，则利用dropout后，有30%的神经元失活，则这个神经元的尺度就从100降到了70，这是在训练的时候使用dropout的情况，但是在测试集是，并没有神经元失活，即所有的神经元都是激活的状态，这个时候，我们需要对每个权值都乘上 1-drop_prob ，也就是70%，使得测试模型时的训练尺度也为70。

Pytorch中实现 dropout 操作是采用 nn.Dropout这个层进行实现，nn.Dropout继承了_DropoutNd，而_DropoutNd又继承了Module类

下面实验使用的依然是线性回归的例子，两个网络均是 3 层的全连接层，每层前面都设置 dropout，一个网络的 dropout 设置为 0，另一个网络的 dropout 设置为 0.5，并使用 TensorBoard 可视化每层权值的变化情况。代码如下：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from common_tools import set_seed
from tensorboardX import SummaryWriterset_seed(1)  # 设置随机种子
n_hidden = 200
max_iter = 2000
disp_interval = 400
lr_init = 0.01# ============================ step 1/5 数据 ============================
def gen_data(num_data=10, x_range=(-1, 1)):w = 1.5train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())return train_x, train_y, test_x, test_ytrain_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))# ============================ step 2/5 模型 ============================
class MLP(nn.Module):def __init__(self, neural_num, d_prob=0.5):super(MLP, self).__init__()self.linears = nn.Sequential(nn.Linear(1, neural_num),nn.ReLU(inplace=True),nn.Dropout(d_prob),nn.Linear(neural_num, neural_num),nn.ReLU(inplace=True),nn.Dropout(d_prob),nn.Linear(neural_num, neural_num),nn.ReLU(inplace=True),nn.Dropout(d_prob),nn.Linear(neural_num, 1),)def forward(self, x):return self.linears(x)net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)# ============================ step 3/5 优化器 ============================
optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9)
optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9)# ============================ step 4/5 损失函数 ============================
loss_func = torch.nn.MSELoss()# ============================ step 5/5 迭代训练 ============================writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):pred_normal, pred_wdecay = net_prob_0(train_x), net_prob_05(train_x)loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)optim_normal.zero_grad()optim_reglar.zero_grad()loss_normal.backward()loss_wdecay.backward()optim_normal.step()optim_reglar.step()if (epoch+1) % disp_interval == 0:net_prob_0.eval()net_prob_05.eval()# 可视化for name, layer in net_prob_0.named_parameters():writer.add_histogram(name + '_grad_normal', layer.grad, epoch)writer.add_histogram(name + '_data_normal', layer, epoch)for name, layer in net_prob_05.named_parameters():writer.add_histogram(name + '_grad_regularization', layer.grad, epoch)writer.add_histogram(name + '_data_regularization', layer, epoch)test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x)# 绘图plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')plt.plot(test_x.data.numpy(), test_pred_prob_0.data.numpy(), 'r-', lw=3, label='d_prob_0')plt.plot(test_x.data.numpy(), test_pred_prob_05.data.numpy(), 'b--', lw=3, label='d_prob_05')plt.text(-0.25, -1.5, 'd_prob_0 loss={:.8f}'.format(loss_normal.item()), fontdict={'size': 15, 'color': 'red'})plt.text(-0.25, -2, 'd_prob_05 loss={:.6f}'.format(loss_wdecay.item()), fontdict={'size': 15, 'color': 'red'})plt.ylim((-2.5, 2.5))plt.legend(loc='upper left')plt.title("Epoch: {}".format(epoch+1))plt.show()plt.close()net_prob_0.train()net_prob_05.train()

训练 2000 次后，模型的曲线如下：

我们使用 TensorBoard 查看第三层网络的权值变化情况。dropout =0 和 0.5 的权值变化如下：

可以看到，加了 dropout 之后，权值更加集中在 0 附近，使得神经元之间的依赖性不至于过大。

model.eval() 和 model.trian()

有些网络层在训练状态和测试状态是不一样的，如 dropout 层，在训练时 dropout 层是有效的，但是数据尺度会缩放，为了保持数据尺度不变，所有的权重需要除以 1-p。而在测试时 dropout 层是关闭的。因此在测试时需要先调用model.eval()设置各个网络层的的training属性为 False，在训练时需要先调用model.train()设置各个网络层的的training属性为 True。

下面是对比 dropout 层的在 eval 和 train 模式下的输出值。

首先构造一层全连接网络，输入是 10000 个神经元，输出是 1 个神经元，权值全设为 1，dropout 设置为 0.5。输入是全为 1 的向量。分别测试网络在 train 模式和 eval 模式下的输出，代码如下：

import torch
import torch.nn as nnclass Net(nn.Module):def __init__(self, neural_num, d_prob=0.5):super(Net, self).__init__()self.linears = nn.Sequential(nn.Dropout(d_prob),nn.Linear(neural_num, 1, bias=False),nn.ReLU(inplace=True))def forward(self, x):return self.linears(x)input_num = 10000
x = torch.ones((input_num, ), dtype=torch.float32)net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)net.train()
y = net(x)
print("output in training mode", y)net.eval()
y = net(x)
print("output in eval mode", y)

输出如下：

output in training mode tensor([10030.], grad_fn=<ReluBackward0>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward0>)

在训练时，由于 dropout 为 0.5，因此理论上输出值是 5000，而由于在训练时，dropout 层会把权值除以 1-p=0.5，也就是乘以 2，因此在 train 模式的输出是 10000 附近的数(上下随机浮动是由于概率的不确定性引起的) 。而在 eval 模式下，关闭了 dropout，因此输出值是 10000。这种方式在训练时对权值进行缩放，在测试时就不用对权值进行缩放，加快了测试的速度。

2.Normalization

本节主要介绍 Batch Normalization 的概念，以及 PyTorch 中的 1d/2d/3d Batch Normalization 实现。

2.1.Batch Normalization

称为批标准化。批是指一批数据，通常为 mini-batch；标准化是处理后的数据服从 N(0,1) 的正态分布。

批标准化的优点有如下：

可以使用更大的学习率，加速模型收敛
可以不用精心设计权值初始化
可以不用 dropout 或者较小的 dropout
可以不用 L2 或者较小的 weight decay
可以不用 LRN (local response normalization)

假设输入的 mini-batch 数据是 B = { x 1 … m } \mathcal{B}=\left\{x_{1 \dots m}\right\} B={x1…m}，Batch Normalization 的可学习参数是 γ \gamma γ, β \beta β，步骤如下：

求 mini-batch 的均值： μ B ← 1 m ∑ i = 1 m x i \mu_{\mathcal{B}} \leftarrow \frac{1}{m} \sum_{i=1}^{m} x_{i} μB←m1∑i=1mxi
求 mini-batch 的方差： σ B 2 ← 1 m ∑ i = 1 ( x i − μ B ) 2 \sigma_{\mathcal{B}}^{2} \leftarrow \frac{1}{m} \sum_{i=1}\left(x_{i}-\mu_{\mathcal{B}}\right)^{2} σB2←m1∑i=1(xi−μB)2
标准化： x ^ i ← x i − μ B σ B 2 + ϵ \widehat{x}_{i} \leftarrow \frac{x_{i}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{B}^{2}+\epsilon}} x i←σB2+ϵ xi−μB，其中 ϵ \epsilon ϵ 是放置分母为 0 的一个数
affine transform(缩放和平移)： y i ← γ x ^ i + β ≡ B N γ , β ( x i ) y_{i} \leftarrow \gamma \widehat{x}_{i}+\beta \equiv \mathrm{B} \mathrm{N}_{\gamma, \beta}\left(x_{i}\right) yi←γx i+β≡BNγ,β(xi)，这个操作可以增强模型的 capacity，也就是让模型自己判断是否要对数据进行标准化，进行多大程度的标准化。如果 γ = σ B 2 ， β = μ B \gamma= \sqrt{\sigma_{B}^{2}}，\beta=\mu_{\mathcal{B}} γ=σB2 ，β=μB，那么就实现了恒等映射。

BN本来是要解决ICS：Internal Covariate Shift问题的，就是防止权重过大或者过小时造成深层神经网络的梯度爆炸或消失的问题。但是也顺带的带来了优点（1可用4不用）。

参考： Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

在训练过程中，数据需要经过多层的网络，如果数据在前向传播的过程中，尺度发生了变化，可能会导致梯度爆炸或者梯度消失，从而导致模型难以收敛。

Batch Normalization 层一般在激活函数前一层。

下面的代码打印一个网络的每个网络层的输出，在没有进行初始化时，数据尺度越来越小。

import torch
import numpy as np
import torch.nn as nn
from common_tools import set_seedset_seed(1)  # 设置随机种子class MLP(nn.Module):def __init__(self, neural_num, layers=100):super(MLP, self).__init__()self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])self.bns = nn.ModuleList([nn.BatchNorm1d(neural_num) for i in range(layers)])self.neural_num = neural_numdef forward(self, x):for (i, linear), bn in zip(enumerate(self.linears), self.bns):x = linear(x)# x = bn(x)x = torch.relu(x)if torch.isnan(x.std()):print("output is nan in {} layers".format(i))breakprint("layers:{}, std:{}".format(i, x.std().item()))return xdef initialize(self):for m in self.modules():if isinstance(m, nn.Linear):# method 1# nn.init.normal_(m.weight.data, std=1)    # normal: mean=0, std=1# method 2 kaimingnn.init.kaiming_normal_(m.weight.data)neural_nums = 256
layer_nums = 100
batch_size = 16net = MLP(neural_nums, layer_nums)
# net.initialize()inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1output = net(inputs)
print(output)

输出：

layers:0, std:0.5751240849494934
...
layers:97, std:0.5823422074317932
layers:98, std:0.5775457620620728
layers:99, std:0.5755460262298584
tensor([[0.0000, 0.0000, 0.0000,  ..., 0.4919, 1.3902, 0.0000],[0.0000, 0.1692, 0.6125,  ..., 0.0000, 0.6123, 0.0710],[0.0000, 0.3392, 0.6378,  ..., 0.3588, 0.0000, 0.3263],...,[0.0000, 0.5380, 0.0000,  ..., 0.0000, 1.0141, 1.3558],[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.6721],[0.0000, 0.0000, 0.0000,  ..., 2.3726, 0.0000, 0.0000]],grad_fn=<ReluBackward0>)

当使用nn.init.kaiming_normal_()初始化后，数据的标准差尺度稳定在 [0.6, 0.9]。

当我们不对网络层进行权值初始化，而是在每个激活函数层之前使用 bn 层，查看数据的标准差尺度稳定在 [0.58, 0.59]。因此 Batch Normalization 可以不用精心设计权值初始化。

下面以人民币二分类实验中的 LeNet 为例，添加 bn 层，对比不带 bn 层的网络和带 bn 层的网络的训练过程。

如果不对权值进行特定初始化：

不带 bn 层的网络，并且使用 kaiming 初始化权值，训练过程如下：

可以看到训练过程中，训练集的 loss 在中间激增，不够稳定。

带有 bn 层的 LeNet 定义如下：

class LeNet_bn(nn.Module):def __init__(self, classes):super(LeNet_bn, self).__init__()self.conv1 = nn.Conv2d(3, 6, 5)self.bn1 = nn.BatchNorm2d(num_features=6)self.conv2 = nn.Conv2d(6, 16, 5)self.bn2 = nn.BatchNorm2d(num_features=16)self.fc1 = nn.Linear(16 * 5 * 5, 120)self.bn3 = nn.BatchNorm1d(num_features=120)self.fc2 = nn.Linear(120, 84)self.fc3 = nn.Linear(84, classes)def forward(self, x):out = self.conv1(x)out = self.bn1(out)out = F.relu(out)out = F.max_pool2d(out, 2)out = self.conv2(out)out = self.bn2(out)out = F.relu(out)out = F.max_pool2d(out, 2)out = out.view(out.size(0), -1)out = self.fc1(out)out = self.bn3(out)out = F.relu(out)out = F.relu(self.fc2(out))out = self.fc3(out)return out

带 bn 层的网络，并且不使用 kaiming 初始化权值，训练过程如下：

虽然训练过程中，训练集的 loss 也有激增，但只是增加到 0.4，非常稳定。

2.2.Batch Normalization in PyTorch

在 PyTorch 中，有 3 个 Batch Normalization 类

nn.BatchNorm1d()，输入数据的形状是 B × C × 1 D _ f e a t u r e B \times C \times 1D\_feature B×C×1D_feature
nn.BatchNorm2d()，输入数据的形状是 B × C × 2 D _ f e a t u r e B \times C \times 2D\_feature B×C×2D_feature
nn.BatchNorm3d()，输入数据的形状是 B × C × 3 D _ f e a t u r e B \times C \times 3D\_feature B×C×3D_feature

以nn.BatchNorm1d()为例，如下：

torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

参数：

num_features：一个样本的特征数量，这个参数最重要
eps：在进行标准化操作时的分母修正项
momentum：指数加权平均估计当前的均值和方差
affine：是否需要 affine transform，默认为 True
track_running_stats：True 为训练状态，此时均值和方差会根据每个 mini-batch 改变。False 为测试状态，此时均值和方差会固定

主要属性：

runninng_mean：均值
running_var：方差
weight：affine transform 中的 γ \gamma γ
bias：affine transform 中的 β \beta β

在训练时，均值和方差采用指数加权平均计算，也就是不仅考虑当前 mini-batch 的值均值和方差还考虑前面的 mini-batch 的均值和方差。

在训练时，均值方差固定为当前统计值。

所有的 bn 层都是根据特征维度计算上面 4 个属性，详情看下面例子。

2.2.1.nn.BatchNorm1d()

输入数据的形状是 B × C × 1 D _ f e a t u r e B \times C \times 1D\_feature B×C×1D_feature。在下面的例子中，数据的维度是：(3, 5, 1)，表示一个 mini-batch 有 3 个样本，每个样本有 5 个特征，每个特征的维度是 1。那么就会计算 5 个均值和方差，分别对应每个特征维度。momentum 设置为 0.3，第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。

数据如下图：

nn.BatchNorm1d input= Bx特征数x1d特征（B为 batch 的样本数目）
如上图所示，样本数为3，即B=3，特征数为5，因此输入为3×5×1

代码如下所示：

    batch_size = 3num_features = 5momentum = 0.3features_shape = (1)feature_map = torch.ones(features_shape)                                                    # 1Dfeature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0)         # 2Dfeature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)             # 3Dprint("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))bn = nn.BatchNorm1d(num_features=num_features, momentum=momentum)running_mean, running_var = 0, 1mean_t, var_t = 2, 0for i in range(2):outputs = bn(feature_maps_bs)print("\niteration:{}, running mean: {} ".format(i, bn.running_mean))print("iteration:{}, running var:{} ".format(i, bn.running_var))running_mean = (1 - momentum) * running_mean + momentum * mean_trunning_var = (1 - momentum) * running_var + momentum * var_tprint("iteration:{}, 第二个特征的running mean: {} ".format(i, running_mean))print("iteration:{}, 第二个特征的running var:{}".format(i, running_var))

输出为：

input data:
tensor([[[1.],[2.],[3.],[4.],[5.]],[[1.],[2.],[3.],[4.],[5.]],[[1.],[2.],[3.],[4.],[5.]]]) shape is torch.Size([3, 5, 1])
iteration:0, running mean: tensor([0.3000, 0.6000, 0.9000, 1.2000, 1.5000])
iteration:0, running var:tensor([0.7000, 0.7000, 0.7000, 0.7000, 0.7000])
iteration:0, 第二个特征的running mean: 0.6
iteration:0, 第二个特征的running var:0.7
iteration:1, running mean: tensor([0.5100, 1.0200, 1.5300, 2.0400, 2.5500])
iteration:1, running var:tensor([0.4900, 0.4900, 0.4900, 0.4900, 0.4900])
iteration:1, 第二个特征的running mean: 1.02
iteration:1, 第二个特征的running var:0.48999999999999994

虽然两个 mini-batch 的数据是一样的，但是 bn 层的均值和方差却不一样。以第二个特征的均值计算为例，值都是 2。

第一次 bn 层的均值计算： r u n n i n g _ m e a n = ( 1 − m o m e n t u m ) × p r e _ r u n n i n g _ m e a n + m o m e n t u m × m e a n _ t = ( 1 − 0.3 ) × 0 + 0.3 × 2 = 0.6 running\_{mean}=(1-momentum) \times pre\_running\_mean + momentum \times mean\_t =(1-0.3) \times 0 + 0.3 \times 2 =0.6 running_mean=(1−momentum)×pre_running_mean+momentum×mean_t=(1−0.3)×0+0.3×2=0.6
第二次 bn 层的均值计算： r u n n i n g _ m e a n = ( 1 − m o m e n t u m ) × p r e _ r u n n i n g _ m e a n + m o m e n t u m × m e a n _ t = ( 1 − 0.3 ) × 0.6 + 0.3 × 2 = 1.02 running\_mean=(1-momentum) \times pre\_running\_mean + momentum \times mean\_t =(1-0.3) \times 0.6 + 0.3 \times 2 =1.02 running_mean=(1−momentum)×pre_running_mean+momentum×mean_t=(1−0.3)×0.6+0.3×2=1.02

网络还没进行前向传播之前，断点查看 bn 层的属性如下：

上面是输入数据，和图片中的设置是一样的都是1.2.3.4.5，主要是下面，均值和方差的计算
可以看到，明显均值不是1，为什么？1.1.1的均值应该是1才对，是因为有 momentum = 0.3
根据公式：running_mean=(1-momentum)* pre_running_mean +momentum * mean_t
由于是第一次迭代，pre_running_mean （上一次的均值）没有，可以设定一个值，这里默认是0，
当前均值mean_t=(1+1+1)/3=1，因此：running_mean=(1-0.3)x0+0.3x1=0.3；
同理，第二个特征（3个2）计算出来是0.6
同理，第三个特征（3个3）计算出来是0.9

下面看第二次迭代的时候：momentum = 0.3，mean_t=(1+1+1)/3=1（第一个特征的输入没有变化）
pre_running_mean=0.3
running_mean=(1-0.3)x0.3+0.3x1 =0.51
同理，第二个特征（3个2）计算出来是(1-0.3)x0.6+0.3x2=1.02
同理，第三个特征（3个3）计算出来是(1-0.3)x0.9+0.3x3=1.53

2.2.2.nn.BatchNorm2d()

输入数据的形状是 B × C × 2 D _ f e a t u r e B \times C \times 2D\_feature B×C×2D_feature。在下面的例子中，数据的维度是：(3, 3, 2, 2)，表示一个 mini-batch 有 3 个样本，每个样本有 3 个特征，每个特征的维度是 1 × 2 1 \times 2 1×2。那么就会计算 3 个均值和方差，分别对应每个特征维度。momentum 设置为 0.3，第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。

数据如下图：

如上图所示，样本数为3，即B=3，特征数为3，每个特征是2×2，因此输入为3×3×2×2

代码如下：

    batch_size = 3num_features = 3momentum = 0.3features_shape = (2, 2)feature_map = torch.ones(features_shape)                                                    # 2Dfeature_maps = torch.stack([feature_map*(i+1) for i in range(num_features)], dim=0)         # 3Dfeature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)             # 4D# print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))bn = nn.BatchNorm2d(num_features=num_features, momentum=momentum)running_mean, running_var = 0, 1for i in range(2):outputs = bn(feature_maps_bs)print("\niter:{}, running_mean: {}".format(i, bn.running_mean))print("iter:{}, running_var: {}".format(i, bn.running_var))print("iter:{}, weight: {}".format(i, bn.weight.data.numpy()))print("iter:{}, bias: {}".format(i, bn.bias.data.numpy()))

输出如下：

iter:0, running_mean: tensor([0.3000, 0.6000, 0.9000])
iter:0, running_var: tensor([0.7000, 0.7000, 0.7000])
iter:0, weight: [1. 1. 1.]
iter:0, bias: [0. 0. 0.]
iter:1, running_mean: tensor([0.5100, 1.0200, 1.5300])
iter:1, running_var: tensor([0.4900, 0.4900, 0.4900])
iter:1, weight: [1. 1. 1.]
iter:1, bias: [0. 0. 0.]

由于特征数num_features 是6，所以四个属性的shape也是6。

2.2.3.nn.BatchNorm3d()

输入数据的形状是 B × C × 3 D _ f e a t u r e B \times C \times 3D\_feature B×C×3D_feature。在下面的例子中，数据的维度是：(3, 2, 2, 2, 3)，表示一个 mini-batch 有 3 个样本，每个样本有 2 个特征，每个特征的维度是 2 × 2 × 3 2 \times 2 \times 3 2×2×3。那么就会计算 2 个均值和方差，分别对应每个特征维度。momentum 设置为 0.3，第一次的均值和方差默认为 0 和 1。输入两次 mini-batch 的数据。

数据如下图：

nn.BatchNorm3d input= Bx特征数x3d特征（B为 batch 的样本数目）
如上图所示，样本数为3，即B=3，特征数为3，每个特征是2×2×3，因此输入为3×3×2×2×3

代码如下：

    batch_size = 3num_features = 3momentum = 0.3features_shape = (2, 2, 3)feature = torch.ones(features_shape)                                                # 3Dfeature_map = torch.stack([feature * (i + 1) for i in range(num_features)], dim=0)  # 4Dfeature_maps = torch.stack([feature_map for i in range(batch_size)], dim=0)         # 5D# print("input data:\n{} shape is {}".format(feature_maps, feature_maps.shape))bn = nn.BatchNorm3d(num_features=num_features, momentum=momentum)running_mean, running_var = 0, 1for i in range(2):outputs = bn(feature_maps)print("\niter:{}, running_mean.shape: {}".format(i, bn.running_mean.shape))print("iter:{}, running_var.shape: {}".format(i, bn.running_var.shape))print("iter:{}, weight.shape: {}".format(i, bn.weight.shape))print("iter:{}, bias.shape: {}".format(i, bn.bias.shape))

输出如下：

iter:0, running_mean.shape: torch.Size([3])
iter:0, running_var.shape: torch.Size([3])
iter:0, weight.shape: torch.Size([3])
iter:0, bias.shape: torch.Size([3])
iter:1, running_mean.shape: torch.Size([3])
iter:1, running_var.shape: torch.Size([3])
iter:1, weight.shape: torch.Size([3])
iter:1, bias.shape: torch.Size([3])

2.3.其他常见的Normalization

其他常见的Normalization，4种Pytorch中的标准化方法，Pytorch中常见的标准化方法有四种，分别是：

Batch Normalization (BN)
Layer Normalization (LN)
Instance Normalization (IN)
Group Normalization (GN)

2.4.Layer Normalization(LN)

提出的原因：Batch Normalization 不适用于变长的网络，如 RNN
思路：每个网络层计算均值和方差
注意事项：
- 不再有 running_mean 和 running_var
- γ \gamma γ 和 β \beta β 为逐样本的

torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)
参数：
- normalized_shape：该层特征的形状，可以取 C × H × W 、 H × W 、 W C \times H \times W、H \times W、W C×H×W、H×W、W
- eps：标准化时的分母修正项
- elementwise_affine：是否需要逐个样本 affine transform

下面代码中，输入数据的形状是 B × C × f e a t u r e B \times C \times feature B×C×feature，(8, 2, 3, 4)，表示一个 mini-batch 有 8 个样本，每个样本有 2 个特征，每个特征的维度是 3 × 4 3 \times 4 3×4。那么就会计算 8 个均值和方差，分别对应每个样本。

    batch_size = 8num_features = 2features_shape = (3, 4)feature_map = torch.ones(features_shape)  # 2Dfeature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3Dfeature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4D# feature_maps_bs shape is [8, 6, 3, 4],  B * C * H * W# ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=True)# ln = nn.LayerNorm(feature_maps_bs.size()[1:], elementwise_affine=False)# ln = nn.LayerNorm([6, 3, 4])ln = nn.LayerNorm([2, 3, 4])output = ln(feature_maps_bs)print("Layer Normalization")print(ln.weight.shape)print(feature_maps_bs[0, ...])print(output[0, ...])

输出：

Layer Normalization
torch.Size([2, 3, 4])
tensor([[[1., 1., 1., 1.],[1., 1., 1., 1.],[1., 1., 1., 1.]],[[2., 2., 2., 2.],[2., 2., 2., 2.],[2., 2., 2., 2.]]])
tensor([[[-1.0000, -1.0000, -1.0000, -1.0000],[-1.0000, -1.0000, -1.0000, -1.0000],[-1.0000, -1.0000, -1.0000, -1.0000]],[[ 1.0000,  1.0000,  1.0000,  1.0000],[ 1.0000,  1.0000,  1.0000,  1.0000],[ 1.0000,  1.0000,  1.0000,  1.0000]]], grad_fn=<SelectBackward>)

Layer Normalization 可以设置 normalized_shape 为 (3, 4) 或者 (4)。

2.5.Instance Normalization(IN)

提出的原因：Batch Normalization 不适用于图像生成。因为在一个 mini-batch 中的图像有不同的风格，不能把这个 batch 里的数据都看作是同一类取标准化。
思路：逐个 instance 的 channel 计算均值和方差。也就是每个 feature map 计算一个均值和方差。

在图像生成任务中（如下图），每个batch的风格是不一样的，把不同batch的特征来求均值明显是不好的。

BN的求均值和方差的思路:

IN的求均值和方差的思路
如图所示，横轴为 Batch Size，纵轴为特征数，所谓的 Instance Normalization 的计算方式为对每一个样本的每一个特征（这里为一个2×2的特征图）进行计算，因此上图中一个样本有3个特征图，一共有3个样本，所以一共需要进行9次计算

Instance Normalization的实现与 Batch Normalization的实现非常相似，都有一个基类，然后分1维、2维、3维。Instance Normalization的基类为_InstanceNorm，nn.InstanceNorm1d、nn.InstanceNorm2d、nn.InstanceNorm3d均是继承于这个基类.
以InstanceNorm1d为例，定义如下：

torch.nn.InstanceNorm1d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
参数：
- num_features：一个样本的特征数，这个参数最重要
- eps：分母修正项
- momentum：指数加权平均估计当前的的均值和方差
- affine：是否需要 affine transform
- track_running_stats：True 为训练状态，此时均值和方差会根据每个 mini-batch 改变。False 为测试状态，此时均值和方差会固定

下面代码中，输入数据的形状是 B × C × 2 D _ f e a t u r e B \times C \times 2D\_feature B×C×2D_feature，(3, 3, 2, 2)，表示一个 mini-batch 有 3 个样本，每个样本有 3 个特征，每个特征的维度是 2 × 2 2 \times 2 2×2 。那么就会计算 3 × 3 3 \times 3 3×3 个均值和方差，分别对应每个样本的每个特征。如下图所示：

下面是代码：

    batch_size = 3num_features = 3momentum = 0.3features_shape = (2, 2)feature_map = torch.ones(features_shape)    # 2Dfeature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3Dfeature_maps_bs = torch.stack([feature_maps for i in range(batch_size)], dim=0)  # 4Dprint("Instance Normalization")print("input data:\n{} shape is {}".format(feature_maps_bs, feature_maps_bs.shape))instance_n = nn.InstanceNorm2d(num_features=num_features, momentum=momentum)for i in range(1):outputs = instance_n(feature_maps_bs)print(outputs)

输出如下：

Instance Normalization
input data:
tensor([[[[1., 1.],[1., 1.]],[[2., 2.],[2., 2.]],[[3., 3.],[3., 3.]]],[[[1., 1.],[1., 1.]],[[2., 2.],[2., 2.]],[[3., 3.],[3., 3.]]],[[[1., 1.],[1., 1.]],[[2., 2.],[2., 2.]],[[3., 3.],[3., 3.]]]]) shape is torch.Size([3, 3, 2, 2])
tensor([[[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]]],[[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]]],[[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]],[[0., 0.],[0., 0.]]]])

2.6.Group Normalization

提出的原因：在小 batch 的样本中，Batch Normalization 估计的值不准。一般用在很大的模型中，这时 batch size 就很小。
思路：数据不够，通道来凑。每个样本的特征分为几组，每组特征分别计算均值和方差。可以看作是 Layer Normalization 的基础上添加了特征分组。
注意事项：
- 不再有 running_mean 和 running_var
- γ \gamma γ 和 β \beta β 为逐通道的

注意：下面的图中，不在x轴上做计算，只在y轴特征数上进行分组计算。
如图所示，Batch size等于2，如果直接求就会非常不准确，这个时候我们就将特征数进行分组，如何我们可以将特征数分为2组（1，2为一组；3，4为一组），这样再分别求取均值和方差。

定义如下：

torch.nn.GroupNorm(num_groups, num_channels, eps=1e-05, affine=True)
参数：
- num_groups：特征的分组数量。通常为 2 n 2^n 2n; 通常为2、4、8、16、32
- num_channels：特征数，通道数。注意 num_channels 要可以整除 num_groups。如果通道数（特征数）为256，分组数为4，那么每组的通道数为：64
- eps：分母修正项
- affine：是否需要 affine transform

下面代码中，输入数据的形状是 B × C × 2 D _ f e a t u r e B \times C \times 2D\_feature B×C×2D_feature，(2, 4, 3, 3)，表示一个 mini-batch 有 2 个样本，每个样本有 4 个特征，每个特征的维度是 3 × 3 3 \times 3 3×3 。num_groups 设置为 2，那么就会计算 2 × ( 4 ÷ 2 ) 2 \times (4 \div 2) 2×(4÷2) 个均值和方差，分别对应每个样本的每个特征。

   batch_size = 2num_features = 4num_groups = 2features_shape = (2, 2)feature_map = torch.ones(features_shape)    # 2Dfeature_maps = torch.stack([feature_map * (i + 1) for i in range(num_features)], dim=0)  # 3Dfeature_maps_bs = torch.stack([feature_maps * (i + 1) for i in range(batch_size)], dim=0)  # 4Dgn = nn.GroupNorm(num_groups, num_features)outputs = gn(feature_maps_bs)print("Group Normalization")print(gn.weight.shape)print(outputs[0])

输出如下：

Group Normalization
torch.Size([4])
tensor([[[-1.0000, -1.0000],[-1.0000, -1.0000]],[[ 1.0000,  1.0000],[ 1.0000,  1.0000]],[[-1.0000, -1.0000],[-1.0000, -1.0000]],[[ 1.0000,  1.0000],[ 1.0000,  1.0000]]], grad_fn=<SelectBackward>)

这里需要注意：设置分组数时，需要保证能够被特征数整除，否则会报错

2.7.小结

BN、LN、IN和GN都是为了克服Internal Covariate Shift (ICS)