Dropout Batch Normolization

文章目录

Regularization
- 概况
- 无正则项的模型
- L2 正则项
- Dropout
BatchNormolization背景
- 背景
- internal covariate shift
Batch Normolization流程
- BP过程
- train
- test
- benifit
- 代码补全

Regularization

概况

本博文详细介绍了 Dropout 和 Batch Normolization
本博文主要介绍了一次深度学习实验的内容
前半部分是针对正则化的，分析了实验过程中不同正则化的结果。不想看实验内容的朋友们可以自动忽略前面的部分

无正则项的模型

结果
- training Accuracy：0.9478
- text Accuracy：0.915
- 可以看到测试集的准确度小于训练集的准确度

L2 正则项

方法：1mλ2∑l∑k∑jWk,j[l]2⏟L2 regularization cost \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum_{l} \sum_{k} \sum_{j} W_{k, j}^{[l] 2}}_{\text {L2 regularization cost }}L2 regularization cost m12λl∑k∑j∑Wk,j[l]2
- 防止模型过于复杂，造成过拟合

代码补全

添加正则项

   ### START CODE HERE ### (approx. 1 line)
L2_regularization_cost =
(1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2))+np.sum(np.square(W3)))### END CODER HERE ###

BP环节，调整梯度

# GRADED FUNCTION: backward_propagation_with_regularizationdef backward_propagation_with_regularization(X, Y, cache, lambd):"""Implements the backward propagation of our baseline model to which we added an L2 regularization.Arguments:X -- input dataset, of shape (input size, number of examples)Y -- "true" labels vector, of shape (output size, number of examples)cache -- cache output from forward_propagation()lambd -- regularization hyperparameter, scalarReturns:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables"""m = X.shape[1](Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cachedZ3 = A3 - Y### START CODE HERE ### (approx. 1 line)dW3 = 1./m * np.dot(dZ3, A2.T) + lambd / m * W3### END CODE HERE ###db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)dA2 = np.dot(W3.T, dZ3)dZ2 = np.multiply(dA2, np.int64(A2 > 0))### START CODE HERE ### (approx. 1 line)dW2 = 1./m * np.dot(dZ2, A1.T) + lambd / m * W2### END CODE HERE ###db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)dA1 = np.dot(W2.T, dZ2)dZ1 = np.multiply(dA1, np.int64(A1 > 0))### START CODE HERE ### (approx. 1 line)dW1 = 1./m * np.dot(dZ1, X.T) + lambd / m * W1### END CODE HERE ###db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}return gradients

结果：
- training Accuracy：0.9383
- test Accuracy：0.93
- 可以看到，决策边界平滑很多，且收敛速度更快
  - 这是因为，防止过拟合，不会考虑训练集的所有细节

Dropout

方法：在每个迭代过程中，随机地删去一些神经元
- 这里的删除并不是真的删除某些神经元，而是将某些神经元的输入变为0。这些神经元在训练的时候并不会产生损失，因此也就不会更新这些神经元的参数，也就丧失了学习能力，防止过拟合
- 同时要对没有被删除的神经元的输入除以drop-rate，防止cost变小。
- drop-out的原理就是，训练模型时，每一个迭代过程都相当于是一个不同的模型，这个模型的神经元是原模型神经元的子集。
- 一个神经元对另一个特定神经元的激活变得不那么敏感，因为其他神经元可能会随时关闭
- dropout只能用在训练集中，在训练集预测过程中不需要再使用dropout

forword propagation代码补全

 ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. D1 = np.random.rand(A1.shape[0] , A1.shape[1])                                       D1 = (D1 < keep_prob)                                       A1 = np.multiply(D1 , A1)                                A1 = A1 / keep_prob                                       ### END CODE HERE ###Z2 = np.dot(W2, A1) + b2A2 = relu(Z2)### START CODE HERE ### (approx. 4 lines)D2 = np.random.rand(A2.shape[0] , A2.shape[1])                                       D2 = (D2 < keep_prob)                                        A2 = np.multiply(D2 , A2)                                      A2 = A2 / keep_prob                                        ### END CODE HERE ###

BP代码补全

    dZ3 = A3 - YdW3 = 1./m * np.dot(dZ3, A2.T)db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)dA2 = np.dot(W3.T, dZ3)### START CODE HERE ### (≈ 2 lines of code)dA2 = np.multiply(dA2 , D2)               dA2 = dA2 / keep_prob              ### END CODE HERE ###dZ2 = np.multiply(dA2, np.int64(A2 > 0))dW2 = 1./m * np.dot(dZ2, A1.T)db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)dA1 = np.dot(W2.T, dZ2)### START CODE HERE ### (≈ 2 lines of code)dA1 = np.multiply(dA1 , D1)              dA1 = dA1 / keep_prob              ### END CODE HERE ###

结果
- train Accuracy：0.9289
- test Accuracy：0.95

BatchNormolization背景

背景

具有统一规格的数据，能让模型更容易学习到数据中的规律。对隐藏层，这一规律也同样适用
神经网络的目的就是输入一批数据, 根据这批数据的分布, 预测真实数据的分布
机器学习中有个很重要的假设：IID独立同分布假设，就是假设训练数据和测试数据是满足相同分布的，这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障。

internal covariate shift

定义：深度神经网络涉及到很多层的叠加，而每一层的参数更新会导致上层的输入数据分布发生变化，通过层层叠加，高层的输入分布变化会非常剧烈，这就使得高层需要不断去重新适应底层的参数更新。为了训好模型，我们需要非常谨慎地去设定学习率、初始化权重、以及尽可能细致的参数更新策略。
Google 将这一现象总结为 Internal Covariate Shift，简称 ICS.

带来的问题：

上层网络需要不断适应新的输入数据分布，降低学习速度。
下层输入的变化可能趋向于变大或者变小，导致上层落入饱和区，使得学习过早停止。
每层的更新都会影响到其它层，因此每层的参数更新策略需要尽可能的谨慎。

尽量保证数据在经过每一层网络时, 其分布保持不变, 也就是数据在输入每一层网络之前, *其分布都与测试数据接近*, 那么可以使得整个网络的训练是高效的.

Batch Normolization流程

一种对每一层的数据都进行标准化的方法，详细介绍可见(14条消息) 李宏毅深度学习笔记：Batch Normalization_qyhaill的博客-CSDN博客
隐式的默认了每个batch之间的分布是大体一致的，小范围的不同可以认为是噪音增加模型的鲁棒性，但是如果大范围的变动其实会增加模型的训练难度
batch Normolization实际上就是在每一层的输出和激活函数中间加了一层，其他的和正常神经网络的训练过程

x^i=xi−μBσB2+ϵ,yi=γ⋅x^i+β\hat{x}_{i}=\frac{x_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}, \quad y_{i}=\gamma \cdot \hat{x}_{i}+\betax^i=σB2+ϵxi−μB,yi=γ⋅x^i+β
- 针对一个batch的所有数据的每个特征，都算一个均值和方差，按特征将数据进行标准化

BP过程

总梯度公式

∂l∂xi=∂l∂x^i⋅∂xi^∂xi+∂l∂σB2⋅∂σB2∂xi+∂l∂μB⋅∂μB∂xi\frac{\partial l}{\partial x_{i}}=\frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x_{i}}}{\partial x_{i}}+\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}}+\frac{\partial l}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial x_{i}}∂xi∂l=∂x^i∂l⋅∂xi∂xi^+∂σB2∂l⋅∂xi∂σB2+∂μB∂l⋅∂xi∂μB
简单的两个基本参数的求导

∂l∂γ=∑iN∂l∂yi⋅∂yi∂γ=∑iN∂l∂yi⋅xi^\frac{\partial l}{\partial \gamma}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial \gamma}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \hat{x_{i}}∂γ∂l=∑iN∂yi∂l⋅∂γ∂yi=∑iN∂yi∂l⋅xi^
∂l∂β=∑iN∂l∂yi⋅∂yi∂β=∑iN∂l∂yi\frac{\partial l}{\partial \beta}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial \beta}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}∂β∂l=∑iN∂yi∂l⋅∂β∂yi=∑iN∂yi∂l
对总式子的第一项推导

∂l∂x^i⋅∂x^i∂xi=∂l∂yi⋅γ⋅(σB2+ϵ)−12\frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial x_{i}}=\frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}∂x^i∂l⋅∂xi∂x^i=∂yi∂l⋅γ⋅(σB2+ϵ)−21
对总式子的第二项的推导
- 第二项第一部分
  
  ∂l∂σB2=∑iN∂l∂x^i⋅∂x^i∂σB2=∑iN∂l∂yi⋅γ⋅(xi−μB)⋅(−12)⋅(σB2+ϵ)−32=−γ⋅(σB2+ϵ)−322∑iN∂l∂yi⋅(xi−μB)\begin{aligned} \frac{\partial l}{\partial \sigma_{B}^{2}} &=\sum_{i}^{N} \frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial \sigma_{B}^{2}} \\ &=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot\left(x_{i}-\mu_{B}\right) \cdot\left(-\frac{1}{2}\right) \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{2} \sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot\left(x_{i}-\mu_{B}\right) \end{aligned}∂σB2∂l=i∑N∂x^i∂l⋅∂σB2∂x^i=i∑N∂yi∂l⋅γ⋅(xi−μB)⋅(−21)⋅(σB2+ϵ)−23=−2γ⋅(σB2+ϵ)−23i∑N∂yi∂l⋅(xi−μB)
- 总的第二项，即两部分相乘的结果
  
  ∂l∂σB2⋅∂σB2∂xi=∂l∂σB2⋅2(xi−μB)N=−γ⋅(σB2+ϵ)−322(∑jN∂l∂yj⋅(xj−μB))⋅2(xi−μB)N=−γ⋅(σB2+ϵ)−32N(∑jN∂l∂yj⋅(xj−μB))⋅(xi−μB)=γ⋅(σB2+ϵ)−12N(∑jN∂l∂yj⋅(xj−μB))⋅(xi−μB)⋅−(σB2+ϵ)−1\begin{aligned} \frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}} &=\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(x_{i}-\mu_{B}\right)}{N} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{2}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot \frac{2\left(x_{i}-\mu_{B}\right)}{N} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{N}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot\left(x_{i}-\mu_{B}\right) \\ &=\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}}{N}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot\left(x_{i}-\mu_{B}\right) \cdot-\left(\sigma_{B}^{2}+\epsilon\right)^{-1} \end{aligned}∂σB2∂l⋅∂xi∂σB2=∂σB2∂l⋅N2(xi−μB)=−2γ⋅(σB2+ϵ)−23(j∑N∂yj∂l⋅(xj−μB))⋅N2(xi−μB)=−Nγ⋅(σB2+ϵ)−23(j∑N∂yj∂l⋅(xj−μB))⋅(xi−μB)=Nγ⋅(σB2+ϵ)−21(j∑N∂yj∂l⋅(xj−μB))⋅(xi−μB)⋅−(σB2+ϵ)−1
  - 注意到：xj−μB=x^jσB2+ϵx_{j}-\mu_{B}=\hat{x}_{j} \sqrt{\sigma_{B}^{2}+\epsilon}xj−μB=x^jσB2+ϵ
  - 单独看一下(∑jN∂l∂yj⋅(xj−μB))⋅xi−μBσ2+ϵ\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot \frac{x i-\mu_{B}}{\sigma^{2}+\epsilon}(∑jN∂yj∂l⋅(xj−μB))⋅σ2+ϵxi−μB
  - 可以转换为(∑jN∂l∂yj⋅x^jσB2+ϵ)⋅xi−μBσ2+ϵ\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot \hat{x}_{j} \sqrt{\sigma_{B}^{2}+\epsilon}\right) \cdot \frac{x i-\mu_{B}}{\sigma^{2}+\epsilon}(∑jN∂yj∂l⋅x^jσB2+ϵ)⋅σ2+ϵxi−μB
    =(∑jN∂l∂yj⋅x^j)⋅xi−μBσB2+ϵ=\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot \hat{x}_{j}\right) \cdot \frac{x i-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}=(∑jN∂yj∂l⋅x^j)⋅σB2+ϵxi−μB
    =∂l∂γ⋅xi−μBσB2+ϵ=\frac{\partial l}{\partial \gamma} \cdot \frac{x i-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}=∂γ∂l⋅σB2+ϵxi−μB
    =∂l∂γ⋅x^i=\frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}=∂γ∂l⋅x^i
  - 总的为：∂l∂σB2⋅∂σB2∂xi=γ⋅(σB2+ϵ)−12N⋅∂l∂γ⋅x^i\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}}=\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}∂σB2∂l⋅∂xi∂σB2=Nγ⋅(σB2+ϵ)−21⋅∂γ∂l⋅x^i
对总式子的第三项推导
- 第三项第一部分
  
  ∂l∂μB=[∑iN∂l∂x^i⋅∂x^i∂μB]+[∂l∂σB2⋅∂σB2∂μB]=[∑iN∂l∂yi⋅γ⋅−1σB2+ϵ]+[∂lσB2⋅1N∑iN−2(xi−μB)]=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)−∂l∂σB2⋅2N(∑iN(xi−μB))=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)\begin{aligned} \frac{\partial l}{\partial \mu_{B}} &=\left[\sum_{i}^{N} \frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial \mu_{B}}\right]+\left[\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial \mu_{B}}\right] \\ &=\left[\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot \frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\right]+\left[\frac{\partial l}{\sigma_{B}^{2}} \cdot \frac{1}{N} \sum_{i}^{N}-2\left(x_{i}-\mu_{B}\right)\right] \\ &=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right)-\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{2}{N}\left(\sum_{i}^{N}\left(x_{i}-\mu_{B}\right)\right) \\ &=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right) \end{aligned}∂μB∂l=[i∑N∂x^i∂l⋅∂μB∂x^i]+[∂σB2∂l⋅∂μB∂σB2]=[i∑N∂yi∂l⋅γ⋅σB2+ϵ−1]+[σB2∂l⋅N1i∑N−2(xi−μB)]=−γ⋅(σB2+ϵ)−21(i∑N∂yi∂l)−∂σB2∂l⋅N2(i∑N(xi−μB))=−γ⋅(σB2+ϵ)−21(i∑N∂yi∂l)
  - 这里可以注意的是，∑iN(xi−μB)=0\sum_{i}^{N}\left(x_{i}-\mu_{B}\right)=0∑iN(xi−μB)=0 ，因此上面推导过程中的倒数第二行的最后一项可以直接删去
- 总的第三项，即两部分相乘的结果
  
  ∂l∂μB⋅∂μB∂xi=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)⋅1N\frac{\partial l}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial x_{i}}=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right) \cdot \frac{1}{N}∂μB∂l⋅∂xi∂μB=−γ⋅(σB2+ϵ)−21(∑iN∂yi∂l)⋅N1
总的结果为上面三项相加，最终结果为

∂l∂xi=γ⋅(σB2+ϵ)−12[∂l∂yi−1N⋅∂l∂γ⋅x^i−1N⋅∑i=1N∂l∂yi]\frac{\partial l}{\partial x_{i}}=\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left[\frac{\partial l}{\partial y_{i}}-\frac{1}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}-\frac{1}{N} \cdot \sum_{i=1}^{N} \frac{\partial l}{\partial y_{i}}\right]∂xi∂l=γ⋅(σB2+ϵ)−21[∂yi∂l−N1⋅∂γ∂l⋅x^i−N1⋅∑i=1N∂yi∂l]

BP过程中对每一项的导数总结如下：

∂ℓ∂x^i=∂ℓ∂yi⋅γ\frac{\partial \ell}{\partial \widehat{x}_{i}}=\frac{\partial \ell}{\partial y_{i}} \cdot \gamma∂xi∂ℓ=∂yi∂ℓ⋅γ
∂ℓ∂σB2=∑i=1m∂ℓ∂x^i⋅(xi−μB)⋅−12(σB2+ϵ)−3/2\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot\left(x_{i}-\mu_{\mathcal{B}}\right) \cdot \frac{-1}{2}\left(\sigma_{\mathcal{B}}^{2}+\epsilon\right)^{-3 / 2}∂σB2∂ℓ=∑i=1m∂xi∂ℓ⋅(xi−μB)⋅2−1(σB2+ϵ)−3/2
∂ℓ∂μB=(∑i=1m∂ℓ∂x^i⋅−1σB2+ϵ)+∂ℓ∂σB2⋅∑i=1m−2(xi−μB)m\frac{\partial \ell}{\partial \mu_{\mathcal{B}}}=\left(\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\right)+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{\sum_{i=1}^{m}-2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}∂μB∂ℓ=(∑i=1m∂xi∂ℓ⋅σB2+ϵ−1)+∂σB2∂ℓ⋅m∑i=1m−2(xi−μB)
∂ℓ∂xi=∂ℓ∂x^i⋅1σB2+ϵ+∂ℓ∂σB2⋅2(xi−μB)m+∂ℓ∂μB⋅1m\frac{\partial \ell}{\partial x_{i}}=\frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}+\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}∂xi∂ℓ=∂xi∂ℓ⋅σB2+ϵ1+∂σB2∂ℓ⋅m2(xi−μB)+∂μB∂ℓ⋅m1
∂ℓ∂γ=∑i=1m∂ℓ∂yi⋅x^i\frac{\partial \ell}{\partial \gamma}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}} \cdot \widehat{x}_{i}∂γ∂ℓ=∑i=1m∂yi∂ℓ⋅xi
∂ℓ∂β=∑i=1m∂ℓ∂yi\frac{\partial \ell}{\partial \beta}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}}∂β∂ℓ=∑i=1m∂yi∂ℓ

train

batch size要够大，因为我们其实是想估计整个训练集的平均值和方差
平均值和方差不能看作常数，因为他们是XXX 和 WWW 的组合，可以在BP的时候对 WWW 的更新起作用

test

训练结束以后，可以估算平均值和方差
- 通过在训练过程中对每次的平均值和方差进行保留得到
- updata过程中的平均值和方差都算出来，并加权求和

前面的平均值和方差与完全训练好的模型的平均值和方差差别是比较大的，因此越往后参数越大，因为随着训练次数增加，准确度会变大

benifit

解决了Internal Covariate Shift的问题，学习率可以设大，训练快
值都在 0 附近，减缓数据通过激活函数时，落在激活函数饱和区的现象
- 落在包含区会导致模型对数据的变化太不敏感，导致模型学习能力下降
参数初始化值的影响比较小

相当于做了正则化，可以一定程度上对抗过拟合
- 因为毕竟是用batch估计平均值和方差，并不会包含所有数据，所以平均值和方差是有噪声的

代码补全

forword

def batchnorm_forward(x, gamma, beta, bn_param):"""Forward pass for batch normalization.During training the sample mean and (uncorrected) sample variance arecomputed from minibatch statistics and used to normalize the incoming data.During training we also keep an exponentially decaying running mean of themean and variance of each feature, and these averages are used to normalizedata at test-time.At each timestep we update the running averages for mean and variance usingan exponential decay based on the momentum parameter:running_mean = momentum * running_mean + (1 - momentum) * sample_meanrunning_var = momentum * running_var + (1 - momentum) * sample_varNote that the batch normalization paper suggests a different test-timebehavior: they compute sample mean and variance for each feature using alarge number of training images rather than using a running average. Forthis implementation we have chosen to use running averages instead sincethey do not require an additional estimation step; the torch7implementation of batch normalization also uses running averages.Input:- x: Data of shape (N, D)- gamma: Scale parameter of shape (D,)- beta: Shift paremeter of shape (D,)- bn_param: Dictionary with the following keys:- mode: 'train' or 'test'; required- eps: Constant for numeric stability- momentum: Constant for running mean / variance.- running_mean: Array of shape (D,) giving running mean of features- running_var Array of shape (D,) giving running variance of featuresReturns a tuple of:- out: of shape (N, D)- cache: A tuple of values needed in the backward pass"""mode = bn_param['mode']eps = bn_param.get('eps', 1e-5)momentum = bn_param.get('momentum', 0.9)N, D = x.shaperunning_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))out, cache = None, Noneif mode == 'train':######################################################################## TODO: Implement the training-time forward pass for batch norm.      ## Use minibatch statistics to compute the mean and variance, use      ## these statistics to normalize the incoming data, and scale and      ## shift the normalized data using gamma and beta.                     ##                                                                     ## You should store the output in the variable out. Any intermediates  ## that you need for the backward pass should be stored in the cache   ## variable.                                                           ##                                                                     ## You should also use your computed sample mean and variance together ## with the momentum variable to update the running mean and running   ## variance, storing your result in the running_mean and running_var   ## variables.                                                          ##                                                                     ## Note that though you should be keeping track of the running         ## variance, you should normalize the data based on the standard       ## deviation (square root of variance) instead!                        # # Referencing the original paper (https://arxiv.org/abs/1502.03167)   ## might prove to be helpful.                                          ######################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****sample_mean = np.mean(x , axis=0) # 每一列是所有数据的一个特征，按特征算平均值sample_var = np.var(x, axis=0) # 同上，算方差x_norm = (x - sample_mean) / np.sqrt(sample_var + eps)out = gamma * x_norm + betacache = (x, sample_mean, sample_var, x_norm, gamma, beta, eps)#更新并存储下来，留着对test数据进行标准化running_mean = momentum * running_mean + (1 - momentum) * sample_meanrunning_var = momentum * running_var + (1 - momentum) * sample_var# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****########################################################################                           END OF YOUR CODE                          ########################################################################elif mode == 'test':######################################################################## TODO: Implement the test-time forward pass for batch normalization. ## Use the running mean and variance to normalize the incoming data,   ## then scale and shift the normalized data using gamma and beta.      ## Store the result in the out variable.                               ######################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x_std = (x - bn_param['running_mean']) / (np.sqrt(bn_param['running_var']) + eps) #用保存的值对测试数据进行标准化out = gamma * x_std + beta# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****########################################################################                          END OF YOUR CODE                           ########################################################################else:raise ValueError('Invalid forward batchnorm mode "%s"' % mode)# Store the updated running means back into bn_parambn_param['running_mean'] = running_meanbn_param['running_var'] = running_varreturn out, cache

backword

def batchnorm_backward(dout, cache):"""Backward pass for batch normalization.For this implementation, you should write out a computation graph forbatch normalization on paper and propagate gradients backward throughintermediate nodes.Inputs:- dout: Upstream derivatives, of shape (N, D)- cache: Variable of intermediates from batchnorm_forward.Returns a tuple of:- dx: Gradient with respect to inputs x, of shape (N, D)- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)- dbeta: Gradient with respect to shift parameter beta, of shape (D,)"""dx, dgamma, dbeta = None, None, None############################################################################ TODO: Implement the backward pass for batch normalization. Store the    ## results in the dx, dgamma, and dbeta variables.                         ## Referencing the original paper (https://arxiv.org/abs/1502.03167)       ## might prove to be helpful.                                              ############################################################################# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, D = dout.shape#得到cache中存储的数据x, sample_mean, sample_var, x_norm, gamma, beta, eps = cache#这里是复现了上面的推导过程，只不过是每一个中间变量都进行计算了dx_norm = dout * gamma #根据 dout = gamma * x_norm + beta 求导 dsample_var = np.sum(dx_norm * (-0.5 * x_norm / (sample_var + eps)), axis=0)dsample_mean = np.sum(-dx_norm / np.sqrt((sample_var + eps)), axis=0) + \dsample_var * np.sum(-2.0/N * (x - sample_mean), axis=0)dx1 = dx_norm / np.sqrt(sample_var + eps)dx2 = dsample_var * (2.0/N) * (x - sample_mean)  # from sample_vardx3 = dsample_mean * (1.0 / N)  # from sample_meandx = dx1 + dx2 + dx3dgamma = np.sum(dout * x_norm, axis=0)dbeta = np.sum(dout, axis=0)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################                             END OF YOUR CODE                            ############################################################################return dx, dgamma, dbeta

backword_alt

不保留中间变量，直接利用∂l∂xi=γ⋅(σB2+ϵ)−12[∂l∂yi−1N⋅∂l∂γ⋅x^i−1N⋅∑i=1N∂l∂yi]\frac{\partial l}{\partial x_{i}}=\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left[\frac{\partial l}{\partial y_{i}}-\frac{1}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}-\frac{1}{N} \cdot \sum_{i=1}^{N} \frac{\partial l}{\partial y_{i}}\right]∂xi∂l=γ⋅(σB2+ϵ)−21[∂yi∂l−N1⋅∂γ∂l⋅x^i−N1⋅∑i=1N∂yi∂l]得到结果

def batchnorm_backward_alt(dout, cache):"""Alternative backward pass for batch normalization.For this implementation you should work out the derivatives for the batchnormalizaton backward pass on paper and simplify as much as possible. Youshould be able to derive a simple expression for the backward pass. See the jupyter notebook for more hints.Note: This implementation should expect to receive the same cache variableas batchnorm_backward, but might not use all of the values in the cache.Inputs / outputs: Same as batchnorm_backward"""dx, dgamma, dbeta = None, None, None############################################################################ TODO: Implement the backward pass for batch normalization. Store the    ## results in the dx, dgamma, and dbeta variables.                         ##                                                                         ## After computing the gradient with respect to the centered inputs, you   ## should be able to compute gradients with respect to the inputs in a     ## single statement; our implementation fits on a single 80-character line.############################################################################# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x, sample_mean , sample_var , x_norm , gamma , beta , eps = cache dgamma = np.sum(dout * x_norm,axis=0)dbeta = np.sum(dout,axis=0)std = (sample_var + eps) ** (-0.5)dx = std * gamma * (dout - dgamma * x_norm / x.shape[0] - np.mean(dout , axis=0))# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################                             END OF YOUR CODE                            ############################################################################return dx, dgamma, dbeta