文章目录

  • Regularization
    • 概况
    • 无正则项的模型
    • L2 正则项
    • Dropout
  • BatchNormolization背景
    • 背景
    • internal covariate shift
  • Batch Normolization流程
    • BP过程
    • train
    • test
    • benifit
    • 代码补全

Regularization

概况

  • 本博文详细介绍了 Dropout 和 Batch Normolization
  • 本博文主要介绍了一次深度学习实验的内容
  • 前半部分是针对正则化的,分析了实验过程中不同正则化的结果。不想看实验内容的朋友们可以自动忽略前面的部分

无正则项的模型

  • 结果

    • training Accuracy:0.9478
    • text Accuracy:0.915
    • 可以看到测试集的准确度小于训练集的准确度

L2 正则项

  • 方法:1mλ2∑l∑k∑jWk,j[l]2⏟L2 regularization cost \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum_{l} \sum_{k} \sum_{j} W_{k, j}^{[l] 2}}_{\text {L2 regularization cost }}L2 regularization cost m1​2λ​l∑​k∑​j∑​Wk,j[l]2​​​

    • 防止模型过于复杂,造成过拟合
  • 代码补全

    • 添加正则项

         ### START CODE HERE ### (approx. 1 line)
      L2_regularization_cost =
      (1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2))+np.sum(np.square(W3)))### END CODER HERE ###
      
    • BP环节,调整梯度

      # GRADED FUNCTION: backward_propagation_with_regularizationdef backward_propagation_with_regularization(X, Y, cache, lambd):"""Implements the backward propagation of our baseline model to which we added an L2 regularization.Arguments:X -- input dataset, of shape (input size, number of examples)Y -- "true" labels vector, of shape (output size, number of examples)cache -- cache output from forward_propagation()lambd -- regularization hyperparameter, scalarReturns:gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables"""m = X.shape[1](Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cachedZ3 = A3 - Y### START CODE HERE ### (approx. 1 line)dW3 = 1./m * np.dot(dZ3, A2.T) + lambd / m * W3### END CODE HERE ###db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)dA2 = np.dot(W3.T, dZ3)dZ2 = np.multiply(dA2, np.int64(A2 > 0))### START CODE HERE ### (approx. 1 line)dW2 = 1./m * np.dot(dZ2, A1.T) + lambd / m * W2### END CODE HERE ###db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)dA1 = np.dot(W2.T, dZ2)dZ1 = np.multiply(dA1, np.int64(A1 > 0))### START CODE HERE ### (approx. 1 line)dW1 = 1./m * np.dot(dZ1, X.T) + lambd / m * W1### END CODE HERE ###db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}return gradients
      
  • 结果:

    • training Accuracy:0.9383
    • test Accuracy:0.93
    • 可以看到,决策边界平滑很多 , 且收敛速度更快
      • 这是因为,防止过拟合,不会考虑训练集的所有细节

Dropout

  • 方法:在每个迭代过程中,随机地删去一些神经元

    • 这里的删除并不是真的删除某些神经元,而是将某些神经元的输入变为0。这些神经元在训练的时候并不会产生损失,因此也就不会更新这些神经元的参数,也就丧失了学习能力,防止过拟合
    • 同时要对没有被删除的神经元的输入除以drop-rate,防止cost变小。
    • drop-out的原理就是,训练模型时,每一个迭代过程都相当于是一个不同的模型,这个模型的神经元是原模型神经元的子集。
    • 一个神经元对另一个特定神经元的激活变得不那么敏感,因为其他神经元可能会随时关闭
    • dropout只能用在训练集中,在训练集预测过程中不需要再使用dropout
  • forword propagation代码补全

     ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. D1 = np.random.rand(A1.shape[0] , A1.shape[1])                                       D1 = (D1 < keep_prob)                                       A1 = np.multiply(D1 , A1)                                A1 = A1 / keep_prob                                       ### END CODE HERE ###Z2 = np.dot(W2, A1) + b2A2 = relu(Z2)### START CODE HERE ### (approx. 4 lines)D2 = np.random.rand(A2.shape[0] , A2.shape[1])                                       D2 = (D2 < keep_prob)                                        A2 = np.multiply(D2 , A2)                                      A2 = A2 / keep_prob                                        ### END CODE HERE ###
    
  • BP代码补全

        dZ3 = A3 - YdW3 = 1./m * np.dot(dZ3, A2.T)db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)dA2 = np.dot(W3.T, dZ3)### START CODE HERE ### (≈ 2 lines of code)dA2 = np.multiply(dA2 , D2)               dA2 = dA2 / keep_prob              ### END CODE HERE ###dZ2 = np.multiply(dA2, np.int64(A2 > 0))dW2 = 1./m * np.dot(dZ2, A1.T)db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)dA1 = np.dot(W2.T, dZ2)### START CODE HERE ### (≈ 2 lines of code)dA1 = np.multiply(dA1 , D1)              dA1 = dA1 / keep_prob              ### END CODE HERE ###
    
  • 结果

    • train Accuracy:0.9289

    • test Accuracy:0.95

BatchNormolization背景

背景

  • 具有统一规格的数据,能让模型更容易学习到数据中的规律。对隐藏层,这一规律也同样适用
  • 神经网络的目的就是输入一批数据, 根据这批数据的分布, 预测真实数据的分布
  • 机器学习中有个很重要的假设:IID独立同分布假设,就是假设训练数据和测试数据是满足相同分布的,这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障。

internal covariate shift

定义:深度神经网络涉及到很多层的叠加,而每一层的参数更新会导致上层的输入数据分布发生变化,通过层层叠加,高层的输入分布变化会非常剧烈,这就使得高层需要不断去重新适应底层的参数更新。为了训好模型,我们需要非常谨慎地去设定学习率、初始化权重、以及尽可能细致的参数更新策略。
Google 将这一现象总结为 Internal Covariate Shift,简称 ICS.

带来的问题:

  • 上层网络需要不断适应新的输入数据分布,降低学习速度。
  • 下层输入的变化可能趋向于变大或者变小,导致上层落入饱和区,使得学习过早停止。
  • 每层的更新都会影响到其它层,因此每层的参数更新策略需要尽可能的谨慎。

尽量保证数据在经过每一层网络时, 其分布保持不变, 也就是数据在输入每一层网络之前, *其分布都与测试数据接近*, 那么可以使得整个网络的训练是高效的.

Batch Normolization流程

  • 一种对每一层的数据都进行标准化的方法,详细介绍可见(14条消息) 李宏毅深度学习笔记:Batch Normalization_qyhaill的博客-CSDN博客

  • 隐式的默认了每个batch之间的分布是大体一致的,小范围的不同可以认为是噪音增加模型的鲁棒性,但是如果大范围的变动其实会增加模型的训练难度

  • batch Normolization实际上就是在每一层的输出和激活函数中间加了一层,其他的和正常神经网络的训练过程

    x^i=xi−μBσB2+ϵ,yi=γ⋅x^i+β\hat{x}_{i}=\frac{x_{i}-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}, \quad y_{i}=\gamma \cdot \hat{x}_{i}+\betax^i​=σB2​+ϵ​xi​−μB​​,yi​=γ⋅x^i​+β

    • 针对一个batch的所有数据的每个特征,都算一个均值和方差,按特征将数据进行标准化

BP过程

  • 总梯度公式

    ∂l∂xi=∂l∂x^i⋅∂xi^∂xi+∂l∂σB2⋅∂σB2∂xi+∂l∂μB⋅∂μB∂xi\frac{\partial l}{\partial x_{i}}=\frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x_{i}}}{\partial x_{i}}+\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}}+\frac{\partial l}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial x_{i}}∂xi​∂l​=∂x^i​∂l​⋅∂xi​∂xi​^​​+∂σB2​∂l​⋅∂xi​∂σB2​​+∂μB​∂l​⋅∂xi​∂μB​​

  • 简单的两个基本参数的求导

    ∂l∂γ=∑iN∂l∂yi⋅∂yi∂γ=∑iN∂l∂yi⋅xi^\frac{\partial l}{\partial \gamma}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial \gamma}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \hat{x_{i}}∂γ∂l​=∑iN​∂yi​∂l​⋅∂γ∂yi​​=∑iN​∂yi​∂l​⋅xi​^​
    ∂l∂β=∑iN∂l∂yi⋅∂yi∂β=∑iN∂l∂yi\frac{\partial l}{\partial \beta}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \frac{\partial y_{i}}{\partial \beta}=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}∂β∂l​=∑iN​∂yi​∂l​⋅∂β∂yi​​=∑iN​∂yi​∂l​

  • 对总式子的第一项推导

    ∂l∂x^i⋅∂x^i∂xi=∂l∂yi⋅γ⋅(σB2+ϵ)−12\frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial x_{i}}=\frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}∂x^i​∂l​⋅∂xi​∂x^i​​=∂yi​∂l​⋅γ⋅(σB2​+ϵ)−21​

  • 对总式子的第二项的推导

    • 第二项第一部分

      ∂l∂σB2=∑iN∂l∂x^i⋅∂x^i∂σB2=∑iN∂l∂yi⋅γ⋅(xi−μB)⋅(−12)⋅(σB2+ϵ)−32=−γ⋅(σB2+ϵ)−322∑iN∂l∂yi⋅(xi−μB)\begin{aligned} \frac{\partial l}{\partial \sigma_{B}^{2}} &=\sum_{i}^{N} \frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial \sigma_{B}^{2}} \\ &=\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot\left(x_{i}-\mu_{B}\right) \cdot\left(-\frac{1}{2}\right) \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{2} \sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot\left(x_{i}-\mu_{B}\right) \end{aligned}∂σB2​∂l​​=i∑N​∂x^i​∂l​⋅∂σB2​∂x^i​​=i∑N​∂yi​∂l​⋅γ⋅(xi​−μB​)⋅(−21​)⋅(σB2​+ϵ)−23​=−2γ⋅(σB2​+ϵ)−23​​i∑N​∂yi​∂l​⋅(xi​−μB​)​

    • 总的第二项,即两部分相乘的结果

      ∂l∂σB2⋅∂σB2∂xi=∂l∂σB2⋅2(xi−μB)N=−γ⋅(σB2+ϵ)−322(∑jN∂l∂yj⋅(xj−μB))⋅2(xi−μB)N=−γ⋅(σB2+ϵ)−32N(∑jN∂l∂yj⋅(xj−μB))⋅(xi−μB)=γ⋅(σB2+ϵ)−12N(∑jN∂l∂yj⋅(xj−μB))⋅(xi−μB)⋅−(σB2+ϵ)−1\begin{aligned} \frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}} &=\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{2\left(x_{i}-\mu_{B}\right)}{N} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{2}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot \frac{2\left(x_{i}-\mu_{B}\right)}{N} \\ &=-\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{3}{2}}}{N}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot\left(x_{i}-\mu_{B}\right) \\ &=\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}}{N}\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot\left(x_{i}-\mu_{B}\right) \cdot-\left(\sigma_{B}^{2}+\epsilon\right)^{-1} \end{aligned}∂σB2​∂l​⋅∂xi​∂σB2​​​=∂σB2​∂l​⋅N2(xi​−μB​)​=−2γ⋅(σB2​+ϵ)−23​​(j∑N​∂yj​∂l​⋅(xj​−μB​))⋅N2(xi​−μB​)​=−Nγ⋅(σB2​+ϵ)−23​​(j∑N​∂yj​∂l​⋅(xj​−μB​))⋅(xi​−μB​)=Nγ⋅(σB2​+ϵ)−21​​(j∑N​∂yj​∂l​⋅(xj​−μB​))⋅(xi​−μB​)⋅−(σB2​+ϵ)−1​

      • 注意到:xj−μB=x^jσB2+ϵx_{j}-\mu_{B}=\hat{x}_{j} \sqrt{\sigma_{B}^{2}+\epsilon}xj​−μB​=x^j​σB2​+ϵ​
      • 单独看一下(∑jN∂l∂yj⋅(xj−μB))⋅xi−μBσ2+ϵ\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot\left(x_{j}-\mu_{B}\right)\right) \cdot \frac{x i-\mu_{B}}{\sigma^{2}+\epsilon}(∑jN​∂yj​∂l​⋅(xj​−μB​))⋅σ2+ϵxi−μB​​
      • 可以转换为(∑jN∂l∂yj⋅x^jσB2+ϵ)⋅xi−μBσ2+ϵ\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot \hat{x}_{j} \sqrt{\sigma_{B}^{2}+\epsilon}\right) \cdot \frac{x i-\mu_{B}}{\sigma^{2}+\epsilon}(∑jN​∂yj​∂l​⋅x^j​σB2​+ϵ​)⋅σ2+ϵxi−μB​​
        =(∑jN∂l∂yj⋅x^j)⋅xi−μBσB2+ϵ=\left(\sum_{j}^{N} \frac{\partial l}{\partial y_{j}} \cdot \hat{x}_{j}\right) \cdot \frac{x i-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}=(∑jN​∂yj​∂l​⋅x^j​)⋅σB2​+ϵ​xi−μB​​
        =∂l∂γ⋅xi−μBσB2+ϵ=\frac{\partial l}{\partial \gamma} \cdot \frac{x i-\mu_{B}}{\sqrt{\sigma_{B}^{2}+\epsilon}}=∂γ∂l​⋅σB2​+ϵ​xi−μB​​
        =∂l∂γ⋅x^i=\frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}=∂γ∂l​⋅x^i​
      • 总的为:∂l∂σB2⋅∂σB2∂xi=γ⋅(σB2+ϵ)−12N⋅∂l∂γ⋅x^i\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial x_{i}}=\frac{\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}∂σB2​∂l​⋅∂xi​∂σB2​​=Nγ⋅(σB2​+ϵ)−21​​⋅∂γ∂l​⋅x^i​
  • 对总式子的第三项推导

    • 第三项第一部分

      ∂l∂μB=[∑iN∂l∂x^i⋅∂x^i∂μB]+[∂l∂σB2⋅∂σB2∂μB]=[∑iN∂l∂yi⋅γ⋅−1σB2+ϵ]+[∂lσB2⋅1N∑iN−2(xi−μB)]=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)−∂l∂σB2⋅2N(∑iN(xi−μB))=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)\begin{aligned} \frac{\partial l}{\partial \mu_{B}} &=\left[\sum_{i}^{N} \frac{\partial l}{\partial \hat{x}_{i}} \cdot \frac{\partial \hat{x}_{i}}{\partial \mu_{B}}\right]+\left[\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{\partial \sigma_{B}^{2}}{\partial \mu_{B}}\right] \\ &=\left[\sum_{i}^{N} \frac{\partial l}{\partial y_{i}} \cdot \gamma \cdot \frac{-1}{\sqrt{\sigma_{B}^{2}+\epsilon}}\right]+\left[\frac{\partial l}{\sigma_{B}^{2}} \cdot \frac{1}{N} \sum_{i}^{N}-2\left(x_{i}-\mu_{B}\right)\right] \\ &=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right)-\frac{\partial l}{\partial \sigma_{B}^{2}} \cdot \frac{2}{N}\left(\sum_{i}^{N}\left(x_{i}-\mu_{B}\right)\right) \\ &=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right) \end{aligned}∂μB​∂l​​=[i∑N​∂x^i​∂l​⋅∂μB​∂x^i​​]+[∂σB2​∂l​⋅∂μB​∂σB2​​]=[i∑N​∂yi​∂l​⋅γ⋅σB2​+ϵ​−1​]+[σB2​∂l​⋅N1​i∑N​−2(xi​−μB​)]=−γ⋅(σB2​+ϵ)−21​(i∑N​∂yi​∂l​)−∂σB2​∂l​⋅N2​(i∑N​(xi​−μB​))=−γ⋅(σB2​+ϵ)−21​(i∑N​∂yi​∂l​)​

      • 这里可以注意的是,∑iN(xi−μB)=0\sum_{i}^{N}\left(x_{i}-\mu_{B}\right)=0∑iN​(xi​−μB​)=0 , 因此上面推导过程中的倒数第二行的最后一项可以直接删去
    • 总的第三项,即两部分相乘的结果

      ∂l∂μB⋅∂μB∂xi=−γ⋅(σB2+ϵ)−12(∑iN∂l∂yi)⋅1N\frac{\partial l}{\partial \mu_{B}} \cdot \frac{\partial \mu_{B}}{\partial x_{i}}=-\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left(\sum_{i}^{N} \frac{\partial l}{\partial y_{i}}\right) \cdot \frac{1}{N}∂μB​∂l​⋅∂xi​∂μB​​=−γ⋅(σB2​+ϵ)−21​(∑iN​∂yi​∂l​)⋅N1​

  • 总的结果为上面三项相加,最终结果为

    ∂l∂xi=γ⋅(σB2+ϵ)−12[∂l∂yi−1N⋅∂l∂γ⋅x^i−1N⋅∑i=1N∂l∂yi]\frac{\partial l}{\partial x_{i}}=\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left[\frac{\partial l}{\partial y_{i}}-\frac{1}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}-\frac{1}{N} \cdot \sum_{i=1}^{N} \frac{\partial l}{\partial y_{i}}\right]∂xi​∂l​=γ⋅(σB2​+ϵ)−21​[∂yi​∂l​−N1​⋅∂γ∂l​⋅x^i​−N1​⋅∑i=1N​∂yi​∂l​]

BP过程中对每一项的导数总结如下:

∂ℓ∂x^i=∂ℓ∂yi⋅γ\frac{\partial \ell}{\partial \widehat{x}_{i}}=\frac{\partial \ell}{\partial y_{i}} \cdot \gamma∂xi​∂ℓ​=∂yi​∂ℓ​⋅γ
∂ℓ∂σB2=∑i=1m∂ℓ∂x^i⋅(xi−μB)⋅−12(σB2+ϵ)−3/2\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot\left(x_{i}-\mu_{\mathcal{B}}\right) \cdot \frac{-1}{2}\left(\sigma_{\mathcal{B}}^{2}+\epsilon\right)^{-3 / 2}∂σB2​∂ℓ​=∑i=1m​∂xi​∂ℓ​⋅(xi​−μB​)⋅2−1​(σB2​+ϵ)−3/2
∂ℓ∂μB=(∑i=1m∂ℓ∂x^i⋅−1σB2+ϵ)+∂ℓ∂σB2⋅∑i=1m−2(xi−μB)m\frac{\partial \ell}{\partial \mu_{\mathcal{B}}}=\left(\sum_{i=1}^{m} \frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}\right)+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{\sum_{i=1}^{m}-2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}∂μB​∂ℓ​=(∑i=1m​∂xi​∂ℓ​⋅σB2​+ϵ​−1​)+∂σB2​∂ℓ​⋅m∑i=1m​−2(xi​−μB​)​
∂ℓ∂xi=∂ℓ∂x^i⋅1σB2+ϵ+∂ℓ∂σB2⋅2(xi−μB)m+∂ℓ∂μB⋅1m\frac{\partial \ell}{\partial x_{i}}=\frac{\partial \ell}{\partial \widehat{x}_{i}} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}}+\frac{\partial \ell}{\partial \sigma_{\mathcal{B}}^{2}} \cdot \frac{2\left(x_{i}-\mu_{\mathcal{B}}\right)}{m}+\frac{\partial \ell}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}∂xi​∂ℓ​=∂xi​∂ℓ​⋅σB2​+ϵ​1​+∂σB2​∂ℓ​⋅m2(xi​−μB​)​+∂μB​∂ℓ​⋅m1​
∂ℓ∂γ=∑i=1m∂ℓ∂yi⋅x^i\frac{\partial \ell}{\partial \gamma}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}} \cdot \widehat{x}_{i}∂γ∂ℓ​=∑i=1m​∂yi​∂ℓ​⋅xi​
∂ℓ∂β=∑i=1m∂ℓ∂yi\frac{\partial \ell}{\partial \beta}=\sum_{i=1}^{m} \frac{\partial \ell}{\partial y_{i}}∂β∂ℓ​=∑i=1m​∂yi​∂ℓ​

train

  • batch size要够大,因为我们其实是想估计整个训练集的平均值和方差
  • 平均值和方差不能看作常数,因为他们是XXX 和 WWW 的组合 , 可以在BP的时候对 WWW 的更新起作用

test

  • 训练结束以后,可以估算平均值和方差

    • 通过在训练过程中对每次的平均值和方差进行保留得到
    • updata过程中的平均值和方差都算出来,并加权求和

  • 前面的平均值和方差与完全训练好的模型的平均值和方差差别是比较大的,因此越往后参数越大,因为随着训练次数增加,准确度会变大

benifit

  • 解决了Internal Covariate Shift的问题 , 学习率可以设大,训练快

  • 值都在 0 附近 , 减缓数据通过激活函数时,落在激活函数饱和区的现象

    • 落在包含区会导致模型对数据的变化太不敏感,导致模型学习能力下降
  • 参数初始化值的影响比较小

  • 相当于做了 正则化,可以一定程度上对抗过拟合

    • 因为毕竟是用batch估计平均值和方差,并不会包含所有数据,所以平均值和方差是有噪声的

代码补全

  • forword

    def batchnorm_forward(x, gamma, beta, bn_param):"""Forward pass for batch normalization.During training the sample mean and (uncorrected) sample variance arecomputed from minibatch statistics and used to normalize the incoming data.During training we also keep an exponentially decaying running mean of themean and variance of each feature, and these averages are used to normalizedata at test-time.At each timestep we update the running averages for mean and variance usingan exponential decay based on the momentum parameter:running_mean = momentum * running_mean + (1 - momentum) * sample_meanrunning_var = momentum * running_var + (1 - momentum) * sample_varNote that the batch normalization paper suggests a different test-timebehavior: they compute sample mean and variance for each feature using alarge number of training images rather than using a running average. Forthis implementation we have chosen to use running averages instead sincethey do not require an additional estimation step; the torch7implementation of batch normalization also uses running averages.Input:- x: Data of shape (N, D)- gamma: Scale parameter of shape (D,)- beta: Shift paremeter of shape (D,)- bn_param: Dictionary with the following keys:- mode: 'train' or 'test'; required- eps: Constant for numeric stability- momentum: Constant for running mean / variance.- running_mean: Array of shape (D,) giving running mean of features- running_var Array of shape (D,) giving running variance of featuresReturns a tuple of:- out: of shape (N, D)- cache: A tuple of values needed in the backward pass"""mode = bn_param['mode']eps = bn_param.get('eps', 1e-5)momentum = bn_param.get('momentum', 0.9)N, D = x.shaperunning_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))out, cache = None, Noneif mode == 'train':######################################################################## TODO: Implement the training-time forward pass for batch norm.      ## Use minibatch statistics to compute the mean and variance, use      ## these statistics to normalize the incoming data, and scale and      ## shift the normalized data using gamma and beta.                     ##                                                                     ## You should store the output in the variable out. Any intermediates  ## that you need for the backward pass should be stored in the cache   ## variable.                                                           ##                                                                     ## You should also use your computed sample mean and variance together ## with the momentum variable to update the running mean and running   ## variance, storing your result in the running_mean and running_var   ## variables.                                                          ##                                                                     ## Note that though you should be keeping track of the running         ## variance, you should normalize the data based on the standard       ## deviation (square root of variance) instead!                        # # Referencing the original paper (https://arxiv.org/abs/1502.03167)   ## might prove to be helpful.                                          ######################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****sample_mean = np.mean(x , axis=0) # 每一列是所有数据的一个特征,按特征算平均值sample_var = np.var(x, axis=0) # 同上,算方差x_norm = (x - sample_mean) / np.sqrt(sample_var + eps)out = gamma * x_norm + betacache = (x, sample_mean, sample_var, x_norm, gamma, beta, eps)#更新并存储下来,留着对test数据进行标准化running_mean = momentum * running_mean + (1 - momentum) * sample_meanrunning_var = momentum * running_var + (1 - momentum) * sample_var# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****########################################################################                           END OF YOUR CODE                          ########################################################################elif mode == 'test':######################################################################## TODO: Implement the test-time forward pass for batch normalization. ## Use the running mean and variance to normalize the incoming data,   ## then scale and shift the normalized data using gamma and beta.      ## Store the result in the out variable.                               ######################################################################### *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x_std = (x - bn_param['running_mean']) / (np.sqrt(bn_param['running_var']) + eps) #用保存的值对测试数据进行标准化out = gamma * x_std + beta# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****########################################################################                          END OF YOUR CODE                           ########################################################################else:raise ValueError('Invalid forward batchnorm mode "%s"' % mode)# Store the updated running means back into bn_parambn_param['running_mean'] = running_meanbn_param['running_var'] = running_varreturn out, cache
  • backword

    def batchnorm_backward(dout, cache):"""Backward pass for batch normalization.For this implementation, you should write out a computation graph forbatch normalization on paper and propagate gradients backward throughintermediate nodes.Inputs:- dout: Upstream derivatives, of shape (N, D)- cache: Variable of intermediates from batchnorm_forward.Returns a tuple of:- dx: Gradient with respect to inputs x, of shape (N, D)- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)- dbeta: Gradient with respect to shift parameter beta, of shape (D,)"""dx, dgamma, dbeta = None, None, None############################################################################ TODO: Implement the backward pass for batch normalization. Store the    ## results in the dx, dgamma, and dbeta variables.                         ## Referencing the original paper (https://arxiv.org/abs/1502.03167)       ## might prove to be helpful.                                              ############################################################################# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****N, D = dout.shape#得到cache中存储的数据x, sample_mean, sample_var, x_norm, gamma, beta, eps = cache#这里是复现了上面的推导过程,只不过是每一个中间变量都进行计算了dx_norm = dout * gamma #根据 dout = gamma * x_norm + beta 求导 dsample_var = np.sum(dx_norm * (-0.5 * x_norm / (sample_var + eps)), axis=0)dsample_mean = np.sum(-dx_norm / np.sqrt((sample_var + eps)), axis=0) + \dsample_var * np.sum(-2.0/N * (x - sample_mean), axis=0)dx1 = dx_norm / np.sqrt(sample_var + eps)dx2 = dsample_var * (2.0/N) * (x - sample_mean)  # from sample_vardx3 = dsample_mean * (1.0 / N)  # from sample_meandx = dx1 + dx2 + dx3dgamma = np.sum(dout * x_norm, axis=0)dbeta = np.sum(dout, axis=0)# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################                             END OF YOUR CODE                            ############################################################################return dx, dgamma, dbeta
    
  • backword_alt

    不保留中间变量,直接利用∂l∂xi=γ⋅(σB2+ϵ)−12[∂l∂yi−1N⋅∂l∂γ⋅x^i−1N⋅∑i=1N∂l∂yi]\frac{\partial l}{\partial x_{i}}=\gamma \cdot\left(\sigma_{B}^{2}+\epsilon\right)^{-\frac{1}{2}}\left[\frac{\partial l}{\partial y_{i}}-\frac{1}{N} \cdot \frac{\partial l}{\partial \gamma} \cdot \hat{x}_{i}-\frac{1}{N} \cdot \sum_{i=1}^{N} \frac{\partial l}{\partial y_{i}}\right]∂xi​∂l​=γ⋅(σB2​+ϵ)−21​[∂yi​∂l​−N1​⋅∂γ∂l​⋅x^i​−N1​⋅∑i=1N​∂yi​∂l​]得到结果

    def batchnorm_backward_alt(dout, cache):"""Alternative backward pass for batch normalization.For this implementation you should work out the derivatives for the batchnormalizaton backward pass on paper and simplify as much as possible. Youshould be able to derive a simple expression for the backward pass. See the jupyter notebook for more hints.Note: This implementation should expect to receive the same cache variableas batchnorm_backward, but might not use all of the values in the cache.Inputs / outputs: Same as batchnorm_backward"""dx, dgamma, dbeta = None, None, None############################################################################ TODO: Implement the backward pass for batch normalization. Store the    ## results in the dx, dgamma, and dbeta variables.                         ##                                                                         ## After computing the gradient with respect to the centered inputs, you   ## should be able to compute gradients with respect to the inputs in a     ## single statement; our implementation fits on a single 80-character line.############################################################################# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****x, sample_mean , sample_var , x_norm , gamma , beta , eps = cache dgamma = np.sum(dout * x_norm,axis=0)dbeta = np.sum(dout,axis=0)std = (sample_var + eps) ** (-0.5)dx = std * gamma * (dout - dgamma * x_norm / x.shape[0] - np.mean(dout , axis=0))# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****############################################################################                             END OF YOUR CODE                            ############################################################################return dx, dgamma, dbeta
    

Dropout Batch Normolization相关推荐

  1. dropout+Batch Normalization理解

    Dropout理解: 在没有dropout时,正向传播如下: 加入dropout后: 测试时,需要每个权值乘以P:  Dropout官方源码: #dropout函数实现 def dropout(x, ...

  2. 吴恩达深度学习笔记7-Course2-Week3【超参数调试、Batch 正则化和程序框架】

    超参数调试.Batch 正则化和程序框架 一.超参数调试(hyperparameter tuning) 推荐的超参数重要性排序: 1.学习率(learning rate): α 2.隐藏神经单元(hi ...

  3. 目标检测模型 YOLO系列

    目标检测模型 YOLO系列 文章目录 目标检测模型 YOLO系列 YOLOv1 一.背景 二.YOLO模型 主要思想 模型结构 损失函数 三.优缺点 四.参考 YOLOv2与YOLO9000 YOLO ...

  4. 人工智能算法面试大总结-总目录

    前言 该面经总结了春招/秋招各厂高频面试八股,除开围绕简历扣项目细节,公司最喜欢问的还是这些经典算法中涉及的知识点. 目前涵盖Python.基础理论.分类与聚类.降维.支持向量机SVM.贝叶斯|决策树 ...

  5. 机器学习防止模型过拟合的讲解

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货 作者:Poll,编辑:深度学习这件小事 来源 | http://www. ...

  6. 彻底搞懂机器学习中的正则化

    正则化在机器学习当中是十分常见的,本次就来比较完整地总结一下~ 首先列一下本篇文章所包含的内容目录,方便各位查找: LP范数 L1范数 L2范数 L1范数和L2范数的区别 Dropout Batch ...

  7. 收藏 | 机器学习防止模型过拟合

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达    目录 LP范数 L1范数 L2范数 L1范数和L2范数的区别 ...

  8. 一文读懂机器学习中的正则化

    来源:Poll的笔记 本文约2600字,建议阅读5分钟 还在被正则化困扰?本文为你答疑解惑! 目录 LP范数 L1范数 L2范数 L1范数和L2范数的区别 Dropout Batch Normaliz ...

  9. Java实现话术词槽匹配_知识图谱与KBQA——槽填充

    一.相关概念 开放域对话:不太严谨的定义可以理解为 ,在不确定用户意图前的各种瞎聊,你不知道用户会问什么样千奇百怪的问题,但是chatbot都能接住,然后和用户进行对话,这种就是开放域的对话. 填槽: ...

最新文章

  1. 通过hibernate 查询出来的数据默认为持久状态,也就是说:你在后头作出的任何操作都会被hibernate自动同步更新到数据库
  2. 盘点丨2017年人工智能带火了哪些词
  3. ASP.NET MVC4中@model使用多个类型实例的方法
  4. android自定义滑块解锁,android 滑动解锁
  5. 数据结构与算法 | 归并排序
  6. java开发爱恩斯坦棋,爱恩斯坦棋计算机博弈关键技术研究
  7. Linux SD卡驱动开发(二) —— SD 卡驱动分析HOST篇
  8. 【POJ - 2728】Desert King (最有比率生成树,分数规划)
  9. 大二c语言期末考试题库及详解答案,大学C语言期末考试练习题(带详解答案)...
  10. 验证码类库CaptchaMvc
  11. [导入]DotNetNuke 模組偵錯(DNN module debug)
  12. 计算机应用基础知识竞赛题,计算机基础知识题库
  13. 为了开发世嘉MD游戏我写了个Tile地图编辑器
  14. idea退出debug模式_一文搞懂如何在Intellij IDEA中使用Debug,超级详细
  15. zigbee应用实践
  16. 解读混淆矩阵在语义分割FCN指标计算中的应用(含代码实现)
  17. 院内导航方案怎么样,低成本的智慧医院室内导航一站式解决方案
  18. 三星Galaxy相机中的专业模式是什么,您可以使用它做什么?
  19. SRS:流媒体服务器如何实现负载均衡
  20. 【知识图谱】 一个有效的知识图谱是如何构建的?

热门文章

  1. 时隔多年,我胡汉三又回来了
  2. latex格式怎么引用文章
  3. 用JSSE定制SSL连接
  4. XILINX K7 DDR3引脚验证总结
  5. 显示前半内容后半内容用省略号_作文写作指导:如何修改作文?
  6. 勒索病毒未死,新病毒或又要来袭
  7. 在ECharts树图中实现搜索高亮和自动展开
  8. Android APK签名 JKS 密钥库使用专用格式。建议使用 “keytool -importkeystore -srckeystore E:\xxxxxx- pkcs12“ 迁移到行业标准格式
  9. Android 彩信发送
  10. enumeration value 'xxxxx' not handled in switch警告