L2正则化Regularization详解及反向传播的梯度求导

摘要

本文解释L2正则化Regularization, 求解其在反向传播中的梯度, 并使用TensorFlow和PyTorch验证.

正文

1. L2 正则原理

若某一个神经网络存在一个参数矩阵 Wm×nW_{m\times n}Wm×n, 该网络在训练时输出一个损失值 error (标量 e0e_0e0), 对 W 加上L2正则化项后的损失值为 eee. 已知 e0e_0e0 对 W 的梯度为 ∇e0(W)\nabla {e_0}_{(W)}∇e0(W), 求 e 对 W 的梯度.

根据题意 :
e=e0+r r=λ2∑i=1m∑j=1nwij2e = e_0+r\\ \;\\ r = \frac{\lambda}{2}\sum_{i=1}^{m}\sum_{j=1}^{n}w_{ij}^2 e=e0+rr=2λi=1∑mj=1∑nwij2
其中 λ\lambdaλ 为正则项衰减系数, wijw_{ij}wij 为矩阵 W 的元素.

梯度求导 :
dedW=de0dW+drdW=∇e0(W)+λW\frac{de}{dW}=\frac{de_0}{dW} + \frac{dr}{dW}=\nabla {e_0}_{(W)} +\lambda W dWde=dWde0+dWdr=∇e0(W)+λW
代入梯度下降公式, 加上正则化项前 :
W(i+1)=W(i)−η∇e0(W)W^{(i+1)} = W^{(i)}-\eta\nabla {e_0}_{(W)} W(i+1)=W(i)−η∇e0(W)
加上L2正则化项后 :
W(i+1)=W(i)−η(∇e0(W)+λW)=(1−ηλ)W(i)−η∇e0(W)W^{(i+1)} = W^{(i)}-\eta(\nabla {e_0}_{(W)}+\lambda W)= (1-\eta\lambda)W^{(i)}-\eta\nabla {e_0}_{(W)} W(i+1)=W(i)−η(∇e0(W)+λW)=(1−ηλ)W(i)−η∇e0(W)
我们可以看到, W 前面乘了一个小于 1 的系数.

故 L2 正则化又称为 L2 惩罚 (penalty) 或权值衰减 (Weight Decay).

2. 程序实现

2.1 TensorFlow

import numpy as np
import tensorflow as tftf.enable_eager_execution()
np.random.seed(123)
np.set_printoptions(8, suppress=True)x_numpy = np.random.random((3, 4)).astype(np.double)
x_tensor = tf.Variable(x_numpy)
w_numpy = np.random.random((4, 5)).astype(np.double)
w_tensor = tf.Variable(w_numpy)weight_decay = 0.9with tf.GradientTape(persistent=True) as tape:loss = tf.reduce_sum(tf.matmul(x_tensor, w_tensor))loss2 = loss + tf.reduce_sum(tf.square(w_tensor)) * weight_decaygrad = tape.gradient(loss, w_tensor).numpy()
grad2 = tape.gradient(loss2, w_tensor).numpy()print("check_grad")
print(grad + 2 * w_numpy * weight_decay)
print(grad2)"""
check_grad
[[ 2.6863001   2.00429027  2.61334972  3.22526179  2.22535517][ 1.41717647  2.05815579  2.05865297  2.24328504  2.63034054][ 2.85481325  2.65063599  2.85119176  2.13211971  2.20201325][ 2.37606803  2.4938795   3.10095124  2.13098311  2.74585633]]
[[ 2.6863001   2.00429027  2.61334972  3.22526179  2.22535517][ 1.41717647  2.05815579  2.05865297  2.24328504  2.63034054][ 2.85481325  2.65063599  2.85119176  2.13211971  2.20201325][ 2.37606803  2.4938795   3.10095124  2.13098311  2.74585633]]
"""

2.2 PyTorch

PyTorch 在 torch.optim.SGD 方法中使用下式实现L2正则化 :
W(i+1)=W(i)−ηG(i) G(i)=dLdW(i)+λW(i)W^{(i+1)} = W^{(i)}-\eta G^{(i)}\\ \;\\ G^{(i)} =\frac{dL}{dW^{(i)}} + \lambda W^{(i)}\\ W(i+1)=W(i)−ηG(i)G(i)=dW(i)dL+λW(i)
dLdW(i)\frac{dL}{dW^{(i)}}dW(i)dL 是 L2 正则化前的梯度, λ\lambdaλ 是权重衰减系数, G(i)G^{(i)}G(i) 是更新梯度.

验证代码 :

import torch
import numpy as npnp.random.seed(123)
np.set_printoptions(8, suppress=True)x_numpy = np.random.random((3, 4)).astype(np.double)
x_torch = torch.tensor(x_numpy, requires_grad=True)
x_torch2 = torch.tensor(x_numpy, requires_grad=True)w_numpy = np.random.random((4, 5)).astype(np.double)
w_torch = torch.tensor(w_numpy, requires_grad=True)
w_torch2 = torch.tensor(w_numpy, requires_grad=True)lr = 0.1
weight_decay = 0.9
sgd = torch.optim.SGD([w_torch], lr=lr, weight_decay=0)
sgd2 = torch.optim.SGD([w_torch2], lr=lr, weight_decay=weight_decay)y_torch = torch.matmul(x_torch, w_torch)
y_torch2 = torch.matmul(x_torch2, w_torch2)loss = y_torch.sum()
loss2 = y_torch2.sum()sgd.zero_grad()
sgd2.zero_grad()loss.backward()
loss2.backward()sgd.step()
sgd2.step()w_grad = w_torch.grad.data.numpy()
w_grad2 = w_torch2.grad.data.numpy()print("check_grad")
print(w_grad + weight_decay * w_numpy)
print(w_grad2)"""
check_grad
[[ 2.29158508  1.95058016  2.25510989  2.56106592  2.06111261][ 1.25926989  1.57975955  1.58000814  1.67232418  1.86585193][ 2.20280346  2.10071483  2.20099271  1.84145669  1.87640346][ 2.17063112  2.22953686  2.53307273  2.04808866  2.35552527]]
[[ 2.29158508  1.95058016  2.25510989  2.56106592  2.06111261][ 1.25926989  1.57975955  1.58000814  1.67232418  1.86585193][ 2.20280346  2.10071483  2.20099271  1.84145669  1.87640346][ 2.17063112  2.22953686  2.53307273  2.04808866  2.35552527]]"""