导包和处理数据

BatchNorm

forward

backward

训练BatchNorm并显示结果

Batch Normalization 和初始化

Batch Normalization 和 Batch Size

Layer Normalization

Layer Normalization 和 Batch Size

卷积层的batch norm--spatial batchnorm

Spatial Group Normalization

dropout

Regularization Experiment

Solver和网络

solver

网络

网络用到的辅助函数（各层）

损失函数

The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, they propose to insert into the network layers that normalize batches. At training time, such a layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

Sergey Ioffe and Christian Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML 2015.

导包和处理数据

# Setup cell.
import time
import numpy as np
import matplotlib.pyplot as plt
from cs231n.data_utils import get_CIFAR10_data
from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # Set default size of plots.
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"%load_ext autoreload
%autoreload 2def rel_error(x, y):"""Returns relative error."""return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))def print_mean_std(x,axis=0):print(f"  means: {x.mean(axis=axis)}")print(f"  stds:  {x.std(axis=axis)}\n")# Load the (preprocessed) CIFAR-10 data.
data = get_CIFAR10_data()
for k, v in list(data.items()):print(f"{k}: {v.shape}")

BatchNorm

forward

The outputs from the last layer of the network should not be normalized.

def batchnorm_forward(x, gamma, beta, bn_param):"""During training the sample mean and (uncorrected) sample variance arecomputed from minibatch statistics and used to normalize the incoming data.During training we also keep an exponentially decaying running mean of themean and variance of each feature, and these averages are used to normalizedata at test-time.At each timestep we update the running averages for mean and variance usingan exponential decay based on the momentum parameter:running_mean = momentum * running_mean + (1 - momentum) * sample_meanrunning_var = momentum * running_var + (1 - momentum) * sample_varNote that the batch normalization paper suggests a different test-timebehavior: they compute sample mean and variance for each feature using alarge number of training images rather than using a running average. Forthis implementation we have chosen to use running averages instead sincethey do not require an additional estimation step; the torch7implementation of batch normalization also uses running averages.Input:- x: Data of shape (N, D)- gamma: Scale parameter of shape (D,)- beta: Shift paremeter of shape (D,)- bn_param: Dictionary with the following keys:- mode: 'train' or 'test'; required- eps: Constant for numeric stability- momentum: Constant for running mean / variance.- running_mean: Array of shape (D,) giving running mean of features- running_var Array of shape (D,) giving running variance of featuresReturns a tuple of:- out: of shape (N, D)- cache: A tuple of values needed in the backward pass"""mode = bn_param["mode"]eps = bn_param.get("eps", 1e-5)momentum = bn_param.get("momentum", 0.9)N, D = x.shaperunning_mean = bn_param.get("running_mean", np.zeros(D, dtype=x.dtype))running_var = bn_param.get("running_var", np.zeros(D, dtype=x.dtype))out, cache = None, Noneif mode == "train":x_mean = np.mean(x, axis = 0)x_var = np.var(x, axis = 0) + epsx_norm = (x - x_mean) / np.sqrt(x_var)out = x_norm * gamma  + beta  #  (N, D)cache = (x, x_norm, gamma, x_mean, x_var)# Store the updated running means back into bn_parambn_param["running_mean"] = momentum * running_mean + (1 - momentum) * x_meanbn_param["running_var"] = momentum * running_var + (1 - momentum) * x_var        elif mode == "test":x_norm = (x - running_mean) / np.sqrt(running_var + eps) out = gamma * x_norm + betaelse:raise ValueError('Invalid forward batchnorm mode "%s"' % mode)return out, cache

backward

可以参考：

【cs231n】Batchnorm及其反向传播_JoeYF_的博客-CSDN博客_batchnorm反向传播

Batch Normalization的反向传播解说_macan_dct的博客-CSDN博客_batchnorm反向传播

Understanding the backward pass through Batch Normalization Layer

理解Batch Normalization（批量归一化）_坚硬果壳_的博客-CSDN博客

步骤9：

步骤8：

步骤7：

步骤6：

步骤5：

步骤4：

步骤3：

步骤2：

步骤1：

步骤0：

复杂版

def batchnorm_backward(dout, cache):"""  Inputs:- dout: Upstream derivatives, of shape (N, D)- cache: Variable of intermediates from batchnorm_forward.Returns a tuple of:- dx: Gradient with respect to inputs x, of shape (N, D)- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)- dbeta: Gradient with respect to shift parameter beta, of shape (D,)"""dx, dgamma, dbeta = None, None, Nonex, x_norm, gamma, mean, var = cacheN, D = dout.shapedbeta = np.sum(dout, axis = 0)dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)# 注意求dgamma的时候没有使用矩阵乘法。在FC层的反向传播需要中矩阵乘法#可能因为正向传播的时候就没有使用真正的矩阵乘法，用的是(N, D) * (D,)dx_norm = dout * gamma  # (N, D) * (D,) x_norm是归一处理后的x      x_mu = x - meanstd = np.sqrt(var)  # var已经包含了epsivar = 1 / std # (D,)   # 求x_norm的式子中，分子分母中都含有x_mu，先求对这个x_mu的导数# 将std的倒数视作一个乘数dxmu1 = dx_norm * ivar  # (N, D) * (D,) divar = np.sum(dx_norm * x_mu, axis = 0)  # dsqrtvar = - 1/ (std ** 2) * divar  # 倒数对根号项的导数 根号项正是stddvar = 0.5 / std * dsqrtvar  # 根号项对var(根号项中非eps项)的导数dsquare = 1/N * np.ones((N, D)) * dvar  #求var时对（N,D）矩阵进行了求列平均操作dxmu2 = 2 * x_mu * dsquare # 平方项对x_mu求导dx1 = dxmu1 + dxmu2#下面求x_mu对x的导数，注意平均值mu也是x的函数，所以导数是一个和# 先对mu求导，然后mu对x求导    dmu = -1 * np.sum(dx1, axis = 0)  # (D,)  广播的反向传播dx2 = 1/N * np.ones((N, D)) * dmu  # 求mu时采用了求平均的操作dx = dx1 + dx2return dx, dgamma, dbeta

快速版

def batchnorm_backward_alt(dout, cache):"""derive a simple expression for the backward pass.   """dx, dgamma, dbeta = None, None, Nonex, x_norm, gamma, mean, var = cachestd = np.sqrt(var)dgamma = np.sum(dout * x_norm, axis = 0)  #  (N, D) * (N, D)dbeta = np.sum(dout, axis = 0)N = 1.0 * x.shape[0]dfdu = dout * gamma# 将x_norm拆成x，mu，var三项，分别对x求导，然后加起来dfdv = np.sum(dfdu * (x - mean) * -0.5 * var ** -1.5, axis = 0) # f对var求导   dfdw = np.sum(dfdu * -1 / std, axis = 0) + \dfdv * np.sum(-2/N * (x - mean), axis = 0)  # f对mu求导，包括var项对mu求导dx = dfdu / std + dfdv * 2/N * (x - mean) + dfdw / N  # 三项分别对x求导并求和'''# 也可以用下面方式算，看不懂N = dout.shape[0]         dfdz = dout * gamma                          # 下面用 z 指代 x_norm  [NxD]dfdz_sum = np.sum(dfdz,axis=0)                                      #[1xD]dx = dfdz - dfdz_sum/N - np.sum(dfdz * x_norm,axis=0) * x_norm/N    #[NxD]dx /= std'''return dx, dgamma, dbeta

训练BatchNorm并显示结果

训练

np.random.seed(231)# Try training a very deep net with batchnorm.
hidden_dims = [100, 100, 100, 100, 100]num_train = 1000
small_data = {'X_train': data['X_train'][:num_train],'y_train': data['y_train'][:num_train],'X_val': data['X_val'],'y_val': data['y_val'],
}weight_scale = 2e-2
bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')
model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)print('Solver with batch norm:')
bn_solver = Solver(bn_model, small_data,num_epochs=10, batch_size=50,update_rule='adam',optim_config={'learning_rate': 1e-3,},verbose=True,print_every=20)
bn_solver.train()print('\nSolver without batch norm:')
solver = Solver(model, small_data,num_epochs=10, batch_size=50,update_rule='adam',optim_config={'learning_rate': 1e-3,},verbose=True, print_every=20)
solver.train()

Solver with batch norm:
(Iteration 1 / 200) loss: 2.340975
(Epoch 0 / 10) train acc: 0.107000; val_acc: 0.115000
(Epoch 1 / 10) train acc: 0.314000; val_acc: 0.266000
……
(Epoch 9 / 10) train acc: 0.767000; val_acc: 0.318000
(Iteration 181 / 200) loss: 0.888016
(Epoch 10 / 10) train acc: 0.808000; val_acc: 0.333000Solver without batch norm:
(Iteration 1 / 200) loss: 2.302332
(Epoch 0 / 10) train acc: 0.129000; val_acc: 0.131000
……
(Iteration 161 / 200) loss: 1.034116
(Epoch 9 / 10) train acc: 0.654000; val_acc: 0.342000
(Iteration 181 / 200) loss: 0.905795
(Epoch 10 / 10) train acc: 0.714000; val_acc: 0.331000

显示结果

def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):"""utility function for plotting training history"""plt.title(title)plt.xlabel(label)bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]bl_plot = plot_fn(baseline)num_bn = len(bn_plots)for i in range(num_bn):label='with_norm'if labels is not None:label += str(labels[i])plt.plot(bn_plots[i], bn_marker, label=label)label='baseline'if labels is not None:label += str(labels[0])plt.plot(bl_plot, bl_marker, label=label)plt.legend(loc='lower center', ncol=num_bn+1) plt.subplot(3, 1, 1)
plot_training_history('Training loss','Iteration', solver, [bn_solver], \lambda x: x.loss_history, bl_marker='o', bn_marker='o')
plt.subplot(3, 1, 2)
plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')
plt.subplot(3, 1, 3)
plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')plt.gcf().set_size_inches(15, 15)
plt.show()

You should find that using batch normalization helps the network to converge much faster.但是在测试结果上差别不大

Batch Normalization 和初始化

We will now run a small experiment to study the interaction of batch normalization and weight initialization.

The first cell will train eight-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale.

np.random.seed(231)# Try training a very deep net with batchnorm.
hidden_dims = [50, 50, 50, 50, 50, 50, 50]
num_train = 1000
small_data = {'X_train': data['X_train'][:num_train],'y_train': data['y_train'][:num_train],'X_val': data['X_val'],'y_val': data['y_val'],
}bn_solvers_ws = {}
solvers_ws = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)bn_solver = Solver(bn_model, small_data,num_epochs=10, batch_size=50,update_rule='adam',optim_config={'learning_rate': 1e-3,},verbose=False, print_every=200)bn_solver.train()bn_solvers_ws[weight_scale] = bn_solversolver = Solver(model, small_data,num_epochs=10, batch_size=50,update_rule='adam',optim_config={'learning_rate': 1e-3,},verbose=False, print_every=200)solver.train()solvers_ws[weight_scale] = solver

打印结果

# Plot results of weight scale experiment.
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []for ws in weight_scales:best_train_accs.append(max(solvers_ws[ws].train_acc_history))bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))best_val_accs.append(max(solvers_ws[ws].val_acc_history))bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs. weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs. weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend()plt.subplot(3, 1, 3)
plt.title('Final training loss vs. weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend()
plt.gca().set_ylim(1.0, 3.5)plt.gcf().set_size_inches(15, 15)
plt.show()

可以发现使用了Batch Normalization 之后，模型对初始化的鲁棒性更强了

Batch Normalization 和 Batch Size

We will now run a small experiment to study the interaction of batch normalization and batch size.

The first cell will train 6-layer networks both with and without batch normalization using different batch sizes. The second layer will plot training accuracy and validation set accuracy over time.

def run_batchsize_experiments(normalization_mode):np.random.seed(231)# Try training a very deep net with batchnorm.hidden_dims = [100, 100, 100, 100, 100]num_train = 1000small_data = {'X_train': data['X_train'][:num_train],'y_train': data['y_train'][:num_train],'X_val': data['X_val'],'y_val': data['y_val'],}n_epochs=10weight_scale = 2e-2batch_sizes = [8,16,32]lr = 10**(-3.5)solver_bsize = batch_sizes[0]print('No normalization: batch size = ',solver_bsize)model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)solver = Solver(model, small_data,num_epochs=n_epochs, batch_size=solver_bsize,update_rule='adam',optim_config={'learning_rate': lr,},verbose=False)solver.train()bn_solvers = []for i in range(len(batch_sizes)):b_size=batch_sizes[i]print('Normalization: batch size = ',b_size)bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)bn_solver = Solver(bn_model, small_data,num_epochs=n_epochs, batch_size=b_size,update_rule='adam',optim_config={'learning_rate': lr,},verbose=False)bn_solver.train()bn_solvers.append(bn_solver)return bn_solvers, solver, batch_sizesbn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')

画出结果

plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)plt.gcf().set_size_inches(15, 10)
plt.show()

batch size大，训练效果好。 batch size小的话可能还不如不用。batch size对测试结果影响不大

Layer Normalization

Batch normalization has proved to be effective in making networks easier to train, but the dependency on batch size makes it less useful in complex networks which have a cap on the input batch size due to hardware limitations.

Layer Normalization：Instead of normalizing over the batch, we normalize over the features. In other words, when using Layer Normalization, each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.

Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer Normalization." stat 1050 (2016): 21.
For layer normalization, we do not keep track of the moving moments, and the testing phase is identical to the training phase, where the mean and variance are directly calculated per datapoint.

def layernorm_forward(x, gamma, beta, ln_param):"""Note that in contrast to batch normalization, the behavior during train and test-time forlayer normalization are identical, and we do not need to keep track of running averagesof any sort.Input:- x: Data of shape (N, D)- gamma: Scale parameter of shape (D,)- beta: Shift paremeter of shape (D,)- ln_param: Dictionary with the following keys:- eps: Constant for numeric stabilityReturns a tuple of:- out: of shape (N, D)- cache: A tuple of values needed in the backward pass"""out, cache = None, Noneeps = ln_param.get("eps", 1e-5)x_mean = np.mean(x, axis = 1, keepdims = True)x_var = np.var(x, axis = 1, keepdims = True) + epsx_std = np.sqrt(x_var)x_norm = (x - x_mean) / x_stdout = x_norm *gamma + betacache = (gamma, x, x_norm, x_mean, x_var)return out, cachedef layernorm_backward(dout, cache):gamma,x, x_norm,  mean, var = cachestd = np.sqrt(var)dgamma = np.sum(dout * x_norm, axis = 0)dbeta = np.sum(dout, axis = 0)D = 1.0 * x.shape[1]   dfdu = dout * gammadfdv = np.sum(dfdu * (x - mean) * -0.5 * var ** -1.5, axis = 1)dfdw = np.sum(dfdu * -1 / std, axis = 1) + dfdv * np.sum(-2/D * (x - mean), axis = 1)# dfdv 和 dfdw.shape的形状都是（N,） dfdv.reshape(-1, 1)形状是（N，1）dx = dfdu / std + dfdv.reshape(-1, 1) * 2/D * (x - mean) + dfdw.reshape(-1, 1) / Dreturn dx, dgamma, dbeta

Layer Normalization 和 Batch Size

ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')plt.subplot(2, 1, 1)
plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)
plt.subplot(2, 1, 2)
plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)plt.gcf().set_size_inches(15, 10)
plt.show()

Compared to the previous experiment, you should see a markedly smaller influence of batch size on the training history!

卷积层的batch norm--spatial batchnorm

If the feature map was produced using convolutions, then we expect every feature channel's statistics e.g. mean, variance to be relatively consistent both between different images, and different locations within the same image -- after all, every feature channel is produced by the same convolutional filter! Therefore, spatial batch normalization computes a mean and variance for each of the C feature channels by computing statistics over the minibatch dimension N as well the spatial dimensions H and W.

正向传播

def spatial_batchnorm_forward(x, gamma, beta, bn_param):"""Inputs:- x: Input data of shape (N, C, H, W)- gamma: Scale parameter, of shape (C,)- beta: Shift parameter, of shape (C,)- bn_param: Dictionary with the following keys:- mode: 'train' or 'test'; required- eps: Constant for numeric stability- momentum: Constant for running mean / variance. momentum=0 means thatold information is discarded completely at every time step, whilemomentum=1 means that new information is never incorporated. Thedefault of momentum=0.9 should work well in most situations.- running_mean: Array of shape (D,) giving running mean of features- running_var Array of shape (D,) giving running variance of featuresReturns a tuple of:- out: Output data, of shape (N, C, H, W)- cache: Values needed for the backward pass"""out, cache = None, NoneN, C, H, W = x.shapex_trans = x.transpose(0,2,3,1).reshape(N*H*W, C)  # 对每一个C进行归一化out1, cache = batchnorm_forward(x_trans, gamma, beta, bn_param)out = out1.reshape(N, H, W, C).transpose(0,3,1,2)  return out, cache

反向传播

def spatial_batchnorm_backward(dout, cache):"""Computes the backward pass for spatial batch normalization.Inputs:- dout: Upstream derivatives, of shape (N, C, H, W)- cache: Values from the forward passReturns a tuple of:- dx: Gradient with respect to inputs, of shape (N, C, H, W)- dgamma: Gradient with respect to scale parameter, of shape (C,)- dbeta: Gradient with respect to shift parameter, of shape (C,)"""N, C, H, W = dout.shapedout_trans = dout.transpose(0,2,3,1).reshape(N*H*W, C)dx, dgamma, dbeta = batchnorm_backward_alt(dout_trans, cache)dx = dx.reshape(N, H, W, C).transpose(0,3,1,2)return dx, dgamma, dbeta

Spatial Group Normalization

as the authors of [2] observed, Layer Normalization does not perform as well as Batch Normalization when used with Convolutional Layers:

With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction, and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer.

The authors of [3] propose an intermediary technique. In contrast to Layer Normalization, where you normalize over the entire feature per-datapoint, they suggest a consistent splitting of each per-datapoint feature into G groups and a per-group per-datapoint normalization instead.

Even though an assumption of equal contribution is still being made within each group, the authors hypothesize that this is not as problematic, as innate grouping arises within features for visual recognition. One example they use to illustrate this is that many high-performance handcrafted features in traditional computer vision have terms that are explicitly grouped together. Take for example Histogram of Oriented Gradients [4] -- after computing histograms per spatially local block, each per-block histogram is normalized before being concatenated together to form the final feature vector.

[2] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer Normalization." stat 1050 (2016): 21.

[3] Wu, Yuxin, and Kaiming He. "Group Normalization." arXiv preprint arXiv:1803.08494 (2018).

[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR), 2005.

正向传播

def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):"""In contrast to layer normalization, group normalization splits each entry in the data into Gcontiguous pieces, which it then normalizes independently. Per-feature shifting and scalingare then applied to the data, in a manner identical to that of batch normalization and layernormalization.Inputs:- x: Input data of shape (N, C, H, W)- gamma: Scale parameter, of shape (1, C, 1, 1)- beta: Shift parameter, of shape (1, C, 1, 1)- G: Integer mumber of groups to split into, should be a divisor of C- gn_param: Dictionary with the following keys:- eps: Constant for numeric stabilityReturns a tuple of:- out: Output data, of shape (N, C, H, W)- cache: Values needed for the backward pass"""out, cache = None, Noneeps = gn_param.get("eps", 1e-5)N, C, H, W = x.shapex_trans = x.reshape(N, G, C//G, H, W)mean = np.mean(x_trans, axis = (2,3,4), keepdims = True)var = np.var(x_trans, axis = (2,3,4), keepdims = True) + epsstd = np.sqrt(var)x_norm = (x_trans - mean) / stdx_norm = x_norm.reshape(x.shape)out = x_norm * gamma + betacache = (x, x_norm, gamma, mean, var, std, G)return out, cache

反向传播

def spatial_groupnorm_backward(dout, cache):"""Inputs:- dout: Upstream derivatives, of shape (N, C, H, W)- cache: Values from the forward passReturns a tuple of:- dx: Gradient with respect to inputs, of shape (N, C, H, W)- dgamma: Gradient with respect to scale parameter, of shape (1, C, 1, 1)- dbeta: Gradient with respect to shift parameter, of shape (1, C, 1, 1)"""dx, dgamma, dbeta = None, None, Nonex, x_norm, gamma, mean, var, std, G = cacheN, C, H, W = x.shapedgamma = np.sum(dout * x_norm, axis = (0,2,3))[None, :, None, None]dbeta = np.sum(dout, axis = (0,2,3))[None, :, None, None]x_trans = x.reshape(N, G, C//G, H, W)M = C // G * H * Wdfdu = (dout * gamma).reshape(N, G, C//G, H, W)dfdv = np.sum(dfdu * (x_trans - mean) * -0.5 * var ** -1.5, axis = (2,3,4))dfdw = np.sum(dfdu * -1 / std, axis = (2,3,4)) + dfdv * np.sum(-2/N * (x_trans - mean), axis = (2,3,4))dx = dfdu / std + dfdv.reshape(N, G, 1, 1, 1) * 2/M * (x_trans - mean) + dfdw.reshape(N, G, 1, 1, 1) / Mdx = dx.reshape(x.shape)return dx, dgamma, dbeta

dropout

正向传播

def dropout_forward(x, dropout_param):"""Note that this is different from the vanilla version of dropout.Here, p is the probability of keeping a neuron output, as opposed tothe probability of dropping a neuron output.Inputs:- x: Input data, of any shape- dropout_param: A dictionary with the following keys:- p: Dropout parameter. We keep each neuron output with probability p.- mode: 'test' or 'train'. If the mode is train, then perform dropout;if the mode is test, then just return the input.Outputs:- out: Array of the same shape as x.- cache: tuple (dropout_param, mask). In training mode, mask is the dropoutmask that was used to multiply the input; in test mode, mask is None."""p, mode = dropout_param["p"], dropout_param["mode"]if "seed" in dropout_param:np.random.seed(dropout_param["seed"])mask = Noneout = Noneif mode == "train":mask = (np.random.rand(*x.shape) < p )/ p # 这地方不能用np.random.randn()out = x * mask elif mode == "test":out = xcache = (dropout_param, mask)out = out.astype(x.dtype, copy=False)return out, cache

反向传播

def dropout_backward(dout, cache):"""Inputs:- dout: Upstream derivatives, of any shape- cache: (dropout_param, mask) from dropout_forward."""dropout_param, mask = cachemode = dropout_param["mode"]dx = Noneif mode == "train":        dx = dout * maskelif mode == "test":dx = doutreturn dx

Regularization Experiment

As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a keep probability of 0.25. We will then visualize the training and validation accuracies of the two networks over time.

# Train two identical nets, one with dropout and one without.
np.random.seed(231)
num_train = 500
small_data = {'X_train': data['X_train'][:num_train],'y_train': data['y_train'][:num_train],'X_val': data['X_val'],'y_val': data['y_val'],
}solvers = {}
dropout_choices = [1, 0.25]
for dropout_keep_ratio in dropout_choices:model = FullyConnectedNet([500],dropout_keep_ratio=dropout_keep_ratio)print(dropout_keep_ratio)solver = Solver(model,small_data,num_epochs=25,batch_size=100,update_rule='adam',optim_config={'learning_rate': 5e-4,},verbose=True,print_every=100)solver.train()solvers[dropout_keep_ratio] = solverprint()

显示结果

# Plot train and validation accuracies of the two models.
train_accs = []
val_accs = []
for dropout_keep_ratio in dropout_choices:solver = solvers[dropout_keep_ratio]train_accs.append(solver.train_acc_history[-1])val_accs.append(solver.val_acc_history[-1])plt.subplot(3, 1, 1)
for dropout_keep_ratio in dropout_choices:plt.plot(solvers[dropout_keep_ratio].train_acc_history, 'o', label='%.2f dropout_keep_ratio' % dropout_keep_ratio)
plt.title('Train accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')plt.subplot(3, 1, 2)
for dropout_keep_ratio in dropout_choices:plt.plot(solvers[dropout_keep_ratio].val_acc_history, 'o', label='%.2f dropout_keep_ratio' % dropout_keep_ratio)
plt.title('Val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')plt.gcf().set_size_inches(15, 15)
plt.show()

Solver和网络

solver

from __future__ import print_function, division
from future import standard_library
from cs231n import optimstandard_library.install_aliases()
import os
import pickle as pickleclass Solver(object):  def __init__(self, model, data, **kwargs):self.model = modelself.X_train = data['X_train']self.y_train = data["y_train"]self.X_val = data["X_val"]self.y_val = data["y_val"]# Unpack keyword argumentsself.update_rule = kwargs.pop("update_rule", "sgd")self.optim_config = kwargs.pop("optim_config", {})self.lr_decay = kwargs.pop("lr_decay", 1.0)self.batch_size = kwargs.pop("batch_size", 100)self.num_epochs = kwargs.pop("num_epochs", 10)self.num_train_samples = kwargs.pop("num_train_samples", 1000)self.num_val_samples = kwargs.pop("num_val_samples", None)self.checkpoint_name = kwargs.pop("checkpoint_name", None)self.print_every = kwargs.pop("print_every", 10)self.verbose = kwargs.pop("verbose", True)# Throw an error if there are extra keyword argumentsif len(kwargs) > 0:extra = ", ".join('"%s"' % k for k in list(kwargs.keys()))raise ValueError("Unrecognized arguments %s" % extra)# Make sure the update rule exists, then replace the string name with the actual functionif not hasattr(optim, self.update_rule):raise ValueError('Invalid update_rule "%s"' % self.update_rule)self.update_rule = getattr(optim, self.update_rule)# hasattr() 函数用于判断对象是否包含对应的属性。 在optim包中找对应的update_rule self._reset()def _reset(self):"""Set up some book-keeping variables for optimization. Don't call this manually."""# Set up some variables for book-keepingself.epoch = 0self.best_val_acc = 0self.best_params = {}self.loss_history = []self.train_acc_history = []self.val_acc_history = []# Make a deep copy of the optim_config for each parameterself.optim_configs = {}for p in self.model.params:  # model.params是一个字典d = {k: v for k, v in self.optim_config.items()} # optim_config是字典self.optim_configs[p] = d   # optim_configs每一个元素岂不是一样？def _step(self):# Make a minibatch of training datanum_train = self.X_train.shape[0]batch_mask = np.random.choice(num_train, self.batch_size)X_batch = self.X_train[batch_mask]y_batch = self.y_train[batch_mask]# Compute loss and gradientloss, grads = self.model.loss(X_batch, y_batch)self.loss_history.append(loss)# Perform a parameter updatefor p, w in self.model.params.items():  # model.params是一个字典dw = grads[p]  # dw只是一个名字，实际上可以是任何参数config = self.optim_configs[p]  # 比如learning ratenext_w, next_config = self.update_rule(w, dw, config)self.model.params[p] = next_wself.optim_configs[p] = next_configdef _save_checkpoint(self):if self.checkpoint_name is None:returncheckpoint = {"model": self.model,"update_rule": self.update_rule,"lr_decay": self.lr_decay,"optim_config": self.optim_config,"batch_size": self.batch_size,"num_train_samples": self.num_train_samples,"num_val_samples": self.num_val_samples,"epoch": self.epoch,"loss_history": self.loss_history,"train_acc_history": self.train_acc_history,"val_acc_history": self.val_acc_history,}filename = "%s_epoch_%d.pkl" % (self.checkpoint_name, self.epoch)if self.verbose:print('Saving checkpoint to "%s"' % filename)with open(filename, "wb") as f:pickle.dump(checkpoint, f)def check_accuracy(self, X, y, num_samples=None, batch_size=100):"""Check accuracy of the model on the provided data.Inputs:- X: Array of data, of shape (N, d_1, ..., d_k)- y: Array of labels, of shape (N,)- num_samples: If not None, subsample the data and only test the modelon num_samples datapoints.- batch_size: Split X and y into batches of this size to avoid usingtoo much memory.Returns:- acc: Scalar giving the fraction of instances that were correctlyclassified by the model."""# Maybe subsample the dataN = X.shape[0]if num_samples is not None and N > num_samples:mask = np.random.choice(N, num_samples)N = num_samplesX = X[mask]y = y[mask]# Compute predictions in batchesnum_batches = N // batch_sizeif N % batch_size != 0:num_batches += 1y_pred = []for i in range(num_batches):start = i * batch_sizeend = (i + 1) * batch_sizescores = self.model.loss(X[start:end])  # model.loss中y==none返回的是scorey_pred.append(np.argmax(scores, axis=1))y_pred = np.hstack(y_pred)acc = np.mean(y_pred == y)return accdef train(self):"""Run optimization to train the model."""num_train = self.X_train.shape[0]iterations_per_epoch = max(num_train // self.batch_size, 1)num_iterations = self.num_epochs * iterations_per_epochfor t in range(num_iterations):self._step()# Maybe print training lossif self.verbose and t % self.print_every == 0:print("(Iteration %d / %d) loss: %f"% (t + 1, num_iterations, self.loss_history[-1]))# At the end of every epoch, increment the epoch counter and decay# the learning rate.epoch_end = (t + 1) % iterations_per_epoch == 0if epoch_end:self.epoch += 1for k in self.optim_configs:self.optim_configs[k]["learning_rate"] *= self.lr_decay# Check train and val accuracy on the first iteration, the last# iteration, and at the end of each epoch.first_it = t == 0last_it = t == num_iterations - 1if first_it or last_it or epoch_end:train_acc = self.check_accuracy(self.X_train, self.y_train, num_samples=self.num_train_samples)val_acc = self.check_accuracy(self.X_val, self.y_val, num_samples=self.num_val_samples)self.train_acc_history.append(train_acc)self.val_acc_history.append(val_acc)self._save_checkpoint()if self.verbose:print("(Epoch %d / %d) train acc: %f; val_acc: %f"% (self.epoch, self.num_epochs, train_acc, val_acc))# Keep track of the best modelif val_acc > self.best_val_acc:self.best_val_acc = val_accself.best_params = {}for k, v in self.model.params.items():self.best_params[k] = v.copy()# At the end of training swap the best params into the modelself.model.params = self.best_params

网络

import numpy as np# from cs231n.layers import *
# from cs231n.layer_utils import *class FullyConnectedNet(object):def __init__(self,hidden_dims,input_dim=3 * 32 * 32,num_classes=10,dropout_keep_ratio=1,normalization=None,reg=0.0,weight_scale=1e-2,dtype=np.float32,seed=None,):self.normalization = normalizationself.use_dropout = dropout_keep_ratio != 1self.reg = regself.num_layers = 1 + len(hidden_dims)self.dtype = dtypeself.params = {}dims = np.hstack((input_dim, hidden_dims, num_classes))for i in range(self.num_layers): W = np.random.randn(dims[i], dims[i+1]) * weight_scale            b = np.zeros(dims[i+1])self.params['W' + str(i+1)] = Wself.params['b' + str(i+1)] = bif self.normalization != None:for i in range(self.num_layers - 1):  # 最后一层的输出不用做batchnormself.params['gamma' + str(i+1)] = np.ones(dims[i+1])self.params['beta' + str(i+1)] = np.zeros(dims[i+1])# When using dropout we need to pass a dropout_param dictionary to each# dropout layer so that the layer knows the dropout probability and the mode# (train / test). You can pass the same dropout_param to each dropout layer.self.dropout_param = {}if self.use_dropout:self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}if seed is not None:self.dropout_param["seed"] = seed# With batch normalization we need to keep track of running means and# variances, so we need to pass a special bn_param object to each batch# normalization layer. You should pass self.bn_params[0] to the forward pass# of the first batch normalization layer, self.bn_params[1] to the forward# pass of the second batch normalization layer, etc.self.bn_params = []if self.normalization == "batchnorm":self.bn_params = [{"mode": "train"} for i in range(self.num_layers - 1)]if self.normalization == "layernorm":self.bn_params = [{} for i in range(self.num_layers - 1)]# Cast all parameters to the correct datatype.for k, v in self.params.items():self.params[k] = v.astype(dtype)def loss(self, X, y=None):        X = X.astype(self.dtype)mode = "test" if y is None else "train"# Set train/test mode for batchnorm params and dropout param since they# behave differently during training and testing.if self.use_dropout:self.dropout_param["mode"] = modeif self.normalization == "batchnorm":for bn_param in self.bn_params:bn_param["mode"] = mode  # bn_param是字典scores = None    # For a network with L layers, the architecture will be ##{affine - [batch/layer norm] - relu - [dropout]} x (L - 1)-affine -softmax#caches = []drop_caches = []x = Xdrop_cache = Nonegamma, beta, bn_params = None, None, Nonefor i in range(self.num_layers - 1):W = self.params['W' + str(i+1)]b = self.params['b' + str(i+1)]if self.normalization != None:                gamma = self.params['gamma'+ str(i+1)]beta = self.params['beta' + str(i+1)]bn_params = self.bn_params[i]# print(b.shape)#print(bn_params)x, cache = affine_norm_relu_forward(x, W, b, gamma, beta, bn_params, self.normalization, self.use_dropout, self.dropout_param)if self.use_dropout:x, drop_cache = dropout_forward(x, self.dropout_param)caches.append(cache)drop_caches.append(drop_cache)W = self.params['W' + str(self.num_layers)]b = self.params['b' + str(self.num_layers)]scores, cache = affine_forward(x, W, b)caches.append(cache)# If test mode return early.if mode == "test":return scoresloss, grads = 0.0, {}loss, softmax_grad = softmax_loss(scores, y)for i in range(self.num_layers):W = self.params['W'+str(i+1)]loss += 0.5 * self.reg *(W ** 2).sum()dout, dw, db = affine_backward(softmax_grad, caches[self.num_layers - 1])grads['W' + str(self.num_layers)] = dw + self.reg * self.params['W'+str(self.num_layers)]grads['b' + str(self.num_layers)] = dbfor i in range(self.num_layers - 2, -1, -1):if self.use_dropout:dout = dropout_backward(dout, drop_caches[i]) # 倒数第二层，drop_cache和caches对应的元素下标是num_layers - 2dx, dw, db, dgamma, dbeta = affine_norm_relu_backward(dout, caches[i], self.normalization, self.use_dropout)grads['W' + str(i+1)] = dw + self.reg * self.params['W'+str(i+1)]grads['b' + str(i + 1)] = dbif self.normalization != None:grads['gamma'+str(i+1)] = dgammagrads['beta' +str(i+1)] = dbetadout = dxreturn loss, grads

网络用到的辅助函数（各层）

#  layers.pydef affine_forward(x, w, b):    x_vector = x.reshape(x.shape[0], -1) out = x_vector.dot(w) + b # 上面第一项形状是(N, M)，b的形状是(M,)，广播机制要求x_vector的最后一维和b一致cache = (x, w, b)return out, cachedef affine_backward(dout, cache):x, w, b = cachedx = dout.dot(w.T).reshape(x.shape) # (N, M) @ (M, D)x_vector = x.reshape(x.shape[0], -1)dw = x_vector.T.dot(dout)  # (D, N) @ (N, M)    # db = np.dot(dout.T, np.ones(x.shape[0])) # dout.T：(M, N) 相当于每一行求和db = dout.sum(axis=0) # 这么写也对   为什么是求和不是求平均？return dx, dw, dbdef relu_forward(x):out = np.maximum(0,x)cache = xreturn out, cachedef relu_backward(dout, cache):dx, x = None, cachedx = xdx[dx < 0] = 0  # x是否会被改变？dx[dx > 0] = 1dx *= doutreturn dxdef affine_norm_relu_forward(x, W, b, gamma, beta, bn_params, normalization,use_dropout, dropout_param):    out, fc_cache = affine_forward(x, W, b)norm_cache, drop_cache = None, Noneif normalization == 'batchnorm':out, norm_cache = batchnorm_forward(out, gamma, beta, bn_params)elif normalization == 'layernorm':out, norm_cache = layernorm_forward(out, gamma, beta, bn_params)out, relu_cache = relu_forward(out)if use_dropout:out, drop_cache = dropout_forward(out, dropout_param)cache = (fc_cache, norm_cache, relu_cache, drop_cache)return out, cachedef affine_norm_relu_backward(dout, cache, normalization, use_dropout):fc_cache, norm_cache, relu_cache, drop_cache = cacheif use_dropout:dout = dropout_backward(dout, drop_cache)dout = relu_backward(dout, relu_cache)dgamma, dbeta = None, None # 必须要有这一行，因为return总要返回值if normalization == 'batchnorm':dout, dgamma, dbeta = batchnorm_backward_alt(dout, norm_cache)elif normalization == 'layernorm':dout, dgamma, dbeta = layernorm_backward(dout, norm_cache)dx, dw, db = affine_backward(dout, fc_cache)return dx, dw, db, dgamma, dbeta

损失函数

def softmax_loss(x, y):    loss, dx = None, Nonenum_train = x.shape[0]scores = x - np.max(x, axis = 1).reshape(-1, 1)normalized_scores = np.exp(scores) / np.sum(np.exp(scores), axis = 1).reshape(-1,1)loss = -np.sum(np.log(normalized_scores[np.arange(num_train), y]))loss /= num_trainnormalized_scores[np.arange(num_train), y] -= 1dx = normalized_scores / num_trainreturn (loss, dx)

Batch Normalization和Dropout相关推荐

Batch Normalization 与Dropout 的冲突
BN或Dropout单独使用能加速训练速度并且避免过拟合但是倘若一起使用,会产生负面效果. BN在某些情况下会削弱Dropout的效果对此,BN与Dropout最好不要一起用,若一定要一起用,有2 ...
dropout+Batch Normalization理解
Dropout理解: 在没有dropout时,正向传播如下: 加入dropout后: 测试时,需要每个权值乘以P: Dropout官方源码: #dropout函数实现 def dropout(x, ...
深度学习总结：用pytorch做dropout和Batch Normalization时需要注意的地方，用tensorflow做dropout和BN时需要注意的地方,
用pytorch做dropout和BN时需要注意的地方 pytorch做dropout: 就是train的时候使用dropout,训练的时候不使用dropout, pytorch里面是通过net.ev ...
偏差与方差、L1正则化、L2正则化、dropout正则化、神经网络调优、批标准化Batch Normalization(BN层)、Early Stopping、数据增强
日萌社人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 3.2 深度学习正则化 3.2.1 偏差与方差 3.2.1.1 ...
深度神经网络中的Batch Normalization介绍及实现
之前在经典网络DenseNet介绍_fengbingchun的博客-CSDN博客_densenet中介绍DenseNet时,网络中会有BN层,即Batch Normalization,在每个Dense ...
Batch Normalization应该放在ReLU非线性激活层的前面还是后面？
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达编辑:CVDaily 转载自:计算机视觉Daily https: ...
原理解释｜直觉与实现：Batch Normalization
https://www.toutiao.com/a6707566287964340747/ 作者:Harrison Jansma编译:ronghuaiyang 在本文中,我会回顾一下batch nor ...
Batch Normalization——加速深度神经网络收敛利器
https://www.toutiao.com/a6703399604613808648/ Batch Normalization Batch Normalization 提出自<Batch N ...
【深度学习】深入理解Batch Normalization批标准化
这几天面试经常被问到BN层的原理,虽然回答上来了,但还是感觉答得不是很好,今天仔细研究了一下Batch Normalization的原理,以下为参考网上几篇文章总结得出. Batch Normaliz ...

Batch Normalization和Dropout

导包和处理数据

BatchNorm

forward

backward

训练BatchNorm并显示结果

Batch Normalization 和初始化

Batch Normalization 和 Batch Size

Layer Normalization

Layer Normalization 和 Batch Size

卷积层的batch norm--spatial batchnorm

Spatial Group Normalization

dropout

Regularization Experiment

Solver和网络

solver

网络

网络用到的辅助函数（各层）

损失函数

Batch Normalization和Dropout相关推荐

最新文章

热门文章