深度学习之手撕深度神经网络DNN代码（基于numpy）

声明

1）本文仅供学术交流，非商用。所以每一部分具体的参考资料并没有详细对应。如果某部分不小心侵犯了大家的利益，还望海涵，并联系博主删除。
2）博主才疏学浅，文中如有不当之处，请各位指出，共同进步，谢谢。
3）此属于第一版本，若有错误，还需继续修正与增删。还望大家多多指点。大家都共享一点点，一起为祖国科研的推进添砖加瓦。

文章目录

声明
- 0、前言
- 1、神经网络步骤
- 2、深度神经网络
- 3、初始化参数
- 4、激活函数
- 5、前向传播
- 6、计算损失
- 7、反向传播
- 8、参数更新
- 9、封装搭建过程（选看）
- Appendix_1、数据读取
- Appendix_2、学习率设置
参考文章

0、前言

在之前写过一个手撕代码系列深度学习之手撕神经网络代码（基于numpy），搭建了感知机和一个隐藏层的神经网络，理解了神经网络的基本结构和传播原理，掌握了如何从零开始手写一个神经网络。但是神经网络和深度学习之所以效果奇佳的一个原因就是，隐藏层多，网络结构深，很久之前一个小伙伴想让我写一个基于numpy的DNN，一直没填坑，今天就来写一下。

1、神经网络步骤

不知道你还记不记得搭建一个神经网络结构的步骤（深度学习之手撕神经网络代码（基于numpy）），大概是六点：

构建网络
初始化参数
迭代优化
计算损失
反向传播
更新参数

简洁地说就是三点，即构建网络、赋值参数、循环计算。

首先是确定准备搭建的网络结构是怎么样的（大话卷积神经网络CNN（干货满满）），比如经典的AlexNet，VGGNet等等；
然后是对权重w和偏置b进行参数初始化（深度学习入门笔记（十二）：权重初始化），比如Xavier初始化，He初始化等等；
最后是迭代计算，（深度学习之手撕神经网络代码（基于numpy）），比如前向传播，反向传播等等。

2、深度神经网络

之前写过的一些深度神经网络的理论：

深度学习入门笔记（七）：深层神经网络
深度学习入门笔记（八）：深层网络的原理

需要补得童鞋可以看一下，避免后面不懂。这里就简单说两句，底层神经网络提取特征，然后接着卷积池化，再经过神经元的激活和随机失活，从而实现前行传播，计算损失函数，反向传播回调整参数，优化迭代过程。

以一个简单的手写数字识别为例，图例是整个过程：

端到端无中间操作，像素值即特征，转换为向量，经过深度神经网络，输出独热编码概率。

网络结构包括输入层，输出层和隐藏层，前向传播即，卷积池化并输出下一层，反向传播也是如此，只不过是反着的。

得到一张图片，提取其中的像素输入到，GPU加速训练过的神经网络中，输出结果就是分类结果。

3、初始化参数

深度神经网络的隐藏层数量用 layer_dims 表示，这样一共有多少层就不用全部写出来了，更加方便灵活：

def initialize_parameters_deep(layer_dims):# 随机种子np.random.seed(3)parameters = {}# 网络层数L = len(layer_dims)for l in range(1, L):parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*0.01parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))return parameters

上述代码使用的是随机数和归零操作来初始化权重 W 和偏置 b，如果想要别的初始化方式，几种常用的初始化方式的numpy写法如下：

截断正态分布初始化

def truncated_normal(mean, std, out_shape):"""Parameters----------mean : float or array_like of floatsThe mean/center of the distributionstd : float or array_like of floatsStandard deviation (spread or "width") of the distribution.out_shape : int or tuple of intsOutput shape.  If the given shape is, e.g., ``(m, n, k)``, then``m * n * k`` samples are drawn.Returns-------samples : :py:class:`ndarray <numpy.ndarray>` of shape `out_shape`Samples from the truncated normal distribution parameterized by `mean`and `std`."""samples = np.random.normal(loc=mean, scale=std, size=out_shape)reject = np.logical_or(samples >= mean + 2 * std, samples <= mean - 2 * std)while any(reject.flatten()):resamples = np.random.normal(loc=mean, scale=std, size=reject.sum())samples[reject] = resamplesreject = np.logical_or(samples >= mean + 2 * std, samples <= mean - 2 * std)return samples

He正态分布初始化

def he_normal(weight_shape):"""Parameters----------weight_shape : tupleThe dimensions of the weight matrix/volume.Returns-------W : :py:class:`ndarray <numpy.ndarray>` of shape `weight_shape`The initialized weights."""fan_in, fan_out = calc_fan(weight_shape)std = np.sqrt(2 / fan_in)return truncated_normal(0, std, weight_shape)

Glorot正态分布初始化（Xavier）

def glorot_normal(weight_shape, gain=1.0):"""Parameters----------weight_shape : tupleThe dimensions of the weight matrix/volume.Returns-------W : :py:class:`ndarray <numpy.ndarray>` of shape `weight_shape`The initialized weights."""fan_in, fan_out = calc_fan(weight_shape)std = gain * np.sqrt(2 / (fan_in + fan_out))return truncated_normal(0, std, weight_shape)

这三个初始化方式可自行调用，这里就按照最简单的讲解了。

小应用

假设一个输入层大小 3 ，隐藏层大小 3，输出层大小 3 的深度神经网络，然后调用参数初始化函数，输入参数 [3,3,3，输出如下：

parameters = initialize_parameters_deep([3,3,3])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

4、激活函数

除了经常使用的 sigmoid 激活函数，就是 ReLU 激活函数了，有 Leaky RELU，ELU，SELU，Softplus 等等。

wikipedia列出来的这些函数：

几种常用的激活函数的numpy写法如下：

from abc import ABC, abstractmethod
import numpy as npclass ActivationBase(ABC):def __init__(self, **kwargs):super().__init__()def __call__(self, z):if z.ndim == 1:z = z.reshape(1, -1)return self.fn(z)@abstractmethoddef fn(self, z):raise NotImplementedError@abstractmethoddef grad(self, x, **kwargs):raise NotImplementedErrorclass Sigmoid(ActivationBase):def __init__(self):"""A logistic sigmoid activation function."""super().__init__()def __str__(self):return "Sigmoid"def fn(self, z):"""Evaluate the logistic sigmoid, :math:`\sigma`, on the elements of input `z`."""return 1 / (1 + np.exp(-z))def grad(self, x):"""Evaluate the first derivative of the logistic sigmoid on the elements of `x`."""fn_x = self.fn(x)return fn_x * (1 - fn_x)def grad2(self, x):"""Evaluate the second derivative of the logistic sigmoid on the elements of `x`."""fn_x = self.fn_xreturn fn_x * (1 - fn_x) * (1 - 2 * fn_x)class ReLU(ActivationBase):"""A rectified linear activation function."""def __init__(self):super().__init__()def __str__(self):return "ReLU"def fn(self, z):"""Evaulate the ReLU function on the elements of input `z`."""return np.clip(z, 0, np.inf)def grad(self, x):"""Evaulate the first derivative of the ReLU function on the elements of input `x`."""return (x > 0).astype(int)def grad2(self, x):"""Evaulate the second derivative of the ReLU function on the elements of input `x`."""return np.zeros_like(x)class LeakyReLU(ActivationBase):"""'Leaky' version of a rectified linear unit (ReLU)."""def __init__(self, alpha=0.3):self.alpha = alphasuper().__init__()def __str__(self):return "Leaky ReLU(alpha={})".format(self.alpha)def fn(self, z):"""Evaluate the leaky ReLU function on the elements of input `z`."""_z = z.copy()_z[z < 0] = _z[z < 0] * self.alphareturn _zdef grad(self, x):"""Evaluate the first derivative of the leaky ReLU function on the elementsof input `x`."""out = np.ones_like(x)out[x < 0] *= self.alphareturn outdef grad2(self, x):"""Evaluate the second derivative of the leaky ReLU function on the elements of input `x`."""return np.zeros_like(x)class ELU(ActivationBase):def __init__(self, alpha=1.0):"""An exponential linear unit (ELU).-----Parameters----------alpha : floatSlope of negative segment. Default is 1."""self.alpha = alphasuper().__init__()def __str__(self):return "ELU(alpha={})".format(self.alpha)def fn(self, z):"""Evaluate the ELU activation on the elements of input `z`."""# z if z > 0  else alpha * (e^z - 1)return np.where(z > 0, z, self.alpha * (np.exp(z) - 1))def grad(self, x):"""Evaluate the first derivative of the ELU activation on the elementsof input `x`."""# 1 if x > 0 else alpha * e^(z)return np.where(x > 0, np.ones_like(x), self.alpha * np.exp(x))def grad2(self, x):"""Evaluate the second derivative of the ELU activation on the elementsof input `x`."""# 0 if x > 0 else alpha * e^(z)return np.where(x >= 0, np.zeros_like(x), self.alpha * np.exp(x))class SELU(ActivationBase):"""A scaled exponential linear unit (SELU)."""def __init__(self):self.alpha = 1.6732632423543772848170429916717self.scale = 1.0507009873554804934193349852946self.elu = ELU(alpha=self.alpha)super().__init__()def __str__(self):return "SELU"def fn(self, z):"""Evaluate the SELU activation on the elements of input `z`."""return self.scale * self.elu.fn(z)def grad(self, x):"""Evaluate the first derivative of the SELU activation on the elementsof input `x`."""return np.where(x >= 0, np.ones_like(x) * self.scale, np.exp(x) * self.alpha * self.scale)def grad2(self, x):"""Evaluate the second derivative of the SELU activation on the elementsof input `x`."""return np.where(x > 0, np.zeros_like(x), np.exp(x) * self.alpha * self.scale)class SoftPlus(ActivationBase):def __init__(self):"""A softplus activation function."""super().__init__()def __str__(self):return "SoftPlus"def fn(self, z):"""Evaluate the softplus activation on the elements of input `z`."""return np.log(np.exp(z) + 1)def grad(self, x):"""Evaluate the first derivative of the softplus activation on the elementsof input `x`."""exp_x = np.exp(x)return exp_x / (exp_x + 1)def grad2(self, x):"""Evaluate the second derivative of the softplus activation on the elementsof input `x`."""exp_x = np.exp(x)return exp_x / ((exp_x + 1) ** 2)

5、前向传播

我们这里仅仅使用 sigmoid 和 relu 两种激活函数进行前向传播，代码如下：

def linear_activation_forward(A_prev, W, b, activation):if activation == "sigmoid":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = sigmoid(Z)    elif activation == "relu":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = relu(Z)   assert (A.shape == (W.shape[0], A_prev.shape[1]))cache = (linear_cache, activation_cache)    return A, cache

A_prev 是前一步前向计算的结果，W 和 b 分别对应权重和偏置，中间有一个激活函数判断。如果你想更换激活函数，直接替换即可。

对于某一层的前向传播过程如下：

实现过程如下：

def L_model_forward(X, parameters):caches = []A = X    网络层数L = len(parameters) // 2                 # 实现[LINEAR -> RELU]*(L-1)for l in range(1, L):A_prev = A A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)], parameters["b"+str(l)], "relu")caches.append(cache)    # 实现LINEAR -> SIGMOIDAL, cache = linear_activation_forward(A, parameters["W"+str(L)], parameters["b"+str(L)], "sigmoid")caches.append(cache)    assert(AL.shape == (1,X.shape[1]))    return AL, caches

6、计算损失

通过前向传播得到结果之后，根据结果去计算损失函数大小。

from abc import ABC, abstractmethod
import numpy as np
import numbersdef is_binary(x):"""Return True if array `x` consists only of binary values"""msg = "Matrix must be binary"assert np.array_equal(x, x.astype(bool)), msgreturn Truedef is_stochastic(X):"""True if `X` contains probabilities that sum to 1 along the columns"""msg = "Array should be stochastic along the columns"assert len(X[X < 0]) == len(X[X > 1]) == 0, msgassert np.allclose(np.sum(X, axis=1), np.ones(X.shape[0])), msgreturn Trueclass OptimizerInitializer(object):def __init__(self, param=None):"""A class for initializing optimizers. Valid inputs are:(a) __str__ representations of `OptimizerBase` instances(b) `OptimizerBase` instances(c) Parameter dicts (e.g., as produced via the `summary` method in`LayerBase` instances)If `param` is `None`, return the SGD optimizer with default parameters."""self.param = paramdef __call__(self):param = self.paramif param is None:opt = SGD()elif isinstance(param, OptimizerBase):opt = paramelif isinstance(param, str):opt = self.init_from_str()elif isinstance(param, dict):opt = self.init_from_dict()return optdef init_from_str(self):r = r"([a-zA-Z]*)=([^,)]*)"opt_str = self.param.lower()kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, opt_str)])if "sgd" in opt_str:optimizer = SGD(**kwargs)elif "adagrad" in opt_str:optimizer = AdaGrad(**kwargs)elif "rmsprop" in opt_str:optimizer = RMSProp(**kwargs)elif "adam" in opt_str:optimizer = Adam(**kwargs)else:raise NotImplementedError("{}".format(opt_str))return optimizerdef init_from_dict(self):O = self.paramcc = O["cache"] if "cache" in O else Noneop = O["hyperparameters"] if "hyperparameters" in O else Noneif op is None:raise ValueError("Must have `hyperparemeters` key: {}".format(O))if op and op["id"] == "SGD":optimizer = SGD().set_params(op, cc)elif op and op["id"] == "RMSProp":optimizer = RMSProp().set_params(op, cc)elif op and op["id"] == "AdaGrad":optimizer = AdaGrad().set_params(op, cc)elif op and op["id"] == "Adam":optimizer = Adam().set_params(op, cc)elif op:raise NotImplementedError("{}".format(op["id"]))return optimizerclass WeightInitializer(object):def __init__(self, act_fn_str, mode="glorot_uniform"):"""A factory for weight initializers.-----Parameters----------act_fn_str : strThe string representation for the layer activation functionmode : str (default: 'glorot_uniform')The weight initialization strategy. Valid entries are {"he_normal","he_uniform", "glorot_normal", glorot_uniform", "std_normal","trunc_normal"}"""if mode not in ["he_normal","he_uniform","glorot_normal","glorot_uniform","std_normal","trunc_normal",]:raise ValueError("Unrecognize initialization mode: {}".format(mode))self.mode = modeself.act_fn = act_fn_strif mode == "glorot_uniform":self._fn = glorot_uniformelif mode == "glorot_normal":self._fn = glorot_normalelif mode == "he_uniform":self._fn = he_uniformelif mode == "he_normal":self._fn = he_normalelif mode == "std_normal":self._fn = np.random.randnelif mode == "trunc_normal":self._fn = partial(truncated_normal, mean=0, std=1)def __call__(self, weight_shape):if "glorot" in self.mode:gain = self._calc_glorot_gain()W = self._fn(weight_shape, gain)elif self.mode == "std_normal":W = self._fn(*weight_shape)else:W = self._fn(weight_shape)return Wdef _calc_glorot_gain(self):"""Values from:https://pytorch.org/docs/stable/nn.html?#torch.nn.init.calculate_gain"""gain = 1.0act_str = self.act_fn.lower()if act_str == "tanh":gain = 5.0 / 3.0elif act_str == "relu":gain = np.sqrt(2)elif "leaky relu" in act_str:r = r"leaky relu\(alpha=(.*)\)"alpha = re.match(r, act_str).groups()[0]gain = np.sqrt(2 / 1 + float(alpha) ** 2)return gainclass ObjectiveBase(ABC):def __init__(self):super().__init__()@abstractmethoddef loss(self, y_true, y_pred):pass@abstractmethoddef grad(self, y_true, y_pred, **kwargs):passclass SquaredError(ObjectiveBase):def __init__(self):"""A squared-error / `L2` loss."""super().__init__()def __call__(self, y, y_pred):return self.loss(y, y_pred)def __str__(self):return "SquaredError"@staticmethoddef loss(y, y_pred):"""Compute the squared error between `y` and `y_pred`.-----Parameters----------y : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Ground truth values for each of `n` examplesy_pred : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Predictions for the `n` examples in the batch.Returns-------loss : floatThe sum of the squared error across dimensions and examples."""return 0.5 * np.linalg.norm(y_pred - y) ** 2@staticmethoddef grad(y, y_pred, z, act_fn):"""Gradient of the squared error loss with respect to the pre-nonlinearityinput, `z`.-----Parameters----------y : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Ground truth values for each of `n` examples.y_pred : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Predictions for the `n` examples in the batch.act_fn : :doc:`Activation <numpy_ml.neural_nets.activations>` objectThe activation function for the output layer of the network.Returns-------grad : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)The gradient of the squared error loss with respect to `z`."""return (y_pred - y) * act_fn.grad(z)class CrossEntropy(ObjectiveBase):def __init__(self):"""A cross-entropy loss."""super().__init__()def __call__(self, y, y_pred):return self.loss(y, y_pred)def __str__(self):return "CrossEntropy"@staticmethoddef loss(y, y_pred):"""Compute the cross-entropy (log) loss.-----Parameters----------y : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Class labels (one-hot with `m` possible classes) for each of `n`examples.y_pred : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)Probabilities of each of `m` classes for the `n` examples in thebatch.Returns-------loss : floatThe sum of the cross-entropy across classes and examples."""is_binary(y)is_stochastic(y_pred)# prevent taking the log of 0eps = np.finfo(float).eps# each example is associated with a single class; sum the negative log# probability of the correct label over all samples in the batch.# observe that we are taking advantage of the fact that y is one-hot# encodedcross_entropy = -np.sum(y * np.log(y_pred + eps))return cross_entropy@staticmethoddef grad(y, y_pred):"""Compute the gradient of the cross entropy loss with regard to thesoftmax input, `z`.-----Parameters----------y : :py:class:`ndarray <numpy.ndarray>` of shape `(n, m)`A one-hot encoding of the true class labels. Each row constitues atraining example, and each column is a different class.y_pred: :py:class:`ndarray <numpy.ndarray>` of shape `(n, m)`The network predictions for the probability of each of `m` classlabels on each of `n` examples in a batch.Returns-------grad : :py:class:`ndarray <numpy.ndarray>` of shape (n, m)The gradient of the cross-entropy loss with respect to the *input*to the softmax function."""is_binary(y)is_stochastic(y_pred)# derivative of xe wrt z is y_pred - y_true, hence we can just# subtract 1 from the probability of the correct class labelsgrad = y_pred - y# [optional] scale the gradients by the number of examples in the batch# n, m = y.shape# grad /= nreturn

这里用最简单的两个单元计算，函数如下：

def compute_cost(AL, Y):m = Y.shape[1]    # Compute loss from aL and y.cost = -np.sum(np.multiply(Y,np.log(AL))+np.multiply(1-Y,np.log(1-AL)))/mcost = np.squeeze(cost)  assert(cost.shape == ())    return cost

7、反向传播

反向传播的关键在于链式法则求导，好在上次课有认真推倒过交叉熵的导数，同时以前写过一个深度学习100问之深入理解Back Propagation（反向传播），如果认真看过的话，反向传播应该不会再有手撕代码的问题。

因而现行反向传播函数代码如下：

def linear_backward(dZ, cache):A_prev, W, b = cachem = A_prev.shape[1]dW = np.dot(dZ, A_prev.T)/mdb = np.sum(dZ, axis=1, keepdims=True)/mdA_prev = np.dot(W.T, dZ)    assert (dA_prev.shape == A_prev.shape)    assert (dW.shape == W.shape)    assert (db.shape == b.shape)    return dA_prev, dW, db

还有线性激活反向传播函数如下：

def linear_activation_backward(dA, cache, activation):linear_cache, activation_cache = cache    if activation == "relu":dZ = relu_backward(dA, activation_cache)dA_prev, dW, db = linear_backward(dZ, linear_cache)    elif activation == "sigmoid":dZ = sigmoid_backward(dA, activation_cache)dA_prev, dW, db = linear_backward(dZ, linear_cache)    return dA_prev, dW, db

得到反向传播函数如下：

def L_model_backward(AL, Y, caches):grads = {}L = len(caches) # 层数m = AL.shape[1]Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL# 初始化backpropagationdAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))    current_cache = caches[L-1]grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, "sigmoid")    for l in reversed(range(L - 1)):current_cache = caches[l]dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, "relu")grads["dA" + str(l + 1)] = dA_prev_tempgrads["dW" + str(l + 1)] = dW_tempgrads["db" + str(l + 1)] = db_temp    return grads

建议认真看一看反向传播，尤其是交叉熵的求导，堪称是复合函数求导的极致，如果能推倒明白了，基本上就没啥大问题了。

8、参数更新

反向传播之后，就是参数更新了，函数如下：

def update_parameters(parameters, grads, learning_rate):# number of layers in the neural networkL = len(parameters) // 2 # Update rule for each parameter. Use a for loop.for l in range(L):parameters["W" + str(l+1)] = parameters["W"+str(l+1)] - learning_rate*grads["dW"+str(l+1)]parameters["b" + str(l+1)] = parameters["b"+str(l+1)] - learning_rate*grads["db"+str(l+1)]    return parameters

9、封装搭建过程（选看）

到这来，一个DNN就已经全部搭建完成了，和上一节类似，肯定想对这些函数进行一下简单的封装，代码如下：

def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):np.random.seed(1)costs = []    # 参数初始化parameters = initialize_parameters_deep(layers_dims)    # 循环迭代for i in range(0, num_iterations):        # 前向传播: # [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOIDAL, caches = L_model_forward(X, parameters)        # 计算损失cost = compute_cost(AL, Y)        # 反向传播grads = L_model_backward(AL, Y, caches)        # 参数更新parameters = update_parameters(parameters, grads, learning_rate)        # 每训练100个样本打印一次损失if print_cost and i % 100 == 0:            print ("Cost after iteration %i: %f" %(i, cost))costs.append(cost)    # 给损失画图plt.plot(np.squeeze(costs))plt.ylabel('cost')plt.xlabel('iterations (per tens)')plt.title("Learning rate =" + str(learning_rate))plt.show()    return parameters

到这来一个深度神经网络就已经完整地搭建完毕了，原代码是吴恩达深度学习课程的代码，可以GitHub寻找一下，如果实在找不到也可以留言。

---手动分割线---

下面是一些其他结构的numpy代码：

Appendix_1、数据读取

minibatch 可以提高算法的运行速度，同时增加训练过程中的随机性。

蓝色是 minibatch，紫色是 fullbatch。

def minibatch(X, batchsize=256, shuffle=True):"""Compute the minibatch indices for a training dataset.Parameters----------X : :py:class:`ndarray <numpy.ndarray>` of shape `(N, \*)`The dataset to divide into minibatches. Assumes the first dimensionrepresents the number of training examples.batchsize : intThe desired size of each minibatch. Note, however, that if ``X.shape[0] %batchsize > 0`` then the final batch will contain fewer than batchsizeentries. Default is 256.shuffle : boolWhether to shuffle the entries in the dataset before dividing intominibatches. Default is True.Returns-------mb_generator : generatorA generator which yields the indices into X for each batchn_batches: intThe number of batches"""N = X.shape[0]ix = np.arange(N)n_batches = int(np.ceil(N / batchsize))if shuffle:np.random.shuffle(ix)def mb_generator():for i in range(n_batches):yield ix[i * batchsize : (i + 1) * batchsize]return mb_generator(), n_batches

Appendix_2、学习率设置

学习率与迭代下降的速度有关，当学习率设置的过小时，收敛过程将变得十分缓慢；而当学习率设置的过大时，梯度可能会在最小值附近来回震荡，甚至可能无法收敛。

学习率衰减的效果如下：

可以由上图看出，固定学习率时，当收敛时，会在最优值附近一个较大的区域内摆动；而学习率衰减，当收敛时，会在最优值附近一个更小的区域内摆动。

from copy import deepcopy
from abc import ABC, abstractmethodimport numpy as npfrom math import erfdef gaussian_cdf(x, mean, var):"""Compute the probability that a random draw from a 1D Gaussian with mean`mean` and variance `var` is less than or equal to `x`."""eps = np.finfo(float).epsx_scaled = (x - mean) / np.sqrt(var + eps)return (1 + erf(x_scaled / np.sqrt(2))) / 2class SchedulerBase(ABC):def __init__(self):"""Abstract base class for all Scheduler objects."""self.hyperparameters = {}def __call__(self, step=None, cur_loss=None):return self.learning_rate(step=step, cur_loss=cur_loss)def copy(self):"""Return a copy of the current object."""return deepcopy(self)def set_params(self, hparam_dict):"""Set the scheduler hyperparameters from a dictionary."""if hparam_dict is not None:for k, v in hparam_dict.items():if k in self.hyperparameters:self.hyperparameters[k] = v@abstractmethoddef learning_rate(self, step=None):raise NotImplementedErrorclass ConstantScheduler(SchedulerBase):def __init__(self, lr=0.01, **kwargs):"""Returns a fixed learning rate, regardless of the current step.Parameters----------initial_lr : floatThe learning rate. Default is 0.01"""super().__init__()self.lr = lrself.hyperparameters = {"id": "ConstantScheduler", "lr": self.lr}def __str__(self):return "ConstantScheduler(lr={})".format(self.lr)def learning_rate(self, **kwargs):"""Return the current learning rate.Returns-------lr : floatThe learning rate"""return self.lrclass ExponentialScheduler(SchedulerBase):def __init__(self, initial_lr=0.01, stage_length=500, staircase=False, decay=0.1, **kwargs):"""An exponential learning rate scheduler.---Parameters----------initial_lr : floatThe learning rate at the first step. Default is 0.01.stage_length : intThe length of each stage, in steps. Default is 500.staircase : boolIf True, only adjusts the learning rate at the stage transitions,producing a step-like decay schedule. If False, adjusts thelearning rate after each step, creating a smooth decay schedule.Default is False.decay : floatThe amount to decay the learning rate at each new stage. Default is0.1."""super().__init__()self.decay = decayself.staircase = staircaseself.initial_lr = initial_lrself.stage_length = stage_lengthself.hyperparameters = {"id": "StepScheduler","decay": self.decay,"staircase": self.staircase,"initial_lr": self.initial_lr,"stage_length": self.stage_length,}def __str__(self):return "ExponentialScheduler(initial_lr={}, stage_length={}, staircase={}, decay={})".format(self.initial_lr, self.stage_length, self.staircase, self.decay)def learning_rate(self, step, **kwargs):"""Return the current learning rate as a function of `step`.Parameters----------step : intThe current step number.Returns-------lr : floatThe learning rate for the current step."""cur_stage = step / self.stage_lengthif self.staircase:cur_stage = np.floor(cur_stage)return self.initial_lr * self.decay ** cur_stage

如果想要更多的资源，欢迎关注 @我是管小亮，文字强迫症MAX~

回复【福利】即可获取我为你准备的大礼，包括C++，编程四大件，NLP，深度学习等等的资料。

想看更多文（段）章（子），欢迎关注微信公众号「程序员管小亮」~

参考文章

https://www.coursera.org/learn/machine-learning
https://www.deeplearning.ai/
深度学习之手撕神经网络代码（基于numpy）
深度神经网络原理与实践
深度学习笔记3：手动搭建深度神经网络（DNN）
https://github.com/ddbourgin/numpy-ml