李理：自动梯度求解——使用自动求导实现多层神经网络

本系列文章面向深度学习研发者，希望通过Image Caption Generation，一个有意思的具体任务，深入浅出地介绍深度学习的知识。本系列文章涉及到很多深度学习流行的模型，如CNN，RNN/LSTM，Attention等。本文为第6篇。

作者：李理
目前就职于环信，即时通讯云平台和全媒体智能客服平台，在环信从事智能客服和智能机器人相关工作，致力于用深度学习来提高智能机器人的性能。

相关文章：
李理：从Image Caption Generation理解深度学习（part I）
李理：从Image Caption Generation理解深度学习（part II）
李理：从Image Caption Generation理解深度学习（part III）
李理：自动梯度求解反向传播算法的另外一种视角
李理：自动梯度求解——cs231n的notes

常见深度学习框架/工具使用方法

前面我们介绍了4种梯度的计算方法：

手工计算
数值计算
符号求导
自动求导

作为一个框架或者工具，显然不能使用手工计算的方式，另外数值计算效率太低，一般只能用来做gradient check。剩下的两种就是符号求导和自动求导了，目前的框架都是用的自动求导。

【注：theano说自己是Symbolic Differentiation ，但含义并不是数学上的Symbolic Diff，感兴趣的读者可以参考这里。】

再细分一下，深度学习框架可以分成两类：

1. 用户可以使用用户基本函数（也有叫操作op的）来定义计算图的；
2. 用户只能用更上层的函数。

但是这两者的界限其实很模糊。哪些函数算基本的，哪些算上层的？到底要提供多少函数才能表示所有的神经网络？

这些其实是很难界定的，但是大部分框架都提供了扩展能力，比如tensorflow可以自定义op，如果一个函数没有，你可以实现，同样的theano也可以自定义。

而另外一些框架或者工具可能就没有那么灵活。但他们本质都是类似的——我们通过某种方式（代码或者配置文件）定义一个计算图，并且定义哪些是变量【可训练的】，哪些是【常量】（或者批量给定的值如tensorflow里的placeholder），以及损失函数，它就能自动地帮我们计算损失函数对每个可训练参数的导数，而且大部分框架把梯度下降的常用方法都封装好了，我们只有指定一些参数，比如batch大小，learning rate等等。

当然有一些框架如theano并不做这些，它只帮助我们求梯度，这样的工具更“底层”一些，当然对AI使用者要求更高一些，也会更灵活一些，适合对细节感兴趣的用户和那些需要自己“创造”神经网络结构的用户——很多学术界的人很喜欢theano，而像caffe，torch，keras等就是更”上层“的工具，使用它时，我只需要定义一个一个CNN或者DNN的层就行，这个层有多少hidden unit，激活函数用什么，是否dropout，用什么loss function，然后其余的事情就不用管了。

使用自动求导来实现多层神经网络

其实就是完成CS231n的Assignment2的部分内容。

环境

请仔细阅读安装需要的软件。我这里根据我的环境(Ubuntu 14.04 LTS)列举一些安装的命令。

一、下载和解压

这里是下载路径。

二、安装virtualenv和依赖

cd assignment2
sudo pip install virtualenv      # This may already be installed
virtualenv .env                  # Create a virtual environment
source .env/bin/activate         # Activate the virtual environment
pip install -r requirements.txt  # Install dependencies
# Work on the assignment for a while ...
deactivate                       # Exit the virtual environment

virtualenv可以理解为一个虚拟的python环境，和系统的环境可以隔离开，而且安装程序也不需要root权限。用的时候记得source .env/bin/active！

三、下载数据

 cd cs231n/datasets
./get_datasets.sh

四、编译cython扩展

python setup.py build_ext --inplace

五、启动ipython notebook

(.env) lili@lili-desktop:~/cs231n/assignment2$ ipython notebook

应该会弹出浏览器打开http://localhost:8888/tree，打FullyConnectedNets.ipynb。

如果没有用过 ipython notebook，请先阅读此参考资料。确保了解基本的操作，知道怎么执行cell等基本概念后再往下阅读。

作业

cs231n中的作业是要实现一个全连接的神经网络，网络的层数是可以自己定义的。

我们把神经网络分解成一些基本的Layer【注意：这里的Layer不是我们之前说的一层，之前说的一个Layer是全连接的网络，而这里的Layer可以认为是一个Gate，或者一个函数一个Op】，每一个Layer我们都能进行feedforward和backward计算，然后我们通过这些基本的Layer组成一个复杂的神经网络，进行这个网络的整体feedforward和backprop计算，然后训练参数和进行预测。

因此我们实现的很多Layer的结构如下：

def layer_forward(x, w):""" Receive inputs x and weights w """# Do some computations ...z = # ... some intermediate value# Do some more computations ...out = # the outputcache = (x, w, z, out) # Values we need to compute gradientsreturn out, cache

我们会进行forward的计算，然后把输入，输出还有一些中间结果都保存下来，放到cache里【backward时要用到的】，然后返回输出和cache。

def layer_backward(dout, cache):"""Receive derivative of loss with respect to outputs and cache,and compute derivative with respect to inputs."""# Unpack cache valuesx, w, z, out = cache# Use values in cache to compute derivativesdx = # Derivative of loss with respect to xdw = # Derivative of loss with respect to wreturn dx, dw

而backward的计算，我们能拿到cache和dout【从后面的layer传过来的gradient】
一上来我们会从cache里读取出输入，输出和中间值。然后就计算对每个变量的local gradient，然后乘以后层传过来的dout，得到最终的dLoss/dw。然后返回。

cell-1

鼠标点击这个cell，然后选择cell菜单，选择运行，如果没有任何输出，恭喜你，环境没有问题，如果发现有import之类的错误，那就是之前的环境和依赖没有安装好，请根据错误信息google解决。

这个cell是导入一些依赖，然后定义了一个rel_error函数：

def rel_error(x, y):""" returns relative error """return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

这个函数计算两个ndarray【两个数或者两个向量或者两个矩阵】的相对误差，主要会用来做gradient check，也就是用numerical gradient和我们计算的gradient比较，如果相对误差比较小，那么就说明我们的gradient可能是正确的。【如果误差较大肯定不对，但是误差小不见得一定对，就像我们的单元测试，通过了单元测试不见得就没bug，但是没通过肯定有bug。】

计算方法也很简单，计算x-y的绝对值，然后除以它们绝对值的和。当然计算机精度的问题，分母可能为0，所以用一个max函数，如果小于10的-8次方就取10的-8次方，否则就是x的绝对值加y的绝对值。【数值计算的时候一定要考虑溢出，包括下溢为0和上溢为无穷大】

cell-2

这个cell加载cifar-10的数据，如果想了解这个数据的格式，请参考作业1的CNN.pynb的前几个cell。

说明：作业1的地址在这里。

安装方法和作业2是一样的，不过在pip install -r requirements.txt时可能会提升pillow-3.0已经存在了，打开requirements.txt，里面有两个pillow的版本，删除一个就行了。

下面是我运行作业1的前几个cell的结果，你有可以自己也试一试，有空最好把作业1自己做一做。

下面是我运行这个cell的结果：

cifar10的数据10类图片，’plane’, ‘car’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck。总共有6万张标注的数据，其中50000张训练数据，10000张测试数据。而在这里，我们把50000张中的49000用来真正训练，1000张用来validate。

cell-3

接下来我们打开cs231n/layers.py这个文件，实现其中的affine_forward函数。

首先看一下没有写任何代码时课程已经提供的一些代码：

def affine_forward(x, w, b):"""Computes the forward pass for an affine (fully-connected) layer.The input x has shape (N, d_1, ..., d_k) and contains a minibatch of Nexamples, where each example x[i] has shape (d_1, ..., d_k). We willreshape each input into a vector of dimension D = d_1 * ... * d_k, andthen transform it to an output vector of dimension M.Inputs:- x: A numpy array containing input data, of shape (N, d_1, ..., d_k)- w: A numpy array of weights, of shape (D, M)- b: A numpy array of biases, of shape (M,)Returns a tuple of:- out: output, of shape (N, M)- cache: (x, w, b)"""out = None############################################################################## TODO: Implement the affine forward pass. Store the result in out. You     ## will need to reshape the input into rows.                                 ##############################################################################pass##############################################################################                             END OF YOUR CODE                              ##############################################################################cache = (x, w, b)return out, cache

函数下面的注释需要仔细阅读一遍。

这个函数计算affine (全连接)层的forward pass 仿射变换 (Affine transformation)看起来很数学，其实我们只有知道它是一个简单的线性变化就行了。如果想了解细节，可以参考wiki。

如果变成特殊的一维的情况，y=Ax+b就是仿射变换，多维的情况就是把A变成矩阵，b变成向量就行了。

【注：这里的全连接，指的是仿射变换，不保护非线性激活的情况，有的文献可能把包含了激活的一个层叫作全连接层】

输入参数x

x是numpy的ndarray，维度是:

N是batch大小，为了提高计算效率，我们一般同时计算一个batch的forward和backward pass。剩下的维度为什么是变长的呢？其实是为了方便。因为CNN的filter的大小是不固定的。不过我们可以简单的把这个多维的tensor展开成一个一维的向量，因为全连接的层是不会考虑不同输入的空间位置的【而CNN是考虑空间关系的，所以在图像处理中效果更好，后面我们在介绍CNN时会详细介绍】。我们把展开后的向量的维度记为D:

D=d1∗...∗dk

输入参数w

w是numpy的ndarray，维度是(D, M)，这个很容易理解，对于全连接的层，输入神经元是D，输出神经元是M，参数w就是(D,M)的矩阵。【说明，如果读者还记得前面的代码，我们之前是反过来的，w是M*D的矩阵。其实不论怎么记都不影响，只是一种习惯而已，不过计算的时候有的要转置，有的不需要。我们只要记住一点，满足矩阵的乘法条件就行了！】

输入参数b

b是numpy的ndarray，维度是(M,)，它是一个M维的向量，是bias。

输出 out

输出out是ndarray，维度是 (N, M)。

输出cache

cache保存这一层的输入和中间变量，这里cache = (x, w, b)，cache是一个tuple，保存了x,w,b，在backward的阶段会用到。

实现函数 affine_forward

介绍完了函数的输入和输出，我们就需要实现这个函数，课程的代码告诉我们在指定的地方（pass那个地方）实现out的计算就行了。

前面我们也说过了，affine函数就是out=Wx+b，需要主要的是矩阵乘法的维度。首先我们需要把x从一个高维的tensor变成2维的matrix。

N = x.shape[0]
x_temp = x.reshape(N,-1)

这里要用到ndarray的reshape函数，具体文档可以上网查，也可以直接在python里看。在python里看比较方便，可以先启动ipython，然后导入numpy，然后用?查看：

$ ipython
import numpy as np
np.reshape?

我们可以自己写代码计算

D=d1∗...∗dk

不过numpy的reshape有一个简便的方法就是设置某一个维度是-1，让numpy来推测。因为我们知道第一维是N，剩下的维度展开成一个一维的向量，所以我们设置为-1。

当然我们也可以完全自己来计算D，请同学们修改代码自己计算D【可能很多同学第一个想到的方法是for循环，但是在numpy或者类似的工具如matlab里尽量避免用for循环，因为使用一些函数，numpy会优化代码】

接下来就是out=Wx+b了。

输出out是N M，W是D M，x_temp是N D，那么唯一合法的乘法就是x_temp W了。所以out的计算如下：

 out = x_temp.dot(w) + b

稍等一下！b是M维的向量，x_temp.dot(W)是N*M，这两个ndarray怎么相加呢？这里用到的技巧就是numpy的broadcast。如果还不理解，请阅读。

如果不计算一个batch只计算一个，那么N就是1，那么就可以相加，现在我们一次计算了N个训练数据的W x，那么b却是一样的（N次计算W和b是不变的），如果我们不用broadcast的技巧，那么需要复制b成为N M的矩阵，这会浪费空间。

这样我们就完成了一个函数，是否很简单呢？

写完这个函数后怎么知道我们写的没问题呢？CS231N的课程非常好的一点就是每一个步骤都会有检验的代码。我们写完这个函数之后就可以运行这个cell测试一下：

cell的注释里写了，如果相对错误了小于10的-9次方，那么说明代码是没有问题的【至少是通过单元测试了】。恭喜！你正确的完成了第一个函数！

cell-4

第二个要实现的是affine_backward函数，也就是反向计算梯度。

输入dout

从上层（后面）传过来的dLoss/dout，维度是和out一样的，(N, M)。

输入cache

我们保存的cache，它是个tuple，具体为：
x: 输入维度是 (N,d1,…dk)
w: 权重矩阵，维度是 (D, M)
b: bias，维度是(M,)

输出

输出返回一个tuple:
dx: dLoss/dx, 维度是 (N,d1,…,dk)
dw: dLoss/dw, (D, M)
db: dLoss/db (M,)

 x, w, b = cachedx, dw, db = None, None, None############################################################################## TODO: Implement the affine backward pass.                                 ##############################################################################db = np.sum(dout, axis = 0)x_temp = x.reshape(x.shape[0],-1)dw = x_temp.T.dot(dout)dx = dout.dot(w.T).reshape(x.shape)##############################################################################                             END OF YOUR CODE                              ##############################################################################
return dx, dw, db

代码我已经放上去了，下面来分析为什么。

首先我们计算dw和dx。

根据链式法则：

是(N,M)，x_temp是(N,D)，而dw是(D,M)所以唯一合法的乘法就是：

所以代码为：

  dw = x_temp.T.dot(dout)

同理可以求dx，稍微不同的就是计算出来的是展开的dx，需要再reshape成和x一样的维度的tensor：

dx = dout.dot(w.T).reshape(x.shape)

最后是db，如果batch等于1，那么很简单db=dout，但现在dout是N个训练样本的梯度，所以需要加起来。具体用到的是np.sum函数【当然也可以写个for循环，但是这会比较低效而且代码看起来很罗嗦】， db = np.sum(dout, axis = 0)

In [10]: dout=np.array([[1,2,3],[4,5,6]])In [11]: dout
Out[11]:
array([[1, 2, 3],[4, 5, 6]])In [12]: dout.sum(axis=0)
Out[12]: array([5, 7, 9])

上面是sum函数的一个例子，请大家理解了db的求法。

实现了之后我们再来测试一下cell-4，课程代码已经帮我们写好单元测试了，我们只需要允许cell-4就行了。

另外值得注意的是这个cell里用到了eval_numerical_gradient_array函数，在cs231n/gradient_check.py下，另外这个文件下还有个eval_numerical_gradient，都是用来计算数值梯度和我们求出的梯度的误差的，有兴趣的读者可以仔细阅读这个代码。

cell-5

这个cell实现ReLU的forward pass。

代码只有一行：

out = np.maximum(0, x)

注意numpy的maximum函数和max函数，前者有两个参数，求其中较大的那个，也就是数学上的max(x,y)函数，而numpy的max函数用于在一个ndarray中求较大的数【当然也可能求某个维度较大的值】

cell-6

这个cell实现ReLU的backward pass。

也只有一行代码：

dx = (x >= 0) * dout

怎么来的呢？还记得前面max(x,y)的偏导数吗？

把y设置成0，则

x>=0返回什么呢？我们测试一下：

In [15]: x=np.array([1,-1])In [16]: x>=0
Out[16]: array([ True, False], dtype=bool)

返回的是一个bool数组，那bool数组乘以一个double数组呢？

In [15]: x=np.array([1,-1])In [16]: x>=0
Out[16]: array([ True, False], dtype=bool)In [17]: y=np.array([2.0,3.0])In [18]: (x>=0)*y
Out[18]: array([ 2.,  0.])

可以看到true会类型转换成1,false转换成0。

所以numpy的 x>=0 其实就是数学上的indicator函数:

cell-7

这个cell要实现的是affine_relu_forward和affine_relu_backward【其实已经实现了，我们看一下代码就行了】，因为神经网络的一次同时需要affine_layer和relu_layer，把它们”拼“在一起用起来更方便。

具体代码在 cs231n/layer_utils.py

def affine_relu_forward(x, w, b):"""Convenience layer that perorms an affine transform followed by a ReLUInputs:- x: Input to the affine layer- w, b: Weights for the affine layerReturns a tuple of:- out: Output from the ReLU- cache: Object to give to the backward pass"""a, fc_cache = affine_forward(x, w, b)out, relu_cache = relu_forward(a)cache = (fc_cache, relu_cache)return out, cachedef affine_relu_backward(dout, cache):"""Backward pass for the affine-relu convenience layer"""fc_cache, relu_cache = cacheda = relu_backward(dout, relu_cache)dx, dw, db = affine_backward(da, fc_cache)return dx, dw, db

cell-8

svm-loss和softmax-loss。

课程代码已经给出了，因为这个函数本来应该是在第一个作业来完成的。因为我们跳过了作业1，所以还是需要理解其中的代码，svm loss我们就不仔细介绍了，感兴趣的同学参考这里。我们来简单的讲一下softmax loss，因为这个loss在神经网络中非常常见。详细的介绍请阅读此参考资料。

首先需要澄清一个概念，并没有一个loss function叫softmax loss。它指的是在输出层加一个softmax函数，然后用cross entropy的损失函数。

softmax函数

简单的说，softmax函数把一个向量变成另外一个向量，这个新的向量每一个元素都大于0【根据后面的条件小于1】，并且加起来等于1。还有一个条件就是”单调“的映射，也就是两个数的顺序在映射之后还能够保持。

如果只看上面的描述，你会怎么实现softmax函数呢？首先要把它们变成大于0的数，这当然有很多方法，指数函数是最容易想到的。首先是:

另外就是如果x1>x2，那么:

那怎么让它们加起来等于1呢？也很简单，除以它们的和就行了。因此我们就得到了softmax函数：

因为这个向量的每个元素都大于0小于1而且加起来等于1，如果我们把这个输出当成一个K类分类器的输出的话，我们可以把它当成分类器的”概率”。

cross entropy 损失函数

而实际的分类结果应该是1,2,…,K中的一个，我们可以用one-hot的方式来表示，比如分类的结果是2，我们可以表示成[0, 1, 0, …, 0]的形式。

那么我们可以用cross-entroy来计算真实的概率p=[0,1,0…0]和模型输出的概率q的”距离“，具体细节参考这里。

距离越小说明损失。

因为p只有一个是1，其余的是0，所以只要下标为1的-logq就行了。

举个例子：假设K=5，假设真实的分类是2，分类器的输出是[0.1, 0.7, 0.1, 0.1, 0]，那么损失应该就是 -log0.7。如果分类器的输出是[0.3, 0.7, 0, 0, 0]，那么损失还是-log0.7，可以看出，它之关注真实分类的值，这是很合理的一个loss。如果分类器在第二个元素越大，那么分类器分成第二类的概率就越大，所以log值也越大【最大是log1=0，没有损失】，-log就越小，损失也越小！

softmax loss的代码

def softmax_loss(x, y):"""Computes the loss and gradient for softmax classification.Inputs:- x: Input data, of shape (N, C) where x[i, j] is the score for the jth classfor the ith input.- y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and0 <= y[i] < CReturns a tuple of:- loss: Scalar giving the loss- dx: Gradient of the loss with respect to x"""probs = np.exp(x - np.max(x, axis=1, keepdims=True))probs /= np.sum(probs, axis=1, keepdims=True)N = x.shape[0]loss = -np.sum(np.log(probs[np.arange(N), y])) / Ndx = probs.copy()dx[np.arange(N), y] -= 1dx /= Nreturn loss, dx

首先是函数的输入参数：
x是输入数据(N,C)，N是batch大小，C是分类的类别数。
y是标签，维度是(N,)，y[i]的取值范围是[0,C)，表示正确的分类。

输出：
loss，标量的loss
dx，dLoss/dx
这个函数看起来只有很简单几行代码，但是其实的内容非常丰富。让我们来逐行讲解代码：

第1行

probs = np.exp(x - np.max(x, axis=1, keepdims=True))

如果直接计算，就是上面的公式，但是在实际的数值运算时可能会溢出。
比如输入x=[1000000, 1000000]，则exp(1000000)会溢出，那怎么办呢？
我们看一个例子：x=[1,2,3]，x=[101,102,103]，我们分别计算：

我们把下面这个式子的分子分母同时除以exp(100)，那么结果应该是不变的，我们会发现这两个值是一样的！

因此softmax函数的一个特点是它之取决于输入向量的”相对“大小。为了防止溢出，我们可以减去最大的那个数，然后在算exp。前面讲过np.max和np.maximum的区别，np.max是在x中找最大的数，但是因为我们一次处理一个batch(N)，所以我们需要从axis=1这个维度找最大值。下面是max的例子，请仔细理解：

In [21]: x=np.array([[1,3,5],[2,1,1]])In [22]: np.max(x)
Out[22]: 5In [23]: np.max(x, axis=1)
Out[23]: array([5, 2])In [24]: np.max(x, axis=1, keepdims=True)
Out[24]:
array([[5],[2]])In [25]: np.max(x, axis=1).shape
Out[25]: (2,)In [26]: np.max(x, axis=1, keepdims=True).shape
Out[26]: (2, 1)

注意keepdims的作用，因为对一个向量求最大，就会变成一个标量，维度减少了1,同样对一个矩阵的某一维求max，也会变成一个向量。keepdims的意思是保留维度。x是(N,C)维的，而np.max(x,axis=1,keepdims=True)得到(N,1)的，所以x-np.max(x,axis=1,keepdims=True)可以相减【根据broadcast规则】，如果keepdims=False，则(N,C)是不能减(N,)维的向量的。
最后再用np.exp求这个(N,C)矩阵的没一个元素的exp值（universal function）。

第2行

probs /= np.sum(probs, axis=1, keepdims=True)

进行归一化，使用了np.sum函数，和max类似，也有keepdims的问题。

第3，4行

N = x.shape[0]
loss = -np.sum(np.log(probs[np.arange(N), y])) / N

首先需要理解probs[np.arange(N), y]

In [39]: x=np.array([[1,3,5],[2,1,1]])In [40]: probs = np.exp(x - np.max(x, axis=1, keepdims=True))In [41]: probs /= np.sum(probs, axis=1, keepdims=True)In [42]: N = x.shape[0]In [43]: y=np.array([1,0])In [44]: probs[np.arange(N), y]
Out[44]: array([ 0.11731043,  0.57611688])

前面我们说过了，我们需要计算”真实“分类对应的下标的log值。”真实“的分类下标就是y，比如上面的例子中x是两个训练数据，y是对应的正确分类下标值1和0。那么我们需要求第0行的第1列和第1行的第0列，求它们的-log，然后加起来。我们可以写for循环来做。但是在numpy里，ndarray提供了方便的方法来slice数组的一部分，np.arange和python标准的range类似，不过得到的是ndarray，得到(0，1,…,N-1)这个N个数，然后probs[np.arange(N), y]分别用这两个一维数组来slice得到一个一维的数组，相当于[probs[0,1], probs[1,0]]

接下来就是用log函数对这个数组的每一个求log，然后除以N就得到平均的loss。
请仔细理解这行代码。如果对ndarray的slice不熟，请参考这里。

接下来是计算dLoss/dx，这个公式有些复杂，下面我先来详细推导一下，读者如果有时间的话请自己一步一步的推导。为了公式简单，我们用变量p替代了代码里的probs：

首先我们来求

我们分为两种情况，第一种情况是 i=j，首先回忆一下：

第二种情况是 i≠j

接下来我们来求

对于求和下标k分为两种情况： k=i和k≠i ，分别代入上面的公式得到：

最后一步用到的是 ∑kyk=1。

推导有些复杂，记忆起来其实不复杂，softmax+cross entropy的梯度就是模型预测的结果p减去lable y。

下面我们来看代码怎么实现！

 dx = probs.copy()dx[np.arange(N), y] -= 1dx /= N

我们需要实现probs - y，不过公式里的y是one-hot表示的向量，而我们这里的y是下标【如果不考虑batch N】。所以这里先从probs里复制一份给新的变量dx【我个人觉得直接修改probs也没有问题】，因为y只有在对应的label的下标才是1，所以 dx[np.arange(N), y] -= 1，然后除以N得到平均的dx。

cell-9

接下来就是把这些Layers拼装成一个完整的多层神经网络，请打开cs231n/classifiers/fc_net.py，我们现在要完成TwoLayerNet这个类。
我们直接把代码放到下面，然后用注释的方式解释代码。补充的解释在后面。

class TwoLayerNet(object):"""这个类实现两层全连接的神经网络，使用ReLU激活函数，softmax loss。我们假设输入的维度是D，hidden unit是H，输出是C维。网络的结构是 affine - relu - affine - softmax。注意：这个类并不会实现梯度下降算法；相反，它会使用一个单独的 Solver 对象来实现参数的优化。模型可以学习(训练)的参数应该放到self.params这个dict里，key是参数名，value是对应的numpy ndarray。"""def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,weight_scale=1e-3, reg=0.0):"""初始化一个新的神经网络。输入:- input_dim: 一个整数，代表输入的向量的大小，默认3*32*32[cifar-10的数据]。- hidden_dim: 一个整数，代表hidden unit的个数，默认100。- num_classes: 一个整数，代表输出分类的个数，默认10[cifar-10的分类数]。- dropout: 标量，范围是0-1，代表dropout的概率- weight_scale: 一个标量，代表用来随机初始化wegiht的标准差，默认值1e-3。- reg: 一个标量，L2 正则化参数"""self.params = {}self.reg = reg############################################################################# TODO: 初始化两层神经网络的weights和biases。Weights    ## 用高斯分别来初始化，均值是0，标准差是weight_scale，biases初始化为0## 所有的 weights 和 biases 应该保存在 self.params, 第一层的  ## weights 和 biases 使用 key 'W1' 和 'b1'， 第二层的用 'W2'和 'b2' #############################################################################self.params['W1'] = np.random.normal(0, weight_scale, (input_dim, hidden_dim)) #使用np.random.normal函数来生成指定大小的矩阵，标准差是weight_scaleself.params['b1'] = np.zeros(hidden_dim)self.params['W2'] = np.random.normal(0, weight_scale, (hidden_dim, num_classes))self.params['b2'] = np.zeros(num_classes)#############################################################################                             END OF YOUR CODE                             #############################################################################def loss(self, X, y=None):"""计算一个batch的数据的loss和gradient。输入:- X: ndarray，shape是 (N, d_1, ..., d_k)- y: label数组， shape (N,). y[i] 是 X[i]的label，取值范围是{0,1,...,C-1}返回:如果y是None，则运行测试时的forward pass【说明：测试时不需要计算最后一个softmax，因为我们最后只是为了选择一个分类，而softmax是单调的函数 argmax softmax[x1,x2] = argmax [x1,x2]。】 并且返回:- scores: 数组 shape是 (N, C) 代表分类的得分，scores[i, c]是 X[i] 分成类 c的得分。如果y不是None， 则运行一次训练时的forward和backward，返回一个tuple:- loss: 一个标量，代表loss- grads: 一个dict，key和self.params的key一样，值则是对应的梯度。"""  scores = None############################################################################# TODO: 计算两层神经网络的forward pass## 计算scores              #############################################################################affine_relu_out, affine_relu_cache = affine_relu_forward(X, self.params['W1'], self.params['b1'])affine2_out, affine2_cache = affine_forward(affine_relu_out, self.params['W2'], self.params['b2'])scores = affine2_out#############################################################################                             END OF YOUR CODE                             ############################################################################## 如果y是None 那么我们是在test mode，只需要返回scoresif y is None:return scoresloss, grads = 0, {}############################################################################# TODO: Implement the backward pass for the two-layer net. Store the loss  ## in the loss variable and gradients in the grads dictionary. Compute data ## loss using softmax, and make sure that grads[k] holds the gradients for  ## self.params[k]. Don't forget to add L2 regularization!                   ##                                                                          ## NOTE: To ensure that your implementation matches ours and you pass the   ## automated tests, make sure that your L2 regularization includes a factor ## of 0.5 to simplify the expression for the gradient.                      #############################################################################loss, dscores = softmax_loss(scores, y)loss += 0.5 * self.reg * (np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2']))affine2_dx, affine2_dw, affine2_db = affine_backward(dscores, affine2_cache)grads['W2'] = affine2_dw + self.reg * self.params['W2']grads['b2'] = affine2_dbaffine1_dx, affine1_dw, affine1_db = affine_relu_backward(affine2_dx, affine_relu_cache)grads['W1'] = affine1_dw + self.reg * self.params['W1']grads['b1'] = affine1_db#############################################################################                             END OF YOUR CODE                             #############################################################################return loss, grads

请仔细阅读代码，其实代码相当简单，不过之前没有介绍L2 regulariation(正则化)，这里简单介绍一下，详细的内容参考这里。

目的是为了防止overfitting（过拟合），所以在Loss function里增加：

对应到代码 λ就是self.reg参数，所以有这样一行代码：

loss += 0.5 * self.reg * (np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2']))

同样的，在计算每个weights的时候梯度的时候也要加上 λw

grads['W2'] = affine2_dw + self.reg * self.params['W2']

注意，这里没有把biases加到正则化参数里去。

接下来我们运行这个cell，检查相对error是否足够小。

cell-10

实现Solver，forward和backward的代码都好了，接下来就是要实现(batch)梯度下降的逻辑了。

请打开cs231n/solver.py，课程已经帮我们实现了，我们需要理解其代码然后使用它。Solver的代码较长，如果读者不想阅读全部代码，至少要阅读最前面的注释，了解它应该怎么用。

 """A Solver encapsulates all the logic necessary for training classificationmodels. The Solver performs stochastic gradient descent using differentupdate rules defined in optim.py.Solver封装了用于训练分类器模型的所有逻辑。Solver用定义于optim.py的更新规则来进行随机梯度下降。The solver accepts both training and validataion data and labels so it canperiodically check classification accuracy on both training and validationdata to watch out for overfitting.solver同时接受用于训练和验证的数据与标签，所以它能周期的检查训练和验证数据上的准确率从而避免过拟合。To train a model, you will first construct a Solver instance, passing themodel, dataset, and various optoins (learning rate, batch size, etc) to theconstructor. You will then call the train() method to run the optimizationprocedure and train the model.如果想训练一个模型，你首先需要构造一个Solver对象，传给它model，dataset和一些选项(learning rate, batch siez等等）给它的构造函数。然后你调用它的train()方法来进行参数优化和训练模型。After the train() method returns, model.params will contain the parametersthat performed best on the validation set over the course of training.In addition, the instance variable solver.loss_history will contain a listof all losses encountered during training and the instance variablessolver.train_acc_history and solver.val_acc_history will be lists containingthe accuracies of the model on the training and validation set at each epoch.train()方法返回之后，model.params保存的是在验证集结果最好的参数。此外，solver.loss_history里保存了训练过程中的所有loss。solver.train_acc_history和solver.val_acc_history保存了每个epoch结束后在训练数据和验证数据上的准确率。Example usage might look something like this:用法可能如下：data = {'X_train': # training data'y_train': # training labels'X_val': # validation data'X_train': # validation labels}model = MyAwesomeModel(hidden_size=100, reg=10)solver = Solver(model, data,update_rule='sgd',optim_config={'learning_rate': 1e-3,},lr_decay=0.95,num_epochs=10, batch_size=100,print_every=100)solver.train()A Solver works on a model object that must conform to the following API:传给Solver的model对象必须遵循如下API：- model.params must be a dictionary mapping string parameter names to numpyarrays containing parameter values.model.params必须是一个dict，key是参数名，value是对应的值的ndarray- model.loss(X, y) must be a function that computes training-time loss andgradients, and test-time classification scores, with the following inputsand outputs:model.loss(X,y)必须是一个函数，它计算训练时的loss和梯度【y is not None】，测试时的分类得分【y is None]。它的输入和输出如下：Inputs:输入- X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)X：minibatch的输入数据，维度是(N, d_1, ..., d_k)- y: Array of labels, of shape (N,) giving labels for X where y[i] is thelabel for X[i]. y: minibatch个标签，shape是(N,)，y[i]是X[i]的labelReturns:返回：If y is None, run a test-time forward pass and return:- scores: Array of shape (N, C) giving classification scores for X wherescores[i, c] gives the score of class c for X[i].如果y是None，返回测试时的分类得分scores，shape是(N,C)，其中scores[i,c]是X[i]分类为c的得分。If y is not None, run a training time forward and backward pass and returna tuple of:- loss: Scalar giving the loss- grads: Dictionary with the same keys as self.params mapping parameternames to gradients of the loss with respect to those parameters.如果y不是None，进行一次训练时的前向和后向计算，并且返回：loss：一个标量代表lossgrads：一个dict，key和self.params一样，value是对应的梯度。"""def __init__(self, model, data, **kwargs):"""Construct a new Solver instance.构造一个新的Solver对象Required arguments:需要的参数：- model: A model object conforming to the API described above- model: 一个model对象需要满足上面描述的API。- data: A dictionary of training and validation data with the following:- data: 一个dict包含如下数据：'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images'y_train': Array of shape (N_train,) giving labels for training images'y_val': Array of shape (N_val,) giving labels for validation images'X_train': 训练图像的ndarray，shape是(N_train, d_1, .., d_k)'X_val': 验证集的图像的ndarray，shape是(N_val, d_1, .., d_k)'y_train': 训练图像的lable，shape是(N_train,)'y_val': 验证图像的lable，shape是(N_val,)Optional arguments:可选参数：- update_rule: A string giving the name of an update rule in optim.py.Default is 'sgd'.- update_rule: optim.py里的update rule的名字，默认'sgd'- optim_config: A dictionary containing hyperparameters that will bepassed to the chosen update rule. Each update rule requires differenthyperparameters (see optim.py) but all update rules require a'learning_rate' parameter so that should always be present.- optim_config: 一个dict包含传给update rule的超参数。不同的update rule有不同的超参数(请参考optim.py)。但是所有的update rules必须有'learning_rate'这个参数。- lr_decay: A scalar for learning rate decay; after each epoch the learningrate is multiplied by this value.- lr_decay: 每轮epoch后learning_rate都会乘以这个参数，让learning_rate越來越小。- batch_size: Size of minibatches used to compute loss and gradient duringtraining.- batch_size: batch大小- num_epochs: The number of epochs to run for during training.- num_epochs: 训练的epochs- print_every: Integer; training losses will be printed every print_everyiterations.- print_every: 每隔多久打一下训练loss- verbose: Boolean; if set to false then no output will be printed duringtraining."""self.model = modelself.X_train = data['X_train']self.y_train = data['y_train']self.X_val = data['X_val']self.y_val = data['y_val']# Unpack keyword argumentsself.update_rule = kwargs.pop('update_rule', 'sgd')self.optim_config = kwargs.pop('optim_config', {})self.lr_decay = kwargs.pop('lr_decay', 1.0)self.batch_size = kwargs.pop('batch_size', 100)self.num_epochs = kwargs.pop('num_epochs', 10)self.print_every = kwargs.pop('print_every', 10)self.verbose = kwargs.pop('verbose', True)# Throw an error if there are extra keyword argumentsif len(kwargs) > 0:extra = ', '.join('"%s"' % k for k in kwargs.keys())raise ValueError('Unrecognized arguments %s' % extra)# Make sure the update rule exists, then replace the string# name with the actual functionif not hasattr(optim, self.update_rule):raise ValueError('Invalid update_rule "%s"' % self.update_rule)self.update_rule = getattr(optim, self.update_rule)self._reset()def _reset(self):"""Set up some book-keeping variables for optimization. Don't call thismanually."""# Set up some variables for book-keepingself.epoch = 0self.best_val_acc = 0self.best_params = {}self.loss_history = []self.train_acc_history = []self.val_acc_history = []# Make a deep copy of the optim_config for each parameterself.optim_configs = {}for p in self.model.params:d = {k: v for k, v in self.optim_config.iteritems()}self.optim_configs[p] = ddef _step(self):"""Make a single gradient update. This is called by train() and should notbe called manually."""# Make a minibatch of training datanum_train = self.X_train.shape[0]batch_mask = np.random.choice(num_train, self.batch_size)X_batch = self.X_train[batch_mask]y_batch = self.y_train[batch_mask]# Compute loss and gradientloss, grads = self.model.loss(X_batch, y_batch)self.loss_history.append(loss)# Perform a parameter updatefor p, w in self.model.params.iteritems():dw = grads[p]config = self.optim_configs[p]next_w, next_config = self.update_rule(w, dw, config)self.model.params[p] = next_wself.optim_configs[p] = next_configdef check_accuracy(self, X, y, num_samples=None, batch_size=100):"""Check accuracy of the model on the provided data.Inputs:- X: Array of data, of shape (N, d_1, ..., d_k)- y: Array of labels, of shape (N,)- num_samples: If not None, subsample the data and only test the modelon num_samples datapoints.- batch_size: Split X and y into batches of this size to avoid using toomuch memory.Returns:- acc: Scalar giving the fraction of instances that were correctlyclassified by the model."""# Maybe subsample the dataN = X.shape[0]if num_samples is not None and N > num_samples:mask = np.random.choice(N, num_samples)N = num_samplesX = X[mask]y = y[mask]# Compute predictions in batchesnum_batches = N / batch_sizeif N % batch_size != 0:num_batches += 1y_pred = []for i in xrange(num_batches):start = i * batch_sizeend = (i + 1) * batch_sizescores = self.model.loss(X[start:end])y_pred.append(np.argmax(scores, axis=1))y_pred = np.hstack(y_pred)acc = np.mean(y_pred == y)return accdef train(self):"""Run optimization to train the model."""num_train = self.X_train.shape[0]iterations_per_epoch = max(num_train / self.batch_size, 1)num_iterations = self.num_epochs * iterations_per_epochfor t in xrange(num_iterations):self._step()# Maybe print training lossif self.verbose and t % self.print_every == 0:print '(Iteration %d / %d) loss: %f' % (t + 1, num_iterations, self.loss_history[-1])# At the end of every epoch, increment the epoch counter and decay the# learning rate.epoch_end = (t + 1) % iterations_per_epoch == 0if epoch_end:self.epoch += 1for k in self.optim_configs:self.optim_configs[k]['learning_rate'] *= self.lr_decay# Check train and val accuracy on the first iteration, the last# iteration, and at the end of each epoch.first_it = (t == 0)last_it = (t == num_iterations + 1)if first_it or last_it or epoch_end:train_acc = self.check_accuracy(self.X_train, self.y_train,num_samples=1000)val_acc = self.check_accuracy(self.X_val, self.y_val)self.train_acc_history.append(train_acc)self.val_acc_history.append(val_acc)if self.verbose:print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (self.epoch, self.num_epochs, train_acc, val_acc)# Keep track of the best modelif val_acc > self.best_val_acc:self.best_val_acc = val_accself.best_params = {}for k, v in self.model.params.iteritems():self.best_params[k] = v.copy()# At the end of training swap the best params into the modelself.model.params = self.best_params

说明：这里实现的sgd和前面稍微不同。前面假设有1000个训练数据，minibatch是100，那么一个epoch会有10次迭代，每次迭代100个训练数据。之前的代码能保证10次迭代会遍历1000个训练数据，每个数据用一次。而这个代码则是10次迭代每次随机采样100个，所以并不能保证1000个数据每个用一次，可能有些样本一次也没有用，而另外一些用多次。

另外，参数的更新封装在optim.py里，Solver和optim.py的协议为：next_w, next_config = self.update_rule(w, dw, config)。这两点说明都在_step函数里能看到用法：

 def _step(self):"""Make a single gradient update. This is called by train() and should notbe called manually."""# Make a minibatch of training datanum_train = self.X_train.shape[0]batch_mask = np.random.choice(num_train, self.batch_size)X_batch = self.X_train[batch_mask]y_batch = self.y_train[batch_mask]# Compute loss and gradientloss, grads = self.model.loss(X_batch, y_batch)self.loss_history.append(loss)# Perform a parameter updatefor p, w in self.model.params.iteritems():dw = grads[p]config = self.optim_configs[p]next_w, next_config = self.update_rule(w, dw, config)self.model.params[p] = next_wself.optim_configs[p] = next_config

而更详细的optim的update协议在optim.py文件里，这里不罗嗦了，请读者阅读。
我们再来看看默认的sgd的实现：

def sgd(w, dw, config=None):"""Performs vanilla stochastic gradient descent.config format:- learning_rate: Scalar learning rate."""if config is None: config = {}config.setdefault('learning_rate', 1e-2)w -= config['learning_rate'] * dwreturn w, config

核心的代码就一行 w -= config[‘learning_rate’] * dw

阅读完Solver.py和optim.py，我们就用它和之前的TwoLayerNet来训练一个两层的神经网络，要求在validation的准确率超过50%。
下面是代码：

model = TwoLayerNet()
solver = None##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
for k, v in data.iteritems():print '%s: ' % k, v.shapemodel = TwoLayerNet(hidden_dim=100, reg= 1e-02)
solver = Solver(model, data,update_rule='sgd',optim_config={'learning_rate': 1e-03,},lr_decay=0.95,num_epochs=10, batch_size=100,print_every=49000)
solver.train()
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

运行的结果如下【因为随机的因素，你的结果和我可能不同】：

读者可能会问，为什么要用learning_rate=1e-3，reg=1e-2？换别的参数好像到不了50%的validation 准确率。这个就是训练神经网络的一些tricks了。谁也不能提前知道，只能通过不断的尝试才能找到比较好的超参数。感兴趣的同学请参考这里和这里，网络上也有很多训练神经网络的tricks，读者也可以自行搜索学习。另外后面的作业里也有除了sgd之外收敛速度更快的优化算法，比如RMSProp 和 Adam，感兴趣的同学可以参考这里。我后面只会把代码实现，细节部分请读者自己琢磨。

cell-11

这个部分不需要自己实现，直接运行就可以了。不过对python不熟的读者可以阅读一下代码。学习怎么绘图。

cell-12

实现FullyConnectedNet，从两层推广的任意层的全连接网络。其实和两层差不多，细节我就不罗嗦了，请读者自行阅读。唯一注意的是第一层和最后一层是需要特殊处理的。

class FullyConnectedNet(object):"""A fully-connected neural network with an arbitrary number of hidden layers,ReLU nonlinearities, and a softmax loss function. This will also implementdropout and batch normalization as options. For a network with L layers,the architecture will be{affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmaxwhere batch normalization and dropout are optional, and the {...} block isrepeated L - 1 times.Similar to the TwoLayerNet above, learnable parameters are stored in theself.params dictionary and will be learned using the Solver class."""def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,dropout=0, use_batchnorm=False, reg=0.0,weight_scale=1e-2, dtype=np.float32, seed=None):"""Initialize a new FullyConnectedNet.Inputs:- hidden_dims: A list of integers giving the size of each hidden layer.- input_dim: An integer giving the size of the input.- num_classes: An integer giving the number of classes to classify.- dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 thenthe network should not use dropout at all.- use_batchnorm: Whether or not the network should use batch normalization.- reg: Scalar giving L2 regularization strength.- weight_scale: Scalar giving the standard deviation for randominitialization of the weights.- dtype: A numpy datatype object; all computations will be performed usingthis datatype. float32 is faster but less accurate, so you should usefloat64 for numeric gradient checking.- seed: If not None, then pass this random seed to the dropout layers. Thiswill make the dropout layers deteriminstic so we can gradient check themodel."""self.use_batchnorm = use_batchnormself.use_dropout = dropout > 0self.reg = regself.num_layers = 1 + len(hidden_dims)self.dtype = dtypeself.params = {}############################################################################# TODO: Initialize the parameters of the network, storing all values in    ## the self.params dictionary. Store weights and biases for the first layer ## in W1 and b1; for the second layer use W2 and b2, etc. Weights should be ## initialized from a normal distribution with standard deviation equal to  ## weight_scale and biases should be initialized to zero.                   ##                                                                          ## When using batch normalization, store scale and shift parameters for the ## first layer in gamma1 and beta1; for the second layer use gamma2 and     ## beta2, etc. Scale parameters should be initialized to one and shift      ## parameters should be initialized to zero.                                #############################################################################pass#############################################################################                             END OF YOUR CODE                             ############################################################################## When using dropout we need to pass a dropout_param dictionary to each# dropout layer so that the layer knows the dropout probability and the mode# (train / test). You can pass the same dropout_param to each dropout layer.self.dropout_param = {}if self.use_dropout:self.dropout_param = {'mode': 'train', 'p': dropout}if seed is not None:self.dropout_param['seed'] = seed# With batch normalization we need to keep track of running means and# variances, so we need to pass a special bn_param object to each batch# normalization layer. You should pass self.bn_params[0] to the forward pass# of the first batch normalization layer, self.bn_params[1] to the forward# pass of the second batch normalization layer, etc.self.bn_params = []if self.use_batchnorm:self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]# Cast all parameters to the correct datatypefor k, v in self.params.iteritems():self.params[k] = v.astype(dtype)def loss(self, X, y=None):"""Compute loss and gradient for the fully-connected net.Input / output: Same as TwoLayerNet above."""X = X.astype(self.dtype)mode = 'test' if y is None else 'train'# Set train/test mode for batchnorm params and dropout param since they# behave differently during training and testing.if self.dropout_param is not None:self.dropout_param['mode'] = mode   if self.use_batchnorm:for bn_param in self.bn_params:bn_param[mode] = modescores = None############################################################################# TODO: Implement the forward pass for the fully-connected net, computing  ## the class scores for X and storing them in the scores variable.          ##                                                                          ## When using dropout, you'll need to pass self.dropout_param to each       ## dropout forward pass.                                                    ##                                                                          ## When using batch normalization, you'll need to pass self.bn_params[0] to ## the forward pass for the first batch normalization layer, pass           ## self.bn_params[1] to the forward pass for the second batch normalization ## layer, etc.                                                              #############################################################################pass#############################################################################                             END OF YOUR CODE                             ############################################################################## If test mode return earlyif mode == 'test':return scoresloss, grads = 0.0, {}############################################################################# TODO: Implement the backward pass for the fully-connected net. Store the ## loss in the loss variable and gradients in the grads dictionary. Compute ## data loss using softmax, and make sure that grads[k] holds the gradients ## for self.params[k]. Don't forget to add L2 regularization!               ##                                                                          ## When using batch normalization, you don't need to regularize the scale   ## and shift parameters.                                                    ##                                                                          ## NOTE: To ensure that your implementation matches ours and you pass the   ## automated tests, make sure that your L2 regularization includes a factor ## of 0.5 to simplify the expression for the gradient.                      #############################################################################pass#############################################################################                             END OF YOUR CODE                             #############################################################################return loss, grads

cell-13

用两层神经网络过拟合50个训练数据
需要调参数。方法就是多试试，最好写一个脚本。
我使用的参数是
weight_scale = 5e-2
learning_rate = 1e-3

cell-14

用5层神经网络过拟合50个训练数据，我们会发现要找一个合适的参数比两层网络更困难。
我使用的参数：
weight_scale = 5e-2
learning_rate = 5e-3

cell-15

sgd_momentum
在TODO部分复制如下代码：

 next_w = wv = config['momentum'] * v - config['learning_rate'] * dwnext_w += v

下面是sgd和sgd_momentum的收敛速度比较：

cell-16

rmsprop和adam

#rmspropnext_x = xconfig['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx * dx)x += -config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])

#adamconfig['t'] += 1config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dxconfig['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx ** 2)mb = config['m'] / (1 - config['beta1'] ** config['t'])vb = config['v'] / (1 - config['beta2'] ** config['t'])next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])

收敛速度比较：

cell-17

用5层的全连接神经网络训练cifar-10，要求得到一个在验证集上得到50%以上准确率的模型。你会发现要调好各种参数确实挺tricky的。不过也不要花太多时间在这里，之后我们的重点是CNN，我们多花些时间调CNN吧。不过如果对神经网络不太熟悉，也可以多调调参数找找感觉，请写个python脚本来搜索最优的参数。同时如果计算资源足够，也可以同时跑多个脚本并行搜索。另外一个技巧就是如果发现loss或者val_acc不怎么变化，就可以提前停止了。

我这里就不列举最优参数了，请大家自己试试能不能找到val_acc大于0.5的超参数。

 X_val= data['X_val']
y_val= data['y_val']
X_test= data['X_test']
y_test= data['y_test']lr =1e-03 #需要调的参数
ws = 1e-02 #需要调的参数model = FullyConnectedNet([100, 100, 100, 100],weight_scale=ws, dtype=np.float64,use_batchnorm=False, reg= 1e-2)
solver = Solver(model, data,print_every=100, num_epochs=10, batch_size=25,update_rule='adam',optim_config={'learning_rate': lr,},lr_decay = 0.95, #需要调的参数verbose = True)   solver.train()plt.subplot(2, 1, 1)
plt.plot(solver.loss_history)
plt.title('Loss history')
plt.xlabel('Iteration')
plt.ylabel('Loss')plt.subplot(2, 1, 2)
plt.plot(solver.train_acc_history, label='train')
plt.plot(solver.val_acc_history, label='val')
plt.title('Classification accuracy history')
plt.xlabel('Epoch')
plt.ylabel('Clasification accuracy')
plt.show() best_model = model

本系列文章也将在CSDN人工智能公众号AI_Thinker中进行连载，扫描下方二维码即可关注。