权重初始化

创建了神经网络后，我们需要进行权重和偏差的初始化。到现在，我们一直是根据在第一章中介绍的那样进行初始化。提醒你一下，之前的方式就是根据独立的均值为 $$0$$，标准差为 $$1$$ 的高斯随机变量随机采样作为权重和偏差的初始值。这个方法工作的还不错，但是非常 ad hoc，所以我们需要寻找一些更好的方式来设置我们网络的初始化权重和偏差，这对于帮助网络学习速度的提升很有价值。

结果表明，我们可以比使用正规化的高斯分布效果更好。为什么？假设我们使用一个很多的输入神经元，比如说 $$1000$$。假设，我们已经使用正规化的高斯分布初始化了连接第一隐藏层的权重。现在我将注意力集中在这一层的连接权重上，忽略网络其他部分：

我们为了简化，假设，我们使用训练样本 x 其中一半的神经元值为 $$0$$，另一半为 $$1$$。下面的观点也是可以更加广泛地应用，但是你可以从特例中获得背后的思想。让我们考虑带权和 $$z=\sum_j w_j x_j + b$$ 的隐藏元输入。其中 $$500$$ 个项消去了，因为对应的输入 $$x_j=0$$。所以 $$z$$ 是 $$501$$ 个正规化的高斯随机变量的和，包含 $$500$$ 个权重项和额外的 $$1$$ 个偏差项。因此 $$z$$ 本身是一个均值为 $$0$$ 标准差为 $$\sqrt{501}\approx 22.4$$ 的分布。$$z$$ 其实有一个非常宽的高斯分布，不是非常尖的形状：

尤其是，我们可以从这幅图中看出 $$|z|$$ 会变得非常的大，比如说 $$z\gg1$$ 或者 $$z\ll 1$$。如果是这样，输出 $$\sigma(z)$$ 就会接近 $$1$$ 或者 $$0$$。也就表示我们的隐藏元会饱和。所以当出现这样的情况时，在权重中进行微小的调整仅仅会给隐藏元的激活值带来极其微弱的改变。而这种微弱的改变也会影响网络中剩下的神经元，然后会带来相应的代价函数的改变。结果就是，这些权重在我们进行梯度下降算法时会学习得非常缓慢。这其实和我们前面讨论的问题差不多，前面的情况是输出神经元在错误的值上饱和导致学习的下降。我们之前通过代价函数的选择解决了前面的问题。不幸的是，尽管那种方式在输出神经元上有效，但对于隐藏元的饱和却一点作用都没有。

我已经研究了第一隐藏层的权重输入。当然，类似的论断也对后面的隐藏层有效：如果权重也是用正规化的高斯分布进行初始化，那么激活值将会接近 $$0$$ 或者 $$1$$，学习速度也会相当缓慢。

还有可以帮助我们进行更好地初始化么，能够避免这种类型的饱和，最终避免学习速度的下降？假设我们有一个有 $$n_{in}$$ 个输入权重的神经元。我们会使用均值为 $$0$$ 标准差为 $$1/\sqrt{n_{in}}$$ 的高斯分布初始化这些权重。也就是说，我们会向下挤压高斯分布，让我们的神经元更不可能饱和。我们会继续使用均值为 $$0$$ 标准差为 $$1$$ 的高斯分布来对偏差进行初始化，后面会告诉你原因。有了这些设定，带权和 $$z=\sum_j w_j x_j + b$$ 仍然是一个均值为 $$0$$ 不过有很陡的峰顶的高斯分布。假设，我们有 $$500$$ 个值为 $$0$$ 的输入和$$500$$ 个值为 $$1$$ 的输入。那么很容证明 $$z$$ 是服从均值为 $$0$$ 标准差为 $$\sqrt{3/2} = 1.22$$ 的高斯分布。这图像要比以前陡得多，所以即使我已经对横坐标进行压缩为了进行更直观的比较：

这样的一个神经元更不可能饱和，因此也不大可能遇到学习速度下降的问题。

练习

验证 $$z=\sum_j w_j x_j + b$$ 标准差为 $$\sqrt{3/2}$$。下面两点可能会有帮助：（a）独立随机变量的和的方差是每个独立随即便方差的和；（b）方差是标准差的平方。

我在上面提到，我们使用同样的方式对偏差进行初始化，就是使用均值为 $$0$$ 标准差为 $$1$$ 的高斯分布来对偏差进行初始化。这其实是可行的，因为这样并不会让我们的神经网络更容易饱和。实际上，其实已经避免了饱和的问题的话，如何初始化偏差影响不大。有些人将所有的偏差初始化为 $$0$$，依赖梯度下降来学习合适的偏差。但是因为差别不是很大，我们后面还会按照前面的方式来进行初始化。

让我们在 MNIST 数字分类任务上比较一下新旧两种权重初始化方式。同样，还是使用 $$30$$ 个隐藏元，minibatch 的大小为 $$30$$，规范化参数 $$\lambda=5.0$$，然后是交叉熵代价函数。我们将学习率从 $$\eta=0.5$$ 调整到 $$0.1$$，因为这样会让结果在图像中表现得更加明显。我们先使用旧的初始化方法训练：

>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,
... evaluation_data=validation_data,
... monitor_evaluation_accuracy=True)

我们也使用新方法来进行权重的初始化。这实际上还要更简单，因为 network2's 默认方式就是使用新的方法。这意味着我们可以丢掉 net.large_weight_initializer() 调用：

>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,
... evaluation_data=validation_data,
... monitor_evaluation_accuracy=True)

将结果用图展示出来，就是：

两种情形下，我们在 96% 的准确度上重合了。最终的分类准确度几乎完全一样。但是新的初始化技术带来了速度的提升。在第一种初始化方式的分类准确度在 87% 一下，而新的方法已经几乎达到了 93%。看起来的情况就是我们新的关于权重初始化的方式将训练带到了一个新的境界，让我们能够更加快速地得到好的结果。同样的情况在 $$100$$ 个神经元的设定中也出现了：

在这个情况下，两个曲线并没有重合。然而，我做的实验发现了其实就在一些额外的回合后（这里没有展示）准确度其实也是几乎相同的。所以，基于这些实验，看起来提升的权重初始化仅仅会加快训练，不会改变网络的性能。然而，在第四张，我们会看到一些例子里面使用 $$1/\sqrt{n_{in}}$$ 权重初始化的长期运行的结果要显著更优。因此，不仅仅能够带来训练速度的加快，有时候在最终性能上也有很大的提升。

$$1/\sqrt{n_{in}}$$ 的权重初始化方法帮助我们提升了神经网络学习的方式。其他的权重初始化技术同样也有，很多都是基于这个基本的思想。我不会在这里给出其他的方法，因为 $$1/\sqrt{n_{in}}$$ 已经可以工作得很好了。如果你对另外的思想感兴趣，我推荐你看看在 $$2012$$ 年的 Yoshua Bengio 的论文的 $$14$$ 和 $$15$$ 页，以及相关的参考文献。

Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio (2012).

问题

将规范化和改进的权重初始化方法结合使用 L2 规范化有时候会自动给我们一些类似于新的初始化方法的东西。假设我们使用旧的初始化权重的方法。考虑一个启发式的观点：（1）假设$$\lambda$$ 不太小，训练的第一回合将会几乎被权重下降统治。；（2）如果 $$\eta\lambda \ll n$$，权重会按照因子 $$exp(-\eta\lambda/m)$$ 每回合下降；（3）假设 $$\lambda$$ 不太大，权重下降会在权重降到 $$1/\sqrt{n}$$ 的时候保持住，其中 $$n$$ 是网络中权重的个数。用论述这些条件都已经满足本节给出的例子。

再看手写识别问题：代码

让我们实现本章讨论过的这些想法。我们将写出一个新的程序，network2.py，这是一个对第一章中开发的 network.py 的改进版本。如果你没有仔细看过 network.py，那你可能会需要重读前面关于这段代码的讨论。仅仅 $$74$$ 行代码，也很易懂。

和 network.py 一样，主要部分就是 Network 类了，我们用这个来表示神经网络。使用一个 sizes 的列表来对每个对应层进行初始化，默认使用交叉熵作为代价 cost 参数：

class Network(object):def __init__(self, sizes, cost=CrossEntropyCost):self.num_layers = len(sizes)self.sizes = sizesself.default_weight_initializer()self.cost=cost

__init__ 方法的和 network.py 中一样，可以轻易弄懂。但是下面两行是新的，我们需要知道他们到底做了什么。

我们先看看 default_weight_initializer 方法，使用了我们新式改进后的初始化权重方法。如我们已经看到的，使用了均值为 $$0$$ 而标准差为 $$1/\sqrt{n}$$，$$n$$ 为对应的输入连接个数。我们使用均值为 $$0$$ 而标准差为 $$1$$ 的高斯分布来初始化偏差。下面是代码：

def default_weight_initializer(self):self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]self.weights = [np.random.randn(y, x)/np.sqrt(x) for x, y in zip(self.sizes[:-1], self.sizes[1:])]

为了理解这段代码，需要知道 np 就是进行线性代数运算的 Numpy 库。我们在程序的开头会 import Numpy。同样我们没有对第一层的神经元的偏差进行初始化。因为第一层其实是输入层，所以不需要引入任何的偏差。我们在 network.py 中做了完全一样的事情。

作为 default_weight_initializer 的补充，我们同样包含了一个 large_weight_initializer 方法。这个方法使用了第一章中的观点初始化了权重和偏差。代码也就仅仅是和default_weight_initializer差了一点点了：

def large_weight_initializer(self):self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]self.weights = [np.random.randn(y, x) for x, y in zip(self.sizes[:-1], self.sizes[1:])]

我将 larger_weight_initializer 方法包含进来的原因也就是使得跟第一章的结果更容易比较。我并没有考虑太多的推荐使用这个方法的实际情景。

初始化方法 __init__ 中的第二个新的东西就是我们初始化了 cost 属性。为了理解这个工作的原理，让我们看一下用来表示交叉熵代价的类：

class CrossEntropyCost(object):    @staticmethoddef fn(a, y):return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))    @staticmethoddef delta(z, a, y):return (a-y)

让我们分解一下。第一个看到的是：即使使用的是交叉熵，数学上看，就是一个函数，这里我们用 Python 的类而不是 Python 函数实现了它。为什么这样做呢？答案就是代价函数在我们的网络中扮演了两种不同的角色。明显的角色就是代价是输出激活值 $$a$$ 和目标输出 $$y$$ 差距优劣的度量。这个角色通过 CrossEntropyCost.fn 方法来扮演。（注意，np.nan_to_num 调用确保了 Numpy 正确处理接近 $$0$$ 的对数值）但是代价函数其实还有另一个角色。回想第二章中运行反向传播算法时，我们需要计算网络输出误差，$$\delta^L$$。这种形式的输出误差依赖于代价函数的选择：不同的代价函数，输出误差的形式就不同。对于交叉熵函数，输出误差就如公式(66)所示：

所以，我们定义了第二个方法，CrossEntropyCost.delta，目的就是让网络知道如何进行输出误差的计算。然后我们将这两个组合在一个包含所有需要知道的有关代价函数信息的类中。

类似地，network2.py 还包含了一个表示二次代价函数的类。这个是用来和第一章的结果进行对比的，因为后面我们几乎都在使用交叉函数。代码如下。QuadraticCost.fn 方法是关于网络输出 $$a$$ 和目标输出 $$y$$ 的二次代价函数的直接计算结果。由 QuadraticCost.delta 返回的值就是二次代价函数的误差。

class QuadraticCost(object):    @staticmethoddef fn(a, y):return 0.5*np.linalg.norm(a-y)**2    @staticmethoddef delta(z, a, y):return (a-y) * sigmoid_prime(z)

现在，我们理解了 network2.py 和 network.py 两个实现之间的主要差别。都是很简单的东西。还有一些更小的变动，下面我们会进行介绍，包含 L2 规范化的实现。在讲述规范化之前，我们看看 network2.py 完整的实现代码。你不需要太仔细地读遍这些代码，但是对整个结构尤其是文档中的内容的理解是非常重要的，这样，你就可以理解每段程序所做的工作。当然，你也可以随自己意愿去深入研究！如果你迷失了理解，那么请读读下面的讲解，然后再回到代码中。不多说了，给代码：

"""network2.py
~~~~~~~~~~~~~~An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights.  Note
that I have focused on making the code simple, easily readable, and
easily modifiable.  It is not optimized, and omits many desirable
features."""#### Libraries
# Standard library
import json
import random
import sys# Third-party libraries
import numpy as np#### Define the quadratic and cross-entropy cost functionsclass QuadraticCost(object):    @staticmethoddef fn(a, y):"""Return the cost associated with an output ``a`` and desired output``y``."""return 0.5*np.linalg.norm(a-y)**2    @staticmethoddef delta(z, a, y):"""Return the error delta from the output layer."""return (a-y) * sigmoid_prime(z)class CrossEntropyCost(object):    @staticmethoddef fn(a, y):"""Return the cost associated with an output ``a`` and desired output``y``.  Note that np.nan_to_num is used to ensure numericalstability.  In particular, if both ``a`` and ``y`` have a 1.0in the same slot, then the expression (1-y)*np.log(1-a)returns nan.  The np.nan_to_num ensures that that is convertedto the correct value (0.0)."""return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))    @staticmethoddef delta(z, a, y):"""Return the error delta from the output layer.  Note that theparameter ``z`` is not used by the method.  It is included inthe method's parameters in order to make the interfaceconsistent with the delta method for other cost classes."""return (a-y)#### Main Network class
class Network(object):def __init__(self, sizes, cost=CrossEntropyCost):"""The list ``sizes`` contains the number of neurons in the respectivelayers of the network.  For example, if the list was [2, 3, 1]then it would be a three-layer network, with the first layercontaining 2 neurons, the second layer 3 neurons, and thethird layer 1 neuron.  The biases and weights for the networkare initialized randomly, using``self.default_weight_initializer`` (see docstring for thatmethod)."""self.num_layers = len(sizes)self.sizes = sizesself.default_weight_initializer()self.cost=costdef default_weight_initializer(self):"""Initialize each weight using a Gaussian distribution with mean 0and standard deviation 1 over the square root of the number ofweights connecting to the same neuron.  Initialize the biasesusing a Gaussian distribution with mean 0 and standarddeviation 1.Note that the first layer is assumed to be an input layer, andby convention we won't set any biases for those neurons, sincebiases are only ever used in computing the outputs from laterlayers."""self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]self.weights = [np.random.randn(y, x)/np.sqrt(x)for x, y in zip(self.sizes[:-1], self.sizes[1:])]def large_weight_initializer(self):"""Initialize the weights using a Gaussian distribution with mean 0and standard deviation 1.  Initialize the biases using aGaussian distribution with mean 0 and standard deviation 1.Note that the first layer is assumed to be an input layer, andby convention we won't set any biases for those neurons, sincebiases are only ever used in computing the outputs from laterlayers.This weight and bias initializer uses the same approach as inChapter 1, and is included for purposes of comparison.  Itwill usually be better to use the default weight initializerinstead."""self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]self.weights = [np.random.randn(y, x)for x, y in zip(self.sizes[:-1], self.sizes[1:])]def feedforward(self, a):"""Return the output of the network if ``a`` is input."""for b, w in zip(self.biases, self.weights):a = sigmoid(np.dot(w, a)+b)return adef SGD(self, training_data, epochs, mini_batch_size, eta,lmbda = 0.0,evaluation_data=None,monitor_evaluation_cost=False,monitor_evaluation_accuracy=False,monitor_training_cost=False,monitor_training_accuracy=False):"""Train the neural network using mini-batch stochastic gradientdescent.  The ``training_data`` is a list of tuples ``(x, y)``representing the training inputs and the desired outputs.  Theother non-optional parameters are self-explanatory, as is theregularization parameter ``lmbda``.  The method also accepts``evaluation_data``, usually either the validation or testdata.  We can monitor the cost and accuracy on either theevaluation data or the training data, by setting theappropriate flags.  The method returns a tuple containing fourlists: the (per-epoch) costs on the evaluation data, theaccuracies on the evaluation data, the costs on the trainingdata, and the accuracies on the training data.  All values areevaluated at the end of each training epoch.  So, for example,if we train for 30 epochs, then the first element of the tuplewill be a 30-element list containing the cost on theevaluation data at the end of each epoch. Note that the listsare empty if the corresponding flag is not set."""if evaluation_data: n_data = len(evaluation_data)n = len(training_data)evaluation_cost, evaluation_accuracy = [], []training_cost, training_accuracy = [], []for j in xrange(epochs):random.shuffle(training_data)mini_batches = [training_data[k:k+mini_batch_size]for k in xrange(0, n, mini_batch_size)]for mini_batch in mini_batches:self.update_mini_batch(mini_batch, eta, lmbda, len(training_data))print "Epoch %s training complete" % jif monitor_training_cost:cost = self.total_cost(training_data, lmbda)training_cost.append(cost)print "Cost on training data: {}".format(cost)if monitor_training_accuracy:accuracy = self.accuracy(training_data, convert=True)training_accuracy.append(accuracy)print "Accuracy on training data: {} / {}".format(accuracy, n)if monitor_evaluation_cost:cost = self.total_cost(evaluation_data, lmbda, convert=True)evaluation_cost.append(cost)print "Cost on evaluation data: {}".format(cost)if monitor_evaluation_accuracy:accuracy = self.accuracy(evaluation_data)evaluation_accuracy.append(accuracy)print "Accuracy on evaluation data: {} / {}".format(self.accuracy(evaluation_data), n_data)printreturn evaluation_cost, evaluation_accuracy, \training_cost, training_accuracydef update_mini_batch(self, mini_batch, eta, lmbda, n):"""Update the network's weights and biases by applying gradientdescent using backpropagation to a single mini batch.  The``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is thelearning rate, ``lmbda`` is the regularization parameter, and``n`` is the total size of the training data set."""nabla_b = [np.zeros(b.shape) for b in self.biases]nabla_w = [np.zeros(w.shape) for w in self.weights]for x, y in mini_batch:delta_nabla_b, delta_nabla_w = self.backprop(x, y)nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nwfor w, nw in zip(self.weights, nabla_w)]self.biases = [b-(eta/len(mini_batch))*nbfor b, nb in zip(self.biases, nabla_b)]def backprop(self, x, y):"""Return a tuple ``(nabla_b, nabla_w)`` representing thegradient for the cost function C_x.  ``nabla_b`` and``nabla_w`` are layer-by-layer lists of numpy arrays, similarto ``self.biases`` and ``self.weights``."""nabla_b = [np.zeros(b.shape) for b in self.biases]nabla_w = [np.zeros(w.shape) for w in self.weights]# feedforwardactivation = xactivations = [x] # list to store all the activations, layer by layerzs = [] # list to store all the z vectors, layer by layerfor b, w in zip(self.biases, self.weights):z = np.dot(w, activation)+bzs.append(z)activation = sigmoid(z)activations.append(activation)# backward passdelta = (self.cost).delta(zs[-1], activations[-1], y)nabla_b[-1] = deltanabla_w[-1] = np.dot(delta, activations[-2].transpose())# Note that the variable l in the loop below is used a little# differently to the notation in Chapter 2 of the book.  Here,# l = 1 means the last layer of neurons, l = 2 is the# second-last layer, and so on.  It's a renumbering of the# scheme in the book, used here to take advantage of the fact# that Python can use negative indices in lists.for l in xrange(2, self.num_layers):z = zs[-l]sp = sigmoid_prime(z)delta = np.dot(self.weights[-l+1].transpose(), delta) * spnabla_b[-l] = deltanabla_w[-l] = np.dot(delta, activations[-l-1].transpose())return (nabla_b, nabla_w)def accuracy(self, data, convert=False):"""Return the number of inputs in ``data`` for which the neuralnetwork outputs the correct result. The neural network'soutput is assumed to be the index of whichever neuron in thefinal layer has the highest activation.The flag ``convert`` should be set to False if the data set isvalidation or test data (the usual case), and to True if thedata set is the training data. The need for this flag arisesdue to differences in the way the results ``y`` arerepresented in the different data sets.  In particular, itflags whether we need to convert between the differentrepresentations.  It may seem strange to use differentrepresentations for the different data sets.  Why not use thesame representation for all three data sets?  It's done forefficiency reasons -- the program usually evaluates the coston the training data and the accuracy on other data sets.These are different types of computations, and using differentrepresentations speeds things up.  More details on therepresentations can be found inmnist_loader.load_data_wrapper."""if convert:results = [(np.argmax(self.feedforward(x)), np.argmax(y))for (x, y) in data]else:results = [(np.argmax(self.feedforward(x)), y)for (x, y) in data]return sum(int(x == y) for (x, y) in results)def total_cost(self, data, lmbda, convert=False):"""Return the total cost for the data set ``data``.  The flag``convert`` should be set to False if the data set is thetraining data (the usual case), and to True if the data set isthe validation or test data.  See comments on the similar (butreversed) convention for the ``accuracy`` method, above."""cost = 0.0for x, y in data:a = self.feedforward(x)if convert: y = vectorized_result(y)cost += self.cost.fn(a, y)/len(data)cost += 0.5*(lmbda/len(data))*sum(np.linalg.norm(w)**2 for w in self.weights)return costdef save(self, filename):"""Save the neural network to the file ``filename``."""data = {"sizes": self.sizes,"weights": [w.tolist() for w in self.weights],"biases": [b.tolist() for b in self.biases],"cost": str(self.cost.__name__)}f = open(filename, "w")json.dump(data, f)f.close()#### Loading a Network
def load(filename):"""Load a neural network from the file ``filename``.  Returns aninstance of Network."""f = open(filename, "r")data = json.load(f)f.close()cost = getattr(sys.modules[__name__], data["cost"])net = Network(data["sizes"], cost=cost)net.weights = [np.array(w) for w in data["weights"]]net.biases = [np.array(b) for b in data["biases"]]return net#### Miscellaneous functions
def vectorized_result(j):"""Return a 10-dimensional unit vector with a 1.0 in the j'th positionand zeroes elsewhere.  This is used to convert a digit (0...9)into a corresponding desired output from the neural network."""e = np.zeros((10, 1))e[j] = 1.0return edef sigmoid(z):"""The sigmoid function."""return 1.0/(1.0+np.exp(-z))def sigmoid_prime(z):"""Derivative of the sigmoid function."""return sigmoid(z)*(1-sigmoid(z))

有个更加有趣的变动就是在代码中增加了 L2 规范化。尽管这是一个主要的概念上的变动，在实现中其实相当简单。对大部分情况，仅仅需要传递参数 lmbda 到不同的方法中，主要是 Network.SGD 方法。实际上的工作就是一行代码的事在 Network.update_mini_batch 的倒数第四行。这就是我们改动梯度下降规则来进行权重下降的地方。尽管改动很小，但其对结果影响却很大！

其实这种情况在神经网络中实现一些新技术的常见现象。我们花费了近千字的篇幅来讨论规范化。概念的理解非常微妙困难。但是添加到程序中的时候却如此简单。精妙复杂的技术可以通过微小的代码改动就可以实现了。

另一个微小却重要的改动是随机梯度下降方法的几个标志位的增加。这些标志位让我们可以对在代价和准确度的监控变得可能。这些标志位默认是 False 的，但是在我们例子中，已经被置为 True 来监控 Network 的性能。另外，network2.py 中的 Network.SGD 方法返回了一个四元组来表示监控的结果。我们可以这样使用：

>>> evaluation_cost, evaluation_accuracy,
... training_cost, training_accuracy = net.SGD(training_data, 30, 10, 0.5,
... lmbda = 5.0,
... evaluation_data=validation_data,
... monitor_evaluation_accuracy=True,
... monitor_evaluation_cost=True,
... monitor_training_accuracy=True,
... monitor_training_cost=True)

所以，比如 evaluation_cost 将会是一个 $$30$$ 个元素的列表其中包含了每个回合在验证集合上的代价函数值。这种类型的信息在理解网络行为的过程中特别有用。比如，它可以用来画出展示网络随时间学习的状态。其实，这也是我在前面的章节中展示性能的方式。然而要注意的是如果任何标志位都没有设置的话，对应的元组中的元素就是空列表。

另一个增加项就是在 Network.save 方法中的代码，用来将 Network 对象保存在磁盘上，还有一个载回内存的函数。这两个方法都是使用 JSON 进行的，而非 Python 的 pickle 或者 cPickle 模块——这些通常是 Python 中常见的保存和装载对象的方法。使用 JSON 的原因是，假设在未来某天，我们想改变 Network 类来允许非 sigmoid 的神经元。对这个改变的实现，我们最可能是改变在 Network.__init__ 方法中定义的属性。如果我们简单地 pickle 对象，会导致 load 函数出错。使用 JSON 进行序列化可以显式地让老的 Network 仍然能够 load。

其他也还有一些微小的变动。但是那些只是 network.py 的微调。结果就是把程序从 $$74$$ 行增长到了 $$152$$ 行。

问题

更改上面的代码来实现 L1 规范化，使用 L1 规范化使用 $$30$$ 个隐藏元的神经网络对 MNIST 数字进行分类。你能够找到一个规范化参数使得比无规范化效果更好么？
看看 network.py 中的 Network.cost_derivative 方法。这个方法是为二次代价函数写的。怎样修改可以用于交叉熵代价函数上？你能不能想到可能在交叉熵函数上遇到的问题？在 network2.py 中，我们已经去掉了 Network.cost_derivative 方法，将其集成进了 CrossEntropyCost.delta 方法中。请问，这样是如何解决你已经发现的问题的？

文／Not_GOD（简书作者）
原文链接：http://www.jianshu.com/p/c43198f5dae1
著作权归作者所有，转载请联系作者获得授权，并标注“简书作者”Not_GOD。

第三章改进神经网络的学习方式（中下）相关推荐

第三章改进神经网络的学习方式（上）
当一个高尔夫球员刚开始学习打高尔夫时,他们通常会在挥杆的练习上花费大多数时间.慢慢地他们才会在基本的挥杆上通过变化发展其他的击球方式,学习低飞球.左曲球和右曲球.类似的,我们现在仍然聚焦在反向传播算法 ...
第三章改进神经网络的学习方式（上中）
过匹配和规范化诺贝尔奖得主美籍意大利裔物理学家恩里科·费米曾被问到他对一个同僚提出的尝试解决一个重要的未解决物理难题的数学模型.模型和实验非常匹配,但是费米却对其产生了怀疑.他问模型中需要设置的自由 ...
[翻译] 神经网络与深度学习第三章提升神经网络学习的效果 - Chapter 3 Improving the way neural networks learn
目录: 首页译序关于本书关于习题和难题第一章利用神经网络识别手写数字第二章反向传播算法是如何工作的 > 第三章提升神经网络学习的效果第四章可视化地证明神经网络可以计算任何函数 ...
实现 RSA 算法之改进和优化（第三章）（老物）
第三章如何改进和优化RSA算法这章呢,我想谈谈在实际应用出现的问题和理解. 由于近期要开始各种忙了,所以写完这章后我短时间内也不打算出什么资料了=- =(反正平时就没有出资料的习惯.) 在讲第一章 ...
神经网络的三种训练方法,神经网络训练过程图解
如何训练神经网络 1.先别着急写代码训练神经网络前,别管代码,先从预处理数据集开始.我们先花几个小时的时间,了解数据的分布并找出其中的规律. Andrej有一次在整理数据时发现了重复的样本,还有一次发 ...
神经网络的三种训练方法,神经网络训练技巧视频
Hopfield 神经网络有哪几种训练方法人工神经网络模型主要考虑网络连接的拓扑结构.神经元的特征.学习规则等.目前,已有近40种神经网络模型,其中有反传网络.感知器.自组织映射.Hopfield网 ...
CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻译并解读2019<A Survey of the Recent Architectures of Deep Convolutional Neural Networks>第一章~第三 ...
第五章卷积神经网络（CNN）
文章目录 5.1 卷积神经网络的组成层 5.2 卷积如何检测边缘信息? 5.3 卷积层中的几个基本参数? 5.3.1 卷积核大小 5.3.2 卷积核的步长 5.3.3 边缘填充 5.3.4 输入和输出 ...
第三章_深度学习基础
文章目录第三章深度学习基础 3.1 基本概念 3.1.1 神经网络组成? 3.1.2神经网络有哪些常用模型结构? 3.1.3如何选择深度学习开发平台? 3.1.4为什么使用深层表示? 3.1.5为 ...

第三章改进神经网络的学习方式（中下）

权重初始化

练习

问题

再看手写识别问题：代码

问题

第三章改进神经网络的学习方式（中下）相关推荐

最新文章

热门文章

第三章 改进神经网络的学习方式（中下）

权重初始化

练习

问题

再看手写识别问题：代码

问题

第三章 改进神经网络的学习方式（中下）相关推荐

最新文章

热门文章

第三章改进神经网络的学习方式（中下）

第三章改进神经网络的学习方式（中下）相关推荐