Deep Learning-Deep feedforward network

原文链接：http://www.deeplearningbook.org

文章目录

6.2.1 Cost Function 代价函数
- 6.2.1.1 Learning Conditional Distributions with Maximum Likelihood 基于最大似然来学习条件分布
- 6.2.1.2 Learning Conditional Statistics 学习条件统计量

6.2.1 Cost Function 代价函数

An important aspect of the design of a deep neural network is the choice of the cost function. Fortunately, the cost functions for neural networks are more or less the same as those for other parametric models, such as linear models.

深度神经网络设计的一个重要方面是代价函数的选择。幸运的是，神经网络的代价函数与线性模型等其他参数模型的代价函数大致相同。

In most cases, our parametric model deﬁnes a distribution p(y∣x;θ)p({\bf y} |{\bf x} ;{\bf \theta})p(y∣x;θ) and we simply use the principle of maximum likelihood. This means we use thecross-entropy between the training data and the model’s predictions as the costfunction.

在大多数情况下，我们的参数模型定义了分布p(y∣x;θ)p({\bf y} |{\bf x} ;{\bf \theta})p(y∣x;θ)，我们只使用最大似然原理。这意味着我们使用训练数据和模型预测之间的交叉熵作为代价函数。

Sometimes, we take a simpler approach, where rather than predicting a complete probability distribution over y\bf yy, we merely predict some statistic of y\bf yy conditioned on x\bf xx. Specialized loss functions enable us to train a predictor of these estimates.

有时，我们采取一种更简单的方法，不是预测y\bf yy的完整的概率分布，而是仅仅预测给定x\bf xx条件下y\bf yy的某个统计量。专门的损失函数使我们能够训练这些估计的预测器。

The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term. We have already seen some simple examples of regularization applied to linear models in section 5.2.2. The weight decay approach used for linear models is also directly applicable to deep neural networks and is among the most popular regularization strategies. More advanced regularization strategies for neural networks aredescribed in chapter 7.

用于训练神经网络的总的代价函数通常将这里描述的基本代价函数与正则项结合。我们已经在5.2.2节中看到了应用于线性模型的正则化的一些简单例子。用于线性模型的权重衰减方法也直接适用于深层神经网络，并且是最流行的正则化策略之一。第7章将介绍更高级的神经网络正则化策略。

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood 基于最大似然来学习条件分布

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently describedas the cross-entropy between the training data and the model distribution. This cost function is given by

大多数现代神经网络都是用最大似然训练的。这意味着代价函数就是负的对数似然，等价地描述为训练数据和模型分布之间的交叉熵。这个代价函数由下式给出
J(θ)=−Ex,y∼p^datalog⁡pmodel(y∣x)J({\bm \theta})=-\mathbb{E}_{{\bf x,y}\sim\hat{p}_{\rm data}}\log p_{\rm model}(\bf y|x) J(θ)=−Ex,y∼p^datalogpmodel(y∣x)

The speciﬁc form of the cost function changes from model to model, depending on the speciﬁc form of log⁡pmodel\log p_{\rm model}logpmodel. The expansion of the above equation typically yields some terms that do not depend on the model parameters and may be discarded. For example, as we saw in section 5.5.1, if pmodel(y∣x)=N(y;f(x;θ),I)p_{\rm model}({\bf y | x}) =\mathcal{N}(\bf y;f(\bf x;\bm θ), I)pmodel(y∣x)=N(y;f(x;θ),I), then we recover the mean squared error cost,

代价函数的具体形式随模型不同而变化，取决于log⁡pmodel\log p_{\rm model}logpmodel的具体形式。展开上述方程，会得到一些与模型参数无关可以舍去的项。正如我们在5.5.1节中所看到的，如果pmodel(y∣x)=N(y;f(x;θ),I)p_{\rm model}({\bf y | x}) =\mathcal{N}(\bf y;f(\bf x;\bm θ), I)pmodel(y∣x)=N(y;f(x;θ),I)，则我们重新得到了均方误差代价，
J(θ)=12Ex,y∼p^data∥y−f(x;θ)∥2+constJ({\bm \theta})=\frac{1}{2}\mathbb{E}_{{\bf x,y}\sim\hat{p}_{\rm data}}\|{\bf y}-f({\bf x};{\bm \theta})\|^2+\rm const J(θ)=21Ex,y∼p^data∥y−f(x;θ)∥2+const

up to a scaling factor of 12\frac{1}{2}21 and a term that does not depend on θ\bm θθ. The discarded constant is based on the variance of the Gaussian distribution, which in this case we chose not to parametrize. Previously, we saw that the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds for a linear model, but in fact, the equivalence holds regardless of the f(x;θ)f(\bf x; \bm θ)f(x;θ) used to predict the mean of the Gaussian.

至少系数12\frac{ 1 }{ 2 }21和其中一项不依赖于θ\bm \thetaθ。舍掉的常数是基于高斯分布的方差，在这种情况下，我们选择不对其参数化。以前我们看到，对输出分布的最大似然估计，与线性模型均方误差最小化之间是等价的，但实际上，不管用于预测高斯平均值的f(x;θ)f(\bf x; \bm θ)f(x;θ)如何，该等价性均成立。

An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model. Specifying a model p(y∣x)p(\bf y | x)p(y∣x) automatically determines a cost functionlog log⁡p(y∣x)\log p(\bf y | x)logp(y∣x).

这种从最大似然推导代价函数的优点在于，它不用为每个模型设计代价函数。明确指定了模型p(y∣x)p(\bf y | x)p(y∣x)也就自动确定了代价函数log⁡p(y∣x)\log p(\bf y | x)logp(y∣x)。

One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm. Functions that saturate (become very ﬂat) undermine this objective because they make the gradient become very small. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models. Several output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units. We will discuss the interaction between the cost function and the choice of output unit in section 6.2.2.

贯穿整个神经网络设计过程反复出现的主题是，代价函数的梯度必须足够大和可预测，才能够作为学习算法的良好指导。饱和（变得非常平坦）的函数无法达到这个目标，因为它们使得梯度变得非常小。在许多情况下，这是因为用于产生隐藏单元或输出单元输出的激活函数饱和。负对数似然有助于避免许多模型的这个问题。若干输出单元会包含一个EXP函数，当它的参数是非常负的时，就会饱和。负对数似然代价函数中的对数函数抵消了一些输出单元的EXP。我们将在第6.2.2节中讨论代价函数与输出单元的选择间的相互作用。

One unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probabilityof zero or one, but can come arbitrarily close to doing so. Logistic regressionis an example of such a model. For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possibleto assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative inﬁnity. Regularization techniques described in chapter 7 provide several diﬀerent ways of modifying the learning problem so that the model cannot reap unlimited reward in this way.

用于执行最大似然估计的交叉熵代价的一个不寻常性质是，当应用于实践中常用的模型时，它通常不具有最小值。对于离散的输出变量，大多数模型都是参数化为，不能表示0或1的概率，但是可以任意接近。Logistic回归是这样模型的一个例子。对于实值输出变量，如果模型可以控制输出分布的密度（例如，通过学习高斯输出分布的方差参数），那么就有可能为正确的训练集输出分配极高的密度，导致交叉熵接近负无穷。第7章描述的正则化技术提供了几种改进学习问题的方法，使得模型不能以这种方式获得无限的回报。

6.2.1.2 Learning Conditional Statistics 学习条件统计量

Instead of learning a full probability distribution p(y∣x;θ)p(\bf y | x;\bm θ)p(y∣x;θ), we often want to learn just one conditional statistic of y\bf yy given x\bf xx.

For example, we may have a predictor f(x;θ)f(\bf x;\bm θ)f(x;θ) that we wish to employ to predictthe mean of y\bf yy.

If we use a suﬃciently powerful neural network, we can think of the neural network as being able to represent any function fff from a wide class of functions,with this class being limited only by features such as continuity and boundedness rather than by having a speciﬁc parametric form. From this point of view, we can view the cost function as being a functional rather than just a function. A functional is a mapping from functions to real numbers. We can thus think of learning as choosing a function rather than merely choosing a set of parameters. We can design our cost functional to have its minimum occur at some speciﬁc function we desire. For example, we can design the cost functional to have its minimum lie on the function that maps x\bf xx to the expected value of y\bf yy given x\bf xx. Solving an optimization problem with respect to a function requires a mathematical tool called calculus of variations, described in section 19.4.2. It is not necessary to understand calculus of variations to understand the content of this chapter. At the moment, it is only necessary to understand that calculus of variations may be used to derive the following two results.

如果我们使用一个非常强大的神经网络，我们可以认为这个神经网络能够表示来自一大类函数的任一函数fff，该类仅受诸如连续性和有界性等特征的限制，而不是具有特定的参数形式。从这个角度来看，我们可以把代价函数看作是一个泛函，而不仅仅是函数。泛函是从函数到实数的映射。因此，我们可以把学习看作是选择一个函数，而不仅仅是选择一组参数。我们可以设计我们的代价泛函，使它的最小值出现在我们所期望的某种特殊函数上。例如，我们可以将代价函数设计成，它的最小值处于一个特殊的函数上，该函数将x\bf xx映射到给定x\bf xx时y\bf yy的期望值。解决关于函数的优化问题需要的数学工具称为变分法，如19.4.2节所述。理解本章的内容不必理解变分法。目前，我们只需要理解变分法可以用来推导下面两个结果。

Our ﬁrst result derived using calculus of variations is that solving the optimization problem

利用变分法导出的第一个结果是求解最优化问题
f∗=arg⁡min⁡fEx,y∼pdata∣∣y−f(x)∣∣2f^∗= \arg \min_f {\mathbb E}_{{\bf x,y}∼p_{\rm data}}||{\bf y} − f({\bf x})||^2 f∗=argfminEx,y∼pdata∣∣y−f(x)∣∣2
得到
f∗(x)=Ey∼pdata(y∣x)[y]f^∗({\bf x}) = {\mathbb E}_{{\bf y}∼p_{\rm data}(\bf y|x)}[\bf y] f∗(x)=Ey∼pdata(y∣x)[y]

so long as this function lies within the class we optimize over. In other words, if we could train on inﬁnitely many samples from the true data generating distribution, minimizing the mean squared error cost function would give a function that predicts the mean of y\bf yy for each value of x\bf xx.

要求这个函数属于我们优化的函数类之内。换言之，如果我们能够对来自真实数据所生成的分布的无数个样本进行训练，则最小化均方误差代价函数将给出一个函数，该函数预测对于x\bf xx的每个值的y\bf yy的平均值。

Diﬀerent cost functions give diﬀerent statistics. A second result derived usingcalculus of variations is that

不同的成本函数给出不同的统计数据。利用变分法导出的第二个结果是
f∗=arg⁡min⁡fEx,y∼pdata∣∣y−f(x)∣∣1f^∗= \arg \min_f \mathbb{E}_{{\bf x,y}∼p_{\rm data}}||{\bf y} − f(\bf x)||_1 f∗=argfminEx,y∼pdata∣∣y−f(x)∣∣1

yields a function that predicts the median value of y for each x, as long as such a function may be described by the family of functions we optimize over. This cost function is commonly called mean absolute error.

产生一个函数来预测每个x\bf xx的y\bf yy的中值，只要这个函数可以由我们优化过的函数族来描述。这种代价函数通常称为平均绝对误差。

Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions. This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y∣x)p(\bf y | x)p(y∣x).

不幸的是，当使用基于梯度的优化时，均方误差和平均绝对误差往往导致较差的结果。当与这些成本函数结合时，一些饱和的输出单元产生非常小的梯度。这是交叉熵代价函数比均方误差或平均绝对误差更受欢迎的一个原因，即使不需要估计整个分布p(y∣x)p(\bf y|x)p(y∣x)。