原文链接:http://www.deeplearningbook.org

文章目录

  • 6.2.1 Cost Function 代价函数
    • 6.2.1.1 Learning Conditional Distributions with Maximum Likelihood 基于最大似然来学习条件分布
    • 6.2.1.2 Learning Conditional Statistics 学习条件统计量

6.2.1 Cost Function 代价函数

  An important aspect of the design of a deep neural network is the choice of the cost function. Fortunately, the cost functions for neural networks are more or less the same as those for other parametric models, such as linear models.

深度神经网络设计的一个重要方面是代价函数的选择。幸运的是,神经网络的代价函数与线性模型等其他参数模型的代价函数大致相同。

 In most cases, our parametric model defines a distribution p(y∣x;θ)p({\bf y} |{\bf x} ;{\bf \theta})p(y∣x;θ) and we simply use the principle of maximum likelihood. This means we use thecross-entropy between the training data and the model’s predictions as the costfunction.

在大多数情况下,我们的参数模型定义了分布p(y∣x;θ)p({\bf y} |{\bf x} ;{\bf \theta})p(y∣x;θ),我们只使用最大似然原理。这意味着我们使用训练数据和模型预测之间的交叉熵作为代价函数。

 Sometimes, we take a simpler approach, where rather than predicting a complete probability distribution over y\bf yy, we merely predict some statistic of y\bf yy conditioned on x\bf xx. Specialized loss functions enable us to train a predictor of these estimates.

有时,我们采取一种更简单的方法,不是预测y\bf yy的完整的概率分布,而是仅仅预测给定x\bf xx条件下y\bf yy的某个统计量。专门的损失函数使我们能够训练这些估计的预测器。

 The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term. We have already seen some simple examples of regularization applied to linear models in section 5.2.2. The weight decay approach used for linear models is also directly applicable to deep neural networks and is among the most popular regularization strategies. More advanced regularization strategies for neural networks aredescribed in chapter 7.

用于训练神经网络的总的代价函数通常将这里描述的基本代价函数与正则项结合。我们已经在5.2.2节中看到了应用于线性模型的正则化的一些简单例子。用于线性模型的权重衰减方法也直接适用于深层神经网络,并且是最流行的正则化策略之一。第7章将介绍更高级的神经网络正则化策略。

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood 基于最大似然来学习条件分布

  Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently describedas the cross-entropy between the training data and the model distribution. This cost function is given by

大多数现代神经网络都是用最大似然训练的。这意味着代价函数就是负的对数似然,等价地描述为训练数据和模型分布之间的交叉熵。这个代价函数由下式给出
J(θ)=−Ex,y∼p^datalog⁡pmodel(y∣x)J({\bm \theta})=-\mathbb{E}_{{\bf x,y}\sim\hat{p}_{\rm data}}\log p_{\rm model}(\bf y|x) J(θ)=−Ex,y∼p^​data​​logpmodel​(y∣x)

 The specific form of the cost function changes from model to model, depending on the specific form of log⁡pmodel\log p_{\rm model}logpmodel​. The expansion of the above equation typically yields some terms that do not depend on the model parameters and may be discarded. For example, as we saw in section 5.5.1, if pmodel(y∣x)=N(y;f(x;θ),I)p_{\rm model}({\bf y | x}) =\mathcal{N}(\bf y;f(\bf x;\bm θ), I)pmodel​(y∣x)=N(y;f(x;θ),I), then we recover the mean squared error cost,

代价函数的具体形式随模型不同而变化,取决于log⁡pmodel\log p_{\rm model}logpmodel​的具体形式。展开上述方程,会得到一些与模型参数无关可以舍去的项。正如我们在5.5.1节中所看到的,如果pmodel(y∣x)=N(y;f(x;θ),I)p_{\rm model}({\bf y | x}) =\mathcal{N}(\bf y;f(\bf x;\bm θ), I)pmodel​(y∣x)=N(y;f(x;θ),I), 则我们重新得到了均方误差代价,
J(θ)=12Ex,y∼p^data∥y−f(x;θ)∥2+constJ({\bm \theta})=\frac{1}{2}\mathbb{E}_{{\bf x,y}\sim\hat{p}_{\rm data}}\|{\bf y}-f({\bf x};{\bm \theta})\|^2+\rm const J(θ)=21​Ex,y∼p^​data​​∥y−f(x;θ)∥2+const

up to a scaling factor of 12\frac{1}{2}21​ and a term that does not depend on θ\bm θθ. The discarded constant is based on the variance of the Gaussian distribution, which in this case we chose not to parametrize. Previously, we saw that the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds for a linear model, but in fact, the equivalence holds regardless of the f(x;θ)f(\bf x; \bm θ)f(x;θ) used to predict the mean of the Gaussian.

至少系数12\frac{ 1 }{ 2 }21​和其中一项不依赖于θ\bm \thetaθ。舍掉的常数是基于高斯分布的方差,在这种情况下,我们选择不对其参数化。以前我们看到,对输出分布的最大似然估计,与线性模型均方误差最小化之间是等价的,但实际上,不管用于预测高斯平均值的f(x;θ)f(\bf x; \bm θ)f(x;θ)如何,该等价性均成立。

 An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model. Specifying a model p(y∣x)p(\bf y | x)p(y∣x) automatically determines a cost functionlog log⁡p(y∣x)\log p(\bf y | x)logp(y∣x).

这种从最大似然推导代价函数的优点在于,它不用为每个模型设计代价函数。明确指定了模型p(y∣x)p(\bf y | x)p(y∣x)也就自动确定了代价函数log⁡p(y∣x)\log p(\bf y | x)logp(y∣x)。

 One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm. Functions that saturate (become very flat) undermine this objective because they make the gradient become very small. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models. Several output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units. We will discuss the interaction between the cost function and the choice of output unit in section 6.2.2.

贯穿整个神经网络设计过程反复出现的主题是,代价函数的梯度必须足够大和可预测,才能够作为学习算法的良好指导。饱和(变得非常平坦)的函数无法达到这个目标,因为它们使得梯度变得非常小。在许多情况下,这是因为用于产生隐藏单元或输出单元输出的激活函数饱和。负对数似然有助于避免许多模型的这个问题。若干输出单元会包含一个EXP函数,当它的参数是非常负的时,就会饱和。负对数似然代价函数中的对数函数抵消了一些输出单元的EXP。我们将在第6.2.2节中讨论代价函数与输出单元的选择间的相互作用。

 One unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probabilityof zero or one, but can come arbitrarily close to doing so. Logistic regressionis an example of such a model. For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possibleto assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity. Regularization techniques described in chapter 7 provide several different ways of modifying the learning problem so that the model cannot reap unlimited reward in this way.

用于执行最大似然估计的交叉熵代价的一个不寻常性质是,当应用于实践中常用的模型时,它通常不具有最小值。对于离散的输出变量,大多数模型都是参数化为,不能表示0或1的概率,但是可以任意接近。Logistic回归是这样模型的一个例子。对于实值输出变量,如果模型可以控制输出分布的密度(例如,通过学习高斯输出分布的方差参数),那么就有可能为正确的训练集输出分配极高的密度,导致交叉熵接近负无穷。第7章描述的正则化技术提供了几种改进学习问题的方法,使得模型不能以这种方式获得无限的回报。

6.2.1.2 Learning Conditional Statistics 学习条件统计量

  Instead of learning a full probability distribution p(y∣x;θ)p(\bf y | x;\bm θ)p(y∣x;θ), we often want to learn just one conditional statistic of y\bf yy given x\bf xx.

  For example, we may have a predictor f(x;θ)f(\bf x;\bm θ)f(x;θ) that we wish to employ to predictthe mean of y\bf yy.

 If we use a sufficiently powerful neural network, we can think of the neural network as being able to represent any function fff from a wide class of functions,with this class being limited only by features such as continuity and boundedness rather than by having a specific parametric form. From this point of view, we can view the cost function as being a functional rather than just a function. A functional is a mapping from functions to real numbers. We can thus think of learning as choosing a function rather than merely choosing a set of parameters. We can design our cost functional to have its minimum occur at some specific function we desire. For example, we can design the cost functional to have its minimum lie on the function that maps x\bf xx to the expected value of y\bf yy given x\bf xx. Solving an optimization problem with respect to a function requires a mathematical tool called calculus of variations, described in section 19.4.2. It is not necessary to understand calculus of variations to understand the content of this chapter. At the moment, it is only necessary to understand that calculus of variations may be used to derive the following two results.

如果我们使用一个非常强大的神经网络,我们可以认为这个神经网络能够表示来自一大类函数的任一函数fff,该类仅受诸如连续性和有界性等特征的限制,而不是具有特定的参数形式。从这个角度来看,我们可以把代价函数看作是一个泛函,而不仅仅是函数。泛函是从函数到实数的映射。因此,我们可以把学习看作是选择一个函数,而不仅仅是选择一组参数。我们可以设计我们的代价泛函,使它的最小值出现在我们所期望的某种特殊函数上。例如,我们可以将代价函数设计成,它的最小值处于一个特殊的函数上,该函数将x\bf xx映射到给定x\bf xx时y\bf yy的期望值。解决关于函数的优化问题需要的数学工具称为变分法,如19.4.2节所述。理解本章的内容不必理解变分法。目前,我们只需要理解变分法可以用来推导下面两个结果。

Our first result derived using calculus of variations is that solving the optimization problem

利用变分法导出的第一个结果是求解最优化问题
f∗=arg⁡min⁡fEx,y∼pdata∣∣y−f(x)∣∣2f^∗= \arg \min_f {\mathbb E}_{{\bf x,y}∼p_{\rm data}}||{\bf y} − f({\bf x})||^2 f∗=argfmin​Ex,y∼pdata​​∣∣y−f(x)∣∣2
得到
f∗(x)=Ey∼pdata(y∣x)[y]f^∗({\bf x}) = {\mathbb E}_{{\bf y}∼p_{\rm data}(\bf y|x)}[\bf y] f∗(x)=Ey∼pdata​(y∣x)​[y]

so long as this function lies within the class we optimize over. In other words, if we could train on infinitely many samples from the true data generating distribution, minimizing the mean squared error cost function would give a function that predicts the mean of y\bf yy for each value of x\bf xx.

要求这个函数属于我们优化的函数类之内。换言之,如果我们能够对来自真实数据所生成的分布的无数个样本进行训练,则最小化均方误差代价函数将给出一个函数,该函数预测对于x\bf xx的每个值的y\bf yy的平均值。

Different cost functions give different statistics. A second result derived usingcalculus of variations is that

不同的成本函数给出不同的统计数据。利用变分法导出的第二个结果是
f∗=arg⁡min⁡fEx,y∼pdata∣∣y−f(x)∣∣1f^∗= \arg \min_f \mathbb{E}_{{\bf x,y}∼p_{\rm data}}||{\bf y} − f(\bf x)||_1 f∗=argfmin​Ex,y∼pdata​​∣∣y−f(x)∣∣1​

yields a function that predicts the median value of y for each x, as long as such a function may be described by the family of functions we optimize over. This cost function is commonly called mean absolute error.

产生一个函数来预测每个x\bf xx的y\bf yy的中值,只要这个函数可以由我们优化过的函数族来描述。这种代价函数通常称为平均绝对误差。

Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions. This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y∣x)p(\bf y | x)p(y∣x).

不幸的是,当使用基于梯度的优化时,均方误差和平均绝对误差往往导致较差的结果。当与这些成本函数结合时,一些饱和的输出单元产生非常小的梯度。这是交叉熵代价函数比均方误差或平均绝对误差更受欢迎的一个原因,即使不需要估计整个分布p(y∣x)p(\bf y|x)p(y∣x)。

Deep Learning-Deep feedforward network相关推荐

  1. A Comparative Analysis of Deep Learning Approaches for Network Intrusion Detection Systems (N-IDSs)

    论文阅读记录 数据类型 在预定义时间窗口中,按照传输控制协议/互联网协议(TCP/IP)数据包将网络流量数据建模成时间序列数据. 数据:KDDCup-99/ NSL-KDD/ UNSW-NB15 NI ...

  2. 【面向代码】学习 Deep Learning Convolution Neural Network(CNN)

    转载自: [面向代码]学习 Deep Learning(三)Convolution Neural Network(CNN) - DarkScope从这里开始 - 博客频道 - CSDN.NET htt ...

  3. 深度学习材料:从感知机到深度网络A Deep Learning Tutorial: From Perceptrons to Deep Networks

    In recent years, there's been a resurgence in the field of Artificial Intelligence. It's spread beyo ...

  4. 【李宏毅机器学习】Brief Introduction of Deep Learning 深度学习简介(p12) 学习笔记

    李宏毅机器学习学习笔记汇总 课程链接 Deep Learning 文章目录 Deep Learning Deep Learning attracts lots of attention Ups and ...

  5. [论文翻译] Deep Learning

    [论文翻译] Deep Learning 论文题目:Deep Learning 论文来源:Deep learning Nature 2015 翻译人:BDML@CQUT实验室 Deep learnin ...

  6. [论文翻译]Deep Learning 翻译及阅读笔记

    论文题目:Deep Learning 论文来源:Deep Learning_2015_Nature 翻译人:BDML@CQUT实验室 Deep Learning Yann LeCun∗ Yoshua ...

  7. [论文翻译]Deep learning

    [论文翻译]Deep learning 论文题目:Deep Learning 论文来源:Deep Learning_2015_Nature 翻译人:BDML@CQUT实验室 Deep learning ...

  8. Deep Learning — LeCun, Yann, Yoshua Bengio and Geoffrey Hinton

    原文链接Deep Learning 由于作者太菜,本文70%为机翻.见谅见谅 第一篇是三巨头LeCun, Yann, Yoshua Bengio和Geoffrey Hinton做的有关Deep Lea ...

  9. 专业英语翻译(二)Deep Learning(上)(词组+生词+段落翻译+全文翻译)

    11/11 原文: Deep learning allows computational(计算的) models that are composed of multiple processing la ...

  10. 论文领读:人工智能三巨头的Deep learning

    「笑傲算法江湖」的论文领读专栏聚焦于深度学习领域经典和最新论文的中英文对照译文,涵盖计算机视觉.自然语言.语音识别和强化学习等专业领域,帮助初学者理解算法理论,为未来算法工程师或科研工作奠定基础.「笑 ...

最新文章

  1. java jdbc 表存在_使用JDBC查询是否存在某表或视图,按月动态生成表
  2. python语言入门m-Python学习基础篇 -1
  3. C语言——实现用链表存储学生信息,当输入0退出输入,并查找学号为3的学生是否存在
  4. android 自由复制与粘贴功能
  5. 【网络安全威胁】企业风险远不止勒索软件,盘点当今企业面临的四种安全威胁
  6. 唯一分解定理(算术基本定理)详解——hdu5248和lightoj1341
  7. JSP简单练习-页面重定向
  8. linux--GD库安装
  9. html frame跳转实例,HTML frame标签怎么用?frame标签的具体使用实例
  10. 30种应该知道的sql调优方法
  11. 【问题待解决】自定义控件设计界面报错,编译运行正常
  12. 基于热传递方程和目标规划的高温服装设计
  13. java jdk下载_jdk1.7下载|Java Development Kit (JDK) 下载「64位」-太平洋下载中心
  14. 大功率MOS管选型手册及可申请样品-KIA MOS管
  15. 支配树学习思路/模板
  16. 前端应届生面试技巧,没有项目经验怎么应对?
  17. Python工程师面试必备25条Python知识点
  18. OP向左,SaaS向右,如何选择?
  19. Emacs 从入门到精通
  20. Wi-Fi 6关键技术及产业进展

热门文章

  1. 关于 epoch、 iteration和batchsize
  2. iptables基础(01)
  3. 源码维护基本命令diff_patch_quilt
  4. iOS开发之 WebView
  5. NAT篇 双剑合璧,无往不利——双向NAT
  6. c#对输入的字符串加密
  7. linux命令比较命令,Linux命令 比较文件
  8. 网页模板----01
  9. cocos android保存图片到相册,android平台 cocos2d-x 读取相册数据
  10. 如何将本地窗口上方地址栏隐藏_Firefox火狐浏览器将提供导出密码至本地的功能...