CODE WORKS

Work Here!

Deep feedforward networks, also often called feedforward neural networks, or multilayer perceptrons (MLPs), are the quintessential（精髓） deep learning models. The goal of a feedforward network is to approximate some function f∗f^{*}f∗. For example, for a classifier, y=f∗(x)y=f^{*}(\boldsymbol{x})y=f∗(x) maps an input x\boldsymbol{x}x to a category y.y .y. A feedforward network defines a mapping y=f(x;θ)\boldsymbol{y}=f(\boldsymbol{x} ; \boldsymbol{\theta})y=f(x;θ) and learns the value of the parameters θ\boldsymbol{\theta}θ that result in the best function approximation.
These models are called feedforward because information flows through the function being evaluated from x\boldsymbol{x}x, through the intermediate computations used to define fff, and finally to the output y.\boldsymbol{y} .y. There are no feedback connections in which outputs of the model are fed back into itself. When feedforward neural networks are extended to include feedback connections, they are called recurrent neural networks, presented later.
Feedforward networks are of extreme importance to machine learning practitioners（从业者）. They form the basis of many important commercial applications. For example, the convolutional networks used for object recognition from photos are a specialized kind of feedforward network. Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications.
Feedforward neural networks are called networks because they are typically represented by composing together many different functions. The model is associated with a directed acyclic（无环的） graph describing how the functions are composed together. For example, we might have three functions f(1),f(2)f^{(1)}, f^{(2)}f(1),f(2), and f(3)f^{(3)}f(3) connected in a chain, to form f(x)=f(3)(f(2)(f(1)(x)))f(\boldsymbol{x})=f^{(3)}\left(f^{(2)}\left(f^{(1)}(\boldsymbol{x})\right)\right)f(x)=f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1)f^{(1)}f(1) is called the first layer of the network, f(2)f^{(2)}f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name “deep learning” arises. The final layer of a feedforward network is called the output layer. During neural network training, we drive f(x)f(\boldsymbol{x})f(x) to match f∗(x)f^{*}(\boldsymbol{x})f∗(x) The training data provides us with noisy, approximate examples of f∗(x)f^{*}(\boldsymbol{x})f∗(x) evaluated at different training points. Each example xxx is accompanied by a label y≈f∗(x)y \approx f^{*}(x)y≈f∗(x). The training examples specify directly what the output layer must do at each point x;\boldsymbol{x} ;x; it must produce a value that is close to yyy. The behavior of the other layers is not directly specified by the training data. The learning algorithm must decide how to use those layers to produce the desired output, but the training data does not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of f∗f^{*}f∗. Because the training data does not show the desired output for each of these layers, these layers are called hidden layers.
Finally, these networks are called neural because they are loosely inspired by neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of these hidden layers determines the width of the model. Each element of the vector may be interpreted as playing a role analogous to a neuron. Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many units that act in parallel, each representing a vector-to-scalar function. Each unit resembles a neuron in the sense that it receives input from many other units and computes its own activation value. The idea of using many layers of vector-valued representation is drawn from neuroscience.
One way to understand feedforward networks is to begin with linear models and consider how to overcome their limitations. Linear models, such as logistic regression and linear regression, are appealing because they may be fit efficiently and reliably, either in closed form or with convex optimization. Linear models also have the obvious defect that the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables. To extend linear models to represent nonlinear functions of x\boldsymbol{x}x, we can apply the linear model not to x\boldsymbol{x}x itself but to a transformed input ϕ(x)\phi(\boldsymbol{x})ϕ(x), where ϕ\phiϕ is a nonlinear transformation. Equivalently, we can apply the kernel trick described in section 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying the ϕ\phiϕ mapping. We can think of ϕ\phiϕ as providing a set of features describing x\boldsymbol{x}x, or as providing a new representation for x\boldsymbol{x}x.
The question is then how to choose the mapping ϕ\phiϕ:

One option is to use a very generic ϕ\phiϕ, such as the infinite-dimensional ϕ\phiϕ that is implicitly used by kernel machines based on the RBF kernel. If ϕ(x)\phi(\boldsymbol{x})ϕ(x) is of high enough dimension, we can always have enough capacity to fit the training set, but generalization to the test set often remains poor. Very generic feature mappings are usually based only on the principle of local smoothness and do not encode enough prior information to solve advanced problems.
Another option is to manually engineer ϕ\phiϕ. Until the advent of deep learning, this was the dominant approach. This approach requires decades of human effort for each separate task, with practitioners specializing in different domains such as speech recognition or computer vision, and with little transfer between domains.
The strategy of deep learning is to learn ϕ.\phi .ϕ. In this approach, we have a model y=f(x;θ,w)=ϕ(x;θ)⊤wy=f(\boldsymbol{x} ; \boldsymbol{\theta}, \boldsymbol{w})=\phi(\boldsymbol{x} ; \boldsymbol{\theta})^{\top} \boldsymbol{w}y=f(x;θ,w)=ϕ(x;θ)⊤w. We now have parameters θ\boldsymbol{\theta}θ that we use to learn ϕ\phiϕ from a broad class of functions, and parameters w\boldsymbol{w}w that map from ϕ(x)\phi(\boldsymbol{x})ϕ(x) to the desired output. This is an example of a deep feedforward network, with ϕ\phiϕ defining a hidden layer. This approach is the only one of the three that gives up on the convexity of the training problem, but the benefits outweigh the harms. In this approach, we parametrize the representation as ϕ(x;θ)\phi(\boldsymbol{x} ; \boldsymbol{\theta})ϕ(x;θ) and use the optimization algorithm to find the θ\boldsymbol{\theta}θ that corresponds to a good representation. If we wish, this approach can capture the benefit of the first approach by being highly generic-we do so by using a very broad family ϕ(x;θ)\phi(\boldsymbol{x} ; \boldsymbol{\theta})ϕ(x;θ). This approach can also capture the benefit of the second approach. Human practitioners can encode their knowledge to help generalization by designing families ϕ(x;θ)\phi(\boldsymbol{x} ; \boldsymbol{\theta})ϕ(x;θ) that they expect will perform well. The advantage is that the human designer only needs to find the right general function family rather than finding precisely the right function.

This general principle of improving models by learning features extends beyond the feedforward networks described in this chapter. It is a recurring theme of deep learning that applies to all of the kinds of models described throughout this book. Feedforward networks are the application of this principle to learning deterministic mappings from xxx to yyy that lack feedback connections. Other models presented later will apply these principles to learning stochastic mappings, learning functions with feedback, and learning probability distributions over a single vector.

Example: Learning XOR

To make the idea of a feedforward network more concrete, we begin with an example of a fully functioning feedforward network on a very simple task: learning the XOR function.
The XOR function (“exclusive or”) is an operation on two binary values, x1x_{1}x1 and x2x_{2}x2. When exactly one of these binary values is equal to 1 , the XOR function returns 1. Otherwise, it returns 0 . The XOR function provides the target function y=f∗(x)y=f^{*}(\boldsymbol{x})y=f∗(x) that we want to learn. Our model provides a function y=f(x;θ)y=f(\boldsymbol{x} ; \boldsymbol{\theta})y=f(x;θ) and our learning algorithm will adapt the parameters θ\boldsymbol{\theta}θ to make fff as similar as possible to f∗f^{*}f∗.
In this simple example, we will not be concerned with statistical generalization. We want our network to perform correctly on the four points X={[0,0]⊤,[0,1]⊤\mathbb{X}=\left\{[0,0]^{\top},[0,1]^{\top}\right.X={[0,0]⊤,[0,1]⊤, [1,0]⊤[1,0]^{\top}[1,0]⊤, and [1,1]⊤}\left.[1,1]^{\top}\right\}[1,1]⊤}. We will train the network on all four of these points. The only challenge is to fit the training set.
We can treat this problem as a regression problem and use a mean squared error loss function. We choose this loss function to simplify the math for this example as much as possible. In practical applications, MSE is usually not an appropriate cost function for modeling binary data. More appropriate approaches are described later.
Evaluated on our whole training set, the MSE loss function is
J(θ)=14∑x∈X(f∗(x)−f(x;θ))2J(\boldsymbol{\theta})=\frac{1}{4} \sum_{\boldsymbol{x} \in \mathbb{X}}\left(f^{*}(\boldsymbol{x})-f(\boldsymbol{x} ; \boldsymbol{\theta})\right)^{2} J(θ)=41x∈X∑(f∗(x)−f(x;θ))2
Now we must choose the form of our model, f(x;θ)f(\boldsymbol{x} ; \boldsymbol{\theta})f(x;θ). Suppose that we choose a linear model, with θ\boldsymbol{\theta}θ consisting of w\boldsymbol{w}w and bbb. Our model is defined to be
f(x;w,b)=x⊤w+bf(\boldsymbol{x} ; \boldsymbol{w}, b)=\boldsymbol{x}^{\top} \boldsymbol{w}+b f(x;w,b)=x⊤w+b
We can minimize J(θ)J(\boldsymbol{\theta})J(θ) in closed form with respect to w\boldsymbol{w}w and bbb using the normal equations.
After solving the normal equations, we obtain w=0\boldsymbol{w}=\mathbf{0}w=0 and b=12b=\frac{1}{2}b=21. The linear model simply outputs 0.50.50.5 everywhere. Why does this happen? Figure above shows how a linear model is not able to represent the XOR function. One way to solve this problem is to use a model that learns a different feature space in which a linear model is able to represent the solution.
Specifically, we will introduce a very simple feedforward network with one hidden layer containing two hidden units. See figure above for an illustration of this model. This feedforward network has a vector of hidden units hhh that are computed by a function f(1)(x;W,c)f^{(1)}(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c})f(1)(x;W,c). The values of these hidden units are then used as the input for a second layer. The second layer is the output layer of the network. The output layer is still just a linear regression model, but now it is applied to hhh rather than to xxx. The network now contains two functions chained together: h=f(1)(x;W,c)\boldsymbol{h}=f^{(1)}(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c})h=f(1)(x;W,c) and y=f(2)(h;w,b)y=f^{(2)}(\boldsymbol{h} ; \boldsymbol{w}, b)y=f(2)(h;w,b), with the complete model being f(x;W,c,w,b)=f(2)(f(1)(x))f(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c}, \boldsymbol{w}, b)=f^{(2)}\left(f^{(1)}(\boldsymbol{x})\right)f(x;W,c,w,b)=f(2)(f(1)(x))
What function should f(1)f^{(1)}f(1) compute? Linear models have served us well so far, and it may be tempting to make f(1)f^{(1)}f(1) be linear as well.
Unfortunately, if f(1)f^{(1)}f(1) were linear, then the feedforward network as a whole would remain a linear function of its input. Ignoring the intercept terms for the moment, suppose f(1)(x)=W⊤xf^{(1)}(\boldsymbol{x})=\boldsymbol{W}^{\top} \boldsymbol{x}f(1)(x)=W⊤x and f(2)(h)=h⊤wf^{(2)}(\boldsymbol{h})=\boldsymbol{h}^{\top} \boldsymbol{w}f(2)(h)=h⊤w. Then f(x)=w⊤W⊤xf(\boldsymbol{x})=\boldsymbol{w}^{\top} \boldsymbol{W}^{\top} \boldsymbol{x}f(x)=w⊤W⊤x. We could represent this function as f(x)=x⊤w′f(\boldsymbol{x})=\boldsymbol{x}^{\top} \boldsymbol{w}^{\prime}f(x)=x⊤w′ where w′=Ww\boldsymbol{w}^{\prime}=\boldsymbol{W} \boldsymbol{w}w′=Ww
Clearly, we must use a nonlinear function to describe the features.
Most neural networks do so using an affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an activation function.
We use that strategy here, by defining h=g(W⊤x+c)\boldsymbol{h}=g\left(\boldsymbol{W}^{\top} \boldsymbol{x}+\boldsymbol{c}\right)h=g(W⊤x+c), where W\boldsymbol{W}W provides the weights of a linear transformation and ccc the biases. Previously, to describe a linear regression model, we used a vector of weights and a scalar bias parameter to describe an affine transformation from an input vector to an output scalar. Now, we describe an affine transformation from a vector x\boldsymbol{x}x to a vector h\boldsymbol{h}h, so an entire vector of bias parameters is needed. The activation function ggg is typically chosen to be a function that is applied element-wise, with hi=g(x⊤W:,i+ci)h_{i}=g\left(\boldsymbol{x}^{\top} \boldsymbol{W}_{:, i}+c_{i}\right)hi=g(x⊤W:,i+ci). In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation function g(z)=max⁡{0,z}g(z)=\max \{0, z\}g(z)=max{0,z} depicted in figure 6.36.36.3.
We can now specify our complete network as
f(x;W,c,w,b)=w⊤max⁡{0,W⊤x+c}+bf(\boldsymbol{x} ; \boldsymbol{W}, \boldsymbol{c}, \boldsymbol{w}, b)=\boldsymbol{w}^{\top} \max \left\{0, \boldsymbol{W}^{\top} \boldsymbol{x}+\boldsymbol{c}\right\}+b f(x;W,c,w,b)=w⊤max{0,W⊤x+c}+b
We can now specify a solution to the XOR problem. Let
W=[1111]c=[0−1]w=[1−2]\begin{gathered} \boldsymbol{W}=\left[\begin{array}{cc} 1 & 1 \\ 1 & 1 \end{array}\right] \\ \boldsymbol{c}=\left[\begin{array}{c} 0 \\ -1 \end{array}\right] \end{gathered} \\ \boldsymbol{w}=\left[\begin{array}{c} 1 \\ -2 \end{array}\right] W=[1111]c=[0−1]w=[1−2]
and b=0b=0b=0.
We can now walk through the way that the model processes a batch of inputs. Let X\boldsymbol{X}X be the design matrix containing all four points in the binary input space, with one example per row:
X=[00011011]\boldsymbol{X}=\left[\begin{array}{ll} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{array}\right] X=⎣⎢⎢⎡00110101⎦⎥⎥⎤
The first step in the neural network is to multiply the input matrix by the first layer’s weight matrix:
XW=[00111122]\boldsymbol{X} \boldsymbol{W}=\left[\begin{array}{ll} 0 & 0 \\ 1 & 1 \\ 1 & 1 \\ 2 & 2 \end{array}\right] XW=⎣⎢⎢⎡01120112⎦⎥⎥⎤
Next, we add the bias vector c\boldsymbol{c}c, to obtain
[0−1101021]\left[\begin{array}{cc} 0 & -1 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{array}\right] ⎣⎢⎢⎡0112−1001⎦⎥⎥⎤
In this space, all of the examples lie along a line with slope 1 . As we move along this line, the output needs to begin at 0 , then rise to 1 , then drop back down to 0.0 .0. A linear model cannot implement such a function. To finish computing the value of hhh for each example, we apply the rectified linear transformation(ReLU):
[00101021]\left[\begin{array}{ll} 0 & 0 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{array}\right] ⎣⎢⎢⎡01120001⎦⎥⎥⎤
This transformation has changed the relationship between the examples. They no longer lie on a single line. As shown in figure 6.16.16.1, they now lie in a space where a linear model can solve the problem. We finish by multiplying by the weight vector w\boldsymbol{w}w :
[0110]\left[\begin{array}{l} 0 \\ 1 \\ 1 \\ 0 \end{array}\right] ⎣⎢⎢⎡0110⎦⎥⎥⎤
The neural network has obtained the correct answer for every example in the batch.
In this example, we simply specified the solution, then showed that it obtained zero error. In a real situation, there might be billions of model parameters and billions of training examples, so one cannot simply guess the solution as we did here.
Instead, a gradient-based optimization algorithm can find parameters that produce very little error. The solution we described to the XOR problem is at a global minimum of the loss function, so gradient descent could converge to this point. There are other equivalent solutions to the XOR problem that gradient descent could also find. The convergence point of gradient descent depends on the initial values of the parameters. In practice, gradient descent would usually not find clean, easily understood, integer-valued solutions like the one we presented here.

Gradient-Based Learning

Designing and training a neural network is not much different from training any other machine learning model with gradient descent. In section 5.10 , we described how to build a machine learning algorithm by specifying an optimization procedure, a cost function, and a model family.
The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex.
This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost
function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs.
Convex optimization converges starting from any initial parameters (in theory—in practice it is very robust but can encounter numerical problems). Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee, and is sensitive to the values of the initial parameters.
For feedforward neural networks, it is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values. The iterative gradient-based optimization algorithms used to train feedforward networks and almost all other deep models will be described in detail in chapter 8 , with parameter initialization in particular discussed later . For the moment, it suffices to understand that the training algorithm is almost always based on using the gradient to descend the cost function in one way or another. The specific algorithms are improvements and refinements on the ideas of gradient descent, introduced in section 4.3 , and,more specifically, are most often improvements of the stochastic gradient descent algorithm, introduced in section 5.9 .
We can of course, train models such as linear regression and support vector machines with gradient descent too, and in fact this is common when the training set is extremely large. From this point of view, training a neural network is not much different from training any other model. Computing the gradient is slightly more complicated for a neural network, but can still be done efficiently and exactly. Section 6.5 will describe how to obtain the gradient using the back-propagation algorithm and modern generalizations of the back-propagation algorithm. As with other machine learning models, to apply gradient-based learning we must choose a cost function, and we must choose how to represent the output of the model. We now revisit these design considerations with special emphasis on the neural networks scenario.

Cost Functions

An important aspect of the design of a deep neural network is the choice of the cost function. Fortunately, the cost functions for neural networks are more or less the same as those for other parametric models, such as linear models. In most cases, our parametric model defines a distribution p(y∣x;θ)p(y | x ; θ )p(y∣x;θ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.
Sometimes, we take a simpler approach, where rather than predicting a complete probability distribution over yyy, we merely predict some statistic of yyy conditioned on xxx . Specialized loss functions allow us to train a predictor of these estimates. The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.
We have already seen some simple examples of regularization applied to linear models in section 5.2.2 . The weight decay approach used for linear models is also directly applicable to deep neural networks and is among the most popular regularization strategies. More advanced regularization strategies for neural networks will be described in chapter 7 .

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution. This cost function is given by
J(θ)=−Ex,y∼p^data log⁡pmodel (y∣x)J(\boldsymbol{\theta})=-\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\boldsymbol{y} \mid \boldsymbol{x}) J(θ)=−Ex,y∼p^data logpmodel (y∣x)
The specific form of the cost function changes from model to model, depending on the specific form of log⁡pmodel \log p_{\text {model }}logpmodel . The expansion of the above equation typically yields some terms that do not depend on the model parameters and may be discarded.
For example, as we saw in section 5.5.1, if pmodel (y∣x)=N(y;f(x;θ),I)p_{\text {model }}(\boldsymbol{y} \mid \boldsymbol{x})=\mathcal{N}(\boldsymbol{y} ; f(\boldsymbol{x} ; \boldsymbol{\theta}), \boldsymbol{I})pmodel (y∣x)=N(y;f(x;θ),I), then we recover the mean squared error cost,
J(θ)=12Ex,y∼Pata ^∥y−f(x;θ)∥2+const J(\theta)=\frac{1}{2} \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim \hat{\text { Pata }}}\|\boldsymbol{y}-f(\boldsymbol{x} ; \boldsymbol{\theta})\|^{2}+\text { const } J(θ)=21Ex,y∼ Pata ^∥y−f(x;θ)∥2+ const
up to a scaling factor of 12\frac{1}{2}21 and a term that does not depend on θ\boldsymbol{\theta}θ. The discarded constant is based on the variance of the Gaussian distribution, which in this case we chose not to parametrize. Previously, we saw that the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds for a linear model, but in fact, the equivalence holds regardless of the f(x;θ)f(\boldsymbol{x} ; \boldsymbol{\theta})f(x;θ) used to predict the mean of the Gaussian.
An advantage of this approach of deriving the cost function from maximum likelihood is that it removes the burden of designing cost functions for each model. Specifying a model p(y∣x)p(\boldsymbol{y} \mid \boldsymbol{x})p(y∣x) automatically determines a cost function log⁡p(y∣x)\log p(\boldsymbol{y} \mid \boldsymbol{x})logp(y∣x).
One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm. Functions that saturate (become very flat) undermine(破坏) this objective because they make the gradient become very small.
In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate.
The negative log-likelihood helps to avoid this problem for many models. Many output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units. We will discuss the interaction between the cost function and the choice of output unit later.
One unusual property of the cross-entropy cost used to perform maximum likelihood estimation is that it usually does not have a minimum value when applied to the models commonly used in practice. For discrete output variables, most models are parametrized in such a way that they cannot represent a probability of zero or one, but can come arbitrarily close to doing so.
Logistic regression is an example of such a model. For real-valued output variables, if the model can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it becomes possible to assign extremely high density to the correct training set outputs, resulting in cross-entropy approaching negative infinity.

Learning Conditional Statistics

Instead of learning a full probability distribution p(y∣x;θ)p(\boldsymbol{y} \mid \boldsymbol{x} ; \boldsymbol{\theta})p(y∣x;θ) we often want to learn just one conditional statistic of y\boldsymbol{y}y given x\boldsymbol{x}x.
For example, we may have a predictor f(x;θ)f(\boldsymbol{x} ; \boldsymbol{\theta})f(x;θ) that we wish to predict the mean of y\boldsymbol{y}y.
If we use a sufficiently powerful neural network, we can think of the neural network as being able to represent any function fff from a wide class of functions, with this class being limited only by features such as continuity and boundedness rather than by having a specific parametric form.
From this point of view, we can view the cost function as being a functional(泛函数) rather than just a function. A functional is a mapping from functions to real numbers. We can thus think of learning as choosing a function rather than merely choosing a set of parameters. We can design our cost functional to have its minimum occur at some specific function we desire.
For example, we can design the cost functional to have its minimum lie on the function that maps xxx to the expected value of yyy given x.x .x. Solving an optimization problem with respect to a function requires a mathematical tool called calculus of variations, described in section 19.4.2. It is not necessary to understand calculus of variations to understand the content of this chapter. At the moment, it is only necessary to understand that calculus of variations may be used to derive the following two results.
Our first result derived using calculus of variations is that solving the optimization problem
f∗=arg⁡min⁡fEx,y∼pdata ∥y−f(x)∥2f^{*}=\underset{f}{\arg \min } \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p_{\text {data }}}\|\boldsymbol{y}-f(\boldsymbol{x})\|^{2} f∗=fargminEx,y∼pdata ∥y−f(x)∥2
yields
f∗(x)=Ey∼pdata (y∣x)[y]f^{*}(\boldsymbol{x})=\mathbb{E}_{\mathbf{y} \sim p_{\text {data }}(\boldsymbol{y} \mid \boldsymbol{x})}[\boldsymbol{y}] f∗(x)=Ey∼pdata (y∣x)[y]
so long as this function lies within the class we optimize over. In other words, if we could train on infinitely many samples from the true data generating distribution, minimizing the mean squared error cost function gives a function that predicts the mean of y\boldsymbol{y}y for each value of x\boldsymbol{x}x.
Different cost functions give different statistics.
A second result derived using calculus of variations is that
f∗=arg⁡min⁡fEx,y∼pdata ∥y−f(x)∥f^{*}=\underset{f}{\arg \min } \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim p_{\text {data }}}\|\boldsymbol{y}-f(\boldsymbol{x})\| f∗=fargminEx,y∼pdata ∥y−f(x)∥
yields a function that predicts the median value of y\boldsymbol{y}y for each x\boldsymbol{x}x, so long as such a function may be described by the family of functions we optimize over. This cost function is commonly called mean absolute error.
Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions. This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y∣x)p(\boldsymbol{y} \mid \boldsymbol{x})p(y∣x).

Output Units

Any kind of neural network unit that may be used as an output can also be used as a hidden unit. Here, we focus on the use of these units as outputs of the model, but in principle they can be used internally as well.
Throughout this section, we suppose that the feedforward network provides a set of hidden features defined by h=f(x;θ)\boldsymbol{h}=f(\boldsymbol{x} ; \boldsymbol{\theta})h=f(x;θ). The role of the output layer is then to provide some additional transformation from the features to complete the task that the network must perform.

Linear Units for Gaussian Output Distributions

One simple kind of output unit is an output unit based on an affine transformation with no nonlinearity. These are often just called linear units.
Given features hhh, a layer of linear output units produces a vector y^=W⊤h+b\hat{\boldsymbol{y}}=\boldsymbol{W}^{\top} \boldsymbol{h}+\boldsymbol{b}y^=W⊤h+b. Linear output layers are often used to produce the mean of a conditional Gaussian distribution:
p(y∣x)=N(y;y^,I)p(\boldsymbol{y} \mid \boldsymbol{x})=\mathcal{N}(\boldsymbol{y} ; \hat{\boldsymbol{y}}, \boldsymbol{I}) p(y∣x)=N(y;y^,I)
Maximizing the log-likelihood is then equivalent to minimizing the mean squared error.
The maximum likelihood framework makes it straightforward to learn the covariance of the Gaussian too, or to make the covariance of the Gaussian be a function of the input. However, the covariance must be constrained to be a positive definite matrix for all inputs. It is difficult to satisfy such constraints with a linear output layer, so typically other output units are used to parametrize the covariance. Approaches to modeling the covariance are described shortly, in section 6.2.2.4.6.2 .2 .4 .6.2.2.4. Because linear units do not saturate, they pose little difficulty for gradientbased optimization algorithms and may be used with a wide variety of optimization algorithms.

Sigmoid Units for Bernoulli Output Distributions

Many tasks require predicting the value of a binary variable yyy. Classification problems with two classes can be cast in this form.
The maximum-likelihood approach is to define a Bernoulli distribution over yyy conditioned on x\boldsymbol{x}x.
A Bernoulli distribution is defined by just a single number. The neural net needs to predict only P(y=1∣x)P(y=1 \mid \boldsymbol{x})P(y=1∣x). For this number to be a valid probability, it must lie in the interval [0,1][0,1][0,1].
Satisfying this constraint requires some careful design effort. Suppose we were to use a linear unit, and threshold its value to obtain a valid probability:
P(y=1∣x)=max⁡{0,min⁡{1,w⊤h+b}}.P(y=1 \mid \boldsymbol{x})=\max \left\{0, \min \left\{1, \boldsymbol{w}^{\top} \boldsymbol{h}+b\right\}\right\} . P(y=1∣x)=max{0,min{1,w⊤h+b}}.
This would indeed define a valid conditional distribution, but we would not be able to train it very effectively with gradient descent. Any time that w⊤h+b\boldsymbol{w}^{\top} \boldsymbol{h}+bw⊤h+b strayed(迷路) outside the unit interval, the gradient of the output of the model with respect to its parameters would be 0 . A gradient of 0\mathbf{0}0 is typically problematic because the learning algorithm no longer has a guide for how to improve the corresponding parameters.
Instead, it is better to use a different approach that ensures there is always a strong gradient whenever the model has the wrong answer. This approach is based on using sigmoid output units combined with maximum likelihood. A sigmoid output unit is defined by
y^=σ(w⊤h+b)\hat{y}=\sigma\left(\boldsymbol{w}^{\top} \boldsymbol{h}+b\right) y^=σ(w⊤h+b)
where σ\sigmaσ is the logistic sigmoid function described in section 3.10. We can think of the sigmoid output unit as having two components.
First, it uses a linear layer to compute z=w⊤h+bz=\boldsymbol{w}^{\top} \boldsymbol{h}+bz=w⊤h+b.
Next, it uses the sigmoid activation function to convert zzz into a probability.
We omit the dependence on xxx for the moment to discuss how to define a probability distribution over yyy using the value zzz. The sigmoid can be motivated by constructing an unnormalized probability distribution P~(y)\tilde{P}(y)P~(y), which does not sum to 1 .
We can then divide by an appropriate constant to obtain a valid probability distribution.
If we begin with the assumption that the unnormalized log probabilities are linear in yyy and zzz, we can exponentiate to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation of zzz :
log⁡P~(y)=yzP~(y)=exp⁡(yz)P(y)=exp⁡(yz)∑y′=01,exp⁡(y′z)P(y)=σ((2y−1)z)（y的取值只有0和1，统一表达式而已）\begin{aligned} \log \tilde{P}(y) &=y z \\ \tilde{P}(y) &=\exp (y z) \\ P(y)&=\frac{\exp (y z)}{\sum_{y^{\prime}=0}^{1,} \exp \left(y^{\prime} z\right)} \\ P(y) &=\sigma((2 y-1) z) （y的取值只有0和1，统一表达式而已） \end{aligned} logP~(y)P~(y)P(y)P(y)=yz=exp(yz)=∑y′=01,exp(y′z)exp(yz)=σ((2y−1)z)（y的取值只有0和1，统一表达式而已）
Probability distributions based on exponentiation and normalization are common throughout the statistical modeling literature. The zzz variable defining such a distribution over binary variables is called a logit（logit transformation）.
This approach to predicting the probabilities in log-space is natural to use with maximum likelihood learning. Because the cost function used with maximum likelihood is −log⁡P(y∣x)-\log P(y \mid \boldsymbol{x})−logP(y∣x), the log in the cost function undoes the exp of the sigmoid.
Without this effect, the saturation of the sigmoid could prevent gradientbased learning from making good progress. The loss function for maximum likelihood learning of a Bernoulli parametrized by a sigmoid is
J(θ)=−log⁡P(y∣x)=−log⁡σ((2y−1)z)=ζ((1−2y)z)\begin{aligned} J(\boldsymbol{\theta}) &=-\log P(y \mid \boldsymbol{x}) \\ &=-\log \sigma((2 y-1) z) \\ &=\zeta((1-2 y) z) \end{aligned} J(θ)=−logP(y∣x)=−logσ((2y−1)z)=ζ((1−2y)z)
This derivation makes use of some properties from section 3.10. By rewriting the loss in terms of the softplus function, we can see that it saturates only when (1−2y)z(1-2 y) z(1−2y)z is very negative. Saturation thus occurs only when the model already has the right answer- when y=1y=1y=1 and zzz is very positive, or y=0y=0y=0 and zzz is very negative. When zzz has the wrong sign, the argument to the softplus function, (1−2y)z(1-2 y) z(1−2y)z, may be simplified to ∣z∣.|z| .∣z∣. As ∣z∣|z|∣z∣ becomes large while zzz has the wrong sign, the softplus function asymptotes（渐近线） toward simply returning its argument ∣z∣.|z| .∣z∣. The derivative with respect to zzz asymptotes to sign⁡(z)\operatorname{sign}(z)sign(z), so, in the limit of extremely incorrect zzz, the softplus function does not shrink the gradient at all. This property is very useful because it means that gradient-based learning can act to quickly correct a mistaken zzz.
When we use other loss functions, such as mean squared error, the loss can saturate anytime σ(z)\sigma(z)σ(z) saturates. The sigmoid activation function saturates to 0 when zzz becomes very negative and saturates to 1 when zzz becomes very positive. The gradient can shrink too small to be useful for learning whenever this happens, whether the model has the correct answer or the incorrect answer. For this reason, maximum likelihood is almost always the preferred approach to training sigmoid output units.
Analytically, the logarithm of the sigmoid is always defined and finite, because the sigmoid returns values restricted to the open interval (0,1)(0,1)(0,1), rather than using the entire closed interval of valid probabilities [0,1].[0,1] .[0,1].
In software implementations, to avoid numerical problems, it is best to write the negative log-likelihood as a function of zzz, rather than as a function of y^=σ(z)\hat{y}=\sigma(z)y^=σ(z). If the sigmoid function underflows to zero, then taking the logarithm of y^\hat{y}y^ yields negative infinity.（这里也有点懵，我觉得都是一样的）

Softmax Units for Multinoulli Output Distributions(多重输出分布)

Any time we wish to represent a probability distribution over a discrete variable with nnn possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.
Softmax functions are most often used as the output of a classifier, to represent the probability distribution over nnn different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of nnn different options for some internal variable. In the case of binary variables, we wished to produce a single number
y^=P(y=1∣x)\hat{y}=P(y=1 \mid \boldsymbol{x}) y^=P(y=1∣x)
Because this number needed to lie between 0 and 1, and because we wanted the logarithm of the number to be well-behaved for gradient-based optimization of the log-likelihood, we chose to instead predict a number z=log⁡P~(y=1∣x)z=\log \tilde{P}(y=1 \mid \boldsymbol{x})z=logP~(y=1∣x).
Exponentiating and normalizing gave us a Bernoulli distribution controlled by the sigmoid function.
To generalize to the case of a discrete variable with nnn values, we now need to produce a vector y^\hat{\boldsymbol{y}}y^, with y^i=P(y=i∣x)\hat{y}_{i}=P(y=i \mid \boldsymbol{x})y^i=P(y=i∣x). We require not only that each element of y^i\hat{y}_{i}y^i be between 0 and 1 , but also that the entire vector sums to 1 so that it represents a valid probability distribution. The same approach that worked for the Bernoulli distribution generalizes to the multinoulli distribution.
First, a linear layer predicts unnormalized log probabilities:
z=W⊤h+b\boldsymbol{z}=\boldsymbol{W}^{\top} \boldsymbol{h}+\boldsymbol{b} z=W⊤h+b
where zi=log⁡P~(y=i∣x)z_{i}=\log \tilde{P}(y=i \mid \boldsymbol{x})zi=logP~(y=i∣x). The softmax function can then exponentiate and normalize zzz to obtain the desired y^\hat{y}y^. Formally, the softmax function is given by
softmax⁡(z)i=exp⁡(zi)∑jexp⁡(zj)\operatorname{softmax}(\boldsymbol{z})_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{j} \exp \left(z_{j}\right)} softmax(z)i=∑jexp(zj)exp(zi)
As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value yyy using maximum log-likelihood.
In this case, we wish to maximize log⁡P(y=i;z)=log⁡softmax⁡(z)i\log P(\mathrm{y}=i ; \boldsymbol{z})=\log \operatorname{softmax}(\boldsymbol{z})_{i}logP(y=i;z)=logsoftmax(z)i. Defining the softmax in terms of exp is natural because the log⁡\loglog in the log-likelihood can undo the exp of the softmax:
log⁡softmax⁡(z)i=zi−log⁡∑jexp⁡(zj)\log \operatorname{softmax}(\boldsymbol{z})_{i}=z_{i}-\log \sum_{j} \exp \left(z_{j}\right) logsoftmax(z)i=zi−logj∑exp(zj)
The first term of equation above shows that the input ziz_{i}zi always has a direct contribution to the cost function. Because this term cannot saturate, we know that learning can proceed, even if the contribution of ziz_{i}zi to the second term of equation above becomes very small.
When maximizing the log-likelihood, the first term encourages ziz_{i}zi to be pushed up, while the second term encourages all of z\boldsymbol{z}z to be pushed down.
To gain some intuition for the second term, log⁡∑jexp⁡(zj)\log \sum_{j} \exp \left(z_{j}\right)log∑jexp(zj), observe that this term can be roughly approximated by max⁡jzj\max _{j} z_{j}maxjzj . This approximation is based on the idea that exp⁡(zk)\exp \left(z_{k}\right)exp(zk) is insignificant for any zkz_{k}zk that is noticeably less than max⁡jzj\max _{j} z_{j}maxjzj.
The intuition we can gain from this approximation is that the negative log-likelihood cost function always strongly penalizes the most active incorrect prediction. If the correct answer already has the largest input to the softmax, then the −zi-z_{i}−zi term and the log⁡∑jexp⁡(zj)≈max⁡jzj=zi\log \sum_{j} \exp \left(z_{j}\right) \approx \max _{j} z_{j}=z_{i}log∑jexp(zj)≈maxjzj=zi terms will roughly cancel(抵消).
This example will then contribute little to the overall training cost, which will be dominated by other examples that are not yet correctly classified.
So far we have discussed only a single example. Overall, unregularized maximum likelihood will drive the model to learn parameters that drive the softmax to predict the fraction（分数） of counts of each outcome observed in the training set:
softmax⁡(z(x;θ))i≈∑j=1m1y(j)=i,x(j)=x∑j=1m1x(j)=x\operatorname{softmax}(\boldsymbol{z}(\boldsymbol{x} ; \boldsymbol{\theta}))_{i} \approx \frac{\sum_{j=1}^{m} \mathbf{1}_{y^{(j)}=i, \boldsymbol{x}}^{(j)}=\boldsymbol{x}}{\sum_{j=1}^{m} \mathbf{1}_{\boldsymbol{x}^{(j)}=\boldsymbol{x}}} softmax(z(x;θ))i≈∑j=1m1x(j)=x∑j=1m1y(j)=i,x(j)=x
Because maximum likelihood is a consistent estimator, this is guaranteed to happen so long as the model family is capable of representing the training distribution. In practice, limited model capacity and imperfect optimization will mean that the model is only able to approximate these fractions.
Many objective functions other than the log-likelihood do not work as well with the softmax function. Specifically, objective functions that do not use a log to undo the exp of the softmax fail to learn when the argument to the exp becomes very negative, causing the gradient to vanish.
In particular, squared error is a poor loss function for softmax units, and can fail to train the model to change its output, even when the model makes highly confident incorrect predictions . To understand why these other loss functions can fail, we need to examine the softmax function itself.
Like the sigmoid, the softmax activation can saturate. The sigmoid function has a single output that saturates when its input is extremely negative or extremely positive. In the case of the softmax, there are multiple output values. These output values can saturate when the differences between input values become extreme. When the softmax saturates, many cost functions based on the softmax also saturate, unless they are able to invert the saturating activating function.
To see that the softmax function responds to the difference between its inputs, observe that the softmax output is invariant to adding the same scalar to all of its inputs:
softmax⁡(z)=softmax⁡(z+c)\operatorname{softmax}(\boldsymbol{z})=\operatorname{softmax}(\boldsymbol{z}+c) softmax(z)=softmax(z+c)
Using this property, we can derive a numerically stable variant of the softmax:
softmax⁡(z)=softmax⁡(z−max⁡izi)\operatorname{softmax}(\boldsymbol{z})=\operatorname{softmax}\left(\boldsymbol{z}-\max _{i} z_{i}\right) softmax(z)=softmax(z−imaxzi)
The reformulated version allows us to evaluate softmax with only small numerical errors even when zzz contains extremely large or extremely negative numbers. Examining the numerically stable variant, we see that the softmax function is driven by the amount that its arguments deviate from max⁡izi\max _{i} z_{i}maxizi.
An output softmax⁡(z)i\operatorname{softmax}(\boldsymbol{z})_{i}softmax(z)i saturates to 1 when the corresponding input is maximal (zi=max⁡izi)\left(z_{i}=\max _{i} z_{i}\right)(zi=maxizi) and ziz_{i}zi is much greater than all of the other inputs. The output softmax⁡(z)i\operatorname{softmax}(\boldsymbol{z})_{i}softmax(z)i can also saturate to 0 when ziz_{i}zi is not maximal and the maximum is much greater. This is a generalization of the way that sigmoid units saturate, and can cause similar difficulties for learning if the loss function is not designed to compensate for it.
The argument zzz to the softmax function can be produced in two different ways.
The most common is simply to have an earlier layer of the neural network output every element of z\boldsymbol{z}z, as described above using the linear layer z=W⊤h+b\boldsymbol{z}=\boldsymbol{W}^{\top} \boldsymbol{h}+\boldsymbol{b}z=W⊤h+b. While straightforward, this approach actually overparametrizes the distribution. The constraint that the nnn outputs must sum to 1 means that only n−1n-1n−1 parameters are necessary; the probability of the nnn -th value may be obtained by subtracting the first n−1n-1n−1 probabilities from 1.1 .1. We can thus impose a requirement that one element of zzz be fixed.
For example, we can require that zn=0.z_{n}=0 .zn=0. Indeed, this is exactly what the sigmoid unit does. Defining P(y=1∣x)=σ(z)P(y=1 \mid \boldsymbol{x})=\sigma(z)P(y=1∣x)=σ(z) is equivalent to defining P(y=1∣x)=softmax⁡(z)1P(y=1 \mid \boldsymbol{x})=\operatorname{softmax}(\boldsymbol{z})_{1}P(y=1∣x)=softmax(z)1 with a two-dimensional z\boldsymbol{z}z and z1=0.z_{1}=0 .z1=0.
Both the n−1n-1n−1 argument and the nnn argument approaches to the softmax can describe the same set of probability distributions, but have different learning dynamics. In practice, there is rarely much difference between using the overparametrized version or the restricted version, and it is simpler to implement the overparametrized version.
The name “softmax” can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term “soft” derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous or differentiable. The softmax function thus provides a “softened” version of the arg max. The corresponding soft version of the maximum function is softmax⁡(z)⊤z\operatorname{softmax}(\boldsymbol{z})^{\top} \boldsymbol{z}softmax(z)⊤z. It would perhaps be better to call the softmax function “softargmax,” but the current name is an entrenched（根深蒂固） convention.

Other Output Types

The linear, sigmoid, and softmax output units described above are the most common. Neural networks can generalize to almost any kind of output layer that we wish. The principle of maximum likelihood provides a guide for how to design a good cost function for nearly any kind of output layer.
In general, if we define a conditional distribution p(y∣x;θ)p(\boldsymbol{y} \mid \boldsymbol{x} ; \boldsymbol{\theta})p(y∣x;θ), the principle of maximum likelihood suggests we use −log⁡p(y∣x;θ)-\log p(\boldsymbol{y} \mid \boldsymbol{x} ; \boldsymbol{\theta})−logp(y∣x;θ) as our cost function.
In general, we can think of the neural network as representing a function f(x;θ)f(\boldsymbol{x} ; \boldsymbol{\theta})f(x;θ). The outputs of this function are not direct predictions of the value y\boldsymbol{y}y. Instead, f(x;θ)=ωf(\boldsymbol{x} ; \boldsymbol{\theta})=\boldsymbol{\omega}f(x;θ)=ω provides the parameters for a distribution over yyy. Our loss function can then be interpreted as −log⁡p(y;ω(x))-\log p(\mathbf{y} ; \boldsymbol{\omega}(\boldsymbol{x}))−logp(y;ω(x)).
For example, we may wish to learn the variance of a conditional Gaussian for y\mathbf{y}y given x\mathrm{x}x. In the simple case, where the variance σ2\sigma^{2}σ2 is a constant, there is a closed form expression because the maximum likelihood estimator of variance is simply the empirical mean(The sample mean (or “empirical mean”) and the sample covariance are statistics computed from a sample of data on one or more random variables.) of the squared difference between observations y\mathbf{y}y and their expected value.
A computationally more expensive approach that does not require writing special-case code is to simply include the variance as one of the properties of the distribution p(y∣x)p(\mathbf{y} \mid \boldsymbol{x})p(y∣x) that is controlled by ω=f(x;θ)\boldsymbol{\omega}=f(\boldsymbol{x} ; \boldsymbol{\theta})ω=f(x;θ). The negative log-likelihood −log⁡p(y;ω(x))-\log p(\boldsymbol{y} ; \boldsymbol{\omega}(\boldsymbol{x}))−logp(y;ω(x)) will then provide a cost function with the appropriate terms necessary to make our optimization procedure incrementally(逐渐地) learn the variance.
In the simple case where the standard deviation does not depend on the input, we can make a new parameter in the network that is copied directly into ω\omegaω. This new parameter might be σ\sigmaσ itself or could be a parameter vvv representing σ2\sigma^{2}σ2 or it could be a parameter β\betaβ representing 1σ2\frac{1}{\sigma^{2}}σ21, depending on how we choose to parametrize the distribution. We may wish our model to predict a different amount of variance in y\mathrm{y}y for different values of x\mathrm{x}x. This is called a heteroscedastic(异方差) model. In the heteroscedastic case, we simply make the specification of the variance be one of the values output by f(x;θ)f(\mathbf{x} ; \boldsymbol{\theta})f(x;θ). A typical way to do this is to formulate the Gaussian distribution using precision, rather than variance, as described below in equation in Gaussian Distribution .
N(x;μ,β−1)=β2πexp⁡(−12β(x−μ)2)\mathcal{N}\left(x ; \mu, \beta^{-1}\right)=\sqrt{\frac{\beta}{2 \pi}} \exp \left(-\frac{1}{2} \beta(x-\mu)^{2}\right) N(x;μ,β−1)=2πβexp(−21β(x−μ)2)In the multivariate case it is most common to use a diagonal precision matrix diag⁡(β).\operatorname{diag}(\boldsymbol{\beta}) . \quaddiag(β).
This formulation works well with gradient descent because the formula for the log-likelihood of the Gaussian distribution parametrized by β\boldsymbol{\beta}β involves only multiplication by βi\beta_{i}βi and addition of log⁡βi\log \beta_{i}logβi. The gradient of multiplication, addition, and logarithm operations is well-behaved.
By comparison, if we parametrized the output in terms of variance, we would need to use division. The division function becomes arbitrarily steep near zero. While large gradients can help learning, arbitrarily large gradients usually result in instability. If we parametrized the output in terms of standard deviation, the log-likelihood would still involve division, and would also involve squaring. The gradient through the squaring operation can vanish near zero, making it difficult to learn parameters that are squared.
Regardless of whether we use standard deviation, variance, or precision, we must ensure that the covariance matrix of the Gaussian is positive definite. Because the eigenvalues of the precision matrix are the reciprocals（倒数） of the eigenvalues of the covariance matrix, this is equivalent to ensuring that the precision matrix is positive definite. If we use a diagonal matrix, or a scalar times the diagonal matrix, then the only condition we need to enforce on the output of the model is positivity. If we suppose that a\boldsymbol{a}a is the raw activation of the model used to determine the diagonal precision, we can use the softplus function to obtain a positive precision vector: β=ζ(a)\boldsymbol{\beta}=\zeta(\boldsymbol{a})β=ζ(a). This same strategy applies equally if using variance or standard deviation rather than precision or if using a scalar times identity rather than diagonal matrix.
It is rare to learn a covariance or precision matrix with richer structure than diagonal. If the covariance is full and conditional, then a parametrization must be chosen that guarantees positive-definiteness of the predicted covariance matrix.
This can be achieved by writing Σ(x)=B(x)B⊤(x)\boldsymbol{\Sigma}(\boldsymbol{x})=\boldsymbol{B}(\boldsymbol{x}) \boldsymbol{B}^{\top}(\boldsymbol{x})Σ(x)=B(x)B⊤(x), where B\boldsymbol{B}B is an unconstrained square matrix.
One practical issue if the matrix is full rank is that computing the likelihood is expensive, with a d×dd \times dd×d matrix requiring O(d3)O\left(d^{3}\right)O(d3) computation for the determinant and inverse of Σ(x)\boldsymbol{\Sigma}(\boldsymbol{x})Σ(x) (or equivalently, and more commonly done, its eigendecomposition or that of B(x))\boldsymbol{B}(\boldsymbol{x}))B(x))
We often want to perform multimodal regression（多峰回归）, that is, to predict real values that come from a conditional distribution p(y∣x)p(\boldsymbol{y} \mid \boldsymbol{x})p(y∣x) that can have several different peaks in y\boldsymbol{y}y space for the same value of x\boldsymbol{x}x.
In this case, a Gaussian mixture is a natural representation for the output. Neural networks with Gaussian mixtures as their output are often called mixture density networks. A Gaussian mixture output with nnn components is defined by the conditional probability distribution
p(y∣x)=∑i=1np(c=i∣x)N(y;μ(i)(x),Σ(i)(x))p(\boldsymbol{y} \mid \boldsymbol{x})=\sum_{i=1}^{n} p(\mathrm{c}=i \mid \boldsymbol{x}) \mathcal{N}\left(\boldsymbol{y} ; \boldsymbol{\mu}^{(i)}(\boldsymbol{x}), \boldsymbol{\Sigma}^{(i)}(\boldsymbol{x})\right) p(y∣x)=i=1∑np(c=i∣x)N(y;μ(i)(x),Σ(i)(x))
The neural network must have three outputs: a vector defining p(c=i∣x)p(\mathrm{c}=i \mid \boldsymbol{x})p(c=i∣x), a matrix providing μ(i)(x)\boldsymbol{\mu}^{(i)}(\boldsymbol{x})μ(i)(x) for all iii, and a tensor providing Σ(i)(x)\boldsymbol{\Sigma}^{(i)}(\boldsymbol{x})Σ(i)(x) for all iii. These outputs must satisfy different constraints:

Mixture components p(c=i∣x):p(\mathrm{c}=i \mid \boldsymbol{x}):p(c=i∣x): these form a multinoulli distribution over the nnn different components associated with latent variable 1c{ }^{1} \mathrm{c}1c, and can typically be obtained by a softmax over an nnn -dimensional vector, to guarantee that these outputs are positive and sum to 1 .
Means μ(i)(x):\mu^{(i)}(\boldsymbol{x}):μ(i)(x): these indicate the center or mean associated with the iii -th Gaussian component, and are unconstrained (typically with no nonlinearity at all for these output units). If y\mathbf{y}y is a ddd -vector, then the network must output an n×dn \times dn×d matrix containing all nnn of these ddd -dimensional vectors. Learning these means with maximum likelihood is slightly more complicated than learning the means of a distribution with only one output mode. We only want to update the mean for the component that actually produced the observation. In practice, we do not know which component produced each observation. The expression for the negative log-likelihood naturally weights each example’s contribution to the loss for each component by the probability that the component produced the example.
Covariances Σ(i)(x):\mathbf{\Sigma}^{(i)}(\boldsymbol{x}):Σ(i)(x): these specify the covariance matrix for each component
i. As when learning a single Gaussian component, we typically use a diagonal matrix to avoid needing to compute determinants. As with learning the means of the mixture, maximum likelihood is complicated by needing to assign partial responsibility for each point to each mixture component. Gradient descent will automatically follow the correct process if given the correct specification of the negative log-likelihood under the mixture model.

It has been reported that gradient-based optimization of conditional Gaussian mixtures (on the output of neural networks) can be unreliable, in part because one gets divisions (by the variance) which can be numerically unstable (when some variance gets to be small for a particular example, yielding very large gradients) One solution is to clip gradients while another is to scale the gradients heuristically（启发式地）
Gaussian mixture outputs are particularly effective in generative models of speech or movements of physical objects. The mixture density strategy gives a way for the network to represent multiple output modes and to control the variance of its output, which is crucial for obtaining a high degree of quality in these real-valued domains. An example of a mixture density network is shown in figure below.
In general, we may wish to continue to model larger vectors y\boldsymbol{y}y containing more variables, and to impose richer and richer structures on these output variables. For example, we may wish for our neural network to output a sequence of characters that forms a sentence. In these cases, we may continue to use the principle of maximum likelihood applied to our model p(y;ω(x))p(\boldsymbol{y} ; \boldsymbol{\omega}(\boldsymbol{x}))p(y;ω(x)), but the model we use to describe yyy becomes complex enough to be beyond the scope of this chapter.

WORKS

一层隐藏层就足够以任意精度表示（不是学习）任何函数，那为什么还要使用有更多层的深度神经网络？
参考https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises，完成练习

Deep Feedforward Networks（1）相关推荐

Detecting Visual Relationships with Deep Relational Networks（阅读笔记）
Detecting Visual Relationships with Deep Relational Networks(阅读笔记) 原文链接:https://blog.csdn.net/xue_we ...
Content Distribution Networks（CDNs）
互联网杀手级应用--网络流量占的比较多,而且比较吸引用户. 视频应用是其中之一,如何向成千上万的用户提供并行的播放服务呢视频流化服务和CDN:上下文视频流量:占据着互联网大部分的带宽 Netfli ...
【论文精读】Superpixel Sampling Networks（SSN）
[论文精读]Superpixel Sampling Networks Abstract 1和2部分懒得翻译 3 复习SLIC 4 Superpixel Sampling Networks(SSN) 4 ...
论文阅读《Deep Layer Aggregation（DLA）》
Background & Motivation 文章认为特征聚合的关键是语义和空间信息的聚合. Semantic fusion, or aggregating across channels ...
深度学习（四十）——深度强化学习（3）Deep Q-learning Network（2）, DQN进化史
Deep Q-learning Network(续) Nature DQN DQN最早发表于NIPS 2013,该版本的DQN,也被称为NIPS DQN.NIPS DQN除了提出DQN的基本概念之外, ...
深度学习（10）-- Capsules Networks（CapsNet）
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/malele4th/article/details/79430464 </div>< ...
Spatial Transformer Networks（STN）
详细解读Spatial Transformer Networks(STN)-一篇文章让你完全理解STN了_多元思考力-CSDN博客_stn
论文阅读：Semantic Aware Attention Based Deep Object Co-segmentation（ACCV2018）
协同分割论文:Semantic Aware Attention Based Deep Object Co-segmentation(ACCV2018) 论文原文 code 目录 1.简介 2. ...
论文解读-Intriguing properties of neural networks（ICLR2014）
Intriguing properties of neural networks(ICLR2014) 这篇文章被认为是对抗样本的开山之作,首次发现并提出了对抗样本,作者阵容豪华,被引了很多次.但是文章 ...

Deep Feedforward Networks（1）

CODE WORKS

CONTENTS

Example: Learning XOR

Gradient-Based Learning

Cost Functions

Learning Conditional Distributions with Maximum Likelihood

Learning Conditional Statistics

Output Units

Linear Units for Gaussian Output Distributions

Sigmoid Units for Bernoulli Output Distributions

Softmax Units for Multinoulli Output Distributions(多重输出分布)

Other Output Types

WORKS

Deep Feedforward Networks（1）相关推荐

最新文章

热门文章