Deep Feedforward Networks（3）

Back-Propagation and Other Diﬀerentiation Algorithms

When we use a feedforward neural network to accept an input xxx and produce an output y^\hat{\boldsymbol{y}}y^, information flows forward through the network. The inputs x\boldsymbol{x}x provide the initial information that then propagates up to the hidden units at each layer and finally produces y^\hat{y}y^. This is called forward propagation. During training, forward propagation can continue onward until it produces a scalar cost J(θ)J(\theta)J(θ). The back-propagation algorithm (Rumelhart et al., 1986a), often simply called backprop, allows the information from the cost to then flow backwards through the network, in order to compute the gradient.
Computing an analytical expression for the gradient is straightforward, but numerically evaluating such an expression can be computationally expensive. The back-propagation algorithm does so using a simple and inexpensive procedure.
The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient. Furthermore, back-propagation is often misunderstood as being specific to multilayer neural networks, but in principle it can compute derivatives of any function (for some functions, the correct response is to report that the derivative of the function is undefined).
Specifically, we will describe how to compute the gradient ∇xf(x,y)\nabla_{x} f(x, y)∇xf(x,y) for an arbitrary function fff, where xxx is a set of variables whose derivatives are desired, and yyy is an additional set of variables that are inputs to the function but whose derivatives are not required. In learning algorithms, the gradient we most often require is the gradient of the cost function with respect to the parameters, ∇θJ(θ)\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})∇θJ(θ). Many machine learning tasks involve computing other derivatives, either as part of the learning process, or to analyze the learned model. The backpropagation algorithm can be applied to these tasks as well, and is not restricted to computing the gradient of the cost function with respect to the parameters. The idea of computing derivatives by propagating information through a network is very general, and can be used to compute values such as the Jacobian of a function fff with multiple outputs. We restrict our description here to the most commonly used case where fff has a single output.

Computational Graphs

So far we have discussed neural networks with a relatively informal graph language. To describe the back-propagation algorithm more precisely, it is helpful to have a more precise computational graph language.
To formalize our graphs, we also need to introduce the idea of an operation. An operation is a simple function of one or more variables. Our graph language is accompanied by a set of allowable operations. Functions more complicated than the operations in this set may be described by composing many operations together.
Without loss of generality, we define an operation to return only a single output variable. This does not lose generality because the output variable can have multiple entries, such as a vector. Software implementations of back-propagation usually support operations with multiple outputs, but we avoid this case in our description because it introduces many extra details that are not important to conceptual understanding.
If a variable yyy is computed by applying an operation to a variable xxx, then we draw a directed edge from xxx to yyy. We sometimes annotate the output node with the name of the operation applied, and other times omit this label when the operation is clear from context. Examples of computational graphs are shown in figure below.

Chain Rule of Calculus

The chain rule of calculus (not to be confused with the chain rule of probability) is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.
Let xxx be a real number, and let fff and ggg both be functions mapping from a real number to a real number. Suppose that y=g(x)y=g(x)y=g(x) and z=f(g(x))=f(y)z=f(g(x))=f(y)z=f(g(x))=f(y). Then the chain rule states that
dzdx=dzdydydx\frac{d z}{d x}=\frac{d z}{d y} \frac{d y}{d x} dxdz=dydzdxdy
We can generalize this beyond the scalar case. Suppose that x∈Rm,y∈Rnx \in \mathbb{R}^{m}, y \in \mathbb{R}^{n}x∈Rm,y∈Rn, ggg maps from Rm\mathbb{R}^{m}Rm to Rn\mathbb{R}^{n}Rn, and fff maps from Rn\mathbb{R}^{n}Rn to R\mathbb{R}R. If y=g(x)\boldsymbol{y}=g(\boldsymbol{x})y=g(x) and z=f(y)z=f(\boldsymbol{y})z=f(y), then
∂z∂xi=∑j∂z∂yj∂yj∂xi\frac{\partial z}{\partial x_{i}}=\sum_{j} \frac{\partial z}{\partial y_{j}} \frac{\partial y_{j}}{\partial x_{i}} ∂xi∂z=j∑∂yj∂z∂xi∂yj
In vector notation, this may be equivalently written as
∇xz=(∂y∂x)⊤∇yz\nabla_{\boldsymbol{x}} z=\left(\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}\right)^{\top} \nabla_{\boldsymbol{y}} z ∇xz=(∂x∂y)⊤∇yz
where ∂y∂x\frac{\partial y}{\partial x}∂x∂y is the n×mn \times mn×m Jacobian matrix of ggg. From this we see that the gradient of a variable xxx can be obtained by multiplying a Jacobian matrix ∂y∂x\frac{\partial y}{\partial x}∂x∂y by a gradient ∇yz.\nabla_{y} z .∇yz. The back-propagation algorithm consists of performing such a Jacobian-gradient product for each operation in the graph. Usually we do not apply the back-propagation algorithm merely to vectors, but rather to tensors of arbitrary dimensionality. Conceptually, this is exactly the same as back-propagation with vectors. The only difference is how the numbers are arranged in a grid to form a tensor.
We could imagine flattening each tensor into a vector before we run back-propagation, computing a vector-valued gradient, and then reshaping the gradient back into a tensor. In this rearranged view, back-propagation is still just multiplying Jacobians by gradients.
To denote the gradient of a value zzz with respect to a tensor X\mathrm{X}X, we write ∇xz\nabla \mathbf{x} z∇xz, just as if XXX were a vector. The indices into XXX now have multiple coordinates-for example, a 3-D tensor is indexed by three coordinates. We can abstract this away by using a single variable iii to represent the complete tuple of indices. For all possible index tuples i,(∇Xz)ii,\left(\nabla_{\mathbf{X}} z\right)_{i}i,(∇Xz)i gives ∂z∂xi\frac{\partial z}{\partial x_{i}}∂xi∂z. This is exactly the same as how for all possible integer indices iii into a vector, (∇xz)i\left(\nabla_{x} z\right)_{i}(∇xz)i gives ∂z∂xi\frac{\partial z}{\partial x_{i}}∂xi∂z. Using this notation, we can write the chain rule as it applies to tensors. If Y=g(X)\mathbf{Y}=g(\mathbf{X})Y=g(X) and z=f(Y)z=f(\mathbf{Y})z=f(Y), then
∇Xz=∑j(∇XYj)∂z∂Yj\nabla_{\mathbf{X}} z=\sum_{j}\left(\nabla_{\mathbf{X}} Y_{j}\right) \frac{\partial z}{\partial Y_{j}} ∇Xz=j∑(∇XYj)∂Yj∂z

Recursively Applying the Chain Rule to Obtain Backprop

Using the chain rule, it is straightforward to write down an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar. However, actually evaluating that expression in a computer introduces some extra considerations.
Specifically, many subexpressions may be repeated several times within the overall expression for the gradient. Any procedure that computes the gradient will need to choose whether to store these subexpressions or to recompute them several times.
An example of how these repeated subexpressions arise is given in figure below. In some cases, computing the same subexpression twice would simply be wasteful. For complicated graphs, there can be exponentially many of these wasted computations, making a naive implementation of the chain rule infeasible. In other cases, computing the same subexpression twice could be a valid way to reduce memory consumption at the cost of higher runtime.
We first begin by a version of the back-propagation algorithm that specifies the actual gradient computation directly (algorithm 6.26.26.2 along with algorithm 6.16.16.1 for the associated forward computation), in the order it will actually be done and according to the recursive application of chain rule. One could either directly perform these computations or view the description of the algorithm as a symbolic specification of the computational graph for computing the back-propagation. However, this formulation does not make explicit the manipulation and the construction of the symbolic graph that performs the gradient computation.
First consider a computational graph describing how to compute a single scalar u(n)u^{(n)}u(n) (say the loss on a training example). This scalar is the quantity whose gradient we want to obtain, with respect to the nin_{i}ni input nodes u(1)u^{(1)}u(1) to u(ni).u^{\left(n_{i}\right)} .u(ni). In other words we wish to compute ∂u(n)∂u(i)\frac{\partial u^{(n)}}{\partial u^{(i)}}∂u(i)∂u(n) for all i∈{1,2,…,ni}.i \in\left\{1,2, \ldots, n_{i}\right\} .i∈{1,2,…,ni}. In the application of back-propagation to computing gradients for gradient descent over parameters, u(n)u^{(n)}u(n) will be the cost associated with an example or a minibatch, while u(1)u^{(1)}u(1) to u(ni)u^{\left(n_{i}\right)}u(ni) correspond to the parameters of the model.
We will assume that the nodes of the graph have been ordered in such a way that we can compute their output one after the other, starting at u(ni+1)u^{\left(n_{i}+1\right)}u(ni+1) and going up to u(n)u^{(n)}u(n). As defined in algorithm 6.16.16.1, each node u(i)u^{(i)}u(i) is associated with an operation f(i)f^{(i)}f(i) and is computed by evaluating the function
u(i)=f(A(i))u^{(i)}=f\left(\mathbb{A}^{(i)}\right) u(i)=f(A(i))
where A(i)\mathbb{A}^{(i)}A(i) is the set of all nodes that are parents of u(i)u^{(i)}u(i) That algorithm specifies the forward propagation computation, which we could put in a graph G.\mathcal{G} .G. In order to perform back-propagation, we can construct a computational graph that depends on G\mathcal{G}G and adds to it an extra set of nodes. These form a subgraph B\mathcal{B}B with one node per node of G\mathcal{G}G. Computation in B\mathcal{B}B proceeds in exactly the reverse of the order of computation in G\mathcal{G}G, and each node of B\mathcal{B}B computes the derivative ∂u(n)∂u(i)\frac{\partial u^{(n)}}{\partial u^{(i)}}∂u(i)∂u(n) associated with the forward graph node u(i)u^{(i)}u(i). This is done using the chain rule with respect to scalar output u(n)u^{(n)}u(n) :
∂u(n)∂u(j)=∑i:j∈Pa(u(i))∂u(n)∂u(i)∂u(i)∂u(j)\frac{\partial u^{(n)}}{\partial u^{(j)}}=\sum_{i: j \in P a\left(u^{(i)}\right)} \frac{\partial u^{(n)}}{\partial u^{(i)}} \frac{\partial u^{(i)}}{\partial u^{(j)}} ∂u(j)∂u(n)=i:j∈Pa(u(i))∑∂u(i)∂u(n)∂u(j)∂u(i)
as specified by algorithm 6.2.6.2 .6.2. The subgraph B\mathcal{B}B contains exactly one edge for each edge from node u(j)u^{(j)}u(j) to node u(i)u^{(i)}u(i) of G\mathcal{G}G. The edge from u(j)u^{(j)}u(j) to u(i)u^{(i)}u(i) is associated with the computation of ∂u(i)∂u(j).\frac{\partial u^{(i)}}{\partial u^{(j)}} .∂u(j)∂u(i). In addition, a dot product is performed for each node, between the gradient already computed with respect to nodes u(i)u^{(i)}u(i) that are children of u(j)u^{(j)}u(j) and the vector containing the partial derivatives ∂u(i)∂u(j)\frac{\partial u^{(i)}}{\partial u^{(j)}}∂u(j)∂u(i) for the same children nodes u(i)u^{(i)}u(i). To summarize, the amount of computation required for performing the back-propagation scales linearly with the number of edges in G\mathcal{G}G, where the computation for each edge corresponds to computing a partial derivative (of one node with respect to one of its parents) as well as performing one multiplication and one addition. Below, we generalize this analysis to tensor-valued nodes, which is just a way to group multiple scalar values in the same node and enable more efficient implementations.
The back-propagation algorithm is designed to reduce the number of common subexpressions without regard to(不考虑) memory. Specifically, it performs on the order of one Jacobian product per node in the graph. This can be seen from the fact that backprop (algorithm 6.2) visits each edge from node u(j)u^{(j)}u(j) to node u(i)u^{(i)}u(i) of the graph exactly once in order to obtain the associated partial derivative ∂u(i)∂u(j)\frac{\partial u^{(i)}}{\partial u^{(j)}}∂u(j)∂u(i).
Back-propagation thus avoids the exponential explosion in repeated subexpressions. However, other algorithms may be able to avoid more subexpressions by performing simplifications on the computational graph, or may be able to conserve memory by recomputing rather than storing some subexpressions. We will revisit these ideas after describing the back-propagation algorithm itself.

Back-Propagation Computation in Fully-Connected MLP

To clarify the above definition of the back-propagation computation, let us consider the specific graph associated with a fully-connected multi-layer MLP.
Algorithm 6.36.36.3 first shows the forward propagation, which maps parameters to the supervised loss L(y^,y)L(\hat{\boldsymbol{y}}, \boldsymbol{y})L(y^,y) associated with a single (input,target) training example (x,y)(\boldsymbol{x}, \boldsymbol{y})(x,y), with y^\hat{\boldsymbol{y}}y^ the output of the neural network when x\boldsymbol{x}x is provided in input. Algorithm 6.46.46.4 then shows the corresponding computation to be done for applying the back-propagation algorithm to this graph. Algorithms 6.36.36.3 and 6.46.46.4 are demonstrations that are chosen to be simple and straightforward to understand. However, they are specialized to one speciﬁc problem. Modern software implementations are based on the generalized form of backpropagation described in section below, which can accommodate any compu6.5.6 tational graph by explicitly manipulating a data structure for representing symbolic computation.

Symbol-to-Symbol Derivatives

Algebraic expressions and computational graphs both operate on symbols, or variables that do not have specific values. These algebraic and graph-based representations are called symbolic representations. When we actually use or train a neural network, we must assign specific values to these symbols. We replace a symbolic input to the network xxx with a specific numeric value, such as [1.2,3.765,−1.8]⊤[1.2,3.765,-1.8]^{\top}[1.2,3.765,−1.8]⊤
Some approaches to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values. We call this approach “symbol-to-number” diﬀerentiation. This is the approach used by libraries such as Torch and Caﬀe.
Another approach is to take a computational graph and add additional nodes to the graph that provide a symbolic description of the desired derivatives. This is the approach taken by Theano and TensorFlow. An example of how this approach works Abadi et al. 2015 is illustrated in ﬁgure above . The primary advantage of this approach is that the derivatives are described in the same language as the original expression. Because the derivatives are just another computational graph, it is possible to run back-propagation again, diﬀerentiating the derivatives in order to obtain higher derivatives. Computation of higher-order derivatives is described in section 6.5.10.
We will use the latter approach and describe the back-propagation algorithm in terms of constructing a computational graph for the derivatives. Any subset of the graph may then be evaluated using speciﬁc numerical values at a later time. This allows us to avoid specifying exactly when each operation should be computed. Instead, a generic graph evaluation engine can evaluate every node as soon as its parents’ values are available. The description of the symbol-to-symbol based approach subsumes(包含) the symbol-to-number approach. The symbol-to-number approach can be understood as performing exactly the same computations as are done in the graph built by the symbol-to-symbol approach. The key diﬀerence is that the symbol-to-number approach does not expose the graph.

General Back-Propagation

The back-propagation algorithm is very simple.
To compute the gradient of some scalar zzz with respect to one of its ancestors x\boldsymbol{x}x in the graph, we begin by observing that the gradient with respect to zzz is given by dzdz=1\frac{d z}{d z}=1dzdz=1.
We can then compute the gradient with respect to each parent of zzz in the graph by multiplying the current gradient by the Jacobian of the operation that produced zzz. We continue multiplying by Jacobians traveling backwards through the graph in this way until we reach xxx.
For any node that may be reached by going backwards from zzz through two or more paths, we simply sum the gradients arriving from different paths at that node.
More formally, each node in the graph G\mathcal{G}G corresponds to a variable. To achieve maximum generality, we describe this variable as being a tensor V\mathbf{V}V. Tensor can in general have any number of dimensions. They subsume scalars, vectors, and matrices.
We assume that each variable V\mathbf{V}V is associated with the following subroutines:

get_operation(V): This returns the operation that computes V\mathbf{V}V, represented by the edges coming into V\mathrm{V}V in the computational graph. For example, there may be a Python or C++\mathrm{C}++C++ class representing the matrix multiplication operation, and the get_operation function. Suppose we have a variable that is created by matrix multiplication, C=AB\boldsymbol{C}=\boldsymbol{A B}C=AB. Then get_operation( V)\mathbf{V})V) returns a pointer to an instance of the corresponding C++\mathrm{C}++C++ class.
get_consumers (V,G):(\mathbf{V}, \mathcal{G}):(V,G): This returns the list of variables that are children of V\mathbf{V}V in the computational graph G\mathcal{G}G.
get_inputs (V,G):(\mathbf{V}, \mathcal{G}):(V,G): This returns the list of variables that are parents of V\mathbf{V}V in the computational graph G\mathcal{G}G.

Each operation op is also associated with a bprop operation. This bprop operation can compute a Jacobian-vector product as described by equation ∇Xz=∑j(∇XYj)∂z∂Yj\nabla_{\mathbf{X}} z=\sum_{j}\left(\nabla_{\mathbf{X}} Y_{j}\right) \frac{\partial z}{\partial Y_{j}} ∇Xz=j∑(∇XYj)∂Yj∂z This is how the back-propagation algorithm is able to achieve great generality. Each operation is responsible for knowing how to back-propagate through the edges in the graph that it participates in.
For example, we might use a matrix multiplication operation to create a variable C=ABC=A BC=AB. Suppose that the gradient of a scalar zzz with respect to CCC is given by GGG. The matrix multiplication operation is responsible for defining two back-propagation rules, one for each of its input arguments.
If we call the bprop method to request the gradient with respect to A\boldsymbol{A}A given that the gradient on the output is G\boldsymbol{G}G, then the bprop method of the matrix multiplication operation must state that the gradient with respect to AAA is given by GB⊤G B^{\top}GB⊤.
Likewise, if we call the bprop method to request the gradient with respect to BBB, then the matrix operation is responsible for implementing the bprop method and specifying that the desired gradient is given by A⊤G\boldsymbol{A}^{\top} \boldsymbol{G}A⊤G.
The back-propagation algorithm itself does not need to know any differentiation rules. It only needs to call each operation’s bprop rules with the right arguments. Formally, op.bprop(inputs, X,G)\mathbf{X}, \mathbf{G})X,G) must return
∑i(∇xop.f(inputs) i)Gi\sum_{i}\left(\nabla \text { xop.f(inputs) }_{i}\right) G_{i} i∑(∇ xop.f(inputs) i)Gi
which is just an implementation of the chain rule as expressed in equation ∇Xz=∑j(∇XYj)∂z∂Yj\nabla_{\mathbf{X}} z=\sum_{j}\left(\nabla_{\mathbf{X}} Y_{j}\right) \frac{\partial z}{\partial Y_{j}} ∇Xz=j∑(∇XYj)∂Yj∂z Here, inputs is a list of inputs that are supplied to the operation, op. f\mathrm{f}f is the mathematical function that the operation implements, X\mathrm{X}X is the input whose gradient we wish to compute, and G\mathrm{G}G is the gradient on the output of the operation.
The op. bprop method should always pretend that all of its inputs are distinct from each other, even if they are not. For example, if the mul operator is passed two copies of xxx to compute x2x^{2}x2, the op. bprop method should still return xxx as the derivative with respect to both inputs. The back-propagation algorithm will later add both of these arguments together to obtain 2x2 x2x, which is the correct total derivative on xxx.
Software implementations of back-propagation usually provide both the operations and their bprop methods, so that users of deep learning software libraries are able to back-propagate through graphs built using common operations like matrix multiplication, exponents, logarithms, and so on. Software engineers who build a new implementation of back-propagation or advanced users who need to add their own operation to an existing library must usually derive the op. bprop method for any new operations manually.
The back-propagation algorithm is formally described in algorithm 6.56.56.5
In section 6.5.2, we explained that back-propagation was developed in order to avoid computing the same subexpression in the chain rule multiple times. The naive algorithm could have exponential runtime due to these repeated subexpressions. Now that we have specified the back-propagation algorithm, we can understand its computational cost. If we assume that each operation evaluation has roughly the same cost, then we may analyze the computational cost in terms of the number of operations executed. Keep in mind here that we refer to an operation as the fundamental unit of our computational graph, which might actually consist of very many arithmetic operations (for example, we might have a graph that treats matrix multiplication as a single operation).
Computing a gradient in a graph with nnn nodes will never execute more than O(n2)O\left(n^{2}\right)O(n2) operations or store the output of more than O(n2)O\left(n^{2}\right)O(n2) operations. Here we are counting operations in the computational graph, not individual operations executed by the underlying hardware, so it is important to remember that the runtime of each operation may be highly variable.
For example, multiplying two matrices that each contain millions of entries might correspond to a single operation in the graph. We can see that computing the gradient requires as most O(n2)O\left(n^{2}\right)O(n2) operations because the forward propagation stage will at worst execute all nnn nodes in the original graph (depending on which values we want to compute, we may not need to execute the entire graph). The back-propagation algorithm adds one Jacobian-vector product, which should be expressed with O(1)O(1)O(1) nodes, per edge in the original graph.
Because the computational graph is a directed acyclic graph it has at most O(n2)O\left(n^{2}\right)O(n2) edges. For the kinds of graphs that are commonly used in practice, the situation is even better. Most neural network cost functions are roughly chain-structured, causing back-propagation to have O(n)O(n)O(n) cost. This is far better than the naive approach, which might need to execute exponentially many nodes. This potentially exponential cost can be seen by expanding and rewriting the recursive chain rule ∂u(n)∂u(j)=∑i:j∈Pa(u(i))∂u(n)∂u(i)∂u(i)∂u(j)\frac{\partial u^{(n)}}{\partial u^{(j)}}=\sum_{i: j \in P a\left(u^{(i)}\right)} \frac{\partial u^{(n)}}{\partial u^{(i)}} \frac{\partial u^{(i)}}{\partial u^{(j)}} ∂u(j)∂u(n)=i:j∈Pa(u(i))∑∂u(i)∂u(n)∂u(j)∂u(i) non-recursively:
Since the number of paths from node jjj to node nnn can grow exponentially in the length of these paths, the number of terms in the above sum, which is the number of such paths, can grow exponentially with the depth of the forward propagation graph. This large cost would be incurred（招致） because the same computation for ∂u(i)∂u(j)\frac{\partial u^{(i)}}{\partial u^{(j)}}∂u(j)∂u(i) would be redone many times.
To avoid such recomputation, we can think of back-propagation as a table-filling algorithm that takes advantage of storing intermediate results ∂u(n)∂u(i).\frac{\partial u^{(n)}}{\partial u^{(i)}} .∂u(i)∂u(n). Each node in the graph has a corresponding slot in a table to store the gradient for that node. By filling in these table entries in order, back-propagation avoids repeating many common subexpressions. This table-ﬁlling strategy is sometimes called dynamic programming（BP中的DP）.

Example: Back-Propagation for MLP Training

As an example, we walk through the back-propagation algorithm as it is used to train a multilayer perceptron.
Here we develop a very simple multilayer perception with a single hidden layer. To train this model, we will use minibatch stochastic gradient descent. The back-propagation algorithm is used to compute the gradient of the cost on a single minibatch. Specifically, we use a minibatch of examples from the training set formatted as a design matrix XXX and a vector of associated class labels yyy. The network computes a layer of hidden features H=max⁡{0,XW(1)}H=\max \left\{0, \boldsymbol{X} \boldsymbol{W}^{(1)}\right\}H=max{0,XW(1)}. To simplify the presentation we do not use biases in this model. We assume that our graph language includes a relu operation that can compute max⁡{0,Z}\max \{0, Z\}max{0,Z} elementwise. The predictions of the unnormalized log probabilities over classes are then given by HW(2)H W^{(2)}HW(2). We assume that our graph language includes a cross_entropy operation that computes the cross-entropy between the targets y\boldsymbol{y}y and the probability distribution defined by these unnormalized log probabilities. The resulting crossentropy defines the cost JMLEJ_{\mathrm{MLE}}JMLE. Minimizing this cross-entropy performs maximum likelihood estimation of the classifier. However, to make this example more realistic, we also include a regularization term. The total cost
J=JMLE+λ(∑i,j(Wi,j(1))2+∑i,j(Wi,j(2))2)J=J_{\mathrm{MLE}}+\lambda\left(\sum_{i, j}\left(W_{i, j}^{(1)}\right)^{2}+\sum_{i, j}\left(W_{i, j}^{(2)}\right)^{2}\right) J=JMLE+λ(i,j∑(Wi,j(1))2+i,j∑(Wi,j(2))2)
consists of the cross-entropy and a weight decay term with coefficient λ\lambdaλ. The computational graph is illustrated in figure 6.116.116.11.
The computational graph for the gradient of this example is large enough that it would be tedious to draw or to read. This demonstrates one of the benefits of the back-propagation algorithm, which is that it can automatically generate gradients that would be straightforward but tedious for a software engineer to derive manually.
We can roughly trace out the behavior of the back-propagation algorithm by looking at the forward propagation graph in figure above. To train, we wish to compute both ∇W(1)J\nabla_{\boldsymbol{W}^{(1)}} J∇W(1)J and ∇W(2)J.\nabla_{\boldsymbol{W}^{(2)}} J .∇W(2)J. There are two different paths leading backward from JJJ to the weights: one through the cross-entropy cost, and one through the weight decay cost. The weight decay cost is relatively simple; it will always contribute 2λW(i)2 \lambda \boldsymbol{W}^{(i)}2λW(i) to the gradient on W(i)\boldsymbol{W}^{(i)}W(i).
The other path through the cross-entropy cost is slightly more complicated. Let GGG be the gradient on the unnormalized log probabilities U(2)U^{(2)}U(2) provided by the cross_entropy operation. The back-propagation algorithm now needs to explore two different branches. On the shorter branch, it adds H⊤GH^{\top} GH⊤G to the gradient on W(2)\boldsymbol{W}^{(2)}W(2), using the back-propagation rule for the second argument to the matrix multiplication operation. The other branch corresponds to the longer chain descending further along the network. First, the back-propagation algorithm computes ∇HJ=GW(2)⊤\nabla_{\boldsymbol{H}}^{J}=G \boldsymbol{W}^{(2) \top}∇HJ=GW(2)⊤ using the back-propagation rule for the first argument to the matrix multiplication operation. Next, the relu operation uses its backpropagation rule to zero out components of the gradient corresponding to entries of U(1)\boldsymbol{U}^{(1)}U(1) that were less than 0.0 .0. Let the result be called G′.\boldsymbol{G}^{\prime} .G′. The last step of the back-propagation algorithm is to use the back-propagation rule for the second argument of the matmul operation to add X⊤G′\boldsymbol{X}^{\top} \boldsymbol{G}^{\prime}X⊤G′ to the gradient on W(1)\boldsymbol{W}^{(1)}W(1).
After these gradients have been computed, it is the responsibility of the gradient descent algorithm, or another optimization algorithm, to use these gradients to update the parameters.
For the MLP, the computational cost is dominated by the cost of matrix multiplication. During the forward propagation stage, we multiply by each weight matrix, resulting in O(w)O(w)O(w) multiply-adds, where www is the number of weights. During the backward propagation stage, we multiply by the transpose of each weight matrix, which has the same computational cost. The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer. This value is stored from the time it is computed until the backward pass has returned to the same point. The memory cost is thus O(mnh)O\left(m n_{h}\right)O(mnh), where mmm is the number of examples in the minibatch and nhn_{h}nh is the number of hidden units.

Complications

Our description of the back-propagation algorithm here is simpler than the implementations actually used in practice.
As noted above, we have restricted the definition of an operation to be a function that returns a single tensor. Most software implementations need to support operations that can return more than one tensor.
For example, if we wish to compute both the maximum value in a tensor and the index of that value, it is best to compute both in a single pass through memory, so it is most efficient to implement this procedure as a single operation with two outputs.
We have not described how to control the memory consumption of backpropagation. Back-propagation often involves summation of many tensors together. In the naive approach, each of these tensors would be computed separately, then all of them would be added in a second step. The naive approach has an overly high memory bottleneck that can be avoided by maintaining a single buffer and adding each value to that buffer as it is computed.
Real-world implementations of back-propagation also need to handle various data types, such as 32 -bit floating point, 64 -bit floating point, and integer values. The policy for handling each of these types takes special care to design.
Some operations have undefined gradients, and it is important to track these cases and determine whether the gradient requested by the user is undefined. Various other technicalities make real-world differentiation more complicated. These technicalities are not insurmountable（不可逾越）, and this chapter has described the key intellectual tools needed to compute derivatives, but it is important to be aware that many more subtleties（微妙之处） exist.

Differentiation outside the Deep Learning Community

The deep learning community has been somewhat isolated from the broader computer science community and has largely developed its own cultural attitudes concerning how to perform differentiation.
More generally, the field of automatic differentiation is concerned with how to compute derivatives algorithmically. The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation. Other approaches evaluate the subexpressions of the chain rule in different orders.
In general, determining the order of evaluation that results in the lowest computational cost is a difficult problem. Finding the optimal sequence of operations to compute the gradient is NP-complete , in the sense that it may require simplifying algebraic expressions into their least expensive form.
For example, suppose we have variables p1,p2,…,pnp_{1}, p_{2}, \ldots, p_{n}p1,p2,…,pn representing probabilities and variables z1,z2,…,znz_{1}, z_{2}, \ldots, z_{n}z1,z2,…,zn representing unnormalized log probabilities. Suppose we define
qi=exp⁡(zi)∑iexp⁡(zi)q_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{i} \exp \left(z_{i}\right)} qi=∑iexp(zi)exp(zi)
where we build the softmax function out of exponentiation, summation and division operations, and construct a cross-entropy loss J=−∑ipilog⁡qi.J=-\sum_{i} p_{i} \log q_{i} . \quadJ=−∑ipilogqi.
A human mathematician can observe that the derivative of JJJ with respect to ziz_{i}zi takes a very simple form: qi−piq_{i}-p_{i}qi−pi. The back-propagation algorithm is not capable of simplifying the gradient this way, and will instead explicitly propagate gradients through all of the logarithm and exponentiation operations in the original graph. Some software libraries such as Theano are able to perform some kinds of algebraic substitution to improve over the graph proposed by the pure back-propagation algorithm.
When the forward graph G\mathcal{G}G has a single output node and each partial derivative ∂u(i)\partial u^{(i)}∂u(i) can be computed with a constant amount of computation, back-propagation guarantees that the number of computations for the gradient computation is of the same order as the number of computations for the forward computation: this can be seen in algorithm 6.26.26.2 because each local partial derivative ∂u(i)∂u(j)\frac{\partial u^{(i)}}{\partial u^{(j)}}∂u(j)∂u(i) needs to be computed only once along with an associated multiplication and addition for the recursive chain-rule formulation. The overall computation is therefore O(#O(\#O(# edges ))). However, it can potentially be reduced by simplifying the computational graph constructed by back-propagation, and this is an NP-complete task. Implementations such as Theano and TensorFlow use heuristics based on matching known simplification patterns in order to iteratively attempt to simplify the graph. We defined back-propagation only for the computation of a gradient of a scalar output but back-propagation can be extended to compute a Jacobian (either of kkk different scalar nodes in the graph, or of a tensor-valued node containing kkk values). A naive implementation may then need kkk times more computation: for each scalar internal node in the original forward graph, the naive implementation computes kkk gradients instead of a single gradient. When the number of outputs of the graph is larger than the number of inputs, it is sometimes preferable to use another form of automatic differentiation called forward mode accumulation. Forward mode computation has been proposed for obtaining real-time computation of gradients in recurrent networks, for example (W and Zipser, 1989). This also avoids the need to store the values and gradients for the whole graph, trading off computational efficiency for memory. The relationship between forward mode and backward mode is analogous to the relationship between left-multiplying versus right-multiplying a sequence of matrices, such as ABCDA B C DABCD where the matrices can be thought of as Jacobian matrices. For example, if DDD is a column vector while AAA has many rows, this corresponds to a graph with a single output and many inputs, and starting the multiplications from the end and going backwards only requires matrix-vector products. This corresponds to the backward mode. Instead, starting to multiply from the left would involve a series of matrix-matrix products, which makes the whole computation much more expensive. However, if A\boldsymbol{A}A has fewer rows than D\boldsymbol{D}D has columns, it is cheaper to run the multiplications left-to-right, corresponding to the forward mode.
In many communities outside of machine learning, it is more common to implement differentiation software that acts directly on traditional programming language code, such as Python or C code, and automatically generates programs that differentiate functions written in these languages. In the deep learning community, computational graphs are usually represented by explicit data structures created by specialized libraries. The specialized approach has the drawback of requiring the library developer to define the bprop methods for every operation and limiting the user of the library to only those operations that have been defined. However, the specialized approach also has the benefit of allowing customized back-propagation rules to be developed for each operation, allowing the developer to improve speed or stability in non-obvious ways that an automatic procedure would presumably be unable to replicate.
Back-propagation is therefore not the only way or the optimal way of computing the gradient, but it is a very practical method that continues to serve the deep learning community very well. In the future, differentiation technology for deep networks may improve as deep learning practitioners become more aware of advances in the broader field of automatic differentiation.

Higher-Order Derivatives

Some software frameworks support the use of higher-order derivatives. Among the deep learning software frameworks, this includes at least Theano and TensorFlow. These libraries use the same kind of data structure to describe the expressions for derivatives as they use to describe the original function being differentiated. This means that the symbolic differentiation machinery can be applied to derivatives. In the context of deep learning, it is rare to compute a single second derivative of a scalar function. Instead, we are usually interested in properties of the Hessian matrix. If we have a function f:Rn→Rf: \mathbb{R}^{n} \rightarrow \mathbb{R}f:Rn→R, then the Hessian matrix is of size n×nn \times nn×n. In typical deep learning applications, nnn will be the number of parameters in the model, which could easily number in the billions. The entire Hessian matrix is thus infeasible to even represent.
Instead of explicitly computing the Hessian, the typical deep learning approach is to use Krylov methods. Krylov methods are a set of iterative techniques for performing various operations like approximately inverting a matrix or finding approximations to its eigenvectors or eigenvalues, without using any operation other than matrix-vector products.
In order to use Krylov methods on the Hessian, we only need to be able to compute the product between the Hessian matrix H\boldsymbol{H}H and an arbitrary vector v.A\boldsymbol{v} . \mathrm{A}v.A straightforward technique (Christianson, 1992 ) for doing so is to compute
Hv=∇x[(∇xf(x))⊤v]\boldsymbol{H} \boldsymbol{v}=\nabla_{\boldsymbol{x}}\left[\left(\nabla_{\boldsymbol{x}} f(x)\right)^{\top} \boldsymbol{v}\right] Hv=∇x[(∇xf(x))⊤v]
Both of the gradient computations in this expression may be computed automatically by the appropriate software library. Note that the outer gradient expression takes the gradient of a function of the inner gradient expression.
If vvv is itself a vector produced by a computational graph, it is important to specify that the automatic differentiation software should not differentiate through the graph that produced v\boldsymbol{v}v.
While computing the Hessian is usually not advisable, it is possible to do with Hessian vector products. One simply computes He(i)H e^{(i)}He(i) for all i=1,…,ni=1, \ldots, ni=1,…,n, where e(i)e^{(i)}e(i) is the one-hot vector with ei(i)=1e_{i}^{(i)}=1ei(i)=1 and all other entries equal to 0 .

Historical Notes

Feedforward networks can be seen as efficient nonlinear function approximators based on using gradient descent to minimize the error in a function approximation.
From this point of view, the modern feedforward network is the culmination of centuries of progress on the general function approximation task.
The chain rule that underlies the back-propagation algorithm was invented in the 17 th century. Calculus and algebra have long been used to solve optimization problems in closed form, but gradient descent was not introduced as a technique for iteratively approximating the solution to optimization problems until the 19 th century.
Beginning in the 1940s1940 \mathrm{~s}1940 s, these function approximation techniques were used to motivate machine learning models such as the perceptron. However, the earliest models were based on linear models. Critics including Marvin Minsky pointed out several of the flaws of the linear model family, such as its inability to learn the XOR function, which led to a backlash against the entire neural network approach.
Learning nonlinear functions required the development of a multilayer perceptron and a means of computing the gradient through such a model. Efficient applications of the chain rule based on dynamic programming began to appear in the 1960s1960 \mathrm{~s}1960 s and 1970s1970 \mathrm{~s}1970 s, mostly for control applications but also for sensitivity analysis . Werbos (1981) proposed applying these techniques to training artificial neural networks. The idea was finally developed in practice after being independently rediscovered in different ways. The book Parallel Distributed Processing presented the results of some of the first successful experiments with back-propagation in a chapter ( that contributed greatly to the popularization of back-propagation and initiated a very active period of research in multi-layer neural networks. However, the ideas put forward by the authors of that book and in particular by Rumelhart and Hinton go much beyond back-propagation. They include crucial ideas about the possible computational implementation of several central aspects of cognition and learning, which came under the name of “connectionism” because of the importance this school of thought places on the connections between neurons as the locus of learning and memory. In particular, these ideas include the notion of distributed representation.
Following the success of back-propagation, neural network research gained popularity and reached a peak in the early 1990s1990 \mathrm{~s}1990 s. Afterwards, other machine learning techniques became more popular until the modern deep learning renaissance that began in 2006 .
The core ideas behind modern feedforward networks have not changed substantially since the 1980s1980 \mathrm{~s}1980 s. The same back-propagation algorithm and the same approaches to gradient descent are still in use. Most of the improvement in neural network performance from 1986 to 2015 can be attributed to two factors. First, larger datasets have reduced the degree to which statistical generalization is a challenge for neural networks. Second, neural networks have become much larger, due to more powerful computers, and better software infrastructure. However, a small number of algorithmic changes have improved the performance of neural networks noticeably.
One of these algorithmic changes was the replacement of mean squared error with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s1980 \mathrm{~s}1980 s and 1990s1990 \mathrm{~s}1990 s, but was gradually replaced by cross-entropy losses and the principle of maximum likelihood as ideas spread between the statistics community and the machine learning community. The use of cross-entropy losses greatly improved the performance of models with sigmoid and softmax outputs, which had previously suffered from saturation and slow learning when using the mean squared error loss.
The other major algorithmic change that has greatly improved the performance of feedforward networks was the replacement of sigmoid hidden units with piecewise linear hidden units, such as rectified linear units. Rectification using the max⁡{0,z}\max \{0, z\}max{0,z} function was introduced in early neural network models and dates back at least as far as the Cognitron and Neocognitron. These early models did not use rectified linear units, but instead applied rectification to nonlinear functions. Despite the early popularity of rectification, rectification was largely replaced by sigmoids in the 1980s1980 \mathrm{~s}1980 s, perhaps because sigmoids perform better when neural networks are very small. As of the early 2000s2000 \mathrm{~s}2000 s, rectified linear units were avoided due to a somewhat superstitious belief that activation functions with non-differentiable points must be avoided. This began to change in about 2009.2009 .2009. Jarrett et al. (2009) observed that “using a rectifying nonlinearity is the single most important factor in improving the performance of a recognition system” among several different factors of neural network architecture design.
For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities is even more important than learning the weights of the hidden layers. Random weights are sufficient to propagate useful information through a rectified linear network, allowing the classifier layer at the top to learn how to map different feature vectors to class identities.
When more data is available, learning begins to extract enough useful knowledge to exceed the performance of randomly chosen parameters. Glorod et al. (2011a) showed that learning is far easier in deep rectified linear networks than in deep networks that have curvature or two-sided saturation in their activation functions.
Rectified linear units are also of historical interest because they show that neuroscience has continued to have an influence on the development of deep learning algorithms. Glorot et al. (2011a) motivate rectified linear units from biological considerations. The half-rectifying nonlinearity was intended to capture these properties of biological neurons:

For some inputs, biological neurons are completely inactive.
For some inputs, a biological neuron’s output is proportional to its input.
Most of the time, biological neurons operate in the regime where they are inactive (i.e., they should have sparse activations).

When the modern resurgence of deep learning began in 2006, feedforward networks continued to have a bad reputation. From about 2006−20122006-20122006−2012, it was widely believed that feedforward networks would not perform well unless they were assisted by other models, such as probabilistic models. Today, it is now known that with the right resources and engineering practices, feedforward networks perform very well.
Today, gradient-based learning in feedforward networks is used as a tool to develop probabilistic models, such as the variational autoencoder and generative adversarial networks, described in chapter 20. Rather than being viewed as an unreliable technology that must be supported by other techniques, gradient-based learning in feedforward networks has been viewed since 2012 as a powerful technology that may be applied to many other machine learning tasks. In 2006, the community used unsupervised learning to support supervised learning, and now, ironically, it is more common to use supervised learning to support unsupervised learning. Feedforward networks continue to have unfulfilled potential.
In the future, we expect they will be applied to many more tasks, and that advances in optimization algorithms and model design will improve their performance even further. This chapter has primarily described the neural network family of models. In the subsequent chapters, we turn to how to use these models - how to regularize and train them.

Deep Feedforward Networks（3）相关推荐

Deep Feedforward Networks（1）
CODE WORKS Work Here! CONTENTS Deep feedforward networks, also often called feedforward neural netwo ...
Detecting Visual Relationships with Deep Relational Networks（阅读笔记）
Detecting Visual Relationships with Deep Relational Networks(阅读笔记) 原文链接:https://blog.csdn.net/xue_we ...
Content Distribution Networks（CDNs）
互联网杀手级应用--网络流量占的比较多,而且比较吸引用户. 视频应用是其中之一,如何向成千上万的用户提供并行的播放服务呢视频流化服务和CDN:上下文视频流量:占据着互联网大部分的带宽 Netfli ...
【论文精读】Superpixel Sampling Networks（SSN）
[论文精读]Superpixel Sampling Networks Abstract 1和2部分懒得翻译 3 复习SLIC 4 Superpixel Sampling Networks(SSN) 4 ...
论文阅读《Deep Layer Aggregation（DLA）》
Background & Motivation 文章认为特征聚合的关键是语义和空间信息的聚合. Semantic fusion, or aggregating across channels ...
深度学习（四十）——深度强化学习（3）Deep Q-learning Network（2）, DQN进化史
Deep Q-learning Network(续) Nature DQN DQN最早发表于NIPS 2013,该版本的DQN,也被称为NIPS DQN.NIPS DQN除了提出DQN的基本概念之外, ...
深度学习（10）-- Capsules Networks（CapsNet）
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/malele4th/article/details/79430464 </div>< ...
Spatial Transformer Networks（STN）
详细解读Spatial Transformer Networks(STN)-一篇文章让你完全理解STN了_多元思考力-CSDN博客_stn
论文阅读：Semantic Aware Attention Based Deep Object Co-segmentation（ACCV2018）
协同分割论文:Semantic Aware Attention Based Deep Object Co-segmentation(ACCV2018) 论文原文 code 目录 1.简介 2. ...
论文解读-Intriguing properties of neural networks（ICLR2014）
Intriguing properties of neural networks(ICLR2014) 这篇文章被认为是对抗样本的开山之作,首次发现并提出了对抗样本,作者阵容豪华,被引了很多次.但是文章 ...

Deep Feedforward Networks（3）

CONTENTS

Back-Propagation and Other Diﬀerentiation Algorithms

Computational Graphs

Chain Rule of Calculus

Recursively Applying the Chain Rule to Obtain Backprop

Back-Propagation Computation in Fully-Connected MLP

Symbol-to-Symbol Derivatives

General Back-Propagation

Example: Back-Propagation for MLP Training

Complications

Differentiation outside the Deep Learning Community

Higher-Order Derivatives

Historical Notes

Deep Feedforward Networks（3）相关推荐

最新文章

热门文章