

L = 神经网络总层数

sl = 第l层的单元数(不包含bias unit)

K = output units/classes的数量




  • 列数=当前层的节点数(包含bias unit)
  • 行数=下一层的节点数(不包含bias unit)










thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]


Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)


梯度检查(Gradient Checking)



当ε足够小时(比如ε = 10-4),可以得到近似的导数值。


epsilon = 1e-4;
for i = 1:n,thetaPlus = theta;thetaPlus(i) += epsilon;thetaMinus = theta;thetaMinus(i) -= epsilon;gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)







If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;



  • 输入节点数:x的特征维度数
  • 输出节点数:在多分析类问题中,等于类别数
  • 隐藏层的节点数:越多效果越好,但计算量越大,需要权衡
  • 默认值:1个隐藏层,若多于1个隐藏层,则每个隐藏层的节点数一样


  1. 随机初始化权值(θ)
  2. 正向传播,计算hx
  3. 计算代价函数
  4. 反向传播,计算偏导数
  5. 用梯度检查确认反向传播是否正确。然后关闭提督检查。
  6. 使用梯度下降或优化算法,来最小化代价函数,求出最佳θ。


for i = 1:m,Perform forward propagation and backpropagation using example (x(i),y(i))(Get activations a(l) and delta terms d(l) for l = 2,...,L



θ的维度,就是 新特征数 * 旧特征数 。因为θ的作用,就是计算出新的维度。

% Theta1 has size 25 x 401
% Theta2 has size 10 x 26



function [J grad] = nnCostFunction(nn_params, ...input_layer_size, ...hidden_layer_size, ...num_labels, ...X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices.
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...hidden_layer_size, (input_layer_size + 1));Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...num_labels, (hidden_layer_size + 1));% Setup some useful variables
m = size(X, 1);% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
%               following parts.
% Part 1: Feedforward the neural network and return the cost in the
%         variable J. After implementing Part 1, you can verify that your
%         cost function computation is correct by verifying the cost
%         computed in ex4.m
% Part 2: Implement the backpropagation algorithm to compute the gradients
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
%         Theta2_grad, respectively. After implementing Part 2, you can check
%         that your implementation is correct by running checkNNGradients
%         Note: The vector y passed into the function is a vector of labels
%               containing values from 1..K. You need to map this vector into a
%               binary vector of 1's and 0's to be used with the neural network
%               cost function.
%         Hint: We recommend implementing backpropagation using a for-loop
%               over the training examples if you are implementing it for the
%               first time.
% Part 3: Implement regularization with the cost function and gradients.
%         Hint: You can implement this around the code for
%               backpropagation. That is, you can compute the gradients for
%               the regularization separately and then add them to Theta1_grad
%               and Theta2_grad from Part 2.
%% Y = zeros(m, num_labels);  % m x num_labels == 5000 x 10
% for i = 1:m,
%     Y(i, y(i)) = 1;
% end
Y = (1:num_labels)==y;  % m x num_labels == 5000 x 10a1 = [ones(m, 1) X];  % 5000 x 401
z2 = a1 * Theta1';  % m x hidden_layer_size == 5000 x 25
a2 = sigmoid(z2);  % m x hidden_layer_size == 5000 x 25
a2 = [ones(m,1), a2]; % 5000 x 26z3 = a2 * Theta2';  % m x num_labels == 5000 x 10
a3 = sigmoid(z3);  % m x num_labels == 5000 x 10
h = a3;  % m x num_labels == 5000 x 10% calculte penalty
p = sum(sum(Theta1(:, 2:end).^2, 2))+sum(sum(Theta2(:, 2:end).^2, 2));% calculate J
J = sum(sum((-Y).*log(h) - (1-Y).*log(1-h), 2))/m + lambda*p/(2*m);  %scalar% calculate sigmas
sigma3 = a3 - Y;  % 5000 x 10
sigma2 = (sigma3*Theta2).*sigmoidGradient([ones(size(z2, 1), 1) z2]);  % 5000 x 26
sigma2 = sigma2(:, 2:end);    % 5000 x 25% accumulate gradients
delta_1 = (sigma2'*a1);  % 25 x 401
delta_2 = (sigma3'*a2);  % 10 x 26% calculate regularized gradient
p1 = (lambda/m)*[zeros(size(Theta1, 1), 1) Theta1(:, 2:end)];
p2 = (lambda/m)*[zeros(size(Theta2, 1), 1) Theta2(:, 2:end)];
Theta1_grad = delta_1./m + p1;  % 25 x 401
Theta2_grad = delta_2./m + p2;  % 10 x 26% -------------------------------------------------------------% =========================================================================% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];end


function g = sigmoidGradient(z)
%SIGMOIDGRADIENT returns the gradient of the sigmoid function
%evaluated at z
%   g = SIGMOIDGRADIENT(z) computes the gradient of the sigmoid function
%   evaluated at z. This should work regardless if z is a matrix or a
%   vector. In particular, if z is a vector or matrix, you should return
%   the gradient for each element.g = zeros(size(z));% ====================== YOUR CODE HERE ======================
% Instructions: Compute the gradient of the sigmoid function evaluated at
%               each value of z (z can be a matrix, vector or scalar).g = sigmoid(z).*(1-sigmoid(z));% =============================================================end


function W = randInitializeWeights(L_in, L_out)
%RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
%incoming connections and L_out outgoing connections
%   W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights
%   of a layer with L_in incoming connections and L_out outgoing
%   connections.
%   Note that W should be set to a matrix of size(L_out, 1 + L_in) as
%   the first column of W handles the "bias" terms
%% You need to return the following variables correctly
W = zeros(L_out, 1 + L_in);% ====================== YOUR CODE HERE ======================
% Instructions: Initialize W randomly so that we break the symmetry while
%               training the neural network.
% Note: The first column of W corresponds to the parameters for the bias unit
%epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;% =========================================================================end

