Gradient Descent for one-hidden-layer-function(单隐藏层神经网络的梯度下降)

  • Problem description
  • Answers to questions

Problem description

This second computer programming assignment is to solve Computer programming of one-hidden-layer neural network with one-dimensional input and one-dimensional output and m=1 or 2 nodes in the hidden layer:

to fit the Runge function on the given eleven grid points


  • set σ(x)=11+e−x\sigma(x)=\frac{1}{1+e^{-x}}σ(x)=1+e−x1​ if m=2 is used
  • set σ(x)=e−x2\sigma(x)=e^{-x^2}σ(x)=e−x2 if m=1 is used

Note the variable to minimize is ci,wi,bic_i,\:w_i,\:b_ici​,wi​,bi​

Apply the gradient descent to this example and for fixed step size, test the convergence rate of your gradient descent and identify the constant γ\gammaγ in the Convergence Theorem 2 for Gradient Descent.

  • Is this a convex optimization problem? Why?
  • The plots of the objective function and its gradient magnitude
  • What is the convergence rate? Verify it by numerical examples.
  • What happens if you use different initial points?

Answers to questions

(1) This is not a convex optimization problem. We can fix two of the three variables w, b, c to see if this is a convex function by looking at the direction of the third variable. For m=2,the following figure 1-3 is the three-dimensional function diagram of fixed b, c direction, observation w direction, fixed w, c direction, observation b direction fixed w,b direction and observation c direction.

Figure1Figure 1Figure1

Figure2Figure 2Figure2

Figure3Figure 3Figure3

We can see that graphs 1 and 2 are nonconvex,while graph 3 may be convex,but convex functions require all three directions to be convex,so it is a nonconvex function.
For m=1,we still adopt the method of fixing two directions to observe the third direction.Figure 4-6 shows that fixed b and c observe w direction,fixed w,c observe b direction,fixed w,b observe c direction.

Figure4Figure5Figure6\qquad\qquad Figure 4 \qquad \qquad Figure 5 \qquad \qquad \qquad Figure 6Figure4Figure5Figure6
We can also see that the function is nonconvex in the w and b directions and convex in the c direction, but in sum, it is still a nonconvex function. So it is a nonconvex problem for both m=1 and m=2.
Figure 7-8 shows the change of the objective function and gradient when m=2. I got a step size of 0.05 and went through 10000 iterations.

Figure7Figure8\qquad \qquad Figure 7 \qquad\qquad\qquad Figure 8Figure7Figure8
If the step size is changed to 0.005, the objective function and gradient will change as shown in the figure below:

Figure9Figure10\qquad \qquad Figure 9 \qquad \qquad\qquad Figure 10Figure9Figure10

If the step size is changed to 0.0005, the objective function and gradient will change as shown in the figure below:

Figure11Figure12\qquad \qquad Figure 11 \qquad \qquad\qquad Figure 12Figure11Figure12

From the above picture analysis,we can kown the best step size may be around 0.05.
Figure 13-14 shows the change of objective function and gradient when m=1. I got a step size of 0.05 and went through 10000 iterations.

Figure13Figure14\qquad \qquad Figure 13 \qquad \qquad\qquad Figure 14Figure13Figure14
If the step size is changed to 0.005, the objective function and gradient will change as shown in the figure below:

Figure15Figure16\qquad \qquad Figure 15 \qquad \qquad\qquad Figure 16Figure15Figure16
If the step size is changed to 0.0005, the objective function and gradient will change as shown in the figure below:

Figure17Figure18\qquad \qquad Figure 17 \qquad \qquad\qquad Figure 18Figure17Figure18
From the above picture analysis,we can kown the best step size may be around 0.0005.

For m=2,γ is calculated by fitting logerrorklogerror_klogerrork​ to k. The figure below shows this value.
Figure 19-21 are graphs of the number of iterations with respect to error,1k\frac{1}{k}k1​ rate,γkγ^kγk rate, respectively,we can known the convergence rate is 1k\frac{1}{k}k1​ and ganna=0.999998

Figure19Figure20Figure21\qquad \qquad Figure 19 \qquad \qquad Figure 20\qquad \qquad Figure 21Figure19Figure20Figure21

For m=1,γ is calculated by fitting logerrorklogerror_klogerrork​ to k. The figure below shows this value.
Figure 22-24 are graphs of the number of iterations with respect to error,1k\frac{1}{k}k1​ rate, γkγ^kγk rate, respectively
we can known the convergence rate is 1k\frac{1}{k}k1​ and gamma=0.999679

Figure22Figure23Figure24\qquad \qquad Figure 22 \qquad \qquad Figure 23\qquad \qquad Figure 24Figure22Figure23Figure24

Using different initial points has a great influence on gradient descent. Some initial points make the gradient descent process very slow, and the iteration can not reach convergence or even divergence.The following figure shows the process of gradient descent which cannot converge due to different initial points when m=2 and m=1.

Figure25Figure26\qquad Figure25 \qquad \qquad\qquad Figure26Figure25Figure26

Figure27Figure28\qquad Figure27 \qquad \qquad\qquad Figure28Figure27Figure28
If the initial point is different the local minimum points may also be different.For m=2 The step size is 0.05, and the initial points are different. It can be seen that figure 33 reaches the minimum value faster than figure 29 and figure 3, so it can be inferred that different local minimum values are reached

Figure29Figure30\qquad Figure29 \qquad \qquad\qquad Figure30Figure29Figure30

Figure31Figure32\qquad Figure31 \qquad \qquad\qquad Figure32Figure31Figure32

Figure33Figure34\qquad Figure33 \qquad \qquad\qquad Figure34Figure33Figure34

