The goal is to find XX such that minXf(X)\underset{X}{min}f(X)

Using gradient descent algorithm to obtain the minimum value of the funtion.
let y=f(x)y = f(x)
Init: x=x0,y0=f(x0)x = x_0, y_0=f(x_0), iterative step α\alpha, convergent precision ϵ\epsilon

The ith iterative formula can be expressed as:
xi=xi−1−α∇f(xi−1)x_i = x_{i-1}-\alpha \nabla f(x_{i-1})

Example: solve the minimum of function f(x)=x2+3x+2f(x) = x^2 + 3x + 2

let x0=0x_0 = 0, step \alpha = 0.1, convergent precision ϵ=10−4\epsilon = 10^{-4}

f = @(x) x.^2 - 3*x + 2;
hold on
for x=0:0.001:3plot(x, f(x),'k-');
endx = 0;
y0 = f(x);
plot(x, y0, 'ro-');
alpha = 0.1;
epsilon = 10^(-4);gnorm = inf;while (gnorm > epsilon)x = x - alpha*(2*x-3);y = f(x);gnorm = abs(y-y0);plot(x, y, 'ro');y0 = y;
end

let’s move into multi-variable case, say we have m samples, each sample has n features. XX is expressed as:

X=⎡⎣⎢⎢⎢⎢⎢xT1xT2⋮xTm⎤⎦⎥⎥⎥⎥⎥

X = \begin{bmatrix} x_1^T\\ x_2^T\\ \vdots \\ x_m^T \end{bmatrix}
where

xi=⎡⎣⎢⎢⎢⎢xi1xi2⋮xin⎤⎦⎥⎥⎥⎥

x_i = \begin{bmatrix} x_{i1}\\ x_{i2}\\ \vdots \\ x_{in} \end{bmatrix}
Then XX can be denoted as :

X=⎡⎣⎢⎢⎢⎢⎢x11x21⋮xm1x11x21⋮xm1⋯⋯⋱⋯x1nx2n⋮xmn⎤⎦⎥⎥⎥⎥⎥

X = \begin{bmatrix} x_{11}&x_{11}&\cdots&x_{1n}\\ x_{21}&x_{21}&\cdots&x_{2n}\\ \vdots&\vdots&\ddots&\vdots \\ x_{m1}&x_{m1}&\cdots&x_{mn} \end{bmatrix}

Assuming h(x.)=∑j=1najx.j=xT.a\displaystyle h(x_.)=\sum_{j = 1}^na_jx_{.j}=x_.^Ta
Here,

a=⎡⎣⎢⎢⎢⎢a1a2⋮an⎤⎦⎥⎥⎥⎥

a = \begin{bmatrix} a_{1}\\ a_{2}\\ \vdots \\ a_{n} \end{bmatrix} is a unknown vector we need to solve.

Xa−y=⎡⎣⎢⎢⎢⎢⎢h(x1)−y1h(x2)−y2⋮h(xm)−ym⎤⎦⎥⎥⎥⎥⎥

Xa - y = \begin{bmatrix} h(x_1)-y_1\\ h(x_2)-y_2\\ \vdots \\ h(x_m)-y_m \end{bmatrix}

Now the objective function is minaf(a)=12(Xa−y)T(Xa−y)\underset{a}{min}f(a)=\frac{1}{2}(Xa-y)^T(Xa-y)

Before the derivation, I would like to introduce some facts:
tr(AB)=tr(BA)tr(AB) = tr(BA) ………………………………..(1)
tr(ABC)=tr(BCA)=tr(CAB)tr(ABC)=tr(BCA)=tr(CAB)………………………………..(2)
tr(A)=tr(AT)tr(A)=tr(A^T) ………………………………..(3)
if a∈Ra\in R, tr(a)=atr(a) = a………………………………..(4)
∇Atr(AB)=BT\nabla_A tr(AB)=B^T………………………………..(5)
∇Atr(ABATC)=CAB+CTABT\nabla_Atr(ABA^TC)=CAB+C^TAB^T………………………………..(6)

In order to obtain the critical points of f(a)f(a), we take the derivative of f(a)f(a) w.r.t aa and set it to be zero.

∇af(a)∇af(a)=0=∇a12(Xa−y)T(Xa−y)=12∇a(aTXTXa−aTXTy−yTXa+yTy)=12∇atr(aTXTXa−aTXTy−yTXa+yTy)// the trace of a scalar is still a scalar=12(∇atr(aTXTXa)−∇atr(aTXTy)−∇atr(yTXa)+∇atr(yTy))=12(∇atr(aTXTXa)−∇atr(yTXa)−∇atr(yTXa)+∇atr(yTy))=12(∇atr(aTXTXa)−2XTy)=12(∇atr(aaTXTX)−2XTy)=12(∇atr(aIaTXTX)−2XTy)=XTXa−XTy=0

\begin{align} \nabla_a f(a)&=0\\ \nabla_a f(a)&= \nabla_a \frac{1}{2}(Xa-y)^T(Xa-y)\\ &=\frac{1}{2}\nabla_a (a^TX^TXa-a^TX^Ty-y^TXa+y^Ty)\\ &=\frac{1}{2}\nabla_a tr(a^TX^TXa-a^TX^Ty-y^TXa+y^Ty)\text{// the trace of a scalar is still a scalar}\\ &=\frac{1}{2}(\nabla_a tr(a^TX^TXa)-\nabla_a tr(a^TX^Ty) -\nabla_a tr(y^TXa)+\nabla_a tr(y^Ty))\\ &=\frac{1}{2}(\nabla_a tr(a^TX^TXa)-\nabla_a tr(y^TXa)-\nabla_a tr(y^TXa)+\nabla_a tr(y^Ty))\\ &=\frac{1}{2}(\nabla_a tr(a^TX^TXa)-2X^Ty)\\ &= \frac{1}{2}(\nabla_a tr(aa^TX^TX)-2X^Ty)\\ &= \frac{1}{2}(\nabla_a tr(aIa^TX^TX)-2X^Ty)\\ &= X^TXa-X^Ty=0 \end{align}

we can easily get a as follows:
a=(XTX)−1XTya=(X^TX)^{-1}X^Ty

function [xopt,fopt,niter,gnorm,dx] = grad_descent(varargin)
% grad_descent.m demonstrates how the gradient descent method can be used
% to solve a simple unconstrained optimization problem. Taking large step
% sizes can lead to algorithm instability. The variable alpha below
% specifies the fixed step size. Increasing alpha above 0.32 results in
% instability of the algorithm. An alternative approach would involve a
% variable step size determined through line search.
%
% This example was used originally for an optimization demonstration in ME
% 149, Engineering System Design Optimization, a graduate course taught at
% Tufts University in the Mechanical Engineering Department. A
% corresponding video is available at:
%
% http://www.youtube.com/watch?v=cY1YGQQbrpQ
%
% Author: James T. Allison, Assistant Professor, University of Illinois at
% Urbana-Champaign
% Date: 3/4/12if nargin==0% define starting pointx0 = [3 3]';
elseif nargin==1% if a single input argument is provided, it is a user-defined starting% point.x0 = varargin{1};
elseerror('Incorrect number of input arguments.')
end% termination tolerance
tol = 1e-6;% maximum number of allowed iterations
maxiter = 1000;% minimum allowed perturbation
dxmin = 1e-6;% step size ( 0.33 causes instability, 0.2 quite accurate)
alpha = 0.1;% initialize gradient norm, optimization vector, iteration counter, perturbation
gnorm = inf; x = x0; niter = 0; dx = inf;% define the objective function:
f = @(x1,x2) x1.^2 + x1.*x2 + 3*x2.^2;% plot objective function contours for visualization:
figure(1); clf; ezcontour(f,[-5 5 -5 5]); axis equal; hold on% redefine objective function syntax for use with optimization:
f2 = @(x) f(x(1),x(2));% gradient descent algorithm:
while and(gnorm>=tol, and(niter <= maxiter, dx >= dxmin))% calculate gradient:g = grad(x);gnorm = norm(g);% take step:xnew = x - alpha*g;% check stepif ~isfinite(xnew)display(['Number of iterations: ' num2str(niter)])error('x is inf or NaN')end% plot current pointplot([x(1) xnew(1)],[x(2) xnew(2)],'ko-')refresh% update termination metricsniter = niter + 1;dx = norm(xnew-x);x = xnew;end
xopt = x;
fopt = f2(xopt);
niter = niter - 1;% define the gradient of the objective
function g = grad(x)
g = [2*x(1) + x(2)x(1) + 6*x(2)];

function [xopt,fopt,niter,gnorm,dx] = grad_descent(varargin)if nargin==0% define starting pointx0 = [3 3]';
elseif nargin==1% if a single input argument is provided, it is a user-defined starting% point.x0 = varargin{1};
elseerror('Incorrect number of input arguments.')
end% termination tolerance
tol = 1e-6;% maximum number of allowed iterations
maxiter = 1000;% minimum allowed perturbation
dxmin = 1e-6;% step size ( 0.33 causes instability, 0.2 quite accurate)
alpha = 0.1;% initialize gradient norm, optimization vector, iteration counter, perturbation
gnorm = inf; x = x0; niter = 0; dx = inf;% define the objective function:
f = @(x1,x2) x1.^2 + x1.*x2 + 3*x2.^2;m = -5:0.1:5;
[X,Y] = meshgrid(m);
Z = f(X,Y);% plot objective function contours for visualization:
figure(1); clf; meshc(X,Y,Z); hold on% redefine objective function syntax for use with optimization:
f2 = @(x) f(x(1),x(2));% gradient descent algorithm:
while and(gnorm>=tol, and(niter <= maxiter, dx >= dxmin))% calculate gradient:g = grad(x);gnorm = norm(g);% take step:xnew = x - alpha*g;% check stepif ~isfinite(xnew)display(['Number of iterations: ' num2str(niter)])error('x is inf or NaN')end% plot current pointplot([x(1) xnew(1)],[x(2) xnew(2)],'ko-')plot3([x(1) xnew(1)],[x(2) xnew(2)], [f(x(1),x(2)) f(xnew(1),xnew(2))]...,'r+-');refresh% update termination metricsniter = niter + 1;dx = norm(xnew-x);x = xnew;end
xopt = x;
fopt = f2(xopt);
niter = niter - 1;% define the gradient of the objective
function g = grad(x)
g = [2*x(1) + x(2)x(1) + 6*x(2)];

Lesson 2 Gradient Desent相关推荐

  1. Linear regression with one variable算法实例讲解(绘制图像,cost_Function ,Gradient Desent, 拟合曲线, 轮廓图绘制)_矩阵操作...

    %测试数据 'ex1data1.txt', 第一列为 population of City in 10,000s, 第二列为 Profit in $10,000s 1 6.1101,17.592 2 ...

  2. 梯度下降算法(Gradient Descent)

    一.定义      梯度下降法(Gradient desent) 是一个一阶最优算法,通常也称为最速下降法.要使用梯度下降法找到一个函数的局部极小值,必须向函数上当前点对应梯度(或者是近似梯度)的反方 ...

  3. Computing Parameters Analytically

    Computing Parameters Analytically Normal Equation Find the optimum θ\thetaθ without iteration Minimi ...

  4. 收集一些有关SilverLight(WPF/E)的链接

    1.介绍类: 1.1Microsoft欲以SilverLight挑战flash/flex 1.2能提供丰富体验的Microsoft SilverLight 1.3Silverlight 1.4Silv ...

  5. 冲量(momentum)的原理与Python实现

    冲量(momentum)的原理与Python实现 前言 参考:https://www.jianshu.com/p/58b3fe300ecb 梯度下降法(Gradient Descent)是机器学习中最 ...

  6. 个人阅读的Deep Learning方向的paper整理

    http://hi.baidu.com/chb_seaok/item/6307c0d0363170e73cc2cb65 个人阅读的Deep Learning方向的paper整理,分了几部分吧,但有些部 ...

  7. [转]矩阵分解在推荐系统中的应用

    矩阵分解是最近几年比较火的算法,经过kddcup和netflix比赛的多人多次检验,矩阵分解可以带来更好的结果,而且可以充分地考虑各种因素的影响,有非常好的扩展性,因为要考虑多种因素的综合作用,往往需 ...

  8. 梯度下降法快速教程 | 第三章:学习率衰减因子(decay)的原理与Python实现

    北京 | 深度学习与人工智能 12月23-24日 再设经典课程 重温深度学习阅读全文> 正文共3017个字.11张图.预计阅读时间:8分钟 前言 梯度下降法(Gradient Descent)是 ...

  9. 梯度下降法快速教程 | 第二章:冲量(momentum)的原理与Python实现

    北京 | 深度学习与人工智能研修 12月23-24日 再设经典课程 重温深度学习阅读全文> 01 前言 梯度下降法(Gradient Descent)是机器学习中最常用的优化方法之一,常用来求解 ...

最新文章

  1. JAVA Functions in XI(转)
  2. Oracle数据库的一些常用命令
  3. Power BI新主页将使内容的导航和发现变得轻而易举!
  4. disruptor3_发布Disruptor 3.0.0
  5. 偶尔所得代码片(进程和锁相关)
  6. Python envoy 模块源码剖析
  7. C++之指针探究(九):结构体指针
  8. mysql 事务日志备份_SQL Server恢复模式与事务日志备份
  9. HashMap的工作原理--重点----数据结构示意图的理解
  10. Mysql之insert,update,delete
  11. yolov3模型训练——使用yolov3训练自己的模型
  12. JavaDemo——java使用RXTX读写串口
  13. Android studio中如何调用setpositivebutton函数
  14. 将分开的微服务项目合到一个项目下面
  15. 从零开始创建一个uni-app项目
  16. Virustotal的使用
  17. Spring单例Bean与单例模式的区别
  18. VRTK4开发VR2:射线
  19. C++: 日程安排(多继承+友元函数)
  20. H - 悼念512汶川大地震遇难同胞——一定要记住我爱你

热门文章

  1. oracle 同步索引,oracle全文索引之同步和优化索引做了什么
  2. C#6中的新增功能 【Unity3D亲测】
  3. Java从零开始(4)——入门项目
  4. 查看Exchange用户最后登录时间
  5. 浅谈Entity Framework中的数据加载方式
  6. error: invalid use of incomplete type 'struct word'|
  7. 链式延迟执行DOME
  8. SystemVerilog 2005 语法
  9. OpenCV-差分法实现绿叶识别(图像差分+颜色通道)
  10. wamp php 安装redis,wampServer的php安装Redis 扩展