【翻译自 : Gradient Descent Optimization With Nadam From Scratch】

【说明:Jason Brownlee PhD大神的文章个人很喜欢,所以闲暇时间里会做一点翻译和学习实践的工作,这里是相应工作的实践记录,希望能帮到有需要的人!】







本教程分为三个部分: 他们是:









x(t)= x(t-1)–step* f'(x(t))





Nesterov加速的自适应动量估计或Nadam算法是对自适应运动估计(Adam)优化算法的扩展,添加了Nesterov的加速梯度(NAG)或Nesterov动量,这是一种改进的动量。更广泛地讲,Nadam算法是对梯度下降优化算法的扩展。Timothy Dozat在2016年的论文“将Nesterov动量整合到Adam中”中描述了该算法。尽管论文的一个版本是在2015年以同名斯坦福项目报告的形式编写的。动量将梯度的指数衰减移动平均值(第一矩)添加到梯度下降算法中。这具有消除嘈杂的目标函数和提高收敛性的影响。Adam是梯度下降的扩展,它增加了梯度的第一和第二矩,并针对正在优化的每个参数自动调整学习率。 NAG是动量的扩展,其中动量的更新是使用对参数的预计更新量而不是实际当前变量值的梯度来执行的。在某些情况下,这样做的效果是在找到最佳位置时减慢了搜索速度,而不是过冲。


m = 0
n = 0
        该算法在从t = 1开始的时间t内迭代执行,并且每次迭代都涉及计算一组新的参数值x,例如。从x(t-1)到x(t)。如果我们专注于更新一个参数,这可能很容易理解该算法,该算法概括为通过矢量运算来更新所有参数。首先,计算当前时间步长的梯度(偏导数)。

g(t)= f'(x(t-1))

接下来,使用梯度和超参数“ mu”更新第一时刻。

m(t)=mu* m(t-1)+(1 –mu)* g(t)

然后使用“ nu”超参数更新第二时刻。

n(t)= nu * n(t-1)+(1 – nu)* ​​g(t)^ 2


mhat =(mu * m(t)/(1 – mu))+((1 – mu)* g(t)/(1 – mu))


nhat = nu * n(t)/(1 – nu)


x(t)= x(t-1)– alpha /(sqrt(nhat)+ eps)* mhat









# objective function
def objective(x, y):return x**2.0 + y**2.0


# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot

运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f(0,0)= 0的熟悉的碗形状。

我们还可以创建函数的二维图。 这在以后要绘制搜索进度时会很有帮助。下面的示例创建目标函数的轮廓图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot

运行示例将创建目标函数的二维轮廓图。我们可以看到碗的形状被压缩为以颜色渐变显示的轮廓。 我们将使用该图来绘制在搜索过程中探索的特定点。




x ^ 2的导数在每个维度上均为x * 2。

f(x)= x ^ 2
f'(x)= x * 2


# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])


# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])


# initialize decaying moving averages
m = [0.0 for _ in range(bounds.shape[0])]
n = [0.0 for _ in range(bounds.shape[0])]

然后,我们运行由“ n_iter”超参数定义的算法的固定迭代次数。

# run iterations of gradient descent
for t in range(n_iter):...


# calculate gradient g(t)
g = derivative(x[0], x[1])

接下来,我们需要执行Nadam更新计算。 为了提高可读性,我们将使用命令式编程样式来一次执行一个变量的这些计算。在实践中,我建议使用NumPy向量运算以提高效率。

# build a solution one variable at a time
for i in range(x.shape[0]):...


# m(t) = mu * m(t-1) + (1 - mu) * g(t)
m[i] = mu * m[i] + (1.0 - mu) * g[i]


# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)
# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2
n[i] = nu * n[i] + (1.0 - nu) * g[i]**2


# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))
mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))


# nhat = nu * n(t) / (1 - nu)
nhat = nu * n[i] / (1.0 - nu)


# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhat
x[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat


# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))


# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 - mu) * g(t)m[i] = mu * m[i] + (1.0 - mu) * g[i]# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2n[i] = nu * n[i] + (1.0 - nu) * g[i]**2# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))# nhat = nu * n(t) / (1 - nu)nhat = nu * n[i] / (1.0 - nu)# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# report progressprint('>%d f(%s) = %.5f' % (t, x, score))return [x, score]


# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)


# summarize the result
print('f(%s) = %f' % (best, score))


# gradient descent optimization with nadam for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 - mu) * g(t)m[i] = mu * m[i] + (1.0 - mu) * g[i]# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2n[i] = nu * n[i] + (1.0 - nu) * g[i]**2# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))# nhat = nu * n(t) / (1 - nu)nhat = nu * n[i] / (1.0 - nu)# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# report progressprint('>%d f(%s) = %.5f' % (t, x, score))return [x, score]# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
print('f(%s) = %f' % (best, score))


注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。 考虑运行该示例几次并比较平均结果。


>40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001
>41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001
>42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001
>43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001
>44 f([-0.00011368 -0.00211861]) = 0.00000
>45 f([-0.00011547 -0.00185511]) = 0.00000
>46 f([-0.0001075 -0.00161122]) = 0.00000
>47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000
>48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000
>49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000
f([-5.54299505e-05 -1.00116899e-03]) = 0.000001



# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 - mu) * g(t)m[i] = mu * m[i] + (1.0 - mu) * g[i]# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2n[i] = nu * n[i] + (1.0 - nu) * g[i]**2# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))# nhat = nu * n(t) / (1 - nu)nhat = nu * n[i] / (1.0 - nu)# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# store solutionsolutions.append(x.copy())# report progressprint('>%d f(%s) = %.5f' % (t, x, score))return solutions


# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)


# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')


# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')


# example of plotting the nadam search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy import product
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with nadam
def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 - mu) * g(t)m[i] = mu * m[i] + (1.0 - mu) * g[i]# n(t) = nu * n(t-1) + (1 - nu) * g(t)^2n[i] = nu * n[i] + (1.0 - nu) * g[i]**2# mhat = (mu * m(t) / (1 - mu)) + ((1 - mu) * g(t) / (1 - mu))mhat = (mu * m[i] / (1.0 - mu)) + ((1 - mu) * g[i] / (1.0 - mu))# nhat = nu * n(t) / (1 - nu)nhat = nu * n[i] / (1.0 - nu)# x(t) = x(t-1) - alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] - alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# store solutionsolutions.append(x.copy())# report progressprint('>%d f(%s) = %.5f' % (t, x, score))return solutions# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# steps size
alpha = 0.02
# factor for average gradient
mu = 0.8
# factor for average squared gradient
nu = 0.999
# perform the gradient descent search with nadam
solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot




  1. 梯度下降优化算法概述

    本文原文是 An overview of gradient descent optimization algorithms,同时作者也在 arXiv 上发了一篇同样内容的 论文. 本文结合了两者来翻译 ...

  2. 梯度下降优化方法'原理_优化梯度下降的新方法

    梯度下降优化方法'原理 The new era of machine learning and artificial intelligence is the Deep learning era. It ...

  3. 梯度下降优化算法总结

    写在前面 梯度下降(Gradient descent)算法可以说是迄今最流行的机器学习领域的优化算法.并且,基本上每一个深度学习库都包括了梯度下降算法的实现,比如Lasagne.cafe.keras等 ...

  4. 深度学习-各类梯度下降优化算法回顾

    本文是根据 链接 进行的翻译,回顾了深度学习的各种梯度下降优化算法.*已获得原作者的翻译许可. 文章目录 一.概述 二.引言 三.Gradient Descent Variants(梯度下降法变体) ...

  5. 深度学习中的梯度下降优化算法综述

    1 简介 梯度下降算法是最常用的神经网络优化算法.常见的深度学习库也都包含了多种算法进行梯度下降的优化.但是,一般情况下,大家都是把梯度下降系列算法当作是一个用于进行优化的黑盒子,不了解它们的优势和劣 ...

  6. 机器学习(11)线性回归(2)实战 -- 正规方程优化、梯度下降优化(波士顿房价预测)

    目录 一.波士顿房价预测(正规方程优化) API 1.获取数据集 2.划分数据集 3.标准化 4. 创建预估器,得到模型 5.模型评估(均方差评估) 代码 二.波士顿房价预测(正规方程优化) API ...

  7. 基于机器学习梯度下降优化算法来寻找最佳的线性回归模型

    https://www.toutiao.com/a6638782437587419652/ 幻风的AI之路 2018-12-25 18:12:27 线性回归模型 线性回归模型是一个非常简单的算法模型, ...

  8. Lesson 4.5 梯度下降优化基础:数据归一化与学习率调度

    Lesson 4.5 梯度下降优化基础:数据归一化与学习率调度 在上一小节中,我们讨论了关于随机梯度下降和小批量梯度下降的基本算法性质与使用流程.我们知道,在引入了一定的样本随机性之后,能够帮助参数点 ...

  9. 神经网络 梯度下降_梯度下降优化器对神经网络训练的影响

    神经网络 梯度下降 co-authored with Apurva Pathak 与Apurva Pathak合着 尝试梯度下降优化器 (Experimenting with Gradient Des ...

  10. 【深度学习】——梯度下降优化算法(批量梯度下降、随机梯度下降、小批量梯度下降、Momentum、Adam)

    目录 梯度 梯度下降 常用的梯度下降算法(BGD,SGD,MBGD) 梯度下降的详细算法 算法过程 批量梯度下降法(Batch Gradient Descent) 随机梯度下降法(Stochastic ...


  1. 没有 4.21 ,Linus Torvalds 宣布 Linux 进入 5.0 时代
  2. my ax 4 faq
  3. YAML中多行字符串的配置方法总结
  4. producer send源码_Kafka源码深度剖析系列(七)——Producer核心流程初探
  5. html排序图标,css实现排序升降图标
  6. java 课后习题 找零钱
  7. Windows将WSL 1升级为WSL2
  8. 使用Jersey来创建RESTful WebService
  9. auto_ptr scoped_ptr shared_ptr weak_ptr unique_ptr
  10. halcon学习资料
  11. DSOFramer使用
  12. 数学建模:排队论模型
  13. Android设备刷Device-owner说明
  14. Navicat 快捷键说明
  15. 苹果手机总是提示系统更新怎么办?
  16. GAMES104 作业2-ColorGrading
  17. 计算机php什么意思,pc什么意思
  18. linux用cat命令创建一个文件,用cat在命令行创建文件
  19. 【Liunx】Navicat连接ubuntu下mysql
  20. 【已解决】连接被重置


  1. P4782 【模板】2-SAT 问题
  2. 《Java核心技术 卷Ⅱ 高级特性(原书第10版)》一3.3 验证XML文档
  3. 锻炼编程思维的小题目
  4. anroid 滑动浏览
  5. 文件共享服务器 -----ftp服务一
  6. 添加.MSPX文件(VISTA下)
  7. [导入]网络安全工作者的必杀技
  8. ubuntu自带截图工具
  9. Gitlab 服务器搭建
  10. 【BZOJ3218】 a+b Problem