【翻译自： Gradient Descent With Adadelta from Scratch】

【说明：Jason Brownlee PhD大神的文章个人很喜欢，所以闲暇时间里会做一点翻译和学习实践的工作，这里是相应工作的实践记录，希望能帮到有需要的人！】

梯度下降是一种优化算法，遵循目标函数的负梯度以定位函数的最小值。

梯度下降的一个限制是，它对每个输入变量使用相同的步长（学习率）。 AdaGradn和RMSProp是梯度下降的扩展，为目标函数的每个参数增加了自适应学习率。

可将Adadelta视为梯度下降的进一步扩展，它基于AdaGrad和RMSProp并更改了自定义步长的计算，从而使单位保持一致，从而不再需要初始学习速率超参数。

在本教程中，您将发现如何从头开始使用Adadelta优化算法开发梯度下降。完成本教程后，您将知道：

梯度下降是一种优化算法，它使用目标函数的梯度来导航搜索空间。
可以使用称为Adadelta的偏导数的衰减平均值，对梯度下降进行更新，以对每个输入变量使用自动自适应步长。
如何从头开始实现Adadelta优化算法并将其应用于目标函数并评估结果。

教程概述

本教程分为三个部分：他们是：

梯度下降
Adadelta算法
Adadelta的梯度下降二维测试问题Adadelta的梯度下降优化可视化的Adadelta

梯度下降

梯度下降是一种优化算法。它在技术上称为一阶优化算法，因为它明确利用了目标目标函数的一阶导数。

一阶导数，或简称为“导数”，是目标函数在特定点（例如，点）上的变化率或斜率。用于特定输入。如果目标函数采用多个输入变量，则将其称为多元函数，并且可以将输入变量视为向量。反过来，多元目标函数的导数也可以视为向量，通常称为梯度。

梯度：多元目标函数的一阶导数。
对于特定输入，导数或梯度指向目标函数最陡峭的上升方向。梯度下降是指一种最小化优化算法，该算法遵循目标函数的下坡梯度负值来定位函数的最小值。梯度下降算法需要一个正在优化的目标函数和该目标函数的导数函数。目标函数f（）返回给定输入集合的分数，导数函数f'（）给出给定输入集合的目标函数的导数。梯度下降算法需要问题中的起点（x），例如输入空间中的随机选择点。

假设我们正在最小化目标函数，然后计算导数并在输入空间中采取一步，这将导致目标函数下坡运动。首先通过计算输入空间中要移动多远的距离来进行下坡运动，计算方法是将步长（称为alpha或学习率）乘以梯度。然后从当前点减去该值，以确保我们逆梯度移动或向下移动目标函数。

x = x – step_size * f'（x）
在给定点的目标函数越陡峭，梯度的大小越大，反过来，在搜索空间中采取的步伐也越大。使用步长超参数来缩放步长的大小。

步长（alpha）：超参数，控制算法每次迭代时相对于梯度在搜索空间中移动多远。
如果步长太小，则搜索空间中的移动将很小，并且搜索将花费很长时间。如果步长太大，则搜索可能会在搜索空间附近反弹并跳过最优值。现在我们已经熟悉了梯度下降优化算法，让我们来看看Adadelta。

Adadelta算法

Adadelta（或“ ADADELTA”）是梯度下降优化算法的扩展。该算法在Matthew Zeiler于2012年发表的题为“ ADADELTA：一种自适应学习率方法”的论文中进行了描述。

Adadelta旨在加速优化过程，例如减少达到最佳状态所需的功能评估次数，或提高优化算法的功能，例如产生更好的最终结果。最好将其理解为AdaGrad和RMSProp算法的扩展。

AdaGrad是梯度下降的扩展，每次进行更新时，它都会为目标函数的每个参数计算步长（学习率）。步长的计算方法是：首先将到目前为止在搜索过程中看到的参数的偏导数相加，然后将初始步长超参数除以平方的偏导数之和的平方根。

使用AdaGrad对一个参数的自定义步长的计算如下：

cust_step_size（t + 1）= step_size /（1e-8 + sqrt（s（t）））

其中cust_step_size（t + 1）是搜索过程中给定点的输入变量的计算步长，step_size是初始步长，sqrt（）是平方根运算，s（t）是求和的总和。到目前为止在搜索中看到的输入变量（包括当前迭代）的平方偏导数的平方。

可以将RMSProp视为AdaGrad的扩展，因为它使用偏导数的衰减平均值或移动平均值，而不是每个参数的步长计算中的总和。这可以通过添加新的超参数“ rho”来实现，该参数像偏导数的动量。

一个参数的衰减移动平均平方偏导数的计算如下：

s（t + 1）=（s（t）* rho）+（f'（x（t））^ 2 *（1.0-rho））

其中s（t + 1）是该算法当前迭代的一个参数的均方偏导数，而s（t）是前一次迭代的衰减移动均方偏导数f'（x（t））^ 2是当前参数的平方偏导数，rho是一个超参数，通常像动量一样值为0.9。Adadelta是RMSProp的进一步扩展，旨在改善算法的收敛性，并消除了手动指定初始学习速率的需要。

与RMSProp一样，针对每个参数计算平方偏导数的衰减移动平均值。关键区别在于使用增量或参数变化的衰减平均值的参数步长的计算。选择分子的目的是确保计算的两个部分具有相同的单位。

首先，将自定义步长计算为增量变化的衰减移动平均值的平方根除以平方偏导数的衰减移动平均值的平方根。

cust_step_size（t + 1）=（ep + sqrt（delta（t）））/（ep + sqrt（s（t）））

其中cust_step_size（t + 1）是给定更新的参数的自定义步长，其中ep是一个超参数，将其添加到分子和分母中以避免除以零误差，而delta（t）是衰减的移动平均值参数的平方变化（在上次迭代中计算），而s（t）是平方的偏导数的衰减移动平均值（在当前迭代中计算）。ep超参数设置为一个较小的值，例如1e-3或1e-8。除了避免除以零误差外，当递减的移动平均平方变化和递减的移动平均平方梯度为零时，它还有助于算法的第一步。

接下来，将对参数的更改计算为自定义步长乘以偏导数更改

（t + 1）= cust_step_size（t + 1）* f'（x（t））

接下来，更新参数平方变化的衰减平均值。

delta（t + 1）=（delta（t）* rho）+（change（t + 1）^ 2 *（1.0-rho））

其中delta（t + 1）是要在下一次迭代中使用的变量的变化的衰减平均值，则change（t + 1）是在前一步中计算的，rho是一个超参数，其作用类似于动量并具有一个值像0.9。

最后，使用更改来计算变量的新值。

x（t + 1）= x（t）–更改（t + 1）
然后针对目标函数的每个变量重复此过程，然后重复整个过程以在搜索空间中导航固定数量的算法迭代。

现在我们已经熟悉了Adadelta算法，让我们探索如何实现它并评估其性能。

Adadelta的梯度下降

在本节中，我们将探讨如何使用Adadelta实现梯度下降优化算法。

二维测试问题

首先，让我们定义一个优化函数。我们将使用一个简单的二维函数，该函数将每个维的输入平方，并定义有效输入的范围（从-1.0到1.0）。下面的Objective（）函数实现了此功能

# objective function
def objective(x, y):return x**2.0 + y**2.0

我们可以创建数据集的三维图，以了解响应面的曲率。下面列出了绘制目标函数的完整示例。

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f（0，0）= 0的熟悉的碗形状。

我们还可以创建函数的二维图。这在以后要绘制搜索进度时会很有帮助。下面的示例创建目标函数的轮廓图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

运行示例将创建目标函数的二维轮廓图。我们可以看到碗的形状被压缩为以颜色渐变显示的轮廓。我们将使用该图来绘制在搜索过程中探索的特定点。

现在我们有了一个测试目标函数，让我们看一下如何实现Adadelta优化算法。

Adadelta的梯度下降优化

我们可以将带有Adadelta的梯度下降应用于测试问题。首先，我们需要一个函数来计算此函数的导数。

f（x）= x ^ 2
f'（x）= x * 2

x ^ 2的导数在每个维度上均为x * 2。 derived（）函数在下面实现了这一点。

# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])

接下来，我们可以实现梯度下降优化。首先，我们可以选择问题范围内的随机点作为搜索的起点。假定我们有一个数组，该数组定义搜索范围，每个维度一行，并且第一列定义最小值，第二列定义维度的最大值。

# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

接下来，我们需要初始化平方偏导数的衰减平均值，并将每个维度的平方变化更改为0.0值。

# list of the average square gradients for each variable
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
# list of the average parameter updates
sq_para_avg = [0.0 for _ in range(bounds.shape[0])]

然后，我们可以枚举“ n_iter”超参数定义的搜索优化算法的固定迭代次数。

# run the gradient descent
for it in range(n_iter):

第一步是使用导数（）函数计算当前解决方案的梯度。

# calculate gradient
gradient = derivative(solution[0], solution[1])

然后，我们需要计算偏导数的平方，并使用“ rho”超参数更新平方偏导数的衰减移动平均值。

# update the average of the squared partial derivatives
for i in range(gradient.shape[0]):# calculate the squared gradientsg = gradient[i]**2.0# update the moving average of the squared gradientsq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))

然后，我们可以使用平方偏导数和梯度的衰减移动平均值来计算下一点的步长。我们将一次执行一个变量。

# build solution
new_solution = list()
for i in range(solution.shape[0]):

首先，我们将使用平方变化和平方偏导数的递减移动平均值以及“ ep”超参数，在此迭代中计算此变量的自定义步长。

# calculate the step size for this variable
alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))

接下来，我们可以使用自定义步长和偏导数来计算对变量的更改。

# calculate the change
change = alpha * gradient[i]

然后，我们可以使用“ rho”超参数使用该变化来更新平方变化的衰减移动平均值。

# update the moving average of squared parameter changes
sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))

最后，我们可以更改变量并存储结果，然后再继续下一个变量。

# calculate the new position in this variable
value = solution[i] - change
# store this variable
new_solution.append(value)

然后可以使用Objective（）函数评估该新解决方案，并可以报告搜索的性能。

# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

就是这样。

我们可以将所有这些绑定到一个名为adadelta（）的函数中，该函数采用目标函数和派生函数的名称，一个具有域边界和超参数值的边界的数组，用于算法迭代和rho的总数，然后返回最终解决方案及其评估。ep超参数也可以作为参数，尽管它的默认值是1e-3。下面列出了完整的功能。

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of the average square gradients for each variablesq_grad_avg = [0.0 for _ in range(bounds.shape[0])]# list of the average parameter updatessq_para_avg = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the average of the squared partial derivativesfor i in range(gradient.shape[0]):# calculate the squared gradientsg = gradient[i]**2.0# update the moving average of the squared gradientsq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))# calculate the changechange = alpha * gradient[i]# update the moving average of squared parameter changessq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))# calculate the new position in this variablevalue = solution[i] - change# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return [solution, solution_eval]

注意：为了提高可读性，我们有意使用列表和命令式编码样式，而不是矢量化操作。随意将实现改编为带有NumPy数组的矢量化实现，以实现更好的性能。

然后，我们可以定义我们的超参数并调用adadelta（）函数来优化我们的测试目标函数。

在这种情况下，我们将使用算法的120次迭代，对rho超参数使用0.99的值，这是在经过反复试验后选择的。

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('Done!')
print('f(%s) = %f' % (best, score))

综合所有这些，下面列出了使用Adadelta进行梯度下降优化的完整示例。

# gradient descent optimization with adadelta for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of the average square gradients for each variablesq_grad_avg = [0.0 for _ in range(bounds.shape[0])]# list of the average parameter updatessq_para_avg = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the average of the squared partial derivativesfor i in range(gradient.shape[0]):# calculate the squared gradientsg = gradient[i]**2.0# update the moving average of the squared gradientsq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))# calculate the changechange = alpha * gradient[i]# update the moving average of squared parameter changessq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))# calculate the new position in this variablevalue = solution[i] - change# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return [solution, solution_eval]# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# momentum for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
best, score = adadelta(objective, derivative, bounds, n_iter, rho)
print('Done!')
print('f(%s) = %f' % (best, score))

运行示例将Adadelta优化算法应用于我们的测试问题，并报告算法每次迭代的搜索性能。

注意：由于算法或评估程序的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

在这种情况下，我们可以看到，经过105次搜索迭代后，找到了接近最佳的解决方案，输入值接近0.0和0.0，评估为0.0。

>100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001
>101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001
>102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001
>103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001
>104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000
>105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000
>106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000
>107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000
>108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000
>109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000
>110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000
>111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000
>112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000
>113 f([-2.02286891e-08 1.33604264e-03]) = 0.00000
>114 f([-1.73549914e-08 1.26475787e-03]) = 0.00000
>115 f([-1.48859650e-08 1.19720951e-03]) = 0.00000
>116 f([-1.27651224e-08 1.13320504e-03]) = 0.00000
>117 f([-1.09437923e-08 1.07256172e-03]) = 0.00000
>118 f([-9.38004754e-09 1.01510604e-03]) = 0.00000
>119 f([-8.03777865e-09 9.60673346e-04]) = 0.00000
Done!
f([-8.03777865e-09 9.60673346e-04]) = 0.000001

Adadelta可视化

我们可以在域的轮廓图上绘制Adadelta搜索的进度。这可以为算法迭代过程中的搜索进度提供直观的认识。我们必须更新adadelta（）函数以维护在搜索过程中找到的所有解决方案的列表，然后在搜索结束时返回此列表。下面列出了具有这些更改的功能的更新版本。

# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of the average square gradients for each variablesq_grad_avg = [0.0 for _ in range(bounds.shape[0])]# list of the average parameter updatessq_para_avg = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the average of the squared partial derivativesfor i in range(gradient.shape[0]):# calculate the squared gradientsg = gradient[i]**2.0# update the moving average of the squared gradientsq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))# build solutionnew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))# calculate the changechange = alpha * gradient[i]# update the moving average of squared parameter changessq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))# calculate the new position in this variablevalue = solution[i] - change# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return solutions

然后，我们可以像以前一样执行搜索，这一次将检索解决方案列表，而不是最佳的最终解决方案。

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)

然后，我们可以像以前一样创建目标函数的轮廓图。

# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

最后，我们可以将在搜索过程中找到的每个解决方案绘制成一条由一条线连接的白点。

# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

综上所述，下面列出了对测试问题执行Adadelta优化并将结果绘制在轮廓图上的完整示例。

# example of plotting the adadelta search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with adadelta
def adadelta(objective, derivative, bounds, n_iter, rho, ep=1e-3):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of the average square gradients for each variablesq_grad_avg = [0.0 for _ in range(bounds.shape[0])]# list of the average parameter updatessq_para_avg = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the average of the squared partial derivativesfor i in range(gradient.shape[0]):# calculate the squared gradientsg = gradient[i]**2.0# update the moving average of the squared gradientsq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))# build solutionnew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))# calculate the changechange = alpha * gradient[i]# update the moving average of squared parameter changessq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))# calculate the new position in this variablevalue = solution[i] - change# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return solutions# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

运行示例将像以前一样执行搜索，但是在这种情况下，将创建目标函数的轮廓图。

在这种情况下，我们可以看到在搜索过程中找到的每个解决方案都显示一个白点，从最优点开始，逐渐靠近图中心的最优点。

从零开始进行Adadelta的梯度下降相关推荐

从零开始的Nesterov动量梯度下降
[翻译自 : Gradient Descent With Nesterov Momentum From Scratch] [说明:Jason Brownlee PhD大神的文章个人很喜欢,所以闲暇时间 ...
从零开始学Pytorch（十三）之梯度下降
梯度下降 %matplotlib inline import numpy as np import torch import time from torch import nn, optim impo ...
独家 | 多项式回归：从零开始学习梯度下降
作者:SETHNEHA 翻译:王可汗校对:陈丹本文约3200字,建议阅读15分钟本文为大家介绍多项式回归中的梯度下降算法. 关键词:梯度下降.多项式回归.模型优化梯度下降是一个需要理解的重要算 ...
比Momentum更快：揭开Nesterov Accelerated Gradient的真面目NAG 梯度下降
d为累计梯度作为一个调参狗,每天用着深度学习框架提供的各种优化算法如Momentum.AdaDelta.Adam等,却对其中的原理不甚清楚,这样和一条咸鱼有什么分别!(误)但是我又懒得花太多时间去看 ...
梯度下降优化算法概述
本文原文是 An overview of gradient descent optimization algorithms,同时作者也在 arXiv 上发了一篇同样内容的论文. 本文结合了两者来翻译 ...
各种 Optimizer 梯度下降优化算法回顾和总结
1. 写在前面当前使用的许多优化算法,是对梯度下降法的衍生和优化.在微积分中,对多元函数的参数求偏导数,把求得的各个参数的导数以向量的形式写出来就是梯度.梯度就是函数变化最快的地方.梯度下降是迭 ...
各种 Optimizer 梯度下降优化算法总结
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货作者:DengBoCong,编辑:极市平台来源:https://zhu ...
深度学习 Optimizer 梯度下降优化算法总结
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达来源:https://zhuanlan.zhihu.com/p/3 ...
各种Optimizer梯度下降优化算法回顾和总结
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自|机器学习算法那些事论文标题:An overview o ...
收藏 | 各种 Optimizer 梯度下降优化算法回顾和总结
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自|深度学习这件小事论文标题:An overview of ...

从零开始进行Adadelta的梯度下降