梯度下降算法的正确步骤

Title: What is the Gradient Descent Algorithm and its working.

标题：什么是梯度下降算法及其工作原理。

Gradient descent is a type of machine learning algorithm that helps us in optimizing neural networks and many other algorithms. This article ventures into how this algorithm actually works, its types, and its significance in the real world.

梯度下降是一种机器学习算法，可帮助我们优化神经网络和许多其他算法。本文探讨了该算法的实际工作原理，类型及其在现实世界中的重要性。

简介 (A Brief Introduction)

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne’s, caffe’s, and keras’ documentation).

梯度下降是执行优化的最流行算法之一，也是迄今为止最优化神经网络的最常用方法。同时，每个最新的深度学习库都包含各种算法的实现，以优化梯度下降(例如，千层面，咖啡和keras的文档)。

The reason we’re talking about it here is not merely theoretical. Gradient Descent algorithm is much more than it seems to be. It is used time and again by ML practitioners, Data scientists, and students to optimize their models.

我们在这里谈论它的原因不仅仅是理论上的。梯度下降算法远不止于此。机器学习从业人员，数据科学家和学生反复使用它来优化模型。

Gradient descent is a way to minimize an objective function parameterized by a model’s parameters by updating the parameters in the opposite direction of the gradient of the objective function w.r.t. to the parameters. The learning rate $alpha$ determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

梯度下降是一种通过在目标函数wrt与参数的梯度相反的方向上更新参数来最小化由模型的参数参数化的目标函数的方法。学习率$ alpha $决定了我们达到(本地)最小值的步骤的大小。换句话说，我们遵循由下坡的目标函数创建的表面的坡度方向，直到到达山谷。

Now that you’ve gotten a basic insight of the algorithm, let’s dig deep in it in this post. We will define and cover some important aspects like its working, it’s working examples, types and a final conclusion to mould it all.

现在您已经对该算法有了基本的了解，让我们在本文中深入研究它。我们将定义并涵盖一些重要方面，例如其工作，它的工作示例，类型以及塑造这一切的最终结论。

什么是梯度下降？ (What is exactly Gradient Descent ?)

Answer the question posed by the title of this post directly below this header. This will increase your chances of ranking for the featured snippet on Google for this phrase and provide readers with an immediate answer. Keep the length of this definition — at least in this very basic introduction — between 50 and 60 words.

回答此标题正下方的帖子标题所提出的问题。 这将增加您对该词在Google上的精选摘要进行排名的机会，并为读者提供立即的答案。 至少在本基本介绍中，此定义的长度应保持在50到60个字之间。

After the brief definition, dive further into the concept and add more context and explanation if needed.

在简要定义之后，请进一步深入该概念，并在需要时添加更多上下文和说明。

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

梯度下降是一种优化算法，用于查找使成本函数(cost)最小的函数(f)的参数(系数)的值。

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

当无法解析计算参数(例如使用线性代数)并且必须通过优化算法进行搜索时，最好使用梯度下降。

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. But if we instead take steps proportional to the positive of the gradient, we approach a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent was originally proposed by Cauchy in 1847.

梯度下降是用于找到可微函数的局部最小值的一阶迭代优化算法。要使用梯度下降找到函数的局部最小值，我们采取与该函数在当前点的梯度 (或近似梯度)的负值成比例的步骤。但是，如果我们改为采取与梯度的正比成比例的步骤，则会逼近该函数的局部最大值。该过程称为梯度上升 。梯度下降最初是由柯西在1847年提出的。

Gradient descent is also known as steepest descent; but gradient descent should not be confused with the method of steepest descent for approximating integrals.

梯度下降也被称为最陡下降 ; 但是，不应将梯度下降与近似积分的最速下降方法混淆。

好的，但是为什么重要呢？ (Okay but why is it Important?)

Provide your readers with a few reasons why they should care about the term or the concept you’re writing about. If this is a consumer-level concept, talk about the implications this could have on their businesses, finances, personal happiness, etc. If you’re writing for an audience of professionals, mention the impact this term or concept has on profit, efficiency, and/or customer satisfaction. To make the most of this section, make sure it includes at least one statistic, quote, or outside reference.

为您的读者提供一些理由，让他们了解自己正在写的术语或概念。 如果这是一个消费者级别的概念，请谈论这可能对他们的业务，财务状况，个人幸福感等产生的影响。如果您是为专业人士而写的，请提及此术语或概念对利润，效率的影响和/或客户满意度。 要充分利用本节的内容，请确保它至少包含一个统计信息，引用或外部参考。

Include at Least One of These Next Three Sections

至少包括以下三个部分中的一个

梯度下降变体 (Gradient descent variants)

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

梯度下降有三种变体，它们在计算目标函数的梯度时使用多少数据不同。根据数据量，我们在参数更新的准确性和执行更新所需的时间之间进行权衡。

Batch Gradient Descent: This is a type of gradient descent which processes all the training examples for each iteration of gradient descent. But if the number of training examples is large, then batch gradient descent is computationally very expensive. Hence if the number of training examples is large, then batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.

批梯度下降：这是一种梯度下降，它为每次梯度下降迭代处理所有训练示例。但是，如果训练示例的数量很大，那么批梯度下降在计算上将非常昂贵。因此，如果训练示例的数量很大，则不优选批量梯度下降。相反，我们更喜欢使用随机梯度下降或小批量梯度下降。
Stochastic Gradient Descent: This is a type of gradient descent which processes 1 training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed. Hence this is quite faster than batch gradient descent. But again, when the number of training examples is large, even then it processes only one example which can be additional overhead for the system as the number of iterations will be quite large.

随机梯度下降：这是一种梯度下降，每次迭代处理1个训练示例。因此，即使在仅处理了一个示例的一次迭代之后，也要更新参数。因此，这比批次梯度下降要快得多。但是，同样，当训练示例的数量很大时，即使如此，它也只处理一个示例，这对于系统来说可能是额外的开销，因为迭代次数将非常大。
Mini Batch gradient descent: This is a type of gradient descent which works faster than both batch gradient descent and stochastic gradient descent. Here b examples where b<m are processed per iteration. So even if the number of training examples is large, it is processed in batches of b training examples in one go. Thus, it works for larger training examples and that too with lesser number of iterations.

迷你批量梯度下降：这是一种梯度下降，其速度比批量梯度下降和随机梯度下降都快。在这里， b的示例是b <m每次迭代处理。因此，即使训练示例的数量很大，也要一次性处理b个训练示例的批次。因此，它适用于较大的训练示例，并且适用于较少的迭代次数。

梯度下降程序 (Gradient Descent Procedure)

The procedure starts off with initial values for the coefficient or coefficients for the function. These could be 0.0 or a small random value.

该过程从函数的一个或多个系数的初始值开始。这些可以是0.0或小的随机值。

coefficient = 0.0

系数= 0.0

The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.

系数的成本是通过将其插入函数并计算成本来评估的。

cost = f(coefficient)

成本= f(系数)

要么

cost = evaluate(f(coefficient))

成本=评估(f(系数))

The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.

计算成本的导数。导数是微积分的概念，是指函数在给定点的斜率。我们需要知道斜率，以便知道移动系数值的方向(符号)，以便在下一次迭代中获得较低的成本。

delta = derivative(cost)

增量=衍生品(成本)

Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter (alpha) must be specified that controls how much the coefficients can change on each update.

现在我们从导数中知道哪个方向是下坡，现在可以更新系数值。必须指定学习率参数 (alpha)，该参数控制每次更新时系数可以改变多少。

coefficient = coefficient — (alpha * delta)

系数=系数-(alpha *增量)

This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be good enough.

重复该过程，直到系数的成本(cost)为0.0或足够接近零为止才足够好。

You can see how simple gradient descent is. It does require you to know the gradient of your cost function or the function you are optimizing, but besides that, it’s very straightforward. Next we will see the math behind it and how we can use this in machine learning algorithms.

您可以看到梯度下降有多简单。它确实需要您了解成本函数或要优化的函数的梯度，但是除此之外，它非常简单。接下来，我们将了解其背后的数学原理以及如何在机器学习算法中使用它。

背后的数学 (Math Behind it)

Suppose we have the following given:

假设我们给出以下内容：

Hypothesis: hθ(x)= θ^Tx=θ0x0+θ1x1+……………+θnxn

假设：hθ(x)=θ^ Tx =θ0x0+θ1x1+……………+θnxn

Parameters: θ0, θ1, θ2,……..,θn

参数：θ0，θ1，θ2，……..，θn

Cost function: J(θ)=J(θ0, θ1, θ2,……..,θn)

成本函数：J(θ)= J(θ0，θ1，θ2，……..，θn)

Consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:

考虑梯度下降算法，该算法从某个初始θ开始，并重复执行更新：

θj := θj − α ∂/∂θj (J(θ))

θj：=θj−α∂/∂θj(J(θ))

(This update is simultaneously performed for all values of j = 0,…,n.) Here, α is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.

(对于j = 0，…，n的所有值，同时执行此更新。)这里，α称为学习率。这是一种非常自然的算法，反复朝J的最大减小方向迈出了一步。

We’d derived the LMS rule for when there was only a single training example. There are two ways to modify this method for a training set of more than one example. The first is replace it with the following algorithm:

当只有一个训练示例时，我们推导了LMS规则。对于一个以上示例的训练集，有两种方法可以修改此方法。首先是将其替换为以下算法：

The reader can easily verify that the quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function.

读者可以轻松地验证上述更新规则中的总和量仅为∂J(θ)/∂θj(对于J的原始定义)。因此，这只是原始成本函数J的梯度下降。此方法着眼于每个步骤的整个训练集中的每个示例，称为批梯度下降。请注意，虽然梯度下降通常可能会受到局部极小值的影响，但是我们在此处为线性回归提出的优化问题只有一个全局最优，而没有其他局部最优。因此，梯度下降总是会收敛(假设学习率α不太大)到全局最小值。实际上，J是一个凸二次函数。

如何计算梯度下降 (How to Calculate Gradient Descent)

Note: This section only applies for posts about math and equations.

注意：本部分仅适用于有关数学和方程式的帖子 。

Provide a step-by-step explanation and example of how to calculate the rate, point, or number you’re providing a definition for.

提供分步说明以及有关如何计算要为其提供定义的比率，点或数字的示例。

**Variables used:**Let m be the number of training examples.Let n be the number of features.

**使用的变量：**让m为训练示例的数量，让n为特征的数量。

Note: if b == m, then mini batch gradient descent will behave similarly to batch gradient descent.

注意：如果b == m，则小批量梯度下降将类似于批量梯度下降。

**Algorithm for batch gradient descent :**Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.

**批量梯度下降的算法：**让hθ(x)为线性回归的假设。然后，成本函数由下式给出：令∑表示从i = 1到m的所有训练示例的总和。

Jtrain(θ) = (1/2m) Σ( hθ(x(i))  - y(i))2Repeat { θj = θj – (learning rate/m) * Σ( hθ(x(i))  - y(i))xj(i)    For every j =0 …n }

Where xj(i) Represents the jth feature of the ith training example. So if m is very large(e.g. 5 million training samples), then it takes hours or even days to converge to the global minimum.That’s why for large datasets, it is not recommended to use batch gradient descent as it slows down the learning.

其中xj(i)表示第i个训练示例的第j个特征。因此，如果m很大(例如500万个训练样本)，则需要花费数小时甚至数天才能收敛到全局最小值。这就是为什么对于大型数据集，不建议使用批量梯度下降法，因为它会减慢学习速度。

Algorithm for stochastic gradient descent:

随机梯度下降算法：

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent).

在该算法中，我们反复遍历训练集，并且每次遇到训练示例时，我们仅根据相对于单个训练示例的误差梯度来更新参数。该算法称为随机梯度下降(也称为增量梯度下降)。

Randomly shuffle the data set so that the parameters can be trained evenly for each type of data.2) As mentioned above, it takes into consideration one example per iteration.
随机调整数据集，以便可以针对每种类型的数据均匀地训练参数。2)如上所述，每次迭代都考虑一个示例。

Hence,Let (x(i),y(i)) be the training exampleCost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i))  - y(i))2Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))Repeat {For i=1 to m{         θj = θj – (learning rate) * Σ( hθ(x(i))  - y(i))xj(i)        For every j =0 …n                } }

**Algorithm for mini batch gradient descent:**Say b be the no of examples in one batch, where b < m.Assume b = 10, m = 100;

**最小批次梯度下降的算法：**假设b是一批中的示例数量，其中b <m。假设b = 10，m = 100;

Note: However we can adjust the batch size. It is generally kept as power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as power of 2.

注意：但是我们可以调整批量大小。通常将其保持为2的幂。其背后的原因是因为某些硬件(例如GPU)在具有常见的批量大小(例如2的幂)下获得了更好的运行时间。

Repeat { For i=1,11, 21,…..,91    Let Σ be the summation from i to i+9 represented by k.     θj = θj – (learning rate/size of (b) ) * Σ( hθ(x(k))  - y(k))xj(k)        For every j =0 …n}

选择最佳α (Choosing the best α)

For sufficiently small α , J(θ) should decrease on every iteration.
对于足够小的α，应在每次迭代中减小J(θ)。
But if α is too small, gradient descent can be slow to converge.
但是，如果α太小，则梯度下降的收敛速度可能会很慢。
If α is too large, J(θ) may not decrease on every iteration, may not converge.
如果α太大，则J(θ)可能不会在每次迭代中减小，也可能不会收敛。

To choose α, try …..,0.001,0.01,0.1,1,……. etc.

要选择α，请尝试......，0.001,0.01,0.1,1，……。等等

批处理vs随机梯度算法 (Batch vs Stochastic gradient algorithm)

Batch gradient descent has to scan through the entire training set before taking a single step — a costly operation if m is large — stochastic gradient descent can start making progress right away, and continues to make progress with each example it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

批量梯度下降必须先扫描整个训练集，然后再采取单个步骤-如果m大，则是一项昂贵的操作-随机梯度下降可以立即开始取得进展，并且在所考察的每个示例中都将继续取得进展。通常，随机梯度下降比批梯度下降更快地将θ“接近”到最小值。 (但是请注意，它可能永远不会“收敛”到最小值，并且参数θ会一直围绕J(θ)的最小值振荡；但是实际上，接近最小值的大多数值在合理程度上近似于真实最小值。出于这些原因，尤其是当训练集很大时，与梯度梯度下降相比，随机梯度下降通常更可取。

一些现实生活中的例子和直觉 (Some real life examples and intuition)

If you feel like it would benefit your readers, list a few examples of the concept you’re explaining in action. You can elevate this section by embedding images, videos, and/or social media posts.

如果您认为这样做对读者有好处，请列举一些您正在实践中解释的概念的示例。 您可以通过嵌入图像，视频和/或社交媒体帖子来提升此部分的效果。

Remember, this post is not a list post — so try to keep this list between three and five examples if you do decide to include it.

请记住，该帖子 不是 列表帖子，因此，如果您决定包含此列表，请尝试将其保留在三个到五个示例之间。

Think of a large bowl like what you would eat cereal out of or store fruit in. This bowl is a plot of the cost function (f). A random position on the surface of the bowl is the cost of the current values of the coefficients (cost). The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function. The goal is to continue to try different values for the coefficients, evaluate their cost and select new coefficients that have a slightly better (lower) cost. Repeating this process enough times will lead to the bottom of the bowl and you will know the values of the coefficients that result in the minimum cost.
想想一个大碗，就像您要吃掉谷物或在其中存放水果一样。该碗是成本函数(f)的图。碗表面上的随机位置是系数的当前值的成本(cost)。碗的底部是最佳系数集(函数最小值)的成本。目标是继续尝试使用不同的系数值，评估其成本并选择成本稍高(较低)的新系数。重复此过程足够的时间将导致碗的底部，您将知道导致最低成本的系数值。
The basic intuition behind gradient descent can be illustrated by a hypothetical scenario. A person is stuck in the mountains and is trying to get down (i.e. trying to find the global minimum). There is heavy fog such that visibility is extremely low. Therefore, the path down the mountain is not visible, so they must use local information to find the minimum. They can use the method of gradient descent, which involves looking at the steepness of the hill at their current position, then proceeding in the direction with the steepest descent (i.e. downhill). If they were trying to find the top of the mountain (i.e. the maximum), then they would proceed in the direction of steepest ascent (i.e. uphill). Using this method, they would eventually find their way down the mountain or possibly get stuck in some hole (i.e. local minimum or saddle point), like a mountain lake. However, assume also that the steepness of the hill is not immediately obvious with simple observation, but rather it requires a sophisticated instrument to measure, which the person happens to have at the moment. It takes quite some time to measure the steepness of the hill with the instrument, thus they should minimize their use of the instrument if they wanted to get down the mountain before sunset. The difficulty then is choosing the frequency at which they should measure the steepness of the hill so not to go off track. In this analogy, the person represents the algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the slope of the error surface at that point. The instrument used to measure steepness is differentiation (the slope of the error surface can be calculated by taking the derivative of the squared error function at that point). The direction they choose to travel in aligns with the gradient of the error surface at that point. The amount of time they travel before taking another measurement is the learning rate of the algorithm.

假设情况可以说明梯度下降背后的基本直觉。一个人被困在山上并试图下山(即试图找到全局最小值)。雾很大，能见度极低。因此，下山的路径不可见，因此他们必须使用本地信息来查找最小值。他们可以使用梯度下降的方法，该方法包括查看当前位置的山坡的陡度，然后沿下降最快的方向(即下坡)前进。如果他们试图寻找山顶(即最高处)，那么他们将朝最陡峭的上升方向(即上坡)前进。使用这种方法，他们最终会走下山路，或者可能卡在某个洞中(例如，局部最低点或鞍点 )，例如高山湖泊。但是，还要假设通过简单的观察并不能立即看出山丘的陡度，而是需要一种复杂的仪器来测量，此人此刻恰好拥有该仪器。用仪器测量山丘的陡度要花费一些时间，因此如果他们想在日落之前下山，他们应该尽量减少使用仪器。困难在于选择他们应该测量山坡陡度的频率，以免偏离轨道。用这种类比，人代表算法，而沿着山下的路径代表算法将探索的参数设置的顺序。丘陵的陡度代表该点处误差表面的斜率。用于测量陡度的仪器是微分 (可以通过获取该点的平方误差函数的导数来计算误差表面的斜率)。他们选择行进的方向与该点的误差表面的坡度对齐。他们进行另一次测量之前所经过的时间就是算法的学习率。

练习之前的提示和提醒 (Tips and Reminders before practicing it)

When breaking down a difficult concept or definition, some readers may still feel overwhelmed and unsure of their ability to address it. Break down a few best practices on how to approach the concept, and/or a few reminders about it. Again, this is not a list post, so keep this short list to three to five pieces of advice.

当分解一个困难的概念或定义时，某些读者可能仍然会感到不知所措，不确定他们是否有能力解决它。 分解一些有关如何实现该概念的最佳实践，和/或有关此概念的一些提醒。 再说一次，这不是列表发布，因此将简短列表保留为三到五条建议。

This section lists some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

本节列出了一些技巧和窍门，它们可以帮助您充分利用梯度下降算法进行机器学习。

Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate.

绘制成本与时间的关系图 ：每次迭代收集并绘制算法计算出的成本值。对梯度下降运行进行良好的期望是每次迭代的成本降低。如果没有减少，请尝试降低学习率。
Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best.

学习率 ：学习率值是一个较小的实际值，例如0.1、0.001或0.0001。为您的问题尝试不同的值，然后查看哪种方法最有效。
Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1].

重新缩放输入 ：如果成本函数的形状不偏斜和失真，则算法将更快地达到最小成本。您可以通过将所有输入变量(X)重新缩放到相同的范围来实现此目的，例如[0，1]或[-1，1]。
Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.

很少通过 ：随机梯度下降通常不需要超过1到10次通过训练数据集即可收敛到良好或足够好的系数。
Plot Mean Cost: The updates for each training data set instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.

绘制平均成本图 ：使用随机梯度下降法时，每个训练数据集实例的更新都可能导致噪声随时间变化的噪声图。平均进行10、100或1000次更新可以使您更好地了解算法的学习趋势。

Convergence trends in different variants of Gradient Descents:

梯度下降的不同变体的收敛趋势：

In case of Batch Gradient Descent, the algorithm follows a straight path towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. Here the learning rate is typically held constant.

如果是“批次梯度下降”，该算法将朝着最小值的方向走。如果成本函数是凸的，则收敛至全局最小值；如果成本函数不是凸的，则收敛至局部最小值。在这里，学习率通常保持恒定。

In case of stochastic gradient Descent and mini-batch gradient descent, the algorithm does not converge but keeps on fluctuating around the global minimum. Therefore in order to make it converge, we have to slowly change the learning rate. However the convergence of Stochastic gradient descent is much noisier as in one iteration, it processes only one training example.

在随机梯度下降和小批量梯度下降的情况下，算法不会收敛，但会一直在全局最小值附近波动。因此，为了使其收敛，我们必须缓慢地改变学习速度。但是，随机梯度下降的收敛性比一次迭代大得多，它仅处理一个训练示例。

总结和最后结论 (Closing and a final conclusion)

Wrap up your amazing new blog post with a great closing. Remind your readers of the key takeaway you want them to walk away with and consider pointing them to other resources you have on your website.

最后结束您的惊人新博客文章。 提醒您的读者您想带走的主要知识，并考虑将他们指向您在网站上拥有的其他资源。

In this post you discovered gradient descent for machine learning. You learned that:

在这篇文章中，您发现了用于机器学习的梯度下降。您了解到：

Optimization is a big part of machine learning.
优化是机器学习的重要组成部分。
Gradient descent is a simple optimization procedure that you can use with many machine learning algorithms.
梯度下降是一个简单的优化过程，可以与许多机器学习算法一起使用。
Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
批次梯度下降是指在计算更新之前从所有训练数据计算导数。
Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.
随机梯度下降是指从每个训练数据实例计算导数并立即计算更新。

Do you have any questions about gradient descent for machine learning or this post? Leave a comment and ask your question and I will do my best to answer it.

您对机器学习或本文的梯度下降有疑问吗？发表评论并提出您的问题，我会尽力回答。

以上文章的来源/号召性用语 (Sources for the above article / Call-to-Action)

Last but not least, place a call-to-action at the bottom of your blog post. This should be to a lead-generating piece of content or to a sales-focused landing page for a demo or consultation.

最后但并非最不重要的一点是，在博客文章的底部放置一个号召性用语。 这应该是潜在客户产生的内容，或者是针对销售的着陆页，以进行演示或咨询。

Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

机器学习中的梯度下降算法(以及变体)简介

Gradient Descent For Machine Learning — Machine Learning Mastery

机器学习的梯度下降—精通机器学习

翻译自: https://medium.com/swlh/gradient-descent-algorithm-3d3ba3823fd4