神经网络梯度下降

co-authored with Apurva Pathak

与Apurva Pathak合着

尝试梯度下降优化器 (Experimenting with Gradient Descent Optimizers)

Welcome to another instalment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

欢迎来到我们的深度学习实验系列的另一部分，我们在其中进行实验以评估关于训练神经网络的普遍假设。我们的目标是更好地了解影响模型训练和评估的不同设计选择。为此，我们提出有关每个设计选择的问题，然后进行实验以回答这些问题。

In this article, we seek to better understand the impact of using different optimizers:

在本文中，我们试图更好地理解使用不同的优化器的影响：

How do different optimizers perform in practice?在实践中不同的优化器如何执行？
How sensitive is each optimizer to parameter choices such as learning rate or momentum?每个优化器对诸如学习率或动量之类的参数选择有多敏感？
How quickly does each optimizer converge?每个优化程序收敛的速度有多快？
How much of a performance difference does choosing a good optimizer make?选择一个好的优化器会对性能产生多大的影响？

To answer these questions, we evaluate the following optimizers:

为了回答这些问题，我们评估以下优化器：

Stochastic gradient descent (SGD)随机梯度下降(SGD)
SGD with momentum新元势头强劲
SGD with Nesterov momentum内斯托罗夫势头强劲的SGD
RMSpropRMSprop
Adam亚当
Adagrad阿达格勒
Cyclic Learning Rate循环学习率

实验如何设置？ (How are the experiments set up?)

We train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

我们使用不同的优化器训练神经网络并比较其性能。这些实验的代码可以在Github上找到。

Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.数据集：我们使用“猫狗”数据集，该数据集由23,262张猫和狗的图像组成，在这两个类别之间划分为大约50/50。由于图像的尺寸不同，因此我们将它们调整为相同的尺寸。我们使用数据集的20％作为验证数据(开发集)，其余作为训练数据。
Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.评估指标：我们使用验证数据上的二进制交叉熵损失作为衡量模型性能的主要指标。

Figure 1: Sample images from Cats and Dogs dataset

Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> ReLU -> batch-norm -> dropout) operations repeatedly. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> ReLU -> batch-norm) with 256 units each, and dropout after the first FC layer. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).基本模型：我们还定义了一个受VGG16启发的基本模型，我们在其中重复应用(卷积->最大池-> ReLU->批处理范数->退出)操作。然后，我们将输出量展平，并将其馈入两个完全连接的层(密集-> ReLU->批处理规范)，每个层具有256个单位，并在第一个FC层之后退出。最后，我们将结果输入到具有S形激活的单神经元层中，从而得到0到1之间的输出，告诉我们该模型预测的是猫(0)还是狗(1)。

Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The default optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.培训：我们使用32批次大小和默认的重量初始化(Glorot统一)。默认优化器为SGD，学习率为0.01。我们进行训练，直到验证损失无法改善超过50次迭代为止。

随机梯度下降 (Stochastic Gradient Descent)

We first start off with vanilla stochastic gradient descent. This is defined by the following update equation:

我们首先从香草随机梯度下降开始。这由以下更新方程式定义：

where w is the weight vector and dw is the gradient of the loss function with respect to the weights. This update rule takes a step in the direction of greatest decrease in the loss function, helping us find a set of weights that minimizes the loss. Note that in pure SGD, the update is applied per example, but more commonly it is computed on a batch of examples (called a mini-batch).

其中w是权重向量，dw是损失函数相对于权重的梯度。此更新规则朝着损失函数最大减少的方向迈出了一步，从而帮助我们找到了使损失最小化的一组权重。请注意，在纯SGD中，每个示例都会应用更新，但更常见的是，它是基于一批示例(称为迷你批处理)计算得出的。

学习率如何影响SGD？ (How does learning rate affect SGD?)

First, we explore how learning rate affects SGD. It is well known that choosing a learning rate that is too low will cause the model to converge slowly, whereas a learning rate that is too high may cause it to not converge at all.

首先，我们探讨学习率如何影响SGD。众所周知，选择过低的学习速率会使模型收敛缓慢，而过高的学习速率可能会使模型完全收敛。

To verify this experimentally, we vary the learning rate along a log scale between 0.001 and 0.1. Let’s first plot the training losses.

为了通过实验验证这一点，我们沿0.001至0.1的对数刻度更改了学习率。让我们首先绘制训练损失。

Figure 5: Training loss curves for SGD with different learning rates

We indeed observe that performance is optimal when the learning rate is neither too small nor too large (the red line). Initially, increasing the learning rate speeds up convergence, but after learning rate 0.0316, convergence actually becomes slower. This may be because taking a larger step may actually overshoot the minimum loss, as illustrated in figure 4, resulting in a higher loss.

我们确实观察到，当学习率既不太小也不太大(红线)时，性能是最佳的。最初，提高学习速率会加快收敛速度，但是在学习速率达到0.0316之后，收敛实际上会变慢。这可能是因为采取更大的步骤实际上可能会超出最小损耗，如图4所示，从而导致更高的损耗。

Let’s now plot the validation losses.

现在让我们绘制验证损失。

Figure 6: Validation loss curves for SGD with different learning rates

We observe that validation performance suffers when we pick a learning rate that is either too small or too big. Too small (e.g. 0.001) and the validation loss does not decrease at all, or does so very slowly. Too large (e.g. 0.1) and the validation loss does not attain as low a minimum as it could with a smaller learning rate.

我们观察到，当选择的学习率太小或太大时，验证性能都会受到影响。太小(例如0.001)，验证损失根本不会减少，或者会非常缓慢地减少。太大(例如0.1)，并且验证损失无法达到学习率较小时的最小值。

Let’s now plot the best training and validation loss attained by each learning rate*:

现在，让我们来绘制每种学习率*可获得的最佳培训和验证损失：

Figure 7: Minimum training and validation losses for SGD at different learning rates

The data above confirm the ‘Goldilocks’ theory of picking a learning rate that is neither too small nor too large, since the best learning rate (3.2e-2) is in the middle of the range of values we tried.

上面的数据证实了“ Goldilocks”理论选择的学习率既不能太小也不能太大，因为最佳学习率(3.2e-2)处于我们尝试的值范围的中间。

*Typically, we would expect the validation loss to be higher than the training loss, since the model has not seen the validation data before. However, we see above that the validation loss is surprisingly sometimes lower than the training loss. This could be due to dropout, since neurons are dropped only at training time and not during evaluation, resulting in better performance during evaluation than during training. The effect may be particularly pronounced when the dropout rate is high, as it is in our model (0.6 dropout on FC layers).

*通常，由于模型之前没有看到验证数据，因此我们希望验证损失高于训练损失。但是，我们在上面看到，验证损失有时会比训练损失低。这可能是由于辍学造成的，因为神经元仅在训练时而不是在评估过程中被丢弃，从而导致评估期间的性能比训练期间更好。当辍学率很高时，效果会特别明显，就像我们的模型一样(FC层上的辍学率为0.6)。

最佳SGD验证损失 (Best SGD validation loss)

Best validation loss: 0.1899最佳验证损失：0.1899
Associated training loss: 0.1945相关的训练损失：0.1945
Epochs to converge to minimum: 535收敛到最少的纪元：535
Params: learning rate 0.032参数：学习率0.032

SGD外卖 (SGD takeaways)

Choosing a good learning rate (not too big, not too small) is critical for ensuring optimal performance on SGD.

选择一个好的学习率(不要太大，不要太小)对于确保SGD的最佳性能至关重要。

动量的随机梯度下降 (Stochastic Gradient Descent with Momentum)

总览 (Overview)

SGD with momentum is a variant of SGD that typically converges more quickly than vanilla SGD. It is typically defined as follows:

具有动量的SGD是SGD的变体，通常比原始SGD收敛更快。通常定义如下：

Figure 8: Update equations for SGD with momentum

Deep Learning by Goodfellow et al. explains the physical intuition behind the algorithm [0]:

Goodfellow等人的深度学习 。解释了算法[0]背后的物理直觉：

Formally, the momentum algorithm introduces a variable v that plays the role of velocity — it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

形式上，动量算法引入了一个变量v ，它起着速度的作用-它是参数在参数空间中移动的方向和速度。速度设置为负梯度的指数衰减平均值。

In other words, the parameters move through the parameter space at a velocity that changes over time. The change in velocity is dictated by two terms:

换句话说，参数以随时间变化的速度在参数空间中移动。速度的变化由两个术语决定：

神经网络梯度下降_梯度下降优化器对神经网络训练的影响相关推荐
1. 批梯度下降随机梯度下降_梯度下降及其变体快速指南
  批梯度下降随机梯度下降 In this article, I am going to discuss the Gradient Descent algorithm. The next article ...
2. 梯度下降法_梯度下降
  梯度下降法介绍 (Introduction) Gradient Descent is a first order iterative optimization algorithm where opt ...
3. 3. 机器学习中为什么需要梯度下降_梯度提升（Gradient Boosting）算法
  本文首发于我的微信公众号里,地址:梯度提升(Gradient Boosting)算法本文禁止任何形式的转载. 我的个人微信公众号:Microstrong 微信公众号ID:MicrostrongAI ...
4. 梯度消失和梯度爆炸_梯度消失、爆炸的原因及解决办法
  一.引入:梯度更新规则目前优化神经网络的方法都是基于反向传播的思想,即根据损失函数计算的误差通过梯度反向传播的方式,更新优化深度网络的权值.这样做是有一定原因的,首先,深层网络由许多非线性层堆叠而来 ...
5. 梯度消失和梯度爆炸_梯度消失梯度爆炸-Gradient Clip
  梯度爆炸与梯度消失实际现象: 当我们使用sigmoid function作为激活函数时,随着神经网络的隐藏层数增加,训练误差反而增大,造成了深度网络的不稳定. 梯度弥散: 靠近输出层的hidden ...
6. 梯度下降算法_梯度下降算法的工作原理
  ↑ 点击蓝字关注极市平台作者丨磐怼怼来源丨深度学习与计算机视觉编辑丨极市平台极市导读梯度下降算法是工业中最常用的机器学习算法之一,但也是很多新手难以理解的算法之一.如果你刚刚接触机器学习,那么 ...
7. 梯度下降算法_梯度下降算法（Gradient Descent)的原理和实现步骤
  大部分的机器学习模型里有直接或者间接地使用了梯度下降的算法.虽然不同的梯度下降算法在具体的实现细节上会稍有不同,但是主要的思想是大致一样的.梯度下降并不会涉及到太多太复杂的数学知识,只要稍微了解过微积 ...
8. 梯度下降算法_梯度下降法的简单介绍以及实现
  梯度下降法的基本思想可以类比为一个下山的过程.假设这样一个场景:一个人被困在山上,需要从山上下来(i.e.找到山的最低点,也就是山谷).但此时山上的浓雾很大,导致可视度很低.因此,下山的路径就无法确定 ...
9. 梯度消失和梯度爆炸_梯度消失和梯度爆炸详解
  在中文搜索引擎搜索梯度爆炸或者梯度消失,出现的往往是一篇文章复制黏贴多次,而且这篇文章讲的并不清晰,比方说下面这种在文章中出现的图一,反正我是根本看不懂这张图描述的实什么,因此那篇被复制黏贴无数次的文 ...
最新文章
热门文章

神经网络 梯度下降_梯度下降优化器对神经网络训练的影响

尝试梯度下降优化器 (Experimenting with Gradient Descent Optimizers)

实验如何设置？ (How are the experiments set up?)

随机梯度下降 (Stochastic Gradient Descent)

学习率如何影响SGD？ (How does learning rate affect SGD?)

最佳SGD验证损失 (Best SGD validation loss)

SGD外卖 (SGD takeaways)

动量的随机梯度下降 (Stochastic Gradient Descent with Momentum)

总览 (Overview)

神经网络 梯度下降_梯度下降优化器对神经网络训练的影响相关推荐

最新文章

热门文章

神经网络梯度下降_梯度下降优化器对神经网络训练的影响

神经网络梯度下降_梯度下降优化器对神经网络训练的影响相关推荐