神经网络 梯度下降

co-authored with Apurva Pathak

与Apurva Pathak合着

尝试梯度下降优化器 (Experimenting with Gradient Descent Optimizers)

Welcome to another instalment in our Deep Learning Experiments series, where we run experiments to evaluate commonly-held assumptions about training neural networks. Our goal is to better understand the different design choices that affect model training and evaluation. To do so, we come up with questions about each design choice and then run experiments to answer them.

欢迎来到我们的深度学习实验系列的另一部分,我们在其中进行实验以评估关于训练神经网络的普遍假设。 我们的目标是更好地了解影响模型训练和评估的不同设计选择。 为此,我们提出有关每个设计选择的问题,然后进行实验以回答这些问题。

In this article, we seek to better understand the impact of using different optimizers:

在本文中,我们试图更好地理解使用不同的优化器的影响:

  • How do different optimizers perform in practice?在实践中不同的优化器如何执行?
  • How sensitive is each optimizer to parameter choices such as learning rate or momentum?每个优化器对诸如学习率或动量之类的参数选择有多敏感?
  • How quickly does each optimizer converge?每个优化程序收敛的速度有多快?
  • How much of a performance difference does choosing a good optimizer make?选择一个好的优化器会对性能产生多大的影响?

To answer these questions, we evaluate the following optimizers:

为了回答这些问题,我们评估以下优化器:

  • Stochastic gradient descent (SGD)随机梯度下降(SGD)
  • SGD with momentum新元势头强劲
  • SGD with Nesterov momentum内斯托罗夫势头强劲的SGD
  • RMSpropRMSprop
  • Adam亚当
  • Adagrad阿达格勒
  • Cyclic Learning Rate循环学习率

实验如何设置? (How are the experiments set up?)

We train a neural net using different optimizers and compare their performance. The code for these experiments can be found on Github.

我们使用不同的优化器训练神经网络并比较其性能。 这些实验的代码可以在Github上找到。

  • Dataset: we use the Cats and Dogs dataset, which consists of 23,262 images of cats and dogs, split about 50/50 between the two classes. Since the images are differently-sized, we resize them all to the same size. We use 20% of the dataset as validation data (dev set) and the rest as training data.数据集:我们使用“猫狗”数据集,该数据集由23,262张猫和狗的图像组成,在这两个类别之间划分为大约50/50。 由于图像的尺寸不同,因此我们将它们调整为相同的尺寸。 我们使用数据集的20%作为验证数据(开发集),其余作为训练数据。
  • Evaluation metric: we use the binary cross-entropy loss on the validation data as our primary metric to measure model performance.评估指标:我们使用验证数据上的二进制交叉熵损失作为衡量模型性能的主要指标。
Figure 1: Sample images from Cats and Dogs dataset
图1:猫和狗数据集的样本图像
  • Base model: we also define a base model that is inspired by VGG16, where we apply (convolution ->max-pool -> ReLU -> batch-norm -> dropout) operations repeatedly. Then, we flatten the output volume and feed it into two fully-connected layers (dense -> ReLU -> batch-norm) with 256 units each, and dropout after the first FC layer. Finally, we feed the result into a one-neuron layer with a sigmoid activation, resulting in an output between 0 and 1 that tells us whether the model predicts a cat (0) or dog (1).基本模型:我们还定义了一个受VGG16启发的基本模型,我们在其中重复应用(卷积->最大池-> ReLU->批处理范数->退出)操作。 然后,我们将输出量展平,并将其馈入两个完全连接的层(密集-> ReLU->批处理规范),每个层具有256个单位,并在第一个FC层之后退出。 最后,我们将结果输入到具有S形激活的单神经元层中,从而得到0到1之间的输出,告诉我们该模型预测的是猫(0)还是狗(1)。
NN SVG)NN SVG创建)
  • Training: we use a batch size of 32 and the default weight initialization (Glorot uniform). The default optimizer is SGD with a learning rate of 0.01. We train until the validation loss fails to improve over 50 iterations.培训:我们使用32批次大小和默认的重量初始化(Glorot统一)。 默认优化器为SGD,学习率为0.01。 我们进行训练,直到验证损失无法改善超过50次迭代为止。

随机梯度下降 (Stochastic Gradient Descent)

We first start off with vanilla stochastic gradient descent. This is defined by the following update equation:

我们首先从香草随机梯度下降开始。 这由以下更新方程式定义:

Figure 3: SGD update equation
图3:SGD更新公式

where w is the weight vector and dw is the gradient of the loss function with respect to the weights. This update rule takes a step in the direction of greatest decrease in the loss function, helping us find a set of weights that minimizes the loss. Note that in pure SGD, the update is applied per example, but more commonly it is computed on a batch of examples (called a mini-batch).

其中w是权重向量,dw是损失函数相对于权重的梯度。 此更新规则朝着损失函数最大减少的方向迈出了一步,从而帮助我们找到了使损失最小化的一组权重。 请注意,在纯SGD中,每个示例都会应用更新,但更常见的是,它是基于一批示例(称为迷你批处理)计算得出的。

学习率如何影响SGD? (How does learning rate affect SGD?)

First, we explore how learning rate affects SGD. It is well known that choosing a learning rate that is too low will cause the model to converge slowly, whereas a learning rate that is too high may cause it to not converge at all.

首先,我们探讨学习率如何影响SGD。 众所周知,选择过低的学习速率会使模型收敛缓慢,而过高的学习速率可能会使模型完全收敛。

Jeremy Jordan’s websiteJeremy Jordan的网站

To verify this experimentally, we vary the learning rate along a log scale between 0.001 and 0.1. Let’s first plot the training losses.

为了通过实验验证这一点,我们沿0.001至0.1的对数刻度更改了学习率。 让我们首先绘制训练损失。

Figure 5: Training loss curves for SGD with different learning rates
图5:具有不同学习率的SGD的训练损失曲线

We indeed observe that performance is optimal when the learning rate is neither too small nor too large (the red line). Initially, increasing the learning rate speeds up convergence, but after learning rate 0.0316, convergence actually becomes slower. This may be because taking a larger step may actually overshoot the minimum loss, as illustrated in figure 4, resulting in a higher loss.

我们确实观察到,当学习率既不太小也不太大(红线)时,性能是最佳的。 最初,提高学习速率会加快收敛速度​​,但是在学习速率达到0.0316之后,收敛实际上会变慢。 这可能是因为采取更大的步骤实际上可能会超出最小损耗,如图4所示,从而导致更高的损耗。

Let’s now plot the validation losses.

现在让我们绘制验证损失。

Figure 6: Validation loss curves for SGD with different learning rates
图6:具有不同学习率的SGD的验证损失曲线

We observe that validation performance suffers when we pick a learning rate that is either too small or too big. Too small (e.g. 0.001) and the validation loss does not decrease at all, or does so very slowly. Too large (e.g. 0.1) and the validation loss does not attain as low a minimum as it could with a smaller learning rate.

我们观察到,当选择的学习率太小或太大时,验证性能都会受到影响。 太小(例如0.001),验证损失根本不会减少,或者会非常缓慢地减少。 太大(例如0.1),并且验证损失无法达到学习率较小时的最小值。

Let’s now plot the best training and validation loss attained by each learning rate*:

现在,让我们来绘制每种学习率*可获得的最佳培训和验证损失:

Figure 7: Minimum training and validation losses for SGD at different learning rates
图7:不同学习率下SGD的最小培训和验证损失

The data above confirm the ‘Goldilocks’ theory of picking a learning rate that is neither too small nor too large, since the best learning rate (3.2e-2) is in the middle of the range of values we tried.

上面的数据证实了“ Goldilocks”理论选择的学习率既不能太小也不能太大,因为最佳学习率(3.2e-2)处于我们尝试的值范围的中间。

*Typically, we would expect the validation loss to be higher than the training loss, since the model has not seen the validation data before. However, we see above that the validation loss is surprisingly sometimes lower than the training loss. This could be due to dropout, since neurons are dropped only at training time and not during evaluation, resulting in better performance during evaluation than during training. The effect may be particularly pronounced when the dropout rate is high, as it is in our model (0.6 dropout on FC layers).

*通常,由于模型之前没有看到验证数据,因此我们希望验证损失高于训练损失。 但是,我们在上面看到,验证损失有时会比训练损失低。 这可能是由于辍学造成的,因为神经元仅在训练时而不是在评估过程中被丢弃,从而导致评估期间的性能比训练期间更好。 当辍学率很高时,效果会特别明显,就像我们的模型一样(FC层上的辍学率为0.6)。

最佳SGD验证损失 (Best SGD validation loss)

  • Best validation loss: 0.1899最佳验证损失:0.1899
  • Associated training loss: 0.1945相关的训练损失:0.1945
  • Epochs to converge to minimum: 535收敛到最少的纪元:535
  • Params: learning rate 0.032参数:学习率0.032

SGD外卖 (SGD takeaways)

  • Choosing a good learning rate (not too big, not too small) is critical for ensuring optimal performance on SGD.

    选择一个好的学习率(不要太大,不要太小)对于确保SGD的最佳性能至关重要。

动量的随机梯度下降 (Stochastic Gradient Descent with Momentum)

总览 (Overview)

SGD with momentum is a variant of SGD that typically converges more quickly than vanilla SGD. It is typically defined as follows:

具有动量的SGD是SGD的变体,通常比原始SGD收敛更快。 通常定义如下:

Figure 8: Update equations for SGD with momentum
图8:具有动量的SGD更新公式

Deep Learning by Goodfellow et al. explains the physical intuition behind the algorithm [0]:

Goodfellow等人的深度学习 。 解释了算法[0]背后的物理直觉:

Formally, the momentum algorithm introduces a variable v that plays the role of velocity — it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

形式上,动量算法引入了一个变量v ,它起着速度的作用-它是参数在参数空间中移动的方向和速度。 速度设置为负梯度的指数衰减平均值。

In other words, the parameters move through the parameter space at a velocity that changes over time. The change in velocity is dictated by two terms:

换句话说,参数以随时间变化的速度在参数空间中移动。 速度的变化由两个术语决定: