Paper:论文解读—《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中国本科生(学霸)提出AdaBound的神经网络优化算法

目录

亮点总结

论文解读

实验结果

1、FEEDFORWARD NEURAL NETWORK

2、CONVOLUTIONAL NEURAL NETWORK

3、RECURRENT NEURAL NETWORK

实验结果分析


《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》
论文页面:https://openreview.net/pdf?id=Bkg3g2R9FX
评审页面:https://openreview.net/forum?id=Bkg3g2R9FX
GitHub地址:https://github.com/Luolc/AdaBound

亮点总结

1、AdaBound算法的初始化速度快。
2、AdaBound算法对超参数不是很敏感,省去了大量调参的时间。
3、适合应用在CV、NLP领域,可以用来开发解决各种流行任务的深度学习模型。

We investigate existing adaptive algorithms and find that extremely large or small learning rates can result in the poor convergence behavior. A rigorous proof of non-convergence for ADAM is provided to demonstrate the above problem.

Motivated by the strong generalization ability of SGD, we design a strategy to constrain the learn- ing rates of ADAM and AMSGRAD to avoid a violent oscillation. Our proposed algorithms, AD- ABOUND and AMSBOUND, which employ dynamic bounds on their learning rates, achieve a smooth transition to SGD. They show the great efficacy on several standard benchmarks while maintaining advantageous properties of adaptive methods such as rapid initial progress and hyper- parameter insensitivity.

我们研究了现有的自适应算法,发现极大或极小的学习率都会导致较差的收敛行为。为证明上述问题,ADAM给出了非收敛性的严格证明。
     基于SGD较强的泛化能力,我们设计了一种策略来约束ADAM和AMSGRAD的学习速率,以避免剧烈的振荡。我们提出的算法,ADABOUNDAMSBOUND,采用了动态的学习速率边界,实现了向SGD的平稳过渡。它们在保持自适应方法初始化速度快、超参数不敏感等优点的同时,在多个标准基准上显示了良好的效果。

论文解读

自适应优化方法,如ADAGRAD, RMSPROP和ADAM已经被提出,以实现一个基于学习速率的元素级缩放项的快速训练过程。虽然它们普遍存在,但与SGD相比,它们的泛化能力较差,甚至由于不稳定和极端的学习速率而无法收敛。最近的研究提出了AMSGRAD等算法来解决这一问题,但相对于现有的方法没有取得很大的改进。在我们的论文中,我们证明了极端的学习率会导致糟糕的表现。我们提供了ADAM和AMSGRAD的新变体,分别称为ADABOUNDAMSBOUND,它们利用学习速率的动态边界来实现从自适应方法到SGD的渐进平稳过渡,并给出收敛性的理论证明。我们进一步对各种流行的任务和模型进行实验,这在以往的工作中往往是不够的。实验结果表明,新的变异可以消除自适应方法与SGD的泛化差距,同时在训练早期保持较高的学习速度。此外,它们可以对原型带来显著的改进,特别是在复杂的深度网络上。该算法的实现可以在https://github.com/Luolc/AdaBound找到。

实验结果

In this section, we turn to an empirical study of different models to compare new variants with  popular optimization methods including SGD(M), ADAGRAD, ADAM, and AMSGRAD. We focus  on three tasks: the MNIST image classification task (Lecun et al., 1998), the CIFAR-10 image  classification task (Krizhevsky & Hinton, 2009), and the language modeling task on Penn Treebank  (Marcus et al., 1993). We choose them due to their broad importance and availability of their architectures  for reproducibility. The setup for each task is detailed in Table 2. We run each experiment  three times with the specified initialization method from random starting points. A fixed budget on  the number of epochs is assigned for training and the decay strategy is introduced in following parts.  We choose the settings that achieve the lowest training loss at the end.

在这一节中,我们将对不同的模型进行实证研究,将新变量与常用的优化方法(包括SGD(M)、ADAGRAD、ADAM和AMSGRAD))进行比较。我们主要关注三个任务:MNIST图像分类任务(Lecun et al.,1998)、CIFAR-10图像分类任务(Krizhevsky & Hinton, 2009)和Penn Treebank上的语言建模任务(Marcus et al.1993)。我们之所以选择它们,是因为它们的架构具有广泛的重要性和可再现性。表2详细列出了每个任务的设置。我们使用指定的初始化方法从随机的起点运行每个实验三次。为训练指定了固定的时域数预算,下面将介绍衰减策略。我们选择的设置,实现最低的训练损失在最后。

1、FEEDFORWARD NEURAL NETWORK

We train a simple fully connected neural network with one hidden layer for the multiclass classification  problem on MNIST dataset. We run 100 epochs and omit the decay scheme for this experiment.  
     Figure 2 shows the learning curve for each optimization method on both the training and test set.  We find that for training, all algorithms can achieve the accuracy approaching 100%. For the test  part, SGD performs slightly better than adaptive methods ADAM and AMSGRAD. Our two proposed  methods, ADABOUND and AMSBOUND, display slight improvement, but compared with  their prototypes there are still visible increases in test accuracy.

针对MNIST数据集上的多类分类问题,我们训练了一个具有隐层的简单全连通神经网络。我们运行了100个epochs,省略了这个实验的衰变方案。
       图2显示了训练和测试集上每种优化方法的学习曲线。我们发现在训练中,所有算法都能达到接近100%的准确率。在测试部分,SGD的性能略优于ADAM和AMSGRAD的自适应方法。我们提出的 ADABOUNDAMSBOUND两种方法显示出轻微的改进,但与它们的原型相比,测试精度仍然有明显的提高。

2、CONVOLUTIONAL NEURAL NETWORK

Using DenseNet-121 (Huang et al., 2017) and ResNet-34 (He et al., 2016), we then consider the task  of image classification on the standard CIFAR-10 dataset. In this experiment, we employ the fixed  budget of 200 epochs and reduce the learning rates by 10 after 150 epochs.  
      DenseNet :We first run a DenseNet-121 model on CIFAR-10 and our results are shown in Figure 3.  We can see that adaptive methods such as ADAGRAD, ADAM and AMSGRAD appear to perform  better than the non-adaptive ones early in training. But by epoch 150 when the learning rates are  decayed, SGDM begins to outperform those adaptive methods. As for our methods, ADABOUND  and AMSBOUND, they converge as fast as adaptive ones and achieve a bit higher accuracy than  SGDM on the test set at the end of training. In addition, compared with their prototypes, their  performances are enhanced evidently with approximately 2% improvement in the test accuracy.  
      ResNet :Results for this experiment are reported in Figure 3. As is expected, the overall performance  of each algorithm on ResNet-34 is similar to that on DenseNet-121. ADABOUND and  AMSBOUND even surpass SGDM by 1%. Despite the relative bad generalization ability of adaptive  methods, our proposed methods overcome this drawback by allocating bounds for their learning  rates and obtain almost the best accuracy on the test set for both DenseNet and ResNet on CIFAR-10.

然后利用DenseNet-121 (Huang et al.2017)和ResNet-34 (He et al.2016)对CIFAR-10标准数据集进行图像分类。在这个实验中,我们使用200个epoch的固定预算,在150个epoch后将学习率降低10个。
      DenseNet:我们首先在CIFAR-10上运行DenseNet-121模型,结果如图3所示。我们可以看到,ADAGRAD、ADAM和AMSGRAD等自适应方法在早期训练中表现得比非自适应方法更好。但是到了历元150,当学习速率衰减时,SGDM开始优于那些自适应方法。对于我们的方法ADABOUNDAMSBOUND,它们收敛速度和自适应方法一样快,并且在训练结束时的测试集上达到比SGDM稍高的精度。此外,与原型机相比,其性能得到了显著提高,测试精度提高了约2%。
      ResNet:实验结果如图3所示。正如预期的那样,ResNet-34上的每个算法的总体性能与DenseNet-121上的相似。ADABOUNDAMSBOUND甚至超过SGDM 1%。尽管自适应方法的泛化能力相对较差,但我们提出的方法克服了这一缺点,为其学习速率分配了界限,在CIFAR-10上对DenseNet和ResNet的测试集都获得了几乎最佳的准确率。

3、RECURRENT NEURAL NETWORK

Finally, we conduct an experiment on the language modeling task with Long Short-Term Memory  (LSTM) network (Hochreiter & Schmidhuber, 1997). From two experiments above, we observe that our methods show much more improvement in deep convolutional neural networks than in perceptrons.  Therefore, we suppose that the enhancement is related to the complexity of the architecture  and run three models with (L1) 1-layer, (L2) 2-layer and (L3) 3-layer LSTM respectively. We train  them on Penn Treebank, running for a fixed budget of 200 epochs. We use perplexity as the metric  to evaluate the performance and report results in Figure 4.

We find that in all models, ADAM has the fastest initial progress but stagnates in worse performance  than SGD and our methods. Different from phenomena in previous experiments on the image classification  tasks, ADABOUND and AMSBOUND does not display rapid speed at the early training  stage but the curves are smoother than that of SGD.

我们发现,在所有模型中,ADAM的初始进展最快,但在性能上停滞不前,不如SGD和我们的方法。与以往在图像分类任务实验中出现的现象不同,ADABOUNDAMSBOUND在训练初期的速度并不快,但曲线比SGD平滑。

Comparing L1, L2 and L3, we can easily notice a distinct difference of the improvement degree.  In L1, the simplest model, our methods perform slightly 1.1% better than ADAM while in L3, the  most complex model, they show evident improvement over 2.8% in terms of perplexity. It serves as  evidence for the relationship between the model’s complexity and the improvement degree.

对比L1、L2和L3,我们可以很容易地发现改善程度的显著差异。在最简单的模型L1中,我们的方法比ADAM的方法略好1.1%,而在最复杂的模型L3中,我们的方法在复杂的方面明显优于2.8%。为模型的复杂性与改进程度之间的关系提供了依据。

实验结果分析

To investigate the efficacy of our proposed algorithms, we select popular tasks from computer vision and natural language processing. Based on results shown above, it is easy to find that ADAM and AMSGRAD usually perform similarly and the latter does not show much improvement for most cases. Their variants, ADABOUND and AMSBOUND, on the other hand, demonstrate a fast speed of convergence compared with SGD while they also exceed two original methods greatly with respect to test accuracy at the end of training. This phenomenon exactly confirms our view mentioned in Section 3 that both large and small learning rates can influence the convergence.

Besides, we implement our experiments on models with different complexities, consisting of a per- ceptron, two deep convolutional neural networks and a recurrent neural network. The perceptron used on the MNIST is the simplest and our methods perform slightly better than others. As for DenseNet and ResNet, obvious increases in test accuracy can be observed. We attribute this differ- ence to the complexity of the model. Specifically, for deep CNN models, convolutional and fully connected layers play different parts in the task. Also, different convolutional layers are likely to be responsible for different roles (Lee et al., 2009), which may lead to a distinct variation of gradients of parameters. In other words, extreme learning rates (huge or tiny) may appear more frequently in complex models such as ResNet. As our algorithms are proposed to avoid them, the greater enhance- ment of performance in complex architectures can be explained intuitively. The higher improvement degree on LSTM with more layers on language modeling task also consists with the above analysis.

为了研究我们提出的算法的有效性,我们从计算机视觉和自然语言处理中选择流行的任务。根据上面显示的结果,不难发现ADAM和AMSGRAD的表现通常是相似的,而AMSGRAD在大多数情况下并没有太大的改善。另一方面,它们的变体ADABOUNDAMSBOUND与SGD相比具有较快的收敛速度,同时在训练结束时的测试精度也大大超过了两种原始方法。这一现象正好印证了我们在第3节中提到的观点,学习速率的大小都会影响收敛。

此外,我们还对不同复杂度的模型进行了实验,包括一个per- ceptron模型、两个深度卷积神经网络模型和一个递归神经网络模型。MNIST上使用的感知器是最简单的,我们的方法比其他方法稍好一些。DenseNet和ResNet的测试精度明显提高。我们把这种不同归因于模型的复杂性。具体来说,对于深度CNN模型,卷积层和全连通层在任务中扮演不同的角色。此外,不同的卷积层可能负责不同的角色(Lee et al.2009),这可能导致参数梯度的明显变化。换句话说,极端的学习速率(巨大或微小)可能在ResNet等复杂模型中出现得更频繁。由于我们的算法是为了避免这些问题而提出的,因此可以直观地解释在复杂体系结构中性能的提高。LSTM在语言建模任务上的层次越多,改进程度越高,也与上述分析一致。

PS:因为时间比较紧,博主翻译的不是特别尽善尽美,如有错误,请指出,谢谢!

Paper:论文解读《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中国本科生提出AdaBound的神经网络优化算法相关推荐

  1. 入门神经网络优化算法(一):Gradient Descent,Momentum,Nesterov accelerated gradient

    入门神经网络优化算法(一):Gradient Descent,Momentum,Nesterov accelerated gradient 入门神经网络优化算法(二):Adaptive Optimiz ...

  2. 2020AI顶会的腾讯论文解读 | 多模态学习、视频内容理解、对抗攻击与对抗防御等「AI核心算法」

    关注:决策智能与机器学习,深耕AI脱水干货 报道 |  腾讯AI实验室 计算机视觉领域三大顶会之一的 ECCV(欧洲计算机视觉会议)今年于 8 月 23-28 日举办.受新冠肺炎疫情影响,今年的 EC ...

  3. 【ICML 2020对比学习论文解读】SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

    一.写在前面 对比学习(Contrastive Learning) 对比学习是一种自监督学习方法,在无标签数据集上仍可以学习到较好的表征. 对比学习的主要思想就是相似的样本的向量距离要近,不相似的要远 ...

  4. 【论文解读】让特征感受野更灵活,腾讯优图提出非对称卡通人脸检测,推理速度仅50ms...

    该文是腾讯优图&东南大学联合提出一种的非对称卡通人脸检测算法,该方法取得了2020 iCartoon Face Challenge(Under 200MB)竞赛的冠军,推理速度仅为50ms且无 ...

  5. 【论文解读|AAAI2022】No Task Left Behind: Multi-Task Learning of KT and OT for Better Student

    没有任务落后:多任务学习的知识跟踪和选项跟踪更好的学生评估 文章目录 摘要 1 引言 2 相关工作 知识追踪 Option Tracing(选项追踪) 成绩预测 多任务学习 3 所提方法 5 实验结论 ...

  6. 进化计算在深度学习中的应用 | 附多篇论文解读

    随着当今计算能力的大幅度提升和大数据时代的到来,深度学习在帮助挖掘海量数据中蕴含的信息和完成一些人工智能任务中,展现出了非凡的能力.然而目前深度学习领域还有许多问题亟待解决,其中算法参数和结构的优化尤 ...

  7. ICML 2022 | 腾讯AI Lab入选论文解读

    感谢阅读腾讯 AI Lab 微信号第 150 篇文章.本文为腾讯 AI Lab 入选 ICML 2022 的 7 篇论文解读. ICML(International Conference on Mac ...

  8. 论文解读丨无参数的注意力模块SimAm

    摘要:本文提出了一个概念简单但对卷积神经网络非常有效的注意力模块. 本文分享自华为云社区<论文解读系列三十:无参数的注意力模块SimAm论文解读>,作者:谷雨润一麦. 摘要 本文提出了一个 ...

  9. Paper:《A Unified Approach to Interpreting Model Predictions—解释模型预测的统一方法》论文解读与翻译

    Paper:<A Unified Approach to Interpreting Model  Predictions-解释模型预测的统一方法>论文解读与翻译 导读:2017年11月25 ...

最新文章

  1. 独家 | 可预测COVID-19病例峰值的新算法
  2. [Treap]JZOJ 4737 金色丝线将瞬间一分为二
  3. 数据库对象 同义词 索引 序列 视图
  4. oracle 性别 函数索引优化,oracle优化记录4_改写函数索引列
  5. pythonlist基本操作_Python 列表(list)简介及基本操作
  6. tty_operations
  7. 移动端通用元件库+app通用元件库+数据展示+操作反馈+通用模板+数据录入+列表页+表单页+详情页+通用版布局+移动端手机模板+业务组件+反馈组件+展示组件+表单组件+导航组件
  8. 看本质:微服务为什么需要契约测试?
  9. LeetCode90. 子集 II(回溯)
  10. IE6 的 hover 伪类 bug
  11. 华为鸿蒙联合品牌,魅族官宣:接入华为鸿蒙!这是国产智能手机品牌的首个公开表态!...
  12. android 自定义tabhost,安卓选项卡的实现方法(TabActivity),自定义TabHost容器
  13. 拓端tecdat|Python对商店数据进行lstm和xgboost销售量时间序列建模预测分析
  14. SQL Server 2008使用问题集锦
  15. 锐起无盘系统菜鸟教程
  16. SQL Server 2008 卸载报错
  17. python时间格式转换为美式日期_python中有关时间日期格式转换问题
  18. 微服务整合J2cache并改造使用
  19. win7黑苹果双系统隐藏Clover多余启动项
  20. spring data jpa 使用@Query 不确定参数查询

热门文章

  1. wxpython 日志显示框_wxpython与logging模块结合显示实时日志
  2. php使用redis生成自增序列号码,Redis使用Eval多个键值自增的操作实例
  3. VBS随时监视注册表的变化,记录有变化的值或键等信息(包括一个文件内容比较函数)...
  4. vue - 响应式原理梳理(一)
  5. JAVA实现https单向认证
  6. Ambari 架构(三)Ambari Server 架构
  7. mac下简单绘图工具
  8. SQL 进阶技巧(下)
  9. 为什么要看源码、如何看源码,高手进阶必看
  10. 62岁程序员植入逻辑炸弹, 面临10年监禁和25万美元罚款