Adadelta原文解读

Adadelta论文原文是:
《Adadelta:An adaptive learning rate method》

论文的重点是Section3，我们重点对Section3进行解读

section 3.Adadelta Method

the continual decay of learning rates throughout training, and 2) the need for a manually selected
global learning rate.
意思是Adadelta是为了:
1.学习率衰退问题,2.学习率自动选择的问题

In the ADAGRAD method the denominator accumulates the squared gradients from each iteration starting at the beginning of training. Since each term is positive, this accumulated sum continues to grow throughout training, effectively shrinking the learning rate on each dimension. After many iterations, this learning rate will become infinitesimally small.
这段话的意思是ADAGRAD会随着训练的进行，导致学习率逐渐变成０．

3.1.Idea1:Accumulate Over Window
Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size www (instead of size ttt where ttt is the current iteration as in ADAGRAD). With this windowed accumulation the denominator of ADAGRAD cannot accumulate to infinity and instead becomes a local estimate using recent gradients. This ensures that learning continues to make progress even after many iterations of updates have been done.
意思是用一个窗口w,而不是像adagrad那样累积之前t轮所有的权重.
E[g2]t=ρE[g2]t−1+(1−ρ)gt2(8)E[g^2]_t=\rho E[g^2]_{t-1}+(1-\rho)g_t^2(8)E[g2]t=ρE[g2]t−1+(1−ρ)gt2(8)
RMS[g]t=E[g2]t+ϵ(9)RMS[g]_t=\sqrt{E[g^2]_t+\epsilon} (9)RMS[g]t=E[g2]t+ϵ(9)
△xt=−ηRMS[g]tgt(10)△x_t=-\frac{\eta}{RMS[g]_t}g_t (10)△xt=−RMS[g]tηgt(10)

上面的式子中，(8)代入(9),(9)代入(10)，即为最终伪代码的一部分
然后，因为式子中η\etaη是需要手工设定的，所以下面有了3.2

3.2.Idea2:Correct Units with Hessian Approximation
二阶牛顿法可以写成：
xt+1=xt−f′(x)f′′(x)x_{t+1}=x_t-\frac{f'(x)}{f''(x)}xt+1=xt−f′′(x)f′(x)
所以二阶牛顿法中，我们可以把1f′′(x)\frac{1}{f''(x)}f′′(x)1视为学习率。

在二阶牛顿法中，有:
△x=∂f∂x∂2f∂x2△x=\frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2f}{\partial x^2}}△x=∂x2∂2f∂x∂f
可以推导出：
1∂2f∂x2=△x∂f∂x\frac{1}{\frac{\partial ^2 f}{\partial x^2}}=\frac{△x}{\frac{\partial f}{\partial x}}∂x2∂2f1=∂x∂f△x（这个步骤我认为没啥用，就是在论文里面凑字数逼叨几句）

Since the RMS of the previous gradients is already represented in the denominator in Eqn. 10 we considered a measure of the △x\triangle x△x quantity in the numerator.
这里的意思是已经把式子(10)的分母处理完了(这是废话，这里是为了增加字数)

△xt\triangle x_t△xt for the current time step is not known, so we assume the curvature is locally smooth and approximate △xt\triangle x_t△xt by compute the exponentially decaying RMS over a window of size w of previous △x\triangle x△x　to give the ADADELTA method.
这段话什么意思呢？
意思是说:
我们同样对△x\triangle x△x使用一个窗口来计算合理的值，讲人话就是：我们脑袋一拍，觉得这里就用均方根吧。
然后就有了分子中中的RMS[△x]t−1RMS[\triangle x]_{t-1}RMS[△x]t−1

最终算法如下：

Note:
算法中的第4步和第6步代入第5步，然后第5步代入第7步，这样就算完成了一次更新迭代

Adadelta原文解读相关推荐

论文原文解读汇总(持续更新中)
以下是自己对一些论文原文的解读: 机器学习: <XGBoost: A Scalable Tree Boosting System> <CatBoost:gradient boosti ...
【Transformer开山之作】Attention is all you need原文解读
Attention Is All You Need Transformer原文解读与细节复现导读在Transformer出现以前,深度学习的基础主流模型可分为卷积神经网络CNN.循环神经网络RNN ...
AlexNet原文解读+colab上运行caffe+caffe神经网络可视化(没有完成)
##########################下面是资源############################################################## 论文原文链接 ...
《On the Momentum Term in Gradient Descent Learning Algorithm》原文解读
############博主前言####################### 我写这篇文章的目的: 想必很多人听过神经网络中的momentum算法, 但是为啥叫momentum(动量)算法呢? 和物 ...
Learning representations by back-propagating errors原文解读
反向传播的原文是: 1986年的<Learning representations by back-propagating errors> xj=∑iyiwji(1)x_j=\sum_iy ...
Catboost原文解读
CatBoost原文: <CatBoost:gradient boosting with categorical features support>-2018 俄罗斯人写的文章,真的是-唉 ...
LDA主题模型原文解读
#################LSA和LSI(start)################### 根据wikipedia: https://en.wikipedia.org/wiki/Latent ...
YOLOv2原文解读
一. 久违的新版本 YOLO 问世已久,不过风头被SSD盖过不少,原作者自然不甘心,YOLO v2 的提出给我们带来了什么呢? 先看一下其在 v1的基础上做了哪些改进,直接引用作者的实验结果了: 条目 ...
Nature Methods：微生物来源分析包SourceTracker——结果解读和使用教程
前一阵我们翻译Rob Knight的综述,1.8万字,让你熟读2遍轻松握掌微生物组领域分析框架.把握未来分析趋势.目前在宏基因组平台累计1.9万人次,热心肠平台首发阅读8500+,科学网加精置顶阅读8 ...

Adadelta原文解读

Adadelta原文解读相关推荐

最新文章

热门文章