深度学习中常用的学习率衰减策略及tensorflow实现

引言

（1）分段常数衰减

（2）指数衰减

（3）自然指数衰减

（4）多项式衰减

（5）余弦衰减

（6）线性余弦衰减

（7）噪声线性余弦衰减

（8）倒数衰减

引言

学习率（learning rate,lr）是在神经网络的训练过程中一个很重要的超参数，对神经网络的训练效果与训练时间成本有很大影响。

学习率对训练效果的影响（主要体现在对网络的有效容量/参数搜索空间的影响上）：

学习率过大：导致参数更新步幅过大，迈过了很多候选参数，有可能会越过最优值。因此，从这个意义上讲，过大的学习率会降低模型的有效容量，缩小了神经网络的参数搜索空间。
学习率过小：因为神经网络的优化是一个非凸过程，损失函数曲线/超平面上存在许多局部极小值，鞍点，平滑点等。过小的学习率容易导致参数搜索过程中，使网络参数停留在一个很高的局部极小值上，不能继续搜索更好的、更偏向于全局的局部极小值。从这个意义上讲，学习率过小也会降低模型的有效容量。

学习率对训练时间成本的影响：

学习率过大，会导致参数优化过程中损失函数值震荡（或，在最终的极优值两侧来回摆动），导致网络不能收敛。
学习率过小，除了会导致训练速度慢以外，还容易导致模型停留在一个训练误差很高的局部极小值上，不利于寻找一个更低的（或更偏向于全局的）局部极小值。

在神经网络的训练过程中，常采用的一个策略就是使用学习率更新策略，使学习率随着模型训练的迭代次数逐渐衰减，这样既可以兼顾学习效率又能兼顾后期学习的稳定性：前期通过大学习率快速搜索，找到一个较好的（更倾向于全局最小的）局部区域，后期用较小的学习率在这个局部区域进行收敛。

主要的学习率更新策略有以下几种：

分段常数衰减
指数衰减
自然指数衰减
多项式衰减
余弦衰减
倒数衰减

tensorflow中的学习率衰减方法有：

tf.train.piecewise_constant　分段常数衰减
tf.train.exponential_decay　指数衰减
tf.train.natural_exp_decay　自然指数衰减
tf.train.polynomial_decay　多项式衰减
tf.train.cosine_decay　余弦衰减
tf.train.linear_cosine_decay　线性余弦衰减
tf.train.noisy_linear_cosine_decay　噪声线性余弦衰减
tf.train.inverse_time_decay　倒数衰减
函数返回衰减的学习率

（1）分段常数衰减

分段常数衰减是通过人工指定，在某个迭代区间使用某个学习率。一般是初始学习率较高，后面随着迭代次数增加逐渐降低。

tensorflow中的函数介绍

tf.train.piecewise_constant(x,          # 标量，global_step，当前的迭代次数，boundaries, # 列表，更换学习率的迭代次数边界，如[40000,80000]values,     # 学习率列表，values的长度比boundaries的长度多一个,如[1e-3,1e-4,1e-5]name=None   #
)
"""
如果按照以上的示例，模型训练时的学习率变化表现是：当globalstep<40000时，lr=1e-3,
当40000<globalstep<80000时，lr=1e-4
当80000<globalstep时，lr=1e-5
"""

使用示例：

#!/usr/bin/python
# coding:utf-8# piecewise_constant 阶梯式下降法
import matplotlib.pyplot as plt
import tensorflow as tf
global_step = tf.Variable(0, name='global_step', trainable=False)
boundaries = [10, 20, 30]
learing_rates = [0.1, 0.07, 0.025, 0.0125]
y = []
N = 40
with tf.Session() as sess:sess.run(tf.global_variables_initializer())for global_step in range(N):learing_rate = tf.train.piecewise_constant(global_step, boundaries=boundaries, values=learing_rates)lr = sess.run([learing_rate])y.append(lr[0])x = range(N)
plt.plot(x, y, 'r-', linewidth=2)
plt.title('piecewise_constant')
plt.show()

学习率曲线示意图：

（2）指数衰减

人工设定衰减系数，使训练过程每经历decay_steps次迭代，学习率都乘以一个衰减系数decay_rate，从而达到学习率指数下降的目的。

衰减公式：

$lr_{decayed} = learning\_rate*decay\_rate^{\frac{global_step}{decay\_steps}}$

tensorflow中的函数介绍：

tf.train.exponential_decay(learning_rate, # 初始学习率global_step,   # 当前训练迭代的次数decay_steps,   # 定义衰减周期，跟参数staircase配合，可以在decay_step个训练轮次内保持学习率不变decay_rate,    # 衰减率系数staircase=False, # 定义是否是阶梯型衰减，还是连续衰减，默认是False，即连续衰减（标准的指数型衰减）name=None
)
"""
当starecase=False时，相当于默认decay_steps=1，没执行一次迭代，就执行一次指数衰减。个人认为设置starecase=True较好，因为指数下降实在是太快，每迭代一次就执行一次衰减的话学习率下降太快了。
"""

使用示例：

#!/usr/bin/python
# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tf
global_step = tf.Variable(0, name='global_step', trainable=False)y = []
z = []
N = 200
with tf.Session() as sess:sess.run(tf.global_variables_initializer())for global_step in range(N):# 阶梯型衰减learing_rate1 = tf.train.exponential_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=True)# 标准指数型衰减learing_rate2 = tf.train.exponential_decay(learning_rate=0.5, global_step=global_step, decay_steps=10, decay_rate=0.9, staircase=False)lr1 = sess.run([learing_rate1])lr2 = sess.run([learing_rate2])y.append(lr1[0])z.append(lr2[0])x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.title('exponential_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

学习率变化曲线如下图所示，

图中红线表示starecase=True,绿线表示starecase=False

（3）自然指数衰减

自然指数衰减是指数衰减的一种特殊情况，学习率也是跟当前的训练轮次指数相关，只不过以 e 为底数。

衰减公式：

$decayed\_learning\_rate = learning\_rate * e^{-decay\_rate * global\_step}$

tensorflow中的函数介绍：

tf.train.natural_exp_decay(learning_rate,    # 初始学习率global_step,      # 全局迭代次数decay_steps,      # 每隔decay_steps执行一次学习率指数衰减decay_rate,       # 标量。衰减系数staircase=False,  # 是否执行阶梯式衰减，默认是False，即连续衰减name=None
)

使用示例：

# coding:utf-8import matplotlib.pyplot as plt
import tensorflow as tfnum_epoch = tf.Variable(0, name='global_step', trainable=False)y = []
z = []
w = []
m = []
N = 200with tf.Session() as sess:sess.run(tf.global_variables_initializer())for num_epoch in range(N):# 阶梯型衰减learing_rate1 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=num_epoch, decay_steps=10, decay_rate=0.9, staircase=True)# 标准指数型衰减learing_rate2 = tf.train.natural_exp_decay(learning_rate=0.5, global_step=num_epoch, decay_steps=10, decay_rate=0.9, staircase=False)# 阶梯型指数衰减learing_rate3 = tf.train.exponential_decay(learning_rate=0.5, global_step=num_epoch, decay_steps=10, decay_rate=0.9, staircase=True)# 标准指数衰减learing_rate4 = tf.train.exponential_decay(learning_rate=0.5, global_step=num_epoch, decay_steps=10, decay_rate=0.9, staircase=False)lr1 = sess.run([learing_rate1])lr2 = sess.run([learing_rate2])lr3 = sess.run([learing_rate3])lr4 = sess.run([learing_rate4])y.append(lr1)z.append(lr2)w.append(lr3)m.append(lr4)x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_ylim([0, 0.55])plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, w, 'r-', linewidth=2)
plt.plot(x, m, 'g-', linewidth=2)plt.title('natural_exp_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

学习率曲线示意图：

左下部分的两条曲线是自然指数衰减，右上部分的两条曲线是指数衰减，可见自然指数衰减对学习率的衰减程度要远大于一般的指数衰减，一般用于可以较快收敛的网络，或者是训练数据集比较大的场合。

（4）多项式衰减

多项式衰减是这样一种衰减机制：定义一个初始的学习率，一个最低的学习率。学习率从初始学习率逐渐降低到最低的学习率。当降低到最低学习率后，可以根据参数cycle=True/False设置，选择（1）False,一直保持使用最低学习率，（2）True，把学习率从最低学习率再上升到一个新的较高学习率，再次执行衰减，并循环往复，形成一个反复的升降过程。

衰减公式：

当cycle=False时，学习率计算公式如下：

_global_step_ = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) *(1 - _global_step_ / decay_steps) ^ (power)

+ end_learning_rate

公式理解：由公式可见，当global_step>decay_steps时， _global_step_=decay_steps，==》0的任何次幂等于0

==》decayed_learning_rate=end_learning_rate

当cycle=True时，学习率计算公式如下：

_decay_steps_ = decay_steps * ceil(global_step / decay_steps) # ceil表示向上取整
decayed_learning_rate = (learning_rate - end_learning_rate) *(1 - global_step / _decay_steps_) ^ (power)

+ end_learning_rate

公式理解：由公式可见，随着global_step逐渐增大，_decay_steps_也周期性的增大。

当 global_step > decay_steps 时，global_step / _decay_steps_ 始终大于等于0，小于1，周期性的由小变大。

从而 (1 - global_step / _decay_steps_) 周期性的由大变小，decayed_learning_rate也周期性的由大变小。

当global_step恰好是decay_steps的倍数时，(1 - global_step / _decay_steps_)=0，

decayed_learning_rate = learning_rate

tensorflow中的函数介绍：

tf.train.polynomial_decay(learning_rate, # 标量，初始学习率,如0.1global_step,   # 标量，训练的迭代次数decay_steps,   # end_learning_rate=0.0001, # 标量，最小的终止学习率power=1.0, # 多项式指数，默认是线性的，取值为1cycle=False, # bool变量，当超过decay_steps后是否循环执行name=None
)

使用示例：

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tfy = []
z = []
N = 200global_step = tf.Variable(0, name='global_step', trainable=False)with tf.Session() as sess:sess.run(tf.global_variables_initializer())for global_step in range(N):# cycle=Falselearing_rate1 = tf.train.polynomial_decay(learning_rate=0.1, global_step=global_step, decay_steps=50,end_learning_rate=0.01, power=0.5, cycle=False)# cycle=Truelearing_rate2 = tf.train.polynomial_decay(learning_rate=0.1, global_step=global_step, decay_steps=50,end_learning_rate=0.01, power=0.5, cycle=True)lr1 = sess.run([learing_rate1])lr2 = sess.run([learing_rate2])y.append(lr1)z.append(lr2)x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'g-', linewidth=2)
plt.plot(x, y, 'r--', linewidth=2)
plt.title('polynomial_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

学习率变化曲线如下图所示：

红线表示cycle=False,当迭代次数超过 decay_steps后，保持end_learning_rate不再改变
绿线表示cycle=True,当迭代次数超过decay_steps后，学习率从end_learning_rate上升到一个数值后，再次执行衰减。
多项式衰减中设置学习率可以往复升降的目的是为了防止神经网络后期训练的学习率过小，导致网络参数陷入某个局部最优解出不来，设置学习率升高机制，有可能使网络跳出局部最优解。

（5）余弦衰减

余弦衰减的衰减机制跟余弦函数相关，形状也大体上是余弦形状。

计算公式如下：

_global_step_ = min(global_step, decay_steps)

cosine_decay = 0.5 * (1 + cos(pi * _global_step_ / decay_steps))

decayed = (1 - alpha) * cosine_decay + alpha

decayed_learning_rate = learning_rate * decayed

公式解释：在cosine_decay表达式中，cos部分从cos(0)逐渐降低增加到cos(pi)后保持不变，导致cosine_decay从1沿余弦函数曲线下降到0后维持不变，主要体现为学习率的平滑变化

tensorflow中的函数介绍：

tf.train.cosine_decay(learning_rate, # 初始学习率global_step,   # 全局迭代次数decay_steps,   # 衰减步数，即从初始学习率衰减到最低学习率时的迭代次数alpha=0.0,     # name=None
)

（6）线性余弦衰减

计算公式

global_step = min(global_step, decay_steps)
linear_decay = (decay_steps - global_step) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * global_step / decay_steps))
decayed = (alpha + linear_decay) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed

tensorflow中的函数介绍：

tf.train.linear_cosine_decay(learning_rate,   #  The initial learning rate.global_step,     #  Global step to use for the decay computation.decay_steps,     #  Number of steps to decay over.num_periods=0.5, #  Number of periods in the cosine part of the decay.alpha=0.0,beta=0.001,name=None
)

（7）噪声线性余弦衰减

计算公式：

_global_step_ = min(global_step, decay_steps)
linear_decay = (decay_steps - _global_step_) / decay_steps)
cosine_decay = 0.5 * (1 + cos(pi * 2 * num_periods * _global_step_ / decay_steps))
decayed = (alpha + linear_decay + eps_t) * cosine_decay + beta
decayed_learning_rate = learning_rate * decayed

tensorflow中的函数介绍：

tf.train.noisy_linear_cosine_decay(learning_rate, # The initial learning rate.global_step,   # Global step to use for the decay computation.decay_steps,   # Number of steps to decay over.initial_variance=1.0,  # initial variance for the noise. variance_decay=0.55,   # decay for the noise's variance. See computation above.num_periods=0.5,       # Number of periods in the cosine part of the decay. alpha=0.0,beta=0.001,name=None
)

三种余弦衰减函数的使用示例：

# coding:utf-8
import matplotlib.pyplot as plt
import tensorflow as tfy = []
z = []
w = []
N = 200
global_step = tf.Variable(0, name='global_step', trainable=False)with tf.Session() as sess:sess.run(tf.global_variables_initializer())for global_step in range(N):# 余弦衰减learing_rate1 = tf.train.cosine_decay(learning_rate=0.1, global_step=global_step, decay_steps=50)# 线性余弦衰减learing_rate2 = tf.train.linear_cosine_decay(learning_rate=0.1, global_step=global_step, decay_steps=50,num_periods=0.2, alpha=0.5, beta=0.2)# 噪声线性余弦衰减learing_rate3 = tf.train.noisy_linear_cosine_decay(learning_rate=0.1, global_step=global_step, decay_steps=50,initial_variance=0.01, variance_decay=0.1, num_periods=0.2, alpha=0.5, beta=0.2)lr1 = sess.run([learing_rate1])lr2 = sess.run([learing_rate2])lr3 = sess.run([learing_rate3])y.append(lr1)z.append(lr2)w.append(lr3)x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'b-', linewidth=2)
plt.plot(x, y, 'r-', linewidth=2)
plt.plot(x, w, 'g-', linewidth=2)
plt.title('cosine_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

三种余弦衰减的学习率曲线如下图所示：

红色标准余弦衰减（tf.train.cosine_decay()），学习率从初始曲线过渡到最低学习率;
蓝色线性余弦衰减（tf.train.linear_cosine_decay()），学习率从初始线性过渡到最低学习率;
绿色噪声线性余弦衰减（tf.train.noisy_linear_cosine_decay()），在线性余弦衰减基础上增加了随机噪声;

（8）倒数衰减

倒数衰减指的是一个变量的大小与另一个变量的大小成反比的关系，具体到神经网络中就是学习率的大小跟训练次数有一定的反比关系。

计算公式：

decayed_learning_rate =learning_rate/(1+decay_rate* global_step/decay_step)

tensorflow中的函数介绍：


tf.train.inverse_time_decay(learning_rate,  # 初始学习率global_step,    # 用于衰减计算的全局步数decay_steps,    # 衰减步数decay_rate,     # 衰减率staircase=False,  # 是否应用离散阶梯型衰减（否则为连续型）name=None
)

使用示例：

#!/usr/bin/python
# coding:utf-8import matplotlib.pyplot as plt
import tensorflow as tf
y = []
z = []
N = 200
global_step = tf.Variable(0, name='global_step', trainable=False)with tf.Session() as sess:sess.run(tf.global_variables_initializer())for global_step in range(N):# 阶梯型衰减learing_rate1 = tf.train.inverse_time_decay(learning_rate=0.1, global_step=global_step, decay_steps=20,decay_rate=0.2, staircase=True)# 连续型衰减learing_rate2 = tf.train.inverse_time_decay(learning_rate=0.1, global_step=global_step, decay_steps=20,decay_rate=0.2, staircase=False)lr1 = sess.run([learing_rate1])lr2 = sess.run([learing_rate2])y.append(lr1[0])z.append(lr2[0])x = range(N)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x, z, 'r-', linewidth=2)
plt.plot(x, y, 'g-', linewidth=2)
plt.title('inverse_time_decay')
ax.set_xlabel('step')
ax.set_ylabel('learing rate')
plt.show()

学习率变化曲线示意图：

倒数衰减不固定最小学习率，迭代次数越多，学习率越小。

参考：学习率衰减方法

tensorflow中常用学习率更新策略