1. Perface

强化学习与其他学习方法最大的区别在于,强化学习 it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

1.1 A k-armed Bandit Problem


在 k-armed Bandit Problem中,每个action都有一个期望reward,称为这个action的value。假设在时刻ttt选择的action为AtA_tAt​,得到的Reward为RtR_tRt​,那么对于选择Action aaa之后得到的reward期望q∗(a)q_*(a)q∗​(a)可以表示为:
q∗(a)=E[Rt∣At=a]q_*(a)=E[R_t|A_t=a] q∗​(a)=E[Rt​∣At​=a]

如果我们选择目前时刻value最高的action,称为greedy action,其意义为: you are exploiting your current knowledge of the values of the actions。相反的,如果选择的是nongreedy actions, we say you are exploring。但是,仅仅基于当前state做最好的选择,并不一定是全局好的选择,因此,如何平衡好exploitingexploring是一件非常重要的事。

1.2 Action-value Methods

先介绍一种最简单的value estimation方法:sample-average method。这种方式的特点是:each estimate is an average of the sample of relevant rewards,使用action去估计value的方法称为action-value methods,其表达式为:

Ipredicate\mathbb{I}_{predicate}Ipredicate​是一个随机变量,当predicate为真时,取1;否则为0。当分母为0时,定义Qt(a)Q_t(a)Qt​(a)是一个默认的value,例如0。(连续t−1t-1t−1个时间内,action a发生的次数以及所得的reward)

最简单选择action的方法:选择highest estimated value,那么,可以写成:
At=arg max⁡aQt(a)A_t=\argmax_aQ_t(a) At​=aargmax​Qt​(a)
选择一个action a,使得到的value最大,那么这个action就是我们需要的action。Greedy action 总是在进行exploits,因此,我们需要增加一点随机性,让他兼顾explore, say with small probability ϵ\epsilonϵ, select randomly from among all the actions with equal probability。这种方式称为 ϵ\epsilonϵ-greedy.

1.3 The 10-armed Testbed

通常情况下,并不是说使用 ϵ\epsilonϵ-greedy 一定要比 greedy的方式要好。if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions


作者对比了两种贪心方式(ϵ=0.01andϵ=0.1\epsilon=0.01 and \epsilon=0.1ϵ=0.01andϵ=0.1)的情况下,所有的action value的估计值都采用sample-average的方法,在短期内,greedy策略能够很快的提升reward,但是很容易陷入到 performing suboptimal action中,长时间之后,其效果不如ϵ\epsilonϵ-greedy的方法。

使用greedy策略得到optimal actions的概率大概只有33%。而ϵ\epsilonϵ-greedy方法则包括了explored。The ϵ=0.01\epsilon=0 .01ϵ=0.01 method improved more slowly, but eventually would perform better than the ϵ=0.1\epsilon=0 .1ϵ=0.1 method on both performance measures shown in the figure

1.4 Incremental Implementation





1.5 Tracking a Nonstationary Problem

公式2.4能够处理 stationary 的问题,那就是在任意时刻,赌博机对于同样action,给出的reward的概率都不会发生改变。

但是对于 nonstationary 的问题来说,需要给不同时刻的值增加一个α\alphaα权重:


我们称这种方式为加权平均或者exponential recency-weighted average。因为(1−α)n+∑i=1nα(1−α)n−i=1(1-\alpha)^n+\sum_{i=1}^n\alpha(1-\alpha)^{n-i}=1(1−α)n+∑i=1n​α(1−α)n−i=1,这个东西α(1−α)n−i\alpha(1-\alpha)^{n-i}α(1−α)n−i取决于reward依赖的时间步长短

1.6 Optimistic Initial Values



1.7 Upper-Confidence-Bound Action Selection(UCB)

如何选择non-greedy的action,其实是一门学问,下面提供了一种方法去评估 non-greedy方法 both how close their estimates are to being maximal and the uncertainties in those estimates.

c>0c>0c>0: controls the degree of exploration
Nt(a)N_t(a)Nt​(a) 在time ttt 时间内 action aaa 被选中的次数

The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value

  1. 当Nt(a)N_t(a)Nt​(a)增大时, as it appears in the denominator(分母), the uncertainty term 减少.
  2. 当某个action aaa被没有被选中时,ttt增大,但是Nt(a)N_t(a)Nt​(a)没有被选中, because ttt appears in the numerator(分子), the uncertainty estimate 增大。

1.8 Gradient Bandit Algorithms

In this section we consider learning a numerical preference(导数偏好) for each action aaa, which we denote Ht(a)H_t(a)Ht​(a). 简言之,偏好越大,这个action被选中的次数越多,但是从reward角度来说没有办法进行解释。因此,不同action之间的preference相比较是十分重要的,一般来说,可以用Gibbs分布来表示:

在上式中,πt(a)\pi_t(a)πt​(a)是在time ttt时采取的action aaa的概率,一般来说,设置H1(a)=0H_1(a)=0H1​(a)=0。



  1. a>0a>0a>0:step size
  2. Rtˉ∈R\bar{R_t}\in\RRt​ˉ​∈R是ttt时刻之前的average reward,这起到了一个baseline的作用。

下面在reward上展示了加入baseline和没有加baseline的差别。This shifting up of all the rewards has absolutely no e↵ect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level

1.8.1 Gredient Ascent的理论

Ht+1(a)=Ht(a)+α∂E[Rt]∂Ht(a)H_{t+1}(a)=H_t(a)+\alpha \frac{\partial E[R_t]}{\partial H_t(a)} Ht+1​(a)=Ht​(a)+α∂Ht​(a)∂E[Rt​]​


就可以这个看成是一个期望:summing over all possible values xxx of the random variable AtA_tAt​:







1.9 总结

  1. ϵ\epsilonϵ-greedy choose randomly a small fraction of the time
  2. UCB methods 虽然没有引入随机机制,但是通过引入其他action的界限来实现了exploration.
  3. Gradient bandit algorithms estimate not action values, but action preferences, and favor the more preferred actions in a graded, probabilistic manner using a soft-max distribution

