1. Perface

强化学习与其他学习方法最大的区别在于，强化学习 it uses training information that evaluates the actions taken rather than instructs by giving correct actions.

1.1 A k-armed Bandit Problem

假设你面前有K个不同的选项，每一次选择都会你选择的选项中得到一个量化的reward，你的目标是使得一段时间后获得的reward累积最大。一个具体的例子是这样的：一个赌徒，要去摇老虎机，走进赌场一看，一排老虎机，外表一模一样，但是每个老虎机吐钱的概率可不一样，他不知道每个老虎机吐钱的概率分布是什么，那么每次该选择哪个老虎机可以做到最大化收益呢？这就是多臂赌博机问题。

在 k-armed Bandit Problem中，每个action都有一个期望reward，称为这个action的value。假设在时刻ttt选择的action为AtA_tAt，得到的Reward为RtR_tRt，那么对于选择Action aaa之后得到的reward期望q∗(a)q_*(a)q∗(a)可以表示为：
q∗(a)=E[Rt∣At=a]q_*(a)=E[R_t|A_t=a] q∗(a)=E[Rt∣At=a]
在我们知道每个action的value之后，我们只要选择value最高的那个action就行了，因此，需要对每个action的value进行一个估计，假设ttt时刻对action的估计函数为Qt(a)Q_t(a)Qt(a)，接下来，我们需要Qt(a)Q_t(a)Qt(a)尽可能的接近q∗(a)q_*(a)q∗(a)。

如果我们选择目前时刻value最高的action，称为greedy action，其意义为： you are exploiting your current knowledge of the values of the actions。相反的，如果选择的是nongreedy actions， we say you are exploring。但是，仅仅基于当前state做最好的选择，并不一定是全局好的选择，因此，如何平衡好exploiting和exploring是一件非常重要的事。

1.2 Action-value Methods

先介绍一种最简单的value estimation方法：sample-average method。这种方式的特点是：each estimate is an average of the sample of relevant rewards，使用action去估计value的方法称为action-value methods，其表达式为：

Ipredicate\mathbb{I}_{predicate}Ipredicate是一个随机变量，当predicate为真时，取1；否则为0。当分母为0时，定义Qt(a)Q_t(a)Qt(a)是一个默认的value，例如0。（连续t−1t-1t−1个时间内，action a发生的次数以及所得的reward）

最简单选择action的方法：选择highest estimated value，那么，可以写成：
At=arg max⁡aQt(a)A_t=\argmax_aQ_t(a) At=aargmaxQt(a)
选择一个action a，使得到的value最大，那么这个action就是我们需要的action。Greedy action 总是在进行exploits，因此，我们需要增加一点随机性，让他兼顾explore， say with small probability ϵ\epsilonϵ, select randomly from among all the actions with equal probability。这种方式称为 ϵ\epsilonϵ-greedy.

1.3 The 10-armed Testbed

通常情况下，并不是说使用 ϵ\epsilonϵ-greedy 一定要比 greedy的方式要好。if the reward variances were zero, then the greedy method would know the true value of each action after trying it once. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions。
假设现在有10个赌博机，每个赌博机都有10种场景（对应的，也就有10个action）。假设每一个Testbed在不同action情况下的Reward都服从标准正态分布，那么随机采样构造小提琴图可以为：

然后对每一个赌博机进行1000个时间步的操作，这个过程重复2000次。

作者对比了两种贪心方式(ϵ=0.01andϵ=0.1\epsilon=0.01 and \epsilon=0.1ϵ=0.01andϵ=0.1)的情况下，所有的action value的估计值都采用sample-average的方法，在短期内，greedy策略能够很快的提升reward，但是很容易陷入到 performing suboptimal action中，长时间之后，其效果不如ϵ\epsilonϵ-greedy的方法。

使用greedy策略得到optimal actions的概率大概只有33%。而ϵ\epsilonϵ-greedy方法则包括了explored。The ϵ=0.01\epsilon=0 .01ϵ=0.01 method improved more slowly, but eventually would perform better than the ϵ=0.1\epsilon=0 .1ϵ=0.1 method on both performance measures shown in the ﬁgure

1.4 Incremental Implementation

对于计算ttt时刻对action的估计函数Qt(a)Q_t(a)Qt(a)，我们考虑该公式(2.1)中的一种极端的情况，某个action被连续选择了n−1n-1n−1次，那么公式(2.1)可以简化为：

那么上述公式就可以写成：

公式2.3就可以理解为：

ϵ\epsilonϵ-greedy的算法流程：

1.5 Tracking a Nonstationary Problem

公式2.4能够处理 stationary 的问题，那就是在任意时刻，赌博机对于同样action，给出的reward的概率都不会发生改变。

但是对于 nonstationary 的问题来说，需要给不同时刻的值增加一个α\alphaα权重：

那么公式就可以表示为：

我们称这种方式为加权平均或者exponential recency-weighted average。因为(1−α)n+∑i=1nα(1−α)n−i=1(1-\alpha)^n+\sum_{i=1}^n\alpha(1-\alpha)^{n-i}=1(1−α)n+∑i=1nα(1−α)n−i=1，这个东西α(1−α)n−i\alpha(1-\alpha)^{n-i}α(1−α)n−i取决于reward依赖的时间步长短

1.6 Optimistic Initial Values

对于估计的初值Q1(a)Q_1(a)Q1(a)，一开始其实是一个偏差项，但是在统计学上，当所有的action都被选择一次之后，这个偏差项可以认为消失了，但是对于公式(2.3)来说，这个偏差是无法消除的，但是对于公式(2.6)来说，这个偏差的影响是逐渐减少的。

不过对于model来说，这个偏差的影响其实不是很大，下面这个是初值分别为0和5的对比试验。

1.7 Upper-Conﬁdence-Bound Action Selection(UCB)

如何选择non-greedy的action，其实是一门学问，下面提供了一种方法去评估 non-greedy方法 both how close their estimates are to being maximal and the uncertainties in those estimates.

c>0c>0c>0: controls the degree of exploration
Nt(a)N_t(a)Nt(a) 在time ttt 时间内 action aaa 被选中的次数

The idea of this upper conﬁdence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value

当Nt(a)N_t(a)Nt(a)增大时， as it appears in the denominator(分母), the uncertainty term 减少.
当某个action aaa被没有被选中时，ttt增大，但是Nt(a)N_t(a)Nt(a)没有被选中， because ttt appears in the numerator(分子), the uncertainty estimate 增大。

1.8 Gradient Bandit Algorithms

In this section we consider learning a numerical preference(导数偏好) for each action aaa, which we denote Ht(a)H_t(a)Ht(a). 简言之，偏好越大，这个action被选中的次数越多，但是从reward角度来说没有办法进行解释。因此，不同action之间的preference相比较是十分重要的，一般来说，可以用Gibbs分布来表示：

在上式中，πt(a)\pi_t(a)πt(a)是在time ttt时采取的action aaa的概率，一般来说，设置H1(a)=0H_1(a)=0H1(a)=0。

对于随机梯度上升算法，更新公式为(2.12)：

对于最佳的action，采用Ht+1(At)H_{t+1}(A_t)Ht+1(At)的公式更新，对于非最佳的action，采用Ht+1(a)H_{t+1}(a)Ht+1(a)的方式更新。
在上式中：

a>0a>0a>0：step size
Rtˉ∈R\bar{R_t}\in\RRtˉ∈R是ttt时刻之前的average reward，这起到了一个baseline的作用。

下面在reward上展示了加入baseline和没有加baseline的差别。This shifting up of all the rewards has absolutely no e↵ect on the gradient bandit algorithm because of the reward baseline term, which instantaneously adapts to the new level

1.8.1 Gredient Ascent的理论

根据梯度上升公式：
Ht+1(a)=Ht(a)+α∂E[Rt]∂Ht(a)H_{t+1}(a)=H_t(a)+\alpha \frac{\partial E[R_t]}{\partial H_t(a)} Ht+1(a)=Ht(a)+α∂Ht(a)∂E[Rt]
其中，E[Rt]=∑xπt(x)q∗(x)E[R_t]=\sum_x\pi_t(x)q_*(x)E[Rt]=∑xπt(x)q∗(x)，但实际上E(Rt)E(R_t)E(Rt)很难求，因为q∗(x)q_*(x)q∗(x)实际上是一个未知的值，
将E[Rt]E[R_t]E[Rt]代入梯度上升公式有：

在上式中，BtB_tBt称为baseline，可以是一个与xxx无关的常数。下一步我们乘上一个πt(x)/πt(x)\pi_t(x)/\pi_t(x)πt(x)/πt(x)，于是有：（πt(x)\pi_t(x)πt(x)的定义式在公式(2.11)，代表选择各个action的概率）

就可以这个看成是一个期望：summing over all possible values xxx of the random variable AtA_tAt：

由于E[Rt∣At]=q∗(At)E[R_t|A_t]=q_*(A_t)E[Rt∣At]=q∗(At)，另外，令Bt=RtˉB_t=\bar{R_t}Bt=Rtˉ，于是有：

假设有：

于是：

可以写成：

所以：

可以写成

1.9 总结

ϵ\epsilonϵ-greedy choose randomly a small fraction of the time
UCB methods 虽然没有引入随机机制，但是通过引入其他action的界限来实现了exploration.
Gradient bandit algorithms estimate not action values, but action preferences, and favor the more preferred actions in a graded, probabilistic manner using a soft-max distribution

Reinforcement Learning——Chapter 2 Multi-armed Bandits相关推荐

李宏毅Reinforcement Learning强化学习入门笔记
文章目录 Concepts in Reinforcement Learning Difficulties in RL A3C Method Brief Introduction Policy-base ...
2020文献积累：计算机 [1] Reinforcement learning in Economics and Finance
2020文献积累 - 计算机方向 [1] Reinforcement learning in Economics and Finance 1. Introduction 1.1 An Historic ...
【论文 CCF C】An Adaptive Box-Normalization Stock Index Trading Strategy Based on Reinforcement Learning
论文题目:An Adaptive Box-Normalization Stock Index Trading Strategy Based on Reinforcement Learning 论文链接 ...
《Reinforcement Learning: An Introduction》读书笔记 - 目录
这一系列笔记是基于Richard S. Sutton的<Reinforcement Learning: An Introduction>第二版因为这本书在出版之前,作者就在官网上发布了几 ...
《Deep Reinforcement Learning for Autonomous Driving: A Survey》笔记
B Ravi Kiran , Ibrahim Sobh , Victor Talpaert , Patrick Mannion , Ahmad A. Al Sallab, Senthil Yogama ...
强化学习 (Reinforcement Learning) 基础及论文资料汇总
持续更新中... 书籍 1. <Reinforcement Learning: An Introduction>Richard S. Sutton and Andrew G.Barto , ...
【论文解析】Fast Adaptive Task Offloading in Edge Computing Based on Meta Reinforcement Learning
基于元强化学习的边缘计算快速自适应任务卸载摘要:多接入边缘计算(multi -access edge computing, MEC)旨在将云服务扩展到网络边缘,以减少网络流量和业务延迟.如何有效地将 ...
强化学习（一）Fundamentals of Reinforcement Learning
强化学习(一)Fundamentals of Reinforcement Learning 第〇章 An Introduction to Sequential Decision-Making 0.1 ...
强化学习（Reinforcement Learning）是什么？强化学习（Reinforcement Learning）和常规的监督学习以及无监督学习有哪些不同？
强化学习(Reinforcement Learning)是什么?强化学习(Reinforcement Learning)和常规的监督学习以及无监督学习有哪些不同? 目录
论文笔记之：Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning
论文笔记之:Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning 2017-06-06 21: ...

Reinforcement Learning——Chapter 2 Multi-armed Bandits