手把手教你强化学习 (六) 强化学习中的无模型预测

在大多是强化学习(reinforcement learning RL)问题中，环境的model都是未知的，也就无法直接做动态规划。一种方法是去学MDP，在这个系列的理解强化学习中的策略迭代和值迭代这篇文章中有具体思路。但这种做法还是会存在很多问题，就是在sample过程中会比较麻烦，如果你随机sample的话就会有某些state你很难sample到，而按照某种策略sample的话，又很难得到真实的转移概率。一旦你的model出现了问题，值迭代和策略迭代都将会出现问题。

于是就有了Model-free Reinforcement Learning，直接与环境交互，直接从数据中学到model。

Model-free Reinforcement Learning

Model-free Reinforcement Learning需要从数据中estimate出value是多少(state or state-action pair)，接下来拿到cumulative reward的期望，得到这些case之后，再去做model-free的control，去optimal当前的policy使得value function最大化。

那model-free的value function如何来做prediction呢？

在model-free的RL中我们无法获取state transition和reward function，我们仅仅是有一些episodes。之前我们是拿这些episodes学model，在model free的方法中拿这些episode直接学value function 或者是policy，不需要学MDP。这里面两个关键的key steps：1. estimate value function. 2. optimize policy.

Value Function Estimate

In model-based RL (MDP), the value function is calculated by dynamic programming

vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}] vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]

在model free的方法中，我们不知道state transition，由此无法计算上述等式的期望。

Monte-Carlo Methods

Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle’s surface. As show in following figure：

对上述方框中均匀撒上一些点，然后用如下等式计算即可：

Circle Surface=Square Surface×points in circlepoints in total\text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}} Circle Surface=Square Surface×points in total points in circle

Monte-Carlo Value Estimation

我们有很多episodes，基于这些episode，我们去计算total discounted reward ：

Gt=Rt+1+γRt+2+…=∑k=0∞γkRt+k+1G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} Gt=Rt+1+γRt+2+…=k=0∑∞γkRt+k+1

Value function的 expected return可表示为如下数学形式：

Vπ(s)=E[Gt∣st=s,π]≈1N∑i=1NGi(i)V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)} Vπ(s)=E[Gt∣st=s,π]≈N1i=1∑NGi(i)

上述方法可总结为两步：1. 使用policy π\piπ从statesss开始采样 NNN 个episodes 。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来，这种基于采样的方法，直接一步到位，计算value而不需要计算MDP中的什么状态转移啥的。

上述思想更加细致、更具体的方法可用如下形式表示：

Sample episodes of policy π\piπ。
Every time-step ttt that state sss is visited in an episode
- Increment counter N(s)←N(s)+1N(s) \leftarrow N(s) +1N(s)←N(s)+1
- Increment total return S(s)←S(s)+GtS(s) \leftarrow S(s) +G_{t}S(s)←S(s)+Gt
- Value is estimated by mean return V(s)=S(s)/N(s)V(s)=S(s)/N(s)V(s)=S(s)/N(s)
- By law of large numbers V(s)←VπV(s) \leftarrow V^{\pi}V(s)←Vπ as N(s)→∞N(s) \rightarrow \inftyN(s)→∞。

Incremental Monte-Carlo Updates

Update V(s)V(s)V(s) incrementally after each episode
For each state StS_{t}St with cumulative return GtG_{t}Gt

N(St)←N(St)+1V(St)←V(St)+1N(St)(Gt−V(St))\begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array} N(St)←N(St)+1V(St)←V(St)+N(St)1(Gt−V(St))

For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes

如果环境的state transition和reward function一直在变，我们把这个环境叫做non-stationary，环境本身肯定叫做stochastic环境。但是如果分布不变，叫做statically environment，但是环境本身的分布会发生变化的话，就需要去忘掉一些老的episode，如果用平均的方法去做的话，老的episode和新的episode一样，它就忘不掉老的episode。

V(St)←V(St)+α(Gt−V(St))V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t})) V(St)←V(St)+α(Gt−V(St))

Monte-Carlo Value Estimation的一些特点：

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping (discussed later)
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate

Temporal-Difference Learning

TD的方法中引入对未来值函数的估计：

Gt=Rt+1+γRt+2+…=Rt+1+γV(St+1)G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1}) Gt=Rt+1+γRt+2+…=Rt+1+γV(St+1)

V(St)←V(St)+α(Rt+1+γV(St+1)−V(St))V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t})) V(St)←V(St)+α(Rt+1+γV(St+1)−V(St))

TD的算法主要有以下四个特点：

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

Monte Carlo vs. Temporal Difference

Monte Carlo方法和Temporal Difference方法对比如下：

The same goal: learn VπV_{\pi}Vπ from episodes of experience under policy π\piπ。
Incremental every-visit Monte-Carlo
- Update value V(St)V(S_{t})V(St) toward actual return GtG_{t}Gt。

V(St)←V(St)+α(Gt−V(St))V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t})) V(St)←V(St)+α(Gt−V(St))

Simplest temporal-difference learning algorithm: TD
- Update value V(St)V(S_{t})V(St) toward estimated return Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})Rt+1+γV(St+1)。
- TD Target：Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})Rt+1+γV(St+1)；
- TD error：δ=Rt+1+γV(St+1)−V(St)\delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})δ=Rt+1+γV(St+1)−V(St)

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

Return GtG_{t}Gt is unbiased estimate of Vπ(St)V^{\pi}(S_{t})Vπ(St)。

基于当前的策略去采样，然后计算平均值，这样得到的估计是无偏估计。

TD target Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})Rt+1+γV(St+1) is biased estimate of VπV^{\pi}Vπ。

TD target中由于存在对未来的估计V(St+1)V(S_{t+1})V(St+1)，这个估计如果是非常准确的，那TD target也是unbiased estimate，但是由于V(St+1)V(S_{t+1})V(St+1)很难估计准确，所以是 biased estimate 。

TD target is of much lower variance than the return

TD target的方法一般比Return GtG_{t}Gt要小。Return GtG_{t}Gt depends on many random actions, transitions and rewards；TD target depends on one random action, transition and reward

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias

MC方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance。

TD has low variance, some bias

TD的方法 Usually more efficient than MC ，TD converges to Vπ(St)V^{\pi}(S_{t})Vπ(St)，but not always with function approximation。并且 More sensitive to initial value than MC。

n-step model-free prediction

For time constraint, we may jump n-step prediction section and directly head to model-free control

Define the n-step return

Gtn=Rt+1+γRt+2+⋯+γn−1Rt+n+γnV(St+n)G_{t}^{n} = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^{n-1}R_{t+n} + \gamma^{n} V(S_{t+n}) Gtn=Rt+1+γRt+2+⋯+γn−1Rt+n+γnV(St+n)

n-step temporal-difference learning

V(St)←V(St)+α(Gt(n)−V(St))V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{(n)} - V(S_{t})) V(St)←V(St)+α(Gt(n)−V(St))

有了值函数之后，我们就需要去做策略改进了。

我的微信公众号名称：深度学习与先进智能决策
微信公众号ID：MultiAgent1024
公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！