手把手教你强化学习 (八) 强化学习中的值函数近似算法

在开始说值函数近似方法之前，我们先回顾一下强化学习算法。强化学习算法主要有两大类Model-based 的方法和Model-free的方法，model based 的方法也可以叫做 dynamic programming ：

Model-based dynamic programming

在model-based的动态规划算法中，核心概念是值迭代和策略迭代。在值迭代算法中是通过对未来状态的价值估计和及时奖励来估计当前状态的价值；在策略迭代算法中，主要是通过贪婪策略进行迭代，而要使得贪婪策略能够进行下去，依然还是会需要对状态的估计，也就是需要值迭代，但是可以不用值迭代收敛才进行策略改进，这样能够使得算法收敛地快一些。其核心公式如下所示：

Value iteration： V(s)=R(s)+max⁡a∈Aγ∑s′∈SPsa(s′)V(s′)V(s) = R(s) + \max_{a \in A} \gamma \sum_{s^{\prime} \in S}P_{sa}(s^{\prime})V(s^{\prime})V(s)=R(s)+maxa∈Aγ∑s′∈SPsa(s′)V(s′)。
Policy iteration： π(s)=arg max⁡a∈A∑s′∈SPsa(s′)V(s′)\pi(s) = \argmax_{a \in A} \sum_{s^{\prime} \in S}P_{sa}(s^{\prime})V(s^{\prime})π(s)=a∈Aargmax∑s′∈SPsa(s′)V(s′)

Model-free reinforcement learning

在model-free的强化学习算法中，主要是通过蒙特卡洛的方法和TD的方法来估计state value。在TD算法中又分为On policy的sarsa算法和off-policy的q-learning算法。

On-Policy MC：V(st)←V(st)+α(Gt−V(st))V(s_{t}) \leftarrow V(s_{t}) + \alpha(G_{t}-V(s_{t}))V(st)←V(st)+α(Gt−V(st))。
On-Policy TD：V(st)←V(st)+α(rt+1+γV(st+1)−V(st))V(s_{t}) \leftarrow V(s_{t}) + \alpha (r_{t+1} + \gamma V(s_{t+1})-V(s_{t}))V(st)←V(st)+α(rt+1+γV(st+1)−V(st))。
On-policy TD SARSA ：

Q(st,at)←Q(st,at)+α(rt+1+γQ(st+1,at+1)−Q(st,at))Q\left(s_{t}, a_{t}\right) \leftarrow Q\left(s_{t}, a_{t}\right)+\alpha\left(r_{t+1}+\gamma Q\left(s_{t+1}, a_{t+1}\right)-Q\left(s_{t}, a_{t}\right)\right) Q(st,at)←Q(st,at)+α(rt+1+γQ(st+1,at+1)−Q(st,at))

Off-policy TD Q-learning ：

Q(st,at)←Q(st,at)+α(rt+1+γmax⁡a′Q(st+1,at+1)−Q(st,at))Q\left(s_{t}, a_{t}\right) \leftarrow Q\left(s_{t}, a_{t}\right)+\alpha\left(r_{t+1}+\gamma \max _{a^{\prime}} Q\left(s_{t+1}, a_{t+1}\right)-Q\left(s_{t}, a_{t}\right)\right) Q(st,at)←Q(st,at)+α(rt+1+γa′maxQ(st+1,at+1)−Q(st,at))

处理大规模状态、动作空间

In all previous models, we have created a lookup table to maintain a variable V(s)V(s)V(s) for each state or Q(s,a)Q(s,a)Q(s,a) for each state-action。

当state or state-action space 太大的话，或者state or action is continuous就没办法created the lookup table。

解决上述问题主要有两种方式，一种就是将大的连续的状态空间或者动作空间离散化，变成一块一块地，这种做法控制效果不会太好，另外一种办法呢，就是建立一个参数化的值函数来近似，后面这种方法也是比较常见的。

Discretization Continuous MDP

对于连续状态下的MDP问题，我们可以将状态离散化为概率，比如在这个状态下采取什么动作会以多大的概率转移到下一个状态，进而可以离散化为一个表格的形式。这种方法非常地繁琐。

这里要注意的就是state transition需要对连续量积分，离散化。

Bucketize Large Discrete MDP

对于大规模的离散化状态空间，我们可以通过domain knowledge将相似的离散state聚合在一起。上述操作不管怎么离散聚合都会存在误差。因此现在的主流方法还是值函数近似算法。

Parametric Value Function Approximation

Create parametric (thus learnable) functions to approximate the value function

Vθ(s)≃Vπ(s)Qθ(s,a)≃Qπ(s,a)\begin{aligned} V_{\theta}(s) & \simeq V^{\pi}(s) \\ Q_{\theta}(s, a) & \simeq Q^{\pi}(s, a) \end{aligned} Vθ(s)Qθ(s,a)≃Vπ(s)≃Qπ(s,a)

θ\thetaθ is the parameters of the approximation function, which can be updated by reinforcement learning。

这种做法一方面解决了维度灾难的问题，另一方面可以Generalize from seen states to unseen states，这也是整个machine learning最强大，最具有魅力的地方。

Many function approximations

(Generalized) linear model
Neural network
Decision tree
Nearest neighbor
Fourier / wavelet bases

上述算法都可以做 function approximations 但是决策树、随机森林的灵活性没有那么强，因为强化学习算法中参数经常会被用来更新。因此我们很多时候用Differentiable functions 像(Generalized) linear model 、Neural network来做。

We assume the model is suitable to be trained for nonstationary, non-iid data

Value Function Approx. by SGD

Goal: find parameter vector θ\thetaθ minimizing mean-squared error between approximate value function Vθ(s)V_{\theta}(s)Vθ(s) and true value Vπ(s)V^{\pi}(s)Vπ(s)。

J(θ)=Eπ[12(Vπ(s)−Vθ(s))2]J(\theta)=\mathbb{E}_{\pi}\left[\frac{1}{2}\left(V^{\pi}(s)-V_{\theta}(s)\right)^{2}\right] J(θ)=Eπ[21(Vπ(s)−Vθ(s))2]

Gradient to minimize the error

−∂J(θ)∂θ=Eπ[(Vπ(s)−Vθ(s))∂Vθ(s)∂θ]-\frac{\partial J(\theta)}{\partial \theta}=\mathbb{E}_{\pi}\left[\left(V^{\pi}(s)-V_{\theta}(s)\right) \frac{\partial V_{\theta}(s)}{\partial \theta}\right] −∂θ∂J(θ)=Eπ[(Vπ(s)−Vθ(s))∂θ∂Vθ(s)]

Stochastic gradient descent on one sample

θ←θ−α∂J(θ)∂θ=θ+α(Vπ(s)−Vθ(s))∂Vθ(s)∂θ\begin{aligned} \theta & \leftarrow \theta-\alpha \frac{\partial J(\theta)}{\partial \theta} \\ &=\theta+\alpha\left(V^{\pi}(s)-V_{\theta}(s)\right) \frac{\partial V_{\theta}(s)}{\partial \theta} \end{aligned} θ←θ−α∂θ∂J(θ)=θ+α(Vπ(s)−Vθ(s))∂θ∂Vθ(s)

Linear Value Function Approximation

举个实际的例子：

Represent value function by a linear combination of features

Vθ(s)=θTx(s)V_{\theta}(s) = \theta^{T}x(s) Vθ(s)=θTx(s)

Objective function is quadratic in parameters θ\thetaθ

J(θ)=Eπ[12(Vπ(s)−θTx(s))2]J(\theta)=\mathbb{E}_{\pi}\left[\frac{1}{2}\left(V^{\pi}(s)-\theta^{T}x(s)\right)^{2}\right] J(θ)=Eπ[21(Vπ(s)−θTx(s))2]

Thus stochastic gradient descent converges on global optimum

θ←θ−α∂J(θ)∂θ=θ+α(Vπ(s)−Vθ(s))x(s)\begin{aligned} \theta & \leftarrow \theta-\alpha \frac{\partial J(\theta)}{\partial \theta} \\ &=\theta+\alpha\left(V^{\pi}(s)-V_{\theta}(s)\right) x(s) \end{aligned} θ←θ−α∂θ∂J(θ)=θ+α(Vπ(s)−Vθ(s))x(s)

那上述公式中的Vπ(s)V^{\pi}(s)Vπ(s)是怎么求的呢？

Monte-Carlo with Value Function Approx

如果不知道真正的value是多少就无法更新，之前的方法就可以拿过来用了。

For each data instance <st，Gt><s_{t}，G_{t}><st，Gt>：

θ←θ+α(Gt−Vθ(s))x(st)\theta \leftarrow \theta +\alpha(G_{t}-V_{\theta}(s))x(s_{t}) θ←θ+α(Gt−Vθ(s))x(st)

可以证明MC evaluation at least converges to a local optimum ，如果环境本身是In linear case it converges to a global optimum。

TD Learning with Value Function Approx

现在就是在找这个target learning用什么东西来替代它，在TD Learning中For each data instance <st,rt+1+γVθ(st+1)><s_{t},r_{t+1} + \gamma V_{\theta}(s_{t+1})><st,rt+1+γVθ(st+1)>：

θ←θ+α(rt+1+γVθ(st+1)−Vθ(s))x(st)\theta \leftarrow \theta +\alpha(r_{t+1}+\gamma V_{\theta}(s_{t+1})-V_{\theta}(s))x(s_{t}) θ←θ+α(rt+1+γVθ(st+1)−Vθ(s))x(st)

Linear TD converges (close) to global optimum。

Action-Value Function Approximation

Approximate the action-value function：

Qθ(s,a)≃Qπ(s,a)Q_{\theta}(s,a) \simeq Q^{\pi}(s,a) Qθ(s,a)≃Qπ(s,a)

Minimize mean squared error：

J(θ)=E[12(Qπ(s,a)−Qθ(s,a))2]J(\theta) = \mathbb{E} [\frac{1}{2}(Q^{\pi}(s,a)-Q_{\theta}(s,a))^{2}] J(θ)=E[21(Qπ(s,a)−Qθ(s,a))2]

Stochastic gradient descent on one sample：

θ←θ−α∂J(θ)∂θ=θ+α(Qπ(s)−Qθ(s))∂Qθ(s,a)∂θ\begin{aligned} \theta & \leftarrow \theta-\alpha \frac{\partial J(\theta)}{\partial \theta} \\ &=\theta+\alpha\left(Q^{\pi}(s)-Q_{\theta}(s)\right) \frac{\partial Q_{\theta}(s,a)}{\partial \theta} \end{aligned} θ←θ−α∂θ∂J(θ)=θ+α(Qπ(s)−Qθ(s))∂θ∂Qθ(s,a)

Linear Action-Value Function Approx

Represent state-action pair by a feature vector

这里就需要从state-action pair上面去抽取 fearure。

x(s,a)=[x1(s,a)⋮xk(s,a)]x(s, a)=\left[\begin{array}{c} {x_{1}(s, a)} \\ {\vdots} \\ {x_{k}(s, a)} \end{array}\right] x(s,a)=⎣⎢⎡x1(s,a)⋮xk(s,a)⎦⎥⎤

Parametric Q function, e.g., the linear case

Qθ(s,a)=θ⊤x(s,a)Q_{\theta}(s, a)=\theta^{\top} x(s, a) Qθ(s,a)=θ⊤x(s,a)

Stochastic gradient descent update

θ←θ−α∂J(θ)∂θ=θ+α(Qπ(s)−θ⊤x(s,a))x(s,a)\begin{aligned} \theta & \leftarrow \theta-\alpha \frac{\partial J(\theta)}{\partial \theta} \\ &=\theta+\alpha\left(Q^{\pi}(s)-\theta^{\top} x(s, a)\right) x(s,a) \end{aligned} θ←θ−α∂θ∂J(θ)=θ+α(Qπ(s)−θ⊤x(s,a))x(s,a)

TD Learning with Value Function Approx.

θ←α(Qπ(s,a)−Qθ(s,a))∂Qθ(s,a)∂θ\theta \leftarrow \alpha(Q^{\pi}(s,a)-Q_{\theta}(s,a)) \frac{\partial Q_{\theta}(s,a)}{\partial \theta} θ←α(Qπ(s,a)−Qθ(s,a))∂θ∂Qθ(s,a)

For MC, the target is the return GtG_{t}Gt：

θ←α(Gt−Qθ(s,a))∂Qθ(s,a)∂θ\theta \leftarrow \alpha(G_{t}-Q_{\theta}(s,a)) \frac{\partial Q_{\theta}(s,a)}{\partial \theta} θ←α(Gt−Qθ(s,a))∂θ∂Qθ(s,a)

For TD，the target is rt+1+γQθ(st+1,at+1)r_{t+1} + \gamma Q_{\theta}(s_{t+1}, a_{t+1})rt+1+γQθ(st+1,at+1)：

θ←α(rt+1+γQθ(st+1,at+1)−Qθ(s,a))∂Qθ(s,a)∂θ\theta \leftarrow \alpha(r_{t+1} + \gamma Q_{\theta}(s_{t+1}, a_{t+1})-Q_{\theta}(s,a)) \frac{\partial Q_{\theta}(s,a)}{\partial \theta} θ←α(rt+1+γQθ(st+1,at+1)−Qθ(s,a))∂θ∂Qθ(s,a)

Control with Value Function Approx

上图中无法达到上界的原因是由于值函数近似逼近无法趋近于真实 QπQ^{\pi}Qπ 所导致的。

Policy evaluation: approximatelypolicy evaluation Qθ≃QπQ_{\theta} \simeq Q^{\pi}Qθ≃Qπ。
Policy improvement: ε\varepsilonε-greedy policy improvement。

NOTE of TD Update

For TD(0), the TD target is

State value

θ←θ+α(Vπ(st)−Vθ(st))∂Vθ(st)∂θ=θ+α(rt+1+γVθ(st+1)−Vθ(s))∂Vθ(st)∂θ\begin{aligned} \theta & \leftarrow \theta+\alpha\left(V^{\pi}\left(s_{t}\right)-V_{\theta}\left(s_{t}\right)\right) \frac{\partial V_{\theta}\left(s_{t}\right)}{\partial \theta} \\ &=\theta+\alpha\left(r_{t+1}+\gamma V_{\theta}\left(s_{t+1}\right)-V_{\theta}(s)\right) \frac{\partial V_{\theta}\left(s_{t}\right)}{\partial \theta} \end{aligned} θ←θ+α(Vπ(st)−Vθ(st))∂θ∂Vθ(st)=θ+α(rt+1+γVθ(st+1)−Vθ(s))∂θ∂Vθ(st)

Action value

θ←θ+α(Qπ(s,a)−Qθ(s,a))∂Qθ(s,a)∂θ=θ+α(rt+1+γQθ(st+1,at+1)−Qθ(s,a))∂Qθ(s,a)∂θ\begin{aligned} \theta & \leftarrow \theta+\alpha\left(Q^{\pi}(s, a)-Q_{\theta}(s, a)\right) \frac{\partial Q_{\theta}(s, a)}{\partial \theta} \\ & =\theta+\alpha\left(r_{t+1}+\gamma Q_{\theta}\left(s_{t+1}, a_{t+1}\right)-Q_{\theta}(s, a)\right) \frac{\partial Q_{\theta}(s, a)}{\partial \theta} \end{aligned} θ←θ+α(Qπ(s,a)−Qθ(s,a))∂θ∂Qθ(s,a)=θ+α(rt+1+γQθ(st+1,at+1)−Qθ(s,a))∂θ∂Qθ(s,a)

Although θ\thetaθ is in the TD target, we don’t calculate gradient from the target. Think about why.

对上述问题主要有两方面的理解：1. 更新方程中是用后面的估值更新前面的，而不需要去更新后面的，这也是马尔可夫性所决定的大体思想(在无近似值函数的算法中也是这么做的)；2. 在做神经网络的时候，可以更新后面的那一项都不去做更新，因为会使得算法更新不稳定。

Deep Q-Network (DQN)

在2013年年底的时候DeepMind就基于深度学习和强化学习做出来了直接估计像素状态值函数的深度强化学习算法。

【5分钟 Paper】Playing Atari with Deep Reinforcement Learning

看似比较简单的更新的trick，在深度强化学习里面往往是比较重要的method，原因是因为当你神经网络变得复杂的时候，很多以前经典的算法都会不那么work，而如果能真正理解深度神经网络，而应用在强化学习里面，这些看起来像一些trick的东西，往往会是深度强化学习里面最重要的算法。

VolodymyrMnih, KorayKavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop.
VolodymyrMnih, KorayKavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

我的微信公众号名称：深度学习与先进智能决策
微信公众号ID：MultiAgent1024
公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！