【5分钟 Paper】Deterministic Policy Gradient Algorithms

文章目录

所解决的问题？
背景
所采用的方法？
取得的效果？
所出版信息？作者信息？
参考链接
扩展阅读

论文题目：Deterministic Policy Gradient Algorithms

所解决的问题？

stochastic policy的方法由于含有部分随机，所以效率不高，方差大，采用deterministic policy方法比stochastic policy的采样效率高，但是没有办法探索环境，因此只能采用off-policy的方法来进行了。

背景

以往的action是一个动作分布πθ(a∣s)\pi_{\theta}(a|s)πθ(a∣s)，作者所提出的是输出一个确定性的策略(deterministic policy) a=μθ(s)a =\mu_{\theta}(s)a=μθ(s)。

In the stochastic case，the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.

Stochastic Policy Gradient

前人采用off-policy的随机策略方法， behaviour policy β(a∣s)≠πθ(a∣s)\beta(a|s) \neq \pi_{\theta}(a|s)β(a∣s)=πθ(a∣s)：

Jβ(πθ)=∫Sρβ(s)Vπ(s)ds=∫S∫Aρβ(s)πθ(a∣s)Qπ(s,a)dads\begin{aligned} J_{\beta}\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\pi}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \end{aligned} Jβ(πθ)=∫Sρβ(s)Vπ(s)ds=∫S∫Aρβ(s)πθ(a∣s)Qπ(s,a)dads

Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

∇θJβ(πθ)≈∫S∫Aρβ(s)∇θπθ(a∣s)Qπ(s,a)dads=Es∼ρβ,a∼β[πθ(a∣s)βθ(a∣s)∇θlog⁡πθ(a∣s)Qπ(s,a)]\begin{aligned} \nabla_{\theta} J_{\beta}\left(\pi_{\theta}\right) & \approx \int_{\mathcal{S}} \int_{\mathcal{A}} \rho^{\beta}(s) \nabla_{\theta} \pi_{\theta}(a | s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}, a \sim \beta}\left[\frac{\pi_{\theta}(a | s)}{\beta_{\theta}(a | s)} \nabla_{\theta} \log \pi_{\theta}(a | s) Q^{\pi}(s, a)\right] \end{aligned} ∇θJβ(πθ)≈∫S∫Aρβ(s)∇θπθ(a∣s)Qπ(s,a)dads=Es∼ρβ,a∼β[βθ(a∣s)πθ(a∣s)∇θlogπθ(a∣s)Qπ(s,a)]

This approximation drops a term that depends on the action-value gradient ∇θQπ(s,a)\nabla_{\theta}Q^{\pi}(s,a)∇θQπ(s,a); (Degris et al., 2012b)

μθ(s)\mu_{\theta}(s)μθ(s) 更新公式：

θk+1=θk+αEs∼ρμk[∇θQμk(s,μθ(s))]\theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} Q^{\mu^{k}}\left(s, \mu_{\theta}(s)\right)\right] θk+1=θk+αEs∼ρμk[∇θQμk(s,μθ(s))]

引入链导法则：

θk+1=θk+αEs∼ρμk[∇θμθ(s)∇aQμk(s,a)∣a=μθ(s)]\theta^{k+1}=\theta^{k}+\alpha \mathbb{E}_{s \sim \rho^{\mu^{k}}} \left[\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu^{k}}\left(s, a\right) |_{a=\mu_{\theta}(s)} \right] θk+1=θk+αEs∼ρμk[∇θμθ(s)∇aQμk(s,a)∣a=μθ(s)]

所采用的方法？

On-Policy Deterministic Actor-Critic

如果环境有大量噪声帮助智能体做exploration的话，这个算法还是可以的，使用sarsa更新critic，使用 Qw(s,a)Q^{w}(s,a)Qw(s,a) 近似true action-value QμQ^{\mu}Qμ：

δt=rt+γQw(st+1,at+1)−Qw(st,at)wt+1=wt+αwδt∇wQw(st,at)θt+1=θt+αθ∇θμθ(st)∇aQw(st,at)∣a=μθ(s)\begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, a_{t+1}\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} δtwt+1θt+1=rt+γQw(st+1,at+1)−Qw(st,at)=wt+αwδt∇wQw(st,at)=θt+αθ∇θμθ(st)∇aQw(st,at)∣a=μθ(s)

Off-Policy Deterministic Actor-Critic

we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy

Jβ(μθ)=∫Sρβ(s)Vμ(s)ds=∫Sρβ(s)Qμ(s,μθ(s))ds\begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} Jβ(μθ)=∫Sρβ(s)Vμ(s)ds=∫Sρβ(s)Qμ(s,μθ(s))ds

∇θJβ(μθ)≈∫Sρβ(s)∇θμθ(a∣s)Qμ(s,a)ds=Es∼ρβ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)]\begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a | s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}} [\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a =\mu_{\theta}(s)}] \end{aligned} ∇θJβ(μθ)≈∫Sρβ(s)∇θμθ(a∣s)Qμ(s,a)ds=Es∼ρβ[∇θμθ(s)∇aQμ(s,a)∣a=μθ(s)]

得到off-policy deterministic actorcritic (OPDAC) 算法：

δt=rt+γQw(st+1,μθ(st+1))−Qw(st,at)wt+1=wt+αwδt∇wQw(st,at)θt+1=θt+αθ∇θμθ(st)∇aQw(st,at)∣a=μθ(s)\begin{aligned} \delta_{t} &=r_{t}+\gamma Q^{w}\left(s_{t+1}, \mu_{\theta}\left(s_{t+1}\right)\right)-Q^{w}\left(s_{t}, a_{t}\right) \\ w_{t+1} &=w_{t}+\alpha_{w} \delta_{t} \nabla_{w} Q^{w}\left(s_{t}, a_{t}\right) \\ \theta_{t+1} &=\theta_{t}+\left.\alpha_{\theta} \nabla_{\theta} \mu_{\theta}\left(s_{t}\right) \nabla_{a} Q^{w}\left(s_{t}, a_{t}\right)\right|_{a=\mu_{\theta}(s)} \end{aligned} δtwt+1θt+1=rt+γQw(st+1,μθ(st+1))−Qw(st,at)=wt+αwδt∇wQw(st,at)=θt+αθ∇θμθ(st)∇aQw(st,at)∣a=μθ(s)

与stochastic off policy算法不同的是由于这里是deterministic policy，所以不需要用重要性采样(importance sampling)。

取得的效果？

所出版信息？作者信息？

这篇文章是ICML2014上面的一篇文章。第一作者David Silver是Google DeepMind的research Scientist，本科和研究生就读于剑桥大学，博士于加拿大阿尔伯特大学就读，2013年加入DeepMind公司，AlphaGo创始人之一，项目领导者。

参考链接

参考文献：Degris, T., White, M., and Sutton, R. S. (2012b). Linear off-policy actor-critic. In 29th International Conference on Machine Learning.

扩展阅读

假定真实的action-value function为 Qπ(s,a)Q^{\pi}(s,a)Qπ(s,a)，用一个function近似它 Qw(s,a)≈Qπ(s,a)Q^{w}(s,a) \approx Q^{\pi}(s,a)Qw(s,a)≈Qπ(s,a)。However, if the function approximator is compatible such that 1. Qw(s,a)=∇θlog⁡πθ(a∣s)⊤wQ^{w}(s, a)=\nabla_{\theta} \log \pi_{\theta}(a | s)^{\top} wQw(s,a)=∇θlogπθ(a∣s)⊤w (linear in “fearure”) 2. the parameters www are chosen to minimise the mean-squared error ε2(w)=Es∼ρπ,a∼πθ[(Qw(s,a)−Qπ(s,a))2]\varepsilon^{2}(w) = \mathbb{E}_{s \sim \rho^{\pi},a \sim \pi_{\theta}}[(Q^{w}(s,a)-Q^{\pi}(s,a))^{2}]ε2(w)=Es∼ρπ,a∼πθ[(Qw(s,a)−Qπ(s,a))2] (linear regression problem form these feature )，then there is no bias (Sutton et al., 1999),

∇θJ(πθ)=Es∼ρπ,a∼πθ[∇θlog⁡πθ(a∣s)Qw(s,a)]\nabla_{\theta} J\left(\pi_{\theta}\right)=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a | s) Q^{w}(s, a)\right] ∇θJ(πθ)=Es∼ρπ,a∼πθ[∇θlogπθ(a∣s)Qw(s,a)]

最后，论文给出了DPG的采用线性函数逼近定理，以及一些理论证明基础。

参考文献：Sutton, R.S., McAllester D. A., Singh, S. P., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems 12, pages 1057–1063.

这篇文章以后有时间再读一遍吧，里面还是有些证明需要仔细推敲一下。