【ICLR2020】通过强化学习和稀疏奖励进行模仿学习

文章目录

所解决的问题？
背景
所采用的方法？
- Soft Q Imitation Learning算法
取得的效果？
所出版信息？作者信息？
参考链接
扩展阅读

论文题目：SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

所解决的问题？

从高维的状态动作空间中进行模仿学习是比较困难的，以往的行为克隆算法(behavioral cloning BC)算法容易产生分布漂移(distribution shift)，而最近做得比较好的就是生成对抗模仿学习算法(generative adversarial imitation learning (GAIL))，是逆强化(Inverse RL)学习算法与生成对抗网络结合的一种模仿学习算法，这个算法使用adversarial training技术学reward function，而作者提出的算法不需要reward function。整篇文章是在证明constant reward的RL方法与学习复杂的reward function的强化学习算法一样有效。

文章的主要贡献在于提出了一种简单易于实现版本的模仿学习算法，用于高维、连续、动态环境中。能够很好克服模仿学习中的distribution shift问题。

背景

模仿学习的问题在于behavior shift，并且误差会累计。一旦trajectory偏离专家的trajectory，智能体并不知道如何回到expert的轨迹状态上来。最近做地比较好的就是GAIL，GAIL做模仿学习最大的好处就是 encourage long-horizon imitation。那为什么GAIL能够做到long-horizon imitation呢？模型学习一般分为两步，在某个state下采取某个action，一般的BC算法都这么做的，而GAIL除此之外还考虑了采取这个action之后还回到expert 轨迹的下一个状态上。而作者也采纳了GAIL的上述两点优势，但是并未使用GAIL算法中的adversarial training技术，而是使用一个constant reward。如果matching the demonstrated action in a demonstrated state，reward = +1；对于其他的情况 reward =0。也就是说你在给定状态下会采取给定动作，就能拿到奖励。因此整个问题就变成了一个奖励稀疏的强化学习问题。

所采用的方法？

作者引入soft-q-learning算法，将expert demonstrations的奖励设置为1，而与环境互动得到的新的experiences奖励设置为0。由于soft Q-Learning算法是off-policy的算法，因此有data就可以训练了。整个算法作者命名为 soft Q imitation learning (SQIL)。

Soft Q Imitation Learning算法

SQIL在soft q learning算法上面做了三个小的修正：

用expert demonstration初始化填入agent的experience replay buffer，其reward设置为+1；
agent与环境互动得到新的data也加入到experience replay buffer里面，其reward设置为0；
平衡demonstration experiences和new experiences各50%50\%50%。这个方法在GAIL和adversarial IRL算法上面也都有应用。

SQIL算法如下所示：

其中QθQ_{\theta}Qθ表示的是soft q function，Ddemo\mathcal{D}_{demo}Ddemo是demonstrations，δ2\delta^{2}δ2表示的是soft bellman error。Equation 1表示为：

δ2(D,r)≜1∣D∣∑(s,a,s′)∈D(Qθ(s,a)−(r+γlog⁡(∑a′∈Aexp⁡(Qθ(s′,a′)))))2\delta^{2}(\mathcal{D}, r) \triangleq \frac{1}{|\mathcal{D}|} \sum_{\left(s, a, s^{\prime}\right) \in \mathcal{D}}\left(Q_{\boldsymbol{\theta}}(s, a)-\left(r+\gamma \log \left(\sum_{a^{\prime} \in \mathcal{A}} \exp \left(Q_{\boldsymbol{\theta}}\left(s^{\prime}, a^{\prime}\right)\right)\right)\right)\right)^{2}δ2(D,r)≜∣D∣1(s,a,s′)∈D∑(Qθ(s,a)−(r+γlog(a′∈A∑exp(Qθ(s′,a′)))))2

其中奖励rrr只有0，1两个取值。上述公式的理解就是希望demonstrated action能够获得比较高的QQQ值，而周围的nearby state的action分布就不期望那么突出，期望均匀一点，这里就跟熵联系起来了。

取得的效果？

所出版信息？作者信息？

作者是来自加利福尼亚伯克利大学的博士生Siddharth Reddy。

参考链接

export-demonstration：https://drive.google.com/drive/folders/1h3H4AY_ZBx08hz-Ct0Nxxus-V1melu1U

扩展阅读

Maximum entropy model of expert behavior：

Maximum entropy model of expert behavior：SQIL是基于最大熵expert behavior所得出来的算法。策略π\piπ服从Boltzmann distribution：

π(a∣s)≜exp⁡(Q(s,a))∑a′∈Aexp⁡(Q(s,a′))\pi(a | s) \triangleq \frac{\exp (Q(s, a))}{\sum_{a^{\prime} \in \mathcal{A}} \exp \left(Q\left(s, a^{\prime}\right)\right)}π(a∣s)≜∑a′∈Aexp(Q(s,a′))exp(Q(s,a))

Soft Q values可通过soft Bellman equation得到：

Q(s,a)≜R(s,a)+γEs′[log⁡(∑a′∈Aexp⁡(Q(s′,a′)))]Q(s, a) \triangleq R(s, a)+\gamma \mathbb{E}_{s^{\prime}}\left[\log \left(\sum_{a^{\prime} \in \mathcal{A}} \exp \left(Q\left(s^{\prime}, a^{\prime}\right)\right)\right)\right]Q(s,a)≜R(s,a)+γEs′[log(a′∈A∑exp(Q(s′,a′)))]

在我们的模仿学习设置中，rewards和dynamic是未知的，专家demonstrationDdemo\mathcal{D}_{demo}Ddemo是一个固定的集合。通过在environment中rolling out策略π\piπ 可以得到state transitions (s,a,s′)∈Ddemo(s,a,s^{\prime}) \in \mathcal{D}_{demo}(s,a,s′)∈Ddemo。

Behavioral cloning (BC)：

在behavior clone中是去拟合一个参数化的model πθ\pi_{\theta}πθ，最小化负的log-likelihood loss：

ℓBC(θ)≜∑(s,a)∈Ddmo−log⁡πθ(a∣s)\ell_{\mathrm{BC}}(\boldsymbol{\theta}) \triangleq \sum_{(s, a) \in \mathcal{D}_{d m o}}-\log \pi_{\boldsymbol{\theta}}(a | s)ℓBC(θ)≜(s,a)∈Ddmo∑−logπθ(a∣s)

本文中作者采用的是soft q function，所以最大化的likelihood目标方程如下所示：

ℓBC(θ)≜∑(s,a)∈Ddemo −(Qθ(s,a)−log⁡(∑a′∈Aexp⁡(Qθ(s,a′))))\ell_{\mathrm{BC}}(\boldsymbol{\theta}) \triangleq \sum_{(s, a) \in \mathcal{D}_{\text {demo }}}-\left(Q_{\boldsymbol{\theta}}(s, a)-\log \left(\sum_{a^{\prime} \in \mathcal{A}} \exp \left(Q_{\boldsymbol{\theta}}\left(s, a^{\prime}\right)\right)\right)\right)ℓBC(θ)≜(s,a)∈Ddemo ∑−(Qθ(s,a)−log(a′∈A∑exp(Qθ(s,a′))))

从这里可以看出作者的目标函数中相比较于行为克隆算法好处在于：后面那一项基于能量的式子是考虑了state transitions。

Regularized Behavior Clone

SQIL可以看作是 a sparsity(稀疏) prior on the implicitly-represented rewards的行为克隆算法。

Sparsity regularization：当agent遇见了一个未见过的state的时候，QθQ_{\theta}Qθ也许会输出任意值。(Piot et al., 2014) 等人有通过引入a sparsity prior on the implied rewards 的正则化项。

Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and reward-regularized classiﬁcation for apprenticeship learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pp. 1249–1256. International Foundation for Autonomous Agents and Multiagent Systems, 2014.

作者与上述这篇文章的不同点在于有将其应用于连续的状态空间，还有加了latest imitation policy进行rollouts采样。

基于上文的soft Bellman equation

我们可以得到reward的表达式子：

Rq(s,a)≜Qθ(s,a)−γEs′[log⁡(∑a′∈Aexp⁡(Qθ(s′,a′)))]R_{q}(s, a) \triangleq Q_{\boldsymbol{\theta}}(s, a)-\gamma \mathbb{E}_{s^{\prime}}\left[\log \left(\sum_{a^{\prime} \in \mathcal{A}} \exp \left(Q_{\boldsymbol{\theta}}\left(s^{\prime}, a^{\prime}\right)\right)\right)\right]Rq(s,a)≜Qθ(s,a)−γEs′[log(a′∈A∑exp(Qθ(s′,a′)))]

从中也可以发现其会考虑下一个状态s′s^{\prime}s′，而不像BC那样只maximization action likelihood。最终的Regularized BC算法可表示为：

ℓRBC(θ)≜ℓBC(θ)+λδ2(Ddemo ∪Dsamp,0)\ell_{\mathrm{RBC}}(\boldsymbol{\theta}) \triangleq \ell_{\mathrm{BC}}(\boldsymbol{\theta})+\lambda \delta^{2}\left(\mathcal{D}_{\text {demo }} \cup \mathcal{D}_{\mathrm{samp}}, 0\right)ℓRBC(θ)≜ℓBC(θ)+λδ2(Ddemo ∪Dsamp,0)

其中λ\lambdaλ是超参数，δ2\delta^{2}δ2是soft bellman error的平方。可以看出RBC算法与SQIL有异曲同工之妙。

Connection Between SQIL and Regularized Behavioral Clone

∇θℓRBC(θ)∝∇θ(δ2(Ddemo ,1)+λsamp δ2(Dsamp ,0)+V(s0))\nabla_{\boldsymbol{\theta}} \ell_{\mathrm{RBC}}(\boldsymbol{\theta}) \propto \nabla_{\boldsymbol{\theta}}\left(\delta^{2}\left(\mathcal{D}_{\text {demo }}, 1\right)+\lambda_{\text {samp }} \delta^{2}\left(\mathcal{D}_{\text {samp }}, 0\right)+V\left(s_{0}\right)\right)∇θℓRBC(θ)∝∇θ(δ2(Ddemo ,1)+λsamp δ2(Dsamp ,0)+V(s0))

SQIL相比与RBC算法引入了+1和0的reward，相当于是加强了奖励稀疏的先验知识。