深度强化学习系列(15): TRPO算法原理及Tensorflow实现

深入浅出理解TRPO算法

1、论文思想与原理
- 1.1 Surrogate function(替代函数)
- 1.2 目标函数
- 1.3 一阶近似： L函数
- - 1.3.1 技巧一：一阶近似
  - 1.3.2 重要性采样
  - 1.3.3 步长的选择
- 1.4 单调递增证明：
- 1.5 优化目标函数（Optimizing the objective function）
Tensorflow代码实践

前言：策略梯度方法博大精深，但策略梯度算法的硬伤就在于更新步长 α \alpha α 很难确定，当步长不恰当时，更新参数所对应的策略是一个更不好的策略，当利用这个更不好的策略进行采样学习时，再次更新的参数会更差，因此很容易导致越学越差，甚至崩溃学不到东西。所以，一个合适的步长选择对于强化学习来说是非常关键的，目前的大部分强化学习算法很难保证单调收敛，而TRPO算法则直面步长问题，并通过数学给出了一个单调的策略改善方法。

由于TRPO中数学推导特别复杂，同时PPO(Proximal Policy Optimization, 近端策略优化)算法【后期单独讲解】对其过程优化的更加简单的缘故一直觉得学的不太到位，因此本篇博客综合了很多大牛的技术博客, 继续写完了TRPO算法的学习，以便于把算法学习明白，在此向大神们表示膜拜与感谢！

置信域策略梯度优化: TRPO

TRPO(Trust Region Policy Optimization, 置信域策略梯度优化)算法是现OpenAI科学家John Schulman在2015年2月提出的一种策略优化算法，并且发表于《第31届ICML》顶会上，可以说是非常独特新颖，同时其在2016年的博士论文中将该算法作为第三章(page18-44)进行了单独阐述。下面我们思考一个问题，当我们在使用梯度下降法求解的时候，步子大容易错过某些最优值，步子小收敛速度非常慢，如图所示

在梯度上升法中，我们判断最陡峭的方向，之后朝着那个方向前进。但是如果学习率(learning rate，步长)太高的话，这样的行动也许让我们远离真正的回报函数( the real reward function )，甚至会变成灾难。在信任的区域之中，我们用 δ \delta δ 变量限制我们的搜索区域（置信区间），这样的区域可以保证在它达到局部或者全局最优策略之前，它的优化策略将会优于当前策略,结果如图所示

1、论文思想与原理

TRPO算法使用了MM(Minorize-Maximizatio)优化过程，在本文中将通过几个步骤推导出TRPO的目标函数。

MM(Minorize-Maximization)过程

MM算法是一个迭代优化，其利用凸函数方法，以便找到他们的最大值或最小值,取决于所需的优化是最大化还是最小化。MM本身不是算法，而是描述如何构造优化算法。MM算法的历史基础可以追溯到至少1970年，当时Ortega和Rheinboldt正在进行与线搜索方法相关的研究。同样的概念继续以不同的形式重新出现在不同的领域。2000年，亨特和兰格提出“MM”作为一般框架。最近的研究已将该方法应用于广泛的学科领域，如数学，统计学，机器学习和工程学。MM过程如下：

更多内容参考MM过程

1.1 Surrogate function(替代函数)

RL的目标就是最大化预期折扣奖励(the expected discounted rewards)。然而我们知道，根据策略梯度方法，参数更新方程式为：
θ n e w = θ o l d + α ∇ θ J \theta_{new}=\theta_{old}+\alpha\nabla_{\theta}J θnew=θold+α∇θJ
策略梯度算法的硬伤就在更新步长 α \alpha α，当步长不合适时，更新的参数所对应的策略是一个更不好的策略，当利用这个更不好的策略进行采样学习时，再次更新的参数会更差，因此很容易导致越学越差，最后崩溃。所以，合适的步长对于强化学习非常关键。那么什么叫合适的步长？所谓合适的步长是指当策略更新后，回报函数的值不能更差。如何选择这个步长？或者说，如何找到新的策略使得新的回报函数的值单调递增，或单调不减。这就是TRPO算法要解决的问题。具体如下图，红色的线表示期望折扣回，

其中 η \eta η 被定义为：
η ( π θ ) = E τ ∼ π θ [ ∑ t = 0 ∞ γ t r t ] \eta(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}] η(πθ)=Eτ∼πθ[t=0∑∞γtrt]
这是对累计奖励函数做的期望，其中图中的C为常数，蓝色曲线M中的KL表示散度（后文解释)，对于每次迭代，我们发现Surrogate函数M(蓝线)有如下性质：

是 η \eta η 的下界函数
可用于估计当前策略的折扣奖励 η \eta η
易于优化(我们将会把Surrogate函数近似估计为一个二次方程)

在每一次迭代之中，我们找到最佳的M点并且把它作为当前的策略。

之后，我们重新评估新策略的下界并且重复迭代。当我们持续这个过程，策略也会不断的改进。因为可能的策略是有限的，所以我们当前的概率最终将会收敛到局部或者全部最优的策略。

1.2 目标函数

在强化学习中，值函数、动作值函数是学习的基础，优势函数则是计算A3C算法、TRPO算法的重要过程，我们先定义Q值函数、状态值函数和优势函数。数学表示为：

Q π ( s t , a t ) = E s t + 1 , a t + 1 , … [ ∑ k = 0 ∞ γ k r ( s t + k ) ] V π ( s t ) = E s t + 1 , … [ ∑ k = 0 ∞ γ k r ( s t + k ) ] A π ( s , a ) = Q π ( s , a ) − V π w h e r e a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) f o r t ≥ 0 Q_{\pi}(s_{t},a_{t}) = \mathbb{E{s_{t+1},a_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ V_{\pi}(s_{t}) = \mathbb{E{s_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi} \\ where \quad a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t}) \quad for \quad t \geq 0 Qπ(st,at)=Est+1,at+1,…[k=0∑∞γkr(st+k)]Vπ(st)=Est+1,…[k=0∑∞γkr(st+k)]Aπ(s,a)=Qπ(s,a)−Vπwhereat∼π(at∣st),st+1∼P(st+1∣st,at)fort≥0

为了衡量一个动作所能够带来的回报过程，优势函数被广泛的使用，

折扣期望奖励被表示为：

η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] w h e r e s 0 ∼ ρ 0 ( s 0 ) , a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) \eta(\pi) = \mathbb{E}_{s_{0},a_{0},\dots}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right] \\ where \quad s_{0} \sim \rho_{0}(s_{0}),a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t}) η(π)=Es0,a0,…[t=0∑∞γtr(st)]wheres0∼ρ0(s0),at∼π(at∣st),st+1∼P(st+1∣st,at)

我们可以使用其他策略计算策略的奖励进行表示

η ( π ^ ) = η ( π ) + E s 0 , a 0 , ⋯ ∼ π ^ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] \eta(\hat{\pi}) = \eta(\pi) +\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})] η(π^)=η(π)+Es0,a0,⋯∼π^[t=0∑∞γtAπ(st,at)]

A π ( s , a ) = Q π ( s , a ) − V π ( s ) = E s ′ ∼ P ( s ′ ∣ s , a ) [ r ( s ) + γ V π ( s ) − V π ( s ) ] A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi}(s) =\mathbb{E}_{s^{'}\sim P(s^{'}|s,a)}[r(s)+\gamma V_{\pi}(s^{})-V_{\pi}(s)] Aπ(s,a)=Qπ(s,a)−Vπ(s)=Es′∼P(s′∣s,a)[r(s)+γVπ(s)−Vπ(s)]

其中符号 E s 0 , a 0 , ⋯ ∼ π ^ [ … ] \mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\dots] Es0,a0,⋯∼π^[…] 表示动作的采样 a t ∼ π ( . ∣ s t ) ^ a_{t} \sim \hat{\pi(.|s_{t})} at∼π(.∣st)^, η ( π ) \eta(\pi) η(π) 表示旧策略， η ( π ^ ) \eta{(\hat{\pi})} η(π^) 表示新策略，而 E s 0 , a 0 , ⋯ ∼ π ^ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] \mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})] Es0,a0,⋯∼π^[∑t=0∞γtAπ(st,at)] 则表示新旧策略的回报差,这也是优势函数所谓的优势所在，同时我们设置 ρ ( π ) \rho(\pi) ρ(π) 为折扣的访问频率

ρ π ( s ) = P ( s 0 = s ) + γ P ( s 1 = s ) + γ 2 P ( s 2 = s ) + … , \rho_{\pi}(s) = P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\dots, ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+…,

其中关于新旧策略的回报差的证明如下:
E τ ∼ π ′ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] = E τ ∼ π ′ [ ∑ t = 0 ∞ γ t ( R ( s t , a t , s t + 1 ) + γ V π ( s t + 1 ) − V π ( s t ) ) ] = η ( π ′ ) + E τ ∼ π ′ [ ∑ t = 0 ∞ γ t + 1 V π ( s t + 1 ) − ∑ t = 0 ∞ γ t V π ( s t ) ] = η ( π ′ ) + E τ ∼ π ′ [ ∑ t = 1 ∞ γ t V π ( s t ) − ∑ t = 0 ∞ γ t V π ( s t ) ] = η ( π ′ ) − E τ ∼ π ′ [ V π ( s 0 ) ] = η ( π ′ ) − η ( π ) \begin{aligned} &\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi}\left(s_{t}, a_{t}\right)\right] \\ &=\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t}\left(R\left(s_{t}, a_{t}, s_{t+1}\right)+\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right)\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t+1} V^{\pi}\left(s_{t+1}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=1}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[V^{\pi}\left(s_{0}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\eta(\pi) \end{aligned} τ∼π′E[t=0∑∞γtAπ(st,at)]=τ∼π′E[t=0∑∞γt(R(st,at,st+1)+γVπ(st+1)−Vπ(st))]=η(π′)+τ∼π′E[t=0∑∞γt+1Vπ(st+1)−t=0∑∞γtVπ(st)]=η(π′)+τ∼π′E[t=1∑∞γtVπ(st)−t=0∑∞γtVπ(st)]=η(π′)−τ∼π′E[Vπ(s0)]=η(π′)−η(π)

将上述证明公式展开得到：

η ( π ^ ) = η ( π ) + ∑ t ∞ ∑ s P ( s t = s ∣ π ^ ) ∑ π ^ ( a ∣ s ) γ t A π ( s , a ) = η ( π ) + ∑ ρ π ^ ( s ) ∑ π ^ ( a ∣ s ) A π ( s , a ) \begin{aligned} \eta(\hat{\pi}) &= \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ &= \eta(\pi)+\sum\rho_{\hat{\pi}}(s)\sum\hat{\pi}(a|s)A_{\pi}(s,a) \end{aligned} η(π^)=η(π)+t∑∞s∑P(st=s∣π^)∑π^(a∣s)γtAπ(s,a)=η(π)+∑ρπ^(s)∑π^(a∣s)Aπ(s,a)
其中 ρ π ^ ( s ) = ∑ t = 0 ∞ γ t P ( s t = s ∣ π ^ ) \rho_{\hat{\pi}}(s) = \sum_{t=0}^{\infty}\gamma^{t}P(s_{t} = s|\hat{\pi}) ρπ^(s)=∑t=0∞γtP(st=s∣π^)

优势函数解释(如下树图所示)：

值函数 V ( s ) V(s) V(s) 可以理解为在该状态下所有可能动作所对应的动作值函数乘以采取该动作的概率的和。更通俗的讲，值函数 V ( s ) V(s) V(s) 是该状态下所有动作值函数关于动作概率的平均值。而动作值函数 Q ( s , a ) Q(s,a) Q(s,a) 是单个动作所对应的值函数， Q π ( s , a ) − V π ( s ) Q_{\pi}\left(s,a\right)-V_{\pi}\left(s\right) Qπ(s,a)−Vπ(s) 能评价当前动作值函数相对于平均值的大小。所以，这里的优势指的是动作值函数相比于当前状态的值函数的优势。如果优势函数大于零，则说明该动作比平均动作好，如果优势函数小于零，则说明当前动作还不如平均动作好。

1.3 一阶近似： L函数

1.3.1 技巧一：一阶近似

引入TRPO的第一个技巧对状态分布进行处理，我们忽略状态分布的变化，依然采用旧的策略所对应的状态分布。这个技巧是对原代价函数(目标函数)的一次近似。其实，当新旧参数很接近的时候，我们将用旧的状态分布替代新的状态分布是合理的，同时使用MM算法是为了找到当前策略下近似 η \eta η 下界局部值的函数，原来的代价函数我们近似为函数 ?? ：

L π ( π ^ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ^ ( a ∣ s ) A π ( s , a ) L_{\pi}(\hat{\pi}) = \eta(\pi) +\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a) Lπ(π^)=η(π)+s∑ρπ(s)a∑π^(a∣s)Aπ(s,a)

第二项策略部分，这时的动作是由新策略 π ^ \hat{\pi} π^ 产生的，可新策略是带参数 θ \theta θ的，这个参数的是未知的，无法用来产生动作。此时引入TRPO的第二个技巧–【重要性采样】。

1.3.2 重要性采样

TRPO第二个技巧是利用重要性采样对动作分布进行处理，后文将会介绍新旧策略之间的差异非常小，因此使用重要性采样能够取得非常好的效果
∑ π ^ ( a ∣ s ) A π ( s , a ) = E a ∼ q ( a ∣ s ) [ π ^ ( a ∣ s ) q ( a ∣ s ) A π ( s t , a t ) ] \sum\hat{\pi}(a|s)A_{\pi}(s,a) = \mathbb{E}_{a \sim q(a|s)}\left[ \frac{\hat{\pi}(a|s)}{q(a|s)}A_{\pi}(s_{t},a_{t})\right] ∑π^(a∣s)Aπ(s,a)=Ea∼q(a∣s)[q(a∣s)π^(a∣s)Aπ(st,at)]
那么，经过将s求和的形式写为期望，则函数表示为：

L π ( π ^ ) = η ( π ) + E s ∼ ρ θ o l d , a ∼ π θ o l d ( a ∣ s ) [ π θ ^ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s t , a t ) ] L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right] Lπ(π^)=η(π)+Es∼ρθold,a∼πθold(a∣s)[πθold(a∣s)πθ^(a∣s)Aπθold(st,at)]

重要性采样
重要性采样（英语：importance sampling）是统计学中估计某一分布性质时使用的一种方法。该方法从与原分布不同的另一个分布中采样，而对原先分布的性质进行估计。重要性采样与计算物理学中的伞形采样相关。

1.3.3 步长的选择

通过对如下两个公式进行观察：

η ( π ^ ) = η ( π ) + ∑ t ∞ ∑ s P ( s t = s ∣ π ^ ) ∑ π ^ ( a ∣ s ) γ t A π ( s , a ) L π ( π ^ ) = η ( π ) + E s ∼ ρ θ o l d , a ∼ π θ o l d ( a ∣ s ) [ π θ ^ ( a ∣ s ) π θ o l d ( a ∣ s ) A π θ o l d ( s t , a t ) ] \eta(\hat{\pi}) = \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ \\ L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right] η(π^)=η(π)+t∑∞s∑P(st=s∣π^)∑π^(a∣s)γtAπ(s,a)Lπ(π^)=η(π)+Es∼ρθold,a∼πθold(a∣s)[πθold(a∣s)πθ^(a∣s)Aπθold(st,at)]
我们发现 η ( π ^ ) \eta(\hat{\pi}) η(π^) 和 L π ( π ^ ) L_{\pi}(\hat{\pi}) Lπ(π^) 的区别在于分布不同，他们都是 π ^ \hat{\pi} π^ 的函数，且在 π θ o l d \pi_{\theta_{old}} πθold 处的

?? 是下界函数M 的一部分（红色下划线）

其中M中的第二项是KL-divergence KL散度

KL散度：
Kullback-Leibler(KL)散度由Solomon Kullback和Richard Leibler在1951 年引入，作为两个分布之间的有向差异衡量，对于离散概率分布 P 和 Q Q在相同的概率空间上定义，Kullback-Leibler之间存在差异 P P P 和 Q Q Q Q定义是
D K L （ P ∥ Q ） = − ∑ x ∈ X P ( x ) log ⁡ Q ( x ) P ( x ) D_{KL}（P\parallel Q）= - \sum_{x \in \mathcal{X}} P(x) \log {\frac{Q(x)}{P(x)}} DKL（P∥Q）=−x∈X∑P(x)logP(x)Q(x)
换句话说，期望概率之间的对数差异P 和Q，使用概率采取期望 P。Kullback-Leibler散度仅在所有人中定义 X， Q（x）= 0， Q（X）= 0 暗示 P（x）= 0，P（X）= 0（绝对连续性）。对于P 和 Q在连续随机变量中，Kullback-Leibler（KL）散度被定义为积分：
D K L （ P ∥ Q ） = ∫ − ∞ ∞ p ( x ) log ⁡ ( q ( x ) p ( x ) ) d x D_{KL}（P\parallel Q）= \intop\nolimits_{-\infty }^{\infty } p(x) \log {\left(\frac{q(x)}{p(x)}\right)dx} DKL（P∥Q）=∫−∞∞p(x)log(p(x)q(x))dx
更多相关数学内容见点击查看维基百科

在当前的策略中， K L ( θ i , θ i ) = 0 KL( \theta_{i}, \theta_{i})=0 KL(θi,θi)=0 , C ∗ K L C*KL C∗KL 项可以看作是 ? ? ?? ?? 的上限误差。在策略 θ i \theta_{i} θi 中，我们可以证明 L L L 与 η η η 的第一个阶导数相同。

L π θ i ( π θ i ) = η ( π θ i ) ∇ θ L π θ i ( π θ ) ∣ θ = θ i = ∇ θ η ( π θ ) ∣ θ = θ i L_{\pi_{\theta_{i}}}(\pi_{\theta_{i}}) = \eta(\pi_{\theta_{i}}) \\ \nabla_{\theta}L_{\pi_{\theta_{i}}}(\pi_{\theta})|_{\theta = \theta_{i}} = \nabla_{\theta}\eta(\pi_{\theta})|_{\theta = \theta_{i}} Lπθi(πθi)=η(πθi)∇θLπθi(πθ)∣θ=θi=∇θη(πθ)∣θ=θi

根据上述公式，在 θ i \theta_{i} θi 有限的范围内,两个目标函数相差十分的近似，梯度也近似，因此在本文可以使用 L π θ o l d L_{\pi_{\theta_{old}}} Lπθold 代替原来的目标函数，并沿着近似函数导数方向做有限步长的更新，同时提升策略梯度。

然而，步长多大能够保证单调不减？既然近似函数与原目标函数是在策略附近一个很小的区域可以做到近似，那么就加入一个约束用于限定新旧策略的差异。如果把策略看做是一个概率分布，那么就可以使用KL散度表示两个分布之间的差异 D K L m a x ( π , π ^ ) = max ⁡ s D K L ( π ( ⋅ ∣ s ) ∣ ∣ π ^ ( ⋅ ∣ s ) ) D_{KL}^{max}(\pi, \hat{\pi}) = \max_{s}D_{KL}(\pi(\cdot |s)||\hat{\pi}(\cdot |s)) DKLmax(π,π^)=maxsDKL(π(⋅∣s)∣∣π^(⋅∣s)) ，在作者原论文中证明过，当 π ^ = a r g max ⁡ π ′ L π ( π ′ ) \hat{\pi} = arg\max_{\pi^{'}}L_{\pi}(\pi^{'}) π^=argmaxπ′Lπ(π′) 成立时，则下限为：

η ( π ^ ) ≥ L π ( π ^ ) − C × D K L m a x ( π , π ^ ) w h e r e , C = 2 ϵ γ ( 1 − γ ) 2 \eta(\hat{\pi}) \geq L_{\pi}(\hat{\pi})-C \times D_{KL}^{max}(\pi, \hat{\pi}) \\ where, \quad C = \frac{2\epsilon \gamma}{(1-\gamma)^{2}} η(π^)≥Lπ(π^)−C×DKLmax(π,π^)where,C=(1−γ)22ϵγ

其中 D K L m a x ( π , π ~ ) D_{KL}^{max}({\pi},{\tilde \pi }) DKLmax(π,π~) 是KL散度，限制了策略模型更新前后的差异，则定义如下等式

M i ( π ) = L π i ( π ) − C × D K L m a x ( π i , π ) M_{i}(\pi) = L_{\pi_{i}}(\pi) -C\times D_{KL}^{max}(\pi_{i},\pi) Mi(π)=Lπi(π)−C×DKLmax(πi,π)

利用这个等式，我们证明策略的单调性
η ( π i + 1 ) ≥ M i ( π i + 1 ) η ( π i ) = M i ( π i ) \eta(\pi_{i+1}) \geq M_{i}(\pi_{i+1}) \\ \eta(\pi_{i}) = M_{i}(\pi_{i}) η(πi+1)≥Mi(πi+1)η(πi)=Mi(πi)

对该公式求减法得到：

η ( π i + 1 ) − η ( π i ) ≥ M i ( π i + 1 ) − M i ( π i ) \eta(\pi_{i+1})-\eta(\pi_{i}) \geq M_{i}(\pi_{i+1})-M_{i}(\pi_{i}) η(πi+1)−η(πi)≥Mi(πi+1)−Mi(πi)

接下来我们讨论下界函数M的细节。TRPO论文附录A中的两页证明 η的有一个确定的下限。

KaTeX parse error: Undefined control sequence: \nonumber at position 177: …pha^{2} where \̲n̲o̲n̲u̲m̲b̲e̲r̲\\ & \epsilon=\…

D_TV是总的散度方差。但这并不重要，因为我们将马上使用KL散度替代它，因为（找下界）

D T V ( p ∣ ∣ q ) 2 ≤ D K L ( p ∣ ∣ q ) D_{TV}(p ||q)^{2} \leq D_{KL}(p||q) DTV(p∣∣q)2≤DKL(p∣∣q)

下界函数可以被重定义为：
η ( π ~ ) ≥ L π ( π ~ ) − C D K L max ⁡ ( π , π ~ ) , where C = 4 ϵ γ ( 1 − γ ) 2 D K L max ⁡ ( π , π ~ ) = max ⁡ s D K L ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s ) ) \eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}), \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} \\ D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})=\max _{s} D_{\mathrm{KL}}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s)) η(π~)≥Lπ(π~)−CDKLmax(π,π~), where C=(1−γ)24ϵγDKLmax(π,π~)=smaxDKL(π(⋅∣s)∥π~(⋅∣s))

注意，符号可以简记为

η ( θ ) : = η ( π θ ) L θ ( θ ^ ) : = L π θ ( π θ ^ ) \eta(\theta):=\eta(\pi_{\theta}) \\ L_{\theta}(\hat{\theta}):= L_{\pi_{\theta}}(\pi_{\hat{\theta}}) η(θ):=η(πθ)Lθ(θ^):=Lπθ(πθ^)

1.4 单调递增证明：

自然策略梯度的关键思想是保证了函数单调上升。它是Policy Gradient方法家族中的“用货币担保”的版本（笑）。简而言之，至少在理论上，任何策略策更新都将比之前的好。我们在这需要证明的是，基于优化M的新策略可以保证在 η （实际预期回报）方面的表现优于之前的策略。由于策略的数量是有限的，持续更新策略最终能达到局部或全局最优。这是证明：

Mi(πi+1)对比Mi(πi)的任何改进都会使得η(πi+1)获得改进。
M i ( π ) = L π i ( π ) − C D K L max ⁡ ( π i , π ) . t h e n η ( π i + 1 ) ≥ M i ( π i + 1 ) η ( π i ) = M i ( π i ) , t h e r e f o r e , η ( π i + 1 ) − η ( π i ) ≥ M i ( π i + 1 ) − M ( π i ) M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_{i}, \pi\right). then \\ \eta\left(\pi_{i+1}\right) \geq M_{i}\left(\pi_{i+1}\right) \\ \eta\left(\pi_{i}\right)=M_{i}\left(\pi_{i}\right), therefore, \\ \eta\left(\pi_{i+1}\right)-\eta\left(\pi_{i}\right) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right) Mi(π)=Lπi(π)−CDKLmax(πi,π).thenη(πi+1)≥Mi(πi+1)η(πi)=Mi(πi),therefore,η(πi+1)−η(πi)≥Mi(πi+1)−M(πi)

策略迭代中梯度单调上升的证明

这是迭代算法，它保证新策略总是比当前策略执行得更好。

然而，很难（在所有策略中）找到KL散度的最大值。因此我们将放宽要求为使用KL散度的均值。

其中 ?? 函数：
L π ( π ^ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ^ ( a ∣ s ) A π ( s , a ) L_{\pi}(\hat{\pi}) = \eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a) Lπ(π^)=η(π)+s∑ρπ(s)a∑π^(a∣s)Aπ(s,a)

我们可以使用重要性抽样，从策略q中抽样来估计上式左边：
maximize ⁡ θ E s ∼ ρ old , a ∼ q [ π θ ( a ∣ s ) q ( a ∣ s ) A ^ θ old ( s , a ) ] s u b j e c t t o E s ∼ ρ old [ D K L ( π θ old ( ⋅ ∣ s ) ∥ π θ ( ⋅ ∣ s ) ) ] ≤ δ \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\text { old }}, a \sim q}\left[\frac{\pi_{\theta}(a | s)}{q(a | s)} \hat{A}_{\theta_{\text { old }}}(s, a)\right] \\ subject \quad to \quad \mathbb{E}_{s \sim \rho_{\text { old }}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text { old }}}(\cdot | s) \| \pi_{\theta}(\cdot | s)\right)\right] \leq \delta θmaximizeEs∼ρ old ,a∼q[q(a∣s)πθ(a∣s)A^θ old (s,a)]subjecttoEs∼ρ old [DKL(πθ old (⋅∣s)∥πθ(⋅∣s))]≤δ

重要性采样
重要性采样计算 f ( x ) f(x) f(x) 的期望值，其中 x x x 具有数据分布 p p p 。则计算为 E x ∼ p [ f ( x ) ] \mathbb{E}_{x \sim p}[f(x)] Ex∼p[f(x)] ,但在重要性抽样中，我们提供了不从 p p p 中抽取 f ( x ) f(x) f(x) 值的选择。相反，我们从 q q q 中采样数据并使用 p p p 和 q q q 之间的概率比来重新校准结果,其计算过程为： E x ∼ p [ f ( x ) p ( x ) q ( x ) ] \mathbb{E}_{x \sim p}[\frac{f(x)p(x)}{q(x)}] Ex∼p[q(x)f(x)p(x)] , 则在PG中，我们使用当前策略来计算策略梯度表示为：
[外链图片转存失败(img-1TkyMuRi-1563675372814)(assets/markdown-img-paste-20190409103527137.png)]

因此，无论何时更改策略，我们都会收集新样本。旧样本不可重复使用。因此PG的样品效率很差。通过重要性抽样，我们的目标可以被重写，我们可以使用旧政策中的样本来计算政策梯度。

继续前文，利用拉格朗日对偶性，可以将目标函数的约束成带拉格朗日乘子的目标函数。两者在数学上是等价的：

maximize ⁡ θ E ^ t [ π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) A ^ t ] s u b j e c t t o E ^ t [ K L [ π θ old ( ⋅ ∣ s t ) , π θ ( ⋅ ∣ s t ) ] ] ≤ δ \underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right] \\ subject \quad to \quad \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right] \leq \delta θmaximizeE^t[πθold(at∣st)πθ(at∣st)A^t]subjecttoE^t[KL[πθ old (⋅∣st),πθ(⋅∣st)]]≤δ
或者
maximize ⁡ θ E ^ t [ π θ ( a t ∣ s t ) π θ old ( a t ∣ s t ) A ^ t ] − β E ^ t [ K L [ π θ old ( ⋅ ∣ s t ) , π θ ( ⋅ ∣ s t ) ] ] \underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\text { old }}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right]-\beta \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right] θmaximizeE^t[πθ old (at∣st)πθ(at∣st)A^t]−βE^t[KL[πθ old (⋅∣st),πθ(⋅∣st)]]

1.5 优化目标函数（Optimizing the objective function）

正如我们之前提到过的，下界函数M应该是易于优化的。事实上，我们使用泰勒级数(Taylor Series)来估计 L函数和 KL距离(KL-divergence)的平均值，分别用泰勒将L 展开到一阶，KL距离展开到二阶：
L θ k ( θ ) ≈ g T ( θ − θ k ) g ≐ ∇ θ L θ k ( θ ) ∣ θ k L_{\theta_{k}}(\theta) \approx g^{T}\left(\theta-\theta_{k}\right) \\ g \doteq \nabla_{\theta} L_{\theta_{k}}\left.(\theta)\right|_{\theta_{k}} Lθk(θ)≈gT(θ−θk)g≐∇θLθk(θ)∣θk

D ‾ K L ( θ ∥ θ k ) ≈ 1 2 ( θ − θ k ) T H ( θ − θ k ) H ≐ ∇ θ 2 D ‾ K L ( θ ∥ θ k ) ∣ θ k \overline{D}_{K L}\left(\theta \| \theta_{k}\right) \approx \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \\ H \doteq \nabla_{\theta}^{2} \overline{D}_{K L}\left.\left(\theta \| \theta_{k}\right)\right|_{\theta_{k}} DKL(θ∥θk)≈21(θ−θk)TH(θ−θk)H≐∇θ2DKL(θ∥θk)∣θk

其中g是策略梯度， H 被叫做费雪信息矩阵(FIM,the Fisher Information matrix)，以海森Hessian矩阵的形式(Hessian matrix)

H = ∇ 2 f = ( ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 … ∂ 2 f ∂ x 1 ∂ x n ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ∂ x 1 2 ∂ 2 f ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ∂ x 1 ∂ x n ) H=\nabla^{2} f=\left(\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\dots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}}\end{array}\right) H=∇2f=⎝⎜⎜⎜⎜⎜⎛∂x12∂2f∂x12∂2f⋮∂x12∂2f∂x1∂x2∂2f∂x1∂x2∂2f⋮∂x1∂x2∂2f…⋯⋱⋯∂x1∂xn∂2f∂x1∂xn∂2f⋮∂x1∂xn∂2f⎠⎟⎟⎟⎟⎟⎞
那么优化问题就变成了：

θ k + 1 = arg ⁡ max ⁡ θ g T ( θ − θ k ) s u b j e c t t o 1 2 ( θ − θ k ) T H ( θ − θ k ) ≤ δ \theta_{k+1}=\arg \max _{\theta} g^{T}\left(\theta-\theta_{k}\right) \\ subject \quad to \quad \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta θk+1=argθmaxgT(θ−θk)subjectto21(θ−θk)TH(θ−θk)≤δ

我们可以通过优化二次方程来解决它，并且解集为：
θ k + 1 = θ k + 2 δ g T H − 1 g H − 1 g \theta_{k+1}=\theta_{k}+\sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} H^{-1} g θk+1=θk+gTH−1g2δ H−1g

上面的计算方法被叫做自然策略梯度法(the natural policy gradien)。以下是使用MM算法和自然梯度算法优化过的算法：

TRPO
但是，FIM (H) 及其逆运算的成本非常的高。TRPO估计如: H − 1 g H^{-1} g H−1g , 通过求解x得到以下线性方程：
x k ≈ H ^ k − 1 g ^ k H ^ k x k ≈ g ^ x_{k} \approx \hat{H}_{k}^{-1} \hat{g}_{k} \\ \hat{H}_{k} x_{k} \approx \hat{g} xk≈H^k−1g^kH^kxk≈g^

这证实了我们可以通过共轭梯度法（conjugate gradien）来求解它，并且这比计算矩阵H 的逆(inverse)更加简单。共轭梯度法类似于梯度下降法，但是它可以在最多N次迭代之中找到最优点，其中N表示模型之中的参数数量。

最后附图论文在atari游戏中的结果图：

Tensorflow代码实践

本部分代码来源于OpenAI官方SpinningUp

import numpy as np
import tensorflow as tf
import gym
import time
import spinup.algos.trpo.core as core
from spinup.utils.logx import EpochLogger
from spinup.utils.mpi_tf import MpiAdamOptimizer, sync_all_params
from spinup.utils.mpi_tools import mpi_fork, mpi_avg, proc_id, mpi_statistics_scalar, num_procsEPS = 1e-8class GAEBuffer:"""A buffer for storing trajectories experienced by a TRPO agent interactingwith the environment, and using Generalized Advantage Estimation (GAE-Lambda)for calculating the advantages of state-action pairs."""# 初始化过程，包括obs,act,优势函数等各种的初始化def __init__(self, obs_dim, act_dim, size, info_shapes, gamma=0.99, lam=0.95):self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32)self.adv_buf = np.zeros(size, dtype=np.float32)self.rew_buf = np.zeros(size, dtype=np.float32)self.ret_buf = np.zeros(size, dtype=np.float32)self.val_buf = np.zeros(size, dtype=np.float32)self.logp_buf = np.zeros(size, dtype=np.float32)self.info_bufs = {k: np.zeros([size] + list(v), dtype=np.float32) for k,v in info_shapes.items()}self.sorted_info_keys = core.keys_as_sorted_list(self.info_bufs)self.gamma, self.lam = gamma, lamself.ptr, self.path_start_idx, self.max_size = 0, 0, size# 单独将存储写了一个函数，类似于DDPG算法经验回放池一样def store(self, obs, act, rew, val, logp, info):"""Append one timestep of agent-environment interaction to the buffer."""assert self.ptr < self.max_size     # buffer has to have room so you can storeself.obs_buf[self.ptr] = obsself.act_buf[self.ptr] = actself.rew_buf[self.ptr] = rewself.val_buf[self.ptr] = val# 新旧策略self.logp_buf[self.ptr] = logpfor i, k in enumerate(self.sorted_info_keys):self.info_bufs[k][self.ptr] = info[i]self.ptr += 1def finish_path(self, last_val=0):"""Call this at the end of a trajectory, or when one gets cut offby an epoch ending. This looks back in the buffer to where thetrajectory started, and uses rewards and value estimates fromthe whole trajectory to compute advantage estimates with GAE-Lambda,as well as compute the rewards-to-go for each state, to use asthe targets for the value function.The "last_val" argument should be 0 if the trajectory endedbecause the agent reached a terminal state (died), and otherwiseshould be V(s_T), the value function estimated for the last state.This allows us to bootstrap the reward-to-go calculation to accountfor timesteps beyond the arbitrary episode horizon (or epoch cutoff)."""path_slice = slice(self.path_start_idx, self.ptr)rews = np.append(self.rew_buf[path_slice], last_val)vals = np.append(self.val_buf[path_slice], last_val)# the next two lines implement GAE-Lambda advantage calculationdeltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]self.adv_buf[path_slice] = core.discount_cumsum(deltas, self.gamma * self.lam)# the next line computes rewards-to-go, to be targets for the value functionself.ret_buf[path_slice] = core.discount_cumsum(rews, self.gamma)[:-1]self.path_start_idx = self.ptrdef get(self):"""Call this at the end of an epoch to get all of the data fromthe buffer, with advantages appropriately normalized (shifted to havemean zero and std one). Also, resets some pointers in the buffer."""assert self.ptr == self.max_size    # buffer has to be full before you can getself.ptr, self.path_start_idx = 0, 0# the next two lines implement the advantage normalization trickadv_mean, adv_std = mpi_statistics_scalar(self.adv_buf)self.adv_buf = (self.adv_buf - adv_mean) / adv_stdreturn [self.obs_buf, self.act_buf, self.adv_buf, self.ret_buf,self.logp_buf] + core.values_as_sorted_list(self.info_bufs)"""Trust Region Policy Optimization(with support for Natural Policy Gradient)"""
[文档]def trpo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), seed=0,steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=1e-3,train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10,backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs=dict(),save_freq=10, algo='trpo'):"""Args:env_fn : A function which creates a copy of the environment.The environment must satisfy the OpenAI Gym API.actor_critic: A function which takes in placeholder symbolsfor state, ``x_ph``, and action, ``a_ph``, and returns the mainoutputs from the agent's Tensorflow computation graph:============  ================  ========================================Symbol        Shape             Description============  ================  ========================================``pi``        (batch, act_dim)  | Samples actions from policy given| states.``logp``      (batch,)          | Gives log probability, according to| the policy, of taking actions ``a_ph``| in states ``x_ph``.``logp_pi``   (batch,)          | Gives log probability, according to| the policy, of the action sampled by| ``pi``.``info``      N/A               | A dict of any intermediate quantities| (from calculating the policy or log| probabilities) which are needed for| analytically computing KL divergence.| (eg sufficient statistics of the| distributions)``info_phs``  N/A               | A dict of placeholders for old values| of the entries in ``info``.``d_kl``      ()                | A symbol for computing the mean KL| divergence between the current policy| (``pi``) and the old policy (as| specified by the inputs to| ``info_phs``) over the batch of| states given in ``x_ph``.``v``         (batch,)          | Gives the value estimate for states| in ``x_ph``. (Critical: make sure| to flatten this!)============  ================  ========================================ac_kwargs (dict): Any kwargs appropriate for the actor_criticfunction you provided to TRPO.seed (int): Seed for random number generators.steps_per_epoch (int): Number of steps of interaction (state-action pairs)for the agent and the environment in each epoch.epochs (int): Number of epochs of interaction (equivalent tonumber of policy updates) to perform.gamma (float): Discount factor. (Always between 0 and 1.)delta (float): KL-divergence limit for TRPO / NPG update.(Should be small for stability. Values like 0.01, 0.05.)vf_lr (float): Learning rate for value function optimizer.train_v_iters (int): Number of gradient descent steps to take onvalue function per epoch.damping_coeff (float): Artifact for numerical stability, should besmallish. Adjusts Hessian-vector product calculation:.. math:: Hv \\rightarrow (\\alpha I + H)vwhere :math:`\\alpha` is the damping coefficient.Probably don't play with this hyperparameter.cg_iters (int): Number of iterations of conjugate gradient to perform.Increasing this will lead to a more accurate approximationto :math:`H^{-1} g`, and possibly slightly-improved performance,but at the cost of slowing things down.Also probably don't play with this hyperparameter.backtrack_iters (int): Maximum number of steps allowed in thebacktracking line search. Since the line search usually doesn'tbacktrack, and usually only steps back once when it does, thishyperparameter doesn't often matter.backtrack_coeff (float): How far back to step during backtracking linesearch. (Always between 0 and 1, usually above 0.5.)lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,close to 1.)max_ep_len (int): Maximum length of trajectory / episode / rollout.logger_kwargs (dict): Keyword args for EpochLogger.save_freq (int): How often (in terms of gap between epochs) to savethe current policy and value function.algo: Either 'trpo' or 'npg': this code supports both, since they arealmost the same."""logger = EpochLogger(**logger_kwargs)logger.save_config(locals())seed += 10000 * proc_id()tf.set_random_seed(seed)np.random.seed(seed)env = env_fn()obs_dim = env.observation_space.shapeact_dim = env.action_space.shape# Share information about action space with policy architectureac_kwargs['action_space'] = env.action_space# Inputs to computation graphx_ph, a_ph = core.placeholders_from_spaces(env.observation_space, env.action_space)adv_ph, ret_ph, logp_old_ph = core.placeholders(None, None, None)# Main outputs from computation graph, plus placeholders for old pdist (for KL)pi, logp, logp_pi, info, info_phs, d_kl, v = actor_critic(x_ph, a_ph, **ac_kwargs)# Need all placeholders in *this* order later (to zip with data from buffer)all_phs = [x_ph, a_ph, adv_ph, ret_ph, logp_old_ph] + core.values_as_sorted_list(info_phs)# Every step, get: action, value, logprob, & info for pdist (for computing kl div)get_action_ops = [pi, v, logp_pi] + core.values_as_sorted_list(info)# Experience bufferlocal_steps_per_epoch = int(steps_per_epoch / num_procs())info_shapes = {k: v.shape.as_list()[1:] for k,v in info_phs.items()}buf = GAEBuffer(obs_dim, act_dim, local_steps_per_epoch, info_shapes, gamma, lam)# Count variablesvar_counts = tuple(core.count_vars(scope) for scope in ['pi', 'v'])logger.log('\nNumber of parameters: \t pi: %d, \t v: %d\n'%var_counts)# TRPO lossesratio = tf.exp(logp - logp_old_ph)          # pi(a|s) / pi_old(a|s)pi_loss = -tf.reduce_mean(ratio * adv_ph)v_loss = tf.reduce_mean((ret_ph - v)**2)# Optimizer for value functiontrain_vf = MpiAdamOptimizer(learning_rate=vf_lr).minimize(v_loss)# Symbols needed for CG solverpi_params = core.get_vars('pi')gradient = core.flat_grad(pi_loss, pi_params)v_ph, hvp = core.hessian_vector_product(d_kl, pi_params)if damping_coeff > 0:hvp += damping_coeff * v_ph# Symbols for getting and setting paramsget_pi_params = core.flat_concat(pi_params)set_pi_params = core.assign_params_from_flat(v_ph, pi_params)sess = tf.Session()sess.run(tf.global_variables_initializer())# Sync params across processessess.run(sync_all_params())# Setup model savinglogger.setup_tf_saver(sess, inputs={'x': x_ph}, outputs={'pi': pi, 'v': v})def cg(Ax, b):"""Conjugate gradient algorithm(see https://en.wikipedia.org/wiki/Conjugate_gradient_method)"""x = np.zeros_like(b)r = b.copy() # Note: should be 'b - Ax(x)', but for x=0, Ax(x)=0. Change if doing warm start.p = r.copy()r_dot_old = np.dot(r,r)for _ in range(cg_iters):z = Ax(p)alpha = r_dot_old / (np.dot(p, z) + EPS)x += alpha * pr -= alpha * zr_dot_new = np.dot(r,r)p = r + (r_dot_new / r_dot_old) * pr_dot_old = r_dot_newreturn xdef update():# Prepare hessian func, gradient evalinputs = {k:v for k,v in zip(all_phs, buf.get())}Hx = lambda x : mpi_avg(sess.run(hvp, feed_dict={**inputs, v_ph: x}))g, pi_l_old, v_l_old = sess.run([gradient, pi_loss, v_loss], feed_dict=inputs)g, pi_l_old = mpi_avg(g), mpi_avg(pi_l_old)# Core calculations for TRPO or NPGx = cg(Hx, g)alpha = np.sqrt(2*delta/(np.dot(x, Hx(x))+EPS))old_params = sess.run(get_pi_params)def set_and_eval(step):sess.run(set_pi_params, feed_dict={v_ph: old_params - alpha * x * step})return mpi_avg(sess.run([d_kl, pi_loss], feed_dict=inputs))if algo=='npg':# npg has no backtracking or hard kl constraint enforcementkl, pi_l_new = set_and_eval(step=1.)elif algo=='trpo':# trpo augments npg with backtracking line search, hard klfor j in range(backtrack_iters):kl, pi_l_new = set_and_eval(step=backtrack_coeff**j)if kl <= delta and pi_l_new <= pi_l_old:logger.log('Accepting new params at step %d of line search.'%j)logger.store(BacktrackIters=j)breakif j==backtrack_iters-1:logger.log('Line search failed! Keeping old params.')logger.store(BacktrackIters=j)kl, pi_l_new = set_and_eval(step=0.)# Value function updatesfor _ in range(train_v_iters):sess.run(train_vf, feed_dict=inputs)v_l_new = sess.run(v_loss, feed_dict=inputs)# Log changes from updatelogger.store(LossPi=pi_l_old, LossV=v_l_old, KL=kl,DeltaLossPi=(pi_l_new - pi_l_old),DeltaLossV=(v_l_new - v_l_old))start_time = time.time()o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0# Main loop: collect experience in env and update/log each epochfor epoch in range(epochs):for t in range(local_steps_per_epoch):agent_outs = sess.run(get_action_ops, feed_dict={x_ph: o.reshape(1,-1)})a, v_t, logp_t, info_t = agent_outs[0][0], agent_outs[1], agent_outs[2], agent_outs[3:]# save and logbuf.store(o, a, r, v_t, logp_t, info_t)logger.store(VVals=v_t)o, r, d, _ = env.step(a)ep_ret += rep_len += 1terminal = d or (ep_len == max_ep_len)if terminal or (t==local_steps_per_epoch-1):if not(terminal):print('Warning: trajectory cut off by epoch at %d steps.'%ep_len)# if trajectory didn't reach terminal state, bootstrap value targetlast_val = r if d else sess.run(v, feed_dict={x_ph: o.reshape(1,-1)})buf.finish_path(last_val)if terminal:# only save EpRet / EpLen if trajectory finishedlogger.store(EpRet=ep_ret, EpLen=ep_len)o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0# Save modelif (epoch % save_freq == 0) or (epoch == epochs-1):logger.save_state({'env': env}, None)# Perform TRPO or NPG update!update()# Log info about epochlogger.log_tabular('Epoch', epoch)logger.log_tabular('EpRet', with_min_and_max=True)logger.log_tabular('EpLen', average_only=True)logger.log_tabular('VVals', with_min_and_max=True)logger.log_tabular('TotalEnvInteracts', (epoch+1)*steps_per_epoch)logger.log_tabular('LossPi', average_only=True)logger.log_tabular('LossV', average_only=True)logger.log_tabular('DeltaLossPi', average_only=True)logger.log_tabular('DeltaLossV', average_only=True)logger.log_tabular('KL', average_only=True)if algo=='trpo':logger.log_tabular('BacktrackIters', average_only=True)logger.log_tabular('Time', time.time()-start_time)logger.dump_tabular()if __name__ == '__main__':import argparseparser = argparse.ArgumentParser()parser.add_argument('--env', type=str, default='HalfCheetah-v2')parser.add_argument('--hid', type=int, default=64)parser.add_argument('--l', type=int, default=2)parser.add_argument('--gamma', type=float, default=0.99)parser.add_argument('--seed', '-s', type=int, default=0)parser.add_argument('--cpu', type=int, default=4)parser.add_argument('--steps', type=int, default=4000)parser.add_argument('--epochs', type=int, default=50)parser.add_argument('--exp_name', type=str, default='trpo')args = parser.parse_args()mpi_fork(args.cpu)  # run parallel code with mpifrom spinup.utils.run_utils import setup_logger_kwargslogger_kwargs = setup_logger_kwargs(args.exp_name, args.seed)trpo(lambda : gym.make(args.env), actor_critic=core.mlp_actor_critic,ac_kwargs=dict(hidden_sizes=[args.hid]*args.l), gamma=args.gamma,seed=args.seed, steps_per_epoch=args.steps, epochs=args.epochs,logger_kwargs=logger_kwargs)

关于core.py代码如下：

import numpy as np
import tensorflow as tf
import scipy.signal
from gym.spaces import Box, DiscreteEPS = 1e-8def combined_shape(length, shape=None):if shape is None:return (length,)return (length, shape) if np.isscalar(shape) else (length, *shape)def keys_as_sorted_list(dict):return sorted(list(dict.keys()))def values_as_sorted_list(dict):return [dict[k] for k in keys_as_sorted_list(dict)]def placeholder(dim=None):return tf.placeholder(dtype=tf.float32, shape=combined_shape(None,dim))def placeholders(*args):return [placeholder(dim) for dim in args]def placeholder_from_space(space):if isinstance(space, Box):return placeholder(space.shape)elif isinstance(space, Discrete):return tf.placeholder(dtype=tf.int32, shape=(None,))raise NotImplementedErrordef placeholders_from_spaces(*args):return [placeholder_from_space(space) for space in args]def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):for h in hidden_sizes[:-1]:x = tf.layers.dense(x, units=h, activation=activation)return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)def get_vars(scope=''):return [x for x in tf.trainable_variables() if scope in x.name]def count_vars(scope=''):v = get_vars(scope)return sum([np.prod(var.shape.as_list()) for var in v])
def gaussian_likelihood(x, mu, log_std):pre_sum = -0.5 * (((x-mu)/(tf.exp(log_std)+EPS))**2 + 2*log_std + np.log(2*np.pi))return tf.reduce_sum(pre_sum, axis=1)def diagonal_gaussian_kl(mu0, log_std0, mu1, log_std1):"""tf symbol for mean KL divergence between two batches of diagonal gaussian distributions,where distributions are specified by means and log stds.(https://en.wikipedia.org/wiki/Kullback-Leibler_divergence#Multivariate_normal_distributions)"""var0, var1 = tf.exp(2 * log_std0), tf.exp(2 * log_std1)pre_sum = 0.5*(((mu1- mu0)**2 + var0)/(var1 + EPS) - 1) +  log_std1 - log_std0all_kls = tf.reduce_sum(pre_sum, axis=1)return tf.reduce_mean(all_kls)def categorical_kl(logp0, logp1):"""tf symbol for mean KL divergence between two batches of categorical probability distributions,where the distributions are input as log probs."""all_kls = tf.reduce_sum(tf.exp(logp1) * (logp1 - logp0), axis=1)return tf.reduce_mean(all_kls)def flat_concat(xs):return tf.concat([tf.reshape(x,(-1,)) for x in xs], axis=0)def flat_grad(f, params):return flat_concat(tf.gradients(xs=params, ys=f))def hessian_vector_product(f, params):# for H = grad**2 f, compute Hxg = flat_grad(f, params)x = tf.placeholder(tf.float32, shape=g.shape)return x, flat_grad(tf.reduce_sum(g*x), params)def assign_params_from_flat(x, params):flat_size = lambda p : int(np.prod(p.shape.as_list())) # the 'int' is important for scalarssplits = tf.split(x, [flat_size(p) for p in params])new_params = [tf.reshape(p_new, p.shape) for p, p_new in zip(params, splits)]return tf.group([tf.assign(p, p_new) for p, p_new in zip(params, new_params)])def discount_cumsum(x, discount):"""magic from rllab for computing discounted cumulative sums of vectors.input:vector x,[x0,x1,x2]output:[x0 + discount * x1 + discount^2 * x2,x1 + discount * x2,x2]"""return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]"""
Policies
"""
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):act_dim = action_space.nlogits = mlp(x, list(hidden_sizes)+[act_dim], activation, None)logp_all = tf.nn.log_softmax(logits)pi = tf.squeeze(tf.multinomial(logits,1), axis=1)logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)old_logp_all = placeholder(act_dim)d_kl = categorical_kl(logp_all, old_logp_all)info = {'logp_all': logp_all}info_phs = {'logp_all': old_logp_all}return pi, logp, logp_pi, info, info_phs, d_kldef mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation, action_space):act_dim = a.shape.as_list()[-1]mu = mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))std = tf.exp(log_std)pi = mu + tf.random_normal(tf.shape(mu)) * stdlogp = gaussian_likelihood(a, mu, log_std)logp_pi = gaussian_likelihood(pi, mu, log_std)old_mu_ph, old_log_std_ph = placeholders(act_dim, act_dim)d_kl = diagonal_gaussian_kl(mu, log_std, old_mu_ph, old_log_std_ph)info = {'mu': mu, 'log_std': log_std}info_phs = {'mu': old_mu_ph, 'log_std': old_log_std_ph}return pi, logp, logp_pi, info, info_phs, d_kl"""
Actor-Critics
"""
def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,output_activation=None, policy=None, action_space=None):# default policy builder depends on action spaceif policy is None and isinstance(action_space, Box):policy = mlp_gaussian_policyelif policy is None and isinstance(action_space, Discrete):policy = mlp_categorical_policywith tf.variable_scope('pi'):policy_outs = policy(x, a, hidden_sizes, activation, output_activation, action_space)pi, logp, logp_pi, info, info_phs, d_kl = policy_outswith tf.variable_scope('v'):v = tf.squeeze(mlp(x, list(hidden_sizes)+[1], activation, None), axis=1)return pi, logp, logp_pi, info, info_phs, d_kl, v

参考文献：

[1].《Trust Region Policy Optimization》https://arxiv.org/pdf/1502.05477.pdf

[2]. https://medium.com/@jonathan_hui/rl-the-math-behind-trpo-ppo-d12f6c745f33

[3]. https://ai.yanxishe.com/page/TextTranslation/1419

[4]. https://blog.csdn.net/weixin_37895339/article/details/83044731

[5]. https://en.wikipedia.org/wiki/MM_algorithm

[6]. https://www.youtube.com/watch?v=wSjQ4LVUgoU

[7]. https://zhuanlan.zhihu.com/p/26308073

[8]. https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9

[9]. https://medium.com/@jonathan_hui/rl-natural-policy-gradient-actor-critic-using-kronecker-factored-trust-region-acktr-58f3798a4a93

[10]. KL散度：https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[11]. https://spinningup.readthedocs.io/zh_CN/latest/_modules/spinup/algos/trpo/trpo.html#trpo