Trust Region Policy Optimization (ICML, 2015)

1 Introduction

policy optimization categories
1. policy iteration (GPI)
2. PG (e.g. TRPO)
3. derivative-free(无导数) optimization methods

2 Preliminaries

Consider an infinite-horizon discounted MDP
1. instead an average reward one
objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
η(π~)=η(π)+Es0,a0,⋯∼π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0,a0,⋯∼π~[t=0∑∞γtAπ(st,at)]=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a)
1. improving the expectation on right hand side leads to policy improvment
2. but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~(s) is hard to estimate
3. so use local approximation instead
local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ(π~) (Kakade, 2002, TODO)
Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ(π~)=η(π)+s∑ρπ(s)a∑π~(a∣s)Aπ(s,a)
1. match η(π~)\eta(\tilde{\pi})η(π~) to first order
2. lower bound
3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

Theorem 1: lower bound with General Stochastic Policies
η(π~)≥Lπ(π~)−CDKLmax⁡(π,π~)where C=4ϵγ(1−γ)2,ϵ=max⁡s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)≥Lπ(π~)−CDKLmax(π,π~) where C=(1−γ)24ϵγ,ϵ=s,amax∣Aπ(s,a)∣
- KL can be seen as penalty
- CCC is a large constant, thus leads to a small step size
Algorithm 1: monotonic policy iteration
- advantage should be accurate
- can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

Approximations
1. trust region constrain
  - step size in Algorithm 1 could be very small
  - change KL penalty to a constrain
2. average KL
  - max operation involving traversal the whole state space
  - use average KL instead
    DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]
optimization problem up to now
maximize⁡θ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize∑sρθold (s)∑aπθ(a∣s)Aθold (s,a) subject to DˉKLρold (θold ,θ)≤δ

5 Sample-Based Estimation of the Objective and Constraint

maximize⁡θEs∼ρθold,a∼q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximizeEs∼ρθold,a∼q[q(a∣s)πθ(a∣s)Qθold(s,a)] subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ

s∼ρθolds \sim \rho_{\theta \mathrm{old}}s∼ρθold
- by defination of expectation
a∼qa \sim qa∼q
- by important sampling
Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold(s,a)
- by the fact that
  1. A(s,a)=Q(s,a)−V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)−V(s)
  2. ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant∑sVθold(s)=constant
- so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
- estimation methods
  1. single path
  2. Vine
    - better estimation
    - but need more simulator
    - limit to the system where states can be recover

6 Practical Algorithm

Update Q by collected trajectoriess
Calulate objective and its constrain
Solve the constrain problem by Conjugate gradient alg TODO
- references
  1. paper Appendix C
  2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

TRPO is related to prior methods (e.g. natural policy
gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.
- These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
the method we proposed is scalable and has strong theoretical foundations.

[RL 9] Trust Region Policy Optimization (ICML, 2015)相关推荐

Proximal Policy Optimization (PPO) 算法理解：从策略梯度开始
近端策略优化(PPO)算法是OpenAI在2017提出的一种强化学习算法,被认为是目前强化学习领域的SOTA方法,也是适用性最广的算法之一.本文将从PPO算法的基础入手,理解从传统策略梯度算法(例如R ...
深度增强学习（DRL）漫谈 - 信赖域（Trust Region）系方法
一.背景深度学习的兴起让增强学习这个古老的机器学习分支迎来一轮复兴.它们的结合领域-深度增强学习(Deep reinforcement learning, DRL)随着在一系列极具挑战的控制实验场景 ...
强化学习笔记：PPO 【近端策略优化（Proximal Policy Optimization）】
1 前言我们回顾一下policy network: 强化学习笔记:Policy-based Approach_UQI-LIUWJ的博客-CSDN博客它先去跟环境互动,搜集很多的路径τ.根据它搜集 ...
ChatGPT 使用强化学习：Proximal Policy Optimization算法（详细图解）
ChatGPT 使用强化学习:Proximal Policy Optimization算法强化学习中的PPO(Proximal Policy Optimization)算法是一种高效的策略优化方法 ...
【文献阅读】Proximal Policy Optimization Algorithms
Author: John Schulman 原文摘要我们提出了一种新的强化学习的策略梯度方法,该方法在与环境互动中进行采样和使用随机梯度提升算法优化"surrogate" ...
深度增强学习PPO（Proximal Policy Optimization）算法源码走读
原文地址:https://blog.csdn.net/jinzhuojun/article/details/80417179 OpenAI出品的baselines项目提供了一系列deep reinfo ...
数学知识-- 信赖域(Trust Region)算法是怎么一回事
信赖域(Trust Region)算法是怎么一回事转载自: https://www.codelast.com/原创信赖域trust-region算法是怎么一回事/ 如果你关心最优化(Optimiza ...
【ICML 2015迁移学习论文阅读】Unsupervised Domain Adaptation by Backpropagation (DANN) 反向传播的无监督领域自适应
会议:ICML 2015 论文题目:Unsupervised Domain Adaptation by Backpropagation 论文地址: http://proceedings.mlr.pre ...
信赖域(Trust Region)
信赖域算法与一维搜索算法的区别.联系在Jorge Nocedal和Stephen J. Wright的<Numerical Optimization>一书的第2.2节介绍了解优化问题的两 ...

[RL 9] Trust Region Policy Optimization (ICML, 2015)