Trust Region Policy Optimization (ICML, 2015)

1 Introduction

  1. policy optimization categories

    1. policy iteration (GPI)
    2. PG (e.g. TRPO)
    3. derivative-free(无导数) optimization methods

2 Preliminaries

  1. Consider an infinite-horizon discounted MDP

    1. instead an average reward one
  2. objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
    η(π~)=η(π)+Es0,a0,⋯∼π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0​,a0​,⋯∼π~​[t=0∑∞​γtAπ​(st​,at​)]=η(π)+s∑​ρπ~​(s)a∑​π~(a∣s)Aπ​(s,a)

    1. improving the expectation on right hand side leads to policy improvment
    2. but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~​(s) is hard to estimate
    3. so use local approximation instead
  3. local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ​(π~) (Kakade, 2002, TODO)
    Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ​(π~)=η(π)+s∑​ρπ​(s)a∑​π~(a∣s)Aπ​(s,a)

    1. match η(π~)\eta(\tilde{\pi})η(π~) to first order
    2. lower bound
    3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

  1. Theorem 1: lower bound with General Stochastic Policies
    η(π~)≥Lπ(π~)−CDKLmax⁡(π,π~)where C=4ϵγ(1−γ)2,ϵ=max⁡s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)​≥Lπ​(π~)−CDKLmax​(π,π~) where C=(1−γ)24ϵγ​,ϵ=s,amax​∣Aπ​(s,a)∣​

    • KL can be seen as penalty
    • CCC is a large constant, thus leads to a small step size
  2. Algorithm 1: monotonic policy iteration
    • advantage should be accurate
    • can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

  1. Approximations

    1. trust region constrain

      • step size in Algorithm 1 could be very small
      • change KL penalty to a constrain
    2. average KL
      • max operation involving traversal the whole state space
      • use average KL instead
        DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold​​(θ1​,θ2​):=Es∼ρold​​[DKL​(πθ1​​(⋅∣s)∥πθ2​​(⋅∣s))]
  2. optimization problem up to now
    maximize⁡θ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize​∑s​ρθold ​​(s)∑a​πθ​(a∣s)Aθold ​​(s,a) subject to DˉKLρold ​​(θold ​,θ)≤δ​

5 Sample-Based Estimation of the Objective and Constraint

maximize⁡θEs∼ρθold,a∼q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximize​Es∼ρθold​,a∼q​[q(a∣s)πθ​(a∣s)​Qθold​​(s,a)] subject to Es∼ρθold​​[DKL​(πθold​​(⋅∣s)∥πθ​(⋅∣s))]≤δ​

  1. s∼ρθolds \sim \rho_{\theta \mathrm{old}}s∼ρθold​

    • by defination of expectation
  2. a∼qa \sim qa∼q
    • by important sampling
  3. Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold​​(s,a)
    • by the fact that

      1. A(s,a)=Q(s,a)−V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)−V(s)
      2. ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant∑s​Vθold​(s)=constant
    • so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
    • estimation methods
      1. single path
      2. Vine
        • better estimation
        • but need more simulator
        • limit to the system where states can be recover

6 Practical Algorithm

  1. Update Q by collected trajectoriess
  2. Calulate objective and its constrain
  3. Solve the constrain problem by Conjugate gradient alg TODO
    • references

      1. paper Appendix C
      2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

  1. TRPO is related to prior methods (e.g. natural policy
    gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.

    • These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
  2. can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
  3. the method we proposed is scalable and has strong theoretical foundations.

[RL 9] Trust Region Policy Optimization (ICML, 2015)相关推荐

  1. Proximal Policy Optimization (PPO) 算法理解:从策略梯度开始

    近端策略优化(PPO)算法是OpenAI在2017提出的一种强化学习算法,被认为是目前强化学习领域的SOTA方法,也是适用性最广的算法之一.本文将从PPO算法的基础入手,理解从传统策略梯度算法(例如R ...

  2. 深度增强学习(DRL)漫谈 - 信赖域(Trust Region)系方法

    一.背景 深度学习的兴起让增强学习这个古老的机器学习分支迎来一轮复兴.它们的结合领域-深度增强学习(Deep reinforcement learning, DRL)随着在一系列极具挑战的控制实验场景 ...

  3. 强化学习笔记:PPO 【近端策略优化(Proximal Policy Optimization)】

    1 前言 我们回顾一下policy network: 强化学习笔记:Policy-based Approach_UQI-LIUWJ的博客-CSDN博客 它先去跟环境互动,搜集很多的 路径τ.根据它搜集 ...

  4. ChatGPT 使用 强化学习:Proximal Policy Optimization算法(详细图解)

    ChatGPT 使用 强化学习:Proximal Policy Optimization算法 强化学习中的PPO(Proximal Policy Optimization)算法是一种高效的策略优化方法 ...

  5. 【文献阅读】Proximal Policy Optimization Algorithms

    Author: John Schulman 原文摘要 我们提出了一种新的强化学习的 策略梯度方法,该方法在 与环境互动中进行采样 和 使用随机梯度提升算法优化"surrogate" ...

  6. 深度增强学习PPO(Proximal Policy Optimization)算法源码走读

    原文地址:https://blog.csdn.net/jinzhuojun/article/details/80417179 OpenAI出品的baselines项目提供了一系列deep reinfo ...

  7. 数学知识-- 信赖域(Trust Region)算法是怎么一回事

    信赖域(Trust Region)算法是怎么一回事 转载自: https://www.codelast.com/原创信赖域trust-region算法是怎么一回事/ 如果你关心最优化(Optimiza ...

  8. 【ICML 2015迁移学习论文阅读】Unsupervised Domain Adaptation by Backpropagation (DANN) 反向传播的无监督领域自适应

    会议:ICML 2015 论文题目:Unsupervised Domain Adaptation by Backpropagation 论文地址: http://proceedings.mlr.pre ...

  9. 信赖域(Trust Region)

    信赖域算法与一维搜索算法的区别.联系 在Jorge Nocedal和Stephen J. Wright的<Numerical Optimization>一书的第2.2节介绍了解优化问题的两 ...

最新文章

  1. 89岁教授起诉知网获赔70万:自己的论文竟要花钱才能看?
  2. 智能车竞赛技术报告 | 双车接力组 - 大连海事大学 - 同舟拾贰队
  3. centos修改主机名整理(勿喷)
  4. 项目中的一个JQuery ajax实现案例
  5. 给你这张图,你能搜索到来历吗
  6. CVE-2019-0708 BlueKeep的扫描和打补丁
  7. android 禁止屏幕放大缩小,禁止APP内Webview页面跟随系统缩放字号
  8. 【月报】Java知音的五月汇总
  9. spring mvc处理静态资源
  10. python怎么打包_如何将一整个python工程打包
  11. 突发,Spring框架发现重大漏洞!
  12. android基于开源网络框架asychhttpclient,二次封装为通用网络请求组件
  13. python xpath 中文乱码_Python爬虫实战 批量下载高清美女图片!让你们开开眼!
  14. Docker破解AWVS和Nessus
  15. Windows 命令行cmd破解WiFi密码
  16. 两个计算机怎么共享一台打印机共享,多台电脑怎样共享一台打印机?这两个设备完美实现!...
  17. linux 安装ElasticSearch 6.x
  18. 说说table下面定位层级的问题
  19. 给一个向量进行归一化
  20. php sha1摘要算法,js 加密和摘要算法(base64、md5、sha1、rsa)

热门文章

  1. Oracle官网登录账号
  2. Dell服务器网卡驱动升级[CentOS 5.5 X86_64和RHEL 5.6 X86_64]
  3. 信贷风控报表常用指标解读(一)
  4. Stata | 字符函数
  5. html利用表格制作个人简历
  6. 第二周数据库学习笔记
  7. 《画解数据结构》九张动图,画解队列
  8. 从零开始的ZYNQ学习(基于矿卡EBAZ4205)(一)
  9. abb式c语言,ABB机器人是用什么语言编程的? ——ABB机器人
  10. Jmeter javax.swing.text.BadLocationException: Position not represented by view 解决方法