[RL 9] Trust Region Policy Optimization (ICML, 2015)
Trust Region Policy Optimization (ICML, 2015)
1 Introduction
- policy optimization categories
- policy iteration (GPI)
- PG (e.g. TRPO)
- derivative-free(无导数) optimization methods
2 Preliminaries
- Consider an infinite-horizon discounted MDP
- instead an average reward one
- objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
η(π~)=η(π)+Es0,a0,⋯∼π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0,a0,⋯∼π~[t=0∑∞γtAπ(st,at)]=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a)- improving the expectation on right hand side leads to policy improvment
- but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~(s) is hard to estimate
- so use local approximation instead
- local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ(π~) (Kakade, 2002, TODO)
Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ(π~)=η(π)+s∑ρπ(s)a∑π~(a∣s)Aπ(s,a)- match η(π~)\eta(\tilde{\pi})η(π~) to first order
- lower bound
- limited form of policy improvement
3 Monotonic Improvement Guarantee for General Stochastic Policies
- Theorem 1: lower bound with General Stochastic Policies
η(π~)≥Lπ(π~)−CDKLmax(π,π~)where C=4ϵγ(1−γ)2,ϵ=maxs,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)≥Lπ(π~)−CDKLmax(π,π~) where C=(1−γ)24ϵγ,ϵ=s,amax∣Aπ(s,a)∣- KL can be seen as penalty
- CCC is a large constant, thus leads to a small step size
- Algorithm 1: monotonic policy iteration
- advantage should be accurate
- can be extend to continunous state and action sapce
4 Optimization of Parameterized Policies
- Approximations
- trust region constrain
- step size in Algorithm 1 could be very small
- change KL penalty to a constrain
- average KL
max
operation involving traversal the whole state space- use average KL instead
DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold(θ1,θ2):=Es∼ρold[DKL(πθ1(⋅∣s)∥πθ2(⋅∣s))]
- trust region constrain
- optimization problem up to now
maximizeθ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize∑sρθold (s)∑aπθ(a∣s)Aθold (s,a) subject to DˉKLρold (θold ,θ)≤δ
5 Sample-Based Estimation of the Objective and Constraint
maximizeθEs∼ρθold,a∼q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximizeEs∼ρθold,a∼q[q(a∣s)πθ(a∣s)Qθold(s,a)] subject to Es∼ρθold[DKL(πθold(⋅∣s)∥πθ(⋅∣s))]≤δ
- s∼ρθolds \sim \rho_{\theta \mathrm{old}}s∼ρθold
- by defination of expectation
- a∼qa \sim qa∼q
- by important sampling
- Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold(s,a)
- by the fact that
- A(s,a)=Q(s,a)−V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)−V(s)
- ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant∑sVθold(s)=constant
- so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
- estimation methods
- single path
- Vine
- better estimation
- but need more simulator
- limit to the system where states can be recover
- by the fact that
6 Practical Algorithm
- Update Q by collected trajectoriess
- Calulate objective and its constrain
- Solve the constrain problem by Conjugate gradient alg TODO
- references
- paper Appendix C
- post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/
- references
8 Experiments
- TRPO is related to prior methods (e.g. natural policy
gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.- These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
- can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
- the method we proposed is scalable and has strong theoretical foundations.
[RL 9] Trust Region Policy Optimization (ICML, 2015)相关推荐
- Proximal Policy Optimization (PPO) 算法理解:从策略梯度开始
近端策略优化(PPO)算法是OpenAI在2017提出的一种强化学习算法,被认为是目前强化学习领域的SOTA方法,也是适用性最广的算法之一.本文将从PPO算法的基础入手,理解从传统策略梯度算法(例如R ...
- 深度增强学习(DRL)漫谈 - 信赖域(Trust Region)系方法
一.背景 深度学习的兴起让增强学习这个古老的机器学习分支迎来一轮复兴.它们的结合领域-深度增强学习(Deep reinforcement learning, DRL)随着在一系列极具挑战的控制实验场景 ...
- 强化学习笔记:PPO 【近端策略优化(Proximal Policy Optimization)】
1 前言 我们回顾一下policy network: 强化学习笔记:Policy-based Approach_UQI-LIUWJ的博客-CSDN博客 它先去跟环境互动,搜集很多的 路径τ.根据它搜集 ...
- ChatGPT 使用 强化学习:Proximal Policy Optimization算法(详细图解)
ChatGPT 使用 强化学习:Proximal Policy Optimization算法 强化学习中的PPO(Proximal Policy Optimization)算法是一种高效的策略优化方法 ...
- 【文献阅读】Proximal Policy Optimization Algorithms
Author: John Schulman 原文摘要 我们提出了一种新的强化学习的 策略梯度方法,该方法在 与环境互动中进行采样 和 使用随机梯度提升算法优化"surrogate" ...
- 深度增强学习PPO(Proximal Policy Optimization)算法源码走读
原文地址:https://blog.csdn.net/jinzhuojun/article/details/80417179 OpenAI出品的baselines项目提供了一系列deep reinfo ...
- 数学知识-- 信赖域(Trust Region)算法是怎么一回事
信赖域(Trust Region)算法是怎么一回事 转载自: https://www.codelast.com/原创信赖域trust-region算法是怎么一回事/ 如果你关心最优化(Optimiza ...
- 【ICML 2015迁移学习论文阅读】Unsupervised Domain Adaptation by Backpropagation (DANN) 反向传播的无监督领域自适应
会议:ICML 2015 论文题目:Unsupervised Domain Adaptation by Backpropagation 论文地址: http://proceedings.mlr.pre ...
- 信赖域(Trust Region)
信赖域算法与一维搜索算法的区别.联系 在Jorge Nocedal和Stephen J. Wright的<Numerical Optimization>一书的第2.2节介绍了解优化问题的两 ...
最新文章
- 89岁教授起诉知网获赔70万:自己的论文竟要花钱才能看?
- 智能车竞赛技术报告 | 双车接力组 - 大连海事大学 - 同舟拾贰队
- centos修改主机名整理(勿喷)
- 项目中的一个JQuery ajax实现案例
- 给你这张图,你能搜索到来历吗
- CVE-2019-0708 BlueKeep的扫描和打补丁
- android 禁止屏幕放大缩小,禁止APP内Webview页面跟随系统缩放字号
- 【月报】Java知音的五月汇总
- spring mvc处理静态资源
- python怎么打包_如何将一整个python工程打包
- 突发,Spring框架发现重大漏洞!
- android基于开源网络框架asychhttpclient,二次封装为通用网络请求组件
- python xpath 中文乱码_Python爬虫实战 批量下载高清美女图片!让你们开开眼!
- Docker破解AWVS和Nessus
- Windows 命令行cmd破解WiFi密码
- 两个计算机怎么共享一台打印机共享,多台电脑怎样共享一台打印机?这两个设备完美实现!...
- linux 安装ElasticSearch 6.x
- 说说table下面定位层级的问题
- 给一个向量进行归一化
- php sha1摘要算法,js 加密和摘要算法(base64、md5、sha1、rsa)
热门文章
- Oracle官网登录账号
- Dell服务器网卡驱动升级[CentOS 5.5 X86_64和RHEL 5.6 X86_64]
- 信贷风控报表常用指标解读(一)
- Stata | 字符函数
- html利用表格制作个人简历
- 第二周数据库学习笔记
- 《画解数据结构》九张动图,画解队列
- 从零开始的ZYNQ学习(基于矿卡EBAZ4205)(一)
- abb式c语言,ABB机器人是用什么语言编程的? ——ABB机器人
- Jmeter javax.swing.text.BadLocationException: Position not represented by view 解决方法