文章目录

  • 1. Model-Free RL
    • a. Deep Q-Learning
    • b. Policy Gradients
    • c. Deterministic Policy Gradients
    • d. Distributional RL
    • e. Policy Gradients with Action-Dependent Baselines
    • f. Path-Consistency Learning
    • g. Other Directions for Combining Policy-Learning and Q-Learning
    • h. Evolutionary Algorithms
  • 2. Exploration
    • a. Intrinsic Motivation
    • b. Unsupervised RL
  • 3. Transfer and Multitask RL
  • 4. Hierarchy
  • 5. Memory
  • 6. Model-Based RL
    • a. Model is Learned
    • b. Model is Given
  • 7. Meta-RL
  • 8. Scaling RL
  • 9. RL in the Real World
  • 10. Safety
  • 11. Imitation Learning and Inverse Reinforcement Learning
  • 12. Reproducibility, Analysis, and Critique
  • 13. Bonus: Classic Papers in RL Theory or Review

1. Model-Free RL

a. Deep Q-Learning

[1] Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN.
[2] Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent Q-Learning.
[3] Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. Algorithm: Dueling DQN.
[4] Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. Algorithm: Double DQN.
[5] Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER).
[6] Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. Algorithm: Rainbow DQN.

b. Policy Gradients

[7] Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016. Algorithm: A3C.
[8] Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO.
[9] High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE.
[10] Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty.
[11] Emergence of Locomotion Behaviours in Rich Environments, Heess et al, 2017. Algorithm: PPO-Penalty.
[12] Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation, Wu et al, 2017. Algorithm: ACKTR.
[13] Sample Efficient Actor-Critic with Experience Replay, Wang et al, 2016. Algorithm: ACER.
[14] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC.

c. Deterministic Policy Gradients

[15] Deterministic Policy Gradient Algorithms, Silver et al, 2014. Algorithm: DPG.
[16] Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. Algorithm: DDPG.
[17] Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al, 2018. Algorithm: TD3.

d. Distributional RL

[18] A Distributional Perspective on Reinforcement Learning, Bellemare et al, 2017. Algorithm: C51.
[19] Distributional Reinforcement Learning with Quantile Regression, Dabney et al, 2017. Algorithm: QR-DQN.
[20] Implicit Quantile Networks for Distributional Reinforcement Learning, Dabney et al, 2018. Algorithm: IQN.
[21] Dopamine: A Research Framework for Deep Reinforcement Learning, Anonymous, 2018. Contribution: Introduces Dopamine, a code repository containing implementations of DQN, C51, IQN, and Rainbow. Code link.

e. Policy Gradients with Action-Dependent Baselines

[22] Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic, Gu et al, 2016. Algorithm: Q-Prop.
[23] Action-depedent Control Variates for Policy Optimization via Stein’s Identity, Liu et al, 2017. Algorithm: Stein Control Variates.
[24] The Mirage of Action-Dependent Baselines in Reinforcement Learning, Tucker et al, 2018. Contribution: interestingly, critiques and reevaluates claims from earlier papers (including Q-Prop and stein control variates) and finds important methodological errors in them.

f. Path-Consistency Learning

[25] Bridging the Gap Between Value and Policy Based Reinforcement Learning, Nachum et al, 2017. Algorithm: PCL.
[26] Trust-PCL: An Off-Policy Trust Region Method for Continuous Control, Nachum et al, 2017. Algorithm: Trust-PCL.

g. Other Directions for Combining Policy-Learning and Q-Learning

[27] Combining Policy Gradient and Q-learning, O’Donoghue et al, 2016. Algorithm: PGQL.
[28] The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning, Gruslys et al, 2017. Algorithm: Reactor.
[29] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning, Gu et al, 2017. Algorithm: IPG.
[30] Equivalence Between Policy Gradients and Soft Q-Learning, Schulman et al, 2017. Contribution: Reveals a theoretical link between these two families of RL algorithms.

h. Evolutionary Algorithms

[31] Evolution Strategies as a Scalable Alternative to Reinforcement Learning, Salimans et al, 2017. Algorithm: ES.

2. Exploration

a. Intrinsic Motivation

[32] VIME: Variational Information Maximizing Exploration, Houthooft et al, 2016. Algorithm: VIME.
[33] Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare et al, 2016. Algorithm: CTS-based Pseudocounts.
[34] Count-Based Exploration with Neural Density Models, Ostrovski et al, 2017. Algorithm: PixelCNN-based Pseudocounts.
[35] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning, Tang et al, 2016. Algorithm: Hash-based Counts.
[36] EX2: Exploration with Exemplar Models for Deep Reinforcement Learning, Fu et al, 2017. Algorithm: EX2.
[37] Curiosity-driven Exploration by Self-supervised Prediction, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM).
[38] Large-Scale Study of Curiosity-Driven Learning, Burda et al, 2018. Contribution: Systematic analysis of how surprisal-based intrinsic motivation performs in a wide variety of environments.
[39] Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND.

b. Unsupervised RL

[40] Variational Intrinsic Control, Gregor et al, 2016. Algorithm: VIC.
[41] Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al, 2018. Algorithm: DIAYN.
[42] Variational Option Discovery Algorithms, Achiam et al, 2018. Algorithm: VALOR.

3. Transfer and Multitask RL

[43] Progressive Neural Networks, Rusu et al, 2016. Algorithm: Progressive Networks.
[44] Universal Value Function Approximators, Schaul et al, 2015. Algorithm: UVFA.
[45] Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, 2016. Algorithm: UNREAL.
[46] The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously, Cabi et al, 2017. Algorithm: IU Agent.
[47] PathNet: Evolution Channels Gradient Descent in Super Neural Networks, Fernando et al, 2017. Algorithm: PathNet.
[48] Mutual Alignment Transfer Learning, Wulfmeier et al, 2017. Algorithm: MATL.
[49] Learning an Embedding Space for Transferable Robot Skills, Hausman et al, 2018.
[50] Hindsight Experience Replay, Andrychowicz et al, 2017. Algorithm: Hindsight Experience Replay (HER).

4. Hierarchy

[51] Strategic Attentive Writer for Learning Macro-Actions, Vezhnevets et al, 2016. Algorithm: STRAW.
[52] FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets et al, 2017. Algorithm: Feudal Networks
[53] Data-Efficient Hierarchical Reinforcement Learning, Nachum et al, 2018. Algorithm: HIRO.

5. Memory

[54] Model-Free Episodic Control, Blundell et al, 2016. Algorithm: MFEC.
[55] Neural Episodic Control, Pritzel et al, 2017. Algorithm: NEC.
[56] Neural Map: Structured Memory for Deep Reinforcement Learning, Parisotto and Salakhutdinov, 2017. Algorithm: Neural Map.
[57] Unsupervised Predictive Memory in a Goal-Directed Agent, Wayne et al, 2018. Algorithm: MERLIN.
[58] Relational Recurrent Neural Networks, Santoro et al, 2018. Algorithm: RMC.

6. Model-Based RL

a. Model is Learned

[59] Imagination-Augmented Agents for Deep Reinforcement Learning, Weber et al, 2017. Algorithm: I2A.
[60] Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning, Nagabandi et al, 2017. Algorithm: MBMF.
[61] Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning, Feinberg et al, 2018. Algorithm: MVE.
[62] Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion, Buckman et al, 2018. Algorithm: STEVE.
[63] Model-Ensemble Trust-Region Policy Optimization, Kurutach et al, 2018. Algorithm: ME-TRPO.
[64] Model-Based Reinforcement Learning via Meta-Policy Optimization, Clavera et al, 2018. Algorithm: MB-MPO.
[65] Recurrent World Models Facilitate Policy Evolution, Ha and Schmidhuber, 2018.

b. Model is Given

[66] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver et al, 2017. Algorithm: AlphaZero.
[67] Thinking Fast and Slow with Deep Learning and Tree Search, Anthony et al, 2017. Algorithm: ExIt.

7. Meta-RL

[68] RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning, Duan et al, 2016. Algorithm: RL^2.
[69] Learning to Reinforcement Learn, Wang et al, 2016.
[70] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al, 2017. Algorithm: MAML.
[71] A Simple Neural Attentive Meta-Learner, Mishra et al, 2018. Algorithm: SNAIL.

8. Scaling RL

[72] Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018. Contribution: Systematic analysis of parallelization in deep RL across algorithms.
[73] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures, Espeholt et al, 2018. Algorithm: IMPALA.
[74] Distributed Prioritized Experience Replay, Horgan et al, 2018. Algorithm: Ape-X.
[75] Recurrent Experience Replay in Distributed Reinforcement Learning, Anonymous, 2018. Algorithm: R2D2.
[76] RLlib: Abstractions for Distributed Reinforcement Learning, Liang et al, 2017. Contribution: A scalable library of RL algorithm implementations. Documentation link.

9. RL in the Real World

[77] Benchmarking Reinforcement Learning Algorithms on Real-World Robots, Mahmood et al, 2018.
[78] Learning Dexterous In-Hand Manipulation, OpenAI, 2018.
[79] QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation, Kalashnikov et al, 2018. Algorithm: QT-Opt.
[80] Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform, Gauci et al, 2018.

10. Safety

[81] Concrete Problems in AI Safety, Amodei et al, 2016. Contribution: establishes a taxonomy of safety problems, serving as an important jumping-off point for future research. We need to solve these!
[82] Deep Reinforcement Learning From Human Preferences, Christiano et al, 2017. Algorithm: LFP.
[83] Constrained Policy Optimization, Achiam et al, 2017. Algorithm: CPO.
[84] Safe Exploration in Continuous Action Spaces, Dalal et al, 2018. Algorithm: DDPG+Safety Layer.
[85] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, Saunders et al, 2017. Algorithm: HIRL.
[86] Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning, Eysenbach et al, 2017. Algorithm: Leave No Trace.

11. Imitation Learning and Inverse Reinforcement Learning

[87] Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy, Ziebart 2010. Contributions: Crisp formulation of maximum entropy IRL.
[88] Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn et al, 2016. Algorithm: GCL.
[89] Generative Adversarial Imitation Learning, Ho and Ermon, 2016. Algorithm: GAIL.
[90] DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills, Peng et al, 2018. Algorithm: DeepMimic.
[91] Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow, Peng et al, 2018. Algorithm: VAIL.
[92] One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL, Le Paine et al, 2018. Algorithm: MetaMimic.

12. Reproducibility, Analysis, and Critique

[93] Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al, 2016. Contribution: rllab.
[94] Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control, Islam et al, 2017.
[95] Deep Reinforcement Learning that Matters, Henderson et al, 2017.
[96] Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods, Henderson et al, 2018.
[97] Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?, Ilyas et al, 2018.
[98] Simple Random Search Provides a Competitive Approach to Reinforcement Learning, Mania et al, 2018.
[99] Benchmarking Model-Based Reinforcement Learning, Wang et al, 2019.

13. Bonus: Classic Papers in RL Theory or Review

[100] Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
[101] An Analysis of Temporal-Difference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. Contributions: Variety of convergence results and counter-examples for value-learning methods in RL.
[102] Reinforcement Learning of Motor Skills with Policy Gradients, Peters and Schaal, 2008. Contributions: Thorough review of policy gradient methods at the time, many of which are still serviceable descriptions of deep RL methods.
[103] Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. Contributions: Early roots for monotonic improvement theory, later leading to theoretical justification for TRPO and other algorithms.
[104] A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.
[105] Algorithms for Reinforcement Learning, Szepesvari, 2009. Contributions: Unbeatable reference on RL before deep RL, containing foundations and theoretical background.

强化学习核心文章一百篇相关推荐

  1. 2019年上半年收集到的人工智能强化学习干货文章

    2019年上半年收集到的人工智能强化学习干货文章 从0到1-强化学习篇 关于人工智能中强化学习的扫盲 强化学习简介 深度强化学习 探索强化学习算法背后的思想起源! 强化学习基础 什么是强化学习?强化学 ...

  2. 强化学习系列文章(二十七):VPG+Beta分布在CartPoleContinuous环境中的应用

    强化学习系列文章(二十七):VPG+Beta分布在CartPoleContinuous环境中的应用 在第七篇笔记(https://blog.csdn.net/hhy_csdn/article/deta ...

  3. 强化学习入门这一篇就够了!!!万字长文

    强化学习 强化学习入门这一篇就够了万字长文带你明明白白学习强化学习... 强化学习入门这一篇就够了 强化学习 前言 一.概率统计知识回顾 1.1 随机变量和观测值 1.2 概率密度函数 1.3 期望 ...

  4. 强化学习系列文章(二十三):AirSim Python API图像与图像处理

    强化学习系列文章(二十三):AirSim Python API图像与图像处理 参考网址:https://microsoft.github.io/AirSim/image_apis/#segmentat ...

  5. 强化学习系列文章(二十八):进化强化学习EvoRL的预实验

    强化学习系列文章(二十八):进化强化学习EvoRL的预实验 最近在研究强化学习解决离散空间的组合优化问题时,接触到了很多进化算法,实际体验也是与RL算法不相上下.进化算法也常用于优化神经网络的参数,C ...

  6. 【论文笔记】AAAI2022多智能体强化学习论文五篇

    文章目录 引子 Anytime Multi-Agent Path Finding via Machine Learning-Guided Large Neighborhood Search MAPF- ...

  7. 【ICML2021】 9篇RL论文作者汪昭然:构建“元宇宙”和理论基础,让深度强化学习从虚拟走进现实...

    深度强化学习实验室 官网:http://www.neurondance.com/ 论坛:http://deeprl.neurondance.com/ 来源:转载自AI科技评论 作者 | 陈彩娴 深度强 ...

  8. 深度强化学习(资源篇)(更新于2020.11.22)

    理论 1种策略就能控制多类模型,华人大二学生提出RL泛化方法,LeCun认可转发 | ICML 2020 AlphaGo原来是这样运行的,一文详解多智能体强化学习的基础和应用 [DeepMind总结] ...

  9. 83篇文献-万字总结 || 强化学习之路

    深度强化学习实验室报道 作者:侯宇清,陈玉荣 编辑:DeepRL 深度强化学习是深度学习与强化学习相结合的产物,它集成了深度学习在视觉等感知问题上强大的理解能力,以及强化学习的决策能力,实现了端到端学 ...

最新文章

  1. 24 location对象
  2. 视频可以转换html,10 个免费的 HTML 视频转换工具
  3. antd的 input有下拉_antd select下拉添加全选的按钮
  4. 阿里巴巴对Java编程【集合处理】的规约
  5. 嵌入式Linux中的根文件系统
  6. 大明王朝简史,笑疯了哈哈哈哈哈哈……
  7. excel手机版_手机自带便签如何导入新手机?试试这款便签同步助手
  8. 第009讲 初识css 类选择器 id选择器 html选择器
  9. 拓端tecdat|GARCH(1,1),MA以及历史模拟法的VaR比较
  10. c# 对象 与 Json串 转换
  11. 关于Pearson相关系数的显著性p值如何计算以及背后原因的思考
  12. 连接方法:网线水晶头接法
  13. 小牛叔讲Python第1章: 编程界的瑞士军刀Python
  14. android 自定义 锁屏
  15. 常见中文字体英文名称以及windows默认字体列表
  16. 【单调队列优化】CF319C——Kalila and Dimna in the Logging Industry
  17. 广告坑死人,这年头如何辨别互联网金融的可靠性?
  18. Windows下用docker打包镜像
  19. 如何给开源项目贡献代码
  20. 【MongoDB】从入门到精通mongdb系列学习宝典,想学mongodb小伙伴请进来

热门文章

  1. PHP中的short_open_tag
  2. mysql8.0 linux 安装教程_linux下mysql8.0安装详细教程
  3. python 增删列表_python 列表的增删改查
  4. matlab产生mif 文件,生成.mif文件的matlab程序
  5. linux内核层是什么,从用户层到内核层 - Linux内核中的信号机制_Linux编程_Linux公社-Linux系统门户网站...
  6. vue 数组长度_深入理解Vue的数据响应式
  7. java 远程 shell脚本_Java 远程调用 shell脚本
  8. 单位阶跃信号是周期信号吗_手机信号变成“HD”,是代表没有信号吗?你的手机正在被扣费...
  9. linux中截断日志
  10. linux 备份数据,LINUX下备份数据