强化学习 (Reinforcement Learning) 基础及论文资料汇总

持续更新中...

书籍

1. 《Reinforcement Learning: An Introduction》Richard S. Sutton and Andrew G.Barto ，被誉为“强化学习圣经” ，毫无疑问是强化学习入门的必读书籍，此书有中文译版，如有兴趣可自行查找。建议阅读英文版，更容易理解作者想要表述的内容，且对后面阅读论文很有帮助。

http://incompleteideas.net/book/RLbook2020.pdf

Code (Python Implementation): 书中案例的 Python 代码实现

GitHub - ShangtongZhang/reinforcement-learning-an-introduction: Python Implementation of Reinforcement Learning: An Introduction

Exercise Reinforcement Learning: An Introduction , 书中习题解答

reinforcement_learning_an_introduction/exercises.pdf at master · brynhayder/reinforcement_learning_an_introduction · GitHub

2. 《Tensorflow 深度学习》龙良曲，深度学习 Tensorflow 2.0 的教程书籍，涉及到深度强化学习的编程时可以用 Tensorflow 来实现，此教材可以作为学习 Tensorflow 2.0 的参考。除 Tensorflow之外，Pytorch、Keras、MXNet 等深度学习框架也非常受欢迎，可以作为选择之一，学习初期根据自身情况选择一个即可。

GitHub - dragen1860/Deep-Learning-with-TensorFlow-book: 深度学习入门开源书，基于TensorFlow 2.0案例实战。Open source Deep Learning book, based on TensorFlow 2.0 framework.

经典论文

1. (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning，讲述了半马尔可夫决策 (semi-MDPs) 过程，强化学习初期可以直接跳过，后期遇到相关论文在阅读此论文作为参考即可。

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning - ScienceDirect

2. (2013 DeepMind) Playing Atari with Deep Reinforcement Learning ，Google DeepMind提出的深度强化学习中 DQN 这一经典算法的雏形，提出了经验回放机 (Experience Replay) 机制。

http://arxiv.org/abs/1312.5602

3. (2014 DeepMind, UCL) Deterministic Policy Gradient Algorithm，确定性策略梯度算法，是后面 DDPG 算法的基础之一。

http://proceedings.mlr.press/v32/silver14.pdf

4. (2015 DeepMind) Human-level control through deep reinforcement learning ，经典的DQN算法，除了经验回放机 (Experience Replay) 机制之外，还加入了固定 target Q network 并定期更新的机制，是许多深度强化学习算法的重要基础。

Human-level control through deep reinforcement learning | Nature

5. (2015 DeepMind) Continuous control with deep reinforcement learning ，提出了著名的 DDPG 算法，应用广泛。

http://arxiv.org/abs/1509.02971

6. (2016 DeepMind) Mastering the game of Go with deep neural networks and tree search，应用强化学习进行围棋对弈，著名的 Alpha Go 便基于此。

http://www.nature.com/articles/nature16961

7. (2017 UCB) Trust Region Policy Optimization，提出了 TRPO 算法，是后面近端策略优化 (PPO) 算法的基础。

http://arxiv.org/abs/1502.05477

8. (2017 DeepMind) FeUdal Networks for Hierarchical Reinforcement Learning，分层强化学习(Hierarchical Reinforcement Learning) 中的 FeUdal Networks 结构，涉及到分层强化学习相关研究的可以作为参考。

http://arxiv.org/abs/1703.01161

在2017年，Google DeepMind 和 OpenAI 两个研究机构一前一后，分别提出了著名的近端策略优化 (PPO) 算法。因其易用性和良好表现，OpenAI 将 PPO 算法作为其默认强化学习算法。

9. (2017 DeepMind) Emergence of Locomotion Behaviours in Rich Environments，由 Google DeepMind 提出的近端策略优化 (PPO) 算法。

http://arxiv.org/abs/1707.02286

10. (2017 OpenAI) Proximal Policy Optimization Algorithms，OpenAI 提出的近端策略优化 (PPO) 算法。

http://arxiv.org/abs/1707.06347

11. (2017 DeepMind) Rainbow: Combining Improvements in Deep Reinforcement Learning，提出了著名的 Rainbow 算法，结合了6种深度强化学习算法，结果表现非常强势。

http://arxiv.org/abs/1710.02298

12. (2017 Google Brain, Google Research) Attention Is All You Need，多头自注意力 (Multi Head Attention) 机制论文，提出了著名的 Transformer 结构，广泛应用于 NLP、CV 等领域之中。

https://arxiv.org/abs/1706.03762