【Tensorlayer系列】深度强化学习之DQN求解FrozenLake

获取更多资讯，赶快关注上面的公众号吧！

Tensorlayer深度强化学习系列：
Tensorlayer深度强化学习之Tensorlayer安装
【Tensorlayer系列】深度强化学习之FrozenLake介绍及表格型Q学习求解

文章目录

3.1 FrozenLake-v0
3.2 DQN
- 3.2.1 代码
- 3.2.2 实验结果

3.1 FrozenLake-v0

FrozenLake环境的介绍可参照【Tensorlayer系列】深度强化学习之FrozenLake介绍及表格型Q学习求解，这里不再赘述。

3.2 DQN

输入：FrozenLake中一共有16个状态，分别为0~15，DQN的输入采用独热编码表示，长度为16，当状态为n时，则独热编码中对应索引位置为1，其他为0。
输出：上下左右四个动作。

3.2.1 代码

"""
Deep Q-Network Q(a, s)
-----------------------
TD Learning, Off-Policy, e-Greedy Exploration (GLIE).
Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A))
delta_w = R + lambda * Q(newS, newA)
See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
Reference
----------
original paper: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
CN: https://zhuanlan.zhihu.com/p/25710327
Note: Policy Network has been proved to be better than Q-Learning, see tutorial_atari_pong.py
Environment
-----------
# The FrozenLake v0 environment
https://gym.openai.com/envs/FrozenLake-v0
The agent controls the movement of a character in a grid world. Some tiles of
the grid are walkable, and others lead to the agent falling into the water.
Additionally, the movement direction of the agent is uncertain and only partially
depends on the chosen direction. The agent is rewarded for finding a walkable
path to a goal tile.
SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
The episode ends when you reach the goal or fall in a hole. You receive a reward
of 1 if you reach the goal, and zero otherwise.
Prerequisites
--------------
tensorflow>=2.0.0a0
tensorlayer>=2.0.0
To run
-------
python tutorial_DQN.py --train/test
"""
import argparse
import os
import timeimport gym
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorlayer as tl# add arguments in command  --train/test
parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
parser.add_argument('--train', dest='train', action='store_true', default=True)
parser.add_argument('--test', dest='test', action='store_true', default=True)
args = parser.parse_args()
tl.logging.set_verbosity(tl.logging.DEBUG)#####################  hyper parameters  ####################
env_id = 'FrozenLake-v0'
alg_name = 'DQN'
lambd = .99  # decay factor
e = 0.1  # e-Greedy Exploration, the larger the more random
num_episodes = 10000
render = False  # display the game environment
rList = [] #记录奖励
##################### DQN ##########################def to_one_hot(i, n_classes=None):a = np.zeros(n_classes, 'uint8')a[i] = 1return a## Define Q-network q(a,s) that ouput the rewards of 4 actions by given state, i.e. Action-Value Function.
# encoding for state: 4x4 grid can be represented by one-hot vector with 16 integers.
def get_model(inputs_shape):ni = tl.layers.Input(inputs_shape, name='observation')nn = tl.layers.Dense(4, act=None, W_init=tf.random_uniform_initializer(0, 0.01), b_init=None, name='q_a_s')(ni)return tl.models.Model(inputs=ni, outputs=nn, name="Q-Network")def save_ckpt(model):  # save trained weightspath = os.path.join('model', '_'.join([alg_name, env_id]))if not os.path.exists(path):os.makedirs(path)tl.files.save_weights_to_hdf5(os.path.join(path, 'dqn_model.hdf5'), model)def load_ckpt(model):  # load trained weightspath = os.path.join('model', '_'.join([alg_name, env_id]))tl.files.save_weights_to_hdf5(os.path.join(path, 'dqn_model.hdf5'), model)if __name__ == '__main__':qnetwork = get_model([None, 16])qnetwork.train()train_weights = qnetwork.trainable_weightsoptimizer = tf.optimizers.SGD(learning_rate=0.1)env = gym.make(env_id)t0 = time.time()if args.train:all_episode_reward = []for i in range(num_episodes):## Reset environment and get first new observations = env.reset()  # observation is state, integer 0 ~ 15rAll = 0if render: env.render()for j in range(99):  # step index, maximum step is 99## Choose an action by greedily (with e chance of random action) from the Q-networkallQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy()a = np.argmax(allQ, 1)## e-Greedy Exploration !!! sample random actionif np.random.rand(1) < e:a[0] = env.action_space.sample()## Get new state and reward from environments1, r, d, _ = env.step(a[0])if render: env.render()## Obtain the Q' values by feeding the new state through our networkQ1 = qnetwork(np.asarray([to_one_hot(s1, 16)], dtype=np.float32)).numpy()## Obtain maxQ' and set our target value for chosen action.maxQ1 = np.max(Q1)  # in Q-Learning, policy is greedy, so we use "max" to select the next action.targetQ = allQtargetQ[0, a[0]] = r + lambd * maxQ1## Train network using target and predicted Q values# it is not real target Q value, it is just an estimation,# but check the Q-Learning update formula:#    Q'(s,a) <- Q(s,a) + alpha(r + lambd * maxQ(s',a') - Q(s, a))# minimizing |r + lambd * maxQ(s',a') - Q(s, a)|^2 equals to force Q'(s,a) ≈ Q(s,a)with tf.GradientTape() as tape:_qvalues = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32))_loss = tl.cost.mean_squared_error(targetQ, _qvalues, is_mean=False)grad = tape.gradient(_loss, train_weights)optimizer.apply_gradients(zip(grad, train_weights))rAll += rs = s1## Reduce chance of random action if an episode is done.if d ==True:e = 1. / ((i / 50) + 10)  # reduce e, GLIE: Greey in the limit with infinite Explorationbreak## Note that, the rewards here with random actionprint('Training  | Episode: {}/{}  | Episode Reward: {:.4f} | Running Time: {:.4f}' \.format(i, num_episodes, rAll, time.time() - t0))if i == 0:all_episode_reward.append(rAll)else:all_episode_reward.append(all_episode_reward[-1] * 0.9 + rAll * 0.1)save_ckpt(qnetwork)  # save modelplt.plot(all_episode_reward)if not os.path.exists('image'):os.makedirs('image')plt.savefig(os.path.join('image', '_'.join([alg_name, env_id])))if args.test:load_ckpt(qnetwork)  # load modelfor i in range(num_episodes):## Reset environment and get first new observations = env.reset()  # observation is state, integer 0 ~ 15rAll = 0if render: env.render()for j in range(99):  # step index, maximum step is 99## Choose an action by greedily (with e chance of random action) from the Q-networkallQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy()a = np.argmax(allQ, 1)  # no epsilon, only greedy for testing## Get new state and reward from environments1, r, d, _ = env.step(a[0])rAll += rs = s1if render: env.render()## Reduce chance of random action if an episode is done.if d: breakprint('Testing  | Episode: {}/{}  | Episode Reward: {:.4f} | Running Time: {:.4f}' \.format(i, num_episodes, rAll, time.time() - t0))rList.append(rAll)print("正确率: " + str(sum(rList) / num_episodes * 100) + "%")

3.2.2 实验结果

DQN采用了神经网络替代表格，所以最终保存的是神经网络结构和参数，只需输入新的状态就能得到相应的移动方向。

训练阶段迭代10000次，各代累积奖励曲线如下：

图6 各代累积奖励

三次测试的正确率如下：

【Tensorlayer系列】深度强化学习之DQN求解FrozenLake相关推荐

深度强化学习-Double DQN算法原理与代码
深度强化学习-Double DQN算法原理与代码引言 1 DDQN算法简介 2 DDQN算法原理 3 DDQN算法伪代码 4 仿真验证引言 Double Deep Q Network(DDQN)是 ...
【深度强化学习】DQN训练超级玛丽闯关
上一期 MyEncyclopedia公众号文章通过代码学Sutton强化学习:从Q-Learning 演化到 DQN,我们从原理上讲解了DQN算法,这一期,让我们通过代码来实现DQN 在任天堂经典的 ...
14. 深度强化学习（DQN）
深度Q网络 14.1 用深度强化学习玩Atari 14.1.1 介绍 14.1.2 背景 14.1.3 相关工作 14.1.4 深度强化学习 1. 预处理和模型架构 14.1.5 实验 1. 训练和稳 ...
深度学习stride_深度强化学习成名作——DQN
前言:其实很早之前就想开始写写深度强化学习(Deep reinforcement learning)了,但是一年前DQN没调出来,没好意思写哈哈,最近呢无意中把打砖块游戏Breakout训练到平均分接 ...
dqn在训练过程中loss越来越大_深度强化学习——从DQN到DDPG
想了解更多好玩的人工智能应用,请关注公众号"机器AI学习数据AI挖掘","智能应用"菜单中包括:颜值检测.植物花卉识别.文字识别.人脸美妆等有趣的智能应用.. ...
深度学习（四十）——深度强化学习（3）Deep Q-learning Network（2）, DQN进化史
Deep Q-learning Network(续) Nature DQN DQN最早发表于NIPS 2013,该版本的DQN,也被称为NIPS DQN.NIPS DQN除了提出DQN的基本概念之外, ...
深度强化学习DQN网络
DQN网络 DQN(Deep Q Networks)网络属于深度强化学习中的一种网络,它是深度学习与Q学习的结合,在传统的Q学习中,我们需要维护一张Q(s,a)表,在实际运用中,Q表往往是巨大的,并且 ...
基于强化学习与深度强化学习的游戏AI训练
github地址一.摘要在本次大作业中由两个项目组成. 第一个小项目即为简单AI走迷宫游戏,通过强化学习的Q-learning算法,对AI进行训练来让其能以大概率找打一条通关路径并基本按照该路径进 ...
深度学习的发展方向：深度强化学习！
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货作者:莫凡&马晶敏,上海交通大学,Datawhale成员深度学 ...
时空AI技术：深度强化学习在智能城市领域应时空AI技术：深度强化学习在智能城市领域应用介绍...
来源:海豚数据科学实验室作者:京东科技时空AI团队深度强化学习是近年来热起来的一项技术.深度强化学习的控制与决策流程必须包含状态,动作,奖励是三要素.在建模过程中,智能体根据环境的当前状态信息输 ...