Deep Q Learning

使用gym的CartPole作为环境,使用QDN解决离散动作空间的问题。

一、导入需要的包和定义超参数

import tensorflow as tf
import numpy as np
import gym
import time
import random
from collections import deque#####################  hyper parameters  ##################### Hyper Parameters for DQN
GAMMA = 0.9 # discount factor for target Q
INITIAL_EPSILON = 0.5 # starting value of epsilon
FINAL_EPSILON = 0.01 # final value of epsilon
REPLAY_SIZE = 10000 # experience replay buffer size
BATCH_SIZE = 32 # size of minibatch

二、DQN构造函数

1、初始化经验重放buffer;

2、设置问题的状态空间维度,动作空间维度;

3、设置e-greedy的epsilon;

4、创建用于估计q值的Q网络,创建训练方法。

5、初始化tensorflow的session

def __init__(self, env):# init experience replayself.replay_buffer = deque()# init some parametersself.time_step = 0self.epsilon = INITIAL_EPSILONself.state_dim = env.observation_space.shape[0]self.action_dim = env.action_space.nself.create_Q_network()self.create_training_method()# Init sessionself.session = tf.InteractiveSession()self.session.run(tf.global_variables_initializer())

三、创建神经网络

创建一个3层全连接的神经网络,hidden layer有20个神经元。

def create_Q_network(self):# network weightsW1 = self.weight_variable([self.state_dim,20])b1 = self.bias_variable([20])W2 = self.weight_variable([20,self.action_dim])b2 = self.bias_variable([self.action_dim])# input layerself.state_input = tf.placeholder("float",[None,self.state_dim])# hidden layersh_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)# Q Value layerself.Q_value = tf.matmul(h_layer,W2) + b2def weight_variable(self,shape):initial = tf.truncated_normal(shape)return tf.Variable(initial)def bias_variable(self,shape):initial = tf.constant(0.01, shape = shape)return tf.Variable(initial)

定义cost function和优化的方法,使“实际”q值(y)与当前网络估计的q值的差值尽可能小,即使当前网络尽可能接近真实的q值。

def create_training_method(self):self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentationself.y_input = tf.placeholder("float",[None])Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)

从buffer中随机取样BATCH_SIZE大小的样本,计算y(batch中(s,a)在当前网络下的实际q值)

if done: y_batch.append(reward_batch[i])

else :  y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))

def train_Q_network(self):self.time_step += 1# Step 1: obtain random minibatch from replay memoryminibatch = random.sample(self.replay_buffer,BATCH_SIZE)state_batch = [data[0] for data in minibatch]action_batch = [data[1] for data in minibatch]reward_batch = [data[2] for data in minibatch]next_state_batch = [data[3] for data in minibatch]# Step 2: calculate yy_batch = []Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})for i in range(0,BATCH_SIZE):done = minibatch[i][4]if done:y_batch.append(reward_batch[i])else :y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))self.optimizer.run(feed_dict={self.y_input:y_batch,self.action_input:action_batch,self.state_input:state_batch})

四、Agent感知环境的接口

每次决策采取的动作,得到环境的反馈,将(s, a, r, s_, done)存入经验重放buffer。当buffer中经验数量大于batch_size时开始训练。

def perceive(self,state,action,reward,next_state,done):one_hot_action = np.zeros(self.action_dim)one_hot_action[action] = 1self.replay_buffer.append((state,one_hot_action,reward,next_state,done))if len(self.replay_buffer) > REPLAY_SIZE:self.replay_buffer.popleft()if len(self.replay_buffer) > BATCH_SIZE:self.train_Q_network()

五、决策(选取action)

两种选取方式greedy和e-greedy。

  def egreedy_action(self,state):Q_value = self.Q_value.eval(feed_dict = {self.state_input:[state]})[0]if random.random() <= self.epsilon:self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000return random.randint(0,self.action_dim - 1)else:self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000return np.argmax(Q_value)def action(self,state):return np.argmax(self.Q_value.eval(feed_dict = {self.state_input:[state]})[0])

Agent完整代码:

DQN

import tensorflow as tf
import numpy as np
import gym
import time
import random
from collections import deque#####################  hyper parameters  ##################### Hyper Parameters for DQN
GAMMA = 0.9 # discount factor for target Q
INITIAL_EPSILON = 0.5 # starting value of epsilon
FINAL_EPSILON = 0.01 # final value of epsilon
REPLAY_SIZE = 10000 # experience replay buffer size
BATCH_SIZE = 32 # size of minibatch###############################  DQN  ####################################class DQN():# DQN Agentdef __init__(self, env):# init experience replayself.replay_buffer = deque()# init some parametersself.time_step = 0self.epsilon = INITIAL_EPSILONself.state_dim = env.observation_space.shape[0]self.action_dim = env.action_space.nself.create_Q_network()self.create_training_method()# Init sessionself.session = tf.InteractiveSession()self.session.run(tf.global_variables_initializer())def create_Q_network(self):# network weightsW1 = self.weight_variable([self.state_dim,20])b1 = self.bias_variable([20])W2 = self.weight_variable([20,self.action_dim])b2 = self.bias_variable([self.action_dim])# input layerself.state_input = tf.placeholder("float",[None,self.state_dim])# hidden layersh_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)# Q Value layerself.Q_value = tf.matmul(h_layer,W2) + b2def create_training_method(self):self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentationself.y_input = tf.placeholder("float",[None])Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)def perceive(self,state,action,reward,next_state,done):one_hot_action = np.zeros(self.action_dim)one_hot_action[action] = 1self.replay_buffer.append((state,one_hot_action,reward,next_state,done))if len(self.replay_buffer) > REPLAY_SIZE:self.replay_buffer.popleft()if len(self.replay_buffer) > BATCH_SIZE:self.train_Q_network()def train_Q_network(self):self.time_step += 1# Step 1: obtain random minibatch from replay memoryminibatch = random.sample(self.replay_buffer,BATCH_SIZE)state_batch = [data[0] for data in minibatch]action_batch = [data[1] for data in minibatch]reward_batch = [data[2] for data in minibatch]next_state_batch = [data[3] for data in minibatch]# Step 2: calculate yy_batch = []Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})for i in range(0,BATCH_SIZE):done = minibatch[i][4]if done:y_batch.append(reward_batch[i])else :y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))self.optimizer.run(feed_dict={self.y_input:y_batch,self.action_input:action_batch,self.state_input:state_batch})def egreedy_action(self,state):Q_value = self.Q_value.eval(feed_dict = {self.state_input:[state]})[0]if random.random() <= self.epsilon:self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000return random.randint(0,self.action_dim - 1)else:self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / 10000return np.argmax(Q_value)def action(self,state):return np.argmax(self.Q_value.eval(feed_dict = {self.state_input:[state]})[0])def weight_variable(self,shape):initial = tf.truncated_normal(shape)return tf.Variable(initial)def bias_variable(self,shape):initial = tf.constant(0.01, shape = shape)return tf.Variable(initial)

训练agent:

train.py

from DQN import DQN
import gym
import numpy as np
import timeENV_NAME = 'CartPole-v1'
EPISODE = 3000 # Episode limitation
STEP = 300 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episodedef main():# initialize OpenAI Gym env and dqn agentenv = gym.make(ENV_NAME)agent = DQN(env)for episode in range(EPISODE):# initialize taskstate = env.reset()# Trainep_reward = 0for step in range(STEP):action = agent.egreedy_action(state) # e-greedy action for trainnext_state,reward,done,_ = env.step(action)# Define reward for agentreward = -10 if done else 1ep_reward += rewardagent.perceive(state,action,reward,next_state,done)state = next_stateif done:#print('episode complete, reward: ', ep_reward)break# Test every 100 episodesif episode % 100 == 0:total_reward = 0for i in range(TEST):state = env.reset()for j in range(STEP):#env.render()action = agent.action(state) # direct action for teststate,reward,done,_ = env.step(action)total_reward += rewardif done:breakave_reward = total_reward/TESTprint ('episode: ',episode,'Evaluation Average Reward:',ave_reward)if __name__ == '__main__':main()

reference:

https://www.cnblogs.com/pinard/p/9714655.html

https://github.com/ljpzzz/machinelearning/blob/master/reinforcement-learning/dqn.py

转载于:https://www.cnblogs.com/jasonlixuetao/p/10964557.html

强化学习_Deep Q Learning(DQN)_代码解析相关推荐

  1. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(复杂迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  2. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(简单迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  3. mdp框架_强化学习中q learning和MDP的区别是什么?

    MDP通常是指一种用转移概率描述连续不确定概率过程的数学框架,是强化学习中最基础的概念,很多强化学习的算法都是在把问题抽象为一个MDP之后再想办法求解的. 而q-learning是求解强化学习问题的算 ...

  4. 第七章 深度强化学习-深度Q网络系列1(Deep Q-Networks,DQN)

    获取更多资讯,赶快关注上面的公众号吧! 文章目录 第七章 深度强化学习-深度Q网络 7.1 学习目标 7.2 深度学习和强化学习的区别 7.3 DQN原理 7.4 DQN算法 7.4.1 预处理 7. ...

  5. dqn 应用案例_强化学习(十二) Dueling DQN

    在强化学习(十一) Prioritized Replay DQN中,我们讨论了对DQN的经验回放池按权重采样来优化DQN算法的方法,本文讨论另一种优化方法,Dueling DQN.本章内容主要参考了I ...

  6. Q学习(Q learning) 强化学习

    Q学习(Q learning) 强化学习的简单例子 Matlab实现 可视化_Morty 的挖坑记录-CSDN博客 强化学习(MATLAB) - 叮叮当当sunny - 博客园

  7. 【强化学习实战-04】DQN和Double DQN保姆级教程(2):以MountainCar-v0

    [强化学习实战-04]DQN和Double DQN保姆级教程(2):以MountainCar-v0 实战:用Double DQN求解MountainCar问题 MountainCar问题详解 Moun ...

  8. 强化学习(十二) Dueling DQN

    在强化学习(十一) Prioritized Replay DQN中,我们讨论了对DQN的经验回放池按权重采样来优化DQN算法的方法,本文讨论另一种优化方法,Dueling DQN.本章内容主要参考了I ...

  9. 强化学习(Reinforcement learning)综述

    文章目录 Reinforcement learning 综述 强化学习的分类 环境(Model-free,Model-based) Based(Policy-Based RL & Value- ...

  10. 强化学习—— Target Network Double DQN(解决高估问题,overestimate)

    强化学习-- Target Network & Double DQN(解决高估问题,overestimate) 1TD算法 2. 高估问题 2.1 Maximization 2.1.1 数学解 ...

最新文章

  1. 使用霍夫变换检测车道线
  2. virtualbox安装centos6.5碰到的问题
  3. WSGI,uWSGI,uwsgi,Nginx
  4. SQL Server 备份与恢复之四:备份类型和选项
  5. Qt Creator调试器故障排除
  6. linux 配置快速查看
  7. 操作系统实验报告5:进程的创建和终止
  8. Windows 非阻塞或异步 socket
  9. softmax logistic loss详解
  10. @async方法不调用了_在Spring中使用Future对象调用Async方法调用
  11. jmc线程转储_如何分析线程转储– IBM VM
  12. IntelliJ中的实时模板
  13. python手机话费_查询话费订单详情示例代码
  14. 云图说|华为数据安全中心,助你保障云上数据安全!
  15. 人生中10件无能为力的事
  16. Mybatis 插件(plugins)
  17. python中判断生肖和星座哪个准_星座准还是属相
  18. 第 29 章 电阻触摸屏—触摸画板
  19. 基于can总线的A2L文件解析(2)
  20. 5D论文PMF及改进

热门文章

  1. Flutter之Redux框架原理解析
  2. Spring读书笔记(一)
  3. nsis打包php项目加环境,NSIS制作安装文件全攻略(一) zz
  4. java8 stream流操作
  5. 刚刚,爱奇艺发布重磅开源项目!
  6. hashcode值一样对象一定相同吗_硬核问题,为什么重写equals()就要重写hashCode()?
  7. mysql字段分隔符拆分_MySQL——字符串拆分(含分隔符的字符串截取)
  8. php resize函数,Php Image Resize图片大小调整的函数代码
  9. engine.POST()处理POST请求
  10. linux工程常用的应用命令总结: