需要源码和环境搭建请点赞关注收藏后评论区留下QQ~~~

一、核心思想

针对DQN中出现的高估问题,有人提出深度双Q网络算法(DDQN),该算法是将强化学习中的双Q学习应用于DQN中。在强化学习中,双Q学习的提出能在一定程度上缓解Q学习带来的过高估计问题。

DDQN的主要思想是在目标值计算时将动作的选择和评估分离,在更新过程中,利用两个网络来学习两组权重,分别是预测网络的权重W和目标网络的权重W',在DQN中,动作选择和评估都是通过目标网络来实现的,而在DDQN中,计算目标Q值时,采取目标网络获取最优动作,再通过预测网络估计该最优动作的目标Q值,这样就可以将最优动作选额和动作值函数估计分离,采用不用的样本保证独立性

二、允许结果与分析

本节实验在Asterix游戏上,通过控制参数变量对DQN,DDQN算法进行性能对比从而验证了在一定程度上,DDQN算法可以缓解DQN算法的高估问题,DDQN需要两个不同的参数网络,每1000步后预测预测网络的参数同步更新给目标网络,实验设有最大能容纳1000000记录的缓冲池,每个Atari游戏,DDQN算法训练1000000时间步

实战结果如下图所示,图中的DDQN算法最后收敛回报明显大于DQN,并且在实验过程中,可以发现DQN算法容易陷入局部的情况,其问题主要在于Q-Learning中的最大化操作,Agent在选择动作时每次都取最大Q值得动作,对于真实的策略来说,在给定的状态下并不是每次都选择Q值最大的动作,因为一般真实的策略都是随机性策略,所以在这里目标值直接选择动作最大的Q值往往会导致目标值高于真实值

为了解决值函数高估计的问题,DDQN算法将动作的选择和动作的评估分别用不同的值函数来实现,结果表明DDQN能够估计出更准确的Q值,在一些Atari2600游戏中可获得更稳定有效的策略

三、代码

部分源码如下

import gym, random, pickle, os.path, math, globimport numpy as np
import pandas as pd
import matplotlib.pyplot as pltimport torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as autograd
import pdbfrom atari_wrappers import make_atari, wrap_deepmind,LazyFramesdef __init__(self, in_channels=4, num_actions=5):nnels: number of channel of input.i.e The number of most recent frames stacked together as describe in the papernum_actions: number of action-value to output, one-to-one correspondence to action in game."""super(DQN, self).__init__()self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)self.fc4 = nn.Linear(7 * 7 * 64, 512)self.fc5 = nn.Linear(512, num_actions)def forward(self, x):x = F.relu(self.conv1(x))x = F.relu(self.conv2(x))x = F.relu(self.conv3(x))x = F.relu(self.fc4(x.view(x.size(0), -1)))return self.fc5(x)class Memory_Buffer(object):def __init__(self, memory_size=1000):self.buffer = []self.memory_size = memory_sizeself.next_idx = 0def push(self, state, action, reward, next_state, done):data = (state, action, reward, next_state, done)if len(self.buffer) <= self.memory_size: # buffer not fullself.buffer.append(data)else: # buffer is fullself.buffer[self.next_idx] = dataself.next_idx = (self.next_idx + 1) % self.memory_sizedef sample(self, batch_size):states, actions, rewards, next_states, dones = [], [], [], [], []for i in range(batch_size):idx = random.randint(0, self.size() - 1)data = self.buffer[idx]state, action, reward, next_state, done= datastates.append(state)actions.append(action)rewards.append(reward)next_states.append(next_state)dones.append(done)return np.concatenate(states), actions, rewards, np.concatenate(next_states), donesdef size(self):return len(self.buffer)class DDQNAgent:def __init__(self, in_channels = 1, action_space = [], USE_CUDA = False, memory_size = 10000, epsilon  = 1, lr = 1e-4):self.epsilon = epsilonself.action_space = action_spaceself.memory_buffer = Memory_Buffer(memory_size)self.DQN = DQN(in_channels = in_channels, num_actions = action_space.n)self.DQN_target = DQN(in_channels = in_channels, num_actions = action_space.n)self.DQN_target.load_state_dict(self.DQN.state_dict())self.USE_CUDA = USE_CUDAif USE_CUDA:self.DQN = self.DQN.to(device)self.DQN_target = self.DQN_target.to(device)self.optimizer = optim.RMSprop(self.DQN.parameters(),lr=lr, eps=0.001, alpha=0.95)def observe(self, lazyframe):# from Lazy frame to tensorstate =  torch.from_numpy(lazyframe._force().transpose(2,0,1)[None]/255).float()if self.USE_CUDA:state = state.to(device)return statedef value(self, state):q_values = self.DQN(state)return q_valuesdef act(self, state, epsilon = None):"""sample actions with epsilon-greedy policyrecap: with p = epsilon pick random action, else pick action with highest Q(s,a)"""if epsilon is None: epsilon = self.epsilonq_values = self.value(state).cpu().detach().numpy()if random.random()<epsilon:aciton = random.randrange(self.action_space.n)else:aciton = q_values.argmax(1)[0]return acitondef compute_td_loss(self, states, actions, rewards, next_states, is_done, gamma=0.99):""" Compute td loss using torch operations only. Use the formula above. """actions = torch.tensor(actions).long()    # shape: [batch_size]rewards = torch.tensor(rewards, dtype =torch.float)  # shape: [batch_size]is_done = torch.tensor(is_done, dtype = torch.uint8)  # shape: [batch_size]if self.USE_CUDA:actions = actions.to(device)rewards = rewards.to(device)is_done = is_done.to(device)# get q-values for all actions in current statespredicted_qvalues = self.DQN(states)# select q-values for chosen actionspredicted_qvalues_for_actions = predicted_qvalues[range(states.shape[0]), actions]# compute q-values for all actions in next states## Where DDQN is different from DQNpredicted_next_qvalues_current = self.DQN(next_states)predicted_next_qvalues_target = self.DQN_target(next_states)# compute V*(next_states) using predicted next q-valuesnext_state_values =  predicted_next_qvalues_target.gather(1, torch.max(predicted_next_qvalues_current, 1)[1].unsqueeze(1)).squeeze(1)# compute "target q-values" for loss - it's what's inside square parentheses in the above formula.target_qvalues_for_actions = rewards + gamma *next_state_values# at the last state we shall use simplified formula: Q(s,a) = r(s,a) since s' doesn't existtarget_qvalues_for_actions = torch.where(is_done, rewards, target_qvalues_for_actions)# mean squared error loss to minimize#loss = torch.mean((predicted_qvalues_for_actions -#                   target_qvalues_for_actions.detach()) ** 2)loss = F.smooth_l1_loss(predicted_qvalues_for_actions, target_qvalues_for_actions.detach())return lossdef sample_from_buffer(self, batch_size):states, actions, rewards, next_states, dones = [], [], [], [], []for i in range(batch_size):idx = random.randint(0, self.memory_buffer.size() - 1)data = self.memory_buffer.buffer[idx]frame, action, reward, next_frame, done= datastates.append(self.observe(frame))actions.append(action)rewards.append(reward)next_states.append(self.observe(next_frame))dones.append(done)return torch.cat(states), actions, rewards, torch.cat(next_states), donesdef learn_from_experience(self, batch_size):if self.memory_buffer.size() > batch_size:states, actions, rewards, next_states, dones = self.sample_from_buffer(batch_size)td_loss = self.compute_td_loss(states, actions, rewards, next_states, dones)self.optimizer.zero_grad()td_loss.backward()for param in self.DQN.parameters():param.grad.data.clamp_(-1, 1)self.optimizer.step()return(td_loss.item())else:return(0)def moving_average(a, n=3) :ret = np.cumsum(a, dtype=float)ret[n:] = ret[n:] - ret[:-n]return ret[n - 1:] / ndef plot_training(frame_idx, rewards, losses):clear_output(True)plt.figure(figsize=(20,5))plt.subplot(131)plt.title('frame %s. reward: %s' % (frame_idx, np.mean(rewards[-100:])))plt.plot(moving_average(rewards,20))plt.subplot(132)plt.title('loss, average on 100 stpes')plt.plot(moving_average(losses, 100),linewidth=0.2)plt.show()# if __name__ == '__main__':# Training DQN in PongNoFrameskip-v4
env = make_atari('PongNoFrameskip-v4')
env = wrap_deepmind(env, scale = False, frame_stack=True)gamma = 0.99
epsilon_max = 1
epsilon_min = 0.01
eps_decay = 30000
frames = 1000000
USE_CUDA = True
learning_rate = 2e-4
max_buff = 100000
update_tar_interval = 1000
batch_size = 32
print_interval = 1000
log_interval = 1000
learning_start = 10000
win_reward = 18     # Pong-v4
win_break = Trueaction_space = env.action_space
action_dim = env.action_space.n
state_dim = env.observation_space.shape[0]
state_channel = env.observation_space.shape[2]
agent = DDQNAgent(in_channels = state_channel, action_space= action_space, USE_CUDA = USE_CUDA, lr = learning_rate)#frame = env.reset()episode_reward = 0
all_rewards = []
losses = []
episode_num = 0
is_win = False
# tensorboard
summary_writer = SummaryWriter(log_dir = "DDQN", comment= "good_makeatari")# e-greedy decay
epsilon_by_frame = lambda frame_idx: epsilon_min + (epsilon_max - epsilon_min) * math.exp(-1. * frame_idx / eps_decay)
# plt.plot([epsilon_by_frame(i) for i in range(10000)])for i in range(frames):epsilon = epsilon_by_frame(i)#state_tensor = agent.observe(frame)#action = agent.act(state_tensor, epsilon)#next_frame, reward, done, _ = env.step(action)#episode_reward += reward#agent.memory_buffer.push(frame, action, reward, next_frame, done)#frame = next_frameloss = 0if agent.memory_buffer.size() >= learning_start:loss = agent.learn_from_experience(batch_size)losses.append(loss)if i % print_interval == 0:print("frames: %5d, reward: %5f, loss: %4f, epsilon: %5f, episode: %4d" % (i, np.mean(all_rewards[-10:]), loss, epsilon, episode_num))summary_writer.add_scalar("Temporal Difference Loss", loss, i)summary_writer.add_scalar("Mean Reward", np.mean(all_rewards[-10:]), i)summary_writer.add_scalar("Epsilon", epsilon, i)if iQN_dict.pth.tar")
plot_training(i, all_rewards, losses)

创作不易 觉得有帮助请点赞关注收藏~~~

深度强化学习中Double DQN算法(Q-Learning+CNN)的讲解及在Asterix游戏上的实战(超详细 附源码)相关推荐

  1. 深度强化学习中利用Q-Learngin和期望Sarsa算法确定机器人最优策略实战(超详细 附源码)

    需要源码和环境搭建请点赞关注收藏后评论区留下QQ~~~ 一.Q-Learning算法 Q-Learning算法中动作值函数Q的更新方向是最优动作值函数q,而与Agent所遵循的行为策略无关,在评估动作 ...

  2. 【PyTorch深度强化学习】带基线的蒙特卡洛策略梯度法(REINFOECE)在短走廊和CartPole环境下的实战(超详细 附源码)

    需要源码请点赞关注收藏后评论区留言留下QQ~~~ 一.带基线的REINFORCE REINFORCE的优势在于只需要很小的更新步长就能收敛到局部最优,并保证了每次更新都是有利的,但是假设每个动作的奖赏 ...

  3. Android开发音效中录制WAV音频和录制MP3音频的讲解及实战(超详细 附源码)

    需要源码请点赞关注收藏后评论区留下QQ~~~ 一.录制WAV音频 无论是MediaRecoredr录制的AMR和AAC音频,还是AudioRecord录制的PCM音频,都不能在计算机上直接播放,因为它 ...

  4. Android开发音效增强中铃声播放Ringtone及声音池调度SoundPool的讲解及实战(超详细 附源码)

    需要源码请点赞关注收藏后评论区留下QQ~~~ 一.铃声播放 虽然媒体播放器MediaPlayer既可用来播放视频,也可以用来播放音频,但是在具体的使用场合,MediaPlayer存在某些播音方面的不足 ...

  5. Android 开发中原始音频的录播和和自定义音频控制条的讲解及实战(超详细 附源码)

    需要源码请点赞关注收藏后评论区留下QQ~~~ 一.原始音频的录播 语音通话功能要求实时传输,如果使用MediaRecorder与MediaPlayer组合,那么只能整句话都录完并编码好了才能传给对方去 ...

  6. 【Android +Tensroflow Lite】实现从基于机器学习语音中识别指令讲解及实战(超详细 附源码和演示视频)

    需要源码和配置文件请点赞关注收藏后评论区留言~~~ 一.基于机器学习的语音推断 Tensorflow基于分层和模块化的设计思想,整个框架以C语言的编程接口为界,分为前端和后端两大部分 Tensorfl ...

  7. Python启发式算法中爬山法的讲解及解方程问题实战(超详细 附源码)

    一.启发式算法 还有一类重要的迭代法,它的迭代关系式不依赖问题的数学性能,而是受某种自然现象的启发而得到,称为启发式算法(Heuristic Algorithm),如爬山法.遗传算法.模拟退火算法.蚁 ...

  8. 《强化学习周刊》第26期:UCL UC Berkeley发表深度强化学习中的泛化研究综述、JHU推出基于强化学习的人工决策模型...

    No.26 智源社区 强化学习组 强 化 学  习 研究 观点 资源 活动 关于周刊 强化学习作为人工智能领域研究热点之一,其研究进展与成果也引发了众多关注.为帮助研究与工程人员了解该领域的相关进展和 ...

  9. 深度强化学习中的泛化

    Overfitting in Supervised Learning 机器学习是一门学科,其中给定了一些训练数据\环境,我们希望找到一个优化目标的模型,但其目的是在训练期间从未见过的数据上表现出色.通 ...

最新文章

  1. jquery插件,nocube
  2. 本地连接阿里云RDS
  3. vsftpd下错误之:500 OOPS
  4. Python中for else注意事项
  5. 有趣的Web版Ubuntu Linux
  6. 关于Silverlight IsolatedStorage 不能Serialze Parameter[]
  7. 文件系统[HDU-1413]
  8. 如何下载centos7的iso文件
  9. 网易云易盾手游智能反外挂亮相ChinaJoy2018
  10. miracl实现sm2
  11. 高端时尚简历PPT模板-优页文档
  12. Ubiquitous Religions POJ - 2524
  13. newifi路由器 php,newifi路由器有线桥接教程
  14. 跑通CHPDet模型
  15. 信息安全工程实践笔记--Day1 信息收集漏洞扫描
  16. TS 36.211 V12.0.0-上行(1)-时隙结构和物理资源
  17. CAD手机看图软件中如何根据已知坐标点绘制线段?
  18. Urllib2的使用,提供接口
  19. foobar2000_修复了在Foobar2000中找不到设备(0x88780078)
  20. 2019华为软件精英挑战赛

热门文章

  1. 数字图像处理(四)大津二值化
  2. 医疗云计算有料!大数据如何给中国医学史加一副药
  3. 计算机图形学慕课信息整理
  4. 体制内拒绝当“烂好人”的实用方法
  5. HTML 前端学习(4)—— CSS 属性相关
  6. spring入门学习导引
  7. 壳学习一:PECompact 2.x 加壳脱壳
  8. 白鹭引擎图片浏览工具
  9. beauty-of-math
  10. Linux下使用天翼云盘终极方案