【强化学习实战-04】DQN和Double DQN保姆级教程(2):以MountainCar-v0

  • 实战:用Double DQN求解MountainCar问题
    • MountainCar问题详解
    • MountainCar问题的源代码解释
      • cartpole.py
      • MountainCar的状态(Observation)
      • MountainCar的动作
      • MountainCar的目的
    • DQN 求解MountainCar问题:完整代码详解
      • 定义神经网络Q(w)Q(\mathbf{w})Q(w)
      • 神经网络可视化:`tensorbard`
      • TD-learning + experience replay更新
      • Double DQN的实现
      • 设计reward:方法1--只考虑小车的位置
      • 设计reward:方法2--同时考虑小车位置和小车速度
      • batch操作
      • 使用Adam优化器,基于gradient descent 更新网络参数
    • Double DQN求解MountainCar-v0问题的完整代码
      • 训练网络的代码
      • 测试网络的代码
  • 小结
  • 参考文献

作者: 刘兴禄,清华大学,清华-伯克利深圳学院,博士在读

实战:用Double DQN求解MountainCar问题

MountainCar问题详解

MountainCar问题是开源环境模块OpenAI gym中的一个问题。

网址:https://gym.openai.com/envs/MountainCar-v0/

该问题的目的,是想通过控制小车的动作,使得小车爬上右边的坡。如上图所示。

MountainCar问题的源代码解释

OpenAI gym中给出了MountainCar环境的源代码。其源代码如下:

cartpole.py

"""
http://incompleteideas.net/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""
import math
from typing import Optionalimport numpy as np
import pygame
from pygame import gfxdrawimport gym
from gym import spaces
from gym.utils import seedingclass MountainCarEnv(gym.Env):"""The agent (a car) is started at the bottom of a valley. For any given statethe agent may choose to accelerate to the left, right or cease anyacceleration. The code is originally based on [this code](http://incompleteideas.net/MountainCar/MountainCar1.cp)and the environment appeared first in Andrew Moore's PhD Thesis (1990):```@TECHREPORT{Moore90efficientmemory-based,author = {Andrew William Moore},title = {Efficient Memory-based Learning for Robot Control},institution = {},year = {1990}}```Observation space is a 2-dim vector, where the 1st element represents the "car position" and the 2nd element represents the "car velocity".There are 3 discrete deterministic actions:- 0: Accelerate to the Left- 1: Don't accelerate- 2: Accelerate to the RightReward: Reward of 0 is awarded if the agent reached the flag(position = 0.5) on top of the mountain. Reward of -1 is awarded if the position of the agent is less than 0.5.Starting State: The position of the car is assigned a uniform random value in [-0.6 , -0.4]. The starting velocity of the car is always assigned to 0.Episode Termination: The car position is more than 0.5. Episode length is greater than 200### Arguments```gym.make('MountainCar-v0')```### Version History* v0: Initial versions release (1.0.0)"""metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 30}def __init__(self, goal_velocity=0):self.min_position = -1.2self.max_position = 0.6self.max_speed = 0.07self.goal_position = 0.5self.goal_velocity = goal_velocityself.force = 0.001self.gravity = 0.0025self.low = np.array([self.min_position, -self.max_speed], dtype=np.float32)self.high = np.array([self.max_position, self.max_speed], dtype=np.float32)self.screen = Noneself.isopen = Trueself.action_space = spaces.Discrete(3)self.observation_space = spaces.Box(self.low, self.high, dtype=np.float32)def step(self, action):assert self.action_space.contains(action), f"{action!r} ({type(action)}) invalid"position, velocity = self.statevelocity += (action - 1) * self.force + math.cos(3 * position) * (-self.gravity)velocity = np.clip(velocity, -self.max_speed, self.max_speed)position += velocityposition = np.clip(position, self.min_position, self.max_position)if position == self.min_position and velocity < 0:velocity = 0done = bool(position >= self.goal_position and velocity >= self.goal_velocity)reward = -1.0self.state = (position, velocity)return np.array(self.state, dtype=np.float32), reward, done, {}def reset(self,*,seed: Optional[int] = None,return_info: bool = False,options: Optional[dict] = None,):super().reset(seed=seed)self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])if not return_info:return np.array(self.state, dtype=np.float32)else:return np.array(self.state, dtype=np.float32), {}def _height(self, xs):return np.sin(3 * xs) * 0.45 + 0.55def render(self, mode="human"):screen_width = 600screen_height = 400world_width = self.max_position - self.min_positionscale = screen_width / world_widthcarwidth = 40carheight = 20if self.screen is None:pygame.init()self.screen = pygame.display.set_mode((screen_width, screen_height))self.surf = pygame.Surface((screen_width, screen_height))self.surf.fill((255, 255, 255))pos = self.state[0]xs = np.linspace(self.min_position, self.max_position, 100)ys = self._height(xs)xys = list(zip((xs - self.min_position) * scale, ys * scale))pygame.draw.aalines(self.surf, points=xys, closed=False, color=(0, 0, 0))clearance = 10l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0coords = []for c in [(l, b), (l, t), (r, t), (r, b)]:c = pygame.math.Vector2(c).rotate_rad(math.cos(3 * pos))coords.append((c[0] + (pos - self.min_position) * scale,c[1] + clearance + self._height(pos) * scale,))gfxdraw.aapolygon(self.surf, coords, (0, 0, 0))gfxdraw.filled_polygon(self.surf, coords, (0, 0, 0))for c in [(carwidth / 4, 0), (-carwidth / 4, 0)]:c = pygame.math.Vector2(c).rotate_rad(math.cos(3 * pos))wheel = (int(c[0] + (pos - self.min_position) * scale),int(c[1] + clearance + self._height(pos) * scale),)gfxdraw.aacircle(self.surf, wheel[0], wheel[1], int(carheight / 2.5), (128, 128, 128))gfxdraw.filled_circle(self.surf, wheel[0], wheel[1], int(carheight / 2.5), (128, 128, 128))flagx = int((self.goal_position - self.min_position) * scale)flagy1 = int(self._height(self.goal_position) * scale)flagy2 = flagy1 + 50gfxdraw.vline(self.surf, flagx, flagy1, flagy2, (0, 0, 0))gfxdraw.aapolygon(self.surf,[(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)],(204, 204, 0),)gfxdraw.filled_polygon(self.surf,[(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)],(204, 204, 0),)self.surf = pygame.transform.flip(self.surf, False, True)self.screen.blit(self.surf, (0, 0))if mode == "human":pygame.display.flip()if mode == "rgb_array":return np.transpose(np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2))else:return self.isopendef get_keys_to_action(self):# Control with left and right arrow keys.return {(): 1, (276,): 0, (275,): 2, (275, 276): 1}def close(self):if self.screen is not None:pygame.quit()self.isopen = False

MountainCar的状态(Observation)

在任意时刻,我们给环境一个动作,环境会返回MountainCar的状态(Observation)。从中我们可以看出,任意时刻,MountainCar的状态包括2个量,即car positioncar velocity

Observation space is a 2-dim vector,
where the 1st element represents the "car position"
and the 2nd element represents the "car velocity".def __init__(self, goal_velocity=0):self.min_position = -1.2self.max_position = 0.6self.max_speed = 0.07self.goal_position = 0.5self.goal_velocity = goal_velocityself.force = 0.001self.gravity = 0.0025

即:

  1. Car的位置xxx: 范围是[−1.2,0.6][-1.2, 0.6][−1.2,0.6]
  2. Car的速度vvv: 范围是[0,0.07][0, 0.07][0,0.07]

下图标出了几个重要的点的坐标。

MountainCar的动作

即 MountainCar的动作,即:向左加速不加速向右加速,只有3个可选动作。

    There are 3 discrete deterministic actions:- 0: Accelerate to the Left- 1: Don't accelerate- 2: Accelerate to the Right

MountainCar的目的

就是在状态st=(xt,vt)s_t = (x_t, v_t)st​=(xt​,vt​)的时候,我们为Car提供

  • at=Accelerate to the Lefta_t = \text{Accelerate to the Left}at​=Accelerate to the Left还是
  • at=Don’t acceleratea_t = \text{Don't accelerate}at​=Don’t accelerate,或者是
  • at=Accelerate to the Righta_t = \text{Accelerate to the Right}at​=Accelerate to the Right

的决策,使其爬上右边的坡的指定位置0.50.50.5,也就是源码中的self.goal_position = 0.5

    def __init__(self, goal_velocity=0):self.min_position = -1.2self.max_position = 0.6self.max_speed = 0.07self.goal_position = 0.5self.goal_velocity = goal_velocityself.force = 0.001self.gravity = 0.0025

DQN 求解MountainCar问题:完整代码详解

定义神经网络Q(w)Q(\mathbf{w})Q(w)

  • 输入层:由于Car有2个状态st=(xt,vt)s_t = (x_t, v_t)st​=(xt​,vt​),因此,神经网络的输入层有2个神经元。
  • 输出层:由于Car可选的动作只有3个,即向左加速不加速或者向右加速。因此输出层为3个神经元。
  • 隐藏层:隐藏层我们采用全连接即可。神经元数量和层数可以自己调整。

定义神经网络结构的代码如下。

class Network(nn.Module):def __init__(self):super(Network, self).__init__()self.fc = nn.Sequential(nn.Linear(2, 24),nn.ReLU(),nn.Linear(24, 24),nn.ReLU(),nn.Linear(24, 3))self.MSELoss = nn.MSELoss()self.optimizer = torch.optim.Adam(self.parameters(), lr = 0.001)def forward(self, inputs):return self.fc(inputs)

神经网络可视化:tensorbard

我们用tensorbard来查看网络结构的可视化:

env = gym.envs.make('CartPole-v1')
env = env.unwrapped
DQN = Network()   # DQN network, 需要训练的网络
Target_net = Network()  # Target network
.......
.......# 用tensorboard可视化神经网络if(graph_added == False):writer.add_graph(model=DQN, input_to_model=batch_state)writer.add_graph(model=Target_net, input_to_model=batch_state)graph_added = True
.......
writer.close()

然后我们打开Pycharm的terminal, cd进入到"logs_DQN_MountainCar"所在的文件夹下,输入命令:

tensorboard --logdir=logs_DQN_MountainCar
  • 注意: 一定要给神经网络喂了数据,导出才会成功。只建立网络是不会成功的。

即可查看可视化的神经网络。我们可视化后,神经网络如下。可见

  • 输入层为batch_size * state_dim = 1000 * 2
  • 输出层为batch_size * action_dim = 1000 * 3

我们仔细看network的结构,如下图:

由于该问题比较简单,我们并没有引入卷积层(Conv2d)以及池化(maxpool)操作等。

TD-learning + experience replay更新

  • 我们在replay buffer中存储的transitions的形式均为(st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1})(st​,at​,rt​,st+1​),因此,我们可以用一个数组或者DataFrame来存储这些transitions。但是需要注意,sts_tst​是一个2元组,即st=(xt,vt)s_t = (x_t, v_t)st​=(xt​,vt​), 所以,一个transition是1行6列的,也就是
    (xt,vt‾,at,xt+1,vt+1‾,rt)(\underline{x_t, v_t}, a_t, \underline{x_{t+1}, v_{t+1}}, r_t) (xt​,vt​​,at​,xt+1​,vt+1​​,rt​),因此,replay buffer的形式是
    t:(xt,vt‾,at,xt+1,vt+1‾,rt)t+1:(xt+1,vt+1‾,at+1,xt+2,vt+2‾,rt+1)t+2:(xt+2,vt+2‾,at+2,xt+3,vt+3‾,rt+2)⋯\begin{aligned} &t: &&(\underline{x_t, v_t}, a_t, \underline{x_{t+1}, v_{t+1}}, r_t) \\ &t+1: &&(\underline{x_{t+1}, v_{t+1}}, a_{t+1}, \underline{x_{t+2}, v_{t+2}}, r_{t+1}) \\ &t+2: &&(\underline{x_{t+2}, v_{t+2}}, a_{t+2}, \underline{x_{t+3}, v_{t+3}}, r_{t+2}) \\ & \cdots \end{aligned} ​t:t+1:t+2:⋯​​(xt​,vt​​,at​,xt+1​,vt+1​​,rt​)(xt+1​,vt+1​​,at+1​,xt+2​,vt+2​​,rt+1​)(xt+2​,vt+2​​,at+2​,xt+3​,vt+3​​,rt+2​)​
    代码中定义replay buffer为:
replay_buffer = np.zeros((replay_buffer_size, 6))  # 初始化buffer 列中储存 s, a, s_, r

实现TD learning的部分为

            # 我们用Target_net来计算TD-targetq = DQN(batch_state).gather(1, batch_action)      # predict q-value by old networkq_next = Target_net(batch_state_).detach().max(1)[0].reshape(batch_size, 1)  # predict q(s_t+1)q_target = batch_reward + gamma * q_next   # 用Target_net来计算TD-targetloss = DQN.MSELoss(q, q_target)            # 计算loss

相应的数学公式为:
ytarget=rt+γmax⁡aQ(st+1,a;wt)y_{\text{target}} = r_t + \gamma \underset{a}{\max \,\,}Q(s_{t+1}, a; \mathbf{w}_{t})ytarget​=rt​+γamax​Q(st+1​,a;wt​)

Double DQN的实现

另外,为了实现Double DQN,我们我们定义两个网络DQNTarget_net。我们只训练更新DQN,然后每学习一定次数后(代码中为update_interval),我们就把最新的DQN网络的参数,load到Target_net里面去,并且,我们用Target_net计算TD target,用DQN选择下一步将要做的动作。

代码中的实现为:

DQN = Network()   # DQN network, 需要训练的网络
Target_net = Network()  # Target network
......
......if stored_transition_cnt > replay_buffer_size:# 如果到达update_interval,则将net的参数load到net2中if transition_cnt % update_interval == 0:Target_net.load_state_dict(DQN.state_dict())

另外,刚开始的时候,我们以较大概率随机给动作,以较小概率用DQN给动作。随着时间推移,我们以较大概率DQN给动作,较小概率随机探索(也就是随机给动作)。

        if (random.randint(0,100) < 100*(discount_factor**transition_cnt)):  # act greedy, 就是随机探索,刚开始所及探索多,后面变少action = random.randint(0, 2)else:# 超过100次,我们用DQN,也就是训练的神经网络来选动作# 我们用DQN,也就是训练的神经网络来选动作output = DQN(torch.Tensor(state)).detach()  # output中是[左走累计奖励, 右走累计奖励]action = torch.argmax(output).data.item()   # 用argmax选取动作

设计reward:方法1–只考虑小车的位置

我们设计reward函数为
rt={10,ifxt⩾0.523(xt+0.5),if−0.5<xt<0.50,ifxt⩽−0.5r_t=\begin{cases} 10,& \text{if}\,\,x_t\geqslant 0.5\\ 2^{3(x_t+0.5)},& \text{if}\,\,-0.5<x_t<0.5\\ 0,& \text{if}\,\,x_t\leqslant -0.5\\ \end{cases} rt​=⎩⎪⎨⎪⎧​10,23(xt​+0.5),0,​ifxt​⩾0.5if−0.5<xt​<0.5ifxt​⩽−0.5​

主要想就是:小车位置越往右,奖励越高。达到0.50.50.5的位置处,奖励最高为10.如果小车走到了<0.5<0.5<0.5的部分,我们不鼓励,所以奖励为0。

这种奖励设计,就是鼓励小车可以往右走爬坡。

代码中为

        reward = state_[0] + 0.5if(state_[0] > -0.5):# reward = state_[0] + 0.5reward = math.pow(2, 3*(state_[0] + 0.5))if(state_[0] > 0.5):reward = 10else:reward = 0

这种方法有个缺陷,就是收敛比较慢。是因为,如果目标是让小车爬坡,就不能只关心位置,小车的速度越快,也可以使得小车尽快爬上坡。因此,reward设置中应当考虑速度。因此我们提供第二种reward的设置方式。

设计reward:方法2–同时考虑小车位置和小车速度

设计reward的逻辑是:

  1. 小车坐标x<−0.5x< -0.5x<−0.5时,虽然位置不好,但是为了加速往右冲,还是需要鼓励:速度绝对值大(注意是速度的绝对值),奖励也大。
  2. 小车坐标x>−0.5x> -0.5x>−0.5时,此时,位置越向右,reward越大,并且速度绝对值越大,奖励要陡增。

基于此,我们设计第二种reward函数为
rt={1000,ifxt⩾0.525(xt+1)+(100∣vt∣)2,if−0.5<xt<0.50+100∣vt∣,ifxt⩽−0.5r_t=\begin{cases} 1000,& \text{if}\,\,x_t\geqslant 0.5\\ 2^{5(x_t+1)} + (100 |v_t|)^2,& \text{if}\,\,-0.5<x_t<0.5\\ 0 + 100 |v_t|,& \text{if}\,\,x_t\leqslant -0.5\\ \end{cases} rt​=⎩⎪⎨⎪⎧​1000,25(xt​+1)+(100∣vt​∣)2,0+100∣vt​∣,​ifxt​⩾0.5if−0.5<xt​<0.5ifxt​⩽−0.5​

代码中为

        reward = state_[0] + 0.5if (state_[0] <= -0.5):reward = 100 * abs(state_[1])# print('速度:', state_[1])elif(state_[0] > -0.5 and  state_[0] < 0.5):reward = math.pow(2, 5*(state_[0] + 1)) + (100 * abs(state_[1])) ** 2elif(state_[0] >= 0.5):reward = 1000

batch操作

我们令batch_size=1000,并且在每一步学习的时候,我们首先从replay buffer中选取一个batch的transitions并将其转化成tensor,并且这个batch是随机选的,这样可以消除样本序列之间(尤其是相邻样本)的相关性。代码如下:

            index = random.randint(0, replay_buffer_size - batch_size -1)batch_state  = torch.Tensor(replay_buffer[index:index + batch_size, 0:4])batch_action  = torch.Tensor(replay_buffer[index:index + batch_size, 4:5]).long()batch_state_ = torch.Tensor(replay_buffer[index:index + batch_size, 5:9])batch_reward  = torch.Tensor(replay_buffer[index:index + batch_size, 9:10])

使用Adam优化器,基于gradient descent 更新网络参数

            # 训练-更新网络:gradient descent updates# 我们用Target_net来计算TD-targetq = DQN(batch_state).gather(1, batch_action)      # predict q-value by old networkq_next = Target_net(batch_state_).detach().max(1)[0].reshape(batch_size, 1)  # predict q(s_t+1)q_target = batch_reward + gamma * q_next   # 用Target_net来计算TD-targetloss = DQN.MSELoss(q, q_target)            # 计算lossDQN.optimizer.zero_grad()                  # 将DQN上步的梯度清零loss.backward()                             # DQN反向传播,更新参数DQN.optimizer.step()                        # DQN更新参数

Double DQN求解MountainCar-v0问题的完整代码

代码参考自(有改动):https://www.bilibili.com/video/BV1Ab411w7Yd?t=3359

训练网络的代码

  • 训练好的网络保存为'DQN_MountainCar-v0.pth'
# gym安装:pip install gym matplotlib -i  https://pypi.tuna.tsinghua.edu.cn/simple
import random
import torch
import torch.nn as nn
import numpy as np
import gym
from torch.utils.tensorboard import SummaryWriterclass Network(nn.Module):def __init__(self):super(Network, self).__init__()self.fc = nn.Sequential(nn.Linear(2, 24),nn.ReLU(),nn.Linear(24, 24),nn.ReLU(),nn.Linear(24, 3))self.MSELoss = nn.MSELoss()self.optimizer = torch.optim.Adam(self.parameters(), lr = 0.001)def forward(self, inputs):return self.fc(inputs)env = gym.envs.make('MountainCar-v0')
env = env.unwrapped
DQN = Network()   # DQN network, 需要训练的网络
Target_net = Network()  # Target networkwriter = SummaryWriter("logs_DQN_MountainCar")   # 注意tensorboard的部分stored_transition_cnt = 0  # 记录transition_cnt的次数
replay_buffer_size = 2000  # buffer size
discount_factor = 0.6      # 衰减系数
transition_cnt = 0         # 记录发生的transition的总次数
update_interval = 20       # 将net的参数load到net2的间隔
gamma = 0.9                # 折扣因子
batch_size = 1000          # batch size
replay_buffer = np.zeros((replay_buffer_size, 6))    # 初始化buffer 列中储存 s, a, state_, r
start_learning = False    # 标记是否开始学习
Max_epoch = 50000         # 学习的回合数
epsilon = 0.1graph_added = False
for i in range(Max_epoch):state = env.reset()  # 重置环境while True:if (random.randint(0,100) < 100*(discount_factor**transition_cnt)):  # act greedy, 就是随机探索,刚开始所及探索多,后面变少action = random.randint(0, 2)else:# 超过100次,我们用DQN,也就是训练的神经网络来选动作# 我们用DQN,也就是训练的神经网络来选动作output = DQN(torch.Tensor(state)).detach()  # output中是[左走累计奖励, 右走累计奖励]action = torch.argmax(output).data.item()   # 用argmax选取动作state_, reward, done, info = env.step(action)    # 执行动作,获得env的反馈# 自己定义一个reward# 只根据小车的位置给rewardreward = state_[0] + 0.5if (state_[0] <= -0.5):reward = 100 * abs(state_[1])# print('速度:', state_[1])elif(state_[0] > -0.5 and  state_[0] < 0.5):reward = math.pow(2, 5*(state_[0] + 1)) + (100 * abs(state_[1])) ** 2elif(state_[0] >= 0.5):reward = 1000replay_buffer[stored_transition_cnt % replay_buffer_size][0:2] = statereplay_buffer[stored_transition_cnt % replay_buffer_size][2:3] = actionreplay_buffer[stored_transition_cnt % replay_buffer_size][3:5] = state_replay_buffer[stored_transition_cnt % replay_buffer_size][5:6] = rewardstored_transition_cnt += 1state = state_if stored_transition_cnt > replay_buffer_size:# 如果到达update_interval,则将net的参数load到net2中if transition_cnt % update_interval == 0:Target_net.load_state_dict(DQN.state_dict())# 从replay buffer中提取一个batch,注意可以是随机提取.# 提取之后将其转成tensor数据类型,以便输入给神经网络index = random.randint(0, replay_buffer_size - batch_size -1)batch_state  = torch.Tensor(replay_buffer[index:index + batch_size, 0:2])batch_action  = torch.Tensor(replay_buffer[index:index + batch_size, 2:3]).long()batch_state_ = torch.Tensor(replay_buffer[index:index + batch_size, 3:5])batch_reward  = torch.Tensor(replay_buffer[index:index + batch_size, 5:6])# 用tensorboard可视化神经网络if(graph_added == False):writer.add_graph(model=DQN, input_to_model=batch_state)writer.add_graph(model=Target_net, input_to_model=batch_state)graph_added = True# 训练-更新网络:gradient descent updates# 我们用Target_net来计算TD-targetq = DQN(batch_state).gather(1, batch_action)      # predict q-value by old networkq_next = Target_net(batch_state_).detach().max(1)[0].reshape(batch_size, 1)  # predict q(s_t+1)q_target = batch_reward + gamma * q_next   # 用Target_net来计算TD-targetloss = DQN.MSELoss(q, q_target)            # 计算lossDQN.optimizer.zero_grad()                  # 将DQN上步的梯度清零loss.backward()                             # DQN反向传播,更新参数DQN.optimizer.step()                        # DQN更新参数transition_cnt += 1if not start_learning:print('start learning')start_learning= Truebreakif done:breakenv.render()torch.save(DQN.state_dict(), 'DQN_MountainCar-v0.pth')writer.close()

训练一分钟左右,小车即可爬到goal position,如下图。

测试网络的代码

  • 我们加载(load)训好的网络'DQN_MountainCar-v0.pth',用它来测试
  • 代码如下
https://pypi.tuna.tsinghua.edu.cn/simple
import random
import torch
import torch.nn as nn
import numpy as np
import gym
from torch.utils.tensorboard import SummaryWriterclass Network(nn.Module):def __init__(self):super(Network, self).__init__()self.fc = nn.Sequential(nn.Linear(2, 24),nn.ReLU(),nn.Linear(24, 24),nn.ReLU(),nn.Linear(24, 3))self.MSELoss = nn.MSELoss()self.optimizer = torch.optim.Adam(self.parameters(), lr = 0.001)def forward(self, inputs):return self.fc(inputs)env = gym.envs.make('MountainCar-v0')
env = env.unwrapped
DQN = Network()   # DQN network, 需要训练的网络
DQN.load_state_dict(torch.load('DQN_MountainCar-v0.pth'))state = env.reset()                                                     # 重置环境
episode_reward_sum = 0                                              # 初始化该循环对应的episode的总奖励
while True:                                                         # 开始一个episode (每一个循环代表一步)env.render()                                                    # 显示实验动画output = DQN.forward(torch.Tensor(state)).detach()  # output中是[左走累计奖励, 右走累计奖励]action = torch.argmax(output).data.item()  # 用argmax选取动作state_, reward, done, info = env.step(action)  # 执行动作,获得env的反馈if done:print(f'finished')break

运行代码,发现Car在一段时间后就会爬上坡,如下图所示。

小结

  1. DQN可以处理状态-动作二元组爆炸的情况,同时也可以处理状态-动作二元组较少的情况。
  2. DQN是用一个神经网络去近似最优状态-动作函数。
  3. DQN存在过高评估的现象。处理方法是:(A) 为了解决取最大化带来的过高估计,可以Double DQN的方法。(B)为了解决自提升(bootstrapping)带来的过高估计,我们可以使用一个Target network来计算TD target,而不是用训练网络来计算TD target。
  4. Double DQN中,我们用DQN Q(s,a;w)Q(s, a; \mathbf{w})Q(s,a;w)选择下一步要做的动作,即a∗=argmaxaQ(st+1,a;w)a^{*}=\underset{a}{\text{argmax}}{Q(s_{t+1}, a; \mathbf{w})}a∗=aargmax​Q(st+1​,a;w); 用Target Network计算TD target,即ytarget=rt+γ⋅Q(st+1,a∗;w−)y_{\text{target}}=r_t + \gamma \cdot Q(s_{t+1}, a^{*}; \mathbf{w}^{-})ytarget​=rt​+γ⋅Q(st+1​,a∗;w−).
  5. 为了消除transition序列的相关性以及经验的浪费,我们可以使用经验回放(Experience replay)。

这些笔记是小编查阅众多资料,仔细总结和推导得来的,我自己觉得写的非常之详细了,对小白也是非常友好。希望可以帮到大家。如果推文中有纰漏指出,请多多指教。

参考文献

  1. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
  2. Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep reinforcement learning with double q-learning.” Proceedings of the AAAI conference on artificial intelligence. Vol. 30. No. 1. 2016.
  3. Wang Shusen的教学视频,网址:https://www.bilibili.com/video/BV1rv41167yx?from=search&seid=18272266068137655483&spm_id_from=333.337.0.0

作者: 刘兴禄,清华大学,清华-伯克利深圳学院,博士在读

【强化学习实战-04】DQN和Double DQN保姆级教程(2):以MountainCar-v0相关推荐

  1. OringinPro2021学习版从申请到使用——全程保姆级教程

    申请或续订 打开浏览器,搜索关键词"OrginLab" 认准官网地址进入(https://www.originlab.com/) 点击"Get Started for F ...

  2. 强化学习(实践):DQN,Double DQN,Dueling DQN,格子环境

    1,DQN算法 1.1,CarPole环境 以车杆(CartPole)环境为例,它的状态值就是连续的,动作值是离散的.在车杆环境中,有一辆小车,智能体的任务是通过左右移动保持车上的杆竖直,若杆的倾斜度 ...

  3. 【强化学习实战】基于gym和tensorflow的强化学习算法实现

    [新智元导读]知乎专栏强化学习大讲堂作者郭宪博士开讲<强化学习从入门到进阶>,我们为您节选了其中的第二节<基于gym和tensorflow的强化学习算法实现>,希望对您有所帮助 ...

  4. 【经典书籍】深度强化学习实战(附最新PDF和源代码下载)

    关注上方"深度学习技术前沿",选择"星标公众号", 资源干货,第一时间送达! 深度强化学习可以说是人工智能领域现在最热门的方向,吸引了众多该领域优秀的科学家去发 ...

  5. PyTorch强化学习实战(1)——强化学习环境配置与PyTorch基础

    PyTorch强化学习实战(1)--强化学习环境配置与PyTorch基础 0. 前言 1. 搭建 PyTorch 环境 2. OpenAI Gym简介与安装 3. 模拟 Atari 环境 4. 模拟 ...

  6. 保姆级教程——Ubuntu16.04 Server下深度学习环境搭建:安装CUDA8.0,cuDNN6.0,Bazel0.5.4,源码编译安装TensorFlow1.4.0(GPU版)...

    写在前面 本文叙述了在Ubuntu16.04 Server下安装CUDA8.0,cuDNN6.0以及源码编译安装TensorFlow1.4.0(GPU版)的亲身经历,包括遇到的问题及解决办法,也有一些 ...

  7. 重磅!深度学习神器 - 高层API 最强保姆级教程公开!

    很多小伙伴在后台给我留言,零基础如何入门深度学习?想要做算法工程师,自学了python基础,现在还来得及吗? 这个问题很大.很难说一篇文章几句话就能解决这个问题.今天我给大家说一下自己的一些个人经验, ...

  8. 天才少年稚晖君 | 【保姆级教程】个人深度学习工作站配置指南

    天才少年稚晖君 | [保姆级教程]个人深度学习工作站配置指南 来源:https://zhuanlan.zhihu.com/p/336429888 0. 前言 工作原因一直想配置一台自己的深度学习工作站 ...

  9. 数仓建设保姆级教程,离线和实时一网打尽(理论+实战)

    本文大纲: 因内容较多,带目录的PDF查看是比较方便的,点击下方链接获取完整PDF版: 数仓建设保姆级教程PDF文档 一.数仓基本概念 1. 数据仓库架构 我们在谈数仓之前,为了让大家有直观的认识,先 ...

最新文章

  1. K-BERT | 基于知识图谱的语言表示模型
  2. java pashone_java – 如何使用onejar Maven插件在清单中设置附加的Class-Path条目?
  3. Dubbo的Zookeeper版本
  4. Cisco WLC 配置 NPS服务器
  5. themleft模板库_Thymeleaf模板引擎常用总结
  6. 怎能错过这个技术集市!转发有奖,惊喜连连!
  7. [信息安全] 1.密码工具箱
  8. iSCSI存储设备的udev绑定 以及iscsi重启卡住解决方法
  9. 微信小程序---实现弹窗效果
  10. Java SSM篇3——Mybatis
  11. dhclient常用命令
  12. linux操作系统中不能挂载nfts格式的文件系统,Linux 下无法挂载windows的ntfs文件系统...
  13. 忘记VSS admin密码了
  14. Python标准库--time模块的详解
  15. idea 导入spring 源码 踩坑记总结整理
  16. Android多国语言values语言包
  17. 我过去1个月的上课总结
  18. Unity得到屏幕宽度
  19. react浏览器兼容报错处理
  20. 读取gpio管脚电平需要设置什么模式_MT7688/MT7628-GPIO使用

热门文章

  1. 数据库间表结构对比和数据对比
  2. java printable_java解析电话薄 VCF文件ENCODING=QUOTED-PRINTABLE编码
  3. 什么是支撑位和阻力位
  4. Poco 库开发-教你如何读Poco的官方文档
  5. 【理财】信托产品的发行和成立
  6. 像素、灰度、RGB、分辨率
  7. WCF 关闭默认调试客户端
  8. Git仓库数据损坏后如何恢复
  9. dicom文件tag详解
  10. shell编程时,引用相对路径的shell文件