创建dqn的深度神经网络

深层钢筋学习讲解— 16 (DEEP REINFORCEMENT LEARNING EXPLAINED — 16)

This is the second post devoted to Deep Q-Network (DQN), in the Deep Reinforcement Learning Explained series, in which we will analyse some challenges that appear when we apply Deep Learning to Reinforcement Learning. We will also present in detail the code that solves the OpenAI Gym Pong game using the DQN network introduced in the previous post.

这是 深度强化学习解释 系列中第二篇专门针对深度Q网络(DQN)的文章,其中我们将分析将深度学习应用于强化学习时出现的一些挑战。 我们还将详细介绍使用上一篇文章中介绍的DQN网络解决OpenAI Gym Pong游戏的代码。

深度强化学习中的挑战 (Challenges in Deep Reinforcement Learning)

Unfortunately, reinforcement learning is more unstable when neural networks are used to represent the action-values, despite applying the wrappers introduced in the previous section. Training such a network requires a lot of data, but even then, it is not guaranteed to converge on the optimal value function. In fact, there are situations where the network weights can oscillate or diverge, due to the high correlation between actions and states.

不幸的是,尽管使用了上一节介绍的包装器,但是当使用神经网络来表示动作值时,强化学习更加不稳定。 训练这样的网络需要大量数据,但是即使如此,也不能保证收敛于最优值函数。 实际上,由于动作和状态之间的高度相关性,在某些情况下网络权​​重可能发生振荡或发散。

In order to solve this, in this section we will introduce two techniques used by the Deep Q-Network:

为了解决这个问题,在本节中,我们将介绍深度Q网络使用的两种技术:

  • Experience Replay体验重播
  • Target Network目标网络

There are many more tips and tricks that researchers have discovered to make DQN training more stable and efficient, and we will cover the best of them in future posts in this series.

研究人员发现了许多其他技巧可以使DQN培训更加稳定和高效,我们将在本系列的后续文章中介绍其中的最佳技巧。

体验重播 (Experience Replay)

We are trying to approximate a complex, nonlinear function, Q(s, a), with a Neural Network. To do this, we must calculate targets using the Bellman equation and then consider that we have a supervised learning problem at hand. However, one of the fundamental requirements for SGD optimization is that the training data is independent and identically distributed and when the Agent interacts with the Environment, the sequence of experience tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experiences tuples in sequential order runs the risk of getting swayed by the effects of this correlation.

我们正在尝试用神经网络近似一个复杂的非线性函数Q(s,a)。 为此,我们必须使用Bellman方程计算目标,然后考虑手头有一个监督学习问题。 但是,SGD优化的基本要求之一是训练数据是独立且均匀分布的,并且当Agent与环境交互时,经验元组的序列可以高度相关。 从这些经验元组中按顺序学习的幼稚Q学习算法冒着被这种相关性的影响所左右的风险。

We can prevent action values from oscillating or diverging catastrophically using a large buffer of our past experience and sample training data from it, instead of using our latest experience. This technique is called replay buffer or experience buffer. The replay buffer contains a collection of experience tuples (S, A, R, S′). The tuples are gradually added to the buffer as we are interacting with the Environment. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it.

我们可以使用我们过去的经验和从中获得的训练数据的较大缓冲,而不是使用我们的最新经验,来防止动作值发生剧烈的波动或发散。 此技术称为重播缓冲区体验缓冲区 。 重播缓冲区包含经验元组( SARS ')的集合。 当我们与环境交互时,将元组逐渐添加到缓冲区中。 最简单的实现是固定大小的缓冲区,将新数据添加到缓冲区的末尾,从而将最旧的体验推向缓冲区。 最简单的实现是固定大小的缓冲区,将新数据添加到缓冲区的末尾,从而将最旧的体验推向缓冲区。

The act of sampling a small batch of tuples from the replay buffer in order to learn is known as experience replay. In addition to breaking harmful correlations, experience replay allows us to learn more from individual tuples multiple times, recall rare occurrences, and in general make better use of our experience.

从重播缓冲区中采样一小批元组以便学习的行为称为体验重播 。 除了打破有害的相关性之外,体验重播还使我们可以多次从单个元组中学习更多,回忆起罕见的事件,并且通常可以更好地利用我们的经验。

As a summary, the basic idea behind experience replay is to storing past experiences and then using a random subset of these experiences to update the Q-network, rather than using just the single most recent experience. In order to store the Agent’s experiences, we used a data structure called a deque in Python’s built-in collections library. It’s basically a list that you can set a maximum size on so that if you try to append to the list and it is already full, it will remove the first item in the list and add the new item to the end of the list. The experiences themselves are tuples of [observation, action, reward, done flag, next state] to keep the transitions obtained from the environment.

总而言之,体验重播的基本思想是存储过去的体验,然后使用这些体验的随机子集来更新Q网络,而不是仅使用单个最新的体验。 为了存储Agent的经验,我们在Python的内置集合库中使用了称为deque的数据结构。 它基本上是一个列表,您可以在其上设置最大大小,以便在尝试追加到列表且该列表已满时,它将删除列表中的第一项并将新项添加到列表的末尾。 体验本身就是[观察,动作,奖励,完成标志,下一个状态]的元组,以保持从环境中获得的过渡。

Experience = collections.namedtuple(‘Experience’,            field_names=[‘state’, ‘action’, ‘reward’,            ‘done’, ‘new_state’])class ExperienceReplay:  def __init__(self, capacity):      self.buffer = collections.deque(maxlen=capacity)  def __len__(self):      return len(self.buffer)  def append(self, experience):      self.buffer.append(experience)

  def sample(self, batch_size):      indices = np.random.choice(len(self.buffer), batch_size,                replace=False)      states, actions, rewards, dones, next_states =              zip([self.buffer[idx] for idx in indices])      return np.array(states), np.array(actions),                      np.array(rewards,dtype=np.float32),              np.array(dones, dtype=np.uint8),                 np.array(next_states)

Each time the Agent does a step in the Environment, it pushes the transition into the buffer, keeping only a fixed number of steps (in our case, 10k transitions). For training, we randomly sample the batch of transitions from the replay buffer, which allows us to break the correlation between subsequent steps in the environment.

代理每次在环境中执行一个步骤时,都会将过渡推送到缓冲区中,仅保留固定数量的步骤(在我们的示例中为10k过渡)。 为了进行训练,我们从重播缓冲区中随机抽取了一批过渡样本,这使我们可以打破环境中后续步骤之间的相关性。

Most of the experience replay buffer code is quite straightforward: it basically exploits the capability of the deque library. In the sample() method, we create a list of random indices and then repack the sampled entries into NumPy arrays for more convenient loss calculation.

大多数体验重放缓冲区代码非常简单:它基本上利用了deque库的功能。 在sample()方法中,我们创建一个随机索引列表,然后将采样的条目重新打包到NumPy数组中,以更方便地进行损耗计算。

目标网络 (Target Network)

Remember that in Q-Learning, we update a guess with a guess, and this can potentially lead to harmful correlations. The Bellman equation provides us with the value of Q(s, a) via Q(s’, a’) . However, both the states s and s’ have only one step between them. This makes them very similar, and it’s very hard for a Neural Network to distinguish between them.

请记住,在Q-Learning中,我们用guess更新一个猜测 ,这可能会导致有害的相关性。 贝尔曼方程式通过Q(s',a')为我们提供Q(s,a)的值。 但是,状态s和s'之间只有一步之遥。 这使它们非常相似,并且神经网络很难区分它们。

When we perform an update of our Neural Networks’ parameters to make Q(s, a) closer to the desired result, we can indirectly alter the value produced for Q(s’, a’) and other states nearby. This can make our training very unstable.

当我们对神经网络的参数进行更新以使Q(s,a)更接近所需结果时,我们可以间接更改为Q(s',a')和附近其他状态产生的值。 这会使我们的训练非常不稳定。

To make training more stable, there is a trick, called target network, by which we keep a copy of our neural network and use it for the Q(s’, a’) value in the Bellman equation.

为了使训练更稳定,有一个技巧叫做目标网络 ,通过该技巧我们可以保留神经网络的副本,并将其用于Bellman方程中的Q(s',a')值。

That is, the predicted Q values of this second Q-network called the target network, are used to backpropagate through and train the main Q-network. It is important to highlight that the target network’s parameters are not trained, but they are periodically synchronized with the parameters of the main Q-network. The idea is that using the target network’s Q values to train the main Q-network will improve the stability of the training.

也就是说,此第二个Q网络(称为目标网络)的预测Q值用于反向传播并训练主Q网络。 重要的是要强调目标网络的参数没有经过训练,但是会与主Q网络的参数定期同步。 这个想法是,使用目标网络的Q值来训练主Q网络将提高训练的稳定性。

Later, when we present the code of the training loop, we will enter in more detail how to code the initialization and use of this target network.

稍后,当我们提供训练循环的代码时,我们将更详细地输入如何对目标网络的初始化和使用进行编码。

深度Q学习算法 (Deep Q-Learning Algorithm)

There are two main phases that are interleaved in the Deep Q-Learning Algorithm. One is where we sample the environment by performing actions and store away the observed experienced tuples in a replay memory. The other is where we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent (SGD) update step.

深度Q学习算法有两个主要阶段相互交错。 一种是我们通过执行操作对环境进行采样 ,并将观察到的有经验的元组存储在重播内存中。 另一个是我们从该内存中随机选择一小批元组,并使用梯度下降(SGD)更新步骤从该批中学习

These two phases are not directly dependent on each other and we could perform multiple sampling steps then one learning step, or even multiple learning steps with different random batches. In practice, you won’t be able to run the learning step immediately. You will need to wait till you have enough tuples of experiences in D.

这两个阶段并不直接相互依赖,我们可以执行多个采样步骤,然后执行一个学习步骤,甚至可以执行具有不同随机批次的多个学习步骤。 实际上,您将无法立即运行学习步骤。 您将需要等待,直到您具有足够的D元组经验。

The rest of the algorithm is designed to support these steps. We can summarize the previous explanations with this pseudocode for the basic DQN algorithm that will guide our implementation of the algorithm:

该算法的其余部分旨在支持这些步骤。 我们可以使用此伪代码来总结先前对基本DQN算法的解释,这些解释将指导我们实现该算法:

In the beginning, we need to create the main network and the target networks, and initialize an empty replay memory D. Note that memory is finite, so we may want to use something like a circular queue that retains the d most recent experience tuples. We also need to initialize the Agent, one of the main components, which interacts with the Environment.

首先,我们需要创建主网络和目标网络,并初始化一个空的重播内存D。 请注意,内存是有限的,因此我们可能想使用类似循环队列的方法来保留d个最近的经验元组。 我们还需要初始化与环境交互的主要组件之一的代理。

Note that we do not clear out the memory after each episode, this enables us to recall and build batches of experiences from across episodes.

请注意,我们不会在每个情节结束后清除内存,这使我们能够回忆起并积累来自各个情节的大量体验。

编码训练循环 (Coding the Training Loop)

超参数和执行时间 (Hyperparameters and execution time)

Before going into the code, mention that DeepMind’s Nature paper contained a table with all the details about hyperparameters used to train its model on all 49 Atari games used for evaluation. DeepMind kept all those parameters the same for all games, but trained individual models for every game. The team’s intention was to show that the method is robust enough to solve lots of games with varying complexity, action space, reward structure, and other details using one single model architecture and hyperparameters.

在进入代码之前,请提及DeepMind的Nature论文包含一张表格,其中包含有关用于在49个用于评估的Atari游戏中训练其模型的超参数的所有详细信息。 DeepMind在所有游戏中都将所有这些参数保持不变,但会为每个游戏训练单独的模型。 团队的意图是证明该方法足够健壮,可以使用一个模型结构和超参数来解决许多具有不同复杂性,动作空间,奖励结构和其他细节的游戏。

However, our goal in this post is to solve just the Pong game, a quite simple and straightforward game in comparison to other games in the Atari test set, so the hyperparameters in the paper are are not the most suitable for a didactic post like this one. For this reason, we decided to use more personalized parameter values for our Pong Environment that converges to mean score of 19.0 in a reasonable wall time, depending on the GPU type that colab assigns to our execution (about a couple of hours at most). Remember that we can know the type of GPU that has been assigned to our runtime environment with the command !nvidia-smi.

但是,本文的目标是解决Pong游戏,与Atari测试集中的其他游戏相比,这是一个非常简单明了的游戏,因此本文中的超参数并非最适合此类教学文章之一。 因此,我们决定在Pong Environment中使用更多个性化参数值,这些值在合理的时间间隔内收敛至平均得分19.0,具体取决于colab分配给我们执行的GPU类型(最多约两个小时)。 请记住,我们可以通过命令!nvidia-smi知道分配给运行时环境的GPU的类型。

Let’s start introducing the code in more detail. The entire code of this post can be found on GitHub (and can be run as a Colab google notebook using this link). We skip the import details of the packages, it is quite straightforward, and we focus on the explanation of the hyperparameters:

让我们开始更详细地介绍代码。 这篇文章的完整代码可以在GitHub上找到 (并且可以使用此链接作为Colab谷歌笔记本运行 )。 我们跳过了软件包的导入细节,这非常简单,我们集中在对超参数的解释上:

DEFAULT_ENV_NAME = “PongNoFrameskip-v4” MEAN_REWARD_BOUND = 19.0 gamma = 0.99                    orbatch_size = 32                 replay_size = 10000             learning_rate = 1e-4            sync_target_frames = 1000        replay_start_size = 10000      eps_start=1.0eps_decay=.999985eps_min=0.02

These DEFAULT_ENV_NAME identify the Environment to train on and MEAN_REWARD_BOUNDthe reward boundary to stop training. We will consider that the game has converged when our agent reaches an average of 19 games won (out of 21) in the last 100 games. The remaining parameters indicate:

这些DEFAULT_ENV_NAME标识要在其上进行训练的环境,并MEAN_REWARD_BOUND奖励边界以停止训练。 当我们的经纪人在过去100场比赛中平均获得19场胜利(21场)时,我们将认为比赛已经收敛。 其余参数指示:

  • gammais the discount factor

    gamma是折扣因子

  • batch_size, the minibatch size

    batch_size ,最小批量大小

  • learning_rateis the learning rate

    learning_rate是学习率

  • replay_sizethe replay buffer size (maximum number of experiences stored in replay memory)

    replay_size重播缓冲区的大小(存储在重播内存中的最大体验数)

  • sync_target_framesindicates how frequently we sync model weights from the main DQN network to the target DQN network (how many frames in between syncing)

    sync_target_frames指示我们将模型权重从主DQN网络同步到目标DQN网络的频率(同步之间有多少帧)

  • replay_start_size the count of frames (experiences) to add to replay buffer before starting training.

    replay_start_size开始训练之前要添加到重播缓冲区的帧数(体验)。

Finally, the hyperparameters related to the epsilon decay schedule are the same as the previous post:

最后,与ε衰减时间表有关的超参数与先前的文章相同:

eps_start=1.0eps_decay=.999985eps_min=0.02

代理商 (Agent)

One of the main components we need is an Agent, which interacts with the Environment, and saves the result of the interaction into the experience replay buffer. The Agent class that we will design already save directly the result of the interacts with the Environment into the experience replay buffer, performing these three steps of the sample phase indicated in the portion of the previous pseudocode:

我们需要的主要组件之一是代理,它与环境交互,并将交互结果保存到体验重播缓冲区中。 我们将设计的Agent类已经将与环境交互的结果直接保存到体验重播缓冲区中执行前面的伪代码部分中指示的示例阶段的这三个步骤:

First of all, during the Agent’s initialization, we need to store references to the Environment and experience replay buffer D indicated as an argument in the creation of the Agent’s object as exp_buffer:

首先,在Agent初始化期间,我们需要存储对Environment的引用,并体验重放缓冲区D ,该缓冲区在创建Agent对象的过程中作为参数表示为exp_buffer

class Agent:     def __init__(self, env, exp_buffer):        self.env = env        self.exp_buffer = exp_buffer        self._reset()def _reset(self):        self.state = env.reset()        self.total_reward = 0.0

In order to perform Agent’s steps in the Environment and store its results in the experience replay memory we suggest the following code:

为了在环境中执行Agent的步骤并将其结果存储在体验重播内存中,我们建议使用以下代码:

def play_step(self, net, epsilon=0.0, device=”cpu”):    done_reward = None    if np.random.random() < epsilon:       action = env.action_space.sample()    else:       state_a = np.array([self.state], copy=False)       state_v = torch.tensor(state_a).to(device)       q_vals_v = net(state_v)       _, act_v = torch.max(q_vals_v, dim=1)       action = int(act_v.item())

The method play_step uses an ϵ-greedy(Q) policy to select actions at every time step. In other words, with the probability epsilon (passed as an argument), we take the random action; otherwise, we use the past model to obtain the Q-values for all possible actions and choose the best.

play_step方法使用ϵ-greedy(Q)策略在每个时间步选择动作。 换句话说,用概率epsilon (作为参数传递),我们采取了随机动作。 否则,我们将使用过去的模型来获取所有可能动作的Q值,然后选择最佳值。

After obtaining the action the method performs the step in the Environment to get the next observation: next_state, reward and is_done:

获得action该方法在环境中执行步骤以获取下一个观察值: next_staterewardis_done

    new_state, reward, is_done, _ = self.env.step(action)    self.total_reward += reward

Finally, the method stores the observation in the experience replay buffer, and then handle the end-of-episode situation:

最后,该方法将观察结果存储在体验重播缓冲区中,然后处理情节结束情况:

    exp = Experience(self.state,action,reward,is_done,new_state)    self.exp_buffer.append(exp)    self.state = new_state    if is_done:       done_reward = self.total_reward       self._reset()    return done_reward

The result of the function is the total accumulated reward if we have reached the end of the episode with this step, or None if not.

该函数的结果是总累积奖励,如果我们已经达到了情节的这一步,或最终 None 如果不

主循环 (Main Loop)

In the initialization part, we create our environment with all required wrappers applied, the main DQN neural network that we are going to train, and our target network with the same architecture. We also create the experience replay buffer of the required size and pass it to the agent. The last things we do before the training loop are to create an optimizer, a buffer for full episode rewards, a counter of frames and a variable to track the best mean reward reached (because every time the mean reward beats the record, we will save the model in a file):

在初始化部分,我们创建一个环境,其中应用了所有必需的包装器,将要训练的主要DQN神经网络以及具有相同体系结构的目标网络。 我们还将创建所需大小的体验重播缓冲区,并将其传递给代理。 在训练循环之前,我们要做的最后一件事是创建一个优化器,一个完整剧集奖励的缓冲区,一个帧计数器和一个变量,以跟踪达到的最佳平均奖励(因为每次平均奖励都超过记录时,我们将保存文件中的模型):

env = make_env(DEFAULT_ENV_NAME)net = DQN(env.observation_space.shape,          env.action_space.n).to(device)target_net = DQN(env.observation_space.shape,          env.action_space.n).to(device)buffer = ExperienceReplay(replay_size)agent = Agent(env, buffer)epsilon = eps_startoptimizer = optim.Adam(net.parameters(), lr=learning_rate)total_rewards = []frame_idx = 0best_mean_reward = None

At the beginning of the training loop, we count the number of iterations completed and update epsilon as we introduced in the previous post. Next, the Agent makes a single step in the Environment (using as arguments the current neural network and value for epsilon). Remember that this function returns a non-None result only if this step is the final step in the episode. In this case, we report the progress in the console (count of episodes played, mean reward for the last 100 episodes and the current value of epsilon):

在训练循环的开始,我们计算完成的迭代次数并更新epsilon,正如我们在上一篇文章中介绍的那样 。 接下来,Agent在环境中迈出了第一步(使用当前神经网络和epsilon的值作为参数)。 请记住,仅当此步骤是情节中的最后一步时,此函数才会返回“ 无”结果。 在这种情况下,我们会在控制台中报告进度(播放的剧集数量,最近100集的平均奖励以及epsilon的当前值):

while True:  frame_idx += 1  epsilon = max(epsilon*eps_decay, eps_min)  reward = agent.play_step(net, epsilon, device=device)  if reward is not None:     total_rewards.append(reward)     mean_reward = np.mean(total_rewards[-100:])     print(“%d: %d games, mean reward %.3f, (epsilon %.2f)” %           (frame_idx, len(total_rewards), mean_reward, epsilon))

After, every time our mean reward for the last 100 episodes reaches a maximum, we report this in the console and save the current model parameters in a file. Also, if this mean rewards exceed the specified MEAN_REWARD_BOUND ( 19.0 in our case) then we stop training. The third if, helps us to ensure our experience replay buffer is large enough for training:

之后,每当我们对最近100集的平均奖励达到最大值时,我们将在控制台中报告此情况并将当前模型参数保存在文件中。 另外,如果平均奖励超过指定的MEAN_REWARD_BOUND (在我们的示例中为19.0),我们将停止训练。 第三个条件是否可以帮助我们确保我们的体验重播缓冲区足够用于训练:

if best_mean_reward is None or         best_mean_reward < mean_reward:             torch.save(net.state_dict(),                        DEFAULT_ENV_NAME + “-best.dat”)             best_mean_reward = mean_reward             if best_mean_reward is not None:             print(“Best mean reward updated %.3f” %                   (best_mean_reward))if mean_reward > MEAN_REWARD_BOUND:             print(“Solved in %d frames!” % frame_idx)             breakif len(buffer) < replay_start_size:             continue

学习阶段 (Learn phase)

Now we will start to describe the part of the code, from the main loop, that refers to the phase where the network learn (a portion of the previous pseudocode):

现在,我们将从主循环开始描述代码的一部分,该部分指的是网络学习的阶段(先前伪代码的一部分):

The whole code that we wrote for implementing this part is as follows:

我们为实现此部分而编写的整个代码如下:

batch = buffer.sample(batch_size) states, actions, rewards, dones, next_states = batchstates_v = torch.tensor(states).to(device)next_states_v = torch.tensor(next_states).to(device)actions_v = torch.tensor(actions).to(device)rewards_v = torch.tensor(rewards).to(device)done_mask = torch.ByteTensor(dones).to(device)state_action_values = net(states_v).gather(1,                           actions_v.unsqueeze(-1)).squeeze(-1)next_state_values = target_net(next_states_v).max(1)[0]next_state_values[done_mask] = 0.0next_state_values = next_state_values.detach()expected_state_action_values=next_state_values * gamma + rewards_vloss_t = nn.MSELoss()(state_action_values,                      expected_state_action_values)optimizer.zero_grad()loss_t.backward()optimizer.step()if frame_idx % sync_target_frames == 0:   target_net.load_state_dict(net.state_dict())

We are going to dissect it to facilitate its description since it is probably the most complex part to understand.

我们将对其进行剖析以便于对其进行描述,因为它可能是最难理解的部分。

The first thing to do is to sample a random mini-batch of transactions from the replay memory:

首先要做的是从重播内存中采样一个随机的小批量事务:

batch = buffer.sample(batch_size) states, actions, rewards, dones, next_states = batch

Next, the code wraps individual NumPy arrays with batch data in PyTorch tensors and copies them to GPU ( we are assuming that the CUDA device is specified in arguments):

接下来,代码将带有批处理数据的单个NumPy数组包装在PyTorch张量中,并将它们复制到GPU(我们假设在参数中指定了CUDA设备):

states_v = torch.tensor(states).to(device)next_states_v = torch.tensor(next_states).to(device)actions_v = torch.tensor(actions).to(device)rewards_v = torch.tensor(rewards).to(device)done_mask = torch.ByteTensor(dones).to(device)

This code inspired by the code of Maxim Lapan. It is written in a form to maximally exploit the capabilities of the GPU by processing (in parallel) all batch samples with vector operations. But explained step by step it can be understood without problems.

此代码受Maxim Lapan代码的启发。 它以某种形式编写,可以通过使用矢量运算处理(并行)所有批处理样本来最大程度地利用GPU的功能。 但是一步一步地解释它可以毫无问题地理解。

Then, we pass observations to the first model and extract the specific Q-values for the taken actions using the gather() tensor operation. The first argument to this function call is a dimension index that we want to perform gathering on. In this case, it is equal to 1, because it corresponds to actions dimension:

然后,我们将观察结果传递给第一个模型,并使用gather()张量操作提取所采取动作的特定Q值。 此函数调用的第一个参数是我们要对其进行收集的维度索引。 在这种情况下,它等于1,因为它对应于动作维度:

state_action_values = net(states_v).gather(1,                           actions_v.unsqueeze(-1)).squeeze(-1)

The second argument is a tensor of indices of elements to be chosen. Here it is a bit more complex to explain the code. Let’s try it!. Maxim Lapan suggest to use the functions unsqueeze() and squeeze(). Because the index should have the same number of dimensions as the data we are processing (2D in our case) it apply a unsqueeze()to the action_v (that is a 1D) to compute the index argument for the gather functions. Finally, to remove the extra dimensions we have created, we will use the squeeze()function. Let’s try to illustrate what a gather does in summary on a simple example case with a batch of four entries and four actions:

第二个参数是要选择的元素的索引张量。 在这里,解释代码有点复杂。 让我们试试吧! Maxim Lapan建议使用unsqueeze()squeeze()函数。 因为索引应具有与我们正在处理的数据相同的维数(在本例中为2D unsqueeze() ,所以action_v ( unsqueeze()应用于action_v (即1D)以计算action_v函数的index参数。 最后,要删除我们创建的额外尺寸,我们将使用squeeze()函数。 让我们尝试在一个简单的示例案例中总结一下汇总操作,该示例案例包含四个条目和四个操作:

Note that the result of gather() applied to tensors is a differentiable operation that will keep all gradients with respect to the final loss value.

请注意,应用于张量的collect()的结果是可微分​​的操作,该操作将相对于最终损耗值保留所有梯度。

Now that we have calculated the state-action values for every transition in the replay buffer, we need to calculate target “y” for every transition in the replay buffer too. Both vectors are the ones we will use in the loss function. To do this, remember that we must use the target network.

现在我们已经为重播缓冲区中的每个过渡计算了状态操作值,我们还需要为重播缓冲区中的每个过渡计算目标“ y”。 这两个向量都是我们将在损失函数中使用的向量。 为此,请记住我们必须使用目标网络。

In the following code, we apply the target network to our next state observations and calculate the maximum Q-value along the same action dimension, 1:

在下面的代码中,我们将目标网络应用于下一个状态观察,并沿着相同的动作维度计算最大Q值:

next_state_values = target_net(next_states_v).max(1)[0]

Function max() returns both maximum values and indices of those values (so it calculates both max and argmax). Because in this case, we are interested only in values, we take the first entry of the result.

函数max()返回最大值和这些值的索引(因此它同时计算max和argmax)。 因为在这种情况下,我们只对值感兴趣,所以我们将结果的第一项输入。

Remember that if the transition in the batch is from the last step in the episode, then our value of the action doesn’t have a discounted reward of the next state, as there is no next state from which to gather the reward:

请记住,如果批次中的过渡是从情节的最后一步开始的,则我们的动作价值不会获得对下一个状态的折扣奖励,因为没有下一个状态可以从中收集奖励:

next_state_values[done_mask] = 0.0

Although we cannot go into detail, it is important to highlight that the calculation of the next state value by the target neural network shouldn’t affect gradients. To achieve this, we use thedetach() function of the PyTorch tensor, which makes a copy of it without connection to the parent’s operation, to prevent gradients from flowing into the target network’s graph:

尽管我们无法详细介绍,但重要的是要强调一点,目标神经网络对下一状态值的​​计算不应影响梯度。 为此,我们使用了PyTorch张量的detach()函数,该函数在不连接父级操作的情况下复制了它,以防止梯度流入目标网络的图形中:

next_state_values = next_state_values.detach()

Now, we can calculate the Bellman approximation value for the vector of targets (“y”), that is the vector of the expected state-action value for every transition in the replay buffer:

现在,我们可以为目标向量(“ y”)计算Bellman近似值,也就是重放缓冲区中每个过渡的预期状态作用值的向量:

expected_state_action_values=next_state_values * gamma + rewards_v

We have all the information required to calculate the mean squared error loss:

我们拥有计算均方误差损失所需的所有信息:

loss_t = nn.MSELoss()(state_action_values,                      expected_state_action_values)

The next piece of the training loop updates the main neural network using the SGD algorithm by minimizing the loss:

训练循环的下一部分使用SGD算法,通过将损失最小化来更新主神经网络:

optimizer.zero_grad()loss_t.backward()optimizer.step()

Finally, the last line of the code syncs parameters from our main DQN network to the target DQN network every sync_target_frames:

最后,代码的最后一行每隔sync_target_frames将参数从我们的主要DQN网络同步到目标DQN网络:

if frame_idx % sync_target_frames == 0:   target_net.load_state_dict(net.state_dict())

And so far the code for the main loop!

到目前为止,主循环的代码!

接下来是什么? (What is next?)

This is the second of three posts devoted to present the basics of Deep Q-Network (DQN), in which we present in detail the algorithm. In the next post, we will talk about the performance of the algorithm and also show how we can use it.

这是专门介绍Deep Q-Network(DQN)基础的三篇文章中的第二篇,我们在其中详细介绍了该算法。 在下一篇文章中 ,我们将讨论该算法的性能,并展示如何使用它。

翻译自: https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c

创建dqn的深度神经网络


http://www.taodudu.cc/news/show-1874125.html

相关文章:

  • kafka topic:1_Topic️主题建模:超越令牌输出
  • dask 于数据分析_利用Dask ML框架进行欺诈检测-端到端数据分析
  • x射线计算机断层成像_医疗保健中的深度学习-X射线成像(第4部分-类不平衡问题)...
  • r-cnn 行人检测_了解用于对象检测的快速R-CNN和快速R-CNN。
  • 语义分割空间上下文关系_多尺度空间注意的语义分割
  • 自我监督学习和无监督学习_弱和自我监督的学习-第2部分
  • 深度之眼 alexnet_AlexNet带给了深度学习的世界
  • ai生成图片是什么技术_什么是生成型AI?
  • ai人工智能可以干什么_我们可以使人工智能更具道德性吗?
  • pong_计算机视觉与终极Pong AI
  • linkedin爬虫_这些框架帮助LinkedIn大规模构建了机器学习
  • 词嵌入生成词向量_使用词嵌入创建诗生成器
  • 端到端车道线检测_如何使用Yolov5创建端到端对象检测器?
  • 深度学习 检测异常_深度学习用于异常检测:全面调查
  • 自我监督学习和无监督学习_弱和自我监督的学习-第3部分
  • 聊天工具机器人开发_聊天机器人-精致的交流工具? 还是您的客户服务团队不可或缺的成员?...
  • 自我监督学习和无监督学习_弱和自我监督的学习-第4部分
  • ai星际探索 爪子_探索AI地牢
  • 循环神经网络 递归神经网络_递归神经网络-第5部分
  • 用于小儿肺炎检测的无代码AI
  • 建筑业建筑业大数据行业现状_建筑—第2部分
  • 脸部识别算法_面部识别技术是种族主义者吗? 先进算法的解释
  • ai人工智能对话了_产品制造商如何缓解对话式AI中的偏见
  • 深度神经网络 轻量化_正则化对深度神经网络的影响
  • dbscan js 实现_DBSCAN在PySpark上的实现
  • 深度学习行人检测简介_深度学习简介
  • ai初创企业商业化落地_初创企业需要问的三个关于人工智能的问题
  • scikit keras_使用Scikit-Learn,Scikit-Opt和Keras进行超参数优化
  • 异常检测时间序列_DeepAnT —时间序列的无监督异常检测
  • 机器学习 结构化数据_聊天机器人:根据结构化数据创建自然语言

创建dqn的深度神经网络_深度Q网络(DQN)-II相关推荐

  1. 贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络

    贝叶斯深度神经网络 Recently I came across an interesting Paper named, "Deep Ensembles: A Loss Landscape ...

  2. ann人工神经网络_深度学习-人工神经网络(ANN)

    ann人工神经网络 Building your first neural network in less than 30 lines of code. 用不到30行代码构建您的第一个神经网络. 1.W ...

  3. 人工智能与深度神经网络,人工智能深度神经网络

    深度学习又称之为什么? 深度学习又称之为人工神经网络训练.深度学习是指多层的人工神经网络和训练它的方法.一层神经网络会把大量矩阵数字作为输入,通过非线性激活方法取权重,再产生另一个数据集合作为输出. ...

  4. 什么是深度学习?45分钟理解深度神经网络和深度学习 刘利刚教授

    什么是深度学习? - 45分钟理解深度神经网络和深度学习 刘利刚 中国科学技术大学图形与几何计算实验室 http://staff.ustc.edu.cn/~lgliu [绪言] 近年来,人工智能(Ar ...

  5. 通过深度Q网络DQN构建游戏智能体

    目录 什么是深度Q网络(DQN) DQN的基本结构 DQN的关键技术 用Python和Gym实现DQN 算法优化 1. 网络结构优化 2. 训练策略优化 3. 超参数优化 欢迎来到我的博客,今天我们将 ...

  6. 多层感知机 深度神经网络_使用深度神经网络和合同感知损失的能源产量预测...

    多层感知机 深度神经网络 in collaboration with Hsu Chung Chuan, Lin Min Htoo, and Quah Jia Yong. 与许忠传,林敏涛和华佳勇合作. ...

  7. 谷歌深度神经网络_本周关注我们:轻松阅读,神经网络和Google召集不良网站

    谷歌深度神经网络 发展更好的体验 (Developing a Better Experience) With the wide variety of devices people use to bro ...

  8. [DeeplearningAI笔记]改善深层神经网络_深度学习的实用层面1.10_1.12/梯度消失/梯度爆炸/权重初始化...

    觉得有用的话,欢迎一起讨论相互学习~Follow Me 1.10 梯度消失和梯度爆炸 当训练神经网络,尤其是深度神经网络时,经常会出现的问题是梯度消失或者梯度爆炸,也就是说当你训练深度网络时,导数或坡 ...

  9. 前馈神经网络_深度学习基础理解:以前馈神经网络为例

    区别于传统统计机器学习的各类算法,我们从本篇开始探索深度学习模型.深度学习在应用上的重要性现如今已毋庸置疑,从2012年燃爆ImageNet,到2016年的AlphaGo战胜李世石,再到2018年的B ...

  10. PyCharm社区版支持深度学习_深度学习,大家都看哪些社区论坛?

    对代码.编程感兴趣的可以加我公众号<老K玩代码>,和我交流! " 学习Python,有一半的小伙伴是冲着深度学习来的. 自学虽好,但还是需要有人指点. 那有没有什么研究深度学习的 ...

最新文章

  1. shellcode学习总结
  2. Ansible Playbook核心元素以及组件
  3. .Net Core 定时任务TimeJob
  4. ExtJS 4.1有什么值得期待?
  5. 10分钟带你逆袭kafka之路
  6. Linux内核源码分析--内核启动之(1)zImage自解压过程(Linux-3.0 ARMv7) 【转】
  7. Python内置函数ord()与chr()
  8. Linux下搭建iSCSI共享存储
  9. 加速你的开发环境[VS2003]
  10. FFmpeg之mp4提取/h265(二十五)
  11. 【深度学习】Colaboratory使用:Google云盘挂载及GPU使用
  12. Arduino 电机测速
  13. https://www.i5seo.com/
  14. tensorflow笔记之二十八——带掩码的损失函数
  15. android 中存储文件所在位置
  16. asp毕业设计——基于asp+sqlserver的旅游网站设计与实现(毕业论文+程序源码)——旅游网站
  17. android自动连接到指定wifi
  18. 【高性能】Linux挂载GPT硬盘
  19. 【LocalDate】获取两个日期间相差的年数、月数、天数
  20. 19个综合编程学习教程网站

热门文章

  1. quartz spring 时间配置
  2. c#中一般处理程序中使用session
  3. Ado.Net Entity Framework 批量删除、判断存在
  4. sql over 用法
  5. 透过 ASP.NET 和数据库读写图片
  6. 关于《Delphi源代码分析》的讨论
  7. Atitit java解析yml文件 以及使用 spel ognl 读取 v4 u77.docx Atitit java解析yml文件 目录 1.1. Springboot use snak
  8. Atitit 状态码专题 目录 1. FTP 1 1.1. 1xx - 肯定的初步答复 1 1.2. 2xx - 肯定的完成答复 1 1.3. 3xx - 肯定的中间答复 2 1.4. 4xx -
  9. Atitit webserver tomcat 7 8.0 8.5 9.0新特性 Tomcat 7 的七大新特性 - 编程语言 - ITeye资讯.html tomcat 8.0特性 - CSD
  10. Atitit rpc之道 attilax著 艾龙 著 1. 远程过程调用协议 1 2. 历史 2 2.1. RPC的早期发展 3 3. RPC这种编程范式存在的三大问题以及这些问题 5 3.1.