教ai玩游戏

by Aman Agarwal

通过阿曼·阿加瓦尔(Aman Agarwal)

简单解释：DeepMind如何教AI玩视频游戏 (Explained Simply: How DeepMind taught AI to play video games)

Google’s DeepMind is one of the world’s foremost AI research teams. They’re most famous for creating the AlphaGo player that beat South Korean Go champion Lee Sedol in 2016.

Google的DeepMind是全球最重要的AI研究团队之一。他们最有名的是创造了AlphaGo播放器，该播放器在2016年击败了韩国围棋冠军Lee Sedol。

The key technology used to create the Go playing AI was Deep Reinforcement Learning.

用于创建围棋人工智能的关键技术是深度强化学习。

Let’s go back 4 years, to when DeepMind first built an AI which could play Atari games from the 70s. Games like Breakout, Pong and Space Invaders. It was this research that led to AlphaGo, and to DeepMind being acquired by Google.

让我们追溯到4年，直到DeepMind首次构建可以玩 70年代Atari游戏的AI时。诸如Breakout，Pong和Space Invaders之类的游戏。正是这项研究促使AlphaGo和DeepMind被Google收购。

Today we’re going to take that original research paper and break it down paragraph-by-paragraph. This will make it more approachable for people who only just now started learning about Reinforcement Learning. And for those who don’t use English as their first language (which makes it VERY difficult to read such papers).

今天，我们将采用原始研究论文并将其逐段细分。对于刚开始学习强化学习的人们来说，这将使其更容易上手。对于那些不使用英语作为第一语言的人(这使得阅读此类论文非常困难)。

Here’s the original paper if you want to try reading it:

如果您想阅读的话，这是原始论文：

一些快速注释(以防您在本篇20分钟的文章中未读懂) (Some quick notes (in case you don’t make it to the bottom of this 20-minute article))

The explanations here are by two people:

这里的解释是两个人的：

Me, a self driving cars engineer

我，一名自动驾驶汽车工程师
Qiang Lu, a PhD candidate and researcher at the University of Denver

吕强，丹佛大学博士研究生兼研究员

We hope that our work will save you a lot of time and effort if you were to study this on your own.

如果您要自己研究，我们希望我们的工作将为您节省很多时间和精力。

如果您更喜欢中文，请参阅本文的非正式翻译。 (And if you’re more comfortable reading Chinese, here is an unofficial translation of this essay.)

We love your claps, but love your comments even more. Unload whatever is on your mind — feelings, suggestions, corrections, or criticism — into the comment box!我们喜欢您的掌声，但更喜欢您的评论。将您的想法，想法，建议，更正或批评卸载到评论框中！
I intend to write many more articles like this, and am looking for more collaborators. If you’d seriously like to contribute, please leave a comment.我打算写更多这样的文章，并且正在寻找更多的合作者。如果您真的想做出贡献，请发表评论。

让我们开始吧 (Let’s get started)

We want to make a robot learn how to play an Atari game by itself, using Reinforcement Learning.

我们想让机器人使用强化学习来学习如何自己玩Atari游戏。

The robot in our case is a convolutional neural network.

在我们的案例中，机器人是一个卷积神经网络。

This is almost real end-to-end deep learning, because our robot gets inputs in the same way as a human player — it directly sees the image on the screen and the reward/change in points after each move, and that is all the information it needs to make a decision.

这几乎是真正的端到端深度学习，因为我们的机器人以与人类玩家相同的方式获取输入-它可以直接在屏幕上看到图像并在每次移动后获得积分的奖励/改变，这就是决策所需的信息。

And what does the robot output? Well, ideally we want the robot to pick the action which it thinks promises the most reward in future. But instead of directly choosing the action, here we let it assign “values” to each of the 18 possible joystick actions. So put simply, the value V for any action A represents the robot’s expectation of the future reward it will get if it performs that action A.

机器人输出什么？好吧，理想情况下，我们希望机器人选择它认为将来有望带来最大回报的动作。但是，我们不是直接选择动作，而是让它为18种可能的操纵杆动作中的每一个分配“值”。简而言之，任何动作A的值V代表机器人对执行该动作A所获得的未来奖励的期望。

So in essence, this neural network is a value function. It takes as input the screen state and change in reward, and it outputs the different values associated with each possible action. So that you can pick the action with the highest value, or pick any other action based on how you program the overall player.

因此，从本质上讲，该神经网络是一个价值函数。它以屏幕状态和奖励变化作为输入，并输出与每个可能的动作相关的不同值。这样您就可以选择具有最高价值的动作，或者根据对整个播放器进行编程的方式来选择其他任何动作。

Say you have the game screen, and you want to tell a neural network what’s on the screen. One way would be to directly feed the image into the neural network; we don’t process the inputs in any other way. The other would be to create a summary of what’s happening on the screen in a numerical format, and then feed that into the neural network. The former is being referred to here as “high dimensional sensory inputs” and the latter is “hand crafted feature representations.”

假设您有游戏屏幕，并且想告诉神经网络屏幕上有什么。一种方法是将图像直接输入到神经网络中。我们不会以其他任何方式处理输入。另一种方法是以数字格式创建屏幕上正在发生的事情的摘要，然后将其输入到神经网络中。前者在这里被称为“高维感官输入”，而后者被称为“手工制作的特征表示”。

Read abstract explanation. Then this paragraph is self-explanatory.

阅读摘要说明。那么这一段是不言自明的。

Deep Learning methods don’t work easily with reinforcement learning like they do in supervised/unsupervised learning. Most DL applications have involved huge training datasets with accurate samples and labels. Or in unsupervised learning, the target cost function is still quite convenient to work with.

深度学习方法在强化学习中并不容易像在有监督/无监督学习中那样起作用。大多数DL应用程序都涉及具有正确样本和标签的庞大训练数据集。或在无监督学习中，使用目标成本函数仍然非常方便。

But in RL, there’s a catch — as you know, RL involves rewards which could be delayed many time steps into the future (for example it takes several moves to knock the opponent’s queen in chess, and each of those moves doesn’t return the same immediate reward as the final move, EVEN IF one of those moves might be more important than the final move).

但是在RL中，有一个陷阱-如您所知，RL涉及奖励，奖励可能会延迟很多时间才能到达未来(例如，需要花费几步才能击败国际象棋的对手皇后，而这些举动中的每一个都不会返回与最终举动相同的即时奖励，即使这些举动之一可能比最终举动更重要)。

The rewards could also be noisy — for instance, sometimes the points for a particular move are slightly random and not easily predictable!

奖励也可能是嘈杂的，例如，有时某个特定举动的得分是随机的，并且不容易预测！

Moreover, in DL usually the input samples are assumed to be unrelated to each other. For example in an image recognition network, the training data will have huge numbers of randomly organized and unrelated images. But while learning how to play a game, usually the strategy of moves doesn’t just depend on the current state of the screen but also some previous states and moves. It’s not simple to assume no correlation.

此外，在DL中，通常假定输入样本彼此无关。例如，在图像识别网络中，训练数据将具有大量随机组织且不相关的图像。但是，在学习如何玩游戏时，通常，移动的策略不仅取决于屏幕的当前状态，还取决于一些先前的状态和移动。假设没有相关性并不容易。

Now, wait a second. Why is it important for our training data samples to not be correlated to each other? Say you had 5 samples of animal images, and you wanted to learn to classify them into “cat” and “not cat”. If one of those images is of a cat, does it affect the likelihood of another image also being a cat? No. But in a video game, one frame of the screen is definitely related to the next frame. And to the next frame. And so on. If it takes 10 frames for a laser beam to destroy your spaceship, I’m pretty sure the 9th frame is a pretty good indication that the 10th frame is going to be painful. You don’t want to treat the two frames just a few milliseconds apart as totally separate experiences while learning, because they obviously carry valuable information about each other. They’re part of the same experience — that of a laser beam striking your spaceship.

现在，请稍等。为什么我们的训练数据样本不相互关联很重要？假设您有5个动物图像样本，并且想学习将其分类为“猫”和“非猫”。如果这些图像之一是猫，是否会影响另一图像也是猫的可能性？不。但是在视频游戏中，屏幕的一帧肯定与下一帧有关。并转到下一帧。等等。如果激光束需要10帧来摧毁您的太空飞船，我很确定第9帧可以很好地表明第10帧会很痛苦。您不想在学习时将两个帧相隔仅几毫秒，就认为它们是完全独立的体验，因为它们显然带有彼此有价值的信息。它们是同一经验的一部分-激光束撞击您的飞船。

Even the training data itself keeps changing in nature as the robot learns new strategies, making it harder to train on. What does that mean? For example, say you’re a noob chess player. You start out with some noob strategies when you play your first chess game, i.e keep moving forward, kill the pawn on the first chance you get etc. So as you keep learning those behaviors and feel happy taking pawns, those moves act like your current training set.

随着机器人学习新策略，甚至训练数据本身也在不断变化，这使得训练变得更加困难。那是什么意思？例如，假设您是菜鸟下棋者。您在玩第一个国际象棋游戏时会从一些新手策略入手，例如，继续前进，在有机会时杀死棋子等等。因此，当您继续学习这些行为并乐于接受棋子时，这些举动就像您当前的行为一样。训练集。

Now one day you try out a different strategy — sacrificing one of your own bishops to save your queen and take the other’s rook. Bam you realize this is so amazing. You’ve added this new trick to your training set, which you would never have learned if you had kept just practicing your previous noob strategy.

现在有一天，您尝试一种不同的策略-牺牲您自己的一位主教来保存您的皇后并接管另一个人的车队。巴姆，你意识到这真是太神奇了。您已经将此新技巧添加到了您的训练集中，如果您仅继续练习以前的菜鸟策略，就永远不会学到。

This is what it means to have a non-stationary data distribution, which doesn’t really happen in supervised/unsupervised learning.

这就是拥有非平稳数据分布的意思，这在有监督/无监督学习中并没有真正发生。

So given these challenges, how do you even train the neural network in such a situation?

因此，面对这些挑战，您如何在这种情况下训练神经网络？

In this paper we show how we overcame the above mentioned problems AND we directly used raw video/image data. Which means we’re awesome. One specific trick that is worth mentioning: “Experience Replay”. This solves the challenge of ‘data correlation’ and ‘non-stationary data distributions’ (see previous paragraph to understand what these mean).

在本文中，我们将展示如何克服上述问题，并直接使用原始视频/图像数据。这意味着我们很棒。值得一提的一个特殊技巧：“体验重播”。这解决了“数据关联”和“非平稳数据分布”的挑战(请参阅上一段以了解这些含义)。

We record all our experiences — again using the chess analogy, each experience looks like [current board status, move I tried, reward I got, new board status] — into a memory. Then while training, we pick up randomly distributed batches of experiences which aren’t related to each other. In each batch, different experiences could be associated with different strategies as well — because all the previous experiences and strategies are now jumbled together! Make sense?

我们记录所有经验-再次使用国际象棋比喻，每种经验看起来都像[当前的董事会状态，我尝试过的动作，获得的奖励，新的董事会状态] -记入内存。然后，在训练过程中，我们将随机分配一批彼此不相关的经验。在每一批中，不同的经验也可能与不同的策略相关联-因为所有以前的经验和策略现在都混杂在一起了！合理？

This makes the training data samples more random and un-correlated, and it also makes it feel more stationary to the neural network because every new batch is already full of random strategy experiences.

这使得训练数据样本更加随机且不相关，并且也使它对神经网络更加稳定，因为每个新批次都已经充满了随机策略经验。

Much of it is self-explanatory. The key here is that the exact same neural network architecture and hyperparameters (learning rate, etc) were used for each different game. It’s not like we used a bigger network for space invaders and a smaller network for ping-pong. We did train the networks from scratch for each new game, but the network design itself was the same. That’s pretty awesome right?

其中大部分是不言自明的。这里的关键是，每个不同的游戏都使用了完全相同的神经网络架构和超参数(学习率等)。并不是说我们为入侵者使用了较大的网络，为乒乓球使用了较小的网络。我们确实从头开始为每个新游戏训练网络，但是网络设计本身是相同的。太棒了吧？

First couple sentences are self-explanatory. By saying that ‘E’ is stochastic, it means that the environment is not always predictable (which is true in games right? anything could happen at any time).

前几句话是不言自明的。通过说“ E”是随机的，这意味着环境并非总是可预测的(这在游戏中是对的，任何时候都可能发生)。

It also repeats that the neural network does NOT get any information about the internal state of the game. For example, we don’t tell it things like “there’s a monster at this position who is firing at you and moving in this direction, your spaceship is present here and moving there, etc”. We simply give it the image and let the convolutional network figure out by itself where the monster is, and where the player is, and who is shooting where etc. This is to make the robot train in a more human-like way.

它也重复了一次，神经网络没有获得有关游戏内部状态的任何信息。例如，我们不会说“在这个位置有一个怪物朝您开火并朝这个方向移动，您的太空飞船在这里并在那里移动等等”之类的事情。我们只给它图像，然后让卷积网络自己找出怪物在哪里，玩家在哪里，谁在哪里射击等等。这是为了使机器人以更像人类的方式训练。

Perceptual Aliasing: Meaning that two different states/places can be perceived as the same. For example, in a building, it is nearly impossible to determine a location solely with the visual information, because all the corridors may look the same.

感知别名：意味着可以将两个不同的状态/位置视为相同。例如，在建筑物中，几乎不可能仅凭视觉信息来确定位置，因为所有走廊看起来都一样。

Perceptual aliasing is a problem. In Atari games the state of the game doesn’t change so much in every millisecond, nor is a human being capable of making decisions in every millisecond. So when we take video input at 60 frames per second, and treat each frame as a separate state, then most of the states in our training data will look exactly the same! It’s better to keep a longer horizon for what a “state” looks like, which has, say, at least 4 to 5 frames (say). Multiple consecutive frames also contain valuable information about each other — for example, a still photograph of two cars a foot away from each other is very ambiguous — is one car about to crash into the other? Or are they about to move away from each other after coming this close? You don’t know. But if you take 4 frames from the video and see them one after the other, now you know how the cars are moving and can guess if they’re going to crash or not. We call this a sequence of consecutive frames, and use one sequence as a state.

感知混叠是一个问题。在Atari游戏中，游戏状态不会每毫秒发生太大变化，人类也无法每毫秒做出决策。因此，当我们以每秒60帧的速度进行视频输入，并将每帧视为一个单独的状态时，训练数据中的大多数状态将看起来完全相同！最好对“状态”看起来更长一些，例如至少有4到5帧。多个连续的帧还包含彼此有关的有价值的信息-例如，一张脚相距两辆车的静止照片非常模糊-一辆车是否会撞向另一辆车？还是在接近之后便要彼此远离？你不知道但是，如果您从视频中拍摄了4帧，然后又看又看，那么您现在知道了汽车的行驶方式，并且可以猜测它们是否会坠毁。我们称其为连续帧的序列，并使用一个序列作为状态。

Moreover, when a human moves the joystick it usually stays in the same position for several milliseconds, so that is incorporated into this state. The same action is continued in each of the frames. Each sequence (which includes several frames and the same action between them) is an individual state, and this state still fulfils a Markov Decision Process (MDP).

而且，当人移动操纵杆时，操纵杆通常在同一位置停留几毫秒，因此并入该状态。在每个帧中继续相同的动作。每个序列(包括几个帧以及它们之间的相同动作)都是一个单独的状态，并且此状态仍满足马尔可夫决策过程(MDP)。

If you’ve read up on RL, you’d know what MDPs are and what they mean! MDPs are the core assumption in RL.

如果您已经阅读过RL，您就会知道MDP是什么以及它们的含义！ MDP是RL中的核心假设。

Now, to understand this part you should really do some background study into Reinforcement Learning and Q-Learning first. It’s very important. You should understand what the Bellman equation does, what discounted future rewards are etc. But let me try to give a really simple overview of Q-learning.

现在，要理解这一部分，您应该首先真正对强化学习和Q学习进行一些背景研究。这很重要。您应该了解Bellman方程的功能，未来的折价奖励等。但是让我尝试对Q学习进行一个非常简单的概述。

Remember what I said about “value function” earlier? Scroll up to the Abstract and read it.

还记得我之前所说的“价值函数”吗？向上滚动至摘要并阅读。

Now, let’s say you had a table which had a row for ALL the possible states (s) of the game, and the columns represented all the possible joystick moves (a). Each cell in the row represents the maximum total future value possible if you take that particular action and play your best fro then on. This means you now have a “cheat sheet” of what to expect from any action at any state! The values of these cells is called the Q-star value. (Q*(s,a)). For any state s, if you take action a, the maximum total future value is Q*(s,a) as seen in that table.

现在，假设您有一张桌子，其中有一行显示游戏的所有可能状态，并且各列代表所有可能的操纵杆移动(a)。如果您执行该特定操作并从此开始发挥最大作用，则该行中的每个单元格都代表最大可能的未来总价值。这意味着您现在有了一个“备忘单”，可以对在任何状态下进行的任何操作有什么期望！这些单元格的值称为Q星值。 (Q *(s，a))。对于任何状态s，如果您采取措施a，则该表中所示的最大未来总价值为Q *(s，a)。

In the last line, pi is ‘policy’. Policy is simply the strategy about what action to pick when you are in a particular state.

在最后一行中，pi是“策略”。策略只是在处于特定状态时应采取何种措施的策略。

Now, if you think about it, say you are in state S1. You see the Q* value for all possible actions in the table (explained in para3), and choose A1 because its Q value is the highest. You get an immediate reward R1, and the game moves into a different state S2. For S2, the max future reward will be if it takes (say) action A2 in the table.

现在，如果您考虑一下，说您处于状态S1。您会在表中看到所有可能动作的Q *值(在para3中说明)，并选择A1，因为其Q值最高。您会立即获得奖励R1，游戏进入另一个状态S2。对于S2，最大未来奖励将是如果采取(例如)表中的操作A2。

Now, the initial Q value Q*(S1,A1) is the max value you could get if you played optimally from then on, right? This means, that Q*(S1, A1) should be equal to the sum of the reward R1 AND the max future value of the next state Q*(S2,A2)! Does this make sense? But hey we want to reduce the influence of the next state, so we multiply it by a number gamma which is between 0 and 1. This is called discounting Q*(S2,A2).

现在，初始Q值Q *(S1，A1)是您从那时起最佳玩法时可获得的最大值，对吗？这意味着Q *(S1，A1)等于奖励R1与下一个状态Q *(S2，A2)的最大未来值之和！这有意义吗？但是，我们要减少下一个状态的影响，因此我们将其乘以一个介于0和1之间的数字gamma。这称为折现Q *(S2，A2)。

Therefore Q*(S1,A1) = R1 + [gamma x Q*(S2,A2)]

因此，Q *(S1，A1)= R1 + [伽马x Q *(S2，A2)]

Look at the previous equation again. We’re assuming that for any state, and for any future action, we *know* the optimal value function already, and can use it to pick the best action at the current state (because by iterating over all the possible Q values we can literally look ahead into the future). But of course, such a Q-function doesn’t really exist in the real world! The best we can do is *approximate* the Q function by another function, and update that approx function little by little by testing it in the real world again and again. This approx function can be a simple linear polynomial, but we can even use non-linear functions. So we choose to use a neural network to be our “approximate Q function.”

再次看前面的方程。我们假设对于任何状态和任何将来的动作，我们已经*知道*最佳值函数，并且可以使用它来选择当前状态下的最佳动作(因为通过迭代所有可能的Q值，我们可以从字面上展望未来)。但是，当然，在现实世界中实际上并不存在这样的Q函数！我们能做的最好的就是将Q函数与另一个函数“近似”，并通过在现实世界中一次又一次地测试它来一点一点地更新该近似函数。该近似函数可以是简单的线性多项式，但我们甚至可以使用非线性函数。因此，我们选择使用神经网络作为我们的“近似Q函数”。

Now you know why we’re reading this paper in the first place — DeepMind uses a neural network to approximate a Q function, and then they let the computer play ATARI games using the network to help predict the best moves. With time, as the computer gets a better and better idea of how the rewards work, it can tweak its neural network (by adjusting the weights), so that it becomes a better and better approximation of that “real” Q function! And by the time that approximation is good enough, voila we realize it can actually make better predictions than humans.

现在您知道了我们为什么要首先阅读本文了-DeepMind使用神经网络来近似Q函数，然后他们让计算机使用该网络玩ATARI游戏来帮助预测最佳动作。随着时间的流逝，随着计算机对奖励的工作原理有了越来越好的了解，它可以调整神经网络(通过调整权重)，从而使它变得越来越接近“真实” Q函数！而且，当近似值足够好时，瞧，我们意识到它实际上可以比人类做出更好的预测。

Now, leaving aside some of the mathematical mumbo jumbo above (it’s hard for me too!). Know that Q-learning is a *model-free* approach. When you say “model-free” RL, it means that your agent doesn’t need to explicitly learn the rules or physics of the game. In model-based RL, those rules and physics are defined in terms of a ‘Transition Matrix’ which calculates the next state given a current state and action, and a ‘Reward Function’ which computes the reward given a current state and action. In our case these two things are too complex to calculate, and if you think about it, we don’t really need them! In our “model free” approach, we simply care about learning the Q value function with hit and trial, because we assume that a good Q function will inherently have to follow the rules and physics of the game.

现在，撇开上面的一些数学大杂烩(对我来说也很难！)。知道Q学习是一种“无模型”方法。当您说“无模型” RL时，这意味着您的代理无需明确学习游戏规则或物理原理。在基于模型的RL中，这些规则和物理是根据“转移矩阵”定义的，其中“转移矩阵”计算给定当前状态和行为的下一个状态，而“奖励功能”计算给定当前状态和动作的奖励。在我们的案例中，这两件事太复杂了，难以计算，如果您考虑一下，我们真的不需要它们！在我们的“无模型”方法中，我们只关心通过命中和试用来学习Q值函数，因为我们假设一个好的Q函数本质上必须遵循游戏的规则和物理原理。

Our approach is also *off-policy*, and not *on-policy*. The difference here is more subtle because in this paper they’ve followed a hybrid of sorts. Assume you’re at state s and have several actions to choose from. We have an approximate Q value function, so we calculate what will be the Q value for each of those actions. Now, while choosing the action, you have two options. The “common sense” option is to simply choose the action that has the highest Q value right? Yes, and this is what’s called a “greedy” strategy. You always pick the action which seems best to you *right now, given your current understanding of the game* — in other words, given your current approximate of the Q function — which means, given your current strategy. But there lies the problem — when you start out, you don’t really have a good Q function approximator right? And even if you have a somewhat good strategy, you still want your AI to check out other possible strategies and see where they lead. This is why a “greedy” strategy isn’t always the best when you’re learning. While learning, you don’t want to just keep trying what you believe will work — you want to try other things which seem less likely, just so you can get experience. And that’s the difference between on-policy (greedy) and off-policy (not greedy).

我们的方法也是* off-policy *，而不是* on-policy *。此处的差异更为细微，因为在本文中，他们遵循了多种混合方式。假设您处于状态，并且有多种操作可供选择。我们有一个近似的Q值函数，因此我们将计算每个动作的Q值。现在，在选择操作时，您有两个选择。 “常识”选项是简单地选择具有最高Q值的动作对吗？是的，这就是所谓的“贪婪”策略。鉴于您当前对游戏的了解*，换句话说，鉴于您对Q函数的当前近似，您总是*现在就选择最适合您的动作，也就是说，鉴于您当前的策略。但是这里存在问题-刚开始时，您真的没有一个好的Q函数逼近器吧？即使您有一个不错的策略，您仍然希望您的AI检验其他可能的策略并查看它们的发展方向。这就是为什么在学习时“贪婪”策略并不总是最好的原因。在学习时，您不想只是继续尝试自己认为可行的方法，而是想尝试其他不太可能发生的事情，只是为了获得经验。这就是启用策略(贪婪)和禁用策略(不是贪婪)之间的区别。

Why did I say we use a hybrid of sorts? Because we vary the approach based on how much we’ve learned. We vary the probability with which the agent will pick the greedy action. How do we vary that? We pick greedy actions with a probability of (1-e), where e is a variable that represents how random the choice is. So e=1 means the choice is completely random, and e=0 means that we always pick the greedy action. Makes sense? At first, when the network just begins learning, we pick e to be very close to 1, because we want the AI to explore as many strategies as possible. As time goes by and the AI learns more and more, we reduce the value of e towards 0 so that the AI stays on a particular strategy.

为什么我说我们使用各种混合形式？因为我们根据学到的知识来改变方法。我们改变了代理人采取贪婪行动的可能性。我们如何改变呢？我们选择概率为(1-e)的贪婪动作，其中e是代表选择随机性的变量。所以e = 1表示选择是完全随机的，而e = 0表示我们总是选择贪婪的行为。说得通？首先，当网络刚刚开始学习时，我们选择e使其非常接近1，因为我们希望AI探索尽可能多的策略。随着时间的流逝以及AI的学习越来越多，我们将e的值减小到0，以使AI保持特定的策略。

Qiang: The backgammon game is the most popular game for scientists to test their various artificial intelligence and machine learning algorithms. Reference [24] used a model-free algorithm to achieve a superhuman level of play. Model-free means there is not a explicit equation between the algorithm’s input (screen images) and output (best play strategy found).

强：步步高游戏是科学家测试其各种人工智能和机器学习算法的最受欢迎的游戏。参考文献[24]使用了无模型算法来达到超人的游戏水平。无模型意味着算法的输入(屏幕图像)和输出(找到最佳播放策略)之间没有明确的方程式。

Q-learning, where ‘Q’ stands for ‘quality’, uses a Q function which represents the maximum discounted future reward when we perform a certain action in a certain state. Then the optimal policy (play strategy) is continually found from that point on. The difference in reference [24] is that they are using approximation for the q-value using a multi-layer perceptron (MLP). In their MLP, one hidden layer exists between the output layer and the input layer.

Q学习(“ Q”代表“质量”)使用Q函数，该函数表示在特定状态下执行特定操作时最大的未来折价奖励。从那时起，就不断找到最佳策略(游戏策略)。参考文献[24]的不同之处在于，他们使用多层感知器(MLP)对q值使用近似值。在他们的MLP中，在输出层和输入层之间存在一个隐藏层。

Qiang: The unsuccessful following applications of similar methods on other games made people don’t believe TD-gammon approach. They owe the success of TD-gammon on backgammon to the stochasticity of dice rolls.

强：在其他游戏上类似方法的后续应用未成功，使人们不相信TD-游戏模式。他们将TD-五子棋在步步高上的成功归功于掷骰子的随机性。

Go a few paragraphs back, where we saw what kinds of functions can be used to approximate our theoretically perfect Q-function. Apparently, linear functions are better suited to the task than non-linear functions like neural networks, because they make the network easier to ‘converge’ (i.e the weights adjust in such a way that the networks makes more accurate, instead of becoming more random).

返回几段，我们看到了可以使用哪些函数来近似理论上理想的Q函数。显然，线性函数比非线性函数(例如神经网络)更适合任务，因为它们使网络更易于“收敛”(即权重调整以使网络更准确而不是变得更加随机) )。

Qiang: In recent time, combining deep learning with reinforcement becomes research interest again. Environment, value function, and policy have been estimated by deep learning algorithms. In the meantime, the divergence has been partially solved by gradient temporal-difference methods. However, as mentioned in the paper, these methods can only works with nonlinear function approximator not directly with nonlinear function.

强：最近，将深度学习与强化相结合再次成为研究兴趣。环境，价值功能和政策已通过深度学习算法进行了估算。同时，通过梯度时差方法已部分解决了发散问题。但是，如本文所述，这些方法仅适用于非线性函数逼近器，而不能直接适用于非线性函数。

Qiang: NFQ is the most similar prior work to the approach in this paper. Main idea of NFQ is using RPROP (resilient backpropagation) to update the parameters of the Q-network to optimise the sequence of loss functions in Equation 2. The disadvantages of NFQ is that it introduced computational cost which is proportional to the size of data set because of batch update.

强：NFQ是与该方法最相似的先验工作。 NFQ的主要思想是使用RPROP(弹性反向传播)更新Q网络的参数，以优化公式2中的损失函数序列。NFQ的缺点是引入的计算成本与数据集的大小成正比因为批量更新。

This paper uses stochastic gradient updates which is computationally efficient. The NFQ applied on simple tasks but not on the visual input, this paper’s algorithm learns end-to-end. Another paper about Q-learning also used low-dimensional state instead of raw visual inputs which is the advantage of this paper.

本文使用计算效率高的随机梯度更新。 NFQ适用于简单任务，但不适用于视觉输入，本文的算法是端到端学习的。另一篇有关Q学习的论文也使用低维状态代替原始视觉输入，这是本文的优势。

Qiang: This paragraph introduced several applications of the Atari 2600 emulator. The first paper that used the Atari 2600 emulator as a reinforcement learning platform applied standard reinforcement learning algorithms with linear function approximation and generic visual features. Larger number of features, and project features to lower-dimensional space improved the results. And the HyperNEAT evolutionary architecture evolved respectively a neural network for each game. And the neural network represents the play strategy, which were able to be trained and exploit some design flaws in some games.

g：本段介绍了Atari 2600仿真器的几种应用。第一篇将Atari 2600模拟器用作强化学习平台的论文应用了具有线性函数逼近和通用视觉特征的标准强化学习算法。大量的要素以及到较低维空间的项目要素改善了结果。 HyperNEAT演化架构分别为每个游戏演化了一个神经网络。神经网络代表了游戏策略，该策略能够被训练并利用某些游戏中的一些设计缺陷。

But as mentioned in the paper, this paper’s algorithm learns the strategy for seven Atari 2600 games without adjustment of the architecture. This is the big advantage of the algorithm in this paper.

但是，正如本文所述，本文的算法无需调整架构即可学习7款Atari 2600游戏的策略。这是本文算法的最大优势。

Self explanatory?

自我解释？

TD Gammon was an on-policy approach, and it directly used the experiences (s1, a1, r1, s2) to train the network (without experience replay etc).

TD Gammon是一种基于策略的方法，它直接使用经验(s1，a1，r1，s2)来训练网络(无需重播经验等)。

Now we come to the specific improvements made over TD Gammon. The first of these is experience replay, which has already been talked about previously. The ‘phi’ function does image preprocessing etc, so the state of the game is stored as the final preprocessed form (more on this in the next section).

现在，我们来谈谈TD Gammon所做的特定改进。首先是体验重播，之前已经讨论过。 “ phi”功能会进行图像预处理等，因此，游戏状态会存储为最终的预处理形式(在下一节中将对此进行详细说明)。

These are the concrete advantages of using experience replay (this paragraph continues on the next page). Firstly, just like in regular deep learning, where each data sample can be reused multiple times to update weights, we can use the same experience multiple times while training. This is more efficient use of data.

这些是使用体验重播的具体优势(本段继续下一页)。首先，就像在常规深度学习中一样，每个数据样本可以多次重用以更新权重，我们可以在训练时多次使用相同的体验。这样可以更有效地使用数据。

Second and third are very related. Because each state is very closely related to the next state (as it is while playing a video game), then training the weights with each consecutive state will lead to the program only following one single way of playing the game. You predict a move based on Q function, you make that move, and update the weights so that the next time you will again likely move left. But by breaking this pattern and drawing randomly from past experiences, you can avoid these feedback loops.

第二和第三非常相关。因为每个状态都与下一个状态非常相关(就像在玩视频游戏时一样)，所以在每个连续状态下训练权重只会导致程序遵循一种单一的玩法。您可以基于Q函数预测移动，然后进行移动并更新权重，以便下次您可能再次向左移动。但是，通过打破这种模式并从过去的经验中随机抽取，您可以避免这些反馈循环。

Now, it’s good to draw random samples from experience replay, but sometimes in a game there are important transitions that you would like the agent to learn about. This is a limitation of the current approach in this paper. A suggestion given is to pick important transitions with a greater probability while using experience replay. Or something like that.

现在，从经验重播中抽取随机样本是很好的，但是有时在游戏中，您需要代理商学习的重要过渡。这是本文当前方法的局限性。给出的建议是在使用体验重播时以更大的概率选择重要的过渡。或类似的东西。

(Beyond this point, everything is based on the theory covered in the previous sections so a lot of it is just technical details)

(除了这一点之外，所有内容都是基于前几节介绍的理论，因此很多只是技术细节)

Most of this is self-explanatory. The state S is preprocessed to include 4 consecutive frames, all preprocessed into grayscale and resized and cropped to 84x84 squares. I think this is because given that the game runs at over 24 frames per second, and humans can’t react so fast as to make a move in each single frame, it makes sense to consider 4 consecutive frames as being in the same state.

其中大多数是不言自明的。状态S被预处理为包括4个连续的帧，所有帧均被预处理为灰度并调整大小并裁剪为84x84正方形。我认为这是因为鉴于游戏以每秒超过24帧的速度运行，并且人类无法以如此快的速度做出React而无法在每个单帧中进行移动，因此将4个连续的帧视为处于同一状态是有道理的。

While making the network architecture, you can either make it a Q function which takes both S1 and A1 and outputs Q-value for this combination. But this means that you’d have to run this network for each of the 18 possible joystick actions at each step, and compare the output of all 18 runs. Instead, you can simply have an architecture where you use S1 as input and have 18 outputs each corresponding to the Q-value for a given joystick action. It’s much more efficient to compare the Q-values in this way!

在建立网络体系结构时，您可以使其成为Q函数，该函数同时接收S1和A1并输出此组合的Q值。但这意味着您必须在每个步骤中为18种可能的操纵杆操作中的每一个运行此网络，并比较所有18次运行的输出。相反，您可以简单地拥有一种架构，在该架构中，您将S1用作输入，并具有18个输出，每个输出对应于给定操纵杆动作的Q值。以这种方式比较Q值要高效得多！

Self explanatory :)

自我说明：)

Oooh. First half is self explanatory. The second half tells about one very important thing about this experiment: that the nature of rewards being input to the agent was modified. So, *any* positive reward was input as +1, negative rewards were input as -1, and no change was input as 0. This is of course very different from how real games work — rewards are always changing, and some accomplishments have higher rewards than others. But it’s impressive that in spite of this, the agent performed better than humans in some games!

哦上半部分不言自明。后半部分讲述了该实验的一个非常重要的事情：对输入给代理的奖励的性质进行了修改。因此，将*任何*正奖励输入为+1，负奖励输入为-1，并且未输入任何更改作为0。这当然与真实游戏的工作方式截然不同-奖励总是在变化，并且某些成就有比其他人更高的奖励。但是令人印象深刻的是，尽管如此，在某些游戏中该特工的表现仍优于人类！

We’ve already talked about e-greedy (in Section 2), and experience replay. This is about the specific details of their implementation.

我们已经讨论过电子贪婪(在第2节中)，并体验了重播。这是关于其实现的具体细节。

More detail about why they use a stack of 4 video frames instead of using a single frame for each state.

关于为什么它们使用4个视频帧的堆栈而不是每个状态使用单个帧的更多详细信息。

This is about the evaluation metric you use while training. Usually in supervised learning you have something like validation accuracy, but here you don’t have any validation set etc to compare with. So what other things can we use to check whether our network is training towards a point or are the weights just dancing around? Hmmm, let’s think about that. The purpose of this whole paper is to create an AI agent which gets a high score on the game, so why not just use the total score as our evaluation metric? And we can play several games and get the average overall score. Well, turns out that using this metric doesn’t work well in practice, as it happens to be very noisy.

这是关于您在训练中使用的评估指标。通常，在有监督的学习中，您会拥有诸如验证准确性之类的东西，但是在这里您没有任何可比拟的验证集。那么，我们还可以使用其他什么方法来检查我们的网络是否在朝着某个点进行训练，或者权重是否在跳动？嗯，让我们考虑一下。本文的目的是创建一个在游戏中获得高分的AI代理，那么为什么不仅仅使用总分作为我们的评估指标呢？我们可以玩几个游戏，并获得平均总成绩。好吧，事实证明，使用该指标在实践中效果不佳，因为它非常嘈杂。

Let’s think about some other metric? Well, another thing we’re doing in this experiment is to find a ‘policy’ which the AI will follow to ensure the highest score (it’s off policy learning, as explained previously). And the Q-value at any particular moment represents the total reward expected by the AI in future. So if the AI finds a great policy, then the Q-value for that policy will be higher, right? Let’s see if we can use the Q-value itself as our evaluation metric. And voila, it seems to be more stable than just the averaged total reward. Now, as you can see there’s no theoretical explanation for this, and it was just an idea which happened to work. (Actually that’s what happens in deep learning all the time. Some things just work, and other things which seem common sense just don’t. Another example of this is Dropout, which is a batshit crazy technique but works amazingly).

让我们考虑其他指标吗？好吧，我们在本实验中要做的另一件事是找到AI将遵循的“政策”，以确保获得最高分数(如前所述，这是政策学习的结果)。并且在任何特定时刻的Q值代表AI未来期望的总回报。因此，如果AI找到了很好的策略，那么该策略的Q值会更高，对吗？让我们看看是否可以将Q值本身用作评估指标。瞧，它似乎比平均总回报更稳定。现在，如您所见，对此没有任何理论上的解释，而这只是一个偶然发生的想法。 (实际上，这一直是深度学习中发生的事情。有些事情正常工作，而有些似乎是常识却行不通。另一个例子是Dropout，这是一种疯狂的疯狂技术，但效果惊人)。

This should be self explanatory. It shows how the value function changes in different moves of the game.

这应该是不言自明的。它显示了价值函数在游戏的不同动作中如何变化。

Here we compare the paper’s results with prior work in this field. “Sarsa” refers to [s1,a1,r,s2,a2]. It is an on-policy learning algorithm (as opposed to our off-policy) approach. It’s not easy to understand the difference so easily, here’s a good one.

在这里，我们将论文的结果与该领域的先前工作进行比较。 “ Sarsa”是指[s1，a1，r，s2，a2]。这是一种基于策略的学习算法(与我们的基于策略的策略相反)。如此轻松地理解它们之间的差异并不容易，这是一个好例子。

And another.

还有一个。

The rest of this paragraph is quite easy to read.

本段的其余部分很容易阅读。

For this paragraph and everything beyond… ‘Look and marvel at how much better their approach performs!’

对于本段以及之后的所有内容，……“ 让他们惊奇地发现他们的方法效果更好！”

Oh and if you’re someone who needs help explaining your science or technology work to non-technical people for any reason (whether it’s marketing or education), I could really help you. Drop me a message on Twitter: @mngrwl

哦，如果您出于某种原因(无论是市场营销还是教育)需要帮助您向非技术人员解释您的科学或技术工作，那么我真的可以为您提供帮助。 在Twitter上给我留言：@mngrwl

翻译自: https://www.freecodecamp.org/news/explained-simply-how-deepmind-taught-ai-to-play-video-games-9eb5f38c89ee/

教ai玩游戏

教ai玩游戏_简单解释：DeepMind如何教AI玩视频游戏相关推荐

java51游戏_简单实现美空军也得玩的游戏-谁能坚持超过50秒？（Java）
前天不知道又在哪里见到这么一句话"谁能坚持超过50秒,我请他吃饭"(可以点击玩一玩原版游戏),其实这个不就是之前在同学QQ空间里面转来转去的一个网页游戏吗?什么"据说,美 ...
linux可以玩什么游戏_为什么我们要在Linux上玩游戏，与Icculus聊天等等
linux可以玩什么游戏开源游戏综述 2014年8月31日至9月6日,一周在本周的开源游戏新闻综述中,我们看了一些用旧游戏组件制成的令人惊叹的灯,Linux Action Show与Ryan&qu ...
用深度强化学习玩atari游戏_（一）深度强化学习·入门从游戏开始
1.在开始正式进入学习之前,有几个概念需要澄清,这样有利于我们对后续的学习有一个大致的框架感监督型学习与无监督型学习深度强化学习的范畴监督型学习是基于已有的带有分类标签的数据集合,来拟合神经网络 ...
ppt中如何合并流程图_简单4招，教你轻松搞定PPT中的流程图！
原标题:简单4招,教你轻松搞定PPT中的流程图! 来自:PK阿锴(ID:akaippt) 作者:王培锴今天跟大家分享PPT流程图的制作方法,流程图相信大家都经常遇到,通常分布着许多节点,由线条链接起 ...
ubuntu自带游戏_腾讯IEG开源GAME AI SDK：自动化测试吃鸡、MOBA类游戏
近日,腾讯互娱(IEG)开源了一款名为 GAME AI SDK 的自动化测试平台,该平台封装好了多种工具供开发者使用,目前支持的游戏类型有跑酷类.吃鸡类.射击类.MOBA 类等. 项目地址:https ...
3ds java游戏_比耐力比持久 3DS耐玩游戏大盘点
监督:巴士速攻-丢丢排版:巴士速攻-灵刻作者:巴士速攻(内详) 来源:BBS.TGBUS.COM 前言: 各位不大♂不小的撸友玩家们,你们好,我是鲁肃,喜欢我的朋友们都叫我撸大湿,今天很荣幸被邀请 ...
智能手机怎么玩java游戏_智能手机可以在普通手机上玩Java游戏吗？
当然可以是2113. 某些智能手机需要首先在5261上安装JAVA平台,但是选择平台程序4102,否则会出现不稳定的情况. 尽管JAVA游戏非常受欢迎,但它们在1653年也很新,但总比没有好. 玩经典 ...
python自动玩游戏_超牛！用Python自动玩转2048游戏
本篇作者:BlueDamage 近来在折腾selenium自动化, 感觉配合爬虫很有意思, 大多数以前难以模拟登录的网站都可以爬了,折腾了这么久, 于是想自动玩个2048游戏!嘿嘿, 我是一个不擅长玩 ...
4米乘以12米CAD图_简单四步，教你如何绘制好施工现场总平面布置图
原标题:简单四步,教你如何绘制好施工现场总平面布置图施工总平面布置图是拟建项目施工场地的总布置图.它按照施工方案和施工进度的要求,对施工现场的道路交通.材料仓库.加工场地.主要机械设备.临时房屋.临 ...

教ai玩游戏_简单解释：DeepMind如何教AI玩视频游戏

简单解释：DeepMind如何教AI玩视频游戏 (Explained Simply: How DeepMind taught AI to play video games)

一些快速注释(以防您在本篇20分钟的文章中未读懂) (Some quick notes (in case you don’t make it to the bottom of this 20-minute article))

如果您更喜欢中文，请参阅本文的非正式翻译。 (And if you’re more comfortable reading Chinese, here is an unofficial translation of this essay.)

让我们开始吧 (Let’s get started)

教ai玩游戏_简单解释：DeepMind如何教AI玩视频游戏相关推荐

最新文章

热门文章

教ai玩游戏_简单解释：DeepMind如何教AI玩视频游戏

简单解释：DeepMind如何教AI玩视频游戏 (Explained Simply: How DeepMind taught AI to play video games)

一些快速注释(以防您在本篇20分钟的文章中未读懂) (Some quick notes (in case you don’t make it to the bottom of this 20-minute article))

如果您更喜欢中文，请参阅本文的*非正式* 翻译 。 (And if you’re more comfortable reading Chinese, here is an *unofficial* translation of this essay.)

让我们开始吧 (Let’s get started)

教ai玩游戏_简单解释：DeepMind如何教AI玩视频游戏相关推荐

最新文章

热门文章

如果您更喜欢中文，请参阅本文的非正式翻译。 (And if you’re more comfortable reading Chinese, here is an unofficial translation of this essay.)