强化学习-动态规划

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

这些是FAU YouTube讲座“ 深度学习 ”的 讲义 这是演讲视频和匹配幻灯片的完整记录。 我们希望您喜欢这些视频。 当然,此成绩单是使用深度学习技术自动创建的,并且仅进行了较小的手动修改。 自己尝试! 如果发现错误,请告诉我们!

导航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一个讲座 / 观看此视频 / 顶级 / 下一个讲座

Breakout is pretty hard to learn. Image created using gifify. Source: YouTube
突破很难学习。 使用gifify创建的图像 。 资料来源: YouTube

Welcome back to deep learning! Today, we want to talk about deep reinforcement learning. So, I have a couple of slides for you. Of course, we want to build on the concepts that we’ve seen in reinforcement learning, but we talk about deep Q-learning today.

欢迎回到深度学习! 今天,我们想谈一谈深度强化学习。 因此,我为您准备了几张幻灯片。 当然,我们希望以在强化学习中看到的概念为基础,但是今天我们谈论的是深度Q学习。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

One of the very well-known examples is human-level control through deep reinforcement learning. Here in [4], this was done by Google Deepmind. They showed a neural network is able to play Atari games. So, the idea here is to directly learn the action-value function using a deep network. The inputs are essentially the three subsequent video frames from the game and this is processed by a deep network. It produces the best next action. So, the idea is now to use this deep reinforcement framework to learn the best next controller movements. They do convolutional layers for the frame processing and then fully connected layers for the final decision-making.

众所周知的例子之一就是通过深度强化学习进行人级控制。 在[4]中,此操作由Google Deepmind完成。 他们证明了神经网络能够玩Atari游戏。 因此,这里的想法是使用深度网络直接学习动作值函数。 输入实质上是游戏中的三个后续视频帧,并由深度网络处理。 它产生最佳的下一动作。 因此,现在的想法是使用此深度增强框架来学习最佳的下一个控制器运动。 他们进行卷积层以进行帧处理,然后进行全连接层以进行最终决策。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Here, you see the main idea of the architecture. So there are these convolutional layers and ReLUs. You have the input frames that are processed by these. Then, you go into fully connected layers and again fully connected layers. Finally, you produce directly the output and you can see that in Atari games this is a very limited set. So you can either do no action, then there are essentially eight directions, there’s a fire button, and there are eight directions plus the fire button. So that’s all of the different things that you can do. So, it’s a limited domain and you can then train your system with that.

在这里,您将看到该体系结构的主要思想。 因此,存在这些卷积层和ReLU。 您具有由这些输入框处理的输入框。 然后,进入完全连接的层,然后再次进入完全连接的层。 最后,您直接产生输出,您可以看到在Atari游戏中这是一个非常有限的集合。 因此,您既可以不执行任何操作,则实际上有八个方向,有一个触发按钮,还有八个方向以及触发按钮。 这就是您可以做的所有不同的事情。 因此,这是一个有限的域,然后您可以使用它来训练系统。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Well, it’s a deep network that directly applies Q-learning. The state of the game is essentially the current plus three previous frames as an image stack. So, you have a rather fuzzy way of incorporating memory and state. Then, you have 18 outputs that are associated with the different actions and each output estimates the action for the given input. You don’t have a label and a cost function, but you update with respect to maximize the future reward. There’s a reward of +1 when the game score is increased and a reward of -1 when the game score is decreased. Otherwise, it’s zero. They use an ε-greedy policy with ε decreasing to a low value during the training. They use a semi-gradient form of the Q-learning to update the network weights w and again they use mini-batches to accumulate the weight updates.

嗯,这是一个直接应用Q学习的深度网络。 游戏的状态实质上是当前状态加上前三个帧作为图像堆栈。 因此,您有一种相当模糊的方式来合并内存和状态。 然后,您有18个与不同操作相关联的输出,每个输出都会估算给定输入的操作。 您没有标签和成本函数,但会进行更新以最大程度地提高未来的回报。 当游戏分数增加时,奖励为+1;在游戏分数降低时,奖励为-1。 否则为零。 他们使用ε贪婪策略,在训练过程中ε降低到较低的值。 他们使用Q学习的半梯度形式来更新网络权重w,并且再次使用小批量来累积权重更新。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So, they have this target network and it’s updated using the following rule (see slide). You can see that this is very close to what we have seen in the previous video. Again, you have the weights and you update them with respect to the rewards. Now, the problem is, of course, that this γ and selection of the maximum q function is a function of the weights again. So, you now have a dependency on the maximization on the weights that you’re trying to update. So, your target changes simultaneously with the weights that we want to learn. This can actually lead to oscillations or divergence of your weights. So, this is not very good. To solve the problem, they introduce a second target network. After C steps, they generate this by copying the weights of the action-value network to a duplicate network and keep them fixed. So, you use the output q bar of the target network as a target to stabilize the previous maximization. You don’t use q hat, the function that you’re trying to learn, but you use the q bar which is the kind of fixed version that you use for a couple of iterations.

因此,他们有了这个目标网络,并使用以下规则对其进行了更新(请参见幻灯片)。 您可以看到这与我们在上一个视频中看到的非常接近。 同样,您拥有权重,并根据奖励更新权重。 现在,问题当然是这个γ和最大q函数的选择又是权重的函数。 因此,您现在依赖于要更新的​​权重的最大化。 因此,您的目标会随着我们想要学习的权重而同时变化。 实际上,这可能导致您的体重发生波动或发散。 所以,这不是很好。 为了解决该问题,他们引入了第二个目标网络。 在执行C步骤之后,他们通过将操作值网络的权重复制到重复网络并保持固定来生成此权重。 因此,您可以使用目标网络的输出q bar作为目标来稳定先前的最大化。 您不需要使用q hat,即您要学习的功能,但可以使用q栏,这是用于几次迭代的固定版本。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Another trick they have been using is experience replay. Here, the idea is to reduce the correlation between the updates. So after performing an action a subscript t for the image stack and receiving the reward, you add this to the replay memory. You accumulate experiences in this replay memory and then you update the network with samples drawn randomly from this memory instead of taking the most recent ones. This way, you kind of can stabilize and simultaneously not too much focus on one particular situation of the game. You try to keep in mind all of the different situations of the game and this removes the dependence on the current weights and increases the stability. I have a small example for you.

他们一直在使用的另一个技巧是体验重播。 这里的想法是减少更新之间的相关性。 因此,在对图像堆栈执行下标t并获得奖励后,可以将其添加到重播内存中。 您在此重播内存中积累了经验,然后使用从该内存中随机抽取的样本而不是最近的样本来更新网络。 这样,您就可以稳定并且不会过多地专注于游戏的一种特定情况。 您尝试牢记游戏的所有不同情况,从而消除了对当前权重的依赖并增加了稳定性。 我有一个小例子给你。

240 hours of training helps. Image created using gifify. Source: YouTube
240小时的培训会有所帮助。 使用gifify创建的图像 。 资料来源: YouTube

So, this is the Atari breakout game and you can see that the agent, in the beginning, is not performing very well. If you train it over several iterations, you can see that the game is played better. So the system learns how to follow with the paddle the ball and then it is able to reflect it. You can see that if you iterate and iterate, you could argue that at some point the reinforcement learning system also figures out the weaknesses of the game. In particular, one situation where you can score really a large number of points is if you manage to bring the ball behind the bricks and then have them jump around there. It will be reflected by the boundaries and not by the paddle and it will generate a large score. So, this is something that offers the claim that the system has learned to be a good strategy by trying to kick out only the bricks on the left-hand side. Then, it needs to get the ball into the region behind the other bricks.

因此,这是Atari突破游戏,您可以看到代理在开始时表现不佳。 如果您经过多次迭代训练,可以看到游戏玩得更好。 因此,系统学习了如何跟随桨球,然后能够将其反映出来。 您可以看到,如果您反复进行迭代,则可能会争辩说,强化学习系统有时也会找出游戏的弱点。 特别是,您可以真正得分很多的情况是,如果您设法将球带到积木后面,然后让它们跳到那儿。 它会被边界而不是桨所反映,并且会产生很大的分数。 因此,这可以证明系统通过尝试只踢掉左侧的砖块而学会了一个好的策略。 然后,它需要使球进入其他砖块后面的区域。

Fast-forward of the game Lee Sedol vs. AlphaGo. Image created using gifify. Source: YouTube
Lee Sedol vs. AlphaGo游戏的快进。 使用gifify创建的图像 。 资料来源: YouTube

Of course, we need to talk about AlphaGo in this video. We want to look into some of the details of how it’s actually implemented. You already heard about this one. So it’s from the paper mastering the game of go with deep neural networks.

当然,我们需要在此视频中谈论AlphaGo。 我们想研究一下其实际实现方式的一些细节。 您已经听说过这一件事。 因此,这是从论文精通深度神经网络的博弈中得出的。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So, we already discussed that go is a much harder problem than chess because it really has a large number of possible moves. With also a large number of possible states that can potentially emerge, the idea is that black plays against white for the control over the board. It has simple rules but an extremely high number of possible moves and situations. To achieve the performance of human players was thought to be years away because of the high in numerical complexity of the problem. So, we could brute-force chess but with Go people thought it would be impossible until we have much much faster computers — orders of magnitude faster computers. They could show that they can really beat human Go experts with the system. So, Go is a perfect information game. There is no hidden information and no chance. So theoretically, we could construct a full game tree and traverse it with min-max to find the best moves. The problem is the high number of legal moves. So in chess, you have approximately 35. In Go, there are like 250 different moves that you can do during the game in each step.

因此,我们已经讨论过,走棋比象棋要难得多,因为走棋确实有很多可能的动作。 由于还有大量可能出现的潜在状态,因此想法是,黑色对白色的作用是对电路板的控制。 它具有简单的规则,但是可能的动作和情况非常多。 人们认为,要达到人类运动员的表现还需要很多年,因为问题的数字复杂性很高。 因此,我们可以强行下象棋,但是有了Go语言,人们认为只有拥有更快的计算机(要快几个数量级的计算机),这才是不可能的。 他们可以证明他们可以用该系统真正击败人类围棋专家。 因此,Go是一款完美的信息游戏。 没有隐藏的信息,没有机会。 因此,从理论上讲,我们可以构建一个完整的游戏树,并以最小-最大的距离遍历它,以找到最佳移动。 问题在于大量的法律诉讼。 因此,在国际象棋中,您大约有35个动作。在围棋中,您在游戏中的每一步都可以进行250种不同的动作。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Also, the game may involve many moves. So approximately a hundred and fifty. This means that the exhaustive search is completely infeasible. Well, a search tree can, of course, be pruned if you have an accurate evaluation function. for chess, if you remember deep blue, this was already extremely complex and based on massive human input. Fo Go in 2002, “No simple yet reasonable evaluation will ever be found for Go.” was the state-of-the-art. Well in 2016 and 2017; AlphaGo beat Lee Sedol and Ke Kie, two of the world’s strongest players. So, there is a way of solving this game.

另外,游戏可能涉及许多动作。 大约有一百五十个。 这意味着穷举搜索是完全不可行的。 好吧,如果您具有准确的评估功能,则可以修剪搜索树。 对于国际象棋,如果您还记得深蓝色,这已经非常复杂并且基于大量的人工输入。 在2002年的Fo Go中,“找不到对Go进行简单但合理的评估。” 是最先进的。 好在2016年和2017年; AlphaGo击败了世界上最强的两位选手Lee Sedol和Ke Kie。 因此,有一种解决此游戏的方法。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

There were several very good ideas in this paper. It has been developed by Silver et al. It also Deepmind and it’s a combination of multiple methods. They use, of course, deep neural networks. Then, they use Monte Carlo Tree Search and they combine supervised learning and reinforcement learning. The first improvement compared to a full tree search was the Monte Carlo tree search. They use the networks to support efficient search through the tree.

本文有几个非常好的想法。 它已经由Silver等人开发。 它也是Deepmind,它是多种方法的组合。 他们当然使用深度神经网络。 然后,他们使用蒙特卡罗树搜索,并将监督学习和强化学习相结合。 与全树搜索相比的第一个改进是蒙特卡罗树搜索。 他们使用网络来支持通过树的有效搜索。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So, what’s Monte Carlo Tree Search? Well, you expand your tree by looking into different possible future moves and you look into the moves that produce very valuable states. You expand on the valuable state over a couple of moves into the future. Then, you also look at the value of these states. So you only look into a couple of valuable states and then expand over and over again for a couple of moves. Finally, you can find a situation where you probably have a much larger state value. So, you try to look a bit into the future and follow moves that are likely produced a higher state value.

那么,什么是蒙特卡罗树搜索? 好吧,您通过查看未来可能发生的不同动作来扩展树,并查看产生非常有价值状态的动作。 您可以在未来的几步中扩展有价值的状态。 然后,您还将查看这些状态的值。 因此,您只需查看几个有价值的状态,然后一次又一次地进行几次扩展。 最后,您会发现一个状态值可能更大的情况。 因此,您尝试展望未来,并遵循可能会产生更高状态值的移动。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So, you start from the root node which is the current state. Then, you iteratively do that until you extend the search tree to find the best future state. Here’s the algorithm: you start at the root, you traverse with the tree policy to a leaf node. Then, you expand and you add one or more child nodes to the current leaf, probably the ones that have valuable states. Next, you simulate from the current or the child node, the episodes with actions according to your rollout policy. So, you also need a policy in order to expand here. Then, you can back up and propagate the received rewards backward through the tree. This allows you to find future states that have a large state value. So, you repeat that for a certain amount of time. Lastly, you stop and you choose the action from the root note according to the accumulated statistics. In the next move, you have to start again with a new root note according to the action that actually your opponent has taken.

因此,您从当前状态的根节点开始。 然后,您需要反复进行此操作,直到扩展搜索树以找到最佳的将来状态为止。 这是算法:从根开始,使用树策略遍历到叶节点。 然后,展开并向当前叶子添加一个或多个子节点,可能是那些具有有价值状态的子节点。 接下来,您将从当前节点或子节点中模拟情节,并根据部署策略执行操作。 因此,您还需要一个策略才能在此处进行扩展。 然后,您可以备份并通过树向后传播收到的奖励。 这使您可以找到状态值较大的将来状态。 因此,您需要重复一定的时间。 最后,您停止并根据累积的统计信息从根注释中选择操作。 在下一步中,您必须根据对手实际采取的行动重新开始新的根音。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So the tree policy guides in how far successful paths are used and how frequently they will be looked at. This is a typical exploration/exploitation trade-off. Well, the main problem here is, of course, that the normal Monte Carlo Tree Search is not accurate enough for Go. The idea in AlphaGo was to control the tree expansion with a neural network to find promising actions and then improve the value estimation by a neural network. So, this is more efficient in terms of extension and evaluation than the search of a tree and this means that you better let go.

因此,树策略可指导成功路径的使用距离和查看频率。 这是典型的勘探/开采权衡。 好吧,这里的主要问题当然是普通的蒙特卡罗树搜索对于Go语言来说不够准确。 AlphaGo中的想法是使用神经网络控制树的扩展,以找到有希望的动作,然后通过神经网络来改进价值估算。 因此,在扩展和评估方面,这比搜索树更有效,这意味着您最好放手。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

How do they use these deep neural networks? They have three different networks. They have a policy network that suggests the next move in a leaf node for the extension. Then they have a value network that looks at the current board situation and computes essentially the chances of winning. Lastly, they have a rollout policy network that guides the rollout action selection. All of those networks are deep convolutional networks and the input is the current board position and additional pre-computed features.

他们如何使用这些深度神经网络? 他们有三个不同的网络。 他们有一个策略网络,可为扩展建议叶节点中的下一步行动。 然后,他们就有了一个价值网络,可以查看当前的董事会情况并从本质上计算获胜的机会。 最后,他们有一个指导政策网络,指导指导行动选择。 所有这些网络都是深度卷积网络,输入是当前电路板位置和其他预先计算的功能。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

So, here’s the policy network. It had 13 convolutional layers one output for each point on the Go board. Then, a huge database of human expert moves, 30 million, that were available. They start with supervised learning and trained the network to predict the next move in the human expert play. Then, they train this network also with reinforcement learning by playing against older versions of the self and they have a reward for winning the game. All the versions, of course, avoid correlation instability. If you look at the training time, there were three weeks on 50 GPUs for the supervised part and one day for the reinforcement learning. So, actually quite a bit of supervised learning involved here — not so much reinforcement learning.

因此,这是政策网络。 它具有13个卷积层,在Go板上的每个点都有一个输出。 然后,有了一个庞大的人类专家举动数据库,共有3000万个。 他们从有监督的学习开始,并对网络进行了培训,以预测人类专家游戏中的下一步行动。 然后,他们通过与较早版本的自我对战来通过增强学习来训练该网络,并且他们会赢得比赛而获得奖励。 当然,所有版本都避免了关联不稳定。 如果您看一下培训时间,则在受监督的部分使用50个GPU进行为期三周的时间,而对于强化学习则需要一天的时间。 因此,这里实际上涉及了很多监督学习-与其说不是强化学习。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

There’s the value network. This has the same architecture as the policy network, but just one output node. The goal is here to predict the probability of winning the game. They train again on self-play games of reinforcement learning and use Monte Carlo policy evaluation for 30 million positions from these games. Training time was one week on 50 GPUs.

有价值网络。 它具有与策略网络相同的体系结构,但只有一个输出节点。 目的是预测赢得比赛的可能性。 他们再次训练强化学习的自玩游戏,并使用蒙特卡洛政策评估这些游戏中的3000万个职位。 在50个GPU上的培训时间为一周。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Then, they have the rollout policy network that could then be used to select the moves during rollout. Of course, here, the problem is that the inference time is comparatively high and the solution was to train a simpler linear network on a subset of the data that provides actions very quickly. So, this led to a speed-up of approximately a thousand compared to the policy network. So if you work with this rollout policy network, then you have a slimmer network, but it’s much faster. So, you can do more simulations and collect more experience. So, this is why they use this rollout policy network.

然后,他们拥有部署策略网络,该网络可用于在部署期间选择移动。 当然,这里的问题是推理时间相对较长,解决方案是在可提供动作非常Swift的数据子集上训练一个更简单的线性网络。 因此,与策略网络相比,这导致速度提高了约一千。 因此,如果使用此推出策略网络,则网络将更苗条,但速度要快得多。 因此,您可以进行更多的模拟并收集更多的经验。 因此,这就是他们使用此推出策略网络的原因。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Now, there was quite a bit of supervised learning involved here. So, let’s have a look at AlphaGo zero. Now, AlphaGo zero doesn’t need human play anymore. So, the idea here is that you then play solely with reinforcement learning and self-play. It has simpler Monte Carlo Tree Search and no rollout policy network in the Monte Carlo Tree Search. Also, in the self-play games, they also introduced multi-task learning. So, the policy and value network shared the initial layers. This then led to [3] and the extensions are also able to play chess and shogi. So, it’s not just code that can solve Go. With this, you can also play chess and shogi at an expert level. Okay. So, this sums up what we’ve been doing in reinforcement learning. Of course, we could look at many other things here. However, there is just not enough time.

现在,这里涉及了很多监督学习。 因此,让我们看一下AlphaGo零。 现在,AlphaGo零不再需要人工操作。 因此,这里的想法是让您只玩强化学习和自我游戏。 它具有更简单的“蒙特卡洛树搜索”,并且在“蒙特卡洛树搜索”中没有部署策略网络。 此外,在自玩游戏中,他们还引入了多任务学习。 因此,政策和价值网络共享初始层。 然后导致[3],扩展程序也能够下棋和将棋。 因此,不仅仅是可以解决Go的代码。 这样,您还可以在专家级别下棋和将棋。 好的。 因此,这总结了我们在强化学习中所做的工作。 当然,我们可以在这里查看许多其他内容。 但是,没有足够的时间。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Next time in deep learning, we want to talk about algorithms that even don’t have rewards. So, complete unsupervised training and we also want to learn how to benefit from adversaries. We will see that there’s a very cool concept out there that is called generative adversarial networks which is able to generate all kinds of different images. Also, a very cool concept that we’ll talk about in one of the next videos, Then we look into extensions into performing image processing tasks. So, we go more and more towards the applications.

下次在深度学习中,我们想谈论甚至没有奖励的算法。 因此,完成无人看管的培训,我们也想学习如何从对手中受益。 我们将看到那里有一个很酷的概念,称为生成对抗网络,它可以生成各种不同的图像。 此外,我们将在下一个视频中讨论一个非常酷的概念,然后研究执行图像处理任务的扩展。 因此,我们越来越倾向于应用程序。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

Well, some comprehensive questions: What is a policy? What are value functions? Explain the exploitation versus exploration dilemma, and so on. If you’re interested in reinforcement learning, I can definitely recommend having a look at the book reinforcement learning by Richard Sutton. It’s really a great book and you will learn in high detail about all the things that we could only scratch on in these videos. So, you see that you can go much deeper into all of the details of reinforcement learning and also deep reinforcement learning. There’s actually much more to say about this at this point, but we can only remain at this level for the time being. Well, I also brought you the link and I put also the link into the video description. So please enjoy this book it’s very good and, of course, we have plenty of further references.

好吧,一些综合性问题:什么是政策? 什么是价值函数? 解释开发与探索的困境,等等。 如果您对强化学习感兴趣,我绝对可以推荐您看一下Richard Sutton撰写的《强化学习》一书。 这确实是一本很棒的书,您将详细了解我们在这些视频中可能碰到的所有事情。 因此,您会发现您可以更深入地学习强化学习的所有细节,也可以深入学习强化学习。 在这一点上,实际上还有很多话要说,但是我们暂时只能停留在这个水平上。 好吧,我还为您带来了链接 ,并将该链接也放入了视频说明中。 因此,请喜欢这本书,它非常好,当然,我们还有很多其他参考。

So, thank you very much for listening and I hope you that you can now understand at least in a bit of what is happening in reinforcement learning and deep reinforcement learning and what the main ideas are in order to perform learning of games. So, thank you very much for watching this video and hope to see you in the next one. Bye-bye!

因此,非常感谢您的聆听,并希望您现在至少可以了解一些强化学习和深度强化学习中发生的事情以及进行游戏学习的主要思想。 因此,非常感谢您观看此视频,并希望在下一个视频中见到您。 再见!

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜欢这篇文章,你可以找到这里更多的文章 ,更多的教育材料,机器学习在这里 ,或看看我们的深入 学习 讲座 。 如果您希望将来了解更多文章,视频和研究信息,也欢迎关注YouTube , Twitter , Facebook或LinkedIn 。 本文是根据知识共享4.0署名许可发布的 ,如果引用,可以重新打印和修改。 如果您对从视频讲座中生成成绩单感兴趣,请尝试使用AutoBlog 。

链接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中链接到萨顿的强化学习,包括Deep Q学习和Alpha Go详细信息

翻译自: https://towardsdatascience.com/reinforcement-learning-part-5-70d10e0ca3d9

强化学习-动态规划


http://www.taodudu.cc/news/show-863854.html

相关文章:

  • 查看-增强会话_会话式人工智能-关键技术和挑战-第2部分
  • 我从未看过荒原写作背景_您从未听说过的最佳数据科学认证
  • nlp算法文本向量化_NLP中的标记化算法概述
  • 数据科学与大数据排名思考题_排名前5位的数据科学课程
  • 《成为一名机器学习工程师》_如何在2020年成为机器学习工程师
  • 打开应用蜂窝移动数据就关闭_基于移动应用行为数据的客户流失预测
  • 端到端机器学习_端到端机器学习项目:评论分类
  • python 数据科学书籍_您必须在2020年阅读的数据科学书籍
  • ai人工智能收入_人工智能促进收入增长:使用ML推动更有价值的定价
  • 泰坦尼克数据集预测分析_探索性数据分析—以泰坦尼克号数据集为例(第1部分)
  • ml回归_ML中的分类和回归是什么?
  • 逻辑回归是分类还是回归_分类和回归:它们是否相同?
  • mongdb 群集_通过对比群集分配进行视觉特征的无监督学习
  • ansys电力变压器模型_变压器模型……一切是如何开始的?
  • 浓缩摘要_浓缩咖啡的收益递减
  • 机器学习中的无监督学习_无监督机器学习中聚类背后的直觉
  • python初学者编程指南_动态编程初学者指南
  • raspberry pi_在Raspberry Pi上使用TensorFlow进行对象检测
  • 我如何在20小时内为AWS ML专业课程做好准备并进行破解
  • 使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页
  • nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子
  • 机器学习导论�_机器学习导论
  • 直线回归数据 离群值_处理离群值:OLS与稳健回归
  • Python中机器学习的特征选择技术
  • 聚类树状图_聚集聚类和树状图-解释
  • 机器学习与分布式机器学习_我将如何再次开始学习机器学习(3年以上)
  • 机器学习算法机器人足球_购买足球队:一种机器学习方法
  • 机器学习与不确定性_机器学习求职中的不确定性
  • pandas数据处理 代码_使用Pandas方法链接提高代码可读性
  • opencv 检测几何图形_使用OpenCV + ConvNets检测几何形状

强化学习-动态规划_强化学习-第5部分相关推荐

  1. 强化学习-动态规划_强化学习-第4部分

    强化学习-动态规划 有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU's Yo ...

  2. 正则表达式学习日记_《学习正则表达式》笔记_Mr_Ouyang

    正则表达式学习日记_<学习正则表达式>笔记_Mr_Ouyang 所属分类: 正则表达式学习日记  书名:     学习正则表达式 作者:     Michael Fitzgerald 译者 ...

  3. 深度学习 图像分类_深度学习时代您应该阅读的10篇文章了解图像分类

    深度学习 图像分类 前言 (Foreword) Computer vision is a subject to convert images and videos into machine-under ...

  4. 如何学习编程语言_如何学习编程

    如何学习编程语言 像程序员一样思考 David Rangel在Unsplash上的照片 免责声明: 这不是有关如何使用特定编程语言进行编码的教程. 而是,这是某人学习(或愿意学习)编程语言的指南,以了 ...

  5. 日语学习心得_日语学习资料

    日语学习心得 现在的学习资料越来越丰富,音视频配合,学习起来比较有兴趣,每次都是尽量学到疲倦得不行.想到掌握一门外语的重要性,拼了... 在网上还收录了一些学习资料 新编日语 点击下载 新编日语1-4 ...

  6. 强化学习案例_强化学习实践案例!携程如何利用强化学习提高酒店推荐排序质量...

    作者简介: 宣云儿,携程酒店排序算法工程师,主要负责酒店排序相关的算法逻辑方案设计实施.目前主要的兴趣在于排序学习.强化学习等领域的理论与应用. 前言 目前携程酒店绝大部分排序业务中所涉及的问题,基本 ...

  7. 强化学习 折扣率_强化学习中的折扣因素的惩罚

    强化学习 折扣率 This post deals with the key parameter I found as a high influence: the discount factor. It ...

  8. 深度学习试题_深度学习秋招面试题集锦(一)

    这部分的面试题包含C++基础知识.python基础.概率相关.智力题相关.算法相关以及深度学习相关.后续还会不断补充,欢迎大家查阅! C++后台开发面试常见问题汇总 Q1 : C++虚函数表剖析. A ...

  9. 前端学习路线_前端学习路线图

    2020年全新前端学习路线图分享给大家! 学习是一个循序渐进的过程,是一件非常难得坚持的事情.如果真的想学习前端开发,一定要下决心! 我这里分享给你的前端学习路线图,希望对你有帮助,以下为2020年更 ...

最新文章

  1. SQL Server技术问题之视图优缺点
  2. 宏定义和内联函数的区别
  3. Exynos4412裸机开发 —— 看门狗定时器
  4. 程序员的私藏好书中,一定有这7本!
  5. 一篇文章了解Liquid模版引擎
  6. Linux 内核里的数据结构——基数树
  7. ASP.NET WebAPI构建API接口服务实战演练
  8. Silverlight Tools 安装失败 解决办法
  9. 深入浅出设计模式(一):单例模式
  10. 深度学习 轻量级卷积神经网络设计综述
  11. php 显示连接数据库失败,php数据库连接失败的原因及解决办法
  12. Python的猜平均数一半游戏
  13. android 支付宝工具类,Android app第三方支付宝支付接入教程
  14. 乘风破浪程序猿,拒绝原地踏步!
  15. python随机生成licence plate numer
  16. 第十一章 文件操作_C语言fopen函数的用法,C语言打开文件详解
  17. mysql数据库导出数据乱码问题_Mysql数据库导出来的是乱码如何解决
  18. 数据可能只有在你眼里才一文不值
  19. 用C语言程序算交税,用C语言编写函数InComeTax计算七级累进税率的税后收入
  20. ccd视觉定位教程_CCD视觉定位识别系统,视觉系统ccd定位原理

热门文章

  1. [转]ReiserFS与ext3的比较
  2. java 基本类型 引用_java中 引用类型 和 基本类型 有何区别?
  3. dos下清除登录共享用户名和密码
  4. 吸尘器电机拆解图解_老少皆宜居家清理更轻松?吉米A6上手把无线吸尘器体验...
  5. 鸿蒙系统超级功能,华为再发新版鸿蒙OS系统!新增超级终端功能:可媲美iOS系统...
  6. python开发多平台app_django下创建多个app并设置urls方法
  7. java 的“mwq”_java的对象模型 - osc_mwqvsfzo的个人空间 - OSCHINA - 中文开源技术交流社区...
  8. java清屏_【图片】请问java编写中如何做到清屏啊。。。_java吧_百度贴吧
  9. 深度学习之基于Inception_ResNet_V2和CNN实现交通标志识别
  10. The Best Vacation CodeForces - 1358D(贪心+尺取)