强化学习-动态规划

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING)

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning”. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

这些是FAU YouTube讲座“ 深度学习 ”的讲义。 这是演讲视频和匹配幻灯片的完整记录。 我们希望您喜欢这些视频。 当然，此成绩单是使用深度学习技术自动创建的，并且仅进行了较小的手动修改。 自己尝试！ 如果发现错误，请告诉我们！

导航 (Navigation)

Previous Lecture / Watch this Video / Top Level / Next Lecture

上一个讲座 / 观看此视频 / 顶级 / 下一个讲座

Also Sonic the Hedgehog has been looked at with respect to reinforcement learning. Image created using gifify. Source: YouTube.

Welcome back to deep learning! Today we want to discuss a couple of other reinforcement learning approaches than the policy iteration concept that you’ve seen in the previous video. So let’s have a look at what I’ve got for you today. We will look at other solution methods.

欢迎回到深度学习！今天，我们要讨论除上一段视频中看到的策略迭代概念以外的其他两种强化学习方法。因此，让我们来看看我今天为您准备的。我们将介绍其他解决方法。

CC BY 4.0 from the 深度学习讲座中 Deep Learning Lecture.CC BY 4.0下的图像。

You see that in the policy and value iteration that we discussed earlier, they require updated policies during the learning to obtain better approximations of our optimal state-value function. So, these are called on policy algorithms because you need n policy. This policy is being updated. Additionally, we assumed that the state transition and the reward are known. So, the probability density functions that produce the new states and the new reward are known. If they are not then you can’t apply the previous concept. So, this very important and of course there are methods where you can then relax this. So, these methods mostly differ in how they perform the policy evaluation. So, let’s look at a couple of those alternatives.

您会看到，在我们前面讨论的策略和价值迭代中，它们在学习期间需要更新的策略才能获得最佳状态值函数的更好近似值。因此，将这些称为策略算法，因为您需要n个策略。此政策正在更新。此外，我们假设状态转换和奖励是已知的。因此，产生新状态和新奖励的概率密度函数是已知的。如果不是，那么您将无法应用先前的概念。因此，这非常重要，当然还有一些方法可以让您放松一下。因此，这些方法的主要区别在于执行策略评估的方式不同。因此，让我们看几个替代方案。

The first one that I want to show you is based on Monte Carlo techniques. This applies only to episodic tasks. Here, the idea is off-policy. So, you learn the optimal state value by following an arbitrary policy. It doesn’t matter what policy you’re using. So it’s an arbitrary policy. It could be multiple policies. Of course, you still have the exploration/exploitation dilemma. So you want to choose policies that really visit all of the states. You don’t need information about the dynamics of the environment because you can simply run many of the episodic tasks. You try to reach all of the possible states. If you do so, then you can generate those episodes using some policy. Then, you loop in backward direction over one episode and you accumulate the expected future reward. Because you have played the game until the end, you can go backward in time over this episode and accumulate the different rewards that have been obtained. If a state was not yet visited, you append it to a list and essentially you use this list then to compute the update for the state value function. So, you see this is simply the sum over these lists for that specific state. This will allow you to update your state value and this way you can then iterate in order to achieve the optimal state value function.

我要向您展示的第一个基于蒙特卡洛技术。这仅适用于情景任务。在这里，这个想法是不合政策的。因此，您可以通过遵循任意策略来学习最佳状态值。您使用什么策略都没有关系。因此，这是一个任意政策。可能是多个策略。当然，您仍然有探索/开发难题。因此，您想选择真正访问所有州的政策。您不需要有关环境动态的信息，因为您可以简单地运行许多情景任务。您尝试达到所有可能的状态。如果这样做，则可以使用某些策略来生成这些情节。然后，您在一个情节中向后循环，并累积了预期的未来奖励。因为您一直玩游戏到最后，所以您可以在此情节中向后退，并累积获得的不同奖励。如果尚未访问状态，则将其附加到列表中，然后基本上使用该列表来计算状态值函数的更新。因此，您看到的只是这些列表中特定状态的总和。这将允许您更新状态值，然后可以通过这种方式进行迭代以实现最佳状态值功能。

Now, another concept is temporal difference learning. This is an on-policy method. Again, it does not need information about the dynamics of the environment. So here, the scheme is that you loop and follow a certain policy. Then you use an action from the policy to observe the rewards and the new states. You update your state-value function using the previous state-value function plus α that is used to weight the influence of the new observations times the new reward plus the discounted version of the old state value function of the new state and you subtract the value of the old state. So this way, you can generate updates and this actually converges to the optimal solution. A variant of this estimates actually the action-value function and is then known as SARSA.

现在，另一个概念是时间差异学习。这是一种基于策略的方法。同样，它不需要有关环境动态的信息。因此，这里的方案是您循环并遵循某个策略。然后，您使用策略中的操作来观察奖励和新状态。您可以使用先前的状态值函数加α来更新状态值函数，该函数用于对新观测值的影响乘以新奖励乘以新奖励再加上新状态的旧状态值函数的打折版本，然后减去该值的旧状态。因此，您可以生成更新，并且实际上可以收敛到最佳解决方案。这种方法的一种变体实际上是估计作用值函数，因此被称为SARSA。

Q learning is an off-policy method. It’s a temporal difference type of method but it does not require information about the dynamics of the environment. Here, the idea is that you loop and follow a policy derived from your action-value function. For example, you could use an ε-greedy type of approach. Then, you use the action from the policy to observe your reward and your new state. Next, you update your action-value function using the previous action-value plus some weighting factor times the observed reward again the discounted action that would have derived the maximum action value over what you have already known from the state that is generated minus the action-value function of the previous state. So it’s again a kind of temporal difference that you are using here in order to update your action-value function.

Q学习是一种脱离政策的方法。这是一种时间差异类型的方法，但不需要有关环境动态的信息。这里的想法是循环并遵循从操作值函数派生的策略。例如，您可以使用ε-贪心类型的方法。然后，您使用策略中的操作来观察您的奖励和新状态。接下来，您使用先前的操作值加上一些权重因子乘以观察到的奖励再一次更新贴现操作，该贴现操作将根据您从生成的状态减去操作得出的最大操作值来更新您的操作值函数前状态的-value函数。因此，这也是您用来更新操作值函数的时间差异。

Well, if you have Universal function approximators, what about just parameterizing your policy with weights w and some loss function? This is known as the policy gradient. This instance is called REINFORCE. So, you generate an episode using your policy and your weights. Then, you go forward in your episode from time 0 to time t — 1. If you do so, you can actually compute the gradient with respect to the weights. You use this gradient in order to update your weights. Very similar way as we have previously seen in our learning approaches. You can see that this idea using the gradient over the policy then gives you an idea of how you can update the weights, again with a learning rate. We are really close to our machine learning ideas from earlier now.

好吧，如果您有通用函数逼近器，那么仅使用权重w和某些损失函数对策略进行参数化怎么办？这称为策略梯度。该实例称为REINFORCE。因此，您可以使用自己的政策和权重来生成情节。然后，您可以从时间0到时间t_1前进。如果这样做，则实际上可以计算权重的梯度。您可以使用此渐变来更新您的权重。与我们以前在学习方法中看到的方式非常相似。您可以看到，通过在策略上使用梯度可以使您重新了解权重，同时又可以提高学习率。从现在开始，我们真的很接近我们的机器学习思想。

This is why we talk in the next video about deep Q learning which is the kind of deep learning version of reinforcement learning. So, I hope you like this video. You’ve now seen other options on how you can actually determine the optimal state-value and action-value function. This way, we have seen that there are many different ideas that do no longer require exact knowledge on how to generate future states and on how to generate future rewards. So with these ideas, you can also do reinforcement learning and in particular the idea of the policy gradient. We’ve seen that this is very much compatible with what we’ve seen earlier in this class regarding our machine learning and deep learning methods. We will talk about exactly this idea in the next video. So thank you very much for listening and see you in the next video. Bye-bye!

这就是为什么我们在下一个视频中谈论深度Q学习，这是强化学习的深度学习版本。所以，我希望你喜欢这个视频。现在，您已经看到了有关如何实际确定最佳状态值和动作值函数的其他选项。这样，我们已经看到，有许多不同的想法不再需要关于如何生成未来状态以及如何生成未来奖励的确切知识。因此，有了这些想法，您还可以进行强化学习，尤其是政策梯度的想法。我们已经看到，这与我们之前在本课程中有关机器学习和深度学习方法的内容非常兼容。我们将在下一个视频中讨论这个想法。因此，非常感谢您收听并在下一个视频中见到您。再见！

Sonic is still a challenge for today’s reinforcement learning methods. Image created using gifify. Source: YouTube

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTube, Twitter, Facebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

如果你喜欢这篇文章，你可以找到这里更多的文章，更多的教育材料，机器学习在这里，或看看我们的深入学习讲座。如果您希望将来了解更多文章，视频和研究信息，也欢迎关注YouTube ， Twitter ， Facebook或LinkedIn 。本文是根据知识共享4.0署名许可发布的，如果引用，可以重新打印和修改。如果您对从视频讲座中生成成绩单感兴趣，请尝试使用AutoBlog 。

链接 (Links)

Link to Sutton’s Reinforcement Learning in its 2018 draft, including Deep Q learning and Alpha Go details

在其2018年草案中链接到萨顿的强化学习，包括Deep Q学习和Alpha Go详细信息

翻译自: https://towardsdatascience.com/reinforcement-learning-part-4-3c51edd8c4bf

强化学习-动态规划

查看全文

http://www.taodudu.cc/news/show-863783.html

神经网络优化器的选择_神经网络：优化器选择的重要性
客户细分_客户细分：K-Means聚类和A / B测试
菜品三级分类_分类器的惊人替代品
开关变压器绕制教程_教程：如何将变压器权重和令牌化器从AllenNLP上传到HuggingFace
一般线性模型和混合线性模型_线性混合模型如何工作
为什么基于数字的技术公司进行机器人研究
人类视觉系统_对人类视觉系统的对抗攻击
在神经网络中使用辍学：不是一个神奇的子弹
线程监视器模型_为什么模型验证如此重要，它与模型监视有何不同
dash使用_使用Dash和SHAP构建和部署可解释的AI仪表盘
面向表开发面向服务开发_面向繁忙开发人员的计算机视觉
可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec
fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数
redis生产环境持久化_在SageMaker上安装持久性Julia环境
alexnet vgg_从零开始：建立著名的分类网2（AlexNet / VGG）
垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器
脑电波之父:汉斯·贝格尔_深度学习，认识聪明的汉斯
PyCaret 2.0在这里-新增功能？
特征选择回归_如何执行回归问题的特征选择
建立神经网络来预测贷款风险
redshift教程_分析和可视化Amazon Redshift数据—教程
白雪小町_町
机器学习术语_机器学习术语神秘化。
centos有趣软件包_这5个软件包使学习R变得有趣
求解决方法_解决方法
xml格式是什么示例_什么是对抗示例？
mlflow_在生产中设置MLflow
神秘实体ALIMA
mnist数据集彩色图像_使用MNIST数据集构建多类图像分类模型。
bert使用做文本分类_使用BERT进行深度学习的多类文本分类

强化学习-动态规划_强化学习-第4部分相关推荐

强化学习-动态规划_强化学习-第5部分
强化学习-动态规划有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU's Yo ...
正则表达式学习日记_《学习正则表达式》笔记_Mr_Ouyang
正则表达式学习日记_<学习正则表达式>笔记_Mr_Ouyang 所属分类: 正则表达式学习日记书名: 学习正则表达式作者: Michael Fitzgerald 译者 ...
深度学习图像分类_深度学习时代您应该阅读的10篇文章了解图像分类
深度学习图像分类前言 (Foreword) Computer vision is a subject to convert images and videos into machine-under ...
如何学习编程语言_如何学习编程
如何学习编程语言像程序员一样思考 David Rangel在Unsplash上的照片免责声明: 这不是有关如何使用特定编程语言进行编码的教程. 而是,这是某人学习(或愿意学习)编程语言的指南,以了 ...
日语学习心得_日语学习资料
日语学习心得现在的学习资料越来越丰富,音视频配合,学习起来比较有兴趣,每次都是尽量学到疲倦得不行.想到掌握一门外语的重要性,拼了... 在网上还收录了一些学习资料新编日语点击下载新编日语1-4 ...
强化学习案例_强化学习实践案例！携程如何利用强化学习提高酒店推荐排序质量...
作者简介: 宣云儿,携程酒店排序算法工程师,主要负责酒店排序相关的算法逻辑方案设计实施.目前主要的兴趣在于排序学习.强化学习等领域的理论与应用. 前言目前携程酒店绝大部分排序业务中所涉及的问题,基本 ...
强化学习折扣率_强化学习中的折扣因素的惩罚
强化学习折扣率 This post deals with the key parameter I found as a high influence: the discount factor. It ...
深度学习试题_深度学习秋招面试题集锦（一）
这部分的面试题包含C++基础知识.python基础.概率相关.智力题相关.算法相关以及深度学习相关.后续还会不断补充,欢迎大家查阅! C++后台开发面试常见问题汇总 Q1 : C++虚函数表剖析. A ...
前端学习路线_前端学习路线图
2020年全新前端学习路线图分享给大家! 学习是一个循序渐进的过程,是一件非常难得坚持的事情.如果真的想学习前端开发,一定要下决心! 我这里分享给你的前端学习路线图,希望对你有帮助,以下为2020年更 ...

强化学习-动态规划_强化学习-第4部分

有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING)

导航 (Navigation)

链接 (Links)

相关文章：

强化学习-动态规划_强化学习-第4部分相关推荐

最新文章

热门文章