强化学习之基础入门

Reinforcement learning is probably one of the most relatable scientific approaches that resemble the way humans learn about things. Every day, we, the learner, learn by interacting with our environment to know what to do in certain situations, to know about the consequences of our actions, and so on.

强化学习可能是最相关的科学方法之一，类似于人类学习事物的方式。每天，我们作为学习者，都通过与环境互动来学习，以了解在某些情况下该怎么做，了解我们的行动的后果等等。

When we were babies, we didn’t know that touching a hot kettle would hurt our hand. However, as we learned about how the environment responded to our action, i.e touching the hot kettle hurts our hand, then we learned not to touch a hot kettle. This illustrates the fundamental theory of reinforcement learning.

当我们还是婴儿的时候，我们不知道触摸热水壶会伤害我们的手。但是，当我们了解到环境如何响应我们的行动(即触摸热水壶会伤害我们的手)时，我们学会了不要触摸热水壶。这说明了强化学习的基本理论。

Reinforcement learning is learning what to do in order to maximize a numerical reward. This means that the learner should discover which actions that yield the highest reward in the long run by trying them.

强化学习是在学习如何使数字奖励最大化。这意味着学习者应该尝试一下，从而发现哪些行为从长远来看会产生最高的回报。

In this article, I want to discuss the fundamentals behind reinforcement learning which includes: Markov decision process, policies, value function, Bellman equation, and of course, dynamic programming.

在本文中，我想讨论强化学习背后的基础知识，其中包括：马尔可夫决策过程，策略，值函数，贝尔曼方程式，当然还有动态规划。

马尔可夫决策过程 (Markov Decision Process)

Markov decision process is the fundamental problem which we try to solve in reinforcement learning. But, what is the definition of Markov Decision Process?

马尔可夫决策过程是我们在强化学习中要解决的基本问题。但是，马尔可夫决策过程的定义是什么？

Markov Decision Process or MDP is a formulation of sequential interaction between agent and environment.

马尔可夫决策过程(MDP)是主体与环境之间顺序相互作用的表述。

Here, the learner and decision maker is called agent, where the thing it interacts with is called environment. In MDP, the agent makes certain decisions or actions, then the environment responds by giving the agent a new situation or state and immediate reward.

在这里，学习者和决策者称为代理，与之交互的事物称为环境。在MDP中，代理做出某些决定或行动，然后环境通过为代理提供新的情况或状态和即时奖励来做出响应。

Agent-environment interaction in Markov Decision Process

In reinforcement learning, the main goal of an agent is to make a decision or action that maximizes the total amount of rewards it received from the environment in the long run.

在强化学习中，代理商的主要目标是做出决策或采取行动，从长远来看，它会最大化其从环境中获得的报酬总额。

Let’s say we want to train a robot to play the game of chess. Each time the robot win the game, the reward would be +1 and if it loses the game, the reward would be -1. In another example, if we want to train the robot to escape from a maze, the reward would decrease by -1 as the more time has passed prior to the escape.

假设我们要训练一个机器人玩象棋游戏。每次机器人赢得比赛，奖励将为+1；如果输掉比赛，奖励将为-1。在另一个示例中，如果我们要训练机器人从迷宫中逃生，随着逃生之前经过的时间越长，奖励将减少-1。

The reward in reinforcement learning is the way how you communicate to the agent what you want it to achieve, not how you want it achieved.

在强化学习奖励是如何传达给你希望它实现的，你不希望如何实现代理的方式。

Now the question is, how do we compute the cumulative amount of rewards that have been gathered by the agent after sequences of actions? The mathematical formulation of cumulative reward is defined as follows.

现在的问题是，我们如何计算代理人在一系列行动之后已经收集到的累积奖励金额？累积奖励的数学公式定义如下。

Above, R is the reward in each sequence of action made by the agent and G is the cumulative reward or expected return. The goal of the agent in reinforcement learning is to maximize this expected return G.

上面的R是代理人在每个动作序列中的奖励，而G是累积奖励或预期回报 。强化学习中智能体的目标是使此预期收益G最大化。

预期折现率 (Discounted Expected Return)

However, the equation above only applies when we have an episodic MDP problem, meaning that the sequence of agent-environment interaction is episodic or finite. What if we have a situation where the interaction between agent-environment is continuous and infinite?

但是，上面的等式仅在我们遇到了突发MDP问题时适用，这意味着主体与环境相互作用的序列是突发的或有限的。如果我们遇到一种情况，即代理人与环境之间的相互作用是连续且无限的？

Suppose we have a problem where the agent works like an air conditioner, its task is to adjust the temperature given certain situations or states. In this problem:

假设我们有一个问题，即该代理像空调一样工作，它的任务是在特定情况或状态下调节温度。在这个问题上：

States: the current temperature, number of people in a room, time of the day.

状态：当前温度，房间人数，一天中的时间。
Action: Increase or decrease the room temperature.

行动：提高或降低室温。
Reward: -1 if a person in the room needs to manually adjust the temperature and 0 otherwise.

奖励：如果房间中的人需要手动调节温度，则为-1，否则为0。

In order to avoid negative rewards, the agent needs to learn and interact with the environment continuously, meaning that there is no end of MDP sequence. To solve this continuous task from the agent, we can use the discounted expected returns.

为了避免负面奖励，代理需要不断学习并与环境互动，这意味着MDP序列没有尽头。为了解决代理商的这一连续任务，我们可以使用折现的预期收益。

In the equation above, γ is the discount rate, where its value should be within the range 0 ≤ γ ≤1.

在上式中，γ是折现率，其值应在0≤γ≤1的范围内。

The intuition behind this discount rate is that the reward the agent received in earlier sequence will be worth more than the reward it received several sequences later. This assumption also makes sense in real life. 1 Euro in today’s life is worth more than 1 Euro several years in the future because of the inflation.

折现率背后的直觉是，代理商在较早序列中获得的报酬将比其在随后多个序列中获得的报酬更有价值。这个假设在现实生活中也是有意义的。由于通货膨胀，当今的1欧元在未来几年的价值将超过1欧元。

If γ= 0, this means that the agent is short-sighted, meaning that it takes more weight for immediate rewards of an action in the next sequence.

如果γ= 0，则表示该代理是近视的 ，这意味着在下一个序列中立即获得动作的奖励需要更多的权重。

If γ is closer to 1, this means that the agent is far-sighted, meaning that it puts more and more weight for future rewards.

如果γ接近于1，则表示该代理是有远见的 ，这意味着它将越来越重地分配给将来的奖励。

With the discounted return, as long as the reward is non-zero and γ < 1, the output of expected return would no longer be infinite.

在折现收益率下，只要报酬不为零且γ<1，预期收益率的输出将不再是无限的。

策略，价值函数和Bellman方程 (Policy, Value Function, and Bellman Equation)

Now we know that the goal of an agent in reinforcement learning is to maximize the cumulative reward. In order to maximize the reward, the agent needs to choose which action it needs to take in a given state such that it gets a high cumulative reward. The probability of an agent choosing a certain action in a given state is called policy.

现在我们知道，强化学习中的主体的目标是最大化累积奖励。为了最大化奖励，代理需要选择在给定状态下需要执行的操作，以使其获得较高的累积奖励。代理在给定状态下选择特定动作的概率称为策略。

Policy is a probability of an agent selecting an action A in a given state S.

策略是代理在给定状态S下选择动作A的概率。

In reinforcement learning, a policy is normally represented by π. This means that π(A|S) is the probability of the agent choosing action A given that it is in state S.

在强化学习中，策略通常由π表示。这意味着π(A | S)是代理在状态S下选择动作A的概率。

Now if an agent is in state S and it follows policy π, then its expected return will be called state-value function for policy π. Thus, the state-value function is normally denoted as Vπ.

现在，如果代理处于状态S并遵循策略π ，则其预期收益将称为策略 π的 状态值函数 。因此，状态值函数通常表示为Vπ 。

Similar to state-value function, if an agent is in state S and it determines its next action based on policy π, then its expected return will be called action-value function for policy π. Thus, the action-value function is normally denoted as qπ.

与状态值函数类似，如果代理处于状态S并根据策略π确定其下一个动作，则其预期收益将称为策略π的动作值函数 。因此，作用值函数通常表示为qπ 。

In some sense, the value function and the reward have some similarities. However, a reward refers to what is good in an immediate sense, while a value function refers to what is good in the long run. Thus, a state might have a low immediate reward but has a high value function because it is regularly followed by other states that have high rewards.

从某种意义上说，价值函数和报酬有一些相似之处。但是，奖励是指从即时意义上讲是好的，而价值函数是指从长远来看是好的。因此，一个州可能具有较低的立即回报，但具有较高的价值功能，因为它会定期跟随其他具有较高奖励的州。

To compute the value function, the Bellman equation is commonly applied.

为了计算值函数，通常使用Bellman方程。

In Reinforcement Learning, the Bellman equation works by relating the value function in the current state with the value in the future states.

在强化学习中，Bellman方程通过将当前状态下的值函数与将来状态下的值相关联来工作。

Mathematically, the Bellman equation can be written as the following.

在数学上，贝尔曼方程式可以写成如下。

As you can see from the mathematical equation above, what Bellman equation expressed is that it averages over all of the possible states and future rewards in any given states, depending on the dynamics environment p.

从上面的数学方程式可以看出，Bellman方程式表示的是，根据动力学环境p ，它在所有给定状态的所有可能状态和将来的回报中平均。

To make it easier for us to understand how the Bellman equation actually works intuitively, let’s relate it to our everyday moment.

为了使我们更容易理解Bellman方程实际上是如何直观地工作的，我们将其与我们的日常生活联系起来。

Let’s say that two months ago, you learned how to ride a bike for the first time. One day when you rode your bike, the bike lost its balance when you pull the brake on a surface full of sand, making you slipped and injured. This means that you got a negative reward from this experience.

假设两个月前，您第一次学习了如何骑自行车。有一天，当您骑自行车时，在充满沙子的表面上拉刹车时，自行车失去了平衡，导致您滑倒并受伤。这意味着您将从这种经历中获得负面奖励。

One week later, you rode the bike again. When you rode it on a surface full of sand, you slowed down the speed. This is because you know that when the bike loses its balance, bad things will happen even though this time you didn’t actually experience it.

一个星期后，您又骑了自行车。在充满沙子的表面上骑车时，速度会降低。这是因为您知道当自行车失去平衡时，即使您实际上没有体验过，也会发生坏事。

最优政策与最优价值函数 (Optimal Policy and Optimal Value Function)

Whenever we try to solve reinforcement learning task, we want the agent to choose an action which maximizes the cumulative reward. To achieve this, it means that the agent should follow a policy that maximizes the value function we have just discussed above. The policy that maximizes the value function in all of the states is called optimal policy and normally it is defined as π*.

每当我们尝试解决强化学习任务时，我们都希望代理选择一个使累积奖励最大化的动作。为了实现这一点，这意味着代理应遵循我们最大化上述价值功能的策略。将所有状态下的值函数最大化的策略称为最优策略，通常将其定义为π*。

To understand the optimal policy better, let’s take a look at the following illustration.

为了更好地理解最佳策略，让我们看一下下图。

As shown in the illustration above, we can say that π` is an optimal policy compared to π because the value function at any given state following policy π` is as good as or better than policy π.

如上图所示，与π相比，我们可以说π`是最优策略，因为在遵循策略π`的任何给定状态下，值函数都与策略π相同或更好。

If we have an optimal policy, then we can actually rewrite the Bellman equation into the following:

如果我们有最佳策略，那么我们实际上可以将Bellman方程重写为以下形式：

The final equation above is called the Bellman optimality equation. Note that in the final form of above equation, there is no specific reference regarding certain policy π. The Bellman optimality equation basically tells us that the value function of a state under an optimal policy should be equal to the expected return of the best action from that state.

上面的最终方程称为Bellman最优方程。注意，在以上等式的最终形式中，没有关于某些策略π的具体参考。贝尔曼最优性方程式基本上告诉我们，最优策略下的状态的价值函数应等于该状态下最佳行动的预期收益。

From the equation above, it is very straightforward to find the optimum state-value function once we know the optimal policy. However, in real life, we often don’t know what the optimal policy is.

从上面的方程式中，一旦我们知道最优策略，找到最优状态值函数就非常简单。但是，在现实生活中，我们通常不知道什么是最佳策略。

To find the optimal policy, the dynamic programming algorithm is normally applied. With dynamic programming, the state-value function of each state will be evaluated iteratively until we find the optimal policy.

为了找到最佳策略，通常采用动态规划算法。通过动态编程，将迭代评估每个状态的状态值函数，直到找到最佳策略为止。

动态编程以找到最佳策略 (Dynamic Programming to Find Optimal Policy)

Now let’s dive into the theory behind dynamic programming to find the optimal policy. At its core, dynamic programming algorithm uses Bellman equation iteratively to do two things:

现在，让我们深入探讨动态编程背后的理论，以找到最佳策略。动态编程算法的核心是迭代地使用Bellman方程来做两件事：

Policy evaluation政策评估
Policy improvement政策改善

Policy evaluation is a step to evaluate how good a given policy is. In this step, the state-value function Vπ for an arbitrary policy π is computed. We have seen that the Bellman equation actually helps us to compute the state-value function with a system of linear equation as follows:

策略评估是评估给定策略的良好程度的一步。在该步骤中，计算任意策略π的状态值函数Vπ 。我们已经看到，Bellman方程实际上通过以下线性方程组帮助我们计算状态值函数：

With dynamic programming, the state-value function will be approximated iteratively based on Bellman equation until the value function converged in each of the state. The converged approximation of a value function can be called as the value function of a given policy Vπ.

通过动态编程，将基于Bellman方程迭代地近似状态值函数，直到值函数收敛于每个状态。价值函数的收敛近似可以称为给定策略Vπ的价值函数。

After we found the value function of a given policy Vπ, we need to improve the policy. Recall that we can define an optimal policy if and only if the value function of one policy is equal or bigger than other policies in any given state. With policy improvement, the new, strictly better policy in any given state can be generated.

找到给定策略Vπ的值函数后，我们需要改进策略。回想一下，当且仅当一个策略的值函数等于或大于任何给定状态下的其他策略时，我们才可以定义最优策略。随着政策的改进，可以在任何给定状态下生成严格严格的新政策。

Notice that in the above equation, we use the Vπ that we have computed in policy evaluation step to improve the policy. If the policy doesn’t improve after we apply the above equation, it means that we have found the optimal policy.

注意，在上式中，我们使用在策略评估步骤中计算出的Vπ来改进策略。如果应用上述公式后策略仍未改善，则表明我们已找到最佳策略。

Overall, these two steps, policy evaluation and policy improvement are done iteratively using dynamic programming. First, under any given policy, the corresponding value function is computed. Then, the policy is improved. With the improved policy, the next value function is computed and so on. If a policy does not improve anymore compared to the previous iteration, it means that we have found the optimal policy for our problem.

总体而言，这两个步骤(策略评估和策略改进)是使用动态编程迭代完成的。首先，在任何给定的策略下，都会计算相应的值函数。然后，对政策进行改进。使用改进的策略，可以计算下一个值函数，依此类推。如果某个策略与以前的迭代相比没有任何改善，则意味着我们已经找到解决问题的最佳策略。

Iterative policy evaluation and policy improvement in dynamic programming

动态规划以找到最佳策略的实现 (Implementation of Dynamic Programming to Find Optimal Policy)

Now that we know all of the theory regarding dynamic programming and optimal policy, let’s implement it in a code with a simple use case.

既然我们了解了有关动态编程和最佳策略的所有理论，那么我们就可以在具有简单用例的代码中实现它。

Suppose that we want to control the increased demand of the use of parking space in a city. To do so, what we need to do is to control the price of parking system depending on the city’s preference. In general, the city council has a perspective that the more parking space is being used, the higher the social welfare is. However, the city council also prefers that at least one spot is left unoccupied for emergency use.

假设我们要控制城市中停车位使用需求的增长。为此，我们需要做的就是根据城市的偏好来控制停车系统的价格。通常，市议会的观点是使用的停车空间越多，社会福利就越高。但是，市议会也希望至少留出一块空地以备不时之需。

We can define the use case above as Markov Decision Process (MDP) with:

我们可以将上述用例定义为马尔可夫决策过程(MDP)，其具有：

State: the number of occupied parking spaces.

州：占用的停车位数量。
Action: the parking price.

行动：停车价格。
Reward: city’s preference for the situation.

奖励：城市对情况的偏爱。

For this example, let’s assume that there are ten parking spots and four different price range. This means that we have eleven states (10 plus 1 because there can be a situation where no parking space is occupied) and four actions.

对于此示例，假设有十个停车位和四个不同的价格范围。这意味着我们有11个状态(10加1，因为可能会出现没有泊车位的情况)和4个动作。

To find the optimal policy with the given use case, we can use the dynamic programming with Bellman optimality equation. First, we evaluate the policy and then we improve the policy. We do these two steps iteratively until the result converges.

为了找到给定用例的最优策略，我们可以使用带有Bellman最优性方程的动态规划。首先，我们评估政策，然后改进政策。我们迭代执行这两个步骤，直到结果收敛。

As a first step, let’s define a function to compute the Bellman optimality equation as shown below.

第一步，让我们定义一个函数来计算Bellman最优性方程，如下所示。

The Bellman optimality equation above evaluates the value function at any given state.

上面的Bellman最优性方程式评估任何给定状态下的值函数。

Next, let’s define a function to improve the policy. We can improve the policy by greedifying the policy. This means that we transform the policy such that the policy will have the probability of 1 of choosing action which maximize the value function at a given state.

接下来，让我们定义一个函数来改进策略。我们可以通过充实政策来改善政策。这意味着我们对策略进行了转换，以使该策略在给定状态下具有选择价值最大化功能的动作的可能性为1。

Finally, we can wrap the policy evaluation and policy improvement into one function.

最后，我们可以将政策评估和政策改进整合为一个功能。

Now if we run the function above, we will get the following result:

现在，如果我们运行上面的函数，我们将得到以下结果：

From the result above, we can see that the value function increases as the number of parking space that is occupied increased, except when all of the parking space are occupied. This is fully expected as we can see from the preference of the city council in the use case example.

从上面的结果可以看出，值函数随着所占用的停车位数量的增加而增加，除非所有的停车位都被占用。从用例示例中市议会的偏爱中可以看出，这是完全可以预期的。

The city council has a perspective that the more the parking space is used, the higher the social welfare is and they prefer to have at least one parking space left unoccupied. Thus, the more the parking spot is occupied, the higher the value function will be apart from the last state.

市议会的观点是，使用的停车位越多，社会福利就越高，他们更愿意至少保留一个空闲的停车位。因此，停车位占用越多，价值函数与最后一个状态的距离就越高。

Also note that when the parking occupancy is high (state nine and ten), the action change from 0 (the lowest price value) to 4 (the highest price value) to avoid the full occupancy.

另请注意，当停车占用率很高(状态为9和10)时，操作将从0(最低价格值)更改为4(最高价格值)，以避免完全占用。

翻译自: https://towardsdatascience.com/the-fundamentals-of-reinforcement-learning-177dd8626042

强化学习之基础入门

查看全文

http://www.taodudu.cc/news/show-863612.html

在置信区间下置信值的计算_使用自举计算置信区间
步进电机无细分和20细分_细分网站导航会话
python gis库_使用开放的python库自动化GIS和遥感工作流
mask rcnn实例分割_使用Mask-RCNN的实例分割
使用FgSegNet进行前景图像分割
完美下巴标准_平行下颚抓
api 规则定义_API有规则，而且功能强大
r语言模型评估:_情感分析评估：对自然语言处理的过去和未来的反思
机器学习偏差方差_机器学习101 —偏差方差难题
机器学习多变量回归算法_如何为机器学习监督算法识别正确的自变量？
python 验证模型_Python中的模型验证
python文本结构化处理_在Python中标记非结构化文本数据
图像分类数据库_图像分类器-使用僧侣库对房屋房间类型进行分类
利用PyCaret的力量
ai伪造论文实验数据_5篇有关AI培训数据的基本论文
机器学习经典算法实践_服务机器学习算法的系统设计-不同环境下管道的最佳实践
css餐厅_餐厅的评分预测
机器学习结构化学习模型_生产化机器学习模型
人工智能已经迫在眉睫_创意计算机已经迫在眉睫
合奏：机器学习中唯一（几乎）免费的午餐
在Ubuntu 18.04上安装和使用Tesseract 4
pytorch机器学习_机器学习— PyTorch
检测和语义分割_分割和对象检测-第1部分
ai人工智能编程_从人工智能动态编程：Q学习
架构垂直伸缩和水平伸缩区别_简单的可伸缩图神经网络
yolo opencv_如何使用Yolo，SORT和Opencv跟踪足球运动员。
人工智能的搭便车指南
机器学习对回归的评估_在机器学习回归问题中应使用哪种评估指标？
可持久化数据结构加扫描线_结构化光扫描
信号处理深度学习机器学习_机器学习和信号处理如何融合？

强化学习之基础入门_强化学习基础相关推荐

重拾强化学习的核心概念_强化学习的核心概念
重拾强化学习的核心概念 By Hannah Peterson and George Williams (gwilliams@gsitechnology.com) 汉娜·彼得森 ( Hannah Pet ...
c语言零基础自学,c语言零基础入门该如何学习
原标题:c语言零基础入门该如何学习零基础学习C语言该从哪里开始学习呢?在学习之前你可以先问自己,为什么我要学C语言?是为了应付考试,还是为了应聘,还是为了提高自己的编程能力.如果你以后想要长期致 ...
＜极客时间：零基础入门Spark＞学习笔记（持续更新中...）
看的是极客时间的课,讲得很不错零基础入门 Spark (geekbang.org) 基础知识 01 Spark:从"大数据的Hello World"开始准备工作 IDEA安装S ...
深度学习深度前馈网络_深度学习前馈网络中的讲义第4部分
深度学习深度前馈网络 FAU深度学习讲义 (FAU Lecture Notes in Deep Learning) These are the lecture notes for FAU's YouT ...
零基础入门软件测试需要学习什么
着近几年软件测试行业的异军突起,加之这又是个进入门槛相对较低的行业,导致不少人都想从事这个岗位.那么,许多初学者在一开始都会想知道,零基础入门软件测试要学什么?希望本文可以对大家的软件测试学习之路有一 ...
自学python需要安装什么软件-零基础入门Python怎么学习？老男孩python用什么软件...
在培训学习Python时,怎么才能学好Python?随着Python技术的发展,越来越多的人开始学习Python编程语言,那么零基础入门Python该怎么学习? 1.要养成良好的代码编写习惯,注重细节 ...
c语言python零基础教学_编程零基础应当如何开始学习 Python？
目录 1.学习了解Python的基础知识. 2.安装Python,边学边练. 3.收集资料,作为练习指引. 4.确定学习方向,项目练手. 5.学习过程中要注意多练.多问! 编程零基础选择Python开 ...
python是什么软件-零基础入门Python怎么学习？老男孩python用什么软件
在培训学习Python时,怎么才能学好Python?随着Python技术的发展,越来越多的人开始学习Python编程语言,那么零基础入门Python该怎么学习? 1.要养成良好的代码编写习惯,注重细节 ...
【大白话学习】UniApp 微信小程序与APP应用开发零基础入门教程(一）---基础页面框架搭建
写在前面话: 随着互联网的快速发展,微信小程序应用的快速便捷,不用下载安装等的优势越来越明显,于是,我就开始着手于小程序开发的学习,虽然微信提供了开发工具,但它只能生成小程序 ,不能生成APP,那么有 ...

强化学习之基础入门_强化学习基础