强化学习应用于组合优化问题

by Sterling Osborne, PhD Researcher

作者：斯特林·奥斯本(Sterling Osborne)，博士研究员

如何将强化学习应用于现实生活中的计划问题 (How to apply Reinforcement Learning to real life planning problems)

Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences.

最近，我发布了一些示例，其中我针对一些现实生活中的问题创建了强化学习模型。例如，将强化学习用于基于计划预算和个人偏好的膳食计划。

Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.

强化学习可以通过这种方式用于各种计划问题，包括旅行计划，预算计划和业务策略。使用RL的两个优点是它考虑了结果的可能性，并允许我们控制环境的某些部分。因此，我决定写一个简单的示例，以便其他人可以考虑如何开始使用它来解决他们的一些日常或工作问题。

什么是强化学习？ (What is Reinforcement Learning?)

Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.

强化学习(RL)是通过本质上的反复试验来测试哪种操作最适合环境的每个状态的过程。该模型引入了一个随机策略来启动，并且每次采取行动时，都会向该模型提供初始金额(称为奖励)。这一直持续到达到最终目标为止，例如，您赢了或输了游戏，游戏结束(或情节)并重置了游戏。

As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.

随着模型经历越来越多的事件，它开始了解哪些行动更有可能导致我们取得积极的结果。因此，它会在任何给定状态下找到最佳操作，称为最佳策略。

Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.

许多RL应用程序在线地在游戏或虚拟环境中训练模型，其中模型能够与环境反复交互。例如，您让模型反复模拟井字游戏，以便观察尝试不同动作的成功和失败。

In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.

在现实生活中，我们很可能无法以这种方式训练模型。例如，在线购物中的推荐系统需要一个人的反馈来告诉我们它是否成功，并且基于有多少用户与购物网站进行交互，其可用性受到限制。

Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.

取而代之的是，我们可能有一些示例数据显示了一段时间内的购物趋势，可以用来创建估计的概率。使用这些，我们可以创建所谓的部分观测的马尔可夫决策过程(POMDP)，作为概括潜在概率分布的一种方法。

部分观测的马尔可夫决策过程(POMDP) (Partially Observed Markov Decision Processes (POMDPs))

Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.

马尔可夫决策过程(MDP)提供了一个框架，用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP的关键特征是它们遵循Markov属性。所有未来状态都独立于过去给出的当前状态。换句话说，进入下一个状态的概率仅取决于当前状态。

POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found here. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.

POMDP的工作方式类似，只是它是MDP的概括。简而言之，这意味着该模型不能简单地与环境交互，而是根据我们观察到的结果给定一个固定的概率分布。更多信息可以在这里找到。我们可以在POMDP上使用值迭代方法，但是我决定在此示例中使用“蒙特卡洛学习”。

示例环境 (Example Environment)

Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper must be passed to him at the front of the classroom and he will place the waste into the bin (trash can).

想象一下，您回到学校(或可能仍然在教室里)，老师对废纸有严格的政策，要求任何废纸必须在教室前面传递给他，他将把废物进入垃圾箱(垃圾桶)。

However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this angers the teacher and those that do this are punished.

但是，班上的一些学生不太在乎老师的规矩，宁愿为自己省去把纸传到教室的麻烦。相反，这些麻烦的人可能会选择从远处将废纸扔进垃圾箱。现在，这激怒了老师，而那些这样做的人将受到惩罚。

This introduces a very basic action-reward concept, and we have an example classroom environment as shown in the following diagram.

这引入了一个非常基本的行动奖励概念，并且我们有一个示例教室环境，如下图所示。

Our aim is to find the best instructions for each person so that the paper reaches the teacher and is placed into the bin and avoids being thrown in the bin.

我们的目标是为每个人找到最好的指导，以便将纸传到老师手中并放入垃圾箱中，避免将其扔到垃圾箱中。

状态与行动 (States and Actions)

In our environment, each person can be considered a state and they have a variety of actions they can take with the scrap paper. They may choose to pass it to an adjacent class mate, hold onto it or some may choose to throw it into the bin. We can therefore map our environment to a more standard grid layout as shown below.

在我们的环境中，每个人都可以被视为一种状态，他们可以对废纸采取多种行动。他们可以选择将其传递给相邻的同伴，抓住它，也可以选择将其扔到垃圾箱中。因此，我们可以将环境映射到更标准的网格布局，如下所示。

This is purposefully designed so that each person, or state, has four actions: up, down, left or right and each will have a varied ‘real life’ outcome based on who took the action. An action that puts the person into a wall (including the black block in the middle) indicates that the person holds onto the paper. In some cases, this action is duplicated, but is not an issue in our example.

这样做的目的是使每个人或每个州有四个动作：向上，向下，向左或向右，并且每个人都会根据采取该动作的人而有不同的“现实生活”结果。将人放到墙壁上的动作(包括中间的黑色方块)表示该人握住纸张。在某些情况下，此操作是重复的，但在我们的示例中不是问题。

For example, person A’s actions result in:

例如，人A的动作导致：

Up = Throw into bin向上=扔进垃圾箱
Down = Hold onto paper向下=握住纸张
Left = Pass to person B左=传给B人
Right = Hold onto paper右=放在纸上

概率环境 (Probabilistic Environment)

For now, the decision maker that partly controls the environment is us. We will tell each person which action they should take. This is known as the policy.

目前，部分控制环境的决策者是我们。我们将告诉每个人应该采取的行动。这就是所谓的政策。

The first challenge I face in my learning is understanding that the environment is likely probabilistic and what this means. A probabilistic environment is when we instruct a state to take an action under our policy, there is a probability associated as to whether this is successfully followed. In other words, if we tell person A to pass the paper to person B, they can decide not to follow the instructed action in our policy and instead throw the scrap paper into the bin.

我在学习中面临的第一个挑战是了解环境很可能是概率性的，这意味着什么。概率环境是当我们指示某个国家根据我们的政策采取行动时，是否成功遵循该概率存在相关性。换句话说，如果我们告诉A人将纸张传递给B人，他们可以决定不遵循我们政策中的指示操作，而是将废纸扔进垃圾箱。

Another example is if we are recommending online shopping products there is no guarantee that the person will view each one.

另一个例子是，如果我们建议使用在线购物产品，则不能保证该人会查看每个产品。

观察到的过渡概率 (Observed Transitional Probabilities)

To find the observed transitional probabilities, we need to collect some sample data about how the environment acts. Before we collect information, we first introduce an initial policy. To start the process, I have randomly chosen one that looks as though it would lead to a positive outcome.

为了找到观察到的过渡概率，我们需要收集一些有关环境行为的样本数据。在收集信息之前，我们首先介绍一个初始政策。为了开始这一过程，我随机选择了一种看起来会带来积极成果的方法。

Now we observe the actions each person takes given this policy. In other words, say we sat at the back of the classroom and simply observed the class and observed the following results for person A:

现在，我们观察每个人在此政策下采取的行动。换句话说，假设我们坐在教室后面，只是观察班级，并观察到A人的以下结果：

We see that a paper passed through this person 20 times; 6 times they kept hold of it, 8 times they passed it to person B and another 6 times they threw it in the trash. This means that under our initial policy, the probability of keeping hold or throwing it in the trash for this person is 6/20 = 0.3 and likewise 8/20 = 0.4 to pass to person B. We can observe the rest of the class to collect the following sample data:

我们看到一篇论文通过了这个人20次；他们保留了6次，将其传递给B人8次，还有6次将其扔进垃圾桶。这意味着在我们最初的政策下，保持住该人或将其扔到垃圾桶中的概率为6/20 = 0.3，同样8/20 = 0.4传递给人B的概率。我们可以观察到班上的其他人收集以下样本数据：

Likewise, we then calculate the probabilities to be the following matrix and we could use this to simulate experience. The accuracy of this model will depend greatly on whether the probabilities are true representations of the whole environment. In other words, we need to make sure we have a sample that is large and rich enough in data.

同样，然后我们将概率计算为以下矩阵，并可以使用它来模拟经验。该模型的准确性将在很大程度上取决于概率是否是整个环境的真实表示。换句话说，我们需要确保我们有一个足够大且数据足够丰富的样本。

多武装土匪，情节，奖励，退货和折扣率 (Multi-Armed Bandits, Episodes, Rewards, Return and Discount Rate)

So we have our transition probabilities estimated from the sample data under a POMDP. The next step, before we introduce any models, is to introduce rewards. So far, we have only discussed the outcome of the final step; either the paper gets placed in the bin by the teacher and nets a positive reward or gets thrown by A or M and nets a negative rewards. This final reward that ends the episode is known as the Terminal Reward.

因此，我们根据POMDP下的样本数据估算了转换概率。在介绍任何模型之前，下一步是介绍奖励。到目前为止，我们仅讨论了最后一步的结果；要么是老师将纸放到垃圾箱中，然后获得正数奖励，要么被A或M掷出纸张，从而得到负数奖励。结束剧集的最终奖励被称为终端奖励 。

But, there is also third outcome that is less than ideal either; the paper continually gets passed around and never (or takes far longer than we would like) reaches the bin. Therefore, in summary we have three final outcomes

但是，还有第三种结果也不理想。纸张会不断地绕过，而不会(或比我们想要的更长的时间)到达垃圾箱。因此，总而言之，我们有三个最终结果

Paper gets placed in bin by teacher and nets a positive terminal reward纸张被老师放进垃圾桶，并获得积极的终端奖励
Paper gets thrown in bin by a student and nets a negative terminal reward学生将纸张扔进垃圾桶，并最终获得负奖励
Paper gets continually passed around room or gets stuck on students for a longer period of time than we would like纸不断地流过房间或卡在学生身上的时间比我们想要的更长

To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is pleased with it being placed in the bin this nets a large positive reward, +1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04.

为避免将纸张扔到垃圾箱中，我们为此提供了较大的负数奖励(例如-1)，并且由于老师对将纸张放入垃圾箱感到高兴，因此获得了较大的正数奖励+1。为了避免结果不断在房间中传递，我们将所有其他操作的奖励设置为较小的负值，例如-0.04。

If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out.

如果我们将其设置为正数或空数，则该模型可能会让论文四处走动，因为获得小的正数比冒险接近负数的结果更好。这个数字也非常小，因为它只会收集单个终端奖励，但是可能需要采取许多步骤才能结束剧集，我们需要确保，如果将论文放在垃圾箱中，则不会取消正面结果。

Please note: the rewards are always relative to one another and I have chosen arbitrary figures, but these can be changed if the results are not as desired.

请注意：奖励总是彼此相关的，我选择了任意数字，但是如果结果不理想，可以更改这些数字。

Although we have inadvertently discussed episodes in the example, we have yet to formally define it. An episode is simply the actions each paper takes through the classroom reaching the bin, which is the terminal state and ends the episode. In other examples, such as playing tic-tac-toe, this would be the end of a game where you win or lose.

尽管我们在示例中无意中讨论了情节，但我们尚未正式定义它。 情节只是简单地讲，每篇论文在教室中到达教室时即到达最终状态并结束情节的垃圾箱 。在其他示例中，例如打井字游戏，这将是您输赢的游戏结局。

The paper could in theory start at any state and this introduces why we need enough episodes to ensure that every state and action is tested enough so that our outcome is not being driven by invalid results. However, on the flip side, the more episodes we introduce the longer the computation time will be and, depending on the scale of the environment, we may not have an unlimited amount of resources to do this.

从理论上讲，本文可以从任何状态开始，这介绍了为什么我们需要足够的情节来确保对每个状态和动作进行足够的测试，以确保我们的结果不会受到无效结果的驱动。但是，另一方面，我们引入的情节越多，计算时间就越长，并且根据环境的规模，我们可能没有无限数量的资源来执行此操作。

This is known as the Multi-Armed Bandit problem; with finite time (or other resources), we need to ensure that we test each state-action pair enough that the actions selected in our policy are, in fact, the optimal ones. In other words, we need to validate that actions that have lead us to good outcomes in the past are not by sheer luck but are in fact in the correct choice, and likewise for the actions that appear poor. In our example this may seem simple with how few states we have, but imagine if we increased the scale and how this becomes more and more of an issue.

这就是所谓的多武装强盗问题 。在有限的时间(或其他资源)下，我们需要确保对每个状态操作对进行足够的测试，以使策略中选择的操作实际上是最佳操作。换句话说，我们需要验证过去导致我们取得良好成果的行动并非偶然，而是实际上是正确的选择，同样对于那些表现不佳的行动也是如此。在我们的示例中，这似乎很简单，因为我们只有几个州，但是想像一下我们是否扩大规模，以及这如何成为越来越多的问题。

The overall goal of our RL model is to select the actions that maximises the expected cumulative rewards, known as the return. In other words, the Return is simply the total reward obtained for the episode. A simple way to calculate this would be to add up all the rewards, including the terminal reward, in each episode.

我们的RL模型的总体目标是选择能够最大化预期累积奖励(称为收益)的行动。换句话说，回报只是该集获得的总奖励。一种简单的计算方法是将每个情节中的所有奖励加起来，包括最终奖励。

A more rigorous approach is to consider the first steps to be more important than later ones in the episode by applying a discount factor, gamma, in the following formula:

一种更严格的方法是，通过在以下公式中应用折扣因子gamma，认为第一步比后续步骤更重要。

In other words, we sum all the rewards but weigh down later steps by a factor of gamma to the power of how many steps it took to reach them.

换句话说，我们将所有奖励相加，但是将后续步骤权重乘以要达到这些步骤所需要执行的步骤的能力，即伽马系数。

If we think about our example, using a discounted return becomes even clearer to imagine as the teacher will reward (or punish accordingly) anyone who was involved in the episode but would scale this based on how far they are from the final outcome.

如果我们考虑我们的示例，那么使用折现收益将变得更容易想象，因为老师会奖励(或相应地惩罚)参与该情节的任何人，但会根据他们与最终结果的距离来进行缩放。

For example, if the paper passed from A to B to M who threw it in the bin, M should be punished most, then B for passing it to him and lastly person A who is still involved in the final outcome but less so than M or B. This also emphasises that the longer it takes (based on the number of steps) to start in a state and reach the bin the less is will either be rewarded or punished but will accumulate negative rewards for taking more steps.

例如，如果论文从A传递到B并传递给M，然后将其扔进垃圾箱，则M应该受到最严厉的惩罚，然后B则将其传递给他，最后是仍参与最终结果但小于M的人A或B。这也强调，开始进入状态并到达垃圾箱所花费的时间越长(基于步骤数)，得到奖励或惩罚的次数越少，但由于采取更多的步骤而积累的负面奖励就越大。

将模型应用于我们的示例 (Applying a Model to our Example)

As our example environment is small, we can apply each and show some of the calculations performed manually and illustrate the impact of changing parameters.

由于示例环境很小，因此我们可以应用每种方法，并展示一些手动执行的计算，并说明更改参数的影响。

For any algorithm, we first need to initialise the state value function, V(s), and have decided to set each of these to 0 as shown below.

对于任何算法，我们首先需要初始化状态值函数V(s)，并已决定将每个值设置为0，如下所示。

Next, we let the model simulate experience on the environment based on our observed probability distribution. The model starts a piece of paper in random states and the outcomes of each action under our policy are based on our observed probabilities. So for example, say we have the first three simulated episodes to be the following:

接下来，我们让模型根据观察到的概率分布模拟环境经验。该模型以随机状态开始工作，而在我们的政策下，每个操作的结果都基于我们观察到的概率。举例来说，假设我们的前三个模拟情节如下：

With these episodes we can calculate our first few updates to our state value function using each of the three models given. For now, we pick arbitrary alpha and gamma values to be 0.5 to make our hand calculations simpler. We will show later the impact this variable has on results.

通过这些事件，我们可以使用给定的三个模型分别计算对状态值函数的前几个更新。现在，我们将任意的alpha和gamma值选择为0.5，以使我们的手算更加简单。稍后我们将显示此变量对结果的影响。

First, we apply temporal difference 0, the simplest of our models and the first three value updates are as follows:

首先，我们应用时间差0，这是我们模型中最简单的，并且前三个值更新如下：

So how have these been calculated? Well because our example is small we can show the calculations by hand.

那么如何计算这些呢？好吧，因为我们的示例很小，所以我们可以手动显示计算结果。

So what can we observe at this early stage? Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times. Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.

那么在这个早期阶段我们可以观察到什么呢？首先，对于某些州，使用TD(0)似乎是不公平的，例如，人D，在此阶段，三分之二的纸都没有进入纸箱。他们的更新仅受下一阶段的价值的影响，但这强调了正面和负面奖励如何从角落向各州向外传播。

As we take more episodes the positive and negative terminal rewards will spread out further and further across all states. This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.

随着我们拍摄更多剧集，正面和负面的最终奖励将在所有州之间越来越广泛地分布。这大致显示在下图中，我们可以看到，这两个事件所产生的积极结果影响了Teacher和G州的价值，而单个消极事件已经惩罚了人M。

To show this, we can try more episodes. If we repeat the same three paths already given we produce the following state value function:

为了显示这一点，我们可以尝试更多的情节。如果我们重复已经给出的相同的三个路径，我们将产生以下状态值函数：

(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.)

(请注意，在本示例中，为简单起见，我们重复了这三个情节，但实际模型中的情节将基于观察到的过渡概率函数。)

The diagram above shows the terminal rewards propagating outwards from the top right corner to the states. From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively. Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure below

上图显示了终端奖励从右上角向外传播到各州。由此，我们可以决定更新我们的政策，因为很明显负的终端奖励是通过人M传递的，因此B和C受到了负面影响。因此，基于V27，对于每个状态，我们可以决定通过为每个状态选择下一个最佳状态值来更新策略，如下图所示

There are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward. This is because none of the episodes have visited this person and emphasises the multi armed bandit problem. In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.

在此示例中，有两个令人担忧的原因：第一个是人A的最佳动作是将其扔到垃圾箱中并获得负面奖励。这是因为这些事件都没有拜访过此人并强调了多武装匪徒问题。在这个小例子中，只有很少的州，因此需要很多插曲来访问它们，但是我们需要确保做到这一点。

The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards. We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.

这个动作对这个人更好的原因是，这两个终极状态都不具有价值，而正负结果都在终极奖励中。然后，如果我们的情况需要，我们可以根据结果用终端状态的数字初始化V0。

Secondly, the state value of person M is flipping back and forth between -0.03 and -0.51 (approx.) after the episodes and we need to address why this is happening. This is caused by our learning rate, alpha. For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.

其次，在事件发生之后，人M的状态值在-0.03和-0.51(大约)之间来回翻转，我们需要解决这种情况的原因。这是由我们的学习率alpha引起的。目前，我们仅介绍了参数(学习率alpha和折扣率gamma)，但未详细说明它们将如何影响结果。

A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge. This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes. Another good explanation for learning rate is as follows:

较高的学习率可能会导致结果振荡，但反之，则不应太小而导致永远收敛。下图进一步显示了这一点，该图演示了每个情节的总V(s)，我们可以清楚地看到，尽管总体趋势呈上升趋势，但在情节之间来回变化。学习率的另一个很好的解释如下：

“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.

“在高尔夫比赛中，当球远离球洞时，球员很难击中球，使其尽可能靠近球洞。之后，当他到达标记区域时，他选择了另一支球杆来获得准确的短射。

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole.”

因此，这并不是说他不选择短射棒就无法将球放入洞中，而是可以将球传给目标两次或三次。但是，如果他发挥最佳状态并使用适量的力量到达洞口，那将是最好的选择。”

Learning rate of a Q learning agentThe question how the learning rate influences the convergence rate and convergence itself. If the learning rate is…stackoverflow.com

Q学习代理 的学习率学习率如何影响收敛率和收敛本身的问题。 如果学习率是… stackoverflow.com

There are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached. This is also known as stochastic gradient decent. In a recent RL project, I demonstrated the impact of reducing alpha using an animated visual and this is shown below. This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.

有一些复杂的方法可以确定问题的最佳学习率，但是，与任何机器学习算法一样，如果环境足够简单，则可以迭代不同的值，直到达到收敛为止。这也被称为随机梯度样。在最近的RL项目中，我演示了使用动画效果降低alpha的影响，如下所示。这说明了当alpha很大时的振荡，以及当alpha减小时如何平滑。

Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.9. The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.

同样，我们还必须将折现率设置为0到1之间的数字，通常将其折算为接近0.9。折扣因素告诉我们未来的奖励有多重要；较大的数字表示它们将被视为重要，而将其移至0将使模型越来越少地考虑将来的步骤。

With both of these in mind, we can change both alpha from 0.5 to 0.2 and gamma from 0.5 to 0.9 and we achieve the following results:

考虑到这两种情况，我们可以将alpha值从0.5更改为0.2，将gamma值从0.5更改为0.9，我们将获得以下结果：

Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller. Most noticeably is for the teacher which is clearly the best state. However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before. We can now see this in the diagram below for the sum of V(s) following our updated parameters. Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.

由于我们的学习率现在小得多，因此该模型需要更长的时间来学习，并且值通常较小。最明显的是对于老师来说，这显然是最好的状态。但是，这种折衷是为了增加计算时间，这意味着我们的M值不再像以前那样震荡。现在，我们可以在下图中看到更新后的参数后的V(s)之和。尽管不是很平滑，但总V(s)的增长速度却比以前平滑得多，并且似乎可以按照我们的意愿收敛，但大约需要75集。

改变目标结果 (Changing the Goal Outcome)

Another crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment. Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.

我们没有过多详细提到的RL的另一个关键优势是，我们可以控制环境。目前，奖励基于我们认为最好的方法，即使模型以尽可能少的步骤达到正面结果。

However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it. Then we can change our negative reward around this and the optimal policy will change.

但是，说老师换了，新老师不介意学生只要把纸扔进垃圾箱就把它扔进垃圾箱。然后，我们可以围绕此改变我们的负面奖励，最优政策也会改变。

This is particularly useful for business solutions. For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.

这对于业务解决方案特别有用。例如，假设您正在计划一项策略，并且知道某些过渡要比其他过渡少，那么可以考虑并随意更改。

结论 (Conclusion)

We have now created a simple Reinforcement Learning model from observed data. There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.

现在，我们根据观察到的数据创建了一个简单的强化学习模型。有许多事情可以改进或进一步，包括使用更复杂的模型，但这对于那些希望尝试并将其应用于实际问题的人来说应该是一个很好的介绍。

I hope you enjoyed reading this article, if you have any questions please feel free to comment below.

希望您喜欢阅读本文，如果有任何疑问，请在下面发表评论。

Thanks

谢谢

Sterling

英镑

翻译自: https://www.freecodecamp.org/news/how-to-apply-reinforcement-learning-to-real-life-planning-problems-90f8fa3dc0c5/