强化学习q学习求最值

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

通过Q学习更深入地学习强化学习 (Diving deeper into Reinforcement Learning with Q-Learning)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow？️的深度强化学习课程的一部分。检查课程表。

Today we’ll learn about Q-Learning. Q-Learning is a value-based Reinforcement Learning algorithm.

今天，我们将学习Q学习。 Q学习是一种基于价值的强化学习算法。

This article is the second part of a free series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course. See the first article here.

本文是有关深度强化学习的一系列免费博客文章的第二部分。有关更多信息和更多资源，请查看课程提纲。在这里看到第一篇文章。

In this article you’ll learn:

在本文中，您将学习：

What Q-Learning is什么是Q学习
How to implement it with Numpy如何使用Numpy实施

大图：骑士与公主 (The big picture: the Knight and the Princess)

Let’s say you’re a knight and you need to save the princess trapped in the castle shown on the map above.

假设您是一名骑士，您需要保存困在上面地图所示城堡中的公主。

You can move one tile at a time. The enemy can’t, but land on the same tile as the enemy, and you will die. Your goal is to go the castle by the fastest route possible. This can be evaluated using a “points scoring” system.

您一次只能移动一个图块。敌人不能，但是和敌人降落在同一块地上，你会死。您的目标是尽可能快地走城堡。可以使用“积分”系统进行评估。

You lose -1 at each step (losing points at each step helps our agent to be fast).

您每步损失-1(每步损失点帮助我们的代理更快)。
If you touch an enemy, you lose -100 points, and the episode ends.如果碰到敌人，您将失去-100分，情节结束。
If you are in the castle you win, you get +100 points.如果您在城堡中获胜，您将获得+100分。

The question is: how do you create an agent that will be able to do that?

问题是：如何创建能够做到这一点的代理？

Here’s a first strategy. Let say our agent tries to go to each tile, and then colors each tile. Green for “safe,” and red if not.

这是第一个策略。假设我们的经纪人尝试转到每个图块，然后为每个图块着色。绿色表示“安全”，否则表示红色。

Then, we can tell our agent to take only green tiles.

然后，我们可以告诉我们的代理商只拿绿砖。

But the problem is that it’s not really helpful. We don’t know the best tile to take when green tiles are adjacent each other. So our agent can fall into an infinite loop by trying to find the castle!

但是问题在于它并没有真正的帮助。我们不知道绿色瓷砖彼此相邻时采取的最佳瓷砖。因此我们的经纪人可以通过尝试找到城堡陷入无限循环！

Q表介绍 (Introducing the Q-table)

Here’s a second strategy: create a table where we’ll calculate the maximum expected future reward, for each action at each state.

这是第二种策略：创建一个表，在该表中，我们将为每个州的每个动作计算最大的预期未来奖励。

Thanks to that, we’ll know what’s the best action to take for each state.

因此，我们将知道对每个州采取的最佳措施是什么。

Each state (tile) allows four possible actions. These are moving left, right, up, or down.

每个状态(平铺)都允许四个可能的操作。它们在向左，向右，向上或向下移动。

In terms of computation, we can transform this grid into a table.

在计算方面，我们可以将此网格转换为表格。

This is called a Q-table (“Q” for “quality” of the action). The columns will be the four actions (left, right, up, down). The rows will be the states. The value of each cell will be the maximum expected future reward for that given state and action.

这称为Q表 (“ Q”代表动作的“质量”)。列将是四个动作(左，右，上，下)。这些行将是状态。每个单元格的值将是给定状态和动作的最大预期未来回报。

Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given.

如果我在给出最佳政策的情况下在该州采取该行动，则每个Q表得分将是我将获得的最大预期未来奖励。

Why do we say “with the policy given?” It’s because we don’t implement a policy. Instead, we just improve our Q-table to always choose the best action.

我们为什么说“给出了政策？” 这是因为我们没有执行政策。 相反，我们只是改进我们的Q表，以始终选择最佳操作。

Think of this Q-table as a game “cheat sheet.” Thanks to that, we know for each state (each line in the Q-table) what’s the best action to take, by finding the highest score in that line.

将此Q表视为游戏“备忘单”。多亏了这一点，我们知道了每个状态(Q表中的每一行)所采取的最佳措施，即找出该行中的最高得分。

Yeah! We solved the castle problem! But wait… How do we calculate the values for each element of the Q table?

是的我们解决了城堡问题！但是等等……我们如何计算Q表中每个元素的值？

To learn each value of this Q-table, we’ll use the Q learning algorithm.

要学习此Q表的每个值， 我们将使用Q学习算法。

Q学习算法：学习动作值函数 (Q-learning algorithm: learning the Action Value Function)

The Action Value Function (or “Q-function”) takes two inputs: “state” and “action.” It returns the expected future reward of that action at that state.

动作值函数(或“ Q函数”)接受两个输入：“状态”和“动作”。它返回该状态下该动作的预期将来奖励。

We can see this Q function as a reader that scrolls through the Q-table to find the line associated with our state, and the column associated with our action. It returns the Q value from the matching cell. This is the “expected future reward.”

我们可以看到这个Q函数，它是一个读取器，它滚动Q表以查找与我们的状态关联的行以及与我们的动作关联的列。它从匹配的单元格返回Q值。这是“预期的未来奖励”。

But before we explore the environment, the Q-table gives the same arbitrary fixed value (most of the time 0). As we explore the environment, the Q-table will give us a better and better approximation by iteratively updating Q(s,a) using the Bellman Equation (see below!).

但是在探索环境之前，Q表会给出相同的任意固定值(大多数情况下为0)。当我们探索环境时，通过使用Bellman方程迭代更新Q(s，a) ， Q表将为我们提供越来越好的近似值(请参见下文！)。

Q学习算法过程 (The Q-learning algorithm Process)

Step 1: Initialize Q-valuesWe build a Q-table, with m cols (m= number of actions), and n rows (n = number of states). We initialize the values at 0.

第1步：初始化Q值我们构建一个Q表，其中包含m个 cols(m =动作数)和n行(n =状态数)。我们将值初始化为0。

Step 2: For life (or until learning is stopped)Steps 3 to 5 will be repeated until we reached a maximum number of episodes (specified by the user) or until we manually stop the training.

步骤2：终生(或直到学习停止) ，将重复步骤3至5，直到达到最大发作次数(由用户指定)或直到我们手动停止训练为止。

Step 3: Choose an actionChoose an action a in the current state s based on the current Q-value estimates.

步骤3：选择一个动作根据当前的Q值估算值，选择一个处于当前状态s的动作。

But…what action can we take in the beginning, if every Q-value equals zero?

但是……如果每个Q值等于零，我们在一开始可以采取什么措施？

That’s where the exploration/exploitation trade-off that we spoke about in the last article will be important.

那就是我们在上一篇文章中谈到的勘探/开采权衡的重要性所在。

The idea is that in the beginning, we’ll use the epsilon greedy strategy:

想法是，一开始，我们将使用epsilon贪婪策略：

We specify an exploration rate “epsilon,” which we set to 1 in the beginning. This is the rate of steps that we’ll do randomly. In the beginning, this rate must be at its highest value, because we don’t know anything about the values in Q-table. This means we need to do a lot of exploration, by randomly choosing our actions.我们指定一个探索率“ε”，在开始时将其设置为1。这是我们将随机执行的步骤的速度。首先，该速率必须为最高值，因为我们对Q表中的值一无所知。这意味着我们需要通过随机选择行动来进行大量探索。
We generate a random number. If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step). Else, we’ll do exploration.

我们生成一个随机数。如果该数字> epsil o n，那么我们将进行“剥削”(这意味着我们将使用我们已经知道的方法在每个步骤中选择最佳操作)。否则，我们将进行探索。
The idea is that we must have a big epsilon at the beginning of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.想法是，在训练Q功能时，我们必须有一个很大的epsilon。然后，随着代理对估计Q值变得更有信心时，逐渐减小它。

Steps 4–5: Evaluate!Take the action a and observe the outcome state s’ and reward r. Now update the function Q(s,a).

步骤4–5：评估！ 采取行动a并观察结果状态s'并奖励r。现在更新函数Q(s，a)。

We take the action a that we chose in step 3, and then performing this action returns us a new state s’ and a reward r (as we saw in the Reinforcement Learning process in the first article).

我们采取在步骤3中选择的动作a ，然后执行此动作将为我们返回新状态s'和奖励r (如我们在第一篇文章的强化学习过程中所看到的)。

Then, to update Q(s,a) we use the Bellman equation:

然后，要更新Q(s，a)，我们使用Bellman方程：

The idea here is to update our Q(state, action) like this:

这里的想法是像这样更新我们的Q(state，action)：

New Q value =    Current Q value +    lr * [Reward + discount_rate * (highest Q value between possible actions from the new state s’ ) — Current Q value ]

Let’s take an example:

让我们举个例子：

One cheese = +1一种奶酪= +1
Two cheese = +2两块奶酪= +2
Big pile of cheese = +10 (end of the episode)大堆奶酪= +10(情节结束)
If you eat rat poison =-10 (end of the episode)如果您吃了鼠药= -10(发作结束)

Step 1: We init our Q-table

步骤1：我们建立Q表

Step 2: Choose an action From the starting position, you can choose between going right or down. Because we have a big epsilon rate (since we don’t know anything about the environment yet), we choose randomly. For example… move right.

第2步：选择一个动作从起始位置，您可以选择向右还是向下。因为我们的epsilon率很高(因为我们对环境一无所知)，所以我们随机选择。例如……向右移动。

We found a piece of cheese (+1), and we can now update the Q-value of being at start and going right. We do this by using the Bellman equation.

我们找到了一块奶酪(+1)，现在我们可以更新起点和终点的Q值。我们通过使用Bellman方程来做到这一点。

Steps 4–5: Update the Q-function

步骤4–5：更新Q功能

First, we calculate the change in Q value ΔQ(start, right)首先，我们计算Q值的变化量ΔQ(开始，右)
Then we add the initial Q value to the ΔQ(start, right) multiplied by a learning rate.然后，我们将初始Q值与ΔQ(start，right)相乘乘以学习率。

Think of the learning rate as a way of how quickly a network abandons the former value for the new. If the learning rate is 1, the new estimate will be the new Q-value.

可以将学习率视为网络放弃新旧价值的一种方式。如果学习率是1，则新的估计值将是新的Q值。

Good! We’ve just updated our first Q value. Now we need to do that again and again until the learning is stopped.

好！我们刚刚更新了第一个Q值。现在，我们需要一次又一次地这样做，直到学习停止。

实施Q学习算法 (Implement a Q-learning algorithm)

We made a video where we implement a Q-learning agent that learns to play Taxi-v2 with Numpy.

我们制作了一个视频，其中我们实现了一个Q学习代理，该代理学习了如何与Numpy玩Taxi-v2。

Now that we know how it works, we’ll implement the Q-learning algorithm step by step. Each part of the code is explained directly in the Jupyter notebook below.

现在我们知道了它的工作原理，我们将逐步实现Q学习算法。下面的Jupyter笔记本中直接解释了代码的每个部分。

You can access it in the Deep Reinforcement Learning Course repo.

您可以在“ 深度强化学习课程”存储库中访问它。

Or you can access it directly on Google Colaboratory:

或者，您可以直接在Google合作实验室上访问它：

Q* Learning with Frozen Lakecolab.research.google.com

Q *与冻湖 一起学习 colab.research.google.com

回顾... (A recap…)

Q-learning is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a q function.Q学习是一种基于值的强化学习算法，用于使用aq函数查找最佳的动作选择策略。
It evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.它基于操作值函数来评估要采取的操作，该函数确定处于某个状态并在该状态下执行某个操作的值。
Goal: maximize the value function Q (expected future reward given a state and action).目标：最大化价值函数Q(给定状态和动作的预期未来回报)。
Q table helps us to find the best action for each state.Q表可帮助我们找到每个状态的最佳操作。
To maximize the expected reward by selecting the best of all possible actions.通过选择所有可能的动作中的最佳动作来最大化预期的回报。
The Q come from quality of a certain action in a certain state.

Q来自品质在特定状态下的特定动作。
Function Q(state, action) → returns expected future reward of that action at that state.函数Q(状态，动作)→返回该动作在该状态的预期未来回报。
This function can be estimated using Q-learning, which iteratively updates Q(s,a) using the Bellman Equation可以使用Q学习估计该函数，Q学习使用Bellman方程迭代地更新Q(s，a)。
Before we explore the environment: Q table gives the same arbitrary fixed value → but as we explore the environment → Q gives us a better and better approximation.在探索环境之前：Q表给出相同的任意固定值→但在我们探索环境时→Q给出了越来越好的近似值。

That’s all! Don’t forget to implement each part of the code by yourself — it’s really important to try to modify the code I gave you.

就这样！不要忘了自己实现代码的每个部分-尝试修改我给您的代码非常重要。

Try to add epochs, change the learning rate, and use a harder environment (such as Frozen-lake with 8x8 tiles). Have fun!

尝试添加纪元，更改学习率，并使用更艰苦的环境(例如带有8x8磁贴的冰冻湖)。玩得开心！

Next time we’ll work on Deep Q-learning, one of the biggest breakthroughs in Deep Reinforcement Learning in 2015. And we’ll train an agent that that plays Doom and kills enemies!

下次，我们将进行深度Q学习，这是2015年深度强化学习中最大的突破之一。我们还将训练一个扮演末日并杀死敌人的特工！

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

如果您喜欢我的文章， 请单击“？”。 您可以根据自己喜欢该文章的次数在下面进行搜索，以便其他人可以在Medium上看到此内容。并且不要忘记跟随我！

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

如果您有任何想法，意见，问题，请在下面发表评论，或给我发送电子邮件：hello@simoninithomas.com或向我发送@ThomasSimonini信息。

Keep learning, stay awesome!

继续学习，保持卓越！

使用Tensorflow进行深度强化学习课程？ (Deep Reinforcement Learning Course with Tensorflow ?️)

? Syllabus

？教学大纲

? Video version

？视频版本

Part 1: An introduction to Reinforcement Learning

第1部分：强化学习简介

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

第2部分：通过Q-Learning更深入地学习强化学习

Part 3: An introduction to Deep Q-Learning: let’s play Doom

第3部分：深度Q学习简介：让我们玩《毁灭战士》

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

第3部分+：深度Q学习中的改进：双重DQN，优先体验重播和固定Q目标

Part 4: An introduction to Policy Gradients with Doom and Cartpole

第4部分： Doom和Cartpole的策略梯度简介

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

第5部分：优势演员评论家方法简介：让我们玩刺猬索尼克吧！

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

第6部分：使用刺猬索尼克2和3的近距离策略优化(PPO)

Part 7: Curiosity-Driven Learning made easy Part I

第七部分：好奇心驱动学习变得简单

翻译自: https://www.freecodecamp.org/news/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe/