使用高德地图打车软件实现

After more than 2 months without publish, I returned! Now, I wanna divide with you my last experiences studying Reinforcement Learning and solving some problems.

经过两个多月没有发表，我回来了！现在，我想与大家分享我在学习强化学习和解决某些问题上的最后经验。

The first algorithm for any any newbie in Reinforcement Learning usually is Q-Learning, and why? Because it’s a very simple algorithm, easy to understand and powerful for a many problems!

强化学习中任何新手的第一个算法通常是Q学习 ，为什么？因为它是一个非常简单的算法，所以对于许多问题而言，它易于理解且功能强大！

In this post, we’ll build together a agent to play the Taxi-V3 game from OpenAI Gym just using numpy and a few lines of code. After this article, you’ll be able to apply Q-Learning to solve other problems in different environments.

在本文中，我们将建立一个代理，使用numpy和几行代码从OpenAI Gym玩Taxi-V 3游戏。在本文之后，您将可以应用Q-Learning解决不同环境中的其他问题。

But first, we need to understand what is Reinforcement Learning?

但是首先，我们需要了解什么是强化学习 ？

强化学习的简短简历 (A Short Resume of Reinforcement Learning)

The image above resume the core idea of Reinforcement Learning where we have:

上图恢复了强化学习的核心思想，其中：

Agent: Think of the agent as our model, he is responsible for making the magic happen, like playing Pacman like a professional.

特工：将特工视为我们的模型，他负责使魔术成为现实，就像像专家一样扮演吃豆人 。
Environment: The Environment is where the magic happens, in this example will be the Taxi-V3 game.

环境：环境是神奇的地方，在此示例中将是Taxi-V 3游戏。
Reward: Is the feedback given by the Environment to say if the action taken from agent was good or bad. The reward can be positive or negative.

奖励：环境给出的反馈，用于说明从代理采取的行动是好是坏。奖励可以是正面的或负面的。
Action: The action taken by the Agent.

行动： 代理采取的行动。
State: Is the current situation of the Agent in Environment such: low life, without ammunition or facing a wall.

状态：特工在环境中的当前状况是：低寿，没有弹药或面对墙壁。

The main goal of the Agent is take actions that will maximize your future reward. So the flow is:

代理商的主要目标是采取能使您将来获得最大回报的行动。因此，流程为：

Take an action;采取行动；
Receive a feedback from environment;接收环境反馈；
Receive the new state;接收新状态；
Take a new Action;采取新的行动；

Our Agent have 2 ways to take a decision in determined situation: Exploration and Exploitation. In the Exploration, our Agent will take random decisions, this is useful to learn about the environment. In the Exploitation, our Agent will take actions based on what he already knows.

我们的代理人有两种在确定的情况下做出决定的方式：探索和利用。在探索中，我们的探员将做出随机决定，这对于了解环境非常有用。在“剥削”中，我们的代理将根据他已经知道的情况采取行动。

In the amazing video below, you can visualize the Reinforcement Learning in practice where we have 4 agents playing hide and seek. Don’t forget to check this!

在下面的精彩视频中，您可以在实践中可视化“强化学习”，其中我们有4个探员在玩捉迷藏。别忘了检查一下！

Now, you already know what is Reinforcement Learning and why it’s so amazing field from the Artificial Intelligence!

现在，您已经知道什么是强化学习，以及为什么它在人工智能领域如此惊人！

Let’s see how Q-Learning works.

让我们看看Q学习的工作原理。

Q学习简历 (Q-Learning Resume)

Like I said before, Q-Learning is a very simple to understand algorithm and very recommended to beginners in Reinforcement Learning, because it’s powerful and can be apply in a few lines of code.

就像我之前说的， Q学习是一种非常简单易懂的算法，并且非常推荐给强化学习的初学者，因为它功能强大并且可以应用在几行代码中。

Basically in Q-Learning, our we create a table with actions and states, called Q-Table. This table will help our agent to take the best action for the moment. The table looks like this:

基本上在Q-Learning中 ，我们创建一个带有动作和状态的表，称为Q-Table 。该表将帮助我们的代理暂时采取最佳措施。该表如下所示：

But in the beginning, we start this table with 0 in all values. The idea is leave the agent explore the environment taking random actions and after, use the rewards received from these actions to populate the table, this is the Exploration.

但首先，我们以所有值都为0的表格开始。想法是让代理人探索环境并采取随机行动，然后，使用从这些行动中获得的奖励来填充表格，这就是探索。

After that, we start the Exploitation, where the agent use the table to take actions who will maximize him future reward. But in the Exploitation, the Q-Table still changing with the states, a good action in some state don’t necessary will be a good action in other state.

之后，我们开始利用，代理人在桌子上采取行动，以使他将来的报酬最大化。但是在开发中，Q表仍然随状态而变化，在某些状态下不需要执行的好动作将在其他状态下执行。

To decide the action to maximize the future reward, we use the formula below

为了决定最大化未来奖励的行动，我们使用以下公式

After that, our agent will receive a reward from the environment, that can be negative or positive. And we’ll use the formula below to update our Q-Table:

之后，我们的代理商将从环境中获得奖励，可以是正面的，也可以是负面的。我们将使用以下公式更新我们的Q表：

This is how the Q-Learning Algorithm works, remember that flow:

Q学习算法是这样工作的，请记住以下流程：

了解环境 (Understanding the Environment)

First of all, we need to understand the problem and how our environment works, let’s do that.

首先，我们需要了解问题以及我们的环境如何工作，让我们做到这一点。

In the Taxi-V3, we have 4 locations and a taxi, who will be our Agent. The task is pick a passenger in a location and drop him in another. Our agent’ll receive +20 points for a successful dropoff and lose 1 point for every timestep it takes, beyond that, our agent’ll lose 10 points for illegal pickup and drop-off actions, for example, try to pick a passenger in a invalid location.

在Taxi-V 3中，我们有4个地点和一辆出租车，他们将成为我们的代理商。任务是在某个地点接一名乘客，然后将其放在另一个地点。我们的代理人成功下车将获得+20积分，并且每花费1步便会损失1点，除此之外，我们的代理人会因非法上落下车操作而失去10点，例如，尝试在无效的位置。

The video below show how our environment works:

以下视频显示了我们的环境如何工作：

Now, we already know our environment and we’re ready to start our to build our initial Q-Table (with all 0).

现在，我们已经知道我们的环境，并且可以开始构建初始Q表(全为0)了。

建立我们的初始Q表 (Building our Initial Q-Table)

First, we need to import our libraries and start our environment to visualize some information, we’ll do that by using the code below:

首先，我们需要导入我们的库并启动我们的环境以可视化一些信息，我们将使用以下代码来做到这一点：

import gymimport numpy as npimport randomenv = gym.make(‘Taxi-v3’)env.render()

Now, we need to build our Q-Table. As I said before, the Q-Table starts with zero for all the values. We will build a Q-Table based on number of possible states X number of actions, let’s do this:

现在，我们需要构建我们的Q表。就像我之前说的，Q表的所有值都从零开始。我们将基于可能的状态数X动作数来构建Q表，让我们这样做：

# Values for Q Table:action_size = env.action_space.nprint(‘Action Space: ‘, action_size)state_size = env.observation_space.nprint(‘State Size: ‘, state_size)# Build Q Table:q_table = np.zeros((state_size, action_size))q_table

We already have our Q-Table initialized with 0, now, we can start to implement our Q-Learning Algorithm!

我们已经将Q-Table初始化为0，现在，我们可以开始实现我们的Q-Learning算法了！

实施Q学习 (Implement Q-Learning)

First, we need to set our Hyperparameters for the Algorithm, I really encourage you to test and find the best parameters for the Algortihm. I wanna use these values:

首先，我们需要为算法设置超参数，我真的鼓励您测试并找到算法的最佳参数。我想使用这些值：

# Hyper params:total_ep = 1500total_test_ep = 100max_steps = 100lr = 0.81gamma = 0.96# Exploration Params:epsilon = 0.9max_epsilon = 1.0min_epsilon = 0.01decay_rate = 0.01

The second step is use a loop to iterate the number of episodes (I putted 1500), and reset the environment in every episode start.

第二步是使用循环迭代剧集的数量(我输入1500)，并在每个剧集开始时重置环境。

Epsilon is our way to balance actions between Exploration and Exploitation, the idea here is according the epsilon number decrease, our agent will take more Exploitation actions and vice versa.

Epsilon是我们在“探索”与“开发”之间进行平衡的方式，此处的想法是根据epsilon数量的减少，我们的代理商将采取更多的“开发”行动，反之亦然。

An episode start when you start the environment and end when your agent arrives in a terminal action such lost the game.

情节从您启动环境开始，到您的特工到达最终动作(例如输掉游戏)时结束。

for episode in range(total_ep):  # Reset Environment:  state = env.reset()  step = 0  done = False

And inside this loop, we’ll create other loop to control the actions of our agent:

在此循环中，我们将创建另一个循环来控制代理的操作：

for step in range(max_steps):  # Choose an action a in the current world state(s) (step 3)  # First we randomize a number  exp_exp_tradeoff = random.uniform(0, 1)  # If this number > greater than epsilon → exploitation (taking the      biggest q value for the current state):  if exp_exp_tradeoff > epsilon:    action = np.argmax(q_table[state, :])  # Else, doing random choice:  else:    action = env.action_space.sample()  # Take the action (a) and observe the outcome state (s’) and the reward (r)  new_state, reward, done, info = env.step(action)  # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s’,a’) — Q(s,a)]  q_table[state, action] = q_table[state, action] + lr * (reward + gamma *  np.max(q_table[new_state, :]) — q_table[state, action])  # Our new state:  state = new_state  # If done True, finish the episode:  if done == True:    break

All the step are commented in the code, but I’ll explain it. First we create a loop to control our actions(steps) inside the environment, after that use take a random number, if this number > than epsilon, our agent will take a the action who have the biggest value of Q-Table by using:

在代码中注释了所有步骤，但我将对其进行解释。首先，我们创建一个循环来控制环境中的操作(步骤)，在该操作之后使用一个随机数，如果该数字> epsilon，我们的代理将通过使用以下操作来执行具有最大Q表值的操作：

np.argmax(q_table[state, :])

If the random number < than epsilon, our agent will take a random action by using:

如果随机数<小于epsilon，我们的代理将通过以下方式采取随机行动：

env.action_space.sample()

After that we use Update our Q-Values using our formula. After update the values we verify if we finished the episode or not. If yes, we increment 1 in the number of episodes and reduce our epsilon number by using:

之后，我们使用公式来更新我们的Q值。更新值后，我们验证是否完成了该情节。如果是，我们通过以下步骤增加情节数1并减少epsilon数：

# Reduce epsilon (because we need less and less exploration):epsilon = min_epsilon + (max_epsilon — min_epsilon) *np.exp(-decay_rate*episode)

After run all the code, we’ll have our Q-Table ready to use by our agent, now, we can apply the algorithm!

运行完所有代码后，我们的代理即可使用我们的Q表，现在，我们可以应用算法了！

First, we reset our environment, to make sure our agent will start from the beginning, and besides that, we create an array to store our total reward and visualize it in the end of the train:

首先，我们重置环境，以确保我们的代理商将从头开始，此外，我们创建一个数组来存储我们的总奖励并将其可视化为火车末尾：

env.reset()rewards = []

After that, we create a loop with a very similar structure from the previous, in this loop we will iterate the episodes from test (100).

之后，我们创建一个与前一个结构非常相似的循环，在这个循环中，我们将迭代测试(100)中的情节。

for episode in range(total_test_ep):  state = env.reset()  step = 0  done = False  total_rewards = 0  print(‘=========================’)  print(‘EPISODE: ‘, episode)

As you can see, we start the rewards with 0. Now, we need to create other loop to control the actions of the agent inside our environment:

如您所见，我们从0开始奖励。现在，我们需要创建其他循环来控制环境中代理的动作：

for step in range(max_steps):  env.render()  # Take the action based on the Q Table:  action = np.argmax(q_table[state, :])  new_state, reward, done, info = env.step(action)  total_rewards += reward  # If episode finishes:  if done:   rewards.append(total_rewards)   print(‘Score: ‘, total_rewards)   break  state = new_state

Again, this loop is very similar with the previous. Here, we render our environment to visualize our agent in action. But different from previous, here, our agent’ll use Exploitation to take actions, that is, in the every state, he’ll take the action with the biggest reward from the Q-Table.

同样，此循环与之前的循环非常相似。在这里，我们渲染环境以可视化行动中的特工。但是与以前不同，在这里，我们的代理将使用Exploitation采取行动，也就是说，在每个州，他都会从Q表中获得最大的回报。

When our episode finishes, we append the score to total_rewards and print the total rewards until the moment.

当我们的情节结束时，我们会将比分附加到total_rewards上，并打印总奖励直到那一刻。

After finish all the 100 test episodes, we close the environment and print our score over the time.

完成所有100个测试情节后，我们关闭环境并随着时间打印分数。

Congratulations, you trained a agent to play Taxi-V3 using Reinforcement Learning and Q-Learning!

恭喜，您已经训练了特工使用“强化学习”和“ Q学习”来玩Taxi-V 3！

You can access the notebook with full code of this article here.

您可以在此处使用本文的完整代码访问笔记本。

I hope you have managed to understand How Q-Learning Works and How to apply into a Problem!

我希望您已经了解了Q学习的工作原理以及如何应用到问题中！

This was my first article about Reinforcement Learning, so your feedback is very important to improve the next contents =)

这是我关于强化学习的第一篇文章，因此您的反馈对于改进下一个内容非常重要=)

For now, this is all!

现在，这就是全部！

See you next time!

下次见！

在社交网络上与我联系 (Connect with Me on Social Networks)

✅ Linkedin: https://www.linkedin.com/in/gabriel-mayer-779b5a162/

✅Linkedin ： https://www.linkedin.com/in/gabriel-mayer-779b5a162/

✅ GitHub: https://github.com/gabrielmayers

✅GitHub ： https ： //github.com/gabrielmayers

✅ Instagram: https://www.instagram.com/gabrielmayerl/

✅Instagram ： https : //www.instagram.com/gabrielmayerl/

翻译自: https://medium.com/analytics-vidhya/reinforcement-learning-using-q-learning-to-drive-a-taxi-5720f7cf38df

使用高德地图打车软件实现

查看全文

http://www.taodudu.cc/news/show-1874014.html

aws fargate_使用AWS Fargate部署PyCaret和Streamlit应用程序-无服务器基础架构
ai-人工智能的本质和未来_带有人工智能的动画电子设备-带来难以想象的结果...
世界第一个聊天机器人源代码_这是世界上第一个“活着”的机器人
pytorch深度学习入门_立即学习AI：01 — Pytorch入门
深度学习将灰度图着色_使用DeOldify着色和还原灰度图像和视频
深度神经网络卷积神经网络_改善深度神经网络
采矿协议_采矿电信产品推荐
机器人控制学习机器编程代码_机器学习正在征服显式编程
强化学习在游戏中的作用_游戏中的强化学习
你在想什么？
如何识别媒体偏见_面部识别，种族偏见和非洲执法
openai-gpt_GPT-3 101：简介
YOLOv5与Faster RCNN相比。谁赢？
句子匹配无监督_在无监督的情况下创建可解释的句子表示形式
科技创新可持续发展论坛_可持续发展时间
Pareidolia — AI的艺术教学
个性化推荐系统_推荐系统，个性化预测和优点
自己对行业未来发展的认知_我们正在建立的认知未来
汤国安mooc实验数据_用漂亮的汤建立自己的数据集
python开发助理s_如何使用Python构建自己的AI个人助理
学习遗忘曲线_级联相关，被遗忘的学习架构
她玩游戏好都不准我玩游戏了_我们可以玩游戏吗？
ai人工智能有哪些_进入AI有多么简单
深度学习分类pytorch_立即学习AI：02 —使用PyTorch进行分类问题简介
机器学习和ai哪个好_AI可以使您成为更好的运动员吗？使用机器学习分析网球发球和罚球...
ocr tesseract_OCR引擎之战— Tesseract与Google Vision
游戏行业数据类丛书_理论丛书：高维数据101
tesseract box_使用Qt Box Editor在自定义数据集上训练Tesseract
人脸检测用什么模型_人脸检测模型：使用哪个以及为什么使用？
不洗袜子的高文博_那个孩子在夏天中旬用高袜子大笑？

使用高德地图打车软件实现_强化学习：使用Q学习来打车！相关推荐

android 高德地图设置不能旋转_这个地图APP，专注于地图软件该做的事！
Bmap Bmap,简单的双地图应用.可任意切换/高德地图数据源,致力满足日常生活的出行需求.具有步行.公交.骑行.驾驶等出行方案,查看街景. 新版特性 1.升级百度地图sdk6.0.0 2.升级高德 ...
高德地图如何取消订单_高德地图怎么取消订单
大家好,我是时间财富网智能客服时间君,上述问题将由我为大家进行解答. 高德地图取消订单的方法是: 1.进入到地图页面之后选择点击页面右上角的头像进入到个人中心页面: 2.打开之后往下滑,找到我的: 3 ...
android 高德地图设置不能旋转_地图导航哪家强？
导语: 我国经济在持续高速发展,我国人民的收入水平也随之水涨船高,这就使得我国人民的消费水平以及消费方式发生着日新月异的变化.现在的人们很多都喜欢旅游,或者是自驾游.但是很多情况下,我们在一个陌生的环 ...
android 高德地图移动卡顿_高德、百度和腾讯三家比拼，哪个 Android 车机地图 App 更好用？...
写在前面不管是自己开车.还是平时打车,相信你一定留意过驾驶座旁边位于车辆中间的那块大屏幕,不管是平时开车导航.放音乐,还是通过倒车影像辅助倒车,都离不开这块屏幕,这就是中控车机. 中控车机往往搭载的 ...
android 高德地图设置不能旋转_你以为高德地图只是个地图，并不是？它其实还是个PPT制作神器...
今天跟大家分享一个冷知识!让你大开眼界! 先问大家一个问题,你平时导航用什么软件? 可能很多人,脱口而出会说:高德地图! 高德地图确实在导航方面很厉害,但他真的只单纯的是一个导航软件吗?不见得!它其实 ...
高德地图路线规划时间_路线准、播报拥堵及时，这次自驾出行高德地图可算是帮了大忙...
在我们日常生活中,自驾已经成为一种很普遍的出行方式,不仅在时间上灵活,特别是一家人出行也比较方便.伴随自驾出行的除了爱车外,一款靠谱的地图导航软件也成为了必不可少的旅行伙伴. 目前比较常用的地图导航软 ...
android 高德地图移动卡顿_高德地图4.8和百度地图3.1的抉择，如何发送地址到车机（下载失效本帖奉上百度云盘）...
9月已经过去半个月了,升级一直围绕着领克车友,8月的兴奋,总想知道车机有什么改变,很多车友提前去预约,总想尝第一口鲜,这个第一口可不好喝,头啖汤随好,但会烫口,于是我带着观望的态度,因为我始终觉得,安 ...
android 高德地图设置不能旋转_高德地图行车记录仪AR导航怎么设置使用教程
高德地图AR导航功能再次升级,现在,大家可以通过车内的行车记录仪来当眼睛,实现高清AR实景导航.将高德地图与行车记录仪连接后,系统将会自动捕捉街道路面的画面,然后以AR实景的方式呈现在高德地图软件中. ...
点击高德地图标注没法弹窗_巴彦淖尔果农注意啦！林草局喊你上高德地图标注位置哦...
采摘果园游人少?想去采摘找不到果园?解决办法来了!记者近日从巴彦淖尔市林草局了解到,国家林业和草原局与阿里巴巴集团高德软件有限公司合作,开展"全国采摘果园一张图"工作,利用高德地图 ...
android 高德地图设置不能旋转_高德正在内测的公交实时查询APP
日常出行或者上下班一般都是以乘坐公交地铁为主,选择公共交通工具出行不仅低碳环保长期下来还能节省一笔不小的路费开支,但挤公交和等公交算是非常折腾人,因为发车时间间隔导致错过了最佳时间段的车次后果可想而知 ...

使用高德地图打车软件实现_强化学习：使用Q学习来打车！