ai人工智能编程

A failure is not always a mistake, it may simply be the best one can do under the circumstances. The real mistake is to stop trying. — B. F. Skinner

失败并不总是错误，它可能只是在这种情况下可以做的最好的事情。真正的错误是停止尝试。 — BF斯金纳

Reinforcement learning models are beating human players in games around the world. Huge international companies are investing millions in reinforcement learning. Reinforcement learning in today’s world is so powerful because it requires neither data nor labels. It could be a technique that leads to general artificial intelligence.

强化学习模型正在全球游戏中击败人类玩家。庞大的国际公司在强化学习上投入了数百万美元。当今世界，强化学习是如此强大，因为它既不需要数据也不需要标签。它可能是导致通用人工智能的技术。

有监督和无监督学习 (Supervised and Unsupervised Learning)

As a summary, in supervised learning, a model learns to map input to outputs using predefined and labeled data. An unsupervised learning approach teaches a model to cluster and group data using predefined data.

总而言之，在监督学习中，模型学习使用预定义和标记的数据 将输入映射到输出 。一种无监督的学习方法，教一个模型使用预定义的数据对数据进行聚类和分组 。

强化学习 (Reinforcement Learning)

However, in reinforcement learning, the model receives no data set and guidance, using a trial and error approach.

但是，在强化学习中，该模型没有使用试错法接收任何数据集和指导。

Reinforcement learning is an area of machine learning defined by how some model (called agent in reinforcement learning) behaves in an environment to maximize a given reward. The most similar real-world example is of a wild animal trying to find food in its ecosystem. In this example, the animal is the agent, the ecosystem is the environment, and the food is the reward.

强化学习是机器学习的一个领域，它是由某种模型(在强化学习中称为主体)在环境中的行为来定义的，以最大化给定的奖励。现实世界中最相似的例子是野生动物试图在其生态系统中寻找食物。在此示例中，动物是主体，生态系统是环境，食物是奖励。

Reinforcement learning is frequently used in the domain of game playing, where there is no immediate way to label how “good” an action was, since we would need to consider all future outcomes.

强化学习经常用于游戏领域，因为我们需要考虑所有未来的结果，因此无法立即标记动作的“良好”程度。

马尔可夫决策过程 (Markov Decision Processes)

The Markov Decision Process is the most fundamental concept of reinforcement learning. There are a few components in an MDP that interact with each other:

马尔可夫决策过程是强化学习的最基本概念。 MDP中有一些相互影响的组件：

Agent — the model代理-模型
Environment — the overall situation环境-总体情况
State — the situation at a specific time状态-特定时间的情况
Action — how the agent acts行动-代理如何行动
Reward — feedback from the environment奖励-来自环境的反馈

MDP表示法 (MDP Notation)

Sutton, R. S. and Barto, A. G. Introduction to Reinforcement LearningSutton，RS和Barto，AG强化学习简介

To repeat what was previously discussed in more mathematically formal terms, some notation must be defined.

为了以数学上更正式的形式重复先前讨论的内容，必须定义一些符号。

t represents the current time step

t代表当前时间步长
S is the set of all possible states, with S_t being the state at time t

S是所有可能状态的集合，其中S_t是时间t的状态
A is the set of all possible actions, with A_t being the action performed at time t

A是所有可能动作的集合， A_t是在时间t执行的动作
R is the set of all possible rewards, with R_t being the reward received after performing A_(t-1)

R是所有可能的奖励的集合， R_t是执行A_(t-1)后收到的奖励
T is the last time step (the last step happens when a certain condition is reached or t is higher than a value)

T是最后一个时间步(当达到某个条件或t大于某个值时，最后一步发生)

The process can be written as:

该过程可以写成：

The agent receives a state S_t

代理收到状态S_t
The agent performs an action A_t based on S_t

代理基于S_t执行动作A_t
The agent receives a reward R_(t+1)

代理收到奖励R_(t + 1)
The environments transitions into a new state S_(t+1)

环境过渡到新状态S_(t + 1)
The cycle repeats for t+1

循环重复t + 1

预期折扣收益(做出长期决策) (Expected Discounted Return (Making Long-Term Decisions))

We discussed that in order for an agent to play a game well, it would need to take future rewards into consideration. This can be described as:

我们讨论过，为了使特工更好地玩游戏，需要考虑未来的收益。这可以描述为：

G(t) = R_(t+1) + R_(t+2) +… + R_(T), where G(t) is the sum of the rewards the agent expects after time t.

G(t)= R_(t + 1)+ R_(t + 2)+…+ R_(T) ，其中G(t)是代理商在时间t之后期望得到的报酬之和。

However, if T is infinite, in order to to make G(t) converge to a single number, we define the discount rate γ to be a number smaller than 1, and define:

但是，如果T为无穷大，为了使G(t)收敛为一个数，我们将折现率γ定义为小于1的数，并定义：

G(t) = R_(t+1) + γR_(t+2) +γ²R_(t+2)+…

G(t)= R_(t + 1)+γR_(t + 2)+γ²R_(t + 2)+…

This can also be written as:

也可以写成：

G(t) = R_(t+1) + γG(t+1)

G(t)= R_(t + 1)+γG(t + 1)

价值与质量(Q学习是质量学习) (Value and Quality (Q-Learning is Quality-Learning))

A policy describes how an agent will act given any state it finds itself in. An agent is said to follow a policy. Value and Quality functions describe how “good” it is for an agent to be in a state, or a state and perform an action.

策略描述了座席在发现自己所处的任何状态下将如何行动。据说座席遵循策略。价值和质量功能描述了代理处于一种状态或一种状态并执行一个动作的“良好”程度。

Specifically, the value function v_p(s) is equal to the expected discounted return while starting in state s and following a policy p. The quality function q_p(s, a) is equal to the best expected discounted return possible while starting in state s, performing action a, and then following policy p.

具体来说，值函数v_p(s)等于从状态s开始并遵循策略p时的预期折现收益。质量函数q_p(s，a)等于在状态s中开始，执行动作a然后遵循策略p时可能的最佳预期折现收益。

v_p(s) = (G(t) | S_t=s)

v_p(s)=(G(t)| S_t = s)

q_p(s, a) = (G(t) | S_t=s, A_t = a)

q_p(s，a)=(G(t)| S_t = s，A_t = a)

A policy is better or equal to another policy if it has a greater or equal discounted expected return for every state. The optimal value and quality functions v* and q* use the best possible policy.

如果一个保单的每个州的预期收益都等于或大于折算的期望收益，则该保单优于或等于另一个保单。最优值和质量函数v *和q *使用最佳策略。

Q的Bellman方程* (Bellman Equation for Q*)

The Bellman Equation another extremely important concept that turns q-learning into dynamic programming combined with a gradient descent-like idea.

贝尔曼方程式是另一个极其重要的概念，它结合了类似梯度下降的思想，将q学习转变为动态编程。

It states that when following the best policy, the q value of a state and action (q_p(s, a)) is the same as the reward received for performing a during s plus the maximum expected discounted reward after performing a during s multiplied by the discount rate.

它指出，继最好的政策，状态和动作的Q值时(q_p(S，A))是一样的在S 进行加最大期望在S乘以执行后折扣奖励获得奖励折现率。

q*(s_t, a_t) = R_(t+1) + γq*(s_(t+1), a_(t+1))

q *(s_t，a_t)= R_(t + 1)+γq*(s_(t + 1)，a_(t + 1))

The quality of the best action is equal to the reward plus the quality of best action on the next time step times the discount rate.

最佳动作的质量等于奖励加上下一个步骤中最佳动作的质量乘以折现率。

Once we find q*, we can find the best policy by using q-learning to find the best policy.

一旦找到q * ，就可以通过使用q-learning找到最佳策略来找到最佳策略。

Q学习 (Q-Learning)

Q-learning is a technique which attempts to maximize the expected reward over all time steps by finding the best q function. In other words, the objective of q-learning is the same as the objective of dynamic programming, but with the discount rate.

Q学习是一种试图通过找到最佳q函数在所有时间步长上最大化预期回报的技术。换句话说，q学习的目标与动态规划的目标相同，但是具有折扣率。

In q-learning, a table with all possible state-action pairs is created, and the algorithm iteratively updates all the values of the table using the bellman equation until the optimal q-values are found.

在q学习中，创建了一个具有所有可能的状态-动作对的表，该算法使用bellman方程迭代更新表的所有值，直到找到最佳q值。

We define a learning rate, a number between 0 and 1 describing how much of the old q-value we overwrite and the new one we keep each iteration.

我们定义一个学习率，一个介于0和1之间的数字，描述了我们覆盖的旧q值和每次迭代保留的新q值的数量。

The process can be described like with the pseudocode:

可以使用伪代码来描述该过程：

Q = np.zeros((state_size, action_size))for i in range(max_t):  action = np.argmax(Q[current_state,:])  new_state, reward = step(action)  Q[state, action] = Q[state, action] * (1-learning_rate) + \  (reward + gamma * np.argmax(Q[new_state,:])) * learning_rate  state = new_state  if(game_over(state)):    break

勘探与开发 (Exploration and Exploitation)

In the beginning, we do not know anything about our environment, so we want to prioritize exploring and gathering information, even it it means we do not get as much reward as possible.

最初，我们对环境一无所知，因此我们希望优先探索和收集信息，即使这意味着我们不会获得尽可能多的回报。

Later, we want to increase our high score and prioritize finding ways to getting more rewards by exploiting the q-table.

后来，我们希望增加我们的高分，并优先考虑通过利用 q表获得更多奖励的方法。

To do this, we can create the variable epsilon, described by hyperparameters to describe when to explore, and when to exploit. Specifically, when a random number generated is higher than epsilon, we exploit, otherwise, we explore.

为此，我们可以创建由超参数描述的变量epsilon，以描述何时进行探索以及何时进行利用。具体来说，当生成的随机数高于epsilon时，我们会利用，否则，我们会进行探索。

The new code is as follows:

新代码如下：

Q = np.zeros((state_size, action_size))epsilon = 1for _ in range(batches):  for i in range(max_t):    if(epsilon > random.uniform(0, 1)):      action = np.argmax(Q[state,:])    else:      action = np.random.rand(possible_actions(state))    new_state, reward = time_step(action)    Q[state, action] = Q[state, action] * (1-learning_rate) + \(reward + gamma * np.argmax(Q[new_state,:])) * learning_rate    state = new_state  epsilon *= epsilon_decay_rate  if(game_over(state)):      break

摘要 (Summary)

Reinforcement learning focuses on a situation where an agent receives no data set, and learns from the actions and rewards it receives from the environment.强化学习的重点是代理没有收到任何数据集，而是从行为中学习并从环境中获得奖励。
The Markov Decision Process is a control process that models decision making of an agent placed in an environment.马尔可夫决策过程是一个控制过程，可对放置在环境中的代理的决策进行建模。
The Bellman Equation describes a characteristic that the best policy has that turns the problem into modified dynamic programming.贝尔曼方程式描述了一种最佳策略所具有的特征，该特征将问题转化为修改后的动态规划。
The agent prioritizes exploring in the beginning, but eventually transitions to exploiting该代理从一开始就优先进行探索，但最终过渡到利用

翻译自: https://medium.com/swlh/dynamic-programming-to-artificial-intelligence-q-learning-51a189fc0441

ai人工智能编程

查看全文

http://www.taodudu.cc/news/show-863588.html

架构垂直伸缩和水平伸缩区别_简单的可伸缩图神经网络
yolo opencv_如何使用Yolo，SORT和Opencv跟踪足球运动员。
人工智能的搭便车指南
机器学习对回归的评估_在机器学习回归问题中应使用哪种评估指标？
可持久化数据结构加扫描线_结构化光扫描
信号处理深度学习机器学习_机器学习和信号处理如何融合？
python 数组合并排重_并排深度学习：Julia vs Python
强化学习求解迷宫问题_使用天真强化学习的迷宫求解器
朴素贝叶斯半朴素贝叶斯_使用朴素贝叶斯和N-Gram的Twitter情绪分析
自动填充数据新增测试数据_用测试数据填充员工数据库
bart使用方法_使用简单变压器的BART释义
卷积网络和卷积神经网络_卷积神经网络的眼病识别
了解回归：迈向机器学习的第一步
yolo yolov2_PP-YOLO超越YOLOv4 —对象检测的进步
机器学习初学者_绝对初学者的机器学习
monk js_对象检测-使用Monk AI进行文档布局分析
线性回归 c语言实现_C ++中的线性回归实现
忍者必须死3 玩什么忍者_降维：忍者新手
交叉验证和超参数调整：如何优化您的机器学习模型
安装好机器学习环境的虚拟机_虚拟环境之外的数据科学是弄乱机器的好方法
遭遇棘手交接_Librosa的城市声音分类-棘手的交叉验证
模型越复杂越容易惰性_ML模型的惰性预测
vgg 名人人脸图像库_您看起来像哪个名人？图像相似度搜索模型
机器学习:贝叶斯和优化方法_Facebook使用贝叶斯优化在机器学习模型中进行更好的实验
power-bi_在Power BI中的VertiPaq内-压缩成功！
模型标签数据神经网络_大型神经网络和小数据的模型选择
学习excel数据分析_为什么Excel是学习数据分析的最佳方法
护理方面关于人工智能的构想_如何提出惊人的AI，ML或数据科学项目构想。
api数据库管理_API管理平台如何增强您的数据科学项目
batch lr替代关系_建立关系的替代方法

ai人工智能编程_从人工智能动态编程：Q学习相关推荐

嵌入式与人工智能关系_嵌入式人工智能的发展趋势
嵌入式与人工智能关系_嵌入式人工智能的发展趋势所谓嵌入式人工智能,就是设备无须联网通过云端数据中心进行大规模计算去实现人工智能,而是在本地计算,在不联网的情况下就可以做实时的环境感知.人机交互.决策 ...
动态编程语言静态编程语言_什么是动态编程？
动态编程语言静态编程语言介绍 (Introduction) Today in this tutorial, we are going to discuss the concept of Dynami ...
fp函数式编程_全面了解函数式编程（FP）
fp函数式编程 This is the other major programming paradigm. If you are interested in Objected oriented pro ...
学python编程_少儿学Python编程的一些思考
自从孩子上了初中,孩子妈就开始盯着各种真假难辨的中考.高考新政传言.当她从铺天盖地的少儿编程广告里获悉,编程将纳入中考,高考范围,并且2018年高考,多个省份的数学卷甚至都出现了编程题时,就变得异常兴 ...
华兴数控g71外圆循环编程_数控车床加工编程典型实例分析(西门子802S数控系统)...
这是一篇带有教学色彩的习作,文章对数控编程的方式和步骤进行了简明的阐述,并针对一个典型零件的数控车削加工给出了一套程序.程序是以西门子802S数控系统为例编写的. 数控机床是一种技术密集度及自动化程度 ...
python青少年编程_机器人Python青少年编程开发实例
章打开极客之门 1.1 TurnipBit是什么 1.2 从拼插编程开始 1.3 做个真正的程序员 1.3.1 什么Python 1.3.2 面向硬件的MicroPython 1.3.3 支持Mic ...
ai透视按键_透视人工智能
ai透视按键 by Rishal Hurbans 由Rishal Hurbans 透视人工智能 (Artificial Intelligence in Perspective) The buzz wo ...
人工智能ai内容阅读_用人工智能打击非法内容
人工智能ai内容阅读 "As the amount of user-generated content that platform users upload, continues to ac ...
学python还是不会编程_你真的不学Python吗?学习Python的四大理由!
在众多人的脑海中,Python无非就是一门编程语言而已,并没有什么特色,但是提及学习编程大部分人都会推荐Python,为什么?今天就给你说说学习Python的四大理由吧. 首先先来了解一下什么是Pyt ...

ai人工智能编程_从人工智能动态编程：Q学习