获取更多资讯,赶快关注上面的公众号吧!

Tensorlayer深度强化学习系列:

1、Tensorlayer深度强化学习之Tensorlayer安装

文章目录

  • 2.4 强化学习环境 gym 介绍
    • 2.4.1 安装
    • 2.4.2 FrozenLake-v0
      • 2.4.2.1 描述
      • 2.4.2.2 代码
  • 2.5 强化学习算法
    • 2.5.1 表格 Q 学习
      • 2.5.1.1 代码

2.4 强化学习环境 gym 介绍

  这一部分主要讲一下 gym 中各种环境是怎样的。

2.4.1 安装

  Gym 的安装和相关说明可以查看文章(https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484108&idx=1&sn=0c9ff7488185c6287fbe56a3fa24a286&chksm=fc115732cb66de24dab450f458cc39effea9ffe4441010d5d3e00078badcdf132a54eb5388ba&token=366879770&lang=zh_CN#rd),这里不再赘述。

2.4.2 FrozenLake-v0

2.4.2.1 描述

  FrozenLake-v0 是一个 4*4 的网络格子,每个格子可以是起始块,目标块、冻结块或者危险块。我们的目标是让 agent 学习从开始块如何行动到目标块上,而不是移动到危险块上。agent 可以选择向上、向下、向左或者向右移动,同时游戏中还有可能吹来一阵风,将 agent 吹到任意的方块上。在这种情况下,每个时刻都有完美的策略是不能的,但是如何避免危险洞并且到达目标洞肯定是可行的。
更通俗一点地讲,就是冬天来了,你和你的朋友在公园里玩飞盘的时候,你把飞盘扔到了湖中央。水大部分都已冻结,但也有一些地方融化出了几个洞。如果你踏进其中一个洞,你就会掉进冰冷的水里。在这个时候,由于没有其他的飞盘,所以必须穿过湖面并取回光盘。然而,冰是滑的,所以你不会总是朝着你想要的方向前进。

  该冰面可以通过以下的网格来描述:

SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)

  当你达到目标或掉进洞里时,这一片段(回合)就结束了。如果达到了目标,将得到 1 分的奖励,否则为 0 分。

2.4.2.2 代码

import sys
from contextlib import closingimport numpy as np
from six import StringIO, bfrom gym import utils
from gym.envs.toy_text import discreteLEFT = 0
DOWN = 1
RIGHT = 2
UP = 3MAPS = {"4x4": ["SFFF","FHFH","FFFH","HFFG"],"8x8": ["SFFFFFFF","FFFFFFFF","FFFHFFFF","FFFFFHFF","FFFHFFFF","FHHFFFHF","FHFFHFHF","FFFHFFFG"],
}def generate_random_map(size=8, p=0.8):"""Generates a random valid map (one that has a path from start to goal):param size: size of each side of the grid:param p: probability that a tile is frozen"""valid = False# DFS to check that it's a valid path.def is_valid(res):frontier, discovered = [], set()frontier.append((0,0))while frontier:r, c = frontier.pop()if not (r,c) in discovered:discovered.add((r,c))directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]for x, y in directions:r_new = r + xc_new = c + yif r_new < 0 or r_new >= size or c_new < 0 or c_new >= size:continueif res[r_new][c_new] == 'G':return Trueif (res[r_new][c_new] not in '#H'):frontier.append((r_new, c_new))return Falsewhile not valid:p = min(1, p)res = np.random.choice(['F', 'H'], (size, size), p=[p, 1-p])res[0][0] = 'S'res[-1][-1] = 'G'valid = is_valid(res)return ["".join(x) for x in res]class FrozenLakeEnv(discrete.DiscreteEnv):"""Winter is here. You and your friends were tossing around a frisbee at the parkwhen you made a wild throw that left the frisbee out in the middle of the lake.The water is mostly frozen, but there are a few holes where the ice has melted.If you step into one of those holes, you'll fall into the freezing water.At this time, there's an international frisbee shortage, so it's absolutely imperative thatyou navigate across the lake and retrieve the disc.However, the ice is slippery, so you won't always move in the direction you intend.The surface is described using a grid like the followingSFFFFHFHFFFHHFFGS : starting point, safeF : frozen surface, safeH : hole, fall to your doomG : goal, where the frisbee is locatedThe episode ends when you reach the goal or fall in a hole.You receive a reward of 1 if you reach the goal, and zero otherwise."""metadata = {'render.modes': ['human', 'ansi']}def __init__(self, desc=None, map_name="4x4",is_slippery=True):if desc is None and map_name is None:desc = generate_random_map()elif desc is None:desc = MAPS[map_name]self.desc = desc = np.asarray(desc,dtype='c')self.nrow, self.ncol = nrow, ncol = desc.shapeself.reward_range = (0, 1)nA = 4nS = nrow * ncolisd = np.array(desc == b'S').astype('float64').ravel()isd /= isd.sum()P = {s : {a : [] for a in range(nA)} for s in range(nS)}def to_s(row, col):return row*ncol + coldef inc(row, col, a):if a == LEFT:col = max(col-1,0)elif a == DOWN:row = min(row+1,nrow-1)elif a == RIGHT:col = min(col+1,ncol-1)elif a == UP:row = max(row-1,0)return (row, col)for row in range(nrow):for col in range(ncol):s = to_s(row, col)for a in range(4):li = P[s][a]letter = desc[row, col]if letter in b'GH':li.append((1.0, s, 0, True))else:if is_slippery:for b in [(a-1)%4, a, (a+1)%4]:newrow, newcol = inc(row, col, b)newstate = to_s(newrow, newcol)newletter = desc[newrow, newcol]done = bytes(newletter) in b'GH'rew = float(newletter == b'G')li.append((1.0/3.0, newstate, rew, done))else:newrow, newcol = inc(row, col, a)newstate = to_s(newrow, newcol)newletter = desc[newrow, newcol]done = bytes(newletter) in b'GH'rew = float(newletter == b'G')li.append((1.0, newstate, rew, done))super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)def render(self, mode='human'):outfile = StringIO() if mode == 'ansi' else sys.stdoutrow, col = self.s // self.ncol, self.s % self.ncoldesc = self.desc.tolist()desc = [[c.decode('utf-8') for c in line] for line in desc]desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True)if self.lastaction is not None:outfile.write("  ({})\n".format(["Left","Down","Right","Up"][self.lastaction]))else:outfile.write("\n")outfile.write("\n".join(''.join(line) for line in desc)+"\n")if mode != 'human':with closing(outfile):return outfile.getvalue()

2.5 强化学习算法

2.5.1 表格 Q 学习

2.5.1.1 代码

  表格 Q 学习的原理这里不再赘述,可以参照另一篇文章(第五章 基于时序差分和 Q 学习的无模型预测与控制(一),https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484656&idx=1&sn=a0804ea632ff65b4f629dca5d4d23574&chksm=fc11510ecb66d818d8b91b7043254d5fe807fe123be4271caaca27282425f6b52952790f661e&token=366879770&lang=zh_CN#rd),

"""Q-Table learning algorithm.
Non deep learning - TD Learning, Off-Policy, e-Greedy Exploration
Q(S, A) <- Q(S, A) + alpha _ (R + lambda _ Q(newS, newA) - Q(S, A))
See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
For Q-Network, see tutorial_frozenlake_q_network.py
EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
CN: https://zhuanlan.zhihu.com/p/25710327
tensorflow==2.0.0a0
tensorlayer==2.0.0
"""import argparse
import os
import timeimport gym
import matplotlib.pyplot as plt
import numpy as npparser = argparse.ArgumentParser()
parser.add_argument('--train', dest='train', action='store_true', default=True)
parser.add_argument('--test', dest='test', action='store_true', default=True)parser.add_argument(
'--save_path', default=None, help='folder to save if mode == train else model path,'
'qnet will be saved once target net update'
)
parser.add_argument('--seed', help='random seed', type=int, default=0)
parser.add_argument('--env_id', default='FrozenLake-v0')
args = parser.parse_args()## Load the environmentalg_name = 'Qlearning'
env_id = args.env_id
env = gym.make(env_id)
render = True # display the game environment##================= Implement Q-Table learning algorithm =====================#### Initialize table with all zerosQ = np.zeros([env.observation_space.n, env.action_space.n])## Set learning parameterslr = .85 # alpha, if use value function approximation, we can ignore it
lambd = .99 # decay factor
num_episodes = 10000
t0 = time.time()if args.train:
all*episode_reward = []
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99):
if render: env.render() ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n) \* (1. / (i + 1))) ## Get new state and reward from environment
s1, r, d, * = env.step(a) ## Update Q-Table with new knowledge
Q[s, a] = Q[s, a] + lr _ (r + lambd _ np.max(Q[s1, :]) - Q[s, a])
rAll += r
s = s1
if d is True:
break
print(
'Training | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)
if i == 0:
all_episode_reward.append(rAll)
else:
all_episode_reward.append(all_episode_reward[-1] _ 0.9 + rAll _ 0.1)# savepath = os.path.join('model', '_'.join([alg_name, env_id]))if not os.path.exists(path):os.makedirs(path)np.save(os.path.join(path, 'Q_table.npy'), Q)plt.plot(all_episode_reward)if not os.path.exists('image'):os.makedirs('image')plt.savefig(os.path.join('image', '_'.join([alg_name, env_id])))# print("Final Q-Table Values:/n %s" % Q)if args.test:
path = os.path.join('model', '_'.join([alg_name, env_id]))
Q = np.load(os.path.join(path, 'Q_table.npy'))
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99): ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :]) ## Get new state and reward from environment
s1, r, d, _ = env.step(a) ## Update Q-Table with new knowledge
rAll += r
s = s1
if d is True:
break
print(
'Testing | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)

2.5.1.2 实验结果

  学习到的最终 Q 表如下:

[[6.20622965e-01 8.84762425e-03 3.09373823e-03 6.55067399e-03]

[6.49198039e-04 3.04069914e-04 8.78667903e-04 5.91638052e-01]

[1.92065690e-03 4.33985167e-01 3.49151873e-03 1.97126703e-03]

[2.70187111e-03 0.00000000e+00 0.00000000e+00 4.35444853e-01]

[6.34931610e-01 1.09286085e-04 1.86982907e-03 2.76783612e-04]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[6.48093009e-07 1.13896350e-04 1.65719637e-01 1.90614063e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 3.84251979e-03 1.48921362e-03 7.46942896e-01]

[0.00000000e+00 8.03386378e-01 6.92688383e-04 0.00000000e+00]

[8.40889312e-01 9.86082253e-06 1.25967676e-04 6.83892296e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 9.61587991e-01 6.98637543e-03]

[0.00000000e+00 9.99905944e-01 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

  各代累积奖励曲线如下:

图 4 各代累积奖励

  三次测试的正确率如下:

图 5 测试正确率

【Tensorlayer系列】深度强化学习之FrozenLake介绍及表格型Q学习求解相关推荐

  1. 强化学习 补充笔记(TD算法、Q学习算法、SARSA算法、多步TD目标、经验回放、高估问题、对决网络、噪声网络)

    学习目标: 深入了解马尔科夫决策过程(MDP),包含TD算法.Q学习算法.SARSA算法.多步TD目标.经验回放.高估问题.对决网络.噪声网络.基础部分见:强化学习 马尔科夫决策过程(价值迭代.策略迭 ...

  2. 基于kb的问答系统_1KB以下基于表的Q学习

    基于kb的问答系统 介绍 (Introduction) Q-learning is an algorithm in which an agent interacts with its environm ...

  3. 强化学习(五) - 时序差分学习(Temporal-Difference Learning)及其实例----Sarsa算法, Q学习, 期望Sarsa算法

    强化学习(五) - 时序差分学习(Temporal-Difference Learning)及其实例 5.1 TD预测 例5.1 回家时间的估计 5.2 TD预测方法的优势 例5.2 随机移动 5.3 ...

  4. 强化学习算法(一)————表格型方法

    文章目录 1. 马尔可夫决策过程 2. Q表格 3. 免模型预测 3.1 蒙特卡洛策略评估 问题:比较动态规划法和蒙特卡洛方法的差异 3.2 时序差分 问题:比较蒙特卡洛和时序差分法 4. 免模型控制 ...

  5. Meta Learning/Learning to Learn, 到底我们要学会学习什么?||介绍了几篇元学习文章

    https://www.zhihu.com/question/62482926/answer/625352436 转载:https://zhuanlan.zhihu.com/p/32270990 1 ...

  6. 综述—多智能体系统深度强化学习:挑战、解决方案和应用的回顾

    多智能体系统深度强化学习:挑战.解决方案和应用的回顾 摘要 介绍 背景:强化学习 前提 贝尔曼方程 RL方法 深度强化学习:单智能体 深度Q网络 DQN变体 深度强化学习:多智能体 挑战与解决方案 M ...

  7. 强化学习(八) - 深度Q学习(Deep Q-learning, DQL,DQN)原理及相关实例

    深度Q学习原理及相关实例 8. 深度Q学习 8.1 经验回放 8.2 目标网络 8.3 相关算法 8.4 训练算法 8.5 深度Q学习实例 8.5.1 主程序 程序注释 8.5.2 DQN模型构建程序 ...

  8. 深度强化学习DQN网络

    DQN网络 DQN(Deep Q Networks)网络属于深度强化学习中的一种网络,它是深度学习与Q学习的结合,在传统的Q学习中,我们需要维护一张Q(s,a)表,在实际运用中,Q表往往是巨大的,并且 ...

  9. 深度强化学习-基于价值的强化学习-TD算法和Q学习(三)

    本文主要介绍TD算法和Q学习算法 目录 TD算法: Q学习算法: 同策略,异策略: TD算法: 即时间差分 (Temporal Difference):此处用举例子方法来帮助大家理解 1.假设我从天津 ...

  10. 【ML4CO论文精读】基于深度强化学习的组合优化问题研究进展(李凯文, 2020)

    基于深度强化学习的组合优化研究进展 本人研究方向(博士期间):多目标组合优化与决策研究--致力于多约束多目标多任务组合优化算法和多目标决策系统的研究与开发,以及多目标优化技术在一些工程实践中的应用. ...

最新文章

  1. socket心跳机制图片_WebSocket心跳检测和重连机制
  2. SQL Server改MySQL注意事项
  3. Redis的两种消息模式
  4. C#支持中文的格式化字符长度方法
  5. Unix操作系统目录存放内容
  6. 剑指offer之Runnable和Callable的区别
  7. 蓝桥杯 ADV-75 算法提高 简单计算器
  8. 零基础python数据分析自学_零基础的人,怎么自学数据分析?
  9. 魔方java3d,CSS3 制作魔方 - 相关立体样式
  10. 一篇文章搞懂数据仓库:元数据分类、元数据管理
  11. STM32下载程序的三种方法(串口、ST-LINK、 ST-LINK Utility)
  12. MySQL查询point类型类型的坐标,返回经度纬度
  13. web爬虫讲解—urllib库中使用xpath表达式—BeautifulSoup基础
  14. 微信小程序云开发获取上传图片后https的url链接地址
  15. sqlmap的用法,sqlmap -r
  16. 【网红流水线车间】“制造”李佳琦们的神秘组织,到底是怎么让网红火起来的?...
  17. OpenGL---GLUT(一)
  18. 【重新定义matlab强大系列三】MATLAB清洗离群数据(查找、填充或删除离群值)
  19. 完美国际mysql后台_完美国际-后台管理配置-.tomcat配置教程
  20. iris 神经网络分类

热门文章

  1. Linux下源码安装ElasticResearch
  2. 21. 总是让比较函数在等值情况下返回false
  3. 为什么使用close()关闭所打开文件
  4. Android修改状态栏的背景颜色
  5. php文章上一篇,thinkphp5实现文章上一篇,下一篇
  6. 根据经纬度计算两点间的距离_全班学生被此奥数题难倒,理解两点间距离公式的几何意义是关键...
  7. Python如何使用生成器得到斐波那契数列
  8. day 03 剑指 Offer 10- I. 斐波那契数列-动态规划
  9. 计算机应用基础doc,计算机应用基础.doc
  10. oracle 判断为空赋一个值_求高手帮忙,oracle查出的值为null,怎么赋初始值?