强化学习 Sarsa Q-learning:on off policy策略下的时序差分控制
一、on policy & off policy
所有的学习控制都面临着一个困境,他们希望学到的动作可以使随后的智能体行为是最优的,但为了搜索所有的动作(已找到最优动作),他们需要采取非最优的行动,如何在遵循探索策略采取行动的同时学到最优策略呢?
第一种方式是:on policy,这种策略其实是一种妥协——他并不是找到最优的策略,而是学习一个接近最优而且扔能进行试探的策略动作值。
另一种方式是:off policy,这种方式干脆使用两种策略,一个用来学习并最终称为最优策略,另一个则更加具有试探性,用来产生智能体的行动样本。用来学习的策略被称为 目标策略,用于生成行动样本的被称为 行动策略。
on policy 更加简单,更加容易收敛,但off policy则更加符合想象力,能处理一些on policy不能做的一些活动。
比如:机器人想要炸一个国家,但手上只有两颗炸弹,炸哪两个城市最好,只能通过off policy的方式,因为炸弹炸了就没了,机器人只能通过想象力去试探炸那里更好,on policy由于做决策的策略和生成行动样本的策略是一个,所以不具有这样的想象力。
二、蒙特卡洛on and off policy control
三、sarsa
sarsa是TD 算法中 on policy的一个案例
import numpy as np import pandas as pd import matplotlib.pyplot as pltfrom matplotlib.colors import hsv_to_rgb# ENV -- WORLD_HEIGHT = 4 WORLD_WIDTH =12 NUM_STATE = WORLD_WIDTH * WORLD_HEIGHT# ACTION -- UP = 0 DOWN = 1 LEFT = 2 RIGHT = 3 NUM_ACTIONS = 4 ACTION = [UP, DOWN, RIGHT, LEFT] ACTIONS = ['U', 'D', 'R', 'L']# STATE -- START = (3,0) END = (3,11)def change_range(values, vmin=0, vmax=1):start_zero = values - np.min(values)return (start_zero / (np.max(start_zero) + 1e-7)) * (vmax - vmin) + vminclass ENVIRONMENT:terrain_color = dict(normal=[127/360, 0, 96/100], objective=[26/360, 100/100, 100/100], cliff=[247/360, 92/100, 70/100], player=[344/360, 93/100, 100/100])def __init__(self):self.player = Noneself.num_steps = 0self.CreateEnv()self.DrawEnv()def CreateEnv(self, inital_grid = None):# create cliff walking grid world# just such as the following grid''' 0 110 x x x x x x x x x x x x1 x x x x x x x x x x x x2 x x x x x x x x x x x x3 S o o o o o o o o o o E'''self.grid = self.terrain_color['normal'] * np.ones((WORLD_HEIGHT, WORLD_WIDTH, 3)) self.grid[-1, 1:11] = self.terrain_color['cliff']self.grid[-1,-1] = self.terrain_color['objective']def DrawEnv(self):self.fig, self.ax = plt.subplots(figsize=(WORLD_WIDTH, WORLD_HEIGHT))self.ax.grid(which='minor') self.q_texts = [self.ax.text( i%WORLD_WIDTH, i//WORLD_WIDTH, '0', fontsize=11, verticalalignment='center', horizontalalignment='center') for i in range(12 * 4)]self.im = self.ax.imshow(hsv_to_rgb(self.grid), cmap='terrain', interpolation='nearest', vmin=0, vmax=1)self.ax.set_xticks(np.arange(WORLD_WIDTH))self.ax.set_xticks(np.arange(WORLD_WIDTH) - 0.5, minor=True)self.ax.set_yticks(np.arange(WORLD_HEIGHT))self.ax.set_yticks(np.arange(WORLD_HEIGHT) - 0.5, minor=True) # plt.show()def step(self, action):# Possible actionsif action == 0 and self.player[0] > 0:self.player = (self.player[0] - 1, self.player[1])if action == 1 and self.player[0] < 3:self.player = (self.player[0] + 1, self.player[1])if action == 2 and self.player[1] < 11:self.player = (self.player[0], self.player[1] + 1)if action == 3 and self.player[1] > 0:self.player = (self.player[0], self.player[1] - 1)self.num_steps = self.num_steps + 1# Rewards, game on# common situation: reward = -1 & game can carry onreward = -1done = False# if walk to the cliff, game over and loss, reward = -100# if walk to the destination, game over and win, reward = 0if self.player[0] == WORLD_HEIGHT-1 and self.player[1] > 0 and self.player[1] < WORLD_WIDTH-1: reward = -100done = Trueelif self.player[0] == END[0] and self.player[1] == END[1]:reward = 0done = Truereturn self.player, reward, donedef reset(self):self.player = [START[0], START[1]]self.num_steps = 0return self.playerdef RenderEnv(self, q_values, action=None, max_q=False, colorize_q=False):assert self.player is not None, 'You first need to call .reset()'if colorize_q: grid = self.terrain_color['normal'] * np.ones((4, 12, 3))values = change_range(np.max(q_values, -1)).reshape(4, 12)grid[:, :, 1] = valuesgrid[-1, 1:11] = self.terrain_color['cliff']grid[-1,-1] = self.terrain_color['objective']else:grid = self.grid.copy()# render the player gridgrid[self.player] = self.terrain_color['player'] self.im.set_data(hsv_to_rgb(grid))if q_values is not None:xs = np.repeat(np.arange(12), 4)ys = np.tile(np.arange(4), 12) for i, text in enumerate(self.q_texts):txt = ""for aaction in range(len(ACTIONS)):txt += str(ACTIONS[aaction]) + ":" + str( round(q_values[ i//WORLD_WIDTH, i%WORLD_WIDTH, aaction], 2) ) + '\n'text.set_text(txt)# show the actionif action is not None:self.ax.set_title(action, color='r', weight='bold', fontsize=32)plt.pause(0.1)def egreedy_policy( q_values, state, epsilon=0.1):if np.random.binomial(1, epsilon) == 1:return np.random.choice(ACTION)else:values_ = q_values[state[0], state[1], :]return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])def sarsa(env, episodes=500, render=True, epsilon=0.1, learning_rate=0.5, gamma=0.9):q_values_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, NUM_ACTIONS))ep_rewards = []# sarsa begin...for _ in range(0,episodes):state = env.reset()done = Falsereward_sum = 0action = egreedy_policy(q_values_sarsa, state, epsilon)while done == False:next_state, reward, done = env.step(action)next_action = egreedy_policy(q_values_sarsa, next_state, epsilon)# 普通sarsaq_values_sarsa[state[0], state[1], action] += learning_rate * (reward + gamma * q_values_sarsa[next_state[0], next_state[1], next_action] - q_values_sarsa[state[0], state[1], action])# 期望 sarsa# q_values_sarsa[state[0], state[1], action] += learning_rate * (reward + gamma * q_values_sarsa[next_state[0], next_state[1], next_action] - q_values_sarsa[state[0], state[1], action])state = next_stateaction = next_action# for comparsion, record all the rewards, this is not necessary for QLearning algorithmreward_sum += rewardif render:env.RenderEnv(q_values_sarsa, action=ACTIONS[action], colorize_q=True)ep_rewards.append(reward_sum)# sarsa end...return ep_rewards, q_values_sarsadef play(q_values):# simulate the environent using the learned Q valuesenv = ENVIRONMENT()state = env.reset()done = Falsewhile not done: # Select actionaction = egreedy_policy(q_values, state, 0.0)# Do the actionstate_, R, done = env.step(action) # Update state and action state = state_ env.RenderEnv(q_values=q_values, action=ACTIONS[action], colorize_q=True)env = ENVIRONMENT() sarsa_rewards, q_values_sarsa = sarsa(env, episodes=500, render=False, epsilon=0.1, learning_rate=1, gamma=0.9) play(q_values_sarsa)
四、Q-learning
Q-learning是TD 算法中 on policy的一个案例
# 总悬崖案例 Q-learning代码 # import numpy as np import pandas as pd import matplotlib.pyplot as pltfrom matplotlib.colors import hsv_to_rgb# ENV -- WORLD_HEIGHT = 4 WORLD_WIDTH =12 NUM_STATE = WORLD_WIDTH * WORLD_HEIGHT# ACTION -- UP = 0 DOWN = 1 LEFT = 2 RIGHT = 3 NUM_ACTIONS = 4 ACTION = [UP, DOWN, RIGHT, LEFT] ACTIONS = ['U', 'D', 'R', 'L']# STATE -- START = (3,0) END = (3,11)def change_range(values, vmin=0, vmax=1):start_zero = values - np.min(values)return (start_zero / (np.max(start_zero) + 1e-7)) * (vmax - vmin) + vminclass ENVIRONMENT:terrain_color = dict(normal=[127/360, 0, 96/100], objective=[26/360, 100/100, 100/100], cliff=[247/360, 92/100, 70/100], player=[344/360, 93/100, 100/100])def __init__(self):self.player = Noneself.num_steps = 0self.CreateEnv()self.DrawEnv()def CreateEnv(self, inital_grid = None):# create cliff walking grid world# just such as the following grid''' 0 110 x x x x x x x x x x x x1 x x x x x x x x x x x x2 x x x x x x x x x x x x3 S o o o o o o o o o o E'''self.grid = self.terrain_color['normal'] * np.ones((WORLD_HEIGHT, WORLD_WIDTH, 3)) self.grid[-1, 1:11] = self.terrain_color['cliff']self.grid[-1,-1] = self.terrain_color['objective']def DrawEnv(self):self.fig, self.ax = plt.subplots(figsize=(WORLD_WIDTH, WORLD_HEIGHT))self.ax.grid(which='minor') self.q_texts = [self.ax.text( i%WORLD_WIDTH, i//WORLD_WIDTH, '0', fontsize=11, verticalalignment='center', horizontalalignment='center') for i in range(12 * 4)]self.im = self.ax.imshow(hsv_to_rgb(self.grid), cmap='terrain', interpolation='nearest', vmin=0, vmax=1)self.ax.set_xticks(np.arange(WORLD_WIDTH))self.ax.set_xticks(np.arange(WORLD_WIDTH) - 0.5, minor=True)self.ax.set_yticks(np.arange(WORLD_HEIGHT))self.ax.set_yticks(np.arange(WORLD_HEIGHT) - 0.5, minor=True) # plt.show()def step(self, action):# Possible actionsif action == 0 and self.player[0] > 0:self.player = (self.player[0] - 1, self.player[1])if action == 1 and self.player[0] < 3:self.player = (self.player[0] + 1, self.player[1])if action == 2 and self.player[1] < 11:self.player = (self.player[0], self.player[1] + 1)if action == 3 and self.player[1] > 0:self.player = (self.player[0], self.player[1] - 1)self.num_steps = self.num_steps + 1# Rewards, game on# common situation: reward = -1 & game can carry onreward = -1done = False# if walk to the cliff, game over and loss, reward = -100# if walk to the destination, game over and win, reward = 0if self.player[0] == WORLD_HEIGHT-1 and self.player[1] > 0 and self.player[1] < WORLD_WIDTH-1: reward = -100done = Trueelif self.player[0] == END[0] and self.player[1] == END[1]:reward = 0done = Truereturn self.player, reward, donedef reset(self):self.player = [START[0], START[1]]self.num_steps = 0return self.playerdef RenderEnv(self, q_values, action=None, max_q=False, colorize_q=False):assert self.player is not None, 'You first need to call .reset()'if colorize_q: grid = self.terrain_color['normal'] * np.ones((4, 12, 3))values = change_range(np.max(q_values, -1)).reshape(4, 12)grid[:, :, 1] = valuesgrid[-1, 1:11] = self.terrain_color['cliff']grid[-1,-1] = self.terrain_color['objective']else:grid = self.grid.copy()# render the player gridgrid[self.player] = self.terrain_color['player'] self.im.set_data(hsv_to_rgb(grid))if q_values is not None:xs = np.repeat(np.arange(12), 4)ys = np.tile(np.arange(4), 12) for i, text in enumerate(self.q_texts):txt = ""for aaction in range(len(ACTIONS)):txt += str(ACTIONS[aaction]) + ":" + str( round(q_values[ i//WORLD_WIDTH, i%WORLD_WIDTH, aaction], 2) ) + '\n'text.set_text(txt)# show the actionprint(action)if action is not None:self.ax.set_title(action, color='r', weight='bold', fontsize=32)plt.pause(0.1)def egreedy_policy( q_values, state, epsilon=0.1):if np.random.binomial(1, epsilon) == 1:return np.random.choice(ACTION)else:values_ = q_values[state[0], state[1], :]return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])def Qlearning(env, episodes=500, render=True, epsilon=0.1, learning_rate=0.5, gamma=0.9):q_values = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, NUM_ACTIONS))ep_rewards = []# Qlearning begin...for _ in range(0,episodes):state = env.reset()done = Falsereward_sum = 0while done == False:action = egreedy_policy(q_values, state, epsilon)next_state, reward, done = env.step(action)q_values[state[0], state[1], action] += learning_rate * (reward + gamma * np.max(q_values[next_state[0], next_state[1], :]) - q_values[state[0], state[1], action])state = next_state# for comparsion, record all the rewards, this is not necessary for QLearning algorithmreward_sum += rewardif render:env.RenderEnv(q_values, action=ACTIONS[action], colorize_q=True)ep_rewards.append(reward_sum)# Qlearning end...return ep_rewards, q_valuesdef play(q_values):# simulate the environent using the learned Q valuesenv = ENVIRONMENT()state = env.reset()done = Falsewhile not done: # Select actionaction = egreedy_policy(q_values, state, 0.0)# Do the actionstate_, R, done = env.step(action) # Update state and action state = state_ env.RenderEnv(q_values=q_values, action=ACTIONS[action], colorize_q=True)env = ENVIRONMENT() q_learning_rewards, q_values = Qlearning(env, episodes=500, render=False, epsilon=0.1, learning_rate=1, gamma=0.9) play(q_values)
强化学习 Sarsa Q-learning:on off policy策略下的时序差分控制相关推荐
- RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置
RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(复杂迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...
- RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置
RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(简单迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...
- mdp框架_强化学习中q learning和MDP的区别是什么?
MDP通常是指一种用转移概率描述连续不确定概率过程的数学框架,是强化学习中最基础的概念,很多强化学习的算法都是在把问题抽象为一个MDP之后再想办法求解的. 而q-learning是求解强化学习问题的算 ...
- 《强化学习》中的时序差分控制:Sarsa、Q-learning、期望Sarsa、双Q学习 etc.
前言: 学习了 Sutton 的<强化学习(第二版)>第6章时序差分学习的控制部分,将笔记提炼如下. 笔者阅读的是中文书籍,所提到的公式,笔者将给出其在英文书籍上的页码.英文书籍见 Sut ...
- [PARL强化学习]Sarsa和Q—learning的实现
[PARL强化学习]Sarsa和Q-learning的实现 Sarsa和Q-learning都是利用表格法再根据MDP四元组<S,A,P,R>:S: state状态,a: action动作 ...
- 强化学习之Q学习与SARSA
** Q学习路径规划与SARSA法路径规划 ** Q学习与SARSA学习简介 强化学习的历史可以用两条各自独立但丰富多彩的主线来追溯..一条主线聚焦 于研究最优化控制,以及使用价值函数动态规划等算法来 ...
- Q学习(Q learning) 强化学习
Q学习(Q learning) 强化学习的简单例子 Matlab实现 可视化_Morty 的挖坑记录-CSDN博客 强化学习(MATLAB) - 叮叮当当sunny - 博客园
- 强化学习(Reinforcement learning)综述
文章目录 Reinforcement learning 综述 强化学习的分类 环境(Model-free,Model-based) Based(Policy-Based RL & Value- ...
- 强化学习--Sarsa
系列文章目录 强化学习 提示:写完文章后,目录可以自动生成,如何生成可参考右边的帮助文档 文章目录 系列文章目录 前言 一.强化学习是什么? 二.核心算法(免模型学习) Sarsa 1.学习心得 前言 ...
最新文章
- 3w字带你揭开WebSocket的神秘面纱~
- gitlab windows安装_gitlab pages之gitlab-runner 安装(windows)
- SpringMVC拦截器-用户登录权限控制代码实现2
- Asp.Net Core下的开源任务调度平台ScheduleMaster—快速上手
- 《2021中国数据资产化工具市场研究报告》隆重发布
- 如何给领导打造一款掌上财务管理驾驶舱?
- C++结构体中有构造函数和析构函数
- DELL R430服务器做raid5以及安装操作系统过程
- Python——语言基础
- PADS2007快捷键、无模命令大全
- 四川理工学院计算机学院在哪里,四川理工计算机学院
- Epicor 调拨方式平负数库存 直接生成DMT格式
- EOS智能合约开发系列(18): 狼人杀游戏的`eosio.code`
- 生日悖论分析基于python
- vs code 不能正确补全结构体成员变量的解决方法
- java设置ContentType,设置下载文件名称
- CVE-2021-41773漏洞复现
- 联想硬盘保护安装linux,Ubuntu 下安装Thinkpad T400硬盘保护APS
- 肠道菌群代谢组学之粪便微生物移植治疗原发性硬化性胆管炎
- js怎么删除一个emoji表情符号