一、on policy & off policy

所有的学习控制都面临着一个困境,他们希望学到的动作可以使随后的智能体行为是最优的,但为了搜索所有的动作(已找到最优动作),他们需要采取非最优的行动,如何在遵循探索策略采取行动的同时学到最优策略呢?

第一种方式是:on policy,这种策略其实是一种妥协——他并不是找到最优的策略,而是学习一个接近最优而且扔能进行试探的策略动作值。

另一种方式是:off policy,这种方式干脆使用两种策略,一个用来学习并最终称为最优策略,另一个则更加具有试探性,用来产生智能体的行动样本。用来学习的策略被称为 目标策略,用于生成行动样本的被称为 行动策略。

on policy 更加简单,更加容易收敛,但off policy则更加符合想象力,能处理一些on policy不能做的一些活动。

比如:机器人想要炸一个国家,但手上只有两颗炸弹,炸哪两个城市最好,只能通过off policy的方式,因为炸弹炸了就没了,机器人只能通过想象力去试探炸那里更好,on policy由于做决策的策略和生成行动样本的策略是一个,所以不具有这样的想象力。

二、蒙特卡洛on and off policy control

三、sarsa

sarsa是TD 算法中 on policy的一个案例

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltfrom matplotlib.colors import hsv_to_rgb# ENV --
WORLD_HEIGHT = 4
WORLD_WIDTH =12
NUM_STATE = WORLD_WIDTH * WORLD_HEIGHT# ACTION --
UP = 0
DOWN = 1
LEFT = 2
RIGHT = 3
NUM_ACTIONS = 4
ACTION = [UP, DOWN, RIGHT, LEFT]
ACTIONS = ['U', 'D', 'R', 'L']# STATE --
START = (3,0)
END = (3,11)def change_range(values, vmin=0, vmax=1):start_zero = values - np.min(values)return (start_zero / (np.max(start_zero) + 1e-7)) * (vmax - vmin) + vminclass ENVIRONMENT:terrain_color = dict(normal=[127/360, 0, 96/100], objective=[26/360, 100/100, 100/100], cliff=[247/360, 92/100, 70/100], player=[344/360, 93/100, 100/100])def __init__(self):self.player = Noneself.num_steps = 0self.CreateEnv()self.DrawEnv()def CreateEnv(self, inital_grid = None):# create cliff walking grid world# just such as the following grid''' 0                    110  x x x x x x x x x x x x1    x x x x x x x x x x x x2    x x x x x x x x x x x x3    S o o o o o o o o o o E'''self.grid = self.terrain_color['normal'] * np.ones((WORLD_HEIGHT, WORLD_WIDTH, 3)) self.grid[-1, 1:11] = self.terrain_color['cliff']self.grid[-1,-1] = self.terrain_color['objective']def DrawEnv(self):self.fig, self.ax = plt.subplots(figsize=(WORLD_WIDTH, WORLD_HEIGHT))self.ax.grid(which='minor')       self.q_texts = [self.ax.text( i%WORLD_WIDTH, i//WORLD_WIDTH, '0', fontsize=11, verticalalignment='center', horizontalalignment='center') for i in range(12 * 4)]self.im = self.ax.imshow(hsv_to_rgb(self.grid), cmap='terrain', interpolation='nearest', vmin=0, vmax=1)self.ax.set_xticks(np.arange(WORLD_WIDTH))self.ax.set_xticks(np.arange(WORLD_WIDTH) - 0.5, minor=True)self.ax.set_yticks(np.arange(WORLD_HEIGHT))self.ax.set_yticks(np.arange(WORLD_HEIGHT) - 0.5, minor=True)  # plt.show()def step(self, action):# Possible actionsif action == 0 and self.player[0] > 0:self.player = (self.player[0] - 1, self.player[1])if action == 1 and self.player[0] < 3:self.player = (self.player[0] + 1, self.player[1])if action == 2 and self.player[1] < 11:self.player = (self.player[0], self.player[1] + 1)if action == 3 and self.player[1] > 0:self.player = (self.player[0], self.player[1] - 1)self.num_steps = self.num_steps + 1# Rewards, game on# common situation: reward = -1 & game can carry onreward = -1done = False# if walk to the cliff, game over and loss, reward = -100# if walk to the destination, game over and win, reward = 0if self.player[0] == WORLD_HEIGHT-1 and self.player[1] > 0 and  self.player[1] < WORLD_WIDTH-1: reward = -100done = Trueelif self.player[0] == END[0] and self.player[1] == END[1]:reward = 0done = Truereturn self.player, reward, donedef reset(self):self.player = [START[0], START[1]]self.num_steps = 0return self.playerdef RenderEnv(self, q_values, action=None, max_q=False, colorize_q=False):assert self.player is not None, 'You first need to call .reset()'if colorize_q:       grid = self.terrain_color['normal'] * np.ones((4, 12, 3))values = change_range(np.max(q_values, -1)).reshape(4, 12)grid[:, :, 1] = valuesgrid[-1, 1:11] = self.terrain_color['cliff']grid[-1,-1] = self.terrain_color['objective']else:grid = self.grid.copy()# render the player gridgrid[self.player] = self.terrain_color['player']       self.im.set_data(hsv_to_rgb(grid))if q_values is not None:xs = np.repeat(np.arange(12), 4)ys = np.tile(np.arange(4), 12)  for i, text in enumerate(self.q_texts):txt = ""for aaction in range(len(ACTIONS)):txt += str(ACTIONS[aaction]) + ":" + str( round(q_values[ i//WORLD_WIDTH, i%WORLD_WIDTH, aaction], 2) ) + '\n'text.set_text(txt)# show the actionif action is not None:self.ax.set_title(action, color='r', weight='bold', fontsize=32)plt.pause(0.1)def egreedy_policy( q_values, state, epsilon=0.1):if np.random.binomial(1, epsilon) == 1:return np.random.choice(ACTION)else:values_ = q_values[state[0], state[1], :]return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])def sarsa(env, episodes=500, render=True, epsilon=0.1, learning_rate=0.5, gamma=0.9):q_values_sarsa = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, NUM_ACTIONS))ep_rewards = []# sarsa begin...for _ in range(0,episodes):state = env.reset()done = Falsereward_sum = 0action = egreedy_policy(q_values_sarsa, state, epsilon)while done == False:next_state, reward, done = env.step(action)next_action = egreedy_policy(q_values_sarsa, next_state, epsilon)# 普通sarsaq_values_sarsa[state[0], state[1], action] +=  learning_rate * (reward + gamma * q_values_sarsa[next_state[0], next_state[1], next_action] - q_values_sarsa[state[0], state[1], action])# 期望 sarsa# q_values_sarsa[state[0], state[1], action] +=  learning_rate * (reward + gamma * q_values_sarsa[next_state[0], next_state[1], next_action] - q_values_sarsa[state[0], state[1], action])state = next_stateaction = next_action# for comparsion, record all the rewards, this is not necessary for QLearning algorithmreward_sum += rewardif render:env.RenderEnv(q_values_sarsa, action=ACTIONS[action], colorize_q=True)ep_rewards.append(reward_sum)# sarsa end...return ep_rewards, q_values_sarsadef play(q_values):# simulate the environent using the learned Q valuesenv = ENVIRONMENT()state = env.reset()done = Falsewhile not done:    # Select actionaction = egreedy_policy(q_values, state, 0.0)# Do the actionstate_, R, done = env.step(action)  # Update state and action        state = state_  env.RenderEnv(q_values=q_values, action=ACTIONS[action], colorize_q=True)env = ENVIRONMENT()
sarsa_rewards, q_values_sarsa = sarsa(env, episodes=500, render=False, epsilon=0.1, learning_rate=1, gamma=0.9)
play(q_values_sarsa)

四、Q-learning

Q-learning是TD 算法中 on policy的一个案例

#
总悬崖案例 Q-learning代码
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltfrom matplotlib.colors import hsv_to_rgb# ENV --
WORLD_HEIGHT = 4
WORLD_WIDTH =12
NUM_STATE = WORLD_WIDTH * WORLD_HEIGHT# ACTION --
UP = 0
DOWN = 1
LEFT = 2
RIGHT = 3
NUM_ACTIONS = 4
ACTION = [UP, DOWN, RIGHT, LEFT]
ACTIONS = ['U', 'D', 'R', 'L']# STATE --
START = (3,0)
END = (3,11)def change_range(values, vmin=0, vmax=1):start_zero = values - np.min(values)return (start_zero / (np.max(start_zero) + 1e-7)) * (vmax - vmin) + vminclass ENVIRONMENT:terrain_color = dict(normal=[127/360, 0, 96/100], objective=[26/360, 100/100, 100/100], cliff=[247/360, 92/100, 70/100], player=[344/360, 93/100, 100/100])def __init__(self):self.player = Noneself.num_steps = 0self.CreateEnv()self.DrawEnv()def CreateEnv(self, inital_grid = None):# create cliff walking grid world# just such as the following grid''' 0                    110  x x x x x x x x x x x x1    x x x x x x x x x x x x2    x x x x x x x x x x x x3    S o o o o o o o o o o E'''self.grid = self.terrain_color['normal'] * np.ones((WORLD_HEIGHT, WORLD_WIDTH, 3)) self.grid[-1, 1:11] = self.terrain_color['cliff']self.grid[-1,-1] = self.terrain_color['objective']def DrawEnv(self):self.fig, self.ax = plt.subplots(figsize=(WORLD_WIDTH, WORLD_HEIGHT))self.ax.grid(which='minor')       self.q_texts = [self.ax.text( i%WORLD_WIDTH, i//WORLD_WIDTH, '0', fontsize=11, verticalalignment='center', horizontalalignment='center') for i in range(12 * 4)]self.im = self.ax.imshow(hsv_to_rgb(self.grid), cmap='terrain', interpolation='nearest', vmin=0, vmax=1)self.ax.set_xticks(np.arange(WORLD_WIDTH))self.ax.set_xticks(np.arange(WORLD_WIDTH) - 0.5, minor=True)self.ax.set_yticks(np.arange(WORLD_HEIGHT))self.ax.set_yticks(np.arange(WORLD_HEIGHT) - 0.5, minor=True)  # plt.show()def step(self, action):# Possible actionsif action == 0 and self.player[0] > 0:self.player = (self.player[0] - 1, self.player[1])if action == 1 and self.player[0] < 3:self.player = (self.player[0] + 1, self.player[1])if action == 2 and self.player[1] < 11:self.player = (self.player[0], self.player[1] + 1)if action == 3 and self.player[1] > 0:self.player = (self.player[0], self.player[1] - 1)self.num_steps = self.num_steps + 1# Rewards, game on# common situation: reward = -1 & game can carry onreward = -1done = False# if walk to the cliff, game over and loss, reward = -100# if walk to the destination, game over and win, reward = 0if self.player[0] == WORLD_HEIGHT-1 and self.player[1] > 0 and  self.player[1] < WORLD_WIDTH-1: reward = -100done = Trueelif self.player[0] == END[0] and self.player[1] == END[1]:reward = 0done = Truereturn self.player, reward, donedef reset(self):self.player = [START[0], START[1]]self.num_steps = 0return self.playerdef RenderEnv(self, q_values, action=None, max_q=False, colorize_q=False):assert self.player is not None, 'You first need to call .reset()'if colorize_q:       grid = self.terrain_color['normal'] * np.ones((4, 12, 3))values = change_range(np.max(q_values, -1)).reshape(4, 12)grid[:, :, 1] = valuesgrid[-1, 1:11] = self.terrain_color['cliff']grid[-1,-1] = self.terrain_color['objective']else:grid = self.grid.copy()# render the player gridgrid[self.player] = self.terrain_color['player']       self.im.set_data(hsv_to_rgb(grid))if q_values is not None:xs = np.repeat(np.arange(12), 4)ys = np.tile(np.arange(4), 12)  for i, text in enumerate(self.q_texts):txt = ""for aaction in range(len(ACTIONS)):txt += str(ACTIONS[aaction]) + ":" + str( round(q_values[ i//WORLD_WIDTH, i%WORLD_WIDTH, aaction], 2) ) + '\n'text.set_text(txt)# show the actionprint(action)if action is not None:self.ax.set_title(action, color='r', weight='bold', fontsize=32)plt.pause(0.1)def egreedy_policy( q_values, state, epsilon=0.1):if np.random.binomial(1, epsilon) == 1:return np.random.choice(ACTION)else:values_ = q_values[state[0], state[1], :]return np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])def Qlearning(env, episodes=500, render=True, epsilon=0.1, learning_rate=0.5, gamma=0.9):q_values = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, NUM_ACTIONS))ep_rewards = []# Qlearning begin...for _ in range(0,episodes):state = env.reset()done = Falsereward_sum = 0while done == False:action = egreedy_policy(q_values, state, epsilon)next_state, reward, done = env.step(action)q_values[state[0], state[1], action] += learning_rate * (reward + gamma * np.max(q_values[next_state[0], next_state[1], :]) - q_values[state[0], state[1], action])state = next_state# for comparsion, record all the rewards, this is not necessary for QLearning algorithmreward_sum += rewardif render:env.RenderEnv(q_values, action=ACTIONS[action], colorize_q=True)ep_rewards.append(reward_sum)# Qlearning end...return ep_rewards, q_valuesdef play(q_values):# simulate the environent using the learned Q valuesenv = ENVIRONMENT()state = env.reset()done = Falsewhile not done:    # Select actionaction = egreedy_policy(q_values, state, 0.0)# Do the actionstate_, R, done = env.step(action)  # Update state and action        state = state_  env.RenderEnv(q_values=q_values, action=ACTIONS[action], colorize_q=True)env = ENVIRONMENT()
q_learning_rewards, q_values = Qlearning(env, episodes=500, render=False, epsilon=0.1, learning_rate=1, gamma=0.9)
play(q_values)

强化学习 Sarsa Q-learning:on off policy策略下的时序差分控制相关推荐

  1. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(复杂迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  2. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(简单迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  3. mdp框架_强化学习中q learning和MDP的区别是什么?

    MDP通常是指一种用转移概率描述连续不确定概率过程的数学框架,是强化学习中最基础的概念,很多强化学习的算法都是在把问题抽象为一个MDP之后再想办法求解的. 而q-learning是求解强化学习问题的算 ...

  4. 《强化学习》中的时序差分控制:Sarsa、Q-learning、期望Sarsa、双Q学习 etc.

    前言: 学习了 Sutton 的<强化学习(第二版)>第6章时序差分学习的控制部分,将笔记提炼如下. 笔者阅读的是中文书籍,所提到的公式,笔者将给出其在英文书籍上的页码.英文书籍见 Sut ...

  5. [PARL强化学习]Sarsa和Q—learning的实现

    [PARL强化学习]Sarsa和Q-learning的实现 Sarsa和Q-learning都是利用表格法再根据MDP四元组<S,A,P,R>:S: state状态,a: action动作 ...

  6. 强化学习之Q学习与SARSA

    ** Q学习路径规划与SARSA法路径规划 ** Q学习与SARSA学习简介 强化学习的历史可以用两条各自独立但丰富多彩的主线来追溯..一条主线聚焦 于研究最优化控制,以及使用价值函数动态规划等算法来 ...

  7. Q学习(Q learning) 强化学习

    Q学习(Q learning) 强化学习的简单例子 Matlab实现 可视化_Morty 的挖坑记录-CSDN博客 强化学习(MATLAB) - 叮叮当当sunny - 博客园

  8. 强化学习(Reinforcement learning)综述

    文章目录 Reinforcement learning 综述 强化学习的分类 环境(Model-free,Model-based) Based(Policy-Based RL & Value- ...

  9. 强化学习--Sarsa

    系列文章目录 强化学习 提示:写完文章后,目录可以自动生成,如何生成可参考右边的帮助文档 文章目录 系列文章目录 前言 一.强化学习是什么? 二.核心算法(免模型学习) Sarsa 1.学习心得 前言 ...

最新文章

  1. 3w字带你揭开WebSocket的神秘面纱~
  2. gitlab windows安装_gitlab pages之gitlab-runner 安装(windows)
  3. SpringMVC拦截器-用户登录权限控制代码实现2
  4. Asp.Net Core下的开源任务调度平台ScheduleMaster—快速上手
  5. 《2021中国数据资产化工具市场研究报告》隆重发布
  6. 如何给领导打造一款掌上财务管理驾驶舱?
  7. C++结构体中有构造函数和析构函数
  8. DELL R430服务器做raid5以及安装操作系统过程
  9. Python——语言基础
  10. PADS2007快捷键、无模命令大全
  11. 四川理工学院计算机学院在哪里,四川理工计算机学院
  12. Epicor 调拨方式平负数库存 直接生成DMT格式
  13. EOS智能合约开发系列(18): 狼人杀游戏的`eosio.code`
  14. 生日悖论分析基于python
  15. vs code 不能正确补全结构体成员变量的解决方法
  16. java设置ContentType,设置下载文件名称
  17. CVE-2021-41773漏洞复现
  18. 联想硬盘保护安装linux,Ubuntu 下安装Thinkpad T400硬盘保护APS
  19. 肠道菌群代谢组学之粪便微生物移植治疗原发性硬化性胆管炎
  20. js怎么删除一个emoji表情符号

热门文章

  1. 比特承诺 Bit Commitment
  2. 如何查看Linux无线网卡驱动,怎么在Linux里查询无线网卡的驱动程序版本
  3. python 多线程
  4. 用电销外呼系统的回拨线路真的不会封号吗?
  5. 步 入 网 络 攻 防 的 神 秘 世 界
  6. 【深度学习】入门理解ResNet和他的小姨子们(一)---ResNet
  7. rtx4000显卡什么级别 rtx4000显卡属于哪个级别
  8. 怀旧电影 -- 《郝邵文\释小龙影集》
  9. oracle sql monitor report,掌握SQL Monitor这些特性,SQL优化将如有神助!
  10. ASP校园网站相册管理系统设计与实现