本文为强化学习笔记,主要参考以下内容:

  • Reinforcement Learning: An Introduction
  • 代码全部来自 GitHub
  • 习题答案参考 Github

目录

  • Windy Gridworld
    • Code
      • Environment
      • Sarsa
      • Visualization
  • Windy Gridworld with King’s Moves

Windy Gridworld

Shown below is a standard gridworld, with start and goal states, but with one difference: there is a crosswind(侧风) running upward through the middle of the grid. The actions are the standard four—up, down, right, and left—but in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward.


This is an undiscounted episodic task, with constant rewards of −1 until the goal state is reached.

The graph below shows the results of applying ε\varepsilonε-greedy Sarsa to this task, with ε=0.1,α=0.5\varepsilon= 0.1, \alpha = 0.5ε=0.1,α=0.5, and the initial values Q(s,a)=0Q(s, a) = 0Q(s,a)=0 for all s,as, as,a.

The increasing slope(斜率) of the graph shows that the goal was reached more quickly over time. By 8000 time steps, the greedy policy was long since optimal (a trajectory from it is shown in the graph above); continued ε\varepsilonε-greedy exploration kept the average episode length at about 17 steps, two more than the minimum of 15.

Note that Monte Carlo methods cannot easily be used here because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Online learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such policies are poor, and switch to something else.

Code

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

Environment

# world height
WORLD_HEIGHT = 7# world width
WORLD_WIDTH = 10# wind strength for each column
WIND = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]# possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3# probability for exploration
EPSILON = 0.1# Sarsa step size
ALPHA = 0.5# reward for each step
REWARD = -1.0START = [3, 0]
GOAL = [3, 7]
ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]
# return the new state s'
def step(state, action):i, j = stateif action == ACTION_UP:return [max(i - 1 - WIND[j], 0), j]elif action == ACTION_DOWN:return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j]elif action == ACTION_LEFT:return [max(i - WIND[j], 0), max(j - 1, 0)]elif action == ACTION_RIGHT:return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)]else:assert False

Sarsa

# play for an episode
def episode(q_value):# track the total time steps in this episodetime = 0# initialize statestate = START# choose an action based on epsilon-greedy algorithmif np.random.binomial(1, EPSILON) == 1:action = np.random.choice(ACTIONS)else:values_ = q_value[state[0], state[1], :]action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])# keep going until get to the goal statewhile state != GOAL:next_state = step(state, action)if np.random.binomial(1, EPSILON) == 1:next_action = np.random.choice(ACTIONS)else:values_ = q_value[next_state[0], next_state[1], :]next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])# Sarsa updateq_value[state[0], state[1], action] += \ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] -q_value[state[0], state[1], action])state = next_stateaction = next_actiontime += 1return time

Visualization

def figure_6_3():q_value = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, len(ACTIONS)))episode_limit = 500steps = []ep = 0while ep < episode_limit:steps.append(episode(q_value))# time = episode(q_value)# episodes.extend([ep] * time)ep += 1steps = np.add.accumulate(steps)plt.plot(steps, np.arange(1, len(steps) + 1))plt.xlabel('Time steps')plt.ylabel('Episodes')plt.savefig('../images/figure_6_3.png')plt.close()# display the optimal policy (greedy)optimal_policy = []for i in range(0, WORLD_HEIGHT):optimal_policy.append([])for j in range(0, WORLD_WIDTH):if [i, j] == GOAL:optimal_policy[-1].append('G')continuebestAction = np.argmax(q_value[i, j, :])if bestAction == ACTION_UP:optimal_policy[-1].append('U')elif bestAction == ACTION_DOWN:optimal_policy[-1].append('D')elif bestAction == ACTION_LEFT:optimal_policy[-1].append('L')elif bestAction == ACTION_RIGHT:optimal_policy[-1].append('R')print('Optimal policy is:')for row in optimal_policy:print(row)print('Wind strength for each column:\n{}'.format([str(w) for w in WIND]))

output:

Optimal policy is:
['R', 'R', 'R', 'U', 'R', 'R', 'R', 'R', 'R', 'D']
['L', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'U', 'R', 'R', 'R', 'D', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'G', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'U', 'D', 'L', 'L']
['R', 'R', 'R', 'R', 'R', 'U', 'U', 'L', 'D', 'D']
['R', 'R', 'R', 'R', 'U', 'U', 'U', 'U', 'U', 'L']
Wind strength for each column:
['0', '0', '0', '1', '1', '1', '2', '2', '1', '0']

Windy Gridworld with King’s Moves

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Adjust the code above

# possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3
ACTION_LEFT_UP = 4
ACTION_LEFT_DOWN = 5
ACTION_RIGHT_UP = 6
ACTION_RIGHT_DOWN = 7
ACTION_STOP = 8ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT, ACTION_LEFT_UP, ACTION_LEFT_DOWN, ACTION_RIGHT_UP, ACTION_RIGHT_DOWN, ACTION_STOP]def step(state, action):i, j = statedef in_boundary(x, y):return [max(min(x, WORLD_HEIGHT - 1), 0), max(min(y, WORLD_WIDTH - 1), 0)]if action == ACTION_UP:return in_boundary(i - 1 - WIND[j], j)elif action == ACTION_DOWN:return in_boundary(i + 1 - WIND[j], j)elif action == ACTION_LEFT:return in_boundary(i - WIND[j], j - 1)elif action == ACTION_RIGHT:return in_boundary(i - WIND[j], j + 1)elif action == ACTION_LEFT_UP:return in_boundary(i - 1 - WIND[j], j - 1)elif action == ACTION_LEFT_DOWN:return in_boundary(i + 1 - WIND[j], j - 1)elif action == ACTION_RIGHT_UP:return in_boundary(i - 1 - WIND[j], j + 1) elif action == ACTION_RIGHT_DOWN:return in_boundary(i + 1 - WIND[j], j + 1) elif action == ACTION_STOP:return in_boundary(i - WIND[j], j)

output:

Optimal policy is:
['R', 'LU', 'D', 'D', 'R', 'R', 'R', 'RD', 'R', 'D']
['L', 'LD', 'LD', 'LD', 'D', 'S', 'RD', 'RD', 'R', 'D']
['LD', 'R', 'LD', 'L', 'LU', 'RD', 'RU', 'R', 'S', 'LD']
['RD', 'D', 'U', 'D', 'RU', 'RD', 'RD', 'G', 'LD', 'LD']
['LD', 'D', 'S', 'RD', 'RD', 'RD', 'RD', 'D', 'L', 'D']
['LD', 'RD', 'LD', 'U', 'R', 'RD', 'R', 'S', 'L', 'LD']
['LD', 'LU', 'RD', 'RD', 'RD', 'RD', 'RU', 'U', 'U', 'L']
Wind strength for each column:
['0', '0', '0', '1', '1', '1', '2', '2', '1', '0']

The average episode length is reduced to about 7 steps.

RL(Chapter 6): Windy Gridworld相关推荐

  1. RL(Chapter 3): GridWorld

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 GridWorld ...

  2. 强化学习导论_Example 6.5: Windy Grid-world

    组会汇报时需要整理 <强化学习导论>第二版- Sutton一书中的例题代码,所以将理解过程记录了一下,并且巩固一遍python的基础知识. 书中页码:P130, 对应 Chapter 6: ...

  3. RL(Chapter 4): Gambler’s Problem

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Gambler's ...

  4. RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 The Agent ...

  5. RL(Chapter 5): Blackjack (二十一点)

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Blackjack ...

  6. Distributional RL with Quantile Regression论文翻译

    毕业设计需要选择一篇外文论文进行翻译,翻译完成后正好分享到这里.因为这一篇论文比较难懂,也是比较重要的一篇论文,所以选择了这一篇.有些地方我也还不确定,翻译错误的地方欢迎指正~ 论文原文:https: ...

  7. 强化学习(五) - 时序差分学习(Temporal-Difference Learning)及其实例----Sarsa算法, Q学习, 期望Sarsa算法

    强化学习(五) - 时序差分学习(Temporal-Difference Learning)及其实例 5.1 TD预测 例5.1 回家时间的估计 5.2 TD预测方法的优势 例5.2 随机移动 5.3 ...

  8. 强化学习(七)时序差分离线控制算法Q-Learning

    在强化学习(六)时序差分在线控制算法SARSA中我们讨论了时序差分的在线控制算法SARSA,而另一类时序差分的离线控制算法还没有讨论,因此本文我们关注于时序差分离线控制算法,主要是经典的Q-Learn ...

  9. 李宏毅Reinforcement Learning强化学习入门笔记

    文章目录 Concepts in Reinforcement Learning Difficulties in RL A3C Method Brief Introduction Policy-base ...

最新文章

  1. 非阻塞线程安全列表——ConcurrentLinkedDeque应用举例
  2. java linux解压_linux整套java环境解压版
  3. Java学习日报—泳道与Feign—2021/11/30
  4. matlab2c使用c++实现matlab函数系列教程-logspace函数
  5. lwj_C#_集合listT
  6. 太乙超级计算机,从“启明”到“太乙”,南科大的超算发展之路
  7. 新唐(Nuvoton)8051单片机开发指南
  8. Deep Dream:理解深度神经网络结构及应用(实战篇)
  9. 联想计算机搜不到mfp,电脑检测不到联想LJ2200L打印机
  10. 二层交换机VLAN基础配置
  11. Codeforces Gym 2015 ACM Arabella Collegiate Programming Contest
  12. 测试排期估时多长合理?
  13. 快速上手JFinal
  14. 2023计算机考研英语考一吗?考研英语一与英语二的区别
  15. Matlab GUI鼠标画线
  16. 未能加载文件或程序集xxxx,系统找不到指定的文件
  17. 国内互联网大厂开源贡献排名和他们的github开源主页
  18. 记一次帮同学搭建项目Tomcat启动失败,至少有一个JAR被扫描用于TLD但尚未包含TLD。
  19. 2013年新米老师语录
  20. 当机器具备跨模态感知能力后,会有智商吗?| MixLab人工智能

热门文章

  1. 软件测试——linux基础
  2. 微信小程序实现举报功能
  3. Python之Sqlitespy连接并把Excel内容写到数据库的表中
  4. 名帖86 蔡襄 行楷《谢赐御书诗表》
  5. Boxplot(盒图)
  6. ThreadPoolExecutor(六)——线程池关闭之后
  7. Win11系统右键没有解压选项!!!???
  8. 3分钟掌握7个XD基础操作
  9. 让人着迷的 STP生成树协议
  10. 父亲节华为P40软文营销广告