Windy Gridworld

Shown below is a standard gridworld, with start and goal states, but with one difference: there is a crosswind(侧风) running upward through the middle of the grid. The actions are the standard four—up, down, right, and left—but in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. The strength of the wind is given below each column, in number of cells shifted upward.

This is an undiscounted episodic task, with constant rewards of −1 until the goal state is reached.

The graph below shows the results of applying ε\varepsilonε-greedy Sarsa to this task, with ε=0.1,α=0.5\varepsilon= 0.1, \alpha = 0.5ε=0.1,α=0.5, and the initial values Q(s,a)=0Q(s, a) = 0Q(s,a)=0 for all s,as, as,a.

The increasing slope(斜率) of the graph shows that the goal was reached more quickly over time. By 8000 time steps, the greedy policy was long since optimal (a trajectory from it is shown in the graph above); continued ε\varepsilonε-greedy exploration kept the average episode length at about 17 steps, two more than the minimum of 15.

Note that Monte Carlo methods cannot easily be used here because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. Online learning methods such as Sarsa do not have this problem because they quickly learn during the episode that such policies are poor, and switch to something else.

Code

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

Environment

# world height
WORLD_HEIGHT = 7# world width
WORLD_WIDTH = 10# wind strength for each column
WIND = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]# possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3# probability for exploration
EPSILON = 0.1# Sarsa step size
ALPHA = 0.5# reward for each step
REWARD = -1.0START = [3, 0]
GOAL = [3, 7]
ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT]

# return the new state s'
def step(state, action):i, j = stateif action == ACTION_UP:return [max(i - 1 - WIND[j], 0), j]elif action == ACTION_DOWN:return [max(min(i + 1 - WIND[j], WORLD_HEIGHT - 1), 0), j]elif action == ACTION_LEFT:return [max(i - WIND[j], 0), max(j - 1, 0)]elif action == ACTION_RIGHT:return [max(i - WIND[j], 0), min(j + 1, WORLD_WIDTH - 1)]else:assert False

Sarsa

# play for an episode
def episode(q_value):# track the total time steps in this episodetime = 0# initialize statestate = START# choose an action based on epsilon-greedy algorithmif np.random.binomial(1, EPSILON) == 1:action = np.random.choice(ACTIONS)else:values_ = q_value[state[0], state[1], :]action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])# keep going until get to the goal statewhile state != GOAL:next_state = step(state, action)if np.random.binomial(1, EPSILON) == 1:next_action = np.random.choice(ACTIONS)else:values_ = q_value[next_state[0], next_state[1], :]next_action = np.random.choice([action_ for action_, value_ in enumerate(values_) if value_ == np.max(values_)])# Sarsa updateq_value[state[0], state[1], action] += \ALPHA * (REWARD + q_value[next_state[0], next_state[1], next_action] -q_value[state[0], state[1], action])state = next_stateaction = next_actiontime += 1return time

Visualization

def figure_6_3():q_value = np.zeros((WORLD_HEIGHT, WORLD_WIDTH, len(ACTIONS)))episode_limit = 500steps = []ep = 0while ep < episode_limit:steps.append(episode(q_value))# time = episode(q_value)# episodes.extend([ep] * time)ep += 1steps = np.add.accumulate(steps)plt.plot(steps, np.arange(1, len(steps) + 1))plt.xlabel('Time steps')plt.ylabel('Episodes')plt.savefig('../images/figure_6_3.png')plt.close()# display the optimal policy (greedy)optimal_policy = []for i in range(0, WORLD_HEIGHT):optimal_policy.append([])for j in range(0, WORLD_WIDTH):if [i, j] == GOAL:optimal_policy[-1].append('G')continuebestAction = np.argmax(q_value[i, j, :])if bestAction == ACTION_UP:optimal_policy[-1].append('U')elif bestAction == ACTION_DOWN:optimal_policy[-1].append('D')elif bestAction == ACTION_LEFT:optimal_policy[-1].append('L')elif bestAction == ACTION_RIGHT:optimal_policy[-1].append('R')print('Optimal policy is:')for row in optimal_policy:print(row)print('Wind strength for each column:\n{}'.format([str(w) for w in WIND]))

output:

Optimal policy is:
['R', 'R', 'R', 'U', 'R', 'R', 'R', 'R', 'R', 'D']
['L', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'D']
['R', 'R', 'U', 'R', 'R', 'R', 'D', 'R', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'R', 'G', 'R', 'D']
['R', 'R', 'R', 'R', 'R', 'R', 'U', 'D', 'L', 'L']
['R', 'R', 'R', 'R', 'R', 'U', 'U', 'L', 'D', 'D']
['R', 'R', 'R', 'R', 'U', 'U', 'U', 'U', 'U', 'L']
Wind strength for each column:
['0', '0', '0', '1', '1', '1', '2', '2', '1', '0']

Windy Gridworld with King’s Moves

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Adjust the code above

# possible actions
ACTION_UP = 0
ACTION_DOWN = 1
ACTION_LEFT = 2
ACTION_RIGHT = 3
ACTION_LEFT_UP = 4
ACTION_LEFT_DOWN = 5
ACTION_RIGHT_UP = 6
ACTION_RIGHT_DOWN = 7
ACTION_STOP = 8ACTIONS = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT, ACTION_LEFT_UP, ACTION_LEFT_DOWN, ACTION_RIGHT_UP, ACTION_RIGHT_DOWN, ACTION_STOP]def step(state, action):i, j = statedef in_boundary(x, y):return [max(min(x, WORLD_HEIGHT - 1), 0), max(min(y, WORLD_WIDTH - 1), 0)]if action == ACTION_UP:return in_boundary(i - 1 - WIND[j], j)elif action == ACTION_DOWN:return in_boundary(i + 1 - WIND[j], j)elif action == ACTION_LEFT:return in_boundary(i - WIND[j], j - 1)elif action == ACTION_RIGHT:return in_boundary(i - WIND[j], j + 1)elif action == ACTION_LEFT_UP:return in_boundary(i - 1 - WIND[j], j - 1)elif action == ACTION_LEFT_DOWN:return in_boundary(i + 1 - WIND[j], j - 1)elif action == ACTION_RIGHT_UP:return in_boundary(i - 1 - WIND[j], j + 1) elif action == ACTION_RIGHT_DOWN:return in_boundary(i + 1 - WIND[j], j + 1) elif action == ACTION_STOP:return in_boundary(i - WIND[j], j)

output:

Optimal policy is:
['R', 'LU', 'D', 'D', 'R', 'R', 'R', 'RD', 'R', 'D']
['L', 'LD', 'LD', 'LD', 'D', 'S', 'RD', 'RD', 'R', 'D']
['LD', 'R', 'LD', 'L', 'LU', 'RD', 'RU', 'R', 'S', 'LD']
['RD', 'D', 'U', 'D', 'RU', 'RD', 'RD', 'G', 'LD', 'LD']
['LD', 'D', 'S', 'RD', 'RD', 'RD', 'RD', 'D', 'L', 'D']
['LD', 'RD', 'LD', 'U', 'R', 'RD', 'R', 'S', 'L', 'LD']
['LD', 'LU', 'RD', 'RD', 'RD', 'RD', 'RU', 'U', 'U', 'L']
Wind strength for each column:
['0', '0', '0', '1', '1', '1', '2', '2', '1', '0']

The average episode length is reduced to about 7 steps.

RL(Chapter 6): Windy Gridworld相关推荐

RL(Chapter 3): GridWorld
本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 GridWorld ...
强化学习导论_Example 6.5: Windy Grid-world
组会汇报时需要整理 <强化学习导论>第二版- Sutton一书中的例题代码,所以将理解过程记录了一下,并且巩固一遍python的基础知识. 书中页码:P130, 对应 Chapter 6: ...
RL(Chapter 4): Gambler’s Problem
本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Gambler's ...
RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)
本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 The Agent ...
RL(Chapter 5): Blackjack (二十一点)
本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Blackjack ...
Distributional RL with Quantile Regression论文翻译
毕业设计需要选择一篇外文论文进行翻译,翻译完成后正好分享到这里.因为这一篇论文比较难懂,也是比较重要的一篇论文,所以选择了这一篇.有些地方我也还不确定,翻译错误的地方欢迎指正~ 论文原文:https: ...
强化学习（五） - 时序差分学习(Temporal-Difference Learning)及其实例----Sarsa算法, Q学习, 期望Sarsa算法
强化学习(五) - 时序差分学习(Temporal-Difference Learning)及其实例 5.1 TD预测例5.1 回家时间的估计 5.2 TD预测方法的优势例5.2 随机移动 5.3 ...
强化学习（七）时序差分离线控制算法Q-Learning
在强化学习(六)时序差分在线控制算法SARSA中我们讨论了时序差分的在线控制算法SARSA,而另一类时序差分的离线控制算法还没有讨论,因此本文我们关注于时序差分离线控制算法,主要是经典的Q-Learn ...
李宏毅Reinforcement Learning强化学习入门笔记
文章目录 Concepts in Reinforcement Learning Difficulties in RL A3C Method Brief Introduction Policy-base ...

RL(Chapter 6): Windy Gridworld

目录