本文为强化学习笔记,主要参考以下内容:

  • Reinforcement Learning: An Introduction
  • 代码全部来自 GitHub
  • 习题答案参考 Github

目录

  • GridWorld
  • Code
    • Settings
    • Environment
    • 可视化
    • Question (a)
    • Question (b)
  • E x e r c i s e Exercise Exercise 3.24 3.24 3.24

GridWorld

Figure 3.2 (left) shows a rectangular gridworld representation of a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of − 1 −1 −1. Other actions result in a reward of 0 0 0, except those that move the agent out of the special states A A A and B B B. From state A A A, all four actions yield a reward of + 10 +10 +10 and take the agent to A ′ A' A′. From state B B B, all actions yield a reward of + 5 +5 +5 and take the agent to B ′ B' B′.


(a) Find the state-value function for the equiprobable random policy.
(求等概率随机策略下的 v π v_\pi vπ​)
(b) Solve the Bellman equation for v ∗ v_* v∗​ for the simple grid task.

Code

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.table import Tablematplotlib.use('Agg')   # select backend: AGG (non-interactive backend, capable of writing to a file)

Settings

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9# left, up, right, down
ACTIONS = [np.array([0, -1]),np.array([-1, 0]),np.array([0, 1]),np.array([1, 0])]
ACTIONS_FIGS=[ '←', '↑', '→', '↓']ACTION_PROB = 0.25      # probabilty of choosing each action (1 / 4)

Environment

def step(state, action):if state == A_POS:return A_PRIME_POS, 10if state == B_POS:return B_PRIME_POS, 5next_state = (np.array(state) + action).tolist()x, y = next_stateif x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:reward = -1.0next_state = stateelse:reward = 0return next_state, reward

可视化

# 还没看 matplotlib.table 的具体用法,现在先把重心放在算法上
# 只要知道 这个函数用来从一个表示 v(s) 的numpy二维数组绘制出网格图,对应网格中标记出 v(s) 即可
def draw_image(image):fig, ax = plt.subplots()ax.set_axis_off()tb = Table(ax, bbox=[0, 0, 1, 1])nrows, ncols = image.shapewidth, height = 1.0 / ncols, 1.0 / nrows# Add cellsfor (i, j), val in np.ndenumerate(image):# add state labelsif [i, j] == A_POS:val = str(val) + " (A)"if [i, j] == A_PRIME_POS:val = str(val) + " (A')"if [i, j] == B_POS:val = str(val) + " (B)"if [i, j] == B_PRIME_POS:val = str(val) + " (B')"tb.add_cell(i, j, width, height, text=val,loc='center', facecolor='white')# Row and column labels...for i in range(len(image)):tb.add_cell(i, -1, width, height, text=i+1, loc='right',edgecolor='none', facecolor='none')tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',edgecolor='none', facecolor='none')ax.add_table(tb)# 这个函数用来在网格图中画箭头来可视化最终求得的 optimal policy
def draw_policy(optimal_values):fig, ax = plt.subplots()ax.set_axis_off()tb = Table(ax, bbox=[0, 0, 1, 1])nrows, ncols = optimal_values.shapewidth, height = 1.0 / ncols, 1.0 / nrows# Add cellsfor (i, j), val in np.ndenumerate(optimal_values):next_vals=[]for action in ACTIONS:next_state, _ = step([i, j], action)next_vals.append(optimal_values[next_state[0],next_state[1]])best_actions=np.where(next_vals == np.max(next_vals))[0]val=''for ba in best_actions:val+=ACTIONS_FIGS[ba]# add state labelsif [i, j] == A_POS:val = str(val) + " (A)"if [i, j] == A_PRIME_POS:val = str(val) + " (A')"if [i, j] == B_POS:val = str(val) + " (B)"if [i, j] == B_PRIME_POS:val = str(val) + " (B')"tb.add_cell(i, j, width, height, text=val,loc='center', facecolor='white')# Row and column labels...for i in range(len(optimal_values)):tb.add_cell(i, -1, width, height, text=i+1, loc='right',edgecolor='none', facecolor='none')tb.add_cell(-1, i, width, height/2, text=i+1, loc='center',edgecolor='none', facecolor='none')ax.add_table(tb)

Question (a)

  • 下面的方法利用贝尔曼方程不断进行迭代,直至收敛
def figure_3_2():value = np.zeros((WORLD_SIZE, WORLD_SIZE))while True:# keep iteration until convergencenew_value = np.zeros_like(value)for i in range(WORLD_SIZE):for j in range(WORLD_SIZE):for action in ACTIONS:(next_i, next_j), reward = step([i, j], action)# bellman equationnew_value[i, j] += ACTION_PROB * (reward + DISCOUNT * value[next_i, next_j])if np.sum(np.abs(value - new_value)) < 1e-4:draw_image(np.round(new_value, decimals=2))plt.savefig('../images/figure_3_2.png')plt.close()breakvalue = new_value
  • 下面的方法直接解贝尔曼方程组(每个状态都对应一个贝尔曼方程)
def figure_3_2_linear_system():'''Here we solve the linear system of equations to find the exact solution.We do this by filling the coefficients for each of the states with their respective right side constant.'''A = -1 * np.eye(WORLD_SIZE * WORLD_SIZE)b = np.zeros(WORLD_SIZE * WORLD_SIZE)for i in range(WORLD_SIZE):for j in range(WORLD_SIZE):s = [i, j]  # current stateindex_s = np.ravel_multi_index(s, (WORLD_SIZE, WORLD_SIZE)) # 将二维坐标变为一维坐标for a in ACTIONS:s_, r = step(s, a)index_s_ = np.ravel_multi_index(s_, (WORLD_SIZE, WORLD_SIZE))A[index_s, index_s_] += ACTION_PROB * DISCOUNTb[index_s] -= ACTION_PROB * rx = np.linalg.solve(A, b)draw_image(np.round(x.reshape(WORLD_SIZE, WORLD_SIZE), decimals=2))plt.savefig('../images/figure_3_2_linear_system.png')plt.close()

Question (b)


因为这个问题中每个动作 a a a 只对应一个确定的 s ′ s' s′ 和 r r r,因此上式可以简化为:
v ∗ ( s ) = m a x a [ r + γ v ∗ ( s ′ ) ] v_*(s)=max_a[r+\gamma v_*(s')] v∗​(s)=maxa​[r+γv∗​(s′)]

  • 下面的代码也是使用迭代,直至解收敛
def figure_3_5():value = np.zeros((WORLD_SIZE, WORLD_SIZE))while True:# keep iteration until convergencenew_value = np.zeros_like(value)for i in range(WORLD_SIZE):for j in range(WORLD_SIZE):values = []for action in ACTIONS:(next_i, next_j), reward = step([i, j], action)# value iterationvalues.append(reward + DISCOUNT * value[next_i, next_j])new_value[i, j] = np.max(values)if np.sum(np.abs(new_value - value)) < 1e-4:draw_image(np.round(new_value, decimals=2))plt.savefig('../images/figure_3_5.png')plt.close()draw_policy(new_value)plt.savefig('../images/figure_3_5_policy.png')plt.close()breakvalue = new_value

E x e r c i s e Exercise Exercise 3.24 3.24 3.24

Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4 24.4 24.4. Use your knowledge of the optimal policy and (3.8)


to express this value symbolically, and then to compute it to three decimal places.

ANSWER

The best solution after reaching A A A is to quickly go back A A A after moving to A A A. That takes 5 time steps. So we will have


Theoretical answer is 10 1 − γ 5 \frac{10}{1-\gamma^5} 1−γ510​.

RL(Chapter 3): GridWorld相关推荐

  1. RL(Chapter 6): Windy Gridworld

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Windy Gri ...

  2. RL(Chapter 4): Gambler’s Problem

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Gambler's ...

  3. RL(Chapter 3): Finite Markov Decision Processes (有限马尔可夫决策过程)

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 The Agent ...

  4. RL(Chapter 5): Blackjack (二十一点)

    本文为强化学习笔记,主要参考以下内容: Reinforcement Learning: An Introduction 代码全部来自 GitHub 习题答案参考 Github 目录 Blackjack ...

  5. 李宏毅Reinforcement Learning强化学习入门笔记

    文章目录 Concepts in Reinforcement Learning Difficulties in RL A3C Method Brief Introduction Policy-base ...

  6. Soft Actor-Critic 论文笔记

    无模型深度强化学习算法(Model-free DRL)有两个主要缺点: 1.非常高的样本复杂性(需要与环境进行大量交互产生大量样本) 2.脆弱的收敛性(它的收敛性受超参数影响严重:学习率,探索常量等等 ...

  7. 资源 |“从蒙圈到入坑”,推荐新一波ML、DL、RL以及数学基础等干货资源

    向AI转型的程序员都关注了这个号☝☝☝ 编译 | AI科技大本营(rgznai100) 参与 | suiling 此前营长曾发过一篇高阅读量.高转发率,高收藏量的文章<爆款 | Medium上6 ...

  8. 菜鸟崛起 DB Chapter 2 MySQL 5.6的概述与安装

    在上文菜鸟崛起 DB Chapter 1 数据库概述我们初步认识了数据库,也知道市面上常见的几种数据库,下面我们就针对常见的MySQL数据库展开对DataBase的探讨. 2.1  MySQL介绍 M ...

  9. github RL: DP

    github RL: DP 这是github上RL练习的笔记 https://github.com/dennybritz/reinforcement-learning/tree/master/DP I ...

最新文章

  1. 初识java类的接口实现
  2. ICLR2020全析解读:华人学者占据60%,谷歌再次领跑!(附最新高引华人榜单)...
  3. 系统要关闭,可我程序还有事要处理?
  4. 驳《IT开发工程师的悲哀》
  5. flask查询User,返回对象列表,提示ypeError: Object of type ‘bytes‘ is not JSON serializable解决办法
  6. python右键弹出菜单编写_python实现应用程序在右键菜单中添加打开方式功能
  7. Uber如何使用Mesos的?答曰:和Cassandra一起用
  8. 动易html编辑器漏洞,动易网站管理系统删除任意文件漏洞
  9. NOR Flash与NAND Flash区别
  10. 本世纪最经典好文---新系统下经典老游戏[中文HGAME]重玩全攻略(感谢作者)
  11. Python题库(100例)第一天
  12. 工作报告模板下载_免费工作报告图片设计素材_第2页_包图网
  13. excel汇总报表如何做?
  14. 获取头条小程序分享二维码
  15. 2018金山wps暑期招聘服务器端笔试题(第二批)
  16. 应用锁(AppLocker)原理及代码实现
  17. 英特尔大师挑战赛燃爆斗鱼直播节,华硕ROG热血助阵
  18. 【Delphi】Android 桌面图标添加快捷菜单功能
  19. 要怎么通过PHP发布微博动态:附代码详解
  20. 【操作系统作业】睡觉助教(用Java的ReentrantLock实现)

热门文章

  1. 华南农大陈程杰/夏瑞TBtools最新文章在iMeta发表
  2. 机器学习快速入门:1~2章
  3. java 设置dns_Java动态修改dns(DDNS基于DNSPot和IPv6)
  4. 微软智能手表似乎要来了 曲面屏可监测心率
  5. JavaScript if~else 语句
  6. Java 图的最短路径问题-迪杰斯特拉算法VS弗洛伊德算法
  7. Java高质量面试总结
  8. PHP实现执行定时任务的几种思路详解
  9. 【Spring】基于xml文件的Autowire自动装配
  10. 华南理工大学计算机学院招生,2017年9月华南理工大学计算机等级考试报名时间...