gym使用解析

ACpole中gym的环境使用代码:

import gym
env = gym.make('CartPole-v0')
env.seed(1)  # reproducible
env = env.unwrappedN_F = env.observation_space.shape[0]
N_A = env.action_space.ns = env.reset()
s_, r, done, info = env.step(a)
env.render()

代码分析https://www.colabug.com/2019/0821/6655535/

机器人走迷宫实例https://blog.csdn.net/extremebingo/article/details/80867486

在本地库导入https://blog.csdn.net/u011254180/article/details/88221426

要引入自定义环境,不必改动 Gym 的源码,只需创建一个 Python 模块 即可。目录结构解释如下

在构造函数中,先定义action_space(包含智体在此环境中可能采取的所有actions)的type和shape,类似的定义observation_space(包含此环境下智体一次观察到的所有类型数据).

AC算法实现机器人走迷宫

改动的目录结构比上图少了README和gym/setup.py,主要参考博文https://blog.csdn.net/extremebingo/article/details/80867486

  1. 在路径D:\python\Python3_7\Lib\site-packages\gym\envs下新建文件夹user,存放自定义强化学习环境

  1. 新建环境grid_mdp_v1.py

    创建GridEnv1类,详见代码。主要设置如下:

    """Description:A robot is trying to move in a maze, where exists obstruction, gem and fire. The goal is to get the gem avoiding fire.Observation:Type: Num Observation Min Max0   state       0   15    Actions:{0:'n',1:'e',2:'s',3:'w'}Reward:Fire    ->  -20Gem     ->  20Invalid ->  -1Valid   ->  0No gem until MAX STEPS  -> -50Staring State:random position except 5,11,12,15 (obstruction, Fire, Gem)Episode Termination:Robot touch fire or gem or reach MAX STEPS.[11,12,15]
    """
    

    全部代码如下,对原博文稍作修改,解决了几个小bug并适应AC算法的接口:

import logging
import random
import gym
import numpy as npclass GridEnv1(gym.Env):metadata = {'render.modes': ['human', 'rgb_array'],'video.frames_per_second': 2}def __init__(self):self.states = range(0,16) #状态空间0~15self.observation_space=np.array([1])self.x=[150,250,350,450] * 4self.y=[450] * 4 + [350] * 4 + [250] * 40 + [150] * 4self.terminate_states = dict()  #终止状态为字典格式self.terminate_states[11] = 1self.terminate_states[12] = 1self.terminate_states[15] = 1self.action_space = {0:'n',1:'e',2:'s',3:'w'}self.rewards = dict();        #回报的数据结构为字典self.rewards['8_s'] = -20.0self.rewards['13_w'] = -20.0self.rewards['7_s'] = -20.0self.rewards['10_e'] = -20.0self.rewards['14_e'] = 100.0self.t = dict();             #状态转移的数据格式为字典self.t['0_s'] = 4self.t['0_e'] = 1self.t['1_e'] = 2self.t['1_w'] = 0self.t['2_w'] = 1self.t['2_e'] = 3self.t['2_s'] = 6self.t['3_w'] = 2self.t['3_s'] = 7self.t['4_n'] = 0self.t['4_s'] = 8self.t['6_n'] = 2self.t['6_s'] = 10self.t['6_e'] = 7self.t['7_w'] = 6self.t['7_n'] = 3self.t['7_s'] = 11self.t['8_n'] = 4self.t['8_e'] = 9self.t['8_s'] = 12self.t['9_w'] = 8self.t['9_e'] = 10self.t['9_s'] = 13self.t['10_w'] = 9self.t['10_n'] = 6self.t['10_e'] = 11self.t['10_s'] = 14self.t['10_w'] = 9self.t['13_n'] = 9self.t['13_e'] = 14self.t['13_w'] = 12self.t['14_n'] = 10self.t['14_e'] = 15self.t['14_w'] = 13self.gamma = 0.8         #折扣因子self.viewer = Noneself.state = Nonedef _seed(self, seed=None):self.np_random, seed = random.seeding.np_random(seed)return [seed]def getTerminal(self):return self.terminate_statesdef getGamma(self):return self.gammadef getStates(self):return self.statesdef getAction(self):return self.action_spacedef getTerminate_states(self):return self.terminate_statesdef setAction(self,s):self.state=sdef step(self, action):#系统当前状态state = self.stateif state in self.terminate_states:next_state = stater = 0is_terminal = Trueelse:key = "%d_%s"%(state, self.action_space[action])   #将状态和动作组成字典的键值#状态转移if key in self.t:next_state = self.t[key]else:next_state = stateself.state = next_stateis_terminal = Falseif next_state in self.terminate_states:is_terminal = Trueif key not in self.rewards:r = 0else:r = self.rewards[key]if type(next_state)!=type(np.array([])):next_state=np.array([next_state])return next_state, r, is_terminal,{}def reset(self):self.close()while(True):self.state = self.states[int(random.random() * (len(self.states)))]if (self.state==None or self.state==5 or self.state==11 or self.state==12 or self.state==15):continueelse:breakreturn np.array([self.state]) # retur initial statedef render(self, mode='human'):from gym.envs.classic_control import renderingscreen_width = 600screen_height = 600if self.viewer is None:self.viewer = rendering.Viewer(screen_width, screen_height)#画出网格边界self.line1 = rendering.Line((100,100),(500,100))self.line2 = rendering.Line((100, 200), (500, 200))self.line3 = rendering.Line((100, 300), (500, 300))self.line4 = rendering.Line((100, 400), (500, 400))self.line5 = rendering.Line((100, 500), (500, 500))self.line6 = rendering.Line((100, 100), (100, 500))self.line7 = rendering.Line((200, 100), (200, 500))self.line8 = rendering.Line((300, 100), (300, 500))self.line9 = rendering.Line((400, 100), (400, 500))self.line10 = rendering.Line((500, 100), (500, 500))#创建石柱self.shizhu = rendering.make_circle(40)self.circletrans = rendering.Transform(translation=(250,350))self.shizhu.add_attr(self.circletrans)self.shizhu.set_color(0.8,0.6,0.4)#创建第一个火坑 translation的y坐标是从下往上计数的!self.fire1 = rendering.make_circle(40)self.circletrans = rendering.Transform(translation=(450, 250))self.fire1.add_attr(self.circletrans)self.fire1.set_color(1, 0, 0)#创建第二个火坑self.fire2 = rendering.make_circle(40)self.circletrans = rendering.Transform(translation=(150, 150))self.fire2.add_attr(self.circletrans)self.fire2.set_color(1, 0, 0)#创建宝石self.diamond = rendering.make_circle(40)self.circletrans = rendering.Transform(translation=(450, 150))self.diamond.add_attr(self.circletrans)self.diamond.set_color(0, 0, 1)#创建机器人self.robot= rendering.make_circle(30)self.robotrans = rendering.Transform()self.robot.add_attr(self.robotrans)self.robot.set_color(0, 1, 0)self.line1.set_color(0, 0, 0)self.line2.set_color(0, 0, 0)self.line3.set_color(0, 0, 0)self.line4.set_color(0, 0, 0)self.line5.set_color(0, 0, 0)self.line6.set_color(0, 0, 0)self.line7.set_color(0, 0, 0)self.line8.set_color(0, 0, 0)self.line9.set_color(0, 0, 0)self.line10.set_color(0, 0, 0)self.viewer.add_geom(self.line1)self.viewer.add_geom(self.line2)self.viewer.add_geom(self.line3)self.viewer.add_geom(self.line4)self.viewer.add_geom(self.line5)self.viewer.add_geom(self.line6)self.viewer.add_geom(self.line7)self.viewer.add_geom(self.line8)self.viewer.add_geom(self.line9)self.viewer.add_geom(self.line10)self.viewer.add_geom(self.shizhu)self.viewer.add_geom(self.fire1)self.viewer.add_geom(self.fire2)self.viewer.add_geom(self.diamond)self.viewer.add_geom(self.robot)if self.state is None: return Noneself.robotrans.set_translation(self.x[self.state], self.y[self.state])return self.viewer.render(return_rgb_array=mode == 'rgb_array')def close(self):if self.viewer:self.viewer.close()self.viewer=None
  1. 新建__init__.py

    from gym.envs.user.grid_mdp_v1 import GridEnv1
    
  2. 在envs的__init__.py中注册环境

    register(id='GridWorld-v1',entry_point='gym.envs.user:GridEnv1',max_episode_steps=200,reward_threshold=100.0,)
    
  3. 单独建立一个文件测试,可以运行

    import gymenv = gym.make('GridWorld-v1')
    env.reset()
    env.render()
    env.close()
    
  4. 代入到AC框架中,替换gym环境为GridWorld-v1,调试接口后运行成功。机器人:绿色,障碍物:棕色,火坑:红色,钻石:蓝色


代码基本是莫烦大神的github样例替换了env并稍微改了reward机制的结果:

"""
Actor-Critic using TD-error as the Advantage, Reinforcement Learning.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
"""import numpy as np
import tensorflow as tf
import gymnp.random.seed(2)
tf.set_random_seed(2)  # reproducible# Superparameters
OUTPUT_GRAPH = True
MAX_EPISODE = 300
DISPLAY_REWARD_THRESHOLD = 200  # renders environment if total episode reward is greater then this threshold
MAX_EP_STEPS = 200   # maximum time step in one episode
RENDER = False  # rendering wastes time
GAMMA = 0.9     # reward discount in TD error
LR_A = 0.001    # learning rate for actor
LR_C = 0.01     # learning rate for criticenv = gym.make('GridWorld-v1')
env.seed(1)  # reproducible
env = env.unwrappedN_F = env.observation_space.shape[0]
N_A = len(env.action_space)class Actor(object):def __init__(self, sess, n_features, n_actions, lr=0.001):self.sess = sessself.s = tf.placeholder(tf.float32, [1, n_features], "state")self.a = tf.placeholder(tf.int32, None, "act")self.td_error = tf.placeholder(tf.float32, None, "td_error")  # TD_errorwith tf.variable_scope('Actor'):l1 = tf.layers.dense(inputs=self.s,units=20,    # number of hidden unitsactivation=tf.nn.relu,kernel_initializer=tf.random_normal_initializer(0., .1),    # weightsbias_initializer=tf.constant_initializer(0.1),  # biasesname='l1')self.acts_prob = tf.layers.dense(inputs=l1,units=n_actions,    # output unitsactivation=tf.nn.softmax,   # get action probabilitieskernel_initializer=tf.random_normal_initializer(0., .1),  # weightsbias_initializer=tf.constant_initializer(0.1),  # biasesname='acts_prob')with tf.variable_scope('exp_v'):log_prob = tf.log(self.acts_prob[0, self.a])self.exp_v = tf.reduce_mean(log_prob * self.td_error)  # advantage (TD_error) guided losswith tf.variable_scope('train'):self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v)  # minimize(-exp_v) = maximize(exp_v)def learn(self, s, a, td):s = s[np.newaxis, :]feed_dict = {self.s: s, self.a: a, self.td_error: td}_, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict)return exp_vdef choose_action(self, s):s = s[np.newaxis, :]probs = self.sess.run(self.acts_prob, {self.s: s})   # get probabilities for all actionsreturn np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())   # return a intclass Critic(object):def __init__(self, sess, n_features, lr=0.01):self.sess = sessself.s = tf.placeholder(tf.float32, [1, n_features], "state")self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next")self.r = tf.placeholder(tf.float32, None, 'r')with tf.variable_scope('Critic'):l1 = tf.layers.dense(inputs=self.s,units=20,  # number of hidden unitsactivation=tf.nn.relu,  # None# have to be linear to make sure the convergence of actor.# But linear approximator seems hardly learns the correct Q.kernel_initializer=tf.random_normal_initializer(0., .1),  # weightsbias_initializer=tf.constant_initializer(0.1),  # biasesname='l1')self.v = tf.layers.dense(inputs=l1,units=1,  # output unitsactivation=None,kernel_initializer=tf.random_normal_initializer(0., .1),  # weightsbias_initializer=tf.constant_initializer(0.1),  # biasesname='V')with tf.variable_scope('squared_TD_error'):self.td_error = self.r + GAMMA * self.v_ - self.vself.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_evalwith tf.variable_scope('train'):self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)def learn(self, s, r, s_):s, s_ = s[np.newaxis, :], s_[np.newaxis, :]v_ = self.sess.run(self.v, {self.s: s_})td_error, _ = self.sess.run([self.td_error, self.train_op],{self.s: s, self.v_: v_, self.r: r})return td_errorsess = tf.Session()actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)
critic = Critic(sess, n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actorsess.run(tf.global_variables_initializer())if OUTPUT_GRAPH:tf.summary.FileWriter("logs/", sess.graph)for i_episode in range(MAX_EPISODE):s = env.reset() # return candidate actionst = 0track_r = []track = []while True:if RENDER: env.render()a = actor.choose_action(s)s_, r, done, info = env.step(a)if s==s_ or s==5: # take invalid action or move toward obstructionr = r-1track_r.append(r)track.append(s)td_error = critic.learn(s, r, s_)  # gradient = grad[r + gamma * V(s_) - V(s)]actor.learn(s, a, td_error)     # true_gradient = grad[logPi(s,a) * td_error]s = s_t += 1if done or t >= MAX_EP_STEPS:if s != 15:r = -100track.append(s)ep_rs_sum = sum(track_r)if 'running_reward' not in globals():running_reward = ep_rs_sumelse:running_reward = running_reward * 0.95 + ep_rs_sum * 0.05if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True  # renderingprint("episode:", i_episode, "  reward:", int(running_reward))print("states: ", end='')for s in track:print(s[0],' ', end='')print('\n-----------')break

但是不知道是因为AC算法的局限性还是程序或者奖励设置有问题,机器人最后一直在第一第二格来回跑,运行了近200次都没有找到钻石,每次的路径类似下图:


后来想了想就算是同样的规则人类也知道有拿到更高分的方法是找到钻石,所以肯定不是规则的问题,而是算法不够聪明。。

AC算法运行自定义gym环境相关推荐

  1. 零基础创建自定义gym环境——以股票市场为例

    零基础创建自定义gym环境--以股票市场为例 翻译自medium上的一篇文章Create custom gym environments from scratch - A stock market e ...

  2. 基于自定义gym环境的强化学习

    本文实现了一个简单的基于gym环境的强化学习的demo,参考了博客使用gym创建一个自定义环境. 1. 依赖包版本 gym == 0.21.0 stable-baselines3 == 1.6.2 2 ...

  3. python井字棋游戏大作业实验报告_Part 1.2 - 实现一个井字棋游戏的gym环境

    上文已经描述了怎么创建和注册一个自定义的gym环境.但是环境类中的4个函数都是空的,本文将描述怎么实现那4个函数,实现一个完整的井字棋游戏的环境. 游戏规则:两个玩家在3x3的棋盘上,一方执X,一方执 ...

  4. RL gym 环境(2)—— 自定义环境

    本文介绍如何在 gym 套件中使用自己创建的环境,改编自官方文档 Make your own custom environment 关于 gym 套件的基础介绍参考:RL gym 环境(1)-- 安装 ...

  5. gym 自定义游戏环境

    gym 自定义游戏环境 需要继承gym.Env类,实现reset(self).step(self, action)函数 import gym from typing import List, Opti ...

  6. 写一个强化学习训练的gym环境

    需求 要用强化学习(Reinforcement Learning)算法解决问题,需要百千万次的训练,真实环境一般不允许这么多次训练(时间太长.试错代价太大),需要开发仿真环境.OpenAI的gym环境 ...

  7. 强化学习——蛇棋游戏gym环境搭建

    强化学习--蛇棋游戏gym环境搭建   学习强化学习精要核心算法与Tensorflow实现这本书中,关于蛇棋游戏利用gym搭建.游戏的规则非常简单,详细请参考冯超的书<<强化学习精要核心算 ...

  8. openai的gym baseline spiningup 深度强化学习环境安装 手撸gym环境demo

    按照spiningup我们学习DRL,链接 https://github.com/openai/gym https://github.com/openai/baselines 1. 安装anacond ...

  9. 人工智能中的rl是什么意思_AI学习如何使用第二部分来创建自定义RL环境并培训代理...

    人工智能中的rl是什么意思 From Icarus burning his wings to the Wright brothers soaring through the sky, it took ...

最新文章

  1. linux下文件系统不丢数据扩容方法
  2. pyInstaller 参数简介
  3. 青藏高原matlab掩膜,1982~2000年青藏高原地表反照率时空变化特征
  4. python constructor_python – 无法成功启动boa-constructor
  5. 高邮机器人_仲尼:省机器人项目荣获一等奖!高邮小学生是如何做到的?
  6. h3c 路由器 刷第三方固件_图文版*许迎果 第201期 双11路由器型号推荐之刷机路由篇...
  7. 腾讯 java_2019腾讯的面试题(腾讯qq音乐部门)
  8. 了解FPS屏幕刷新率
  9. linux racoon代码,源代码安装IPsec-Tools-0.7.2
  10. tensorflow安装以及在Anaconda中安装使用
  11. java比较三个数的编程_java编程题,输入3个数abc按大小顺序输出
  12. 自动旁注并多进程调用wwwscan扫描旁注结果的python脚本。
  13. 做短视频自媒体,新手一个月6000多,全靠这些工具,抓紧收藏
  14. quoted-printable解码程序
  15. 微信公众号如何和Salesforce集成,然后后台给公众号的关注者推送模板消息?
  16. MySQL海量数据项目实战
  17. 关于Celeba人脸数据集的介绍
  18. Hamming Embedding 汉明嵌入
  19. 全球及中国炼油工业市场供需规模及竞争格局咨询报告2021版
  20. 别树一帜的思维图软件:PersonalBrain(转)

热门文章

  1. Blob数据类型及应用
  2. Cell Biolabs——艾美捷 天狼星红比色法体外定量检测
  3. 轮子四:QT保存数据到 office word文档
  4. 元宇宙房地产:又一个疯狂的加密市场吗?
  5. 《程序员》 -- 如何提高团队协作的效率
  6. 归纳整理--PKI体系结构
  7. 在数据库中根据经纬度查找数据中所有附近的经纬度点
  8. Python中plt绘图包的基本使用方法
  9. 店宝宝谈美团“同舟计划”:多维度提升骑手体验
  10. 【牛客】1. 字符串操作 <字符串>