强化学习初探

  • 1、代码
    • 1.1、导入依赖
    • 1.2、设置超参
    • 1.3、搭建Model、Algorithm、Agent架构
    • 1.4、Algorithm
    • 1.5、Agent
    • 1.6、Training && Test(训练&&测试)
    • 1.7、创建环境和Agent,启动训练,保存模型
  • 2、代码待补充
  • 3、心得体会

1、代码

1.1、导入依赖

import os
import gym
import numpy as npimport paddle.fluid as fluid
import parl
from parl import layers
from parl.utils import logger

1.2、设置超参

######################################################################
######################################################################
#
# 1. 请设定 learning rate,尝试增减查看效果
#
######################################################################
######################################################################LEARNING_RATE = 1e-3

1.3、搭建Model、Algorithm、Agent架构

class Model(parl.Model):def __init__(self, act_dim):############################################################################################################################################## 2. 请参考课程Demo,配置model结构#############################################################################################################################################def forward(self, obs):  # 可直接用 model = Model(5); model(obs)调用############################################################################################################################################## 3. 请参考课程Demo,组装policy网络#############################################################################################################################################return out

1.4、Algorithm

from parl.algorithms import PolicyGradient # 直接从parl库中导入PolicyGradient算法,无需重复写算法

1.5、Agent

class Agent(parl.Agent):def __init__(self, algorithm, obs_dim, act_dim):self.obs_dim = obs_dimself.act_dim = act_dimsuper(Agent, self).__init__(algorithm)def build_program(self):self.pred_program = fluid.Program()self.learn_program = fluid.Program()with fluid.program_guard(self.pred_program):  # 搭建计算图用于 预测动作,定义输入输出变量obs = layers.data(name='obs', shape=[self.obs_dim], dtype='float32')self.act_prob = self.alg.predict(obs)with fluid.program_guard(self.learn_program):  # 搭建计算图用于 更新policy网络,定义输入输出变量obs = layers.data(name='obs', shape=[self.obs_dim], dtype='float32')act = layers.data(name='act', shape=[1], dtype='int64')reward = layers.data(name='reward', shape=[], dtype='float32')self.cost = self.alg.learn(obs, act, reward)def sample(self, obs):obs = np.expand_dims(obs, axis=0)  # 增加一维维度act_prob = self.fluid_executor.run(self.pred_program,feed={'obs': obs.astype('float32')},fetch_list=[self.act_prob])[0]act_prob = np.squeeze(act_prob, axis=0)  # 减少一维维度act = np.random.choice(range(self.act_dim), p=act_prob)  # 根据动作概率选取动作return actdef predict(self, obs):obs = np.expand_dims(obs, axis=0)act_prob = self.fluid_executor.run(self.pred_program,feed={'obs': obs.astype('float32')},fetch_list=[self.act_prob])[0]act_prob = np.squeeze(act_prob, axis=0)act = np.argmax(act_prob)  # 根据动作概率选择概率最高的动作return actdef learn(self, obs, act, reward):act = np.expand_dims(act, axis=-1)feed = {'obs': obs.astype('float32'),'act': act.astype('int64'),'reward': reward.astype('float32')}cost = self.fluid_executor.run(self.learn_program, feed=feed, fetch_list=[self.cost])[0]return cost

1.6、Training && Test(训练&&测试)

def run_episode(env, agent):obs_list, action_list, reward_list = [], [], []obs = env.reset()while True:obs = preprocess(obs) # from shape (210, 160, 3) to (100800,)obs_list.append(obs)action = agent.sample(obs) # 采样动作action_list.append(action)obs, reward, done, info = env.step(action)reward_list.append(reward)if done:breakreturn obs_list, action_list, reward_list# 评估 agent, 跑 5 个episode,求平均
def evaluate(env, agent, render=False):eval_reward = []for i in range(5):obs = env.reset()episode_reward = 0while True:obs = preprocess(obs) # from shape (210, 160, 3) to (100800,)action = agent.predict(obs) # 选取最优动作obs, reward, isOver, _ = env.step(action)episode_reward += rewardif render:env.render()if isOver:breakeval_reward.append(episode_reward)return np.mean(eval_reward)

1.7、创建环境和Agent,启动训练,保存模型

# Pong 图片预处理
def preprocess(image):""" 预处理 210x160x3 uint8 frame into 6400 (80x80) 1维 float vector """image = image[35:195] # 裁剪image = image[::2,::2,0] # 下采样,缩放2倍image[image == 144] = 0 # 擦除背景 (background type 1)image[image == 109] = 0 # 擦除背景 (background type 2)image[image != 0] = 1 # 转为灰度图,除了黑色外其他都是白色return image.astype(np.float).ravel()# 根据一个episode的每个step的reward列表,计算每一个Step的Gt
def calc_reward_to_go(reward_list, gamma=0.99):"""calculate discounted reward"""reward_arr = np.array(reward_list)for i in range(len(reward_arr) - 2, -1, -1):# G_t = r_t + γ·r_t+1 + ... = r_t + γ·G_t+1reward_arr[i] += gamma * reward_arr[i + 1]# normalize episode rewardsreward_arr -= np.mean(reward_arr)reward_arr /= np.std(reward_arr)return reward_arr# 创建环境
env = gym.make('Pong-v0')
obs_dim = 80 * 80
act_dim = env.action_space.n
logger.info('obs_dim {}, act_dim {}'.format(obs_dim, act_dim))# 根据parl框架构建agent
######################################################################
######################################################################
#
# 4. 请参考课堂Demo构建 agent,嵌套Model, PolicyGradient, Agent
#
######################################################################
######################################################################
model =
alg =
agent = # 加载模型
# if os.path.exists('./model.ckpt'):
#     agent.restore('./model.ckpt')for i in range(1000):obs_list, action_list, reward_list = run_episode(env, agent)# if i % 10 == 0:#     logger.info("Train Episode {}, Reward Sum {}.".format(i, #                                         sum(reward_list)))batch_obs = np.array(obs_list)batch_action = np.array(action_list)batch_reward = calc_reward_to_go(reward_list)agent.learn(batch_obs, batch_action, batch_reward)if (i + 1) % 100 == 0:total_reward = evaluate(env, agent, render=False)logger.info('Episode {}, Test reward: {}'.format(i + 1, total_reward))# save the parameters to ./model.ckpt
agent.save('./model.ckpt')

2、代码待补充

最后一个作业,先贴出代码。

import os
import numpy as npimport parl
from parl import layers
from paddle import fluid
from parl.utils import logger
from parl.utils import action_mapping # 将神经网络输出映射到对应的 实际动作取值范围 内
from parl.utils import ReplayMemory # 经验回放from rlschool import make_env  # 使用 RLSchool 创建飞行器环境######################################################################
######################################################################
#
# 1. 请设定 learning rate,尝试增减查看效果
#
######################################################################
######################################################################
ACTOR_LR = 0.0002   # Actor网络更新的 learning rate
CRITIC_LR = 0.001   # Critic网络更新的 learning rateGAMMA = 0.99        # reward 的衰减因子,一般取 0.9 到 0.999 不等
TAU = 0.001         # target_model 跟 model 同步参数 的 软更新参数
MEMORY_SIZE = 1e6   # replay memory的大小,越大越占用内存
MEMORY_WARMUP_SIZE = 1e4      # replay_memory 里需要预存一些经验数据,再从里面sample一个batch的经验让agent去learn
REWARD_SCALE = 0.01       # reward 的缩放因子
BATCH_SIZE = 256          # 每次给agent learn的数据数量,从replay memory随机里sample一批数据出来
TRAIN_TOTAL_STEPS = 1e6   # 总训练步数
TEST_EVERY_STEPS = 1e4    # 每个N步评估一下算法效果,每次评估5个episode求平均rewardclass ActorModel(parl.Model):def __init__(self, act_dim):############################################################################################################################################## 2. 请配置model结构#hid_size = 100self.fc1 = layers.fc(size=hid_size, act='relu')self.fc2 = layers.fc(size=act_dim, act='tanh')############################################################################################################################################def policy(self, obs):############################################################################################################################################## 3. 请组装policy网络#hid = self.fc1(obs)means = self.fc2(hid)return means############################################################################################################################################return logitsclass CriticModel(parl.Model):def __init__(self):############################################################################################################################################## 4. 请配置model结构#hid_size = 100self.fc1 = layers.fc(size=hid_size, act='relu')self.fc2 = layers.fc(size=1, act=None)############################################################################################################################################def value(self, obs, act):# 输入 state, action, 输出对应的Q(s,a)############################################################################################################################################## 5. 请组装Q网络#concat = layers.concat([obs, act], axis=1)hid = self.fc1(concat)Q = self.fc2(hid)Q = layers.squeeze(Q, axes=[1])return Q######################################################################################################################################class QuadrotorModel(parl.Model):def __init__(self, act_dim):self.actor_model = ActorModel(act_dim)self.critic_model = CriticModel()def policy(self, obs):return self.actor_model.policy(obs)def value(self, obs, act):return self.critic_model.value(obs, act)def get_actor_params(self):return self.actor_model.parameters()from parl.algorithms import DDPGclass QuadrotorAgent(parl.Agent):def __init__(self, algorithm, obs_dim, act_dim=4):assert isinstance(obs_dim, int)assert isinstance(act_dim, int)self.obs_dim = obs_dimself.act_dim = act_dimsuper(QuadrotorAgent, self).__init__(algorithm)# 注意,在最开始的时候,先完全同步target_model和model的参数self.alg.sync_target(decay=0)def build_program(self):self.pred_program = fluid.Program()self.learn_program = fluid.Program()with fluid.program_guard(self.pred_program):obs = layers.data(name='obs', shape=[self.obs_dim], dtype='float32')self.pred_act = self.alg.predict(obs)with fluid.program_guard(self.learn_program):obs = layers.data(name='obs', shape=[self.obs_dim], dtype='float32')act = layers.data(name='act', shape=[self.act_dim], dtype='float32')reward = layers.data(name='reward', shape=[], dtype='float32')next_obs = layers.data(name='next_obs', shape=[self.obs_dim], dtype='float32')terminal = layers.data(name='terminal', shape=[], dtype='bool')_, self.critic_cost = self.alg.learn(obs, act, reward, next_obs,terminal)def predict(self, obs):obs = np.expand_dims(obs, axis=0)act = self.fluid_executor.run(self.pred_program, feed={'obs': obs},fetch_list=[self.pred_act])[0]return actdef learn(self, obs, act, reward, next_obs, terminal):feed = {'obs': obs,'act': act,'reward': reward,'next_obs': next_obs,'terminal': terminal}critic_cost = self.fluid_executor.run(self.learn_program, feed=feed, fetch_list=[self.critic_cost])[0]self.alg.sync_target()return critic_costdef run_episode(env, agent, rpm):obs = env.reset()total_reward, steps = 0, 0while True:steps += 1batch_obs = np.expand_dims(obs, axis=0)action = agent.predict(batch_obs.astype('float32'))action = np.squeeze(action)# 给输出动作增加探索扰动,输出限制在 [-1.0, 1.0] 范围内action = np.clip(np.random.normal(action, 1.0), -1.0, 1.0)# 动作映射到对应的 实际动作取值范围 内, action_mapping是从parl.utils那里import进来的函数action = action_mapping(action, env.action_space.low[0],env.action_space.high[0])next_obs, reward, done, info = env.step(action)rpm.append(obs, action, REWARD_SCALE * reward, next_obs, done)if rpm.size() > MEMORY_WARMUP_SIZE:batch_obs, batch_action, batch_reward, batch_next_obs, \batch_terminal = rpm.sample_batch(BATCH_SIZE)critic_cost = agent.learn(batch_obs, batch_action, batch_reward,batch_next_obs, batch_terminal)obs = next_obstotal_reward += rewardif done:breakreturn total_reward, steps# 评估 agent, 跑 5 个episode,总reward求平均
def evaluate(env, agent):eval_reward = []for i in range(5):obs = env.reset()total_reward, steps = 0, 0while True:batch_obs = np.expand_dims(obs, axis=0)action = agent.predict(batch_obs.astype('float32'))action = np.squeeze(action)action = action_mapping(action, env.action_space.low[0], env.action_space.high[0])next_obs, reward, done, info = env.step(action)obs = next_obstotal_reward += rewardsteps += 1if done:breakeval_reward.append(total_reward)return np.mean(eval_reward)# 创建飞行器环境
env = make_env("Quadrotor", task="hovering_control")
env.reset()
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]# 根据parl框架构建agent
######################################################################
######################################################################
#
# 6. 请构建agent:  QuadrotorModel, DDPG, QuadrotorAgent三者嵌套
#
######################################################################
######################################################################
model = QuadrotorModel(act_dim)
algorithm = DDPG(model, gamma=GAMMA, tau=TAU, actor_lr=ACTOR_LR, critic_lr=CRITIC_LR)
agent = QuadrotorAgent(algorithm, obs_dim, act_dim)# parl库也为DDPG算法内置了ReplayMemory,可直接从 parl.utils 引入使用
rpm = ReplayMemory(int(MEMORY_SIZE), obs_dim, act_dim)# 启动训练
test_flag = 0
total_steps = 0
while total_steps < TRAIN_TOTAL_STEPS:train_reward, steps = run_episode(env, agent, rpm)total_steps += steps#logger.info('Steps: {} Reward: {}'.format(total_steps, train_reward)) # 打印训练rewardif total_steps // TEST_EVERY_STEPS >= test_flag: # 每隔一定step数,评估一次模型while total_steps // TEST_EVERY_STEPS >= test_flag:test_flag += 1evaluate_reward = evaluate(env, agent)logger.info('Steps {}, Test reward: {}'.format(total_steps, evaluate_reward)) # 打印评估的reward# 每评估一次,就保存一次模型,以训练的step数命名ckpt = 'model_dir/steps_{}.ckpt'.format(total_steps)agent.save(ckpt)######################################################################
######################################################################
#
# 7. 请选择你训练的最好的一次模型文件做评估
#
######################################################################
######################################################################
ckpt = 'model_dir/steps_??????.ckpt'  # 请设置ckpt为你训练中效果最好的一次评估保存的模型文件名称agent.restore(ckpt)
evaluate_reward = evaluate(env, agent)
logger.info('Evaluate reward: {}'.format(evaluate_reward)) # 打印评估的reward

3、心得体会

parl挺好用的,以后搞强化学习可以用一用。

百度强化学习之Policy learning相关推荐

  1. 百度强化学习框架PARL入门强化学习

    1.什么是强化学习? 强化学习(Reinforcement Learning, RL),又称再励学习.评价学习或增强学习,是机器学习的范式和方法论之一,用于描述和解决智能体(agent)在与环境的交互 ...

  2. 学习笔记|强化学习(Reinforcement Learning, RL)——让AlphaGo进化得比人类更强

    文章目录 1. 题外话:人类棋手的最后赞礼 2. 强化学习概述 2.1 强化学习的目标也是要找一个Function 2.2 强化学习的三个基本步骤 2.2.1 定义一个function 2.2.2 定 ...

  3. 强化学习(Reinforcement Learning)是什么?强化学习(Reinforcement Learning)和常规的监督学习以及无监督学习有哪些不同?

    强化学习(Reinforcement Learning)是什么?强化学习(Reinforcement Learning)和常规的监督学习以及无监督学习有哪些不同? 目录

  4. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(复杂迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  5. RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置

    RL之Q Learning:利用强化学习之Q Learning实现走迷宫-训练智能体走到迷宫(简单迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 f ...

  6. 【强化学习】Policy Gradient算法详解

    DeepMind公开课https://sites.google.com/view/deep-rl-bootcamp/lectures David Silver教程 http://www0.cs.ucl ...

  7. 强化学习(Reinforcement Learning)之策略梯度(Policy Gradient)的一点点理解以及代码的对应解释

    一.策略梯度算法推导以及解释 1.1 背景 设πθ(s)\pi_{\theta }(s)πθ​(s)是一个有网络参数θ\thetaθ的actor,然后我们让这个actor和环境(environment ...

  8. 【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术

    原文:http://www.jianshu.com/p/d347bb2ca53c 声明:感谢 Tambet Matiisen 的创作,这里只对最为核心的部分进行的翻译 Two years ago, a ...

  9. 强化学习(Reinforcement learning)综述

    文章目录 Reinforcement learning 综述 强化学习的分类 环境(Model-free,Model-based) Based(Policy-Based RL & Value- ...

最新文章

  1. 千万别在UI线程上调用Control.Invoke和Control.BeginInvoke,因为这些是依然阻塞UI线程的,造成界面的假死...
  2. 【week3】psp (技术随笔)
  3. Java GregorianCalendar getActualMinimum()方法与示例
  4. 视频换脸AISWAP技术示例
  5. Android圆角矩形
  6. url 微信公众号开发 配置失效_微信公众号开发之授权登录
  7. 【深度学习】全连接层or卷积层
  8. java 定时器 quartz_Java定时器和Quartz使用
  9. Visual Studio.NET 无法创建或打开应用程序之解决方法
  10. ElementUI:修改tabs标签鼠标悬浮和选中标签
  11. python遗传算法之geatpy学习
  12. 基于PHP+小程序(MINA框架)+Mysql数据库的共享会议室预约小程序系统设计与实现
  13. NAACL 2022 | TAMT:通过下游任务无关掩码训练搜索可迁移的BERT子网络
  14. 如何从巨潮资讯爬取股票公告
  15. 网站服务器日志包含什么,查看网站日志有什么作用?
  16. c语言点菜菜单程序大学一,数据结构实训报告 c语言点餐系统 net
  17. vue 报错vue : 无法加载文件 D:\wjx_tools\node\node_global\vue.ps1,因为在此系统上禁止运行脚本。有关详细信息,请参阅 https:/go.microsof
  18. docker 改host_Docker容器修改端口映射
  19. Qt:Qt导出Excel表格
  20. 存储系统 - IOPS与带宽的关系

热门文章

  1. centos系统安装pycharm编辑器
  2. 计蒜客 2019 蓝桥杯省赛 A 组模拟赛(一) B:炮台实验
  3. Vivado设计流程(五)工程实现
  4. 自主巡航——高精度地图制作
  5. Python将csv格式转换为xlsx
  6. SQLALchemy (ORM工具)[PostgreSQL为例]
  7. 《全唐诗》前言和后记
  8. webshell文件下载器
  9. VMware player桥接模式不能联网的解决方法
  10. Nginx+Mongodb 文件存储方案