pytorch自动编码_用pytorch第2部分从头开始编码ppo 4

pytorch自动编码

Welcome to Part 2 of our series, where we shall start coding Proximal Policy Optimization (PPO) from scratch with PyTorch. If you haven’t read Part 1, please do so first.

欢迎来到本系列的第2部分，我们将开始使用PyTorch从头开始编写近端策略优化(PPO)。如果您尚未阅读第1部分，请先阅读。

Note that going forward, I will be posting code screenshots rather than GitHub gists because I don’t want you to just copy-paste code (you can just go to the main repository for that). Instead, you are encouraged to follow along this tutorial while coding manually in another window.

请注意，接下来，我将发布代码屏幕快照，而不是GitHub要点，因为我不希望您仅复制粘贴代码(您可以直接进入主存储库)。相反，建议您在另一个窗口中手动编码时遵循本教程。

We will be following the PPO-clip variant with pseudocode found in OpenAI’s Spinning Up docs and an Actor-Critic Framework. Here’s a picture of the pseudocode:

我们将使用在OpenAI的Spinning Up文档和Actor-Critic Framework中找到的伪代码跟踪PPO-clip变体。这是伪代码的图片：

Pseudocode of PPO on OpenAI’s Spinning Up doc.

Initial Thoughts: Only 8 steps? Nice. Since this is a pseudocode for a learning algorithm, might be wise to first design the way our code will flow. This pseudocode looks like it can fit all in one function; we’ll call it learn. It appears that we will need to write subroutines for many steps (i.e. Step 3 wants us to basically roll out a bunch of simulations. In that case, we can define something like rollout later), so it’s best to encapsulate everything into a class PPO. This way, to train on an environment, we can first create a PPO object, then simply call learn.

最初的想法：仅8个步骤？真好由于这是学习算法的伪代码，因此首先设计代码流动方式可能是明智的。该伪代码看起来可以满足所有功能。我们称之为learn 。看来，我们将需要许多步骤写子程序(即第3步要我们主要推出了一堆模拟。在这种情况下，我们可以定义像rollout更高版本)，所以最好封装一切都变成类PPO 。这样，为了在环境上进行训练，我们可以首先创建一个PPO对象，然后只需调用learn 。

First, let’s set up our PPO class in a file called ppo.py:

首先，让我们在名为ppo.py的文件中设置PPO类：

Cool, pat on the back. Let’s look at Step 1:

酷，轻拍背面。让我们看一下步骤1：

Here’s where we’ll initialize our actor and critic networks. This means we’ll either need to import a neural network module or write our own. Let’s do the latter; we’ll do something similar to PyTorch’s tutorial on creating a neural network with torch.nn. We’ll create a very basic Feed Forward Neural Network. If you’re not comfortable with neural networks, watch this series.

在这里，我们将初始化演员和评论家网络。这意味着我们将需要导入神经网络模块或编写我们自己的模块。让我们做后者；我们将做类似于PyTorch的有关使用torch.nn创建神经网络的教程。我们将创建一个非常基本的前馈神经网络。如果您对神经网络不满意，请观看本系列。

Let’s set up our neural network module in a new file network.py:

让我们在新文件network.py中设置神经网络模块：

import torchfrom torch import nnimport torch.nn.functional as Fimport numpy as npclass FeedForwardNN(nn.Module):  def __init__(self):    super(FeedForwardNN, self).__init__()

We’ll need to define our neural network layers now. We can use a few basic nn.Linear layers, nothing too fancy. We need to define the input and output dimensions, so let’s add some parameters to __init__ to capture that.

我们现在需要定义我们的神经网络层。我们可以使用一些基本的nn.Linear层，没有什么花哨的。我们需要定义输入和输出尺寸，因此让我们在__init__添加一些参数来捕获它。

def __init__(self, in_dim, out_dim):    super(FeedForwardNN, self).__init__()    self.layer1 = nn.Linear(in_dim, 64)    self.layer2 = nn.Linear(64, 64)    self.layer3 = nn.Linear(64, out_dim)

Note that I chose 64 arbitrarily, it doesn’t matter too much. Our __init__ is done; now we can define a forward function to do a forward pass on our network. We can use ReLU for activation (again picked arbitrarily). Since we’re planning on using this network module to define our actor and critic, and they both will take in an observation and return either an action or a value, we’ll set observation as a parameter. One thing to note is that the input to our network must be a tensor, so we should convert our observation to a tensor first in case it’s passed in as a numpy array.

请注意，我任意选择了64个，不要紧。我们的__init__完成了；现在，我们可以定义forward功能以在网络上进行转发。我们可以使用ReLU进行激活(再次选择)。由于我们正计划使用此网络模块来定义我们的演员和评论者，并且它们都将接受观察并返回一个动作或一个值，因此我们将观察设置为参数。需要注意的一件事是，我们网络的输入必须是张量，因此，如果将其作为numpy数组传入，我们应该首先将观察值转换为张量。

def forward(self, obs):  # Convert observation to tensor if it's a numpy array  if isinstance(obs, np.ndarray):    obs = torch.tensor(obs, dtype=torch.float)

  activation1 = F.relu(self.layer1(obs))  activation2 = F.relu(self.layer2(activation1))  output = self.layer3(activation2)  return output

We are now done with defining our network module; we are ready to define our actor and critic networks. Here’s how network.py should look like:

现在，我们定义了网络模块。我们准备定义演员和评论家网络。这是network.py的外观：

Back to ppo.py; we should be ready to do Step 1 really easily now and define our initial policy, or actor, parameters and value function, or critic, parameters.

回到ppo.py ; 我们现在应该准备好真正轻松地执行第1步，并定义我们的初始策略或参与者，参数和值函数，或批评者参数。

from network import FeedForwardNNself.actor = FeedForwardNN(

Uh oh, road block. We don’t have any information on input or output size, which depends on the environment. Since we’ll need access to that environment in many subroutines as well, let’s just add it as an instance variable in our PPO __init__.

哦，路障。我们没有有关输入或输出大小的任何信息，这取决于环境。由于我们还需要在许多子例程中访问该环境，因此我们只需将其作为实例变量添加到我们的PPO __init__ 。

def __init__(self, env):  # Extract environment information  self.env = env  self.obs_dim = env.observation_space.shape[0]  self.act_dim = env.action_space.shape[0]

Eh, we’ll need our actor and critic networks later too, so let’s define them as instance variables in __init__ too.

嗯，我们稍后也会需要我们的参与者和评论者网络，因此我们也将它们定义为__init__实例变量。

# ALG STEP 1# Initialize actor and critic networksself.actor = FeedForwardNN(self.obs_dim, self.act_dim)self.critic = FeedForwardNN(self.obs_dim, 1)

And we’re done with step 1! Officially done with 1/8 of PPO. Here’s the code so far:

第一步完成了！正式使用1/8的PPO完成。这是到目前为止的代码：

Onto Step 2 now.

现在进入步骤2。

Easy. They want us to define a for loop to learn for some number of iterations. Now we could loop by iterations, but we also know that Stable Baselines PPO2 makes you specify how many timesteps to train in total when calling learn. Let’s follow that design. This way, instead of counting off to infinite iterations, we can specify how many timesteps to train before we stop.

简单。他们希望我们定义一个for循环，以学习一些迭代次数。现在我们可以循环进行迭代，但是我们也知道，稳定基准PPO2使您可以指定调用learn时总共要训练多少个时间步长。让我们遵循该设计。这样，我们可以指定停止之前需要训练的时间步长，而不必计算无限迭代。

def learn(self, total_timesteps):  t_so_far = 0 # Timesteps simulated so far  while t_so_far < total_timesteps:              # ALG STEP 2    # Increment t_so_far somewhere below

Step 2, done. Here’s the code so far:

步骤2，完成。到目前为止的代码如下：

Step 3:

第三步：

Our first mini-challenge. We need to collect data from a set of episodes by running our current actor policy. Sure, sounds like a rollout to me. We can call our data collected in each rollout a batch. Now what data do we need? Let’s take a little look ahead in our pseudocode.

我们的第一个小挑战。我们需要通过运行当前的演员策略从一组情节中收集数据。当然，对我来说听起来像是一次rollout 。我们可以将每次rollout收集的数据称为batch 。现在我们需要什么数据？让我们先看一下伪代码。

Looks like we’ll need observations per timestep as I see sₜ in steps 6 and 7. We’ll also need actions per timesteps with aₜ in steps 6 and 7, action probabilities with π_θ (aₜ | sₜ) in step 6, and rewards-to-go with Rₜ in step 4 and 7. Oh, and don’t forget that in order to increment t_so_far in learn, we’ll need to know how many timesteps are simulated per batch; let’s return the lengths of each episode run in our batch (not summing yet because it can be used for logging average episodic length later too. You can also just sum the episodic lengths before returning, doesn’t really matter).

看起来我们将需要在每个时间步进行观察，就像在步骤6和7中看到sₜ一样。在步骤6和7中，我们还需要在每个时间步中使用aₜ进行操作，在步骤6中需要π_θ(aₜ|sₜ)的操作概率，并获得奖励与Rₜ-to-go在第4和第7。哦，不要忘记的是，为了增加t_so_far在learn ，我们需要知道有多少时间步每批进行模拟; 让我们返回批处理中运行的每个情节的长度(尚未汇总，因为它也可以用于以后记录平均情节长度。您也可以在返回之前仅对情节长度进行求和，这没关系)。

We’ll also have to figure out how many timesteps to run per batch; sounds like a hyperparameter to me. We’ll first create a function _init_hyperparameters to define some default hyperparameters, and call the function from our __init__.

我们还必须找出每个批处理要运行多少时间；对我来说听起来像是个超参数。我们首先创建一个函数_init_hyperparameters来定义一些默认的超参数，然后从__init__调用该函数。

def __init__(self, env):  ...  self._init_hyperparameters()def _init_hyperparameters(self):  # Default values for hyperparameters, will need to change later.  self.timesteps_per_batch = 4800            # timesteps per batch  self.max_timesteps_per_episode = 1600      # timesteps per episode

Next, let’s create a rollout function to collect our data.

接下来，让我们创建一个rollout功能来收集我们的数据。

def rollout(self):  # Batch data  batch_obs = []             # batch observations  batch_acts = []            # batch actions  batch_log_probs = []       # log probs of each action  batch_rews = []            # batch rewards  batch_rtgs = []            # batch rewards-to-go  batch_lens = []            # episodic lengths in batch

In our batch, we’ll be running episodes until we hit self.timesteps_per_batch timesteps; in the process, we shall collect observations, actions, log probabilities of those actions, rewards, rewards-to-go, and lengths of each episode. We’ll need these for our PPO algorithm later. The respective shapes of each list will be:

在我们的批次中，我们将运行情节，直到达到self.timesteps_per_batch timesteps;为止。在此过程中，我们将收集观察结果，动作，这些动作的记录概率，奖励，获得的奖励以及每集的持续时间。稍后，我们将需要这些用于我们的PPO算法。每个列表的各自形状为：

observations: (number of timesteps per batch, dimension of observation)观察结果：(每批次的时间步数，观察结果的维度)
actions: (number of timesteps per batch, dimension of action)操作：(每批次的时间步数，操作范围)
log probabilities: (number of timesteps per batch)记录概率：(每批次的时间步数)
rewards: (number of episodes, number of timesteps per episode)奖励：(情节数，每个情节的时间步数)
reward-to-go’s: (number of timesteps per batch)奖励奖励：(每批次的时间步数)
batch lengths: (number of episodes)批次长度：(发作次数)

For why we keep track of log probabilities instead of raw action probabilities, here is a resource that explains why and here is another. TL;DR: makes gradient ascent easier behind the scenes. Let’s write our generic gym rollout on one episode.

对于为什么我们要跟踪日志概率而不是原始动作概率，这里有一个资源来解释原因，这里是另一个。 TL; DR：使幕后攀登更为轻松。让我们在一集上编写我们的通用体育馆展示。

obs = self.env.reset()done = Falsefor ep_t in range(self.max_timesteps_per_episode):

  action = self.env.action_space.sample()  obs, rew, done, _ = self.env.step(action)  if done:    break

Few things we need to change. We’re not sampling an action, but querying our actor network. We need to collect observations, actions, log probs, episodic rewards, and episodic length. We need to stop once we hit self.timesteps_per_batch. Let’s do that now, assuming we have some get_action function to help us query an action and its log prob.

我们需要改变的几件事。我们不是在采样动作，而是查询我们的参与者网络。我们需要收集观察值，动作，对数概率，情节奖励和情节长短。一旦我们击中self.timesteps_per_batch我们需要停止。假设我们有一些get_action函数可以帮助我们查询一个动作及其日志概率，现在就开始做。

# Number of timesteps run so far this batcht = 0 while t < self.timesteps_per_batch:  # Rewards this episode  ep_rews = []  obs = self.env.reset()  done = False  for ep_t in range(self.max_timesteps_per_episode):    # Increment timesteps ran this batch so far    t += 1    # Collect observation    batch_obs.append(obs)    action, log_prob = self.get_action(obs)    obs, rew, done, _ = self.env.step(action)

    # Collect reward, action, and log prob    ep_rews.append(rew)    batch_acts.append(action)    batch_log_probs.append(log_prob)    if done:      break  # Collect episodic length and rewards  batch_lens.append(ep_t + 1) # plus 1 because timestep starts at 0  batch_rews.append(ep_rews)

Okay, so we need a get_action. Let’s go ahead and write that. During training, we’ll need a way to “explore” actions; we’ll use something called a “Multivariate Normal Distribution”. The idea is to have the actor network output a “mean” action on a forward pass, then create a covariance matrix with some standard deviation along the diagonals. Then, we can use this mean and stddev to generate a Multivariate Normal Distribution using PyTorch’s distributions, and then sample an action close to our mean. We’ll also extract the log probability of that action in the distribution. If you’re uncomfortable with Multivariate Normal Distributions, here’s a great lecture by Andrew Ng on it.

好的，所以我们需要一个get_action 。让我们继续写吧。在训练过程中，我们需要一种“探索”行动的方法。我们将使用“多元正态分布”。想法是让角色网络在前向通过时输出“均值”动作，然后创建沿对角线具有一些标准偏差的协方差矩阵。然后，我们可以使用这个平均值和stddev使用PyTorch的分布生成一个多元正态分布，然后对一个接近我们平均值的动作进行采样。我们还将提取该动作在分布中的对数概率。如果您对多元正态分布不满意，这是Andrew Ng的精彩演讲。

Note: actions will be deterministic when testing, meaning that the “mean” action will be our actual action during testing. However, during training we need an exploratory factor, which this distribution can help us with.

注意：测试时动作将是确定性的，这意味着“平均”动作将是我们在测试期间的实际动作。但是，在训练过程中，我们需要一个探索性因素，这种分布可以帮助我们。

from torch.distributions import MultivariateNormaldef __init(self, env):  ...  # Create our variable for the matrix.  # Note that I chose 0.5 for stdev arbitrarily.  self.cov_var = torch.full(size=(self.act_dim,), fill_value=0.5)

  # Create the covariance matrix  self.cov_mat = torch.diag(self.cov_var)def get_action(self, obs):  # Query the actor network for a mean action.  # Same thing as calling self.actor.forward(obs)  mean = self.actor(obs)  # Create our Multivariate Normal Distribution  dist = MultivariateNormal(mean, self.cov_mat)  # Sample an action from the distribution and get its log prob  action = dist.sample()  log_prob = dist.log_prob(action)

  # Return the sampled action and the log prob of that action  # Note that I'm calling detach() since the action and log_prob    # are tensors with computation graphs, so I want to get rid  # of the graph and just convert the action to numpy array.  # log prob as tensor is fine. Our computation graph will  # start later down the line.  return action.detach().numpy(), log_prob.detach()

Finally, back in our rollout function, we should convert our batch_obs, batch_acts, batch_log_probs, and batch_rtgs to tensors since we’ll need them in that form later to draw our computation graphs. Assume that we have a function compute_rtgs that will compute the rewards-to-go of the batch rewards. Funnily enough, finding the rewards-to-go is Step 4 in our algorithm:

最后，返回到rollout功能，我们应该将batch_obs，batch_acts，batch_log_probs和batch_rtgs转换为张量，因为稍后将需要这种形式的张量来绘制计算图。假设我们有一个compute_rtgs函数，它将计算批量奖励的compute_rtgs奖励。有趣的是，找到奖赏是我们算法的第4步：

# Reshape data as tensors in the shape specified before returningbatch_obs = torch.tensor(batch_obs, dtype=torch.float)batch_acts = torch.tensor(batch_acts, dtype=torch.float)batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float)# ALG STEP #4batch_rtgs = self.compute_rtgs(batch_rews)# Return the batch datareturn batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens

Let’s figure out now how to calculate rewards-to-go. Typically when calculating rewards-to-go on a set of rewards from a single episode, you start from the end, have some variable to track the sum of rewards, multiply the variable by a discount factor (gamma) per timestep, add the variable with the immediate reward, and append it to some reward-to-go array. In case you’re fuzzy on how to calculate reward-to-go, or return, given some observation, here’s the formula.

现在让我们弄清楚如何计算奖励。通常，在计算单个情节中的一组奖励的待取奖励时，您要从头开始，使用一些变量来跟踪奖励的总和，将该变量乘以每个时间步长的折扣因子( gamma )，然后添加变量立即奖励，并将其附加到一些奖励奖励阵列中。万一您对如何计算“获得奖励”或“回报”感到困惑，请看下面的公式。

where G is reward-to-go function, sₖ is our observation at timestep k, T is timesteps per episode, γ is discount factor, and R(sᵢ) is reward given some observation sᵢ.

其中G是随行赏金函数，sₖ是我们在时间步长k处的观测值，T是每集的时间步长，γ是折现因子，R(sᵢ)是在给定观察值sᵢ的情况下的报酬。

We’ll apply this exact same workflow, except on multiple episodes (to keep the order consistent, we’ll need to iterate the episodes backward too).

除了多个情节外，我们将应用完全相同的工作流程(为保持顺序一致，我们也需要向后迭代情节)。

def compute_rtgs(self, batch_rews):  # The rewards-to-go (rtg) per episode per batch to return.  # The shape will be (num timesteps per episode)  batch_rtgs = []  # Iterate through each episode backwards to maintain same order  # in batch_rtgs  for ep_rews in reversed(batch_rews):    discounted_reward = 0 # The discounted reward so far    for rew in reversed(ep_rews):      discounted_reward = rew + discounted_reward * self.gamma      batch_rtgs.insert(0, discounted_reward)  # Convert the rewards-to-go into a tensor  batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float)  return batch_rtgsdef _init_hyperparameters(self):  ...  self.gamma = 0.95

Finally, let’s call our rollout function in learn.

最后，让我们把我们的rollout在功能上learn 。

def learn(self, total_timesteps):  ...  while t_so_far < total_timesteps:    # ALG STEP 3    batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens = self.rollout()

And there we go! We’re done with Steps 3 and 4, and halfway done with our PPO implementation. Here’s the code so far:

然后我们去了！我们完成了步骤3和4，完成了PPO实施的一半。到目前为止的代码如下：

get_action, compute_rtgs, _init_hyperparameters

Congratulations! We are already halfway through implementing a bare-bones PPO, and finished the majority of the code. In Part 3, we will finish up the PPO implementation.

恭喜你！我们已经完成了准系统PPO的一半，并完成了大部分代码。在第3部分中，我们将完成PPO的实现。

If you have any questions up to this point, don’t hesitate to leave a comment or reach out to me at eyyu@ucsd.edu. Otherwise, see you in Part 3!

如果您对此有任何疑问，请随时发表评论或通过eyyu@ucsd.edu与我联系。否则，在第三部分见！

翻译自: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-2-4-f9d8b8aa938a

pytorch自动编码

查看全文

http://www.taodudu.cc/news/show-5188489.html

深度强化学习笔记之PPO实现细节（2）
mavros的常用服务介绍（怪不得普罗米修斯解锁和切offboard是通过调用服务实现的，PX4官方的offboard示例代码也是通过调用服务切offboard的，原来服务是在MAVROS里写的！）
【PX4代码】PX4 filght mode 飞行模式分析
PP常用事务代码
python写作机器人_人工智能机器人写作
神码ai人工智能写作机器人_真正的人工智能和机器学习的未来
【AI写作】怎样使用AI写作每天赚钱？—— 使用 AI 写作和每天产生被动收入的终极指南
使用AI写作工具，进阶文案写手：WordHero AI
用laravel 搭一套AI chatgpt 写作助手和内容创作 saas 系统
人工智能AI 写作平台横空出世——人工智能的未来
python的ai写作_AI伪原创，我们是认真的。[Python实现]
值得推荐的五大AI工具助力写作
中文AI写作分享
揭秘 GET人工智能写作的前世今生
AI写作文案的技巧：Wordhero AI写作SOP
写作助手-AI智能写作助手-免费写作助手软件
python智能写作_可怕的人工智能写作
教你用RNN实现人工智能写作
linux查服务器总内存大小,怎么查看linux中的可用内存大小
Linux下查看服务器内存、CPU、GPU显卡使用情况
Linux 查看服务器内存、CPU、网络等占用情况的命令--汇总
查看服务器cpu和内存信息
有道云更改默认字体大小
小程序-云存储web字体
Ubuntu20.04配置网易云窗口字体大小
腾讯云 cos 字体在CDN上跨域处理
uni-app 边框线 app 上更粗的解决办法
iText7 HTML to PDF 支持中文支持加粗
CSDN文字字体、大小、颜色、加粗、居中、斜体设置
java特粗宋体_Java IdentityPlusMapper类代码示例

pytorch自动编码_用pytorch第2部分从头开始编码ppo 4相关推荐

pytorch机器学习_机器学习— PyTorch
pytorch机器学习 PyTorch : It is an open source machine learning library based on the Torch library (whic ...
pytorch图像分类_使用PyTorch和Streamlit创建图像分类Web应用
pytorch图像分类 You just developed a cool ML model. 您刚刚开发了一个很酷的ML模型. You are proud of it. You want to sh ...
pytorch 安卓_兼容PyTorch、TF，史上最灵活Python机器学习框架发布 | 一周AI最火论文...
大数据文摘出品作者:Christopher Dossman 编译:Olivia.Joey.云舟呜啦啦啦啦啦啦啦大家好,本周的AI Scholar Weekly栏目又和大家见面啦!AI Schola ...
pytorch 归一化_用PyTorch进行语义分割
点击上方"机器学习与生成对抗网络",关注"星标" 获取有趣.好玩的前沿干货! 木易发自凹非寺量子位报道 | 公众号 QbitAI 很久没给大家带来教程 ...
Pytorch ——基础指北_叁 [Pytorch API 构建基础模型]
Pytorch --基础指北_叁系列文章目录 Pytorch --基础指北_零 Pytorch --基础指北_壹 Pytorch --基础指北_贰 Pytorch --基础指北_叁文章目录 Pyt ...
python实现yolo目标检测_从零开始PyTorch项目：YOLO v3目标检测实现
在过去几个月中,我一直在实验室中研究提升目标检测的方法.在这之中我获得的最大启发就是意识到:学习目标检测的最佳方法就是自己动手实现这些算法,而这正是本教程引导你去做的. 在本教程中,我们将使用 PyT ...
pytorch 源_建议收藏！从零开始学PyTorch
来源:GitHub 作者:zergtant [新智元导读]今天我们强烈推荐一本中文PyTorch书籍 -- PyTorch 中文手册,并附上试读.本书提供PyTorch快速入门指南并与最新版本保持一致 ...
PyTorch模型开发使用PyTorch GPU2Ascend
PyTorch模型开发使用PyTorch GPU2Ascend https://www.bilibili.com/video/BV1UK411U7x4/?spm_id_from=333.999.0.0 ...
PyTorch 学习笔记（六）：PyTorch hook 和关于 PyTorch backward 过程的理解 call
您的位置首页 PyTorch 学习笔记系列 PyTorch 学习笔记(六):PyTorch hook 和关于 PyTorch backward 过程的理解发布: 2017年8月4日 7,195阅读 ...

pytorch自动编码_用pytorch第2部分从头开始编码ppo 4

相关文章：

pytorch自动编码_用pytorch第2部分从头开始编码ppo 4相关推荐

最新文章

热门文章