梯度离散_使用策略梯度同时进行连续/离散超参数调整

梯度离散

总览 (Overview)

In my previous article, I showed how to build policy gradients from scratch in Python, and we used it to tune discrete hyperparameters for machine learning models. (If you haven’t read it already, I’d recommend starting there.) Now, we’ll build on that progress, and extend policy gradients to optimize continuous parameters as well. By the end of this article, we’ll have a full-fledged method for simultaneously tuning discrete and continuous hyperparameters.

在我的上一篇文章中，我展示了如何在Python中从头开始构建策略梯度，并且我们使用它来为机器学习模型调整离散的超参数。 (如果您还没有阅读它，我建议您从那里开始。)现在，我们将基于这一进展，并扩展策略梯度以优化连续参数。到本文结尾，我们将提供一种成熟的方法，用于同时调整离散和连续超参数。

审查政策梯度 (Review of Policy Gradients)

From last time, recall that policy gradients optimizes the following cost function for tuning hyperparameters:

从上次开始，我们记得策略梯度优化了以下成本函数来调整超参数：

where a is the set of hyperparameters chosen for a particular experiment, and theta represents all trainable parameters for our PG model. Then, p denotes the probability of selecting action a, and r is the “reward” received for that action. We then showed that:

其中a是为特定实验选择的一组超参数，而theta代表我们PG模型的所有可训练参数。然后， p表示选择动作a的概率， r是为该动作接收的“奖励”。然后，我们证明：

The above equation tells us how to update our PG model, given a set of actions and their observed rewards. For discrete hyperparameters, we directly updated the relative log-probabilities (logits) for each possible action:

上面的方程式告诉我们在给定一组动作及其观察到的奖励的情况下如何更新我们的PG模型。对于离散超参数，我们直接为每个可能的操作更新了相对对数概率( logits )：

This approach will not work for continuous hyperparameters, because we cannot possibly store the log-probability for every possible outcome! We need a new method for generating continuous random variables and their relative log-probabilities.

这种方法不适用于连续的超参数，因为我们不可能为每个可能的结果存储对数概率！我们需要一种生成连续随机变量及其相对对数概率的新方法。

扩展到连续超参数 (Extending to Continuous Hyperparameters)

In the field of reinforcement learning, continuous variables are commonly modeled using Gaussian Processes. The idea is pretty straightforward: our model predicts the mean and standard deviation for a Gaussian distribution, and we gather actions/predictions using a random number generator.

在强化学习领域，通常使用高斯过程对连续变量进行建模。这个想法非常简单：我们的模型可以预测高斯分布的均值和标准差，并且可以使用随机数生成器来收集动作/预测。

Again, we’ve chosen the simplest possible model — just store the mean and log-standard deviation for each parameter. To understand the update rules for self.mu and self.log_sigma above, let’s compute some gradients. The probability density function for a Gaussian distribution is:

同样，我们选择了最简单的模型-仅存储每个参数的均值和对数标准差。要了解上述self.mu和self.log_sigma的更新规则，让我们计算一些渐变。高斯分布的概率密度函数为：

Or equivalently, written in terms of the log-standard deviation:

或等效地，以对数标准偏差表示 ：

To obtain the log-probability, simply take the natural logarithm,

要获得对数概率，只需取自然对数，

where we’ve used the following mathematical identities to help simplify the expression:

在这里我们使用以下数学标识来简化表达式：

After all that work, computing gradients from here isn’t too difficult. Take the partial derivative with respect to mu and log-sigma (or just use WolframAlpha), and we find:

完成所有这些工作之后，从此处计算梯度就不太困难了。取关于mu和log-sigma的偏导数(或仅使用WolframAlpha )，我们发现：

This explains how parameters are updated inside of GaussianActor… almost. There’s a practical issue with our formulation, despite all of the nice mathematics: the mean of our GaussianActor changes very slowly compared to the standard deviation. Essentially, this is because self.log_sigma is in log-space, and updates to self.log_sigma cause exponentially larger changes to the standard deviation. We can manually increase the learning rate for self.mu to encourage it to keep up. Multiplying by a factor of 1000 tends to work well in practice.

这几乎解释了如何在GaussianActor内部更新参数。尽管有很多不错的数学方法，但是我们的公式还是存在一个实际问题：与标准偏差相比，我们的GaussianActor的平均值变化非常缓慢。本质上，这是因为self.log_sigma位于日志空间中，并且对self.log_sigma更新导致标准偏差的指数变化更大。我们可以手动提高self.mu的学习率，以鼓励它跟上。在实践中，乘以1000往往会效果很好。

This is, in my opinion, an ugly band-aid for an otherwise elegant solution. But it really does work. I don’t recommend using this hack for most applications of policy gradients, because it degrades the stability of our model during training. In our case, efficiency is much more important than stability, because we’re limited to a fixed number of experiments. And unlike most applications, we won’t be re-using the trained model for anything — we just use it to estimate hyperparameters, and then discard the model altogether.

在我看来，这是一个丑陋的创可贴，旨在以其他方式优雅地解决问题。但这确实有效。我不建议在大多数策略渐变应用中使用此技巧，因为它会降低训练过程中模型的稳定性。在我们的案例中，效率比稳定性重要得多，因为我们限于固定数量的实验。与大多数应用程序不同，我们不会在任何事情上重复使用经过训练的模型-我们仅使用它来估计超参数，然后将其完全丢弃。

玩具问题：连续超参数 (Toy Problem: Continuous Hyperparameters)

Let’s apply what we’ve learned. We can make up a set of hyperparameters, ground truth values, and an objective function. This time, we’ll use mean squared error as an objective, rather than mean absolute error. The choice of objective function is not critical, and any sensible option should work. Some choices are more efficient than others, though. MSE and MAE are generally pretty safe choices. Empirically, mean squared error gave the best results for these experiments.

让我们运用所学知识。我们可以组成一组超参数，基本事实值和一个目标函数。这次，我们将均方误差用作目标，而不是均值绝对误差 。目标函数的选择并不关键，任何明智的选择都应该起作用。不过，有些选择比其他选择更有效。 MSE和MAE通常是相当安全的选择。根据经验，均方误差为这些实验提供了最佳结果。

Notice that we added the “type” key to each parameter dictionary. This will prove helpful later on, when we incorporate discrete hyperparameters as well. Let’s also set up a function to create actors for each parameter type.

注意，我们在每个参数字典中添加了“类型”键。稍后，当我们也合并离散超参数时，这将被证明是有用的。我们还设置一个函数来为每种参数类型创建参与者。

Finally, we can show the complete policy gradients code. Only a small change is needed to allow for continuous parameters.

最后，我们可以显示完整的策略梯度代码。仅需进行很小的更改即可获得连续的参数。

To measure the performance of our algorithm, let’s run 1000 or so unique experiments and observe how accurate our predictions are. Again, we’ll use a budget of 250 — that’s the maximum number of times that new hyperparameters can be evaluated. This will allow for an apples-to-apples comparison with our previous implementation for discrete hyperparameters only.

为了衡量算法的性能，让我们运行1000个左右的独特实验，并观察我们的预测的准确性。同样，我们将使用250的预算 -这是可以评估新超参数的最大次数。这将允许与我们先前对离散超参数的实现进行逐个比较。

Average MAE: 2.155

How does this compare against a random search? To achieve the same MAE, you’d need to guess within ±4.31 of each ground truth value. There’s roughly a 0.0862³ = 6.41E-4 probability of randomly selecting such a set of parameters. On average, it would take 1561 random experiments before matching the accuracy of policy gradients! In this case, PG is roughly 6.25x more efficient than random search. That’s lower efficiency than we saw for discrete parameters (>10x), but still pretty respectable.

与随机搜寻相比，这有何不同？为了获得相同的MAE，您需要猜测每个基本真值的±4.31以内。随机选择这样一组参数的概率大约为0.0862³= 6.41E-4 。平均而言，要匹配策略梯度的准确性，需要进行1561次随机实验！在这种情况下，PG的效率大约是随机搜索的6.25倍 。这比我们对离散参数( > 10x )看到的效率要低，但仍然值得尊重。

另一个玩具问题：混合超参数 (Another Toy Problem: Mixed Hyperparameters)

At this point, we have confirmed that policy gradients works for tuning continuous and discrete hyperparameters, respectively. Let’s put those pieces together, and see how simultaneous parameter tuning works. No additional changes are needed — just update hyperparameters and re-run the experiments. We’ll also increase the budget to 1000, since we’re optimizing a larger set of parameters.

在这一点上，我们已经证实，政策梯度分别适用于调整连续和离散的超参数。让我们将这些部分放在一起，看看同步参数调整是如何工作的。无需其他更改-只需更新超参数并重新运行实验即可。由于我们正在优化更大的参数集，因此我们还将预算增加到1000。

Average MAE: 2.088

Pause for a brief second, and appreciate the significance of that result. Using only 1000 experiments, policy gradients achieves a mean error of just 2.1% across 6 separate hyperparameters. There’s roughly a 0.0835⁶ = 3.39E-7 probability of randomly drawing a similar set of parameters, which means PG is now nearly 3000x more efficient than random search!

暂停片刻，并欣赏该结果的重要性。仅使用1000个实验，策略梯度在6个独立的超参数上的平均误差仅为2.1％ 。随机绘制一组相似参数的概率大约为0.0835⁶= 3.39E-7 ，这意味着PG现在的效率比随机搜索高出近3000倍 ！

The benefit of using policy gradients sharply increases as the size of our search space grows. That’s because we’re concurrently (not sequentially) tuning our hyperparameters. If we had simply tuned the discrete parameters first, and then tuned the continuous parameters, PG would have been >60x more efficient than random search. (Just multiply the relative efficiencies of discrete and continuous searches.) Instead, we’re able to use every experiment in our budget to optimize all hyperparameters simultaneously, and efficiency skyrockets.

随着我们搜索空间大小的增加，使用策略梯度的好处会急剧增加。那是因为我们正在同时 (而不是顺序地)调整我们的超参数。如果我们先简单地调整离散参数，然后再调整连续参数，则PG的效率将比随机搜索高60倍以上。 (只需将离散搜索和连续搜索的相对效率相乘即可。)相反，我们能够使用预算中的每个实验来同时优化所有超参数，并提高效率。

外卖 (Takeaways)

I’m constantly amazed by the power and flexibility of policy gradients. In a few hundred lines of Python, we’ve created an efficient, production-ready optimizer for discrete and continuous hyperparameters. It works equally well with any machine learning algorithm (neural networks, decision trees, k-nearest neighbors, etc.), because external gradients are not needed for optimization. If you need faster performance, each batch of experiments can also be parallelized using built-in Python modules like multiprocessing or concurrent.futures. I avoided doing that here for simplicity.

我对政策梯度的强大功能和灵活性感到惊讶。在几百行Python中，我们为离散和连续超参数创建了一个高效的，可立即投入生产的优化器。它与任何机器学习算法(神经网络，决策树，k最近邻等)都同样有效，因为不需要外部梯度来进行优化。如果需要更快的性能，还可以使用内置的Python模块(例如multiprocessing或concurrent.futures将每批实验concurrent.futures 。为了简单起见，我在这里避免这样做。

After this demo, I hope you’re equally as enamored with policy gradients as I am. For more information on PG, I strongly recommend OpenAI Spinning Up, which contains a plethora of background material, research papers, and open-source software relating to reinforcement learning.

在完成本演示之后，我希望您也像我一样迷恋政策梯度。有关PG的更多信息，我强烈建议您使用OpenAI Spinning Up ，其中包含了大量的背景材料，研究论文以及与强化学习有关的开源软件。

翻译自: https://towardsdatascience.com/simultaneous-continuous-discrete-hyperparameter-tuning-with-policy-gradients-4531d226d6e2

梯度离散

查看全文

http://www.taodudu.cc/news/show-1874069.html

机械工程人工智能_机械工程中的人工智能
遗传算法是机器学习算法嘛?_基于遗传算法的机器人控制器方法
ai人工智能对话了_对话式AI：智能虚拟助手和未来之路。
mnist 转图像_解决MNIST图像分类问题
roc-auc_AUC-ROC技术的局限性
根据吴安德（斯坦福大学深度学习讲座），您应该如何阅读研究论文
ibm watson_使用IBM Watson Assistant构建AI私人教练-第1部分
ai会取代程序员吗_机器会取代程序员吗？
xkcd目录_12条展示AI真相的XKCD片段
怎样理解电脑评分_电脑可以理解我们的情绪吗？
ai 数据模型下载_为什么需要将AI模型像数据一样对待
对话生成深度强化学习_通过深度学习与死人对话
波普尔心智格列高利心智_心智与人工智能理论
深度学习计算机视觉的简介_商业用途计算机视觉简介
slack 聊天机器人_使用Node.js和Symanto的Text Analytics API在Slack中创建情感机器人
c语言八数码问题启发式搜索_一种快速且简单的AI启发式语言学习方法
机器学习库线性回归代码_PyCaret回归：更好的机器学习库
元学习：学习学习
深度学习去雨论文代码_将深度学习研究论文转换为有用的代码
r-cnn 行人检测_了解对象检测和R-CNN。
情态语态_情绪与情态与对话情感
gan loss gan_我的GAN怎么了？
h5py group_人工智能驱动的零售：H＆M Group如何做到
openai-gpt_GPT-3的不道德故事：OpenAI的百万美元模型
通话时自动中断音乐播放_您知道用户在何处以及为何中断通话吗？
机器视觉科学计算可视化_模因视觉：对模因进行分类的科学
人工智能与自动驾驶汽车_自动驾驶汽车中的道德AI
是你渡过人生难关的助力_人工智能将助力安全返回工作场所。这是如何做
机器学习流式特征_Web服务与实时机器学习端点的流式传输
算法博士_Strangecode博士-我如何学会不再担心并喜欢算法