降维后的高维特征的参数

by Thalles Silva

由Thalles Silva

高维超参数调整简介 (An introduction to high-dimensional hyper-parameter tuning)

优化ML模型的最佳做法 (Best practices for optimizing ML models)

If you ever struggled with tuning Machine Learning (ML) models, you are reading the right piece.

如果您在调优机器学习(ML)模型方面遇到困难，那么您正在阅读正确的文章。

Hyper-parameter tuning refers to the problem of finding an optimal set of parameter values for a learning algorithm.

超参数调整是指为学习算法找到一组最佳参数值的问题。

Usually, the process of choosing these values is a time-consuming task.

通常，选择这些值的过程非常耗时。

Even for simple algorithms like Linear Regression, finding the best set for the hyper-parameters can be tough. With Deep Learning, things get even worse.

即使对于像线性回归这样的简单算法，也很难为超参数找到最佳集合。借助深度学习，情况变得更糟。

Some of the parameters to tune when optimizing neural nets (NNs) include:

优化神经网络(NN)时需要调整的一些参数包括：

learning rate学习率
momentum动量
regularization正则化
dropout probability辍学概率
batch normalization批量标准化

In this short piece, we talk about the best practices for optimizing ML models. These practices come in hand mainly when the number of parameters to tune exceeds two or three.

在这篇简短的文章中，我们讨论了优化ML模型的最佳实践。这些做法主要是在要调整的参数数量超过两个或三个时使用。

网格搜索的问题 (The problem with Grid Search)

Grid Search is usually a good choice when we have a small number of parameters to optimize. For two or even three different parameters, it might be the way to go.

当我们需要优化的参数很少时，网格搜索通常是一个不错的选择。对于两个甚至三个不同的参数，这可能是解决方法。

For each hyper-parameter, we define a set of candidate values to explore.

对于每个超参数，我们定义了一组要探索的候选值。

Then, the idea is to exhaustively try every possible combination of the values of the individual parameters.

然后，该想法是穷尽尝试各个参数值的每种可能组合。

For each combination, we train and evaluate a different model.

对于每种组合，我们训练和评估一个不同的模型。

In the end, we keep the one with the smallest generalization error.

最后，我们保留泛化误差最小的那个。

The main problem with Grid Search is that it is an exponential time algorithm. Its cost grows exponentially with the number of parameters.

网格搜索的主要问题在于它是一种指数时间算法。它的成本随着参数数量的增加而呈指数增长。

In other words, if we need to optimize p parameters and each one takes at most v values, it runs in O(vᵖ) time.

换句话说，如果我们需要优化p个参数，并且每个参数最多使用v个值，则它以O(vᵖ)时间运行。

Also, Grid Search is not as effective in exploring the hyper-parameter space as we may think.

同样，网格搜索在探索超参数空间方面并不像我们想象的那样有效。

Take a look at the code above again. Using this setup, we are going to train a total of 256 different models. Note that if we decide to add one more parameter, the number of experiments would increase to 1024.

再看一下上面的代码。使用此设置，我们将训练总共256种不同的模型。请注意，如果我们决定再添加一个参数，则实验次数将增加到1024。

However, this setup only explores four different values for each hyper-parameter. That is it, we train 256 models to only explore four values of the learning rate, regularization, and so on.

但是，此设置仅针对每个超参数探索四个不同的值。就是说，我们训练了256个模型，仅探索学习率，正则化等四个值。

Besides, Grid Search usually requires repetitive trials. Take the learning_rate_search values from the code above as an example.

此外，网格搜索通常需要重复试验。以上面代码中的learning_rate_search值为例。

learning_rate_search = [0.1, 0.01, 0.001, 0.0001]

Suppose that after our first run (256 model trials), we get the best model with a learning rate value of 0.01.

假设在第一次运行(256个模型试验)之后，我们获得了学习率值为0.01的最佳模型。

In this situation, we should try to refine our search values by “zooming in” on the grid around 0.01 in the hope to find an even better value.

在这种情况下，我们应该尝试通过在0.01左右的网格上“放大”来优化搜索值，以期找到更好的值。

To do this, we could setup a new Grid Search and redefine the learning rate search range such as:

为此，我们可以设置一个新的网格搜索并重新定义学习率搜索范围，例如：

learning_rate_search = [0.006, 0.008, 0.01, 0.04, 0.06]

But what if we get the best model with a learning rate value was 0.0001?

但是，如果我们获得学习率值为0.0001的最佳模型怎么办？

Since this value is at the very edge of our initial search range, we should shift the values and try again with a different set like:

由于该值位于我们初始搜索范围的边缘，因此我们应移动这些值，然后尝试使用其他类似的集进行尝试：

learning_rate_search = [0.0001, 0.00006, 0.00002]

And possibly try to refine the range after finding a good candidate.

并可能在找到合适的候选人之后尝试完善范围。

All these details only emphasize how time-consuming hyper-parameter search can be.

所有这些细节仅强调了如何耗时的超参数搜索。

更好的方法-随机搜索 (A better approach — Random Search)

How about choosing our hyper-parameter candidate values at random? As not intuitive as it might seem, this idea is almost always better than Grid Search.

如何随机选择我们的超参数候选值？尽管看起来不直观，但这种想法几乎总是比网格搜索更好。

一点直觉 (A little bit of intuition)

Note that some of the hyper-parameters are more important than others.

请注意，一些超参数比其他一些更重要。

The learning rate and the momentum factor, for example, are more worth tuning than all others.

例如，学习速度和动量因子比其他所有变量更值得调整。

However, with the above exception, it is hard to know which ones play major roles in the optimization process. In fact, I would argue that the importance of each parameter might change for different model architectures and datasets.

但是，除了上述例外情况，很难知道哪个在优化过程中起主要作用。实际上，我认为对于不同的模型体系结构和数据集，每个参数的重要性可能会发生变化。

Suppose we are optimizing over two hyper-parameters — the learning rate and the regularization strength. Also, take into consideration that only the learning rate matters for the problem.

假设我们正在对两个超参数进行优化-学习率和正则化强度。另外，要考虑到只有学习速度才是问题的关键。

In the case of Grid Search, we are going to run nine different experiments, but only try three candidates for the learning rate.

在网格搜索的情况下，我们将进行9个不同的实验，但仅尝试3个候选者以提高学习率。

Now, take a look at what happens if we sample the candidates uniformly at random. In this scenario, we are actually exploring nine different values for each parameter.

现在，看看如果我们随机地对候选样本进行均匀采样会发生什么。在这种情况下，我们实际上正在为每个参数探索九个不同的值。

If you are not yet convinced, suppose we are optimizing over three hyper-parameters. For example, the learning rate, the regularization strength, and momentum.

如果您还不确定，请假设我们正在对三个超参数进行优化。例如，学习率，正则化强度和动量。

For Grid Search, we would be running 125 training runs, but only exploring five different values of each parameter.

对于“网格搜索”，我们将进行125次训练，但仅探索每个参数的五个不同值。

On the other hand, with Random Search, we would be exploring 125 different values of each.

另一方面，通过随机搜索，我们将探索每个的125个不同值。

怎么做 (How to do it)

If we want to try values for the learning rate, say within the range of 0.1 to 0.0001, we do:

如果我们想尝试学习率的值，例如在0.1到0.0001的范围内，则可以执行以下操作：

Note that we are sampling values from a uniform distribution on a log scale.

请注意，我们是从对数刻度的均匀分布中采样值。

You can think of the values -1 and -4 (for the learning rate) as the exponents in the interval [10e-1, 10e-4].

您可以将值-1和-4(对于学习率)视为区间[10e-1，10e-4]中的指数。

If we do not use a log-scale, the sampling will not be uniform within the given range. In other words, you should not attempt to sample values like:

如果我们不使用对数刻度，则采样将在给定范围内不一致。换句话说，您不应尝试对以下值进行采样：

In this situation, most of the values would not be sampled from a ‘valid’ region. Actually, considering the learning rate samples in this example, 72% of the values would fall in the interval [0.02, 0.1].

在这种情况下，大多数值将不会从“有效”区域中采样。实际上，考虑到此示例中的学习率样本，其中72％的值将落在间隔[0.02，0.1]中。

Moreover, 88% in the sampled values would come from the interval [0.01, 0.1]. That is, only 12% of the learning rate candidates, 3 values, would be sampled from the interval [0.0004, 0.01]. Do not do that.

此外，采样值的88％来自间隔[0.01，0.1]。也就是说，仅从间隔[0.0004，0.01]中采样12％的学习率候选(3个值)。不要那样做。

In the graphic below, we are sampling 25 random values from the range [0.1,0.0004]. The plot in the top left shows the original values.

在下图中，我们从[0.1,0.0004]范围内采样了25个随机值。左上方的图显示了原始值。

In the top right, notice that 72% of the sampled values are in the interval [0.02, 0.1]. 88% of the values lie within the range [0.01, 0.1].

在右上角，请注意，有72％的采样值在[0.02，0.1]区间内。 88％的值在[0.01，0.1]范围内。

The bottom plot shows the distribution of values. Only 12% of the values are in the interval [0.0004, 0.01]. To solve this problem, sample the values from a uniform distribution in a log-scale.

底部的图显示了值的分布。只有12％的值在[0.0004，0.01]区间内。要解决此问题，请从对数刻度的均匀分布中采样值。

A similar behavior would happen with the regularization parameter.

使用正则化参数会发生类似的行为。

Also, note that like with Grid Search, you need to consider the two cases we mentioned above.

另外，请注意，就像使用Grid Search一样，您需要考虑上面提到的两种情况。

If the best candidate falls very near the edge, your range might be off and should be shifted and re-sampled. Also, after choosing the first good candidates, try re-sampling to a finer range of values.

如果最好的候选者非常接近边缘，则您的范围可能会偏离，应进行移动并重新采样。同样，在选择第一个好的候选者之后，请尝试重新采样到更好的值范围。

In conclusion, these are the key takeaways.

总之，这些是关键要点。

If you have more than two or three hyper-parameters to tune, prefer Random Search. It is faster/easier to implement and converges faster than Grid Search.如果要调整的超级参数超过两个或三个，请选择“随机搜索”。它比Grid Search更快/更容易实现和收敛。
Use an appropriate scale to pick your values. Sample from a uniform distribution in a log-space. This will allow you to sample values equally distributed across the parameters ranges.使用适当的标度来选择您的值。来自对数空间中均匀分布的样本。这将使您可以采样均匀分布在参数范围内的值。
Regardless of Random or Grid Search, pay attention to the candidates you choose. Make sure the parameter’s ranges are properly set and refine the best candidates if possible.无论是随机搜索还是网格搜索，都请注意您选择的候选对象。确保正确设置参数的范围，并在可能的情况下优化最佳候选值。

Thanks for reading! For more cool stuff on Deep Learning, check out some of my previous articles:

谢谢阅读！有关深度学习的更多有趣内容，请查看我以前的文章：

How to train your own FaceID ConvNet using TensorFlow Eager executionFaces are everywhere — from photos and videos on social media websites, to consumer security applications like the…medium.freecodecamp.orgMachine Learning 101: An Intuitive Introduction to Gradient DescentGradient descent is, with no doubt, the heart and soul of most Machine Learning (ML) algorithms. I definitely believe…towardsdatascience.com

如何使用TensorFlow Eager执行能力训练自己的FaceID ConvNet 面Kong无处不在-从社交媒体网站上的照片和视频，到消费者安全应用程序，如… media.freecodecamp.org 机器学习101：梯度下降的直观介绍 梯度下降是，毫无疑问，这是大多数机器学习(ML)算法的灵魂。 我绝对相信…朝向datascience.com

翻译自: https://www.freecodecamp.org/news/an-introduction-to-high-dimensional-hyper-parameter-tuning-df5c0106e5a4/

降维后的高维特征的参数

降维后的高维特征的参数_高维超参数调整简介相关推荐

R语言caret包构建xgboost模型实战：特征工程（连续数据离散化、因子化、无用特征删除）、配置模型参数（随机超参数寻优、10折交叉验证）并训练模型
R语言caret包构建xgboost模型实战:特征工程(连续数据离散化.因子化.无用特征删除).配置模型参数(随机超参数寻优.10折交叉验证)并训练模型目录
机器学习中模型参数和模型超参数分别是什么？有什么区别？
机器学习中模型参数和模型超参数分别是什么?有什么区别? 目录机器学习中模型参数和模型超参数分别是什么?有什么区别?
模型参数与模型超参数
什么是模型参数? 模型参数是模型内部的配置变量,其值可以根据数据进行估计. 模型在进行预测时需要它们.它们的值定义了可使用的模型.他们是从数据估计或获悉的.它们通常不由编程者手动设置.他们通常被保存为 ...
对pca降维后的手写体数字图片数据分类_机器学习：数据的准备和探索——特征提取和降维...
在数据的预处理阶段,特征提取和数据降维是提升模型表示能力的一种重要手段. 特征提取主要是从数据中找到有用的特征,用于提升模型的表示能力,而数据降维主要是在不减少模型准确率的情况下减少数据的特征数量. ...
对pca降维后的手写体数字图片数据分类_【AI白身境】深度学习中的数据可视化...
今天是新专栏<AI白身境>的第八篇,所谓白身,就是什么都不会,还没有进入角色. 上一节我们已经讲述了如何用爬虫爬取数据,那爬取完数据之后就应该是进行处理了,一个很常用的手段是数据可视化. ...
对pca降维后的手写体数字图片数据分类_知识干货-机器学习-TSNE数据降维
1.TSNE的基本概念 2.例1 鸢尾花数据集降维 3.例2 MINISET数据集降维 1.TSNE的基本概念 t-SNE(t-distributed stochastic neighbor embe ...
lgg7深度详细参数_机器学习超详细实践攻略(9)：决策树算法使用及小白都能看懂的调参指南...
决策树算法在工业中本身应用并不多,但是,目前主流的比赛中的王者,包括GBDT.XGBOOST.LGBM都是以决策树为积木搭建出来的,所以理解决策树,是学习这些算法的基石,今天,我们就从模型调用到调参详 ...
数据数据泄露泄露_通过超参数调整进行数据泄漏
数据数据泄露泄露介绍 (Introduction) Data Leakage is when the model somehow knows the patterns in the test dat ...
java使用初始化输入参数_使用初始化参数配置java web应用程序
在编写java web应用程序的时候,我们难免会遇到需要使用参数来初始化应用程序的问题.在这里介绍最简单的三种方式:使用上下文参数进行配置.使用Servlet初始化参数以及使用注释来初始化参数. 这些 ...

降维后的高维特征的参数_高维超参数调整简介