
by Thalles Silva

由Thalles Silva

高维超参数调整简介 (An introduction to high-dimensional hyper-parameter tuning)

优化ML模型的最佳做法 (Best practices for optimizing ML models)

If you ever struggled with tuning Machine Learning (ML) models, you are reading the right piece.


Hyper-parameter tuning refers to the problem of finding an optimal set of parameter values for a learning algorithm.


Usually, the process of choosing these values is a time-consuming task.


Even for simple algorithms like Linear Regression, finding the best set for the hyper-parameters can be tough. With Deep Learning, things get even worse.

即使对于像线性回归这样的简单算法,也很难为超参数找到最佳集合。 借助深度学习,情况变得更糟。

Some of the parameters to tune when optimizing neural nets (NNs) include:


  • learning rate学习率
  • momentum动量
  • regularization正则化
  • dropout probability辍学概率
  • batch normalization批量标准化

In this short piece, we talk about the best practices for optimizing ML models. These practices come in hand mainly when the number of parameters to tune exceeds two or three.

在这篇简短的文章中,我们讨论了优化ML模型的最佳实践。 这些做法主要是在要调整的参数数量超过两个或三个时使用。

网格搜索的问题 (The problem with Grid Search)

Grid Search is usually a good choice when we have a small number of parameters to optimize. For two or even three different parameters, it might be the way to go.

当我们需要优化的参数很少时,网格搜索通常是一个不错的选择。 对于两个甚至三个不同的参数,这可能是解决方法。

For each hyper-parameter, we define a set of candidate values to explore.


Then, the idea is to exhaustively try every possible combination of the values of the individual parameters.


For each combination, we train and evaluate a different model.


In the end, we keep the one with the smallest generalization error.


The main problem with Grid Search is that it is an exponential time algorithm. Its cost grows exponentially with the number of parameters.

网格搜索的主要问题在于它是一种指数时间算法。 它的成本随着参数数量的增加而呈指数增长。

In other words, if we need to optimize p parameters and each one takes at most v values, it runs in O(vᵖ) time.

换句话说,如果我们需要优化p个参数,并且每个参数最多使用v个值,则它以O(vᵖ)时间运行 。

Also, Grid Search is not as effective in exploring the hyper-parameter space as we may think.


Take a look at the code above again. Using this setup, we are going to train a total of 256 different models. Note that if we decide to add one more parameter, the number of experiments would increase to 1024.

再看一下上面的代码。 使用此设置,我们将训练总共256种不同的模型。 请注意,如果我们决定再添加一个参数,则实验次数将增加到1024。

However, this setup only explores four different values for each hyper-parameter. That is it, we train 256 models to only explore four values of the learning rate, regularization, and so on.

但是,此设置仅针对每个超参数探索四个不同的值。 就是说,我们训练了256个模型,仅探索学习率,正则化等四个值。

Besides, Grid Search usually requires repetitive trials. Take the learning_rate_search values from the code above as an example.

此外,网格搜索通常需要重复试验。 以上面代码中的learning_rate_search值为例。

learning_rate_search = [0.1, 0.01, 0.001, 0.0001]

Suppose that after our first run (256 model trials), we get the best model with a learning rate value of 0.01.


In this situation, we should try to refine our search values by “zooming in” on the grid around 0.01 in the hope to find an even better value.


To do this, we could setup a new Grid Search and redefine the learning rate search range such as:


learning_rate_search = [0.006, 0.008, 0.01, 0.04, 0.06]

But what if we get the best model with a learning rate value was 0.0001?


Since this value is at the very edge of our initial search range, we should shift the values and try again with a different set like:


learning_rate_search = [0.0001, 0.00006, 0.00002]

And possibly try to refine the range after finding a good candidate.


All these details only emphasize how time-consuming hyper-parameter search can be.


更好的方法-随机搜索 (A better approach — Random Search)

How about choosing our hyper-parameter candidate values at random? As not intuitive as it might seem, this idea is almost always better than Grid Search.

如何随机选择我们的超参数候选值? 尽管看起来不直观,但这种想法几乎总是比网格搜索更好。

一点直觉 (A little bit of intuition)

Note that some of the hyper-parameters are more important than others.


The learning rate and the momentum factor, for example, are more worth tuning than all others.


However, with the above exception, it is hard to know which ones play major roles in the optimization process. In fact, I would argue that the importance of each parameter might change for different model architectures and datasets.

但是,除了上述例外情况,很难知道哪个在优化过程中起主要作用。 实际上,我认为对于不同的模型体系结构和数据集,每个参数的重要性可能会发生变化。

Suppose we are optimizing over two hyper-parameters — the learning rate and the regularization strength. Also, take into consideration that only the learning rate matters for the problem.

假设我们正在对两个超参数进行优化-学习率和正则化强度。 另外,要考虑到只有学习速度才是问题的关键。

In the case of Grid Search, we are going to run nine different experiments, but only try three candidates for the learning rate.


Now, take a look at what happens if we sample the candidates uniformly at random. In this scenario, we are actually exploring nine different values for each parameter.

现在,看看如果我们随机地对候选样本进行均匀采样会发生什么。 在这种情况下,我们实际上正在为每个参数探索九个不同的值。

If you are not yet convinced, suppose we are optimizing over three hyper-parameters. For example, the learning rate, the regularization strength, and momentum.

如果您还不确定,请假设我们正在对三个超参数进行优化。 例如,学习率,正则化强度和动量。

For Grid Search, we would be running 125 training runs, but only exploring five different values of each parameter.


On the other hand, with Random Search, we would be exploring 125 different values of each.


怎么做 (How to do it)

If we want to try values for the learning rate, say within the range of 0.1 to 0.0001, we do:


Note that we are sampling values from a uniform distribution on a log scale.


You can think of the values -1 and -4 (for the learning rate) as the exponents in the interval [10e-1, 10e-4].


If we do not use a log-scale, the sampling will not be uniform within the given range. In other words, you should not attempt to sample values like:

如果我们不使用对数刻度,则采样将在给定范围内不一致。 换句话说,您不应尝试对以下值进行采样:

In this situation, most of the values would not be sampled from a ‘valid’ region. Actually, considering the learning rate samples in this example, 72% of the values would fall in the interval [0.02, 0.1].

在这种情况下,大多数值将不会从“有效”区域中采样。 实际上,考虑到此示例中的学习率样本,其中72%的值将落在间隔[0.02,0.1]中。

Moreover, 88% in the sampled values would come from the interval [0.01, 0.1]. That is, only 12% of the learning rate candidates, 3 values, would be sampled from the interval [0.0004, 0.01]. Do not do that.

此外,采样值的88%来自间隔[0.01,0.1]。 也就是说,仅从间隔[0.0004,0.01]中采样12%的学习率候选(3个值)。 不要那样做。

In the graphic below, we are sampling 25 random values from the range [0.1,0.0004]. The plot in the top left shows the original values.

在下图中,我们从[0.1,0.0004]范围内采样了25个随机值。 左上方的图显示了原始值。

In the top right, notice that 72% of the sampled values are in the interval [0.02, 0.1]. 88% of the values lie within the range [0.01, 0.1].

在右上角,请注意,有72%的采样值在[0.02,0.1]区间内。 88%的值在[0.01,0.1]范围内。

The bottom plot shows the distribution of values. Only 12% of the values are in the interval [0.0004, 0.01]. To solve this problem, sample the values from a uniform distribution in a log-scale.

底部的图显示了值的分布。 只有12%的值在[0.0004,0.01]区间内。 要解决此问题,请从对数刻度的均匀分布中采样值。

A similar behavior would happen with the regularization parameter.


Also, note that like with Grid Search, you need to consider the two cases we mentioned above.

另外,请注意,就像使用Grid Search一样,您需要考虑上面提到的两种情况。

If the best candidate falls very near the edge, your range might be off and should be shifted and re-sampled. Also, after choosing the first good candidates, try re-sampling to a finer range of values.

如果最好的候选者非常接近边缘,则您的范围可能会偏离,应进行移动并重新采样。 同样,在选择第一个好的候选者之后,请尝试重新采样到更好的值范围。

In conclusion, these are the key takeaways.


  • If you have more than two or three hyper-parameters to tune, prefer Random Search. It is faster/easier to implement and converges faster than Grid Search.如果要调整的超级参数超过两个或三个,请选择“随机搜索”。 它比Grid Search更快/更容易实现和收敛。
  • Use an appropriate scale to pick your values. Sample from a uniform distribution in a log-space. This will allow you to sample values equally distributed across the parameters ranges.使用适当的标度来选择您的值。 来自对数空间中均匀分布的样本。 这将使您可以采样均匀分布在参数范围内的值。
  • Regardless of Random or Grid Search, pay attention to the candidates you choose. Make sure the parameter’s ranges are properly set and refine the best candidates if possible.无论是随机搜索还是网格搜索,都请注意您选择的候选对象。 确保正确设置参数的范围,并在可能的情况下优化最佳候选值。

Thanks for reading! For more cool stuff on Deep Learning, check out some of my previous articles:

谢谢阅读! 有关深度学习的更多有趣内容,请查看我以前的文章:

How to train your own FaceID ConvNet using TensorFlow Eager executionFaces are everywhere — from photos and videos on social media websites, to consumer security applications like the…medium.freecodecamp.orgMachine Learning 101: An Intuitive Introduction to Gradient DescentGradient descent is, with no doubt, the heart and soul of most Machine Learning (ML) algorithms. I definitely believe…towardsdatascience.com

如何使用TensorFlow Eager执行能力训练自己的FaceID ConvNet 面Kong无处不在-从社交媒体网站上的照片和视频,到消费者安全应用程序,如… media.freecodecamp.org 机器学习101:梯度下降的直观介绍 梯度下降是,毫无疑问,这是大多数机器学习(ML)算法的灵魂。 我绝对相信…朝向datascience.com

翻译自: https://www.freecodecamp.org/news/an-introduction-to-high-dimensional-hyper-parameter-tuning-df5c0106e5a4/



  1. R语言caret包构建xgboost模型实战:特征工程(连续数据离散化、因子化、无用特征删除)、配置模型参数(随机超参数寻优、10折交叉验证)并训练模型

    R语言caret包构建xgboost模型实战:特征工程(连续数据离散化.因子化.无用特征删除).配置模型参数(随机超参数寻优.10折交叉验证)并训练模型 目录

  2. 机器学习中模型参数和模型超参数分别是什么?有什么区别?

    机器学习中模型参数和模型超参数分别是什么?有什么区别? 目录 机器学习中模型参数和模型超参数分别是什么?有什么区别?

  3. 模型参数与模型超参数

    什么是模型参数? 模型参数是模型内部的配置变量,其值可以根据数据进行估计. 模型在进行预测时需要它们.它们的值定义了可使用的模型.他们是从数据估计或获悉的.它们通常不由编程者手动设置.他们通常被保存为 ...

  4. 对pca降维后的手写体数字图片数据分类_机器学习:数据的准备和探索——特征提取和降维...

    在数据的预处理阶段,特征提取和数据降维是提升模型表示能力的一种重要手段. 特征提取主要是从数据中找到有用的特征,用于提升模型的表示能力,而数据降维主要是在不减少模型准确率的情况下减少数据的特征数量. ...

  5. 对pca降维后的手写体数字图片数据分类_【AI白身境】深度学习中的数据可视化...

    今天是新专栏<AI白身境>的第八篇,所谓白身,就是什么都不会,还没有进入角色. 上一节我们已经讲述了如何用爬虫爬取数据,那爬取完数据之后就应该是进行处理了,一个很常用的手段是数据可视化. ...

  6. 对pca降维后的手写体数字图片数据分类_知识干货-机器学习-TSNE数据降维

    1.TSNE的基本概念 2.例1 鸢尾花数据集降维 3.例2 MINISET数据集降维 1.TSNE的基本概念 t-SNE(t-distributed stochastic neighbor embe ...

  7. lgg7深度详细参数_机器学习超详细实践攻略(9):决策树算法使用及小白都能看懂的调参指南...

    决策树算法在工业中本身应用并不多,但是,目前主流的比赛中的王者,包括GBDT.XGBOOST.LGBM都是以决策树为积木搭建出来的,所以理解决策树,是学习这些算法的基石,今天,我们就从模型调用到调参详 ...

  8. 数据数据泄露泄露_通过超参数调整进行数据泄漏

    数据数据泄露泄露 介绍 (Introduction) Data Leakage is when the model somehow knows the patterns in the test dat ...

  9. java使用初始化输入参数_使用初始化参数配置java web应用程序

    在编写java web应用程序的时候,我们难免会遇到需要使用参数来初始化应用程序的问题.在这里介绍最简单的三种方式:使用上下文参数进行配置.使用Servlet初始化参数以及使用注释来初始化参数. 这些 ...


  1. iptables防火墙过滤规则
  2. python经典案例-20个Python练手经典案例,能全做对的人确实很少!
  3. c语言构造插值多项式,拉格朗日多项式插值(C语言).docx
  4. 如何使用 Kafka、MongoDB 和 Maxwell’s Daemon 构建 SQL 数据库的审计系统
  5. 记录下面试中的回答的不好的问题
  6. linux驱动文件操作简单介绍
  7. 为什么只有奇次谐波_关于开关电源谐波失真,这有一份测量分析方法经验分享!...
  8. 你也被Spring的这个“线程池”坑过吗?
  9. 软件设计师18-系统开发和运行01
  10. sscanf 与 sscanf_s的区别
  11. zookeeper3.3.6 伪分布式安装
  12. 抓包工具Charles乱码解决办法
  13. java instant_Java Instant类
  14. 签证管理系统 签证软件
  15. Apache + Tomcat + JK 集群
  16. 猿创征文|HCIE-Security Day60:邮件过滤技术
  17. 安装目录里无法找到计算机,Win7系统下programdata文件夹找不到怎么办?
  18. linux开通本地ip连接,SSH 连接本地虚拟机 Linux
  19. 工作这么多年还不知道如何对MySQL进行性能压测?这也太Low了吧
  20. HC05蓝牙模块(一)


  1. php column not found,java.sql.SQLException: Column 'cloumn name' not found.
  2. php 贝瑟尔曲线,贝塞尔曲线的应用详解
  3. idea 批量修改同一列_学会这个,1秒就可以批量处理文件
  4. iOS 多级下拉菜单
  5. python中内建函数isinstance的用法
  6. 云评测、云监测、云加速,性能魔方mmTrix全球速度最快
  7. nginx的tmp文件过大导致磁盘空间不足一例
  8. Logback学习笔记1
  9. 房子成焦点,被挂马的房产网站仍在增加中
  10. ios9定位服务的app进入后台三分钟收不到经纬度,应用被挂起问题及解决方案