http://blog.csdn.net/pipisorry/article/details/52268947

Scikit-learn：并行调参Grid Search

Grid Search: Searching for estimator parameters

scikit-learn中提供了pipeline(for estimator connection) & grid_search(searching best parameters)进行并行调参

如使用scikit-learn做文本分类时：vectorizer取多少个word呢？预处理时候要过滤掉tf>max_df的words，max_df设多少呢？tfidftransformer只用tf还是加idf呢？classifier分类时迭代几次？学习率怎么设？ “循环一个个试”，这就是grid search要做的基本东西。

某小皮

调整模型的超参数

Hyper-parameters are parameters that are not directly learnt within estimators.In scikit-learn they are passed as arguments to the constructor of theestimator classes.

It is possible and recommended to search the hyper-parameter space for the best Cross-validation: evaluating estimator performance score.

Any parameter provided when constructing an estimator may be optimized in thismanner. Specifically, to find the names and current values for all parametersfor a given estimator, use:

estimator.get_params()

A search consists of:

an estimator (regressor or classifier such as sklearn.svm.SVC());
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function.

GridSearchCV exhaustively considersall parameter combinations, while RandomizedSearchCV can sample agiven number of candidates from a parameter space with a specifieddistribution.

穷尽网格搜索GridSearchCV

Gird Search：具体说，就是每种参数确定好几个要尝试的值，然后像一个网格一样，把所有参数值的组合遍历一下。优点是实现简单暴力，如果能全部遍历的话，结果比较可靠。缺点是太费时间了，特别像神经网络，一般尝试不了太多的参数组合。

param_grid = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},]

最好的实例 Nested versus non-nested cross-validationfor an example of Grid Search within a cross validation loop on the irisdataset

随机参数优化RandomizedSearchCV

Random Search：先用Gird Search的方法，得到所有候选参数，然后每次从中随机选择进行训练。

sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score='raise', return_train_score=True)

two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.
Adding parameters that do not influence the performance does not decrease efficiency.

{'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),'kernel': ['rbf'], 'class_weight':['balanced', None]}

In principle, any function can be passed that provides a rvs (randomvariate sample) method to sample a value.

实例Comparing randomized search and grid search for hyperparameter estimation compares the usage and efficiencyof randomized search and grid search.

参数搜索的tips

Specifying an objective metric

[机器学习模型的评价指标和方法 ]

Composite estimators and parameter spaces

estimator类必须有的方法是有：get_params, set_params(**params), fit(x,y), predict(new_samples), score(x, y_true)。其中有的可以直接从from sklearn.base import BaseEstimator中继承。

使用pipline方法 Pipeline: chaining estimators

Model selection: development and evaluation

使用验证集（也就是开发集吧）来进行模型选择，输入到grid_search中。development set (tobe fed to the GridSearchCV instance)

Parallelism

GridSearchCV and RandomizedSearchCV 都是并行运行的，by using the keyword n_jobs=-1.

Robustness to failure

参数输入后模型出错会导致整个grid serach失败，但是可以通过Setting error_score=0(or =np.NaN)来解决。失败的issuing awarning and setting the score for that fold to 0 (or NaN)。

调参参数的选择

[机器学习模型选择：调参参数选择 ]

GridSearch流程图

某小皮

Alternatives to brute force parameter search

特定模型的交叉验证

...

信息准则

Some models can offer an information-theoretic closed-form formula of theoptimal estimate of the regularization parameter by computing a singleregularization path (instead of several when using cross-validation).

Here is the list of models benefitting from the Aikike InformationCriterion (AIC) or the Bayesian Information Criterion (BIC) for automatedmodel selection:

linear_model.LassoLarsIC([criterion, ...]) Lasso model fit with Lars using BIC or AIC for model selection

可以参考prml。

Out of Bag Estimates

集成方法因为有数据抽样，多余的可以直接用于模型选择，而不需要额外独立的验证集。This left out portion can be used to estimate the generalization errorwithout having to rely on a separate validation set. This estimatecomes “for free” as no additional data is needed and can be used formodel selection.

`ensemble.RandomForestClassifier`([...])	A random forest classifier.
`ensemble.RandomForestRegressor`([...])	A random forest regressor.
`ensemble.ExtraTreesClassifier`([...])	An extra-trees classifier.
`ensemble.ExtraTreesRegressor`([n_estimators, ...])	An extra-trees regressor.
`ensemble.GradientBoostingClassifier`([loss, ...])	Gradient Boosting for classification.
`ensemble.GradientBoostingRegressor`([loss, ...])	Gradient Boosting for regression.

贝叶斯优化Bayesian Optimization

考虑到了不同参数对应的实验结果值，因此更节省时间。和网络搜索相比简直就是老牛和跑车的区别。具体原理可以参考这个论文： Practical Bayesian Optimization of Machine Learning Algorithms ，这里同时推荐两个实现了贝叶斯调参的Python库，可以上手即用：

jaberg/hyperopt, 比较简单。
fmfn/BayesianOptimization，比较复杂，支持并行调参。

GrideSearch示例

[Auto-scaling scikit-learn with Apache Spark]

from: http://blog.csdn.net/pipisorry/article/details/52268947

ref: [3.2. Tuning the hyper-parameters of an estimator]*

[python并行调参——scikit-learn grid_search]*

[Parameter estimation using grid search with cross-validation*]

参数资料
Practical recommendations for gradient-based training of deep architectures by Yoshua Bengio (2012)
Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller
Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.

Scikit-learn：模型选择之调参grid search相关推荐

机器学习模型评估与改进：网格化调参(grid search)
文章目录简单网格化搜索参数过拟合的风险网格搜索与交叉验证模型调参接口: GridSearchCV函数整体流程 GridSearchCV( )函数对交叉验证进一步分析不同核方法的情况网格 ...
pyspark学习之——逻辑回归、模型选择与调参
记录pyspark的MLlib库学习篇,学习资料来自spark官方文档,主要记录pyspark相关内容,要么直接翻译过来,要么加上自己的理解.spark2.4.8官方文档如下:https://spar ...
机器学习模型选择：调参参数选择
http://blog.csdn.net/pipisorry/article/details/52902797 调参经验好的实验环境是成功的一半由于深度学习实验超参众多,代码风格良好的实验环境,可 ...
【机器学习】K-近邻算法-模型选择与调优
前言在KNN算法中,k值的选择对我们最终的预测结果有着很大的影响那么有没有好的方法能够帮助我们选择好的k值呢? 模型选择与调优目标说明交叉验证过程说明参数搜索过程应用GirdSearchC ...
机器学习的练功方式（五）——模型选择及调优
文章目录 5 模型选择及调优 5.1 数据增强 5.2 过拟合 5.3 交叉验证 5.4 超参数搜索--网格搜索 5 模型选择及调优 5.1 数据增强有时候,你和你的老板说你数据不够,它是不会理你的 ...
(转) 深度模型优化性能调参
原地址:https://blog.csdn.net/qq_16234613/article/details/79596609 注意调参看验证集.trainset loss通常能够一直降低,但vali ...
SVM 的核函数选择和调参
版权声明:本文为博主原创文章,未经博主允许不得转载. https://blog.csdn.net/aliceyangxi1987/article/details/80617649 本文结构: ...
零基础数据挖掘入门系列(五) - 模型建立与调参
思维导图:零基础入门数据挖掘的学习路径 1. 写在前面零基础入门数据挖掘是记录自己在Datawhale举办的数据挖掘专题学习中的所学和所想, 该系列笔记使用理论结合实践的方式,整理数据挖掘相关知识, ...
ML：基于葡萄牙银行机构营销活动数据集(年龄/职业等)利用Pipeline框架(两种类型特征并行处理)+多种模型预测(分层抽样+调参交叉验证评估+网格/随机搜索+推理)客户是否购买该银行的产品二分类案
ML之pipeline:基于葡萄牙银行机构营销活动数据集(年龄/职业/婚姻/违约等)利用Pipeline框架(两种类型特征并行处理)+多种模型预测(分层抽样+调参交叉验证评估+网格搜索/随机搜索+模型 ...
简单粗暴理解与实现机器学习之K-近邻算法（十）：交叉验证，网格搜索（模型选择与调优）API、鸢尾花案例增加K值调优
K-近邻算法文章目录 K-近邻算法学习目标 1.10 交叉验证,网格搜索 1 什么是交叉验证(cross validation) 1.1 分析 1.2 为什么需要交叉验证 **问题:那么这个只是对 ...

Scikit-learn：模型选择之调参grid search