GirdSearchCV and RandomizedSearchCV

1）. GirdSearchCV是sklearn中的一个参数寻优的方法，存在的意义是自动调参，使用的人只需要把想调的参数输入进去，GirdSearchCV就会返回对于模型最佳的参数。GirdSearchCV原理是网格搜索，也就是穷举搜索，在候选的参数选择中，循环遍历每种可能，表现最好的参数就是最后想要的结果。

其原理就像是在数组里找最大值。（为什么叫网格搜索？以有两个参数的模型为例，参数a有3种可能，参数b有4种可能，把所有可能性列出来，可以表示成一个 3∗43*43∗4 的表格，其中每个 cell 就是一个网格，循环过程就像是在每个网格里遍历、搜索，所以叫grid search）

CV表示交叉验证，避免偶然性，这也是在一般实验中常用到的方法。

GirdSearchCV存在一个问题，那就是只适合小的数据集，或者参数较少的模型。参数较多或数据很大的时候，就要想办法了，可以使用一个快速调优的方法——坐标下降。它其实是一种贪心算法：拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优，但是省时间省力，巨大的优势面前，还是试一试吧，后续可以再拿bagging再优化。

2）. 当然，还可以用另外一种方法，叫做 RandomizedSearchCV ，随机在超参数空间中搜索几十几百个点，其中就有可能有比较小的值。这种做法比上面稀疏化网格的做法快，而且实验证明，随机搜索法结果比稀疏网格法稍好。

RandomizedSearchCV使用方法和类GridSearchCV 很相似，但他不是尝试所有可能的组合，而是通过选择每一个超参数的一个随机值的特定数量的随机组合，这个方法有两个优点：

如果你让随机搜索运行，比如1000次，它会探索每个超参数的1000个不同的值（而不是像网格搜索那样，只搜索每个超参数的几个值）
你可以方便的通过设定搜索次数，控制超参数搜索的计算量。

RandomizedSearchCV的使用方法其实是和GridSearchCV一致的，但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索，在对于有连续变量的参数时，RandomizedSearchCV会将其当做一个分布进行采样进行这是网格搜索做不到的，它的搜索能力取决于设定的n_iter参数。

自定义函数中使用GirdSearchCV

想在自己的方法中使用自动寻参的方法，这样就可以少写几层循环了。但是在网上找了很久大多都是用在sklearn里面的方法中，比如xgboost 、svm等等，这些方法是包装好的，用起来很方便，比如我们对svm中的 σ\sigmaσ 和 CCC 参数寻优，我们可以直接这样：

from sklearn.datasets import load_iris
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer , accuracy_scoreiris_data = load_iris()X_trainval, X_test, y_trainval, y_test = train_test_split(iris_data.data, iris_data.target, random_state=0)
X_train , X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state=1)clf = svm.SVC(kernel='rbf', C=1)# 这里需要试验的2个超参数svc_gamma和svc_C的元素个数分别为4、3,这样我们一共有12种超参数对集合
# numpy.linspace用于创建等差数列，numpy.logspace用于创建等比数列
# logspace中，开始点和结束点是10的幂
# 例如logspace(-2,1,4)表示起始数字为10^-2，结尾数字为10^1即10，元素个数为4的等比数列
# parameters变量里面的key都有一个前缀,这个前缀其实就是在Pipeline中定义的操作名。二者相结合，使我们的代码变得十分简洁。
# 还有注意的是，这里对参数名是<两条>下划线 __
parameters = {'gamma': np.logspace(-5, 5, 3), 'svc__C':np.logspace(-1,1,3)}scorin_fnc = make_scorer(accuracy_score)
# GridSearchCV参数解释:
# 1.estimator : estimator(评估) object.
# 2.param_grid : dict or list of dictionaries
# 3.verbose:Controls the verbosity(冗余度): the higher, the more messages.
# 4.refit:default=True, Refit(再次拟合)the best estimator with the entire dataset
# 5.cv : int, cross-validation generator 此处表示3折交叉验证
gs = GridSearchCV(clf, parameters, scorin_fnc, verbose=2, refit=True, cv=3)# 执行单线程网格搜索
gs.fit(X_train, y_train)print(gs.best_params_, gs.best_score_)# 最后输出最佳模型在测试集上的准确性
print('the accuracy of best model in test set is', gs.score(X_test, y_test))

GridSearchCV参数说明

那想要在自己的方法中使用GridSearchCV怎么办呢？我们先观察一下上面GridSearchCV代码，详细了解一下GridSearchCV的参数。

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None,
fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0,
pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

estimator：选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法：如estimator = RandomForestClassifier(min_sample_split=100,min_samples_leaf = 20,max_depth = 8,max_features = ‘sqrt’ , random_state =10),
param_grid：需要最优化的参数的取值，值为字典或者列表，例如：param_grid = param_test1,param_test1 = {‘n_estimators’ : range(10,71,10)}
scoring = None ：模型评价标准，默认为None，这时需要使用score函数；或者如scoring = ‘roc_auc’，根据所选模型不同，评价准则不同，字符串（函数名），或是可调用对象，需要其函数签名，形如：scorer(estimator，X，y）；如果是None，则使用estimator的误差估计函数。
fit_para,s = None
n_jobs = 1 ： n_jobs：并行数，int：个数，-1：跟CPU核数一致，1：默认值
iid = True：iid：默认为True，为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。
refit = True ：默认为True，程序将会以交叉验证训练集得到的最佳参数，重新对所有可能的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。
cv = None：交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。
verbose = 0 ,scoring = None　　verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。
pre_dispatch = ‘2*n_jobs’ ：指定总共发的并行任务数，当n_jobs大于1时候，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次。

GridSearchCV常用方法

grid.fit() ：运行网格搜索
best_params_ ：描述了已取得最佳结果的参数的组合
best_score_ ：提供优化过程期间观察到的最好的评分
cv_results_ ：具体用法模型不同参数下交叉验证的结果

GridSearchCV属性说明

cv_results_ : dict of numpy (masked) ndarrays
具有键作为列标题和值作为列的dict，可以导入到DataFrame中。注意，“params”键用于存储所有参数候选项的参数设置列表。
best_estimator_ : estimator
通过搜索选择的估计器，即在左侧数据上给出最高分数（或指定的最小损失）的估计器。如果refit = False，则不可用。
best_score_ ：float best_estimator的分数
best_parmas_ : dict 在保存数据上给出最佳结果的参数设置
best_index_ : int 对应于最佳候选参数设置的索引（cv_results_数组）
search.cv_results _ [‘params’] [search.best_index_]中的dict给出了最佳模型的参数设置，给出了最高的平均分数（search.best_score_）。
scorer_ : function
Scorer function used on the held out data to choose the best parameters for the model.
n_splits_ : int
The number of cross-validation splits (folds/iterations).

自定义函数使用GridSearchCV

首先我们要知道一些规则，

__init__ 的所有参数都必须具有默认值，因此仅通过键入 MyClassifier（） 即可初始化分类器。
__init__方法中不能确认输入参数！输入数据参数是在 fit() 中接收的。
__init__ 方法的所有参数都应具有与创建对象的属性相同的名称。
在这里不要以数据为参数！它应该在 fit() 中。
所有估计器都必须具有 get_params 和 set_params 函数。当你继承 BaseEstimator 的子类时，它们会被继承，这个时候最好不要重写这些函数，以免出错。
在 fit() 函数中，应该完成所有的分类工作，在这里面首先你会检查参数，即会使用到需要优化的参数。其次会对输入的数据进行处理。如果你通过 fit() 方法创建了一些新的属性，那么这个属性的名字要以“-”结尾，例如 self.fitted_ 。出于兼容性和与scikit-learn的通用接口， fit()函数将返回 self ，即最后会return self。
为了使GridSearch正常运行，我们必须给出一个score()方法，为什么？因为GridSearch需要识别给定的模型是否更好，它会直接看score()的结果，认定越大越好，因此设计的评价指标必须是数字型的表示。

给出一个例子，更加详细内容看这里。

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCVclass MeanClassifier(BaseEstimator, ClassifierMixin):"""An example of classifier"""def __init__(self, intValue=0, stringParam="defaultValue", otherParam=None):"""Called when initializing the classifier"""self.intValue = intValueself.stringParam = stringParam# THIS IS WRONG! Parameters should have same name as attributesself.differentParam = otherParamdef fit(self, X, y=None):"""This should fit classifier. All the "work" should be done here.Note: assert is not a good choice here and you should ratheruse try/except blog with exceptions. This is just for short syntax."""self.treshold_ = (sum(X)/len(X)) + self.intValue  # mean + intValuereturn selfdef _meaning(self, x):# returns True/False according to fitted classifier# notice underscore on the beginningreturn( True if x >= self.treshold_ else False )def predict(self, X, y=None):try:getattr(self, "treshold_")except AttributeError:raise RuntimeError("You must train classifer before predicting data!")return([self._meaning(x) for x in X])def score(self, X, y=None):# counts number of values bigger than meanreturn(sum(self.predict(X)))X_train = [i for i in range(0, 100, 5)]
X_test = [i + 3 for i in range(20)]
tuned_params = {"intValue" : [-10, -1, 0, 1, 10]}gs = GridSearchCV(MeanClassifier(), tuned_params)# for some reason I have to pass y with same shape
# otherwise gridsearch throws an error. Not sure why.
gird_result = gs.fit(X_train, y=[1 for i in range(20)])print("Best: %f using %s" % (gird_result.best_score_, gird_result.best_params_))means = gird_result.cv_results_['mean_test_score']
params = gird_result.cv_results_['params']
for mean, param in zip(means, params):print("%f  with:   %r" % (mean, param))

简单的说，就是把自己的分类器放在fit()函数，最后返回self，然后其他的函数，不继承的话，必须的要有score()、get_params()、set_params()，看一下get_params()、set_params()重写的固定格式（不继承的情况下用GridSearchCV）：

 def get_params(self, deep = False):params={'alpha':self.alpha,'num_iters':self.num_iters}return paramsdef set_params(self, **parameters):for parameter, value in parameters.items():setattr(self, parameter, value)return self

直接复制即可用，改一下params即可。其他什么都不要改了。score()自己根据自定义函数的评价指标定义就可以了，切记一定要是数值型。
最后给大家看一下在我的函数使用的代码：

import sys
from src.parameters_optimization_svdd import SVDD
#from src.single_svdd import SVDD
from src.visualize import Visualization as draw
import pandas as pd
import numpy as np
import src.tool as to
import cross_validation as cr
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_splitimport warnings
warnings.filterwarnings("ignore")sys.path.append("..")class GC_SVDD():""""""def __init__(self, alpha=0.01, positive_penalty=0.1, negative_penalty=0.1):self.alpha = alphaself.positive_penalty = positive_penaltyself.negative_penalty = negative_penaltydef fit(self, train_data, train_label):"""SVDD based on granular computing:return:"""# gauss widtht_data, te_data, t_label, te_label = train_test_split(train_data, train_label, test_size=0.1, random_state=1)temp_gauss_width = self.alpha# set SVDD parametersparameters = {"positive penalty": self.positive_penalty ,"negative penalty": self.negative_penalty  ,"kernel": {"type": 'gauss', "width": temp_gauss_width},"option": {"display": 'on'}}# construct an SVDD modelsvdd = SVDD(parameters)# train SVDD modelpara_list, _ = svdd.train(t_data, t_label)_, self.accuracy_ = svdd.test(te_data, te_label, para_list)return selfdef score(self, test_data, test_label):return self.accuracy_def get_params(self, deep=False):params = {'alpha': self.alpha,'positive_penalty': self.positive_penalty,'negative_penalty': self.negative_penalty}return paramsdef set_params(self, **parameters):for parameter, value in parameters.items():setattr(self, parameter, value)return selfif __name__ == '__main__':data_mat_list = ['iris.mat.csv']for i in range(len(data_mat_list)):np.random.seed(1)temp_path = r".\\data\\sortData\\"temp_data_path = temp_path + data_mat_list[i]data = pd.read_csv(temp_data_path, header=None)trainData, trainLabel = to.load_datasets_excel(data)temp_k = 10temp_train_index, temp_test_index = cr.cross_validation(trainData.shape[0], temp_k)clf = GC_SVDD()parameters = {'alpha': np.linspace(1, 10, 3),'negative_penalty': np.linspace(0.1, 1, 2),'positive_penalty': np.linspace(0.1, 1, 2)}gs = GridSearchCV(clf, parameters, cv=10)gird_result = gs.fit(trainData, trainLabel)print("Best: %f using %s" % (gird_result.best_score_, gird_result.best_params_))# means = gird_result.cv_results_['mean_test_score']# params = gird_result.cv_results_['params']# for mean, param in zip(means, params):#     print("%f  with:   %r" % (mean, param))

【参考】
[1] https://www.cnblogs.com/wj-1314/p/10422159.html
[2] 自己定义的类无法在GridSearchCV中使用解决办法
[3] 自定义的模型如何使用GridSearchCV()来选择参数
[4] 最值得一读的参考
[5] Developing scikit-learn estimators¶
[6] python中 return self的作用

自定义函数使用GridSearchCV参数寻优相关推荐

Keras训练神经网络进行分类并使用GridSearchCV进行参数寻优
Keras训练神经网络进行分类并使用GridSearchCV进行参数寻优在机器学习模型中,需要人工选择的参数称为超参数.比如随机森林中决策树的个数,人工神经网络模型中隐藏层层数和每层的节点个数,正则 ...
python svm超参数_grid search 超参数寻优
http://scikit-learn.org/stable/modules/grid_search.html 1. 超参数寻优方法 gridsearchCV 和 RandomizedSearchC ...
【超参数寻优】量子粒子群算法（QPSO）超参数寻优的python实现
[超参数寻优]量子粒子群算法(QPSO) 超参数寻优的python实现一.粒子群算法的缺点二.量子粒子群算法三.QPSO算法的python实现参考资料一.粒子群算法的 ...
MATLAB中使用LIBSVM进行SVM参数寻优
MATLAB中使用LIBSVM进行SVM参数寻优一些资源网站配置 svmtrain svmParams -t:表示选择的核函数类型 -g为核函数的参数系数 -c为惩罚因子系数 -v为交叉验证的数, ...
sklearn 交叉验证与参数寻优
3.3. Model evaluation: quantifying the quality of predictions - scikit-learn 0.19.2 documentation sk ...
【超参数寻优】粒子群算法（PSO）超参数寻优的python实现
[超参数寻优]粒子群算法(PSO) 超参数寻优的python实现一.算法原理 1.粒子群算法的名词解释 2.粒子更新二.PSO算法参数寻优的python实现参考资料粒子群优化算法(Partic ...
【超参数寻优】遗传算法（GA）超参数寻优的python实现
[超参数寻优]遗传算法(GA) 超参数寻优的python实现一.遗传算法简介 1.遗传算法由来 2.遗传算法名词概念 3.遗传算法中对染色体的操作 3.1.选择 3.2.交叉 3.3.变异二.遗传 ...
【超参数寻优】交叉验证（Cross Validation）超参数寻优的python实现：多参数寻优
[超参数寻优]交叉验证(Cross Validation)超参数寻优的python实现:多参数寻优一.网格搜索原理二.网格搜索+交叉验证用于多参数寻优的python实现 1.训练模型及待寻优参数 ...
libsvm安装使用及网格搜索法参数寻优
LIBSVM 是台湾大学林智仁( Chih-Jen Lin)教授开发的. 说明:本教程仅针对电脑为64位的计算机,如果是32位的计算机需要下载C语言编辑器进行手动编译. 1.下载libsvm ①下载地 ...

自定义函数使用GridSearchCV参数寻优