机器学习 交叉验证与网格搜索调参
网格搜索一般是针对参数进行寻优,交叉验证是为了验证训练模型拟合程度。sklearn中的相关API如下:
(1)交叉验证的首要工作:切分数据集train/validation/test
A.)没指定数据切分方式,直接选用cross_val_score按默认切分方式进行交叉验证评估得分,如下图
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegressioniris = load_iris()
logreg = LogisticRegression()scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
#默认cv=3,没指定默认在训练集和测试集上进行交叉验证
scores#Output:
#array([ 1. , 0.96666667, 0.93333333, 0.9 , 1. ])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
B.)K折交叉验证KFold
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
iris = load_iris()
kfold = KFold(n_splits=5)
cross_val_score(logreg, iris.data, iris.target, cv=kfold)#Output:
#array([ 1. , 0.93333333, 0.43333333, 0.96666667, 0.43333333])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
kfold = KFold(n_splits=3, shuffle=True, random_state=0)
#shuffle添加了随机扰动,打乱样本顺序,再进行k折切分样本
cross_val_score(logreg, iris.data, iris.target, cv=kfold)#Output:
#array([ 0.9 , 0.96, 0.96])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
C.)留一交叉验证LeaveOneOut(工业实践很少用)
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
#cv参数Determines the cross-validation splitting strategy,An object to be used as a cross-validation generator.
print("number of cv iterations: ", len(scores))
print("mean accuracy: ", scores.mean())#Output:
#number of cv iterations: 150
#mean accuracy: 0.953333333333
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
D.)乱序分割交叉验证ShuffleSplit
from sklearn.datasets import load_iris
from sklearn.model_selection import ShuffleSplit
#random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10,random_state=0)
cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)#Output:
#array([ 0.84 , 0.93333333, 0.90666667, 1. , 0.90666667,
# 0.93333333, 0.94666667, 1. , 0.90666667, 0.88 ])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
E.)数据与分组交叉验证GroupKFold
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
gkf=GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups):print("TRAIN:", train_index, "TEST:", test_index)X_train, X_test = X[train_index], X[test_index]y_train, y_test = y[train_index], y[test_index]print('X_train:',X_train)print('X_test:',X_test)print('y_train:',y_train)print('y_test:',y_test)Output:
TRAIN: [0 1] TEST: [2 3]
X_train: [[1 2][3 4]]
X_test: [[5 6][7 8]]
y_train: [1 2]
y_test: [3 4]TRAIN: [2 3] TEST: [0 1]
X_train: [[5 6][7 8]]
X_test: [[1 2]
[3 4]]
y_train: [3 4]
y_test: [1 2]
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
F.)按样本的标签分层切分StratifiedKFold
(2.)有关模型的参数调优过程,即网格搜索/交叉验证
a.)最简单的网格搜索:两层for循环
# naive grid search implementation
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
print("Size of training set: %d size of test set: %d" % (X_train.shape[0], X_test.shape[0]))best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:for C in [0.001, 0.01, 0.1, 1, 10, 100]:# for each combination of parameters# train an SVCsvm = SVC(gamma=gamma, C=C)svm.fit(X_train, y_train)# evaluate the SVC on the test set score = svm.score(X_test, y_test)# if we got a better score, store the score and parametersif score > best_score:best_score = scorebest_parameters = {'C': C, 'gamma': gamma}print("best score: ", best_score)
print("best parameters: ", best_parameters)#Output:
#Size of training set: 112 size of test set: 38
#best score: 0.973684210526
#best parameters: {'gamma': 0.001, 'C': 100}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
在训练集上再切出来一部分作为验证集,用于评估模型,防止过拟合
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
print("Size of training set: %d size of test set: %d" % (X_train.shape[0], X_test.shape[0]))best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:for C in [0.001, 0.01, 0.1, 1, 10, 100]:# for each combination of parameters# train an SVCsvm = SVC(gamma=gamma, C=C)svm.fit(X_train, y_train)# evaluate the SVC on the test set score = svm.score(X_test, y_test)# if we got a better score, store the score and parametersif score > best_score:best_score = scorebest_parameters = {'C': C, 'gamma': gamma}print("best score: ", best_score)
print("best parameters: ", best_parameters)#Output:
#Size of training set: 112 size of test set: 38
#best score: 0.973684210526
#best parameters: {'gamma': 0.001, 'C': 100}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
b.)网格搜索内部嵌套了交叉验证
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
iris = load_iris()
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0) #总集——>训练验证集+测试集
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1) #训练验证集——>训练集+验证集
best_score = 0
for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:for C in [0.001, 0.01, 0.1, 1, 10, 100]:svm = SVC(gamma=gamma, C=C)scores = cross_val_score(svm, X_trainval, y_trainval, cv=5) #在训练集和验证集上进行交叉验证score = np.mean(scores) # compute mean cross-validation accuracyif score > best_score: best_score = scorebest_parameters = {'C': C, 'gamma': gamma}# rebuild a model on the combined training and validation set
print('网格搜索for循环<有cross_val_score交叉验证>获得的最好参数组合:',best_parameters)
print(' ')
svmf = SVC(**best_parameters)
svmf.fit(X_trainval, y_trainval)
print('网格搜索<有交叉验证>获得的最好估计器,在训练验证集上没做交叉验证的得分:',svmf.score(X_trainval,y_trainval))#####
print(' ')
scores = cross_val_score(svmf, X_trainval, y_trainval, cv=5) #在训练集和验证集上进行交叉验证
print('网格搜索<有交叉验证>获得的最好估计器,在训练验证集上做交叉验证的平均得分:',np.mean(scores)) #交叉验证的平均accuracy
print(' ')
print('网格搜索<有交叉验证>获得的最好估计器,在测试集上的得分:',svmf.score(X_test,y_test))#####
# print(' ')
# print(' ')
# scoreall = cross_val_score(svmf, iris.data, iris.target, cv=5)
# print(scoreall ,np.mean(scoreall))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
Output:
网格搜索for循环<有cross_val_score交叉验证>获得的最好参数组合: {'gamma': 0.01, 'C': 100}网格搜索<有交叉验证>获得的最好估计器,在训练验证集上没做交叉验证的得分: 0.982142857143网格搜索<有交叉验证>获得的最好估计器,在训练验证集上做交叉验证的平均得分: 0.972689629211网格搜索<有交叉验证>获得的最好估计器,在测试集上的得分: 0.973684210526
- 1
- 2
- 3
- 4
- 5
- 6
- 7
c.)构造参数字典,代替双层for循环进行网格搜索
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
X_trainvalid, X_test, y_trainvalid, y_test = train_test_split(iris.data, iris.target, random_state=0) #default=0.25
grid_search = GridSearchCV(SVC(), param_grid, cv=5) #网格搜索+交叉验证
grid_search.fit(X_trainvalid, y_trainvalid)
print('GridSearchCV交叉验证网格搜索字典获得的最好参数组合',grid_search.best_params_)
print(' ')
print('GridSearchCV交叉验证网格搜索获得的最好估计器,在训练验证集上没做交叉验证的得分',grid_search.score(X_trainvalid,y_trainvalid))#####
print(' ')
print('GridSearchCV交叉验证网格搜索获得的最好估计器,在**集上做交叉验证的平均得分',grid_search.best_score_)#?????
# print(' ')
# print('BEST_ESTIMATOR:',grid_search.best_estimator_) #对应分数最高的估计器
print(' ')
print('GridSearchCV交叉验证网格搜索获得的最好估计器,在测试集上的得分',grid_search.score(X_test, y_test))#####
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
Output:
GridSearchCV交叉验证网格搜索字典获得的最好参数组合 {'gamma': 0.01, 'C': 100}GridSearchCV交叉验证网格搜索获得的最好估计器,在训练验证集上没做交叉验证的得分 0.982142857143GridSearchCV交叉验证网格搜索获得的最好估计器,在**集上做交叉验证的平均得分 0.973214285714GridSearchCV交叉验证网格搜索获得的最好估计器,在测试集上的得分 0.973684210526
- 1
- 2
- 3
- 4
- 5
- 6
- 7
d.)嵌套交叉验证:字典参数+cross_val_score
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5), iris.data, iris.target, cv=5)
#选定网格搜索的每一组超参数,对训练集与测试集的交叉验证(cross_val_score没指定数据集合分割的默认情况)
print("Cross-validation scores: ", scores)
print("Mean cross-validation score: ", scores.mean())#Output:
#Cross-validation scores: [ 0.96666667 1. 0.96666667 0.96666667 1. ]
#Mean cross-validation score: 0.98
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
def nested_cv(X, y, inner_cv, outer_cv, Classifier, parameter_grid):outer_scores = []# for each split of the data in the outer cross-validation# (split method returns indices)for training_samples, test_samples in outer_cv.split(X, y):# find best parameter using inner cross-validation:网格搜索外层cvbest_parms = {}best_score = -np.inf# iterate over parametersfor parameters in parameter_grid:# accumulate score over inner splitscv_scores = []# iterate over inner cross-validationfor inner_train, inner_test in inner_cv.split(X[training_samples], y[training_samples]):# build classifier given parameters and training data交叉验证内层cvclf = Classifier(**parameters)clf.fit(X[inner_train], y[inner_train])# evaluate on inner test setscore = clf.score(X[inner_test], y[inner_test])cv_scores.append(score)# compute mean score over inner foldsmean_score = np.mean(cv_scores)if mean_score > best_score:# if better than so far, remember parametersbest_score = mean_scorebest_params = parameters# build classifier on best parameters using outer training setclf = Classifier(**best_params)clf.fit(X[training_samples], y[training_samples])# evaluate outer_scores.append(clf.score(X[test_samples], y[test_samples]))return outer_scoresfrom sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import ParameterGrid, StratifiedKFold
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html#sklearn.model_selection.ParameterGrid
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold
#ParameterGrid是按给定参数字典分配训练集与测试集,StratifiedKFold是分层分配训练集与测试集
nested_cv(iris.data, iris.target, StratifiedKFold(5), StratifiedKFold(5), SVC, ParameterGrid(param_grid))#Output:
#[0.96666666666666667, 1.0, 0.96666666666666667, 0.96666666666666667, 1.0]
机器学习 交叉验证与网格搜索调参相关推荐
- 客户逾期贷款预测[6] - 网格搜索调参和交叉验证
任务 使用网格搜索对模型进行调优并采用五折交叉验证的方式进行模型评估 实现 之前已经进行过数据探索,缺失值和异常值处理.特征生成.特征删除.缩放等处理,具体可见前几篇博客.本文只进行带交叉验证的网格搜 ...
- 简单粗暴理解与实现机器学习之K-近邻算法(十):交叉验证,网格搜索(模型选择与调优)API、鸢尾花案例增加K值调优
K-近邻算法 文章目录 K-近邻算法 学习目标 1.10 交叉验证,网格搜索 1 什么是交叉验证(cross validation) 1.1 分析 1.2 为什么需要交叉验证 **问题:那么这个只是对 ...
- 机器学习算法------1.10 交叉验证,网格搜索(交叉验证,网格搜索(模型选择与调优)API、鸢尾花案例增加K值调优)
文章目录 1.10 交叉验证,网格搜索 学习目标 1 什么是交叉验证(cross validation) 1.1 分析 1.2 为什么需要交叉验证 2 什么是网格搜索(Grid Search) 3 交 ...
- 机器学习系列之交叉验证、网格搜索
第一部分:交叉验证 机器学习建立和验证模型,常用的方法之一就是交叉验证.在机器学习过程中,往往数据集是有限的,而且可能具有一定的局限性.如何最大化的利用数据集去训练.验证.测试模型,常用的方法就是交叉 ...
- 机器学习之网格搜索调参sklearn
网格搜索 网格搜索 GridSearchCV我们在选择超参数有两个途径:1凭经验:2选择不同大小的参数,带入到模型中,挑选表现最好的参数.通过途径2选择超参数时,人力手动调节注意力成本太高,非常不值得 ...
- 机器学习调参——网格搜索调参,随机搜索调参,贝叶斯调参
from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error from lightgb ...
- PythonML-Day02: k-近邻、朴素贝叶斯、决策树、随机森林、交叉验证、网格搜索
ML-Day02: k-近邻.朴素贝叶斯.决策树.随机森林.交叉验证.网格搜索1.数据分类离散型数据:可以列举出连续型数据:在区间内可任意划分,不可一一列举2.机器学习算法分类监督学习(预测):有特征 ...
- python实现留一法_数据分割:留出法train_test_split、留一法LeaveOneOut、GridSearchCV(交叉验证法+网格搜索)、自助法...
1.10 交叉验证,网格搜索 学习目标 目标 知道交叉验证.网格搜索的概念 会使用交叉验证.网格搜索优化训练模型 1 什么是交叉验证(cross validation) 交叉验证:将拿到的训练数据,分 ...
- K-近邻算法之交叉验证,网格搜索
K-近邻算法之交叉验证,网格搜索 1 什么是交叉验证(cross validation) 交叉验证:将拿到的训练数据,分为训练和验证集.以下图为例:将数据分成4份,其中一份作为验证集.然后经过4次(组 ...
- ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测
ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测 目录 输出结果 设计思路 核心代码 输出结果 设计思路 核心代码 grid_s ...
最新文章
- tp5.0 queue 队列操作
- 一个按钮触发两个事件可以吗?
- 【知识分享】异步调用与多线程的区别
- GB28181开放流媒体服务平台LiveGBS实际测试时问题排查
- Hyperledger Fabric教程(1)--Hyperledger Fabric 老版本 1.1.0 快速部署安装
- linux中config文件怎么打开,linux-如何使用CoreOS的cloud-config文件启动Dock...
- FreeImage的学习总结总结(三)
- Unity 安装Vuforia配置Android时遇到的问题及解决
- PSV微豆瓣FM v0.1.0
- python画围棋棋盘_Python语言程序设计之二--用turtle库画围棋棋盘和正、余弦函数图形...
- 加快建设泛在电力物联网:万物互联 驶向数字经济蓝海
- Term40:若一个类是函数子(functor),则应使它可配接(adaptable)
- 基本面因子投资的三点思考
- 美国俚语:Keep your eyes peeled什么意思?_
- 数据挖掘入门必看的几个问题
- 在深信服实习是怎样的体验(研发测试岗)
- 《基于GPU加速的计算机视觉编程》学习笔记
- StreamInPut/Output
- Vue项目设置ico
- oa办公自动化系统有什么作用?
热门文章
- 计算机视觉、模式识别、人工智能
- ansi、unico、utf8
- 学习js的第十三天【事件的绑定方式,执行方式】
- java 导出word_java导出生成word
- linux设置合上电脑,CentOS7设置笔记本合盖不休眠
- html分组标签tfoot,网页布局中 tbody标签与thead和tfoot标签使用
- 2021江苏考试院高考成绩查询入口,江苏省教育考试院2021年江苏高考成绩查询时间及系统入口...
- Linux系统基于MobaXterm的下载及使用
- Python使用Opencc库完成字符繁简体转换
- 安徽大学计算机专硕学几年,安徽大学专业硕士学制几年