Model selection: choosing estimators and their parameters

评分，交叉验证的评分：Score, and cross-validated scores
交叉验证生成器：Cross-validation generators
网格搜索和交叉验证估计：Grid-search and cross-validated estimators
- 网格搜索： grid-search
- - 密集交叉验证：Nested cross-validation
- 交叉验证估计：Cross-validated estimators
参考文献

为了帮朋友写个作业，由于之前又没学过，所以干脆过一遍官方的教程，做个笔记，以便日后回查。

评分，交叉验证的评分：Score, and cross-validated scores

每个模型都会有个score 方法来表示训练的结果，这个方法返回的就是模型的评分了，越高自然越好Bigger is better.

from sklearn import datasets, svmX_digits, y_digits = datasets.load_digits(return_X_y=True)
svc = svm.SVC(C=1, kernel='linear') svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
>>>0.98

有时候你在自己选的训练集和验证集不一定就最能够说明你的模型性能好，因为很可能刚好是对你选训练集和测试集的效果好，可能换一种选法效果就不好了，所以我们可以通过把数据集分成很多分来分别得出它在这些数据集上的性能评分

import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):# We use 'list' to copy, in order to 'pop' lazer onX_train = list(X_folds)X_test = X_train.pop(k)X_train = np.concatenate(X_train)y_train = list(y_folds)y_test = y_train.pop(k)y_train = np.concatenate(y_train)scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)
>>>[0.934..., 0.956..., 0.939...]

这就叫 KFold cross-validation k份交叉验证

交叉验证生成器：Cross-validation generators

sklearn 中的数据生成器都有一个split方法，它可以帮你自动的去生成训练集和测试集样本的下标

from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5 6 7 8 9] | test: [0 1]
Train: [0 1 4 5 6 7 8 9] | test: [2 3]
Train: [0 1 2 3 6 7 8 9] | test: [4 5]
Train: [0 1 2 3 4 5 8 9] | test: [6 7]
Train: [0 1 2 3 4 5 6 7] | test: [8 9]

然后就很好算交叉验证分了：

[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
...  for train, test in k_fold.split(X_digits)]
[0.963..., 0.922..., 0.963..., 0.963..., 0.930...]

当然直接求也是可以的

cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
>>> array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])

这里的 n_jobs = -1 意思是计算给用到所有的cpu资源

这里不得不说的是要想更6地去看看还有哪些模型评估工具，那都在 matrics module 里了.
但是其实score 是可以直接通过名字来选的，人家都给你封装好了。使用参数 scoring 就ok了

cross_val_score(svc, X_digits, y_digits, cv=k_fold,scoring='precision_macro')
>>> array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])

下面的图显示还有很多的数据集交叉验证生成器供人玩

然后有个小练习脚本可以玩玩：

print(__doc__)import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svmX, y = datasets.load_digits(return_X_y=True)svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)scores = list()
scores_std = list()
for C in C_s:svc.C = Cthis_scores = cross_val_score(svc, X, y, n_jobs=1)scores.append(np.mean(this_scores))scores_std.append(np.std(this_scores))# Do the plotting
import matplotlib.pyplot as plt
plt.figure()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()

图片大体如下：

网格搜索和交叉验证估计：Grid-search and cross-validated estimators

网格搜索： grid-search

就是说你在训练的时候grid-search 可以帮你找到交叉验证分最高的模型超参是啥，很爽，你只用提供数据和模型的对象就好了

>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...                    n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...

GridSearchCV默认是3个fold 的交叉验证分, 取决于版本，当然如果你放进去的是回归器，人家就会使用3fold层级交叉验证。

密集交叉验证：Nested cross-validation

cross_val_score(clf, X_digits, y_digits)
array([0.938..., 0.963..., 0.944...])

本质上就是两个循环，一个是循环参数，第二个是循环遍历交叉验证分，然后找到最高的分， The resulting scores are unbiased estimates of the prediction score on new data. 这句话就很有意思了，意思就是训练集上完美训练了呗？
Warning You cannot nest objects with parallel computing (n_jobs different than 1).

交叉验证估计：Cross-validated estimators

调参其实很高效，因为 for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation 用这些超参估计器就可以自动设置超参啦

>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV()
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.00375...

这些模型所对应的超参交叉验证估计器就是这些模型对应的名字后面加上“CV”
下面是官方的一个练习例子脚本：

print(__doc__)import numpy as np
import matplotlib.pyplot as pltfrom sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCVX, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]lasso = Lasso(random_state=0, max_iter=10000)
alphas = np.logspace(-4, -0.5, 30)tuned_parameters = [{'alpha': alphas}]
n_folds = 5clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])# #############################################################################
# Bonus: how much can you trust the selection of alpha?# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)print("Answer to the bonus question:","how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X, y)):lasso_cv.fit(X[train], y[train])print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")plt.show()

图是这么个图

参考文献

sklearn官方教程

sklearn 小白抱佛脚笔记3：模型选择和它们的参数相关推荐

深度学习Deep learning小白入门笔记——PanGu模型训练分析
书接上回深度学习Deep learning小白入门笔记--在AI平台上训练LLM--PanGu 对训练模型重新认知与评估. 模型评估在训练过程中或训练完成后,通常使用验证集或测试集来评估模型的性能 ...
机器学习模型选择：调参参数选择
http://blog.csdn.net/pipisorry/article/details/52902797 调参经验好的实验环境是成功的一半由于深度学习实验超参众多,代码风格良好的实验环境,可 ...
第六课.模型评估与模型选择
目录导语模型评估回归任务的评估指标分类任务的评估指标过拟合现象过拟合的原因过拟合解决办法模型选择与调整超参数正则化留出法交叉验证网格搜索实验:线性回归预测股票走势实验说明 ...
数据科学和人工智能技术笔记十、模型选择
十.模型选择作者:Chris Albon 译者:飞龙协议:CC BY-NC-SA 4.0 在模型选择期间寻找最佳预处理步骤在进行模型选择时,我们必须小心正确处理预处理. 首先,GridSearc ...
Py之scikit-learn：机器学习sklearn库的简介、六大基本功能介绍(数据预处理/数据降维/模型选择/分类/回归/聚类)、安装、使用方法(实际问题中如何选择最合适的机器学习算法)之详细攻略
Py之scikit-learn:机器学习sklearn库的简介(组件/版本迭代).六大基本功能介绍(数据预处理/数据降维/模型选择/分类/回归/聚类).安装.使用方法(实际问题中如何选择最合适的机器学 ...
机器学习笔记——模型选择与正则化
机器学习笔记--模型选择与正则化一.模型选择 1.方差与偏差 2.过拟合与欠拟合 3.模型选择的平衡 4.欠.过拟合解决方法二.正则化 1.正则化线性回归 2.正则化对数回归 3.训练集规模对误差 ...
斯坦福ML公开课笔记10——VC维、模型选择、特征选择
本篇是ML公开课的第10个视频,上接第9个视频,都是讲学习理论的内容.本篇的主要内容则是VC维.模型选择(Model Selection).其中VC维是上篇笔记中模型集合无限大时的扩展分析:模型选择又 ...
机器学习笔记(二)模型评估与选择
2.模型评估与选择 2.1经验误差和过拟合不同学习算法及其不同参数产生的不同模型,涉及到模型选择的问题,关系到两个指标性,就是经验误差和过拟合. 1)经验误差错误率(errorrate):分类错误 ...
5.10 程序示例--模型选择-机器学习笔记-斯坦福吴恩达教授
程序示例–模型选择在新的一组样本中,我们将通过交叉验证集选择模型,参数 CCC 和高斯核的参数 δδδ 我们都将在以下 8 个值中选取测试,则总共构成了 8×8=648×8=648×8=64 个模 ...
西瓜书笔记之模型评估与选择
讲真,这书是越看觉得自己不会的越多,感觉好多概念,完全不是理工男喜欢的样子.. 首先了解一下NP问题,机器学习面临的问题多是NP完全问题(NP-C问题),号称世界七大数学难题之一. NP的英文全称是N ...

sklearn 小白抱佛脚笔记3：模型选择和它们的参数

Model selection: choosing estimators and their parameters

评分，交叉验证的评分：Score, and cross-validated scores

交叉验证生成器：Cross-validation generators

网格搜索和交叉验证估计：Grid-search and cross-validated estimators

网格搜索： grid-search

密集交叉验证：Nested cross-validation

交叉验证估计：Cross-validated estimators

参考文献

sklearn 小白抱佛脚笔记3：模型选择和它们的参数相关推荐

最新文章

热门文章

sklearn 小白抱佛脚笔记3：模型选择和它们的参数

Model selection: choosing estimators and their parameters

评分， 交叉验证的评分：Score, and cross-validated scores

交叉验证生成器：Cross-validation generators

网格搜索 和 交叉验证估计：Grid-search and cross-validated estimators

网格搜索： grid-search

密集交叉验证：Nested cross-validation

交叉验证估计：Cross-validated estimators

参考文献

sklearn 小白抱佛脚笔记3：模型选择和它们的参数相关推荐

最新文章

热门文章

评分，交叉验证的评分：Score, and cross-validated scores

网格搜索和交叉验证估计：Grid-search and cross-validated estimators