Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: MIT License

文章目录

Learning with ensembles
Combining classifiers via majority vote
- Implementing a simple majority vote classifier
- Using the majority voting principle to make predictions
Evaluating and tuning the ensemble classifier
Bagging -- Building an ensemble of classifiers from bootstrap samples
- Bagging in a nutshell
- Applying bagging to classify samples in the Wine dataset
Leveraging weak learners via adaptive boosting
- How boosting works
- Applying AdaBoost using scikit-learn
Summary

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

%load_ext watermark
%watermark -a "Sebastian Raschka" -u -d -v -p numpy,pandas,matplotlib,scipy,sklearn

Sebastian Raschka
last updated: 2017-07-22 CPython 3.6.1
IPython 6.0.0numpy 1.13.1
pandas 0.20.2
matplotlib 2.0.2
scipy 0.19.1
sklearn 0.19b2

The use of watermark is optional. You can install this IPython extension via “pip install watermark”. For more information, please see: https://github.com/rasbt/watermark.

第七章. 结合不同的模型进行组合学习

在上一章中，我们重点介绍了调整和评估不同分类模型的最佳实践。在这一章中，我们将在这些技术的基础上，探索构建一个分类器集合的不同方法，这些分类器的预测性能往往比其单个成员的预测性能更好。我们将学习如何做到以下几点。基于多数人投票进行预测使用袋装法，通过重复抽取训练集的随机组合来减少过度拟合应用升压技术，从弱学习器中建立强大的模型，并从错误中学习

协同学习

集合方法的目标是将不同的分类器组合成一个元分类器，它比每个单独的分类器具有更好的概括性能。例如，假设我们收集了10个专家的预测，合奏方法可以让我们有策略地将这10个专家的预测组合在一起，得出一个比每个专家的预测更准确、更稳健的预测。正如我们将在本章后面看到的那样，有几种不同的方法来创建一个分类器的集合。在这一节中，我们将介绍一个基本的感知，了解聚类器是如何工作的，以及为什么聚类器通常被认为能产生良好的泛化性能。在本章中，我们将重点介绍使用多数投票原则的最流行的合集方法。多数投票简单来说，就是指我们选择被大多数分类器预测的类标签，也就是获得了50%以上的选票。严格来说，多数投票原则仅指二元类设置。但是，很容易将多数投票原则概括为多类设置，也就是所谓的复数投票。在这里，我们选择获得票数最多的类标签（模式）。下图展示了一个由10个分类器组成的集合，其中每个独特的符号（三角形、正方形和圆形）代表一个独特的类标签。

使用训练集，我们首先训练m个不同的分类器（）。根据技术的不同，我们可以从不同的分类器中建立一个集合。
决策树、支持向量机、逻辑回归分类器等算法。另外，我们还可以使用相同的基础分类算法，拟合训练集的不同子集。这种方法的一个突出的例子是随机森林算法，它结合了不同的决策树分类器。下图说明了使用多数投票的一般集合方法的概念。

from IPython.display import Image
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

Learning with ensembles

Image(filename='images/07_01.png', width=500)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dnm2rtF1-1586785355019)(output_13_0.png)]

Image(filename='images/07_02.png', width=500)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JHCxUIqm-1586785355022)(output_14_0.png)]

#from scipy.misc import comb
#scipy.special
from scipy.special import comb
import mathdef ensemble_error(n_classifier, error):k_start = int(math.ceil(n_classifier / 2.))probs = [comb(n_classifier, k) * error**k * (1-error)**(n_classifier - k)for k in range(k_start, n_classifier + 1)]return sum(probs)

ensemble_error(n_classifier=11, error=0.25)

0.03432750701904297

import numpy as nperror_range = np.arange(0.0, 1.01, 0.01)
ens_errors = [ensemble_error(n_classifier=11, error=error)for error in error_range]

import matplotlib.pyplot as pltplt.plot(error_range, ens_errors, label='Ensemble error', linewidth=2)plt.plot(error_range, error_range, linestyle='--',label='Base error',linewidth=2)plt.xlabel('Base error')
plt.ylabel('Base/Ensemble error')
plt.legend(loc='upper left')
plt.grid(alpha=0.5)
#plt.savefig('images/07_03.png', dpi=300)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W904lSBl-1586785355023)(output_18_0.png)]

Combining classifiers via majority vote

Implementing a simple majority vote classifier

import numpy as npnp.argmax(np.bincount([0, 0, 1], weights=[0.2, 0.2, 0.6]))

ex = np.array([[0.9, 0.1],[0.8, 0.2],[0.4, 0.6]])p = np.average(ex, axis=0, weights=[0.2, 0.2, 0.6])
p

array([0.58, 0.42])

np.argmax(p)

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operatorclass MajorityVoteClassifier(BaseEstimator, ClassifierMixin):""" A majority vote ensemble classifierParameters----------classifiers : array-like, shape = [n_classifiers]Different classifiers for the ensemblevote : str, {'classlabel', 'probability'} (default='label')If 'classlabel' the prediction is based on the argmax ofclass labels. Else if 'probability', the argmax ofthe sum of probabilities is used to predict the class label(recommended for calibrated classifiers).weights : array-like, shape = [n_classifiers], optional (default=None)If a list of `int` or `float` values are provided, the classifiersare weighted by importance; Uses uniform weights if `weights=None`."""def __init__(self, classifiers, vote='classlabel', weights=None):self.classifiers = classifiersself.named_classifiers = {key: value for key, valuein _name_estimators(classifiers)}self.vote = voteself.weights = weightsdef fit(self, X, y):""" Fit classifiers.Parameters----------X : {array-like, sparse matrix}, shape = [n_samples, n_features]Matrix of training samples.y : array-like, shape = [n_samples]Vector of target class labels.Returns-------self : object"""if self.vote not in ('probability', 'classlabel'):raise ValueError("vote must be 'probability' or 'classlabel'""; got (vote=%r)"% self.vote)if self.weights and len(self.weights) != len(self.classifiers):raise ValueError('Number of classifiers and weights must be equal''; got %d weights, %d classifiers'% (len(self.weights), len(self.classifiers)))# Use LabelEncoder to ensure class labels start with 0, which# is important for np.argmax call in self.predictself.lablenc_ = LabelEncoder()self.lablenc_.fit(y)self.classes_ = self.lablenc_.classes_self.classifiers_ = []for clf in self.classifiers:fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))self.classifiers_.append(fitted_clf)return selfdef predict(self, X):""" Predict class labels for X.Parameters----------X : {array-like, sparse matrix}, shape = [n_samples, n_features]Matrix of training samples.Returns----------maj_vote : array-like, shape = [n_samples]Predicted class labels."""if self.vote == 'probability':maj_vote = np.argmax(self.predict_proba(X), axis=1)else:  # 'classlabel' vote#  Collect results from clf.predict callspredictions = np.asarray([clf.predict(X)for clf in self.classifiers_]).Tmaj_vote = np.apply_along_axis(lambda x:np.argmax(np.bincount(x,weights=self.weights)),axis=1,arr=predictions)maj_vote = self.lablenc_.inverse_transform(maj_vote)return maj_votedef predict_proba(self, X):""" Predict class probabilities for X.Parameters----------X : {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples andn_features is the number of features.Returns----------avg_proba : array-like, shape = [n_samples, n_classes]Weighted average probability for each class per sample."""probas = np.asarray([clf.predict_proba(X)for clf in self.classifiers_])avg_proba = np.average(probas, axis=0, weights=self.weights)return avg_probadef get_params(self, deep=True):""" Get classifier parameter names for GridSearch"""if not deep:return super(MajorityVoteClassifier, self).get_params(deep=False)else:out = self.named_classifiers.copy()for name, step in six.iteritems(self.named_classifiers):for key, value in six.iteritems(step.get_params(deep=True)):out['%s__%s' % (name, key)] = valuereturn out

Using the majority voting principle to make predictions

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_splitiris = datasets.load_iris()
X, y = iris.data[50:, [1, 2]], iris.target[50:]
le = LabelEncoder()
y = le.fit_transform(y)X_train, X_test, y_train, y_test =\train_test_split(X, y, test_size=0.5, random_state=1,stratify=y)

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_scoreclf1 = LogisticRegression(penalty='l2', C=0.001,random_state=1)clf2 = DecisionTreeClassifier(max_depth=1,criterion='entropy',random_state=0)clf3 = KNeighborsClassifier(n_neighbors=1,p=2,metric='minkowski')pipe1 = Pipeline([['sc', StandardScaler()],['clf', clf1]])
pipe3 = Pipeline([['sc', StandardScaler()],['clf', clf3]])clf_labels = ['Logistic regression', 'Decision tree', 'KNN']print('10-fold cross validation:\n')
for clf, label in zip([pipe1, clf2, pipe3], clf_labels):scores = cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc')print("ROC AUC: %0.2f (+/- %0.2f) [%s]"% (scores.mean(), scores.std(), label))

10-fold cross validation:ROC AUC: 0.87 (+/- 0.17) [Logistic regression]
ROC AUC: 0.89 (+/- 0.16) [Decision tree]
ROC AUC: 0.88 (+/- 0.15) [KNN]

# Majority Rule (hard) Votingmv_clf = MajorityVoteClassifier(classifiers=[pipe1, clf2, pipe3])clf_labels += ['Majority voting']
all_clf = [pipe1, clf2, pipe3, mv_clf]for clf, label in zip(all_clf, clf_labels):scores = cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc')print("ROC AUC: %0.2f (+/- %0.2f) [%s]"% (scores.mean(), scores.std(), label))

ROC AUC: 0.87 (+/- 0.17) [Logistic regression]
ROC AUC: 0.89 (+/- 0.16) [Decision tree]
ROC AUC: 0.88 (+/- 0.15) [KNN]
ROC AUC: 0.94 (+/- 0.13) [Majority voting]

Evaluating and tuning the ensemble classifier

from sklearn.metrics import roc_curve
from sklearn.metrics import auccolors = ['black', 'orange', 'blue', 'green']
linestyles = [':', '--', '-.', '-']
for clf, label, clr, ls \in zip(all_clf,clf_labels, colors, linestyles):# assuming the label of the positive class is 1y_pred = clf.fit(X_train,y_train).predict_proba(X_test)[:, 1]fpr, tpr, thresholds = roc_curve(y_true=y_test,y_score=y_pred)roc_auc = auc(x=fpr, y=tpr)plt.plot(fpr, tpr,color=clr,linestyle=ls,label='%s (auc = %0.2f)' % (label, roc_auc))plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1],linestyle='--',color='gray',linewidth=2)plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.grid(alpha=0.5)
plt.xlabel('False positive rate (FPR)')
plt.ylabel('True positive rate (TPR)')#plt.savefig('images/07_04', dpi=300)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WKezVqle-1586785355024)(output_33_0.png)]

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)

from itertools import productall_clf = [pipe1, clf2, pipe3, mv_clf]x_min = X_train_std[:, 0].min() - 1
x_max = X_train_std[:, 0].max() + 1
y_min = X_train_std[:, 1].min() - 1
y_max = X_train_std[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))f, axarr = plt.subplots(nrows=2, ncols=2, sharex='col', sharey='row', figsize=(7, 5))for idx, clf, tt in zip(product([0, 1], [0, 1]),all_clf, clf_labels):clf.fit(X_train_std, y_train)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.3)axarr[idx[0], idx[1]].scatter(X_train_std[y_train==0, 0], X_train_std[y_train==0, 1], c='blue', marker='^',s=50)axarr[idx[0], idx[1]].scatter(X_train_std[y_train==1, 0], X_train_std[y_train==1, 1], c='green', marker='o',s=50)axarr[idx[0], idx[1]].set_title(tt)plt.text(-3.5, -5., s='Sepal width [standardized]', ha='center', va='center', fontsize=12)
plt.text(-12.5, 4.5, s='Petal length [standardized]', ha='center', va='center', fontsize=12, rotation=90)#plt.savefig('images/07_05', dpi=300)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gLCkxHKU-1586785355026)(output_35_0.png)]

mv_clf.get_params()

{'pipeline-1': Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',LogisticRegression(C=0.001, class_weight=None, dual=False,fit_intercept=True, intercept_scaling=1,l1_ratio=None, max_iter=100,multi_class='warn', n_jobs=None,penalty='l2', random_state=1, solver='warn',tol=0.0001, verbose=0, warm_start=False)]],verbose=False),'decisiontreeclassifier': DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False,random_state=0, splitter='best'),'pipeline-2': Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',KNeighborsClassifier(algorithm='auto', leaf_size=30,metric='minkowski', metric_params=None,n_jobs=None, n_neighbors=1, p=2,weights='uniform')]],verbose=False),'pipeline-1__memory': None,'pipeline-1__steps': [('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, l1_ratio=None, max_iter=100,multi_class='warn', n_jobs=None, penalty='l2',random_state=1, solver='warn', tol=0.0001, verbose=0,warm_start=False)]],'pipeline-1__verbose': False,'pipeline-1__sc': StandardScaler(copy=True, with_mean=True, with_std=True),'pipeline-1__clf': LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, l1_ratio=None, max_iter=100,multi_class='warn', n_jobs=None, penalty='l2',random_state=1, solver='warn', tol=0.0001, verbose=0,warm_start=False),'pipeline-1__sc__copy': True,'pipeline-1__sc__with_mean': True,'pipeline-1__sc__with_std': True,'pipeline-1__clf__C': 0.001,'pipeline-1__clf__class_weight': None,'pipeline-1__clf__dual': False,'pipeline-1__clf__fit_intercept': True,'pipeline-1__clf__intercept_scaling': 1,'pipeline-1__clf__l1_ratio': None,'pipeline-1__clf__max_iter': 100,'pipeline-1__clf__multi_class': 'warn','pipeline-1__clf__n_jobs': None,'pipeline-1__clf__penalty': 'l2','pipeline-1__clf__random_state': 1,'pipeline-1__clf__solver': 'warn','pipeline-1__clf__tol': 0.0001,'pipeline-1__clf__verbose': 0,'pipeline-1__clf__warm_start': False,'decisiontreeclassifier__class_weight': None,'decisiontreeclassifier__criterion': 'entropy','decisiontreeclassifier__max_depth': 1,'decisiontreeclassifier__max_features': None,'decisiontreeclassifier__max_leaf_nodes': None,'decisiontreeclassifier__min_impurity_decrease': 0.0,'decisiontreeclassifier__min_impurity_split': None,'decisiontreeclassifier__min_samples_leaf': 1,'decisiontreeclassifier__min_samples_split': 2,'decisiontreeclassifier__min_weight_fraction_leaf': 0.0,'decisiontreeclassifier__presort': False,'decisiontreeclassifier__random_state': 0,'decisiontreeclassifier__splitter': 'best','pipeline-2__memory': None,'pipeline-2__steps': [('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=1, p=2,weights='uniform')]],'pipeline-2__verbose': False,'pipeline-2__sc': StandardScaler(copy=True, with_mean=True, with_std=True),'pipeline-2__clf': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=1, p=2,weights='uniform'),'pipeline-2__sc__copy': True,'pipeline-2__sc__with_mean': True,'pipeline-2__sc__with_std': True,'pipeline-2__clf__algorithm': 'auto','pipeline-2__clf__leaf_size': 30,'pipeline-2__clf__metric': 'minkowski','pipeline-2__clf__metric_params': None,'pipeline-2__clf__n_jobs': None,'pipeline-2__clf__n_neighbors': 1,'pipeline-2__clf__p': 2,'pipeline-2__clf__weights': 'uniform'}

from sklearn.model_selection import GridSearchCVparams = {'decisiontreeclassifier__max_depth': [1, 2],'pipeline-1__clf__C': [0.001, 0.1, 100.0]}grid = GridSearchCV(estimator=mv_clf,param_grid=params,cv=10,scoring='roc_auc')
grid.fit(X_train, y_train)for r, _ in enumerate(grid.cv_results_['mean_test_score']):print("%0.3f +/- %0.2f %r"% (grid.cv_results_['mean_test_score'][r], grid.cv_results_['std_test_score'][r] / 2.0, grid.cv_results_['params'][r]))

0.933 +/- 0.07 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.001}
0.947 +/- 0.07 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 0.1}
0.973 +/- 0.04 {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 100.0}
0.947 +/- 0.07 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.001}
0.947 +/- 0.07 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 0.1}
0.973 +/- 0.04 {'decisiontreeclassifier__max_depth': 2, 'pipeline-1__clf__C': 100.0}

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

Best parameters: {'decisiontreeclassifier__max_depth': 1, 'pipeline-1__clf__C': 100.0}
Accuracy: 0.97

Note
By default, the default setting for refit in GridSearchCV is True (i.e., GridSeachCV(..., refit=True)), which means that we can use the fitted GridSearchCV estimator to make predictions via the predict method, for example:

grid = GridSearchCV(estimator=mv_clf, param_grid=params, cv=10, scoring='roc_auc')
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)

In addition, the “best” estimator can directly be accessed via the best_estimator_ attribute.

grid.best_estimator_.classifiers

[Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',LogisticRegression(C=100.0, class_weight=None, dual=False,fit_intercept=True, intercept_scaling=1,l1_ratio=None, max_iter=100,multi_class='warn', n_jobs=None,penalty='l2', random_state=1, solver='warn',tol=0.0001, verbose=0, warm_start=False)]],verbose=False),DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, presort=False,random_state=0, splitter='best'),Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True, with_mean=True, with_std=True)),['clf',KNeighborsClassifier(algorithm='auto', leaf_size=30,metric='minkowski', metric_params=None,n_jobs=None, n_neighbors=1, p=2,weights='uniform')]],verbose=False)]

mv_clf = grid.best_estimator_

mv_clf.set_params(**grid.best_estimator_.get_params())

MajorityVoteClassifier(classifiers=[Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True,with_mean=True,with_std=True)),('clf',LogisticRegression(C=100.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,l1_ratio=None,max_iter=100,multi_class='warn',n_jobs=None,penalty='l2',random_state=1,solver='warn',tol=0.0001,verbose=0,w...min_weight_fraction_leaf=0.0,presort=False,random_state=0,splitter='best'),Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True,with_mean=True,with_std=True)),('clf',KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',metric_params=None,n_jobs=None,n_neighbors=1,p=2,weights='uniform'))],verbose=False)],vote='classlabel', weights=None)

mv_clf

MajorityVoteClassifier(classifiers=[Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True,with_mean=True,with_std=True)),('clf',LogisticRegression(C=100.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,l1_ratio=None,max_iter=100,multi_class='warn',n_jobs=None,penalty='l2',random_state=1,solver='warn',tol=0.0001,verbose=0,w...min_weight_fraction_leaf=0.0,presort=False,random_state=0,splitter='best'),Pipeline(memory=None,steps=[('sc',StandardScaler(copy=True,with_mean=True,with_std=True)),('clf',KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',metric_params=None,n_jobs=None,n_neighbors=1,p=2,weights='uniform'))],verbose=False)],vote='classlabel', weights=None)

Bagging – Building an ensemble of classifiers from bootstrap samples

Image(filename='./images/07_06.png', width=500)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Eb2C427g-1586785355027)(output_46_0.png)]

Bagging in a nutshell

Image(filename='./images/07_07.png', width=400)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dTuVkfBq-1586785355028)(output_48_0.png)]

Applying bagging to classify samples in the Wine dataset

import pandas as pddf_wine = pd.read_csv('https://archive.ics.uci.edu/ml/''machine-learning-databases/wine/wine.data',header=None)df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash','Alcalinity of ash', 'Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins','Color intensity', 'Hue', 'OD280/OD315 of diluted wines','Proline']# if the Breast Cancer dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:# df_wine = pd.read_csv('wine.data', header=None)# drop 1 class
df_wine = df_wine[df_wine['Class label'] != 1]y = df_wine['Class label'].values
X = df_wine[['Alcohol', 'OD280/OD315 of diluted wines']].values

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_splitle = LabelEncoder()
y = le.fit_transform(y)X_train, X_test, y_train, y_test =\train_test_split(X, y, test_size=0.2, random_state=1,stratify=y)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifiertree = DecisionTreeClassifier(criterion='entropy', max_depth=None,random_state=1)bag = BaggingClassifier(base_estimator=tree,n_estimators=500, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, n_jobs=1, random_state=1)

from sklearn.metrics import accuracy_scoretree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'% (tree_train, tree_test))bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print('Bagging train/test accuracies %.3f/%.3f'% (bag_train, bag_test))

Decision tree train/test accuracies 1.000/0.833
Bagging train/test accuracies 1.000/0.917

import numpy as np
import matplotlib.pyplot as pltx_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))f, axarr = plt.subplots(nrows=1, ncols=2, sharex='col', sharey='row', figsize=(8, 3))for idx, clf, tt in zip([0, 1],[tree, bag],['Decision tree', 'Bagging']):clf.fit(X_train, y_train)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)axarr[idx].contourf(xx, yy, Z, alpha=0.3)axarr[idx].scatter(X_train[y_train == 0, 0],X_train[y_train == 0, 1],c='blue', marker='^')axarr[idx].scatter(X_train[y_train == 1, 0],X_train[y_train == 1, 1],c='green', marker='o')axarr[idx].set_title(tt)axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.text(10.2, -0.5,s='OD280/OD315 of diluted wines',ha='center', va='center', fontsize=12)plt.tight_layout()
#plt.savefig('images/07_08.png', dpi=300, bbox_inches='tight')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mMiVZ6IO-1586785355029)(output_54_0.png)]

Leveraging weak learners via adaptive boosting

How boosting works

Image(filename='images/07_09.png', width=400)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PhnYVell-1586785355030)(output_58_0.png)]

Image(filename='images/07_10.png', width=500)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lhfRj8r9-1586785355030)(output_59_0.png)]

Applying AdaBoost using scikit-learn

from sklearn.ensemble import AdaBoostClassifiertree = DecisionTreeClassifier(criterion='entropy', max_depth=1,random_state=1)ada = AdaBoostClassifier(base_estimator=tree,n_estimators=500, learning_rate=0.1,random_state=1)

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'% (tree_train, tree_test))ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)ada_train = accuracy_score(y_train, y_train_pred)
ada_test = accuracy_score(y_test, y_test_pred)
print('AdaBoost train/test accuracies %.3f/%.3f'% (ada_train, ada_test))

Decision tree train/test accuracies 0.916/0.875
AdaBoost train/test accuracies 1.000/0.917

x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(8, 3))for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'AdaBoost']):clf.fit(X_train, y_train)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)axarr[idx].contourf(xx, yy, Z, alpha=0.3)axarr[idx].scatter(X_train[y_train == 0, 0],X_train[y_train == 0, 1],c='blue', marker='^')axarr[idx].scatter(X_train[y_train == 1, 0],X_train[y_train == 1, 1],c='green', marker='o')axarr[idx].set_title(tt)axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.text(10.2, -0.5,s='OD280/OD315 of diluted wines',ha='center', va='center', fontsize=12)plt.tight_layout()
#plt.savefig('images/07_11.png', dpi=300, bbox_inches='tight')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gritB1qh-1586785355031)(output_63_0.png)]

Summary

…

Readers may ignore the next cell.

! python ../.convert_notebook_to_script.py --input ch07.ipynb --output ch07.py

python: can't open file '../.convert_notebook_to_script.py': [Errno 2] No such file or directory

结合不同的模型进行集成学习相关推荐

【组队学习】【30期】6. 树模型与集成学习
树模型与集成学习航路开辟者:耿远昊领航员:姜萌航海士:耿远昊基本信息开源内容:https://github.com/datawhalechina/machine-learning-toy-c ...
模型优化之模型融合|集成学习
目录模型融合 Bagging 随机森林RF Boosting Adaboost GBDT Xgboost Stacking Blending 结合策略平均法投票法学习法在机器学习训练完模型之 ...
DataWhale-202110 树模型与集成学习（第一次）
DataWhale-202110 树模型与集成学习信息论的基础节点纯度不确定性函数H(P)H(P)H(P) 决策树分裂信息增益分类树的节点分裂深度优先增长于最佳增益增长 CART树均方误 ...
【树模型与集成学习】(task8)阶段性总结（更新ing）
学习总结作业需要继续补! task1学习决策树基础,根据评价标准为信息增益.信息增益比.基尼指数分别分为ID3树,C4.5树和CART树 task2学习cart的分类和回归代码 task3基于偏差和 ...
datawhale 10月学习——树模型与集成学习：两种并行集成的树模型
前情回顾决策树 CART树的实现集成模式结论速递本次学习了两种并行集成的树模型,随机森林和孤立森林,并进行了相应的代码实践.其中对孤立森林的学习比较简略,有待后续补充. 这里写自定义目录标题 ...
【树模型与集成学习】(task4)两种并行集成的树模型
学习总结 (1)随机森林中的随机主要来自三个方面: 其一为bootstrap抽样导致的训练集随机性, 其二为每个节点随机选取特征子集进行不纯度计算的随机性, 其三为当使用随机分割点选取时产生的随机性( ...
【树模型与集成学习】(task2)代码实现CART树（更新ing）
学习心得 task2学习GYH大佬的回归CART树,并在此基础上改为分类CART树. 更新ing.. 这里做一些对决策树分裂依据更深入的思考引导:我们在task1证明离散变量信息增益非负时曾提到,信息 ...
【树模型与集成学习】(task1)决策树（上）
学习心得 (1)决策树常用于分类,目标就是将具有 PPP 维特征的 nnn 个样本分到 CCC 个类别中,相当于做一个映射 C=f(n)C = f(n)C=f(n) ,将样本经过一种变换赋予一个 la ...
集成学习模型（xgboost、lightgbm、catboost）进行回归预测构建实战：异常数据处理、缺失值处理、数据重采样resample、独热编码、预测特征检查、特征可视化、预测结构可视化、模型
集成学习模型(xgboost.lightgbm.catboost)进行回归预测构建实战:异常数据处理.缺失值处理.数据重采样resample.独热编码.预测特征检查.特征可视化.预测结构可视化.模型保 ...

结合不同的模型进行集成学习