集成学习之Bagging

1 Bagging集成原理

目标：把下面的圈和方块进行分类

【此时用直线很难把圈和方块分开的】

实现过程：

1.采样不同数据集

2.训练分类器

3.平权投票，获取最终结果

4.主要实现过程小结

2 随机森林构造过程

在机器学习中，随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。

随机森林 = Bagging + 决策树

例如, 如果你训练了5个树, 其中有4个树的结果是True, 1个树的结果是False, 那么最终投票结果就是True

随机森林构造过程中的关键步骤（用N来表示训练用例（样本）的个数，M表示特征数目）：

1）一次随机选出一个样本，有放回的抽样，重复N次（有可能出现重复的样本）

2）随机去选出m个特征, m <<M，建立决策树

思考
- 1.为什么要随机抽样训练集？　　
  
  如果不进行随机抽样，每棵树的训练集都一样，那么最终训练出的树分类结果也是完全一样的
- 2.为什么要有放回地抽样？【保证每个样本抽取的概率相等】
  
  如果不是有放回的抽样，那么每棵树的训练样本都是不同的，都是没有交集的，这样每棵树都是“有偏的”，都是绝对“片面的”（当然这样说可能不对），也就是说每棵树训练出来都是有很大的差异的；而随机森林最后分类取决于多棵树（弱分类器）的投票表决。

3 随机森林api介绍

sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, bootstrap=True, random_state=None, min_samples_split=2)
- n_estimators：integer，optional（default = 10）森林里的树木数量120,200,300,500,800,1200
- Criterion：string，可选（default =“gini”）分割特征的测量方法
- max_depth：integer或None，可选（默认=无）树的最大深度 5,8,15,25,30
- max_features="auto”,每个决策树的最大特征数量
  - If "auto", then max_features=sqrt(n_features).
  - If "sqrt", then max_features=sqrt(n_features)(same as "auto").
  - If "log2", then max_features=log2(n_features).
  - If None, then max_features=n_features.
- bootstrap：boolean，optional（default = True）是否在构建树时使用放回抽样
- min_samples_split:节点划分最少样本数
- min_samples_leaf:叶子节点的最小样本数
超参数：n_estimator, max_depth, min_samples_split,min_samples_leaf

4 随机森林预测案例

实例化随机森林

# 随机森林去进行预测
rf = RandomForestClassifier()

定义超参数的选择列表

param = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}

使用GridSearchCV进行网格搜索

# 超参数调优
gc = GridSearchCV(rf, param_grid=param, cv=2)gc.fit(x_train, y_train)print("随机森林预测的准确率为：", gc.score(x_test, y_test))

注意

随机森林的建立过程

树的深度、树的个数等需要进行超参数调优

示例代码：

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score#  1.获取数据
titan = pd.read_csv('./titanic.csv')#  2.数据基本处理
#  2.1 确定特征值、目标值
x = titan[['Pclass', 'Age', 'Sex']]
y = titan['Survived']
#  2.2 缺失值处理
# 缺失值需要处理，将特征当中有类别的这些特征进行字典特征抽取
x['Age'].fillna(x['Age'].mean(), inplace=True)
# x = pd.get_dummies(x)
#  2.3 数据集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)#  3. 特征工程
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))#  4.机器学习-随机森林
#  随机森林去进行预测
rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=5)
#  训练模型
rf.fit(x_train, y_train)#  5.评估模型 - 准确率
score = rf.score(x_test, y_test)
print('准确率为：', score)
#  精准率，召回率，f1-score
y_predict = rf.predict(x_test)
res = classification_report(y_predict, y_test)
print('精准率，召回率，f1-score分别为:\n', res)
#  AUC
auc = roc_auc_score(y_predict, y_test)
print('auc:', auc)

运行结果：

使用GridSearchCV进行网格搜索示例代码：

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score#  1.获取数据
titan = pd.read_csv('./titanic.csv')#  2.数据基本处理
#  2.1 确定特征值、目标值
x = titan[['Pclass', 'Age', 'Sex']]
y = titan['Survived']
#  2.2 缺失值处理
# 缺失值需要处理，将特征当中有类别的这些特征进行字典特征抽取
x['Age'].fillna(x['Age'].mean(), inplace=True)
# x = pd.get_dummies(x)
#  2.3 数据集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)#  3. 特征工程
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))#  4.机器学习-随机森林
#  随机森林去进行预测
rf = RandomForestClassifier()#  定义超参数列表
param = {"n_estimators": [120, 200, 300, 500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}#  超参数调优
gc = GridSearchCV(rf, param_grid=param, cv=2)  # 这儿会跑60次，len(n_estimators)*len(max_depth)*cv=6*5*2
gc.fit(x_train, y_train)#  5.评估模型 - 准确率
score = gc.score(x_test, y_test)
print('随机森林预测的准确率为：：', score)
#  精准率，召回率，f1-score
y_predict = gc.predict(x_test)
res = classification_report(y_predict, y_test)
print('精准率，召回率，f1-score分别为:\n', res)
#  AUC
auc = roc_auc_score(y_predict, y_test)
print('auc:', auc)# 获得最优超参数
print('获得最优超参数:', gc.best_params_)# 获得最优模型
print('获得最优模型:', gc.best_estimator_)

运行效果：

5 bagging集成优点

Bagging + 决策树/线性回归/逻辑回归/深度学习… = bagging集成学习方法

经过上面方式组成的集成学习方法:

均可在原有算法上提高约2%左右的泛化正确率
简单, 方便, 通用

集成学习之Bagging相关推荐

集成学习、Bagging算法、Bagging+Pasting、随机森林、极端随机树集成（Extra-trees）、特征重要度、包外评估
集成学习.Bagging算法.Bagging+Pasting.随机森林.极端随机树集成(Extra-trees).特征重要度.包外评估目录
python机器学习案例系列教程——集成学习（Bagging、Boosting、随机森林RF、AdaBoost、GBDT、xgboost）
全栈工程师开发手册 (作者:栾鹏) python数据挖掘系列教程可以通过聚集多个分类器的预测结果提高分类器的分类准确率,这一方法称为集成(Ensemble)学习或分类器组合(Classifier C ...
集成学习：Bagging、随机森林、Boosting、GBDT
日萌社人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 集成学习:Bagging.随机森林.Boosting.GBDT ...
[白话解析] 通俗解析集成学习之bagging，boosting 随机森林
[白话解析] 通俗解析集成学习之bagging,boosting & 随机森林 0x00 摘要本文将尽量使用通俗易懂的方式,尽可能不涉及数学公式,而是从整体的思路上来看,运用感性直觉的思考来 ...
监督学习 | 集成学习之Bagging、随机森林及Sklearn实现
文章目录集成学习 1. 投票分类器 1.1 硬投票法 1.2 软投票法 2. Bagging & Pasting 2.1 包外评估 2.2 Random Patches 和随机子空间 3. ...
集成学习（Bagging和Boosting）
一.概念集成学习就是组合这里的多个弱监督模型以期得到一个更好更全面的强监督模型,集成学习潜在的思想是即便某一个弱分类器得到了错误的预测,其他的弱分类器也可以将错误纠正回来. Baggging 和Bo ...
机器学习（九）：集成学习（bagging和boosting），随机森林、XGBoost、AdaBoost
文章目录一.什么是随机森林? 1.1 定义 1.2 目的 1.3 随机森林 VS bagging 二.集成学习 2.1 定义 2.2 决策树的问题 2.3 袋装法概念与理论 2.4 装袋法的优缺点 ...
集成学习（bagging/boosting/stacking）BERT,Adaboost
文章目录集成学习(ensemble learning) 1.bagging(装袋法) 2.boosting(提升法) 3.stacking(堆叠法) 集成学习(ensemble learning) ...
[学习笔记] [机器学习] 7. 集成学习（Bagging、随机森林、Boosting、GBDT）
视频链接数据集下载地址:无需下载 1. 集成学习算法简介学习目标: 了解什么是集成学习知道机器学习中的两个核心任务了解集成学习中的 Boosting 和 Bagging 1.1 什么是集成学习 ...

集成学习之Bagging

集成学习之Bagging

1 Bagging集成原理

2 随机森林构造过程

3 随机森林api介绍

4 随机森林预测案例

5 bagging集成优点

集成学习之Bagging相关推荐

最新文章

热门文章