bagging算法_Bagging/Boosting傻傻分不清？来一探究竟吧~

你是否还在迷惑什么是Bagging?你是否还在纠结Bagging和Boosting的区别到底在哪？？你是否还在探索Bagging的具体用法？？？那就一起来看看吧！

Bagging

同一个学习算法在来自同一分布的多个不同的训练数据集上训练得到的模型偏差可能较大，即模型的方差(variance)较大，为了解决这个问题，可以综合多个模型的输出结果，对于回归问题可以取平均值，对于分类问题可以采取多数投票的方法。Bagging(Boostrap Aggregation)是常用的统计学习方法，其综合的基本学习器可以是各种弱学习器。

1.Bootstrap抽样

要想综合N个弱分类器(决策树)的结果，我们需要采样N个训练数据集，在实际应用中获取N个训练数据集往往不现实，Bootstrap 采样提供了一种有效的解决方法——有放回的随机采样，示例如下：

采用这样的方式解决了获取N个服从同一分布的原始数据集不现实的问题，而且在可接受程度上，可以认为Bootstrap 采样方式不影响到模型的准确性(以方差来衡量)，即可以等价于使用N个不同的原始数据集。然而有研究表明，有放回的随机采样，其实对模型的性能来说不是至关重要的，也可以用无放回的随机采样来取代。

2. Bagging算法

将训练数据集进行N次Bootstrap采样得到N个训练数据子集，对每个子集使用相同的算法分别建立决策树，最终的分类(或回归)结果是N个决策树的结果的多数投票(或平均)。上述Bagging算法涉及两个主要步骤：

- 平均，目的是降低方差。Z1,Z2,...,Zn的方差是sigma**2，Z均值的方差是(sigma**2)/n。采用Bagging策略明显减低模型方差(variance),如下图所示。

- Bootstrap采样。为了实现上述平均操作，我们需要多个训练数据集来训练模型，而在应用中找N个不同的训练集合不现实，所以对同一个训练集使用有放回的bootstrap采样获取N个训练集。

Bagging较单棵决策树来说，降低了方差，但由于将多棵决策树的结果进行了平均，这损失了模型的可解释性。

Bagging算法步骤：

3.Bagging算法特点：

如果基分类器是不稳定的(如：决策树、神经网络)，bagging有助于降低训练数据的随机波动导致的误差。

如果基分类器是稳定的(基分类器对训练数据集中的微小变化时鲁棒性的)，则集成分类器的误差主要是由基分类器的偏倚所引起的，bagging可能不会对基分类器的性能有明显改善。

由于每一个样本被选中的概率相同，因此装袋并不侧重于训练数据集中的任何特定实例。因此对于噪声数据，装袋不太受过分拟合的影响。

4.Bagging与Boosting的区别：

1)样本选择上：

Bagging：训练集是在原始集中有放回选取的，从原始集中选出的各轮训练集之间是独立的。

Boosting：每一轮的训练集不变，只是训练集中每个样例在分类器中的权重发生变化。而权值是根据上一轮的分类结果进行调整。

2)样例权重：

Bagging：使用均匀取样，每个样例的权重相等

Boosting：根据错误率不断调整样例的权值，错误率越大则权重越大。

3)预测函数：

Bagging：所有预测函数的权重相等。

Boosting：每个弱分类器都有相应的权重，对于分类误差小的分类器会有更大的权重。

4)并行计算：

Bagging：各个预测函数可以并行生成

Boosting：各个预测函数只能顺序生成，因为后一个模型参数需要前一轮模型的结果。

是不是看了那么多文字感觉已经一脸懵逼了？那接下来我们一起来看看Bagging的具体案例分析吧！

Bagging回归分析案例

代码可以移动哦~

require(data.table)library(rpart)require(ggplot2)set.seed(456)#R中的空气质量数据集#目标是估计风速(wind)与空气中臭氧量(ozone)之间的关系bagging_data=data.table(airquality)head(bagging_data)##    Ozone Solar.R Wind Temp Month Day## 1:    41     190  7.4   67     5   1## 2:    36     118  8.0   72     5   2## 3:    12     149 12.6   74     5   3## 4:    18     313 11.5   62     5   4## 5:    NA      NA 14.3   56     5   5## 6:    28      NA 14.9   66     5   6#从图中可以看出关系可能不是线性的，因此使用回归树可能是有效的ggplot(bagging_data,aes(Wind,Ozone))+geom_point()+ggtitle("Ozone vs wind speed")

data_test=na.omit(bagging_data[,.(Ozone,Wind)])##数据集分为具有80％数据的训练集和具有20％数据的测试集train_index=sample.int(nrow(data_test),size=round(nrow(data_test)*0.8),replace = F)data_test[train_index,train:=TRUE][-train_index,train:=FALSE]##在全部训练数据上训练单棵回归树no_bag_model=rpart(Ozone~Wind,data_test[train_index],control=rpart.control(minsplit=6))result_no_bag=predict(no_bag_model,bagging_data)oparpar(mfrow=c(1,2))library(rpart.plot)rpart.plot(no_bag_model,type=2,branch=1,fallen.leaves=T,cex=0.8)

#红线代表单棵树的预测gg=ggplot(data_test,aes(Wind,Ozone))+geom_point(aes(color=train))data_no_bag=data.table(Wind=bagging_data$Wind,Ozone=result_no_bag)gg=gg+geom_line(data=data_no_bag[order(Wind)],aes(x=Wind,y=Ozone),color='red')gg

##baggingn_model=100      ##训练100棵树bagged_models=list()for (i in 1:n_model){ new_sample=sample(train_index,size=length(train_index),replace=T) bagged_models=c(bagged_models,list(rpart(Ozone~Wind,data_test[new_sample],control=rpart.control(minsplit=6))))}## 预测bagged_result=NULLi=0for (from_bag_model in bagged_models){ if (is.null(bagged_result)) bagged_result=predict(from_bag_model,bagging_data) else bagged_result=(i*bagged_result+predict(from_bag_model,bagging_data))/(i+1) i=i+1}##绿线代表模型预测require(ggplot2)gg=ggplot(data_test,aes(Wind,Ozone))+geom_point(aes(color=train))data_bagged=data.table(Wind=bagging_data$Wind,Ozone=bagged_result)gg=gg+geom_line(data=data_bagged[order(Wind)],aes(x=Wind,y=Ozone),color='green')gg

#每条灰线代表在单个样本上的模型。gg=ggplot(data_test,aes(Wind,Ozone))+geom_point(aes(color=train))for (tree_model in bagged_models[1:100]){ prediction=predict(tree_model,bagging_data) data_plot=data.table(Wind=bagging_data$Wind,Ozone=prediction) gg=gg+geom_line(data=data_plot[order(Wind)],aes(x=Wind,y=Ozone),alpha=0.2)}gg

##对比单棵树和bagging的结果gg=ggplot(data_test,aes(Wind,Ozone))+geom_point(aes(color=train))data_bagged=data.table(Wind=bagging_data$Wind,Ozone=bagged_result)gg=gg+geom_line(data=data_bagged[order(Wind)],aes(x=Wind,y=Ozone),color='green') data_no_bag=data.table(Wind=bagging_data$Wind,Ozone=result_no_bag)gg=gg+geom_line(data=data_no_bag[order(Wind)],aes(x=Wind,y=Ozone),color='red')gg

Bagging分类分析案例

library(adabag)  #bagging()实现Bagging算法library(ggplot2)target.urldata summary(data)head(data)##       V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988##      V12    V13    V14    V15    V16    V17    V18    V19    V20    V21    V22## 1 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071## 2 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052## 3 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737## 4 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690## 5 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292## 6 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985##      V23    V24    V25    V26    V27    V28    V29    V30    V31    V32    V33## 1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121## 2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984## 3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862## 4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120## 5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881## 6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299##      V34    V35    V36    V37    V38    V39    V40    V41    V42    V43    V44## 1 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256## 2 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628## 3 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222## 4 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202## 5 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841## 6 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238##      V45    V46    V47    V48    V49    V50    V51    V52    V53    V54    V55## 1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072## 2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094## 3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180## 4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085## 5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110## 6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013##      V56    V57    V58    V59    V60 V61## 1 0.0167 0.0180 0.0084 0.0090 0.0032   R## 2 0.0191 0.0140 0.0049 0.0052 0.0044   R## 3 0.0244 0.0316 0.0164 0.0095 0.0078   R## 4 0.0073 0.0050 0.0044 0.0040 0.0117   R## 5 0.0015 0.0072 0.0048 0.0107 0.0094   R## 6 0.0089 0.0057 0.0027 0.0051 0.0062   R#数据集分割为测试集(70%)和训练集(30%)set.seed(210)index train test

bagging()函数对训练集进行训练,首先定义基分类器个数为1，通过循环依次增加基分类器个数直至达到20

error for(i in 1:20){  data.bagging   #mfinal:控制boositng的迭代次数或者要使用的树数，默认为100  #control:赋值格式为control=rpart.control()，  #对于rpart.control可以调节的参数如下：  #minspilt：整数，默认为20，表示对节点中样本进行划分的最小样本数  #maxdepth：控制决策树的最大深度  #xval：交叉验证的数量，默认10，即十折交叉验证  #maxcompete：在每次节点划分中选择使用的变量个数，默认为4，用于计算信息增益指标  #cp：复杂度，默认0.01  #predict.bagging 预测训练集分类  data.predbagging test)   #预测误差  error[i] $error}#data.bagging#误差随基分类器个数增长的变化error p   geom_line(colour="red", linetype="dashed",size = 1)+  xlab("the number of basic classifiers")+  geom_point(size=3, shape=18)+  theme_bw()+  theme(panel.grid = element_blank())+  theme(axis.title = element_text(face = "bold"))#由图可知随着基分类器增加，误差虽有波动，但有减小趋势，逐渐趋向于0.22左右p

#预测值与真实结果的混淆矩阵data.predbagging$confusion##                Observed Class## Predicted Class  M  R##               M 30  4##               R  6 23# $votes:为每个观测值统计将其分配给每个类别的树的数量# $prob:每个观测各个类别的后验概率或支持程度的矩阵# $class:预测类别#准确率=(30+23)/63 = 0.8412#ROC曲线library(pROC)decisiontree_roc_bag 2])    ## Setting levels: control = M, case = R## Setting direction: controls < casesplot(decisiontree_roc_bag, print.auc=TRUE,      auc.polygon=TRUE, grid=c(0.1, 0.2),     grid.col=c("green", "red"),      max.auc.polygon=TRUE,     auc.polygon.col="skyblue",      print.thres=TRUE,main='ROC曲线')

#更改mfinal=10bagging_g 10) predbagging g_roc_bag 2])## Setting levels: control = M, case = R## Setting direction: controls < casesplot(g_roc_bag, print.auc=TRUE,      auc.polygon=TRUE, grid=c(0.1, 0.2),     grid.col=c("green", "red"),     max.auc.polygon=TRUE,     auc.polygon.col="skyblue",      print.thres=TRUE,main='ROC曲线')#AUC从0.917降到0.867，AUC值来看，基分类器越多，效果越好；但是另一方面，速度会越慢

##################random forest##################library(caret)library(randomForest)set.seed(7)cvcontrol train.rf train.rf## Random Forest ## 145 samples##  60 predictor##   2 classes: 'M', 'R' ## No pre-processing## Resampling: Cross-Validated (10 fold, repeated 1 times) ## Summary of sample sizes: 130, 130, 130, 131, 130, 131, ... ## Resampling results across tuning parameters:##   mtry  Accuracy   Kappa    ##    2    0.8142857  0.6267624##   31    0.7661905  0.5291793##   60    0.7600000  0.5159617## Accuracy was used to select the optimal model using the largest value.## The final value used for the model was mtry = 2.rf.classTrain head(rf.classTrain)## [1] R M M R R R## Levels: M RconfusionMatrix(train$V61,rf.classTrain)#训练数据的混淆矩阵## Confusion Matrix and Statistics##           Reference## Prediction  M  R##          M 75  0##          R  0 70                     ##                Accuracy : 1          ##                  95% CI : (0.9749, 1)##     No Information Rate : 0.5172     ##     P-Value [Acc > NIR] : < 2.2e-16                   ##                   Kappa : 1                            ##  Mcnemar's Test P-Value : NA             'Positive' Class : M          #没有错误分类#模型应用于测试数据rf.classTest head(rf.classTest)## [1] M M R R R M## Levels: M RconfusionMatrix(test$V61,rf.classTest) #测试数据混淆矩阵## Confusion Matrix and Statistics##           Reference## Prediction  M  R##          M 33  3##          R  3 24                                ##                Accuracy : 0.9048          ##                  95% CI : (0.8041, 0.9642)##     No Information Rate : 0.5714          ##     P-Value [Acc > NIR] : 6.819e-09                               ##                   Kappa : 0.8056                                           ##  Mcnemar's Test P-Value : 1                                                 ##             Sensitivity : 0.9167          ##             Specificity : 0.8889          ##          Pos Pred Value : 0.9167          ##          Neg Pred Value : 0.8889          ##              Prevalence : 0.5714          ##          Detection Rate : 0.5238          ##    Detection Prevalence : 0.5714          ##       Balanced Accuracy : 0.9028                                               ##        'Positive' Class : M               #准确率为0.9048，表现优于baggingrf.probs=predict(train.rf,newdata=test,type="prob") #测试数据的预测概率head(rf.probs)##        M     R## 3  0.584 0.416## 8  0.608 0.392## 10 0.278 0.722## 11 0.108 0.892## 15 0.422 0.578## 18 0.506 0.494rocCurve.rf ## Setting levels: control = M, case = R## Setting direction: controls < casesplot(rocCurve.rf,col=c(1),     print.auc=TRUE,      auc.polygon=TRUE, grid=c(0.1, 0.2),     grid.col=c("green", "red"),     max.auc.polygon=TRUE,     auc.polygon.col="skyblue",      print.thres=TRUE,main='ROC曲线')

auc(rocCurve.rf)  #auc=0.9491,表现优于bagging## Area under the curve: 0.9491

原文作者：杨成意、吕润倩、段妍、

皮常鹏、张若岩、郑俊男

排版：郑俊男

指导：林红梅

bagging算法_Bagging/Boosting傻傻分不清？来一探究竟吧~相关推荐

国防大学计算机学院,国防大学和国防科技大学是同一所学校吗？很多人傻傻都分不清！...
国防大学和国防科技大学,这两所大学名字相近,极易混淆,在很多网站搜索"国防大学录取分数线",出来的全是国防科技大学的的高考录取分数线,所以给广大考生带来了很大困惑,难道国防大学和国 ...
国家电网和南方电网还傻傻分不清？
参看:都2020年了,国家电网和南方电网还傻傻分不清? 一.名称不同一个叫南方电网,一个叫国家电力电网,虽然都是电网,但是区别还是很大的而且成立时间不一样:国家电力电网有限公司成立于2002年12 ...
cdn厂商同兴万点_同兴万点：TXNetworks和CDNetworks让我们傻傻分不清
原标题:同兴万点:TXNetworks和CDNetworks让我们傻傻分不清在2008年2月25日成立的同兴万点,公司全称为同兴万点(北京)网络技术有限公司(TXNetworks),一直专注于CDN ...
Executor 与 ExecutorService 和 Executors 傻傻分不清
转载自 Executor 与 ExecutorService 和 Executors 傻傻分不清 java.util.concurrent.Executor, java.util.concurren ...
2运行内存多大_智能设备中的内存与容量为何傻傻分不清？它们的区别是什么？...
在日常生活中,很多时候会把某些电子产品的容量说成内存,或者把内存说成了容量.比如有人问:"这个手机的内存多大?"或许会有这样回答的:"内存是256G."这种问答 ...
数据平台、大数据平台、数据中台……傻傻分不清？这次终于有人讲明白了！
来源 | 智领云科技造概念,在IT行业可不是一件陌生的事儿,中文博大精深,新名词.新概念往往简单准确,既可以被大众接受,又可以被专家把玩,真正做到雅俗共赏.各有趣味.近年来,数据中台之火爆,什么数据 ...
c语言位运算符怎么用，傻傻分不清
c语言位运算符怎么用,傻傻分不清左移运算符 << 右移运算符 >> 左移运算符 << 左移运算符**<<**用来把操作数的各个二进制位全部左移若干位. ...
Session/Cookie/Token还傻傻分不清？
Cookie.Session.Token 傻傻分不清 Session/Cookie/Token 还傻傻分不清? 相信项目中用JWT Token的应该不在少数,但是发现网上很多文章对 token 的介绍 ...
linux看磁盘是sas还是sata吗,SAS和SATA硬盘傻傻分不清？看这里
原标题:SAS和SATA硬盘傻傻分不清?看这里互联网时代的来临,使得企业对存储的需求在增长,传统的硬盘也逐渐发展,而变化最大的就是接口.当前,按照接口的不同,机械硬盘主要可被分为SATA硬盘和SAS ...

bagging算法_Bagging/Boosting傻傻分不清？来一探究竟吧~

bagging算法_Bagging/Boosting傻傻分不清？来一探究竟吧~相关推荐

最新文章

热门文章