Random Forest

文章目录

随机森林
- 基本概念
- 数据的随机选取
- 特征的随机选取
- RandomForestClassifier参数
- RandomForestClassifier属性
- 混淆矩阵
- 交叉验证
- 网格搜索最佳参数
- 查看特征的正负样本分布

随机森林

对决策树和集成学习有一定了解的基础上，再进一步理解随机森林采取的策略：样本数据、特征进行采样，训练的多棵决策树进行集成。

基本概念

来自百度百科

根据下列算法而建造每棵树

用N来表示训练用例（样本）的个数，M表示特征数目。
输入特征数目m，用于确定决策树上一个节点的决策结果；其中m应远小于M。
从N个训练用例（样本）中以有放回抽样的方式，取样N次，形成一个训练集（即bootstrap取样），并用未抽到的用例（样本）作预测，评估其误差。
对于每一个节点，随机选择m个特征，决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征，计算其最佳的分裂方式。
每棵树都会完整成长而不会剪枝，这有可能在建完一棵正常树状分类器后会被采用）。

相关概念

分裂：在决策树的训练过程中，需要一次次的将训练数据集分裂成两个子数据集，这个过程就叫做分裂。
特征：在分类问题中，输入到分类器中的数据叫做特征。以上面的股票涨跌预测问题为例，特征就是前一天的交易量和收盘价。
待选特征：在决策树的构建过程中，需要按照一定的次序从全部的特征中选取特征。待选特征就是在步骤之前还没有被选择的特征的集合。例如，全部的特征是 ABCDE，第一步的时候，待选特征就是ABCDE，第一步选择了C，那么第二步的时候，待选特征就是ABDE。
分裂特征：接待选特征的定义，每一次选取的特征就是分裂特征，例如，在上面的例子中，第一步的分裂特征就是C。因为选出的这些特征将数据集分成了一个个不相交的部分，所以叫它们分裂特征。

数据的随机选取

首先，从原始的数据集中采取有放回的抽样，构造子数据集，子数据集的数据量是和原始数据集相同的。不同子数据集的元素可以重复，同一个子数据集中的元素也可以重复。第二，利用子数据集来构建子决策树，将这个数据放到每个子决策树中，每个子决策树输出一个结果。最后，如果有了新的数据需要通过随机森林得到分类结果，就可以通过对子决策树的判断结果的投票，得到随机森林的输出结果了。如下图，假设随机森林中有3棵子决策树，2棵子树的分类结果是A类，1棵子树的分类结果是B类，那么随机森林的分类结果就是A类。

若训练集D有n个样本，有放回的进行n次随机采样，那么得到子集D1总共也是n个样本，但是D1中样本是有重复的。通过计算
lim⁡n→+∞(1−1n)=1e≈0.368\lim_{n \to +\infty} (1-\frac{1}{n})=\frac{1}{e} \approx 0.368 n→+∞lim(1−n1)=e1≈0.368
可知采用有放回的采样，训练集D中大约63.2%的样本出现在子集D1中，而剩下的36.8%是不出现在D1中. 通常那36.8%的样本作为验证集来对泛化性能进行”包外估计”(out-of-bag estimate)。

特征的随机选取

与数据集的随机选取类似，随机森林中的子树的每一个分裂过程并未用到所有的待选特征，而是从所有的待选特征中随机选取一定的特征，之后再在随机选取的特征中选取最优的特征。这样能够使得随机森林中的决策树都能够彼此不同，提升系统的多样性，从而提升分类性能。
下图中，蓝色的方块代表所有可以被选择的特征，也就是待选特征。黄色的方块是分裂特征。左边是一棵决策树的特征选取过程，通过在待选特征中选取最优的分裂特征（别忘了前文提到的ID3算法，C4.5算法，CART算法等等），完成分裂。下面是一个随机森林中的子树的特征选取过程。

若总特征数为m，每一次训练选取的特征数k，通常推荐值 k=log⁡2mk=\log_2mk=log2m 。简而言之，就是从m个特征中随机挑选k个特征用于最有划分.

样本随机选取与特征的随机选取参见：【机器学习－西瓜书】八、Bagging；随机森林（RF）

import warnings
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import xgboost as xgb
import tensorflow as tf
import matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from  sklearn import metrics # from sklearn.utils import class_weight
# from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
# from sklearn.metrics import roc_auc_score,precision_recall_curvepd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)
pd.set_option('display.width',100)#忽略一些版本不兼容等警告
warnings.filterwarnings("ignore")

clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=True, n_jobs=1, random_state=None, verbose=0,warm_start=False, class_weight=None)

转载自：sklearn随机森林分类类RandomForestClassifier

RandomForestClassifier参数

n_estimators : integer, optional (default=10) 整数，可选择(默认值为10), 森林里（决策）树的数目。
criterion : string, optional (default=”gini”) 字符串，可选择(默认值为“gini”),衡量分裂质量的性能（函数）。受支持的标准是基尼不纯度的"gini",和信息增益的"entropy"（熵）。
max_features : int, float, string or None, optional (default=”auto”) 整数，浮点数，字符串或者无值，可选的（默认值为"auto"）

寻找最佳分割时需要考虑的特征数目：

如果是int，就要考虑每一次分割处的max_feature特征

如果是float，那么max_features就是一个百分比，那么（max_feature*n_features）特征整数值是在每个分割处考虑的。

如果是auto，那么max_features=sqrt(n_features)，即n_features的平方根值。

如果是log2，那么max_features=log2(n_features)

如果是None,那么max_features=n_features

注意：寻找分割点不会停止，直到找到最少一个有效的节点划分区，即使它需要有效检查超过max_features的特征。

max_depth : integer or None, optional (default=None) 整数或者无值，可选的（默认为None），（决策）树的最大深度。如果值为None，那么会扩展节点，直到所有的叶子是纯净的，或者直到所有叶子包含少于min_sample_split的样本。
min_samples_split : int, float, optional (default=2) 整数，浮点数，可选的（默认值为2）

分割内部节点所需要的最小样本数量：

如果为int，那么考虑min_samples_split作为最小的数字。

如果为float，那么min_samples_split是一个百分比，并且把ceil(min_samples_split*n_samples)是每一个分割最小的样本数量。

min_samples_leaf : int, float, optional (default=1) 整数，浮点数，可选的（默认值为1）

需要在叶子结点上的最小样本数量：

如果为int，那么考虑min_samples_leaf作为最小的数字。

如果为float，那么min_samples_leaf为一个百分比，并且ceil(min_samples_leaf*n_samples)是每一个节点的最小样本数量。

max_leaf_nodes : int or None, optional (default=None) 整数或者无值,可选的（默认值为None）,以最优的方法使用max_leaf_nodes来生长树。最好的节点被定义为不纯度上的相对减少。如果为None,那么不限制叶子节点的数量。
bootstrap : boolean, optional (default=True) 布尔值，可选的（默认值为True）,建立决策树时，是否使用有放回抽样。
oob_score : bool (default=False) bool，（默认值为False）,是否使用袋外样本来估计泛化精度。
n_jobs : integer, optional (default=1) 整数，可选的（默认值为1）,用于拟合和预测的并行运行的工作（作业）数量。如果值为-1，那么工作数量被设置为核的数量。

traindata_path = u'D:/01_Project/99_test/ML/titanic/train.csv'
testdata_path = u'D:/01_Project/99_test/ML/titanic/test.csv'
testresult_path = u'D:/01_Project/99_test/ML/titanic/gender_submission.csv'
df_train = pd.read_csv(traindata_path)
df_test = pd.read_csv(testdata_path)
df_test['Survived'] = pd.read_csv(testresult_path)['Survived']
data_original = pd.concat([df_train,df_test],sort=False)
# df_test = df_test[df_train.columns]
# display (df_train.head(5))
# data_original.drop('Name',axis=1,inplace=True)
# data_original.dropna(inplace=True)
display (data_original.head(5))

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

PassengerId => 乘客ID
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

data_original.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 132.9+ KB

data_original['Sex'].replace('male',0,inplace=True)   #inplace=True 原位替换
data_original['Sex'].replace('female',1,inplace=True)

data_original['Embarked'] = data_original['Embarked'].fillna(method='bfill').fillna(method='ffill')
dummies = pd.get_dummies(data_original['Embarked'],prefix='Embarked')
dummies.head()

	Embarked_C	Embarked_S
0	0	1
1	1	0
2	0	1
3	0	1
4	0	1

data_original = data_original.join(dummies)
data_original.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Embarked_C	Embarked_Q	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	NaN	S	0	0	1
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	NaN	S	0	1	0
0	892	0	3	Kelly, Mr. James	0	34.5	0	330911	7.8292	NaN	Q	0	0	1
0	892	0	3	Kelly, Mr. James	0	34.5	0	330911	7.8292	NaN	Q	0	1	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	C85	C	1	0	0

print (data_original['Sex'].value_counts())
data_original['Embarked'].value_counts()

0    1367
1     778
Name: Sex, dtype: int64S    1482
C     453
Q     210
Name: Embarked, dtype: int64

print (data_original['Embarked'].unique())
feature = ['Pclass','Age','SibSp','Parch','Sex']+['Embarked_'+i for i in data_original['Embarked'].unique()]
print (feature)
for column in feature:data_original[column].fillna(data_original[column].mean(), inplace=True)
x_train, x_test, y_train, y_test = train_test_split(data_original[feature], data_original['Survived'], random_state=1, train_size=0.7)
# x_train, x_test, y_train, y_test = train_test_split(data_original, data_original['Survived'], random_state=1, train_size=0.7)
display(x_train.shape)
display(x_test.shape)
display(y_train.shape)
display(y_train.shape)

['S' 'Q' 'C']
['Pclass', 'Age', 'SibSp', 'Parch', 'Sex', 'Embarked_S', 'Embarked_Q', 'Embarked_C'](1501, 8)(644, 8)(1501,)(1501,)

clf.fit(x_train, y_train)

RandomForestClassifier(n_estimators=10, n_jobs=1, oob_score=True)

RandomForestClassifier属性

estimators_ : list of DecisionTreeClassifier 决策树分类器的序列,拟合的子估计器的集合。

clf.estimators_

[DecisionTreeClassifier(max_features='auto', random_state=1495518603),DecisionTreeClassifier(max_features='auto', random_state=170010899),DecisionTreeClassifier(max_features='auto', random_state=2053236039),DecisionTreeClassifier(max_features='auto', random_state=1004379910),DecisionTreeClassifier(max_features='auto', random_state=2052542410),DecisionTreeClassifier(max_features='auto', random_state=834032305),DecisionTreeClassifier(max_features='auto', random_state=413200844),DecisionTreeClassifier(max_features='auto', random_state=801999364),DecisionTreeClassifier(max_features='auto', random_state=1345507579),DecisionTreeClassifier(max_features='auto', random_state=1667197337)]

classes_ : array of shape = [n_classes] or a list of such arrays 数组维度=[n_classes]的数组或者一个这样数组的序列。，或者类别标签的数组序列（多输出问题）。

clf.classes_

array([0, 1], dtype=int64)

n_classes_ : int or list 整数或者序列,类别的数量（单输出问题），或者一个序列，包含每一个输出的类别数量（多输出问题）

clf.n_classes_

n_features_ : int 整数,执行拟合时的特征数量。

clf.n_features_

n_outputs_ : int 整数，执行拟合时的输出数量。

clf.n_outputs_

feature_importances_ : array of shape = [n_features] 维度等于n_features的数组，特征的重要性（值越高，特征越重要）。

clf.feature_importances_
dict_importance = dict(zip(feature,clf.feature_importances_))
# display (dict_importance)
df_feature_importance = pd.DataFrame()
df_feature_importance['features'] = feature
df_feature_importance['importances'] = clf.feature_importances_
df_feature_importance = df_feature_importance.sort_values('importances',ascending=False)
df_feature_importance

	features	importances
4	Sex	0.561840
1	Age	0.249461
0	Pclass	0.061210
2	SibSp	0.054826
3	Parch	0.047580
5	Embarked_S	0.008941
6	Embarked_Q	0.008869
7	Embarked_C	0.007272

oob_score_ : float 浮点数

clf.oob_score_

0.8500999333777481

oob_decision_function_ : array of shape = [n_samples, n_classes] 维度=[n_samples,n_classes]的数组。在训练集上用袋外估计计算的决策函数。如果n_estimators很小的话，那么在有放回抽样中，一个数据点也不会被忽略是可能的。在这种情况下，oob_decision_function_ 可能包括NaN。

clf.oob_decision_function_

array([[0.89891015, 0.10108985],[0.75      , 0.25      ],[0.        , 1.        ],...,[0.17672414, 0.82327586],[1.        , 0.        ],[1.        , 0.        ]])

混淆矩阵

pred_y_test = clf.predict(x_test)
# m = metrics.confusion_matrix(y_test, pred_y_test)
# display (m)
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred_y_test).ravel()
print ('matrix    label1   label0')
print ('predict1  {:<6d}   {:<6d}'.format(int(tp), int(fp)))
print ('predict0  {:<6d}   {:<6d}'.format(int(fn), int(tn)))

matrix    label1   label0
predict1  179      34
predict0  51       380

交叉验证

验证模型得分

score_x = x_train
score_y = y_train

# 正确率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='accuracy')
print('交叉验证正确率为:'+str(scores.mean()))

交叉验证正确率为:0.8640974529346623

# 精确率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='precision')
print('交叉验证精确率为:'+str(scores.mean()))

交叉验证精确率为:0.837064112977926

# 召回率
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='recall')
print('交叉验证召回率为:'+str(scores.mean()))

交叉验证召回率为:0.7876161919040481

# f1_score
scores = cross_val_score(clf, score_x, score_y, cv=5, scoring='f1')
print('交叉验证f1_score为:'+str(scores.mean()))

交叉验证f1_score为:0.8135730532581953

网格搜索最佳参数

param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
clf = RandomForestClassifier()
# clf = xgb.XGBClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
print (grid_search.best_params_)
print (grid_search.best_estimator_)

{'max_features': 8, 'n_estimators': 30}
RandomForestClassifier(max_features=8, n_estimators=30)

查看特征的正负样本分布

def KdePlot(df,label,factor,flag=None,positive=1):import seaborn as snsimport matplotlib.pyplot as plt# 设置核密度分布图plt.figure(figsize=(20,10))sns.set(style='white')if positive==0:df[factor] = np.abs(df[factor])else:passif flag == 'log':x0 = np.log(df[df[label]==0][factor]+1)x1 = np.log(df[df[label]==1][factor]+1)else:x0 = df[df[label]==0][factor]x1 = df[df[label]==1][factor]sns.distplot(x0,color = 'blue',kde = True, # 绘制密度曲线hist = True, # 绘制直方图#rug = True, # rug图kde_kws = {'shade':True,'color':'green','facecolor':'green','label':'label_0'},rug_kws = {'color':'green','height':0.1,'alpha':0.1})plt.xlabel('%s'%factor,fontsize=40)plt.ylabel('label_0',fontsize = 30)plt.xticks(fontsize = 30)plt.yticks(fontsize = 30)plt.legend(loc='upper left',fontsize=30)plt.twinx()sns.distplot(x1,color = 'orange',kde = True, # 绘制密度曲线hist = True, # 绘制直方图#rug = True, # rug图kde_kws = {'shade':True,'color':'red','facecolor':'red','label':'label_1'},rug_kws = {'color':'red','height':0.1,'alpha':0.2})
#     plt.xlabel('%s'%factor,fontsize=40)plt.ylabel('label_1',fontsize = 30)plt.xticks(fontsize = 30)plt.yticks(fontsize = 30)plt.legend(loc='upper right',fontsize=30)plt.show()

for factor in df_feature_importance['features'].values:KdePlot(data_original,'Survived',factor)

Random Forest相关推荐

最常用的决策树算法！Random Forest、Adaboost、GBDT 算法
点击上方"Datawhale",选择"星标"公众号第一时间获取价值内容本文主要介绍基于集成学习的决策树,其主要通过不同学习框架生产基学习器,并综合所有基学习 ...
Machine Learning | (8) Scikit-learn的分类器算法-随机森林（Random Forest）
Machine Learning | 机器学习简介 Machine Learning | (1) Scikit-learn与特征工程 Machine Learning | (2) sklearn数据集 ...
R语言使用caret包构建随机森林模型（random forest）构建回归模型、通过method参数指定算法名称、通过ntree参数指定随机森林中树的个数
R语言使用caret包构建随机森林模型(random forest)构建回归模型.通过method参数指定算法名称.通过ntree参数指定随机森林中树的个数目录
R语言xgboost包：使用xgboost算法实现随机森林（random forest）模型
R语言xgboost包:使用xgboost算法实现随机森林(random forest)模型目录 R语言xgboost包:使用xgboost算法实现随机森林(random forest)模型
使用R构建随机森林回归模型（Random Forest Regressor）
使用R构建随机森林回归模型(Random Forest Regressor) 目录使用R构建随机森林回归模型(Random Forest Regressor) 安装包randomForest 缺失值 ...
随机森林(Random Forest)和梯度提升树(GBDT)有什么区别？
随机森林(Random Forest)和梯度提升树(GBDT)有什么区别? 随机森林属于集成学习中的 Bagging(Bootstrap AGgregation 的简称) 方法. 随机森林是由很多 ...
使用随机森林（Random Forest）进行特征筛选并可视化
使用随机森林(Random Forest)进行特征筛选并可视化随机森林可以理解为Cart树森林,它是由多个Cart树分类器构成的集成学习模式.其中每个Cart树可以理解为一个议员,它从样本集里面随机 ...
随机森林（Random Forest）为什么是森林？到底随机在哪里？行采样和列采样又是什么东西？
ensemble.RandomForestClassifier([-]) A random forest classifier. ensemble.RandomForestRegressor([-]) ...
Decision stump、Bootstraping、bagging、boosting、Random Forest、Gradient Boosting
1)首先来看看 Decision stump https://en.wikipedia.org/wiki/Decision_stump A decision stump is a machine le ...
机器学习-Random Forest算法简介
Random Forest是加州大学伯克利分校的Breiman Leo和Adele Cutler于2001年发表的论文中提到的新的机器学习算法,可以用来做分类,聚类,回归,和生存分析,这里只简单介绍该 ...