基于R的数据挖掘方法与实践（3）—

决策树构建的目的有两个——探索与预测。探索方面，参与决策树声场的数据为训练数据，待树长成后即可探索数据所隐含的信息。预测方面，可以借助决策树推导出的规则预测未来数据。由于需要考虑未来数据进入该模型的分类表现，因此在基于训练数据构建决策树之后，可以用测试数据来衡量该模型的稳健性和分类表现。通过一连串的验证过程，最后得到最佳的分类规则，用作未来数据的预测。

1决策树构建理论

决策树的建立步骤包括数据准备、决策树生长、决策树修剪及规则提取。

1.1数据准备

决策树的分析数据包括两类变量：一是根据问题所决定的目标变量；二是根据问题背景与环境所选择的各种属性变量作为分支变量。分支变量是否容易理解与解释将决定决策树分析结果。

（1）二元属性：其测试条件可以产生两种结果。

（2）名目属性：名目属性结果的多少可以用不同属性值来表示，如血型可以分为A、B、AB、O四种类别。

（3）顺序属性：可以生成二元或二元以上的分割，其属性可以是群组，但群组必须符合属性值顺序特征。如年龄可以分为青年、中年、老年，

（4）连续属性：连续属性的条件可以表示成x<a或x>=a的关系。决策树必须考虑到所有可能的分割点y，再从中选出最好的分割。

取得数据后，将所搜集的数据分为训练数据集和测试数据集，数据分割可参照如下方法：

数据分割是将数据分成训练数据集、测试数据集和验证数据集。训练数据集用来建立模型，测试数据集用来评估模型是否过度复杂及其通用性，验证数据集则用以衡量模型的好坏，例如分类错误率、均方误差。一个好的训练模式应该对于未知的数据仍有很好的适配度，若当模式复杂程度越来越高，而测试数据的误差却越来越大，表示该训练模型有过度配适的问题。

数据分割的比例有不同的定义，均应代表原来的数据。一种方法是抽取80%的数据作为训练数据集构建模型，剩下的20%用于模型的效度检验。另一种方法是k-fold交互验证。该方法首先将数据分为k等份，每次抽取k-1份数据进行模式训练，剩下的1份数据用于测试模型，如此重复k次，使每笔数据都能成为训练数据集与测试数据集，最后的平均结果则用来代表模型的效度。该方法适合于样本数较少的情况，可以有效涵盖整个数据，但缺点是计算时间很长。

在决策树构建过程中，如果一个决策树模型仅在训练数据中有很低的错误率，但在测试数据集中有很高的错误率，则说明该决策树模型过度配适，造成模型无法用于估计其他数据。因此建立决策树模型后，应根据估计测试数据的分类表现，适当地修剪决策树，增加其分类或预测的争取性，避免过度配适。

1.2决策树分支准则

决策树的分支准则决定了树的规模大小，包括宽度和深度。常见的分支准则有：信息增益、Gini系数、卡方统计量、信息增益比等。

假设训练数据集有k个类别，分别为C1、C2、……、Ck，属性A有l中不同的数值，A1、A2、……、Al。

属性	类别
属性	C1	C2	…	Ck	总和
A1	x11	x12	…	x1k	x1.
A2	x21	x22	…	x2k	x2.
…	…	…	…	…	…
Al	xl1	xl2	…	xlk	xl.
总和	x.1	x.2	…	x.k	N

（1）信息增益

信息增益是根据不同信息的似然值或概率衡量不同条件下的信息量。

若每个类别的数据个数定义为x.j，N为数据集合中所有数据的个数，类别出现的概率为pj = x.j/N，根据信息论可以知道，各类别的信息为-log2(pj)，因此各类别C1、C2、……、Ck所带来的信息总和Info(D)为：

Info(D)= - (x.1/N)*log2(x.1/N) - (x.2/N)*log2(x.2/N) - … - (x.k/N)*log2(x.k/N)

Info(D)又称为熵，常用以衡量数据离散程度。当各类别出现的概率相等，则Info(D)=1，表示该分类的信息复杂程度最高。

假设该数据集D要根据属性A进行分割，产生共L各数据分割集Di，其中xi.为各属性值Ai下的分割数据总个数xij为属性值Ai下且为类别Cj的个数，因此可计算属性Ai下的信息Info(Ai)：

Info(Ai)= - (xi1/ xi.)*log2(xi1/ xi.) - (xi2/ xi.)*log2(xi2/ xi.) - … - (xik/ xi.)*log2(xik/xi.)

属性A的信息则根据各属性值下数据个数多寡决定：

InfoA(D)= (x1./N)*Info(A1) + (x2./N)*Info(A2) + … + (xl./N)*Info(Al)

信息增益可以表示为原始数据的总信息量减去分之后的总信息量，以表示属性A作为分支属性对信息的贡献程度。以此类推可以计算出各个属性作为分支变量能带来的信息贡献度，比较后可找出具有最佳信息增益的信息属性。

（2）Gini系数

Gini系数是衡量数据集合对于所有类别的不纯度。

Gini(D)= 1 – sum(j = 1, ….,k, pj^2)

各属性值Ai下数据集合的不纯度Gini(Ai)为：

Gini(Ai)= 1 – (xi1/xi.)^2 – (xi2/xi.)^2 - ……, – (xik/xi.)^2

属性A的总数据不纯度为：

GiniA(D)= (x1./N)*Gini(A1) + (x2./N)*Gini(A2) + … + (xl./N)*Gini(Al)

属性A对不纯度减少的贡献：

deltaGini(A)= Gini(D) –GiniA(D)

（3）卡方统计量

卡方统计量是用列联表来计算两列变量之间的相依程度，当计算出的样本卡方统计值越大，表示两变量之间的相依程度越高。

（4）信息增益比

信息增益比是考虑候选属性本身所携带的信息，在将这些信息转移至决策树，经由计算增益与分支属性的信息量的比值来找出最适合的分支属性。

（5）方差缩减

当目标变量为连续时，可采用放假缩减作为分支依据。

1.3 决策树修剪

决策树的修剪方式包括事前修剪和事后修剪。事前修剪应用于一开始决策树的生长过程中，实现设定停止决策树生长的门槛值，常见的设定门槛如分割的评估值没达到此门槛值时，就会停止决策树的生长，例如信息增益值要大于0.1或是节点中包含足够的样本数目。事前修剪的优点在于具有执行效率，但可能会有过度修剪的缺点。事后修剪法虽然效率较低，但对于解决决策树的过度配适问题相当具有正面效益。

1.4 规则提取

完成决策树的生长及修剪之后，即可利用决策树提取数据中隐含的信息。

2、决策树算法

算法		CART	C4.5/C5.0	CHAID
处理数据形态		离散、连续	离散、连续	离散
连续型数据分支方式		只分2支	无限制	无法处理
分支准则	类别型相依变数	Gini分散度指标	信息增益比	卡方检验
分支准则	连续型相依变数	方差缩减	方差缩减	卡方检定或F检定（需先转化为类别变量）
分支方法	类别型独立变量	二元分支	多元分支	多元分支
分支方法	连续型独立变量	二元分支	二元分支	多元分支（需转化为类别变量）
修剪方法		成本复杂性修剪	基于错误的修剪	无

3、模型评估

决策树分类模型可以从两个方面评估其分类及预测表现：（1）以测试组数据的结果来客观评估较佳的决策树模型，例如分类错误率；（2）由于分类规则的提取随着问题而异，因此在客观评估后，通常均需由该领域专家根据问题背景选出最适合的决策树模型。

4、决策树应用

4.1 CART决策树

载入包和数据集

> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of  8 variables:$ npreg: int  5 7 5 0 0 5 3 1 3 2 ...$ glu  : int  86 195 77 165 107 97 83 193 142 128 ...$ bp   : int  68 70 82 76 60 76 58 50 80 78 ...$ skin : int  28 33 41 43 25 27 31 16 15 37 ...$ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...$ ped  : num  0.364 0.163 0.156 0.259 0.133 ...$ age  : int  24 55 35 26 23 52 25 24 63 31 ...$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

其中，Pima数据已经被分为两部分，Pima.tr为训练集、Pima.te为测试集。

> #首先以不修剪的方法进行决策树的构建，因而将复杂系数cp设置为0
> cart_tree1 = rpart(type~., Pima.tr, control = rpart.control(cp = 0))
> summary(cart_tree1)
Call:
rpart(formula = type ~ ., data = Pima.tr, control = rpart.control(cp = 0))n= 200 CP nsplit rel error    xerror       xstd
1 0.22058824      0 1.0000000 1.0000000 0.09851844
2 0.16176471      1 0.7794118 0.9852941 0.09816108
3 0.07352941      2 0.6176471 0.8235294 0.09337946
4 0.05882353      3 0.5441176 0.7941176 0.09233140
5 0.01470588      4 0.4852941 0.6176471 0.08470895
6 0.00000000      7 0.4411765 0.7500000 0.09064718Node number 1: 200 observations,    complexity param=0.2205882predicted class=No   expected loss=0.34  P(node) =1class counts:   132    68probabilities: 0.660 0.340 left son=2 (109 obs) right son=3 (91 obs)Primary splits:glu   < 123.5  to the left,  improve=19.624700, (0 missing)age   < 28.5   to the left,  improve=15.016410, (0 missing)npreg < 6.5    to the left,  improve=10.465630, (0 missing)bmi   < 27.35  to the left,  improve= 9.727105, (0 missing)skin  < 22.5   to the left,  improve= 8.201159, (0 missing)Surrogate splits:age   < 30.5   to the left,  agree=0.685, adj=0.308, (0 split)bp    < 77     to the left,  agree=0.650, adj=0.231, (0 split)npreg < 6.5    to the left,  agree=0.640, adj=0.209, (0 split)skin  < 32.5   to the left,  agree=0.635, adj=0.198, (0 split)bmi   < 30.85  to the left,  agree=0.575, adj=0.066, (0 split)Node number 2: 109 observations,    complexity param=0.01470588predicted class=No   expected loss=0.1376147  P(node) =0.545class counts:    94    15probabilities: 0.862 0.138 left son=4 (74 obs) right son=5 (35 obs)Primary splits:age   < 28.5   to the left,  improve=3.2182780, (0 missing)npreg < 6.5    to the left,  improve=2.4578310, (0 missing)bmi   < 33.5   to the left,  improve=1.6403660, (0 missing)bp    < 59     to the left,  improve=0.9851960, (0 missing)skin  < 24     to the left,  improve=0.8342926, (0 missing)Surrogate splits:npreg < 4.5    to the left,  agree=0.798, adj=0.371, (0 split)bp    < 77     to the left,  agree=0.734, adj=0.171, (0 split)skin  < 36.5   to the left,  agree=0.725, adj=0.143, (0 split)bmi   < 38.85  to the left,  agree=0.716, adj=0.114, (0 split)glu   < 66     to the right, agree=0.688, adj=0.029, (0 split)Node number 3: 91 observations,    complexity param=0.1617647predicted class=Yes  expected loss=0.4175824  P(node) =0.455class counts:    38    53probabilities: 0.418 0.582 left son=6 (35 obs) right son=7 (56 obs)Primary splits:ped  < 0.3095 to the left,  improve=6.528022, (0 missing)bmi  < 28.65  to the left,  improve=6.473260, (0 missing)skin < 19.5   to the left,  improve=4.778504, (0 missing)glu  < 166    to the left,  improve=4.104532, (0 missing)age  < 39.5   to the left,  improve=3.607390, (0 missing)Surrogate splits:glu   < 126.5  to the left,  agree=0.670, adj=0.143, (0 split)bp    < 93     to the right, agree=0.659, adj=0.114, (0 split)bmi   < 27.45  to the left,  agree=0.659, adj=0.114, (0 split)npreg < 9.5    to the right, agree=0.648, adj=0.086, (0 split)skin  < 20.5   to the left,  agree=0.637, adj=0.057, (0 split)Node number 4: 74 observationspredicted class=No   expected loss=0.05405405  P(node) =0.37class counts:    70     4probabilities: 0.946 0.054 Node number 5: 35 observations,    complexity param=0.01470588predicted class=No   expected loss=0.3142857  P(node) =0.175class counts:    24    11probabilities: 0.686 0.314 left son=10 (9 obs) right son=11 (26 obs)Primary splits:glu  < 90     to the left,  improve=2.3934070, (0 missing)bmi  < 33.4   to the left,  improve=1.3714290, (0 missing)bp   < 68     to the right, improve=0.9657143, (0 missing)ped  < 0.334  to the left,  improve=0.9475564, (0 missing)skin < 39.5   to the right, improve=0.7958592, (0 missing)Surrogate splits:ped < 0.1795 to the left,  agree=0.8, adj=0.222, (0 split)Node number 6: 35 observations,    complexity param=0.05882353predicted class=No   expected loss=0.3428571  P(node) =0.175class counts:    23    12probabilities: 0.657 0.343 left son=12 (27 obs) right son=13 (8 obs)Primary splits:glu   < 166    to the left,  improve=3.438095, (0 missing)ped   < 0.2545 to the right, improve=1.651429, (0 missing)skin  < 25.5   to the left,  improve=1.651429, (0 missing)npreg < 3.5    to the left,  improve=1.078618, (0 missing)bp    < 73     to the right, improve=1.078618, (0 missing)Surrogate splits:bp < 94.5   to the left,  agree=0.8, adj=0.125, (0 split)Node number 7: 56 observations,    complexity param=0.07352941predicted class=Yes  expected loss=0.2678571  P(node) =0.28class counts:    15    41probabilities: 0.268 0.732 left son=14 (11 obs) right son=15 (45 obs)Primary splits:bmi   < 28.65  to the left,  improve=5.778427, (0 missing)age   < 39.5   to the left,  improve=3.259524, (0 missing)npreg < 6.5    to the left,  improve=2.133215, (0 missing)ped   < 0.8295 to the left,  improve=1.746894, (0 missing)skin  < 22     to the left,  improve=1.474490, (0 missing)Surrogate splits:skin < 19.5   to the left,  agree=0.839, adj=0.182, (0 split)Node number 10: 9 observationspredicted class=No   expected loss=0  P(node) =0.045class counts:     9     0probabilities: 1.000 0.000 Node number 11: 26 observations,    complexity param=0.01470588predicted class=No   expected loss=0.4230769  P(node) =0.13class counts:    15    11probabilities: 0.577 0.423 left son=22 (19 obs) right son=23 (7 obs)Primary splits:bp    < 68     to the right, improve=1.6246390, (0 missing)bmi   < 33.4   to the left,  improve=1.6173080, (0 missing)npreg < 6.5    to the left,  improve=0.9423077, (0 missing)skin  < 39.5   to the right, improve=0.6923077, (0 missing)ped   < 0.334  to the left,  improve=0.4923077, (0 missing)Surrogate splits:glu < 94.5   to the right, agree=0.808, adj=0.286, (0 split)ped < 0.2105 to the right, agree=0.808, adj=0.286, (0 split)Node number 12: 27 observationspredicted class=No   expected loss=0.2222222  P(node) =0.135class counts:    21     6probabilities: 0.778 0.222 Node number 13: 8 observationspredicted class=Yes  expected loss=0.25  P(node) =0.04class counts:     2     6probabilities: 0.250 0.750 Node number 14: 11 observationspredicted class=No   expected loss=0.2727273  P(node) =0.055class counts:     8     3probabilities: 0.727 0.273 Node number 15: 45 observationspredicted class=Yes  expected loss=0.1555556  P(node) =0.225class counts:     7    38probabilities: 0.156 0.844 Node number 22: 19 observationspredicted class=No   expected loss=0.3157895  P(node) =0.095class counts:    13     6probabilities: 0.684 0.316 Node number 23: 7 observationspredicted class=Yes  expected loss=0.2857143  P(node) =0.035class counts:     2     5probabilities: 0.286 0.714
> par(xpd = TRUE); plot(cart_tree1); text(cart_tree1)

> #对测试集进行预测分析，并得到预测精度
> pre_cart_tree1 = predict(cart_tree1, Pima.te, type = "class")
> matrix1 = table(Type = Pima.te$type, predict = pre_cart_tree1)
> matrix1predict
Type   No YesNo  223   0Yes 109   0
> accuracy_tree1 = sum(diag(matrix1))/sum(matrix1)
> accuracy_tree1
[1] 0.6716867
> #对建成的决策树模型进行剪枝，将cp设为0.03
> cart_tree2 = prune(cart_tree1, cp = 0.03)
> par(xpd = TRUE); plot(cart_tree2); text(cart_tree2)

> #基于剪枝后的模型对测试集进行预测分析，并得到预测精度
> pre_cart_tree2 = predict(cart_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_cart_tree2)
> matrix2predict
Type   No YesNo  223   0Yes 109   0
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.6716867
> #对建成的决策树模型进行进一步剪枝，将cp设为0.1
> cart_tree3 = prune(cart_tree2, cp = 0.1)
> par(xpd = TRUE); plot(cart_tree3); text(cart_tree3)

> #基于剪枝后的模型对测试集进行预测分析，并得到预测精度
> pre_cart_tree3 = predict(cart_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_cart_tree3)
> matrix3predict
Type   No YesNo  223   0Yes 109   0
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.6716867

显然当cp为0.03时可以获得较高的准确率，而cp设为0.1时，模型二道了极大的简化且准确率基本并未过多损失。

4.2 C5.0决策树

> #C5.0决策树分析
> library(C50)
> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of  8 variables:$ npreg: int  5 7 5 0 0 5 3 1 3 2 ...$ glu  : int  86 195 77 165 107 97 83 193 142 128 ...$ bp   : int  68 70 82 76 60 76 58 50 80 78 ...$ skin : int  28 33 41 43 25 27 31 16 15 37 ...$ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...$ ped  : num  0.364 0.163 0.156 0.259 0.133 ...$ age  : int  24 55 35 26 23 52 25 24 63 31 ...$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
> C50_tree2 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = TRUE)) #不对树进行剪枝
> summary(C50_tree2)Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = TRUE))C5.0 [Release 2.07 GPL Edition]      Sat Sep 16 12:12:54 2017
-------------------------------Class specified by attribute `outcome'Read 200 cases (8 attributes) from undefined.dataDecision tree:glu <= 123: No (109/15)
glu > 123:
:...bmi > 28.6::...ped <= 0.344: No (29/12):   ped > 0.344: Yes (41/5)bmi <= 28.6::...age <= 32: No (11)age > 32::...bp > 80: No (3)bp <= 80::...ped <= 0.162: No (2)ped > 0.162: Yes (5)Evaluation on training data (200 cases):Decision Tree   ----------------  Size      Errors  7   32(16.0%)   <<(a)   (b)    <-classified as----  ----127     5    (a): class No27    41    (b): class YesAttribute usage:100.00%    glu45.50%   bmi38.50%   ped10.50%   age5.00%    bpTime: 0.0 secs
> plot(C50_tree2)

> C50_tree3 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = FALSE)) #对树进行剪枝
> summary(C50_tree3)Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = FALSE))C5.0 [Release 2.07 GPL Edition]     Sat Sep 16 12:14:14 2017
-------------------------------Class specified by attribute `outcome'Read 200 cases (8 attributes) from undefined.dataDecision tree:glu <= 123: No (109/15)
glu > 123:
:...bmi <= 28.6: No (21/5)bmi > 28.6::...ped <= 0.344: No (29/12)ped > 0.344: Yes (41/5)Evaluation on training data (200 cases):Decision Tree   ----------------  Size      Errors  4   37(18.5%)   <<(a)   (b)    <-classified as----  ----127     5    (a): class No32    36    (b): class YesAttribute usage:100.00%  glu45.50%   bmi35.00%   pedTime: 0.0 secs
> plot(C50_tree3)

> pre_C50_Cla2 = predict(C50_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_C50_Cla2)
> matrix2predict
Type   No YesNo  193  30Yes  58  51
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.7349398
> pre_C50_Cla3 = predict(C50_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_C50_Cla3)
> matrix3predict
Type   No YesNo  195  28Yes  60  49
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.7349398

我们发现修剪和不修剪对模型正确率没有影响，但修剪之后的模型显然更容易解释。

4.3 CHAID决策树

> #CHAID决策树分析
> #CHAID决策树只能对离散型属性进行处理，因此需要将数据中的连续型数据都转化为离散型，不用考虑时候修剪的问题。
> install.packages("CHAID")#如果找不到，则可以从https://r-forge.r-project.org/R/?group_id=343下载后安装
> library(CHAID)
> #加载训练和测试数据集
> data("Pima.tr")
> data("Pima.te")
> #将数据集合并
> Pima = rbind(Pima.tr, Pima.te)
> #对数据进行离散化处理，并输出离散化的属性
> level_name = {}
> for(i in 1:7)
+ {
+   Pima[,i] = cut(Pima[,i], breaks = 3, ordered_result = TRUE, include.lowest = TRUE)
+   level_name <- rbind(level_name, levels(Pima[,i]))
+ }
> level_name = data.frame(level_name)
> row.names(level_name) = colnames(Pima)[1:7]
> colnames(level_name) = paste("L",1:3,sep="")
> level_nameL1           L2          L3
npreg  [-0.017,5.67]  (5.67,11.3]   (11.3,17]
glu       [55.9,104]    (104,151]   (151,199]
bp       [23.9,52.7]  (52.7,81.3]  (81.3,110]
skin     [6.91,37.7]  (37.7,68.3] (68.3,99.1]
bmi      [18.2,34.5]  (34.5,50.8] (50.8,67.1]
ped   [0.0827,0.863] (0.863,1.64] (1.64,2.42]
age        [20.9,41]      (41,61]   (61,81.1]
> #以前200个数据为训练集，剩下的332个数据为测试集
> Pima.tr = Pima[1:200,]
> Pima.te = Pima[201:nrow(Pima),]
> CHAID_tree = chaid(type~., Pima.tr)
> CHAID_treeModel formula:
type ~ npreg + glu + bp + skin + bmi + ped + ageFitted party:
[1] root
|   [2] glu in [55.9,104]
|   |   [3] age in [20.9,41]: No (n = 50, err = 6.0%)
|   |   [4] age in (41,61], (61,81.1]: No (n = 10, err = 40.0%)
|   [5] glu in (104,151]
|   |   [6] age in [20.9,41]: No (n = 86, err = 27.9%)
|   |   [7] age in (41,61], (61,81.1]: Yes (n = 15, err = 26.7%)
|   [8] glu in (151,199]: Yes (n = 39, err = 33.3%)Number of inner nodes:    3
Number of terminal nodes: 5
> plot(CHAID_tree)

> #对测试集分别进行预测分析，并得到预测精度
> pre_CHAID_tree = predict(CHAID_tree, Pima.te)
> matrix = table(Type = Pima.te$type, predict = pre_CHAID_tree)
> matrixpredict
Type   No YesNo  199  24Yes  47  62
> accuracy_tree = sum(diag(matrix))/sum(matrix)
> accuracy_tree
[1] 0.7861446