回归树与基于规则的模型(part3)--回归模型树

学习笔记，仅供参考，有错必纠

回归树与基于规则的模型

回归模型树

One limitation of simple regression trees is that each terminal node(最终节点) uses the average of the training set outcomes(训练结果变量的平均值) in that node for prediction. As a consequence, these models may not do a good job predicting samples whose true outcomes are extremely high or low.

One approach to dealing with this issue is to use a diﬀerent estimator(其他队的估计量) in the terminal nodes.

Here we focus on the model tree approach described in Quinlan (1992) called M5, which is similar to regression trees except:

切分的准则不同；
最终节点利用线性模型来对结果变量进行预测(而不是使用简单的平均)；
新样本的预测值通常是树中同一条路径下，若干不同模型预测值的组合。

Like simple regression trees, the initial split(初次切分) is found using an exhaustive search(穷举搜索) over the predictors and training set samples, but, unlike those models, the expected reduction in the node’s error rate is used(此处的优化准则是节点上的期望误差率减少量). Let SSS denote the entire set of data and let S1,S2,...,SPS_1,S_2,...,S_PS1,S2,...,SP represent the P subsets of the data after splitting. The split criterion(分割准则) would be：
reduction=SD(S)−∑i=1Pnin∗SD(Si)(1)reduction=SD(S)-\sum_{i=1}^{P} \frac{n_i}{n}*SD(S_i) \tag{1} reduction=SD(S)−i=1∑Pnni∗SD(Si)(1)

where SD is the standard deviation(标准差) and nin_ini is the number of samples in partition iii(第iii个子集).

This metric determines if the total variation in the splits, weighted by sample size, is lower than in the presplit data.(这个指标衡量了进行切分后按样本量进行加权的总变异是否比没有切分时的总变异更小)

The split that is associated with the largest reduction in error is chosen (能使误差达到最小的切分方案将被选中) and a linear model is created within the partitions using the split variable in the model(同时，在每一份子集中将利用切分变量拟合一个线性模型).

For subsequent splitting iterations, this process is repeated:

an initial split is determined and a linear model is created for the partition using the current split variable and all others that preceded it. (在子集中利用该切分变量和之前所有的切分变量拟合线性模型)

The error associated with each linear model is used in place of SD(S)SD(S)SD(S)in Eq.1 to determine the expected reduction in the error rate for the next split.(要计算下一个切分方案的期望误差减小值，要在(1)式中用每一个线性模型的误差取代SD(S)SD(S)SD(S))

The tree growing process continues along the branches of the tree until there are no further improvements in the error rate(误差率不再有进一步的提升) or there are not enough samples to continue the process(没有足够的样本去执行这个过程). Once the tree is fully grown, there is a linear model for every node in the tree(当树完全生长后，树的每个节点都具备一个线性模型).

Once the complete set of linear models have been created, each undergoes a simpliﬁcation procedure to potentially drop some of the terms(其中的每个模型都将经历一个简化的过程，即从模型中移除部分项). For a given model, an adjusted error rate is computed(可以计算其调整后的误差率). First, the absolute diﬀerences between the observed and predicted data are calculated then multiplied by a term that penalizes models with large numbers of parameters(对变量数较多的模型进行惩罚):
AdjustedErrorRate=n∗+pn∗−p∑i=1n∗∣yi−y^i∣Adjusted \; Error \; Rate = \frac{n^*+p}{n^*-p}\sum_{i=1}^{n^*}|y_i-\hat{y}_i| AdjustedErrorRate=n∗−pn∗+pi=1∑n∗∣yi−y^i∣
where n∗n^*n∗ is the number of training set data points that were used to build the model and ppp is the number of parameters.

Each model term is dropped and the adjusted error rate(调整后的误差率) is computed. 如果误差率没有比移除部分项时调整后误差率要小，那么该项将从模型中被移除. In some cases, the linear model may be simpliﬁed to having only an intercept. This procedure is independently applied to each linear model(不同线性模型的化简过程是相互独立的).

Model trees also incorporate a type of smoothing(平滑) to decrease the potential for over-ﬁtting(减少潜在的过拟合).The technique is based on the recursive shrinking(递归收缩) methodology of Hastie and Pregibon (1990).

When predicting, the new sample goes downthe appropriate path of the tree(新样本自上而下的落入到合适的路径中) , and moving from the bottom up(自下而上), the linear models along that path are combined(路径中的线性模型被组合在一起).

子节点和父节点的预测值将通过如下方式组合在一起，并获得组合后的父节点预测值：
y^(p)′=n(k)y^(k)+cy^(p)n(k)+c\hat{y}_{(p)}'=\frac{n_{(k)} \hat{y}_{(k)}+c \hat{y}_{(p)}}{n_{(k)}+c} y^(p)′=n(k)+cn(k)y^(k)+cy^(p)
其中y^(k)\hat{y}_{(k)}y^(k)是子节点的预测值，n(k)n_{(k)}n(k)是子节点训练集的样本量，y^(p)\hat{y}_{(p)}y^(p)是父节点的预测值，ccc是一个常数，默认值为15.

当这个组合后的预测值计算完毕后，y^(p)′\hat{y}_{(p)}'y^(p)′将作为新的子节点的预测值，类似的与其父节点的预测值进行组合，以此类推。

Smoothing the models has the eﬀect of minimizing the collinearity issues(多重共线性问题). Removing the correlated predictors would produce a model that has less inconsistencies(减少不一致性) and is more interpretable(更加有解释性). However, there is a measurable drop(明显的负面影响) in performance by using the strategy.