notes from《classification and regression trees》

page 11th
R(d)=1N∑n=1NX(d(xn)≠jn)−−−−−−−(1.8)R(d)=\frac{1}{N}\sum_{n=1}^NX(d(x_n)≠j_n)-------(1.8)R(d)=N1n=1∑NX(d(xn)̸=jn)−−−−−−−(1.8)
where X(⋅X(·X(⋅)is index function.

For large datasets.
L1:L_1:L1:used to train classifier
L2:L_2:L2:used to estimate
Rts(d)=1N2∑xn,jn∈L2X(d(xn)≠jn)−−−−−(1.9)R^{ts}(d)=\frac{1}{N_2}\sum_{x_n,j_n\in L_2}X(d(x_n)≠j_n)-----(1.9)Rts(d)=N21xn,jn∈L2∑X(d(xn)̸=jn)−−−−−(1.9)

For smaller datasets.v-fold cross-validation is used.
L－Lv:L－L_v:L－Lv:used to train classifier d(v)(x)d^{(v)}(x)d(v)(x)

R∗(d(v))=Rts(d(v))=1Nv∑(xn,jn)∈LvX(d(v)(xn)≠jn)R^{*}{(d^{(v)})}=R^{ts}{(d^{(v)})}=\frac{1}{N_v} \sum_{(x_n,j_n)\in L_v}X(d^{(v)}(x_n)≠j_n)R∗(d(v))=Rts(d(v))=Nv1(xn,jn)∈Lv∑X(d(v)(xn)̸=jn)
(using all the data,except the vth fold data)

R∗=Rcv(d)=1V∑V=1VRts(d(v))R^{*}=R^{cv}(d)=\frac{1}{V}\sum_{V=1}^VR^{ts}(d^{(v)})R∗=Rcv(d)=V1V=1∑VRts(d(v))
(using all the data)
这个公式的具体例子在85页

2IntroductiontoTreeClassification2\ Introduction\ to\ Tree\ Classification2 Introduction to Tree Classification

p(j,t)=π(j)⋅Nj(t)Nj−−−−−−−−−(2.2)p(j,t)=\pi(j)·\frac{N_j(t)}{N_j}---------(2.2)p(j,t)=π(j)⋅NjNj(t)−−−−−−−−−(2.2)
p(t)=∑jp(j,t)−−−−−−−−−−−(2.3)p(t)=\sum_{j}p(j,t)----------- (2.3)p(t)=j∑p(j,t)−−−−−−−−−−−(2.3)
p(j∣t)=p(j,t)p(t)−−−−−−−−−−−(2.4)p(j|t)=\frac{p(j,t)}{p(t)}----------- (2.4)p(j∣t)=p(t)p(j,t)−−−−−−−−−−−(2.4)
T~:currentsetofterminalnodes\tilde{T}:current\ set\ of\ terminal\ nodesT~:current set of terminal nodes

$\$
$\$
$\$
P35th$\$
R(T)=∑t∈T~R(t)R(T)=\sum_{t\in\tilde{T}}R(t)R(T)=t∈T~∑R(t)

r(t)=min∑i≠jC(i∣j)⋅p(j∣t)r(t)=min\sum_{i≠j}C(i|j)·p(j|t)r(t)=mini̸=j∑C(i∣j)⋅p(j∣t)

R(T)=∑t∈T~r(t)⋅p(t)=∑t∈T~R(t)R(T)=\sum_{t\in \tilde{T}}r(t)·p(t)=\sum_{t\in{\tilde{T}}}R(t)R(T)=t∈T~∑r(t)⋅p(t)=t∈T~∑R(t)
(落在节点ｔ的数据集·分错的概率)

p36th
propostion 2.14
R(t)≥R(tL)+R(tR)R(t)≥R(t_L)+R(t_R)R(t)≥R(tL)+R(tR)

3.3 MINIMAL COST-COMPLEXITY PRUNING

Rα(T)=R(T)+α∣T~∣R_{\alpha}(T)=R(T)+\alpha|\tilde{T}|Rα(T)=R(T)+α∣T~∣

DEFINITION:
For each value of α\alphaα,find the T(α)≤TmaxT(\alpha)≤T_{max}T(α)≤Tmax which minimize Ra(T)R_a(T)Ra(T)
Rα(T(α))=minT≤TmaxRα(T)R_{\alpha}(T(\alpha))=min_{T≤T_{max}}R_{\alpha}(T)Rα(T(α))=minT≤TmaxRα(T)

complexity:∣T~∣|\tilde{T}|∣T~∣
TtT_tTt:any branch of T1T_1T1
R(Tt)=∑t′∈Tt~R(t′)R(T_t)=\sum_{t'\in\tilde{T_t}}R(t')R(Tt)=t′∈Tt~∑R(t′)
R(t)>R(Tt))R(t)>R(T_t))R(t)>R(Tt))

page 73th
Q∗(i∣j)=P(d(X)=i∣Y=j)Q^{*}(i|j)=P(d(X)=i|Y=j)Q∗(i∣j)=P(d(X)=i∣Y=j)
Q∗(i∣j)Q^{*}(i|j)Q∗(i∣j) is the probability that a case in j is classified into i by d（错分概率）,Define:

R∗(j)=∑iC(i∣j)⋅Q∗(i∣j)R^{*}(j)=\sum_iC(i|j)·Q^{*}(i|j)R∗(j)=∑iC(i∣j)⋅Q∗(i∣j)
so that R∗(j)R^{*}(j)R∗(j)is the expected cost of misclassification for class j items.
Deaine:
R∗(d)=∑jR∗(j)⋅π(j)R^{*}(d)=\sum_jR^{*}(j)·\pi(j)R∗(d)=∑jR∗(j)⋅π(j)
as the expected misclassification cost for the classifier d.

3.4.1 Test Sample Estimates
page 74th
Basic estimate:
Qts(i∣j)=Nij(2)Nj(2)Q^{ts}({i|j})=\frac{N_{ij}^{(2)}}{N_j^{(2)}}Qts(i∣j)=Nj(2)Nij(2)
That is,Q∗(i∣j)Q^{*}({i|j})Q∗(i∣j) is estimated as the proportion of the test sample class j cases that the tree T classifies as i(set Qts(i∣j)Q^{ts}(i|j)Qts(i∣j)=0,if Nj(2)=0N_j^{(2)}=0Nj(2)=0)

Rts(j)=∑iC(i∣j)⋅Qts(i∣j)R^{ts}(j)=\sum_iC(i|j)·Q^{ts}(i|j)Rts(j)=∑iC(i∣j)⋅Qts(i∣j)(j表示第ｊ类)
(第j类数据错分类为其他类别的概率，C(i∣j)C(i|j)C(i∣j)的含义在书上没有提到，应该是第ｊ类数据的数量)

Rts(T)=∑jRts(j)⋅π(j)(3.13)R^{ts}(T)=\sum_jR^{ts}(j)·\pi(j)(3.13)Rts(T)=∑jRts(j)⋅π(j)(3.13)
If the priors are data estimated,use L2L_2L2
to estimate them as
π(j)=Nj(2)N(2)\pi(j)=\frac{N_j^{(2)}}{N^{(2)}}π(j)=N(2)Nj(2)(第二折数据中，第ｊ类数据的比例),in this case,(3.13)simplifies to
Rts(T)=1N(2)∑i,jC(i∣j)Nij(2)R^{ts}(T)=\frac{1}{N^{(2)}}\sum_{i,j}C(i|j)N_{ij}^{(2)}Rts(T)=N(2)1∑i,jC(i∣j)Nij(2)

P75th:
The test sample estimates can be used to select the right sized tree Tk0T_{k0}Tk0by the rule
Rts(Tk0)=minkRts(Tk)R^{ts}(T_{k0})=min_kR^{ts}(T_k)Rts(Tk0)=minkRts(Tk)

P75th:
3.4.2 Cross-Validation Estimates
3.4.3 Standard Errors and the 1 SE Rule

P78th:
SE(Rts(T))=Rts(T)(1−Rts(T))N2SE(R^{ts}(T))=\sqrt{\frac{R^{ts}(T)(1-R^{ts}(T))}{N_2}}SE(Rts(T))=N2Rts(T)(1−Rts(T))

R^(Tk)=Rts(Tk)\hat{R}(T_k)=R^{ts}(T_k)R^(Tk)=Rts(Tk) or Rcv(Tk)R^{cv}(T_k)Rcv(Tk)

page 152th
R(T)=∑t∈T~r(t)⋅p(t)R(T)=\sum_{t\in \tilde{T}}r(t)·p(t)R(T)=t∈T~∑r(t)⋅p(t)
(这个节点对于到达的数据错分类的概率·这个意思是数据进入这个节点的概率)

Rcv(d)=1Nv∑(xn,yn)∈Lv(yn−d(v)(xn))2(8.7)R^{cv}(d)=\frac{1}{N_v}\sum_{(x_n,y_n)\in L_v} (y_n-d^{(v)}(x_n))^2(8.7)Rcv(d)=Nv1(xn,yn)∈Lv∑(yn−d(v)(xn))2(8.7)
这个式子在书上有两个Σ，应该是写错了，上面已经修正。

P225th
REts(d)=Rts(d)Rts(y‾)RE^{ts}(d)=\frac{R^{ts}(d)}{R^{ts}(\overline{y})}REts(d)=Rts(y)Rts(d)

REcv(d)=Rcv(d)R(y‾)RE^{cv}(d)=\frac{R^{cv}(d)}{R(\overline{y})}REcv(d)=R(y)Rcv(d)
ρ2=1−RE(d)\rho^2=1-RE(d)ρ2=1−RE(d)
(这个好像就是传说中的R2R^2R2值，注意验证下)

E(Y1−d(X1))4≈1N2∑n=1N2(Yn−d(Xn))4E(Y_1-d(X_1))^4\approx\frac{1}{N_2}\sum_{n=1}^{N_2}(Y_n-d(X_n))^4E(Y1−d(X1))4≈N21n=1∑N2(Yn−d(Xn))4

E(Y1−d(X1))2=1N2∑n=1N2(Yn−d(Xn))2=(Rts)2E(Y_1-d(X_1))^2=\frac{1}{N_2}\sum_{n=1}^{N_2}(Y_n-d(X_n))^2=(R^{ts})^2E(Y1−d(X1))2=N21n=1∑N2(Yn−d(Xn))2=(Rts)2

The following are deduced by myself for page 226th
D(Rts)\sqrt{D(R_{ts})}D(Rts)

=D(1N∑n=1N(Yn−d(Xn))2)=\sqrt{D(\frac{1}{N}\sum_{n=1}^{N}(Y_n-d(X_n))^2)}=D(N1n=1∑N(Yn−d(Xn))2)
=1ND[∑n=1N(Yn−d(Xn))2]=\frac{1}{N}\sqrt{D[\sum_{n=1}^{N}(Y_n-d(X_n))^2]}=N1D[n=1∑N(Yn−d(Xn))2]
=1N∑n=1ND(Yn−d(Xn))2=\frac{1}{N}\sqrt{\sum_{n=1}^{N}D(Y_n-d(X_n))^2}=N1n=1∑ND(Yn−d(Xn))2
=E(Rts2)−E2(Rts)=\sqrt{E(R_{ts}^2)-E^{2}(R_{ts})}=E(Rts2)−E2(Rts)

=1N∑n=1N{E(Yn−d(Xn))4−E2[(Yn−d(Xn))2]}=\frac{1}{N}\sqrt{\sum_{n=1}^N\{E(Y_n-d(X_n))^4-E^2[(Y_n-d(X_n))^2] \}}=N1n=1∑N{E(Yn−d(Xn))4−E2[(Yn−d(Xn))2]}
=1NE(Yn−d(Xn))4−E2[(Yn−d(Xn))2]=\frac{1}{\sqrt{N}}\sqrt{E(Y_n-d(X_n))^4-E^2[(Y_n-d(X_n))^2]}=N1E(Yn−d(Xn))4−E2[(Yn−d(Xn))2]
=1N[1N∑n=1(Yn−d(Xn))4]−E2[(Yn−d(Xn))2]=\frac{1}{\sqrt{N}}\sqrt{[\frac{1}{N}\sum_{n=1}(Y_n-d(X_n))^4]-E^2[(Y_n-d(X_n))^2]}=N1[N1n=1∑(Yn−d(Xn))4]−E2[(Yn−d(Xn))2]
=1N[1N∑n=1(Yn−d(Xn))4]−(Rts)2=\frac{1}{\sqrt{N}}\sqrt{[\frac{1}{N}\sum_{n=1}(Y_n-d(X_n))^4]-(R_{ts})^2}=N1[N1n=1∑(Yn−d(Xn))4]−(Rts)2

where
E[∑n=1N(Yn−d(Xn))2]E[\sum_{n=1}^N(Y_n-d(X_n))^2]E[n=1∑N(Yn−d(Xn))2]

=1N∑n=1N(Yn−d(Xn))2=\frac{1}{N}\sum_{n=1}^N(Y_n-d(X_n))^2=N1n=1∑N(Yn−d(Xn))2

=Rts=R^{ts}=Rts

Page 230th
y‾(t)=1N(t)∑xn∈t(yn)\overline{y}(t)=\frac{1}{N(t)}\sum_{x_n\in t}(y_n)y(t)=N(t)1xn∈t∑(yn)
R(T)=1N∑t∈T~∑xn∈t(yn−y‾(t))2R(T)=\frac{1}{N}\sum_{t\in \tilde{T}}\sum_{x_n\in t}(y_n-\overline{y}(t))^2R(T)=N1t∈T~∑xn∈t∑(yn−y(t))2

R(t)=1N∑xn∈t(yn−y‾(t))2R(t)=\frac{1}{N}\sum_{x_n\in t}(y_n-\overline{y}(t))^2R(t)=N1xn∈t∑(yn−y(t))2

R(T)=∑t∈T~R(t)R(T)=\sum_{t\in \tilde{T}}R(t)R(T)=t∈T~∑R(t)

Page 232th
s2(t)=1N(t)∑xn∈t(yn−y‾(t))2s^2(t)=\frac{1}{N(t)}\sum_{x_n\in t}(y_n-\overline{y}(t))^2s2(t)=N(t)1xn∈t∑(yn−y(t))2

R(T)=∑t∈T^s2(t)⋅p(t)R(T)=\sum_{t\in \hat{T}}s^2(t)·p(t)R(T)=t∈T^∑s2(t)⋅p(t)
we can decuce from above that:
p(t)=N(t)Np(t)=\frac{N(t)}{N}p(t)=NN(t)

Page 233th
Rα(T)=R(T)+α∣T~∣R_{\alpha}(T)=R(T)+\alpha|\tilde{T}|Rα(T)=R(T)+α∣T~∣
Now minimal error-complexity pruning is done exactly as minimal cost-complexity pruning in classification.
The result is a a decreasing sequence of tree:
T1>T2>⋅⋅⋅>t1T_1>T_2>···>{t_1}T1>T2>⋅⋅⋅>t1
with T1≤TmaxT_1≤T_{max}T1≤Tmax and a corresponding increasing sequence of α\alphaα values:
0≤α1≤α2≤⋅⋅⋅0≤\alpha_1≤\alpha_2≤···0≤α1≤α2≤⋅⋅⋅
such for α∈[αk,αk+1)\alpha\in[\alpha_{k},\alpha_{k+1})α∈[αk,αk+1),TkT_kTk is the smallest subtree of TmaxT_maxTmax minimizing Rα(T)R_{\alpha}(T)Rα(T)

page 234th
Rts(Tk)=1N2∑(xn,yn)∈L2(yn−dk(xn))2R^{ts}(T_k)=\frac{1}{N_2}\sum_{(x_n,y_n)\in L_2}(y_n-d_k(x_n))^2Rts(Tk)=N21(xn,yn)∈L2∑(yn−dk(xn))2
Rcv(Tk)=1N∑v=1V∑(xn,yn)∈Lv(yn−dk(v)(xn))2R^{cv}(T_k)=\frac{1}{N}\sum_{v=1}^V\sum_{(x_n,y_n)\in L_v}(y_n-d_k^{(v)}(x_n))^2Rcv(Tk)=N1v=1∑V(xn,yn)∈Lv∑(yn−dk(v)(xn))2
αk′=αkαk+1\alpha_k^{'}=\sqrt{\alpha_k\alpha_{k+1}}αk′=αkαk+1
REcv(Tk)=Rcv(Tk)R(y‾)RE^{cv}(T_k)=\frac{R^{cv}(T_k)}{R(\overline{y})}REcv(Tk)=R(y)Rcv(Tk)

page 237th
TkT_kTk selected was the smallest tree such that
Rcv(Tk)≤Rcv(Tk0)+SER^{cv}(T_k)≤R^{cv}(T_{k0})+SERcv(Tk)≤Rcv(Tk0)+SE
where
RcvTk0=minkRcv(Tk)R^{cv}{T_{k0}}=min_kR^{cv}(T_k)RcvTk0=minkRcv(Tk)

page 303th
11.4 Test Samples
An obvious way to cure overoptimistic tendency of the empirical estimate R(dk)R(d_k)R(dk) ofR∗(dk)R^{*}(d_k)R∗(dk)is to base the estimate of R∗(dk)R^{*}(d_k)R∗(dk) on data not used to construct dkd_kdk
There are two ways to evaluate the SE:
model 1 version(page 303th lower part)
model 2 version(page 304th before"To evaluate the preceding formulas efficiently")

page 306th
11.5 cross-validation
The use of test samples to estimate the risk of tree structured procedures requires that one set of sample data be used to construct the procedure and a disjoint set be used to evaluate it.
When the combined set of available data contains a thousand or more cases,this is a reasonable approach. But if only a few hundred cases or less in total are available,it can be inefficient in its use of the available data;cross-validation is then preferable.

Page 309th
11.6 Final Tree Selection
Test samples or cross-validation can be used to select a particular procedure dk=dTkd_k=d_{T_k}dk=dTk from among the candidate dkd_kdk.

In summary:
We can use test samples or cross-validation to select the best tree.

selectthebesttree={testsampleswithminimumMSEtestsampleswith1SErulecross−validationselect\ the\ best \ tree=\left\{ \begin{aligned} test\ samples\ with minimum\ MSE\\ test\ samples\ with\ 1SE\ rule\\ cross-validation \end{aligned} \right.select the best tree=⎩⎪⎨⎪⎧test samples withminimum MSEtest samples with 1SE rulecross−validation
where
testsampleswith1SErule={model1versionmodel2versiontest\ samples\ with\ 1SE\ rule=\left\{ \begin{aligned} model\ 1\ version\\ model\ 2\ version \end{aligned} \right.test samples with 1SE rule={model 1 versionmodel 2 version

Selecting the best pruned-tree with the cross-validation has some defects,it is listed as follows:
https://blog.csdn.net/appleyuchi/article/details/84957220