matlab决策树模型过程,利用MATLAB统计工具箱进行决策树分类的一个例子

这个例子开始从lda线性分类算法，最后引出决策树分类算法，不错，初学者可以参考下

网上的很多决策树算法都没有例子，都是就一堆代码都不知道参数怎么传递。直接用工具箱里面的决策树算法，不懂得就help一下就ok了。

Classification

Suppose you have a data set containing observations with measurements on different variables (called predictors) and their known class labels. If you obtain predictor values for new observations, could you determine to which classes those observations probably belong? This is the problem of classification. This demo illustrates how to perform some classification algorithms in MATLAB? using Statistics Toolbox? by applying them to Fisher's iris data.

Contents

Fisher's Iris Data

Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species. Load the data and see how the sepal measurements differ between species. You can use the two columns containing sepal measurements.

load fisheriris gscatter(meas(:,1), meas(:,2), species,'rgb','osd');

xlabel('Sepal length');

ylabel('Sepal width');

N = size(meas,1);

Suppose you measure a sepal and petal from an iris, and you need to determine its species on the basis of those measurements. One approach to solving this problem is known as discriminant analysis.

Linear and Quadratic Discriminant Analysis

The classify function can perform classification using different types of discriminant analysis. First classify the data using the default linear discriminant analysis (LDA).

ldaClass = classify(meas(:,1:2),meas(:,1:2),species);

The observations with known class labels are usually called the training data. Now compute the resubstitution error, which is the misclassification error (the proportion of misclassified observations) on the training set.

bad = ~strcmp(ldaClass,species);

ldaResubErr = sum(bad) / N

ldaResubErr =

0.2000

You can also compute the confusion matrix on the training set. A confusion matrix contains information about known class labels and predicted class labels. Generally speaking, the (i,j) element in the confusion matrix is the number of samples whose known class label is class i and whose predicted class is j. The diagonal elements represent correctly classified observations.

[ldaResubCM,grpOrder] = confusionmat(species,ldaClass)

ldaResubCM =

49 1 0

0 36 14

0 15 35

grpOrder =

'setosa'

'versicolor'

'virginica'

Of the 150 training observations, 20% or 30 observations are misclassified by the linear discriminant function. You can see which ones they are by drawing X through the misclassified points.

hold on;

plot(meas(bad,1), meas(bad,2), 'kx');

hold off;

The function has separated the plane into regions divided by lines, and assigned different regions to different species. One way to visualize these regions is to create a grid of (x,y) values and apply the classification function to that grid.

[x,y] = meshgrid(4:.1:8,2:.1:4.5);

x = x(:);

y = y(:);

j = classify([x y],meas(:,1:2),species);

gscatter(x,y,j,'grb','sod')

For some data sets, the regions for the various classes are not well separated by lines. When that is the case, linear discriminant analysis is not appropriate. Instead, you can try quadratic discriminant analysis (QDA) for our data.

Compute the resubstitution error for quadratic discriminant analysis.

qdaClass = classify(meas(:,1:2),meas(:,1:2),species,'quadratic');

bad = ~strcmp(qdaClass,species);

qdaResubErr = sum(bad) / N

qdaResubErr =

0.2000

You have computed the resubstitution error. Usually people are more interested in the test error (also referred to as generalization error), which is the expected prediction error on an independent set. In fact, the resubstitution error will likely under-estimate the test error.

In this case you don't have another labeled data set, but you can simulate one by doing cross-validation. A stratified 10-fold cross-validation is a popular choice for estimating the test error on classification algorithms. It randomly divides the training set into 10 disjoint subsets. Each subset has roughly equal size and roughly the same class proportions as in the training set. Remove one subset, train the classification model using the other nine subsets, and use the trained model to classify the removed subset. You could repeat this by removing each of the ten subsets one at a time.

Because cross-validation randomly divides data, its outcome depends on the initial random seed. To reproduce the exact results in this demo, execute the following command:

rng(0,'twister');

First use cvpartition to generate 10 disjoint stratified subsets.

cp = cvpartition(species,'k',10)

cp =

K-fold cross validation partition

N: 150

NumTestSets: 10

TrainSize: 135 135 135 135 135 135 135 135 135 135

TestSize: 15 15 15 15 15 15 15 15 15 15

The crossval function can estimate the misclassification error for both LDA and QDA using the given data partitioncp.

Estimate the true test error for LDA using 10-fold stratified cross-validation.

ldaClassFun= @(xtrain,ytrain,xtest)(classify(xtest,xtrain,ytrain));

ldaCVErr = crossval('mcr',meas(:,1:2),species,'predfun', ... ldaClassFun,'partition',cp)

ldaCVErr =

0.2000

The LDA cross-validation error has the same value as the LDA resubstitution error on this data.

Estimate the true test error for QDA using 10-fold stratified cross-validation.

qdaClassFun = @(xtrain,ytrain,xtest)(classify(xtest,xtrain,ytrain,... 'quadratic'));

qdaCVErr = crossval('mcr',meas(:,1:2),species,'predfun',... qdaClassFun,'partition',cp)

qdaCVErr =

0.2200

QDA has a slightly larger cross-validation error than LDA. It shows that a simpler model may get comparable, or better performance than a more complicated model.

NaiveBayes Classifier

The classify function has other two other types, 'diagLinear' and 'diagQuadratic'. They are similar to 'linear' and 'quadratic', but with diagonal covariance matrix estimates. These diagonal choices are specific examples of a Naive Bayes classifier, because they assume the variables are conditionally independent given the class label. Naive Bayes classifiers are among the most popular classifiers. While the assumption of class-conditional independence between variables is not true in general, Naive Bayes classifiers have been found to work well in practice on many data sets.

The NaiveBayes class can be used to create a more general type of Naive Bayes classifier.

First model each variable in each class using a Gaussian distribution. You can compute the resubstitution error and the cross-validation error.

nbGau= NaiveBayes.fit(meas(:,1:2), species);

nbGauClass= nbGau.predict(meas(:,1:2));

bad = ~strcmp(nbGauClass,species);

nbGauResubErr = sum(bad) / N

nbGauClassFun = @(xtrain,ytrain,xtest)... (predict(NaiveBayes.fit(xtrain,ytrain), xtest));

nbGauCVErr = crossval('mcr',meas(:,1:2),species,... 'predfun', nbGauClassFun,'partition',cp)

nbGauResubErr =

0.2200

nbGauCVErr =

0.2200

So far you have assumed the variables from each class have a multivariate normal distribution. Often that is a reasonable assumption, but sometimes you may not be willing to make that assumption or you may see clearly that it is not valid. Now try to model each variable in each class using a kernel density estimation, which is a more flexible nonparametric technique.

nbKD= NaiveBayes.fit(meas(:,1:2), species,'dist','kernel');

nbKDClass= nbKD.predict(meas(:,1:2));

bad = ~strcmp(nbKDClass,species);

nbKDResubErr = sum(bad) / N

nbKDClassFun = @(xtrain,ytrain,xtest)... (predict(NaiveBayes.fit(xtrain,ytrain,'dist','kernel'),xtest));

nbKDCVErr = crossval('mcr',meas(:,1:2),species,... 'predfun', nbKDClassFun,'partition',cp)

nbKDResubErr =

0.1800

nbKDCVErr =

0.2333

For this data set, the Naive Bayes classifier with kernel density estimation gets smaller resubstitution error and cross-validation error than the Naive Bayes classifier with a Gaussian distribution.

Decision Tree决策树的适用方法

Another classification algorithm is based on a decision tree. A decision tree is a set of simple rules, such as "if the sepal length is less than 5.45, classify the specimen as setosa." Decision trees are also nonparametric because they do not require any assumptions about the distribution of the variables in each class.

The classregtree class creates a decision tree. Create a decision tree for the iris data and see how well it classifies the irises into species.

t = classregtree(meas(:,1:2), species,'names',{'SL' 'SW' });

It's interesting to see how the decision tree method divides the plane. Use the same technique as above to visualize the regions assigned to each species.

[grpname,node] = t.eval([x y]);

gscatter(x,y,grpname,'grb','sod')

Another way to visualize the decision tree is to draw a diagram of the decision rule and class assignments.

view(t);

This cluttered-looking tree uses a series of rules of the form "SL < 5.45" to classify each specimen into one of 19 terminal nodes. To determine the species assignment for an observation, start at the top node and apply the rule. If the point satisfies the rule you take the left path, and if not you take the right path. Ultimately you reach a terminal node that assigns the observation to one of the three species.

Compute the resubstitution error and the cross-validation error for decision tree.

dtclass = t.eval(meas(:,1:2));

bad = ~strcmp(dtclass,species);

dtResubErr = sum(bad) / N

dtClassFun = @(xtrain,ytrain,xtest)(eval(classregtree(xtrain,ytrain),xtest));

dtCVErr = crossval('mcr',meas(:,1:2),species, ... 'predfun', dtClassFun,'partition',cp)

dtResubErr =

0.1333

dtCVErr =

0.2933

For the decision tree algorithm, the cross-validation error estimate is significantly larger than the resubstitution error. This shows that the generated tree overfits the training set. In other words, this is a tree that classifies the original training set well, but the structure of the tree is sensitive to this particular training set so that its performance on new data is likely to degrade. It is often possible to find a simpler tree that performs better than a more complex tree on new data.

Try pruning the tree. First compute the resubstitution error for various of subsets of the original tree. Then compute the cross-validation error for these sub-trees. A graph shows that the resubstitution error is overly optimistic. It always decreases as the tree size grows, but beyond a certain point, increasing the tree size increases the cross-validation error rate.

resubcost = test(t,'resub');

[cost,secost,ntermnodes,bestlevel] = test(t,'cross',meas(:,1:2),species);

plot(ntermnodes,cost,'b-', ntermnodes,resubcost,'r--')

figure(gcf);

xlabel('Number of terminal nodes');

ylabel('Cost (misclassification error)')

legend('Cross-validation','Resubstitution')

Which tree should you choose? A simple rule would be to choose the tree with the smallest cross-validation error. While this may be satisfactory, you might prefer to use a simpler tree if it is roughly as good as a more complex tree. For this example, take the simplest tree that is within one standard error of the minimum. That's the default rule used by the classregtree/test method.

You can show this on the graph by computing a cutoff value that is equal to the minimum cost plus one standard error. The "best" level computed by the classregtree/test method is the smallest tree under this cutoff. (Note that bestlevel=0 corresponds to the unpruned tree, so you have to add 1 to use it as an index into the vector outputs from classregtree/test.)

[mincost,minloc] = min(cost);

cutoff = mincost + secost(minloc);

hold on plot([0 20], [cutoff cutoff], 'k:')

plot(ntermnodes(bestlevel+1), cost(bestlevel+1), 'mo')

legend('Cross-validation','Resubstitution','Min + 1 std. err.','Best choice')

hold off

Finally, you can look at the pruned tree and compute the estimated misclassification error for it.

pt = prune(t,bestlevel);

view(pt)

cost(bestlevel+1)

ans =

0.2467

Conclusions

This demonstration shows how to perform classification in MATLAB using Statistics Toolbox functions.

This demonstration is not meant to be an ideal analysis of the Fisher iris data, In fact, using the petal measurements instead of, or in addition to, the sepal measurements may lead to better classification. Also, this demonstration is not meant to compare the strengths and weaknesses of different classification algorithms. You may find it instructive to perform the analysis on other data sets and compare different algorithms. There are also Statistics Toolbox functions that implement other classification algorithms. For example, you can use TreeBagger to perform bootstrap aggregation for an ensemble of decision trees, as described in the example .

转自

matlab决策树模型过程,利用MATLAB统计工具箱进行决策树分类的一个例子相关推荐

利用Matlab进行灰色预测,利用matlab进行灰色预测.pdf
利用matlab进行灰色预测,灰色预测matlab程序,灰色预测模型matlab,matlab灰色预测,matlab灰色预测代码,matlab灰色预测工具箱,灰色预测的matlab程序,matlab灰 ...
matlab ploty,matlab绘制函数如何利用matlab的ploty
最近有网友提出"matlab绘制函数如何利用matlab的ploty"等问题,小小知识站提取了各大知名网站有关"matlab绘制函数如何利用matlab的ploty& ...
R语言构建决策树模型（decision tree）并可视化决策树：自定义函数计算对数似然、自定义函数计算模型的分类效能（accuray、F1、偏差Deviance）、使用pander包美化界面输出内容
R语言构建决策树模型(decision tree)并可视化决策树:自定义函数计算对数似然.自定义函数计算模型的分类效能(accuray.F1.偏差Deviance).使用pander包美化界面输出内容 ...
matlab模拟线圈电磁场,利用MATLAB的PDE工具箱对电场和磁场进行模拟
中学物理Vol. 32 No.巧疑的态度,认为弊大于利,笔者认为这些想法不无道理,只是大家的关注点应该在于如何更好发挥它的优势,而避免它的不足．在实践的过程中笔者有以下心得和体会,与大家分享: 3, ...
matlab 进行非线性回归,5.利用Matlab编程进行非线性回归分析.doc
5.利用Matlab编程进行非线性回归分析.doc §5. 利用Matlab编程计算非线性回归模型 --以Logistic曲线为例 1.原始数据下表给出了某地区1971-2000年的人口数据(表1) ...
matlab求解全微分函数,利用MATLAB求解微分方程的方法探索
引言科学问题和工程问题经常需要求取微分方程的解,MATLAB 的强大数值运算和符号运算能力,能够方便地进行各种解析运算,是方便实用.功能强大的数学软件之一. 1线性微分方程求解 1.1线性常微分方程 ...
matlab分析ct图像,利用MATLAB实现CT断层图像的三维重建.PDF
利用MATLAB实现CT断层图像的三维重建第 13 卷第2 期 CT 理论与应用研究 Vol.13No.2 2004 年5 月 24~29 CT Theory and Applications M ...
matlab实现幅度调制,利用matlab实现信号幅度的调制与解调钟媛
利用matlab实现信号幅度的调制与解调钟媛 1<MATLAB 语言>课程论文利用 MATLAB 实现信号幅度的调制与解调姓名: 钟媛学号:12010245219专业:电子信息工程班级: ...
matlab或_如何利用MATLAB计算圆周率
圆周率是圆的周长与直径的比值,一般用希腊字母π表示,是一个在数学及物理学中普遍存在的数学常数.π也等于圆形之面积与半径平方之比.是精确计算圆周长.圆面积.球体积等几何形状的关键值. MATLAB中的表 ...

matlab决策树模型过程,利用MATLAB统计工具箱进行决策树分类的一个例子

matlab决策树模型过程,利用MATLAB统计工具箱进行决策树分类的一个例子相关推荐

最新文章

热门文章