【Machine Learning in Action --3】决策树ID3算法

1、简单概念描述

决策树的类型有很多，有CART、ID3和C4.5等，其中CART是基于基尼不纯度(Gini)的，这里不做详解，而ID3和C4.5都是基于信息熵的，它们两个得到的结果都是一样的，本次定义主要针对ID3算法。下面我们介绍信息熵的定义。

p(a_i)：事件a_i发生的概率

　　I(a_i)=-log₂(p(a_i))：表示为事件a_i的不确定程度，称为a_i的自信息量

　　H=sum(p(a_i)*I(a_i))：称为信源S的平均信息量—信息熵

　　Gain = BaseEntropy – newEntropy：信息增益

决策树学习采用的是自顶向下的递归方法，其基本思想是以信息熵为度量构造一棵熵值下降最快的树，到叶子节点处的熵值为零，此时每个叶节点中的实例都属于同一类。ID3的原理是基于信息熵增益Gain达到最大，设原始问题的标签有正例和负例，p和n表示其相应的个数。则原始问题的信息熵为

其中N为该特征所取值的个数，比如{rain，sunny}，则N即为2

　　ID3易出现的问题：如果是取值更多的属性，更容易使得数据更“纯”（尤其是连续型数值），其信息增益更大，决策树会首先挑选这个属性作为树的顶点。结果训练出来的形状是一棵庞大且深度很浅的树，这样的划分是极为不合理的。此时可以采用C4.5来解决，C4.5的思想是最大化Gain除以下面这个公式即得到信息增益率：

　　其中底为2

2、决策树的优缺点

优点：计算复杂度不高，输出结果易于理解，对中间值缺失不敏感，可以处理不相关特征数据

缺点：可能产生过度匹配问题

适用数据类型：数值型和标称型

3、python代码的实现

以下的代码根据这些数据理解

数据1中包含5个海洋动物，特征包括：不浮出水面是否可以生存，以及是否有脚蹼。我们可以将这些动物分成两类：鱼类和非鱼类。

	不浮出水面是否可以生存	是否有脚蹼	属于鱼类
1	是	是	是
2	是	是	是
3	是	否	否
4	否	是	否
5	否	是	否

	特征[0](no surfacing)	特征[1](flippers)	特征[-1]fish
dataSet[0]	1	1	yes
dataSet[1]	1	1	yes
dataSet[2]	0	1	no
dataSet[3]	0	1	no
dataSet[4]	0	1	no

创建名为trees.py的文件，下面代码内容都在此文件中。

(1)计算信息熵

# -*- coding: utf-8 -*-#计算给定数据集的香农熵
def calcShannonEnt(dataSet):  numEntries=len(dataSet)  #数据实例总数labelCounts={}  #对类别数量创建了一个数据字典，键值是最后一列的数值for featVec in dataSet:   #featVec表示特征集currentLabel=featVec[-1]    # currentLabel表示当前键值，featVec[-1]表示数据集中的最后一列#如果当前键值不存在，扩展字典将当前键值加入字典，设置当前键值表示的类别数量为0if currentLabel not in labelCounts.keys(): labelCounts[currentLabel]=0#如果当前键值存在，则类别数量累加labelCounts[currentLabel]+=1shannonEnt=0.0for key in labelCounts:prob=float(labelCounts[key])/numEntries #每个键值都记录了当前类别出现的次数shannonEnt -=prob*log(prob,2)return shannonEnt

(2)创建数据集

#创建数据集
def createDataSet():dataSet=[[1,1,'yes'],[1,1,'yes'],[0,1,'no'],[0,1,'no'],[0,1,'no']]labels=['no surfacing','flippers']return dataSet,labels

在python命令提示符下输入下列命令：

1 >>> import trees
2 >>> reload(trees)
3 <module 'trees' from 'E:\python excise\trees.pyc'>
4 >>> myDat,labels=trees.createDataSet()
5 >>> myDat
6 [[1, 1, 'yes'], [1, 1, 'yes'], [0, 1, 'no'], [0, 1, 'no'], [0, 1, 'no']]
7 >>> trees.calcShannonEnt(myDat)
8 0.9709505944546686
9 >>>

熵越高，则混合的数据越多，在数据集中添加更多的分类，观察熵是如何变化的，这里增加第三个名为maybe的分类，测试熵的变化：

>>> myDat[0][-1]='maybe'
>>> myDat
[[1, 1, 'maybe'], [1, 1, 'yes'], [0, 1, 'no'], [0, 1, 'no'], [0, 1, 'no']]
>>> trees.calcShannonEnt(myDat)
1.3709505944546687

得到熵后，我们可以按照获取最大信息增益的方法划分数据集

(3)划分数据集

我们将对每个特征划分数据集的结果计算一次信息熵，然后判断按照哪个特征划分数据集是最好的划分方式

#按照给定特征划分数据集
#dataSet:待划分的数据集，axis:划分数据集的特征，value:需要返回的特征的值
def splitDataSet(dataSet,axis,value):retDataSet=[]   #为了不修改原始数据dataSet，创建一个新的列表对象for featVec in dataSet:if featVec[axis]==value:     reducedFeatVec=featVec[:axis]   #获取从第0列到特征列的数据reducedFeatVec.extend(featVec[axis+1:])  #获取从特征列之后的数据retDataSet.append(reducedFeatVec) #目前reducedFeatVec表示除了特征列的数据return retDataSet

1 >>> reload(trees)
2 <module 'trees' from 'E:\python excise\trees.pyc'>
3 >>> myDat,labels=trees.createDataSet()
4 >>> myDat
5 [[1, 1, 'yes'], [1, 1, 'yes'], [0, 1, 'no'], [0, 1, 'no'], [0, 1, 'no']]
6 >>> trees.splitDataSet(myDat,0,1)
7 [[1, 'yes'], [1, 'yes']]
8 >>> trees.splitDataSet(myDat,0,0)
9 [[1, 'no'], [1, 'no'], [1, 'no']]

(4)选择最好的特征进行划分

#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):numFeatures=len(dataSet[0])-1        #减去类别那一列baseEntropy=calcShannonEnt(dataSet)   #计算整个数据集的原始香农熵bestInfoGain=0.0;bestFeature=-1  #现在最好的特征是数据集中的最后一列  #i=0，新熵，增益  #i=1，新熵，增益for i in range(numFeatures):    #循环遍历数据集中的所有特征featList=[example[i] for example in dataSet]  #获取第i个特征所有可能的取值，特征0一个列表，特征1一个列表...uniqueVals=set(featList)  #集合数据类型（set）与列表类型相似，不同之处仅在于集合类型中每个值互不相同newEntropy=0.0for value in uniqueVals:subDataSet=splitDataSet(dataSet,i,value)  #划分后的数据集prob=len(subDataSet)/float(len(dataSet))newEntropy+=prob*calcShannonEnt(subDataSet) #求划分完的数据集的熵infoGain=baseEntropy-newEntropyif(infoGain>bestInfoGain):bestInfoGain=infoGainbestFeature=i           return bestFeature

注意：这里数据集需要满足以下两个办法：

<1>所有的列元素都必须具有相同的数据长度

<2>数据的最后一列或者每个实例的最后一个元素是当前实例的类别标签。

1 >>> reload(trees)
2 <module 'trees' from 'E:\python excise\trees.pyc'>
3 >>> myDat,labels=trees.createDataSet()
4 >>> trees.chooseBestFeatureToSplit(myDat)
5 0

(5)创建树的代码

Python用字典类型来存储树的结构，返回的结果是myTree-字典

#创建树的函数代码
def createTree(dataSet,labels):classList=[example[-1] for example in dataSet]if classList.count(classList[0])==len(classList):  #类别完全相同规则停止继续划分return classList[0]if len(dataSet[0])==1: #确认至少有数据集    return majorityCnt(classList)bestFeat=chooseBestFeatureToSplit(dataSet)bestFeatLabel=labels[bestFeat]myTree={bestFeatLabel:{}}del(labels[bestFeat])  #得到列表包含的所有属性featValues=[example[bestFeat] for example in dataSet]uniqueVals=set(featValues)for value in uniqueVals:subLabels=labels[:]myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)return myTree

其中递归结束当且仅当该类别中标签完全相同或者遍历所有的特征此时返回次数最多的

1 >>> reload(trees)
2 <module 'trees' from 'E:\python excise\trees.pyc'>
3 >>> myDat,labels=trees.createDataSet()
4 >>> myTree=trees.createTree(myDat,labels)
5 >>> myTree
6 {'no surfacing': {0: 'no', 1: 'yes'}}

其中当所有的特征都用完时，采用多数表决的方法来决定该叶子节点的分类，即该叶节点中属于某一类最多的样本数，那么我们就说该叶节点属于那一类。即为如果数据集已经处理了所有的属性，但是类标签依然不是唯一的，此时我们要决定如何定义该叶子节点，在这种情况下，我们通常采用多数表决的方法来决定该叶子节点的分类。代码如下：

def majorityCnt(classList):classCount={}for vote in classList:if vote not in classCount.keys():classCount[vote]=0classCount[vote]+=1sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)return sortedClassCount[0][0]

(6)使用决策树执行分类

#测试算法：使用决策树执行分类
def classify(inputTree,featLabels,testVec):firstStr=inputTree.keys()[0]secondDict=inputTree[firstStr]featIndex=featLabels.index(firstStr)for key in secondDict.keys():if testVec[featIndex]==key:if type(secondDict[key]).__name__=='dict':classLabel=classify(secondDict[key],featLabels,testVec)else:classLabel=secondDict[key]return classLabel

1 >>> import trees
2 >>> myDat,labels=trees.createDataSet()
3 >>> labels
4 ['no surfacing', 'flippers']
5 >>> trees.classify(myTree,labels,[1,0])
6 'no'
7 >>> trees.classify(myTree,labels,[1,1])
8 'yes'

注意递归的思想很重要。

(7)决策树的存储

构造决策树是一个很耗时的任务。为了节省计算时间，最好能够在每次执行分类时调用已经构造好的决策树。为了解决这个问题，需要使用python模块pickle序列化对象，序列化对象可以在磁盘上保存对象，并在需要的时候读取出来。

#使用算法：决策树的存储
def storeTree(inputTree,filename):import picklefw=open(filename,'w')pickle.dump(inputTree,fw)fw.close()
def grabTree(filename):import picklefr=open(filename)return pickle.load(fr)

1 >>> reload(trees)
2 >>><module 'tree' from 'trees.py'>
3 >>> trees.storeTree(myTree,'classifierStorage.txt')
4 >>> trees.grabTree('classifierStorage.txt')
5 {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

classifierStorage.txt如下：

补充：

用matplotlib注解上述形成的决策树

Matplotlib提供了一个注解工具annotations，非常有用，它可以在数据图形上添加文本注释。注解通常用于解释数据的内容。

创建名为treePlotter.py文件，下面代码都在此文件中

#!/usr/bin/python
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
from numpy import *
import operator
#定义文本框和箭头格式
decisionNode=dict(boxstyle="sawtooth",fc="0.8")
leafNode=dict(boxstyle="round4",fc="0.8")
arrow_args=dict(arrowstyle="<-")
#绘制箭头的注解
def plotNode(nodeTxt,centerPt,parentPt,nodeType):createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',xytext=centerPt,textcoords='axes fraction',va="center",ha="center",bbox=nodeType,arrowprops=arrow_args)
def createPlot():fig=plt.figure(1,facecolor='white')fig.clf()createPlot.ax1=plt.subplot(111,frameon=False)plotNode(U'决策节点',(0.5,0.1),(0.1,0.5),decisionNode)plotNode(U'叶节点',(0.8,0.1),(0.3,0.8),leafNode)plt.show()
#获取叶节点的数目和树的层数
def getNumLeafs(myTree):numLeafs=0firstStr=myTree.keys()[0]secondDict=myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':numLeafs += getNumLeafs(secondDict[key])else: numLeafs +=1return numLeafs
def getTreeDepth(myTree):maxDepth=0firstStr=myTree.keys()[0]secondDict=myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':thisDepth=1+getTreeDepth(secondDict[key])else:thisDepth=1if thisDepth>maxDepth:maxDepth=thisDepthreturn maxDepthdef retrieveTree(i):listOfTrees=[{'no surfacing':{0:'no',1:{'flippers':{0:'no',1:'yes'}}}},\{'no surfacing':{0:'no',1:{'flippers':{0:{'head':{0:'no',1:'yes'}},1:'no'}}}}]return listOfTrees[i]
#在父节点间填充文本信息
def plotMidText(cntrPt,parentPt,txtString):xMid=(parentPt[0]-cntrPt[0])/2.0+cntrPt[0]yMid=(parentPt[1]-cntrPt[1])/2.0+cntrPt[1]createPlot.ax1.text(xMid,yMid,txtString)
#计算宽和高
def plotTree(myTree,parentPt,nodeTxt):numLeafs=getNumLeafs(myTree)depth=getTreeDepth(myTree)firstStr=myTree.keys()[0]cntrPt=(plotTree.xOff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.yOff)plotMidText(cntrPt,parentPt,nodeTxt)   #计算父节点和子节点的中间位置plotNode(firstStr,cntrPt,parentPt,decisionNode)secondDict=myTree[firstStr]plotTree.yOff=plotTree.yOff-1.0/plotTree.totalDfor key in secondDict.keys():if type(secondDict[key]).__name__=='dict':plotTree(secondDict[key],cntrPt,str(key))else:plotTree.xOff=plotTree.xOff+1.0/plotTree.totalWplotNode(secondDict[key],(plotTree.xOff,plotTree.yOff),cntrPt,leafNode)plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))plotTree.yOff=plotTree.yOff+1.0/plotTree.totalD
def createPlot(inTree):fig=plt.figure(1,facecolor='white')fig.clf()axprops=dict(xticks=[],yticks=[])createPlot.ax1=plt.subplot(111,frameon=False,**axprops)plotTree.totalW=float(getNumLeafs(inTree))plotTree.totalD=float(getTreeDepth(inTree))plotTree.xOff=-0.5/plotTree.totalW;plotTree.yOff=1.0;plotTree(inTree,(0.5,1.0),'')plt.show()

其中index方法为查找当前列表中第一个匹配firstStr的元素返回的为索引。

【Machine Learning in Action --3】决策树ID3算法相关推荐

Machine Learning in Action 读书笔记---第3章决策树
Machine Learning in Action 读书笔记第3章决策树文章目录 Machine Learning in Action 读书笔记一.决策树算法简介 1 决策树的构造 2 决策 ...
机器学习实战（Machine Learning in Action）学习笔记————06.k-均值聚类算法（kMeans）学习笔记...
机器学习实战(Machine Learning in Action)学习笔记----06.k-均值聚类算法(kMeans)学习笔记关键字:k-均值.kMeans.聚类.非监督学习作者:米仓山下时 ...
Machine Learning In Action 第二章学习笔记: kNN算法
本文主要记录<Machine Learning In Action>中第二章的内容.书中以两个具体实例来介绍kNN(k nearest neighbors),分别是: 约会对象预测手写数 ...
Machine Learning in Action(5) SVM算法
做机器学习的一定对支持向量机(support vector machine-SVM)颇为熟悉,因为在深度学习出现之前,SVM一直霸占着机器学习老大哥的位子.他的理论很优美,各种变种改进版本也很多,比如 ...
【机器学习实战】Machine Learning in Action 代码视频项目案例
MachineLearning 欢迎任何人参与和完善:一个人可以走的很快,但是一群人却可以走的更远 ApacheCN - 学习机器学习群[629470233] Machine Learning in ...
《Machine Learning in Action》—— 剖析支持向量机，优化SMO
手撕机器学习系列文章就暂时更新到此吧,目前已经完成了支持向量机SVM.决策树.KNN.贝叶斯.线性回归.Logistic回归,其他算法还请允许Taoye在这里先赊个账,后期有机会有时间再给大家补上. ...
Machine Learning in Action 读书笔记---第4章基于概率论的分类方法：朴素贝叶斯
Machine Learning in Action 读书笔记第4章基于概率论的分类方法:朴素贝叶斯文章目录 Machine Learning in Action 读书笔记一.基于贝叶斯决策理 ...
Machine Learning in Action 读书笔记---第5章 Logistic回归
Machine Learning in Action 读书笔记第5章 Logistic回归文章目录 Machine Learning in Action 读书笔记一.Logistic回归 1.L ...
《Machine Learning in action》- （笔记）之Logistic regression（2_实战篇）
<Machine Learning in action>,机器学习实战(笔记)之Logistic regression 使用工具 - Python3.7 - pycharm - anaco ...
Machine Learning in Action 读书笔记---第8章预测数值型数据：回归
Machine Learning in Action 读书笔记第8章预测数值型数据:回归文章目录 Machine Learning in Action 读书笔记一.回归 1.回归的一般过程 2 ...

【Machine Learning in Action --3】决策树ID3算法

【Machine Learning in Action --3】决策树ID3算法相关推荐

最新文章

热门文章