决策树-ID3与C4.5

决策树是利用可视化的树结构实现数据的分类或回归，现实中分类比如银行根据收入、职业、婚姻状况、年龄等属性了解客户的贷款是否能按期偿还，其分类为“是”或“否”，再比如过安检时，根据旅客脸部数据分类为是否是通缉人员；回归则是为了得到具体数值，比如根据年龄、身高、性别得到目标数据的体重。决策树是一种监督学习过程，利用已经标记的数据生成树模型。依照决策树发展历程先后有ID3，C4.5和CART（Classification and Regression Trees）,其中ID3、C4.5的提出者都是一个人，C4.5是对ID3的改进，ID3、C4.5其核心理论基础是之前介绍的信息熵，CART是一个能同时实现分类和回归的二叉树模型，本篇将重点介绍C4.5。

一、ID3

ID3和C4.5都只能解决分类问题，ID3是利用信息熵熵实现的决策树，关于熵的知识请看信息熵、交叉熵、相对熵原理与softmax函数的应用。ID3算法是首先计算样本总体信息熵，然后依次计算每个属性的条件熵，条件熵越小说明该属性对于降低分类是有益的，可以作为分类的结点，或者用信息熵减去条件熵得到信息增益，信息增益越大的属性作为决策树的结点，条件熵小和信息增益大表达是同一个意思，具体计算过程如下：

ID3优先选择值较多属性作为分类结点，这带来一些其他的问题：比如属性值不是离散型而是连续型的数值时，如果将连续值做离散化处理后会造成该属性的值非常多，造成在ID3的算法会优先选择数值型属性进行分类，因此，ID3只能处理属性值为离散型的分类。实践中很少利用ID3进行分类，本篇不做详细介绍。虽然ID3没有太大的应用价值，但可以'抛砖引玉'通过ID3的引出c4.5,c4.5改进了ID3算法，解决了ID3算法不足，c4.5能解决属性值是连续型的分类问题。

二、C4.5

ID3喜欢属性值多样化的属性作为决策树分裂结点，结合之前信息熵的知识可以了解到，属性值多的属性其本身的熵也大，如果将信息增益除以属性本身的熵，将属性的熵作为一个惩罚性因子，就可以抵消因为属性值多而带来的选择优势。C4.5正是基于这种思维获得很好的效果，只是在ID3基础上稍微变动了一下，让C4.5长期以来雄霸数据挖掘算法排行榜第一。

2.1 C4.5树分类与剪枝

下面是一个利用C4.5实现鸢尾花数据分类例子，这个数据集在介绍PCA时使用过，鸢尾花数据可以分为三类，分别是山鸢尾，北美鸢尾，变色鸢尾，鸢尾花具有4个属性，分别是花萼的长度、花萼的宽度、花瓣的长度，花瓣的宽度，可以发现这个例子中属性都是连续的数值型。C4.5的实现不算复杂，但是需要注意决策树在生成后一定要做剪枝以避免‘过拟合’现象，这里打上粗体表明重要性，没有经过剪枝的决策树是没有实用意义的。

==treePlot.py== 显示C4.5决策树工具类

import matplotlib.pyplot as pltplt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False
decisionNode = dict(boxstyle='sawtooth', fc='0.8')
leafNode = dict(boxstyle='round4', fc='0.8')
arrow_args = dict(arrowstyle='<-')
def plotMidText(cntrPt, parentPt, txtString):xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotNode(nodeTxt, centerPt, parentPt, nodeType):createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',xytext=centerPt, textcoords='axes fraction',va='center', ha='center', bbox=nodeType,arrowprops=arrow_args)# 获取叶子节点数目和树的层数
def getNumLeafs(myTree):numLeafs = 0firstStr =list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if (type(secondDict[key]).__name__ == 'dict'):numLeafs += getNumLeafs(secondDict[key])else:numLeafs += 1return numLeafsdef getTreeDepth(myTree):maxDepth = 0firstStr = list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if (type(secondDict[key]).__name__ == 'dict'):thisDepth = 1 + getTreeDepth(secondDict[key])else:thisDepth = 1if thisDepth > maxDepth: maxDepth = thisDepthreturn maxDepth
def plotTree(myTree, parentPt, nodeTxt):  # if the first key tells you what feat was split onnumLeafs = getNumLeafs(myTree)  # this determines the x width of this treedepth = getTreeDepth(myTree)firstStr =  list(myTree.keys()) [0]  # the text label for this node should be thiscntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)plotMidText(cntrPt, parentPt, nodeTxt)plotNode(firstStr, cntrPt, parentPt, decisionNode)secondDict = myTree[firstStr]plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalDfor key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodesplotTree(secondDict[key], cntrPt, str(key))  # recursionelse:  # it's a leaf node print the leaf nodeplotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalWplotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalDdef createPlot(inTree):fig = plt.figure(1, facecolor='white')# fig.title("c4.5",size=14)fig.clf()axprops = dict(xticks=[], yticks=[])createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)  # no ticks#createPlot.ax1.set_title("c4.5", size=24)# createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropsesplotTree.totalW = float(getNumLeafs(inTree))plotTree.totalD = float(getTreeDepth(inTree))plotTree.xOff = -0.5 / plotTree.totalW;plotTree.yOff = 1.0;plotTree(inTree, (0.5, 1.0), '')plt.show()

==c45.py== c4.5决策树实现代码：

# coding=utf-8
import math
import operator
from sklearn import datasets
from  treePlot import *#加载数据集
def createDataSet_iris():iris = datasets.load_iris()dataSet = []for var in iris.data:dataSet.append(list(var))targets = iris.targetfor index, var in enumerate(targets):dataSet[index].append(var)labels = ['花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度']return dataSet,labels##计算给定数据集的信息熵
def calcShannonEnt(dataSet):numEntries = len(dataSet)labelCounts = {}for featVec in dataSet:currentLabel = featVec[-1]if currentLabel not in labelCounts.keys():  # 为所有可能分类创建字典labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1shannonEnt = 0.0for key in labelCounts:prob = float(labelCounts[key]) / numEntriesshannonEnt -= prob * math.log(prob, 2)  # 以2为底数求对数return shannonEnt# 依据特征划分数据集  axis代表第几个特征  value代表该特征所对应的值  返回的是划分后的数据集
def splitDataSet(dataSet, axis, value):retDataSet = []for featVec in dataSet:if featVec[axis] == value:reducedFeatVec = featVec[:axis]reducedFeatVec.extend(featVec[axis + 1:])retDataSet.append(reducedFeatVec)return retDataSet# 选择信息增益比最大属性作为分裂节点
def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0]) - 1  # 特征个数baseEntropy = calcShannonEnt(dataSet)bestInfoGainrate = 0.0bestFeature = -1for i in range(numFeatures):  # 遍历特征 第i个featureSet = set([example[i] for example in dataSet])  # 第i个特征取值集合newEntropy = 0.0splitinfo = 0.0for value in featureSet:subDataSet = splitDataSet(dataSet, i, value)prob = len(subDataSet) / float(len(dataSet))newEntropy += prob * calcShannonEnt(subDataSet)  # 该特征划分所对应的entropysplitinfo -= prob * math.log(prob, 2)if not splitinfo:splitinfo = -0.99 * math.log(0.99, 2) - 0.01 * math.log(0.01, 2)infoGain = baseEntropy - newEntropyinfoGainrate = float(infoGain) / float(splitinfo)#信息增益比if infoGainrate > bestInfoGainrate:bestInfoGainrate = infoGainratebestFeature = ireturn bestFeature# 创建树的函数代码   python中用字典类型来存储树的结构 返回的结果是myTree-字典
def createTree(dataSet, labels):classList = [example[-1] for example in dataSet]# 类别完全相同则停止继续划分  返回类标签-叶子节点if classList.count(classList[0]) == len(classList):return classList[0]if len(dataSet[0]) == 1:return majorityCnt(classList)  # 遍历完所有的特征时返回出现次数最多的bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLabel = labels[bestFeat]myTree = {bestFeatLabel: {}}del (labels[bestFeat])featValues = [example[bestFeat] for example in dataSet]  # 得到的列表包含所有的属性值uniqueVals = set(featValues)for value in uniqueVals:subLabels = labels[:]myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)return myTree# 多数表决的方法决定叶子节点的分类 ----  当所有的特征全部用完时仍属于多类
def majorityCnt(classList):classCount = {}for vote in classList:if vote not in classCount.keys():classCount[vote] = 0;classCount[vote] += 1# 排序函数 operator中的sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)return sortedClassCount[0][0]# 自底向上剪枝
def prune_downtoup(inputTree, dataSet, featLabels, count):# global numfirstStr =  list(inputTree.keys())[0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr)for key in secondDict.keys():  # 走到最深的非叶子结点if type(secondDict[key]).__name__ == 'dict':tempcount = []  # 本将的记录rightcount = 0wrongcount = 0tempfeatLabels = featLabels[:]subDataSet = splitDataSet(dataSet, featIndex, key)tempfeatLabels.remove(firstStr)getCount(secondDict[key], subDataSet, tempfeatLabels, tempcount)tempnum = 0.0wrongnum = 0.0old = 0.0# 标准误差standwrong = 0.0for var in tempcount:tempnum += var[0] + var[1]wrongnum += var[1]old = float(wrongnum + 0.5 * len(tempcount)) / float(tempnum)standwrong = math.sqrt(tempnum * old * (1 - old))# 假如剪枝new = float(wrongnum + 0.5) / float(tempnum)if tempnum*new <= tempnum*old + standwrong :  # 要确定新叶子结点的类别# 误判率最低的叶子节点的类为新叶子结点的类# 在count的每一个列表类型的元素里再加一个标记类别的元素。wrongtemp = 1.0newtype = -1for var in tempcount:if float(var[1] + 0.5) / float(var[0] + var[1]) < wrongtemp:wrongtemp = float(var[1] + 0.5) / float(var[0] + var[1])newtype = var[-1]secondDict[key] = str(newtype)tempcount = []  # 这个相当复杂，因为如果发生剪枝，才会将它置空，如果不发生剪枝，那么应该保持原来的叶子结点的结构for var in tempcount:count.append(var)for key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':continuerightcount = 0wrongcount = 0subDataSet = splitDataSet(dataSet, featIndex, key)for eachdata in subDataSet:if str(eachdata[-1]) == str(secondDict[key]):rightcount += 1else:wrongcount += 1count.append([rightcount, wrongcount, secondDict[key]])  # 最后一个为该叶子结点的类别#计算任意子树正确率
def getCount(inputTree, dataSet, featLabels, count):# global numfirstStr =  list(inputTree.keys())[0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr)# count=[]for key in secondDict.keys():rightcount = 0wrongcount = 0tempfeatLabels = featLabels[:]subDataSet = splitDataSet(dataSet, featIndex, key)tempfeatLabels.remove(firstStr)if type(secondDict[key]).__name__ == 'dict':# 如果是子树结点，递归调用getCount(secondDict[key], subDataSet, tempfeatLabels, count)else:for eachdata in subDataSet:if str(eachdata[-1]) == str(secondDict[key]):rightcount += 1else:wrongcount += 1count.append([rightcount, wrongcount, secondDict[key]])# num+=rightcount+wrongcount# 自顶向下剪枝
def prune_uptodown(inputTree, dataSet, featLabels):firstStr =   list(inputTree.keys()) [0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr)for key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':#如果是子树则计算该结点错误率tempfeatLabels = featLabels[:]subDataSet = splitDataSet(dataSet, featIndex, key)tempfeatLabels.remove(firstStr)tempcount = []#tempcount保存了所有子树结点正确与错误的个数、以及该子树对应分裂属性getCount(secondDict[key], subDataSet, tempfeatLabels, tempcount)tempnum,wrongnum,standwrong = 0.0,0.0,0.0for var in tempcount:tempnum += var[0] + var[1]wrongnum += var[1]treeErr = float(wrongnum + 0.5 * len(tempcount)) / float(tempnum)standwrong = math.sqrt(tempnum * treeErr * (1 - treeErr))#方差# 如果用叶结点代替子树结点nodeErr = float(wrongnum + 0.5) / float(tempnum)#判断条件对应公式(2.4)if tempnum*nodeErr <= tempnum*treeErr + standwrong :  # 要确定新叶子结点的类别# 误判率最低的叶子节点的类为新叶子结点的类# 在count的每一个列表类型的元素里再加一个标记类别的元素。#print(key,old,new)wrongtemp = 1.0newtype = -1for var in tempcount:if float(var[1] + 0.5) / float(var[0] + var[1]) < wrongtemp:wrongtemp = float(var[1] + 0.5) / float(var[0] + var[1])newtype = var[-1]secondDict[key] = str(newtype)def getsortNum(mapfeatures,num):if  num<mapfeatures[0]  or num > mapfeatures[len(mapfeatures) - 1]  :return   numfor i in range(len(mapfeatures) - 1):if (num > mapfeatures[i] and num <= mapfeatures[i + 1]):return mapfeatures[i + 1]#处理连续型属性
def handleContinuousNumber(dataset,index ):features=set(data[index] for data in dataset[:])sortfeatures=sorted( features)mapfeatures=[]for i in range(len(sortfeatures)-1):mapfeatures.append(  round ((sortfeatures[i]+sortfeatures[i+1])/2 ,2)  )for i in range(len(dataset)):dataset[i][index]=getsortNum(mapfeatures,dataset[i][index])if __name__ == '__main__':global numnum = 0dataset, features = createDataSet_iris()for a in range(len(features)):handleContinuousNumber(dataset[:], a)#1features4uptodown= features.copy()tree = createTree(dataset, features)#2createPlot(tree)#自顶向下剪枝prune_uptodown(tree, dataset, features4uptodown )#3createPlot(tree)#自低向上剪枝'''features4downtoup = features.copy()count = [] cutBranch_downtoup(tree, dataset, features4downtoup, count) createPlot(tree)  '''

程序说明：

1#:handleContinuousNumber(dataset[:], a) ：处理连续数值型属性，加载数据后排序，将排序好数组相邻两个元素取平均值，新生成的序列将属性值划分为149个区间（原来有150个数据），依次将4个属性做离散化处理。

2# tree = createTree(dataset, features)根据公式(1.5)生成决策树，实现此功能的函数是chooseBestFeatureToSplit，chooseBestFeatureToSplit函数是利用信息增益比选择最佳的属性作为决策树的节点。

3#prune_uptodown(tree, dataset, features4uptodown ),这里采用悲观算法实现从顶向下剪枝，没有剪枝前决策树如下图:

剪枝后发现，根据花瓣宽度属性就可以实现分类，如下图所示：

2.2 C4.5剪枝

余下文章请访问链接 C4.5剪枝

决策树-ID3与C4.5相关推荐

决策树ID3与C4.5
决策树ID3与C4.5 参考极客时间<数据分析实战45讲>17章决策树的生成决策树的生成会经历两个阶段:构造和剪枝构造构造就是生成一颗完整的决策树.简单来说,构造的 ...
机器学习算法 04 —— 决策树(ID3、C4.5、CART，剪枝，特征提取，回归决策树)
文章目录系列文章决策树 1 决策树算法简介 2 决策树分类的原理 2.1 信息熵 2.2 决策树划分依据-信息增益(ID3) 2.3 决策树划分依据-信息增益率(C4.5) 2.4 决策树划分依据 ...
决策树ID3、C4.5、CART、随机森林、bagging、boosting、Adaboost、GBDT、xgboost
决策树 1.决策树学习算法主要由三部分构成: 特征选择决策树生成决策树的剪枝特征选择而我们应该基于什么准则来判定一个特征的分类能力呢? 这时候,需要引入一个概念:信息增益.,下面是公式(其中x ...
决策树ID3、C4.5、CART算法
参考[机器学习]决策树(上)--ID3.C4.5.CART(非常详细) - 阿泽的文章 - 知乎 https://zhuanlan.zhihu.com/p/85731206 算法的目的都是为了构建决策 ...
决策树ID3，C4.5，CART算法及实现
文章目录一.决策树 1.画法 2.决策树的剪枝 3.挑西瓜决策树 3.1利用信息增益选择最优划分属性 3.2python代码实现二.sk-learn库对西瓜数据集,分别进行ID3.C4.5和CAR ...
决策树ID3、C4.5和CART算法例子详解
决策树决策树是附加概率结果的一个树状的决策图,是直观的运用统计概率分析的图法.机器学习中决策树是一个预测模型,它表示对象属性和对象值之间的一种映射,树中的每一个节点表示对象属性的判断条件,其分支表示 ...
决策树ID3、C4.5这两种算法的特点？为什么使用C4.5要好？
ID3.C4.5这两种算法的特点?为什么使用C4.5要好? 参考:链接 [参考文章] [1]. <机器学习> p73–p79 见附件书本 [2]. C4.5算法详解 https://blo ...
决策树——ID3和C4.5
决策树(decision tree)是一个树结构(可以是二叉树或非二叉树).其每个非叶节点表示一个特征属性上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别.使用决策树进行 ...
决策树ID3和C4.5算法Python实现源码
决策树的概念其实不难理解,下面一张图是某女生相亲时用到的决策树: 基本上可以理解为:一堆数据,附带若干属性,每一条记录最后都有一个分类(见或者不见),然后根据每种属性可以进行划分(比如年龄是>3 ...

决策树-ID3与C4.5

决策树-ID3与C4.5相关推荐

最新文章

热门文章