【机器学习实战】决策树算法：预测隐形眼镜类型

0.收集数据

这里采用的数据集是《机器学习实战》提供的 lenses.txt 文件，该文件内容如下：

young    myope   no  reduced no lenses
young   myope   no  normal  soft
young   myope   yes reduced no lenses
young   myope   yes normal  hard
young   hyper   no  reduced no lenses
young   hyper   no  normal  soft
young   hyper   yes reduced no lenses
young   hyper   yes normal  hard
pre myope   no  reduced no lenses
pre myope   no  normal  soft
pre myope   yes reduced no lenses
pre myope   yes normal  hard
pre hyper   no  reduced no lenses
pre hyper   no  normal  soft
pre hyper   yes reduced no lenses
pre hyper   yes normal  no lenses
presbyopic  myope   no  reduced no lenses
presbyopic  myope   no  normal  no lenses
presbyopic  myope   yes reduced no lenses
presbyopic  myope   yes normal  hard
presbyopic  hyper   no  reduced no lenses
presbyopic  hyper   no  normal  soft
presbyopic  hyper   yes reduced no lenses
presbyopic  hyper   yes normal  no lenses

每列数据类型分别是 age、prescript、astigmatic、tearRateage、prescript、astigmatic、tearRateage、prescript、astigmatic、tearRate ，而最后一列的类型是隐形眼镜的类型。

1.准备数据：解析tab键分隔的数据行

首先由于我们的数据文件是以 TabTabTab 分割开各列之间的数据的，所以我们首先需要获取被分隔的数据行。

代码如下，其中 strip()strip()strip() 表示删除掉数据中的换行符，则 split('\t') 是数据中遇到 '\t' (既 TabTabTab) 就隔开。

fr = open('lenses.txt') # 打开数据集文件
lenses = [inst.strip().split('\t') for inst in fr.readlines()] # 解析tab键分割的数据行

由于 lenses.txtlenses.txtlenses.txt 文件中并没有对每列数据进行命名，这里我将每列数据的名称准备在 lensesLabelslensesLabelslensesLabels 变量中。

lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

数据都准备好了，接下来就可以开始我们的决策树构造了。

2.决策树的构造

决策树算法(DecisionTreeDecision TreeDecisionTree)：决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。

缺点：可能会产生过度匹配问题。

适用数据类型：数值型和标称型。

2.1 信息增益

划分数据集的大原则是：将无序的数据变得更加有序。在划分数据集之前之后信息发生的变化称为信息增益，这里我们采用 香农熵 来计算信息的增益。

如果待分类的事务可能划分在多个分类中，则符号 xix_ixi 的信息定义为：l(xi)=−log2p(xi)l(x_i)=-log_2p(x_i)l(xi)=−log2p(xi)

其中 p(xi)p(x_i)p(xi) 是选择该分类的概率。

为了计算熵，我们需要计算所有类别所有可能值包含的信息期望值，通过下面的公式得到(其中 nnn 是分类的数目)：
H=−∑i=1np(xi)log2p(xi)H=-\sum^{n}_{i=1}p(x_i)log_2p(x_i)H=−∑i=1np(xi)log2p(xi)

from math import log#计算给定数据集的香农熵
def calcShannonEnt(dataSet):numEntries = len(dataSet) # 获取数据集中实例的总数labelCounts = {}for featVec in dataSet:currentLabel = featVec[-1] # featVec[-1]是指获取最后一个数值if currentLabel not in labelCounts.keys():labelCounts[currentLabel] = 0 # 新添加的值，所以计数为 0labelCounts[currentLabel] += 1shannonEnt = 0.0 # shannonEnt用于记录计算的香农熵for key in labelCounts:prob = float(labelCounts[key])/numEntries # 计算P(xi)的概率shannonEnt -= prob * log(prob, 2) # 计算香农熵return shannonEnt

由于熵越高，则混合的数据也越多，因此我们可以通过计算香农熵来划分数据集。

2.2 划分数据集

首先先把当作特征值的属性进行抽取。

# 输入参数分别是：待划分的数据集、划分数据集的特征，需要返回的特征的值
def splitDataSet(dataSet, axis, value):retDataSet = [] # 创建新的list对象for featVec in dataSet:if featVec[axis] == value:reducedFeatVec = featVec[:axis] # 获取关键特征前面的属性reducedFeatVec.extend(featVec[axis + 1 :]) # 填加关键特征后面的属性retDataSet.append(reducedFeatVec) # 以上步骤相当于对特征值进行抽取return retDataSet # 返回抽取特征后的数据集

然后再依次计算以不同属性值为特征值时的香农熵，判断以何种属性为特征值时是最优的数据划分。

# 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0]) - 1 #获取每个数据集拥有几个特征（排除最后一个）beseEntropy = calcShannonEnt(dataSet) # 计算以最后一个数值为特征的香农熵bestInfoGain = 0.0;bestFeature = -1for i in range(numFeatures):featList = [example[i] for example in dataSet]# 将dataSet中的数据先按行依次放入example中，然后取得example中的example[i]元素，放入列表featList中uniqueVals = set(featList) # set() 函数创建一个无序不重复元素集newEntropy = 0.0for value in uniqueVals: # 计算每种划分方式的信息熵subDataSet = splitDataSet(dataSet, i, value) # 按照给定特征划分数据集prob = len(subDataSet) / float(len(dataSet)) # 计算当前结果的可能性newEntropy += prob * calcShannonEnt(subDataSet) # 不同可能性的香农熵的和infoGain = beseEntropy - newEntropyif(infoGain > bestInfoGain): # 判断是否是当前最小香农熵，计算出最好的信息增益bestInfoGain = infoGainbestFeature = ireturn bestFeature

到这里，我们已经可以计算当前数据的最好划分方式了，但决策树不是只划分一次就好了，而是层层递进的划分下去，因此接下来就开始实现递归构建决策树。

2.3 递归构建决策树

工作原理：得到原始数据，然后基于最好的属性值划分数据集，由于特征值可能多余两个，因此可能存在大于两个分支的数据集划分。第一次划分之后，数据将被向下传递到树分支的下一个节点，再这个节点上，我们可以再次划分数据。因此我们可以采用递归的原则处理数据集。

递归结束的条件是：程序遍历完所有划分数据集的属性，或者每个分支下的所有实例都具有相同的分类。如果所有实例具有相同的分类，则得到一个叶子节点或者终止块。任何到达叶子节点的数据必然属于叶子节点的分类。

首先使用分类名称的列表，然后创建值为 classListclassListclassList 中唯一值的数据字典，字典对象存储了 classListclassListclassList 中每个类标签出现的频率，最后利用 operatoroperatoroperator 操作键值排序字典，并返回出现次数最多的分类名称。

import operatordef majorityCnt(classList):classCount = {}for vote in classList:if vote not in classCount.keys(): classCount[vote] = 0classCount[vote] += 1sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)return sortedClassCount # 返回出现次数最多的分类名称

接着就可以创建树了，其中变量 myTreemyTreemyTree 包含了很多代表树结构信息的嵌套字典，至此我们已经正确的构建好了树。

# 创建树的函数代码，两个输入参数：数据集和标签列表
def creatTree(dataSet, labels):classList = [example[-1] for example in dataSet]# 将dataSet中的数据先按行依次放入example中，然后取得example中的example[-1]元素，放入列表classList中if classList.count(classList[0]) == len(classList): # 类别完全相同则停止继续划分return classList[0]if len(dataSet[0]) == 1: # 遍历完所有特征时返回出现次数最多的类别return majorityCnt(classList)bestFeat = chooseBestFeatureToSplit(dataSet) # 选择最好的数据集划分方式bestFeatLabel = labels[bestFeat] # 获取属性文字标签myTree = {bestFeatLabel : {}}# 得到列表包含的所有属性值del(labels[bestFeat])featValues = [example[bestFeat] for example in dataSet]uniqueVals = set(featValues)for value in uniqueVals:subLabels = labels[:]myTree[bestFeatLabel][value] = creatTree(splitDataSet(dataSet, bestFeat, value), subLabels)return myTree

3.在Python中使用Matplotlib注解绘制树形图

由于这里使用的主要是 MatplotlibMatplotlibMatplotlib 绘图的知识，与机器学习关系不大，故这里不对代码进行详细讲解。

import matplotlib.pyplot as plt
import matplotlib# 定义文本框和箭头格式
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")# 绘制带箭头的注解
def plotNode(nodeTxt, centerPt, parentPt, nodeType):createPlot.ax1.annotate(nodeTxt, xy = parentPt, xycoords = 'axes fraction',xytext = centerPt, textcoords = 'axes fraction',va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)# 获取叶节点的数目和树的层数
def getNumLeafs(myTree):numLeafs = 0firstStr = list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':numLeafs += getNumLeafs(secondDict[key])else: numLeafs += 1return numLeafsdef getTreeDepth(myTree):maxDepth = 0firstStr = list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':thisDepth = 1 + getTreeDepth(secondDict[key])else: thisDepth = 1if thisDepth > maxDepth: maxDepth = thisDepthreturn maxDepth# plotTree函数
# 在父子节点间填充文本信息
def plotMidText(cntrPt, parentPt, txtString):xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)# 计算宽与高
def plotTree(myTree, parentPt, nodeTxt):numLeafs = getNumLeafs(myTree)depth = getTreeDepth(myTree)firstStr = list(myTree.keys())[0]cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)#标记子节点属性值plotMidText(cntrPt, parentPt, nodeTxt)plotNode(firstStr, cntrPt, parentPt, decisionNode)secondDict = myTree[firstStr]plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalDfor key in secondDict.keys():if type(secondDict[key]).__name__ == 'dict':  # test to see if the nodes are dictonaires, if not they are leaf nodesplotTree(secondDict[key], cntrPt, str(key))  # recursionelse:  # it's a leaf node print the leaf nodeplotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalWplotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD# 这个是真正的绘制，上边是逻辑的绘制
def createPlot(inTree):fig = plt.figure(1, facecolor='white')fig.clf()axprops = dict(xticks=[], yticks=[])createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)  # no ticksplotTree.totalW = float(getNumLeafs(inTree))plotTree.totalD = float(getTreeDepth(inTree))plotTree.xOff = -0.5 / plotTree.totalW;plotTree.yOff = 1.0;plotTree(inTree, (0.5, 1.0), '')plt.axis('off') # 去掉坐标轴plt.show()

4.使用算法

主函数代码：

if __name__ == "__main__":fr = open('lenses.txt') # 打开数据集文件lenses = [inst.strip().split('\t') for inst in fr.readlines()] # 解析tab键分割的数据行lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']lensesTree = creatTree(lenses, lensesLabels)createPlot(lensesTree)

运行过后就可以得到我们的结果，如下图片：

5.总结

这个算法的思想本质其实并不复杂，但我在阅读代码的过程中却是困难重重