机器学习实战ch03: 使用决策树预测隐形眼镜类型

决策树的一般流程
1.收集数据
2.准备数据：树构造算法只适用标称型数据，因此数据值型数据必须离散化
3.分析数据
4.训练算法
5.测试数据
6.使用算法
决策树的优点
1.数据形式非常容易理解
2.计算复杂度不高，输出结果易与理解，对中间值得缺失不敏感，可以处理不相关的数据
3.分类器可以通过pickle模块存储在硬盘上，节省计算时间
决策树得缺点
1.可能会产生多读匹配问题
构造决策树需要解决的问题
1.当前数据集上那个特征在划分数据分类时起决定性作用
划分数据
1.使用ID3算法划分数据
2.划分数据集的最大原则：将无序的数据变得更加有序
3.度量数据集的无序程度可以用信息熵（香农熵），熵越高，混合的数据越多
测试算法
1.依靠训练数据构造决策树
2.执行数据分类，需要决策树以及用于构造树的标签向量
3.比较测试数据与决策树熵的数值，递归执行该过程直到进入叶子节点
4.最后，将测试数据定义为叶子节点所属的类型

程序

trees.py

from math import log
import operator#程序清单3-1 计算给定数据集的香农熵def calcshannonEnt(dataSet):numEntries = len(dataSet) #计算数据集中实例的总数labelCounts = {}  #创建一个数据字典，键值是最后一列的数值for featVec in dataSet:currentLabel = featVec[-1]if currentLabel not in labelCounts.keys(): #如果当前键值不存在，则扩展字典并将当前键值加入字典labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1shannonEnt = 0.0for key in labelCounts: #每个键值都记录了当前类别出现的次数prob = float(labelCounts[key])/numEntries #使用所有类标签的发生频率计算类别出现的频率shannonEnt -= prob * log(prob,2) #香农熵计算return shannonEnt#输入自己的简单鱼鉴定数据集
def createDataSet():dataSet = [[1, 1, 'yes'],[1, 1, 'yes'],[1, 0, 'no'],[0, 1,'no'],[0, 1, 'no']]labels = ['no surfacing', 'flippers']# change to discrete valuesreturn dataSet, labels#程序清单3-2 按照给定特征划分数据集
"""
dataSet:待划分的数据集
axis:划分数据集的特征
value: 需要返回的特征值
"""
def splitDataSet(dataSet, axis, value):retDataSet = [] #创建新的列表对象for featVec in dataSet: #遍历数据集中的每个元素，一旦发现符合要求的值，将其添加到新创建的列表中if featVec[axis] == value: #程序将符合特征的数据抽取出来reduceFeatVec = featVec[:axis]reduceFeatVec.extend(featVec[axis+1:]) #extend:添加元素retDataSet.append(reduceFeatVec)#append:添加的元素也是一个列表return retDataSet#程序清单3-3 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):numFeatures = len(dataSet[0]) - 1#判定当前数据集包含多少特征属性baseEntropy = calcshannonEnt(dataSet) #计算整个数据集的原始香农熵bestInfoGain = 0.0 #保留最初的无序度量值，与划分后的数据集计算的熵进行比较bestFeature = -1for i in range(numFeatures):  #遍历数据集中的所有特征featList = [example[i] for example in dataSet] #创建唯一的分类标签uniqueVals = set(featList) #转换为集合，集合与列表的不同为：集合的每个值互不相同newEntropy = 0.0for value in uniqueVals: #遍历当前特征中的所有唯一属性值，计算每种划分方法的信息熵subDataSet = splitDataSet(dataSet, i, value) #对每个特征划分一次数据集prob = len(subDataSet)/float(len(dataSet)) #计算数据集的新熵值newEntropy += prob * calcshannonEnt(subDataSet)  #对所有唯一特征值得到的熵求和infoGain = baseEntropy - newEntropy #熵的变化,即信息增益if (infoGain > bestInfoGain): #计算最好的信息增益bestInfoGain = infoGainbestFeature = ireturn bestFeature#
def majorityCnt(classList):classCount = {} #创建字典for vote in classList: #创建建值为classList中为唯一值的数据字典if vote not in classCount.keys():  #计算每个类标签出现的频率classCount[vote] = 0classCount[vote] += 1sortedClassCount = sorted(classCount.iteritems(), #键值倒序排序key=operator.itemgetter(1),reverse=True)return sortedClassCount[0][0] #返回出现次数最多的分类名称#程序清单3-4 创建树的函数代码
"""
dataSet: 数据集
labels:标签列表（包含数据集中所有特征的标签）"""
def createTree(dataSet,labels):classList = [example[-1] for example in dataSet] #创建classList列表变量，包含了数据集的所有类标签if classList.count(classList[0]) == len(classList): #递归函数第一个停止条件：类别完全相同则停止继续划分return classList[0] #，直接返回该类标签if len(dataSet[0]) == 1: #递归函数第二个停止条件：使用完了所有特征，任然不能将数据集划分成包含唯一类别的分组return majorityCnt(classList) #挑出出现次数最多的类别作为返回值bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLabel = labels[bestFeat] #当前数据集选取的最好特征myTree = {bestFeatLabel: {}} #创建树del (labels[bestFeat]) #del作用在变量上，删除的是变量，而不是数据featValues = [example[bestFeat] for example in dataSet] #得到列表包含的所有属性uniqueVals = set(featValues)for value in uniqueVals:subLabels = labels[:] #复制类标签myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) #在每个数据集划分上递归调用函数createTree（）return myTree#程序清单3-8 使用决策树的分类函数
""""
使用决策树执行分类
在执行数据分类时，需要决策树以及用于构造树的标签向量；
然后，程序比较测试数据与决策树上的数值，递归执行该过程直到进入叶子节点
最后将测试数据定义为叶子节点所属的类型"""##存储带有特征的数据会面临一个问题：程序无法确定特征在数据集中的位置
def classify(inputTree,featLabels,testVec):firstStr = list(inputTree.keys())[0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr) #index方法查找当前列表中第一个匹配firstStr变量的元素for key in secondDict.keys(): #遍历整棵树if testVec[featIndex] == key: #比较testVec变量中的值与树节点中的值if type(secondDict[key]).__name__=='dict': #如果不是叶子节点，则递归调用classify函数classLabel = classify(secondDict[key],featLabels,testVec)else: #如果达到叶子节点，则返回当前节点的分类标签classLabel = secondDict[key]return classLabel#程序清单3-9 使用pickle模块存储决策树
"""
决策树是很耗时的任务，为了节省计算时间，最好能够在每次执行分类时调用已经构造好的决策树
将分类器存储在硬盘上，而不用每次对数据分类时重新学习一遍，也是决策树的优点之一
"""
def storeTree(inputTree, filename):import picklefw = open(filename, 'wb')pickle.dump(inputTree, fw)fw.close()def grabTree(filename):import picklefr = open(filename,'rb')return pickle.load(fr)

treePlotter.py

#3.2 在Python中使用Matplotlib注解绘制树形图
#程序清单3-5 使用文本注解绘制树节点
"""
利用matplotlib提供的注解工具annotations
"""import matplotlib.pyplot as plt#程序清单3-5 使用文本注释绘制树节点
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8") #定义树节点格式的常量（定义文本框和箭头格式）
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle = "<-")def plotNode(nodeTxt, centerPt, parentPt, nodeType): #绘制带箭头的注解,执行实际的绘图功能createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',xytext=centerPt, textcoords='axes fraction',va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )#绘图区由全局变量createPlot.ax1定义# def createPlot():
#    fig = plt.figure(1, facecolor='white')
#    fig.clf()
#    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
#    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
#    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
#    plt.show()#程序清单3-6 获取叶节点的数目和树的层数
def getNumLeafs(myTree): #获取叶节点的数目numLeafs = 0 #叶节点数目初始化firstStr = list(myTree.keys())[0]secondDict = myTree[firstStr] #第一此划分数据集的类标签，附带的数值表示子节点的数值for key in secondDict.keys(): #从第一个关键字出发，遍历所有的子节点if type(secondDict[key]).__name__=='dict': #判断子节点是否为字典，是节点，则该节点也是一个判断节点numLeafs += getNumLeafs(secondDict[key]) #递归调用getNumLeafselse:   numLeafs +=1return numLeafsdef getTreeDepth(myTree): #获取树的层数maxDepth = 0firstStr = list(myTree.keys())[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodesthisDepth = 1 + getTreeDepth(secondDict[key])  #到达叶子节点，则从递归调用中返回，将树的深度＋1else:   thisDepth = 1if thisDepth > maxDepth: maxDepth = thisDepthreturn maxDepthdef retrieveTree(i): #输出预先存储的树信息，避免每次测试代码测试时都需要从数据中创建树的麻烦listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}]return listOfTrees[i] #返回预定义的树的结构#程序清单3-7 plotTree函数
"""
createPlot()为主函数，调用plotTree()，函数plotTree又依次调用前面介绍的函数和plotMidText()"""
def plotMidText(cntrPt, parentPt, txtString):xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split onnumLeafs = getNumLeafs(myTree)  #计算树的宽和高depth = getTreeDepth(myTree)firstStr = list(myTree.keys())[0]     #the text label for this node should be thiscntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)plotMidText(cntrPt, parentPt, nodeTxt) #计算父亲节点和子节点的中间位置,并在此处添加简单的文本标签信息plotNode(firstStr, cntrPt, parentPt, decisionNode)#标记子节点具有的特征secondDict = myTree[firstStr]plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD #按比例缩小全局变量for key in secondDict.keys(): #减小y偏移if type(secondDict[key]).__name__=='dict':#如果不是叶子节点plotTree(secondDict[key],cntrPt,str(key))        #递归调用plotTree函数else:   #如果节点是叶子节点则在图形上呼出叶子节点plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalWplotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dictdef createPlot(inTree):fig = plt.figure(1, facecolor='white') #创建绘图区域，计算树形图全局尺寸fig.clf()axprops = dict(xticks=[], yticks=[])createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropsesplotTree.totalW = float(getNumLeafs(inTree)) #全局变量存储树的宽度plotTree.totalD = float(getTreeDepth(inTree)) #全局变量存储树的高度plotTree.xOff = -0.5/plotTree.totalW #xOff/yOff:追踪已经绘制节点的位置,以及放置下一节点合适的位置plotTree.yOff = 1.0plotTree(inTree, (0.5,1.0), '')plt.show()

test_03.py

from importlib import reload
import trees
import treePlotter#3.1.1 信息增益
reload(trees)
myDat, labels = trees.createDataSet()
print(myDat)
print(trees.calcshannonEnt(myDat))#测试熵的变化
myDat [0][-1] = 'maybe'
print(myDat)
print(trees.calcshannonEnt(myDat))#3.1.2 划分数据集
reload(trees)
myDat, labels = trees.createDataSet()
print("\n")
print(myDat)
print(trees.splitDataSet(myDat,0,1))
print(trees.splitDataSet(myDat,0,0))#选择最好的数据集划分方式
reload(trees)
myDat, labels = trees.createDataSet()
print("\n")
print(myDat)
print(trees.chooseBestFeatureToSplit(myDat))#3.1.3 递归构造决策树
reload(trees)
myDat, labels = trees.createDataSet()
myTree = trees.createTree(myDat,labels)
print("\n")
print(myTree)#3.2.1 Matplotlib注解
# treePlotter.createPlot()#3.2.2 构造注释树
reload(treePlotter)
print("\n")
print(treePlotter.retrieveTree(1))myTree = treePlotter.retrieveTree (0)
print(treePlotter.getNumLeafs(myTree))
print(treePlotter.getTreeDepth(myTree))#plotTree函数
reload(treePlotter)
myTree = treePlotter.retrieveTree (0)
print("\n")
print(treePlotter.createPlot(myTree))myTree['no surfacing'][3] = 'maybe'
print("\n")
print(myTree)
print(treePlotter.createPlot(myTree))#3.3.1 测试算法：使用决策树执行分类
myDat, labels = trees.createDataSet()
print("\n")
print(labels)
myTree = treePlotter.retrieveTree (0)
print(myTree)
print(trees.classify(myTree,labels,[1,0]))
print(trees.classify(myTree,labels,[1,1]))#3.3.2 使用算法：决策树的存储
print("\n")
print(trees.storeTree(myTree,'classifierStorage.txt'))
print(trees.grabTree('classifierStorage.txt'))#3.4 使用决策树预测隐形眼睛类型
fr = open('lenses.txt')
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age','prescript','astigmatic','tearRate']
lensesTree = trees.createTree(lenses,lensesLabels)
print("\n")
print(lensesTree)
print(treePlotter.createPlot(lensesTree))

D:\down_path_v1\python3.7.0\python.exe F:/code/Machine_Learning_Practices/chapter03/test_03.py
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
0.9709505944546686
[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
1.3709505944546687[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
[[1, 'yes'], [1, 'yes'], [0, 'no']]
[[1, 'no'], [1, 'no']][[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
0{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

附录:lenses.txt

young    myope   no  reduced no lenses
young   myope   no  normal  soft
young   myope   yes reduced no lenses
young   myope   yes normal  hard
young   hyper   no  reduced no lenses
young   hyper   no  normal  soft
young   hyper   yes reduced no lenses
young   hyper   yes normal  hard
pre myope   no  reduced no lenses
pre myope   no  normal  soft
pre myope   yes reduced no lenses
pre myope   yes normal  hard
pre hyper   no  reduced no lenses
pre hyper   no  normal  soft
pre hyper   yes reduced no lenses
pre hyper   yes normal  no lenses
presbyopic  myope   no  reduced no lenses
presbyopic  myope   no  normal  no lenses
presbyopic  myope   yes reduced no lenses
presbyopic  myope   yes normal  hard
presbyopic  hyper   no  reduced no lenses
presbyopic  hyper   no  normal  soft
presbyopic  hyper   yes reduced no lenses
presbyopic  hyper   yes normal  no lenses

机器学习实战ch03: 使用决策树预测隐形眼镜类型相关推荐

决策树实战2-使用决策树预测隐形眼镜类型
这里是3.x版本的Python,对代码做了一些修改. 其中画图的函数直接使用的是原代码中的函数,也做了一些修改. 书本配套的数据和2.7版本的源码可以在这里获取 :https://www.mannin ...
徒手写代码之《机器学习实战》-----决策树算法(2)（使用决策树预测隐形眼镜类型）
使用决策树预测隐形眼镜类型说明: 将数据集文件 'lenses.txt' 放在当前文件夹 from math import log import operator 熵的定义 "" ...
《机器学习实战》学习笔记：绘制树形图使用决策树预测隐形眼镜类型
上一节实现了决策树,但只是使用包含树结构信息的嵌套字典来实现,其表示形式较难理解,显然,绘制直观的二叉树图是十分必要的.Python没有提供自带的绘制树工具,需要自己编写函数,结合Matplotlib ...
03_使用决策树预测隐形眼镜类型
使用决策树预测隐形眼镜类型 1.实验描述使用Python编程,输入为隐形眼镜数据集,计算所有可能的特征的信息增益,选择最优的特征值划分数据集,进而递归地构建决策树.其中为了更加直观地呈现决策树,使用 ...
决策树（四）：使用决策树预测隐形眼镜类型
使用决策树预测隐形眼镜类型介绍代码部分总结介绍本节我们将通过一个例子讲解决策树如何预测患者需要佩戴的隐形眼镜类型.使用小数据集 ,我们就可以利用决策树学到很多知识:眼科医生是如何判断患者需要 ...
ID3构造决策树预测隐形眼镜类型（代码笔记）
决策树可以从数据集合中提取出一系列规则,从而分类样本.它的优势是理解数据蕴含信息. 思想:利用信息增益(information gain)[度量数据集信息的方式-香农熵(entropy)]计算得出最好 ...
【python和机器学习入门2】决策树3——使用决策树预测隐形眼镜类型
参考博客:决策树实战篇之为自己配个隐形眼镜 (po主Jack-Cui,<--大部分内容转载自参考书籍:<机器学习实战>--第三章3.4 <--决策树基础知识见前两篇 , 摘要 ...
Educoder 机器学习决策树使用之使用决策树预测隐形眼镜类型
任务描述相关知识如何处理隐形眼镜数据集编程要求测试说明任务描述本关任务:编写一个例子讲解决策树如何预测患者需要佩戴的隐形眼镜类型.使用小数据集,我们就可以利用决策树学到很多知识:眼科医生是 ...
《机器学习实战》第3章—隐形眼镜类型（Jupyter版决策树）
目录一.导入所需的第三方库二.数据读取及预处理 2.1 读取数据 2.2 转换为数据集 2.3 划分特征集合标签集 2.4 划分训练集和测试集三.建立决策树模型四.用 test 中的数据检验 ...

机器学习实战ch03: 使用决策树预测隐形眼镜类型

程序

机器学习实战ch03: 使用决策树预测隐形眼镜类型相关推荐

最新文章

热门文章