python：实现基于信息熵进行划分选择的决策树算法

本文主要介绍本人用python基于信息熵进行划分选择的决策树代码实现，参考教材为西瓜书第四章——决策树。ps.本文只涉及决策树连续和离散两种情况，未考虑缺失值和剪枝。
首先摘取一些书上比较关键的理论知识：

1.决策树学习基本算法

显然，决策树是一种递归算法，递归最重要的一点是return条件的设置，这里主要有三种情况会产生return：

当前结点包含的样本全属于同一类别，无需划分。（即全为好瓜或全为坏瓜）
当前属性集为空，或是所有样本在所有属性上取值相同，无法划分。
当前节点包含的样本集合为空，不能划分。

2.信息熵

刚才的算法中，有一步很关键，那就是第8：从A中选择最优划分属性a*。依据什么来选择最优划分属性呢？这里我们用最基本的基于信息熵来进行划分。

1. 信息熵定义

y表示样本类别的数量，例如分为好瓜和坏瓜，y=2.
Pk表示第k类样本所占所有样本的比例，例如好瓜/(好瓜+坏瓜)。

2. 信息增益定义

（1）离散情况：

信息增益表示：基于某个属性（如纹理）对样本集进行划分后得到的信息增益。信息增益越大，说明纯度提升越大。
故选择信息增益最大的属性为最优划分属性，若计算值相同，则任意选择即可。
v表示在a属性下的子属性，例如：“色泽”属性又分为：乌黑，青绿，浅白三种子属性。在计算时需要分别统计子属性下对应的好瓜数量和坏瓜数量，再使用公式进行计算。
（2）连续情况：
当给出的属性有连续值(如“密度”)时，上述公式需稍加修改。我们采用二分法对连续值进行划分，选择出最优的二分界限。因此，首先我们需要对连续属性的取值由小到大排序，然后算出n-1个两两相邻的属性值的中间值，再依次代入求解信息增益，选择信息增益最大时的中界值。

3.代码实现

先放完整代码（python）：

import mathclass Attribute():def __init__(self,name,id,iscon=0):self.name = nameself.kids = []self.id = idself.iscon = iscon #whether the attribute is continuous.1:algha,0:number.# count per number of the kid of per attribute in the SampleArray
def count_sample(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray)==0:return -1 #Sample is NULLif iscon == 1:for sample in SampleArray:samples = sample.split(',')if samples[index] not in attribute:attribute[samples[index]]=1else:attribute[samples[index]]+=1else:for sample in SampleArray:samples = sample.split(',')if float(samples[index])<=T:if 'less' not in attribute:attribute['less'] = 1else:attribute['less'] += 1else:if 'more' not in attribute:attribute['more'] = 1else:attribute['more'] += 1return attribute
#count the number of the good and bad objects of each attribute
def count_attribute(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray) == 0:return -1  # Sample is NULLif iscon==1 :#depersedfor sample in SampleArray:samples = sample.split(',')if str(samples[index]+samples[-1]) not in attribute:attribute[samples[index]+samples[-1]] = 1else:attribute[samples[index]+samples[-1]] += 1else:#continousfor sample in SampleArray:samples = sample.split(',')if float(samples[index]) <= T:if str('less'+sample[-1]) not in attribute.keys():attribute['less'+sample[-1]] = 1else:attribute['less' + sample[-1]] += 1else:if str('more'+sample[-1]) not in attribute.keys():attribute['more'+sample[-1]] = 1else:attribute['more' + sample[-1]] += 1return attributedef read_file(file_name,SampleArray,AttributeArray):with open(file_name) as f:contents = f.readline()flag =0index = -1if "编号" in contents:flag = 1index = contents.find(',')attributes = contents[index+1:].split(',')else:attributes = contents.split(',')  # remove the last word in txt. '\n'id = 0for a in attributes:att = Attribute(a, id)id += 1AttributeArray.append(att)  # rocord the attributeper_att = []for contents in f:if flag == 1:index = contents.find(',')per_att = contents[index+1:-1].split(',')else:per_att = contents[:-1].split(',')for i in range(len(AttributeArray)):if per_att[i] not in AttributeArray[i].kids:AttributeArray[i].kids.append(per_att[i])if per_att[i].isalnum():#the kid is numberAttributeArray[i].iscon = 1SampleArray.append(contents[index+1:].replace('\n',''))del AttributeArray[-1].kids[-1] #delete the last '' in kids of attributes.max_mark = count_sample(SampleArray,-1,1)max_class = max(max_mark,key=max_mark.get)#find the max number of the classesreturn max_class#find the best attribute for the node
def find_attribute(SampleArray,AttributeArray):entropy_D = 0entropy_Dv = 0entropy_Dv_total = 0max_index = 0max_gain = 0den = 0gains = []max_con_middle = 0  # find the max middle numbermax_con_gain = 0classes = count_sample(SampleArray, -1,1)total_nums = sum(classes.values())total_nums = sum(classes.values())for value in classes.values():p = value / total_numsentropy_D += p*math.log(p,2)entropy_D = -(entropy_D)for index in range(len(AttributeArray)-1):#from 1 begin: overlook the number of each sampleif AttributeArray[index].iscon == 1:# dispersedtotal_kids = count_sample(SampleArray,index,1)per_kid = count_attribute(SampleArray,index,1)for kid in AttributeArray[index].kids:for j in AttributeArray[-1].kids:if str(kid+j) not in per_kid.keys():continue #avoid some kid has no good resultnum = per_kid[str(kid+j)]den = total_kids[kid]p = num / denentropy_Dv += p*math.log(p,2)entropy_Dv_total += (den/total_nums)*(entropy_Dv)entropy_Dv = 0gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0gains.append(gain)elif AttributeArray[index].iscon == 0:#continousTa = []AttributeArray[index].kids.sort()for i in range(len(AttributeArray[index].kids)-1):Ta.append((float(AttributeArray[index].kids[i])+float(AttributeArray[index].kids[i+1]))/2)for t in Ta:total_kids = count_sample(SampleArray, index, 0,t)per_kid = count_attribute(SampleArray, index, 0,t)for j in AttributeArray[-1].kids:if str('less'+j) not in per_kid.keys():continuenum = per_kid['less'+j]den = total_kids['less']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0for j in AttributeArray[-1].kids:if str('more'+j) not in per_kid.keys():continuenum = per_kid['more'+j]den = total_kids['more']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0con_gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0if con_gain > max_con_gain:max_con_gain = con_gainmax_con_middle = tgain = max_con_gaingains.append(gain)if gain > max_gain:max_gain = gainmax_index = indexreturn max_index,max_con_middle  #return the index of the best attributetreenode = []
#per tree node: [father, father_index, num, judge, result，leaf]
def tree_generate(SampleArray, AttributeArray,father,father_index,pass_kid,max_class):treenode.append([])  # create a new tree nodeindex = len(treenode) - 1treenode[index].append(father)  # record the father of the nodetreenode[index].append(father_index)treenode[index].append(index)treenode[index].append(pass_kid)'''case 1: judge whether there is only one class in SampleArray'''count = count_sample(SampleArray,-1,1)if len(count)==1:treenode[index].append(max_class)treenode[index].append(1)return'''case 2: AttributeArray is NULL or all the samples have the same attributes.'''i = 0for i in range(len(AttributeArray)-1):if len(count_sample(SampleArray,i,1))!=1:breakif i==(len(AttributeArray)-1) or len(AttributeArray)==1:#class should not be included.treenode[index].append(max_class)treenode[index].append(1)  # leafreturntreenode[index].append(0)#no resulttreenode[index].append(0)#not the leaf'''case 3: find the best attribute.'''best_index,best_middle = find_attribute(SampleArray,AttributeArray)kid_SampleArray = []kid_SampleArray.clear()new_index = 0#prepare to create the kid treeif AttributeArray[best_index].iscon == 1:for kid in AttributeArray[best_index].kids:kid_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if samples[best_index] == kid:kid_SampleArray.append(sample.replace(kid+',',''))if len(kid_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(kid)treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_class = count_sample(kid_SampleArray,-1,1)max_class = max(max_class, key=max_class.get)tree_generate(kid_SampleArray,kid_AttributeArray,AttributeArray[best_index].name,index,kid,max_class)else:kid_less_SampleArray = []kid_less_SampleArray.clear()kid_more_SampleArray = []kid_more_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if float(samples[best_index]) <= best_middle:kid_less_SampleArray.append(sample.replace(samples[best_index]+',',''))else:kid_more_SampleArray.append(sample.replace(samples[best_index]+',',''))if len(kid_less_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append("<="+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_less_class = count_sample(kid_less_SampleArray, -1, 1)max_less_class = max(max_less_class, key=max_less_class.get)tree_generate(kid_less_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, "<="+str(best_middle),max_less_class)if len(kid_more_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(">"+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_more_class = count_sample(kid_more_SampleArray, -1, 1)max_more_class = max(max_more_class, key=max_more_class.get)tree_generate(kid_more_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, ">"+str(best_middle),max_more_class)def main():AttributeArray = []  # record attributesSampleArray = []  # record samplesmax_class = read_file('data.txt',SampleArray,AttributeArray)tree_generate(SampleArray,AttributeArray,-1,-1,-1,max_class)print(treenode[1:])if __name__ =='__main__':main()

输入示例：

左边为含连续型的情况，右边为不含连续型的情况。
输出示例：

分别为：[father, father_index, index, kid, result, isleaf]
我没有实现决策树的可视化，以后若有时间可以实现一下。不过现在也能根据跑出来的结果手动画出决策树了hhh。

3.1 具体说明：

1. 定义属性类

class Attribute():def __init__(self,name,id,iscon=0):self.name = nameself.kids = []self.id = idself.iscon = iscon #whether the attribute is continuous.1:algha,0:number.

name用于记录属性的名称，kids用于记录属性下包括的子属性，id没啥用可以不加，iscon代表该属性是否为连续型，若是则为0.（逻辑有点反哈。。）

2. count_sample函数

def count_sample(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray)==0:return -1 #Sample is NULLif iscon == 1:for sample in SampleArray:samples = sample.split(',')if samples[index] not in attribute:attribute[samples[index]]=1else:attribute[samples[index]]+=1else:for sample in SampleArray:samples = sample.split(',')if float(samples[index])<=T:if 'less' not in attribute:attribute['less'] = 1else:attribute['less'] += 1else:if 'more' not in attribute:attribute['more'] = 1else:attribute['more'] += 1return attribute

该函数主要作用是计算样本中指定属性的各子属性的数目。若为连续型，则计算大于某个阈值T的数目和小于等于T的数目。返回一个字典。

count_attribute函数

def count_attribute(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray) == 0:return -1  # Sample is NULLif iscon==1 :#depersedfor sample in SampleArray:samples = sample.split(',')if str(samples[index]+samples[-1]) not in attribute:attribute[samples[index]+samples[-1]] = 1else:attribute[samples[index]+samples[-1]] += 1else:#continousfor sample in SampleArray:samples = sample.split(',')if float(samples[index]) <= T:if str('less'+sample[-1]) not in attribute.keys():attribute['less'+sample[-1]] = 1else:attribute['less' + sample[-1]] += 1else:if str('more'+sample[-1]) not in attribute.keys():attribute['more'+sample[-1]] = 1else:attribute['more' + sample[-1]] += 1return attribute

这个函数实现计算指定属性下每个子属性对应的各类别的数目。例如：指定属性为“色泽”时，样本中“青绿”、“乌黑”、“浅白”分别对应的“好瓜”的数量和“坏瓜”的数目。返回值也为字典。

read_file函数：

def read_file(file_name,SampleArray,AttributeArray):with open(file_name) as f:contents = f.readline()flag =0index = -1if "编号" in contents:flag = 1index = contents.find(',')attributes = contents[index+1:].split(',')else:attributes = contents.split(',')  # remove the last word in txt. '\n'id = 0for a in attributes:att = Attribute(a, id)id += 1AttributeArray.append(att)  # rocord the attributeper_att = []for contents in f:if flag == 1:index = contents.find(',')per_att = contents[index+1:-1].split(',')else:per_att = contents[:-1].split(',')for i in range(len(AttributeArray)):if per_att[i] not in AttributeArray[i].kids:AttributeArray[i].kids.append(per_att[i])if per_att[i].isalnum():#the kid is numberAttributeArray[i].iscon = 1SampleArray.append(contents[index+1:].replace('\n',''))del AttributeArray[-1].kids[-1] #delete the last '' in kids of attributes.max_mark = count_sample(SampleArray,-1,1)max_class = max(max_mark,key=max_mark.get)#find the max number of the classesreturn max_class

本函数实现从指定文件中读入数据，需要注意以下两点:
（1）如果文件中第一列为编号，需要无视这一列。
（2）文件末尾可能存在"\n"，需要删除。
这里我图方便就直接返回读入样本的最大类了，方便后面第一次使用。

find_attribute函数

#find the best attribute for the node
def find_attribute(SampleArray,AttributeArray):entropy_D = 0entropy_Dv = 0entropy_Dv_total = 0max_index = 0max_gain = 0den = 0gains = []max_con_middle = 0  # find the max middle numbermax_con_gain = 0classes = count_sample(SampleArray, -1,1)total_nums = sum(classes.values())total_nums = sum(classes.values())for value in classes.values():p = value / total_numsentropy_D += p*math.log(p,2)entropy_D = -(entropy_D)for index in range(len(AttributeArray)-1):#from 1 begin: overlook the number of each sampleif AttributeArray[index].iscon == 1:# dispersedtotal_kids = count_sample(SampleArray,index,1)per_kid = count_attribute(SampleArray,index,1)for kid in AttributeArray[index].kids:for j in AttributeArray[-1].kids:if str(kid+j) not in per_kid.keys():continue #avoid some kid has no good resultnum = per_kid[str(kid+j)]den = total_kids[kid]p = num / denentropy_Dv += p*math.log(p,2)entropy_Dv_total += (den/total_nums)*(entropy_Dv)entropy_Dv = 0gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0gains.append(gain)elif AttributeArray[index].iscon == 0:#continousTa = []AttributeArray[index].kids.sort()for i in range(len(AttributeArray[index].kids)-1):Ta.append((float(AttributeArray[index].kids[i])+float(AttributeArray[index].kids[i+1]))/2)for t in Ta:total_kids = count_sample(SampleArray, index, 0,t)per_kid = count_attribute(SampleArray, index, 0,t)for j in AttributeArray[-1].kids:if str('less'+j) not in per_kid.keys():continuenum = per_kid['less'+j]den = total_kids['less']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0for j in AttributeArray[-1].kids:if str('more'+j) not in per_kid.keys():continuenum = per_kid['more'+j]den = total_kids['more']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0con_gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0if con_gain > max_con_gain:max_con_gain = con_gainmax_con_middle = tgain = max_con_gaingains.append(gain)if gain > max_gain:max_gain = gainmax_index = indexreturn max_index,max_con_middle  #return the index of the best attribute

本函数实现算法中的第8步：即基于信息熵寻找最优划分属性。分为离散型和连续型两种情况，按公式实现即可。

tree_generate函数

def tree_generate(SampleArray, AttributeArray,father,father_index,pass_kid,max_class):treenode.append([])  # create a new tree nodeindex = len(treenode) - 1treenode[index].append(father)  # record the father of the nodetreenode[index].append(father_index)treenode[index].append(index)treenode[index].append(pass_kid)'''case 1: judge whether there is only one class in SampleArray'''count = count_sample(SampleArray,-1,1)if len(count)==1:treenode[index].append(max_class)treenode[index].append(1)return'''case 2: AttributeArray is NULL or all the samples have the same attributes.'''i = 0for i in range(len(AttributeArray)-1):if len(count_sample(SampleArray,i,1))!=1:breakif i==(len(AttributeArray)-1) or len(AttributeArray)==1:#class should not be included.treenode[index].append(max_class)treenode[index].append(1)  # leafreturntreenode[index].append(0)#no resulttreenode[index].append(0)#not the leaf'''case 3: find the best attribute.'''best_index,best_middle = find_attribute(SampleArray,AttributeArray)kid_SampleArray = []kid_SampleArray.clear()new_index = 0#prepare to create the kid treeif AttributeArray[best_index].iscon == 1:for kid in AttributeArray[best_index].kids:kid_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if samples[best_index] == kid:kid_SampleArray.append(sample.replace(kid+',',''))if len(kid_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(kid)treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_class = count_sample(kid_SampleArray,-1,1)max_class = max(max_class, key=max_class.get)tree_generate(kid_SampleArray,kid_AttributeArray,AttributeArray[best_index].name,index,kid,max_class)else:kid_less_SampleArray = []kid_less_SampleArray.clear()kid_more_SampleArray = []kid_more_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if float(samples[best_index]) <= best_middle:kid_less_SampleArray.append(sample.replace(samples[best_index]+',',''))else:kid_more_SampleArray.append(sample.replace(samples[best_index]+',',''))if len(kid_less_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append("<="+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_less_class = count_sample(kid_less_SampleArray, -1, 1)max_less_class = max(max_less_class, key=max_less_class.get)tree_generate(kid_less_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, "<="+str(best_middle),max_less_class)if len(kid_more_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(">"+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_more_class = count_sample(kid_more_SampleArray, -1, 1)max_more_class = max(max_more_class, key=max_more_class.get)tree_generate(kid_more_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, ">"+str(best_middle),max_more_class)

本函数就是按照决策树算法写的了。每一次调用时，先生成一个新的结点，然后判断是否满足条件return，若不return，则寻找本次递归的最优划分属性，然后再判断是否return。若也不return，则继续递归。

以上就是我写的python实现决策树的代码啦~。
附：
data.txt
编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,否

data_con.txt
编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.744,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否

【机器学习】西瓜书_周志华，python实现基于信息熵进行划分选择的决策树算法相关推荐

周志华《机器学习》习题4.4——python实现基于信息熵进行划分选择的决策树算法
1.题目试编程实现基于信息熵进行话饭选择的决策树算法,并为表4.3中数据生成一棵决策树. 表4.3如下: 另外再附个txt版的,下次可以复制粘贴: 青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0 ...
机器学习西瓜书（周志华）第七章贝叶斯分类器
第七章贝叶斯分类器 1. 贝叶斯决策论 1.1 先验分布 1.2 后验分布 1.3 似然估计 1.4 四大概率在贝叶斯分类中指代含义 1. 朴素贝叶斯 7. 课后练习参考答案 1. 贝叶斯决策论贝 ...
《统计学习方法》-李航、《机器学习-西瓜书》-周志华总结+Python代码连载（一）--模型选择+误差评估
一.模型选择 1.1 模型模型就是学习的条件概率分布或者决策函数(这里只指监督学习).条件概率的集合:,决策函数的集合:.条件概率表示的模型为概率模型,决策函数表示的模型为非概率模型. 1.2 模型 ...
《机器学习》西瓜书课后习题4.3——python实现基于信息熵划分的决策树算法（简单、全面）
<机器学习>西瓜书课后习题4.3--python实现基于信息熵划分的决策树算法 <机器学习>西瓜书P93 4.3 试编程实现基于信息熵进行划分选择的决策树算法,并为表4.3中数 ...
《机器学习》（西瓜书）周志华 -学习心得
第一章绪论基本术语记录&示例&样本:"=",意思是取值为,每一条记录是关于一个对象或事件的描述.eg:(色泽=浅白:根蒂=硬挺:敲声=清脆) 数据集:记录的集合 ...
西瓜书（周志华）：什么是版本空间以及如何求取版本空间
下面是自己结合百度的资料来理解的一些比较通俗的说法: 假设空间:属性所有可能取值组成的可能的样本版本空间:与已知数据集一致的所有假设的子集集合. (绿色加号代表正类样本,红色小圈代表负类样本) GB ...
西瓜书（周志华）课后习题答案
目录第一章绪论 http://blog.csdn.net/icefire_tyh/article/details/52065224 第二章模型评估与选择 http://blog.csdn.net ...
机器学习_周志华（西瓜书）课后习题答案第一章 Chapter1
机器学习_周志华课后习题答案第一章 Chapter1 习题1.1 Q:表1.1中若只包含编号为1和4的两个样例,试给出相应的版本空间. 由所给出的数据集(训练集)可知,属性3个:色泽.根蒂.敲声, ...
机器学习_周志华_问题汇总_第2周
问题 Q1 如果我想分析一下文本分类错误的原因,应该从哪些方面入手? 可以去分析一下哪个类别错误率高,然后看看这个类别的是否不平衡,针对这个类别看看能不能进行改进. 还有就是数据量过少,或是数据质量较 ...

【机器学习】西瓜书_周志华，python实现基于信息熵进行划分选择的决策树算法