python:实现基于信息熵进行划分选择的决策树算法

本文主要介绍本人用python基于信息熵进行划分选择的决策树代码实现,参考教材为西瓜书第四章——决策树。ps.本文只涉及决策树连续和离散两种情况,未考虑缺失值和剪枝。
首先摘取一些书上比较关键的理论知识:

1.决策树学习基本算法


显然,决策树是一种递归算法,递归最重要的一点是return条件的设置,这里主要有三种情况会产生return:

  1. 当前结点包含的样本全属于同一类别,无需划分。(即全为好瓜或全为坏瓜)
  2. 当前属性集为空,或是所有样本在所有属性上取值相同,无法划分。
  3. 当前节点包含的样本集合为空,不能划分。

2.信息熵

刚才的算法中,有一步很关键,那就是第8:从A中选择最优划分属性a*。依据什么来选择最优划分属性呢?这里我们用最基本的基于信息熵来进行划分。

1. 信息熵定义


y表示样本类别的数量,例如分为好瓜和坏瓜,y=2.
Pk表示第k类样本所占所有样本的比例,例如好瓜/(好瓜+坏瓜)。

2. 信息增益定义

(1) 离散情况:

信息增益表示:基于某个属性(如纹理)对样本集进行划分后得到的信息增益。信息增益越大,说明纯度提升越大。
选择信息增益最大的属性为最优划分属性,若计算值相同,则任意选择即可。
v表示在a属性下的子属性,例如:“色泽”属性又分为:乌黑,青绿,浅白三种子属性。在计算时需要分别统计子属性下对应的好瓜数量和坏瓜数量,再使用公式进行计算。
(2)连续情况:
当给出的属性有连续值(如“密度”)时,上述公式需稍加修改。我们采用二分法对连续值进行划分,选择出最优的二分界限。因此,首先我们需要对连续属性的取值由小到大排序,然后算出n-1个两两相邻的属性值的中间值,再依次代入求解信息增益,选择信息增益最大时的中界值。

3.代码实现

先放完整代码(python):

import mathclass Attribute():def __init__(self,name,id,iscon=0):self.name = nameself.kids = []self.id = idself.iscon = iscon #whether the attribute is continuous.1:algha,0:number.# count per number of the kid of per attribute in the SampleArray
def count_sample(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray)==0:return -1 #Sample is NULLif iscon == 1:for sample in SampleArray:samples = sample.split(',')if samples[index] not in attribute:attribute[samples[index]]=1else:attribute[samples[index]]+=1else:for sample in SampleArray:samples = sample.split(',')if float(samples[index])<=T:if 'less' not in attribute:attribute['less'] = 1else:attribute['less'] += 1else:if 'more' not in attribute:attribute['more'] = 1else:attribute['more'] += 1return attribute
#count the number of the good and bad objects of each attribute
def count_attribute(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray) == 0:return -1  # Sample is NULLif iscon==1 :#depersedfor sample in SampleArray:samples = sample.split(',')if str(samples[index]+samples[-1]) not in attribute:attribute[samples[index]+samples[-1]] = 1else:attribute[samples[index]+samples[-1]] += 1else:#continousfor sample in SampleArray:samples = sample.split(',')if float(samples[index]) <= T:if str('less'+sample[-1]) not in attribute.keys():attribute['less'+sample[-1]] = 1else:attribute['less' + sample[-1]] += 1else:if str('more'+sample[-1]) not in attribute.keys():attribute['more'+sample[-1]] = 1else:attribute['more' + sample[-1]] += 1return attributedef read_file(file_name,SampleArray,AttributeArray):with open(file_name) as f:contents = f.readline()flag =0index = -1if "编号" in contents:flag = 1index = contents.find(',')attributes = contents[index+1:].split(',')else:attributes = contents.split(',')  # remove the last word in txt. '\n'id = 0for a in attributes:att = Attribute(a, id)id += 1AttributeArray.append(att)  # rocord the attributeper_att = []for contents in f:if flag == 1:index = contents.find(',')per_att = contents[index+1:-1].split(',')else:per_att = contents[:-1].split(',')for i in range(len(AttributeArray)):if per_att[i] not in AttributeArray[i].kids:AttributeArray[i].kids.append(per_att[i])if per_att[i].isalnum():#the kid is numberAttributeArray[i].iscon = 1SampleArray.append(contents[index+1:].replace('\n',''))del AttributeArray[-1].kids[-1] #delete the last '' in kids of attributes.max_mark = count_sample(SampleArray,-1,1)max_class = max(max_mark,key=max_mark.get)#find the max number of the classesreturn max_class#find the best attribute for the node
def find_attribute(SampleArray,AttributeArray):entropy_D = 0entropy_Dv = 0entropy_Dv_total = 0max_index = 0max_gain = 0den = 0gains = []max_con_middle = 0  # find the max middle numbermax_con_gain = 0classes = count_sample(SampleArray, -1,1)total_nums = sum(classes.values())total_nums = sum(classes.values())for value in classes.values():p = value / total_numsentropy_D += p*math.log(p,2)entropy_D = -(entropy_D)for index in range(len(AttributeArray)-1):#from 1 begin: overlook the number of each sampleif AttributeArray[index].iscon == 1:# dispersedtotal_kids = count_sample(SampleArray,index,1)per_kid = count_attribute(SampleArray,index,1)for kid in AttributeArray[index].kids:for j in AttributeArray[-1].kids:if str(kid+j) not in per_kid.keys():continue #avoid some kid has no good resultnum = per_kid[str(kid+j)]den = total_kids[kid]p = num / denentropy_Dv += p*math.log(p,2)entropy_Dv_total += (den/total_nums)*(entropy_Dv)entropy_Dv = 0gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0gains.append(gain)elif AttributeArray[index].iscon == 0:#continousTa = []AttributeArray[index].kids.sort()for i in range(len(AttributeArray[index].kids)-1):Ta.append((float(AttributeArray[index].kids[i])+float(AttributeArray[index].kids[i+1]))/2)for t in Ta:total_kids = count_sample(SampleArray, index, 0,t)per_kid = count_attribute(SampleArray, index, 0,t)for j in AttributeArray[-1].kids:if str('less'+j) not in per_kid.keys():continuenum = per_kid['less'+j]den = total_kids['less']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0for j in AttributeArray[-1].kids:if str('more'+j) not in per_kid.keys():continuenum = per_kid['more'+j]den = total_kids['more']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0con_gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0if con_gain > max_con_gain:max_con_gain = con_gainmax_con_middle = tgain = max_con_gaingains.append(gain)if gain > max_gain:max_gain = gainmax_index = indexreturn max_index,max_con_middle  #return the index of the best attributetreenode = []
#per tree node: [father, father_index, num, judge, result,leaf]
def tree_generate(SampleArray, AttributeArray,father,father_index,pass_kid,max_class):treenode.append([])  # create a new tree nodeindex = len(treenode) - 1treenode[index].append(father)  # record the father of the nodetreenode[index].append(father_index)treenode[index].append(index)treenode[index].append(pass_kid)'''case 1: judge whether there is only one class in SampleArray'''count = count_sample(SampleArray,-1,1)if len(count)==1:treenode[index].append(max_class)treenode[index].append(1)return'''case 2: AttributeArray is NULL or all the samples have the same attributes.'''i = 0for i in range(len(AttributeArray)-1):if len(count_sample(SampleArray,i,1))!=1:breakif i==(len(AttributeArray)-1) or len(AttributeArray)==1:#class should not be included.treenode[index].append(max_class)treenode[index].append(1)  # leafreturntreenode[index].append(0)#no resulttreenode[index].append(0)#not the leaf'''case 3: find the best attribute.'''best_index,best_middle = find_attribute(SampleArray,AttributeArray)kid_SampleArray = []kid_SampleArray.clear()new_index = 0#prepare to create the kid treeif AttributeArray[best_index].iscon == 1:for kid in AttributeArray[best_index].kids:kid_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if samples[best_index] == kid:kid_SampleArray.append(sample.replace(kid+',',''))if len(kid_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(kid)treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_class = count_sample(kid_SampleArray,-1,1)max_class = max(max_class, key=max_class.get)tree_generate(kid_SampleArray,kid_AttributeArray,AttributeArray[best_index].name,index,kid,max_class)else:kid_less_SampleArray = []kid_less_SampleArray.clear()kid_more_SampleArray = []kid_more_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if float(samples[best_index]) <= best_middle:kid_less_SampleArray.append(sample.replace(samples[best_index]+',',''))else:kid_more_SampleArray.append(sample.replace(samples[best_index]+',',''))if len(kid_less_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append("<="+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_less_class = count_sample(kid_less_SampleArray, -1, 1)max_less_class = max(max_less_class, key=max_less_class.get)tree_generate(kid_less_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, "<="+str(best_middle),max_less_class)if len(kid_more_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(">"+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_more_class = count_sample(kid_more_SampleArray, -1, 1)max_more_class = max(max_more_class, key=max_more_class.get)tree_generate(kid_more_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, ">"+str(best_middle),max_more_class)def main():AttributeArray = []  # record attributesSampleArray = []  # record samplesmax_class = read_file('data.txt',SampleArray,AttributeArray)tree_generate(SampleArray,AttributeArray,-1,-1,-1,max_class)print(treenode[1:])if __name__ =='__main__':main()

输入示例:

左边为含连续型的情况,右边为不含连续型的情况。
输出示例:

分别为:[father, father_index, index, kid, result, isleaf]
我没有实现决策树的可视化,以后若有时间可以实现一下。不过现在也能根据跑出来的结果手动画出决策树了hhh。

3.1 具体说明:

1. 定义属性类

class Attribute():def __init__(self,name,id,iscon=0):self.name = nameself.kids = []self.id = idself.iscon = iscon #whether the attribute is continuous.1:algha,0:number.

name用于记录属性的名称,kids用于记录属性下包括的子属性,id没啥用可以不加,iscon代表该属性是否为连续型,若是则为0.(逻辑有点反哈。。)

2. count_sample函数

def count_sample(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray)==0:return -1 #Sample is NULLif iscon == 1:for sample in SampleArray:samples = sample.split(',')if samples[index] not in attribute:attribute[samples[index]]=1else:attribute[samples[index]]+=1else:for sample in SampleArray:samples = sample.split(',')if float(samples[index])<=T:if 'less' not in attribute:attribute['less'] = 1else:attribute['less'] += 1else:if 'more' not in attribute:attribute['more'] = 1else:attribute['more'] += 1return attribute

该函数主要作用是计算样本中指定属性的各子属性的数目。若为连续型,则计算大于某个阈值T的数目和小于等于T的数目。返回一个字典。

  1. count_attribute函数
def count_attribute(SampleArray,index,iscon,T=0):attribute = {}if len(SampleArray) == 0:return -1  # Sample is NULLif iscon==1 :#depersedfor sample in SampleArray:samples = sample.split(',')if str(samples[index]+samples[-1]) not in attribute:attribute[samples[index]+samples[-1]] = 1else:attribute[samples[index]+samples[-1]] += 1else:#continousfor sample in SampleArray:samples = sample.split(',')if float(samples[index]) <= T:if str('less'+sample[-1]) not in attribute.keys():attribute['less'+sample[-1]] = 1else:attribute['less' + sample[-1]] += 1else:if str('more'+sample[-1]) not in attribute.keys():attribute['more'+sample[-1]] = 1else:attribute['more' + sample[-1]] += 1return attribute

这个函数实现计算指定属性下每个子属性对应的各类别的数目。例如:指定属性为“色泽”时,样本中“青绿”、“乌黑”、“浅白”分别对应的“好瓜”的数量和“坏瓜”的数目。返回值也为字典。

  1. read_file函数:
def read_file(file_name,SampleArray,AttributeArray):with open(file_name) as f:contents = f.readline()flag =0index = -1if "编号" in contents:flag = 1index = contents.find(',')attributes = contents[index+1:].split(',')else:attributes = contents.split(',')  # remove the last word in txt. '\n'id = 0for a in attributes:att = Attribute(a, id)id += 1AttributeArray.append(att)  # rocord the attributeper_att = []for contents in f:if flag == 1:index = contents.find(',')per_att = contents[index+1:-1].split(',')else:per_att = contents[:-1].split(',')for i in range(len(AttributeArray)):if per_att[i] not in AttributeArray[i].kids:AttributeArray[i].kids.append(per_att[i])if per_att[i].isalnum():#the kid is numberAttributeArray[i].iscon = 1SampleArray.append(contents[index+1:].replace('\n',''))del AttributeArray[-1].kids[-1] #delete the last '' in kids of attributes.max_mark = count_sample(SampleArray,-1,1)max_class = max(max_mark,key=max_mark.get)#find the max number of the classesreturn max_class

本函数实现从指定文件中读入数据,需要注意以下两点:
(1) 如果文件中第一列为编号,需要无视这一列。
(2) 文件末尾可能存在"\n",需要删除。
这里我图方便就直接返回读入样本的最大类了,方便后面第一次使用。

  1. find_attribute函数
#find the best attribute for the node
def find_attribute(SampleArray,AttributeArray):entropy_D = 0entropy_Dv = 0entropy_Dv_total = 0max_index = 0max_gain = 0den = 0gains = []max_con_middle = 0  # find the max middle numbermax_con_gain = 0classes = count_sample(SampleArray, -1,1)total_nums = sum(classes.values())total_nums = sum(classes.values())for value in classes.values():p = value / total_numsentropy_D += p*math.log(p,2)entropy_D = -(entropy_D)for index in range(len(AttributeArray)-1):#from 1 begin: overlook the number of each sampleif AttributeArray[index].iscon == 1:# dispersedtotal_kids = count_sample(SampleArray,index,1)per_kid = count_attribute(SampleArray,index,1)for kid in AttributeArray[index].kids:for j in AttributeArray[-1].kids:if str(kid+j) not in per_kid.keys():continue #avoid some kid has no good resultnum = per_kid[str(kid+j)]den = total_kids[kid]p = num / denentropy_Dv += p*math.log(p,2)entropy_Dv_total += (den/total_nums)*(entropy_Dv)entropy_Dv = 0gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0gains.append(gain)elif AttributeArray[index].iscon == 0:#continousTa = []AttributeArray[index].kids.sort()for i in range(len(AttributeArray[index].kids)-1):Ta.append((float(AttributeArray[index].kids[i])+float(AttributeArray[index].kids[i+1]))/2)for t in Ta:total_kids = count_sample(SampleArray, index, 0,t)per_kid = count_attribute(SampleArray, index, 0,t)for j in AttributeArray[-1].kids:if str('less'+j) not in per_kid.keys():continuenum = per_kid['less'+j]den = total_kids['less']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0for j in AttributeArray[-1].kids:if str('more'+j) not in per_kid.keys():continuenum = per_kid['more'+j]den = total_kids['more']p = num / denentropy_Dv += p * math.log(p, 2)entropy_Dv_total += (den / total_nums) * (entropy_Dv)entropy_Dv = 0con_gain = entropy_D + entropy_Dv_totalentropy_Dv_total = 0if con_gain > max_con_gain:max_con_gain = con_gainmax_con_middle = tgain = max_con_gaingains.append(gain)if gain > max_gain:max_gain = gainmax_index = indexreturn max_index,max_con_middle  #return the index of the best attribute

本函数实现算法中的第8步:即基于信息熵寻找最优划分属性。分为离散型和连续型两种情况,按公式实现即可。

  1. tree_generate函数
def tree_generate(SampleArray, AttributeArray,father,father_index,pass_kid,max_class):treenode.append([])  # create a new tree nodeindex = len(treenode) - 1treenode[index].append(father)  # record the father of the nodetreenode[index].append(father_index)treenode[index].append(index)treenode[index].append(pass_kid)'''case 1: judge whether there is only one class in SampleArray'''count = count_sample(SampleArray,-1,1)if len(count)==1:treenode[index].append(max_class)treenode[index].append(1)return'''case 2: AttributeArray is NULL or all the samples have the same attributes.'''i = 0for i in range(len(AttributeArray)-1):if len(count_sample(SampleArray,i,1))!=1:breakif i==(len(AttributeArray)-1) or len(AttributeArray)==1:#class should not be included.treenode[index].append(max_class)treenode[index].append(1)  # leafreturntreenode[index].append(0)#no resulttreenode[index].append(0)#not the leaf'''case 3: find the best attribute.'''best_index,best_middle = find_attribute(SampleArray,AttributeArray)kid_SampleArray = []kid_SampleArray.clear()new_index = 0#prepare to create the kid treeif AttributeArray[best_index].iscon == 1:for kid in AttributeArray[best_index].kids:kid_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if samples[best_index] == kid:kid_SampleArray.append(sample.replace(kid+',',''))if len(kid_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(kid)treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_class = count_sample(kid_SampleArray,-1,1)max_class = max(max_class, key=max_class.get)tree_generate(kid_SampleArray,kid_AttributeArray,AttributeArray[best_index].name,index,kid,max_class)else:kid_less_SampleArray = []kid_less_SampleArray.clear()kid_more_SampleArray = []kid_more_SampleArray.clear()for sample in SampleArray:samples = sample.split(',')if float(samples[best_index]) <= best_middle:kid_less_SampleArray.append(sample.replace(samples[best_index]+',',''))else:kid_more_SampleArray.append(sample.replace(samples[best_index]+',',''))if len(kid_less_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append("<="+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_less_class = count_sample(kid_less_SampleArray, -1, 1)max_less_class = max(max_less_class, key=max_less_class.get)tree_generate(kid_less_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, "<="+str(best_middle),max_less_class)if len(kid_more_SampleArray)== 0:treenode.append([])  # create a new tree nodenew_index = len(treenode) - 1treenode[new_index].append(AttributeArray[best_index].name)  # record the father of the nodetreenode[new_index].append(index)treenode[new_index].append(new_index)treenode[new_index].append(">"+str(best_middle))treenode[new_index].append(max_class)treenode[new_index].append(1)  # leafreturnelse:kid_AttributeArray = list(AttributeArray)del kid_AttributeArray[best_index]max_more_class = count_sample(kid_more_SampleArray, -1, 1)max_more_class = max(max_more_class, key=max_more_class.get)tree_generate(kid_more_SampleArray, kid_AttributeArray, AttributeArray[best_index].name,index, ">"+str(best_middle),max_more_class)

本函数就是按照决策树算法写的了。每一次调用时,先生成一个新的结点,然后判断是否满足条件return,若不return,则寻找本次递归的最优划分属性,然后再判断是否return。若也不return,则继续递归。


以上就是我写的python实现决策树的代码啦~。
附:
data.txt
编号,色泽,根蒂,敲声,纹理,脐部,触感,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,否

data_con.txt
编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.744,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否

【机器学习】西瓜书_周志华,python实现基于信息熵进行划分选择的决策树算法相关推荐

  1. 周志华《机器学习》习题4.4——python实现基于信息熵进行划分选择的决策树算法

    1.题目 试编程实现基于信息熵进行话饭选择的决策树算法,并为表4.3中数据生成一棵决策树. 表4.3如下: 另外再附个txt版的,下次可以复制粘贴: 青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0 ...

  2. 机器学习西瓜书(周志华)第七章 贝叶斯分类器

    第七章 贝叶斯分类器 1. 贝叶斯决策论 1.1 先验分布 1.2 后验分布 1.3 似然估计 1.4 四大概率在贝叶斯分类中指代含义 1. 朴素贝叶斯 7. 课后练习参考答案 1. 贝叶斯决策论 贝 ...

  3. 《统计学习方法》-李航、《机器学习-西瓜书》-周志华总结+Python代码连载(一)--模型选择+误差评估

    一.模型选择 1.1 模型 模型就是学习的条件概率分布或者决策函数(这里只指监督学习).条件概率的集合:,决策函数的集合:.条件概率表示的模型为概率模型,决策函数表示的模型为非概率模型. 1.2 模型 ...

  4. 《机器学习》西瓜书课后习题4.3——python实现基于信息熵划分的决策树算法(简单、全面)

    <机器学习>西瓜书课后习题4.3--python实现基于信息熵划分的决策树算法 <机器学习>西瓜书P93 4.3 试编程实现基于信息熵进行划分选择的决策树算法,并为表4.3中数 ...

  5. 《机器学习》(西瓜书)周志华 -学习心得

    第一章绪论 基本术语 记录&示例&样本:"=",意思是取值为,每一条记录是关于一个对象或事件的描述.eg:(色泽=浅白:根蒂=硬挺:敲声=清脆) 数据集:记录的集合 ...

  6. 西瓜书(周志华):什么是版本空间以及如何求取版本空间

    下面是自己结合百度的资料来理解的一些比较通俗的说法: 假设空间:属性所有可能取值组成的可能的样本 版本空间:与已知数据集一致的所有假设的子集集合. (绿色加号代表正类样本,红色小圈代表负类样本) GB ...

  7. 西瓜书(周志华)课后习题答案

    目录 第一章 绪论 http://blog.csdn.net/icefire_tyh/article/details/52065224 第二章 模型评估与选择 http://blog.csdn.net ...

  8. 机器学习_周志华(西瓜书) 课后习题答案 第一章 Chapter1

    机器学习_周志华 课后习题答案 第一章 Chapter1 习题1.1 Q:表1.1中若只包含编号为1和4的两个样例,试给出相应的版本空间. 由所给出的数据集(训练集)可知,属性3个:色泽.根蒂.敲声, ...

  9. 机器学习_周志华_问题汇总_第2周

    问题 Q1 如果我想分析一下文本分类错误的原因,应该从哪些方面入手? 可以去分析一下哪个类别错误率高,然后看看这个类别的是否不平衡,针对这个类别看看能不能进行改进. 还有就是数据量过少,或是数据质量较 ...

最新文章

  1. linux进程通信:pipe实现进程同步
  2. 刻意练习:LeetCode实战 -- Task18. 正则表达式匹配
  3. 独家 | 使用机器学习对非结构化数据加速查询-第2部分(具有统计保证的近似选择查询)...
  4. 《每周CV论文》人脸识别难题:遮挡年龄姿态妆造亲属伪造攻击
  5. 用ASP.NET AJAX框架扩展HTML Map控件
  6. 阿里巴巴的云原生与开发者
  7. Tensorflow2.0(Keras)转换TFlite
  8. 颜色空间缩减color space reduction
  9. cmd cd 无法切换目录_一分钟掌握cmd基础操作,告别鼠标
  10. 技术架构演进|0到千万DAU,微淘如何走过?
  11. 整理sqlserver 级联更新和删除 c#调用存储过程返回值
  12. golang rpc单参数调用实例
  13. k8s核心技术-持久化存储(nfs网络存储)---K8S_Google工作笔记0050
  14. 终于转了,写写人生学习规划
  15. Failed to execute goal org.apache.maven.plugins:ma
  16. JAVA回文数代码getReverse_java判断回文数示例分享
  17. Physics Bodies(中文翻译)—UE4官方文档
  18. 【观察】华为:给园区安防加点“智慧”
  19. 【复杂网络系列】模块度(Modularity )的计算方法
  20. a标签去下划线或文字添加下修饰_怎样去除ul li a标签文字下的下划线

热门文章

  1. 2 评价类算法:TOPSIS法笔记(附Python代码)
  2. 构成中学计算机教学系统的要素包括,教学策略就是对完成特定的教学目标而采用的教学活动的()要素的总体考虑...
  3. 【python标准库】os.path详解
  4. 转:解决“连接被重置”
  5. 震荡指标(一)RSI指标
  6. Windows 批量删除文件简单方法
  7. PS中“曲线”【ctrl+M】的作用【加强对曲线的使用】
  8. 行业洞察 | 小米发布人形机器人的AI技术
  9. 淘宝api开放平台买家卖家订单接口,python网络爬虫采集数据
  10. Type interface mapper.UserMapper is not known to the MapperRegistry