Python实现Decision Tree

前言

这篇文章是我用Python对Decision Tree的简单实现，不包含剪枝功能。另外，这个Decision Tree只适用于连续性特征值，离散型的以后有机会再补充。数据集为iris。

1. 导入所需包

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

2. 加载数据

iris = load_iris()
data = pd.DataFrame(iris.data)
data.columns = iris.feature_names
data['target'] = iris.target

3. 实现Decision Tree

构建树的停止条件有两个：

如果节点的数据集中所有的数据标签相同，那么就将此节点作为叶节点，并将此标签作为判断结果。
如果节点的数据集的特征都用完了，则将此节点作为叶节点，并将当前数据集中最多的标签作为判断结果。

特征选择标准是信息增益。

'''
说明：计算数据集的经验熵参数：labels：数据集的labels
返回值：entropy：数据集的经验熵
'''
def cal_entropy(labels):data_num = len(labels) # 数据集中数据的个数labels_num = {} # 用于储存数据集中每个label的个数for label in labels:if label not in labels_num.keys():labels_num[label] = 1else:labels_num[label] += 1entropy = 0.0 # 经验熵for key in labels_num:prob = float(labels_num[key]) / data_numentropy -= prob * np.log2(prob)return entropy'''
说明：计算按照当前特征划分后的经验熵（连续性特征值）参数：data：连续性特征值labels：数据对应的labels
返回值：min_entropy：最小经验熵best_partition：使得经验熵最小的划分点
'''def entropy_for_continuous(data, labels):sorted_data = data.sort_values().reset_index(drop=True) # 将数据排序partitions = [(sorted_data[i] + sorted_data[i + 1]) / 2 for i in range(0, len(sorted_data) - 1)] # 计算相邻两个数字的中位数entropy_list = []# 计算按照不同划分点划分后的经验熵for partition in partitions:smaller_group = data[data <= partition] # 比划分点小的数据bigger_group = data[data > partition] # 比划分点大的数据entropy1 = cal_entropy(labels[smaller_group.index])entropy2 = cal_entropy(labels[bigger_group.index])entropy = (len(smaller_group) / float(len(data)) * entropy1) + (len(bigger_group) / float(len(data)) * entropy2) # 按照当前划分点划分后的经验熵entropy_list.append(entropy)# 取得最小经验熵以及使得经验熵最小的划分点min_entropy = min(entropy_list)best_partition = partitions[entropy_list.index(min_entropy)]return min_entropy, best_partition'''
说明：将数据分类参数：tree：用于分类的决策树X：待分类的数据
返回值：predLabel：预测的标签
'''def classify(tree, X):key = next(iter(tree)) # 提取决策节点dictionary = tree[key] # 包含决策下的不同分支的dictionary# 如果key是元组，说明是连续型特征值if isinstance(key, tuple):# key值等于0对应小于等于划分点if X[key[0]] <= key[1]: value = dictionary[0] # 小于等于划分点的分支else: value = dictionary[1] # 大于划分点的分支# 如果key不是元组，说明是离散型特征值else:for i in dictionary.keys():if X[key] == i:value = dictionary[i]breakif isinstance(value, dict):predLabel = classify(value, X)else:predLabel = valuereturn predLabel'''
说明：选出使得经验熵最小的特征参数：dataSet：训练集types：训练集的特征类型总索引
返回值：索引和划分点组成的元组或索引
'''def bestLabelIndForSplit(dataSet, types):child_entropy = [] # 储存按照不同特征来划分后的经验熵或因连续型特征而得到的含有最小经验熵以及使得经验熵最小的划分点的元组for i in range(0, dataSet.shape[1] - 1):if types[i] == 'continuous':min_entropy, partition = entropy_for_continuous(dataSet.iloc[:,i], dataSet.iloc[:,-1]) # 得到最小经验熵以及使得经验熵最小的划分点child_entropy.append((min_entropy, partition)) # 将它们变成元组并接在list后面else:continue # 离散型特征，还没有完成！# 储存上面得到的经验熵以及元组里的最小经验熵，用于比大小temp = [child_entropy[i][0] if isinstance(child_entropy[i],tuple) else child_entropy[i] for i in range(0, len(child_entropy))]minInd = np.argmin(temp)# 如果最佳特征是连续型的，返回索引和划分点组成的元组if isinstance(child_entropy[minInd], tuple):return (minInd,child_entropy[minInd][1])# 如果最佳特征是离散型的，只返回索引else:return minInd'''
说明：创建决策树参数：dataSet：训练集colIndex：训练集的特征总索引types：训练集的特征类型总索引
返回值：tree：决策树
'''def createTree(dataSet, colIndex, types):labels = dataSet.iloc[:, -1]# 如果数据集中所有的数据标签相同，则返回此标签if len(list(set(labels))) == 1:return list(set(labels))[0]# 如果数据集的特征都用完了，则返回当前数据集中最多的标签if colIndex == []:labels_list = data.iloc[:, -1].to_list()return max(labels_list, key=labels_list.count)bestLabelInd = bestLabelIndForSplit(dataSet, types)if isinstance(bestLabelInd, tuple):partition = bestLabelInd[1] # 最佳划分点bestLabelInd = bestLabelInd[0] # 最佳特征索引smaller_group = dataSet[dataSet.iloc[:, bestLabelInd] <= partition] # 比划分点小的数据bigger_group = dataSet[dataSet.iloc[:, bestLabelInd] > partition] # 比划分点大的数据smaller_group = smaller_group.drop(smaller_group.iloc[:,bestLabelInd].name, axis=1) # 删除用过的特征那一列bigger_group = bigger_group.drop(bigger_group.iloc[:,bestLabelInd].name, axis=1)key = (colIndex[bestLabelInd],partition)tree = {key : {}} # 构建内部节点del colIndex[bestLabelInd], types[bestLabelInd] # 删除用过的特征的数据集总索引和用过特征的类型总索引colIndex1 = colIndex[:]colIndex2 = colIndex[:]type1 = types[:]type2 = types[:]tree[key][0] = createTree(smaller_group, colIndex1, type1) # 构造小于等于划分点的树tree[key][1] = createTree(bigger_group, colIndex2, type2) # 构造大于划分点的树return treeelse: print('离散型特征，还没有完成！') # 离散型特征，还没有完成！'''
说明：计算模型的准确率参数：pred：预测的标签real：真实的标签
返回值：accuracy：模型的准确率
'''def accuracy(pred, real):pred_list = pred.to_list()real_list = real.to_list()corrNum = 0 # 当前模型预测正确的个数for i in range(0, len(pred_list)):if pred_list[i] == real_list[i]:corrNum += 1accuracy = float(corrNum) / len(pred_list)return accuracy

4. 测试模型

if __name__ == '__main__':colIndex = [i for i in range(0, data.shape[1] - 1)] # 训练集特征的总索引，目的是构建决策树的时候需要参考总索引# 判断特征值是连续的还是离散的types = ['continuous' if (data.dtypes[i] == 'int64') | (data.dtypes[i] == 'float64') else 'Discrete' for i in range(0, len(data.dtypes))]x_train, x_test, y_train, y_test = train_test_split(data.iloc[:,0:-1], data.iloc[:,-1])x_train['target'] = y_traintree = createTree(x_train, colIndex, types)pred = x_test.apply(lambda x : classify(tree, x), axis=1)print(accuracy(pred, y_test)) # 打印准确率

Python实现Decision Tree相关推荐

Python:实现decision tree决策树算法(附完整源码)
Python:实现decision tree决策树算法 import numpy as np class Decision_Tree:def __init__(self, depth=5, min_l ...
Python实现决策树(Decision Tree)分类
关于决策树的简介可以参考: http://blog.csdn.net/fengbingchun/article/details/78880934 在 https://machinelearningm ...
【CSDN软件工程师能力认证学习精选】机器学习之决策树（Decision Tree）及其Python代码实现
CSDN软件工程师能力认证(以下简称C系列认证)是由中国软件开发者网CSDN制定并推出的一个能力认证标准.C系列认证历经近一年的实际线下调研.考察.迭代.测试,并梳理出软件工程师开发过程中所需的各项技 ...
How To Implement The Decision Tree Algorithm From Scratch In Python (从零开始在Python中实现决策树算法)
How To Implement The Decision Tree Algorithm From Scratch In Python 原文作者:Jason Brownlee 原文地址:https:/ ...
决策树分类python代码_分类算法-决策树 Decision Tree
决策树(Decision Tree)是一个非参数的监督式学习方法,决策树又称为判定树,是运用于分类的一种树结构,其中的每个内部节点代表对某一属性的一次测试,每条边代表一个测试结果,叶节点代表某个类或类 ...
决策树(Decision Tree)算法 python简单实现
1. 简介决策数(Decision Tree)在机器学习中是比较常见的一种算法,属于监督学习中的一种. 算法流程如图: 具体算法可以详见下方参考有空再做详解参考:https://blog.csd ...
Python数据挖掘入门与实践第三章用决策树预测获胜球队（一）pandas的数据预处理与决策树(Decision tree)
作为一个NBA球迷,看到这一章还是挺激动的. 不过内容有点难,研究了半天... 要是赌球的,用这章的预测+凯利公式,是不是就能提升赢钱概率了? 数据预处理回归书本内容,既然要分析,首先需要有数据: ...
python决策树预测模型_机器学习：决策树（Decision Tree）
决策树(decision tree)是一种基本的分类与回归方法.在分类问题中,它可以认为是if-then规则的集合,也可以认为是定义在特征空间与类空间上的条件概率分布.在学习时,利用训练数据,根据损失 ...
决策树分类Decision tree classifier
2019独角兽企业重金招聘Python工程师标准>>> import org.apache.spark.sql.SparkSession import org.apache.spar ...

Python实现Decision Tree

前言

1. 导入所需包

2. 加载数据

3. 实现Decision Tree

4. 测试模型

Python实现Decision Tree相关推荐

最新文章

热门文章