
决策树如何运作 (How a Decision Tree Works)

Pictorially, a decision tree is like a flow-chart where the parent nodes represent an attribute’s test and the leaf nodes represent the final category assigned to the datapoints which made it to that leaf.


Figure 1— Students sample distribution

In the illustration above, a total of 13 students were randomly sampled from a students performance dataset. The scatter plot shows a distribution of the sample based on two attributes:

在上图中,从学生成绩数据集中随机抽取了13名学生。 散点图基于两个属性显示了样本的分布:

  1. raisedhands: number of times the student raised his/her hands in class to ask or answer questions.举手:学生在课堂上举手提问或回答问题的次数。
  2. visitedResources: How many times the student visited a course content.VisitedResources:该学生浏览过一次课程内容的次数。

Our intent is to manually construct a decision tree that can best separate the sample data points into the distinct classes — L, M, H where:


L= Lower performance category

L =较低性能类别

M= Medium(average) performance category

M =中(平均)性能类别

H= High performance category

H =高性能类别

Option A

An option is to split the data along the attribute — visitedResources, at point mark 70.


This “perfectly” separates the H class from the rest.


Option B

One other option is to split along the same attribute — visitedResources, at point mark 41.


No “perfect” separation is achieved for any class.


Option C

Another option is to split along the attribute — raisedhands, at point mark 38.


This “perfectly” separates the L class from the rest.


Options A and C did a better job at separating at least one of the classes. Suppose we pick option A, the resultant decision three will be:

选项A和C在分离至少一个类方面做得更好。 假设我们选择了选项A,那么最终的决定三将是:

The left branch has only H class students, hence, cannot be separated any further. On the right branch, the resultant node has four students each in the M and L classes.

左分支只有H级学生,因此不能再分开。 在右侧分支上,结果节点在M和L班级中分别有四个学生。

Remember that this is the current state of our separation exercise.


How best can the remaining students(data points) be separated into their appropriate classes? Yes, you guessed right — draw more lines!

其余的学生(数据点)如何最好地分成各自合适的班级? 是的,您猜对了-画出更多的线!

One option is to split along the attribute — raisedhands, at point mark 38.


Again, any number of split lines can be drawn, however, this option seems to yield a good result, so, we shall go with it.


The resultant decision tree after the split is shown below:


Clearly, the data points are perfectly separated into the appropriate classes, hence no further logical separation is needed.


到目前为止吸取的教训: (Lessons Learnt So Far:)

  1. In ML parlance, this process of building out a decision tree that best classifies a given dataset is interpreted or referred to as Learning.


  2. This learning process is iterative.此学习过程是迭代的。
  3. Several decision trees of varying levels of prediction accuracy can be derived from the same dataset, subject to the split attribute choices made and tree depth allowed.可以从同一个数据集中获得具有不同级别的预测准确性的几棵决策树,但要进行分割属性选择并允许树深度。

In manually constructing the decision tree, we learnt that the separation lines can be drawn at any point along any of the attributes available in a dataset. The question is, at any given decision node, which of the possible attributes and separation points will do a better job of separating the dataset into the desired or near-desired classes or categories? An instrument to determining the answer to this question is the Gini Impurity.

在手动构建决策树时,我们了解到可以沿着数据集中任何可用属性的任意点绘制分隔线。 问题是,在任何给定的决策节点上,哪种可能的属性和分离点将更好地将数据集分离为所需的或接近所需的类或类别? 确定这个问题答案的工具是基尼杂质

基尼杂质 (Gini Impurity)

Suppose we have a new student and we randomly classify this new student into any of the three classes based on the probability distribution of the classes. The gini impurity is a measure of the likelihood of incorrectly classifying that new random student(variable). It is a probabilistic measure, hence it’s bounded between 0 and 1.

假设我们有一个新学生,我们根据班级的概率分布将该新学生随机分为三个班级中的任何一个。 基尼杂质是对新随机学生(变量)进行错误分类的可能性的度量。 这是一种概率测度,因此范围在0到1之间。

We have a total of 13 students in our sample dataset and the probability distribution of H, M and L class are 5/13, 4/13 and 4/13 respectively.

我们的样本数据集中共有13名学生,H,M和L班级的概率分布分别为5 / 13、4 / 13和4/13。

The formular below is applied in calculating gini impurity:


The above formular when applied in our example case becomes:


Therefore gini impurity at the root node of the decision tree before any split, will be computed as::


Recall the earlier discussed split options A and C at the root node, let us compare the gini impurities of the two options and see why A was picked as a better split choice.


Option A
Option C

Therefore, the amount of impurity removed with split option A — gini gain is: 0.66–0.3=0.36. While that for split option C is: 0.66–0.37=0.29.

因此,使用分割选项A-gini增益去除的杂质量为:0.66-0.3 = 0.36 。 而拆分选项C的系数为:0.66-0.37 = 0.29

Obviously, gini gain 0.36>0.29, hence, option A is a better split choice, informing the earlier decision to pick A over C.

显然,基尼系数增加0.36> 0.29,因此,选项A是更好的拆分选择,这表明了较早的选择A胜过C的决定。

The gini impurity at a node where all the students are of only one class, say H, is always equal to zero — meaning no impurity. This implies a perfect classification, hence, no further split is needed.

在所有学生仅属于一个班级的节点(例如H)上的基尼杂质始终等于零,这意味着没有杂质。 这意味着一个完美的分类,因此不需要进一步的拆分。

随机森林 (Random Forest)

We have seen that many decision trees can be generated from the same dataset, and that the performance of the trees at correctly predicting unseen examples can vary. Also, using a single tree model (decision tree) can easily lead to over-fitting.

我们已经看到,可以从同一个数据集生成许多决策树,并且在正确预测看不见的示例时树的性能可能会有所不同。 同样,使用单个树模型(决策树)很容易导致过度拟合。

The question becomes: how do we make sure to construct the best possible performant tree? An answer to this is to smartly construct as many trees as possible and use averaging to improve the predictive accuracy and control over-fitting. This method is called the Random Forest. It is random because each tree is constructed using not all the training dataset but a random sample of the dataset and attributes.

问题就变成了:我们如何确保构建性能最佳的树? 对此的一种解决方案是智能地构造尽可能多的树,并使用求平均值来提高预测准确性和控制过度拟合。 此方法称为随机森林。 这是随机的,因为不是使用所有训练数据集而是使用数据集和属性的随机样本来构造每棵树。

We shall use the random forest algorithm implementation in Scikit-learn python package to demonstrate how a random forest model can be trained, tested as well as visualize one of the trees that constitute the forest.

我们将使用Scikit-learn python软件包中的随机森林算法实现来演示如何训练,测试以及可视化构成森林的其中一棵树。

For this exercise, we shall train a random forest model to predict(classify) the academic performance category (Class) which students belong to, based on their participation in class/learning processes.


In the dataset for this exercise, students’ participation is defined as a measure of four variables, which are:

在此练习的数据集中 ,学生的参与被定义为四个变量的度量,它们是:

  1. Raised hand: How many times the student raised his/her hands in class to ask or answer questions (numeric:0–100)


  2. Visited resources: How many times the student visited a course content(numeric:0–100)


  3. Viewing announcements: How many times the student checked the news announcements(numeric:0–100)


  4. Discussion groups: How many times the student participated in discussion groups (numeric:0–100)


In the sample extract below, the first four(4) numeric columns correspond to the students’ participation measures defined earlier, and the last column — Class which is categorical, represents the students performance. A student can be in either of three(3) classes — Low, Medium or High performance.

在下面的示例摘录中,前四(4)个数字列对应于之前定义的学生参与度,而最后一列是“类别”,它表示学生的表现。 学生可以是三(3)类任一L -流中,M edium或H IGH性能。

Figure-1: Dataset extract: Students participation measures and performance class

Basic data preparation steps:


  1. Load dataset加载数据集
  2. Clean or preprocess data. All features in this dataset are already in the right format and there exist no missing values. In my experience, this is rarely the case in ML projects, as some degree of cleaning or preprocessing is usually required.清理或预处理数据。 该数据集中的所有要素均已采用正确的格式,并且不存在缺失值。 以我的经验,在ML项目中很少出现这种情况,因为通常需要一定程度的清洁或预处理。
  3. Encode label. This is necessary as the label(Class) in this dataset is categorical.编码标签。 这是必需的,因为此数据集中的label(Class)是分类的。
  4. Split dataset into train and test sets.将数据集分为训练集和测试集。

An implementation of all the above steps is shown in the snippet below:


Next, we shall create a RandomForest instance and fit (build the tree) the model to the train set.




  1. n_estimators = number of trees to make the forest

    n_estimators = 造林的树木数量

  2. criterion = what method to use in picking the best attribute split option for the decision trees. Here, we see the gini impurity being used.

    条件 =为决策树选择最佳属性拆分选项时使用的方法。 在这里,我们看到使用了基尼杂质。

  3. max_depth = This is a cap on the depth of the trees. If at this depth, no clear classification is arrived at, the model will consider all the nodes at the level to be leaf nodes. Also, for each leaf node, the data points are classified to be of the majority class in that node.

    max_depth =这是树木深度的上限。 如果在此深度上没有清晰的分类,则模型会将该级别的所有节点视为叶节点。 同样,对于每个叶节点,数据点被分类为该节点中的多数类。

Note that the optimal n_estimators and max_depth combination can only be determined by experimenting with several combinations. One way to achieve this is by using the grid search method.

请注意,只能通过试验几种组合来确定最佳的n_estimators和max_depth组合。 实现此目的的一种方法是使用网格搜索方法。

模型评估 (Model Evaluation)

While there exist several metrics for evaluating models, we shall use one if not the most basic one — accuracy.


Accuracy on train set: 72.59%, test set: 68.55% — could be better but not a bad benchmark.


可视化森林中最好的树 (Visualizing The Best Tree in the Forest)

The most optimal tree in a random forest model can be visualized easily, enabling both the engineer, scientist and business specialists to have some understanding of the decision-flow of the model.


The snippet below extracts and visualizes the most optimal tree from the above-trained model:


Decision tree extracted from the random forest.

结论: (Conclusion:)

In this article, we succeeded in looking at how a decision tree works, understanding how the attributes split-choices are made using the gini impurity, how several decision trees are ensembled to make a random forest, and finally, demonstrated the usage of the random forest algorithm by training a random forest model to classify students into academic performance categories based on their participation in class/learning processes.


Thanks for reading.


翻译自: https://towardsdatascience.com/seeing-the-forest-through-the-trees-45deafe1a6f0




  • gan神经网络_神经联觉:当艺术遇见GAN
  • rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法
  • python进阶指南_Python特性工程动手指南
  • 人工智能对金融世界的改变_人工智能革命正在改变网络世界
  • 数据科学自动化_数据科学会自动化吗?
  • 数据结构栈和队列_使您的列表更上一层楼:链接列表和队列数据结构
  • 轨迹预测演变(第1/2部分)
  • 人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分
  • 机器学习 深度学习 ai_人工智能,机器学习,深度学习-特征和差异
  • 随机模拟_随机模拟可帮助您掌握统计概念
  • 机器学习算法如何应用于控制_将机器学习算法应用于NBA MVP数据
  • 知乎 开源机器学习_使用开源数据和机器学习预测海洋温度
  • :)xception_Xception:认识Xtreme盗梦空间
  • 评估模型如何建立_建立和评估分类ML模型
  • 介绍神经网络_神经网络介绍
  • 人物肖像速写_深度视频肖像
  • 奇异值值分解。svd_推荐系统-奇异值分解(SVD)和截断SVD
  • 机器学习 对模型进行惩罚_使用Streamlit对机器学习模型进行原型制作
  • 神经网络实现xor_在神经网络中实现逻辑门和XOR解决方案
  • sagan 自注意力_请使用英语:自我注意生成对抗网络(SAGAN)
  • pytorch 音频分类_Pytorch中音频的神经风格转换
  • 变压器 5g_T5:文本到文本传输变压器
  • 演示方法:有抱负的分析师
  • 机器学习 模型性能评估_如何评估机器学习模型的性能
  • 深度学习将灰度图着色_通过深度学习为视频着色
  • 工业机器人入门实用教程_机器学习实用入门
  • facebook 图像比赛_使用Facebook的Detectron进行图像标签
  • 营销大数据分析 关键技术_营销分析的3个最关键技能
  • ue4 gpu构建_待在家里吗 为什么不构建GPU Box!
  • 使用机器学习预测天气_使用机器学习的二手车价格预测


  1. 我的python世界_请前辈看看我的python世界周游世界算法

    fromStackimportStackfromQueueimportQueuedirs=[(-2,1),(-1,2),(1,2),(2,1),\(2,-1),(1,-2),(1,-2),(2,-1) ...

  2. cad图纸怎么看懂_新手如何看懂CAD施工图

    1.建筑施工图的组成部分:结构设计第一步就是看懂建筑施工图,建筑专业是整个建筑物设计的龙头,没有建筑设计其他专业也就谈不上设计了,所以看懂建筑施工图就显得格外重要.大体上建筑施工图包括以下部分:图纸目 ...

  3. violinplot如何看懂_如何彻底看懂GRS证书

    大家都知道钻石有权威的GIA证书,却少有人了解彩色宝石的证书.彩色宝石的标准划分在钻石的4C标准上有所不同.首先如同钻石有在国际上被广泛认可的GIA,彩色宝石则由GRS证书. 除了GRS证书以外,彩色 ...

  4. 为什么换了不久的眼睛看不清_【眼睛看东西雾蒙蒙的 】_看不清_怎么办_原因_如何改善-大众养生网...

    文章导读 在生活起居中有些人会出現双眼看东西看不清模糊不清的状况,是啥原因造成了这类状况的产生呢,双眼看东西模糊不清的是什么原因?医师根据下边的內容对这一问题作出了实际的回应,已经被该类病症所困惑着的 ...

  5. java异常看不懂_报错了 看不懂求解

    严重: Error configuring application listener of class CodeConsoleInitializer java.lang.ClassNotFoundEx ...

  6. 混淆矩阵怎么看_201.工具篇MECE法则:透过结构看世界。

    工具篇|战略工具1之前讲过了商业.管理.个人,今天讲如何用工具来提高上述的效率工具.今天先讲一MECE法则.某公司将2020年定为品牌战略年,小王接到领导安排写作任务,要求充分阐述公司 的品牌主张.小 ...

  7. 通过显微镜,人们又看到了一个活生生的但是肉眼看不到的世界。透过成千上万的点击数据,在线世界也就变得更为鲜活,更有意义了。...

    通过显微镜,人们又看到了一个活生生的但是肉眼看不到的世界.透过成千上万的点击数据,在线世界也就变得更为鲜活,更有意义了. 转载于:https://www.cnblogs.com/beingonline ...

  8. 一张图透过结构看世界--掌握结构化思维


  9. win10运行快捷键_阿销带你看世界——电脑技巧之电脑快捷键(WIN篇2)

    片头(wawjf)---- 嘿,大家好啊,我是wawjf 是的没错,从今日起,我们45工作室恢复更新状态,每天都会有不同的教程哦 然后,很重要的一点是 我们将于明天或者后天在Bilibili发布我们的 ...


  1. python可以做什么有趣的东西-您用python做过什么有趣的事?(什么事python)
  2. ExecutorService框架
  3. 响应微信公众平台公众号菜单单击事件
  4. Percona5.6自身已支持杀死慢SQL
  5. java中stack集合框架
  6. 问题 L: The Hanoi Tower
  7. jsp 使用base标签 没有作用_tag标签的概念,如何设置使用,它对网站seo优化有何作用...
  8. MYSQL 更新字段,向字段追加字符串
  9. OpenFire 安装及配置
  10. node.js 微信小程序 部署服务器_微信小程序云开发如何上手
  11. LTE 注网流程log分析
  12. SSM服装销售商城,毕业论文+源码+包运行
  13. manjaro 更新失败
  14. 省赛前的做题计划记录
  15. JS之数组 创建数组 访问数组和数组的长度 数组的相关方法
  16. 微信小程序抓包反编译保姆级教程
  17. python怎么合成音乐_Python合成音乐
  18. 极简yolov5转torchscript
  19. (原創) 如何破解Synplify Pro 9.6.2? (SOC) (Synplify)
  20. 揭秘Alltesting众测平台


  1. 美国无人驾驶立法提案将进行投票,有望解除一切封杀
  2. 雅虎正开发聊天机器人挑战对手 不过似乎很难成功
  3. HTML特殊字符大全2
  4. 如何在vue-router的beforeEach钩子里做页面访问权限验证
  5. 巨一自动化工业机器人_2021第11届深圳国际工业自动化及机器人展览会
  6. php mysql修复_MySQL数据表损坏的巧妙修复
  7. python open permission denied_python - Image.open PermissionError:[Errno 13]权限被拒绝: - 堆栈内存溢出...
  8. Ellipsoid HDU - 5017(模拟退火)
  9. java split 数字_java截取字符串,截串,substring和split,分割字母和数字,正则缝隙...
  10. java怎么看dao文件_java通过实体类生成dao文件