<?xml version="1.0" encoding="utf-8"?> ID3/C4.5/Gini Index

ID3/C4.5/Gini Index

1 ID3

Select the attribute with the highest information gain.

1.1 Formula

1.2 Formula

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$

1.3 Example

age income Student creditrating buyscomputer
<= 30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes excellent yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.

So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$

In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{<=30}) = -\frac{3}{5}log_2\frac{3}{5} - \frac{2}{5}log_2\frac{2}{5} = 0.971$$
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{<=30}) + \frac{4}{14}Info(D_{31...40}) + \frac{5}{14}Info(D_{>40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$

1.4 Question

What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.

2 C4.5

C4.5 is a successor of ID3.

2.1 Formula

$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.

3 Gini Index

3.1 Formula

If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$

4 Summary

ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.

5 Other Attribute Selection Measures

1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others

Author: mlhy

Created: 2015-10-06 二 14:39

Emacs 24.5.1 (Org mode 8.2.10)

转载于:https://www.cnblogs.com/mlhy/p/4856062.html

ID3/C4.5/Gini Index相关推荐

  1. Data Minig --- Decision Tree ID3 C4.5 Gini Index

    一.决策树学习(适用于"属性-值"实例且输出值离散) 决策树学习是一种逼近离散值目标函数的方法,这个方法学到的函数称为一棵决策树.学到的决策树可表示为多个if-then过程以提高可 ...

  2. 统计学习方法第五章作业:ID3/C4.5算法分类决策树、平方误差二叉回归树代码实现

    ID3/C4.5算法分类决策树 import numpy as np import math class Node:def __init__(self,feature_index=None,value ...

  3. 决策树数学原理(ID3,c4.5,cart算法)

    上面这个图就是一棵典型的决策树.我们在做决策树的时候,会经历两个阶段:构造和剪枝. 构造 简单来说,构造的过程就是选择什么属性作为节点的过程,那么在构造过程中,会存在三种节点: 根节点:就是树的最顶端 ...

  4. [机器学习-Sklearn]决策树学习与总结 (ID3, C4.5, C5.0, CART)

    决策树学习与总结 (ID3, C4.5, C5.0, CART) 1. 什么是决策树 2. 决策树介绍 3. ID3 算法 信息熵 信息增益 缺点 4. C4.5算法 5. C5.0算法 6. CAR ...

  5. 决策树(ID3,C4.5和CART)介绍、说明、联系和区别

    决策树 决策树 1. 决策树介绍 2. 决策树构建过程 2.1 属性选择 熵 条件熵 信息增益 信息增益比 3. 决策树生成和修建 4. 决策树常见算法 ID3 C4.5 CART(基尼指数) 5.总 ...

  6. 机器学习爬大树之决策树(ID3,C4.5)

    自己自学机器学习的相关知识,过了一遍西瓜书后准备再刷一遍,后来在看别人打比赛的代码时多次用到XGBoost,lightGBM,遂痛下决心认真学习机器学习关于树的知识,自己学习的初步流程图为: 决策树( ...

  7. Gini index世界各国基尼系数(1960-2022)

    Gini index世界各国基尼系数(1960-2022) 1990年以前数据缺失较多,1990-2020年数据较为完整,2021.2022数据部分缺失 本数据集包含200多个国家/地区的基尼系数面板 ...

  8. 决策树ID3 C4.5 CART代码

    ID3 # encoding: gbkimport pandas as pd import numpy as npclass DecisionTree:def __init__(self):self. ...

  9. 决策树 基于python实现ID3,C4.5,CART算法

    实验目录 实验环境 简介 决策树(decision tree) 信息熵 信息增益(应用于ID3算法) 信息增益率(在C4.5算法中使用) 基尼指数(被用于CART算法) 实验准备 数据集 算法大体流程 ...

最新文章

  1. Emacs支持gomodifytags
  2. T4模板使用记录,生成Model、Service、Repository
  3. Django Book 2.0 笔记——会话、用户和注册
  4. 初识推荐算法---算法背景、算法概念介绍、推荐信息选取、常用推荐算法简介
  5. springcloud 使用git作为配置中心
  6. 让打开文件夹直接在某路径打开
  7. docker学习(三) 安装docker的web可视化管理工具
  8. STM32F407 硬件IIC驱动MCP4017 数字电位器
  9. IAR8.3.2破解说明
  10. Scope参数错误或没有Scope权限解决方法
  11. python利用公式计算_python利用公式计算π的方法
  12. UVALive 3713 Astronauts(2SAT)
  13. 怎么删除计算机病毒,电脑中病毒删不掉怎么办?
  14. VSCode 前端插件推荐
  15. 完整版 Chrome 浏览器将登陆 Fuchsia OS
  16. 12. Spring Boot统一日志框架
  17. 解决office,word奔溃的问题
  18. 初次使用 python poetry 包管理模块踩坑
  19. 快来直播:“东方甄选”火爆影射直播新趋势——为知识买单
  20. 作为一个在校大学生,是否有必要参加计算机培训班?

热门文章

  1. java循环输入_【图文+视频新手也友好】Java一维数组详细讲解(内含练习题答案+详解彩蛋喔~)...
  2. python获取token并登录,Python token的获取和再次登录验证
  3. C语言的延时程序怎么改,C语言编程,怎么用按键来改变延时的长短?
  4. 帷幕的帷是什么意思_公务员最低服务年限是什么意思,被录用后辞职,还能考公务员吗...
  5. 70进货卖100利润是多少_服装批发利润大揭秘!让你拿货砍价心里有个底
  6. python嵌套循环注意事项_python循环嵌套的几种使用方法
  7. android8 测试,Android 8.0 Oreo 国内可用测试平台上线
  8. java调试时监视_Java监控工具、调优、调试辅助函数
  9. 计算机管理可以更新吗,微信可以批量管理好友吗(电脑版微信3.0.0更新规则了)...
  10. hibernate mysql 视图_转:hibernate映射视图的两种方式