【机器学习基础】xgboost系列丨xgboost建树过程分析及代码实现

前面我们通过对论文中的公式详细解读，一步步推导了XGBoost的优化目标以及建树方法。下面我们就来动手实践，拿真实的数据来手动计算，并且使用python来实现一个简易的XGBoost。

手动计算还原xgboost的过程

XGBoost的建树过程包括下面几个关键步骤。

计算分裂前样本的G,H(每个样本的g,h求和)
贪心枚举每个特征每个值做为分隔条件
计算分隔后左右节点的G_l,H_l,G_r,H_r，算出Gain
更新最大增益Gain_max，更新分隔点。
最终得到最优分隔点。
根据上述过程递归建树直到终止条件。
计算叶节点的权重w

在建完一个树后，再建下棵树时，注意G/H/Gain的计算用到的预测值要进行更新，对之前的所有树的预测值求和可以得到建下棵树时的。

这里我们使用一个简单的UCI数据集：

http://archive.ics.uci.edu/ml/datasets/Container+Crane+Controller+Data+Set
数据集的样本条数只有15条，2个特征。具体如下：

<< 左右滑动查看更多 >>

import pandas as pd
df = pd.read_csv('1.csv',index_col=0)
# df = pd.read_clipboard(index_col=0) 可以直接复制下面这个表格后使用read_clipboard读成DataFrame

对于二分类问题概率是预测值经过Sigmoid函数变换后得到的，默认预测概率为0.5，也就是默认的预测值为0。

<< 左右滑动查看更多 >>

def log_loss_obj(preds, labels):preds = 1.0 / (1.0 + np.exp(-preds))grad = preds - labelshess = preds * (1.0 - preds)return grad, hessbase_predict = np.zeros_like(df.y)
g,h = log_loss_obj(base_predict,df.y.values) # 计算每个样本的g和h
df['g'], df['h'] = g,h

df

	x1	x2	y	g	h
ID
1	1	-5	0	0.5	0.25
2	2	5	0	0.5	0.25
3	3	-2	1	-0.5	0.25
4	1	2	1	-0.5	0.25
5	2	0	1	-0.5	0.25
6	6	-5	1	-0.5	0.25
7	7	5	1	-0.5	0.25
8	6	-2	0	0.5	0.25
9	7	2	0	0.5	0.25
10	6	0	1	-0.5	0.25
11	8	-5	1	-0.5	0.25
12	9	5	1	-0.5	0.25
13	10	-2	0	0.5	0.25
14	8	2	0	0.5	0.25
15	9	0	1	-0.5	0.25

首先对特征x1上的不同的值进行枚举分裂增益。特征一总共有以下几个取值：

sorted(df.x1.unique())
# [1, 2, 3, 6, 7, 8, 9, 10]

对x1的取值进行排序后，最小值为1，显然没有样本的x1特征值<1，以x1划分相当于没有进行划分，因此从x1=2进行划分。

<< 左右滑动查看更多 >>

def split_data(df, split_feature, split_value):left_instance = df[df[split_feature] < split_value]right_instance = df[df[split_feature] >= split_value]return left_instance, right_instanceleft_instance, right_instance = split_data(df,'x1',2)
left_instance.index.tolist(),right_instance.index.tolist()
#  ([1, 4], [2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

可以看到样本1,4被划分到了左侧，其余样本划分到了右侧。对划分后的数据计算G/H，并求出分裂后的增益Gain为0.055727554179566596。

<< 左右滑动查看更多 >>

reg_lambda = 1
G,H = df.g.sum(), df.h.sum() # 分裂前全部样本的G/H
for thresh_values in sorted(df.x1.unique())[1:]:   G_left, H_left = left_instance[['g', 'h']].sum() # 分裂后的G_l,H_lG_right, H_right = right_instance[['g', 'h']].sum() # 分裂后的G_r,H_rGain = G_left**2/(H_left+reg_lambda)+G_right**2 / \(H_right+reg_lambda)-G**2/(H+reg_lambda) # 分裂后的增益计算

对x1每个取值进行划分，得到下面不同划分下的增益值。

	G_left	H_left	G_right	H_right	Gain
2	0.0	0.50	-1.5	3.25	0.055728
3	0.0	1.00	-1.5	2.75	0.126316
6	-0.5	1.25	-1.0	2.50	-0.076859
7	-1.0	2.00	-0.5	1.75	-0.049442
8	-1.0	2.50	-0.5	1.25	-0.076859
9	-1.0	3.00	-0.5	0.75	-0.080827
10	-2.0	3.50	0.5	0.25	0.615205

同样，对于特征x2每个取值的分裂情况：

	G_left	H_left	G_right	H_right	Gain
-2	-0.5	0.75	-1.0	3.00	-0.080827
0	0.0	1.50	-1.5	2.25	0.218623
2	-1.5	2.25	0.0	1.50	0.218623
5	-1.0	3.00	-0.5	0.75	-0.080827

我们可以看到当最大的增益为特征x1=10时，增益为0.615205，因此将x1<10作为第一个分裂条件。

接下来开始进行第2个分裂条件的选择：

<< 左右滑动查看更多 >>

use_instance,_ = split_data(df,'x1',10)

G,H = use_instance.g.sum(), use_instance.h.sum()
for thresh_values in sorted(use_instance.x1.unique())[1:]:   left_instance, right_instance = split_data(use_instance,'x1',thresh_values)G_left, H_left = left_instance[['g', 'h']].sum()G_right, H_right = right_instance[['g', 'h']].sum()Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \(H_right+reg_lambda)-G**2/(H+reg_lambda)

对于特征x1：

	G_left	H_left	G_right	H_right	Gain
2	0.0	0.50	-2.0	3.00	0.111111
3	0.0	1.00	-2.0	2.50	0.253968
6	-0.5	1.25	-1.5	2.25	-0.085470
7	-1.0	2.00	-1.0	1.50	-0.155556
8	-1.0	2.50	-1.0	1.00	-0.103175
9	-1.0	3.00	-1.0	0.50	0.027778

对于特征x2：

	G_left	H_left	G_right	H_right	Gain
-2	-0.5	0.75	-1.5	2.75	-0.146032
0	-0.5	1.25	-1.5	2.25	-0.085470
2	-2.0	2.00	0.0	1.50	0.444444
5	-1.5	2.75	-0.5	0.75	-0.146032

因此最优的划分为特征x2<2。

通过不断的对上述过程迭代，即可递归地建出第一棵树。

简易版XGBoost实现

我们首先使用XGBoost的开源实现来构建模型，和下面我们的实现做对比。

<< 左右滑动查看更多 >>

import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=2,max_depth=3,learning_rate=0.1,random_state=1,reg_lambda=1,gamma=0,objective='binary:logistic',min_child_weight=0,)
model.fit(df[['x1','x2']],df.y)
xgb.to_graphviz(model,num_trees=0)

可以看到第一棵树的结构如下：

第二棵树的结构如下：

xgb.to_graphviz(model,num_trees=1)

下面我们来用Python实现自己的简易版XGBoost。
首先创建节点类，通过节点的递归构建出一棵树。

<< 左右滑动查看更多 >>

from graphviz import Digraphclass Node(object):"""结点leaf_value ： 记录叶子结点值split_feature ：分裂节点对应的特征名split_value ： 分裂节点对应的特征的值left ： 左子树right ： 右子树"""def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):self.leaf_value = leaf_valueself.split_feature = split_featureself.split_value = split_valueself.left = leftself.right = rightdef show(self):print(f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')def visualize_tree(self):"""使用graphviz递归绘制树"""def add_nodes_edges(self, dot=None):if dot is None:dot = Digraph()dot.node(name=str(self),label=f'{self.split_feature}<{self.split_value}')# Add nodesif self.left:if self.left.leaf_value:dot.node(name=str(self.left),label=f'leaf={self.left.leaf_value:.10f}')else:dot.node(name=str(self.left),label=f'{self.left.split_feature}<{self.left.split_value}')dot.edge(str(self), str(self.left))dot = add_nodes_edges(self.left, dot=dot)if self.right:if self.right.leaf_value:dot.node(name=str(self.right),label=f'leaf={self.right.leaf_value:.10f}')else:dot.node(name=str(self.right),label=f'{self.right.split_feature}<{self.right.split_value}')dot.edge(str(self), str(self.right))dot = add_nodes_edges(self.right, dot=dot)return dotdot = add_nodes_edges(self)return dot

注意这里我们写了一个visualize_tree方法，使用graphviz来绘制建好的树的结构，在XGBoost的开源实现中也是使用类似的方案。该方法不是必须的，但是却对我们了解建树过程有着很好的帮助。同时我们使用show方法将节点的信息也字符形式显示出来。

下面来看一下损失函数和建树过程的实现：

<< 左右滑动查看更多 >>

# 树的结构参数
reg_lambda = 1 # 叶节点权重L2正则系数
min_samples_split = 1 # 分裂所需的最小样本个数
max_depth = 3 # 树的深度# 建树过程参数
learning_rate = 0.1 # 学习率
n_estimators = 2 # 树的个数# log损失函数
def log_loss_obj(preds, labels):# preds是建该树之前模型的输出，对于二分类问题需要的是概率，因此将该值经过Sigmoid转换probs = 1.0 / (1.0 + np.exp(-preds))grad = probs - labelshess = probs * (1.0 - probs)return grad, hessdef build_tree(df, feature_names, depth=1):df = df.copy()df['g'], df['h'] = log_loss_obj(df.y_pred, df.y)G, H = df[['g', 'h']].sum()Gain_max = float('-inf')# 终止条件 当前节点个数小于分裂所需最小样本个数，深度大于max_depth，叶节点只有一类样本无需再分if df.shape[0] > min_samples_split and depth <= max_depth and df.y.nunique() > 1:for feature in feature_names: # 遍历每个特征thresholds = sorted(set(df[feature])) # 特征取值排序for thresh_value in thresholds[1:]: # 遍历每个取值left_instance = df[df[feature] < thresh_value] # 划分到左右节点的样本right_instance = df[df[feature] >= thresh_value]G_left, H_left = left_instance[['g', 'h']].sum()G_right, H_right = right_instance[['g', 'h']].sum()Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \(H_right+reg_lambda)-G**2/(H+reg_lambda) # 评价划分的增益效果if Gain >= Gain_max:Gain_max = Gainsplit_feature = feature # 最大增益对应的分裂特征split_value = thresh_value # 最大增益对应的分裂值left_data = left_instance # 最大增益对应的分裂后左节点样本right_data = right_instance # 最大增益对应的分裂后右节点样本left = build_tree(left_data, feature_names,  depth+1) # 递归建左子树right = build_tree(right_data, feature_names, depth+1)# 递归建右子树return Node(split_feature=split_feature, split_value=split_value, left=left, right=right) # 返回分裂节点return Node(leaf_value=-G/(H+reg_lambda)*learning_rate) # 分裂终止，返回叶节点权重

对于单棵树的预测，我们使用predict函数，根据树的分裂条件以及当前数据的信息来确认每个样本被分到当前这棵树的哪个叶节点，得到对应叶节点的权重。

<< 左右滑动查看更多 >>

def predict(x, tree):# 递归每个分裂点直到样本对应的叶节点# 终止条件：叶节点if tree.leaf_value is not None:return tree.leaf_valueif x[tree.split_feature] < tree.split_value:return predict(x, tree.left)else:return predict(x, tree.right)

根据已建好的树来更新预测结果，进而确定新建树的结构，不断迭代最终得到全部的树：

<< 左右滑动查看更多 >>

trees = []
y_pred = 0  # 初始预测值
for i in range(n_estimators):df['y_pred'] = y_predtree = build_tree(df, feature_names=['x1', 'x2'])data_weight = df[['x1', 'x2']].apply(predict, tree=tree, axis=1)y_pred += data_weighttrees.append(tree)

对于已建好的树，使用我们前面定义的visualize_tree函数，可以方便的看到树的结构。

第一棵数的结构：

trees[0].visualize_tree()

第二棵树的结构：

trees[1].visualize_tree()

可以看到，我们自己实现的XGBoost与开源的XGBoost得到的树的结构是一致的。

封装

把前面的各部分代码封装成类：

<< 左右滑动查看更多 >>

import numpy as np
import pandas as pd
import xgboost as xgb
from graphviz import Digraphclass Node(object):"""结点leaf_value ： 记录叶子结点值split_feature ：特征isplit_value ： 特征i的值left ： 左子树right ： 右子树"""def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):self.leaf_value = leaf_valueself.split_feature = split_featureself.split_value = split_valueself.left = leftself.right = rightdef show(self):print(f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')def visualize_tree(self):"""递归查找绘制树"""def add_nodes_edges(self, dot=None):if dot is None:dot = Digraph()dot.node(name=str(self),label=f'{self.split_feature}<{self.split_value}')# Add nodesif self.left:if self.left.leaf_value:dot.node(name=str(self.left),label=f'leaf={self.left.leaf_value:.10f}')else:dot.node(name=str(self.left),label=f'{self.left.split_feature}<{self.left.split_value}')dot.edge(str(self), str(self.left))dot = add_nodes_edges(self.left, dot=dot)if self.right:if self.right.leaf_value:dot.node(name=str(self.right),label=f'leaf={self.right.leaf_value:.10f}')else:dot.node(name=str(self.right),label=f'{self.right.split_feature}<{self.right.split_value}')dot.edge(str(self), str(self.right))dot = add_nodes_edges(self.right, dot=dot)return dotdot = add_nodes_edges(self)return dotdef log_loss_obj(preds, labels):preds = 1.0 / (1.0 + np.exp(-preds))grad = preds - labelshess = preds * (1.0 - preds)return grad, hessdef mse_obj(preds, labels):grad = labels-y_predhess = np.ones_like(labels)return grad, hessclass XGB:def __init__(self, n_estimators=2, learning_rate=0.1, max_depth=3, min_samples_split=0, reg_lambda=1, base_score=0.5, loss=log_loss_obj):# 学习控制参数self.n_estimators = n_estimatorsself.learning_rate = learning_rateself.base_score = base_score# 树参数self.max_depth = max_depthself.min_samples_split = min_samples_splitself.loss = lossself.reg_lambda = reg_lambdaself.trees = []self.feature_names = Nonedef sigmoid_array(self, x):return 1 / (1 + np.exp(-x))def _predict(self, x, tree):# 循环终止条件：叶节点if tree.leaf_value is not None:return tree.leaf_valueif x[tree.split_feature] < tree.split_value:return self._predict(x, tree.left)else:return self._predict(x, tree.right)def _build_tree(self, df, depth=1):df = df.copy()df['g'], df['h'] = self.loss(df.y_pred, df.y)G, H = df[['g', 'h']].sum()Gain_max = float('-inf')if df.shape[0] > self.min_samples_split and depth <= self.max_depth and df.y.nunique() > 1:for feature in self.feature_names:thresholds = sorted(set(df[feature]))for thresh_value in thresholds[1:]:left_instance = df[df[feature] < thresh_value]right_instance = df[df[feature] >= thresh_value]G_left, H_left = left_instance[['g', 'h']].sum()G_right, H_right = right_instance[['g', 'h']].sum()Gain = G_left**2/(H_left+self.reg_lambda)+G_right**2 / \(H_right+self.reg_lambda)-G**2/(H+self.reg_lambda)if Gain >= Gain_max:Gain_max = Gainsplit_feature = featuresplit_value = thresh_valueleft_data = left_instanceright_data = right_instance# print(feature,'Gain:',Gain,'G-Left',G_left,'H-left',H_left,'G-Right',G_right,'H-right',H_right,'----',thresh_value)# print(Gain_max,split_feature,split_value)left = self._build_tree(left_data,  depth+1)right = self._build_tree(right_data,  depth+1)return Node(split_feature=split_feature, split_value=split_value, left=left, right=right)return Node(leaf_value=-G/(H+self.reg_lambda)*self.learning_rate)def fit(self, X, y):y_pred = -np.log((1/self.base_score)-1)df = pd.DataFrame(X)df['y'] = yself.feature_names = df.columns.tolist()[:-1]for i in range(self.n_estimators):df['y_pred'] = y_predtree = self._build_tree(df)data_weight = df[['x1', 'x2']].apply(self._predict, tree=tree, axis=1)y_pred += data_weightself.trees.append(tree)def predict(self, X):df = pd.DataFrame(X)y_pred = -np.log((1/self.base_score)-1)for tree in self.trees:df['y_pred'] = y_preddata_weight = df[['x1', 'x2']].apply(self._predict, tree=tree, axis=1)y_pred += data_weightreturn self.sigmoid_array(y_pred)def __repr__(self):return 'XGBClassifier('+', '.join(f'{k}={v}' for k, v in self.__dict__.items() if not k.startswith('_'))+')'

使用前面同样的方法，调用我们自己封装好的代码，可以看到很简洁的实现了跟官方开源实现类似的效果。

df = pd.read_csv('1.csv',index_col=0)
model = XGB()
model.fit(df[['x1', 'x2']], df.y)
model.predict(df[['x1', 'x2']])
model.trees[0].visualize_tree()

当然这里只是一个DEMO，来帮助我们更好的理解论文中的公式以及XGBoost的实现过程，更多的细节优化可以参考官方实现：https://github.com/dmlc/xgboost

参考

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

https://blog.csdn.net/qq_22238533/article/details/79477547

https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py

https://github.com/lxmly/machine-learning/blob/master/decision_tree.py

相关阅读

往期精彩回顾适合初学者入门人工智能的路线及资料下载机器学习及深度学习笔记等资料打印机器学习在线手册深度学习笔记专辑《统计学习方法》的代码复现专辑
AI基础下载机器学习的数学基础专辑
获取本站知识星球优惠券，复制链接直接打开：
https://t.zsxq.com/qFiUFMV
本站qq群704220115。加入微信群请扫码：

【机器学习基础】xgboost系列丨xgboost建树过程分析及代码实现相关推荐

【机器学习】xgboost系列丨xgboost原理及公式推导
建树过程中如何选择使用哪个特征哪个值来进行分裂? 什么时候停止分裂? 如何计算叶节点的权值? 建完了第一棵树之后如何建第二棵树? 为防止过拟合,XGB做了哪些改进树的集成本文主要针对xgboost ...
【机器学习基础】终于把XGBoost总结写出来了！
作者:王茂霖,华中科技大学,Datawhale成员内容概括 XGBoost模型及调参总结 XGBoost原理 XGBoost优势总结 XGBoost参数详解 XGBoost快速使用 XGBoost调 ...
【机器学习基础】XGBoost、LightGBM与CatBoost算法对比与调参
机器学习 Author:louwill Machine Learning Lab 虽然现在深度学习大行其道,但以XGBoost.LightGBM和CatBoost为代表的Boosting算法仍有其广阔 ...
树模型系列之XGBoost算法
文章目录树模型系列之XGBoost算法概要原理损失函数子树 xgboost算法流程总结关键问题缺失值处理(稀疏问题的分裂点查找 Sparsity-aware Split Finding) ...
【基础算法】 GBDT/XGBoost 常见问题
原文作者:Jack Stack 原文:https://zhuanlan.zhihu.com/p/81368182 在非深度学习的机器学习模型中,基于GBDT算法的XGBoost.lightgbm等有着 ...
机器学习-集成学习-提升树-Xgboost
内容整理自:ML-NLP/Machine Learning/3.3 XGBoost at master · NLP-LOVE/ML-NLP · GitHub XGBoost--机器学习(理论+图解+安 ...
[机器学习 ] RandomForest、GBDT、XGBoost、lightGBM 原理与区别
目录随机森林 -- RandomForest GBDT (Gradient Boosting Decision Tree) XGBoost lightGBM 一知识点介绍 RF,GBDT,XGBo ...
资源 | Intel发布AI免费系列课程3部曲：机器学习基础、深度学习基础以及TensorFlow基础
翻译 | AI科技大本营(公众号ID:rgznai100) 校对 | 成龙编辑 | 明明 Intel于近期发布了三门AI系列的免费课程,分别是关于机器学习基础.深度学习基础.TensorFlow基础 ...
机器学习总结（17）-XGBoost
文章目录 lecture17:XGBoost(eXtreme Gradient Boosting) 目录 1. XGBoost的基本信息 2. XGBoost与GBDT的异同点 3. XGBoost的 ...

【机器学习基础】xgboost系列丨xgboost建树过程分析及代码实现

【机器学习基础】xgboost系列丨xgboost建树过程分析及代码实现相关推荐

最新文章

热门文章