XGBoost原理及在Python中使用XGBoost
原理见:http://www.myexception.cn/operating-system/2084839.html
译文转自:http://blog.csdn.net/zc02051126/article/details/46771793
在Python中使用XGBoost
下面将介绍XGBoost的Python模块,内容如下:
* 编译及导入Python模块
* 数据接口
* 参数设置
* 训练模型l
* 提前终止程序
* 预测
A walk through python example for UCI Mushroom dataset is provided.
安装
首先安装XGBoost的C++版本,然后进入源文件的根目录下的 wrappers
文件夹执行如下脚本安装Python模块
<code class="language-shell hljs cmake has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">python setup.py <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">install</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
安装完成后按照如下方式导入XGBoost的Python模块
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> xgboost <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> xgb</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
=
数据接口
XGBoost可以加载libsvm格式的文本数据,加载的数据格式可以为Numpy的二维数组和XGBoost的二进制的缓存文件。加载的数据存储在对象DMatrix
中。
- 加载libsvm格式的数据和二进制的缓存文件时可以使用如下方式
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">dtrain = xgb.DMatrix(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'train.svm.txt'</span>) dtest = xgb.DMatrix(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'test.svm.buffer'</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
- 加载numpy的数组到
DMatrix
对象时,可以用如下方式
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">data = np.random.rand(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 5 entities, each contains 10 features</span> label = np.random.randint(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># binary target</span> dtrain = xgb.DMatrix( data, label=label)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
- 将
scipy.sparse
格式的数据转化为DMatrix
格式时,可以使用如下方式
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">csr = scipy.sparse.csr_matrix( (dat, (row,col)) ) dtrain = xgb.DMatrix( csr )</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
- 将
DMatrix
格式的数据保存成XGBoost的二进制格式,在下次加载时可以提高加载速度,使用方式如下
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">dtrain = xgb.DMatrix(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'train.svm.txt'</span>) dtrain.save_binary(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"train.buffer"</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
- 可以用如下方式处理
DMatrix
中的缺失值:
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">dtrain = xgb.DMatrix( data, label=label, missing = -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">999.0</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
- 当需要给样本设置权重时,可以用如下方式
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">w = np.random.rand(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>) dtrain = xgb.DMatrix( data, label=label, missing = -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">999.0</span>, weight=w)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
参数设置
XGBoost使用key-value格式保存参数. Eg
* Booster(基本学习器)参数
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">param = {<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'bst:max_depth'</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'bst:eta'</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'silent'</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'objective'</span>:<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'binary:logistic'</span> } param[<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'nthread'</span>] = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span> plst = param.items() plst += [(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'eval_metric'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'auc'</span>)] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># Multiple evals can be handled in this way</span> plst += [(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'eval_metric'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'ams@0'</span>)]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>
- 还可以定义验证数据集,验证算法的性能
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">evallist = [(dtest,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'eval'</span>), (dtrain,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'train'</span>)]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
=
训练模型
有了参数列表和数据就可以训练模型了
* 训练
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">num_round = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span> bst = xgb.train( plst, dtrain, num_round, evallist )</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
- 保存模型
在训练完成之后可以将模型保存下来,也可以查看模型内部的结构
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">bst.save_model(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'0001.model'</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li></ul>
- Dump Model and Feature Map
You can dump the model to txt and review the meaning of model
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># dump model</span> bst.dump_model(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'dump.raw.txt'</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># dump model with feature map</span> bst.dump_model(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'dump.raw.txt'</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'featmap.txt'</span>)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>
- 加载模型
通过如下方式可以加载模型
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">bst = xgb.Booster({<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'nthread'</span>:<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>}) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#init model</span> bst.load_model(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"model.bin"</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># load data</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>
=
提前终止程序
如果有评价数据,可以提前终止程序,这样可以找到最优的迭代次数。如果要提前终止程序必须至少有一个评价数据在参数evals
中。 If there’s more than one, it will use the last.
train(..., evals=evals, early_stopping_rounds=10)
The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds
to continue training.
If early stopping occurs, the model will have two additional fields: bst.best_score
and bst.best_iteration
. Note that train()
will return a model from the last iteration, not the best one.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC).
=
Prediction
After you training/loading a model and preparing the data, you can start to do prediction.
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">data = np.random.rand(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 7 entities, each contains 10 features</span> dtest = xgb.DMatrix( data, missing = -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">999.0</span> ) ypred = bst.predict( xgmat )</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li></ul>
If early stopping is enabled during training, you can predict with the best iteration.
<code class="language-python hljs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">ypred = bst.predict(xgmat,ntree_limit=bst.best_iteration)</code>
XGBoost原理及在Python中使用XGBoost相关推荐
- 数据分享 | LSTM神经网络架构和原理及其在Python中的预测应用(附视频)
本文约2800字,建议阅读10+分钟 本文与你分享如何使用长短期记忆网络(LSTM)来拟合一个不稳定的时间序列. 长短期记忆网络--通常称为"LSTM"--是一种特殊的RNN递归神 ...
- Python 中解释 XGBoost 模型的学习曲线
XGBoost是梯度提升集成算法的强大而有效的实现.配置XGBoost模型的超参数可能具有挑战性,这通常会导致使用既费时又计算量大的大型网格搜索实验.配置XGBoost模型的另一种方法是在训练过程中算 ...
- python引用计数的原理_深入Python中引用计数
在python中的垃圾回收机制主要是以引用计数为主要手段以标记清除和隔代回收机制为辅的手段 .可以对内存中无效数据的自动管理!在这篇文章,带着这个问题来一直往下看:怎么知道一个对象能不能被调用了呢? ...
- python分片操作_【python原理解析】python中分片的实现原理及使用技巧
首先:说明什么是序列? 序列中的每一个元素都会被分配一个序号,即元素的位置,也称为索引:在python中的序列包含:字符串.列表和元组 然后是:什么是分片? 分片就是通过操作索引访问及获得序列的一个或 ...
- 在Python中使用XGBoost
本文原是xgboost的官方文档教程,但是鉴于其中部分内容叙述不清,部分内容也确实存在一定的问题,所以本人重写了该部分.数据请前往Github此处下载 前置代码 引用类库,添加需要的函数 import ...
- php和python的选择排序算法,图文讲解选择排序算法的原理及在Python中的实现
def sort_choice(numbers, max_to_min=True): """ 我这没有按照标准的选择排序,假设列表长度为n,思路如下: 1.获取最大值x, ...
- ML之XGBoost:XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(二)
ML之XGBoost:XGBoost参数调优的优秀外文翻译-<XGBoost中的参数调优完整指南(带python中的代码)>(二) 目录 2. xgboost参数/XGBoost Para ...
- Python中XGBoost的特性重要性和特性选择
使用像梯度增强这样的决策树方法的集合的一个好处是,它们可以从经过训练的预测模型中自动提供特征重要性的估计. 在这篇文章中,您将发现如何使用Python中的XGBoost库估计特性对于预测建模问题的重要 ...
- 用 XGBoost 在 Python 中进行特征重要性分析和特征选择
使用诸如梯度增强之类的决策树方法的集成的好处是,它们可以从训练有素的预测模型中自动提供特征重要性的估计. 在本文中,您将发现如何使用Python中的XGBoost库来估计特征对于预测性建模问题的重要性 ...
最新文章
- 【CVPR2020来啦】不容错过的29个教程Tutorial !(附Slides下载链接)
- MySQL的主从复制延迟问题
- 自学python清单-python学习清单
- 重温java中的String,StringBuffer,StringBuilder类
- 从客户端中检测到有潜在危险的request.form值
- 从民宅到独栋大厦 我们搬家啦!
- 【网易云信获奖啦】2020 年值得再读一遍的技术干货 | 下篇
- QML for Android 实现二维码扫描(QZXing)
- 非科班的java程序员该如何补充计算机基础知识,需要看哪些书?
- Serilog Tutorial
- exchange 删除邮件
- Android设备新型恶意软件,融合银行木马、键盘记录器和移动勒索软件等功能
- java中的值传递与引用传递
- 111. Minimum Depth of Binary Tree
- DB2 查看表空间使用率
- 2021SC@SDUSC Zxing开源代码(十六)PDF417二维码(二)
- DBMS 中实现事务持久性的子系统是()
- Invalid HTTP method: PATCH executing PATCH
- 新年来到,特此制作一款烟花特效,预祝大家 虎虎生威,虎年大吉,生龙活虎
- 单向散列函数的实际应用
热门文章
- 聚类算法-K均值(K-means)
- 【原创】自定义Appender类,输出DCMTK日志
- Qt 线程基础(QThread、QtConcurrent等)
- 字节流转换为对象的方法
- 【Vue US国际会议】使用Vue和NativeScript来开发吸引人的原生手机app
- 面向对象编程设计练习题(1)
- Y2161 Hibernate第三次考试 2016年8月18日 试卷分析
- PHP如何释放内存之unset销毁变量并释放内存详解
- python中__init__.py是干什么的
- 年老代过大有什么影响