XGBoost的GPU用法

使用Kaggle Notebook运行，59w的训练集,训练时间不到40s

数据使用IEEE CIS比赛数据集

代码如下：

# # About this kernel
#
# Before I get started, I just wanted to say: huge props to Inversion! The official starter kernel is **AWESOME**; it's so simple, clean, straightforward, and pragmatic. It certainly saved me a lot of time wrangling with data, so that I can directly start tuning my models (real data scientists will call me lazy, but hey I'm an engineer I just want my stuff to work).
#
# I noticed two tiny problems with it:
# * It takes a lot of RAM to run, which means that if you are using a GPU, it might crash as you try to fill missing values.
# * It takes a while to run (roughly 3500 seconds, which is more than an hour; again, I'm a lazy guy and I don't like waiting).
#
# With this kernel, I bring some small changes:
# * Decrease RAM usage, so that it won't crash when you change it to GPU. I simply changed when we are deleting unused variables.
# * Decrease **running time from ~3500s to ~40s** (yes, that's almost 90x faster), at the cost of a slight decrease in score. This is done by adding a single argument.
#
# Again, my changes are super minimal (cause Inversion's kernel was already so awesome), but I hope it will save you some time and trouble (so that you can start working on cool stuff).
#
#
# ### Changelog
#
# **V4**
# * Change some wording
# * Prints XGBoost version
# * Add random state to XGB for reproducibility# %% [code]
import os
import numpy as np
import pandas as pd# %% [markdown]
# # Efficient Preprocessing
#
# This preprocessing method is more careful with RAM usage, which avoids crashing the kernel when you switch from CPU to GPU. Otherwise, it is exactly the same procedure as the official starter.# %% [code]
!pip install datatable# %% [code]
%%time
# (590540, 433)
# (506691, 432)
# CPU times: user 40.4 s, sys: 11.5 s, total: 51.9 s
# Wall time: 52 s# import datatable as dt
# train_transaction =dt.fread("../input/train_transaction.csv").to_pandas()
# test_transaction = dt.fread('../input/test_transaction.csv', index_col='TransactionID').to_pandas()
# train_identity = df.fread('../input/train_identity.csv', index_col='TransactionID').to_pandas()
# test_identity  = dt.fread('../input/test_identity.csv', index_col='TransactionID').to_pandas()
import pandas as pd
train_transaction = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv', index_col='TransactionID')
test_identity  = pd.read_csv('../input/ieee-fraud-detection/test_identity.csv', index_col='TransactionID')
sample_submission = pd.read_csv('../input/ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')
train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)#这里用datatable想办法加速处理下～！！！！！！！！！！# %% [code]
import numpy as np
train['Transaction_hour'] = np.floor(train['TransactionDT'] / 3600) % 24
test['Transaction_hour'] = np.floor(test['TransactionDT'] / 3600) % 24
# train_one_column=np.floor(train['TransactionDT']*1.0 % 86400) / 3600#pandas的DataFrame->Series
# test_one_column =np.floor(test['TransactionDT']*1.0 % 86400) / 3600#pandas的DataFrame->Series
del train["TransactionDT"]
del test["TransactionDT"]
# train["TransactionDT"]=train_one_column#末尾追加一列
# test["TransactionDT"]=test_one_column#末尾追加一列
train.head(10)# print(train_one_column)
# df=pd.DataFrame(one_column)
# df.columns.name = 'TransactionDF_hour'
# #Series->pandas的DataFrame# one_column_dict={col:df[col].tolist() for col in df.columns} #pandas的DataFrame->dict
# one_column=dt.Frame(one_column_dict)#dict->datatable的DataFrame# %% [code]print(train.shape)
print(test.shape)y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()del train, testX_train = X_train.fillna(-999)
X_test = X_test.fillna(-999)# %% [code]
%%time
# CPU times: user 49.9 s, sys: 380 ms, total: 50.3 s
# Wall time: 50.8 s# Label Encoding
from sklearn import preprocessing
for f in X_train.columns:if X_train[f].dtype=='object' or X_test[f].dtype=='object': lbl = preprocessing.LabelEncoder()lbl.fit(list(X_train[f].values) + list(X_test[f].values))X_train[f] = lbl.transform(list(X_train[f].values))X_test[f] = lbl.transform(list(X_test[f].values))   # %% [code]
import gc
gc.collect()# %% [markdown]
# # Training
#
# To activate GPU usage, simply use `tree_method='gpu_hist'` (took me an hour to figure out, I wish XGBoost documentation was clearer about that).# %% [code]
import xgboost as xgb
clf = xgb.XGBClassifier(n_estimators=500,max_depth=9,learning_rate=0.05,subsample=0.9,colsample_bytree=0.9,missing=-999,random_state=2019,tree_method='gpu_hist'  # THE MAGICAL PARAMETER
)# %% [markdown]
# GPU_hist-0.9355
# GPU_exact-0.9331# %% [code]
%time clf.fit(X_train, y_train)# %% [markdown]
# Some of you must be wondering how we were able to decrease the fitting time by that much. The reason for that is not only we are running on gpu, but we are also computing an approximation of the real underlying algorithm (which is a greedy algorithm).
#
# This hurts your score slightly, but as a result is much faster.
#
# So why am I not using CPU with `tree_method='hist'`?
# If you try it out yourself, you'll realize it'll take ~ 7 min, which is still far from the GPU fitting time.
# Similarly, `tree_method='gpu_exact'` will take ~ 4 min,
# but likely yields better accuracy than `gpu_hist` or `hist`.
#
# The [docs on parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) has a section on `tree_method`, and it goes over the details of each option.# %% [code]
sample_submission['isFraud'] = clf.predict_proba(X_test)[:,1]
sample_submission.to_csv('simple_xgboost-gpu-exact.csv')

XGBoost的GPU用法相关推荐

ML之xgboost：解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法
ML之xgboost:解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法目录 xgboost之skl ...
xgboost 多gpu支持编译
xgboost 多gpu支持编译 Ubuntu 18.04.2Linux 4.15.0-46-genericgcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 cuda ...
Xgboost实现GPU加速
环境配置首先最好你有帕斯卡构架的GPU,如果没有的话,最好还是不要搞GPU加速了. 首先安装CUDA 第二步: 下载xgboost源码: https://github.com/dmlc/xgboos ...
笔记本电脑安装cuda10.2并使用pycuda、xgboost测试GPU计算效率（附Colab, Kaggle, BaiduAIStudio性能测试对比）
硬件配置: 机型:机械革命s1,2018年 CPU:i7-8550U GPU:Nvidia Geforce MX150(满血版),2G显存内存:单条16g (自己更换的) 软件信息: CUDA:v1 ...
WINDOWS 安装XGBoost GPU版本最新简易方法
目录一.系统配置二.问题背景三.执行步骤 1. 安装cuda a. 检查是否安装了CUDA b. 从dos中查看可以支持的cuda版本 c 下载对应版本的cuda d 根据引导安装cuda 2. ...
GPU—加速数据科学工作流程
GPU-加速数据科学工作流程 GPU-ACCELERATE YOUR DATA SCIENCE WORKFLOWS 传统上,数据科学工作流程是缓慢而繁琐的,依赖于cpu来加载.过滤和操作数据,训练和部 ...
Xgboost简易入门教程
最近准备研究一下信贷风控中机器学习模型评分卡的制作.信贷评分卡分为两种,一种是用逻辑回归,称为评分卡:一种是用集成学习算法,称为机器学习模型.逻辑回归算法相对简单,但是解释性要求高:机器学习模型理 ...
使用Sci-kit学习和XGBoost进行多类别分类：使用Brainwave数据的案例研究
by Avishek Nag (Machine Learning expert) 作者:Avishek Nag(机器学习专家) 使用Sci-kit学习和XGBoost进行多类别分类:使用Brainwa ...
sklearn、XGBoost、LightGBM的文档阅读小记
最近用XGBoost时看到这篇文章,作者写的很详细,对于此类模型的学习和调优都有帮助. 转自:http://izhaoyi.top/2017/09/23/sklearn-xgboost/ sklear ...

XGBoost的GPU用法

XGBoost的GPU用法相关推荐

最新文章

热门文章