使用Kaggle Notebook运行,59w的训练集,训练时间不到40s

数据使用IEEE CIS比赛数据集

代码如下:

# # About this kernel
#
# Before I get started, I just wanted to say: huge props to Inversion! The official starter kernel is **AWESOME**; it's so simple, clean, straightforward, and pragmatic. It certainly saved me a lot of time wrangling with data, so that I can directly start tuning my models (real data scientists will call me lazy, but hey I'm an engineer I just want my stuff to work).
#
# I noticed two tiny problems with it:
# * It takes a lot of RAM to run, which means that if you are using a GPU, it might crash as you try to fill missing values.
# * It takes a while to run (roughly 3500 seconds, which is more than an hour; again, I'm a lazy guy and I don't like waiting).
#
# With this kernel, I bring some small changes:
# * Decrease RAM usage, so that it won't crash when you change it to GPU. I simply changed when we are deleting unused variables.
# * Decrease **running time from ~3500s to ~40s** (yes, that's almost 90x faster), at the cost of a slight decrease in score. This is done by adding a single argument.
#
# Again, my changes are super minimal (cause Inversion's kernel was already so awesome), but I hope it will save you some time and trouble (so that you can start working on cool stuff).
#
#
# ### Changelog
#
# **V4**
# * Change some wording
# * Prints XGBoost version
# * Add random state to XGB for reproducibility# %% [code]
import os
import numpy as np
import pandas as pd# %% [markdown]
# # Efficient Preprocessing
#
# This preprocessing method is more careful with RAM usage, which avoids crashing the kernel when you switch from CPU to GPU. Otherwise, it is exactly the same procedure as the official starter.# %% [code]
!pip install datatable# %% [code]
%%time
# (590540, 433)
# (506691, 432)
# CPU times: user 40.4 s, sys: 11.5 s, total: 51.9 s
# Wall time: 52 s# import datatable as dt
# train_transaction =dt.fread("../input/train_transaction.csv").to_pandas()
# test_transaction = dt.fread('../input/test_transaction.csv', index_col='TransactionID').to_pandas()
# train_identity = df.fread('../input/train_identity.csv', index_col='TransactionID').to_pandas()
# test_identity  = dt.fread('../input/test_identity.csv', index_col='TransactionID').to_pandas()
import pandas as pd
train_transaction = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv', index_col='TransactionID')
train_identity = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv', index_col='TransactionID')
test_identity  = pd.read_csv('../input/ieee-fraud-detection/test_identity.csv', index_col='TransactionID')
sample_submission = pd.read_csv('../input/ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')
train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)#这里用datatable想办法加速处理下~!!!!!!!!!!# %% [code]
import numpy as np
train['Transaction_hour'] = np.floor(train['TransactionDT'] / 3600) % 24
test['Transaction_hour'] = np.floor(test['TransactionDT'] / 3600) % 24
# train_one_column=np.floor(train['TransactionDT']*1.0 % 86400) / 3600#pandas的DataFrame->Series
# test_one_column =np.floor(test['TransactionDT']*1.0 % 86400) / 3600#pandas的DataFrame->Series
del train["TransactionDT"]
del test["TransactionDT"]
# train["TransactionDT"]=train_one_column#末尾追加一列
# test["TransactionDT"]=test_one_column#末尾追加一列
train.head(10)# print(train_one_column)
# df=pd.DataFrame(one_column)
# df.columns.name = 'TransactionDF_hour'
# #Series->pandas的DataFrame# one_column_dict={col:df[col].tolist() for col in df.columns} #pandas的DataFrame->dict
# one_column=dt.Frame(one_column_dict)#dict->datatable的DataFrame# %% [code]print(train.shape)
print(test.shape)y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()del train, testX_train = X_train.fillna(-999)
X_test = X_test.fillna(-999)# %% [code]
%%time
# CPU times: user 49.9 s, sys: 380 ms, total: 50.3 s
# Wall time: 50.8 s# Label Encoding
from sklearn import preprocessing
for f in X_train.columns:if X_train[f].dtype=='object' or X_test[f].dtype=='object': lbl = preprocessing.LabelEncoder()lbl.fit(list(X_train[f].values) + list(X_test[f].values))X_train[f] = lbl.transform(list(X_train[f].values))X_test[f] = lbl.transform(list(X_test[f].values))   # %% [code]
import gc
gc.collect()# %% [markdown]
# # Training
#
# To activate GPU usage, simply use `tree_method='gpu_hist'` (took me an hour to figure out, I wish XGBoost documentation was clearer about that).# %% [code]
import xgboost as xgb
clf = xgb.XGBClassifier(n_estimators=500,max_depth=9,learning_rate=0.05,subsample=0.9,colsample_bytree=0.9,missing=-999,random_state=2019,tree_method='gpu_hist'  # THE MAGICAL PARAMETER
)# %% [markdown]
# GPU_hist-0.9355
# GPU_exact-0.9331# %% [code]
%time clf.fit(X_train, y_train)# %% [markdown]
# Some of you must be wondering how we were able to decrease the fitting time by that much. The reason for that is not only we are running on gpu, but we are also computing an approximation of the real underlying algorithm (which is a greedy algorithm).
#
# This hurts your score slightly, but as a result is much faster.
#
# So why am I not using CPU with `tree_method='hist'`?
# If you try it out yourself, you'll realize it'll take ~ 7 min, which is still far from the GPU fitting time.
# Similarly, `tree_method='gpu_exact'` will take ~ 4 min,
# but likely yields better accuracy than `gpu_hist` or `hist`.
#
# The [docs on parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) has a section on `tree_method`, and it goes over the details of each option.# %% [code]
sample_submission['isFraud'] = clf.predict_proba(X_test)[:,1]
sample_submission.to_csv('simple_xgboost-gpu-exact.csv')

XGBoost的GPU用法相关推荐

  1. ML之xgboost:解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法

    ML之xgboost:解读用法之xgboost库的core.py文件中的get_score(importance_type=self.importance_type)方法 目录 xgboost之skl ...

  2. xgboost 多gpu支持 编译

    xgboost 多gpu支持 编译 Ubuntu 18.04.2Linux 4.15.0-46-genericgcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 cuda ...

  3. Xgboost实现GPU加速

    环境配置 首先最好你有帕斯卡构架的GPU,如果没有的话,最好还是不要搞GPU加速了. 首先安装CUDA 第二步: 下载xgboost源码: https://github.com/dmlc/xgboos ...

  4. 笔记本电脑安装cuda10.2并使用pycuda、xgboost测试GPU计算效率(附Colab, Kaggle, BaiduAIStudio性能测试对比)

    硬件配置: 机型:机械革命s1,2018年 CPU:i7-8550U GPU:Nvidia Geforce MX150(满血版),2G显存 内存:单条16g (自己更换的) 软件信息: CUDA:v1 ...

  5. WINDOWS 安装XGBoost GPU版本最新简易方法

    目录 一.系统配置 二.问题背景 三.执行步骤 1. 安装cuda a. 检查是否安装了CUDA b. 从dos中查看可以支持的cuda版本 c 下载对应版本的cuda d 根据引导安装cuda 2. ...

  6. GPU—加速数据科学工作流程

    GPU-加速数据科学工作流程 GPU-ACCELERATE YOUR DATA SCIENCE WORKFLOWS 传统上,数据科学工作流程是缓慢而繁琐的,依赖于cpu来加载.过滤和操作数据,训练和部 ...

  7. Xgboost简易入门教程

      最近准备研究一下信贷风控中机器学习模型评分卡的制作.信贷评分卡分为两种,一种是用逻辑回归,称为评分卡:一种是用集成学习算法,称为机器学习模型.逻辑回归算法相对简单,但是解释性要求高:机器学习模型理 ...

  8. 使用Sci-kit学习和XGBoost进行多类别分类:使用Brainwave数据的案例研究

    by Avishek Nag (Machine Learning expert) 作者:Avishek Nag(机器学习专家) 使用Sci-kit学习和XGBoost进行多类别分类:使用Brainwa ...

  9. sklearn、XGBoost、LightGBM的文档阅读小记

    最近用XGBoost时看到这篇文章,作者写的很详细,对于此类模型的学习和调优都有帮助. 转自:http://izhaoyi.top/2017/09/23/sklearn-xgboost/ sklear ...

最新文章

  1. Lock与synchronized 的区别
  2. pythonfor循环案例教程_python开发之for循环操作实例详解,pythonfor实例详解
  3. Python风格总结:遍历技巧
  4. css常用鼠标指针形状代码
  5. mswinsck.ocx 一个文件丢失或无效_AutoCAD文件修复的10种方法
  6. Android 日历提供器(一)
  7. 我是如何自学C语言的(一个菜鸟的学习路)
  8. 4.shell脚本中的变量
  9. 如何提高思维能力和逻辑能力?
  10. (基础知识)单反镜头的参数辨别
  11. 通俗易懂的解释深藏在傅里叶变换背后的奥秘,傅立叶变换的几何意义
  12. 虎言新媒体训练营 助力初级会计成功转型新媒体运营
  13. EAS报表开发----收付明细
  14. 连接mysql所必须参数_数据库连接参数使用方法详解
  15. “谁动了我的奶酪?”的故事
  16. java byte char io流_一文带你看懂JAVA IO流,史上最全面的IO教学
  17. ezboot not found 虚拟机安装_如何在 Linux 上安装 Minecraft 服务器
  18. 智能计算——蚁群优化算法实现
  19. 什么软件可以剪辑音乐? 1
  20. android 百度地图定位总结

热门文章

  1. spring-note-01
  2. 验证视图状态 MAC 失败 的解决办法
  3. 关于MSSQL数据存储的问题
  4. NSCTF-部分题目wp
  5. Ubuntu单用户修改root密码
  6. 关于flex布局的深入学习
  7. “此图片来自微信公众平台 未经允许不可引用“ 解决办法
  8. 使用electron脚手架electron-vue
  9. 微信小程序中template模板使用
  10. SSM项目搭建之配置文件