代码如下:

# General imports
import numpy as np
import pandas as pd
import os, sys, gc, warnings, randomimport pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoderfrom catboost import CatBoostClassifier, Pool, cv
from sklearn.metrics import auc
import shapfrom tqdm import tqdmimport math
warnings.filterwarnings('ignore')SEED = 10
########################### Helpers
#################################################################################
## -------------------
## Seeder
# :seed to make all processes deterministic     # type: int
def seed_everything(seed=0):random.seed(seed)os.environ['PYTHONHASHSEED'] = str(seed)np.random.seed(seed)
## ------------------- def reduce_mem_usage(df, verbose=True):numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']start_mem = df.memory_usage(deep=True).sum() / 1024**2for col in df.columns:col_type = df[col].dtypesif col_type in numerics:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64)else:c_prec = df[col].apply(lambda x: np.finfo(x).precision).max()if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max and c_prec == np.finfo(np.float32).precision:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)end_mem = df.memory_usage().sum() / 1024**2if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))return df
########################### DATA LOAD
#################################################################################
seed_everything(SEED)print('Load Data')
train_df = pd.read_pickle('../input/ieee-data-minification/train_transaction.pkl')test_df = pd.read_pickle('../input/ieee-data-minification/test_transaction.pkl')
train_identity = pd.read_pickle('../input/ieee-data-minification/train_identity.pkl')
test_identity = pd.read_pickle('../input/ieee-data-minification/test_identity.pkl')base_columns = list(train_df) + list(train_identity)
train_df = pd.merge(train_df,train_identity, how = 'left', on = 'TransactionID',validate = "many_to_one")
test_df = pd.merge(test_df,test_identity, how = 'left', on = 'TransactionID',validate = "many_to_one")
train_df.drop(["TransactionID", "TransactionDT"],axis=1, inplace=True)
test_df.drop(["TransactionDT"],axis=1, inplace=True)
X = train_df.drop(["isFraud"],axis=1)
y= train_df["isFraud"]
X_Test = test_df.copy()X_Test.drop(['TransactionID', 'isFraud'],axis=1,inplace=True) #getting rid of the trans.ID that is in our submission file anyway# X = reduce_mem_usage(X)
# X_Test = reduce_mem_usage(X_Test)del train_df, test_df, train_identity, test_identitygc.collect()

basic data cleaning

print(f"Before dropna, top missing columns:\n{X.isna().sum().sort_values(ascending = False).head(5)}\n")thresh = 0.80 #how many NA values (%) I think anything more than 80% is a bit too much. This is of course only my opinionX_less_nas = X.dropna(thresh=X.shape[0]*(1-thresh), axis='columns')cols_dropped  = list(set(X.columns)-set(X_less_nas.columns))X_Test.drop(cols_dropped, axis=1, inplace=True)# X_less_nas = reduce_mem_usage(X_less_nas)
# X_Test = reduce_mem_usage(X_Test)print(f"After dropna, top missing columns:\n{X_less_nas.isna().sum().sort_values(ascending = False).head(5)}")print(f"\nNo. of cols dropped = {len(set(X.columns)-set(X_less_nas.columns))}, or {len(set(X.columns)-set(X_less_nas.columns))/len(X.columns)*100:.2f}% of columns")del X ; gc.collect()

Let's build a dictionary containing the categorical features for catboost's API

#according to https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-607486Catfeats = ['ProductCD'] + \["card"+f"{i+1}" for i in range(6)] + \["addr"+f"{i+1}" for i in range(2)] + \["P_emaildomain", "R_emaildomain"] + \["M"+f"{i+1}" for i in range(9)] + \["DeviceType", "DeviceInfo"] + \["id_"+f"{i}" for i in range(12, 39)]# removing columns dropped earlier when we weeded out the empty columnsCatfeats = list(set(Catfeats)- set(cols_dropped))

Lets define our Numerical Features as well:

Numfeats = list(set(X_less_nas.columns)- set(cols_dropped)-set(Catfeats))
X_less_nas[Catfeats].head()

Seems good :)

According to Catboost's official tutorial, it's good transform our NaN values to some number way out their distribution

https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb

lets do that:

X_less_nas.fillna(-10000, inplace=True)
X_Test.fillna(-10000, inplace=True)
X_less_nas.head()

Model Fitting

## quick test with AUCX_tr, X_val, y_tr, y_val = train_test_split(X_less_nas, y, test_size=0.2, random_state=SEED,stratify = y)cat_params = {'loss_function': 'Logloss','custom_loss':['AUC'],'logging_level':'Silent','task_type' : 'GPU','early_stopping_rounds' : 100
}simple_model = CatBoostClassifier(**cat_params)simple_model.fit(X_tr, y_tr,cat_features=Catfeats,eval_set=(X_val, y_val),plot=True
);# cv_params = model.get_params()# cv_data = cv(
#     Pool( X.iloc[:2000,:5], y[:2000], `=[1]),
#     cv_params,nfold=4,
#     plot=True
# )

Looks very promising. Lets train on all available data,

I'll do cross validation later

#final training on whole trianing setsimple_model.fit(X_less_nas, y,cat_features=Catfeats,logging_level = 'Silent'
);
submission = pd.read_csv('../input/ieee-fraud-detection/sample_submission.csv')
submission['isFraud'] = simple_model.predict_proba(X_Test)[:,1] # you must predict a probability for the isFraud variable
submission.to_csv('simple_model_Catboost.csv', index=False)

Reference:

[1]https://www.kaggle.com/pipboyguy/catboost-and-eda/output

catboost进行分类并开启GPU模式相关推荐

  1. css怎么使用gpu加速,用CSS3开启GPU硬件加速来提升网站的动画渲染性能

    CSS3为咱们开发动画效果大大提升了效率,但有些动画效果,如果涉及的DOM元素比较多,会发现有"卡卡"的感觉,为动画DOM元素添加CSS3样式 -webkit-transform: ...

  2. 使用CSS3开启GPU硬件加速提升网站动画渲染性能

    中文地址:http://www.cnblogs.com/rubylouvre/p/3471490.html 原文地址:http://blog.teamtreehouse.com/increase-yo ...

  3. Qt音视频开发02-海康sdk解码(支持句柄/回调/GPU模式/支持win/linux)

    一.前言 为何还要选用使用海康sdk,之前不是ffmpeg已经牛皮吹上天了吗?这个问题问得好,那是因为无论ffmpeg也好还是vlc/mpv之类的,都是实现的播放相关,不同的监控硬件厂家对应设备还有很 ...

  4. 虚拟服务器显卡设置,Vmware vSphere5.1开启GPU虚拟化

    Vmware vSphere5.1开启GPU虚拟化 1.环境准备 Dell PowerEdge R720 GPU显卡:NVIDIA Grid K1 VMware: ESX 5.1 (企业增强版) VM ...

  5. oracle存档模式,Oracle开启归档模式并设置RMAN自动备份策略

    title: Oracle开启归档模式并设置RMAN自动备份策略 categories: 数据库 tags: - Oracle - RMAN timezone: Asia/Shanghai date: ...

  6. 治堵有智慧 城市轨道交通建设开启奔跑模式

    杭州地铁续建134公里,宁波地铁续建110.74公里,温州市域铁路续建65.3公里,绍兴城际铁路开工15公里--今年以来城市轨道交通建设在浙江开启奔跑模式.在交通业内他们有一个别称--"治堵 ...

  7. DateEdit如果开启Vista模式并显示日期+时间模式

    DateEdit如果开启Vista模式并显示日期+时间模式 问题,以前没有涉及过,借机看一下,记录如下: 设置为Vista显示模式(如下图) 设置以下属性 dateEdit1.Properties.V ...

  8. Java开发微信公众号(二)---开启开发者模式,接入微信公众平台开发

    接入微信公众平台开发,开发者需要按照如下步骤完成: 1.填写服务器配置 2.验证服务器地址的有效性 3.依据接口文档实现业务逻辑 资料准备: 1.一个可以访问的外网,即80的访问端口,因为微信公众号接 ...

  9. 【微信开发】微信开发 之 开启开发模式

    . 作者 : 万境绝尘 转载请注明出处 : http://blog.csdn.net/shulianghan/article/details/20494177 . GitHub源码位置 : -- HT ...

最新文章

  1. 突破对银河系的传统认知 大量超高能宇宙加速器被发现
  2. codility上的问题(26) Hydrogenium 2013
  3. Java基础知识强化之集合框架笔记56:Map集合之HashMap集合(HashMapString,Student)的案例...
  4. 智能家居 (1) ——智能家居整体功能框架
  5. 在虚拟机中安装和配配置 MOSS2007 全过程
  6. 如何在SQL Server数据库中加密数据
  7. 【Spring】spring depend-on 到底是干什么的?
  8. 编程入门先学什么python-自学编程入门,先学什么语言好?
  9. 洛谷—— P1268 树的重量
  10. c9500堆叠配置_用C ++堆叠
  11. 【bug】VUE:Cannot read property '_withTask' of undefined
  12. C Primer Plus(十二)
  13. FFTW3在VS环境下的安装(亲测)
  14. 微服务与虚拟化技术博客总结
  15. Oracle基础查询
  16. 找不到ADO.NET Entity Data Model模板或 sql server database project模板
  17. 美通企业日报 | 凯悦集团将在华新开5家奢华酒店;铁姆肯庆祝成立120周年
  18. 超详细纯前端导出excel并完成各种样式的修改(xlsx-style)
  19. 题目 1548: 盾神与砝码称重
  20. python使用numba库实现gpu加速

热门文章

  1. [译]用AngularJS构建大型ASP.NET单页应用(二)
  2. Struts2自己定义拦截器实例—登陆权限验证
  3. 如何在博客等文章中添加带有滚动条的文本框
  4. 聚类算法之DBScan(Java实现)[转]
  5. WEB前端 javascript、php关键字搜索函数的使用方法
  6. angularjs html5模式,Angularjs $location html5mode浅析
  7. 单片机位寻址举例_单片机的寻址方式
  8. java验证码的实现
  9. angularjs ng-click传参控制ng-repeat元素显示与隐藏
  10. CNN:测试一下YoloV3