比赛地址
https://www.kaggle.com/c/santander-customer-transaction-prediction

一、赛后总结

1.1学习他人

1.1.1 List of Fake Samples and Public/Private LB split

https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
首先测试集和训练集统计分析非常相似,但是unique value统计差别大。所以猜想测试集由真实样本由真实样本采样而生成的。由此可以找到100000个fake example、100000个real example。又假设采样是再分public/private LB后进行的,所以real example可以分出50000 public+50000 private

1.1.2 giba single model public 0.9245 private 0.9234

https://www.kaggle.com/titericz/giba-single-model-public-0-9245-private-0-9234

>Reverse features
不理解为什么要反转负相关特征

#Reverse features
for var in features:if np.corrcoef( train_df['target'], train_df[var] )[1][0] < 0:train_df[var] = train_df[var] * -1test_df[var]  = test_df[var]  * -1

>生成特征
每个特征生成四个特征,为本值、count、feature_id、rank值。(不是很理解feature_id的作用),后面还对var做了归一化。 但生成的是(40000000, 4)的矩阵,也就是竖列的堆叠。

def var_to_feat(vr, var_stats, feat_id ):new_df = pd.DataFrame()new_df["var"] = vr.valuesnew_df["hist"] = pd.Series(vr).map(var_stats)new_df["feature_id"] = feat_idnew_df["var_rank"] = new_df["var"].rank()/200000.return new_df.valuesTARGET = np.array( list(train_df['target'].values) * 200 )TRAIN = []
var_mean = {}
var_var  = {}
for var in features:tmp = var_to_feat(train_df[var], var_stats[var], int(var[4:]) )var_mean[var] = np.mean(tmp[:,0]) var_var[var]  = np.var(tmp[:,0])tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]TRAIN.append( tmp )
TRAIN = np.vstack( TRAIN )del train_df
_=gc.collect()print( TRAIN.shape, len( TARGET ) )

>模型LGBM
用LGBM模型训练

model = lgb.LGBMClassifier(**{'learning_rate': 0.04,'num_leaves': 31,'max_bin': 1023,'min_child_samples': 1000,'reg_alpha': 0.1,'reg_lambda': 0.2,'feature_fraction': 1.0,'bagging_freq': 1,'bagging_fraction': 0.85,'objective': 'binary','n_jobs': -1,'n_estimators':200,})MODELS = []
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):print('Fold:', fold_ )model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),verbose = 10,eval_metric='auc',early_stopping_rounds=25,categorical_feature = [2] )MODELS.append( model )del TRAIN, TARGET
_=gc.collect()

>预测
对test数据进行同样的特征工程,对每个特征每个模型预测,然后logx-log(1-x),为什么??

为什么最后还要sub[‘target’] = sub[‘target’].rank() / 200000.???
原话: rank or not it produces the same score since the metric is rank based (AUC). I used rank just to normalize to the range [0-1]

ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):tmp = var_to_feat(test_df[var], var_stats[var], int(var[4:]) )tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]for model_id in range(10):model = MODELS[model_id]ypred[:,feat] += model.predict_proba( tmp )[:,1] / 10.
ypred = np.mean( logit(ypred), axis=1 )sub = test_df[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('golden_sub.csv', index=False)
print( sub.head(10) )

I studied your code some more. This is a brilliant solution !! Reversing some variables and stacking all of them into 4 columns is really ingenious. It simulates ideas from an NN convolution where the model can use patterns it learns from one variable to assist in its pattern detection of another variable. This also prevents LGBM from modeling spurious interactions between variables. But it’s more advanced than a convolution (that uses the same weights for all variables) because you provide column 3 which has the original variable’s number (0-199), so your LGBM can customize its prediction for each variable. Lastly you combine everything back together mathematically accurate by using mean logit. Very very nice. Setting the frequency count as a categorical value is a nice touch which allows LGBM to efficiently divide the different distributions. You maximized the modeling ability of an LGBM duplicating other participants’ success with NNs. I am quite impressed !!

持续更新。。。

kaggle——Santander Customer Transaction Prediction相关推荐

  1. Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants?

    Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants? Introduction ...

  2. kaggle:PUBG Finish Placement Prediction

    The Mission of Machine Learning :PUBG Finish Placement Prediction 一. Introduction 二. Experiments 三. ...

  3. kaggle TMDB Box Office Prediction

    点这传送kaggle原作者 点这传送数据源&比赛 首先是常规的读数 import numpy as np # linear algebra import pandas as pd # data ...

  4. 【竞赛相关】Kaggle竞赛宝典国内外竞赛方案汇总

    本文汇总了kaggle竞赛宝典将近50个竞赛的Top方案&思路&代码. 2019年度所有国内外竞赛汇总 干货 | 2019 数据竞赛TOP方案合集 数据竞赛2019年度年鉴发布(250 ...

  5. 【数据竞赛】Kaggle竞赛宝典国内外竞赛方案汇总

    本文汇总了kaggle竞赛宝典将近50个竞赛的Top方案&思路&代码.关注"kaggle竞赛宝典"并在后台输入"竞赛宝典",还将获得5个kagg ...

  6. 太强了!Kaggle竞赛宝典方案汇总

    kaggle竞赛宝典公众号建立的初衷是希望将很多经典比赛的方案写成案例的形式,每个步骤都配有一定的阐述,讲述某些操作为什么有效等,希望可以帮助更多的新手快速入门(太多的开源只有代码,看代码的时间是巨大 ...

  7. kaggle比赛数据_表格数据二进制分类:来自5个Kaggle比赛的所有技巧和窍门

    kaggle比赛数据 This article was originally written by Shahul ES and posted on the Neptune blog. 本文最初由 Sh ...

  8. Kaggle竞赛宝典方案汇总

    kaggle竞赛宝典公众号建立的初衷是希望将很多经典比赛的方案写成案例的形式,每个步骤都配有一定的阐述,讲述某些操作为什么有效等,希望可以帮助更多的新手快速入门(太多的开源只有代码,看代码的时间是巨大 ...

  9. 【机器学习】小数据集怎么上分? 几行代码生成伪标签数据集

    背景 伪标签(Pseudo-Labeling)的定义来自于半监督学习,其核心思想是通过借助无标签的数据来提升有监督模型的性能.伪标签技术在许多场景中被验证了它的有效性,例如在kaggle竞赛Santa ...

最新文章

  1. php 命名空间地址,php命名空间简介
  2. 业务id转密文短链的一种实现思路
  3. 随手小记·080911
  4. 为ie和chrome FF单独设置样式的“条件注释法”、“类内属性前缀法”、“选择器前缀法”、实现方法 案例(推荐)
  5. Javascript ECMA-1(数据类型,字符串操作)
  6. Impala的安装(含使用CM安装 和 手动安装)(图文详解)
  7. bom mysql表,如何输出bomCAD表格
  8. JavaScript实现大数据(条形统计图表)
  9. Git入门——tortoisegit使用问题:git不显示图标?
  10. win7系统计算机文件夹缓慢,win7系统搜索文件很慢的两种解决方法
  11. 1月末支付机构备付金总量达1.4万亿,较去年12月下滑两千多亿
  12. EasyPusher进行Android UVC外接摄像头直播推送实现方法
  13. Powerpoint自动插入页码
  14. 我退休金只有2000块钱能去海南三亚养老吗?
  15. 第二章、视频压缩技术与原理
  16. 数字化转型的避坑指南:细说数字化转型十二大坑
  17. 制造企业该如何提高生产效率
  18. 计算机类学生需要考计算机二级证吗?原来是这样!
  19. java option请求_Spring boot处理OPTIONS请求
  20. 硅谷之谜(读书笔记)

热门文章

  1. iOS开发之iOS10简单适配
  2. 36_2 On Chip Bus —— AXI总线介绍
  3. 《塞尔达传说-旷野之息》策划细节-爬山
  4. 爬虫小白第一课、从安装python到写出第一个爬虫程序、Pycharm安装详解
  5. 计算机中的首地址是多少,物理首地址是什么
  6. 给小白的论文写作方法!实用率99%!
  7. 他狂骗五千万美元消失17年...却被一个纪录片导演锲而不舍的追到了镜头前!...
  8. (译)网站加速最佳实践——雅虎35条
  9. Android Developers:支持不同的屏幕大小
  10. C语言编写利率程序,《C语言及程序设计》实践参考——定期存款利息计算器