比赛地址
https://www.kaggle.com/c/santander-customer-transaction-prediction

一、赛后总结

1.1学习他人

1.1.1 List of Fake Samples and Public/Private LB split

https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
首先测试集和训练集统计分析非常相似，但是unique value统计差别大。所以猜想测试集由真实样本与由真实样本采样而生成的。由此可以找到100000个fake example、100000个real example。又假设采样是再分public/private LB后进行的，所以real example可以分出50000 public+50000 private

1.1.2 giba single model public 0.9245 private 0.9234

https://www.kaggle.com/titericz/giba-single-model-public-0-9245-private-0-9234

>Reverse features
不理解为什么要反转负相关特征

#Reverse features
for var in features:if np.corrcoef( train_df['target'], train_df[var] )[1][0] < 0:train_df[var] = train_df[var] * -1test_df[var]  = test_df[var]  * -1

>生成特征
每个特征生成四个特征，为本值、count、feature_id、rank值。（不是很理解feature_id的作用），后面还对var做了归一化。但生成的是(40000000, 4)的矩阵，也就是竖列的堆叠。

def var_to_feat(vr, var_stats, feat_id ):new_df = pd.DataFrame()new_df["var"] = vr.valuesnew_df["hist"] = pd.Series(vr).map(var_stats)new_df["feature_id"] = feat_idnew_df["var_rank"] = new_df["var"].rank()/200000.return new_df.valuesTARGET = np.array( list(train_df['target'].values) * 200 )TRAIN = []
var_mean = {}
var_var  = {}
for var in features:tmp = var_to_feat(train_df[var], var_stats[var], int(var[4:]) )var_mean[var] = np.mean(tmp[:,0]) var_var[var]  = np.var(tmp[:,0])tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]TRAIN.append( tmp )
TRAIN = np.vstack( TRAIN )del train_df
_=gc.collect()print( TRAIN.shape, len( TARGET ) )

>模型LGBM
用LGBM模型训练

model = lgb.LGBMClassifier(**{'learning_rate': 0.04,'num_leaves': 31,'max_bin': 1023,'min_child_samples': 1000,'reg_alpha': 0.1,'reg_lambda': 0.2,'feature_fraction': 1.0,'bagging_freq': 1,'bagging_fraction': 0.85,'objective': 'binary','n_jobs': -1,'n_estimators':200,})MODELS = []
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):print('Fold:', fold_ )model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),verbose = 10,eval_metric='auc',early_stopping_rounds=25,categorical_feature = [2] )MODELS.append( model )del TRAIN, TARGET
_=gc.collect()

>预测
对test数据进行同样的特征工程，对每个特征每个模型预测，然后logx-log(1-x)，为什么？？

为什么最后还要sub[‘target’] = sub[‘target’].rank() / 200000.？？？
原话： rank or not it produces the same score since the metric is rank based (AUC). I used rank just to normalize to the range [0-1]

ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):tmp = var_to_feat(test_df[var], var_stats[var], int(var[4:]) )tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]for model_id in range(10):model = MODELS[model_id]ypred[:,feat] += model.predict_proba( tmp )[:,1] / 10.
ypred = np.mean( logit(ypred), axis=1 )sub = test_df[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('golden_sub.csv', index=False)
print( sub.head(10) )

I studied your code some more. This is a brilliant solution !! Reversing some variables and stacking all of them into 4 columns is really ingenious. It simulates ideas from an NN convolution where the model can use patterns it learns from one variable to assist in its pattern detection of another variable. This also prevents LGBM from modeling spurious interactions between variables. But it’s more advanced than a convolution (that uses the same weights for all variables) because you provide column 3 which has the original variable’s number (0-199), so your LGBM can customize its prediction for each variable. Lastly you combine everything back together mathematically accurate by using mean logit. Very very nice. Setting the frequency count as a categorical value is a nice touch which allows LGBM to efficiently divide the different distributions. You maximized the modeling ability of an LGBM duplicating other participants’ success with NNs. I am quite impressed !!

持续更新。。。

kaggle——Santander Customer Transaction Prediction相关推荐

Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants?
Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants? Introduction ...
kaggle:PUBG Finish Placement Prediction
The Mission of Machine Learning :PUBG Finish Placement Prediction 一. Introduction 二. Experiments 三. ...
kaggle TMDB Box Office Prediction
点这传送kaggle原作者点这传送数据源&比赛首先是常规的读数 import numpy as np # linear algebra import pandas as pd # data ...
【竞赛相关】Kaggle竞赛宝典国内外竞赛方案汇总
本文汇总了kaggle竞赛宝典将近50个竞赛的Top方案&思路&代码. 2019年度所有国内外竞赛汇总干货 | 2019 数据竞赛TOP方案合集数据竞赛2019年度年鉴发布(250 ...
【数据竞赛】Kaggle竞赛宝典国内外竞赛方案汇总
本文汇总了kaggle竞赛宝典将近50个竞赛的Top方案&思路&代码.关注"kaggle竞赛宝典"并在后台输入"竞赛宝典",还将获得5个kagg ...
太强了！Kaggle竞赛宝典方案汇总
kaggle竞赛宝典公众号建立的初衷是希望将很多经典比赛的方案写成案例的形式,每个步骤都配有一定的阐述,讲述某些操作为什么有效等,希望可以帮助更多的新手快速入门(太多的开源只有代码,看代码的时间是巨大 ...
kaggle比赛数据_表格数据二进制分类：来自5个Kaggle比赛的所有技巧和窍门
kaggle比赛数据 This article was originally written by Shahul ES and posted on the Neptune blog. 本文最初由 Sh ...
Kaggle竞赛宝典方案汇总
kaggle竞赛宝典公众号建立的初衷是希望将很多经典比赛的方案写成案例的形式,每个步骤都配有一定的阐述,讲述某些操作为什么有效等,希望可以帮助更多的新手快速入门(太多的开源只有代码,看代码的时间是巨大 ...
【机器学习】小数据集怎么上分? 几行代码生成伪标签数据集
背景伪标签(Pseudo-Labeling)的定义来自于半监督学习,其核心思想是通过借助无标签的数据来提升有监督模型的性能.伪标签技术在许多场景中被验证了它的有效性,例如在kaggle竞赛Santa ...

kaggle——Santander Customer Transaction Prediction

一、赛后总结

1.1学习他人

1.1.1 List of Fake Samples and Public/Private LB split

1.1.2 giba single model public 0.9245 private 0.9234

kaggle——Santander Customer Transaction Prediction相关推荐

最新文章

热门文章