Santander价值预测挑战简介

Santander Value Prediction Challenge

在价值预测的挑战中,Santander Group 提供了4900+的训练样本,需要提交测试结果的样本大概是训练样本的10倍,而且训练数据的特征数量>样本数量(明显的宽数据),要求是预测客户的潜在价值.
下面是评分指标
The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

RMSLE is calculated as

ϵ=1n∑1n(log(pi+1)−log(ai+1))2−−−−−−−−−−−−−−−−−−−−−−−−−−√ϵ=1n∑1n(log(pi+1)−log(ai+1))2

ϵ=\sqrt{\frac{1}{n} \sum^{n}_{1}(log(p_i+1)−log(a_i+1))^2}

公式中的变量:

  • ϵϵϵ is the RMSLE value (score)
  • n" role="presentation" style="position: relative;">nnn 样本数量
  • pipip_i 每个样本的预测值(target)
  • aiaia_i 样本的真实值(target)
  • log(x)log(x)log(x) is the natural logarithm of x

Submission File

提交文件是.csv格式,两列,ID,target.格式如下

ID,target
000137c73,5944923.322036332
00021489f,5944923.322036332
0004d7953,5944923.322036332
etc.

下面开始数据的简单探索

Python库
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import plotly.tools as tls
import plotly.offline as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from sklearn import  model_selection, metricscolor = sns.color_palette()
%matplotlib inlinepy.init_notebook_mode(connected=True)
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 10warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

Train , Test data

train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
print('Test rows and columns : ', test.shape)
print('Train rows and columns :', train.shape)
Test rows and columns :  (49342, 4992)
Train rows and columns : (4459, 4993)

宽数据.

train.head()
ID target 48df886f9 0deb4b6a8 34b15f335 71b203550 137efaa80 fb36b89d9 7e293fbaf 9fc776466
0 000d6aaf2 38000000.0 0.0 0 0.0 0 0 0 0 0
1 000fbd867 600000.0 0.0 0 0.0 0 0 0 0 0
2 0027d6b71 10000000.0 0.0 0 0.0 0 0 0 0 0
3 0028cbf45 2000000.0 0.0 0 0.0 0 0 0 0 0
4 002a68644 14400000.0 0.0 0 0.0 0 0 0 0 0

5 rows × 4993 columns

数据的列名经过了加密处理,并不清楚特征列的含义.

Target Distribution

plt.figure(figsize=(12, 8))
plt.scatter(range(train.shape[0]), np.sort(train.target.values))
plt.grid()
plt.xlabel('index', fontsize=12)
plt.ylabel('Target', fontsize=12)
plt.title("Target Distribution", fontsize=14)

Target Histogram

plt.figure(figsize=(12,8))
plt.hist(train.target.values, bins=50)
plt.xlabel('Target', fontsize=12)
plt.title("Target Histogram", fontsize=14)

histogram (log of target)

plt.figure(figsize=(12,8))
sns.distplot( np.log1p(train.target.values), bins=50, kde=False)
plt.xlabel('Target', fontsize=12)
plt.title("Log of Target Histogram", fontsize=14)
plt.show()

Check missing value

缺失值

# mis value count
mis_vlaue_counts = train.isnull().sum().reset_index()# update columns name
mis_value_counts.rename(index = str,columns={'index':'Feature_name',0:'Count'},inplace=True)(mis_value_counts.Count > 0).value_counts()
Series([], Name: Count, dtype: int64)

不存在缺失值

Check columns value type

col_type = train.dtypes.reset_index()col_type.rename(index = str, columns={'index':'Feature_name',0:'Type'},inplace=True)(col_type.Type).value_counts()
int64      3147
float64    1845
object        1
Name: Type, dtype: int64

object类型:1,指的是 ID

Check columns values

value_counts()

#  number of unique elements in the object
num_unique = train.nunique().reset_index()# rename columns name
num_unique.columns = ['Col_name','Value_count']#特征包含值的(count)
only_1_value = num_unique[num_unique.Value_count == 1]
num_unique.head()
Col_name Value_count
0 ID 4459
1 target 1413
2 48df886f9 32
3 0deb4b6a8 5
4 34b15f335 29

只有一个值的特征列,这些特征(256个)可以从train,test set中丢掉

only_1_value.head()
Col_name Value_count
28 d5308d8bc 1
35 c330f1a67 1
38 eeac16933 1
59 7df8788e8 1
70 5b91580ee 1
only_1_value.shape
(256, 2)

计算相关系数

相关系数有好几个:

  • pearson
  • kendall
  • spearman

Pandas,Scipy,都支持相关系数的计算.


Scipy

from scipy.stats import spearmanr
labels = []
values = []
for col in train.columns:if col not in ["ID", "target"]:labels.append(col)values.append(spearmanr(train[col].values,train['target'].values)[0])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})

Pandas贼简单,(4000+特征时间比较长)

corr_df = train.corr(method = 'spearman')['target']

Scipy : spearmanr

from tqdm import tqdm, tqdm_notebook
from scipy.stats import spearmanr
import warnings
warnings.filterwarnings('ignore')
col_names = []
cor_value = []
for col in tqdm(train.columns, ncols=100 , leave= True):if col not in ['ID','target']:col_names.append(col)cor_value.append(spearmanr(train[col].values, train.target.values)[0])
corrs = pd.DataFrame({'Feature_Name':col_names,'Corr_value':cor_value})
corrs = corrs.sort_values(by = 'Corr_value')
100%|██████████████████████████████████████████████████████████| 4993/4993 [00:08<00:00, 597.96it/s]

visualization corrs value

only_1_value = only_1_value.set_index('Col_name')
only_1_value.head()
Value_count
Col_name
d5308d8bc 1
c330f1a67 1
eeac16933 1
7df8788e8 1
5b91580ee 1
corrs = corrs.set_index('Feature_Name')
corr_df = corrs.ix[list(only_1_value.index)]
# nan
corr_df.Corr_value.isnull().value_counts()
True    256
Name: Corr_value, dtype: int64

特征值只有一个的特征,跟target的相关性,为NAN


相关性在[-0.1~0.1]之间的特征

# 筛选出相关系数在[0.1 ~ 0.1]之间的特征
corr_df = corrs[(corrs.Corr_value > 0.1) | (corrs.Corr_value < -0.1)].reset_index()corr_df = corr_df.set_index('Feature_Name')
corr_df.plot(kind='barh', figsize = (12,15), title='Correlation of variables')

Correlation [-0.11~~0.11]

corr_df = corrs[(corrs.Corr_value > 0.11) | (corrs.Corr_value < -0.11)].reset_index()corr_df = corr_df.set_index('Feature_Name')
corr_df.plot(kind='barh', figsize = (10,8), title='Correlation of variables')

仅仅从相关性系数来看,似乎并没有相关性很强的特征.


利用模型来选择重要的特征

  • ExtraTreesRegressor
  • LightGBM
# num_unique == 1, columns
useless_feature_names = list(only_1_value.index)# drop num_unique == 1的columns
train_X = train.drop(useless_feature_names + ["ID", "target"], axis=1)
test_X = test.drop(useless_feature_names + ["ID"], axis=1)
# log1p(x) = log(1+x),
train_y = np.log1p(train["target"].values)

np.log1p(x) = np.log(1+x),后面模型的结果需要执行逆操作np.expm1(x)=np.exp(x)-1

ExtraTreesRegressor(sklearn)

from sklearn import ensemblemodel = ensemble.ExtraTreesRegressor(n_estimators=200, max_depth=20, max_features=0.5,n_jobs=-1, random_state=50)
model.fit(train_X, train_y)
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=20,max_features=0.5, max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,oob_score=False, random_state=50, verbose=0, warm_start=False)

Feature Importance

# feature_importances DataFrame
feature_importances = pd.DataFrame({'Feature_name':train_X.columns,'Importances':model.feature_importances_})# reset index
feature_importances = feature_importances.set_index('Feature_name')# sort by importances
feature_importances = feature_importances.sort_values(by = 'Importances')# std feature importances
std_feat_importances = np.std([tree.feature_importances_ for tree in model.estimators_],axis=0)feature_importances.iloc[-50:].plot(kind='barh',figsize = (10,20),color = 'g',xerr=std_feat_importances[-50:], align='center',title='top 50 Feature Importances')

f190486d6应该是一个很重要的特征,越是重要的特征越是稳


Result1

np.expm1(x)=np.exp(x)−1np.expm1(x)=np.exp(x)−1

np.expm1(x) = np.exp(x)-1

result = model.predict(test_X)
result = np.expm1(result)
submit = pd.DataFrame({'ID':test.ID, 'target':result})
submit.head()
ID target
0 000137c73 1.512680e+06
1 00021489f 1.334733e+06
2 0004d7953 2.424292e+06
3 00056a333 3.894291e+06
4 00056d8eb 1.334733e+06

Make submission file

submit.to_csv('submit3_1.csv',index = False)

测评结果


LightGBM

LightGBM GPU Tutorial

NVIDIA:NVS4200,加没加速没感觉,抱着试试看的态度,在老旧的Thinkpad上试了试.

def run_lgb(train_X, train_y, val_X, val_y, test_X):params = {"objective" : "regression","metric" : "rmse","num_leaves" : 30,"learning_rate" : 0.01,"bagging_seed" : 1884,"device" : "gpu","gpu_platform_id" : 0,"gpu_device_id" : 0,"num_thread" : 8}train_set = lgb.Dataset(train_X, label=train_y)val_set = lgb.Dataset(val_X, label=val_y)model = lgb.train(params, train_set ,10000, valid_sets = [val_set], early_stopping_rounds = 100, verbose_eval = 200)result = model.predict(test_X, num_iteration = model.best_iteration)return result, model

采用10折交叉验证,来训练模型

kf = model_selection.KFold(n_splits = 10, shuffle = True, random_state = 1884)
pred_test_full = 0
# 10 model's Feature importances
f_imprtances = []
for dev_index, val_index in kf.split(train_X):dev_X, val_X = train_X.loc[dev_index,:], train_X.loc[val_index,:]dev_y, val_y = train_y[dev_index], train_y[val_index]pred_test, model = run_lgb(dev_X, dev_y, val_X, val_y, test_X)# test resultpred_test_full += pred_test/10# 10 model's feature importancesf_imprtances.append(model.feature_importance())# 结果
pred_test_full = np.expm1(pred_test_full)
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.42782
[400]   valid_0's rmse: 1.39864
Early stopping, best iteration is:
[437]   valid_0's rmse: 1.39731
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.45798
[400]   valid_0's rmse: 1.43229
Early stopping, best iteration is:
[393]   valid_0's rmse: 1.4318
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.46425
[400]   valid_0's rmse: 1.43454
[600]   valid_0's rmse: 1.43284
Early stopping, best iteration is:
[578]   valid_0's rmse: 1.43227
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.49087
[400]   valid_0's rmse: 1.47221
Early stopping, best iteration is:
[357]   valid_0's rmse: 1.47059
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.42226
[400]   valid_0's rmse: 1.39553
[600]   valid_0's rmse: 1.39161
Early stopping, best iteration is:
[685]   valid_0's rmse: 1.39064
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.46622
[400]   valid_0's rmse: 1.44762
Early stopping, best iteration is:
[461]   valid_0's rmse: 1.44708
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.50951
[400]   valid_0's rmse: 1.48227
[600]   valid_0's rmse: 1.47924
[800]   valid_0's rmse: 1.47827
Early stopping, best iteration is:
[702]   valid_0's rmse: 1.47819
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.44264
[400]   valid_0's rmse: 1.41304
Early stopping, best iteration is:
[479]   valid_0's rmse: 1.41138
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.46255
[400]   valid_0's rmse: 1.43719
Early stopping, best iteration is:
[450]   valid_0's rmse: 1.43618
Training until validation scores don't improve for 100 rounds.
[200]   valid_0's rmse: 1.49027
[400]   valid_0's rmse: 1.47742
Early stopping, best iteration is:
[389]   valid_0's rmse: 1.47729

RMSE :在1.39~1.47之间,比ExtraTrees的结果要好.

# Making a submission file
submit = pd.DataFrame({'ID':test.ID, 'target':pred_test_full})submit.to_csv('submit3_2.csv',index = False)


LightGBM在原始数据上得到的评分是1.48,这是直接在原始数据上得到的结果,这个结果可以作为baseline,在后面模型的迭代当中,其实还有很多的工作可以做,特征的选择,特征工程,数据清洗,异常值的检测等等…


下面是LightGBM,Feature Importances

### Feature Importance
lgb.plot_importance(model, max_num_features=50, height=0.8, figsize=(12, 20))
plt.title("LightGBM - Feature Importance top 50", fontsize=15)
Text(0.5,1,'LightGBM - Feature Importance top 50')


Feature Importance

Fearue_Importance = pd.DataFrame({'Feature_name':test_X.columns,'Importance':model.feature_importance()})Fearue_Importance = Fearue_Importance.sort_values(by = 'Importance')Fearue_Importance.to_csv('FeatureImportance_LGBM.csv',index = False)

Feature Importances & std(Feature Importance)

top50特征importance和特征importance在10个model中的std

# f_importances : 10 folds feature importances
# std_importance : every feature importance' std
std_importance = np.std(np.array(f_imprtances), axis = 0)Fearue_Importance = Fearue_Importance.set_index('Feature_name')Fearue_Importance.iloc[-50:].plot(kind='barh',figsize = (10,20),xerr=std_importance[-50:], align='center',title='top 50 LightGBM Feature Importances')


LightGBM & ExtraTrees,Top 300特征的交集

topx_extra_features = list(feature_importances.index)[-300:]
topx_light_features = list(Fearue_Importance.index)[-300:]interSection = set(topx_extra_features).intersection(set(topx_light_features))print('LightGBM & ExtraTrees Top 300 features intersection :',len(interSection))
LightGBM & ExtraTrees Top 300 features intersection : 130

Top 100

topx_extra_features = list(feature_importances.index)[-100:]
topx_light_features = list(Fearue_Importance.index)[-100:]interSection = set(topx_extra_features).intersection(set(topx_light_features))print('LightGBM & ExtraTrees Top 100 features intersection :',len(interSection))
LightGBM & ExtraTrees Top 100 features intersection : 56

Top 200

topx_extra_features = list(feature_importances.index)[-200:]
topx_light_features = list(Fearue_Importance.index)[-200:]interSection = set(topx_extra_features).intersection(set(topx_light_features))print('LightGBM & ExtraTrees Top 200 features intersection :',len(interSection))
LightGBM & ExtraTrees Top 200 features intersection : 84

            

Kaggle:Santander 价值预测挑战(baseline)相关推荐

  1. Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合

    Kaggle泰坦尼克号生存预测挑战 这是kaggle上Getting Started 的Prediction Competition,也是比较入门和简单的新人赛,我的最好成绩好像有进入top8%,重新 ...

  2. Kaggle泰坦尼克号生存预测挑战——数据分析

    Kaggle泰坦尼克号生存预测挑战 这是kaggle上Getting Started 的Prediction Competition,也是比较入门和简单的新人赛,我的最好成绩好像有进入top8%,重新 ...

  3. Datawhale~数据挖掘实践之序列问题处理~天池·资金流入流出预测-挑战Baseline~Day01~数据探索与分析

    写在前面✍ 本系列笔记基于天池平台上"资金流入流出预测-挑战Baseline"学习赛,记录如何完整的打一次数据挖掘类比赛.同时,该比赛属于序列建模问题,希望学习完成这个任务,可以对 ...

  4. ML之LoRBaggingRF:依次利用LoR、Bagging、RF算法对泰坦尼克号数据集 (Kaggle经典案例)获救人员进行二分类预测——优化baseline模型

    ML之LoR&Bagging&RF:依次利用LoR.Bagging.RF算法对泰坦尼克号数据集 (Kaggle经典案例)获救人员进行二分类预测--优化baseline模型 目录 模型优 ...

  5. kaggle竞赛--房价预测详细解读

    ## Kaggle竞赛 -- 房价预测 (House Prices) #### 完整代码见[kaggle kernel](https://www.kaggle.com/massquantity/all ...

  6. Kaggle 自行车租赁预测比赛项目实现

    作者:大树 更新时间:01.20 email:59888745@qq.com 数据处理,机器学习 回主目录:2017 年学习记录和总结 In [ ]: Kaggle上有很多有意思的项目,大家得空可以试 ...

  7. 《动手深度学习》4.10. 实战Kaggle比赛:预测房价

    4.10. 实战Kaggle比赛:预测房价 本节内容预览 数据 下载和缓存数据集 访问和读取数据集 使用pandas读入并处理数据 数据预处理 处理缺失值&对数值类数据标准化 处理离散值-on ...

  8. 实战Kaggle比赛:预测房价

    文章目录 实战Kaggle比赛:预测房价 1 - 下载和缓存数据集 2 - 访问和读取数据集 3 - 数据预处理 4 - 训练 5 - K折交叉验证 6 - 模型选择 7 - 提交你的Kaggle预测 ...

  9. 科大讯飞:电信客户流失预测挑战赛baseline

    文章目录 一.查看各字段中分布情况 1.2 使用pandas_profiling自动分析数据 二. 使用baseline参数训练 三.Null Importances进行特征选择 3.2 计算Scor ...

  10. 04.10. 实战Kaggle比赛:预测房价

    4.10. 实战Kaggle比赛:预测房价 详细介绍数据预处理.模型设计和超参数选择. 通过亲身实践,你将获得一手经验,这些经验将有益数据科学家的职业成长. import hashlib import ...

最新文章

  1. C# 温故而知新:Stream篇(二)
  2. 计算机在线平方,完全平方数批量判断在线计算器_三贝计算网_23bei.com
  3. perl unload gbk oracle 数据库
  4. 如何解决pip更新问题
  5. MySQL多表与分组练习题及答案
  6. DCMTK:工作清单数据库测试程序
  7. 在监控网络项目中该如何选择光纤交换机?
  8. 面试还搞不懂Redis,快看看这40道面试题!| 博文精选
  9. ubutun 更换网络源_Ubuntu 更换源
  10. Opencv step by step - 图像变换
  11. 微信小程序实战之 goods(订餐页)
  12. 将英文字符串转换为数字
  13. 如何在linux系统中快速切换目录(目录栈的使用),如何在Linux下环境下快速切换工作目录...
  14. React 合成事件
  15. [附源码]java毕业设计基于的高校学生考勤管理系统
  16. 如何教会爸妈用智能手机?方法比耐心重要,get这份手绘板说明书!
  17. 苹果cms播放器不能全屏播放解决方法
  18. 易推影视推手系统,支持苹果v8 v10影视系统
  19. Windows CMD中的findstr命令详解
  20. 回家过年要花多少钱?算完感觉生无可恋~

热门文章

  1. JavaWeb9大内置对象的作用与作用域
  2. java text 格式化_java.text.Format
  3. vvv在线文档导出工具_墙裂推荐 | 在线文档编辑工具
  4. java session 使用_浅谈Session的使用(原创)
  5. linux服务器备份,Linux服务器上如何备份
  6. ckeditor4.x操作之在页面中引入(一)
  7. 《.Ne框架程序设计》随记(3)
  8. 一步一步安装 Windows Server 2008 Beta3(Code Name Longhorn)
  9. 中国的网站能活几天?
  10. CIO如何做好灾难恢复/数据外泄的业务连续性计划?