比赛介绍

比赛链接:此次大赛由厦门国际银行与厦门大学数据挖掘研究中心联合举办,厦门国际银行-厦门大学数据挖掘研究中心“数创金融”联合实验室承办。

数据下载地址:https://download.csdn.net/download/weixin_35770067/13718841

数据总体概述

本次数据共分为两个数据集,train_x.csv、train_target.csv和test_x.csv,其中train_x.csv为训练集的特征,train_target.csv为训练集的目标变量,其中,为了增强模型的泛化能力,训练集由两个阶段的样本组成,由字段isNew标记。test_x.csv为测试集的特征,特征变量与训练集一致。建模的目标即根据训练集对模型进行训练,并对测试集进行预测。

数据字段说明

a)为用户基本属性信息
id, target, certId, gender, age, dist, edu, job, ethnic, highestEdu, certValidBegin, certValidStop

b)借贷相关信息
loanProduct, lmt, basicLevel, bankCard, residentAddr, linkRela,setupHour, weekday

c)用户征信相关信息
x_0至x_78以及ncloseCreditCard, unpayIndvLoan, unpayOtherLoan, unpayNormalLoan, 5yearBadloan
该部分数据涉及较为第三方敏感数据,未做进一步说明。

评分标准

排名根据测试集的AUC确定

RF-basemodel(0.75+)

我们一开始先使用RandomFroest分类器来泡一下所有的数据,看一下选用默认参数下的后果。

# 选用所有特征
# ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'x_79', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
rf = RandomForestClassifier()

AUC Score (Train): 0.545862
我们发现达到的准确率只有0.545862,基本上和盲猜没啥区别。

我们看一下调参之后的结果:

rf = RandomForestClassifier(n_estimators=100,random_state=10)

AUC Score (Train): 0.632956

rf = RandomForestClassifier(n_estimators=90,random_state=10)

AUC Score (Train): 0.638696

rf = RandomForestClassifier(n_estimators=80,random_state=10)

AUC Score (Train): 0.633332

rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.687838

rf = RandomForestClassifier(n_estimators=90, max_depth=6,random_state=10)

AUC Score (Train): 0.685170

rf = RandomForestClassifier(n_estimators=90 max_depth=8,random_state=10)

AUC Score (Train): 0.653320

rf = RandomForestClassifier(n_estimators=90 max_depth=10,random_state=10)

AUC Score (Train): 0.636410

通过上面简单的调参结果可以看出,不同的参数对AUC的影响还是很大的,最低为0.545862,最高可以达到0.687838,相差接近20%各百分点,当然这个结果不一定是最优的,还有很多可调控的空间。

上面主要是对初期的调参,下面我们再来看看特征。

我们上面的特征选择的是全部特征,我们下面使用随机森林自带的feature_importances_接口来选取一些更有效的特征,代码如下:

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
feat_labels = X_train.columns
std = np.std([tree.feature_importances_ for tree in rf.estimators_],axis=0) #  inter-trees variability.
print("Feature ranking:")
#    l1,l2,l3,l4 = [],[],[],[]
# 打印每个特征的重要程度
for f in range(X_train.shape[1]):print("%d. feature no:%d feature name:%s (%f)" % (f + 1, indices[f], feat_labels[indices[f]], importances[indices[f]]))
Feature ranking:
1. feature no:7 feature name:lmt (0.119897)
2. feature no:90 feature name:certBalidStop (0.070063)
3. feature no:91 feature name:bankCard (0.065635)
4. feature no:89 feature name:certValidBegin (0.061998)
5. feature no:93 feature name:residentAddr (0.055272)
6. feature no:0 feature name:certId (0.054448)
7. feature no:4 feature name:dist (0.048813)
8. feature no:8 feature name:basicLevel (0.042018)
9. feature no:97 feature name:weekday (0.040811)
10. feature no:96 feature name:setupHour (0.040214)
11. feature no:54 feature name:x_45 (0.038700)
12. feature no:3 feature name:age (0.031389)
13. feature no:1 feature name:loanProduct (0.028978)
14. feature no:95 feature name:linkRela (0.027006)
15. feature no:100 feature name:unpayOtherLoan (0.026191)
16. feature no:6 feature name:job (0.018915)
17. feature no:29 feature name:x_20 (0.018539)
18. feature no:55 feature name:x_46 (0.016263)
19. feature no:82 feature name:x_73 (0.015427)
20. feature no:42 feature name:x_33 (0.014756)
21. feature no:44 feature name:x_35 (0.009275)
22. feature no:92 feature name:ethnic (0.008969)
23. feature no:34 feature name:x_25 (0.008467)
24. feature no:71 feature name:x_62 (0.008017)
25. feature no:37 feature name:x_28 (0.007177)
26. feature no:2 feature name:gender (0.007070)
27. feature no:76 feature name:x_67 (0.006776)
28. feature no:85 feature name:x_76 (0.006183)
29. feature no:101 feature name:unpayNormalLoan (0.005641)
30. feature no:72 feature name:x_63 (0.005626)
31. feature no:98 feature name:ncloseCreditCard (0.005433)
32. feature no:81 feature name:x_72 (0.005120)
33. feature no:77 feature name:x_68 (0.004969)
34. feature no:43 feature name:x_34 (0.004652)
35. feature no:70 feature name:x_61 (0.004451)
36. feature no:35 feature name:x_26 (0.003792)
37. feature no:63 feature name:x_54 (0.003617)
38. feature no:60 feature name:x_51 (0.003151)
39. feature no:56 feature name:x_47 (0.003083)
40. feature no:25 feature name:x_16 (0.002995)
41. feature no:23 feature name:x_14 (0.002979)
42. feature no:36 feature name:x_27 (0.002700)
43. feature no:32 feature name:x_23 (0.002591)
44. feature no:99 feature name:unpayIndvLoan (0.002557)
45. feature no:80 feature name:x_71 (0.002379)
46. feature no:83 feature name:x_74 (0.002353)
47. feature no:68 feature name:x_59 (0.002294)
48. feature no:84 feature name:x_75 (0.002284)
49. feature no:61 feature name:x_52 (0.001965)
50. feature no:26 feature name:x_17 (0.001933)
51. feature no:10 feature name:x_1 (0.001912)
52. feature no:9 feature name:x_0 (0.001882)
53. feature no:31 feature name:x_22 (0.001662)
54. feature no:52 feature name:x_43 (0.001651)
55. feature no:74 feature name:x_65 (0.001631)
56. feature no:62 feature name:x_53 (0.001578)
57. feature no:13 feature name:x_4 (0.001530)
58. feature no:57 feature name:x_48 (0.001484)
59. feature no:59 feature name:x_50 (0.001357)
60. feature no:11 feature name:x_2 (0.001116)
61. feature no:16 feature name:x_7 (0.000877)
62. feature no:48 feature name:x_39 (0.000832)
63. feature no:102 feature name:5yearBadloan (0.000797)
64. feature no:64 feature name:x_55 (0.000787)
65. feature no:30 feature name:x_21 (0.000786)
66. feature no:47 feature name:x_38 (0.000759)
67. feature no:19 feature name:x_10 (0.000694)
68. feature no:66 feature name:x_57 (0.000653)
69. feature no:50 feature name:x_41 (0.000548)
70. feature no:20 feature name:x_11 (0.000508)
71. feature no:65 feature name:x_56 (0.000500)
72. feature no:17 feature name:x_8 (0.000400)
73. feature no:15 feature name:x_6 (0.000390)
74. feature no:79 feature name:x_70 (0.000378)
75. feature no:94 feature name:highestEdu (0.000355)
76. feature no:75 feature name:x_66 (0.000229)
77. feature no:53 feature name:x_44 (0.000226)
78. feature no:21 feature name:x_12 (0.000183)
79. feature no:58 feature name:x_49 (0.000129)
80. feature no:38 feature name:x_29 (0.000120)
81. feature no:51 feature name:x_42 (0.000112)
82. feature no:73 feature name:x_64 (0.000096)
83. feature no:39 feature name:x_30 (0.000005)
84. feature no:24 feature name:x_15 (0.000000)
85. feature no:40 feature name:x_31 (0.000000)
86. feature no:88 feature name:x_79 (0.000000)
87. feature no:87 feature name:x_78 (0.000000)
88. feature no:86 feature name:x_77 (0.000000)
89. feature no:5 feature name:edu (0.000000)
90. feature no:41 feature name:x_32 (0.000000)
91. feature no:78 feature name:x_69 (0.000000)
92. feature no:45 feature name:x_36 (0.000000)
93. feature no:22 feature name:x_13 (0.000000)
94. feature no:67 feature name:x_58 (0.000000)
95. feature no:12 feature name:x_3 (0.000000)
96. feature no:46 feature name:x_37 (0.000000)
97. feature no:14 feature name:x_5 (0.000000)
98. feature no:33 feature name:x_24 (0.000000)
99. feature no:49 feature name:x_40 (0.000000)
100. feature no:28 feature name:x_19 (0.000000)
101. feature no:18 feature name:x_9 (0.000000)
102. feature no:27 feature name:x_18 (0.000000)
103. feature no:69 feature name:x_60 (0.000000)

根据特征的重要性评估,我们抛弃掉重要性为0的特征,进行测试后:

x_columns = ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_4', 'x_6', 'x_7', 'x_8', 'x_10', 'x_11', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_22', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_38', 'x_39', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_59','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']
rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.681259
我们剔除了随机森林认为不重要的特征,效果反而变差了。

依次类推,我们再次打印特征重要性,继续抛弃重要性为0的特征。

x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_1', 'x_2', 'x_4', 'x_6', 'x_8', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_39', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_57','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.677848
我们再次剔除了rf认为不重要的特征,效果再次变差了。

x_columns = ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_6', 'x_8', 'x_12', 'x_14', 'x_16', 'x_20', 'x_22', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_33', 'x_34', 'x_35', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.690318
前两轮效果逐渐变差,现在终于AUC又提升了有些,其中还是挺微妙的。

最后,我们暂且将0.690318 作为最好的结果(其实并不是)。性能提升远不止于此,初次模型的调参和特征选择到此结束了。如果再使用特征工程、规则、交叉验证等的一些方法,效果肯定会更好,初次这里就只是进行了简单的调参和特征选取。

XGBoost-basemodel(76+)

上一次使用的是基于随机森林的basemodel,最终线上可达75+,今天尝试了一下xgboost,线上可达76+。

首先使用XGBoost的分类器,使用默认参数看一下效果。

# 选用所有特征
# ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
xgboost = xgb.XGBClassifier()

AUC Score (Train): 0.703644
之前随机森林的默认参数只有0.545862,两者差距有些大呀。

首先看一下调参数的效果:

xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=100)

AUC Score (Train): 0.702864

xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=200)

AUC Score (Train): 0.688059

从上面简单的调参结果可以看出,不同参数下对AUC的影响还是很大的,始终没有默认参数效果好,放弃继续手动调参数。(后续要换成网格搜索了,训练时间会长一些)

下面开始看一下特征的选取对结果的影响。
上述的测试都选用的全部特征,下面使用XGBoost的feature_importances_接口来选取一些更有效的特征,代码如下:

importances = xgboost_model.feature_importances_
indices = np.argsort(importances)[::-1]
feat_labels = X_train.columns
print("Feature ranking:")
#    l1,l2,l3,l4 = [],[],[],[]
for f in range(X_train.shape[1]):print("%d. feature no:%d feature name:%s (%f)" % (f + 1, indices[f], feat_labels[indices[f]], importances[indices[f]]))
print (">>>>>", importances)

经过和随机森林一样基于特征重要度对特征进行剔除后,最终发现AUC没有变化,所以直接提交了结果,线上可达76+。

XGBoost-KFold(77+)

在上一次中我们基于XGBoost的basemodel,线上可达76+,今天尝试了一下XGBoost下不同折的交叉验证,线上可达77+。

下面分别给出5折、7折、8折交叉验证的代码 和 各自最好结果。

# 选用所有特征
# ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
......
n_splits = 7
kf = KFold(n_splits=n_splits, shuffle=True, random_state=1234)
for train_index, test_index in kf.split(X_train):xgboost = xgb.XGBClassifier()

5折交叉验证下AUC Score (Train): 0.7245306571511836
7折交叉验证下AUC Score (Train): 0.7306788309565827
8折交叉验证下AUC Score (Train): 0.7511906354858096
最终线上成绩都在77+,从结果来看随着折数增加,线下AUC提升合理。

XGBoost-KFlod-特征工程

# 训练与测试数据进行拼接
train_test_data = pd.concat([X_train,X_predict],axis=0,ignore_index = True)# 数据转换
train_test_data['certBeginDt'] = pd.to_datetime(train_test_data["certValidBegin"] * 1000000000) - pd.offsets.DateOffset(years=70)
print ("time >>>", train_test_data['certBeginDt'])
train_test_data = train_test_data.drop(['certValidBegin'], axis=1)
train_test_data['certStopDt'] = pd.to_datetime(train_test_data["certValidStop"] * 1000000000) - pd.offsets.DateOffset(years=70)
train_test_data = train_test_data.drop(['certValidStop'], axis=1)# 特征组合
train_test_data["certStopDt"+"certBeginDt"] = train_test_data["certStopDt"] - train_test_data["certBeginDt"]
print ("train_test_data>>>>>>", train_test_data["certStopDt"+"certBeginDt"])print ("进行分箱")
train_test_data["age_bin"] = pd.cut(train_test_data["age"],20,labels=False)
train_test_data = train_test_data.drop(['age'], axis=1)
train_test_data["dist_bin"] = pd.qcut(train_test_data["dist"],60,labels=False)
train_test_data = train_test_data.drop(['dist'], axis=1)
train_test_data["lmt_bin"] = pd.qcut(train_test_data["lmt"],50,labels=False)
train_test_data = train_test_data.drop(['lmt'], axis=1)
train_test_data["setupHour_bin"] = pd.qcut(train_test_data["setupHour"],10,labels=False)
train_test_data = train_test_data.drop(['setupHour'], axis=1)
train_test_data["certStopDtcertBeginDt_bin"] = pd.cut(train_test_data["certStopDtcertBeginDt"],30,labels=False)
train_test_data = train_test_data.drop(['certStopDtcertBeginDt'], axis=1)
# 'certValidBegin', 'certValidStop'
train_test_data["certBeginDt_bin"] = pd.cut(train_test_data["certBeginDt"],30,labels=False)
train_test_data = train_test_data.drop(['certBeginDt'], axis=1)
train_test_data["certStopDt_bin"] = pd.cut(train_test_data["certStopDt"],30,labels=False)
train_test_data = train_test_data.drop(['certStopDt'], axis=1)
X_train = train_test_data.iloc[:X_train.shape[0],:]
X_predict = train_test_data.iloc[X_train.shape[0]:,:]# 选用所有特征
print ("进行onehot")
train_data = X_train
test_data = X_predict
# 选择要做onehot的列['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
# ["gender", "edu", "job", 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78']
# edu
dummy_fea = ["gender","job", "loanProduct", "basicLevel","ethnic"] #'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78']
train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True)
dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea])
dunmy_fea_rename_dict = {}
for per_i in dummy_df.columns.values:dunmy_fea_rename_dict[per_i] = per_i + '_onehot'
print (">>>>>",  dunmy_fea_rename_dict)
dummy_df = dummy_df.rename( columns=dunmy_fea_rename_dict )
train_test_data = pd.concat([train_test_data,dummy_df],axis=1)
column_headers = list( train_test_data.columns.values )
print(column_headers)
train_test_data = train_test_data.drop(dummy_fea,axis=1)
column_headers = list( train_test_data.columns.values )
print(column_headers)
train_train = train_test_data.iloc[:train_data.shape[0],:]
test_test = train_test_data.iloc[train_data.shape[0]:,:]
X_train = train_train
X_predict = test_test# 交叉验证,可参考之前的
..........
# 网格搜索
n_splits = 5
cv_params = {'max_depth': [4, 6, 8, 10], 'min_child_weight': [3, 4, 5, 6], 'scale_pos_weight':[5,8,10]}
other_params = {'learning_rate': 0.1, 'n_estimators': 4, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 1, 'reg_alpha': 1, 'reg_lambda': 1}
xgboost = xgb.XGBClassifier()
optimized_GBM = GridSearchCV(estimator=xgboost, param_grid=cv_params, scoring='roc_auc', cv=n_splits, verbose=1, n_jobs=4)
xgboost_model = optimized_GBM.fit(X_train, y_train)
y_pp = xgboost_model.predict_proba(X_predict)[:, 1]

发现提升效果并不大。在这里劝诫各位打比赛的小伙伴,在不分析数据的基础上随意堆叠特征工程,效果可能不升反降,需要我们对数据进行针对性的分析。

stacking-KFold(78+)

下面给出模型集成的代码stacking:经过调参和不断优化,最终线上成绩达到78+

# -*- coding: utf-8 -*-
from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
# ModelsPipeline:https://blog.csdn.net/qiqzhang/article/details/85477242 ; https://cloud.tencent.com/developer/article/1463294
from heamy.pipeline import ModelsPipeline
import pandas as pd
import xgboost as xgb
import datetime
from sklearn.metrics import roc_auc_score
# lightgbm安装:https://blog.csdn.net/weixin_41843918/article/details/85047492
# lgb样例:https://www.jianshu.com/p/c208cac3496f
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from pandas.core.frame import DataFramedef xgb_feature(X_train, y_train, X_test, y_test=None):other_params = {'learning_rate': 0.125, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef xgb_feature2(X_train, y_train, X_test, y_test=None):# , 'num_boost_round':12other_params = {'learning_rate': 0.1, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef xgb_feature3(X_train, y_train, X_test, y_test=None):# , 'num_boost_round':20other_params = {'learning_rate': 0.13, 'max_depth': 3}model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef rf_model(X_train, y_train, X_test, y_test=None):# n_estimators = 100model = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10).fit(X_train,y_train)predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef et_model(X_train, y_train, X_test, y_test=None):model = ExtraTreesClassifier(max_features = 'log2', n_estimators = 1000 , n_jobs = -1).fit(X_train,y_train)return model.predict_proba(X_test)[:,1]def gbdt_model(X_train, y_train, X_test, y_test=None):# n_estimators = 700model = GradientBoostingClassifier(learning_rate = 0.02, max_features = 0.7, n_estimators = 100 , max_depth = 5).fit(X_train,y_train)predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictdef logistic_model(X_train, y_train, X_test, y_test=None):model = LogisticRegression(penalty = 'l2').fit(X_train,y_train)return model.predict_proba(X_test)[:,1]def lgb_feature(X_train, y_train, X_test, y_test=None):model = lgb.LGBMClassifier(boosting_type='gbdt',  min_data_in_leaf=5, max_bin=200, num_leaves=25, learning_rate=0.01).fit(X_train, y_train) predict = model.predict_proba(X_test)[:,1]#minmin = min(predict)#maxmax = max(predict)#vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))#return vfunc(predict)return predictVAILD = False
if __name__ == '__main__':if VAILD == False:##############################train_data = pd.read_csv('data/train_data_target.csv',engine = 'python')# # x_columns = [x for x in train_data.columns if x not in ["target", "id"]]x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_12', 'x_14', 'x_16', 'x_20', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_33', 'x_34', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew']train_data.fillna(0,inplace = True)test_data = pd.read_csv('data/test.csv',engine = 'python')test_data.fillna(0,inplace = True)train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True)train_test_data = train_test_data.fillna(-888, inplace = True)# dummy_fea = ["gender", "edu", "job"]dummy_fea = []#dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea])#dunmy_fea_rename_dict = {}#for per_i in dummy_df.columns.values:#    dunmy_fea_rename_dict[per_i] = per_i + '_onehot'#print (">>>>>",  dunmy_fea_rename_dict)#dummy_df.rename( columns=dunmy_fea_rename_dict )#train_test_data = pd.concat([train_test_data,dummy_df],axis=1)#train_test_data = train_test_data.drop(dummy_fea,axis=1)train_train = train_test_data.iloc[:train_data.shape[0],:]test_test = train_test_data.iloc[train_data.shape[0]:,:]train_train_x = train_traintest_test_x = test_testxgb_dataset = Dataset(X_train=train_train_x,y_train=train_data['target'],X_test=test_test_x,y_test=None,use_cache=False)#heamyprint ("---------------------------------------------------------------------------------------)")print ("开始构建pipeline:ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt)")model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature,name='xgb',use_cache=False)model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2,name='xgb2',use_cache=False)model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3,name='xgb3',use_cache=False)model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model,name='gbdt',use_cache=False)model_lgb = Regressor(dataset=xgb_dataset, estimator=lgb_feature,name='lgb',use_cache=False)model_rf = Regressor(dataset=xgb_dataset, estimator=rf_model,name='rf',use_cache=False)# pipeline = ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt, model_rf)pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf)print ("---------------------------------------------------------------------------------------)")print ("开始训练pipeline:pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)")stack_ds = pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)# k = 7    model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf :   AUC: 0.780043 print ("stack_ds: ", stack_ds)print ("---------------------------------------------------------------------------------------)")print ("开始训练Regressor:Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})")stacker = Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})print ("---------------------------------------------------------------------------------------)")print ("开始预测:")predict_result = stacker.predict()id_list = test_data["id"].tolist()d ={ "id" : id_list, "target" : predict_result  }res = DataFrame(d)#将字典转换成为数据框print (">>>>", res)csv_file = 'stacking_res/res_stacking.csv'res.to_csv( csv_file )

后续又做了很多特征工程和模型融合,可能是对金融风控这一块不了解和自己太菜的原因,成绩止步于此。

总结

从之前的尝试可以看出,在不做任何特征、不调参的情况下,提升效果的方法可以有:

  • 换好的模型
  • 使用交叉验证
  • 采用模型集成的方案

当然,后期想提升的话,方案就比较多了,还可以有 数据增强(数据的扩增、不均衡的处理)、数据清洗(异常值、分布等等),特征工程(特征选择、统计特征、归一化、编码、分箱等等)、模型选择、 损失函数、模型集成。 (以上每种都去尝试真的很难,这个靠平常的积累,譬如什么模型要对特征做什么样的处理、什么样的参数适合多大的数据量、特征选择方法(卡方、方差、模型、分布等等)
总而言之,都需要平时的多尝试和多积累。

2019年厦门国际银行“数创金融杯”数据建模大赛总结相关推荐

  1. 【数据竞赛】厦门国际银行 “数创金融杯”数据建模大赛-冠军分享

    写在前面 冠军团队:三位靓仔 成员介绍:团队成员由当下国内赛圈著名选手组成,一月三冠选手宁缺,赛圈网红林有夕,以及最具潜力选手孙中宇组成. 首先还是非常感谢他们提供的冠军方案分享,下面就一起来看看是如 ...

  2. 火热进行ing:第三届「厦门国际银行“数创金融杯”建模大赛」邀您来战

    第三届厦门国际银行"数创金融杯"建模大赛自2021年12月开赛以来,吸引了不少来自校园和社会各界的金融数据爱好者们纷纷入场.大赛以财富产品精准营销为主题,设置了高达34万元的丰厚奖 ...

  3. 第三届厦门国际银行数创金融杯金融营销建模大赛-BaseLine

    第三届厦门国际银行数创金融杯金融营销建模大赛-BaseLine 1.大赛背景 随着科技发展,银行陆续打造了线上线下.丰富多样的客户触点,来满足客户日常业务办理.渠道交易等需求.面对着大量的客户,银行需 ...

  4. 厦门国际银行”数创金融杯“比赛思路及总结

    说明:这是第一次参加比赛,成绩不理想,高手勿喷... 比赛链接:点这里 一.赛题解读 1.任务 2.数据 3.评分标准 4.解决任务方法 通过分析数据标签可以知道这是一个不平衡样本的分类问题,对于这类 ...

  5. 厦门国际银行数创金融杯建模大赛

    2020厦门国际银行数创金融杯建模大赛baseline分享 成绩:0.34 比赛地址:https://www.dcjingsai.com/v2/cmptDetail.html?id=439&= ...

  6. 34万奖金!第三届厦门国际银行数创金融杯金融营销大赛来啦!

    近日,厦门国际银行与厦门大学数据挖掘研究中心联合举办了"第三届厦门国际银行数创金融杯金融营销建模算法大赛",要求参赛者针对客户购买各类理财产品存单概率进行预测,并将预测结果作为营销 ...

  7. “甜橙金融杯”数据建模大赛发布,8万重金寻找大数据金融人才!

    全世界有3.14 % 的人已经关注了 数据与算法之美 随着互联网+概念不断发展,越来越多的商家进入这一市场.为了在竞争中拉取新用户,培养用户的消费习惯,各种类型的营销和补贴活动层出不穷.为正常用户带来 ...

  8. “甜橙金融杯”数据建模大赛征程过半,数据玩家高手过招!

    "甜橙金融杯"数据建模大赛,自10月20日在DataCastle数据科学社区开放报名通道以来,受到了社会各界的广泛关注! 目前初赛阶段已经过半,不少校园数据爱好者纷纷入场,新颖的赛 ...

  9. 2020厦门国际银行数创金融杯建模大赛(一)----赛题说明数据重塑Baseline

    比赛介绍 比赛连接 https://js.dclab.run/v2/cmptDetail.html?id=439 任务 随着科技发展,银行陆续打造了线上线下.丰富多样的客户触点,来满足客户日常业务办理 ...

最新文章

  1. vue总结 08状态管理vuex
  2. 多元函数的二阶导数对应的矩阵
  3. dptcpp 题目 2352: [信息学奥赛一本通-T1440]数的划分-dp
  4. 对于PHP大型开发框架的看法
  5. java gui变量_关于java:静态/类变量和GUI
  6. 目前我国网络安全人才市场状况
  7. 几种常用英文信件范文
  8. 小火狐进化_口袋妖怪xy 三主进化的详细解析说明
  9. 如何裁剪动图大小?试试这个在线照片裁剪工具
  10. AD9854 MSP430 代码总结
  11. 华为荣耀畅享7的计算机在哪,华为畅享7有什么新功能_华为畅享7有哪些新功能-太平洋IT百科...
  12. 海尔智家半年报营收净利双增,卡萨帝、三翼鸟贡献几何?
  13. nodejs+express(ejs)做摇一摇小游戏(公司年会摇一摇游戏环节,大屏幕统计前几名摇动次数),大家一起摇一摇,看谁摇的次数多,并用excel-export导出excel
  14. jsp简易的图书管理系统
  15. elementUI 使用 el-select 的远程搜索功能,导致数据无法回显怎么解决?
  16. 哈佛凌晨4点半的景象
  17. form表单序列化 $('#form1').serialize()到后台没值
  18. 法兰克机器人循环编程_FANUC机器人程序[1]
  19. sql 语句将两张表合并成一张表
  20. 常见排序算法及对应的时间复杂度和空间复杂度

热门文章

  1. echarts 在线编辑,在线学习地址
  2. ocr文字识别软件:Readiris Corporate 17 Mac中文版
  3. 深大uooc学术道德与学术规范教育第十一章
  4. linux中sed如何替换换行符,linux sed命令,如何替换换行符“\n”
  5. jQuery笔记——工具函数——jQuery标志
  6. 微信公众号举报能封号吗
  7. apns 苹果服务器压力,[iOS]APNs推送机制
  8. 2010提升你幽默感的经典短句!
  9. 编写一个方法,将一段文本中的各个单词的字母顺序翻转题
  10. Control Egress Traffic