科大讯飞:电信客户流失预测挑战赛baseline
文章目录
- 一、查看各字段中分布情况
- 1.2 使用pandas_profiling自动分析数据
- 二、 使用baseline参数训练
- 三、Null Importances进行特征选择
- 3.2 计算Score
- 3.3 筛选正确的特征
- 四、跑通baseline
- 4.1使用lgb训练
- 4.2 使用Xgb训练
- 4.3 使用cat训练
- 4.4 另外去掉'平均丢弃数据呼叫数'特征
- 五、贝叶斯调参
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/chinese task/讯飞-电信用户流失')
Mounted at /content/drive
参考:
- 《2021科大讯飞-车辆贷款违约预测挑战赛 Top1方案》
- 《数据挖掘-租金预测》
-<WSDM-爱奇艺:用户留存预测挑战赛 线上0.865> - <微信大数据挑战赛 亚军方案–如何用baseline上724+>
- <用户购买预测比赛第十名方案>
!pip install unzip
!unzip '/content/drive/MyDrive/chinese task/讯飞-电信用户流失/电信客户流失预测挑战赛数据集.zip'n
读取数据集:
import pandas as pd
train= pd.read_csv('./train.csv');
test=pd.read_csv('./test.csv')
train
客户ID | 地理区域 | 是否双频 | 是否翻新机 | 当前手机价格 | 手机网络功能 | 婚姻状况 | 家庭成人人数 | 信息库匹配 | 预计收入 | ... | 客户生命周期内平均月费用 | 客户生命周期内的平均每月使用分钟数 | 客户整个生命周期内的平均每月通话次数 | 过去三个月的平均每月使用分钟数 | 过去三个月的平均每月通话次数 | 过去三个月的平均月费用 | 过去六个月的平均每月使用分钟数 | 过去六个月的平均每月通话次数 | 过去六个月的平均月费用 | 是否流失 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 7 | 0 | -1 | 181 | 0 | 2 | 0 | 0 | 3 | ... | 24 | 286 | 91 | 351 | 121 | 23 | 303 | 101 | 25 | 0 |
1 | 1 | 13 | 1 | 0 | 1399 | 0 | 3 | 0 | 0 | 0 | ... | 44 | 447 | 190 | 483 | 199 | 40 | 488 | 202 | 44 | 1 |
2 | 2 | 14 | 1 | 0 | 927 | 0 | 2 | 4 | 0 | 6 | ... | 48 | 183 | 79 | 271 | 95 | 71 | 209 | 77 | 54 | 0 |
3 | 3 | 1 | 0 | 0 | 232 | 0 | 3 | -1 | 1 | -1 | ... | 42 | 303 | 166 | 473 | 226 | 72 | 446 | 219 | 65 | 1 |
4 | 4 | 0 | -1 | 0 | 699 | 0 | 1 | 2 | 0 | 3 | ... | 36 | 119 | 24 | 88 | 15 | 35 | 106 | 21 | 37 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
149995 | 149995 | 10 | 1 | 0 | 1350 | 0 | 3 | 0 | 0 | 0 | ... | 156 | 474 | 160 | 239 | 80 | 74 | 346 | 122 | 83 | 1 |
149996 | 149996 | 6 | 1 | 0 | 542 | 0 | 3 | -1 | 1 | -1 | ... | 52 | 968 | 208 | 1158 | 257 | 58 | 1307 | 261 | 57 | 0 |
149997 | 149997 | 15 | 1 | 0 | 1300 | 0 | 1 | 2 | 0 | 6 | ... | 39 | 504 | 205 | 544 | 203 | 45 | 531 | 205 | 47 | 1 |
149998 | 149998 | 12 | 1 | 0 | 1399 | 0 | 4 | 1 | 0 | -1 | ... | 91 | 685 | 249 | 233 | 140 | 94 | 432 | 236 | 97 | 1 |
149999 | 149999 | 10 | 1 | 0 | 1049 | 0 | 3 | -1 | 1 | -1 | ... | 37 | 177 | 80 | 147 | 59 | 35 | 167 | 74 | 34 | 0 |
150000 rows × 69 columns
一、查看各字段中分布情况
train['是否流失'].value_counts()#查看正负样本数
#查看是否有缺失值
missing_counts = pd.DataFrame(train.isnull().sum())
missing_counts.columns = ['count_null']
missing_counts.describe()
#查看各字段数据类型
for col in train.columns:print(f'{col} \t {train.dtypes[col]} {train[col].nunique()}')
import pandas as pd
import numpy as npfrom sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import time
from lightgbm import LGBMClassifier
import lightgbm as lgbimport matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inlineimport warnings
warnings.simplefilter('ignore', UserWarning)import gc
gc.enable()
import time
1.2 使用pandas_profiling自动分析数据
参考:
- 官方文档
- 《使用pandas-profiling生成数据的详细报告》
- 《用pandas-profiling做出更好的探索性数据分析》
conda install -c conda-forge pandas-profiling
#!pip install -U pandas-profiling[notebook]
#安装之后要重启kernal
import pandas as pd
import pandas_profiling
data = pd.read_csv('./train.csv')
profile = data.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="telecom_customers_pandas_profiling.html")
查看Pandas Profiling Report发现:
- 类别特征:‘地理区域’,‘是否双频’,‘是否翻新机’,‘手机网络功能’,‘婚姻状况’,‘家庭成人人数’,‘信息库匹配’,‘信用卡指示器’,‘新手机用户’,‘账户消费限额’
- 分箱特征有:‘预计收入’,
- 异常值特征:‘家庭中唯一订阅者的数量’,‘家庭活跃用户数’,
- 无用(数据不平衡)特征:‘平均呼叫转移呼叫数’,‘平均丢弃数据呼叫数’,[149797,148912]
#区分数值特征和类别特征
features=list(train.columns)
categorical_features =['地理区域','是否双频','是否翻新机','手机网络功能','婚姻状况','预计收入','家庭成人人数','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
numeric_features =[item for item in features if item not in categorical_features]
numeric_features=[i for i in numeric_features if i not in ['客户ID','是否流失']]
#多类别和少类别
categorical_features1 =['是否双频','是否翻新机','手机网络功能','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
categorical_features2 =['地理区域','婚姻状况','预计收入','家庭成人人数']
#处理几个异常值
#train[train['家庭中唯一订阅者的数量'].values > 13]=14
#通过查看Pandas Profiling Report,发现以下列类别不平衡,打印出来看看情况
#还有一些异常值暂时没处理
cols=['家庭中唯一订阅者的数量','家庭活跃用户数','数据超载的平均费用','平均漫游呼叫数','平均丢弃数据呼叫数','平均占线数据调用次数','未应答数据呼叫的平均次数','尝试数据调用的平均数','完成数据调用的平均数','平均三通电话数','平均峰值数据调用次数','非高峰数据呼叫的平均数量','平均呼叫转移呼叫数']
for i in cols:print(train[i].value_counts())
- lr=0.2时roc=0.84479;0.3时0.8379,;lr=0.15时0.84578
- ‘num_leaves’,30改为45时,0.8468
这样调没啥用啊
#以下特征99.5%都是一种数值,可以考虑删掉。[149797,149493,149218,148912]
#lr=0.2时roc=0.84479
null_clos=['平均呼叫转移呼叫数','平均占线数据调用次数','未应答数据呼叫的平均次数','平均丢弃数据呼叫数']for i in null_clos:del train[i]del test[i]
train
二、 使用baseline参数训练
《科大讯飞:电信客户流失预测挑战赛baseline》
- 全部特征跑10931轮,valid_acc=0.84298
- null importance跑5000轮:
- 选取split_feats大于0的特征(43种)可跑14402轮,valid_acc=0.83887
- 选取feats大于0的特征(23种)可跑10946轮,valid_acc=0.8193
- null importance跑1000轮:
- 选取split_feats大于0的特征(66种)可跑11817轮,valid_acc=0.84417
- 选取feats大于0的特征(58种)可跑11725轮,valid_acc=0.84345
from sklearn.model_selection import train_test_split
# 划分训练集和测试集
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练5000轮,每300轮报告一次acc,200轮没有提升就停止训练
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300] valid_0's auc: 0.733101
[600] valid_0's auc: 0.754127
[900] valid_0's auc: 0.766728
[1200] valid_0's auc: 0.777367
[1500] valid_0's auc: 0.78594
[1800] valid_0's auc: 0.792209
[2100] valid_0's auc: 0.798424
[2400] valid_0's auc: 0.80417
[2700] valid_0's auc: 0.808074
[3000] valid_0's auc: 0.811665
[3300] valid_0's auc: 0.814679
[3600] valid_0's auc: 0.817462
[3900] valid_0's auc: 0.820151
[4200] valid_0's auc: 0.822135
[4500] valid_0's auc: 0.824544
[4800] valid_0's auc: 0.825994
Did not meet early stopping. Best iteration is:
[4994] valid_0's auc: 0.826797
roc,acc
(0.8267972007033084, 0.7533)
三、Null Importances进行特征选择
def get_feature_importances(X_train, X_test, y_train, y_test,shuffle, seed=None):# 获取特征train_features = list(X_train.columns) # 判断是否shuffle TARGETy_train,y_test= y_train.copy(),y_test.copy()if shuffle:# Here you could as well use a binomial distributiony_train,y_test= y_train.copy().sample(frac=1.0),y_test.copy().sample(frac=1.0)lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)# 在 RF 模式下安装 LightGBM,它比 sklearn RandomForest 更快 lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练模型clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=500,verbose_eval=50,early_stopping_rounds=30)#得到特征重要性imp_df = pd.DataFrame()imp_df["feature"] = list(train_features)imp_df["importance_gain"] = clf.feature_importance(importance_type='gain')imp_df["importance_split"] = clf.feature_importance(importance_type='split')imp_df['trn_score'] = roc_auc_score(y_test, clf.predict( X_test))return imp_df
np.random.seed(123)
# 获得市实际的特征重要性,即没有shuffletarget
actual_imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=False)
actual_imp_df
Training until validation scores don't improve for 20 rounds.
[30] valid_0's auc: 0.695549
[60] valid_0's auc: 0.704629
[90] valid_0's auc: 0.711638
[120] valid_0's auc: 0.715182
[150] valid_0's auc: 0.718961
[180] valid_0's auc: 0.722121
[210] valid_0's auc: 0.725615
[240] valid_0's auc: 0.728251
[270] valid_0's auc: 0.730962
[300] valid_0's auc: 0.733101
[330] valid_0's auc: 0.73578
[360] valid_0's auc: 0.73886
[390] valid_0's auc: 0.741238
[420] valid_0's auc: 0.742486
[450] valid_0's auc: 0.744295
[480] valid_0's auc: 0.746555
Did not meet early stopping. Best iteration is:
[495] valid_0's auc: 0.747792
feature | importance_gain | importance_split | trn_score | |
---|---|---|---|---|
0 | 地理区域 | 1956.600422 | 313 | 0.747792 |
1 | 是否双频 | 442.401141 | 62 | 0.747792 |
2 | 是否翻新机 | 269.466828 | 26 | 0.747792 |
3 | 当前手机价格 | 3838.696197 | 365 | 0.747792 |
4 | 手机网络功能 | 750.396258 | 51 | 0.747792 |
... | ... | ... | ... | ... |
62 | 过去三个月的平均每月通话次数 | 2540.721027 | 325 | 0.747792 |
63 | 过去三个月的平均月费用 | 2098.813867 | 304 | 0.747792 |
64 | 过去六个月的平均每月使用分钟数 | 2375.925741 | 337 | 0.747792 |
65 | 过去六个月的平均每月通话次数 | 2541.735172 | 346 | 0.747792 |
66 | 过去六个月的平均月费用 | 2103.062207 | 313 | 0.747792 |
67 rows × 4 columns
<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>
null_imp_df = pd.DataFrame()
nb_runs = 10
import time
start = time.time()
dsp = ''
for i in range(nb_runs):# 获取当前的特征重要性imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=True)imp_df['run'] = i + 1 # 将特征重要性连起来null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)# 删除上一条信息for l in range(len(dsp)):print('\b', end='', flush=True)# Display current run and time usedspent = (time.time() - start) / 60dsp = 'Done with %4d of %4d (Spent %5.1f min)' % (i + 1, nb_runs, spent)print(dsp, end='', flush=True)
null_imp_df
feature | importance_gain | importance_split | trn_score | run | |
---|---|---|---|---|---|
0 | 地理区域 | 38.622730 | 5 | 0.505320 | 1 |
1 | 是否双频 | 0.000000 | 0 | 0.505320 | 1 |
2 | 是否翻新机 | 0.000000 | 0 | 0.505320 | 1 |
3 | 当前手机价格 | 30.980300 | 4 | 0.505320 | 1 |
4 | 手机网络功能 | 0.000000 | 0 | 0.505320 | 1 |
... | ... | ... | ... | ... | ... |
62 | 过去三个月的平均每月通话次数 | 109.945481 | 14 | 0.503911 | 10 |
63 | 过去三个月的平均月费用 | 35.344621 | 4 | 0.503911 | 10 |
64 | 过去六个月的平均每月使用分钟数 | 55.200380 | 7 | 0.503911 | 10 |
65 | 过去六个月的平均每月通话次数 | 53.439080 | 6 | 0.503911 | 10 |
66 | 过去六个月的平均月费用 | 47.455200 | 6 | 0.503911 | 10 |
670 rows × 5 columns
<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>
def display_distributions(actual_imp_df_, null_imp_df_, feature_):plt.figure(figsize=(13, 6))gs = gridspec.GridSpec(1, 2)# 画出 Split importancesax = plt.subplot(gs[0, 0])a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')ax.legend()ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())# 画出 Gain importancesax = plt.subplot(gs[0, 1])a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')ax.legend()ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())#画出“DESTINATION_AIRPORT”的特征重要性
display_distributions(actual_imp_df_=actual_imp_df, null_imp_df_=null_imp_df, feature_='DESTINATION_AIRPORT')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KNaPI0o5-1655391260403)(xunfei_files/xunfei_16_0.png)]
plt.rcParams['font.sans-serif'] = ['SimHei'] # 中文字体设置-黑体
plt.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
sns.set(font='SimHei')
3.2 计算Score
以未进行特征shuffle的特征重要性除以shuffle之后的0.75分位数作为我们的score
feature_scores = []
for _f in actual_imp_df['feature'].unique():f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].valuesf_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75))) # Avoid didvide by zerof_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].valuesf_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75))) # Avoid didvide by zerofeature_scores.append((_f, split_score, gain_score))scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()null_imp_df.to_csv('null_importances_distribution_rf.csv')
actual_imp_df.to_csv('actual_importances_ditribution_rf.csv')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ObncLd9v-1655391260406)(xunfei_files/xunfei_19_1.png)]
[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]
scores_df.sort_values(by="split_score",ascending=False,inplace=True)
scores_df
feature | split_score | gain_score | |
---|---|---|---|
17 | 每月平均使用分钟数 | 4.152397 | 4.571279 |
60 | 客户整个生命周期内的平均每月通话次数 | 4.116323 | 4.226021 |
56 | 计费调整后的总分钟数 | 3.992808 | 3.961585 |
52 | 客户生命周期内的总通话次数 | 3.932502 | 4.008059 |
38 | 一分钟内的平均呼入电话数 | 3.832258 | 3.505356 |
... | ... | ... | ... |
35 | 完成数据调用的平均数 | 1.878771 | 2.220746 |
30 | 未应答数据呼叫的平均次数 | 1.791759 | 3.040400 |
28 | 平均占线数据调用次数 | 1.609438 | 2.565711 |
49 | 平均呼叫转移呼叫数 | 0.693147 | 2.221640 |
26 | 平均丢弃数据呼叫数 | -23.025851 | -23.025851 |
67 rows × 3 columns
<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>
<script>const buttonEl =document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a button.colab-df-convert');buttonEl.style.display =google.colab.kernel.accessAllowed ? 'block' : 'none';async function convertToInteractive(key) {const element = document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a');const dataTable =await google.colab.kernel.invokeFunction('convertToInteractive',[key], {});if (!dataTable) return;const docLinkHtml = 'Like what you see? Visit the ' +'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'+ ' to learn more about interactive tables.';element.innerHTML = '';dataTable['output_type'] = 'display_data';await google.colab.output.renderOutput(dataTable, element);const docLink = document.createElement('div');docLink.innerHTML = docLinkHtml;element.appendChild(docLink);}</script>
</div>
#huffle target之后特征重要性低于实际target对应特征的重要性0.25分位数的次数百分比
correlation_scores = []
for _f in actual_imp_df['feature'].unique():f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].valuesf_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].valuesgain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.sizef_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].valuesf_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].valuessplit_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.sizecorrelation_scores.append((_f, split_score, gain_score))corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])fig = plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
fig.subplots_adjust(top=0.93)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2T6Pes1-1655391260409)(xunfei_files/xunfei_22_1.png)]
corr_scores_df.sort_values(by="split_score",ascending=False,inplace=True)
corr_scores_df
feature | split_score | gain_score | |
---|---|---|---|
0 | 地理区域 | 100.0 | 100.0 |
50 | 平均呼叫等待呼叫数 | 100.0 | 100.0 |
36 | 平均客户服务电话次数 | 100.0 | 100.0 |
37 | 使用客户服务电话的平均分钟数 | 100.0 | 100.0 |
38 | 一分钟内的平均呼入电话数 | 100.0 | 100.0 |
... | ... | ... | ... |
29 | 平均未接语音呼叫数 | 100.0 | 100.0 |
30 | 未应答数据呼叫的平均次数 | 100.0 | 100.0 |
31 | 尝试拨打的平均语音呼叫次数 | 100.0 | 100.0 |
66 | 过去六个月的平均月费用 | 100.0 | 100.0 |
26 | 平均丢弃数据呼叫数 | 0.0 | 0.0 |
67 rows × 3 columns
<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>
3.3 筛选正确的特征
通过corr_scores_df知道,平均丢弃数据呼叫数是没用的,可以去掉。去掉之后效果确实提升了
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练5000轮,每300轮报告一次acc,200轮没有提升就停止训练
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300] valid_0's auc: 0.734833
[600] valid_0's auc: 0.753598
[900] valid_0's auc: 0.767934
[1200] valid_0's auc: 0.778701
[1500] valid_0's auc: 0.785552
[1800] valid_0's auc: 0.793379
[2100] valid_0's auc: 0.799713
[2400] valid_0's auc: 0.805404
[2700] valid_0's auc: 0.809381
[3000] valid_0's auc: 0.813516
[3300] valid_0's auc: 0.816289
[3600] valid_0's auc: 0.81927
[3900] valid_0's auc: 0.821682
[4200] valid_0's auc: 0.824342
[4500] valid_0's auc: 0.82676
[4800] valid_0's auc: 0.829004
[5100] valid_0's auc: 0.830592
[5400] valid_0's auc: 0.83205
[5700] valid_0's auc: 0.833626
[6000] valid_0's auc: 0.83478
[6300] valid_0's auc: 0.835981
[6600] valid_0's auc: 0.836975
[6900] valid_0's auc: 0.837994
[7200] valid_0's auc: 0.838715
[7500] valid_0's auc: 0.83963
[7800] valid_0's auc: 0.840372
[8100] valid_0's auc: 0.840644
[8400] valid_0's auc: 0.841068
[8700] valid_0's auc: 0.841685
Early stopping, best iteration is:
[8768] valid_0's auc: 0.841806
pred=clf.predict(X_test,num_iteration=clf.best_iteration)
roc,acc
(0.8418064634121478, 0.7683)
四、跑通baseline
baseline参考:https://mp.weixin.qq.com/s/nLgaGMJByOqRVWnm1UfB3g
!pip install catboost
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)features = [f for f in data.columns if f not in ['是否流失','客户ID','平均丢弃数据呼叫数']]train = data[data['是否流失'].notnull()].reset_index(drop=True)
test = data[data['是否流失'].isnull()].reset_index(drop=True)x_train = train[features]
x_test = test[features]y_train = train['是否流失']
4.1使用lgb训练
def cv_model(clf, train_x, train_y, test_x, clf_name):folds=5seed=2022kf=KFold(n_splits=folds,shuffle=True,random_state=seed)train=np.zeros(train_x.shape[0])test=np.zeros(test_x.shape[0])cv_scores = []for i, (train_index, valid_index) in enumerate(kf.split(train_x,train_y)):print('************************************ {} ************************************'.format(str(i+1)))trn_x,trn_y,val_x,val_y=train_x.iloc[train_index],train_y[train_index],train_x.iloc[valid_index],train_y[valid_index]if clf_name == "lgb":train_matrix=clf.Dataset(trn_x, label=trn_y)valid_matrix=clf.Dataset(val_x, label=val_y)#baseline参数params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}#最优参数params2={'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','bagging_fraction': 0.8864320989515848,'bagging_freq': 10,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,'seed': 2022,'n_jobs':-1}model = clf.train(params,train_matrix,50000,valid_sets=[train_matrix, valid_matrix], categorical_feature=[],verbose_eval=3000, early_stopping_rounds=200)val_pred=model.predict(val_x,num_iteration=model.best_iteration)test_pred=model.predict(test_x,num_iteration=model.best_iteration)print(list(sorted(zip(features,model.feature_importance("gain")),key=lambda x: x[1], reverse=True))[:20])if clf_name == "xgb":train_matrix=clf.DMatrix(trn_x,label=trn_y)valid_matrix=clf.DMatrix(val_x,label=val_y)test_matrix=clf.DMatrix(test_x)params={'booster': 'gbtree','objective': 'binary:logistic','eval_metric': 'auc','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.2,'tree_method': 'exact','seed': 2020,'nthread': 36,"silent": True,}watchlist=[(train_matrix, 'train'),(valid_matrix, 'eval')]model=clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)val_pred=model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)test_pred=model.predict(test_matrix , ntree_limit=model.best_ntree_limit)if clf_name=="cat":params={'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli','od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}model=clf(iterations=20000, **params)model.fit(trn_x,trn_y,eval_set=(val_x, val_y),cat_features=[],use_best_model=True, verbose=3000)val_pred=model.predict(val_x)test_pred=model.predict(test_x)train[valid_index]=val_predtest=test_pred/kf.n_splitscv_scores.append(roc_auc_score(val_y,val_pred))print(cv_scores)print("%s_scotrainre_list:" % clf_name, cv_scores)print("%s_score_mean:" % clf_name, np.mean(cv_scores))print("%s_score_std:" % clf_name, np.std(cv_scores))return train, testdef lgb_model(x_train,y_train,x_test):lgb_train,lgb_test=cv_model(lgb,x_train,y_train,x_test,"lgb")return lgb_train,lgb_testdef xgb_model(x_train,y_train,x_test):xgb_train,xgb_test=cv_model(xgb,x_train,y_train,x_test,"xgb")return xgb_train, xgb_testdef cat_model(x_train,y_train,x_test):cat_train,cat_test=cv_model(CatBoostRegressor,x_train,y_train,x_test,"cat") return cat_train,cat_test
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)#查看代码执行记录,耗时21min左右吧
test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('lgb_base.csv',index=False)#提交成绩0.825
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999488 valid_1's auc: 0.811334
Early stopping, best iteration is:
[5163] training's auc: 0.999996 valid_1's auc: 0.8289
[('当前设备使用天数', 21934.934124737978), ('当月使用分钟数与前三个月平均值的百分比变化', 17126.358324214816), ('在职总月数', 12409.957622632384), ('每月平均使用分钟数', 12073.125538095832), ('客户生命周期内的平均每月使用分钟数', 11994.06405813992), ('客户整个生命周期内的平均每月通话次数', 11518.068050682545), ('已完成语音通话的平均使用分钟数', 11292.594955265522), ('当前手机价格', 10964.187494635582), ('客户生命周期内的总费用', 10750.710047110915), ('使用高峰语音通话的平均不完整分钟数', 10274.193908914924), ('客户生命周期内的总使用分钟数', 10260.600554332137), ('当月费用与前三个月平均值的百分比变化', 10164.166730254889), ('计费调整后的总分钟数', 10095.02776375413), ('计费调整后的总费用', 10074.029564589262), ('客户生命周期内的总通话次数', 9900.794713005424), ('客户生命周期内平均月费用', 9874.11763061583), ('平均非高峰语音呼叫数', 9546.732098400593), ('过去六个月的平均每月通话次数', 9531.47578701377), ('过去六个月的平均每月使用分钟数', 9481.577100589871), ('计费调整后的呼叫总数', 9305.693744853139)]
[0.8288996222651557]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999472 valid_1's auc: 0.811878
Early stopping, best iteration is:
[4772] training's auc: 0.999971 valid_1's auc: 0.827608
[('当前设备使用天数', 21505.16284123063), ('当月使用分钟数与前三个月平均值的百分比变化', 16946.651323199272), ('每月平均使用分钟数', 12132.766281962395), ('在职总月数', 11971.832910627127), ('客户生命周期内的平均每月使用分钟数', 11526.178689315915), ('客户整个生命周期内的平均每月通话次数', 11283.326876536012), ('当前手机价格', 11003.212880536914), ('客户生命周期内的总费用', 10808.01029574871), ('已完成语音通话的平均使用分钟数', 10684.196997240186), ('使用高峰语音通话的平均不完整分钟数', 10399.707967177033), ('当月费用与前三个月平均值的百分比变化', 10358.123901829123), ('客户生命周期内的总使用分钟数', 10162.593608289957), ('客户生命周期内的总通话次数', 10073.619953781366), ('计费调整后的总费用', 9978.180806919932), ('计费调整后的总分钟数', 9764.853373721242), ('过去三个月的平均每月通话次数', 9391.67290854454), ('过去六个月的平均每月通话次数', 9381.156281203032), ('客户生命周期内平均月费用', 9243.235832542181), ('过去六个月的平均每月使用分钟数', 9032.57935705781), ('计费调整后的呼叫总数', 8945.249050289392)]
[0.8288996222651557, 0.8276084395403329]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999494 valid_1's auc: 0.811642
Early stopping, best iteration is:
[4663] training's auc: 0.99999 valid_1's auc: 0.827114
[('当前设备使用天数', 21289.608253866434), ('当月使用分钟数与前三个月平均值的百分比变化', 16997.806541010737), ('在职总月数', 12316.054881855845), ('客户生命周期内的平均每月使用分钟数', 11741.117707148194), ('每月平均使用分钟数', 11664.033028051257), ('已完成语音通话的平均使用分钟数', 11115.561656951904), ('客户整个生命周期内的平均每月通话次数', 10854.345216721296), ('当前手机价格', 10763.63857871294), ('客户生命周期内的总费用', 10621.98585870862), ('当月费用与前三个月平均值的百分比变化', 10375.685174629092), ('计费调整后的总费用', 10232.226524055004), ('客户生命周期内的总使用分钟数', 10052.964914098382), ('使用高峰语音通话的平均不完整分钟数', 9799.514198839664), ('计费调整后的总分钟数', 9735.032970786095), ('客户生命周期内平均月费用', 9637.621711835265), ('客户生命周期内的总通话次数', 9429.328524649143), ('过去六个月的平均每月使用分钟数', 9333.910300150514), ('计费调整后的呼叫总数', 9013.730677694082), ('过去六个月的平均每月通话次数', 8954.436415627599), ('过去六个月的平均月费用', 8829.167943418026)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999532 valid_1's auc: 0.814214
Early stopping, best iteration is:
[5281] training's auc: 0.999996 valid_1's auc: 0.830897
[('当前设备使用天数', 21271.166813850403), ('当月使用分钟数与前三个月平均值的百分比变化', 17270.63153974712), ('每月平均使用分钟数', 12677.148315995932), ('在职总月数', 12486.456961512566), ('客户生命周期内的平均每月使用分钟数', 11930.549542114139), ('客户整个生命周期内的平均每月通话次数', 11403.163509890437), ('已完成语音通话的平均使用分钟数', 11126.607083335519), ('当前手机价格', 10973.327338501811), ('当月费用与前三个月平均值的百分比变化', 10719.836767598987), ('客户生命周期内的总费用', 10684.931542679667), ('计费调整后的总费用', 10567.041279122233), ('计费调整后的总分钟数', 10477.076363384724), ('客户生命周期内的总使用分钟数', 10404.941493198276), ('客户生命周期内平均月费用', 10015.077973127365), ('使用高峰语音通话的平均不完整分钟数', 9988.746752500534), ('过去六个月的平均每月使用分钟数', 9924.928602397442), ('客户生命周期内的总通话次数', 9658.558003604412), ('平均非高峰语音呼叫数', 9605.689363330603), ('过去六个月的平均每月通话次数', 9560.14350926876), ('计费调整后的呼叫总数', 9525.798342213035)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999444 valid_1's auc: 0.8118
Early stopping, best iteration is:
[5148] training's auc: 0.999994 valid_1's auc: 0.829686
[('当前设备使用天数', 21662.356478646398), ('当月使用分钟数与前三个月平均值的百分比变化', 17710.528580009937), ('在职总月数', 12402.68640038371), ('每月平均使用分钟数', 11945.518620952964), ('客户生命周期内的平均每月使用分钟数', 11887.39459644258), ('已完成语音通话的平均使用分钟数', 11309.949122816324), ('客户整个生命周期内的平均每月通话次数', 11231.172733142972), ('客户生命周期内的总费用', 10822.351191923022), ('当前手机价格', 10691.375393077731), ('计费调整后的总费用', 10513.226110234857), ('当月费用与前三个月平均值的百分比变化', 10418.488398104906), ('客户生命周期内的总使用分钟数', 10276.142720848322), ('使用高峰语音通话的平均不完整分钟数', 10242.566086634994), ('计费调整后的总分钟数', 10193.664465650916), ('客户生命周期内的总通话次数', 10117.483586207032), ('客户生命周期内平均月费用', 9943.684495016932), ('过去六个月的平均每月通话次数', 9800.775234118104), ('过去三个月的平均每月通话次数', 9572.030710801482), ('过去六个月的平均每月使用分钟数', 9561.15305377543), ('平均非高峰语音呼叫数', 9292.315245553851)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_scotrainre_list: [0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_score_mean: 0.8288409576534048
lgb_score_std: 0.0013744978556818929"\n************************************ 1 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999474\tvalid_1's auc: 0.811874\nEarly stopping, best iteration is:\n[4935]\ttraining's auc: 0.999996\tvalid_1's auc: 0.827972\n[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]\n[0.8279715963308298]\n************************************ 2 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999427\tvalid_1's auc: 0.810338\nEarly stopping, best iteration is:\n[4648]\ttraining's auc: 0.999965\tvalid_1's auc: 0.824151\n[('当前设备使用天数', 21631.878849938512), ('当月使用分钟数与前三个月平均值的百分比变化', 16730.961754366755), ('在职总月数', 12067.951921060681), ('每月平均使用分钟数', 12002.064660459757), ('客户生命周期内的平均每月使用分钟数', 11514.234459266067), ('客户整个生命周期内的平均每月通话次数', 11378.85348239541), ('已完成语音通话的平均使用分钟数', 10749.901214078069), ('当前手机价格', 10722.060040861368), ('客户生命周期内的总费用', 10603.264658093452), ('当月费用与前三个月平均值的百分比变化', 10405.526055783033), ('使用高峰语音通话的平均不完整分钟数', 10171.211520016193), ('客户生命周期内的总使用分钟数', 10006.355669140816), ('计费调整后的总分钟数', 9942.827439278364), ('客户生命周期内的总通话次数', 9937.020643949509), ('计费调整后的总费用', 9920.474541395903), ('过去六个月的平均每月使用分钟数', 9621.407806247473), ('客户生命周期内平均月费用', 9319.960188627243), ('过去三个月的平均每月通话次数', 9318.490131109953), ('平均月费用', 9294.081347599626), ('过去六个月的平均每月通话次数', 9203.844007015228)]\n[0.8279715963308298, 0.8241509252411403]\n************************************ 3 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.99949\tvalid_1's auc: 0.810687\nEarly stopping, best iteration is:\n[4731]\ttraining's auc: 0.999987\tvalid_1's auc: 0.825545\n[('当前设备使用天数', 21968.028517633677), ('当月使用分钟数与前三个月平均值的百分比变化', 16903.005848184228), ('在职总月数', 12133.818779706955), ('客户生命周期内的平均每月使用分钟数', 11976.253827899694), ('每月平均使用分钟数', 11948.46197539568), ('已完成语音通话的平均使用分钟数', 11421.855388239026), ('客户整个生命周期内的平均每月通话次数', 11262.173004433513), ('当前手机价格', 11005.929363071918), ('客户生命周期内的总费用', 10528.124375209212), ('客户生命周期内的总使用分钟数', 10390.872772306204), ('计费调整后的总费用', 10347.706698387861), ('当月费用与前三个月平均值的百分比变化', 10124.151285156608), ('计费调整后的总分钟数', 9813.354337349534), ('使用高峰语音通话的平均不完整分钟数', 9805.469536915421), ('客户生命周期内平均月费用', 9772.446165367961), ('过去六个月的平均每月使用分钟数', 9544.928655743599), ('计费调整后的呼叫总数', 9390.860902503133), ('过去六个月的平均每月通话次数', 9323.151294022799), ('客户生命周期内的总通话次数', 9320.212619245052), ('过去三个月的平均每月通话次数', 9084.183073118329)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296]\n************************************ 4 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.9995\tvalid_1's auc: 0.813234\nEarly stopping, best iteration is:\n[5599]\ttraining's auc: 0.999997\tvalid_1's auc: 0.831782\n[('当前设备使用天数', 21882.617314189672), ('当月使用分钟数与前三个月平均值的百分比变化', 17574.675364792347), ('每月平均使用分钟数', 12675.68729557097), ('在职总月数', 12567.960791677237), ('客户生命周期内的平均每月使用分钟数', 12466.111717522144), ('客户整个生命周期内的平均每月通话次数', 11556.674870744348), ('已完成语音通话的平均使用分钟数', 11522.147867411375), ('当前手机价格', 11065.775812849402), ('客户生命周期内的总使用分钟数', 10911.875026881695), ('客户生命周期内的总费用', 10715.607445791364), ('使用高峰语音通话的平均不完整分钟数', 10510.982212975621), ('当月费用与前三个月平均值的百分比变化', 10451.965088263154), ('计费调整后的总费用', 10446.603226020932), ('计费调整后的总分钟数', 10408.396666422486), ('过去六个月的平均每月使用分钟数', 10079.377708375454), ('客户生命周期内平均月费用', 10037.817246481776), ('客户生命周期内的总通话次数', 10017.892398029566), ('计费调整后的呼叫总数', 9739.093963235617), ('过去三个月的平均每月通话次数', 9609.546253487468), ('平均非高峰语音呼叫数', 9569.536746695638)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651]\n************************************ 5 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999448\tvalid_1's auc: 0.810144\nEarly stopping, best iteration is:\n[5255]\ttraining's auc: 0.999999\tvalid_1's auc: 0.829245\n[('当前设备使用天数', 21498.932903170586), ('当月使用分钟数与前三个月平均值的百分比变化', 17680.002600044012), ('在职总月数', 12638.706078097224), ('客户生命周期内的平均每月使用分钟数', 12569.80523788929), ('每月平均使用分钟数', 12267.705941140652), ('当前手机价格', 11370.256973087788), ('已完成语音通话的平均使用分钟数', 11110.097302675247), ('客户整个生命周期内的平均每月通话次数', 11020.642103403807), ('客户生命周期内的总费用', 10986.333106696606), ('计费调整后的总费用', 10700.256485000253), ('当月费用与前三个月平均值的百分比变化', 10575.144608184695), ('计费调整后的总分钟数', 10401.467713326216), ('使用高峰语音通话的平均不完整分钟数', 10237.447989702225), ('客户生命周期内的总通话次数', 10139.773517146707), ('客户生命周期内平均月费用', 10076.59566681087), ('客户生命周期内的总使用分钟数', 9953.696122318506), ('过去六个月的平均每月使用分钟数', 9595.342250138521), ('平均非高峰语音呼叫数', 9504.704583987594), ('过去六个月的平均每月通话次数', 9500.140991523862), ('计费调整后的呼叫总数', 9425.357908219099)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_scotrainre_list: [0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_score_mean: 0.827738724125895\nlgb_score_std: 0.002696458502502849\n"
4.2 使用Xgb训练
xgb_train,xgb_test=xgb_model(x_train,y_train,x_test)
test['是否流失'] = xgb_test
test[['客户ID','是否流失']].to_csv('xgb_base.csv',index=False)#2h50min,太慢了
************************************ 1 ************************************
[0] train-auc:0.635939 eval-auc:0.634176
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992932 eval-auc:0.788708
[6000] train-auc:0.999906 eval-auc:0.807173
[9000] train-auc:0.999997 eval-auc:0.812868
Stopping. Best iteration:
[9945] train-auc:0.999999 eval-auc:0.814055[0.8140550495535315]
************************************ 2 ************************************
[0] train-auc:0.636635 eval-auc:0.633894
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992878 eval-auc:0.790387
[6000] train-auc:0.99988 eval-auc:0.807621
Stopping. Best iteration:
[8538] train-auc:0.999991 eval-auc:0.812347[0.8140550495535315, 0.8123468873894992]
************************************ 3 ************************************
[0] train-auc:0.637058 eval-auc:0.630979
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992874 eval-auc:0.790023
[6000] train-auc:0.999898 eval-auc:0.80827
[9000] train-auc:0.999996 eval-auc:0.813291
Stopping. Best iteration:
[8933] train-auc:0.999996 eval-auc:0.813342[0.8140550495535315, 0.8123468873894992, 0.8133415339513355]
************************************ 4 ************************************
[0] train-auc:0.635278 eval-auc:0.633351
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.993107 eval-auc:0.78905
[6000] train-auc:0.999903 eval-auc:0.808401
Stopping. Best iteration:
[8343] train-auc:0.999993 eval-auc:0.812439[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089]
************************************ 5 ************************************
[0] train-auc:0.635985 eval-auc:0.633911
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992892 eval-auc:0.788101
[6000] train-auc:0.999904 eval-auc:0.805732
[9000] train-auc:0.999997 eval-auc:0.810194
Stopping. Best iteration:
[10041] train-auc:0.999999 eval-auc:0.811155[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_scotrainre_list: [0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_score_mean: 0.8126675195312721
xgb_score_std: 0.000982024071432044
4.3 使用cat训练
cat_train,cat_test=cat_model(x_train,y_train,x_test)#22min,和lgb差不多
test['是否流失'] = cat_test
test[['客户ID','是否流失']].to_csv('cat_base.csv',index=False)#
************************************ 1 ************************************
0: learn: 0.4955489 test: 0.4954619 best: 0.4954619 (0) total: 233ms remaining: 1h 17m 39s
3000: learn: 0.3769726 test: 0.4483572 best: 0.4483572 (3000) total: 2m 5s remaining: 11m 50s
6000: learn: 0.3209359 test: 0.4391546 best: 0.4391520 (5999) total: 4m remaining: 9m 21s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4360869428
bestIteration = 7499Shrink model to first 7500 iterations.
[0.78868229695141]
************************************ 2 ************************************
0: learn: 0.4953117 test: 0.4954092 best: 0.4954092 (0) total: 39.5ms remaining: 13m 10s
3000: learn: 0.3763302 test: 0.4490481 best: 0.4490378 (2981) total: 1m 46s remaining: 10m 2s
6000: learn: 0.3196365 test: 0.4402621 best: 0.4402621 (6000) total: 3m 38s remaining: 8m 30s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4361341716
bestIteration = 8001Shrink model to first 8002 iterations.
[0.78868229695141, 0.7897985044313038]
************************************ 3 ************************************
0: learn: 0.4954711 test: 0.4955905 best: 0.4955905 (0) total: 38.5ms remaining: 12m 49s
3000: learn: 0.3763265 test: 0.4477431 best: 0.4477431 (3000) total: 1m 49s remaining: 10m 21s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4406746798
bestIteration = 5128Shrink model to first 5129 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264]
************************************ 4 ************************************
0: learn: 0.4955798 test: 0.4955669 best: 0.4955669 (0) total: 46.1ms remaining: 15m 21s
3000: learn: 0.3768704 test: 0.4486424 best: 0.4486421 (2997) total: 1m 45s remaining: 9m 59s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4426386429
bestIteration = 4903Shrink model to first 4904 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829]
************************************ 5 ************************************
0: learn: 0.4955262 test: 0.4956471 best: 0.4956471 (0) total: 38.9ms remaining: 12m 57s
3000: learn: 0.3761659 test: 0.4494234 best: 0.4494234 (3000) total: 1m 47s remaining: 10m 11s
6000: learn: 0.3202277 test: 0.4407377 best: 0.4407330 (5999) total: 3m 31s remaining: 8m 12s
9000: learn: 0.2781913 test: 0.4347233 best: 0.4347168 (8998) total: 5m 14s remaining: 6m 24s
Stopped by overfitting detector (50 iterations wait)bestTest = 0.4323322625
bestIteration = 10483Shrink model to first 10484 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_scotrainre_list: [0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_score_mean: 0.785996191059122
cat_score_std: 0.0084674009574612
4.4 另外去掉’平均丢弃数据呼叫数’特征
效果变差了
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999535 valid_1's auc: 0.811083
Early stopping, best iteration is:
[5495] training's auc: 0.999999 valid_1's auc: 0.830252
[('当前设备使用天数', 21646.43981860578), ('当月使用分钟数与前三个月平均值的百分比变化', 17622.58995847404), ('在职总月数', 12633.31053687632), ('每月平均使用分钟数', 12317.316355511546), ('客户整个生命周期内的平均每月通话次数', 12213.196875602007), ('客户生命周期内的平均每月使用分钟数', 11988.745236545801), ('已完成语音通话的平均使用分钟数', 11742.254607230425), ('客户生命周期内的总费用', 10961.734202891588), ('客户生命周期内的总使用分钟数', 10739.284949079156), ('当前手机价格', 10717.661178082228), ('使用高峰语音通话的平均不完整分钟数', 10648.361330911517), ('当月费用与前三个月平均值的百分比变化', 10563.12071943283), ('客户生命周期内平均月费用', 10260.813065826893), ('计费调整后的总费用', 10214.983077257872), ('客户生命周期内的总通话次数', 10042.090887442231), ('过去六个月的平均每月使用分钟数', 10030.256944060326), ('计费调整后的总分钟数', 9833.17426289618), ('过去六个月的平均每月通话次数', 9658.642087131739), ('平均非高峰语音呼叫数', 9604.195981651545), ('过去三个月的平均每月使用分钟数', 9474.32663051784)]
[0.8302521387863329]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999484 valid_1's auc: 0.812252
Early stopping, best iteration is:
[4761] training's auc: 0.999991 valid_1's auc: 0.827726
[('当前设备使用天数', 20778.929791480303), ('当月使用分钟数与前三个月平均值的百分比变化', 17059.72723968327), ('在职总月数', 12247.527016088367), ('每月平均使用分钟数', 12162.8245485425), ('客户生命周期内的平均每月使用分钟数', 11649.190486937761), ('客户整个生命周期内的平均每月通话次数', 11235.27798551321), ('已完成语音通话的平均使用分钟数', 10887.697177901864), ('客户生命周期内的总使用分钟数', 10537.405863419175), ('客户生命周期内的总费用', 10427.963113591075), ('当前手机价格', 10388.50929298997), ('当月费用与前三个月平均值的百分比变化', 10345.741146698594), ('使用高峰语音通话的平均不完整分钟数', 10325.746990069747), ('计费调整后的总费用', 10308.259309798479), ('客户生命周期内的总通话次数', 9878.29905757308), ('过去六个月的平均每月使用分钟数', 9860.522675991058), ('计费调整后的总分钟数', 9831.829701200128), ('客户生命周期内平均月费用', 9413.955781325698), ('平均月费用', 9256.14368981123), ('过去三个月的平均每月通话次数', 9233.180386424065), ('过去六个月的平均每月通话次数', 9178.422535061836)]
[0.8302521387863329, 0.8277260767493848]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999503 valid_1's auc: 0.812223
Early stopping, best iteration is:
[4737] training's auc: 0.999988 valid_1's auc: 0.826507
[('当前设备使用天数', 21187.639979198575), ('当月使用分钟数与前三个月平均值的百分比变化', 17066.7826115638), ('在职总月数', 12178.71656690538), ('每月平均使用分钟数', 11915.060246050358), ('客户生命周期内的平均每月使用分钟数', 11457.53249040246), ('已完成语音通话的平均使用分钟数', 11197.47149656713), ('客户整个生命周期内的平均每月通话次数', 11062.857962206006), ('当前手机价格', 10535.98642912507), ('计费调整后的总费用', 10396.114720955491), ('当月费用与前三个月平均值的百分比变化', 10280.928569793701), ('使用高峰语音通话的平均不完整分钟数', 10159.540036082268), ('过去六个月的平均每月使用分钟数', 10114.058793380857), ('客户生命周期内的总使用分钟数', 10109.089174315333), ('客户生命周期内的总费用', 10081.144412502646), ('计费调整后的总分钟数', 10064.824367910624), ('客户生命周期内的总通话次数', 9710.811524420977), ('过去六个月的平均每月通话次数', 9568.110130429268), ('客户生命周期内平均月费用', 9536.692147105932), ('计费调整后的呼叫总数', 9272.926451265812), ('过去三个月的平均每月通话次数', 9104.1763061136)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999558 valid_1's auc: 0.812316
Early stopping, best iteration is:
[4955] training's auc: 0.999985 valid_1's auc: 0.82816
[('当前设备使用天数', 20919.606680095196), ('当月使用分钟数与前三个月平均值的百分比变化', 17050.523352131248), ('在职总月数', 12673.502319052815), ('每月平均使用分钟数', 12145.743713662028), ('客户生命周期内的平均每月使用分钟数', 12082.749334529042), ('已完成语音通话的平均使用分钟数', 11270.388913482428), ('客户整个生命周期内的平均每月通话次数', 11032.332806184888), ('客户生命周期内的总费用', 10647.951857417822), ('计费调整后的总费用', 10599.385332718492), ('客户生命周期内的总使用分钟数', 10490.505580991507), ('当前手机价格', 10461.154125005007), ('当月费用与前三个月平均值的百分比变化', 10269.522361278534), ('使用高峰语音通话的平均不完整分钟数', 10231.192073732615), ('客户生命周期内的总通话次数', 9965.85817475617), ('计费调整后的总分钟数', 9773.746473029256), ('客户生命周期内平均月费用', 9764.829889595509), ('过去六个月的平均每月使用分钟数', 9703.316017881036), ('过去六个月的平均每月通话次数', 9595.259186178446), ('平均非高峰语音呼叫数', 9585.856355905533), ('计费调整后的呼叫总数', 9195.526195570827)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999494 valid_1's auc: 0.809363
Early stopping, best iteration is:
[4829] training's auc: 0.999983 valid_1's auc: 0.824736
[('当前设备使用天数', 20857.728651717305), ('当月使用分钟数与前三个月平均值的百分比变化', 17141.65538044274), ('在职总月数', 12623.7158523947), ('每月平均使用分钟数', 12155.711411625147), ('客户生命周期内的平均每月使用分钟数', 11755.307457834482), ('客户整个生命周期内的平均每月通话次数', 11121.649592876434), ('客户生命周期内的总费用', 10800.35821519792), ('当前手机价格', 10647.860997959971), ('已完成语音通话的平均使用分钟数', 10567.15585295856), ('客户生命周期内的总使用分钟数', 10455.313509970903), ('计费调整后的总费用', 10241.350874692202), ('当月费用与前三个月平均值的百分比变化', 10177.092842921615), ('客户生命周期内的总通话次数', 10139.20638936758), ('使用高峰语音通话的平均不完整分钟数', 9981.980402067304), ('计费调整后的总分钟数', 9756.786857843399), ('过去六个月的平均每月使用分钟数', 9725.03030230105), ('客户生命周期内平均月费用', 9604.02791416645), ('计费调整后的呼叫总数', 9452.47144331038), ('平均非高峰语音呼叫数', 9228.985016450286), ('过去六个月的平均每月通话次数', 9228.196154907346)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_scotrainre_list: [0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_score_mean: 0.8274762875126079
lgb_score_std: 0.0018267969533472914
五、贝叶斯调参
#!pip install bayesian-optimization
from bayes_opt import BayesianOptimization
def LGB_bayesian(num_leaves, # intbagging_freq, # intlearning_rate, feature_fraction,bagging_fraction,lambda_l1,lambda_l2,min_gain_to_split,max_depth):# LightGBM expects next three parameters need to be integer. So we make them integernum_leaves = int(num_leaves)max_depth = int(max_depth)assert type(num_leaves) == intassert type(max_depth) == intparam = {'num_leaves': num_leaves,'learning_rate': learning_rate,'bagging_fraction': bagging_fraction,'bagging_freq': bagging_freq,'feature_fraction': feature_fraction,'lambda_l1': lambda_l1,'lambda_l2': lambda_l2,'max_depth': max_depth,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','seed': 2022,'feature_fraction_seed': 2022,'bagging_seed': 2022,'drop_seed': 2022,'data_random_seed': 2022,'is_unbalance': True,'boost_from_average': False,'save_binary': True, } lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)num_round=10000clf = lgb.train(param,lgb_train,num_round,valid_sets =lgb_eval,verbose_eval=500,early_stopping_rounds = 200)roc= roc_auc_score(y_test,clf.predict(X_test,num_iteration=clf.best_iteration)) return roc
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)
bounds_LGB = {'num_leaves': (5,50), 'learning_rate': (0.03,0.5), 'feature_fraction': (0.1,1),'bagging_fraction': (0.1,1),'bagging_freq': (0,10),'lambda_l1': (0, 5.0), 'lambda_l2': (0, 10), 'min_gain_to_split': (0, 1.0),'max_depth':(5,15),
}
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
from bayes_opt import BayesianOptimization
LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)
init_points = 5
n_iter = 15
print('-' * 130)with warnings.catch_warnings():warnings.filterwarnings('ignore')LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0)
----------------------------------------------------------------------------------------------------------------------------------
| iter | target | baggin... | baggin... | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_ga... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.756501
[1000] valid_0's auc: 0.779654
[1500] valid_0's auc: 0.795342
[2000] valid_0's auc: 0.804397
[2500] valid_0's auc: 0.812615
[3000] valid_0's auc: 0.818713
[3500] valid_0's auc: 0.82294
[4000] valid_0's auc: 0.826771
[4500] valid_0's auc: 0.82971
[5000] valid_0's auc: 0.832648
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.832648
| [0m 1 [0m | [0m 0.8326 [0m | [0m 0.7999 [0m | [0m 2.375 [0m | [0m 0.8419 [0m | [0m 4.829 [0m | [0m 9.726 [0m | [0m 0.2431 [0m | [0m 11.09 [0m | [0m 0.7755 [0m | [0m 33.87 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.735036
[1000] valid_0's auc: 0.754152
[1500] valid_0's auc: 0.767662
[2000] valid_0's auc: 0.778614
[2500] valid_0's auc: 0.786152
[3000] valid_0's auc: 0.792418
[3500] valid_0's auc: 0.79872
[4000] valid_0's auc: 0.803314
[4500] valid_0's auc: 0.807683
[5000] valid_0's auc: 0.81121
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.81121
| [0m 2 [0m | [0m 0.8112 [0m | [0m 0.7498 [0m | [0m 0.3504 [0m | [0m 0.3686 [0m | [0m 0.2926 [0m | [0m 8.571 [0m | [0m 0.2052 [0m | [0m 11.8 [0m | [0m 0.2563 [0m | [0m 20.64 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.749544
[1000] valid_0's auc: 0.774443
[1500] valid_0's auc: 0.78951
[2000] valid_0's auc: 0.800602
[2500] valid_0's auc: 0.80823
[3000] valid_0's auc: 0.814024
[3500] valid_0's auc: 0.81853
[4000] valid_0's auc: 0.821672
[4500] valid_0's auc: 0.823975
[5000] valid_0's auc: 0.826105
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.826105
| [0m 3 [0m | [0m 0.8261 [0m | [0m 0.1085 [0m | [0m 3.583 [0m | [0m 0.9542 [0m | [0m 1.089 [0m | [0m 3.194 [0m | [0m 0.4614 [0m | [0m 5.319 [0m | [0m 0.06508 [0m | [0m 33.34 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.776662
[1000] valid_0's auc: 0.804675
[1500] valid_0's auc: 0.817655
[2000] valid_0's auc: 0.826085
[2500] valid_0's auc: 0.831839
[3000] valid_0's auc: 0.835281
Early stopping, best iteration is:
[3179] valid_0's auc: 0.836292
| [95m 4 [0m | [95m 0.8363 [0m | [95m 0.8864 [0m | [95m 0.08716 [0m | [95m 0.7719 [0m | [95m 4.064 [0m | [95m 0.7572 [0m | [95m 0.3385 [0m | [95m 10.09 [0m | [95m 0.4799 [0m | [95m 48.0 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.751437
[1000] valid_0's auc: 0.777091
[1500] valid_0's auc: 0.793125
[2000] valid_0's auc: 0.805084
[2500] valid_0's auc: 0.812527
[3000] valid_0's auc: 0.81902
[3500] valid_0's auc: 0.823788
[4000] valid_0's auc: 0.827882
[4500] valid_0's auc: 0.831144
[5000] valid_0's auc: 0.834175
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.834175
| [0m 5 [0m | [0m 0.8342 [0m | [0m 0.1 [0m | [0m 2.47 [0m | [0m 0.741 [0m | [0m 1.623 [0m | [0m 2.77 [0m | [0m 0.3569 [0m | [0m 14.19 [0m | [0m 0.2445 [0m | [0m 25.61 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.731224
[1000] valid_0's auc: 0.749423
[1500] valid_0's auc: 0.760884
[2000] valid_0's auc: 0.76975
[2500] valid_0's auc: 0.777677
[3000] valid_0's auc: 0.785282
[3500] valid_0's auc: 0.791667
[4000] valid_0's auc: 0.796303
[4500] valid_0's auc: 0.800412
[5000] valid_0's auc: 0.804301
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.804301
| [0m 6 [0m | [0m 0.8043 [0m | [0m 0.6073 [0m | [0m 1.211 [0m | [0m 0.312 [0m | [0m 0.3293 [0m | [0m 7.263 [0m | [0m 0.2122 [0m | [0m 7.399 [0m | [0m 0.2959 [0m | [0m 19.23 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.758604
[1000] valid_0's auc: 0.783179
[1500] valid_0's auc: 0.798202
[2000] valid_0's auc: 0.808477
[2500] valid_0's auc: 0.816619
[3000] valid_0's auc: 0.821904
[3500] valid_0's auc: 0.825642
[4000] valid_0's auc: 0.828837
[4500] valid_0's auc: 0.83184
[5000] valid_0's auc: 0.833605
Did not meet early stopping. Best iteration is:
[4998] valid_0's auc: 0.833608
| [0m 7 [0m | [0m 0.8336 [0m | [0m 0.5285 [0m | [0m 0.4363 [0m | [0m 0.5314 [0m | [0m 4.917 [0m | [0m 0.0 [0m | [0m 0.2589 [0m | [0m 15.0 [0m | [0m 0.5876 [0m | [0m 36.33 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.775361
[1000] valid_0's auc: 0.802306
[1500] valid_0's auc: 0.816144
[2000] valid_0's auc: 0.823617
[2500] valid_0's auc: 0.828743
[3000] valid_0's auc: 0.831524
Early stopping, best iteration is:
[3085] valid_0's auc: 0.831879
| [0m 8 [0m | [0m 0.8319 [0m | [0m 1.0 [0m | [0m 8.511 [0m | [0m 1.0 [0m | [0m 4.544 [0m | [0m 7.213 [0m | [0m 0.4367 [0m | [0m 15.0 [0m | [0m 1.0 [0m | [0m 45.52 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.759645
[1000] valid_0's auc: 0.786561
[1500] valid_0's auc: 0.802323
[2000] valid_0's auc: 0.81118
[2500] valid_0's auc: 0.817364
[3000] valid_0's auc: 0.821898
[3500] valid_0's auc: 0.824679
Early stopping, best iteration is:
[3739] valid_0's auc: 0.826167
| [0m 9 [0m | [0m 0.8262 [0m | [0m 0.1 [0m | [0m 10.0 [0m | [0m 1.0 [0m | [0m 5.0 [0m | [0m 0.0 [0m | [0m 0.5 [0m | [0m 15.0 [0m | [0m 1.0 [0m | [0m 30.18 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.708925
[1000] valid_0's auc: 0.721584
[1500] valid_0's auc: 0.729905
[2000] valid_0's auc: 0.736464
[2500] valid_0's auc: 0.741614
[3000] valid_0's auc: 0.746397
[3500] valid_0's auc: 0.750703
[4000] valid_0's auc: 0.754139
[4500] valid_0's auc: 0.757474
[5000] valid_0's auc: 0.760932
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.760932
| [0m 10 [0m | [0m 0.7609 [0m | [0m 0.1 [0m | [0m 0.0 [0m | [0m 0.1 [0m | [0m 0.0 [0m | [0m 7.904 [0m | [0m 0.03 [0m | [0m 15.0 [0m | [0m 0.0 [0m | [0m 43.18 [0m |
=====================================================================================================================================
print(LGB_BO.max['target'])#优化完成后,让我们看看我们得到的最大值是多少。
LGB_BO.max['params']#让我们看看参数:
0.8362916622722081{'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10.092622000835181,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48.00083652189798}
- BayesianOptimization库中还有一个很酷的选项。 你可以探测LGB_bayesian函数,如果你对最佳参数有所了解,或者您从其他kernel获取参数。 我将在此复制并粘贴其他内核中的参数。 你可以按照以下方式进行探测
- 默认情况下这些将被懒惰地探索(lazy = True),这意味着只有在你下次调用maxime时才会评估这些点。 让我们对LGB_BO对象进行最大化调用。
LGB_BO.probe(params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48},lazy=True, #
)
LGB_BO.maximize(init_points=0, n_iter=0) # remember no init_points or n_iter
| iter | target | baggin... | baggin... | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_ga... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.776662
[1000] valid_0's auc: 0.804675
[1500] valid_0's auc: 0.817655
[2000] valid_0's auc: 0.826085
[2500] valid_0's auc: 0.831839
[3000] valid_0's auc: 0.835281
Early stopping, best iteration is:
[3179] valid_0's auc: 0.836292
| [0m 11 [0m | [0m 0.8363 [0m | [0m 0.8864 [0m | [0m 0.08716 [0m | [0m 0.7719 [0m | [0m 4.064 [0m | [0m 0.7572 [0m | [0m 0.3385 [0m | [0m 10.0 [0m | [0m 0.4799 [0m | [0m 48.0 [0m |
=====================================================================================================================================
通过属性LGB_BO.res可以获得探测的所有参数列表及其相应的目标值。
for i, res in enumerate(LGB_BO.res):print("Iteration {}: \n\t{}".format(i, res))
Iteration 0: {'target': 0.8326475014985013, 'params': {'bagging_fraction': 0.7999321695164382, 'bagging_freq': 2.375412200349123, 'feature_fraction': 0.8418506793952316, 'lambda_l1': 4.8287459902149985, 'lambda_l2': 9.726011139048934, 'learning_rate': 0.2431211462861367, 'max_depth': 11.09042462761278, 'min_gain_to_split': 0.7755265146048467, 'num_leaves': 33.87260051415811}}
Iteration 1: {'target': 0.8112097384468528, 'params': {'bagging_fraction': 0.7498164065652524, 'bagging_freq': 0.35036524101437316, 'feature_fraction': 0.36860452380026143, 'lambda_l1': 0.29256245941037373, 'lambda_l2': 8.57060942587199, 'learning_rate': 0.2052413931011595, 'max_depth': 11.798479515780969, 'min_gain_to_split': 0.2562799493266301, 'num_leaves': 20.64115468186214}}
Iteration 2: {'target': 0.8261050715459326, 'params': {'bagging_fraction': 0.10847149307287247, 'bagging_freq': 3.5833378270496974, 'feature_fraction': 0.9541847635103893, 'lambda_l1': 1.0894950456584445, 'lambda_l2': 3.193913663803646, 'learning_rate': 0.461353021420276, 'max_depth': 5.319036664398947, 'min_gain_to_split': 0.06508453704251449, 'num_leaves': 33.34230495985203}}
Iteration 3: {'target': 0.8362916622722081, 'params': {'bagging_fraction': 0.8864320989515848, 'bagging_freq': 0.08715732303784862, 'feature_fraction': 0.7719195132945438, 'lambda_l1': 4.0642058550131175, 'lambda_l2': 0.7571744617226672, 'learning_rate': 0.33853400726057015, 'max_depth': 10.092622000835181, 'min_gain_to_split': 0.47988339149638315, 'num_leaves': 48.00083652189798}}
Iteration 4: {'target': 0.8341753174329471, 'params': {'bagging_fraction': 0.1000108302125366, 'bagging_freq': 2.4697870099191634, 'feature_fraction': 0.7410094101204252, 'lambda_l1': 1.6229102489335656, 'lambda_l2': 2.769963563838095, 'learning_rate': 0.3568593627001331, 'max_depth': 14.185517481459488, 'min_gain_to_split': 0.2444757021979903, 'num_leaves': 25.613861777454364}}
Iteration 5: {'target': 0.8043011941176942, 'params': {'bagging_fraction': 0.6072612766397498, 'bagging_freq': 1.2108187274978577, 'feature_fraction': 0.31196871185798225, 'lambda_l1': 0.32926780461649763, 'lambda_l2': 7.263184308560325, 'learning_rate': 0.2121743710976095, 'max_depth': 7.398509227738581, 'min_gain_to_split': 0.2958905376418717, 'num_leaves': 19.228154502442003}}
Iteration 6: {'target': 0.833607682808766, 'params': {'bagging_fraction': 0.5284563589983934, 'bagging_freq': 0.4362596810480882, 'feature_fraction': 0.5314451681549669, 'lambda_l1': 4.917464527991873, 'lambda_l2': 0.0, 'learning_rate': 0.258904064526674, 'max_depth': 15.0, 'min_gain_to_split': 0.5875699451215919, 'num_leaves': 36.327854565302374}}
Iteration 7: {'target': 0.8318794311203416, 'params': {'bagging_fraction': 1.0, 'bagging_freq': 8.511485033295257, 'feature_fraction': 1.0, 'lambda_l1': 4.543957420169538, 'lambda_l2': 7.21250815846057, 'learning_rate': 0.4367137368532056, 'max_depth': 15.0, 'min_gain_to_split': 1.0, 'num_leaves': 45.523690581294865}}
Iteration 8: {'target': 0.826167096580232, 'params': {'bagging_fraction': 0.1, 'bagging_freq': 10.0, 'feature_fraction': 1.0, 'lambda_l1': 5.0, 'lambda_l2': 0.0, 'learning_rate': 0.5, 'max_depth': 15.0, 'min_gain_to_split': 1.0, 'num_leaves': 30.184337136649773}}
Iteration 9: {'target': 0.7609321700847423, 'params': {'bagging_fraction': 0.1, 'bagging_freq': 0.0, 'feature_fraction': 0.1, 'lambda_l1': 0.0, 'lambda_l2': 7.904224578705337, 'learning_rate': 0.03, 'max_depth': 15.0, 'min_gain_to_split': 0.0, 'num_leaves': 43.17758999088051}}
Iteration 10: {'target': 0.8362916622722081, 'params': {'bagging_fraction': 0.8864320989515848, 'bagging_freq': 0.08715732303784862, 'feature_fraction': 0.7719195132945438, 'lambda_l1': 4.0642058550131175, 'lambda_l2': 0.7571744617226672, 'learning_rate': 0.33853400726057015, 'max_depth': 10.0, 'min_gain_to_split': 0.47988339149638315, 'num_leaves': 48.0}}
构建一个模型使用这些参数。
params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,}param_lgb = {'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here'max_bin': 63,'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here'learning_rate': LGB_BO.max['params']['learning_rate'],'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],'bagging_fraction': 1.0, 'bagging_freq': 5, 'feature_fraction': LGB_BO.max['params']['feature_fraction'],'lambda_l1': LGB_BO.max['params']['lambda_l1'],'lambda_l2': LGB_BO.max['params']['lambda_l2'],'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here'save_binary': True,'seed': 1337,'feature_fraction_seed': 1337,'bagging_seed': 1337,'drop_seed': 1337,'data_random_seed': 1337,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','is_unbalance': True,'boost_from_average': False,}
nfold = 5
gc.collect()
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)
oof = np.zeros(len(train_df))
predictions = np.zeros((len(test_df),nfold))i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):print("\nfold {}".format(i))xg_train = lgb.Dataset(train_df.iloc[train_index][predictors].values,label=train_df.iloc[train_index][target].values,feature_name=predictors,free_raw_data = False)xg_valid = lgb.Dataset(train_df.iloc[valid_index][predictors].values,label=train_df.iloc[valid_index][target].values,feature_name=predictors,free_raw_data = False) clf = lgb.train(param_lgb, xg_train, 5000, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)oof[valid_index] = clf.predict(train_df.iloc[valid_index][predictors].values, num_iteration=clf.best_iteration) predictions[:,i-1] += clf.predict(test_df[predictors], num_iteration=clf.best_iteration)i = i + 1print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))
Hyperopt入门指南、《Insightful EDA + modeling LGBM hyperopt》、
test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('test_sub.csv',index=False)
bda_l2’: 0.7571744617226672, ‘learning_rate’: 0.33853400726057015, ‘max_depth’: 10.0, ‘min_gain_to_split’: 0.47988339149638315, ‘num_leaves’: 48.0}}
构建一个模型使用这些参数。
params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,}param_lgb = {'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here'max_bin': 63,'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here'learning_rate': LGB_BO.max['params']['learning_rate'],'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],'bagging_fraction': 1.0, 'bagging_freq': 5, 'feature_fraction': LGB_BO.max['params']['feature_fraction'],'lambda_l1': LGB_BO.max['params']['lambda_l1'],'lambda_l2': LGB_BO.max['params']['lambda_l2'],'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here'save_binary': True,'seed': 1337,'feature_fraction_seed': 1337,'bagging_seed': 1337,'drop_seed': 1337,'data_random_seed': 1337,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','is_unbalance': True,'boost_from_average': False,}
nfold = 5
gc.collect()
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)
oof = np.zeros(len(train_df))
predictions = np.zeros((len(test_df),nfold))i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):print("\nfold {}".format(i))xg_train = lgb.Dataset(train_df.iloc[train_index][predictors].values,label=train_df.iloc[train_index][target].values,feature_name=predictors,free_raw_data = False)xg_valid = lgb.Dataset(train_df.iloc[valid_index][predictors].values,label=train_df.iloc[valid_index][target].values,feature_name=predictors,free_raw_data = False) clf = lgb.train(param_lgb, xg_train, 5000, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)oof[valid_index] = clf.predict(train_df.iloc[valid_index][predictors].values, num_iteration=clf.best_iteration) predictions[:,i-1] += clf.predict(test_df[predictors], num_iteration=clf.best_iteration)i = i + 1print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))
Hyperopt入门指南、《Insightful EDA + modeling LGBM hyperopt》、
test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('test_sub.csv',index=False)
结果是:
调了几次提升不大就没再提交了
科大讯飞:电信客户流失预测挑战赛baseline相关推荐
- 科大讯飞:电信客户流失预测挑战赛baseline——Datawhale6月组队打卡笔记(1)
文章目录 1. 赛题介绍 2. 赛题任务 3.赛题数据 4.评分标准 5.赛题baseline 5.1 导入模块 5.2 数据预处理 5.3 训练数据/测试数据准备 5.4 构建模型 5.5 提交结果 ...
- 鱼佬:电信客户流失预测赛方案!
Datawhale干货 作者:鱼佬,武汉大学硕士 2022科大讯飞:电信客户流失预测挑战赛 赛事地址(持续更新): https://challenge.xfyun.cn/topic/info?type ...
- 【Clemetine】基于二项Logistic回归的电信客户流失预测
一.实验目的及要求 1.掌握Logistic回归分析的基本步骤.原理.软件实现.结果分析: 2.理解多重共线性的概念.原理及岭轨迹的软件实现: 3.了解高维数据分析的应用领域及分析方法. 二.实验仪器 ...
- 基于机器学习的电信客户流失预测 附完整代码+数据
直接看视频:https://www.bilibili.com/video/BV1To4y1i78d/?vd_source=8f3cf4ad6c08a40d40ca6809c9c9e8ca 博客会分享完 ...
- 基于机器学习逻辑回归的电信客户流失预测
直接看视频:https://www.bilibili.com/video/BV1To4y1i78d/?vd_source=8f3cf4ad6c08a40d40ca6809c9c9e8ca 博客会分享完 ...
- 大数据分析案例-对电信客户流失分析预警预测
目录 1.项目背景 2.项目简介 2.1数据说明 2.2变量介绍 2.3技术工具 3.算法原理 4.项目实施步骤 4.1导入数据 4.2理解数据 4.3数据预处理 4.4数据可视化 4.5特征工程 4 ...
- 【Paper Note】基于决策树算法的电信运营商客户流失预测
1.引言 随着互联网业务的速发展,移动业务市场的客户流失预警成为每一个电信运营商重点关注的内容,在商务智能与机器学习快速发展的当下,运用数据挖掘的方法,实现对电信客户的挽留.转化.精准营销越来越彰显其 ...
- 电信和邻居共享上网_与k个最近邻居的电信行业客户流失预测
电信和邻居共享上网 问题描述 (Problem Description) This blog aims to predict when a customer could probably churn ...
- 客户流失预测_如何不预测和防止客户流失
客户流失预测 Customers are any business' bread and butter. Whether it is a B2B business model or B2C, ever ...
最新文章
- 计算机专业毕业生人数稳居前十,你该怎么脱颖而出?
- IntelliJ IDEA的黑白色背景切换(Ultimate和Community版本皆通用)
- python中可以用中文作为变量-python里能不能用中文
- UA PHYS515A 电磁理论V 电磁波与辐射5 电磁波在介质中的传播
- f5 ppt图标_PPT制作学习 (PPT技巧干货,拿走不谢)
- boost::math模块实现图表显示使用 Lambert W 函数计算电流的测试程序
- python下载器2
- 最近公共祖先 python_求二叉搜索树的最近公共祖先
- python列表keys函数_字典常用函数(clear、get、items、keys、values、pop)
- 【Flink】FLink Barrier 在流经算子 做 checkpoint 的时候,数据是停止的吗?
- python实现嵌套功能_我应该如何在Python中实现“嵌套”子命令?
- SQLite(二)高级操作
- arduino运行java_调试在Arduino MKR1000上运行的Arduino Uno代码
- BCD码和ASCII码的区别
- 电容和电感(自总结)
- UWB测距及定位原理
- android https HttpsURLConnection 忽略证书
- 会声会影X3常见问题80个解答
- vs2010生成的exe更改icon
- appstore上传截图的各种尺寸
热门文章
- 零基础学Python - 1 - Python简介及下载安装
- windows自带便笺使用
- apche和nginx分别与php的连接方式区别
- 计算机应用期刊主编终审通过率,审稿快的期刊_最容易发表审稿快的学报_审稿快的三本或大专学报...
- vcruntime140_1.dll
- Onenote如何快速实现首行缩进的功能。
- 小羊驼和你一起学习cocos2d-x之四(摇杆)
- vscode添加源文件_VSCode 添加自定义注释的方法(附带红色警戒经典注释风格)
- 物联网实验8:Zigbee数据上传
- 工程伦理--1.1 第四次工业革命