文章目录

  • 一、查看各字段中分布情况
    • 1.2 使用pandas_profiling自动分析数据
    • 二、 使用baseline参数训练
  • 三、Null Importances进行特征选择
    • 3.2 计算Score
    • 3.3 筛选正确的特征
  • 四、跑通baseline
    • 4.1使用lgb训练
    • 4.2 使用Xgb训练
    • 4.3 使用cat训练
    • 4.4 另外去掉'平均丢弃数据呼叫数'特征
  • 五、贝叶斯调参
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/chinese task/讯飞-电信用户流失')
Mounted at /content/drive

参考:

  • 《2021科大讯飞-车辆贷款违约预测挑战赛 Top1方案》
  • 《数据挖掘-租金预测》
    -<WSDM-爱奇艺:用户留存预测挑战赛 线上0.865>
  • <微信大数据挑战赛 亚军方案–如何用baseline上724+>
  • <用户购买预测比赛第十名方案>
!pip install unzip
!unzip '/content/drive/MyDrive/chinese task/讯飞-电信用户流失/电信客户流失预测挑战赛数据集.zip'n

读取数据集:

import pandas as pd
train= pd.read_csv('./train.csv');
test=pd.read_csv('./test.csv')
train
客户ID 地理区域 是否双频 是否翻新机 当前手机价格 手机网络功能 婚姻状况 家庭成人人数 信息库匹配 预计收入 ... 客户生命周期内平均月费用 客户生命周期内的平均每月使用分钟数 客户整个生命周期内的平均每月通话次数 过去三个月的平均每月使用分钟数 过去三个月的平均每月通话次数 过去三个月的平均月费用 过去六个月的平均每月使用分钟数 过去六个月的平均每月通话次数 过去六个月的平均月费用 是否流失
0 0 7 0 -1 181 0 2 0 0 3 ... 24 286 91 351 121 23 303 101 25 0
1 1 13 1 0 1399 0 3 0 0 0 ... 44 447 190 483 199 40 488 202 44 1
2 2 14 1 0 927 0 2 4 0 6 ... 48 183 79 271 95 71 209 77 54 0
3 3 1 0 0 232 0 3 -1 1 -1 ... 42 303 166 473 226 72 446 219 65 1
4 4 0 -1 0 699 0 1 2 0 3 ... 36 119 24 88 15 35 106 21 37 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
149995 149995 10 1 0 1350 0 3 0 0 0 ... 156 474 160 239 80 74 346 122 83 1
149996 149996 6 1 0 542 0 3 -1 1 -1 ... 52 968 208 1158 257 58 1307 261 57 0
149997 149997 15 1 0 1300 0 1 2 0 6 ... 39 504 205 544 203 45 531 205 47 1
149998 149998 12 1 0 1399 0 4 1 0 -1 ... 91 685 249 233 140 94 432 236 97 1
149999 149999 10 1 0 1049 0 3 -1 1 -1 ... 37 177 80 147 59 35 167 74 34 0

150000 rows × 69 columns

一、查看各字段中分布情况

train['是否流失'].value_counts()#查看正负样本数
#查看是否有缺失值
missing_counts = pd.DataFrame(train.isnull().sum())
missing_counts.columns = ['count_null']
missing_counts.describe()
#查看各字段数据类型
for col in train.columns:print(f'{col} \t {train.dtypes[col]} {train[col].nunique()}')
import pandas as pd
import numpy as npfrom sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import time
from lightgbm import LGBMClassifier
import lightgbm as lgbimport matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inlineimport warnings
warnings.simplefilter('ignore', UserWarning)import gc
gc.enable()
import time

1.2 使用pandas_profiling自动分析数据

参考:

  • 官方文档
  • 《使用pandas-profiling生成数据的详细报告》
  • 《用pandas-profiling做出更好的探索性数据分析》
conda install -c conda-forge pandas-profiling
#!pip install -U pandas-profiling[notebook]
#安装之后要重启kernal
import pandas as pd
import pandas_profiling
data = pd.read_csv('./train.csv')
profile = data.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="telecom_customers_pandas_profiling.html")

查看Pandas Profiling Report发现:

  • 类别特征:‘地理区域’,‘是否双频’,‘是否翻新机’,‘手机网络功能’,‘婚姻状况’,‘家庭成人人数’,‘信息库匹配’,‘信用卡指示器’,‘新手机用户’,‘账户消费限额’
  • 分箱特征有:‘预计收入’,
  • 异常值特征:‘家庭中唯一订阅者的数量’,‘家庭活跃用户数’,
  • 无用(数据不平衡)特征:‘平均呼叫转移呼叫数’,‘平均丢弃数据呼叫数’,[149797,148912]
#区分数值特征和类别特征
features=list(train.columns)
categorical_features =['地理区域','是否双频','是否翻新机','手机网络功能','婚姻状况','预计收入','家庭成人人数','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
numeric_features =[item for item in features if item not in categorical_features]
numeric_features=[i for i in numeric_features if i not in ['客户ID','是否流失']]
#多类别和少类别
categorical_features1 =['是否双频','是否翻新机','手机网络功能','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
categorical_features2 =['地理区域','婚姻状况','预计收入','家庭成人人数']
#处理几个异常值
#train[train['家庭中唯一订阅者的数量'].values > 13]=14
#通过查看Pandas Profiling Report,发现以下列类别不平衡,打印出来看看情况
#还有一些异常值暂时没处理
cols=['家庭中唯一订阅者的数量','家庭活跃用户数','数据超载的平均费用','平均漫游呼叫数','平均丢弃数据呼叫数','平均占线数据调用次数','未应答数据呼叫的平均次数','尝试数据调用的平均数','完成数据调用的平均数','平均三通电话数','平均峰值数据调用次数','非高峰数据呼叫的平均数量','平均呼叫转移呼叫数']
for i in cols:print(train[i].value_counts())
  • lr=0.2时roc=0.84479;0.3时0.8379,;lr=0.15时0.84578
  • ‘num_leaves’,30改为45时,0.8468

这样调没啥用啊

#以下特征99.5%都是一种数值,可以考虑删掉。[149797,149493,149218,148912]
#lr=0.2时roc=0.84479
null_clos=['平均呼叫转移呼叫数','平均占线数据调用次数','未应答数据呼叫的平均次数','平均丢弃数据呼叫数']for i in null_clos:del train[i]del test[i]
train

二、 使用baseline参数训练

《科大讯飞:电信客户流失预测挑战赛baseline》

  1. 全部特征跑10931轮,valid_acc=0.84298
  2. null importance跑5000轮:
    1. 选取split_feats大于0的特征(43种)可跑14402轮,valid_acc=0.83887
    2. 选取feats大于0的特征(23种)可跑10946轮,valid_acc=0.8193
  3. null importance跑1000轮:
    1. 选取split_feats大于0的特征(66种)可跑11817轮,valid_acc=0.84417
    2. 选取feats大于0的特征(58种)可跑11725轮,valid_acc=0.84345
from sklearn.model_selection import train_test_split
# 划分训练集和测试集
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练5000轮,每300轮报告一次acc,200轮没有提升就停止训练
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300]   valid_0's auc: 0.733101
[600]   valid_0's auc: 0.754127
[900]   valid_0's auc: 0.766728
[1200]  valid_0's auc: 0.777367
[1500]  valid_0's auc: 0.78594
[1800]  valid_0's auc: 0.792209
[2100]  valid_0's auc: 0.798424
[2400]  valid_0's auc: 0.80417
[2700]  valid_0's auc: 0.808074
[3000]  valid_0's auc: 0.811665
[3300]  valid_0's auc: 0.814679
[3600]  valid_0's auc: 0.817462
[3900]  valid_0's auc: 0.820151
[4200]  valid_0's auc: 0.822135
[4500]  valid_0's auc: 0.824544
[4800]  valid_0's auc: 0.825994
Did not meet early stopping. Best iteration is:
[4994]  valid_0's auc: 0.826797
roc,acc
(0.8267972007033084, 0.7533)

三、Null Importances进行特征选择

def get_feature_importances(X_train, X_test, y_train, y_test,shuffle, seed=None):# 获取特征train_features = list(X_train.columns)   # 判断是否shuffle TARGETy_train,y_test= y_train.copy(),y_test.copy()if shuffle:# Here you could as well use a binomial distributiony_train,y_test= y_train.copy().sample(frac=1.0),y_test.copy().sample(frac=1.0)lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)# 在 RF 模式下安装 LightGBM,它比 sklearn RandomForest 更快   lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练模型clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=500,verbose_eval=50,early_stopping_rounds=30)#得到特征重要性imp_df = pd.DataFrame()imp_df["feature"] = list(train_features)imp_df["importance_gain"] = clf.feature_importance(importance_type='gain')imp_df["importance_split"] = clf.feature_importance(importance_type='split')imp_df['trn_score'] = roc_auc_score(y_test, clf.predict( X_test))return imp_df
np.random.seed(123)
# 获得市实际的特征重要性,即没有shuffletarget
actual_imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=False)
actual_imp_df
Training until validation scores don't improve for 20 rounds.
[30]    valid_0's auc: 0.695549
[60]    valid_0's auc: 0.704629
[90]    valid_0's auc: 0.711638
[120]   valid_0's auc: 0.715182
[150]   valid_0's auc: 0.718961
[180]   valid_0's auc: 0.722121
[210]   valid_0's auc: 0.725615
[240]   valid_0's auc: 0.728251
[270]   valid_0's auc: 0.730962
[300]   valid_0's auc: 0.733101
[330]   valid_0's auc: 0.73578
[360]   valid_0's auc: 0.73886
[390]   valid_0's auc: 0.741238
[420]   valid_0's auc: 0.742486
[450]   valid_0's auc: 0.744295
[480]   valid_0's auc: 0.746555
Did not meet early stopping. Best iteration is:
[495]   valid_0's auc: 0.747792
feature importance_gain importance_split trn_score
0 地理区域 1956.600422 313 0.747792
1 是否双频 442.401141 62 0.747792
2 是否翻新机 269.466828 26 0.747792
3 当前手机价格 3838.696197 365 0.747792
4 手机网络功能 750.396258 51 0.747792
... ... ... ... ...
62 过去三个月的平均每月通话次数 2540.721027 325 0.747792
63 过去三个月的平均月费用 2098.813867 304 0.747792
64 过去六个月的平均每月使用分钟数 2375.925741 337 0.747792
65 过去六个月的平均每月通话次数 2541.735172 346 0.747792
66 过去六个月的平均月费用 2103.062207 313 0.747792

67 rows × 4 columns

<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>

null_imp_df = pd.DataFrame()
nb_runs = 10
import time
start = time.time()
dsp = ''
for i in range(nb_runs):# 获取当前的特征重要性imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=True)imp_df['run'] = i + 1 # 将特征重要性连起来null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)# 删除上一条信息for l in range(len(dsp)):print('\b', end='', flush=True)# Display current run and time usedspent = (time.time() - start) / 60dsp = 'Done with %4d of %4d (Spent %5.1f min)' % (i + 1, nb_runs, spent)print(dsp, end='', flush=True)
null_imp_df
feature importance_gain importance_split trn_score run
0 地理区域 38.622730 5 0.505320 1
1 是否双频 0.000000 0 0.505320 1
2 是否翻新机 0.000000 0 0.505320 1
3 当前手机价格 30.980300 4 0.505320 1
4 手机网络功能 0.000000 0 0.505320 1
... ... ... ... ... ...
62 过去三个月的平均每月通话次数 109.945481 14 0.503911 10
63 过去三个月的平均月费用 35.344621 4 0.503911 10
64 过去六个月的平均每月使用分钟数 55.200380 7 0.503911 10
65 过去六个月的平均每月通话次数 53.439080 6 0.503911 10
66 过去六个月的平均月费用 47.455200 6 0.503911 10

670 rows × 5 columns

<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>

def display_distributions(actual_imp_df_, null_imp_df_, feature_):plt.figure(figsize=(13, 6))gs = gridspec.GridSpec(1, 2)# 画出 Split importancesax = plt.subplot(gs[0, 0])a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')ax.legend()ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())# 画出 Gain importancesax = plt.subplot(gs[0, 1])a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')ax.legend()ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())#画出“DESTINATION_AIRPORT”的特征重要性
display_distributions(actual_imp_df_=actual_imp_df, null_imp_df_=null_imp_df, feature_='DESTINATION_AIRPORT')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KNaPI0o5-1655391260403)(xunfei_files/xunfei_16_0.png)]

plt.rcParams['font.sans-serif'] = ['SimHei']  # 中文字体设置-黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题
sns.set(font='SimHei')

3.2 计算Score

以未进行特征shuffle的特征重要性除以shuffle之后的0.75分位数作为我们的score


feature_scores = []
for _f in actual_imp_df['feature'].unique():f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].valuesf_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))  # Avoid didvide by zerof_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].valuesf_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))  # Avoid didvide by zerofeature_scores.append((_f, split_score, gain_score))scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()null_imp_df.to_csv('null_importances_distribution_rf.csv')
actual_imp_df.to_csv('actual_importances_ditribution_rf.csv')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ObncLd9v-1655391260406)(xunfei_files/xunfei_19_1.png)]

[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]

scores_df.sort_values(by="split_score",ascending=False,inplace=True)
scores_df
feature split_score gain_score
17 每月平均使用分钟数 4.152397 4.571279
60 客户整个生命周期内的平均每月通话次数 4.116323 4.226021
56 计费调整后的总分钟数 3.992808 3.961585
52 客户生命周期内的总通话次数 3.932502 4.008059
38 一分钟内的平均呼入电话数 3.832258 3.505356
... ... ... ...
35 完成数据调用的平均数 1.878771 2.220746
30 未应答数据呼叫的平均次数 1.791759 3.040400
28 平均占线数据调用次数 1.609438 2.565711
49 平均呼叫转移呼叫数 0.693147 2.221640
26 平均丢弃数据呼叫数 -23.025851 -23.025851

67 rows × 3 columns

<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>

  <script>const buttonEl =document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a button.colab-df-convert');buttonEl.style.display =google.colab.kernel.accessAllowed ? 'block' : 'none';async function convertToInteractive(key) {const element = document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a');const dataTable =await google.colab.kernel.invokeFunction('convertToInteractive',[key], {});if (!dataTable) return;const docLinkHtml = 'Like what you see? Visit the ' +'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'+ ' to learn more about interactive tables.';element.innerHTML = '';dataTable['output_type'] = 'display_data';await google.colab.output.renderOutput(dataTable, element);const docLink = document.createElement('div');docLink.innerHTML = docLinkHtml;element.appendChild(docLink);}</script>
</div>
#huffle target之后特征重要性低于实际target对应特征的重要性0.25分位数的次数百分比
correlation_scores = []
for _f in actual_imp_df['feature'].unique():f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].valuesf_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].valuesgain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.sizef_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].valuesf_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].valuessplit_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.sizecorrelation_scores.append((_f, split_score, gain_score))corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])fig = plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
fig.subplots_adjust(top=0.93)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2T6Pes1-1655391260409)(xunfei_files/xunfei_22_1.png)]

corr_scores_df.sort_values(by="split_score",ascending=False,inplace=True)
corr_scores_df
feature split_score gain_score
0 地理区域 100.0 100.0
50 平均呼叫等待呼叫数 100.0 100.0
36 平均客户服务电话次数 100.0 100.0
37 使用客户服务电话的平均分钟数 100.0 100.0
38 一分钟内的平均呼入电话数 100.0 100.0
... ... ... ...
29 平均未接语音呼叫数 100.0 100.0
30 未应答数据呼叫的平均次数 100.0 100.0
31 尝试拨打的平均语音呼叫次数 100.0 100.0
66 过去六个月的平均月费用 100.0 100.0
26 平均丢弃数据呼叫数 0.0 0.0

67 rows × 3 columns

<svg xmlns=“http://www.w3.org/2000/svg” height="24px"viewBox=“0 0 24 24”
width=“24px”>

3.3 筛选正确的特征

通过corr_scores_df知道,平均丢弃数据呼叫数是没用的,可以去掉。去掉之后效果确实提升了

X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)lgb_params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}# 训练5000轮,每300轮报告一次acc,200轮没有提升就停止训练
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300]   valid_0's auc: 0.734833
[600]   valid_0's auc: 0.753598
[900]   valid_0's auc: 0.767934
[1200]  valid_0's auc: 0.778701
[1500]  valid_0's auc: 0.785552
[1800]  valid_0's auc: 0.793379
[2100]  valid_0's auc: 0.799713
[2400]  valid_0's auc: 0.805404
[2700]  valid_0's auc: 0.809381
[3000]  valid_0's auc: 0.813516
[3300]  valid_0's auc: 0.816289
[3600]  valid_0's auc: 0.81927
[3900]  valid_0's auc: 0.821682
[4200]  valid_0's auc: 0.824342
[4500]  valid_0's auc: 0.82676
[4800]  valid_0's auc: 0.829004
[5100]  valid_0's auc: 0.830592
[5400]  valid_0's auc: 0.83205
[5700]  valid_0's auc: 0.833626
[6000]  valid_0's auc: 0.83478
[6300]  valid_0's auc: 0.835981
[6600]  valid_0's auc: 0.836975
[6900]  valid_0's auc: 0.837994
[7200]  valid_0's auc: 0.838715
[7500]  valid_0's auc: 0.83963
[7800]  valid_0's auc: 0.840372
[8100]  valid_0's auc: 0.840644
[8400]  valid_0's auc: 0.841068
[8700]  valid_0's auc: 0.841685
Early stopping, best iteration is:
[8768]  valid_0's auc: 0.841806
pred=clf.predict(X_test,num_iteration=clf.best_iteration)
roc,acc
(0.8418064634121478, 0.7683)

四、跑通baseline

baseline参考:https://mp.weixin.qq.com/s/nLgaGMJByOqRVWnm1UfB3g

!pip install catboost
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)features = [f for f in data.columns if f not in ['是否流失','客户ID','平均丢弃数据呼叫数']]train = data[data['是否流失'].notnull()].reset_index(drop=True)
test = data[data['是否流失'].isnull()].reset_index(drop=True)x_train = train[features]
x_test = test[features]y_train = train['是否流失']

4.1使用lgb训练

def cv_model(clf, train_x, train_y, test_x, clf_name):folds=5seed=2022kf=KFold(n_splits=folds,shuffle=True,random_state=seed)train=np.zeros(train_x.shape[0])test=np.zeros(test_x.shape[0])cv_scores = []for i, (train_index, valid_index) in enumerate(kf.split(train_x,train_y)):print('************************************ {} ************************************'.format(str(i+1)))trn_x,trn_y,val_x,val_y=train_x.iloc[train_index],train_y[train_index],train_x.iloc[valid_index],train_y[valid_index]if clf_name == "lgb":train_matrix=clf.Dataset(trn_x, label=trn_y)valid_matrix=clf.Dataset(val_x, label=val_y)#baseline参数params = {'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','num_leaves': 2 ** 5,'lambda_l2': 10,'feature_fraction': 0.7,'bagging_fraction': 0.7,'bagging_freq': 10,'learning_rate': 0.2,'seed': 2022,'n_jobs':-1}#最优参数params2={'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','bagging_fraction': 0.8864320989515848,'bagging_freq': 10,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,'seed': 2022,'n_jobs':-1}model = clf.train(params,train_matrix,50000,valid_sets=[train_matrix, valid_matrix], categorical_feature=[],verbose_eval=3000, early_stopping_rounds=200)val_pred=model.predict(val_x,num_iteration=model.best_iteration)test_pred=model.predict(test_x,num_iteration=model.best_iteration)print(list(sorted(zip(features,model.feature_importance("gain")),key=lambda x: x[1], reverse=True))[:20])if clf_name == "xgb":train_matrix=clf.DMatrix(trn_x,label=trn_y)valid_matrix=clf.DMatrix(val_x,label=val_y)test_matrix=clf.DMatrix(test_x)params={'booster': 'gbtree','objective': 'binary:logistic','eval_metric': 'auc','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.2,'tree_method': 'exact','seed': 2020,'nthread': 36,"silent": True,}watchlist=[(train_matrix, 'train'),(valid_matrix, 'eval')]model=clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)val_pred=model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)test_pred=model.predict(test_matrix , ntree_limit=model.best_ntree_limit)if clf_name=="cat":params={'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli','od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}model=clf(iterations=20000, **params)model.fit(trn_x,trn_y,eval_set=(val_x, val_y),cat_features=[],use_best_model=True, verbose=3000)val_pred=model.predict(val_x)test_pred=model.predict(test_x)train[valid_index]=val_predtest=test_pred/kf.n_splitscv_scores.append(roc_auc_score(val_y,val_pred))print(cv_scores)print("%s_scotrainre_list:" % clf_name, cv_scores)print("%s_score_mean:" % clf_name, np.mean(cv_scores))print("%s_score_std:" % clf_name, np.std(cv_scores))return train, testdef lgb_model(x_train,y_train,x_test):lgb_train,lgb_test=cv_model(lgb,x_train,y_train,x_test,"lgb")return lgb_train,lgb_testdef xgb_model(x_train,y_train,x_test):xgb_train,xgb_test=cv_model(xgb,x_train,y_train,x_test,"xgb")return xgb_train, xgb_testdef cat_model(x_train,y_train,x_test):cat_train,cat_test=cv_model(CatBoostRegressor,x_train,y_train,x_test,"cat") return cat_train,cat_test
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)#查看代码执行记录,耗时21min左右吧
test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('lgb_base.csv',index=False)#提交成绩0.825
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999488   valid_1's auc: 0.811334
Early stopping, best iteration is:
[5163]  training's auc: 0.999996   valid_1's auc: 0.8289
[('当前设备使用天数', 21934.934124737978), ('当月使用分钟数与前三个月平均值的百分比变化', 17126.358324214816), ('在职总月数', 12409.957622632384), ('每月平均使用分钟数', 12073.125538095832), ('客户生命周期内的平均每月使用分钟数', 11994.06405813992), ('客户整个生命周期内的平均每月通话次数', 11518.068050682545), ('已完成语音通话的平均使用分钟数', 11292.594955265522), ('当前手机价格', 10964.187494635582), ('客户生命周期内的总费用', 10750.710047110915), ('使用高峰语音通话的平均不完整分钟数', 10274.193908914924), ('客户生命周期内的总使用分钟数', 10260.600554332137), ('当月费用与前三个月平均值的百分比变化', 10164.166730254889), ('计费调整后的总分钟数', 10095.02776375413), ('计费调整后的总费用', 10074.029564589262), ('客户生命周期内的总通话次数', 9900.794713005424), ('客户生命周期内平均月费用', 9874.11763061583), ('平均非高峰语音呼叫数', 9546.732098400593), ('过去六个月的平均每月通话次数', 9531.47578701377), ('过去六个月的平均每月使用分钟数', 9481.577100589871), ('计费调整后的呼叫总数', 9305.693744853139)]
[0.8288996222651557]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999472   valid_1's auc: 0.811878
Early stopping, best iteration is:
[4772]  training's auc: 0.999971   valid_1's auc: 0.827608
[('当前设备使用天数', 21505.16284123063), ('当月使用分钟数与前三个月平均值的百分比变化', 16946.651323199272), ('每月平均使用分钟数', 12132.766281962395), ('在职总月数', 11971.832910627127), ('客户生命周期内的平均每月使用分钟数', 11526.178689315915), ('客户整个生命周期内的平均每月通话次数', 11283.326876536012), ('当前手机价格', 11003.212880536914), ('客户生命周期内的总费用', 10808.01029574871), ('已完成语音通话的平均使用分钟数', 10684.196997240186), ('使用高峰语音通话的平均不完整分钟数', 10399.707967177033), ('当月费用与前三个月平均值的百分比变化', 10358.123901829123), ('客户生命周期内的总使用分钟数', 10162.593608289957), ('客户生命周期内的总通话次数', 10073.619953781366), ('计费调整后的总费用', 9978.180806919932), ('计费调整后的总分钟数', 9764.853373721242), ('过去三个月的平均每月通话次数', 9391.67290854454), ('过去六个月的平均每月通话次数', 9381.156281203032), ('客户生命周期内平均月费用', 9243.235832542181), ('过去六个月的平均每月使用分钟数', 9032.57935705781), ('计费调整后的呼叫总数', 8945.249050289392)]
[0.8288996222651557, 0.8276084395403329]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999494   valid_1's auc: 0.811642
Early stopping, best iteration is:
[4663]  training's auc: 0.99999    valid_1's auc: 0.827114
[('当前设备使用天数', 21289.608253866434), ('当月使用分钟数与前三个月平均值的百分比变化', 16997.806541010737), ('在职总月数', 12316.054881855845), ('客户生命周期内的平均每月使用分钟数', 11741.117707148194), ('每月平均使用分钟数', 11664.033028051257), ('已完成语音通话的平均使用分钟数', 11115.561656951904), ('客户整个生命周期内的平均每月通话次数', 10854.345216721296), ('当前手机价格', 10763.63857871294), ('客户生命周期内的总费用', 10621.98585870862), ('当月费用与前三个月平均值的百分比变化', 10375.685174629092), ('计费调整后的总费用', 10232.226524055004), ('客户生命周期内的总使用分钟数', 10052.964914098382), ('使用高峰语音通话的平均不完整分钟数', 9799.514198839664), ('计费调整后的总分钟数', 9735.032970786095), ('客户生命周期内平均月费用', 9637.621711835265), ('客户生命周期内的总通话次数', 9429.328524649143), ('过去六个月的平均每月使用分钟数', 9333.910300150514), ('计费调整后的呼叫总数', 9013.730677694082), ('过去六个月的平均每月通话次数', 8954.436415627599), ('过去六个月的平均月费用', 8829.167943418026)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999532   valid_1's auc: 0.814214
Early stopping, best iteration is:
[5281]  training's auc: 0.999996   valid_1's auc: 0.830897
[('当前设备使用天数', 21271.166813850403), ('当月使用分钟数与前三个月平均值的百分比变化', 17270.63153974712), ('每月平均使用分钟数', 12677.148315995932), ('在职总月数', 12486.456961512566), ('客户生命周期内的平均每月使用分钟数', 11930.549542114139), ('客户整个生命周期内的平均每月通话次数', 11403.163509890437), ('已完成语音通话的平均使用分钟数', 11126.607083335519), ('当前手机价格', 10973.327338501811), ('当月费用与前三个月平均值的百分比变化', 10719.836767598987), ('客户生命周期内的总费用', 10684.931542679667), ('计费调整后的总费用', 10567.041279122233), ('计费调整后的总分钟数', 10477.076363384724), ('客户生命周期内的总使用分钟数', 10404.941493198276), ('客户生命周期内平均月费用', 10015.077973127365), ('使用高峰语音通话的平均不完整分钟数', 9988.746752500534), ('过去六个月的平均每月使用分钟数', 9924.928602397442), ('客户生命周期内的总通话次数', 9658.558003604412), ('平均非高峰语音呼叫数', 9605.689363330603), ('过去六个月的平均每月通话次数', 9560.14350926876), ('计费调整后的呼叫总数', 9525.798342213035)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999444   valid_1's auc: 0.8118
Early stopping, best iteration is:
[5148]  training's auc: 0.999994   valid_1's auc: 0.829686
[('当前设备使用天数', 21662.356478646398), ('当月使用分钟数与前三个月平均值的百分比变化', 17710.528580009937), ('在职总月数', 12402.68640038371), ('每月平均使用分钟数', 11945.518620952964), ('客户生命周期内的平均每月使用分钟数', 11887.39459644258), ('已完成语音通话的平均使用分钟数', 11309.949122816324), ('客户整个生命周期内的平均每月通话次数', 11231.172733142972), ('客户生命周期内的总费用', 10822.351191923022), ('当前手机价格', 10691.375393077731), ('计费调整后的总费用', 10513.226110234857), ('当月费用与前三个月平均值的百分比变化', 10418.488398104906), ('客户生命周期内的总使用分钟数', 10276.142720848322), ('使用高峰语音通话的平均不完整分钟数', 10242.566086634994), ('计费调整后的总分钟数', 10193.664465650916), ('客户生命周期内的总通话次数', 10117.483586207032), ('客户生命周期内平均月费用', 9943.684495016932), ('过去六个月的平均每月通话次数', 9800.775234118104), ('过去三个月的平均每月通话次数', 9572.030710801482), ('过去六个月的平均每月使用分钟数', 9561.15305377543), ('平均非高峰语音呼叫数', 9292.315245553851)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_scotrainre_list: [0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_score_mean: 0.8288409576534048
lgb_score_std: 0.0013744978556818929"\n************************************ 1 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999474\tvalid_1's auc: 0.811874\nEarly stopping, best iteration is:\n[4935]\ttraining's auc: 0.999996\tvalid_1's auc: 0.827972\n[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]\n[0.8279715963308298]\n************************************ 2 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999427\tvalid_1's auc: 0.810338\nEarly stopping, best iteration is:\n[4648]\ttraining's auc: 0.999965\tvalid_1's auc: 0.824151\n[('当前设备使用天数', 21631.878849938512), ('当月使用分钟数与前三个月平均值的百分比变化', 16730.961754366755), ('在职总月数', 12067.951921060681), ('每月平均使用分钟数', 12002.064660459757), ('客户生命周期内的平均每月使用分钟数', 11514.234459266067), ('客户整个生命周期内的平均每月通话次数', 11378.85348239541), ('已完成语音通话的平均使用分钟数', 10749.901214078069), ('当前手机价格', 10722.060040861368), ('客户生命周期内的总费用', 10603.264658093452), ('当月费用与前三个月平均值的百分比变化', 10405.526055783033), ('使用高峰语音通话的平均不完整分钟数', 10171.211520016193), ('客户生命周期内的总使用分钟数', 10006.355669140816), ('计费调整后的总分钟数', 9942.827439278364), ('客户生命周期内的总通话次数', 9937.020643949509), ('计费调整后的总费用', 9920.474541395903), ('过去六个月的平均每月使用分钟数', 9621.407806247473), ('客户生命周期内平均月费用', 9319.960188627243), ('过去三个月的平均每月通话次数', 9318.490131109953), ('平均月费用', 9294.081347599626), ('过去六个月的平均每月通话次数', 9203.844007015228)]\n[0.8279715963308298, 0.8241509252411403]\n************************************ 3 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.99949\tvalid_1's auc: 0.810687\nEarly stopping, best iteration is:\n[4731]\ttraining's auc: 0.999987\tvalid_1's auc: 0.825545\n[('当前设备使用天数', 21968.028517633677), ('当月使用分钟数与前三个月平均值的百分比变化', 16903.005848184228), ('在职总月数', 12133.818779706955), ('客户生命周期内的平均每月使用分钟数', 11976.253827899694), ('每月平均使用分钟数', 11948.46197539568), ('已完成语音通话的平均使用分钟数', 11421.855388239026), ('客户整个生命周期内的平均每月通话次数', 11262.173004433513), ('当前手机价格', 11005.929363071918), ('客户生命周期内的总费用', 10528.124375209212), ('客户生命周期内的总使用分钟数', 10390.872772306204), ('计费调整后的总费用', 10347.706698387861), ('当月费用与前三个月平均值的百分比变化', 10124.151285156608), ('计费调整后的总分钟数', 9813.354337349534), ('使用高峰语音通话的平均不完整分钟数', 9805.469536915421), ('客户生命周期内平均月费用', 9772.446165367961), ('过去六个月的平均每月使用分钟数', 9544.928655743599), ('计费调整后的呼叫总数', 9390.860902503133), ('过去六个月的平均每月通话次数', 9323.151294022799), ('客户生命周期内的总通话次数', 9320.212619245052), ('过去三个月的平均每月通话次数', 9084.183073118329)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296]\n************************************ 4 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.9995\tvalid_1's auc: 0.813234\nEarly stopping, best iteration is:\n[5599]\ttraining's auc: 0.999997\tvalid_1's auc: 0.831782\n[('当前设备使用天数', 21882.617314189672), ('当月使用分钟数与前三个月平均值的百分比变化', 17574.675364792347), ('每月平均使用分钟数', 12675.68729557097), ('在职总月数', 12567.960791677237), ('客户生命周期内的平均每月使用分钟数', 12466.111717522144), ('客户整个生命周期内的平均每月通话次数', 11556.674870744348), ('已完成语音通话的平均使用分钟数', 11522.147867411375), ('当前手机价格', 11065.775812849402), ('客户生命周期内的总使用分钟数', 10911.875026881695), ('客户生命周期内的总费用', 10715.607445791364), ('使用高峰语音通话的平均不完整分钟数', 10510.982212975621), ('当月费用与前三个月平均值的百分比变化', 10451.965088263154), ('计费调整后的总费用', 10446.603226020932), ('计费调整后的总分钟数', 10408.396666422486), ('过去六个月的平均每月使用分钟数', 10079.377708375454), ('客户生命周期内平均月费用', 10037.817246481776), ('客户生命周期内的总通话次数', 10017.892398029566), ('计费调整后的呼叫总数', 9739.093963235617), ('过去三个月的平均每月通话次数', 9609.546253487468), ('平均非高峰语音呼叫数', 9569.536746695638)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651]\n************************************ 5 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999448\tvalid_1's auc: 0.810144\nEarly stopping, best iteration is:\n[5255]\ttraining's auc: 0.999999\tvalid_1's auc: 0.829245\n[('当前设备使用天数', 21498.932903170586), ('当月使用分钟数与前三个月平均值的百分比变化', 17680.002600044012), ('在职总月数', 12638.706078097224), ('客户生命周期内的平均每月使用分钟数', 12569.80523788929), ('每月平均使用分钟数', 12267.705941140652), ('当前手机价格', 11370.256973087788), ('已完成语音通话的平均使用分钟数', 11110.097302675247), ('客户整个生命周期内的平均每月通话次数', 11020.642103403807), ('客户生命周期内的总费用', 10986.333106696606), ('计费调整后的总费用', 10700.256485000253), ('当月费用与前三个月平均值的百分比变化', 10575.144608184695), ('计费调整后的总分钟数', 10401.467713326216), ('使用高峰语音通话的平均不完整分钟数', 10237.447989702225), ('客户生命周期内的总通话次数', 10139.773517146707), ('客户生命周期内平均月费用', 10076.59566681087), ('客户生命周期内的总使用分钟数', 9953.696122318506), ('过去六个月的平均每月使用分钟数', 9595.342250138521), ('平均非高峰语音呼叫数', 9504.704583987594), ('过去六个月的平均每月通话次数', 9500.140991523862), ('计费调整后的呼叫总数', 9425.357908219099)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_scotrainre_list: [0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_score_mean: 0.827738724125895\nlgb_score_std: 0.002696458502502849\n"

4.2 使用Xgb训练

xgb_train,xgb_test=xgb_model(x_train,y_train,x_test)
test['是否流失'] = xgb_test
test[['客户ID','是否流失']].to_csv('xgb_base.csv',index=False)#2h50min,太慢了
************************************ 1 ************************************
[0] train-auc:0.635939  eval-auc:0.634176
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000]  train-auc:0.992932  eval-auc:0.788708
[6000]  train-auc:0.999906  eval-auc:0.807173
[9000]  train-auc:0.999997  eval-auc:0.812868
Stopping. Best iteration:
[9945]  train-auc:0.999999  eval-auc:0.814055[0.8140550495535315]
************************************ 2 ************************************
[0] train-auc:0.636635  eval-auc:0.633894
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000]  train-auc:0.992878  eval-auc:0.790387
[6000]  train-auc:0.99988   eval-auc:0.807621
Stopping. Best iteration:
[8538]  train-auc:0.999991  eval-auc:0.812347[0.8140550495535315, 0.8123468873894992]
************************************ 3 ************************************
[0] train-auc:0.637058  eval-auc:0.630979
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000]  train-auc:0.992874  eval-auc:0.790023
[6000]  train-auc:0.999898  eval-auc:0.80827
[9000]  train-auc:0.999996  eval-auc:0.813291
Stopping. Best iteration:
[8933]  train-auc:0.999996  eval-auc:0.813342[0.8140550495535315, 0.8123468873894992, 0.8133415339513355]
************************************ 4 ************************************
[0] train-auc:0.635278  eval-auc:0.633351
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000]  train-auc:0.993107  eval-auc:0.78905
[6000]  train-auc:0.999903  eval-auc:0.808401
Stopping. Best iteration:
[8343]  train-auc:0.999993  eval-auc:0.812439[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089]
************************************ 5 ************************************
[0] train-auc:0.635985  eval-auc:0.633911
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.Will train until eval-auc hasn't improved in 200 rounds.
[3000]  train-auc:0.992892  eval-auc:0.788101
[6000]  train-auc:0.999904  eval-auc:0.805732
[9000]  train-auc:0.999997  eval-auc:0.810194
Stopping. Best iteration:
[10041] train-auc:0.999999  eval-auc:0.811155[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_scotrainre_list: [0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_score_mean: 0.8126675195312721
xgb_score_std: 0.000982024071432044

4.3 使用cat训练

cat_train,cat_test=cat_model(x_train,y_train,x_test)#22min,和lgb差不多
test['是否流失'] = cat_test
test[['客户ID','是否流失']].to_csv('cat_base.csv',index=False)#
************************************ 1 ************************************
0:  learn: 0.4955489    test: 0.4954619 best: 0.4954619 (0) total: 233ms    remaining: 1h 17m 39s
3000:   learn: 0.3769726    test: 0.4483572 best: 0.4483572 (3000)  total: 2m 5s    remaining: 11m 50s
6000:   learn: 0.3209359    test: 0.4391546 best: 0.4391520 (5999)  total: 4m   remaining: 9m 21s
Stopped by overfitting detector  (50 iterations wait)bestTest = 0.4360869428
bestIteration = 7499Shrink model to first 7500 iterations.
[0.78868229695141]
************************************ 2 ************************************
0:  learn: 0.4953117    test: 0.4954092 best: 0.4954092 (0) total: 39.5ms   remaining: 13m 10s
3000:   learn: 0.3763302    test: 0.4490481 best: 0.4490378 (2981)  total: 1m 46s   remaining: 10m 2s
6000:   learn: 0.3196365    test: 0.4402621 best: 0.4402621 (6000)  total: 3m 38s   remaining: 8m 30s
Stopped by overfitting detector  (50 iterations wait)bestTest = 0.4361341716
bestIteration = 8001Shrink model to first 8002 iterations.
[0.78868229695141, 0.7897985044313038]
************************************ 3 ************************************
0:  learn: 0.4954711    test: 0.4955905 best: 0.4955905 (0) total: 38.5ms   remaining: 12m 49s
3000:   learn: 0.3763265    test: 0.4477431 best: 0.4477431 (3000)  total: 1m 49s   remaining: 10m 21s
Stopped by overfitting detector  (50 iterations wait)bestTest = 0.4406746798
bestIteration = 5128Shrink model to first 5129 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264]
************************************ 4 ************************************
0:  learn: 0.4955798    test: 0.4955669 best: 0.4955669 (0) total: 46.1ms   remaining: 15m 21s
3000:   learn: 0.3768704    test: 0.4486424 best: 0.4486421 (2997)  total: 1m 45s   remaining: 9m 59s
Stopped by overfitting detector  (50 iterations wait)bestTest = 0.4426386429
bestIteration = 4903Shrink model to first 4904 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829]
************************************ 5 ************************************
0:  learn: 0.4955262    test: 0.4956471 best: 0.4956471 (0) total: 38.9ms   remaining: 12m 57s
3000:   learn: 0.3761659    test: 0.4494234 best: 0.4494234 (3000)  total: 1m 47s   remaining: 10m 11s
6000:   learn: 0.3202277    test: 0.4407377 best: 0.4407330 (5999)  total: 3m 31s   remaining: 8m 12s
9000:   learn: 0.2781913    test: 0.4347233 best: 0.4347168 (8998)  total: 5m 14s   remaining: 6m 24s
Stopped by overfitting detector  (50 iterations wait)bestTest = 0.4323322625
bestIteration = 10483Shrink model to first 10484 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_scotrainre_list: [0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_score_mean: 0.785996191059122
cat_score_std: 0.0084674009574612

4.4 另外去掉’平均丢弃数据呼叫数’特征

效果变差了

lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999535   valid_1's auc: 0.811083
Early stopping, best iteration is:
[5495]  training's auc: 0.999999   valid_1's auc: 0.830252
[('当前设备使用天数', 21646.43981860578), ('当月使用分钟数与前三个月平均值的百分比变化', 17622.58995847404), ('在职总月数', 12633.31053687632), ('每月平均使用分钟数', 12317.316355511546), ('客户整个生命周期内的平均每月通话次数', 12213.196875602007), ('客户生命周期内的平均每月使用分钟数', 11988.745236545801), ('已完成语音通话的平均使用分钟数', 11742.254607230425), ('客户生命周期内的总费用', 10961.734202891588), ('客户生命周期内的总使用分钟数', 10739.284949079156), ('当前手机价格', 10717.661178082228), ('使用高峰语音通话的平均不完整分钟数', 10648.361330911517), ('当月费用与前三个月平均值的百分比变化', 10563.12071943283), ('客户生命周期内平均月费用', 10260.813065826893), ('计费调整后的总费用', 10214.983077257872), ('客户生命周期内的总通话次数', 10042.090887442231), ('过去六个月的平均每月使用分钟数', 10030.256944060326), ('计费调整后的总分钟数', 9833.17426289618), ('过去六个月的平均每月通话次数', 9658.642087131739), ('平均非高峰语音呼叫数', 9604.195981651545), ('过去三个月的平均每月使用分钟数', 9474.32663051784)]
[0.8302521387863329]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999484   valid_1's auc: 0.812252
Early stopping, best iteration is:
[4761]  training's auc: 0.999991   valid_1's auc: 0.827726
[('当前设备使用天数', 20778.929791480303), ('当月使用分钟数与前三个月平均值的百分比变化', 17059.72723968327), ('在职总月数', 12247.527016088367), ('每月平均使用分钟数', 12162.8245485425), ('客户生命周期内的平均每月使用分钟数', 11649.190486937761), ('客户整个生命周期内的平均每月通话次数', 11235.27798551321), ('已完成语音通话的平均使用分钟数', 10887.697177901864), ('客户生命周期内的总使用分钟数', 10537.405863419175), ('客户生命周期内的总费用', 10427.963113591075), ('当前手机价格', 10388.50929298997), ('当月费用与前三个月平均值的百分比变化', 10345.741146698594), ('使用高峰语音通话的平均不完整分钟数', 10325.746990069747), ('计费调整后的总费用', 10308.259309798479), ('客户生命周期内的总通话次数', 9878.29905757308), ('过去六个月的平均每月使用分钟数', 9860.522675991058), ('计费调整后的总分钟数', 9831.829701200128), ('客户生命周期内平均月费用', 9413.955781325698), ('平均月费用', 9256.14368981123), ('过去三个月的平均每月通话次数', 9233.180386424065), ('过去六个月的平均每月通话次数', 9178.422535061836)]
[0.8302521387863329, 0.8277260767493848]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999503   valid_1's auc: 0.812223
Early stopping, best iteration is:
[4737]  training's auc: 0.999988   valid_1's auc: 0.826507
[('当前设备使用天数', 21187.639979198575), ('当月使用分钟数与前三个月平均值的百分比变化', 17066.7826115638), ('在职总月数', 12178.71656690538), ('每月平均使用分钟数', 11915.060246050358), ('客户生命周期内的平均每月使用分钟数', 11457.53249040246), ('已完成语音通话的平均使用分钟数', 11197.47149656713), ('客户整个生命周期内的平均每月通话次数', 11062.857962206006), ('当前手机价格', 10535.98642912507), ('计费调整后的总费用', 10396.114720955491), ('当月费用与前三个月平均值的百分比变化', 10280.928569793701), ('使用高峰语音通话的平均不完整分钟数', 10159.540036082268), ('过去六个月的平均每月使用分钟数', 10114.058793380857), ('客户生命周期内的总使用分钟数', 10109.089174315333), ('客户生命周期内的总费用', 10081.144412502646), ('计费调整后的总分钟数', 10064.824367910624), ('客户生命周期内的总通话次数', 9710.811524420977), ('过去六个月的平均每月通话次数', 9568.110130429268), ('客户生命周期内平均月费用', 9536.692147105932), ('计费调整后的呼叫总数', 9272.926451265812), ('过去三个月的平均每月通话次数', 9104.1763061136)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999558   valid_1's auc: 0.812316
Early stopping, best iteration is:
[4955]  training's auc: 0.999985   valid_1's auc: 0.82816
[('当前设备使用天数', 20919.606680095196), ('当月使用分钟数与前三个月平均值的百分比变化', 17050.523352131248), ('在职总月数', 12673.502319052815), ('每月平均使用分钟数', 12145.743713662028), ('客户生命周期内的平均每月使用分钟数', 12082.749334529042), ('已完成语音通话的平均使用分钟数', 11270.388913482428), ('客户整个生命周期内的平均每月通话次数', 11032.332806184888), ('客户生命周期内的总费用', 10647.951857417822), ('计费调整后的总费用', 10599.385332718492), ('客户生命周期内的总使用分钟数', 10490.505580991507), ('当前手机价格', 10461.154125005007), ('当月费用与前三个月平均值的百分比变化', 10269.522361278534), ('使用高峰语音通话的平均不完整分钟数', 10231.192073732615), ('客户生命周期内的总通话次数', 9965.85817475617), ('计费调整后的总分钟数', 9773.746473029256), ('客户生命周期内平均月费用', 9764.829889595509), ('过去六个月的平均每月使用分钟数', 9703.316017881036), ('过去六个月的平均每月通话次数', 9595.259186178446), ('平均非高峰语音呼叫数', 9585.856355905533), ('计费调整后的呼叫总数', 9195.526195570827)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000]  training's auc: 0.999494   valid_1's auc: 0.809363
Early stopping, best iteration is:
[4829]  training's auc: 0.999983   valid_1's auc: 0.824736
[('当前设备使用天数', 20857.728651717305), ('当月使用分钟数与前三个月平均值的百分比变化', 17141.65538044274), ('在职总月数', 12623.7158523947), ('每月平均使用分钟数', 12155.711411625147), ('客户生命周期内的平均每月使用分钟数', 11755.307457834482), ('客户整个生命周期内的平均每月通话次数', 11121.649592876434), ('客户生命周期内的总费用', 10800.35821519792), ('当前手机价格', 10647.860997959971), ('已完成语音通话的平均使用分钟数', 10567.15585295856), ('客户生命周期内的总使用分钟数', 10455.313509970903), ('计费调整后的总费用', 10241.350874692202), ('当月费用与前三个月平均值的百分比变化', 10177.092842921615), ('客户生命周期内的总通话次数', 10139.20638936758), ('使用高峰语音通话的平均不完整分钟数', 9981.980402067304), ('计费调整后的总分钟数', 9756.786857843399), ('过去六个月的平均每月使用分钟数', 9725.03030230105), ('客户生命周期内平均月费用', 9604.02791416645), ('计费调整后的呼叫总数', 9452.47144331038), ('平均非高峰语音呼叫数', 9228.985016450286), ('过去六个月的平均每月通话次数', 9228.196154907346)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_scotrainre_list: [0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_score_mean: 0.8274762875126079
lgb_score_std: 0.0018267969533472914

五、贝叶斯调参

#!pip install bayesian-optimization
from bayes_opt import BayesianOptimization
def LGB_bayesian(num_leaves,  # intbagging_freq,  # intlearning_rate, feature_fraction,bagging_fraction,lambda_l1,lambda_l2,min_gain_to_split,max_depth):# LightGBM expects next three parameters need to be integer. So we make them integernum_leaves = int(num_leaves)max_depth = int(max_depth)assert type(num_leaves) == intassert type(max_depth) == intparam = {'num_leaves': num_leaves,'learning_rate': learning_rate,'bagging_fraction': bagging_fraction,'bagging_freq': bagging_freq,'feature_fraction': feature_fraction,'lambda_l1': lambda_l1,'lambda_l2': lambda_l2,'max_depth': max_depth,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','seed': 2022,'feature_fraction_seed': 2022,'bagging_seed': 2022,'drop_seed': 2022,'data_random_seed': 2022,'is_unbalance': True,'boost_from_average': False,'save_binary': True,    }    lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)num_round=10000clf = lgb.train(param,lgb_train,num_round,valid_sets =lgb_eval,verbose_eval=500,early_stopping_rounds = 200)roc= roc_auc_score(y_test,clf.predict(X_test,num_iteration=clf.best_iteration))   return roc
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,silent=True)
bounds_LGB = {'num_leaves': (5,50),  'learning_rate': (0.03,0.5),    'feature_fraction': (0.1,1),'bagging_fraction': (0.1,1),'bagging_freq': (0,10),'lambda_l1': (0, 5.0), 'lambda_l2': (0, 10), 'min_gain_to_split': (0, 1.0),'max_depth':(5,15),
}
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
from bayes_opt import BayesianOptimization
LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)
init_points = 5
n_iter = 15
print('-' * 130)with warnings.catch_warnings():warnings.filterwarnings('ignore')LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0)
----------------------------------------------------------------------------------------------------------------------------------
|   iter    |  target   | baggin... | baggin... | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_ga... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.756501
[1000]  valid_0's auc: 0.779654
[1500]  valid_0's auc: 0.795342
[2000]  valid_0's auc: 0.804397
[2500]  valid_0's auc: 0.812615
[3000]  valid_0's auc: 0.818713
[3500]  valid_0's auc: 0.82294
[4000]  valid_0's auc: 0.826771
[4500]  valid_0's auc: 0.82971
[5000]  valid_0's auc: 0.832648
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.832648
| [0m 1       [0m | [0m 0.8326  [0m | [0m 0.7999  [0m | [0m 2.375   [0m | [0m 0.8419  [0m | [0m 4.829   [0m | [0m 9.726   [0m | [0m 0.2431  [0m | [0m 11.09   [0m | [0m 0.7755  [0m | [0m 33.87   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.735036
[1000]  valid_0's auc: 0.754152
[1500]  valid_0's auc: 0.767662
[2000]  valid_0's auc: 0.778614
[2500]  valid_0's auc: 0.786152
[3000]  valid_0's auc: 0.792418
[3500]  valid_0's auc: 0.79872
[4000]  valid_0's auc: 0.803314
[4500]  valid_0's auc: 0.807683
[5000]  valid_0's auc: 0.81121
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.81121
| [0m 2       [0m | [0m 0.8112  [0m | [0m 0.7498  [0m | [0m 0.3504  [0m | [0m 0.3686  [0m | [0m 0.2926  [0m | [0m 8.571   [0m | [0m 0.2052  [0m | [0m 11.8    [0m | [0m 0.2563  [0m | [0m 20.64   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.749544
[1000]  valid_0's auc: 0.774443
[1500]  valid_0's auc: 0.78951
[2000]  valid_0's auc: 0.800602
[2500]  valid_0's auc: 0.80823
[3000]  valid_0's auc: 0.814024
[3500]  valid_0's auc: 0.81853
[4000]  valid_0's auc: 0.821672
[4500]  valid_0's auc: 0.823975
[5000]  valid_0's auc: 0.826105
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.826105
| [0m 3       [0m | [0m 0.8261  [0m | [0m 0.1085  [0m | [0m 3.583   [0m | [0m 0.9542  [0m | [0m 1.089   [0m | [0m 3.194   [0m | [0m 0.4614  [0m | [0m 5.319   [0m | [0m 0.06508 [0m | [0m 33.34   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.776662
[1000]  valid_0's auc: 0.804675
[1500]  valid_0's auc: 0.817655
[2000]  valid_0's auc: 0.826085
[2500]  valid_0's auc: 0.831839
[3000]  valid_0's auc: 0.835281
Early stopping, best iteration is:
[3179]  valid_0's auc: 0.836292
| [95m 4       [0m | [95m 0.8363  [0m | [95m 0.8864  [0m | [95m 0.08716 [0m | [95m 0.7719  [0m | [95m 4.064   [0m | [95m 0.7572  [0m | [95m 0.3385  [0m | [95m 10.09   [0m | [95m 0.4799  [0m | [95m 48.0    [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.751437
[1000]  valid_0's auc: 0.777091
[1500]  valid_0's auc: 0.793125
[2000]  valid_0's auc: 0.805084
[2500]  valid_0's auc: 0.812527
[3000]  valid_0's auc: 0.81902
[3500]  valid_0's auc: 0.823788
[4000]  valid_0's auc: 0.827882
[4500]  valid_0's auc: 0.831144
[5000]  valid_0's auc: 0.834175
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.834175
| [0m 5       [0m | [0m 0.8342  [0m | [0m 0.1     [0m | [0m 2.47    [0m | [0m 0.741   [0m | [0m 1.623   [0m | [0m 2.77    [0m | [0m 0.3569  [0m | [0m 14.19   [0m | [0m 0.2445  [0m | [0m 25.61   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.731224
[1000]  valid_0's auc: 0.749423
[1500]  valid_0's auc: 0.760884
[2000]  valid_0's auc: 0.76975
[2500]  valid_0's auc: 0.777677
[3000]  valid_0's auc: 0.785282
[3500]  valid_0's auc: 0.791667
[4000]  valid_0's auc: 0.796303
[4500]  valid_0's auc: 0.800412
[5000]  valid_0's auc: 0.804301
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.804301
| [0m 6       [0m | [0m 0.8043  [0m | [0m 0.6073  [0m | [0m 1.211   [0m | [0m 0.312   [0m | [0m 0.3293  [0m | [0m 7.263   [0m | [0m 0.2122  [0m | [0m 7.399   [0m | [0m 0.2959  [0m | [0m 19.23   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.758604
[1000]  valid_0's auc: 0.783179
[1500]  valid_0's auc: 0.798202
[2000]  valid_0's auc: 0.808477
[2500]  valid_0's auc: 0.816619
[3000]  valid_0's auc: 0.821904
[3500]  valid_0's auc: 0.825642
[4000]  valid_0's auc: 0.828837
[4500]  valid_0's auc: 0.83184
[5000]  valid_0's auc: 0.833605
Did not meet early stopping. Best iteration is:
[4998]  valid_0's auc: 0.833608
| [0m 7       [0m | [0m 0.8336  [0m | [0m 0.5285  [0m | [0m 0.4363  [0m | [0m 0.5314  [0m | [0m 4.917   [0m | [0m 0.0     [0m | [0m 0.2589  [0m | [0m 15.0    [0m | [0m 0.5876  [0m | [0m 36.33   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.775361
[1000]  valid_0's auc: 0.802306
[1500]  valid_0's auc: 0.816144
[2000]  valid_0's auc: 0.823617
[2500]  valid_0's auc: 0.828743
[3000]  valid_0's auc: 0.831524
Early stopping, best iteration is:
[3085]  valid_0's auc: 0.831879
| [0m 8       [0m | [0m 0.8319  [0m | [0m 1.0     [0m | [0m 8.511   [0m | [0m 1.0     [0m | [0m 4.544   [0m | [0m 7.213   [0m | [0m 0.4367  [0m | [0m 15.0    [0m | [0m 1.0     [0m | [0m 45.52   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.759645
[1000]  valid_0's auc: 0.786561
[1500]  valid_0's auc: 0.802323
[2000]  valid_0's auc: 0.81118
[2500]  valid_0's auc: 0.817364
[3000]  valid_0's auc: 0.821898
[3500]  valid_0's auc: 0.824679
Early stopping, best iteration is:
[3739]  valid_0's auc: 0.826167
| [0m 9       [0m | [0m 0.8262  [0m | [0m 0.1     [0m | [0m 10.0    [0m | [0m 1.0     [0m | [0m 5.0     [0m | [0m 0.0     [0m | [0m 0.5     [0m | [0m 15.0    [0m | [0m 1.0     [0m | [0m 30.18   [0m |
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.708925
[1000]  valid_0's auc: 0.721584
[1500]  valid_0's auc: 0.729905
[2000]  valid_0's auc: 0.736464
[2500]  valid_0's auc: 0.741614
[3000]  valid_0's auc: 0.746397
[3500]  valid_0's auc: 0.750703
[4000]  valid_0's auc: 0.754139
[4500]  valid_0's auc: 0.757474
[5000]  valid_0's auc: 0.760932
Did not meet early stopping. Best iteration is:
[5000]  valid_0's auc: 0.760932
| [0m 10      [0m | [0m 0.7609  [0m | [0m 0.1     [0m | [0m 0.0     [0m | [0m 0.1     [0m | [0m 0.0     [0m | [0m 7.904   [0m | [0m 0.03    [0m | [0m 15.0    [0m | [0m 0.0     [0m | [0m 43.18   [0m |
=====================================================================================================================================
print(LGB_BO.max['target'])#优化完成后,让我们看看我们得到的最大值是多少。
LGB_BO.max['params']#让我们看看参数:
0.8362916622722081{'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10.092622000835181,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48.00083652189798}
  • BayesianOptimization库中还有一个很酷的选项。 你可以探测LGB_bayesian函数,如果你对最佳参数有所了解,或者您从其他kernel获取参数。 我将在此复制并粘贴其他内核中的参数。 你可以按照以下方式进行探测
  • 默认情况下这些将被懒惰地探索(lazy = True),这意味着只有在你下次调用maxime时才会评估这些点。 让我们对LGB_BO对象进行最大化调用。
LGB_BO.probe(params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48},lazy=True, #
)
LGB_BO.maximize(init_points=0, n_iter=0) # remember no init_points or n_iter
|   iter    |  target   | baggin... | baggin... | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_ga... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 200 rounds.
[500]   valid_0's auc: 0.776662
[1000]  valid_0's auc: 0.804675
[1500]  valid_0's auc: 0.817655
[2000]  valid_0's auc: 0.826085
[2500]  valid_0's auc: 0.831839
[3000]  valid_0's auc: 0.835281
Early stopping, best iteration is:
[3179]  valid_0's auc: 0.836292
| [0m 11      [0m | [0m 0.8363  [0m | [0m 0.8864  [0m | [0m 0.08716 [0m | [0m 0.7719  [0m | [0m 4.064   [0m | [0m 0.7572  [0m | [0m 0.3385  [0m | [0m 10.0    [0m | [0m 0.4799  [0m | [0m 48.0    [0m |
=====================================================================================================================================

通过属性LGB_BO.res可以获得探测的所有参数列表及其相应的目标值。

for i, res in enumerate(LGB_BO.res):print("Iteration {}: \n\t{}".format(i, res))
Iteration 0: {'target': 0.8326475014985013, 'params': {'bagging_fraction': 0.7999321695164382, 'bagging_freq': 2.375412200349123, 'feature_fraction': 0.8418506793952316, 'lambda_l1': 4.8287459902149985, 'lambda_l2': 9.726011139048934, 'learning_rate': 0.2431211462861367, 'max_depth': 11.09042462761278, 'min_gain_to_split': 0.7755265146048467, 'num_leaves': 33.87260051415811}}
Iteration 1: {'target': 0.8112097384468528, 'params': {'bagging_fraction': 0.7498164065652524, 'bagging_freq': 0.35036524101437316, 'feature_fraction': 0.36860452380026143, 'lambda_l1': 0.29256245941037373, 'lambda_l2': 8.57060942587199, 'learning_rate': 0.2052413931011595, 'max_depth': 11.798479515780969, 'min_gain_to_split': 0.2562799493266301, 'num_leaves': 20.64115468186214}}
Iteration 2: {'target': 0.8261050715459326, 'params': {'bagging_fraction': 0.10847149307287247, 'bagging_freq': 3.5833378270496974, 'feature_fraction': 0.9541847635103893, 'lambda_l1': 1.0894950456584445, 'lambda_l2': 3.193913663803646, 'learning_rate': 0.461353021420276, 'max_depth': 5.319036664398947, 'min_gain_to_split': 0.06508453704251449, 'num_leaves': 33.34230495985203}}
Iteration 3: {'target': 0.8362916622722081, 'params': {'bagging_fraction': 0.8864320989515848, 'bagging_freq': 0.08715732303784862, 'feature_fraction': 0.7719195132945438, 'lambda_l1': 4.0642058550131175, 'lambda_l2': 0.7571744617226672, 'learning_rate': 0.33853400726057015, 'max_depth': 10.092622000835181, 'min_gain_to_split': 0.47988339149638315, 'num_leaves': 48.00083652189798}}
Iteration 4: {'target': 0.8341753174329471, 'params': {'bagging_fraction': 0.1000108302125366, 'bagging_freq': 2.4697870099191634, 'feature_fraction': 0.7410094101204252, 'lambda_l1': 1.6229102489335656, 'lambda_l2': 2.769963563838095, 'learning_rate': 0.3568593627001331, 'max_depth': 14.185517481459488, 'min_gain_to_split': 0.2444757021979903, 'num_leaves': 25.613861777454364}}
Iteration 5: {'target': 0.8043011941176942, 'params': {'bagging_fraction': 0.6072612766397498, 'bagging_freq': 1.2108187274978577, 'feature_fraction': 0.31196871185798225, 'lambda_l1': 0.32926780461649763, 'lambda_l2': 7.263184308560325, 'learning_rate': 0.2121743710976095, 'max_depth': 7.398509227738581, 'min_gain_to_split': 0.2958905376418717, 'num_leaves': 19.228154502442003}}
Iteration 6: {'target': 0.833607682808766, 'params': {'bagging_fraction': 0.5284563589983934, 'bagging_freq': 0.4362596810480882, 'feature_fraction': 0.5314451681549669, 'lambda_l1': 4.917464527991873, 'lambda_l2': 0.0, 'learning_rate': 0.258904064526674, 'max_depth': 15.0, 'min_gain_to_split': 0.5875699451215919, 'num_leaves': 36.327854565302374}}
Iteration 7: {'target': 0.8318794311203416, 'params': {'bagging_fraction': 1.0, 'bagging_freq': 8.511485033295257, 'feature_fraction': 1.0, 'lambda_l1': 4.543957420169538, 'lambda_l2': 7.21250815846057, 'learning_rate': 0.4367137368532056, 'max_depth': 15.0, 'min_gain_to_split': 1.0, 'num_leaves': 45.523690581294865}}
Iteration 8: {'target': 0.826167096580232, 'params': {'bagging_fraction': 0.1, 'bagging_freq': 10.0, 'feature_fraction': 1.0, 'lambda_l1': 5.0, 'lambda_l2': 0.0, 'learning_rate': 0.5, 'max_depth': 15.0, 'min_gain_to_split': 1.0, 'num_leaves': 30.184337136649773}}
Iteration 9: {'target': 0.7609321700847423, 'params': {'bagging_fraction': 0.1, 'bagging_freq': 0.0, 'feature_fraction': 0.1, 'lambda_l1': 0.0, 'lambda_l2': 7.904224578705337, 'learning_rate': 0.03, 'max_depth': 15.0, 'min_gain_to_split': 0.0, 'num_leaves': 43.17758999088051}}
Iteration 10: {'target': 0.8362916622722081, 'params': {'bagging_fraction': 0.8864320989515848, 'bagging_freq': 0.08715732303784862, 'feature_fraction': 0.7719195132945438, 'lambda_l1': 4.0642058550131175, 'lambda_l2': 0.7571744617226672, 'learning_rate': 0.33853400726057015, 'max_depth': 10.0, 'min_gain_to_split': 0.47988339149638315, 'num_leaves': 48.0}}

构建一个模型使用这些参数。

params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,}param_lgb = {'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here'max_bin': 63,'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here'learning_rate': LGB_BO.max['params']['learning_rate'],'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],'bagging_fraction': 1.0, 'bagging_freq': 5, 'feature_fraction': LGB_BO.max['params']['feature_fraction'],'lambda_l1': LGB_BO.max['params']['lambda_l1'],'lambda_l2': LGB_BO.max['params']['lambda_l2'],'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here'save_binary': True,'seed': 1337,'feature_fraction_seed': 1337,'bagging_seed': 1337,'drop_seed': 1337,'data_random_seed': 1337,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','is_unbalance': True,'boost_from_average': False,}
nfold = 5
gc.collect()
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)
oof = np.zeros(len(train_df))
predictions = np.zeros((len(test_df),nfold))i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):print("\nfold {}".format(i))xg_train = lgb.Dataset(train_df.iloc[train_index][predictors].values,label=train_df.iloc[train_index][target].values,feature_name=predictors,free_raw_data = False)xg_valid = lgb.Dataset(train_df.iloc[valid_index][predictors].values,label=train_df.iloc[valid_index][target].values,feature_name=predictors,free_raw_data = False)   clf = lgb.train(param_lgb, xg_train, 5000, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)oof[valid_index] = clf.predict(train_df.iloc[valid_index][predictors].values, num_iteration=clf.best_iteration) predictions[:,i-1] += clf.predict(test_df[predictors], num_iteration=clf.best_iteration)i = i + 1print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))

Hyperopt入门指南、《Insightful EDA + modeling LGBM hyperopt》、

test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('test_sub.csv',index=False)

bda_l2’: 0.7571744617226672, ‘learning_rate’: 0.33853400726057015, ‘max_depth’: 10.0, ‘min_gain_to_split’: 0.47988339149638315, ‘num_leaves’: 48.0}}

构建一个模型使用这些参数。

params={'bagging_fraction': 0.8864320989515848,'bagging_freq': 0.08715732303784862,'feature_fraction': 0.7719195132945438,'lambda_l1': 4.0642058550131175,'lambda_l2': 0.7571744617226672,'learning_rate': 0.33853400726057015,'max_depth': 10,'min_gain_to_split': 0.47988339149638315,'num_leaves': 48,}param_lgb = {'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here'max_bin': 63,'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here'learning_rate': LGB_BO.max['params']['learning_rate'],'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],'bagging_fraction': 1.0, 'bagging_freq': 5, 'feature_fraction': LGB_BO.max['params']['feature_fraction'],'lambda_l1': LGB_BO.max['params']['lambda_l1'],'lambda_l2': LGB_BO.max['params']['lambda_l2'],'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here'save_binary': True,'seed': 1337,'feature_fraction_seed': 1337,'bagging_seed': 1337,'drop_seed': 1337,'data_random_seed': 1337,'objective': 'binary','boosting_type': 'gbdt','verbose': 1,'metric': 'auc','is_unbalance': True,'boost_from_average': False,}
nfold = 5
gc.collect()
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)
oof = np.zeros(len(train_df))
predictions = np.zeros((len(test_df),nfold))i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):print("\nfold {}".format(i))xg_train = lgb.Dataset(train_df.iloc[train_index][predictors].values,label=train_df.iloc[train_index][target].values,feature_name=predictors,free_raw_data = False)xg_valid = lgb.Dataset(train_df.iloc[valid_index][predictors].values,label=train_df.iloc[valid_index][target].values,feature_name=predictors,free_raw_data = False)   clf = lgb.train(param_lgb, xg_train, 5000, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)oof[valid_index] = clf.predict(train_df.iloc[valid_index][predictors].values, num_iteration=clf.best_iteration) predictions[:,i-1] += clf.predict(test_df[predictors], num_iteration=clf.best_iteration)i = i + 1print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))

Hyperopt入门指南、《Insightful EDA + modeling LGBM hyperopt》、

test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('test_sub.csv',index=False)

结果是:

调了几次提升不大就没再提交了

科大讯飞:电信客户流失预测挑战赛baseline相关推荐

  1. 科大讯飞:电信客户流失预测挑战赛baseline——Datawhale6月组队打卡笔记(1)

    文章目录 1. 赛题介绍 2. 赛题任务 3.赛题数据 4.评分标准 5.赛题baseline 5.1 导入模块 5.2 数据预处理 5.3 训练数据/测试数据准备 5.4 构建模型 5.5 提交结果 ...

  2. 鱼佬:电信客户流失预测赛方案!

    Datawhale干货 作者:鱼佬,武汉大学硕士 2022科大讯飞:电信客户流失预测挑战赛 赛事地址(持续更新): https://challenge.xfyun.cn/topic/info?type ...

  3. 【Clemetine】基于二项Logistic回归的电信客户流失预测

    一.实验目的及要求 1.掌握Logistic回归分析的基本步骤.原理.软件实现.结果分析: 2.理解多重共线性的概念.原理及岭轨迹的软件实现: 3.了解高维数据分析的应用领域及分析方法. 二.实验仪器 ...

  4. 基于机器学习的电信客户流失预测 附完整代码+数据

    直接看视频:https://www.bilibili.com/video/BV1To4y1i78d/?vd_source=8f3cf4ad6c08a40d40ca6809c9c9e8ca 博客会分享完 ...

  5. 基于机器学习逻辑回归的电信客户流失预测

    直接看视频:https://www.bilibili.com/video/BV1To4y1i78d/?vd_source=8f3cf4ad6c08a40d40ca6809c9c9e8ca 博客会分享完 ...

  6. 大数据分析案例-对电信客户流失分析预警预测

    目录 1.项目背景 2.项目简介 2.1数据说明 2.2变量介绍 2.3技术工具 3.算法原理 4.项目实施步骤 4.1导入数据 4.2理解数据 4.3数据预处理 4.4数据可视化 4.5特征工程 4 ...

  7. 【Paper Note】基于决策树算法的电信运营商客户流失预测

    1.引言 随着互联网业务的速发展,移动业务市场的客户流失预警成为每一个电信运营商重点关注的内容,在商务智能与机器学习快速发展的当下,运用数据挖掘的方法,实现对电信客户的挽留.转化.精准营销越来越彰显其 ...

  8. 电信和邻居共享上网_与k个最近邻居的电信行业客户流失预测

    电信和邻居共享上网 问题描述 (Problem Description) This blog aims to predict when a customer could probably churn ...

  9. 客户流失预测_如何不预测和防止客户流失

    客户流失预测 Customers are any business' bread and butter. Whether it is a B2B business model or B2C, ever ...

最新文章

  1. 计算机专业毕业生人数稳居前十,你该怎么脱颖而出?
  2. IntelliJ IDEA的黑白色背景切换(Ultimate和Community版本皆通用)
  3. python中可以用中文作为变量-python里能不能用中文
  4. UA PHYS515A 电磁理论V 电磁波与辐射5 电磁波在介质中的传播
  5. f5 ppt图标_PPT制作学习 (PPT技巧干货,拿走不谢)
  6. boost::math模块实现图表显示使用 Lambert W 函数计算电流的测试程序
  7. python下载器2
  8. 最近公共祖先 python_求二叉搜索树的最近公共祖先
  9. python列表keys函数_字典常用函数(clear、get、items、keys、values、pop)
  10. 【Flink】FLink Barrier 在流经算子 做 checkpoint 的时候,数据是停止的吗?
  11. python实现嵌套功能_我应该如何在Python中实现“嵌套”子命令?
  12. SQLite(二)高级操作
  13. arduino运行java_调试在Arduino MKR1000上运行的Arduino Uno代码
  14. BCD码和ASCII码的区别
  15. 电容和电感(自总结)
  16. UWB测距及定位原理
  17. android https HttpsURLConnection 忽略证书
  18. 会声会影X3常见问题80个解答
  19. vs2010生成的exe更改icon
  20. appstore上传截图的各种尺寸

热门文章

  1. 零基础学Python - 1 - Python简介及下载安装
  2. windows自带便笺使用
  3. apche和nginx分别与php的连接方式区别
  4. 计算机应用期刊主编终审通过率,审稿快的期刊_最容易发表审稿快的学报_审稿快的三本或大专学报...
  5. vcruntime140_1.dll
  6. Onenote如何快速实现首行缩进的功能。
  7. 小羊驼和你一起学习cocos2d-x之四(摇杆)
  8. vscode添加源文件_VSCode 添加自定义注释的方法(附带红色警戒经典注释风格)
  9. 物联网实验8:Zigbee数据上传
  10. 工程伦理--1.1 第四次工业革命