文章目录

  • 1. Baseline
  • 2. AUC评估要使用predict_proba
    • 2.1 导入工具包
    • 2.2 特征提取
    • 2.3 训练+模型选择
    • 2.4 网格/随机搜索 参数+提交
    • 2.5 测试结果
  • 3. 致谢

新人赛地址

1. Baseline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv")
test = pd.read_csv("./test_set.csv")
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25317 entries, 0 to 25316
Data columns (total 18 columns):#   Column     Non-Null Count  Dtype
---  ------     --------------  ----- 0   ID         25317 non-null  int64 1   age        25317 non-null  int64 2   job        25317 non-null  object3   marital    25317 non-null  object4   education  25317 non-null  object5   default    25317 non-null  object6   balance    25317 non-null  int64 7   housing    25317 non-null  object8   loan       25317 non-null  object9   contact    25317 non-null  object10  day        25317 non-null  int64 11  month      25317 non-null  object12  duration   25317 non-null  int64 13  campaign   25317 non-null  int64 14  pdays      25317 non-null  int64 15  previous   25317 non-null  int64 16  poutcome   25317 non-null  object17  y          25317 non-null  int64
dtypes: int64(9), object(9)
memory usage: 3.5+ MB
NO 字段名称 数据类型 字段描述
1 ID Int 客户唯一标识
2 age Int 客户年龄
3 job String 客户的职业
4 marital String 婚姻状况
5 education String 受教育水平
6 default String 是否有违约记录
7 balance Int 每年账户的平均余额
8 housing String 是否有住房贷款
9 loan String 是否有个人贷款
10 contact String 与客户联系的沟通方式
11 day Int 最后一次联系的时间(几号)
12 month String 最后一次联系的时间(月份)
13 duration Int 最后一次联系的交流时长
14 campaign Int 在本次活动中,与该客户交流过的次数
15 pdays Int 距离上次活动最后一次联系该客户,过去了多久(999表示没有联系过)
16 previous Int 在本次活动之前,与该客户交流过的次数
17 poutcome String 上一次活动的结果
18 y Int 预测客户是否会订购定期存款业务
  • 相关系数
abs(train.corr()['y']).sort_values(ascending=False)
y           1.000000
ID          0.556627
duration    0.394746
pdays       0.107565
previous    0.088337
campaign    0.075173
balance     0.057564
day         0.031886
age         0.029916
Name: y, dtype: float64
  • 绘制数字特征分布图
s = (train.dtypes == 'object')
object_col = list(s[s].index)
object_col
num_col = list(set(train.columns) - set(object_col))plt.figure(figsize=(25,22))
for (i,col) in enumerate(num_col):plt.subplot(3,3,i+1)sns.distplot(train[col]) # kde=False 可不显示密度线plt.xlabel(col,size=20)
plt.show()

  • 分析下训练集 y 标签的比例
len(train[train['y']==1])/len(train['y'])
0.11695698542481336

只有 11% 的人会购买

  • 新人赛,数据没有缺失的,直接用模型试试效果
X_train = train.drop(['ID','y'], axis=1)
X_test = test.drop(['ID'], axis=1)
y_train = train['y']
def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_cols
num_cols, object_cols = num_cat_splitor(X_train)
# 查看文字变量的种类
for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique()))
class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),#('imputer', SimpleImputer(strategy="median")),('std_scaler', StandardScaler()),])
cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False,handle_unknown='ignore')),])
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),])
X_prepared = full_pipeline.fit_transform(X_train)
from sklearn.ensemble import RandomForestClassifierprepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('forst_reg', RandomForestClassifier(random_state=0))
])
param_grid = [{'forst_reg__n_estimators' : [50,100, 150, 200,250,300,330,350],'forst_reg__max_features':[45,50, 55, 65]
}]grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=7,scoring='roc_auc', verbose=2, n_jobs=-1)
grid_search_prep.fit(X_train,y_train)
grid_search_prep.best_params_
final_model = grid_search_prep.best_estimator_
y_pred_test = final_model.predict(X_test)
# AUC 评估准则,需要使用 predict_proba,这里错误!!!
result = pd.DataFrame()
result['ID'] = test['ID']
result['pred'] = y_pred_test
result.to_csv('buy_product_pred.csv',index=False)

排名结果

auc 得分:0.72439844

2. AUC评估要使用predict_proba

AUC 指标,在预测时,应该使用概率来预测,上面做法是错误的(未使用概率预测)。

  • 机器学习之分类器性能指标之ROC曲线、AUC值 https://www.cnblogs.com/dlml/p/4403482.html
  • 如何理解机器学习和统计中的AUC? https://www.zhihu.com/question/39840928
  • sklearn.metrics.roc_auc_score 介绍

AUC 评估模型的优点,在模型正负样本比例失衡的情况下,依然可以很好的评估模型

以下重新对代码进行优化

2.1 导入工具包

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.facecolor']=(1,1,1,1) # pycharm 绘图白底,看得清坐标
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_scoretrain = pd.read_csv("./train_set.csv")
test = pd.read_csv("./test_set.csv")

2.2 特征提取

  • 查看文字特征的值
for col in object_col:print(col, sorted(train[col].unique()))print(col, sorted(test[col].unique()))
job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']
job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']
marital ['divorced', 'married', 'single']
marital ['divorced', 'married', 'single']
education ['primary', 'secondary', 'tertiary', 'unknown']
education ['primary', 'secondary', 'tertiary', 'unknown']
default ['no', 'yes']
default ['no', 'yes']
housing ['no', 'yes']
housing ['no', 'yes']
loan ['no', 'yes']
loan ['no', 'yes']
contact ['cellular', 'telephone', 'unknown']
contact ['cellular', 'telephone', 'unknown']
month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep']
month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep']
poutcome ['failure', 'other', 'success', 'unknown']
poutcome ['failure', 'other', 'success', 'unknown']
  • 二值特征转化为 0, 1
# 对 'default','housing','loan' 3列二值(yes,no)特征转为 0,1
def binaryFeature(data):data['default_']=0data['default_'][data['default']=='yes'] = 1data['housing_']=0data['housing_'][data['housing']=='yes'] = 1data['loan_']=0data['loan_'][data['loan']=='yes'] = 1return data.drop(['default','housing','loan'], axis=1)X_train = binaryFeature(train)
X_test = binaryFeature(test)
  • 训练集数据切分,用于本地测试
X_train = X_train.drop(['ID'], axis=1)
X_test = X_test.drop(['ID'], axis=1)# 将训练集拆分一些出来做验证, 分层抽样
from sklearn.model_selection import StratifiedShuffleSplit
splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1)
for train_idx, vaild_idx in splt.split(X_train, X_train['y']):train_part = X_train.loc[train_idx]valid_part = X_train.loc[vaild_idx]# 训练集拆成两部分 本地测试
train_part_y = train_part['y']
valid_part_y = valid_part['y']
train_part = train_part.drop(['y'], axis=1)
valid_part = valid_part.drop(['y'], axis=1)
  • 特征处理管道
def num_cat_splitor(X_train):s = (X_train.dtypes == 'object')object_cols = list(s[s].index)num_cols = list(set(X_train.columns) - set(object_cols))return num_cols, object_colsnum_cols, object_cols = num_cat_splitor(X_train)
num_cols.remove('y')class DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].valuesnum_pipeline = Pipeline([('selector', DataFrameSelector(num_cols)),
#         ('imputer', SimpleImputer(strategy="median")),
#         ('std_scaler', StandardScaler()),])
cat_pipeline = Pipeline([('selector', DataFrameSelector(object_cols)),('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),])
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline),])

2.3 训练+模型选择

# 本地测试,选模型
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_scorerf = RandomForestClassifier()
knn = KNeighborsClassifier()
lr = LogisticRegression()
svc = SVC(probability=True)
gbdt = GradientBoostingClassifier()models = [knn, lr, svc, rf, gbdt]
param_grid_list = [# knn[{'model__n_neighbors' : [5,15,35,50,100],'model__leaf_size' : [10,20,30,40,50]}],# lr[{'model__penalty' : ['l1', 'l2'],'model__C' : [0.2, 0.5, 1, 1.2, 1.5],'model__max_iter' : [10000]}],# svc[{'model__C' : [0.2, 0.5, 1, 1.2],'model__kernel' : ['rbf']}],# rf[{#     'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [200,250,300,330,350],'model__max_features' : [20,30,40,50],'model__max_depth' : [5,7]}],# gbdt[{'model__learning_rate' : [0.1, 0.5],'model__n_estimators' : [130, 200, 300],'model__max_features' : ['sqrt'],'model__max_depth' : [5,7],'model__min_samples_split' : [500,1000,1200],'model__min_samples_leaf' : [60, 100],'model__subsample' : [0.8, 1]}],
]for i, model in enumerate(models):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(train_part, train_part_y)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(valid_part)[:,1] # roc 必须使用概率预测print("auc score: ", roc_auc_score(valid_part_y, pred))
  • 注意 AUC 评分标准 要使用predict_proba方法 !!!
Fitting 3 folds for each of 25 candidates, totalling 75 fits
{'model__leaf_size': 20, 'model__n_neighbors': 50}
auc score:  0.8212256518034133
Fitting 3 folds for each of 10 candidates, totalling 30 fits
{'model__C': 1.2, 'model__max_iter': 10000, 'model__penalty': 'l2'}
auc score:  0.9011510812019533
Fitting 3 folds for each of 4 candidates, totalling 12 fits
{'model__C': 0.2, 'model__kernel': 'rbf'}
auc score:  0.7192431208601267
Fitting 3 folds for each of 40 candidates, totalling 120 fits
{'model__max_depth': 7, 'model__max_features': 20, 'model__n_estimators': 350}
auc score:  0.913398647137746
Fitting 3 folds for each of 144 candidates, totalling 432 fits
{'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 60, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1}
auc score:  0.9299485084368806

可以看见 GBDT 梯度提升下降树模型表现最好

2.4 网格/随机搜索 参数+提交

微调参数列表,使用全部的训练数据训练,使用 RF 和 GBDT 模型 对测试集进行预测

  • 网格搜索
# 全量训练,网格搜索,提交
y_train = X_train['y']
X_train_ = X_train.drop(['y'], axis=1)select_model = [rf, gbdt]
param_grid_list = [# rf[{#     'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : [250,300,350,400],'model__max_features' : [7,8,10,15,20],'model__max_depth' : [7,9,10,11]}],# gbdt[{'model__learning_rate' : [0.03, 0.05, 0.1],'model__n_estimators' : [200, 300, 350],'model__max_features' : ['sqrt'],'model__max_depth' : [7,9,11],'model__min_samples_split' : [300, 400, 500],'model__min_samples_leaf' : [50,60,70],'model__subsample' : [0.8, 1, 1.2]}],
]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,scoring='roc_auc', verbose=2, n_jobs=-1)grid_search.fit(X_train_, y_train)print(grid_search.best_params_)final_model = grid_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必须使用概率预测print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False)
Fitting 3 folds for each of 80 candidates, totalling 240 fits
{'model__max_depth': 11, 'model__max_features': 15, 'model__n_estimators': 400}
RandomForestClassifier() finished!
Fitting 3 folds for each of 729 candidates, totalling 2187 fits
{'model__learning_rate': 0.05, 'model__max_depth': 11, 'model__max_features': 'sqrt',
'model__min_samples_leaf': 50, 'model__min_samples_split': 500,
'model__n_estimators': 300, 'model__subsample': 1}
GradientBoostingClassifier() finished!
  • 随机搜索
# 随机搜索参数
y_train = X_train['y']
X_train_ = X_train.drop(['y'], axis=1)from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
select_model = [rf, gbdt]
param_distribs = [# rf[{#     'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],'model__n_estimators' : randint(low=250, high=500),'model__max_features' : randint(low=10, high=30),'model__max_depth' : randint(low=8, high=20)}],# gbdt[{'model__learning_rate' : np.linspace(0.01, 0.1, 10),'model__n_estimators' : randint(low=250, high=500),'model__max_features' : ['sqrt'],'model__max_depth' : randint(low=8, high=20),'model__min_samples_split' : randint(low=400, high=1000),'model__min_samples_leaf' : randint(low=40, high=80),'model__subsample' : np.linspace(0.5, 1.5, 10)}],
]for i, model in enumerate(select_model):pipe = Pipeline([('preparation', full_pipeline),('model', model)])rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=3,n_iter=20,scoring='roc_auc', verbose=2, n_jobs=-1)rand_search.fit(X_train_, y_train)print(rand_search.best_params_)final_model = rand_search.best_estimator_pred = final_model.predict_proba(X_test)[:,1] # roc 必须使用概率预测print(model,'\n finished!')result = pd.DataFrame()result['ID'] = test['ID']result['pred'] = predresult.to_csv('{}_pred.csv'.format(i), index=False)
Fitting 3 folds for each of 20 candidates, totalling 60 fits
{'model__max_depth': 18, 'model__max_features': 13, 'model__n_estimators': 481}
RandomForestClassifier() finished!
Fitting 3 folds for each of 20 candidates, totalling 60 fits
{'model__learning_rate': 0.05000000000000001, 'model__max_depth': 15, 'model__max_features': 'sqrt',
'model__min_samples_leaf': 68, 'model__min_samples_split': 905,
'model__n_estimators': 362, 'model__subsample': 0.9444444444444444}
GradientBoostingClassifier() finished!

2.5 测试结果

RF 模型得分:0.9229160811692528


GBDT 模型得分:0.9332932318964199


第二期排名,暂列第8

3. 致谢

感谢徐师兄一直以来的指点和帮助!
欢迎大家一起分享练习心得,一起继续加油!

[Kesci] 预测分析 · 客户购买预测(AUC评估要使用predict_proba)相关推荐

  1. Bank Marketing预测一个客户购买理财产品的成功率

    Bank Marketing预测一个客户购买理财产品的成功率 一.实验目的 熟悉数据预处理的基本方法,包括缺失值填充.数据编码. 熟悉 pandas.scikit 等数据分析库的使用. 熟悉机器学习算 ...

  2. 【编译原理笔记05】语法分析:FIRST集和FOLLOW集的计算,[非]递归的预测分析法,预测分析中的错误处理

    本次笔记内容: 4-4 FIRST集和FOLLOW集 4-5 递归的预测分析法 4-6 非递归的预测分析法 4-7 预测分析法中的错误处理 本节课幻灯片,见于我的 GitHub 仓库:第5讲 语法分析 ...

  3. 4.2.5 预测分析法与预测分析表的构造

    4.2.5 预测分析法与预测分析表的构造 预测分析法也称为 LL ( 1 )分析法.这种分析法是确定的自上而下分析的另一种方法,采用这种方法进行语法分析要求描述语言的文法是 LL ( 1 )文法. 一 ...

  4. 张量是一个序列_张量流中使用gru和bilstm进行预测分析的时间序列预测

    张量是一个序列 Recurrent Neural Networks are designed to handle the complexity of sequence dependence in ti ...

  5. 今天nba预测分析_NBA情报预测分析_NBA足球俱乐部 - 全球体育网

    小前锋这个位置要求比较高,既得需要锋线的意识,还需要后卫的技术,最重要的还得攻守兼备.下面是我心目中的NBA历史前五小前锋:1.皮蓬.2.小皇帝詹姆斯.3.大鸟伯德.4.J博士欧文.5.KD杜兰特.6 ...

  6. 8大预测分析工具比较

    什么是预测分析工具? 预测分析工具融合了人工智能和业务报告.这些工具包括用于从整个企业收集数据的复杂管道,添加统计分析和机器学习层以对未来进行预测,并将这些见解提炼成有用的摘要,以便业务用户可以对此采 ...

  7. 指纹图谱相似度评价软件_基于指纹图谱和网络药理学对当归四逆汤中桂枝的Qmarker预测分析...

    摘  要:目的  基于指纹图谱和网络药理学分析预测当归四逆汤(DSD)中桂枝的质量标志物(Q-marker).方法 建立桂枝水煎液和DSD的指纹图谱,利用中药色谱指纹图谱相似度评价系统软件(2012年 ...

  8. 语法分析:自上而下分析(递归下降分析法+预测分析法)

    语法分析:自上而下分析 目录 语法分析:自上而下分析 知识背景 递归下降分析法 内容一:根据文法生成子程序 内容二:调用文法开始符号所对应的子程序 预测分析法 内容一:构造预测分析表 内容二:预测分析 ...

  9. ML:基于葡萄牙银行机构营销活动数据集(年龄/职业等)利用Pipeline框架(两种类型特征并行处理)+多种模型预测(分层抽样+调参交叉验证评估+网格/随机搜索+推理)客户是否购买该银行的产品二分类案

    ML之pipeline:基于葡萄牙银行机构营销活动数据集(年龄/职业/婚姻/违约等)利用Pipeline框架(两种类型特征并行处理)+多种模型预测(分层抽样+调参交叉验证评估+网格搜索/随机搜索+模型 ...

最新文章

  1. 展望未来:使用 PostCSS 和 cssnext 书写 CSS
  2. tarnado源码解析系列一
  3. poj 2492 A Bug's Life
  4. 优化技巧与理论(part1)
  5. django与easyui使用过程中遇到的问题
  6. 银行家算法:解决多线程死锁问题
  7. 阿里巴巴发布AliOS品牌 重投汽车及IoT领域
  8. MySQL亿级数据数据库优化方案测试-银行交易流水记录的查询
  9. Objective-C的内省(Introspection)小结
  10. Android Folding View(折叠视图、控件)
  11. Git的使用--如何将本地项目上传到Github(两种简单、方便的方法)
  12. FTP命令 上传下载文件
  13. 什么叫组网_混合组网是什么意思
  14. 关于TP3.2.3的反序列化学习
  15. 【光线追踪系列十七】直接光源采样
  16. unity 3d iphone android 通用,在Unity3D中使用iPhone原生UI
  17. 网易邮箱服务器邮箱协议,怎么用网易闪电邮IMAP协议同步网页邮箱?
  18. 网络通信TCP/UDP
  19. 医疗影像MRI相关软件
  20. 超级鹰模拟登录古诗文网站

热门文章

  1. 融云通讯服务器,vue使用融云即时通讯,老是报了发送失败,服务器超时
  2. 石头剪刀布python编程_《python核心编程第二版》练习题——游戏:石头剪刀布
  3. usg6000v 无法ping通_柯美复印机网络打印无响应?无法打印、扫描?原来这里出了问题...
  4. php imap 附件,学习猿地-PHP-imap 使用参考
  5. java breakpoint_java断点
  6. 衰落信道中的平均信噪比和瞬时信噪比
  7. jmeter+Fiddler:通过Fiddler抓包生成jmeter脚本
  8. 随堂小测冲刺.第19天
  9. vue项目创建,redis列表字典操作,django用redis的第二种方法
  10. WebSocket介绍