筛选最佳参数:

# 对于max_depth和min_child_weight查找最好的参数
param_grid = { 'max_depth':range(3,10,2),'min_child_weight':range(1,6,2)}model = XGBClassifier(learning_rate =0.1,n_estimators=100,max_depth=5,use_label_encoder=False,min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27,verbosity = 0)
gsearch1 = GridSearchCV(estimator = model,param_grid = param_grid, scoring='roc_auc',n_jobs=-1, cv=5)gsearch1.fit(train[cols],train[target])
print('本次筛选最佳参数：',gsearch1.best_params_)
print('最佳得分是：',gsearch1.best_score_)
'''
本次筛选最佳参数： {'max_depth': 5, 'min_child_weight': 3}
最佳得分是： 0.8417229752667561
'''

4、Xgboost模型使用

4.1、数据介绍

根据用户一些信息，进行算法建模，判断用户是否可以按时进行还款！

字段信息如下：

字段	说明
Disbursed	是否还款
Existing_EMI	每月还款金额
Loan_Amount_Applied	贷款金额
Loan_Tenure_Applied	贷款期限
Monthly_Income	月收入
……	……

4.2、导包

import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn import model_selection, metrics
from sklearn.model_selection import GridSearchCVimport matplotlib.pylab as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

4.3、加载数据

train = pd.read_csv('train_modified.csv')
test = pd.read_csv('test_modified.csv')# 删除ID字段，对建模没有实际意义
train.drop(labels = 'ID',axis = 1,inplace = True)
test.drop(labels = 'ID',axis = 1,inplace = True)# 声明训练数据字段和目标值字段
target = 'Disbursed'
cols = [x for x in train.columns if x not in [target]]

4.4、构建训练函数

def modelfit(model, dtrain, dtest, cols,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):# 训练数据交叉验证if useTrainCV:xgb_param = model.get_xgb_params()xgb_train = xgb.DMatrix(dtrain[cols].values, label=dtrain[target].values)xgb_test = xgb.DMatrix(dtest[cols].values)cvresult = xgb.cv(xgb_param, xgb_train, num_boost_round = model.get_params()['n_estimators'], nfold=cv_folds,early_stopping_rounds = early_stopping_rounds, verbose_eval=False)model.set_params(n_estimators=cvresult.shape[0])# 建模model.fit(dtrain[cols], dtrain['Disbursed'],eval_metric='auc')# 对训练集预测y_ = model.predict(dtrain[cols])proba_ = model.predict_proba(dtrain[cols])[:,1] # 获取正样本# 输出模型的一些结果print('该模型表现：')print('准确率 (训练集): %.4g' % metrics.accuracy_score(dtrain['Disbursed'],y_))print('AUC 得分 (训练集): %f' % metrics.roc_auc_score(dtrain['Disbursed'],proba_))# 特征重要性feature_imp = pd.Series(model.get_booster().get_fscore()).sort_values(ascending=False)feature_imp.plot(kind='bar', title='Feature Importances')plt.ylabel('Feature Importance Score')

函数说明：

训练数据建模交叉验证
根据Xgboost交叉验证更新 n_estimators
数据建模
求训练准确率
求训练集AUC二分类ROC-AUC 二分类ROC-AUC二分类ROC-AUC
画出特征的重要度

4.5、建模交叉验证筛选最佳参数 (n_estimators)

xgb1 = XGBClassifier(learning_rate =0.1,use_label_encoder=False,n_estimators=50,max_depth=5,min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,reg_alpha = 0,verbosity = 0)
modelfit(xgb1, train, test, cols)

该模型表现：
准确率 (训练集): 0.9854
AUC 得分 (训练集): 0.867372

4.6、条件筛选

4.6.1、参数筛选一

# 对于max_depth和min_child_weight查找最好的参数
param_grid = { 'max_depth':range(3,10,2),'min_child_weight':range(1,6,2)}model = XGBClassifier(learning_rate =0.1,n_estimators=100,max_depth=5,use_label_encoder=False,min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,objective='binary:logistic',nthread=4,scale_pos_weight=1,seed=27,verbosity = 0)
gsearch1 = GridSearchCV(estimator = model,param_grid = param_grid, scoring='roc_auc',n_jobs=-1, cv=5)gsearch1.fit(train[cols],train[target])
print('本次筛选最佳参数：',gsearch1.best_params_)
print('最佳得分是：',gsearch1.best_score_)
'''
本次筛选最佳参数： {'max_depth': 5, 'min_child_weight': 3}
最佳得分是： 0.8417229752667561
'''

4.6.2、参数筛选二

%%time
# 筛选合适的gamma：惩罚项系数，指定节点分裂所需的最小损失函数下降值
param_grid = {'gamma':[i/10.0 for i in range(0,5)]}model = XGBClassifier(learning_rate =0.1,n_estimators=100,max_depth=5,use_label_encoder=False,min_child_weight=3,gamma=0,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=27,verbosity = 0)
gsearch2 = GridSearchCV(estimator = model,param_grid = param_grid,scoring='roc_auc',n_jobs=4, cv=5)
gsearch2.fit(train[cols],train[target])print('本次筛选最佳参数：',gsearch2.best_params_)
print('最佳得分是：',gsearch2.best_score_)
'''
本次筛选最佳参数： {'gamma': 0.0}
最佳得分是： 0.8417229752667561
Wall time: 37.9 s
'''

4.7.3、参数筛选三

%%time
# 对subsample 和 colsample_bytree用grid search寻找最合适的参数
param_grid = {'subsample':[i/10.0 for i in range(6,10)],'colsample_bytree':[i/10.0 for i in range(6,10)]}model = XGBClassifier(learning_rate =0.1,n_estimators=100,max_depth=5,use_label_encoder=False,min_child_weight=3,gamma=0,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=1024,verbosity = 0)gsearch3 = GridSearchCV(estimator = model,param_grid = param_grid,scoring='roc_auc',n_jobs=-1,cv=5)gsearch3.fit(train[cols],train[target])print('本次筛选最佳参数：',gsearch3.best_params_)
print('最佳得分是：',gsearch3.best_score_)
'''
本次筛选最佳参数： {'colsample_bytree': 0.8, 'subsample': 0.8}
最佳得分是： 0.8427440100917339
Wall time: 2min 10s
'''

4.6.4、参数筛选四

%%time
# 对reg_alpha用grid search寻找最合适的参数
param_grid = {'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]}model = XGBClassifier(learning_rate =0.1,n_estimators=100,max_depth=5,use_label_encoder=False,min_child_weight=3,gamma=0,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=1024,verbosity = 0)gsearch4 = GridSearchCV(estimator = model,param_grid = param_grid,scoring='roc_auc',n_jobs=-1,cv=5)gsearch4.fit(train[cols],train[target])print('本次筛选最佳参数：',gsearch4.best_params_)
print('最佳得分是：',gsearch4.best_score_)
'''
本次筛选最佳参数： {'reg_alpha': 1}
最佳得分是： 0.8427616879925267
Wall time: 39.9 s
'''

最佳条件是：

max_depth = 5
min_child_weight = 3
gamma = 0
subsample = 0.8
colsample_bytree = 0.8
reg_alpha = 1

4.7、建模验证新参数效果

xgb2 = XGBClassifier(learning_rate =0.1,use_label_encoder=False,n_estimators=100,max_depth=5,min_child_weight=3,gamma=0,subsample=0.8,colsample_bytree=0.8,reg_alpha = 1,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,eval_metric=['error','auc'],verbosity = 0)
modelfit(xgb2, train, test, cols)
'''
该模型表现：
准确率 (训练集): 0.9854
AUC 得分 (训练集): 0.879929
'''

4.7.1 调整学习率

xgb3 = XGBClassifier(learning_rate =0.2,use_label_encoder=False,n_estimators=100,max_depth=5,min_child_weight=3,gamma=0,subsample=0.8,colsample_bytree=0.8,reg_alpha = 1,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,eval_metric=['error','auc'],verbosity = 0)
modelfit(xgb3, train, test, cols)
'''
该模型表现：
准确率 (训练集): 0.9855
AUC 得分 (训练集): 0.907529
'''

04- 根据Xgboost集成算法预测还贷能力 (项目四)相关推荐

【机器学习】XGBoost集成算法——（理论+图解+python代码比较其他算法使用天池蒸汽数据）
一.集成算法思想二.XGBoost基本思想三.用python实现XGBoost算法在竞赛题中经常会用到XGBoost算法,用这个算法通常会使我们模型的准确率有一个较大的提升.既然它效果这么好,那 ...
Sklearn XGBoost模型算法分类建模-----风控项目实战（PR曲线、KS、AUC、F1-Score各类指标）
项目背景:二手手机需从前端质检项推断手机有无拆修问题思路: a)X值:前端各类质检项,对应映射ID+RANK值(涉及质检项会有等级排序,需进行RANK排序(属性值RANK一般需手工或是系统配置时候就 ...
xgboost简单介绍_好文干货|全面理解项目中最主流的集成算法XGBoost 和 LightGBM
点击上方"智能与算法之路",选择"星标"公众号第一时间获取价值内容本文主要介绍基于 Boosting 框架的主流集成算法,包括 XGBoost 和 Ligh ...
以XGBoost为代表的集成算法体现的哲学思想与数学技巧
目录哲学思想一:抓住主要矛盾为什么AdaBoost要增加前一次错分样本的权重? 为什么lightGBM可以忽略梯度小的样本? 哲学思想二: 矛盾在一定条件下是可以相互转化的. 为什么随机森林比单一 ...
集成算法-随机森林与案例实战-泰坦尼克获救预测
集成算法-随机森林 Ensemble learning 目的:让机器学习效果更好,单个不行,群殴走起 Bagging:训练多个分类器取平均 f ( x ) = 1 / M ∑ m = 1 M f m ...
10- 天猫用户复购预测 (机器学习集成算法) (项目十) *
项目难点 merchant: 商人重命名列名: user_log.rename(columns={'seller_id':'merchant_id'}, inplace=True) 数据类型转换 ...
ML之xgboost：基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测) 目录输出结果设计思路核心代码 ...
ML之xgboost：基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测目录数据集简介输出结果设计思路核心代码数 ...
09- 京东客户购买意向预测 (机器学习集成算法) (项目九) *
项目难点根据用户在网站的操作(浏览, 加购, 删除, 购买, 收藏, 点击), 预测用户是否购买产品 . 主要使用 Xgboost 建模 pd.get_dummies 相当于onehot编码,常用与 ...

04- 根据Xgboost集成算法预测还贷能力 (项目四)

4、Xgboost模型使用

4.1、数据介绍

4.2、导包

4.3、加载数据

4.4、构建训练函数

4.5、建模交叉验证筛选最佳参数 (n_estimators)

4.6、条件筛选

4.6.1、参数筛选一

4.6.2、参数筛选二

4.7.3、参数筛选三

4.6.4、参数筛选四

4.7、建模验证新参数效果

4.7.1 调整学习率

04- 根据Xgboost集成算法预测还贷能力 (项目四)相关推荐

最新文章

热门文章