金融贷款逾期模型 -- 029
内容目录
一、导入相关库二、数据读取三、数据清洗——删除无关、重复数据四、数据清洗——类型转换1、数据集划分2、缺失值处理3、异常值处理4、离散特征编码5、日期特征处理6、特征组合五、数据集划分六、模型构建七、模型评估八、模型调优九、分类模型和集成模型评分和ROC曲线
一、导入相关库
# -*- coding:utf-8 -*-
# 一、导入相关库
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as pltimport warnings
import numpy as np
import matplotlib
import pandas as pd
warnings.filterwarnings("ignore")
matplotlib.rcParams['font.sans-serif']=['SimHei'] # 用黑体显示中文
matplotlib.rcParams['axes.unicode_minus']=False # 正常显示负号
np.set_printoptions(precision=5,suppress=True)pd.set_option('display.max_columns', 10000)
#显示所有行
pd.set_option('display.max_rows', 10000)
pd.set_option('max_colwidth',10000)
pd.set_option('display.width', 10000)#不换行
print(X.shape) #(4754, 84)
Unnamed: 0 custid trade_no bank_card_no low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter trans_days_interval regional_mobility student_feature repayment_capability is_high_user number_of_trans_from_2011 first_transaction_time historical_trans_amount historical_trans_day rank_trad_1_month trans_amount_3_month avg_consume_less_12_valid_month abs top_trans_count_last_1_month avg_price_last_12_month avg_price_top_last_12_valid_month reg_preference_for_trad trans_top_time_last_1_month trans_top_time_last_6_month consume_top_time_last_1_month consume_top_time_last_6_month cross_consume_count_last_1_month trans_fail_top_count_enum_last_1_month trans_fail_top_count_enum_last_6_month trans_fail_top_count_enum_last_12_month consume_mini_time_last_1_month max_cumulative_consume_later_1_month max_consume_count_later_6_month railway_consume_count_last_12_month pawns_auctions_trusts_consume_last_1_month pawns_auctions_trusts_consume_last_6_month jewelry_consume_count_last_6_month status source first_transaction_day trans_day_last_12_month id_name apply_score apply_credibility query_org_count query_finance_count query_cash_count query_sum_count latest_query_time latest_one_month_apply latest_three_month_apply latest_six_month_apply loans_score loans_credibility_behavior loans_count loans_settle_count loans_overdue_count loans_org_count_behavior consfin_org_count_behavior loans_cash_count latest_one_month_loan latest_three_month_loan latest_six_month_loan history_suc_fee history_fail_fee latest_one_month_suc latest_one_month_fail loans_long_time loans_latest_time loans_credit_limit loans_credibility_limit loans_org_count_current loans_product_count loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
0 5 2791858 20180507115231274000000023057383 卡号1 0.01 0.99 0 0.90 0.55 0.313 17.0 27.0 26.0 3.0 NaN 19890 0 30.0 20130817.0 149050 151.0 0.40 34030 7.0 3920 0.15 1020 0.55 一线城市 4.0 19.0 4.0 19.0 1.0 1.0 2.0 2.0 5.0 2170 6.0 0.0 1970 18040 0.0 1 xs 1738.0 85.0 蒋红 583.0 79.0 8.0 2.0 6.0 10.0 2018-04-25 2.0 5.0 8.0 552.0 73.0 37.0 34.0 2.0 10.0 1.0 9.0 1.0 1.0 13.0 37.0 7.0 1.0 0.0 341.0 2018-04-19 2200.0 72.0 9.0 10.0 2900.0 1688.0 1200.0 75.0 1.0 2.0 1200.0 1200.0 12.0 18.0
1 10 534047 20180507121002192000000023073000 卡号1 0.02 0.94 2000 1.28 1.00 0.458 19.0 30.0 14.0 4.0 1.0 16970 0 23.0 20160402.0 302910 224.0 0.35 10590 5.0 6950 0.05 1210 0.50 一线城市 13.0 30.0 13.0 30.0 0.0 0.0 3.0 3.0 330.0 2100 9.0 0.0 1820 15680 0.0 0 xs 779.0 84.0 崔向朝 653.0 73.0 7.0 4.0 2.0 8.0 2018-05-03 2.0 6.0 8.0 635.0 76.0 37.0 36.0 0.0 17.0 5.0 12.0 2.0 2.0 8.0 49.0 4.0 2.0 1.0 353.0 2018-05-05 2000.0 74.0 12.0 12.0 3500.0 1758.0 15100.0 80.0 5.0 6.0 22800.0 9360.0 4.0 2.0
2 12 2849787 20180507125159718000000023114911 卡号1 0.04 0.96 0 1.00 1.00 0.114 13.0 68.0 22.0 1.0 NaN 9710 0 9.0 20170617.0 11520 31.0 1.00 5710 5.0 840 0.65 570 0.65 一线城市 0.0 68.0 0.0 68.0 0.0 3.0 6.0 6.0 0.0 0 3.0 0.0 0 0 0.0 1 xs 338.0 95.0 王中云 654.0 76.0 11.0 5.0 5.0 16.0 2018-05-05 5.0 5.0 14.0 633.0 83.0 4.0 2.0 0.0 3.0 1.0 2.0 2.0 2.0 4.0 2.0 2.0 1.0 1.0 157.0 2018-05-01 1500.0 77.0 2.0 2.0 1600.0 1250.0 4200.0 87.0 1.0 1.0 4200.0 4200.0 2.0 6.0
3 13 1809708 20180507121358683000000388283484 卡号1 0.00 0.96 2000 0.13 0.57 0.777 22.0 14.0 6.0 3.0 NaN 6210 0 33.0 20130516.0 491130 360.0 0.15 91690 7.0 46850 0.05 1290 0.45 三线城市 6.0 8.0 6.0 8.0 0.0 1.0 8.0 8.0 31700.0 8140 9.0 0.0 2700 27970 0.0 0 xs 1831.0 82.0 何洋洋 595.0 79.0 12.0 7.0 4.0 22.0 2018-05-05 3.0 16.0 17.0 542.0 75.0 85.0 81.0 4.0 22.0 5.0 17.0 2.0 4.0 34.0 91.0 26.0 2.0 0.0 355.0 2018-05-03 1800.0 74.0 17.0 18.0 3200.0 1541.0 16300.0 80.0 5.0 5.0 30000.0 12180.0 2.0 4.0
4 14 2499829 20180507115448545000000388205844 卡号1 0.01 0.99 0 0.46 1.00 0.175 13.0 66.0 42.0 1.0 NaN 11150 0 12.0 20170312.0 61470 63.0 0.65 9770 6.0 760 1.00 1110 0.50 一线城市 0.0 66.0 0.0 66.0 0.0 3.0 3.0 3.0 0.0 1000 3.0 0.0 0 6410 0.0 1 xs 435.0 88.0 赵洋 541.0 75.0 11.0 3.0 4.0 14.0 2018-04-15 6.0 8.0 9.0 479.0 73.0 37.0 32.0 6.0 12.0 2.0 10.0 0.0 0.0 10.0 36.0 25.0 0.0 0.0 360.0 2018-01-07 1800.0 72.0 10.0 10.0 2300.0 1630.0 8300.0 79.0 2.0 2.0 8400.0 8250.0 22.0 120.0
(4754, 90)
二、数据读取
# 二、数据读取
file_path = "D:\A\AI-master\py-data\overdue.csv"
data = pd.read_csv(file_path, encoding='gbk')
print(data.head())
print(data.shape) #(4754, 90)
三、数据清洗——删除无关、重复数据
# 三、数据清洗——删除无关、重复数据
## 删除与个人身份相关的列
data.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)## 删除列中数据均相同的列
X = data.drop(labels='status',axis=1)
print(X.shape) # (4754, 85)
L = []
for col in X:if len(X[col].unique()) == 1:L.append(col)
for col in L:X.drop(col, axis=1, inplace=True)print(X.shape) #(4754, 84)
四、数据清洗——类型转换
1、数据集划分
划分不同数据类型:数值型、非数值型、标签
使用:Pandas对象有 select_dtypes() 方法可以筛选出特定数据类型的特征
参数:include 包括(默认);exclude 不包括
2、缺失值处理
发现缺失值方法:缺失个数、缺失率
分析:缺失率最高的特征是student_feature,为 63.0627% > 50% ,其他特征缺失率都在10%以下。
高缺失率特征处理:EM插补、多重插补。
==》由于两种方法比较复杂,这里先将缺失值归为一类,用0填充。其他特征:平均数、中位数、众数…
3、异常值处理
箱型图的四分位距(IQR)
4、离散特征编码
序号编码:用于有大小关系的数据
one-hot编码:用于无序关系的数据
5、日期特征处理
6、特征组合
print('1、数据集划分')
X_num = X.select_dtypes(include='number').copy()
X_str = X.select_dtypes(exclude='number').copy()
y = data['status']
print(X.shape,X_num.shape,X_str.shape,y.shape) #(4754, 84) (4754, 81) (4754, 3) (4754,)print('2、缺失值处理')
# 使用缺失率(可以了解比重)并按照值降序排序 ascending=False
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
print('---------'*5)
X_str_miss = (X_str.isnull().sum() / len(X_str)).sort_values(ascending=False)
print(X_str_miss.head())
'''
分析:缺失率最高的特征是student_feature,为 63.0627% > 50% ,其他的特征缺失率都在10%以下。高缺失率特征处理:EM插补、多重插补。
==》由于两种方法比较复杂,这里先将缺失值归为一类,用0填充。
其他特征:平均数、中位数、众数…
'''
## student_feature特征处理设置为0
print('---------'*5)
X_num['student_feature'].fillna(0, inplace = True)
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
## 其他特征插值: 众数
print('---------'*5)
X_num.fillna(X_num.mode().iloc[0, :], inplace=True)
X_str.fillna(X_str.mode().iloc[0, :], inplace=True)
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())print('3、异常值处理')
# 3、异常值处理:箱型图的四分位距(IQR)
print('X_num.shape',X_num.shape)
def iqr_outlier(x, thre = 1.5):x_cl = x.copy()q25, q75 = x.quantile(q = [0.25, 0.75])iqr = q75 - q25top = q75 + thre * iqrbottom = q25 - thre * iqrx_cl[x_cl > top] = topx_cl[x_cl < bottom] = bottomreturn x_cl
X_num_cl = pd.DataFrame()
for col in X_num.columns:X_num_cl[col] = iqr_outlier(X_num[col])
X_num = X_num_cl
print('X_num.shape',X_num.shape)print('4、离散特征编码')X_str_oh = pd.get_dummies(X_str['reg_preference_for_trad'])
print(X_str_oh.head())print('5、日期特征处理')
print('X_str',X_str['latest_query_time'].head())
X_date = pd.DataFrame()
X_date['latest_query_time_year'] = pd.to_datetime(X_str['latest_query_time']).dt.year
X_date['latest_query_time_month'] = pd.to_datetime(X_str['latest_query_time']).dt.month
X_date['latest_query_time_weekday'] = pd.to_datetime(X_str['latest_query_time']).dt.weekday
X_date['loans_latest_time_year'] = pd.to_datetime(X_str['loans_latest_time']).dt.year
X_date['loans_latest_time_month'] = pd.to_datetime(X_str['loans_latest_time']).dt.month
X_date['loans_latest_time_weekday'] = pd.to_datetime(X_str['loans_latest_time']).dt.weekday
print('X_date',X_date.head())print('6、特征组合')
X = pd.concat([X_num, X_str_oh, X_date], axis=1, sort=False)
print(X.shape)
1、数据集划分
(4754, 84) (4754, 81) (4754, 3) (4754,)
2、缺失值处理
student_feature 0.630627
cross_consume_count_last_1_month 0.089609
latest_one_month_apply 0.063946
query_finance_count 0.063946
latest_six_month_apply 0.063946
dtype: float64
---------------------------------------------
latest_query_time 0.063946
loans_latest_time 0.062474
reg_preference_for_trad 0.000421
dtype: float64
---------------------------------------------
cross_consume_count_last_1_month 0.089609
latest_three_month_apply 0.063946
query_finance_count 0.063946
latest_six_month_apply 0.063946
query_sum_count 0.063946
dtype: float64
---------------------------------------------
loans_latest_day 0.0
jewelry_consume_count_last_6_month 0.0
abs 0.0
top_trans_count_last_1_month 0.0
avg_price_last_12_month 0.0
dtype: float64
3、异常值处理
X_num.shape (4754, 81)
X_num.shape (4754, 81)
4、离散特征编码一线城市 三线城市 二线城市 其他城市 境外
0 1 0 0 0 0
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 0 0 0 05、日期特征处理
X_str 0 2018-04-25
1 2018-05-03
2 2018-05-05
3 2018-05-05
4 2018-04-15
Name: latest_query_time, dtype: object
X_date latest_query_time_year latest_query_time_month latest_query_time_weekday loans_latest_time_year loans_latest_time_month loans_latest_time_weekday
0 2018 4 2 2018 4 3
1 2018 5 3 2018 5 5
2 2018 5 5 2018 5 1
3 2018 5 5 2018 5 3
4 2018 4 6 2018 1 66、特征组合
(4754, 92)
五、数据集划分
## 预处理:标准化
# X_std = StandardScaler().fit(X)## 划分数据集
X_std_train, X_std_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2020)
print(X_std_train.shape, X_std_test.shape)
(3327, 92) (1427, 92)
六、模型构建
## 模型1:Logistic Regression
lr = LogisticRegression()
lr.fit(X_std_train, y_train)## 模型2:Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_std_train,y_train)# ## 模型3:SVM
svm = SVC(kernel='linear',probability=True)
svm.fit(X_std_train,y_train)## 模型4:Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_std_train,y_train)## 模型5:XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_std_train,y_train)
七、模型评估
## 模型评估
def model_metrics(clf, X_test, y_test):y_test_pred = clf.predict(X_test) #predict是训练后返回预测结果,是标签值。#predict_proba返回的是一个 n 行 k 列的数组, 第 i 行 第 j 列上的数值是模型预测 第 i 个预测样本为某个标签的概率,并且每一行的概率和为1。y_test_prob = clf.predict_proba(X_test)[:, 1] accuracy = accuracy_score(y_test, y_test_pred)print('The accuracy: ', accuracy)precision = precision_score(y_test, y_test_pred)print('The precision: ', precision)recall = recall_score(y_test, y_test_pred)print('The recall: ', recall)f1_score = recall_score(y_test, y_test_pred)print('The F1 score: ', f1_score)print('----------------------------------')# roc_auc_score = roc_auc_score(y_test, y_test_prob)# print('The AUC of: ', roc_auc_score)model_metrics(lr,X_std_test,y_test)
model_metrics(dtc,X_std_test,y_test)
model_metrics(svm,X_std_test,y_test)
model_metrics(rfc,X_std_test,y_test)
model_metrics(xgbc,X_std_test,y_test)
The accuracy: 0.7624386825508059
The precision: 0.0
The recall: 0.0
The F1 score: 0.0
----------------------------------
The accuracy: 0.7386124737210932
The precision: 0.43656716417910446
The recall: 0.34513274336283184
The F1 score: 0.34513274336283184
----------------------------------
The accuracy: 0.7624386825508059
The precision: 0.5
The recall: 0.0029498525073746312
The F1 score: 0.0029498525073746312
----------------------------------
The accuracy: 0.775052557813595
The precision: 0.5616438356164384
The recall: 0.24188790560471976
The F1 score: 0.24188790560471976
----------------------------------
The accuracy: 0.8044849334267694
The precision: 0.6456310679611651
The recall: 0.39233038348082594
The F1 score: 0.39233038348082594
----------------------------------
八、模型调优
使用网格搜索法对6个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估。
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2020)
sc = StandardScaler()
sc.fit(X_train)# 估算每个特征的平均值和标准差
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)#定义网格搜索交叉验证函数(5折)
def gridsearch(model,parameters):grid = GridSearchCV(model,parameters,scoring='accuracy',cv=5)grid = grid.fit(X_train_std,y_train)if hasattr(model,'decision_function'):y_predict_pro = grid.decision_function(X_test_std)else:y_predict_pro = grid.predict_proba(X_test_std)[:,1]print('best score:',grid.best_score_)print(grid.best_params_)print('test score:',grid.score(X_test_std,y_test))print('AUC:',metrics.roc_auc_score(y_test,y_predict_pro))#逻辑回归
import time
print('逻辑回归:')
start = time.time()
parameters = {'C':[0.1,1,2,5],'penalty':['l1','l2']}
lr = LogisticRegression()
lr.fit(X_train_std,y_train)
gridsearch(lr, parameters)
end = time.time()
print('逻辑回归用时%s:'%(end-start))#SVM
print('SVM:')
start = time.time()
parameters = {'C':[0.1,1,2,5],'kernel':['linear','poly','rbf']}
svc = SVC()
svc.fit(X_train_std,y_train)
gridsearch(svc,parameters)
end = time.time()
print('SVM用时%s:'%(end-start))#决策树
print('决策树:')
start = time.time()
parameters = {'criterion': ['gini', 'entropy'], 'max_depth': [1,2,3,4,5,6], 'splitter': ['best', 'random'],'max_features': ['log2', 'sqrt', 'auto']}
clf = DecisionTreeClassifier()
clf.fit(X_train_std,y_train)
gridsearch(clf,parameters)
end = time.time()
print('决策树用时%s:'%(end-start))#随机森林
print('随机森林:')
start = time.time()
parameters = {'n_estimators': range(1,200), 'max_features': ['log2', 'sqrt', 'auto']}
rfc = RandomForestClassifier(random_state=2018)
rfc.fit(X_train_std,y_train)
gridsearch(rfc,parameters)
end = time.time()
print('随机森林用时%s:'%(end-start))#GBDT
print('GBDT:')
start = time.time()
parameters = {'n_estimators': range(1, 150, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
gbdt = GradientBoostingClassifier(random_state=2018)
gbdt.fit(X_train_std,y_train)
gridsearch(gbdt,parameters)
end = time.time()
print('GBDT用时%s:'%(end-start))#XGBoost
print('XGBoost:')
start = time.time()
parameters = {'eta': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1,6,1), 'min_child_weight': range(1,5,1)}
xgbs = XGBClassifier()
xgbs.fit(X_train_std,y_train)
gridsearch(xgbs,parameters)
end = time.time()
print('XGBoost用时%s:'%(end-start))
逻辑回归:
best score: 0.7965133754132853
{'C': 0.1, 'penalty': 'l1'}
test score: 0.8009810791871058
AUC: 0.7964574657296546
逻辑回归用时12.71597933769226:
SVM:
best score: 0.7944093778178539
{'C': 2, 'kernel': 'linear'}
test score: 0.8023826208829713
AUC: 0.7971732387645323
SVM用时124.0969910621643:
决策树:
best score: 0.763751127141569
{'criterion': 'gini', 'max_depth': 5, 'max_features': 'auto', 'splitter': 'best'}
test score: 0.7792571829011913
AUC: 0.6941046330036439
决策树用时4.393248558044434:
随机森林:
best score: 0.7935076645626691
{'max_features': 'sqrt', 'n_estimators': 51}
test score: 0.7904695164681149
AUC: 0.7712752689571404
随机森林用时3088.8878474235535:
GBDT:
best score: 0.7956116621581004
{'learning_rate': 0.2, 'n_estimators': 21}
test score: 0.7946741415557113
AUC: 0.7899951739545374
GBDT用时504.82539463043213:
XGBoost:
best score: 0.7992185151788398
{'eta': 0.1, 'max_depth': 2, 'min_child_weight': 4}
test score: 0.8037841625788367
AUC: 0.7903137471802881
XGBoost用时450.1137731075287:
九、分类模型和集成模型评分和ROC曲线
## 导入包
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve,auc ## 导入评价算法
import pandas as pd
import matplotlib.pyplot as plt ## 绘图的包X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=2018) ## 数据分类sc = StandardScaler() ## 数据归一化
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)def score(y_true, y_predicet, y_predict_pro):acc_score = accuracy_score(y_true,y_predicet)pre_score = precision_score(y_true,y_predicet)recall = recall_score(y_true,y_predicet)F = f1_score(y_true,y_predicet)auc_score = roc_auc_score(y_true,y_predict_pro) #AUC值fpr, tpr, thresholds = roc_curve(y_test,y_predict_pro) #绘制ROC曲线plt.plot(fpr,tpr,'b',label='AUC = %0.4f'% auc_score)plt.plot([0,1],[0,1],'r--',label= 'Random guess')plt.legend(loc='lower right')plt.title('ROC')plt.xlabel('false positive rate')plt.ylabel('true positive rate')plt.show()
#逻辑回归
lr = LogisticRegression()
lr.fit(X_train_std,y_train)
lr_predict = lr.predict(X_test_std)
lr_predict_pro = lr.predict_proba(X_test_std)[:,1]
score(y_test,lr_predict,lr_predict_pro)#线性SVM
svc = LinearSVC()
svc.fit(X_train_std,y_train)
svc_predict = svc.predict(X_test_std)
svc_predict_pro = svc.decision_function(X_test_std)
score(y_test,svc_predict,svc_predict_pro)#决策树
clf = DecisionTreeClassifier()
clf.fit(X_train_std,y_train)
clf_predict = clf.predict(X_test_std)
clf_predict_proba = clf.predict_proba(X_test_std)[:,1]
score(y_test,clf_predict,clf_predict_proba)#随机森林
rfc = RandomForestClassifier()
rfc.fit(X_train_std,y_train)
rfc_predict = rfc.predict(X_test_std)
rfc_predict_proba = rfc.predict_proba(X_test_std)[:,1]
score(y_test,rfc_predict,rfc_predict_proba)#GBDT
gdbt = GradientBoostingClassifier()
gdbt.fit(X_train_std,y_train)
gdbt_predict = gdbt.predict(X_test_std)
gdbt_predict_proba = gdbt.predict_proba(X_test_std)[:,1]
score(y_test,gdbt_predict,gdbt_predict_proba)#XGBoost
xgbs = XGBClassifier()
xgbs.fit(X_train_std,y_train)
xgbs_predict = xgbs.predict(X_test_std)
xgbs_predict_proba = xgbs.predict_proba(X_test_std)[:,1]
score(y_test,xgbs_predict,xgbs_predict_proba)
-----------------逻辑回归------------------
accuracy:0.7848633496846531,
precision:0.6181818181818182,
recall:0.3788300835654596,
F1-Score:0.4697754749568221,
Auc:0.7743732590529248
-----------------线性SVM------------------
accuracy:0.7813594954449895,
precision:0.6157635467980296,
recall:0.34818941504178275,
F1-Score:0.4448398576512455,
Auc:0.7752808988764046
-----------------决策树------------------
accuracy:0.6734407848633497,
precision:0.36175710594315247,
recall:0.38997214484679665,
F1-Score:0.3753351206434316,
Auc:0.5793493683035481
-----------------随机森林------------------
accuracy:0.7589348283111422,
precision:0.5531914893617021,
recall:0.21727019498607242,
F1-Score:0.312,
Auc:0.6982958279866045
-----------------GBDT------------------
accuracy:0.7778556412053259,
precision:0.5990566037735849,
recall:0.35376044568245124,
F1-Score:0.4448336252189142,
Auc:0.7711443564624998
-----------------XGBoost------------------
accuracy:0.7834618079887876,
precision:0.6225490196078431,
recall:0.35376044568245124,
F1-Score:0.4511545293072824,
Auc:0.7739533452265448
About Me:小婷儿
● 本文作者:小婷儿,专注于python、数据分析、数据挖掘、机器学习相关技术,也注重技术的运用
● 作者博客地址:https://blog.csdn.net/u010986753
● 本系列题目来源于作者的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解
● 版权所有,欢迎分享本文,转载请保留出处
● 微信:tinghai87605025 联系我加微信群
● QQ:87605025
● QQ交流群py_data :483766429
● 公众号:python宝 或 DB宝
● 提供OCP、OCM和高可用最实用的技能培训
● 题目解答若有不当之处,还望各位朋友批评指正,共同进步
如果你觉得到文章对您有帮助,欢迎赞赏哦!有您的支持,小婷儿一定会越来越好!
金融贷款逾期模型 -- 029相关推荐
- 一周算法实践---金融贷款逾期模型
金融贷款逾期模型 1.读取数据 import pandas as pd data_all = pd.read_csv('../data/data_all.csv')` 2.划分数据集 from skl ...
- 【机器学习】数据挖掘实战:金融贷款分类模型和时间序列分析
今天给大家带来一个企业级数据挖掘实战项目,金融贷款分类模型和时间序列分析,文章较长,建议收藏! 如果本文对你有所帮助,记得文末点赞和在看,也可分享给你需要的朋友- 项目背景 银行和其他金融贷款机构经常 ...
- 金融贷款逾期的模型构建7——模型融合
文章目录 一.集成学习 1.Bagging 2.Boosting 3.Stacking (1)核心图解 a.构建新的训练集 b.构建新的测试集 c.最终的训练与预测 (2)示例 a.构建新的训练集 b ...
- 金融贷款逾期的模型构建6——特征选择
文章目录 一.IV值 1.概述 2.IV计算 (1)WOE (2)IV 计算 二.实现 0.相关模块 1.IV值 2.Random Forest 3.特征合并 4.模型构建 5.模型评估 数据传送门( ...
- 金融贷款逾期的模型构建5——数据预处理
文章目录 一.相关库 二.数据读取 三.数据清洗--删除无关.重复数据 四.数据清洗--类型转换 1.数据集划分 2.缺失值处理 3.异常值处理 4.离散特征编码 5.日期特征处理 6.特征组合 五. ...
- 金融贷款逾期的模型构建1
数据 data_all.csv文件是非原始数据,已经处理过了.数据是金融数据, 我们要做的是预测贷款用户是否会逾期.表格中, status是标签: 0表示未逾期, 1表示逾期. 任务--模型构建 给定 ...
- 金融贷款逾期的模型构建4——模型调优
文章目录 一.任务 二.概述 1.参数说明 2.常用方法 二.实现 1.模块引入 2.模型评估函数 3.数据读取 4.Logistic Regression (1)调参部分 (2)模型评估 5.SVM ...
- 金融贷款逾期的模型构建3——模型评估
文章目录 一.评价指标 1.基本概念 2.准确率(accuracy) 3.精确率(precision) 4.召回率(recall) 5.F1值 6.roc曲线 和 auc值 二.模型评估 1.Logi ...
- 金融贷款逾期的模型构建2——集成模型
任务--模型构建 构建随机森林.GBDT.XGBoost和LightGBM这4个模型,并对每一个模型进行评分,评分方式任意,例如准确度和auc值. 1.相关安装资源 随机森林.GBDT均在sklear ...
- 金融反欺诈模型----项目实战--机器学习
机器学习:从源数据清洗到特征工程建立谈金融反欺诈模型训练 本文旨在通过一个完整的实战例子,演示从源数据清洗到特征工程建立,再到模型训练,以及模型验证和评估的一个机器学习的完整流程.由于初识机器学习,会 ...
最新文章
- iOS 9应用开发教程之多行读写文本ios9文本视图
- 购买Entrust SSL 数字证书?你怎么看?
- colpick-jQuery颜色选择器使用说明
- [导入]C#正则表达式整理备忘
- npm 安装报错 rollbackFailedOptional verb npm-session无法解决?
- webpackjsonp 还原_具有催化CO2还原性能的非贵金属配合物的配体设计
- 【英语学习】【Level 08】U05 Better option L1 Message sent
- Microsoft Updater Application Block 1.4.3 KeyValidator类设计 [翻译]
- 5 张图带你搞懂容器网络的工作原理
- 探索关系抽取中的多变知识
- Programer Cat 福利
- unity Animator 同时播放两个动画,并动态更换Animator中的AnimationClip
- nginx的反向代理和负载均衡
- 安卓端录像并将视频分享给微信好友
- win10定时关机怎么设置(Win10怎么设置亮度)
- 技能兴鲁试题--可视化
- 物联网实战指南 分享
- 绍兴文理学院元培学院第十五届大学生程序设计竞赛
- OpenAcc的使用
- VMware虚拟机的安装、创建及CentOS 7的安装