内容目录

一、导入相关库二、数据读取三、数据清洗——删除无关、重复数据四、数据清洗——类型转换1、数据集划分2、缺失值处理3、异常值处理4、离散特征编码5、日期特征处理6、特征组合五、数据集划分六、模型构建七、模型评估八、模型调优九、分类模型和集成模型评分和ROC曲线

一、导入相关库

# -*- coding:utf-8 -*-
# 一、导入相关库
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as pltimport warnings
import numpy as np
import matplotlib
import pandas as pd
warnings.filterwarnings("ignore")
matplotlib.rcParams['font.sans-serif']=['SimHei']   # 用黑体显示中文
matplotlib.rcParams['axes.unicode_minus']=False     # 正常显示负号
np.set_printoptions(precision=5,suppress=True)pd.set_option('display.max_columns', 10000)
#显示所有行
pd.set_option('display.max_rows', 10000)
pd.set_option('max_colwidth',10000)
pd.set_option('display.width', 10000)#不换行
print(X.shape)  #(4754, 84)  
   Unnamed: 0   custid                          trade_no bank_card_no  low_volume_percent  middle_volume_percent  take_amount_in_later_12_month_highest  trans_amount_increase_rate_lately  trans_activity_month  trans_activity_day  transd_mcc  trans_days_interval_filter  trans_days_interval  regional_mobility  student_feature  repayment_capability  is_high_user  number_of_trans_from_2011  first_transaction_time  historical_trans_amount  historical_trans_day  rank_trad_1_month  trans_amount_3_month  avg_consume_less_12_valid_month    abs  top_trans_count_last_1_month  avg_price_last_12_month  avg_price_top_last_12_valid_month reg_preference_for_trad  trans_top_time_last_1_month  trans_top_time_last_6_month  consume_top_time_last_1_month  consume_top_time_last_6_month  cross_consume_count_last_1_month  trans_fail_top_count_enum_last_1_month  trans_fail_top_count_enum_last_6_month  trans_fail_top_count_enum_last_12_month  consume_mini_time_last_1_month  max_cumulative_consume_later_1_month  max_consume_count_later_6_month  railway_consume_count_last_12_month  pawns_auctions_trusts_consume_last_1_month  pawns_auctions_trusts_consume_last_6_month  jewelry_consume_count_last_6_month  status source  first_transaction_day  trans_day_last_12_month id_name  apply_score  apply_credibility  query_org_count  query_finance_count  query_cash_count  query_sum_count latest_query_time  latest_one_month_apply  latest_three_month_apply  latest_six_month_apply  loans_score  loans_credibility_behavior  loans_count  loans_settle_count  loans_overdue_count  loans_org_count_behavior  consfin_org_count_behavior  loans_cash_count  latest_one_month_loan  latest_three_month_loan  latest_six_month_loan  history_suc_fee  history_fail_fee  latest_one_month_suc  latest_one_month_fail  loans_long_time loans_latest_time  loans_credit_limit  loans_credibility_limit  loans_org_count_current  loans_product_count  loans_max_limit  loans_avg_limit  consfin_credit_limit  consfin_credibility  consfin_org_count_current  consfin_product_count  consfin_max_limit  consfin_avg_limit  latest_query_day  loans_latest_day
0           5  2791858  20180507115231274000000023057383          卡号1                0.01                   0.99                                      0                               0.90                  0.55               0.313        17.0                        27.0                 26.0                3.0              NaN                 19890             0                       30.0              20130817.0                   149050                 151.0               0.40                 34030                              7.0   3920                          0.15                     1020                               0.55                    一线城市                          4.0                         19.0                            4.0                           19.0                               1.0                                     1.0                                     2.0                                      2.0                             5.0                                  2170                              6.0                                  0.0                                        1970                                       18040                                 0.0       1     xs                 1738.0                     85.0      蒋红        583.0               79.0              8.0                  2.0               6.0             10.0        2018-04-25                     2.0                       5.0                     8.0        552.0                        73.0         37.0                34.0                  2.0                      10.0                         1.0               9.0                    1.0                      1.0                   13.0             37.0               7.0                   1.0                    0.0            341.0        2018-04-19              2200.0                     72.0                      9.0                 10.0           2900.0           1688.0                1200.0                 75.0                        1.0                    2.0             1200.0             1200.0              12.0              18.0
1          10   534047  20180507121002192000000023073000          卡号1                0.02                   0.94                                   2000                               1.28                  1.00               0.458        19.0                        30.0                 14.0                4.0              1.0                 16970             0                       23.0              20160402.0                   302910                 224.0               0.35                 10590                              5.0   6950                          0.05                     1210                               0.50                    一线城市                         13.0                         30.0                           13.0                           30.0                               0.0                                     0.0                                     3.0                                      3.0                           330.0                                  2100                              9.0                                  0.0                                        1820                                       15680                                 0.0       0     xs                  779.0                     84.0     崔向朝        653.0               73.0              7.0                  4.0               2.0              8.0        2018-05-03                     2.0                       6.0                     8.0        635.0                        76.0         37.0                36.0                  0.0                      17.0                         5.0              12.0                    2.0                      2.0                    8.0             49.0               4.0                   2.0                    1.0            353.0        2018-05-05              2000.0                     74.0                     12.0                 12.0           3500.0           1758.0               15100.0                 80.0                        5.0                    6.0            22800.0             9360.0               4.0               2.0
2          12  2849787  20180507125159718000000023114911          卡号1                0.04                   0.96                                      0                               1.00                  1.00               0.114        13.0                        68.0                 22.0                1.0              NaN                  9710             0                        9.0              20170617.0                    11520                  31.0               1.00                  5710                              5.0    840                          0.65                      570                               0.65                    一线城市                          0.0                         68.0                            0.0                           68.0                               0.0                                     3.0                                     6.0                                      6.0                             0.0                                     0                              3.0                                  0.0                                           0                                           0                                 0.0       1     xs                  338.0                     95.0     王中云        654.0               76.0             11.0                  5.0               5.0             16.0        2018-05-05                     5.0                       5.0                    14.0        633.0                        83.0          4.0                 2.0                  0.0                       3.0                         1.0               2.0                    2.0                      2.0                    4.0              2.0               2.0                   1.0                    1.0            157.0        2018-05-01              1500.0                     77.0                      2.0                  2.0           1600.0           1250.0                4200.0                 87.0                        1.0                    1.0             4200.0             4200.0               2.0               6.0
3          13  1809708  20180507121358683000000388283484          卡号1                0.00                   0.96                                   2000                               0.13                  0.57               0.777        22.0                        14.0                  6.0                3.0              NaN                  6210             0                       33.0              20130516.0                   491130                 360.0               0.15                 91690                              7.0  46850                          0.05                     1290                               0.45                    三线城市                          6.0                          8.0                            6.0                            8.0                               0.0                                     1.0                                     8.0                                      8.0                         31700.0                                  8140                              9.0                                  0.0                                        2700                                       27970                                 0.0       0     xs                 1831.0                     82.0     何洋洋        595.0               79.0             12.0                  7.0               4.0             22.0        2018-05-05                     3.0                      16.0                    17.0        542.0                        75.0         85.0                81.0                  4.0                      22.0                         5.0              17.0                    2.0                      4.0                   34.0             91.0              26.0                   2.0                    0.0            355.0        2018-05-03              1800.0                     74.0                     17.0                 18.0           3200.0           1541.0               16300.0                 80.0                        5.0                    5.0            30000.0            12180.0               2.0               4.0
4          14  2499829  20180507115448545000000388205844          卡号1                0.01                   0.99                                      0                               0.46                  1.00               0.175        13.0                        66.0                 42.0                1.0              NaN                 11150             0                       12.0              20170312.0                    61470                  63.0               0.65                  9770                              6.0    760                          1.00                     1110                               0.50                    一线城市                          0.0                         66.0                            0.0                           66.0                               0.0                                     3.0                                     3.0                                      3.0                             0.0                                  1000                              3.0                                  0.0                                           0                                        6410                                 0.0       1     xs                  435.0                     88.0      赵洋        541.0               75.0             11.0                  3.0               4.0             14.0        2018-04-15                     6.0                       8.0                     9.0        479.0                        73.0         37.0                32.0                  6.0                      12.0                         2.0              10.0                    0.0                      0.0                   10.0             36.0              25.0                   0.0                    0.0            360.0        2018-01-07              1800.0                     72.0                     10.0                 10.0           2300.0           1630.0                8300.0                 79.0                        2.0                    2.0             8400.0             8250.0              22.0             120.0
(4754, 90)

二、数据读取

# 二、数据读取
file_path = "D:\A\AI-master\py-data\overdue.csv"
data = pd.read_csv(file_path, encoding='gbk')
print(data.head())
print(data.shape) #(4754, 90)

三、数据清洗——删除无关、重复数据

# 三、数据清洗——删除无关、重复数据
## 删除与个人身份相关的列
data.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1, inplace=True)## 删除列中数据均相同的列
X = data.drop(labels='status',axis=1)
print(X.shape) # (4754, 85)
L = []
for col in X:if len(X[col].unique()) == 1:L.append(col)
for col in L:X.drop(col, axis=1, inplace=True)print(X.shape)  #(4754, 84)

四、数据清洗——类型转换

1、数据集划分

划分不同数据类型:数值型、非数值型、标签
        使用:Pandas对象有 select_dtypes() 方法可以筛选出特定数据类型的特征
        参数:include 包括(默认);exclude 不包括

2、缺失值处理

发现缺失值方法:缺失个数、缺失率

分析:缺失率最高的特征是student_feature,为 63.0627% > 50% ,其他特征缺失率都在10%以下。

  • 高缺失率特征处理:EM插补、多重插补。
    ==》由于两种方法比较复杂,这里先将缺失值归为一类,用0填充。

  • 其他特征:平均数、中位数、众数…

3、异常值处理

  • 箱型图的四分位距(IQR)

4、离散特征编码

  • 序号编码:用于有大小关系的数据

  • one-hot编码:用于无序关系的数据

5、日期特征处理

6、特征组合

print('1、数据集划分')
X_num = X.select_dtypes(include='number').copy()
X_str = X.select_dtypes(exclude='number').copy()
y = data['status']
print(X.shape,X_num.shape,X_str.shape,y.shape)  #(4754, 84) (4754, 81) (4754, 3) (4754,)print('2、缺失值处理')
# 使用缺失率(可以了解比重)并按照值降序排序 ascending=False
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
print('---------'*5)
X_str_miss = (X_str.isnull().sum() / len(X_str)).sort_values(ascending=False)
print(X_str_miss.head())
'''
分析:缺失率最高的特征是student_feature,为 63.0627% > 50% ,其他的特征缺失率都在10%以下。高缺失率特征处理:EM插补、多重插补。
==》由于两种方法比较复杂,这里先将缺失值归为一类,用0填充。
其他特征:平均数、中位数、众数…
'''
## student_feature特征处理设置为0
print('---------'*5)
X_num['student_feature'].fillna(0, inplace = True)
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())
## 其他特征插值: 众数
print('---------'*5)
X_num.fillna(X_num.mode().iloc[0, :], inplace=True)
X_str.fillna(X_str.mode().iloc[0, :], inplace=True)
X_num_miss = (X_num.isnull().sum() / len(X_num)).sort_values(ascending=False)
print(X_num_miss.head())print('3、异常值处理')
# 3、异常值处理:箱型图的四分位距(IQR)
print('X_num.shape',X_num.shape)
def iqr_outlier(x, thre = 1.5):x_cl = x.copy()q25, q75 = x.quantile(q = [0.25, 0.75])iqr = q75 - q25top = q75 + thre * iqrbottom = q25 - thre * iqrx_cl[x_cl > top] = topx_cl[x_cl < bottom] = bottomreturn  x_cl
X_num_cl = pd.DataFrame()
for col in X_num.columns:X_num_cl[col] = iqr_outlier(X_num[col])
X_num = X_num_cl
print('X_num.shape',X_num.shape)print('4、离散特征编码')X_str_oh = pd.get_dummies(X_str['reg_preference_for_trad'])
print(X_str_oh.head())print('5、日期特征处理')
print('X_str',X_str['latest_query_time'].head())
X_date = pd.DataFrame()
X_date['latest_query_time_year'] = pd.to_datetime(X_str['latest_query_time']).dt.year
X_date['latest_query_time_month'] = pd.to_datetime(X_str['latest_query_time']).dt.month
X_date['latest_query_time_weekday'] = pd.to_datetime(X_str['latest_query_time']).dt.weekday
X_date['loans_latest_time_year'] = pd.to_datetime(X_str['loans_latest_time']).dt.year
X_date['loans_latest_time_month'] = pd.to_datetime(X_str['loans_latest_time']).dt.month
X_date['loans_latest_time_weekday'] = pd.to_datetime(X_str['loans_latest_time']).dt.weekday
print('X_date',X_date.head())print('6、特征组合')
X = pd.concat([X_num, X_str_oh, X_date], axis=1, sort=False)
print(X.shape)
1、数据集划分
(4754, 84) (4754, 81) (4754, 3) (4754,)
2、缺失值处理
student_feature                     0.630627
cross_consume_count_last_1_month    0.089609
latest_one_month_apply              0.063946
query_finance_count                 0.063946
latest_six_month_apply              0.063946
dtype: float64
---------------------------------------------
latest_query_time          0.063946
loans_latest_time          0.062474
reg_preference_for_trad    0.000421
dtype: float64
---------------------------------------------
cross_consume_count_last_1_month    0.089609
latest_three_month_apply            0.063946
query_finance_count                 0.063946
latest_six_month_apply              0.063946
query_sum_count                     0.063946
dtype: float64
---------------------------------------------
loans_latest_day                      0.0
jewelry_consume_count_last_6_month    0.0
abs                                   0.0
top_trans_count_last_1_month          0.0
avg_price_last_12_month               0.0
dtype: float64
3、异常值处理
X_num.shape (4754, 81)
X_num.shape (4754, 81)
4、离散特征编码一线城市  三线城市  二线城市  其他城市  境外
0     1     0     0     0   0
1     1     0     0     0   0
2     1     0     0     0   0
3     0     1     0     0   0
4     1     0     0     0   05、日期特征处理
X_str 0    2018-04-25
1    2018-05-03
2    2018-05-05
3    2018-05-05
4    2018-04-15
Name: latest_query_time, dtype: object
X_date    latest_query_time_year  latest_query_time_month  latest_query_time_weekday  loans_latest_time_year  loans_latest_time_month  loans_latest_time_weekday
0                    2018                        4                          2                    2018                        4                          3
1                    2018                        5                          3                    2018                        5                          5
2                    2018                        5                          5                    2018                        5                          1
3                    2018                        5                          5                    2018                        5                          3
4                    2018                        4                          6                    2018                        1                          66、特征组合
(4754, 92)

五、数据集划分

## 预处理:标准化
# X_std = StandardScaler().fit(X)## 划分数据集
X_std_train, X_std_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2020)
print(X_std_train.shape, X_std_test.shape)
(3327, 92) (1427, 92)

六、模型构建

## 模型1:Logistic Regression
lr = LogisticRegression()
lr.fit(X_std_train, y_train)## 模型2:Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_std_train,y_train)# ## 模型3:SVM
svm = SVC(kernel='linear',probability=True)
svm.fit(X_std_train,y_train)## 模型4:Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_std_train,y_train)## 模型5:XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_std_train,y_train)

七、模型评估

## 模型评估
def model_metrics(clf, X_test,  y_test):y_test_pred = clf.predict(X_test)  #predict是训练后返回预测结果,是标签值。#predict_proba返回的是一个 n 行 k 列的数组, 第 i 行 第 j 列上的数值是模型预测 第 i 个预测样本为某个标签的概率,并且每一行的概率和为1。y_test_prob = clf.predict_proba(X_test)[:, 1]  accuracy = accuracy_score(y_test, y_test_pred)print('The accuracy: ', accuracy)precision = precision_score(y_test, y_test_pred)print('The precision: ', precision)recall = recall_score(y_test, y_test_pred)print('The recall: ', recall)f1_score = recall_score(y_test, y_test_pred)print('The F1 score: ', f1_score)print('----------------------------------')# roc_auc_score = roc_auc_score(y_test, y_test_prob)# print('The AUC of: ', roc_auc_score)model_metrics(lr,X_std_test,y_test)
model_metrics(dtc,X_std_test,y_test)
model_metrics(svm,X_std_test,y_test)
model_metrics(rfc,X_std_test,y_test)
model_metrics(xgbc,X_std_test,y_test)
The accuracy:  0.7624386825508059
The precision:  0.0
The recall:  0.0
The F1 score:  0.0
----------------------------------
The accuracy:  0.7386124737210932
The precision:  0.43656716417910446
The recall:  0.34513274336283184
The F1 score:  0.34513274336283184
----------------------------------
The accuracy:  0.7624386825508059
The precision:  0.5
The recall:  0.0029498525073746312
The F1 score:  0.0029498525073746312
----------------------------------
The accuracy:  0.775052557813595
The precision:  0.5616438356164384
The recall:  0.24188790560471976
The F1 score:  0.24188790560471976
----------------------------------
The accuracy:  0.8044849334267694
The precision:  0.6456310679611651
The recall:  0.39233038348082594
The F1 score:  0.39233038348082594
----------------------------------

八、模型调优

  使用网格搜索法对6个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估。

from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2020)
sc = StandardScaler()
sc.fit(X_train)# 估算每个特征的平均值和标准差
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)#定义网格搜索交叉验证函数(5折)
def gridsearch(model,parameters):grid = GridSearchCV(model,parameters,scoring='accuracy',cv=5)grid = grid.fit(X_train_std,y_train)if hasattr(model,'decision_function'):y_predict_pro = grid.decision_function(X_test_std)else:y_predict_pro = grid.predict_proba(X_test_std)[:,1]print('best score:',grid.best_score_)print(grid.best_params_)print('test score:',grid.score(X_test_std,y_test))print('AUC:',metrics.roc_auc_score(y_test,y_predict_pro))#逻辑回归
import time
print('逻辑回归:')
start = time.time()
parameters = {'C':[0.1,1,2,5],'penalty':['l1','l2']}
lr = LogisticRegression()
lr.fit(X_train_std,y_train)
gridsearch(lr, parameters)
end = time.time()
print('逻辑回归用时%s:'%(end-start))#SVM
print('SVM:')
start = time.time()
parameters = {'C':[0.1,1,2,5],'kernel':['linear','poly','rbf']}
svc = SVC()
svc.fit(X_train_std,y_train)
gridsearch(svc,parameters)
end = time.time()
print('SVM用时%s:'%(end-start))#决策树
print('决策树:')
start = time.time()
parameters = {'criterion': ['gini', 'entropy'], 'max_depth': [1,2,3,4,5,6], 'splitter': ['best', 'random'],'max_features': ['log2', 'sqrt', 'auto']}
clf = DecisionTreeClassifier()
clf.fit(X_train_std,y_train)
gridsearch(clf,parameters)
end = time.time()
print('决策树用时%s:'%(end-start))#随机森林
print('随机森林:')
start = time.time()
parameters = {'n_estimators': range(1,200), 'max_features': ['log2', 'sqrt', 'auto']}
rfc = RandomForestClassifier(random_state=2018)
rfc.fit(X_train_std,y_train)
gridsearch(rfc,parameters)
end = time.time()
print('随机森林用时%s:'%(end-start))#GBDT
print('GBDT:')
start = time.time()
parameters = {'n_estimators': range(1, 150, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
gbdt = GradientBoostingClassifier(random_state=2018)
gbdt.fit(X_train_std,y_train)
gridsearch(gbdt,parameters)
end = time.time()
print('GBDT用时%s:'%(end-start))#XGBoost
print('XGBoost:')
start = time.time()
parameters = {'eta': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1,6,1), 'min_child_weight': range(1,5,1)}
xgbs = XGBClassifier()
xgbs.fit(X_train_std,y_train)
gridsearch(xgbs,parameters)
end = time.time()
print('XGBoost用时%s:'%(end-start))
逻辑回归:
best score: 0.7965133754132853
{'C': 0.1, 'penalty': 'l1'}
test score: 0.8009810791871058
AUC: 0.7964574657296546
逻辑回归用时12.71597933769226:
SVM:
best score: 0.7944093778178539
{'C': 2, 'kernel': 'linear'}
test score: 0.8023826208829713
AUC: 0.7971732387645323
SVM用时124.0969910621643:
决策树:
best score: 0.763751127141569
{'criterion': 'gini', 'max_depth': 5, 'max_features': 'auto', 'splitter': 'best'}
test score: 0.7792571829011913
AUC: 0.6941046330036439
决策树用时4.393248558044434:
随机森林:
best score: 0.7935076645626691
{'max_features': 'sqrt', 'n_estimators': 51}
test score: 0.7904695164681149
AUC: 0.7712752689571404
随机森林用时3088.8878474235535:
GBDT:
best score: 0.7956116621581004
{'learning_rate': 0.2, 'n_estimators': 21}
test score: 0.7946741415557113
AUC: 0.7899951739545374
GBDT用时504.82539463043213:
XGBoost:
best score: 0.7992185151788398
{'eta': 0.1, 'max_depth': 2, 'min_child_weight': 4}
test score: 0.8037841625788367
AUC: 0.7903137471802881
XGBoost用时450.1137731075287:

九、分类模型和集成模型评分和ROC曲线

## 导入包
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn.model_selection import  train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve,auc    ## 导入评价算法
import pandas as pd
import matplotlib.pyplot as plt     ## 绘图的包X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=2018)    ## 数据分类sc = StandardScaler()    ## 数据归一化
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)def score(y_true, y_predicet, y_predict_pro):acc_score = accuracy_score(y_true,y_predicet)pre_score = precision_score(y_true,y_predicet)recall = recall_score(y_true,y_predicet)F = f1_score(y_true,y_predicet)auc_score = roc_auc_score(y_true,y_predict_pro)     #AUC值fpr, tpr, thresholds = roc_curve(y_test,y_predict_pro)      #绘制ROC曲线plt.plot(fpr,tpr,'b',label='AUC = %0.4f'% auc_score)plt.plot([0,1],[0,1],'r--',label= 'Random guess')plt.legend(loc='lower right')plt.title('ROC')plt.xlabel('false positive rate')plt.ylabel('true positive rate')plt.show()
#逻辑回归
lr = LogisticRegression()
lr.fit(X_train_std,y_train)
lr_predict = lr.predict(X_test_std)
lr_predict_pro = lr.predict_proba(X_test_std)[:,1]
score(y_test,lr_predict,lr_predict_pro)#线性SVM
svc = LinearSVC()
svc.fit(X_train_std,y_train)
svc_predict = svc.predict(X_test_std)
svc_predict_pro = svc.decision_function(X_test_std)
score(y_test,svc_predict,svc_predict_pro)#决策树
clf = DecisionTreeClassifier()
clf.fit(X_train_std,y_train)
clf_predict = clf.predict(X_test_std)
clf_predict_proba = clf.predict_proba(X_test_std)[:,1]
score(y_test,clf_predict,clf_predict_proba)#随机森林
rfc = RandomForestClassifier()
rfc.fit(X_train_std,y_train)
rfc_predict = rfc.predict(X_test_std)
rfc_predict_proba = rfc.predict_proba(X_test_std)[:,1]
score(y_test,rfc_predict,rfc_predict_proba)#GBDT
gdbt = GradientBoostingClassifier()
gdbt.fit(X_train_std,y_train)
gdbt_predict = gdbt.predict(X_test_std)
gdbt_predict_proba = gdbt.predict_proba(X_test_std)[:,1]
score(y_test,gdbt_predict,gdbt_predict_proba)#XGBoost
xgbs = XGBClassifier()
xgbs.fit(X_train_std,y_train)
xgbs_predict = xgbs.predict(X_test_std)
xgbs_predict_proba = xgbs.predict_proba(X_test_std)[:,1]
score(y_test,xgbs_predict,xgbs_predict_proba)
-----------------逻辑回归------------------
accuracy:0.7848633496846531,
precision:0.6181818181818182,
recall:0.3788300835654596,
F1-Score:0.4697754749568221,
Auc:0.7743732590529248

-----------------线性SVM------------------
accuracy:0.7813594954449895,
precision:0.6157635467980296,
recall:0.34818941504178275,
F1-Score:0.4448398576512455,
Auc:0.7752808988764046

-----------------决策树------------------
accuracy:0.6734407848633497,
precision:0.36175710594315247,
recall:0.38997214484679665,
F1-Score:0.3753351206434316,
Auc:0.5793493683035481

-----------------随机森林------------------
accuracy:0.7589348283111422,
precision:0.5531914893617021,
recall:0.21727019498607242,
F1-Score:0.312,
Auc:0.6982958279866045

-----------------GBDT------------------
accuracy:0.7778556412053259,
precision:0.5990566037735849,
recall:0.35376044568245124,
F1-Score:0.4448336252189142,
Auc:0.7711443564624998

-----------------XGBoost------------------
accuracy:0.7834618079887876,
precision:0.6225490196078431,
recall:0.35376044568245124,
F1-Score:0.4511545293072824,
Auc:0.7739533452265448

About Me:小婷儿

● 本文作者:小婷儿,专注于python、数据分析、数据挖掘、机器学习相关技术,也注重技术的运用

● 作者博客地址:https://blog.csdn.net/u010986753

● 本系列题目来源于作者的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解

● 版权所有,欢迎分享本文,转载请保留出处

● 微信:tinghai87605025 联系我加微信群

● QQ:87605025

● QQ交流群py_data :483766429

● 公众号:python宝 或 DB宝

● 提供OCP、OCM和高可用最实用的技能培训

● 题目解答若有不当之处,还望各位朋友批评指正,共同进步

如果你觉得到文章对您有帮助,欢迎赞赏哦!有您的支持,小婷儿一定会越来越好!

金融贷款逾期模型 -- 029相关推荐

  1. 一周算法实践---金融贷款逾期模型

    金融贷款逾期模型 1.读取数据 import pandas as pd data_all = pd.read_csv('../data/data_all.csv')` 2.划分数据集 from skl ...

  2. 【机器学习】数据挖掘实战:金融贷款分类模型和时间序列分析

    今天给大家带来一个企业级数据挖掘实战项目,金融贷款分类模型和时间序列分析,文章较长,建议收藏! 如果本文对你有所帮助,记得文末点赞和在看,也可分享给你需要的朋友- 项目背景 银行和其他金融贷款机构经常 ...

  3. 金融贷款逾期的模型构建7——模型融合

    文章目录 一.集成学习 1.Bagging 2.Boosting 3.Stacking (1)核心图解 a.构建新的训练集 b.构建新的测试集 c.最终的训练与预测 (2)示例 a.构建新的训练集 b ...

  4. 金融贷款逾期的模型构建6——特征选择

    文章目录 一.IV值 1.概述 2.IV计算 (1)WOE (2)IV 计算 二.实现 0.相关模块 1.IV值 2.Random Forest 3.特征合并 4.模型构建 5.模型评估 数据传送门( ...

  5. 金融贷款逾期的模型构建5——数据预处理

    文章目录 一.相关库 二.数据读取 三.数据清洗--删除无关.重复数据 四.数据清洗--类型转换 1.数据集划分 2.缺失值处理 3.异常值处理 4.离散特征编码 5.日期特征处理 6.特征组合 五. ...

  6. 金融贷款逾期的模型构建1

    数据 data_all.csv文件是非原始数据,已经处理过了.数据是金融数据, 我们要做的是预测贷款用户是否会逾期.表格中, status是标签: 0表示未逾期, 1表示逾期. 任务--模型构建 给定 ...

  7. 金融贷款逾期的模型构建4——模型调优

    文章目录 一.任务 二.概述 1.参数说明 2.常用方法 二.实现 1.模块引入 2.模型评估函数 3.数据读取 4.Logistic Regression (1)调参部分 (2)模型评估 5.SVM ...

  8. 金融贷款逾期的模型构建3——模型评估

    文章目录 一.评价指标 1.基本概念 2.准确率(accuracy) 3.精确率(precision) 4.召回率(recall) 5.F1值 6.roc曲线 和 auc值 二.模型评估 1.Logi ...

  9. 金融贷款逾期的模型构建2——集成模型

    任务--模型构建 构建随机森林.GBDT.XGBoost和LightGBM这4个模型,并对每一个模型进行评分,评分方式任意,例如准确度和auc值. 1.相关安装资源 随机森林.GBDT均在sklear ...

  10. 金融反欺诈模型----项目实战--机器学习

    机器学习:从源数据清洗到特征工程建立谈金融反欺诈模型训练 本文旨在通过一个完整的实战例子,演示从源数据清洗到特征工程建立,再到模型训练,以及模型验证和评估的一个机器学习的完整流程.由于初识机器学习,会 ...

最新文章

  1. iOS 9应用开发教程之多行读写文本ios9文本视图
  2. 购买Entrust SSL 数字证书?你怎么看?
  3. colpick-jQuery颜色选择器使用说明
  4. [导入]C#正则表达式整理备忘
  5. npm 安装报错 rollbackFailedOptional verb npm-session无法解决?
  6. webpackjsonp 还原_具有催化CO2还原性能的非贵金属配合物的配体设计
  7. 【英语学习】【Level 08】U05 Better option L1 Message sent
  8. Microsoft Updater Application Block 1.4.3 KeyValidator类设计 [翻译]
  9. 5 张图带你搞懂容器网络的工作原理
  10. 探索关系抽取中的多变知识
  11. Programer Cat 福利
  12. unity Animator 同时播放两个动画,并动态更换Animator中的AnimationClip
  13. nginx的反向代理和负载均衡
  14. 安卓端录像并将视频分享给微信好友
  15. win10定时关机怎么设置(Win10怎么设置亮度)
  16. 技能兴鲁试题--可视化
  17. 物联网实战指南 分享
  18. 绍兴文理学院元培学院第十五届大学生程序设计竞赛
  19. OpenAcc的使用
  20. VMware虚拟机的安装、创建及CentOS 7的安装

热门文章

  1. 迷你西游最新服务器是哪个,迷你西游公测新开服务器“万佛朝宗”公告
  2. 写计算机课的作文,电脑课作文(小学生作文写不好怎么办)
  3. ArcGIS空间统计——点密度计算
  4. CapstoneCS5263|DP转HDMI 4K60HZ方案|替代PS176芯片
  5. k8s-----安全机制
  6. iOS之HomeKit
  7. IOS开发之HomeKit(一)
  8. html注册新浪邮箱代码,新浪博客美化代码:邮箱快速登录
  9. linux系统it固定资产管理系统包_固定资产管理系统功能介绍
  10. layui表格工具条