金融贷款逾期的模型构建6—

文章目录

一、IV值
- 1、概述
- 2、IV计算
- - （1）WOE
  - （2）IV 计算
二、实现
- 0、相关模块
- 1、IV值
- 2、Random Forest
- 3、特征合并
- 4、模型构建
- 5、模型评估

数据传送门（data.csv）：https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目标：数据集是金融数据（非脱敏），要预测贷款用户是否会逾期。表格中 “status” 是结果标签：0表示未逾期，1表示逾期。

任务：分别用IV值和随机森林进行特征选择。然后分别构建模型（逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM），进行模型评估。

一、IV值

1、概述

IV：Information Value，即信息价值，或者信息量。用于衡量变量的预测能力，也就是说，若某特征的IV值越大，该特征对预测的结果影响越大。

适用条件：有监督模型且必须是二分类。

常见的IV取值范围代表意思如下：

若IV在（-∞，0.02]区间，视为无预测力变量
若IV在（0.02，0.1]区间，视为较弱预测力变量
若IV在（0.1，+∞）区间，视为预测力可以，而实际应用中，也是保留IV值大于0.1的变量进行筛选。

IV值计算

2、IV计算

WOE 是 IV 的计算基础。

（1）WOE

WOE（Weight of Evidence，证据权重）。WOE是对原始自变量的一种编码形式。

首先，对该特征进行分组处理（也称离散化、分箱等）。
然后，对第 iii 组，计算WOEWOEWOE，公式如下所示：
WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})WOEi=ln(pnipyi)=ln(#ni/#nT#yi/#yT)
其中，pyip_{y_i}pyi表示该组中响应客户（在风险模型中，即违约客户）占所有样本中所有响应客户的比例，pnip_{n_i}pni表示该组中未响应客户占样本中所有未响应客户的比例。#yi\#y_i#yi表示这个组中响应客户的数量，#ni\#n_i#ni表示这个组中未响应客户的数量，#yT\#y_T#yT表示样本中所有响应客户的数量，#nT\#n_T#nT表示样本中所有未响应客户的数量。
==》WOEWOEWOE：“当前分组中响应客户占所有响应客户的比例”和“当前分组中没有响应的客户占所有没有响应的客户的比例”的差异。
公式变形：
WOEi=ln(pyipni)=ln(#yi/#yT#ni/#nT)=ln(#yi/#ni#yT/#nT)WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})=ln(\frac{\#y_i/\#n_i}{\#y_T/\#n_T})WOEi=ln(pnipyi)=ln(#ni/#nT#yi/#yT)=ln(#yT/#nT#yi/#ni)
==》WOEWOEWOE：当前这个组中响应的客户和未响应客户的比值，和所有样本中这个比值的差异。
==》WOE越大，这种差异越大，这个分组里的样本响应的可能性就越大，WOE越小，差异越小，这个分组里的样本响应的可能性就越小。

（2）IV 计算

IVi=(pyi−pni)∗WOEi=(pyi−pni)∗ln(pyipni)=(#yi/#yT−#ni/#nT)ln(#yi/#yT#ni/#nT)IV=∑i=1nIViIV_i =(p_{y_i}-p_{n_i})* WOE_i = (p_{y_i}-p_{n_i})*ln(\frac{p_{y_i}}{p_{n_i}})=(\#y_i/\#y_T-\#n_i/\#n_T)ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})\\ IV = \sum_{i=1}^{n}IV_i IVi=(pyi−pni)∗WOEi=(pyi−pni)∗ln(pnipyi)=(#yi/#yT−#ni/#nT)ln(#ni/#nT#yi/#yT)IV=i=1∑nIVi
其中，n为特征的分组个数。

二、实现

0、相关模块

import pandas as pd
from pandas import DataFrame as df
from numpy import log
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc
import matplotlib.pyplot as plt

1、IV值

def calcWOE(dataset, col, target):## 对特征进行统计分组subdata = df(dataset.groupby(col)[col].count())## 每个分组中响应客户的数量suby = df(dataset.groupby(col)[target].sum())## subdata 与 suby 的拼接data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))## 相关统计，总共的样本数量total，响应客户总数b_total，未响应客户数量g_totalb_total = data[target].sum()total = data[col].sum()g_total = total - b_total## WOE公式data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)return data.loc[:, ["bad", "good", "WOE"]]def calcIV(dataset):print()dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)IV = sum(dataset["IV"])return IVfile_name = '1.csv'
data = pd.read_csv(file_name, encoding='gbk')
X = data.drop(labels="status", axis=1)
print(X.shape)
y = data["status"]
col_list = [col for col in  data.drop(labels=['Unnamed: 0','status'], axis=1)]
data_IV = df()
fea_iv = []for col in col_list:col_WOE = calcWOE(data, col, "status")## 删除nan、inf、-infcol_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]col_IV = calcIV(col_WOE)if col_IV > 0.1:data_IV[col] = [col_IV]fea_iv.append(col)data_IV.to_csv('data_IV.csv', index=0)
print(fea_iv)

输出结果

['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']

2、Random Forest

rfc = RandomForestClassifier()
rfc.fit(X, y)
rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
fea_gini = rfc_impc[:20].index.tolist()
print(fea_gini)

输出结果

['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']

3、特征合并

features = list(set(fea_gini)|set(fea_iv))
X_final = X[features]
print(X_final.shape)

(4754, 35)
分析：从原来的(4754, 92)经过筛选得到 (4754, 35) 特征的数据，去掉了大量的冗余。

4、模型构建

## 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)## 模型1：Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)# ## 模型2：SVM
svm = SVC(kernel='linear',probability=True)
svm.fit(X_train,y_train)## 模型3：Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_train,y_train)## 模型4：Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)## 模型5：GBDT
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train,y_train)## 模型6：XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_train,y_train)## 模型7：LightGBM
lgbc = lgb.LGBMClassifier()
lgbc.fit(X_train,y_train)

5、模型评估

## 模型评估
def model_metrics(clf, X_train, X_test, y_train, y_test):y_train_pred = clf.predict(X_train)y_test_pred = clf.predict(X_test)y_train_prob = clf.predict_proba(X_train)[:, 1]y_test_prob = clf.predict_proba(X_test)[:, 1]# 准确率print('准确率: ',end=' ')print('训练集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')print('测试集: ', '%.4f' % accuracy_score(y_test, y_test_pred))# 精准率print('精准率:',end=' ')print('训练集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')print('测试集: ', '%.4f' % precision_score(y_test, y_test_pred))# 召回率print('召回率:',end=' ')print('训练集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')print('测试集: ', '%.4f' % recall_score(y_test, y_test_pred))# f1_scoreprint('f1-score:',end=' ')print('训练集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')print('测试集: ', '%.4f' % f1_score(y_test, y_test_pred))# aucprint('auc:',end=' ')print('训练集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')print('测试集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))# roc曲线fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]plt.plot(fpr_train, tpr_train)plt.plot(fpr_test, tpr_test)plt.plot([0, 1], [0, 1], 'd--')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(label, loc=4)plt.title('ROC Curve')model_metrics(lr, X_train, X_test, y_train, y_test)
model_metrics(svm, X_train, X_test, y_train, y_test)
model_metrics(dtc, X_train, X_test, y_train, y_test)
model_metrics(rfc, X_train, X_test, y_train, y_test)
model_metrics(gbdt, X_train, X_test, y_train, y_test)
model_metrics(xgbc, X_train, X_test, y_train, y_test)
model_metrics(lgbc, X_train, X_test, y_train, y_test)

出现的问题：

TypeError: 'list' object is not callable set
原因：上面重复定义list所以该处不可使用，提示：定义任何对象不要和关键字或者import里面的函数等等同名。
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)在预测的时候出现该警告，同一模型有的评价指标结果为0，目前没有解决。

参考：
https://blog.csdn.net/kevin7658/article/details/50780391/