一、案例背景

随着互联网应用的日益普及，网络贷款已成为一种常见的贷款形式。但是，网络借贷行业也存在着很多的风险失控问题。为了规范网络借贷过程管理，加强网络借贷风险控制，业界人士开始采用一些技术手段来规避网络借贷风险。
许多金融服务机构在前期的业务运营中积累了大量的客户数据，如个人基本信息、在机构办理业务信息等数据。本文正是基于数据挖掘技术，综合客户各项信息数据及以往业务中是否违约为分类标签，进行模型训练，经过调整优化后对新的申请用户信息进行分类预测，以预测结果作为核发贷款的重要依据。

第一步：清洗原始业务数据，得到满足要求的训练数据集；
第二步：选用合适的数据挖掘分类算法进行模型训练（还可以对算法参数进行调优）；
第三步：在模型中输入新用户信息即可得到预测结果。

二、数据处理

本案例数据文件来自于一家网络贷款公司，本文仅选择部分数据进行使用，原数据下载地址：https://www.kaggle.com/husainsb/lendingclub-issued-loans

# 读取数据
dforigin = pd.read_csv('LendingClub10k.csv',sep=',')dforigin.info()
print(dforigin.shape)

2.1 选取数据

观察分类标签字段loan_status：

# 显示 loan_status字段
dforigin['loan_status'].value_counts()

发现除了正常还款、违约之外还有其他分类，由于本文只分析此两类情况，所以接下来直接选取包含这两个标签的数据：

# 显示 loan_status 是 Fully_paid 和 Charged off取值的记录数
df = dforigin.loc[dforigin['loan_status'].isin(['Fully Paid', 'Charged Off'])].copy(deep=True)
print(df['loan_status'].value_counts())
print(df.shape)

观察违约比例：

# 显示df中正常还款和违约样本比例
print(df['loan_status'].value_counts() / df.shape[0])

并将标签数据化：

# Fully paid 取值为0； Charged off 取值为 1
df['loan_status'] = df['loan_status'].apply(lambda s: np.float(s == 'Charged Off'))
print(df['loan_status'].value_counts())
df.rename(columns={'loan_status':'charged_off'}, inplace=True)

2.2 删除无效数据

对于某些列都是相同的值，其对数据挖掘提供不了有效信息，因此可将其删除：

# 丢弃只有唯一值的列
drop_list = []
for col in df.columns:if df[col].nunique() == 1:drop_list.append(col)print(drop_list)
print(df.shape)df.drop(labels=drop_list, axis=1, inplace=True)
print(df.shape)

而有些列存在大量缺失数据，严重影响数据分析，也将其舍去：

# 丢弃那些大量缺失值的列
drop_list = []
for col in df.columns:if df[col].notnull().sum() / df.shape[0] < 0.02:drop_list.append(col)print(drop_list)
df.drop(labels=drop_list, axis=1, inplace=True)
print(df.shape)

最后，根据字段的具体意义及个人主观看法，将明显与目标字段（分类标签）无关的数据进行删除：

# 丢弃那些明显与目标无关的列
df.drop(labels=['id', 'emp_title', 'title', 'last_credit_pull_d','earliest_cr_line'], axis=1, inplace=True)df.drop(labels=['collection_recovery_fee', 'debt_settlement_flag', 'last_pymnt_amnt', 'last_pymnt_d', 'recoveries', 'total_pymnt', 'total_pymnt_inv', 'total_rec_int','total_rec_late_fee', 'total_rec_prncp'], axis=1, inplace=True)
df.shape

2.3 分析属性关系

# 考察贷款目的与违约之间的关系
plt.figure(figsize=(12,8))
sns.countplot(y='purpose', hue='charged_off', data=df,orient='h',palette = 'BuPu')
plt.yticks(size=18)
plt.ylabel('贷款目的',fontdict={'size':18})
plt.xticks(size=18)
plt.xlabel('业务数量',fontdict={'size':18})
# plt.savefig("ch18_lc01.jpg",dpi=300,bbox_inches="tight")
plt.show()

从图中可以看出，网络借贷的目的大多是为了债务转移和信用卡还贷，同时其违约率也较高，可反映出这样的一个事实：借款人是由于背负贷款被银行等传统机构排除在外，只能通过网络借贷来偿还其贷款。

# 考察信用分级与违约之间的关系
plt.figure(figsize=(12,8))
sns.countplot(y='sub_grade', hue='charged_off', data=df,order=sorted(df['sub_grade'].value_counts().index),orient='h',palette = 'BuPu')
plt.ylabel('客户分级',fontdict={'size':18})
plt.xticks(size=18)
plt.xlabel('业务数量',fontdict={'size':18})
# plt.savefig("ch18_lc02.jpg",dpi=300,bbox_inches="tight")
plt.show()

从图中可以看出，客户评级越低，其中违约的人数所占比例越多，甚至存在所处等级的借款人都出现违约的情况，可见等级越低越容易违约。

#考察贷款期限与违约之间的关系
plt.figure(figsize=(4,4))
sns.countplot(x='term', hue='charged_off', data=df)
# plt.savefig("ch18_lc03.jpg",dpi=300,bbox_inches="tight")
plt.show()

然后是与借款期限的关系，从图中可以看出：60个月的贷款业务其违约率更高，可推断出长期贷款违约率高于短期贷款。

# 考察FICO评分与违约之间的关系
plt.figure(figsize=(12,6))
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==0], gridsize=500, label='charged_off = 0',linewidth=2,linestyle='--')
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==1], gridsize=500, label='charged_off = 1',linewidth=2,linestyle='-')
plt.xlabel('last_fico_range_high评分',fontdict={'size':18})
plt.ylabel('FICO评分概率密度分布',fontdict={'size':18})
# plt.savefig("ch18_lc04.jpg",dpi=300,bbox_inches="tight")
plt.show()

最后是FICO评分与违约的关系，从图中可以明显看出，违约客户的FICO评分均值远低于正常客户的FICO评分。

# 多属性相关性分析
corr_charged_off = df.corr()['charged_off']
corr_charged_off.drop(labels='charged_off', inplace=True)
corr_charged_off = corr_charged_off.sort_values()
plt.figure(figsize=(8,28))
sns.barplot(y=corr_charged_off.index, x=corr_charged_off.values, orient='h')
plt.title("Correlation with 'charged_off'")
plt.xlabel("Correlation coefficient with 'charged_off'")
xmax = np.abs(corr_charged_off).max()
plt.xlim([-xmax, xmax])
# plt.savefig("ch18_lc05.jpg",dpi=300,bbox_inches="tight")
plt.show()

最后，可以观察各属性与分类标签之间的关系，如FICO与违约存在极强的负相关性，即FICO分数越高，违约可能性就越低。

2.4 数据再处理

因为数据中还存在非数据值类型的，如要进行数据挖掘还需将其转化为数值：

# 非数值型数据转换为数值型
text_cols = []
for col in df.columns:if df[col].dtype == np.object:text_cols.append(col)
print(text_cols)

然后依次对各列数据进行数值转化：

# 转换term列
df['term'] = df['term'].apply(lambda s:np.float(s[1:3]))
# There's an extra space in the data for some reason
print(df['term'].value_counts())

#转换disbursement_method列DisbMethod_dict = {'Cash':0.0, 'DirectPay':1.0}
def DisbMethod_dict_to_float(s):return DisbMethod_dict[s]
df['disbursement_method'] = df['disbursement_method'].apply(lambda s: DisbMethod_dict_to_float(s))
print(df['disbursement_method'].value_counts())

最后将不重要的数据在此删除，最终得到数据大小为：(4952, 89)

#Some comments in this column, impossible to convert to numeric
df.drop(labels=['desc'], axis=1, inplace=True)
df.drop(labels=['issue_d'], axis=1, inplace=True)
df.shape

三、模型训练

在将数据清洗完成之后，就可以进入下一步的模型训练了

3.1 分离训练集

首先，先将数据的分类标签和各属性分离开来：

X = df.drop(labels=['charged_off'], axis=1) # Features
y = df['charged_off'] # Target variable

然后，分离训练集和测试集：

from sklearn.model_selection import train_test_split
random_state = 12 # I chose this randomly, just to make the results fixed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)
pd.DataFrame((X_train.notnull().sum() / X_train.shape[0]).sort_values(), columns=['Fraction not null'])

最后，将这四个数据集进行统一的标准化处理：

imputer =SimpleImputer(missing_values=NA, strategy = "mean").fit(X_train)X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(imputer.transform(X_test),  columns=X_test.columns)from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

可以得到：

print("df.shape:",df.shape)
print("X_train:",X_train.shape)
print("y_train:",y_train.shape)
print("X_test:",X_test.shape)
print("y_test:",y_test.shape)

3.2 模型训练

接下来进行数据挖掘模型训练，需要说明的是：由于受到随机抽样和算法数学特性等因素的影响，每次运行结果都可能存在细微的差异。
第一个训练的模型是逻辑回归模型：

#逻辑回归模型
from sklearn.linear_model import LogisticRegressionlrmodel=LogisticRegression(solver='liblinear')   #初始化start=datetime.datetime.now()
lrmodel.fit(X_train,y_train)   #fit训练模型参数
end=datetime.datetime.now()y_lrpred=lrmodel.predict(X_test) my_eval(y_test, y_lrpred)
print('Runtime =',end-start)

可以看到准确率达到91%，AUC=0.85，运行时间也比较低。

其次，是随机森林模型训练：

# 随机森林模型
from sklearn.ensemble import RandomForestClassifierrfmodel=RandomForestClassifier()start=datetime.datetime.now()
rfmodel.fit(X_train, y_train)
end=datetime.datetime.now()y__rfpred=rfmodel.predict(X_test)my_eval(y_test, y__rfpred)
print('Runtime =',end-start)

可以看到准确率也很高，AUC值比逻辑回归模型更高，但花费的时间要多一些。

最后是SGDClassifier模型，它是一系列采用了梯度下降来求解参数的算法的集合，如SVM、logistic、regression等，其相关参数的设置可自行上网搜索：

# SGDClassifier模型
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef, make_scorer
from sklearn.metrics import roc_auc_scoreparam_grid = [{'loss': ['hinge'],'alpha': [10.0**k for k in range(-3,4)],'max_iter': [1000],'tol': [1e-3],'random_state': [random_state],'class_weight': [None, 'balanced'],'warm_start': [True]},{'loss': ['log'],'penalty': ['l2', 'l1'],'alpha': [10.0**k for k in range(-3,4)],'max_iter': [1000],'tol': [1e-3],'random_state': [random_state],'warm_start': [True]}]
grid = GridSearchCV(estimator=SGDClassifier(), param_grid=param_grid, scoring=make_scorer(matthews_corrcoef), n_jobs=1, pre_dispatch=1, verbose=1, return_train_score=True)start=datetime.datetime.now()
grid.fit(X_train, y_train)
end=datetime.datetime.now()y_SGDCpred = grid.predict(X_test)my_eval(y_test,y_SGDCpred)
print('Runtime =',end-start)

最终，我们可以看到其AUC有了较大地提升，只是花费了更多的时间来进行训练。

四、模型预测

最后，我们可以在测试集中随机选取一条用户数据，利用训练好的三个模型对其进行预测，结果可以看到：三个模型预测的结果都正确。

#使用训练好的三种模型进行预测
#从x_test中抽取一条记录，作为模拟的新输入数据
new_input=X_test.iloc[10].values.reshape(1,-1)
print("new_input=",new_input)
new_output=y_test.iloc[10]
print("new_output=",new_output)
#使用=逻辑回归模型对该数据进行预测
print("逻辑回归预测prediction=",lrmodel.predict(new_input))
#使用随机森林模型对该数据进行预测
print("随机森林预测 prediction=",rfmodel.predict(new_input))
#使用SGDClassifier模型对该数据进行预测
print("SGDC模型预测 prediction=",grid.predict(new_input))

最后的最后，大家如果觉得文章不错的话，记得点赞、收藏、关注三连~
我会把相关数据和完整代码整理好上传到我的资源，大家也可以下载下来自己研究

除此之外，还有以前写的【综合案例】信用卡虚拟交易识别；以及接下来将要写的【综合案例】信用评分模型开发，大家可以关注一下哦~

【综合案例】网络贷款违约预测相关推荐

基于机器学习与深度学习的金融风控贷款违约预测
基于机器学习与深度学习的金融风控贷款违约预测目录一.赛题分析 1. 任务分析 2. 数据属性 3. 评价指标 4. 问题归类 5. 整体思路二.数据可视化分析 1. 总体数据分析 2. 数值型数 ...
【算法竞赛学习】金融风控之贷款违约预测-模型融合
Task5 模型融合 Tip:此部分为零基础入门金融风控的 Task5 模型融合部分,欢迎大家后续多多交流. 赛题:零基础入门数据挖掘 - 零基础入门金融风控之贷款违约预测项目地址:https:// ...
「机器学习」天池比赛：金融风控贷款违约预测
一.前言 1.1 赛题背景赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题. 任务:预测用户贷款是否违约比 ...
基于逻辑回归的金融风控贷款违约预测分析（笔记）
一.背景与思路 (一)背景核心问题:对贷款偿债能力的评估 1. 方法:利用逻辑回归(理解简单,可解释性强) 2. 信用评分卡的构建金融风控定性分析逻辑回归定量分析信用评分卡 (二)流程 1 ...
【天池】金融风控贷款违约预测task5
[天池]金融风控贷款违约预测task5 task5学习总结: 1)简单平均和加权平均是常用的两种比赛中模型融合的方式.其优点是快速.简单. 2)stacking在众多比赛中大杀四方,但是跑过代码的小伙 ...
零基础入门金融风控-贷款违约预测-机器学习-数据分析
零基础入门金融风控-贷款违约预测一.赛题数据赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变 ...
【天池】金融风控-贷款违约预测（五）—— 模型融合
[天池]金融风控-贷款违约预测(五)-- 模型融合前言内容介绍 stacking\blending详解代码示例总结前言 [天池]金融风控-贷款违约预测(赛题链接). 上一篇进行数据建模和模型 ...
零基础入门金融风控-贷款违约预测-Task05——模型融合
有幸参加了阿里云举办的零基础入门金融风控-贷款违约预测训练营.收获颇多. 每天记录一些自己之前的知识盲点,需经常温习. 第五次的学习任务,是模型融合. 一.模型融合常用方法模型融合有常用的如下六种方 ...
数据挖掘实践（金融风控-贷款违约预测）（二）：数据分析
数据挖掘实践(金融风控-贷款违约预测)(二):数据分析目录数据挖掘实践(金融风控-贷款违约预测)(二):数据分析 1.引言 2.基本知识点 2.1缺失值(Missing data) 2.1.1缺失 ...

【综合案例】网络贷款违约预测

目录