Kaggle数据集-贷款逾期预测

前段时间在Kaggle上找了一个数据集Loan_Default银行商业数据集来做贷款预期预测的数据分析练习，下面是数据处理与分析预测的过程。

一、数据集

Banks earn a major revenue from lending loans. But it is often associated with risk. The borrower’s may default on the loan. To mitigate this issue, the banks have decided to use Machine Learning to overcome this issue. They have collected past data on the loan borrowers and we would like to develop a strong ML Model to classify if any new borrower is likely to default or not.
The dataset is enormous & consists of multiple deteministic factors like borrowe’s income, gender, loan pupose etc. The dataset is subject to strong multicollinearity & empty values. We are supposed to overcome these factors & build a strong classifier to predict defaulters?

# from https://www.kaggle.com/yasserh/loan-default-dataset
data = pd.read_csv("Loan_Default.csv")
data

data[data["Status"] == 1].info()

该数据集共有34列，需要对特征进行筛选，其中“Status”列是目标变量。
数据集共有148670条数据，其中“Status”==1的有36639条，“Status”==0的有36639条，属于不平衡数据集，可能会导致结果不准确。

二、数据处理

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as npfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCVfrom sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import plot_roc_curvepd.set_option('display.max_columns',None)

首先删掉显而易见没用的行：ID，year，删除含缺失值太多的列，然后再删除包含缺失值的行：（若直接删除含缺失值的行会将全部Status==1的行删除）

# drop the columns with too much null values in Status == 1
data = data.drop(["rate_of_interest", "Interest_rate_spread", "Upfront_charges", "property_value", "LTV", "dtir1"], axis=1)# drop the useless columns.
data = data.drop(["ID", "year"], axis=1)# drop the lines with null values.
data = data.dropna(axis=0, how="any")

查看数值数据的分布：

# check the distribution of numeric columns.
data.hist(bins=15, figsize=(10,10))
plt.show()

3. 查看非数值数据的分布：

# check the distribution of non-numeric columns.
object_list = list(data.columns[data.dtypes == "object"])
fig = plt.figure(figsize=(15,25))
n = 1
for column in object_list:d = pd.DataFrame(data.loc[:, [column, "Status"]]).groupby(column).count()["Status"]ax = fig.add_subplot(7, 3, n)ax.bar(height=d, x=[i for i in range(len(d))], width = 0.5)ax.set_xticks([i for i in range(len(d))])ax.set_xticklabels(list(d.index))ax.set_title(column)n +=1
plt.show()

删除数据极端分布的列：

# drop the columns with extreme distributions, which are useless for predicting.
data = data.drop(["Security_Type", "total_units", "construction_type", "open_credit", "Secured_by", "income"], axis=1)

把非数值的列用转换为独热编码，并删除首列（防止过拟合）：

# change the object columns into one-hot coding.
dummy_list = list(data.columns[data.dtypes == "object"])
for i in dummy_list:data = pd.concat([data, pd.get_dummies(data[i]).iloc[:, 1:]], axis=1)data.drop(i, axis=1, inplace=True)

查看此时数据集信息：
查看各特征的相关系数：

# find the correlation between columns.
corr = data.corr()
plt.figure(figsize=(14,14))
sns.heatmap(corr, annot=True, cmap='Reds',square=True, fmt=".1f")
plt.show()

删除相关系数为正负1的列：

# drop the columns with coreelations that are too high.
data = data.drop(["type2"], axis=1)

解决数据不平衡的问题常用方法：

采集更多数据，最后选取平衡数量的数据。

重采样，减少大类的数据，增加小类的数据（甚至可以重复取样）。

人为生成小类的数据。

细分类，将大类的数据再细分为几个小类，使每个类别数据平衡。
这里我主要使用了重采样的方法处理：

# Reselect data to make the it balance.
status_1 = data[data["Status"] == 1]
status_0 = data[data["Status"] == 0].sample(len(status_1) * 2)
data = pd.concat([status_0, status_1, status_1], axis=0)
data = data.sample(frac=1)

分割数据集：

# split the dataset into train set and test set.
X = data.drop("Status", axis=1)
y = data["Status"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

归一化：（如果使用树的方法进行训练就不需要了）

# scaler
scaler = MinMaxScaler()
scaler.fit(X_train)X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

PCA降维处理：

# PCA principal component analysis
pca = PCA(n_components=10)
pca.fit(X_train)X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

三、模型选择

决策树：

# Decision tree
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
predict_test = DT.predict(X_test)print("score for the training set :", DT.score(X_train, y_train))
print("score for the training set :", DT.score(X_test, y_test))print(classification_report(y_test, predict_test))labels = ["Status 0", "Status 1"]
M = confusion_matrix(y_test, predict_test)
disp = ConfusionMatrixDisplay(confusion_matrix=M, display_labels=labels)
disp.plot(cmap=plt.cm.YlGn)
plt.show()

逻辑回归：

# Logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
predict_test = lr.predict(X_test)print("score for the training set :", lr.score(X_train, y_train))
print("score for the training set :", lr.score(X_test, y_test))print(classification_report(y_test, predict_test))labels = ["Status 0", "Status 1"]
M = confusion_matrix(y_test, predict_test)
disp = ConfusionMatrixDisplay(confusion_matrix=M, display_labels=labels)
disp.plot(cmap=plt.cm.YlGn)
plt.show()

多层感知机（神经网络）：

# network
model = MLPClassifier(hidden_layer_sizes=(20,20),learning_rate_init=0.1)
model.fit(X_train, y_train)
predict_test = model.predict(X_test)print("score for the training set :", model.score(X_train, y_train))
print("score for the training set :", model.score(X_test, y_test))print(classification_report(y_test, predict_test))labels = ["Status 0", "Status 1"]
M = confusion_matrix(y_test, predict_test)
disp = ConfusionMatrixDisplay(confusion_matrix=M, display_labels=labels)
disp.plot(cmap=plt.cm.YlGn)
plt.show()

将结果进行对比，最终选择决策树模型，并对超参数进行优化：

四、模型优化

params = {'max_depth': list(range(40, 180, 10))}grid_search_cv = GridSearchCV(DecisionTreeClassifier(), params, verbose=1, cv=3,n_jobs = -1,)
grid_search_cv.fit(X_train, y_train)DT = grid_search_cv.best_estimator_
predict_test = DT.predict(X_test)

五、结果可视化

print("score for the training set :", DT.score(X_train, y_train))
print("score for the training set :", DT.score(X_test, y_test))
print(classification_report(y_test, predict_test))labels = ["Status 0", "Status 1"]
M = confusion_matrix(y_test, predict_test)
disp = ConfusionMatrixDisplay(confusion_matrix=M, display_labels=labels)
disp.plot(cmap=plt.cm.YlGn)
plot_roc_curve(DT, X_test, y_test)
plt.show()

六、总结

The accuracy on the test set is 0.86556, which is the highest among these models.

f1 score is 0.87, which means the performance of the model is balance.

AUC = 0.87, which means the performance of the model is good.

What’s more, for banks, it is more important to reduce risk, which means the recall of status=1 should as high as possible and this model achieves this target.

Kaggle数据集-贷款逾期预测相关推荐

信贷逾期预测，LightGBX模型
信贷逾期预测背景数据处理小提琴图查看数据分布模型建立模型评估本文介绍了利用LightGBX模型进行贷款逾期预测的方法. 背景互联网金融的核心在于风控,风控决定了互联网金融企业的竞争力 ...
ML之SVM：基于Js代码利用SVM算法的实现根据Kaggle数据集预测泰坦尼克号生存人员
ML之SVM:基于Js代码利用SVM算法的实现根据Kaggle数据集预测泰坦尼克号生存人员目录实验数据设计思路实现代码(部分代码) 实验数据设计思路实现代码(部分代码) /**js代码实 ...
【数据分析与挖掘实战】金融风控之贷款违约预测详解2(有代码和数据集)
本文接着上一篇博客,如果您未阅读上篇博客,请点击[数据分析与挖掘实战]金融风控之贷款违约预测详解1(有代码和数据集) 七.建模和调参 7.1模型相关原理介绍由于相关算法原理篇幅较长,本文推荐了一些博 ...
基于机器学习与深度学习的金融风控贷款违约预测
基于机器学习与深度学习的金融风控贷款违约预测目录一.赛题分析 1. 任务分析 2. 数据属性 3. 评价指标 4. 问题归类 5. 整体思路二.数据可视化分析 1. 总体数据分析 2. 数值型数 ...
「机器学习」天池比赛：金融风控贷款违约预测
一.前言 1.1 赛题背景赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题. 任务:预测用户贷款是否违约比 ...
基于逻辑回归的金融风控贷款违约预测分析（笔记）
一.背景与思路 (一)背景核心问题:对贷款偿债能力的评估 1. 方法:利用逻辑回归(理解简单,可解释性强) 2. 信用评分卡的构建金融风控定性分析逻辑回归定量分析信用评分卡 (二)流程 1 ...
DataWhale天池-金融风控贷款违约预测-Task01赛题理解
目录一.赛题概况二.数据集介绍三.预测指标理解通过ROC曲线评估分类器最佳阈值点选择一.赛题概况本次新人赛是Datawhale与天池联合发起的0基础入门系列赛事第四场 -- 零基础入门 ...
零基础入门金融风控-贷款违约预测-机器学习-数据分析
零基础入门金融风控-贷款违约预测一.赛题数据赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变 ...
数据挖掘机器学习[六]---项目实战金融风控之贷款违约预测
相关文章: 特征工程详解及实战项目[参考] 数据挖掘---汽车车交易价格预测[一](测评指标:EDA) 数据挖掘机器学习---汽车交易价格预测详细版本[二]{EDA-数据探索性分析} 数据挖掘机器学习 ...