1、赛题和数据介绍

1.1 赛题背景

赛题以金融风控中的个人信贷为背景，要求选手根据贷款申请人的数据信息预测其是否有违约的可能，以此判断是否通过此项贷款，这是一个典型的分类问题。

1.2 赛题数据

数据集中的字段含义如下：

2、数据探索分析和预处理

2.1 数据探索分析

首先导入需要使用的相关模块：

import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder

然后读取训练数据集，查看数据集中特征的数据类型和缺失情况，代码如下：

df = pd.read_csv('train.csv')
df.info()

执行代码，结果如下：

在这里插入代码片Data columns (total 47 columns):#   Column              Non-Null Count   Dtype
---  ------              --------------   -----  0   id                  800000 non-null  int64  1   loanAmnt            800000 non-null  float642   term                800000 non-null  int64  3   interestRate        800000 non-null  float644   installment         800000 non-null  float645   grade               800000 non-null  object 6   subGrade            800000 non-null  object 7   employmentTitle     799999 non-null  float648   employmentLength    753201 non-null  object 9   homeOwnership       800000 non-null  int64  10  annualIncome        800000 non-null  float6411  verificationStatus  800000 non-null  int64  12  issueDate           800000 non-null  object 13  isDefault           800000 non-null  int64  14  purpose             800000 non-null  int64  15  postCode            799999 non-null  float6416  regionCode          800000 non-null  int64  17  dti                 799761 non-null  float6418  delinquency_2years  800000 non-null  float6419  ficoRangeLow        800000 non-null  float6420  ficoRangeHigh       800000 non-null  float6421  openAcc             800000 non-null  float6422  pubRec              800000 non-null  float6423  pubRecBankruptcies  799595 non-null  float6424  revolBal            800000 non-null  float6425  revolUtil           799469 non-null  float6426  totalAcc            800000 non-null  float6427  initialListStatus   800000 non-null  int64  28  applicationType     800000 non-null  int64  29  earliesCreditLine   800000 non-null  object 30  title               799999 non-null  float6431  policyCode          800000 non-null  float6432  n0                  759730 non-null  float6433  n1                  759730 non-null  float6434  n2                  759730 non-null  float6435  n3                  759730 non-null  float6436  n4                  766761 non-null  float6437  n5                  759730 non-null  float6438  n6                  759730 non-null  float6439  n7                  759730 non-null  float6440  n8                  759729 non-null  float6441  n9                  759730 non-null  float6442  n10                 766761 non-null  float6443  n11                 730248 non-null  float6444  n12                 759730 non-null  float6445  n13                 759730 non-null  float6446  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

可以发现 employmentLength 和 n0-n14 等特征取值存在较多的缺失情况。grade、subGrade 等特征是非数值类型。后续需要对这些情况进行处理。

从数据集提取前5行数据，如下所示：

2.2 数据预处理

首先把 grade 和 subgrade 两个字段转换成数值类型，由于这里是信用等级，等级之间是有次序的，因此使用序号编码的方式，将这两个字段转换成数值，代码如下：

# 对 grade    subGrade 进行序号编码
oe = OrdinalEncoder()
for i in ['grade', 'subGrade']:tmp = oe.fit_transform(df[i].values.reshape(-1,1))tmp = pd.DataFrame(tmp)tmp.columns = [i+'new']df = pd.merge(df, tmp, how='left', left_index=True, right_index=True)

然后对工作年限这个字段进行处理，提取出其中的数值，代码如下：

# 提取 employmentLength 中的数值
df['employmentLength'] = df['employmentLength'].fillna('0 years')
df['employmentLength_new'] = df['employmentLength'].apply(lambda x: float(re.findall(r"\d+\.?\d*", x)[0]))

然后我们筛选出正负样本的数据，分别观察两组数据中特征取值的分布情况，代码如下：

# 筛选出数值特征
df_num = df.select_dtypes(include=[int, float])
col_num = df_num.columns.tolist()
col_num.remove('policyCode')                                            # 去掉 policyCode 这个特征# 分别筛选出标签=0和标签=1的数据
df_0 = df[df['isDefault'] == 0]
df_1 = df[df['isDefault'] == 1]df_0 = df_0.fillna(0)
df_1 = df_1.fillna(0)# 分析各个特征在df_0和df_1上的特征分布
k = 1
for j in col_num:fig,axes=plt.subplots(1,2)                                     # 创建一个1行2列的图片sns.distplot(df_0[j],ax=axes[0])sns.distplot(df_1[j],ax=axes[1])fig.tight_layout()                                             # 调整子图间距fig.savefig(str(k) + '_' + str(j) + ".png", transparent=True)  # 保存图片k += 1

部分特征的分布图如下：

可以发现正负样本的特征差异不明显，那么构建的机器学习模型精度上限可能不是太高。另外对比正负样本的数量能看出样本的数量不均衡，负样本的数量占比较小。

3、构建随机森林模型和调整模型参数

3.1 构建随机森林分类模型

首先对借贷人最早报告的信用额度开启月份这个字段进行处理，计算这个时间到当前时间的年数，生成一个新的特征。然后从 df 中筛选出我们要使用的特征，代码如下：

# 对 earliesCreditLine（借贷人最早报告的信用额度开启月份） 进行处理，计算距当前时间的年数
import datetime
df['earliesCreditLine_new'] = df['earliesCreditLine'].apply(lambda x: float(re.findall(r"\d+\.?\d*", x)[0]))
df['当前时间'] = datetime.datetime.now().year
df['信用开启年数'] = df['当前时间'] - df['earliesCreditLine_new']# 筛选出实际使用的特征
df_2 = df
df_2 = df_2.drop(columns=['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine', '当前时间', 'earliesCreditLine_new'])
df_2 = df_2.fillna(0)

然后构建随机森林模型，代码如下：

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics# 提取出特征和标签列
x = df_2.drop(columns=['isDefault'])
y = df_2['isDefault']# 将数据划分成训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)# 构建随机森林模型，这里使用的模型的默认参数，对模型训练
clf = RandomForestClassifier()
clf.fit(x_train, y_train)# 利用训练好的模型预测测试集的结果
y_pred = clf.predict(x_test)# 查看模型的预测效果，输出 auc 值
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
auc_val = metrics.auc(fpr, tpr)
print(auc_val)
# 输出auc:0.5292283033891222print(metrics.accuracy_score(y_test, y_pred))
# Out[5]: 0.8036166666666666

这里打印 auc 值，得到的是0.529，相对比较低，还需要进一步优化。

3.2 调整模型参数

这里采用网格搜索调整模型参数，网格搜索对模型参数组合进行枚举，从中筛选出最优的参数组合。由于网格搜索算法的效率相对较低，所以这里只对随机森林中树的数量、生成每一棵树用的特征数量 2个参数进行调整。并且在进行网格搜索的过程中，对参数 n_jobs 进行设置，采用多进程提升执行效率，参数的设置方式见以下代码注释。

#######################################################
## 调整随机森林模型的参数
#######################################################
from sklearn.model_selection import GridSearchCVclf = RandomForestClassifier()
# 这里只对随机森林中 树的数量、生成每一棵树用的特征数量进行调整
parameters = {'n_estimators': [50, 100], 'max_features': [5, 10]}# 查看 CPU 的核心数
from multiprocessing import cpu_count
print(cpu_count())'''
n_jobs is an integer, specifying the maximum number of concurrently running workers.
If 1 is given, no joblib parallelism is used at all, which is useful for debugging.
If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
For example with n_jobs=-2, all CPUs but one are used.
'''
grid_search = GridSearchCV(clf, parameters, cv=5, scoring = 'roc_auc', n_jobs=-5)print('start+++++++++++++++++++++')
grid_search.fit(x_train, y_train)
print('end  +++++++++++++++++++++')

参数调整完成之后，可以查看最佳参数的组合,代码如下：

print(grid_search.best_params_)
# Out[3]: {'max_features': 5, 'n_estimators': 100}print(grid_search.best_score_)
# Out[4]: 0.6806999331977975

4、总结

以上就是对信贷违约数据预处理、构建模型分类的过程，这里还有2点可以优化：
1、当前的特征所能达到的分类精度比较有限，还可以基于当前数据自动构建其他一些特征，或者构建信用评分卡。
2、网格搜索这种调参方法效率比较低，还可以使用基于贝叶斯调参的方法。

参考链接：
https://zhuanlan.zhihu.com/p/139510947
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

天池比赛-01-用随机森林进行信贷违约预测-Baseline相关推荐

实战五十三：基于机器学习随机森林的购房贷款违约预测（完整代码+数据集）
1.1 实验题目:购房贷款违约预测任务:使用机器学习相关知识完成购房贷款违约预测,给定特征字段,输出是否会发生逾期的预测. 1.2 实验要求 1.2 题目背景随着世界经济的蓬勃发展和中国改革开放的 ...
数据分享|WEKA用决策树、随机森林、支持向量机SVM、朴素贝叶斯、逻辑回归信贷违约预测报告
作者:Nuo Liu 数据变得越来越重要,其核心应用"预测"也成为互联网行业以及产业变革的重要力量.近年来网络 P2P借贷发展形势迅猛,一方面普通用户可以更加灵活.便快捷地获得中小 ...
R语言CART决策树、随机森林、chaid树预测母婴电商平台用户寿命、流失可视化
全文链接:http://tecdat.cn/?p=31644 借着二胎政策的开放与家庭消费升级的东风,母婴市场迎来了生机盎然的春天,尤其是母婴电商行业,近年来发展迅猛(点击文末"阅读原文&q ...
python数据项目分析实战技法_《Python数据分析与机器学习实战-唐宇迪》读书笔记第9章--随机森林项目实战——气温预测(1/2)...
第9章--随机森林项目实战--气温预测(1/2) 第8章已经讲解过随机森林的基本原理,本章将从实战的角度出发,借助Python工具包完成气温预测任务,其中涉及多个模块,主要包含随机森林建模.特征选择. ...
python天气数据分析论文_《Python数据分析与机器学习实战-唐宇迪》读书笔记第9章--随机森林项目实战——气温预测(2/2)...
第9章--随机森林项目实战--气温预测(2/2) 第8章已经讲解过随机森林的基本原理,本章将从实战的角度出发,借助Python工具包完成气温预测任务,其中涉及多个模块,主要包含随机森林建模.特征选择. ...
如何评估随机森林模型以及重要预测变量的显著性
如何评估随机森林模型以及重要预测变量的显著性说到随机森林(random forest,RF),想必很多同学都不陌生了,毕竟这些机器学习方法目前非常流(fàn)行(làn)--白鱼同学也曾分别分享过& ...
数据挖掘实战：个人信贷违约预测
大家好,我是东哥.本次分享一个数据挖掘实战项目:个人信贷违约预测,此项目对于想要学习信贷风控模型的同学非常有帮助,本文首发于公众号:Python数据科学,作者云朵君. 一.个人征信预测模型 1.项目背 ...
ArcGIS Pro随机森林模型深度机器学习预测海草栖息地【教程】
开始之前,总体梳理一下项目流程,大致了解一下我们是怎么一步一步得到最终结果的 1.创建训练数据集下载数据(链接:https://pan.baidu.com/s/1tM4ZXplEP2MC787OKS ...
Random Forest（随机森林）在软件缺陷预测领域的应用及其特点
这篇博客也就简单总结一些基础知识.从我个人的经验和别人的论文来看,Random Forest是最适用于软件缺陷预测的机器学习算法.例如这篇文章: Osman, Haidar, Mohammad Gha ...

天池比赛-01-用随机森林进行信贷违约预测-Baseline