一、项目介绍

背景

以金融风控中的个人信贷为背景,根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题。

具体的列名含义

            id 为贷款清单分配的唯一信用证标识loanAmnt 贷款金额term 贷款期限(year)interestRate 贷款利率installment 分期付款金额grade 贷款等级subGrade 贷款等级之子级employmentTitle 就业职称employmentLength 就业年限(年)homeOwnership 借款人在登记时提供的房屋所有权状况annualIncome 年收入verificationStatus 验证状态issueDate 贷款发放的月份purpose 借款人在贷款申请时的贷款用途类别postCode 借款人在贷款申请中提供的邮政编码的前3位数字regionCode 地区编码dti 债务收入比delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数ficoRangeLow 借款人在贷款发放时的fico所属的下限范围ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围openAcc 借款人信用档案中未结信用额度的数量pubRec 贬损公共记录的数量pubRecBankruptcies 公开记录清除的数量revolBal 信贷周转余额合计revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额totalAcc 借款人信用档案中当前的信用额度总数initialListStatus 贷款的初始列表状态applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请earliesCreditLine 借款人最早报告的信用额度开立的月份title 借款人提供的贷款名称policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理

二、数据准备

导入相关库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.model_selection import cross_val_score,train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')##### 取消pandas最大列显示限制
pd.options.display.max_columns = None

获取数据

train = pd.read_csv('../data/贷款违约预测/train.csv')

三、数据分析

3.1 总体了解数据

train.shape
(800000, 47)
train.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'isDefault','purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14'],dtype='object')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):#   Column              Non-Null Count   Dtype
---  ------              --------------   -----  0   id                  800000 non-null  int64  1   loanAmnt            800000 non-null  float642   term                800000 non-null  int64  3   interestRate        800000 non-null  float644   installment         800000 non-null  float645   grade               800000 non-null  object 6   subGrade            800000 non-null  object 7   employmentTitle     799999 non-null  float648   employmentLength    753201 non-null  object 9   homeOwnership       800000 non-null  int64  10  annualIncome        800000 non-null  float6411  verificationStatus  800000 non-null  int64  12  issueDate           800000 non-null  object 13  isDefault           800000 non-null  int64  14  purpose             800000 non-null  int64  15  postCode            799999 non-null  float6416  regionCode          800000 non-null  int64  17  dti                 799761 non-null  float6418  delinquency_2years  800000 non-null  float6419  ficoRangeLow        800000 non-null  float6420  ficoRangeHigh       800000 non-null  float6421  openAcc             800000 non-null  float6422  pubRec              800000 non-null  float6423  pubRecBankruptcies  799595 non-null  float6424  revolBal            800000 non-null  float6425  revolUtil           799469 non-null  float6426  totalAcc            800000 non-null  float6427  initialListStatus   800000 non-null  int64  28  applicationType     800000 non-null  int64  29  earliesCreditLine   800000 non-null  object 30  title               799999 non-null  float6431  policyCode          800000 non-null  float6432  n0                  759730 non-null  float6433  n1                  759730 non-null  float6434  n2                  759730 non-null  float6435  n3                  759730 non-null  float6436  n4                  766761 non-null  float6437  n5                  759730 non-null  float6438  n6                  759730 non-null  float6439  n7                  759730 non-null  float6440  n8                  759729 non-null  float6441  n9                  759730 non-null  float6442  n10                 766761 non-null  float6443  n11                 730248 non-null  float6444  n12                 759730 non-null  float6445  n13                 759730 non-null  float6446  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
train.describe()
id loanAmnt term interestRate installment employmentTitle homeOwnership annualIncome verificationStatus isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
count 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 8.000000e+05 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 799761.000000 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799595.000000 8.000000e+05 799469.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.0 759730.000000 759730.000000 759730.000000 759730.000000 766761.000000 759730.000000 759730.000000 759730.000000 759729.000000 759730.000000 766761.000000 730248.000000 759730.000000 759730.000000 759730.000000
mean 399999.500000 14416.818875 3.482745 13.238391 437.947723 72005.351714 0.614213 7.613391e+04 1.009683 0.199513 1.745982 258.535648 16.385758 18.284557 0.318239 696.204081 700.204226 11.598020 0.214915 0.134163 1.622871e+04 51.790734 24.998861 0.416953 0.019267 1754.113589 1.0 0.511932 3.642330 5.642648 5.642648 4.735641 8.107937 8.575994 8.282953 14.622488 5.592345 11.643896 0.000815 0.003384 0.089366 2.178606
std 230940.252015 8716.086178 0.855832 4.765757 261.460393 106585.640204 0.675749 6.894751e+04 0.782716 0.399634 2.367453 200.037446 11.036679 11.150155 0.880325 31.865995 31.866674 5.475286 0.606467 0.377471 2.245802e+04 24.516126 11.999201 0.493055 0.137464 7941.474040 0.0 1.333266 2.246825 3.302810 3.302810 2.949969 4.799210 7.400536 4.561689 8.124610 3.216184 5.484104 0.030075 0.062041 0.509069 1.844377
min 0.000000 500.000000 3.000000 5.310000 15.690000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 630.000000 634.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 2.000000 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 199999.750000 8000.000000 3.000000 9.750000 248.450000 427.000000 0.000000 4.560000e+04 0.000000 0.000000 0.000000 103.000000 8.000000 11.790000 0.000000 670.000000 674.000000 8.000000 0.000000 0.000000 5.944000e+03 33.400000 16.000000 0.000000 0.000000 0.000000 1.0 0.000000 2.000000 3.000000 3.000000 3.000000 5.000000 4.000000 5.000000 9.000000 3.000000 8.000000 0.000000 0.000000 0.000000 1.000000
50% 399999.500000 12000.000000 3.000000 12.740000 375.135000 7755.000000 1.000000 6.500000e+04 1.000000 0.000000 0.000000 203.000000 14.000000 17.610000 0.000000 690.000000 694.000000 11.000000 0.000000 0.000000 1.113200e+04 52.100000 23.000000 0.000000 0.000000 1.000000 1.0 0.000000 3.000000 5.000000 5.000000 4.000000 7.000000 7.000000 7.000000 13.000000 5.000000 11.000000 0.000000 0.000000 0.000000 2.000000
75% 599999.250000 20000.000000 3.000000 15.990000 580.710000 117663.500000 1.000000 9.000000e+04 2.000000 0.000000 4.000000 395.000000 22.000000 24.060000 0.000000 710.000000 714.000000 14.000000 0.000000 0.000000 1.973400e+04 70.700000 32.000000 1.000000 0.000000 5.000000 1.0 0.000000 5.000000 7.000000 7.000000 6.000000 11.000000 11.000000 10.000000 19.000000 7.000000 14.000000 0.000000 0.000000 0.000000 3.000000
max 799999.000000 40000.000000 5.000000 30.990000 1715.420000 378351.000000 5.000000 1.099920e+07 2.000000 1.000000 13.000000 940.000000 50.000000 999.000000 39.000000 845.000000 850.000000 86.000000 86.000000 12.000000 2.904836e+06 892.300000 162.000000 1.000000 1.000000 61680.000000 1.0 51.000000 33.000000 63.000000 63.000000 49.000000 70.000000 132.000000 79.000000 128.000000 45.000000 82.000000 4.000000 4.000000 39.000000 30.000000
# 查看数据集中特征缺失值的特征数
train.isnull().any().sum()
22
# 具体的查看缺失特征数量并可视化
missing = train.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace = True)
missing.plot.bar();

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lUzXSsK9-1615121853776)(output_13_0.png)]

# 查看训练集测试集中特征属性只有一值的特征
fea = [col for col in train.columns if train[col].nunique() <=1]
fea
['policyCode']
# 查看特征的数值类型有哪些,对象类型有哪些
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x:x not in numerical_fea,list(train.columns)))
print('数值类型特征有{}个,分别为{}:'.format(len(numerical_fea),numerical_fea))
print()
print('对象类型特征有{}个,分别为{}:'.format(len(category_fea),category_fea))
数值类型特征有42个,分别为['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']:对象类型特征有5个,分别为['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']:
# 划分数值型变量中的连续变量和离散型变量
numerical_noserial_fea = []
numerical_serial_fea = []for fea in numerical_fea:temp = train[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)print('数值连续型变量特征有:',numerical_serial_fea)
print()
print('数值离散型变量特征有:',numerical_noserial_fea)
数值连续型变量特征有: ['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']数值离散型变量特征有: ['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']

3.2 数值离散型变量分析

for fea in numerical_noserial_fea:print('离散型变量:',fea)print(train[fea].value_counts())print()print()
离散型变量: term
3    606902
5    193098
Name: term, dtype: int64离散型变量: homeOwnership
0    395732
1    317660
2     86309
3       185
5        81
4        33
Name: homeOwnership, dtype: int64离散型变量: verificationStatus
1    309810
2    248968
0    241222
Name: verificationStatus, dtype: int64离散型变量: isDefault
0    640390
1    159610
Name: isDefault, dtype: int64离散型变量: initialListStatus
0    466438
1    333562
Name: initialListStatus, dtype: int64离散型变量: applicationType
0    784586
1     15414
Name: applicationType, dtype: int64离散型变量: policyCode
1.0    800000
Name: policyCode, dtype: int64离散型变量: n11
0.0    729682
1.0       540
2.0        24
4.0         1
3.0         1
Name: n11, dtype: int64离散型变量: n12
0.0    757315
1.0      2281
2.0       115
3.0        16
4.0         3
Name: n12, dtype: int64

3.3 数值连续型变量分析

f = pd.melt(train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=4, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GxlERoUd-1615121853777)(output_20_0.png)]

3.4 非数值类别型变量分析

for fea in category_fea:print('非数值类别型变量:',fea)print(train[fea].value_counts())print()
非数值类别型变量: grade
B    233690
C    227118
A    139661
D    119453
E     55661
F     19053
G      5364
Name: grade, dtype: int64非数值类别型变量: subGrade
C1    50763
B4    49516
B5    48965
B3    48600
C2    47068
C3    44751
C4    44272
B2    44227
B1    42382
C5    40264
A5    38045
A4    30928
D1    30538
D2    26528
A1    25909
D3    23410
A3    22655
A2    22124
D4    21139
D5    17838
E1    14064
E2    12746
E3    10925
E4     9273
E5     8653
F1     5925
F2     4340
F3     3577
F4     2859
F5     2352
G1     1759
G2     1231
G3      978
G4      751
G5      645
Name: subGrade, dtype: int64非数值类别型变量: employmentLength
10+ years    262753
2 years       72358
< 1 year      64237
3 years       64152
1 year        52489
5 years       50102
4 years       47985
6 years       37254
8 years       36192
7 years       35407
9 years       30272
Name: employmentLength, dtype: int64非数值类别型变量: issueDate
2016-03-01    29066
2015-10-01    25525
2015-07-01    24496
2015-12-01    23245
2014-10-01    21461...
2007-08-01       23
2007-07-01       21
2008-09-01       19
2007-09-01        7
2007-06-01        1
Name: issueDate, Length: 139, dtype: int64非数值类别型变量: earliesCreditLine
Aug-2001    5567
Sep-2003    5403
Aug-2002    5403
Oct-2001    5258
Aug-2000    5246...
Oct-1954       1
Jan-1944       1
May-1957       1
Nov-1954       1
Nov-1953       1
Name: earliesCreditLine, Length: 720, dtype: int64

三、特征工程

3.1 特征预处理

3.1.1缺失值填充

# 按照平均数填充连续型数值型特征
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
# 按照众数填充类别型特征
train[category_fea].fillna(train[category_fea].mode(),inplace=True)
train['employmentLength'].fillna('10+ years',inplace=True)
train.isnull().any().sum()
0

3.1.2 对象型类别特征进行预处理

# 时间格式处理
train['issueDate'] = pd.to_datetime(train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01','%Y-%m-%d')
train['issueDate'] = train['issueDate'].apply(lambda x: x-startdate).dt.days
train['issueDate'].value_counts()
3196    29066
3044    25525
2952    24496
3105    23245
2679    21461...
61         23
30         21
458        19
92          7
0           1
Name: issueDate, Length: 139, dtype: int64
# employmentLength预处理
def employmentLength_to_int(s):if pd.isnull(s):return selse:return np.int8(s.split()[0])train['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
train['employmentLength'].replace('< 1 year', '0 years', inplace=True)
train['employmentLength'] = train['employmentLength'].apply(employmentLength_to_int)
train['employmentLength'].value_counts()
10    309552
2      72358
0      64237
3      64152
1      52489
5      50102
4      47985
6      37254
8      36192
7      35407
9      30272
Name: employmentLength, dtype: int64
# 对earliesCreditLine进行预处理
train['earliesCreditLine'] = train['earliesCreditLine'].apply(lambda x:int(x[-4:]))
train['earliesCreditLine'].value_counts()
2001    53194
2002    51060
2003    50649
2000    50624
2004    49280...
1954        5
1953        5
1950        5
1946        2
1944        1
Name: earliesCreditLine, Length: 68, dtype: int64
# grade预处理
train['grade'] = train['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
train['grade'].value_counts()
2    233690
3    227118
1    139661
4    119453
5     55661
6     19053
7      5364
Name: grade, dtype: int64
# subGrade预处理
train['subGrade'] = train['subGrade'].map({'A1':1,'A2':2,'A3':3,'A4':4,'A5':5,'B1':6,'B2':7,'B3':8,'B4':9,'B5':10,'C1':11,'C2':12,'C3':13,'C4':14,'C5':15
,'D1':16,'D2':17,'D3':18,'D4':19,'D5':20,'E1':21,'E2':22,'E3':23,'E4':24,'E5':25,'F1':26,'F2':27,'F3':28,'F4':29,'F5':30
,'G1':31,'G2':32,'G3':33,'G4':34,'G5':35})
train['subGrade'].value_counts()
11    50763
9     49516
10    48965
8     48600
12    47068
13    44751
14    44272
7     44227
6     42382
15    40264
5     38045
4     30928
16    30538
17    26528
1     25909
18    23410
3     22655
2     22124
19    21139
20    17838
21    14064
22    12746
23    10925
24     9273
25     8653
26     5925
27     4340
28     3577
29     2859
30     2352
31     1759
32     1231
33      978
34      751
35      645
Name: subGrade, dtype: int64

3.1.3 数值离散特征处理

# 独热编码
temp = ['subGrade','homeOwnership','verificationStatus','purpose','regionCode']
data = pd.get_dummies(train,columns=temp,drop_first=True)
data.head()
id loanAmnt term interestRate installment grade employmentTitle employmentLength annualIncome issueDate isDefault postCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 subGrade_2 subGrade_3 subGrade_4 subGrade_5 subGrade_6 subGrade_7 subGrade_8 subGrade_9 subGrade_10 subGrade_11 subGrade_12 subGrade_13 subGrade_14 subGrade_15 subGrade_16 subGrade_17 subGrade_18 subGrade_19 subGrade_20 subGrade_21 subGrade_22 subGrade_23 subGrade_24 subGrade_25 subGrade_26 subGrade_27 subGrade_28 subGrade_29 subGrade_30 subGrade_31 subGrade_32 subGrade_33 subGrade_34 subGrade_35 homeOwnership_1 homeOwnership_2 homeOwnership_3 homeOwnership_4 homeOwnership_5 verificationStatus_1 verificationStatus_2 purpose_1 purpose_2 purpose_3 purpose_4 purpose_5 purpose_6 purpose_7 purpose_8 purpose_9 purpose_10 purpose_11 purpose_12 purpose_13 regionCode_1 regionCode_2 regionCode_3 regionCode_4 regionCode_5 regionCode_6 regionCode_7 regionCode_8 regionCode_9 regionCode_10 regionCode_11 regionCode_12 regionCode_13 regionCode_14 regionCode_15 regionCode_16 regionCode_17 regionCode_18 regionCode_19 regionCode_20 regionCode_21 regionCode_22 regionCode_23 regionCode_24 regionCode_25 regionCode_26 regionCode_27 regionCode_28 regionCode_29 regionCode_30 regionCode_31 regionCode_32 regionCode_33 regionCode_34 regionCode_35 regionCode_36 regionCode_37 regionCode_38 regionCode_39 regionCode_40 regionCode_41 regionCode_42 regionCode_43 regionCode_44 regionCode_45 regionCode_46 regionCode_47 regionCode_48 regionCode_49 regionCode_50
0 0 35000.0 5 19.52 917.97 5 320.0 2 110000.0 2587 1 137.0 17.05 0.0 730.0 734.0 7.0 0.0 0.0 24178.0 48.9 27.0 0 0 2001 1.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 18000.0 5 18.49 461.90 4 219843.0 5 46000.0 1888 0 156.0 27.83 0.0 700.0 704.0 13.0 0.0 0.0 15096.0 38.9 18.0 1 0 2002 1723.0 1.0 0.0 3.0 5.0 5.0 10.0 7.0 7.0 7.0 13.0 5.0 13.0 0.0 0.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 12000.0 5 16.99 298.17 4 31698.0 8 74000.0 3044 0 337.0 22.77 0.0 675.0 679.0 11.0 0.0 0.0 4606.0 51.8 27.0 0 0 2006 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 11000.0 3 7.26 340.96 1 46854.0 10 118000.0 2983 0 148.0 17.21 0.0 685.0 689.0 9.0 0.0 0.0 9948.0 52.6 28.0 1 0 1999 4.0 1.0 6.0 4.0 6.0 6.0 4.0 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 3000.0 3 12.99 101.07 3 54.0 10 29000.0 3196 0 301.0 32.16 0.0 690.0 694.0 12.0 0.0 0.0 2942.0 32.0 27.0 0 0 1977 11.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3.1.4 连续型特征处理

# 异常值处理
def find_outliers_by_3segama(fea):data_std = np.std(data[fea])data_mean = np.mean(data[fea])outliers_cut_off = data_std * 3lower_rule = data_mean - outliers_cut_offupper_rule = data_mean + outliers_cut_offdata[fea] = data[fea].apply(lambda x:np.nan if x > upper_rule or x < lower_rule else x)return datafor fea in data.columns:if data[fea].nunique() > 10:find_outliers_by_3segama(fea)continuedata.dropna(axis=0,how='any',inplace=True)
data.shape
(626125, 146)
# 数据分桶
for fea in data.columns:if data[fea].nunique() > 10:if data[fea].max()-data[fea].min() >100000:data[fea] = np.floor_divide(data[fea], 10000)elif data[fea].max()-data[fea].min() >10000:data[fea] = np.floor_divide(data[fea], 1000)elif data[fea].max()-data[fea].min() >1000:data[fea] = np.floor_divide(data[fea], 100)elif data[fea].max()-train[fea].min() >100:data[fea] = np.floor_divide(data[fea], 10)
data.head()
id loanAmnt term interestRate installment grade employmentTitle employmentLength annualIncome issueDate isDefault postCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 subGrade_2 subGrade_3 subGrade_4 subGrade_5 subGrade_6 subGrade_7 subGrade_8 subGrade_9 subGrade_10 subGrade_11 subGrade_12 subGrade_13 subGrade_14 subGrade_15 subGrade_16 subGrade_17 subGrade_18 subGrade_19 subGrade_20 subGrade_21 subGrade_22 subGrade_23 subGrade_24 subGrade_25 subGrade_26 subGrade_27 subGrade_28 subGrade_29 subGrade_30 subGrade_31 subGrade_32 subGrade_33 subGrade_34 subGrade_35 homeOwnership_1 homeOwnership_2 homeOwnership_3 homeOwnership_4 homeOwnership_5 verificationStatus_1 verificationStatus_2 purpose_1 purpose_2 purpose_3 purpose_4 purpose_5 purpose_6 purpose_7 purpose_8 purpose_9 purpose_10 purpose_11 purpose_12 purpose_13 regionCode_1 regionCode_2 regionCode_3 regionCode_4 regionCode_5 regionCode_6 regionCode_7 regionCode_8 regionCode_9 regionCode_10 regionCode_11 regionCode_12 regionCode_13 regionCode_14 regionCode_15 regionCode_16 regionCode_17 regionCode_18 regionCode_19 regionCode_20 regionCode_21 regionCode_22 regionCode_23 regionCode_24 regionCode_25 regionCode_26 regionCode_27 regionCode_28 regionCode_29 regionCode_30 regionCode_31 regionCode_32 regionCode_33 regionCode_34 regionCode_35 regionCode_36 regionCode_37 regionCode_38 regionCode_39 regionCode_40 regionCode_41 regionCode_42 regionCode_43 regionCode_44 regionCode_45 regionCode_46 regionCode_47 regionCode_48 regionCode_49 regionCode_50
0 0 35.0 5 19.52 9.0 5 0.0 2 11.0 25.0 1 13.0 17.05 0.0 73.0 73.0 7.0 0.0 0.0 24.0 4.0 27.0 0 0 2001.0 0.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 18.0 5 18.49 4.0 4 21.0 5 4.0 18.0 0 15.0 27.83 0.0 70.0 70.0 13.0 0.0 0.0 15.0 3.0 18.0 1 0 2002.0 1.0 1.0 0.0 3.0 5.0 5.0 10.0 7.0 7.0 7.0 13.0 5.0 13.0 0.0 0.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 12.0 5 16.99 2.0 4 3.0 8 7.0 30.0 0 33.0 22.77 0.0 67.0 67.0 11.0 0.0 0.0 4.0 5.0 27.0 0 0 2006.0 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 3.0 3 12.99 1.0 3 0.0 10 2.0 31.0 0 30.0 32.16 0.0 69.0 69.0 12.0 0.0 0.0 2.0 3.0 27.0 0 0 1977.0 0.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 2.0 3 7.69 0.0 1 18.0 9 3.0 26.0 0 51.0 17.49 0.0 75.0 75.0 12.0 0.0 0.0 3.0 0.0 23.0 0 0 2006.0 0.0 1.0 0.0 1.0 3.0 3.0 7.0 11.0 3.0 10.0 18.0 3.0 12.0 0.0 0.0 0.0 3.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3.2 特征交互

3.3 特征编码

3.4 特征选择

# 相关性绝对值小于0.003的特征删除
data.corr()
id loanAmnt term interestRate installment grade employmentTitle employmentLength annualIncome issueDate isDefault postCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 subGrade_2 subGrade_3 subGrade_4 subGrade_5 subGrade_6 subGrade_7 subGrade_8 subGrade_9 subGrade_10 subGrade_11 subGrade_12 subGrade_13 subGrade_14 subGrade_15 subGrade_16 subGrade_17 subGrade_18 subGrade_19 subGrade_20 subGrade_21 subGrade_22 subGrade_23 subGrade_24 subGrade_25 subGrade_26 subGrade_27 subGrade_28 subGrade_29 subGrade_30 subGrade_31 subGrade_32 subGrade_33 subGrade_34 subGrade_35 homeOwnership_1 homeOwnership_2 homeOwnership_3 homeOwnership_4 homeOwnership_5 verificationStatus_1 verificationStatus_2 purpose_1 purpose_2 purpose_3 purpose_4 purpose_5 purpose_6 purpose_7 purpose_8 purpose_9 purpose_10 purpose_11 purpose_12 purpose_13 regionCode_1 regionCode_2 regionCode_3 regionCode_4 regionCode_5 regionCode_6 regionCode_7 regionCode_8 regionCode_9 regionCode_10 regionCode_11 regionCode_12 regionCode_13 regionCode_14 regionCode_15 regionCode_16 regionCode_17 regionCode_18 regionCode_19 regionCode_20 regionCode_21 regionCode_22 regionCode_23 regionCode_24 regionCode_25 regionCode_26 regionCode_27 regionCode_28 regionCode_29 regionCode_30 regionCode_31 regionCode_32 regionCode_33 regionCode_34 regionCode_35 regionCode_36 regionCode_37 regionCode_38 regionCode_39 regionCode_40 regionCode_41 regionCode_42 regionCode_43 regionCode_44 regionCode_45 regionCode_46 regionCode_47 regionCode_48 regionCode_49 regionCode_50
id 1.000000 0.000423 -0.000832 0.001380 0.000935 0.001117 -0.000515 0.000522 0.001601 0.000813 0.000054 0.002119 -0.001419 0.000611 -0.000891 -0.000891 -0.002579 -0.000380 -0.001113 -0.002203 0.001751 -0.001191 0.001625 0.001088 -0.000648 -0.000831 NaN 0.001187 -0.001507 -0.001596 -0.001596 -0.001167 -0.000726 -0.001190 -0.002063 -0.001261 -0.001383 -0.002682 -0.000979 0.000919 -0.000399 0.000466 -0.003861 0.002312 -0.000586 0.001854 0.001601 -0.000564 -0.001624 -0.001781 -0.000133 -0.002010 0.000298 -0.000805 0.000330 0.000902 0.000199 0.001544 0.000008 0.001353 0.002652 -0.000004 0.001617 0.000757 -0.001725 -0.001176 0.000167 0.000584 -0.000396 0.001567 -0.001031 -0.001643 -0.000925 0.000106 0.000632 0.001376 0.001050 -0.000191 -0.000324 -0.001007 -0.000076 -0.001066 0.002051 0.001086 -0.000949 0.000830 -0.001223 0.001617 0.001010 0.000869 0.000494 -0.000550 -0.001030 -0.000046 0.000530 -0.001829 -0.000512 0.000132 -0.002777 0.000837 0.000781 0.001360 0.000291 0.003432 0.000811 0.000275 -0.000130 -0.002148 -0.000562 -0.001097 0.000369 -0.000868 0.000086 0.000242 -0.001352 0.000190 0.002099 -0.001709 0.000090 -0.002066 -0.001333 0.001156 0.000646 -0.001189 0.001397 0.000154 -0.000731 -0.003658 -0.002803 -0.000918 0.000107 0.000119 0.001542 -0.000442 -0.001099 -0.000408 0.000445 0.000052 0.000535 -0.002393 0.001781 0.000439 0.001373 -0.000621 -0.001178 NaN
loanAmnt 0.000423 1.000000 0.409910 0.118841 0.944344 0.120627 -0.009845 0.065608 0.467275 -0.003040 0.059117 -0.021573 0.027657 0.011335 0.112145 0.112145 0.183657 -0.085582 -0.095282 0.441242 0.115935 0.224888 -0.064040 0.063317 -0.163749 -0.019045 NaN -0.039067 0.185727 0.139590 0.139590 0.205452 0.199770 0.096398 0.158380 0.173734 0.140495 0.179692 -0.000524 0.001787 -0.009348 -0.039259 -0.015960 -0.013134 -0.007710 -0.006264 -0.029218 -0.026768 -0.023572 -0.028199 -0.041442 -0.023762 -0.015886 0.001133 0.016117 0.011130 0.001577 -0.000291 0.009578 0.026116 0.028586 0.033834 0.038488 0.041068 0.045713 0.048670 0.036593 0.039976 0.030056 0.029536 0.028785 0.025081 0.018754 0.020721 0.016866 0.016305 -0.164712 -0.026631 -0.002439 0.001072 -0.000577 0.030942 0.155035 0.008906 -0.018050 -0.044906 0.029625 -0.137541 0.002530 -0.077588 -0.066592 -0.067713 -0.065535 -0.014204 -0.015790 -0.003060 -0.003828 0.005890 0.006220 -0.003724 -0.003024 -0.005583 -0.004681 0.006280 0.017927 -0.010561 -0.005282 0.012463 -0.006248 0.027367 -0.002174 -0.000754 -0.004629 0.006900 -0.014687 -0.000519 -0.024489 0.003046 -0.012151 -0.007140 0.004722 0.016062 -0.005721 0.006328 -0.004399 0.017766 -0.000734 -0.001726 0.000768 -0.005510 -0.011785 0.001404 -0.007696 -0.007866 0.000001 -0.004725 0.005218 -0.000192 -0.010283 -0.003038 0.002874 0.000126 -0.001679 0.013122 -0.001992 NaN
term -0.000832 0.409910 1.000000 0.415437 0.162911 0.423747 0.013595 0.043317 0.106556 -0.038790 0.172367 0.011870 0.064007 -0.003635 0.010680 0.010680 0.077523 -0.019022 -0.011780 0.133292 0.066051 0.117111 -0.093668 0.039031 -0.055153 0.006475 NaN -0.015322 0.042665 0.050506 0.050506 0.047499 0.058447 0.084320 0.051615 0.064934 0.050742 0.079613 0.000024 -0.000450 -0.005690 0.021253 -0.092775 -0.093894 -0.102663 -0.095506 -0.079971 -0.077644 -0.065284 -0.055868 -0.074103 -0.026473 -0.005394 0.033247 0.062081 0.057930 0.043461 0.043473 0.055638 0.083301 0.090898 0.096988 0.103288 0.109382 0.116059 0.117178 0.093415 0.089779 0.071815 0.071525 0.060260 0.053811 0.044648 0.035705 0.028910 0.027364 -0.099142 -0.020997 -0.001623 -0.001340 -0.000943 0.033459 0.088802 0.002053 0.004970 -0.013546 -0.038622 -0.050243 0.005194 -0.034611 -0.018643 -0.023816 -0.027348 -0.006035 -0.003893 0.002572 0.000277 -0.007993 0.003824 0.003904 0.004104 0.001358 0.011006 -0.021286 0.012225 -0.001926 0.008378 0.009153 -0.010520 -0.005189 0.001348 -0.000885 0.008042 -0.000223 0.007868 0.003341 -0.012818 0.005266 0.000807 0.002217 -0.003946 0.003221 0.002844 0.001443 -0.001787 0.001891 0.003290 0.005010 -0.000667 -0.001398 -0.003144 -0.001115 0.002452 -0.007542 0.009668 0.000364 0.004289 0.003000 0.006001 -0.000352 0.008120 -0.002490 -0.001081 0.000969 0.000010 NaN
interestRate 0.001380 0.118841 0.415437 1.000000 0.120640 0.951939 0.072586 -0.000166 -0.128969 -0.038144 0.253111 0.008645 0.178430 0.041478 -0.397102 -0.397102 -0.026849 0.052064 0.048077 -0.029575 0.240812 -0.071632 0.129388 0.029456 0.107659 0.016223 NaN 0.042391 0.013137 0.082619 0.082619 -0.066004 -0.093405 -0.015900 -0.020434 -0.069891 0.081010 -0.030676 0.003671 0.012847 0.026751 0.190942 -0.243300 -0.225212 -0.249667 -0.245046 -0.221699 -0.173199 -0.134540 -0.093975 -0.063343 -0.020932 0.018093 0.049204 0.086881 0.125377 0.144427 0.169726 0.184474 0.198299 0.208694 0.197218 0.206423 0.209187 0.214386 0.233260 0.170868 0.158553 0.152411 0.141718 0.124198 0.107149 0.085428 0.060953 0.047413 0.044686 0.063943 0.007200 0.000583 0.002810 0.002654 0.014635 0.208091 0.064548 -0.019210 -0.012516 -0.168315 0.074356 0.036263 0.009930 -0.022643 0.020322 0.036275 0.012358 0.010037 0.000691 0.001799 -0.006851 0.000489 -0.004114 0.002838 0.000529 0.005557 -0.007345 0.001095 -0.005518 0.001007 0.002319 0.011774 -0.007952 -0.002154 -0.001417 0.011875 0.002459 0.002773 0.000049 0.005241 -0.005651 0.004637 -0.001894 -0.006674 -0.015917 -0.004497 0.009403 -0.003376 -0.003325 0.000832 0.006044 -0.005375 0.001907 -0.003610 -0.003600 0.007047 0.003819 0.000419 -0.001031 0.000135 0.003551 0.001716 0.002648 -0.000331 0.001871 -0.002838 0.002601 0.002269 NaN
installment 0.000935 0.944344 0.162911 0.120640 1.000000 0.115138 -0.005217 0.055356 0.440590 -0.000578 0.041954 -0.024978 0.034484 0.018761 0.063176 0.063176 0.169524 -0.076187 -0.089488 0.418453 0.133743 0.193653 -0.019355 0.055460 -0.142758 -0.019282 NaN -0.031052 0.188863 0.146766 0.146766 0.197947 0.184984 0.073793 0.153171 0.158454 0.147354 0.164535 -0.000376 0.003462 -0.004688 -0.024037 -0.012639 -0.008244 -0.003798 -0.003254 -0.031604 -0.027494 -0.027593 -0.030215 -0.035670 -0.018203 -0.011481 -0.004921 0.006126 0.004602 0.002066 0.005157 0.013816 0.025210 0.027891 0.030986 0.034989 0.037623 0.042111 0.048661 0.032765 0.038853 0.032514 0.031743 0.032160 0.027272 0.020309 0.021406 0.016787 0.016617 -0.134317 -0.020577 -0.002310 0.001835 -0.000081 0.023925 0.159702 0.017272 -0.024932 -0.046447 0.024458 -0.129208 0.004799 -0.075295 -0.066935 -0.065726 -0.061881 -0.012486 -0.014627 -0.003316 -0.003710 0.008499 0.006028 -0.005617 -0.003991 -0.006158 -0.008413 0.012794 0.014852 -0.011654 -0.008282 0.009407 -0.001891 0.029017 -0.002955 -0.001005 -0.006236 0.007959 -0.018127 -0.001813 -0.021365 0.000745 -0.012224 -0.008418 0.005480 0.016269 -0.006885 0.007615 -0.004297 0.017403 -0.001918 -0.002410 0.000553 -0.005730 -0.011770 0.001309 -0.008045 -0.005362 -0.003144 -0.004970 0.004123 -0.000890 -0.012445 -0.002893 0.000689 0.001379 -0.002107 0.014514 -0.002434 NaN

regionCode_46 0.000439 0.000126 -0.002490 0.001871 0.001379 0.002765 -0.008369 -0.011931 -0.004625 0.024831 -0.000336 0.058462 0.009459 0.003612 -0.002016 -0.002016 -0.003194 0.001476 -0.002488 -0.004314 0.002978 -0.001576 -0.013439 0.009129 0.011684 -0.005360 NaN -0.000041 -0.002700 -0.004665 -0.004665 -0.004031 -0.008001 0.009072 -0.007103 -0.008401 -0.005226 -0.003726 0.001148 0.001783 0.000147 0.002509 -0.002335 -0.002496 -0.001349 -0.003406 0.001946 -0.000224 -0.001426 -0.000537 0.002037 -0.000117 -0.000140 0.000594 0.000096 0.005490 0.001900 0.000297 0.002810 -0.000436 -0.001953 0.000466 -0.000527 0.001300 -0.002110 -0.000180 0.000703 0.000282 -0.000673 -0.000245 -0.000626 -0.001348 -0.001072 -0.000790 -0.000618 -0.000568 -0.001450 0.002148 -0.000553 -0.000206 -0.000190 0.001963 0.001363 -0.000022 -0.002358 -0.001070 0.002238 0.000232 -0.001839 0.000393 0.000282 0.001306 0.000258 0.002608 -0.001346 -0.000119 -0.001614 -0.007011 -0.006412 -0.004771 -0.003901 -0.002357 -0.006112 -0.014704 -0.006070 -0.005696 -0.004578 -0.005392 -0.010477 -0.010645 -0.003315 -0.002605 -0.003985 -0.005350 -0.006564 -0.003865 -0.009814 -0.005398 -0.005811 -0.004546 -0.001762 -0.005437 -0.004141 -0.002559 -0.001590 -0.006658 -0.001902 -0.004436 -0.002476 -0.001846 -0.003979 -0.004241 -0.003088 -0.004423 -0.002117 -0.001948 -0.001703 -0.003432 -0.003531 -0.002525 -0.003082 1.000000 -0.001388 -0.001764 -0.000718 NaN
regionCode_47 0.001373 -0.001679 -0.001081 -0.002838 -0.002107 -0.002504 -0.007356 0.001941 -0.007198 0.030269 -0.005557 0.060215 0.007797 0.000471 0.001091 0.001091 -0.003704 0.003830 0.002131 0.000469 0.000210 0.000105 -0.013378 0.009576 0.002036 -0.005905 NaN -0.000110 -0.000971 -0.004866 -0.004866 -0.000450 -0.001806 0.004435 -0.006316 -0.004431 -0.005377 -0.004324 -0.000898 -0.000308 -0.000536 -0.003055 -0.000755 -0.000961 -0.001134 0.001827 0.001583 0.000080 -0.001527 0.002260 0.002075 -0.001430 -0.000158 0.001319 0.001773 -0.001070 0.001429 -0.000536 0.001861 0.001303 -0.001691 -0.001744 -0.001350 -0.001749 -0.000725 -0.001673 -0.000952 -0.000817 -0.001704 -0.000594 -0.000881 -0.001484 -0.001180 -0.000870 -0.000681 -0.000625 -0.009648 0.005573 -0.000609 -0.000226 -0.000210 0.001542 -0.001215 -0.001125 -0.001645 -0.000254 0.003326 -0.001654 -0.001564 0.001295 0.001141 -0.001868 -0.002305 0.000595 -0.001483 -0.000131 -0.001778 -0.007724 -0.007063 -0.005255 -0.004297 -0.002596 -0.006733 -0.016198 -0.006686 -0.006275 -0.005044 -0.005940 -0.011541 -0.011727 -0.003651 -0.002869 -0.004390 -0.005894 -0.007231 -0.004257 -0.010811 -0.005946 -0.006402 -0.005008 -0.001941 -0.005990 -0.004562 -0.002819 -0.001752 -0.007335 -0.002096 -0.004887 -0.002727 -0.002033 -0.004384 -0.004672 -0.003402 -0.004872 -0.002332 -0.002145 -0.001876 -0.003781 -0.003890 -0.002782 -0.003395 -0.001388 1.000000 -0.001943 -0.000791 NaN
regionCode_48 -0.000621 0.013122 0.000969 0.002601 0.014514 0.002697 0.005839 0.001035 0.005677 -0.006643 0.000386 0.074851 0.003538 -0.000505 -0.000066 -0.000066 -0.008804 -0.010687 -0.009029 0.012497 0.016935 -0.002152 0.003535 -0.000401 -0.000109 0.002563 NaN -0.003413 -0.004071 -0.006308 -0.006308 -0.006013 -0.003305 0.003450 -0.011382 -0.006447 -0.005964 -0.008729 -0.001141 -0.001788 -0.000788 -0.007267 -0.002167 0.000319 0.001117 0.000323 0.000613 -0.001174 -0.002295 -0.001939 0.001481 -0.000007 -0.001138 0.000008 -0.001269 -0.000826 0.001418 0.001531 -0.000812 0.002184 -0.000215 0.001059 0.000417 0.000593 0.001657 -0.000144 -0.000153 0.003352 -0.000897 -0.000291 -0.000136 0.000664 0.002774 0.001791 0.000984 0.001222 -0.002532 0.001040 -0.000774 -0.000288 0.005743 0.000934 0.001735 0.001361 0.000426 -0.000243 -0.001842 -0.003017 0.002071 0.003414 -0.001371 0.000768 0.001233 -0.000015 0.000669 -0.000166 -0.002259 -0.009814 -0.008975 -0.006678 -0.005460 -0.003299 -0.008555 -0.020582 -0.008496 -0.007973 -0.006408 -0.007547 -0.014664 -0.014901 -0.004640 -0.003646 -0.005578 -0.007488 -0.009188 -0.005409 -0.013736 -0.007556 -0.008134 -0.006364 -0.002466 -0.007610 -0.005797 -0.003582 -0.002226 -0.009320 -0.002663 -0.006209 -0.003465 -0.002583 -0.005570 -0.005937 -0.004323 -0.006191 -0.002963 -0.002726 -0.002384 -0.004804 -0.004943 -0.003534 -0.004314 -0.001764 -0.001943 1.000000 -0.001005 NaN
regionCode_49 -0.001178 -0.001992 0.000010 0.002269 -0.002434 0.000968 -0.003484 0.000365 -0.006709 0.020769 -0.002056 0.059028 0.004764 -0.000256 -0.001166 -0.001166 -0.001209 0.011928 0.004670 -0.003665 -0.000948 -0.001023 -0.006241 0.015515 0.005007 -0.003054 NaN 0.003467 -0.004131 -0.001632 -0.001632 -0.004665 -0.007655 0.002124 -0.001115 -0.003804 -0.001842 -0.001661 -0.000464 0.000555 0.001177 0.001127 0.000018 -0.000573 -0.003264 -0.003465 -0.000607 0.000914 -0.000261 0.001553 0.002841 -0.000248 0.000493 0.002180 0.000185 -0.000020 0.000113 -0.001948 0.001733 0.000162 0.000193 -0.000263 0.001869 -0.000318 0.000810 -0.001332 -0.001581 -0.000191 -0.001230 -0.001093 0.000817 -0.000768 -0.000610 -0.000450 -0.000352 -0.000323 -0.007148 -0.001560 -0.000315 -0.000117 -0.000108 -0.000953 0.002061 -0.001269 -0.000138 -0.001287 -0.002003 0.001516 -0.001416 -0.000695 -0.000552 0.001566 0.000195 -0.000518 -0.000767 -0.000068 -0.000920 -0.003995 -0.003653 -0.002718 -0.002222 -0.001343 -0.003482 -0.008377 -0.003458 -0.003245 -0.002608 -0.003072 -0.005969 -0.006065 -0.001888 -0.001484 -0.002271 -0.003048 -0.003740 -0.002202 -0.005591 -0.003075 -0.003311 -0.002590 -0.001004 -0.003098 -0.002359 -0.001458 -0.000906 -0.003794 -0.001084 -0.002527 -0.001411 -0.001052 -0.002267 -0.002416 -0.001760 -0.002520 -0.001206 -0.001110 -0.000970 -0.001955 -0.002012 -0.001439 -0.001756 -0.000718 -0.000791 -0.001005 1.000000 NaN
regionCode_50 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

146 rows × 146 columns

corr = data.corr().unstack()['isDefault'].sort_values(ascending=False)
drop_fea = corr[abs(corr)<0.003].index
data.drop(drop_fea,axis=1,inplace=True)
data.head()
loanAmnt term interestRate installment grade employmentTitle annualIncome issueDate isDefault postCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n13 n14 subGrade_2 subGrade_3 subGrade_4 subGrade_5 subGrade_6 subGrade_7 subGrade_8 subGrade_9 subGrade_10 subGrade_11 subGrade_12 subGrade_13 subGrade_14 subGrade_15 subGrade_16 subGrade_17 subGrade_18 subGrade_19 subGrade_20 subGrade_21 subGrade_22 subGrade_23 subGrade_24 subGrade_25 subGrade_26 subGrade_27 subGrade_28 subGrade_29 subGrade_30 subGrade_31 subGrade_32 subGrade_33 subGrade_34 subGrade_35 homeOwnership_1 homeOwnership_2 verificationStatus_1 verificationStatus_2 purpose_1 purpose_2 purpose_4 purpose_5 purpose_6 purpose_8 purpose_9 purpose_10 purpose_12 regionCode_2 regionCode_3 regionCode_5 regionCode_6 regionCode_7 regionCode_11 regionCode_12 regionCode_13 regionCode_14 regionCode_15 regionCode_17 regionCode_18 regionCode_19 regionCode_20 regionCode_21 regionCode_22 regionCode_24 regionCode_25 regionCode_27 regionCode_29 regionCode_30 regionCode_32 regionCode_33 regionCode_34 regionCode_35 regionCode_36 regionCode_37 regionCode_38 regionCode_39 regionCode_40 regionCode_42 regionCode_43 regionCode_44 regionCode_45 regionCode_47 regionCode_50
0 35.0 5 19.52 9.0 5 0.0 11.0 25.0 1 13.0 17.05 0.0 73.0 73.0 7.0 0.0 0.0 24.0 4.0 27.0 0 0 2001.0 0.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 18.0 5 18.49 4.0 4 21.0 4.0 18.0 0 15.0 27.83 0.0 70.0 70.0 13.0 0.0 0.0 15.0 3.0 18.0 1 0 2002.0 1.0 1.0 0.0 3.0 5.0 5.0 10.0 7.0 7.0 7.0 13.0 5.0 13.0 0.0 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 12.0 5 16.99 2.0 4 3.0 7.0 30.0 0 33.0 22.77 0.0 67.0 67.0 11.0 0.0 0.0 4.0 5.0 27.0 0 0 2006.0 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 3.0 3 12.99 1.0 3 0.0 2.0 31.0 0 30.0 32.16 0.0 69.0 69.0 12.0 0.0 0.0 2.0 3.0 27.0 0 0 1977.0 0.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 4.0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 2.0 3 7.69 0.0 1 18.0 3.0 26.0 0 51.0 17.49 0.0 75.0 75.0 12.0 0.0 0.0 3.0 0.0 23.0 0 0 2006.0 0.0 1.0 0.0 1.0 3.0 3.0 7.0 11.0 3.0 10.0 18.0 3.0 12.0 0.0 3.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

四.建型与调参

4.1 划分训练集/测试集/验证集

x_train = data.drop(['isDefault'],axis=1)
y_train = data['isDefault']x, val_x, y, val_y = train_test_split(x_train,y_train,test_size=0.25,random_state=1)

4.2 选择模型

4.2.1 逻辑回归模型

lr = LogisticRegression()
scores = cross_val_score(lr,x_train,y_train,cv=5,scoring='accuracy')
print('逻辑回归5折交叉训练准确率为:',np.mean(scores))
逻辑回归5折交叉训练准确率为: 0.803408265122779

4.2.2 随机森林模型

rfc = RandomForestClassifier(n_estimators=250,max_depth=10)
rfc.fit(x,y)
print('随机森林准确率为:',rfc.score(val_x,val_y))
随机森林准确率为: 0.8049472312370634

4.3 网格搜索进行超参数调优

param_grid = {'n_estimators':[100,150,200,300],'max_depth':[5,10,15]}
r = RandomForestClassifier()
emstimator = GridSearchCV(r,param_grid=param_grid,cv=5)
emstimator.fit(x_train,y_train)
emstimator.best_score_

五、总结

金融风控的实际项目多涉及到信用评分,因此需要模型特征具有较好的可解释性,所以目前在实际项目中多还是以逻辑回归作为基础模型。

如果想获得更好的结果,可以使用集成算法进行建模,但解释性不强。


35. 贷款违约预测相关推荐

  1. 【算法竞赛学习】金融风控之贷款违约预测-建模与调参

    Task4 建模与调参 此部分为零基础入门金融风控的 Task4 建模调参部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流. 赛题:零基础入门数据挖掘 - 零基础入门金融风控之贷 ...

  2. 数据竞赛入门-金融风控(贷款违约预测)四、建模与调参

    前言 本次活动为datawhale与天池联合举办,为金融风控之贷款违约预测挑战赛(入门) 比赛地址:https://tianchi.aliyun.com/competition/entrance/53 ...

  3. 数据竞赛入门-金融风控(贷款违约预测)三、特征工程

    前言 本次活动为datawhale与天池联合举办,为金融风控之贷款违约预测挑战赛(入门) 比赛地址:https://tianchi.aliyun.com/competition/entrance/53 ...

  4. 基于机器学习与深度学习的金融风控贷款违约预测

    基于机器学习与深度学习的金融风控贷款违约预测 目录 一.赛题分析 1. 任务分析 2. 数据属性 3. 评价指标 4. 问题归类 5. 整体思路 二.数据可视化分析 1. 总体数据分析 2. 数值型数 ...

  5. 「机器学习」天池比赛:金融风控贷款违约预测

    一.前言 1.1 赛题背景 赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题. 任务:预测用户贷款是否违约 比 ...

  6. 数据挖掘机器学习[六]---项目实战金融风控之贷款违约预测

    相关文章: 特征工程详解及实战项目[参考] 数据挖掘---汽车车交易价格预测[一](测评指标:EDA) 数据挖掘机器学习---汽车交易价格预测详细版本[二]{EDA-数据探索性分析} 数据挖掘机器学习 ...

  7. 1.天池金融风控-贷款违约预测新人赛之预备知识

    比赛链接:金融风控-贷款违约预测 因为这是一个金融风控专题的数据挖掘实战,在开始之前先引入一些预备知识. 1.预备知识 1.1预测指标 本次竞赛用AUC作为评价指标,AUC为ROC曲线下与坐标轴围成的 ...

  8. 入门金融风控【贷款违约预测】

    入门金融风控[贷款违约预测] 赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题.通过这道赛题来引导大家了解金融 ...

  9. 金融风控-贷款违约预测学习笔记(Part3:特征工程)

    金融风控-贷款违约预测学习笔记(Part3:特征工程) 1.特征预处理 1.1 处理类别型特征和数值型特征 1.2 缺失值填充 1.3 时间格式处理 1.4 将对象类型特征转换到数值 1.5 类别特征 ...

最新文章

  1. ubuntu 14.04安装postgresql最新版本
  2. 计算机书籍-Go语言入门经典SAMS Teach Yourself
  3. 顶刊发文奖励100万!不唯论文后,这所中科院研究院的激励机制引发争议
  4. 手把手带你玩转 AWS Lambda
  5. 使用root用户安装Hybris遇到的错误
  6. 51NOD 1125(交换机器最小代价) (贪心) 思想 !思想!
  7. linux多线程学习(四)——互斥锁线程控制
  8. 天花板级软测项目拆分详解,年后涨薪面试,稳了...
  9. Win8:Setting
  10. Halcon和Opencv区别
  11. 谈谈项目成本管理遇到的难题及解决措施
  12. 【某deed和某app面试】
  13. [Java][详解]使用jintellitype实现键盘全局监听
  14. 编写一个Python程序,计算任意圆锥体的体积和表面积。
  15. 紫光信息港 软件测试,紫光展锐 信息化软件工程师面经
  16. 国产可替代电机芯片AT8236驱动控制
  17. 福利:工作经常用到的Mac软件整理(全)
  18. 7个银行的软件测试项目实战,别再说简历项目不知道怎么写了
  19. MySQL新增数据,存在就更新,不存在就添加(Mybatis)
  20. 【PDF报表】Jasperreports+jaspersoft studio快速入门

热门文章

  1. ios 10以上 ssh连不上的解决办法
  2. 使用AccessibilityService实现微信自动抢红包
  3. 硬连接(hard link)与软连接(symbolic)
  4. CentOS安装指定版本的Mysql
  5. 【图像去噪】基于matlab小波变换(硬阙值+软阙值+折中阙值+最佳阙值)图像去噪【含Matlab源码 2596期】
  6. 企业后备人才管理体系的建立
  7. 俄罗斯方块的发展历史
  8. cocos2d-x 3.4之排行榜的实现
  9. 小视频源码炙手可热的秘密,短视频行业先驱者们给我们留下启示
  10. 验证身份证号 格式问题