数据挖掘 任务一:预测贷款是否逾期
目录
步骤:
1.数据导入;
2.基本属性:
(1)查看前10行;
(2)查看数据大小,几行几列;
(3)数据标签的分布情况;
(4)数据的特征(列名);
(5)每个分类变量中存在几个类别;
(6)连续变量的均值、中位数等;
(7)缺失值的处理;
(8)分类变量热编码;
(9)时间字段处理:
(10)构建单个不同模型。
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv(r'/....../data2.csv',encoding='gbk')
data.head(10)
Unnamed: 0 custid trade_no bank_card_no low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day ... loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
0 5 2791858 20180507115231274000000023057383 卡号1 0.01 0.99 0 0.90 0.55 0.313 ... 2900.0 1688.0 1200.0 75.0 1.0 2.0 1200.0 1200.0 12.0 18.0
1 10 534047 20180507121002192000000023073000 卡号1 0.02 0.94 2000 1.28 1.00 0.458 ... 3500.0 1758.0 15100.0 80.0 5.0 6.0 22800.0 9360.0 4.0 2.0
2 12 2849787 20180507125159718000000023114911 卡号1 0.04 0.96 0 1.00 1.00 0.114 ... 1600.0 1250.0 4200.0 87.0 1.0 1.0 4200.0 4200.0 2.0 6.0
3 13 1809708 20180507121358683000000388283484 卡号1 0.00 0.96 2000 0.13 0.57 0.777 ... 3200.0 1541.0 16300.0 80.0 5.0 5.0 30000.0 12180.0 2.0 4.0
4 14 2499829 20180507115448545000000388205844 卡号1 0.01 0.99 0 0.46 1.00 0.175 ... 2300.0 1630.0 8300.0 79.0 2.0 2.0 8400.0 8250.0 22.0 120.0
5 15 518072 20180507121233054000000388275132 卡号1 0.02 0.98 2000 7.59 1.00 0.733 ... 5300.0 1941.0 11200.0 80.0 10.0 12.0 20400.0 8130.0 3.0 4.0
6 16 1205125 20180507121931540000000388298915 卡号1 0.02 0.98 0 23.67 0.94 0.087 ... 2200.0 2200.0 7600.0 73.0 2.0 2.0 16800.0 8900.0 1.0 3.0
7 18 1129897 20180507124659235000000023105807 卡号1 0.02 0.98 0 0.25 0.88 0.302 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 20 2599411 20180507115855621000000388224458 卡号1 0.03 0.65 0 0.31 0.76 0.472 ... 5300.0 4750.0 5500.0 79.0 8.0 11.0 19200.0 7987.0 24.0 7.0
9 26 1413051 20180504155156296000000021138084 卡号1 0.01 0.99 500 0.80 1.00 0.088 ... 2800.0 1520.0 0.0 0.0 0.0 0.0 0.0 0.0 18.0 142.0
10 rows × 90 columns
查看数据各个特征的缺失情况:student_feature 缺失严重,考虑删除
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0 4754 non-null int64
custid 4754 non-null int64
trade_no 4754 non-null object
bank_card_no 4754 non-null object
low_volume_percent 4752 non-null float64
middle_volume_percent 4752 non-null float64
take_amount_in_later_12_month_highest 4754 non-null int64
trans_amount_increase_rate_lately 4751 non-null float64
trans_activity_month 4752 non-null float64
trans_activity_day 4752 non-null float64
transd_mcc 4752 non-null float64
trans_days_interval_filter 4746 non-null float64
trans_days_interval 4752 non-null float64
regional_mobility 4752 non-null float64
student_feature 1756 non-null float64
repayment_capability 4754 non-null int64
is_high_user 4754 non-null int64
number_of_trans_from_2011 4752 non-null float64
first_transaction_time 4752 non-null float64
historical_trans_amount 4754 non-null int64
historical_trans_day 4752 non-null float64
rank_trad_1_month 4752 non-null float64
trans_amount_3_month 4754 non-null int64
avg_consume_less_12_valid_month 4752 non-null float64
abs 4754 non-null int64
top_trans_count_last_1_month 4752 non-null float64
avg_price_last_12_month 4754 non-null int64
avg_price_top_last_12_valid_month 4650 non-null float64
reg_preference_for_trad 4752 non-null object
trans_top_time_last_1_month 4746 non-null float64
trans_top_time_last_6_month 4746 non-null float64
consume_top_time_last_1_month 4746 non-null float64
consume_top_time_last_6_month 4746 non-null float64
cross_consume_count_last_1_month 4328 non-null float64
trans_fail_top_count_enum_last_1_month 4738 non-null float64
trans_fail_top_count_enum_last_6_month 4738 non-null float64
trans_fail_top_count_enum_last_12_month 4738 non-null float64
consume_mini_time_last_1_month 4728 non-null float64
max_cumulative_consume_later_1_month 4754 non-null int64
max_consume_count_later_6_month 4746 non-null float64
railway_consume_count_last_12_month 4742 non-null float64
pawns_auctions_trusts_consume_last_1_month 4754 non-null int64
pawns_auctions_trusts_consume_last_6_month 4754 non-null int64
jewelry_consume_count_last_6_month 4742 non-null float64
status 4754 non-null int64
source 4754 non-null object
first_transaction_day 4752 non-null float64
trans_day_last_12_month 4752 non-null float64
id_name 4478 non-null object
apply_score 4450 non-null float64
apply_credibility 4450 non-null float64
query_org_count 4450 non-null float64
query_finance_count 4450 non-null float64
query_cash_count 4450 non-null float64
query_sum_count 4450 non-null float64
latest_query_time 4450 non-null object
latest_one_month_apply 4450 non-null float64
latest_three_month_apply 4450 non-null float64
latest_six_month_apply 4450 non-null float64
loans_score 4457 non-null float64
loans_credibility_behavior 4457 non-null float64
loans_count 4457 non-null float64
loans_settle_count 4457 non-null float64
loans_overdue_count 4457 non-null float64
loans_org_count_behavior 4457 non-null float64
consfin_org_count_behavior 4457 non-null float64
loans_cash_count 4457 non-null float64
latest_one_month_loan 4457 non-null float64
latest_three_month_loan 4457 non-null float64
latest_six_month_loan 4457 non-null float64
history_suc_fee 4457 non-null float64
history_fail_fee 4457 non-null float64
latest_one_month_suc 4457 non-null float64
latest_one_month_fail 4457 non-null float64
loans_long_time 4457 non-null float64
loans_latest_time 4457 non-null object
loans_credit_limit 4457 non-null float64
loans_credibility_limit 4457 non-null float64
loans_org_count_current 4457 non-null float64
loans_product_count 4457 non-null float64
loans_max_limit 4457 non-null float64
loans_avg_limit 4457 non-null float64
consfin_credit_limit 4457 non-null float64
consfin_credibility 4457 non-null float64
consfin_org_count_current 4457 non-null float64
consfin_product_count 4457 non-null float64
consfin_max_limit 4457 non-null float64
consfin_avg_limit 4457 non-null float64
latest_query_day 4450 non-null float64
loans_latest_day 4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
查看数值数据的各项统计指标
data.describe()
Unnamed: 0 custid low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter ... loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
count 4754.000000 4.754000e+03 4752.000000 4752.000000 4754.000000 4751.000000 4752.000000 4752.000000 4752.000000 4746.000000 ... 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4450.000000 4457.000000
mean 6008.414178 1.690993e+06 0.021806 0.901294 1940.197728 14.160674 0.804411 0.365425 17.502946 29.029920 ... 3390.038142 1820.357864 9187.009199 76.042630 4.732331 5.227507 16153.690823 8007.696881 24.112809 55.181512
std 3452.071428 1.034235e+06 0.041527 0.144856 3923.971494 694.180473 0.196920 0.170196 4.475616 22.722432 ... 1474.206546 583.418291 7371.257043 14.536819 2.974596 3.409292 14301.037628 5679.418585 37.725724 53.486408
min 5.000000 1.140000e+02 0.000000 0.000000 0.000000 0.000000 0.120000 0.033000 2.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -2.000000 -2.000000
25% 3106.000000 7.593358e+05 0.010000 0.880000 0.000000 0.615000 0.670000 0.233000 15.000000 16.000000 ... 2300.000000 1535.000000 4800.000000 77.000000 2.000000 3.000000 7800.000000 4737.000000 5.000000 10.000000
50% 6006.500000 1.634942e+06 0.010000 0.960000 500.000000 0.970000 0.860000 0.350000 17.000000 23.000000 ... 3100.000000 1810.000000 7700.000000 79.000000 4.000000 5.000000 13800.000000 7050.000000 14.000000 36.000000
75% 8999.000000 2.597905e+06 0.020000 0.990000 2000.000000 1.600000 1.000000 0.480000 20.000000 32.000000 ... 4300.000000 2100.000000 11700.000000 80.000000 7.000000 7.000000 20400.000000 10000.000000 24.000000 91.000000
max 11992.000000 4.004694e+06 1.000000 1.000000 68000.000000 47596.740000 1.000000 0.941000 42.000000 285.000000 ... 10000.000000 6900.000000 87100.000000 87.000000 18.000000 20.000000 266400.000000 82800.000000 360.000000 323.000000
8 rows × 83 columns
data.shape
(4754, 90)
查看目标变量不同标签的分布情况
data.groupby('status').size()
status
0 3561
1 1193
dtype: int64
删除无关变量。Unnamed:0 可以视为无关变量;custid 用户编号;trade_no :交易流水号;bank_card_no:银行卡号;id_name:用户姓名;sourc:“唯一=“
data = data.drop(["Unnamed: 0","custid","trade_no","bank_card_no","id_name","source","student_feature"],axis=1)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
建模流程: 1.数据类型分类;处理缺失值 2.标准化数据 3.数据集划分 4.数据建模 5.效果检测
#字符型变量
data_classify = data[['reg_preference_for_trad','latest_query_time','loans_latest_time']]
target=data['status']
# 数值型变量
data_num=data.drop(['reg_preference_for_trad','latest_query_time','loans_latest_time'],axis=1)
分别处理缺失值,数值型用均值填充,字符型用上一个值填充
data_num=data_num.fillna(data_num.mean())
data_classify= data_classify.fillna(method='bfill')
# 字符型进行热编码
dummies=pd.get_dummies(data_classify['reg_preference_for_trad'],prefix='reg_preference_for_trad')
data = pd.get_dummies(data,columns=["reg_preference_for_trad"])
data = data.convert_objects(convert_numeric=True)
print(data.dtypes.value_counts())
float64 69
int64 11
uint8 5
object 2
dtype: int64
# 处理日期
# latest_query_time ,loans_latest_time 等都不是时间格式,是object 的格式data_classify['latest_query_time'] = pd.to_datetime(data_classify['latest_query_time'])
data_classify['loans_latest_time'] = pd.to_datetime(data_classify['loans_latest_time'])
#月:
data_classify['latest_query_time_month'] = pd.to_datetime(data_classify['latest_query_time'] ).dt.month
data_classify['loans_latest_time_month'] = pd.to_datetime(data_classify['loans_latest_time'] ).dt.month
#周
data_classify['latest_query_time_week'] = pd.to_datetime(data_classify['latest_query_time'] ).dt.weekday
data_classify['loans_latest_time_week'] = pd.to_datetime(data_classify['loans_latest_time'] ).dt.weekday
y = data['status']
x = data.drop('status', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2018)
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)
(3327, 86) (3327,)
(1427, 86) (1427,)
# train three models and score them
lr = LogisticRegression(random_state=2018)
lr.fit(x_train,y_train)svm = SVC(random_state=2018)
svm.fit(x_train,y_train)dtree = DecisionTreeClassifier(random_state=2018)
dtree.fit(x_train, y_train)
score_lr = lr.score(x_test,y_test)
score_svm = svm.score(x_test,y_test)
score_dtree = dtree.score(x_test,y_test)
print("LogisticRegression: ", score_lr)
print("SVM: ", score_svm)
print("DecisionTreeClassifier: ", score_dtree)
数据挖掘 任务一:预测贷款是否逾期相关推荐
- [转载] 机器学习 scikit-learn1 预测贷款用户是否会逾期
参考链接: 使用Scikit-Learn在Python中减少维度 scikit-learn 一周算法实践集训 简介代码说明代码目录结构代码使用方法 个人总结参考文档任务1. 逻辑回归模型实践[2018 ...
- 预测贷款用户是否逾期:数据清洗与预处理
任务一 做一个项目的第一步是数据清洗与预处理,也是对数据进行探索和分析.这份数据集是金融数据,我们的目标是要预测贷款用户是否会逾期,其中status表示标签,1表示逾期,0表示未逾期. 1.查看数据 ...
- 机器学习 scikit-learn1 预测贷款用户是否会逾期
scikit-learn 一周算法实践集训 简介 代码说明 代码目录结构 代码使用方法 个人总结 参考文档 任务1. 逻辑回归模型实践[2018.11.14 - 2018.11.15] 任务2.支持向 ...
- 预测贷款用户是否会逾期
数据信息: 这是本次实践数据的下载地址 https://pan.baidu.com/s/1dtHJiV6zMbf_fWPi-dZ95g 说明:这份数据集是金融数据(非原始数据,已经处理过了),要做的是 ...
- 模型优化———预测贷款用户是否逾期
一.学习要求 对一份金融数据,我们在之前的博客中用各种模型完成了预测贷款用户是否会逾期的工作,接下来我们要介绍网格搜索和交叉验证的方法,来提高模型的准确率. 二.基础知识 什么是网格搜索 通过循环遍历 ...
- 预测贷款用户是否逾期-数据预处理
1.本项目为预测贷款用户是否逾期的数据预处理部分,主要包括特征处理.数据类型分析.数据类型转换以及缺失值处理. 一.数据查看 选择的IDE为pycharm,首先导入pandas库与numpy库,查看数 ...
- (预测贷款用户是否会逾期)支持向量机和决策树的模型建立
(预测贷款用户是否会逾期)支持向量机和决策树的模型建立 数据是金融数据,我们要做的是预测贷款用户是否会逾期,表格中,status是标签:0表示未逾期,1表示逾期.[今天的任务]构建支持向量机和决策树模 ...
- ML - 贷款用户逾期情况分析5 - 特征工程2(特征选择)
文章目录 特征选择 (判定贷款用户是否逾期) 1. IV值进行特征选择 1.1 基本介绍 1.2 计算公式 2. 随机森林进行特征选择 2.1 平均不纯度减少 mean decrease impuri ...
- ML实操 - 贷款用户逾期情况分析
目录 任务描述 实现过程 基本思路 1. 数据集预览 2. 数据预处理 3. 特征工程 4. 模型选择 4.1 数据及划分及数据归一化 4.2 LR 4.3 SVM 4.4 决策树 4.5 Xgboo ...
- 贷款用户逾期问题Task4
贷款用户逾期问题Task4 任务4 - 模型评估(2天) 评估 任务4 - 模型评估(2天) 任务4:记录5个模型(逻辑回归.SVM.决策树.随机森林.XGBoost)关于accuracy.preci ...
最新文章
- Git log、diff、config 进阶
- 自定义Docker容器的 hostname
- python格式化字符串语法_详解Python3 中的字符串格式化语法
- python支持链式赋值和多重赋值_Python: 链式赋值的坑
- Linux进程与线程的区别
- SpringSecurity加密认证
- oracle 全表扫描 分区,oracle分区表全分区扫描问题
- Jmeter JDBC Request执行多条SQL语句
- (大数据工程师学习路径)第五步 MySQL参考手册中文版----MySQL视图
- android答辩问题,我的设计是安卓微博,答辩时老师会问些什么问题
- linux 编码转换-转
- python自动化办公 51cto_Python办公自动化之从Word到Excel
- EventBus实现 - 发布订阅 - XML加载
- linux判断网卡能否上网,网卡坏了有什么现象?判断网卡是否坏了的方法
- CA1704:标识符应正确拼写
- Pr:导出设置之字幕
- 2.4 随机变量函数的分布
- 东师奥鹏计算机应用基础19春,东师计算机应用基础19春在线作业1【参考答案】...
- Unity的C#编程教程_61_委托和事件 Delegates and Events 详解及应用练习
- python做面板回归_Python中的Panel回归
热门文章
- matlab高斯公式求值,高斯求积公式 matlab
- Pytorch训练问题:AssertionError: Invalid device id
- 推荐一款高效的处理延迟任务神器
- 硬盘又坏了?硬盘数据恢复的工具有这些
- 微软官方dllcache恢复的批处理
- 达摩院提出时序预测新模型 有效提升预测精准度
- 进一步限塑!洲际酒店集团与联合利华达成合作,旗下酒店将提供大瓶装洗护用品替换一次性小包装 | 美通社头条...
- idea开发SSM框架乐器租赁网站管理系统 (javaweb-php-asp.netC#-j2ee-springboot)
- linux安装硬盘超过2t,linux 硬盘超过2T问题
- 【程序员的自我修养】[动态图文] 超详解函数栈帧