目录

步骤:

1.数据导入;
2.基本属性:
(1)查看前10行;

(2)查看数据大小,几行几列;

(3)数据标签的分布情况;

(4)数据的特征(列名);

(5)每个分类变量中存在几个类别;

(6)连续变量的均值、中位数等;

(7)缺失值的处理;

(8)分类变量热编码;

(9)时间字段处理:

(10)构建单个不同模型。

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv(r'/....../data2.csv',encoding='gbk')
data.head(10)
 Unnamed: 0  custid  trade_no    bank_card_no    low_volume_percent  middle_volume_percent   take_amount_in_later_12_month_highest   trans_amount_increase_rate_lately   trans_activity_month    trans_activity_day  ... loans_max_limit loans_avg_limit consfin_credit_limit    consfin_credibility consfin_org_count_current   consfin_product_count   consfin_max_limit   consfin_avg_limit   latest_query_day    loans_latest_day
0   5   2791858 20180507115231274000000023057383    卡号1 0.01    0.99    0   0.90    0.55    0.313   ... 2900.0  1688.0  1200.0  75.0    1.0 2.0 1200.0  1200.0  12.0    18.0
1   10  534047  20180507121002192000000023073000    卡号1 0.02    0.94    2000    1.28    1.00    0.458   ... 3500.0  1758.0  15100.0 80.0    5.0 6.0 22800.0 9360.0  4.0 2.0
2   12  2849787 20180507125159718000000023114911    卡号1 0.04    0.96    0   1.00    1.00    0.114   ... 1600.0  1250.0  4200.0  87.0    1.0 1.0 4200.0  4200.0  2.0 6.0
3   13  1809708 20180507121358683000000388283484    卡号1 0.00    0.96    2000    0.13    0.57    0.777   ... 3200.0  1541.0  16300.0 80.0    5.0 5.0 30000.0 12180.0 2.0 4.0
4   14  2499829 20180507115448545000000388205844    卡号1 0.01    0.99    0   0.46    1.00    0.175   ... 2300.0  1630.0  8300.0  79.0    2.0 2.0 8400.0  8250.0  22.0    120.0
5   15  518072  20180507121233054000000388275132    卡号1 0.02    0.98    2000    7.59    1.00    0.733   ... 5300.0  1941.0  11200.0 80.0    10.0    12.0    20400.0 8130.0  3.0 4.0
6   16  1205125 20180507121931540000000388298915    卡号1 0.02    0.98    0   23.67   0.94    0.087   ... 2200.0  2200.0  7600.0  73.0    2.0 2.0 16800.0 8900.0  1.0 3.0
7   18  1129897 20180507124659235000000023105807    卡号1 0.02    0.98    0   0.25    0.88    0.302   ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8   20  2599411 20180507115855621000000388224458    卡号1 0.03    0.65    0   0.31    0.76    0.472   ... 5300.0  4750.0  5500.0  79.0    8.0 11.0    19200.0 7987.0  24.0    7.0
9   26  1413051 20180504155156296000000021138084    卡号1 0.01    0.99    500 0.80    1.00    0.088   ... 2800.0  1520.0  0.0 0.0 0.0 0.0 0.0 0.0 18.0    142.0
10 rows × 90 columns

查看数据各个特征的缺失情况:student_feature 缺失严重,考虑删除

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB

查看数值数据的各项统计指标

data.describe()
 Unnamed: 0  custid  low_volume_percent  middle_volume_percent   take_amount_in_later_12_month_highest   trans_amount_increase_rate_lately   trans_activity_month    trans_activity_day  transd_mcc  trans_days_interval_filter  ... loans_max_limit loans_avg_limit consfin_credit_limit    consfin_credibility consfin_org_count_current   consfin_product_count   consfin_max_limit   consfin_avg_limit   latest_query_day    loans_latest_day
count   4754.000000 4.754000e+03   4752.000000 4752.000000 4754.000000 4751.000000 4752.000000 4752.000000 4752.000000 4746.000000 ... 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4457.000000 4450.000000 4457.000000
mean    6008.414178 1.690993e+06   0.021806    0.901294    1940.197728 14.160674   0.804411    0.365425    17.502946   29.029920   ... 3390.038142 1820.357864 9187.009199 76.042630   4.732331    5.227507    16153.690823    8007.696881 24.112809   55.181512
std 3452.071428 1.034235e+06   0.041527    0.144856    3923.971494 694.180473  0.196920    0.170196    4.475616    22.722432   ... 1474.206546 583.418291  7371.257043 14.536819   2.974596    3.409292    14301.037628    5679.418585 37.725724   53.486408
min 5.000000    1.140000e+02   0.000000    0.000000    0.000000    0.000000    0.120000    0.033000    2.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    -2.000000   -2.000000
25% 3106.000000 7.593358e+05   0.010000    0.880000    0.000000    0.615000    0.670000    0.233000    15.000000   16.000000   ... 2300.000000 1535.000000 4800.000000 77.000000   2.000000    3.000000    7800.000000 4737.000000 5.000000    10.000000
50% 6006.500000 1.634942e+06   0.010000    0.960000    500.000000  0.970000    0.860000    0.350000    17.000000   23.000000   ... 3100.000000 1810.000000 7700.000000 79.000000   4.000000    5.000000    13800.000000    7050.000000 14.000000   36.000000
75% 8999.000000 2.597905e+06   0.020000    0.990000    2000.000000 1.600000    1.000000    0.480000    20.000000   32.000000   ... 4300.000000 2100.000000 11700.000000    80.000000   7.000000    7.000000    20400.000000    10000.000000    24.000000   91.000000
max 11992.000000    4.004694e+06   1.000000    1.000000    68000.000000    47596.740000    1.000000    0.941000    42.000000   285.000000  ... 10000.000000    6900.000000 87100.000000    87.000000   18.000000   20.000000   266400.000000   82800.000000    360.000000  323.000000
8 rows × 83 columns
data.shape

(4754, 90)

查看目标变量不同标签的分布情况

data.groupby('status').size()
status
0    3561
1    1193
dtype: int64

删除无关变量。Unnamed:0 可以视为无关变量;custid 用户编号;trade_no :交易流水号;bank_card_no:银行卡号;id_name:用户姓名;sourc:“唯一=“

data = data.drop(["Unnamed: 0","custid","trade_no","bank_card_no","id_name","source","student_feature"],axis=1)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
建模流程: 1.数据类型分类;处理缺失值 2.标准化数据 3.数据集划分 4.数据建模 5.效果检测
#字符型变量
data_classify = data[['reg_preference_for_trad','latest_query_time','loans_latest_time']]
target=data['status']
# 数值型变量
data_num=data.drop(['reg_preference_for_trad','latest_query_time','loans_latest_time'],axis=1)
分别处理缺失值,数值型用均值填充,字符型用上一个值填充
data_num=data_num.fillna(data_num.mean())
data_classify= data_classify.fillna(method='bfill')
# 字符型进行热编码
dummies=pd.get_dummies(data_classify['reg_preference_for_trad'],prefix='reg_preference_for_trad')
data = pd.get_dummies(data,columns=["reg_preference_for_trad"])
data = data.convert_objects(convert_numeric=True)
print(data.dtypes.value_counts())
float64    69
int64      11
uint8       5
object      2
dtype: int64
# 处理日期
# latest_query_time ,loans_latest_time 等都不是时间格式,是object 的格式data_classify['latest_query_time'] = pd.to_datetime(data_classify['latest_query_time'])
data_classify['loans_latest_time'] = pd.to_datetime(data_classify['loans_latest_time'])
#月:
data_classify['latest_query_time_month'] =  pd.to_datetime(data_classify['latest_query_time'] ).dt.month
data_classify['loans_latest_time_month'] =  pd.to_datetime(data_classify['loans_latest_time'] ).dt.month
#周
data_classify['latest_query_time_week'] =  pd.to_datetime(data_classify['latest_query_time'] ).dt.weekday
data_classify['loans_latest_time_week'] =  pd.to_datetime(data_classify['loans_latest_time'] ).dt.weekday
y = data['status']
x = data.drop('status', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2018)
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)
(3327, 86) (3327,)
(1427, 86) (1427,)
# train three models and score them
lr = LogisticRegression(random_state=2018)
lr.fit(x_train,y_train)svm = SVC(random_state=2018)
svm.fit(x_train,y_train)dtree = DecisionTreeClassifier(random_state=2018)
dtree.fit(x_train, y_train)
score_lr = lr.score(x_test,y_test)
score_svm = svm.score(x_test,y_test)
score_dtree = dtree.score(x_test,y_test)
print("LogisticRegression: ", score_lr)
print("SVM: ", score_svm)
print("DecisionTreeClassifier: ", score_dtree)

数据挖掘 任务一:预测贷款是否逾期相关推荐

  1. [转载] 机器学习 scikit-learn1 预测贷款用户是否会逾期

    参考链接: 使用Scikit-Learn在Python中减少维度 scikit-learn 一周算法实践集训 简介代码说明代码目录结构代码使用方法 个人总结参考文档任务1. 逻辑回归模型实践[2018 ...

  2. 预测贷款用户是否逾期:数据清洗与预处理

    任务一 做一个项目的第一步是数据清洗与预处理,也是对数据进行探索和分析.这份数据集是金融数据,我们的目标是要预测贷款用户是否会逾期,其中status表示标签,1表示逾期,0表示未逾期. 1.查看数据 ...

  3. 机器学习 scikit-learn1 预测贷款用户是否会逾期

    scikit-learn 一周算法实践集训 简介 代码说明 代码目录结构 代码使用方法 个人总结 参考文档 任务1. 逻辑回归模型实践[2018.11.14 - 2018.11.15] 任务2.支持向 ...

  4. 预测贷款用户是否会逾期

    数据信息: 这是本次实践数据的下载地址 https://pan.baidu.com/s/1dtHJiV6zMbf_fWPi-dZ95g 说明:这份数据集是金融数据(非原始数据,已经处理过了),要做的是 ...

  5. 模型优化———预测贷款用户是否逾期

    一.学习要求 对一份金融数据,我们在之前的博客中用各种模型完成了预测贷款用户是否会逾期的工作,接下来我们要介绍网格搜索和交叉验证的方法,来提高模型的准确率. 二.基础知识 什么是网格搜索 通过循环遍历 ...

  6. 预测贷款用户是否逾期-数据预处理

    1.本项目为预测贷款用户是否逾期的数据预处理部分,主要包括特征处理.数据类型分析.数据类型转换以及缺失值处理. 一.数据查看 选择的IDE为pycharm,首先导入pandas库与numpy库,查看数 ...

  7. (预测贷款用户是否会逾期)支持向量机和决策树的模型建立

    (预测贷款用户是否会逾期)支持向量机和决策树的模型建立 数据是金融数据,我们要做的是预测贷款用户是否会逾期,表格中,status是标签:0表示未逾期,1表示逾期.[今天的任务]构建支持向量机和决策树模 ...

  8. ML - 贷款用户逾期情况分析5 - 特征工程2(特征选择)

    文章目录 特征选择 (判定贷款用户是否逾期) 1. IV值进行特征选择 1.1 基本介绍 1.2 计算公式 2. 随机森林进行特征选择 2.1 平均不纯度减少 mean decrease impuri ...

  9. ML实操 - 贷款用户逾期情况分析

    目录 任务描述 实现过程 基本思路 1. 数据集预览 2. 数据预处理 3. 特征工程 4. 模型选择 4.1 数据及划分及数据归一化 4.2 LR 4.3 SVM 4.4 决策树 4.5 Xgboo ...

  10. 贷款用户逾期问题Task4

    贷款用户逾期问题Task4 任务4 - 模型评估(2天) 评估 任务4 - 模型评估(2天) 任务4:记录5个模型(逻辑回归.SVM.决策树.随机森林.XGBoost)关于accuracy.preci ...

最新文章

  1. Git log、diff、config 进阶
  2. 自定义Docker容器的 hostname
  3. python格式化字符串语法_详解Python3 中的字符串格式化语法
  4. python支持链式赋值和多重赋值_Python: 链式赋值的坑
  5. Linux进程与线程的区别
  6. SpringSecurity加密认证
  7. oracle 全表扫描 分区,oracle分区表全分区扫描问题
  8. Jmeter JDBC Request执行多条SQL语句
  9. (大数据工程师学习路径)第五步 MySQL参考手册中文版----MySQL视图
  10. android答辩问题,我的设计是安卓微博,答辩时老师会问些什么问题
  11. linux 编码转换-转
  12. python自动化办公 51cto_Python办公自动化之从Word到Excel
  13. EventBus实现 - 发布订阅 - XML加载
  14. linux判断网卡能否上网,网卡坏了有什么现象?判断网卡是否坏了的方法
  15. CA1704:标识符应正确拼写
  16. Pr:导出设置之字幕
  17. 2.4 随机变量函数的分布
  18. 东师奥鹏计算机应用基础19春,东师计算机应用基础19春在线作业1【参考答案】...
  19. Unity的C#编程教程_61_委托和事件 Delegates and Events 详解及应用练习
  20. python做面板回归_Python中的Panel回归

热门文章

  1. matlab高斯公式求值,高斯求积公式 matlab
  2. Pytorch训练问题:AssertionError: Invalid device id
  3. 推荐一款高效的处理延迟任务神器
  4. 硬盘又坏了?硬盘数据恢复的工具有这些
  5. 微软官方dllcache恢复的批处理
  6. 达摩院提出时序预测新模型 有效提升预测精准度
  7. 进一步限塑!洲际酒店集团与联合利华达成合作,旗下酒店将提供大瓶装洗护用品替换一次性小包装 | 美通社头条...
  8. idea开发SSM框架乐器租赁网站管理系统 (javaweb-php-asp.netC#-j2ee-springboot)
  9. linux安装硬盘超过2t,linux 硬盘超过2T问题
  10. 【程序员的自我修养】[动态图文] 超详解函数栈帧