信贷违约风险预测(四)TrAiNiNG MoDeL
之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理)
- Polynomial Features :poly_train_data.csv,poly_test_data.csv
- Domain Knowledge Features: domain_train_data.csv,domain_test_data.csv
- Featuretools:auto_train_data.csv, auto_test_data.csv
Training model
Logistic Regression
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.linear_model import LogisticRegression
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')#domain_train = pd.read_csv('data/domain_train_data.csv')
#domain_test = pd.read_csv('data/domain_test_data.csv')
auto_train = pd.read_csv('data/auto_train_data.csv')
auto_test = pd.read_csv('data/auto_test_data.csv')
缺失值标准化处理
target = poly_train['TARGET']
Id = poly_test[['SK_ID_CURR']]
polynomial features
poly_train = poly_train.drop(['TARGET'],1)
# feature names
poly_features = list(poly_train.columns)
# 中位数填充缺失值
imputer = Imputer(strategy = 'median')
# scale feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))
# fit train data
imputer.fit(poly_train)
# Transform train test data
poly_train = imputer.transform(poly_train)
poly_test = imputer.transform(poly_test)
# scaler
scaler.fit(poly_train)
poly_train = scaler.transform(poly_train)
poly_test = scaler.transform(poly_test)
domain features
domain_train = domain_train.drop(['TARGET'],1)
domain_features = list(domain_train.columns)
# fit train data
imputer.fit(domain_train)
# Transform train test data
domain_train = imputer.transform(domain_train)
domain_test = imputer.transform(domain_test)
# scaler
scaler.fit(domain_train)
domain_train = scaler.transform(domain_train)
domain_test = scaler.transform(domain_test)
Featuretool
auto_train = auto_train.drop(['TARGET'],1)
auto_features = list(auto_train.columns)
# fit train data
imputer.fit(auto_train)
# Transform train test data
auto_train = imputer.transform(auto_train)
auto_test = imputer.transform(auto_test)
# scaler
scaler.fit(auto_train)
auto_train = scaler.transform(auto_train)
auto_test = scaler.transform(auto_test)
print('poly_train',poly_train.shape)
print('poly_test',poly_test.shape)
print('domain_train',domain_train.shape)
print('domain_test',domain_test.shape)
print('auto_train',auto_train.shape)
print('auto_test',auto_test.shape)
poly_train (307511, 274)
poly_test (48744, 274)
domain_train (307511, 244)
domain_test (48744, 244)
auto_train (307511, 239)
auto_test (48744, 239)
LogisticRegression,
lr = LogisticRegression(C = 0.0001,class_weight='balanced') # c 正则化参数
Polynomial
lr.fit(poly_train, target)
lr_poly_pred = lr.predict_proba(poly_test)[:,1]
# submission dataframe
submit = Id.copy()
submit['TARGET'] = lr_poly_pred
submit.to_csv('lr_poly_submit.csv',index = False)
Domain Knowledge
lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(domain_train, target)
lr_domain_pred = lr.predict_proba(domain_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_domain_pred
submit.to_csv('lr_domain_submit.csv',index = False)
FeatureTools
lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(auto_train, target)
lr_auto_pred = lr.predict_proba(auto_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_auto_pred
submit.to_csv('lr_auto_submit.csv',index = False)
在线测评结果:
- Polynomail–: 0.723
- Domain——: 0.670
- Featuretool-: 0.669
下面升级一下算法,仍在这三组数据哈桑进行训练,随机森林.
Random Forest
from sklearn.ensemble import RandomForestClassifier
Polynomail
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 55, verbose = 1,n_jobs = -1)
random_forest.fit(poly_train, target)
# 提取重要特征
poly_importance_feature_values = random_forest.feature_importances_
poly_importance_features = pd.DataFrame({'feature':poly_features,'importance':poly_importance_feature_values})
rf_poly_pred = random_forest.predict_proba(poly_test)[:,1]
# 结果
submit = Id.copy()
submit['TARGET'] = rf_poly_pred
submit.to_csv('rf_poly_submit.csv', index = False)
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 3.7min finished
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.4s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.9s finished
Feature importance
poly_importance_features = poly_importance_features.set_index(['feature'])
poly_importance_features.sort_values(by = 'importance').plot(kind='barh',figsize=(10, 120))
根据上面图中的数据,可以做一些特征选择,丢掉一些完全没有作用的特征,同时也给数据降低维度.下面在升级一下算法,机器学习中的大杀器,Light Gradient Boosting Machine
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gcimport numpy as np
import warnings
warnings.filterwarnings('ignore')
def model(features, test_features, n_folds = 10):# 取出ID列train_ids = features['SK_ID_CURR']test_ids = test_features['SK_ID_CURR']# TARGETlabels = features[['TARGET']]# 去掉ID和TARGETfeatures = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)test_features = test_features.drop(['SK_ID_CURR'], axis = 1)# 特征名字feature_names = list(features.columns)# Dataframe-->数组#features = np.array(features)#test_features = np.array(test_features)# 随即切分train _data10份,9份训练,1份验证k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)# test predictionstest_predictions = np.zeros(test_features.shape[0])# validation predictionsout_of_fold = np.zeros(features.shape[0])# 记录每次的scoresvalid_scores = []train_scores = []# Iterate through each foldcount = 0for train_indices, valid_indices in k_fold.split(features):# Training data for the foldtrain_features = features.loc[train_indices, :]train_labels = labels.loc[train_indices, :]# Validation data for the foldvalid_features = features.loc[valid_indices, :]valid_labels = labels.loc[valid_indices, :]# Create the modelmodel = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', class_weight = 'balanced', learning_rate = 0.05, reg_alpha = 0.1, reg_lambda = 0.1, subsample = 0.8, n_jobs = -1, random_state = 50)# Train the modelmodel.fit(train_features, train_labels, eval_metric = 'auc',eval_set = [(valid_features, valid_labels), (train_features, train_labels)],eval_names = ['valid', 'train'], categorical_feature = 'auto',early_stopping_rounds = 100, verbose = 200)# Record the best iterationbest_iteration = model.best_iteration_# 测试集的结果test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds# 验证集结果out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]# Record the best scorevalid_score = model.best_score_['valid']['auc']train_score = model.best_score_['train']['auc']valid_scores.append(valid_score)train_scores.append(train_score)# Clean up memorygc.enable()del model, train_features, valid_featuresgc.collect()count += 1pirnt("%d_fold is over"%count)# Make the submission dataframesubmission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})# Overall validation scorevalid_auc = roc_auc_score(labels, out_of_fold)# Add the overall scores to the metricsvalid_scores.append(valid_auc)train_scores.append(np.mean(train_scores))# dataframe of validation scoresfold_names = list(range(n_folds))fold_names.append('overall')# Dataframe of validation scoresmetrics = pd.DataFrame({'fold': fold_names,'train': train_scores,'valid': valid_scores}) return submission, metrics
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')print('poly_train:',poly_train.shape)
print('poly_test:',poly_test.shape)
poly_train: (307511, 275)
poly_test: (48744, 274)
Select Features
特征影响力排名,从小到大
poly_importance_features = poly_importance_features.sort_values(by = 'importance')
丢掉这20个特征
poly_importance_features.head(20).plot(kind = 'barh')
s_train_1 = poly_train.copy()
s_test_1 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:20]# 去掉掉20个特征
s_train_1 = s_train_1.drop(drop_feature_names, axis = 1)s_test_1 = s_test_1.drop(drop_feature_names, axis = 1)submit2, metrics2 = model(s_train_1, s_test_1, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800686 valid's auc: 0.755447
[400] train's auc: 0.831722 valid's auc: 0.755842
Early stopping, best iteration is:
[351] train's auc: 0.824767 valid's auc: 0.756092
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800338 valid's auc: 0.757507
[400] train's auc: 0.831318 valid's auc: 0.757378
Early stopping, best iteration is:
[307] train's auc: 0.818238 valid's auc: 0.757819
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799557 valid's auc: 0.762719
Early stopping, best iteration is:
[160] train's auc: 0.791849 valid's auc: 0.763023
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.80053 valid's auc: 0.758546
Early stopping, best iteration is:
[224] train's auc: 0.804828 valid's auc: 0.758703
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799826 valid's auc: 0.758312
[400] train's auc: 0.831271 valid's auc: 0.758623
Early stopping, best iteration is:
[319] train's auc: 0.819603 valid's auc: 0.758971
metrics2
fold | train | valid | |
---|---|---|---|
0 | 0 | 0.824767 | 0.756092 |
1 | 1 | 0.818238 | 0.757819 |
2 | 2 | 0.791849 | 0.763023 |
3 | 3 | 0.804828 | 0.758703 |
4 | 4 | 0.819603 | 0.758971 |
5 | overall | 0.811857 | 0.758799 |
submit2.to_csv('submit2.csv',index = False)
评分结果:0.734
丢掉几个特征后结果略有提升哦,去掉30个试试
s_train_2 = poly_train.copy()
s_test_2 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:30]# 丢掉30个特征
s_train_2 = s_train_2.drop(drop_feature_names, axis = 1)
s_test_2 = s_test_2.drop(drop_feature_names, axis = 1)submit3, metrics3 = model(s_train_2, s_test_2, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800547 valid's auc: 0.755442
Early stopping, best iteration is:
[267] train's auc: 0.81211 valid's auc: 0.755868
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.80048 valid's auc: 0.757653
Early stopping, best iteration is:
[258] train's auc: 0.81057 valid's auc: 0.758107
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799261 valid's auc: 0.76291
Early stopping, best iteration is:
[189] train's auc: 0.797314 valid's auc: 0.762962
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.800499 valid's auc: 0.758385
Early stopping, best iteration is:
[202] train's auc: 0.800851 valid's auc: 0.758413
Training until validation scores don't improve for 100 rounds.
[200] train's auc: 0.799977 valid's auc: 0.758234
Early stopping, best iteration is:
[284] train's auc: 0.814454 valid's auc: 0.758612
metrics3
fold | train | valid | |
---|---|---|---|
0 | 0 | 0.812110 | 0.755868 |
1 | 1 | 0.810570 | 0.758107 |
2 | 2 | 0.797314 | 0.762962 |
3 | 3 | 0.800851 | 0.758413 |
4 | 4 | 0.814454 | 0.758612 |
5 | overall | 0.807060 | 0.758735 |
submit3.to_csv('submit3.csv',index = False)
结果是:0.733,基本没什么变化.只用了一张主表肯定是不行的,这个纯属娱乐一下.
信贷违约风险预测(四)TrAiNiNG MoDeL相关推荐
- 信贷违约风险预测(四)训练模型
之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理) Polynomi ...
- 信贷违约风险预测(二)简单的数据探索
之前已经简单介绍了数据,客户的违约风险的预测是一个监督学习的任务,主要是对客户进行分类,就是哪些人可以获得贷款,哪些不可以,每个申请者可能会违约的概率在0~1之间,0:表示申请者能及时还款,1:申请者 ...
- 数据挖掘实战:个人信贷违约预测
大家好,我是东哥.本次分享一个数据挖掘实战项目:个人信贷违约预测,此项目对于想要学习信贷风控模型的同学非常有帮助,本文首发于公众号:Python数据科学,作者云朵君. 一.个人征信预测模型 1.项目背 ...
- Foundations of Machine Learning 2nd——第四章Model Selection(二)
Foundations of Machine Learning 2nd--第四章Model Selection(二) 交叉验证 Cross Validation(CV) 交叉验证的步骤 交叉验证有效性 ...
- (四) View/Model 全解(mvc)
转自:http://www.cnblogs.com/zhangziqiu/archive/2009/03/18/Aspnet-MVC-4.html 一.摘要 本文讲解在Action中向View传递Mo ...
- 【肖四出了】考研政治肖秀荣预测四套卷已出!
肖秀荣预测四套卷今天已经上市,京东自营已经有同学收到货了. 网络上会出一些关于肖秀荣的总结/分析/精简版/背诵版等等,这些东西随便看看就好,不必认真对待.照着背诵就更不必要了. 毕竟它们肯定不如肖秀荣 ...
- 信贷逾期预测,LightGBX模型
信贷逾期预测 背景 数据处理 小提琴图查看数据分布 模型建立 模型评估 本文介绍了利用LightGBX模型进行贷款逾期预测的方法. 背景 互联网金融的核心在于风控,风控决定了互联网金融企业的竞争力 ...
- 天池比赛-01-用随机森林进行信贷违约预测-Baseline
这篇文章构建了信贷违约预测数据挖掘项目的一个baseline,这个项目来源于天池数据科学大赛,是一个二分类问题. 赛题链接:https://tianchi.aliyun.com/competi ...
- python数据分析实战之信用卡违约风险预测
文章目录 1.明确需求和目的 2. 数据收集 3.数据预处理 3.1 数据整合 3.1.1 加载相关库和数据集 3.1.2 主要数据集概览 3.2 数据清洗 3.2.1 多余列的删除 3.2.2 数据 ...
- Home Credit Default Risk 违约风险预测,kaggle比赛,初级篇,LB 0.749
Home Credit Default Risk 结论 背景知识 数据集 数据分析 平衡度 数据缺失 数据类型 离群值 填充缺失值 建模 Logistic Regression LightGBM Fe ...
最新文章
- 启明云端分享 | SSD201\SSD202D 核心板如何批量烧录,母片制作教程分享
- 用百度开放地图api在代码中获得两地距离
- C语言及程序设计进阶例程-17 认识链表
- 2020 年,开启现代库的基建学习 —— 从项目演进看前端工程化发展
- 基于Caffe的人脸检测实现
- 【瑕疵检测】基于matlab瓶盖瑕疵检测【含Matlab源码 730期】
- AUTOCAD——标注关联
- Linux screen capture
- 解决IDM下载城通网盘,一个网站不允许请求同一个文件两次,即使设置了快捷键也无用的问题
- 计算机桌面文件能单独设密码吗,文件夹怎么设置密码,教您如何给电脑上文件夹设置密码...
- 【Unity3D日常开发】新建2D、3D场景,新建场景没有灯光等问题
- 小码哥《恋上数据结构与算法》笔记(十五):哈希表(Hash Table)
- Riche million espérer interroger chasse.
- 既有住宅加装电梯数学建模问题
- 大学生计算机PHP实训报告,大学生计算机实训心得体会
- Win10打开任务管理器卡死的解决方法
- 利用Zotero进行文献检索与管理
- throw与throws的区别
- WebRtc视频特效
- Typo: In word 拼写检查
热门文章
- javaEE解决eclipse中不能设置tomcat8.5
- 计算机WORD列宽行高怎么设置,高会《职称计算机》Word 2007:设置行高和列宽
- oracle 获取awk报告,Oracle 使用 ass.awk 工具查看 system state dump 说明
- 悟道web标准:前端性能优化
- 如何得到当前程序执行的堆栈
- SOA安全性解决方案
- windows系统挂载存储阵列的iscsi映射虚拟磁盘
- netty高级篇(3)-HTTP协议开发
- BFS+模拟 ZOJ 3865 Superbot
- android分享到新浪微博,认证+发送微博,