之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理)

  • Polynomial Features :poly_train_data.csv,poly_test_data.csv
  • Domain Knowledge Features: domain_train_data.csv,domain_test_data.csv
  • Featuretools:auto_train_data.csv, auto_test_data.csv

Training model

Logistic Regression

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.linear_model import LogisticRegression
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')#domain_train = pd.read_csv('data/domain_train_data.csv')
#domain_test = pd.read_csv('data/domain_test_data.csv')
auto_train = pd.read_csv('data/auto_train_data.csv')
auto_test = pd.read_csv('data/auto_test_data.csv')

缺失值标准化处理

target = poly_train['TARGET']
Id = poly_test[['SK_ID_CURR']]

polynomial features

poly_train = poly_train.drop(['TARGET'],1)
# feature names
poly_features = list(poly_train.columns)
# 中位数填充缺失值
imputer = Imputer(strategy = 'median')
# scale feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))
# fit train data
imputer.fit(poly_train)
# Transform train test data
poly_train = imputer.transform(poly_train)
poly_test = imputer.transform(poly_test)
# scaler
scaler.fit(poly_train)
poly_train = scaler.transform(poly_train)
poly_test = scaler.transform(poly_test)

domain features

domain_train = domain_train.drop(['TARGET'],1)
domain_features = list(domain_train.columns)
# fit train data
imputer.fit(domain_train)
# Transform train test data
domain_train = imputer.transform(domain_train)
domain_test = imputer.transform(domain_test)
# scaler
scaler.fit(domain_train)
domain_train = scaler.transform(domain_train)
domain_test = scaler.transform(domain_test)

Featuretool

auto_train = auto_train.drop(['TARGET'],1)
auto_features = list(auto_train.columns)
# fit train data
imputer.fit(auto_train)
# Transform train test data
auto_train = imputer.transform(auto_train)
auto_test = imputer.transform(auto_test)
# scaler
scaler.fit(auto_train)
auto_train = scaler.transform(auto_train)
auto_test = scaler.transform(auto_test)
print('poly_train',poly_train.shape)
print('poly_test',poly_test.shape)
print('domain_train',domain_train.shape)
print('domain_test',domain_test.shape)
print('auto_train',auto_train.shape)
print('auto_test',auto_test.shape)
poly_train (307511, 274)
poly_test (48744, 274)
domain_train (307511, 244)
domain_test (48744, 244)
auto_train (307511, 239)
auto_test (48744, 239)

 LogisticRegression,

lr = LogisticRegression(C = 0.0001,class_weight='balanced')   # c 正则化参数

Polynomial

lr.fit(poly_train, target)
lr_poly_pred = lr.predict_proba(poly_test)[:,1]
# submission dataframe
submit = Id.copy()
submit['TARGET'] = lr_poly_pred
submit.to_csv('lr_poly_submit.csv',index = False)

Domain Knowledge

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(domain_train, target)
lr_domain_pred = lr.predict_proba(domain_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_domain_pred
submit.to_csv('lr_domain_submit.csv',index = False)

FeatureTools

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(auto_train, target)
lr_auto_pred = lr.predict_proba(auto_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_auto_pred
submit.to_csv('lr_auto_submit.csv',index = False)

在线测评结果:

  • Polynomail–: 0.723
  • Domain——: 0.670
  • Featuretool-: 0.669

下面升级一下算法,仍在这三组数据哈桑进行训练,随机森林.

Random Forest

from sklearn.ensemble import RandomForestClassifier

Polynomail

random_forest = RandomForestClassifier(n_estimators = 100, random_state = 55, verbose = 1,n_jobs = -1)
random_forest.fit(poly_train, target)
# 提取重要特征
poly_importance_feature_values = random_forest.feature_importances_
poly_importance_features = pd.DataFrame({'feature':poly_features,'importance':poly_importance_feature_values})
rf_poly_pred = random_forest.predict_proba(poly_test)[:,1]
# 结果
submit = Id.copy()
submit['TARGET'] = rf_poly_pred
submit.to_csv('rf_poly_submit.csv', index = False)
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.7min finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.9s finished

Feature importance

poly_importance_features = poly_importance_features.set_index(['feature'])
poly_importance_features.sort_values(by = 'importance').plot(kind='barh',figsize=(10, 120))

根据上面图中的数据,可以做一些特征选择,丢掉一些完全没有作用的特征,同时也给数据降低维度.下面在升级一下算法,机器学习中的大杀器,Light Gradient Boosting Machine

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gcimport numpy as np
import warnings
warnings.filterwarnings('ignore')
def model(features, test_features, n_folds = 10):# 取出ID列train_ids = features['SK_ID_CURR']test_ids = test_features['SK_ID_CURR']# TARGETlabels = features[['TARGET']]# 去掉ID和TARGETfeatures = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)test_features = test_features.drop(['SK_ID_CURR'], axis = 1)# 特征名字feature_names = list(features.columns)# Dataframe-->数组#features = np.array(features)#test_features = np.array(test_features)# 随即切分train _data10份,9份训练,1份验证k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)# test predictionstest_predictions = np.zeros(test_features.shape[0])# validation predictionsout_of_fold = np.zeros(features.shape[0])# 记录每次的scoresvalid_scores = []train_scores = []# Iterate through each foldcount = 0for train_indices, valid_indices in k_fold.split(features):# Training data for the foldtrain_features = features.loc[train_indices, :]train_labels = labels.loc[train_indices, :]# Validation data for the foldvalid_features = features.loc[valid_indices, :]valid_labels = labels.loc[valid_indices, :]# Create the modelmodel = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', class_weight = 'balanced', learning_rate = 0.05, reg_alpha = 0.1, reg_lambda = 0.1, subsample = 0.8, n_jobs = -1, random_state = 50)# Train the modelmodel.fit(train_features, train_labels, eval_metric = 'auc',eval_set = [(valid_features, valid_labels), (train_features, train_labels)],eval_names = ['valid', 'train'], categorical_feature = 'auto',early_stopping_rounds = 100, verbose = 200)# Record the best iterationbest_iteration = model.best_iteration_# 测试集的结果test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds# 验证集结果out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]# Record the best scorevalid_score = model.best_score_['valid']['auc']train_score = model.best_score_['train']['auc']valid_scores.append(valid_score)train_scores.append(train_score)# Clean up memorygc.enable()del model, train_features, valid_featuresgc.collect()count += 1pirnt("%d_fold is over"%count)# Make the submission dataframesubmission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})# Overall validation scorevalid_auc = roc_auc_score(labels, out_of_fold)# Add the overall scores to the metricsvalid_scores.append(valid_auc)train_scores.append(np.mean(train_scores))# dataframe of validation scoresfold_names = list(range(n_folds))fold_names.append('overall')# Dataframe of validation scoresmetrics = pd.DataFrame({'fold': fold_names,'train': train_scores,'valid': valid_scores}) return submission, metrics
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')print('poly_train:',poly_train.shape)
print('poly_test:',poly_test.shape)
poly_train: (307511, 275)
poly_test: (48744, 274)

Select Features

特征影响力排名,从小到大

poly_importance_features = poly_importance_features.sort_values(by = 'importance')

丢掉这20个特征

poly_importance_features.head(20).plot(kind = 'barh')

s_train_1 = poly_train.copy()
s_test_1 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:20]# 去掉掉20个特征
s_train_1 = s_train_1.drop(drop_feature_names, axis = 1)s_test_1 = s_test_1.drop(drop_feature_names, axis = 1)submit2, metrics2 = model(s_train_1, s_test_1, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800686   valid's auc: 0.755447
[400]   train's auc: 0.831722   valid's auc: 0.755842
Early stopping, best iteration is:
[351]   train's auc: 0.824767   valid's auc: 0.756092
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800338   valid's auc: 0.757507
[400]   train's auc: 0.831318   valid's auc: 0.757378
Early stopping, best iteration is:
[307]   train's auc: 0.818238   valid's auc: 0.757819
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799557   valid's auc: 0.762719
Early stopping, best iteration is:
[160]   train's auc: 0.791849   valid's auc: 0.763023
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.80053    valid's auc: 0.758546
Early stopping, best iteration is:
[224]   train's auc: 0.804828   valid's auc: 0.758703
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799826   valid's auc: 0.758312
[400]   train's auc: 0.831271   valid's auc: 0.758623
Early stopping, best iteration is:
[319]   train's auc: 0.819603   valid's auc: 0.758971
metrics2
fold train valid
0 0 0.824767 0.756092
1 1 0.818238 0.757819
2 2 0.791849 0.763023
3 3 0.804828 0.758703
4 4 0.819603 0.758971
5 overall 0.811857 0.758799
submit2.to_csv('submit2.csv',index = False)

评分结果:0.734

丢掉几个特征后结果略有提升哦,去掉30个试试

s_train_2 = poly_train.copy()
s_test_2 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:30]# 丢掉30个特征
s_train_2 = s_train_2.drop(drop_feature_names, axis = 1)
s_test_2 = s_test_2.drop(drop_feature_names, axis = 1)submit3, metrics3 = model(s_train_2, s_test_2, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800547   valid's auc: 0.755442
Early stopping, best iteration is:
[267]   train's auc: 0.81211    valid's auc: 0.755868
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.80048    valid's auc: 0.757653
Early stopping, best iteration is:
[258]   train's auc: 0.81057    valid's auc: 0.758107
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799261   valid's auc: 0.76291
Early stopping, best iteration is:
[189]   train's auc: 0.797314   valid's auc: 0.762962
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800499   valid's auc: 0.758385
Early stopping, best iteration is:
[202]   train's auc: 0.800851   valid's auc: 0.758413
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799977   valid's auc: 0.758234
Early stopping, best iteration is:
[284]   train's auc: 0.814454   valid's auc: 0.758612
metrics3
fold train valid
0 0 0.812110 0.755868
1 1 0.810570 0.758107
2 2 0.797314 0.762962
3 3 0.800851 0.758413
4 4 0.814454 0.758612
5 overall 0.807060 0.758735
submit3.to_csv('submit3.csv',index = False)

结果是:0.733,基本没什么变化.只用了一张主表肯定是不行的,这个纯属娱乐一下.

信贷违约风险预测(四)TrAiNiNG MoDeL相关推荐

  1. 信贷违约风险预测(四)训练模型

    之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理) Polynomi ...

  2. 信贷违约风险预测(二)简单的数据探索

    之前已经简单介绍了数据,客户的违约风险的预测是一个监督学习的任务,主要是对客户进行分类,就是哪些人可以获得贷款,哪些不可以,每个申请者可能会违约的概率在0~1之间,0:表示申请者能及时还款,1:申请者 ...

  3. 数据挖掘实战:个人信贷违约预测

    大家好,我是东哥.本次分享一个数据挖掘实战项目:个人信贷违约预测,此项目对于想要学习信贷风控模型的同学非常有帮助,本文首发于公众号:Python数据科学,作者云朵君. 一.个人征信预测模型 1.项目背 ...

  4. Foundations of Machine Learning 2nd——第四章Model Selection(二)

    Foundations of Machine Learning 2nd--第四章Model Selection(二) 交叉验证 Cross Validation(CV) 交叉验证的步骤 交叉验证有效性 ...

  5. (四) View/Model 全解(mvc)

    转自:http://www.cnblogs.com/zhangziqiu/archive/2009/03/18/Aspnet-MVC-4.html 一.摘要 本文讲解在Action中向View传递Mo ...

  6. 【肖四出了】考研政治肖秀荣预测四套卷已出!

    肖秀荣预测四套卷今天已经上市,京东自营已经有同学收到货了. 网络上会出一些关于肖秀荣的总结/分析/精简版/背诵版等等,这些东西随便看看就好,不必认真对待.照着背诵就更不必要了. 毕竟它们肯定不如肖秀荣 ...

  7. 信贷逾期预测,LightGBX模型

    信贷逾期预测 背景 数据处理 小提琴图查看数据分布 模型建立 模型评估  本文介绍了利用LightGBX模型进行贷款逾期预测的方法. 背景  互联网金融的核心在于风控,风控决定了互联网金融企业的竞争力 ...

  8. 天池比赛-01-用随机森林进行信贷违约预测-Baseline

      这篇文章构建了信贷违约预测数据挖掘项目的一个baseline,这个项目来源于天池数据科学大赛,是一个二分类问题.   赛题链接:https://tianchi.aliyun.com/competi ...

  9. python数据分析实战之信用卡违约风险预测

    文章目录 1.明确需求和目的 2. 数据收集 3.数据预处理 3.1 数据整合 3.1.1 加载相关库和数据集 3.1.2 主要数据集概览 3.2 数据清洗 3.2.1 多余列的删除 3.2.2 数据 ...

  10. Home Credit Default Risk 违约风险预测,kaggle比赛,初级篇,LB 0.749

    Home Credit Default Risk 结论 背景知识 数据集 数据分析 平衡度 数据缺失 数据类型 离群值 填充缺失值 建模 Logistic Regression LightGBM Fe ...

最新文章

  1. 启明云端分享 | SSD201\SSD202D 核心板如何批量烧录,母片制作教程分享
  2. 用百度开放地图api在代码中获得两地距离
  3. C语言及程序设计进阶例程-17 认识链表
  4. 2020 年,开启现代库的基建学习 —— 从项目演进看前端工程化发展
  5. 基于Caffe的人脸检测实现
  6. 【瑕疵检测】基于matlab瓶盖瑕疵检测【含Matlab源码 730期】
  7. AUTOCAD——标注关联
  8. Linux screen capture
  9. 解决IDM下载城通网盘,一个网站不允许请求同一个文件两次,即使设置了快捷键也无用的问题
  10. 计算机桌面文件能单独设密码吗,文件夹怎么设置密码,教您如何给电脑上文件夹设置密码...
  11. 【Unity3D日常开发】新建2D、3D场景,新建场景没有灯光等问题
  12. 小码哥《恋上数据结构与算法》笔记(十五):哈希表(Hash Table)
  13. Riche million espérer interroger chasse.
  14. 既有住宅加装电梯数学建模问题
  15. 大学生计算机PHP实训报告,大学生计算机实训心得体会
  16. Win10打开任务管理器卡死的解决方法
  17. 利用Zotero进行文献检索与管理
  18. throw与throws的区别
  19. WebRtc视频特效
  20. Typo: In word 拼写检查

热门文章

  1. javaEE解决eclipse中不能设置tomcat8.5
  2. 计算机WORD列宽行高怎么设置,高会《职称计算机》Word 2007:设置行高和列宽
  3. oracle 获取awk报告,Oracle 使用 ass.awk 工具查看 system state dump 说明
  4. 悟道web标准:前端性能优化
  5. 如何得到当前程序执行的堆栈
  6. SOA安全性解决方案
  7. windows系统挂载存储阵列的iscsi映射虚拟磁盘
  8. netty高级篇(3)-HTTP协议开发
  9. BFS+模拟 ZOJ 3865 Superbot
  10. android分享到新浪微博,认证+发送微博,