之前给予application_train.csv和application_test.csv,做了简单的特征工程:得到了下面几个文件:(对recode后的数据做了三种不同的特征处理)

  • Polynomial Features :poly_train_data.csv,poly_test_data.csv
  • Domain Knowledge Features: domain_train_data.csv,domain_test_data.csv
  • Featuretools:auto_train_data.csv, auto_test_data.csv

Training model

Logistic Regression

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.linear_model import LogisticRegression
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')#domain_train = pd.read_csv('data/domain_train_data.csv')
#domain_test = pd.read_csv('data/domain_test_data.csv')
auto_train = pd.read_csv('data/auto_train_data.csv')
auto_test = pd.read_csv('data/auto_test_data.csv')

缺失值标准化处理

target = poly_train['TARGET']
Id = poly_test[['SK_ID_CURR']]

polynomial features

poly_train = poly_train.drop(['TARGET'],1)
# feature names
poly_features = list(poly_train.columns)
# 中位数填充缺失值
imputer = Imputer(strategy = 'median')
# scale feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))
# fit train data
imputer.fit(poly_train)
# Transform train test data
poly_train = imputer.transform(poly_train)
poly_test = imputer.transform(poly_test)
# scaler
scaler.fit(poly_train)
poly_train = scaler.transform(poly_train)
poly_test = scaler.transform(poly_test)

domain features

domain_train = domain_train.drop(['TARGET'],1)
domain_features = list(domain_train.columns)
# fit train data
imputer.fit(domain_train)
# Transform train test data
domain_train = imputer.transform(domain_train)
domain_test = imputer.transform(domain_test)
# scaler
scaler.fit(domain_train)
domain_train = scaler.transform(domain_train)
domain_test = scaler.transform(domain_test)

Featuretool

auto_train = auto_train.drop(['TARGET'],1)
auto_features = list(auto_train.columns)
# fit train data
imputer.fit(auto_train)
# Transform train test data
auto_train = imputer.transform(auto_train)
auto_test = imputer.transform(auto_test)
# scaler
scaler.fit(auto_train)
auto_train = scaler.transform(auto_train)
auto_test = scaler.transform(auto_test)
print('poly_train',poly_train.shape)
print('poly_test',poly_test.shape)
print('domain_train',domain_train.shape)
print('domain_test',domain_test.shape)
print('auto_train',auto_train.shape)
print('auto_test',auto_test.shape)
poly_train (307511, 274)
poly_test (48744, 274)
domain_train (307511, 244)
domain_test (48744, 244)
auto_train (307511, 239)
auto_test (48744, 239)

LogisticRegression,

lr = LogisticRegression(C = 0.0001,class_weight='balanced')   # c 正则化参数

Polynomial

lr.fit(poly_train, target)
lr_poly_pred = lr.predict_proba(poly_test)[:,1]
# submission dataframe
submit = Id.copy()
submit['TARGET'] = lr_poly_pred
submit.to_csv('lr_poly_submit.csv',index = False)

Domain Knowledge

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(domain_train, target)
lr_domain_pred = lr.predict_proba(domain_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_domain_pred
submit.to_csv('lr_domain_submit.csv',index = False)

FeatureTools

lr = LogisticRegression(C = 0.0001,class_weight='balanced',solver='sag')
lr.fit(auto_train, target)
lr_auto_pred = lr.predict_proba(auto_test)[:,1]
submit = Id.copy()
submit['TARGET'] = lr_auto_pred
submit.to_csv('lr_auto_submit.csv',index = False)

在线测评结果:

  • Polynomail–: 0.723
  • Domain------: 0.670
  • Featuretool-: 0.669

下面升级一下算法,仍在这三组数据哈桑进行训练,随机森林.

Random Forest

from sklearn.ensemble import RandomForestClassifier

Polynomail

random_forest = RandomForestClassifier(n_estimators = 100, random_state = 55, verbose = 1,n_jobs = -1)
random_forest.fit(poly_train, target)
# 提取重要特征
poly_importance_feature_values = random_forest.feature_importances_
poly_importance_features = pd.DataFrame({'feature':poly_features,'importance':poly_importance_feature_values})
rf_poly_pred = random_forest.predict_proba(poly_test)[:,1]
# 结果
submit = Id.copy()
submit['TARGET'] = rf_poly_pred
submit.to_csv('rf_poly_submit.csv', index = False)
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  3.7min finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.9s finished

Feature importance

poly_importance_features = poly_importance_features.set_index(['feature'])
poly_importance_features.sort_values(by = 'importance').plot(kind='barh',figsize=(10, 120))

根据上面图中的数据,可以做一些特征选择,丢掉一些完全没有作用的特征,同时也给数据降低维度.下面在升级一下算法,机器学习中的大杀器,Light Gradient Boosting Machine

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gcimport numpy as np
import warnings
warnings.filterwarnings('ignore')
def model(features, test_features, n_folds = 10):# 取出ID列train_ids = features['SK_ID_CURR']test_ids = test_features['SK_ID_CURR']# TARGETlabels = features[['TARGET']]# 去掉ID和TARGETfeatures = features.drop(['SK_ID_CURR', 'TARGET'], axis = 1)test_features = test_features.drop(['SK_ID_CURR'], axis = 1)# 特征名字feature_names = list(features.columns)# Dataframe-->数组#features = np.array(features)#test_features = np.array(test_features)# 随即切分train _data10份,9份训练,1份验证k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)# test predictionstest_predictions = np.zeros(test_features.shape[0])# validation predictionsout_of_fold = np.zeros(features.shape[0])# 记录每次的scoresvalid_scores = []train_scores = []# Iterate through each foldcount = 0for train_indices, valid_indices in k_fold.split(features):# Training data for the foldtrain_features = features.loc[train_indices, :]train_labels = labels.loc[train_indices, :]# Validation data for the foldvalid_features = features.loc[valid_indices, :]valid_labels = labels.loc[valid_indices, :]# Create the modelmodel = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', class_weight = 'balanced', learning_rate = 0.05, reg_alpha = 0.1, reg_lambda = 0.1, subsample = 0.8, n_jobs = -1, random_state = 50)# Train the modelmodel.fit(train_features, train_labels, eval_metric = 'auc',eval_set = [(valid_features, valid_labels), (train_features, train_labels)],eval_names = ['valid', 'train'], categorical_feature = 'auto',early_stopping_rounds = 100, verbose = 200)# Record the best iterationbest_iteration = model.best_iteration_# 测试集的结果test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1]/n_folds# 验证集结果out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]# Record the best scorevalid_score = model.best_score_['valid']['auc']train_score = model.best_score_['train']['auc']valid_scores.append(valid_score)train_scores.append(train_score)# Clean up memorygc.enable()del model, train_features, valid_featuresgc.collect()count += 1pirnt("%d_fold is over"%count)# Make the submission dataframesubmission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})# Overall validation scorevalid_auc = roc_auc_score(labels, out_of_fold)# Add the overall scores to the metricsvalid_scores.append(valid_auc)train_scores.append(np.mean(train_scores))# dataframe of validation scoresfold_names = list(range(n_folds))fold_names.append('overall')# Dataframe of validation scoresmetrics = pd.DataFrame({'fold': fold_names,'train': train_scores,'valid': valid_scores}) return submission, metrics
# load data
poly_train = pd.read_csv('data/poly_train_data.csv')
poly_test = pd.read_csv('data/poly_test_data.csv')print('poly_train:',poly_train.shape)
print('poly_test:',poly_test.shape)
poly_train: (307511, 275)
poly_test: (48744, 274)

Select Features

特征影响力排名,从小到大

poly_importance_features = poly_importance_features.sort_values(by = 'importance')

丢掉这20个特征

poly_importance_features.head(20).plot(kind = 'barh')

s_train_1 = poly_train.copy()
s_test_1 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:20]# 去掉掉20个特征
s_train_1 = s_train_1.drop(drop_feature_names, axis = 1)s_test_1 = s_test_1.drop(drop_feature_names, axis = 1)submit2, metrics2 = model(s_train_1, s_test_1, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800686  valid's auc: 0.755447
[400]   train's auc: 0.831722  valid's auc: 0.755842
Early stopping, best iteration is:
[351]   train's auc: 0.824767  valid's auc: 0.756092
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800338  valid's auc: 0.757507
[400]   train's auc: 0.831318  valid's auc: 0.757378
Early stopping, best iteration is:
[307]   train's auc: 0.818238  valid's auc: 0.757819
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799557  valid's auc: 0.762719
Early stopping, best iteration is:
[160]   train's auc: 0.791849  valid's auc: 0.763023
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.80053   valid's auc: 0.758546
Early stopping, best iteration is:
[224]   train's auc: 0.804828  valid's auc: 0.758703
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799826  valid's auc: 0.758312
[400]   train's auc: 0.831271  valid's auc: 0.758623
Early stopping, best iteration is:
[319]   train's auc: 0.819603  valid's auc: 0.758971
metrics2
fold train valid
0 0 0.824767 0.756092
1 1 0.818238 0.757819
2 2 0.791849 0.763023
3 3 0.804828 0.758703
4 4 0.819603 0.758971
5 overall 0.811857 0.758799
submit2.to_csv('submit2.csv',index = False)

评分结果:0.734

丢掉几个特征后结果略有提升哦,去掉30个试试

s_train_2 = poly_train.copy()
s_test_2 = poly_test.copy()
# 特征列名
drop_feature_names = poly_importance_features.index[:30]# 丢掉30个特征
s_train_2 = s_train_2.drop(drop_feature_names, axis = 1)
s_test_2 = s_test_2.drop(drop_feature_names, axis = 1)submit3, metrics3 = model(s_train_2, s_test_2, n_folds= 5)
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800547  valid's auc: 0.755442
Early stopping, best iteration is:
[267]   train's auc: 0.81211   valid's auc: 0.755868
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.80048   valid's auc: 0.757653
Early stopping, best iteration is:
[258]   train's auc: 0.81057   valid's auc: 0.758107
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799261  valid's auc: 0.76291
Early stopping, best iteration is:
[189]   train's auc: 0.797314  valid's auc: 0.762962
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.800499  valid's auc: 0.758385
Early stopping, best iteration is:
[202]   train's auc: 0.800851  valid's auc: 0.758413
Training until validation scores don't improve for 100 rounds.
[200]   train's auc: 0.799977  valid's auc: 0.758234
Early stopping, best iteration is:
[284]   train's auc: 0.814454  valid's auc: 0.758612
metrics3
fold train valid
0 0 0.812110 0.755868
1 1 0.810570 0.758107
2 2 0.797314 0.762962
3 3 0.800851 0.758413
4 4 0.814454 0.758612
5 overall 0.807060 0.758735
submit3.to_csv('submit3.csv',index = False)

结果是:0.733,基本没什么变化.只用了一张主表肯定是不行的,这个纯属娱乐一下.

信贷违约风险预测(四)训练模型相关推荐

  1. 信贷违约风险预测(二)简单的数据探索

    之前已经简单介绍了数据,客户的违约风险的预测是一个监督学习的任务,主要是对客户进行分类,就是哪些人可以获得贷款,哪些不可以,每个申请者可能会违约的概率在0~1之间,0:表示申请者能及时还款,1:申请者 ...

  2. 数据挖掘实战:个人信贷违约预测

    大家好,我是东哥.本次分享一个数据挖掘实战项目:个人信贷违约预测,此项目对于想要学习信贷风控模型的同学非常有帮助,本文首发于公众号:Python数据科学,作者云朵君. 一.个人征信预测模型 1.项目背 ...

  3. 【肖四出了】考研政治肖秀荣预测四套卷已出!

    肖秀荣预测四套卷今天已经上市,京东自营已经有同学收到货了. 网络上会出一些关于肖秀荣的总结/分析/精简版/背诵版等等,这些东西随便看看就好,不必认真对待.照着背诵就更不必要了. 毕竟它们肯定不如肖秀荣 ...

  4. 信贷逾期预测,LightGBX模型

    信贷逾期预测 背景 数据处理 小提琴图查看数据分布 模型建立 模型评估  本文介绍了利用LightGBX模型进行贷款逾期预测的方法. 背景  互联网金融的核心在于风控,风控决定了互联网金融企业的竞争力 ...

  5. 天池比赛-01-用随机森林进行信贷违约预测-Baseline

      这篇文章构建了信贷违约预测数据挖掘项目的一个baseline,这个项目来源于天池数据科学大赛,是一个二分类问题.   赛题链接:https://tianchi.aliyun.com/competi ...

  6. python数据分析实战之信用卡违约风险预测

    文章目录 1.明确需求和目的 2. 数据收集 3.数据预处理 3.1 数据整合 3.1.1 加载相关库和数据集 3.1.2 主要数据集概览 3.2 数据清洗 3.2.1 多余列的删除 3.2.2 数据 ...

  7. Tianchi×Datawhale 零基础信贷模型预测 task04

    Task04 建模与调参 作者:Datawhale 小一 一.模型对比与性能评估 1. 逻辑回归 优点 训练速度较快,分类的时候,计算量仅仅只和特征的数目相关: 简单易理解,模型的可解释性非常好,从特 ...

  8. 数据分享|WEKA用决策树、随机森林、支持向量机SVM、朴素贝叶斯、逻辑回归信贷违约预测报告

    作者:Nuo Liu 数据变得越来越重要,其核心应用"预测"也成为互联网行业以及产业变革的重要力量.近年来网络 P2P借贷发展形势迅猛,一方面普通用户可以更加灵活.便快捷地获得中小 ...

  9. Home Credit Default Risk 违约风险预测,kaggle比赛,初级篇,LB 0.749

    Home Credit Default Risk 结论 背景知识 数据集 数据分析 平衡度 数据缺失 数据类型 离群值 填充缺失值 建模 Logistic Regression LightGBM Fe ...

最新文章

  1. JVM - ZGC初探
  2. java定时器结合springboot_SpringBoot开发案例之整合定时任务(Scheduled)
  3. Java多线程之线程池详解
  4. Python科学计算:Pandas
  5. java安全权限_java.security.SecurityPermission
  6. AE物体表面跟踪特效合成高级插件:Lockdown for Mac 支持ae2021
  7. Objective-C 2.0 with Cocoa Foundation 1 前言
  8. [Swift]LeetCode229. 求众数 II | Majority Element II
  9. Linux的主动实行措施cron和crontab(1)
  10. php一键环境包xammp 安装 phpDocumentor
  11. linux下latex使用教程,LaTeX使用--XeLaTeX入门基础(二)
  12. 利用python的pyqt5和vtk库实现对gcode模型的全彩预览
  13. 手机指纹识别测试软件,指纹测算-指纹照相机 扫描识别指纹评分
  14. lisp 焊缝标注_基于ObjectARX的焊接符号标注系统开发
  15. mysql 节假日判断_sql 节假日判断(春节、中秋、国庆、周末等)
  16. echarts 数据区域缩放
  17. 关于Mysql的驱动(org.gjt.mm.mysql.Driver)问题
  18. java将数据写入excel_java将数据写入excel
  19. 爬取《斗破苍穹》小说
  20. Flutter屏幕截图

热门文章

  1. 学习编程的计划和路线
  2. 微企移动oa源码php,微企移动oa系统
  3. 浅谈服务治理、微服务与Service Mesh(一二三)
  4. 新港转债,百洋转债上市价格预测
  5. 【MATLAB编程实例练习】-(15)红绿色方块染色问题
  6. 【简书读书社】每个周末,一起来读简书电子书(第三期)
  7. app调用root权限,安卓app获取root权限
  8. 快速理解LAN、WAN和WLAN的区别? -- 转载
  9. 如何移除电路板上的元器件?
  10. PCB layout 电路板 敷铜 铺铜 铺地 问题的讨论