机器学习10-信用卡反欺诈模型

文章目录

1.数据准备
2.数据采样
3.建模与调参
最终代码

1.数据准备

# 信用卡反欺诈模型
# 识别数据中的虚假信息
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt# 数据准备
# 1.加载数据
data = pd.read_csv('data/creditcard.csv',delimiter=',')print(data.shape)
print(data.head(5))
# 样本类别分布情况
print(data['Class'].value_counts())
# 通过条形图形式，查看数据中两种样本类别的数量情况
plt.subplots(1,1,figsize = (7,5))
count_classes = pd.value_counts(data['Class'],sort=True).sort_index()
count_classes.plot(kind='bar')
plt.title('Fraud class histogram',fontsize=13)
plt.xlabel('Class',fontsize=13)
plt.ylabel('Frequency',fontsize=15)
# 坐标轴变名
plt.xticks(rotation=0)
plt.show()# 从图中可以看出，绝大部分样本的类别为“0”，即信用数据可信，仅有极少数的样本类别为“1”，
# 即存在欺诈的情况。 并且数据是极度有偏的

2.数据采样

# 数据采样
# 不平衡数据的训练集与测试的分割方法
data = data.drop(['Time'], axis=1)
# 1.按照被解释变量进行分层超额抽样
# 使用sklearn.model_selection中的StratifiedShuffleSplit做训练集和测试集的划分，该方法先将数据集打乱，
# 之后根据参数设置划分出train/test对，同时可以保证每组划分中类别比例相同。
X = np.array(data.loc[:, :'V28'])
y = np.array(data['Class'])
# n_splits=1表示随机取一次
sess = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=0)for train_index, test_index in sess.split(X, y):print(len(train_index))  # 170884# 数据集的分割X_train, X_test = X[train_index], X[test_index]y_train, y_test = y[train_index], y[test_index]print('train_size:%s' % len(y_train),'test_size:%s' % len(y_test))  # train_size:170884 test_size:113923plt.figure(figsize=(7, 5))
count_classes = pd.value_counts(y_train, sort=True)
count_classes.plot(kind='bar')
plt.title("The histogram of fraud class in trainingdata ", fontsize=13)
plt.xlabel("Class", fontsize=13)
plt.ylabel("Frequency", fontsize=15)
plt.xticks(rotation=0)
plt.show()
# 从训练集的两种类别的直方图来看，“0”类数据远多于“1”类数据，数据存在不平衡现象，在建模之前需要进行处理# 2.过采样平衡样本
# 随机过采样
ros = RandomOverSampler(random_state=0)
# SMOTE过采样
sos = SMOTE(random_state=0)
# 综合过采样
kos = SMOTETomek(random_state=0)X_ros, y_ros = ros.fit_sample(X_train, y_train)
X_sos, y_sos = sos.fit_sample(X_train, y_train)
X_kos, y_kos = kos.fit_sample(X_train, y_train)print('ros:%s,sos:%s,kos:%s' % (len(y_ros), len(y_sos), len(y_kos)))a = pd.DataFrame(y_ros)
print(a[0].value_counts())

284807, 31)Time        V1        V2        V3  ...       V27       V28  Amount  Class
0   0.0 -1.359807 -0.072781  2.536347  ...  0.133558 -0.021053  149.62      0
1   0.0  1.191857  0.266151  0.166480  ... -0.008983  0.014724    2.69      0
2   1.0 -1.358354 -1.340163  1.773209  ... -0.055353 -0.059752  378.66      0
3   1.0 -0.966272 -0.185226  1.792993  ...  0.062723  0.061458  123.50      0
4   2.0 -1.158233  0.877737  1.548718  ...  0.219422  0.215153   69.99      0
[5 rows x 31 columns]
0    284315
1       492
Name: Class, dtype: int64
170884
train_size:170884 test_size:113923
ros:341178,sos:341178,kos:341178
1    170589
0    170589
Name: 0, dtype: int64

3.建模与调参

# 建模与调参
# 过采样后，两类样本均衡，下面将对直接划分的训练集和三种过采样方法得到的数据集建立决策树模型进行预测，
# 选择预测效果好的数据集进行后续建模
clf = DecisionTreeClassifier(criterion='gini',random_state=1234)
param_grid = {'max_depth':[3,4,5,6],',max_lesf_nodes':[4,6,8,10,12]}
cv = GridSearchCV(clf,param_grid = param_grid,scoring='f1')data = [[X_train,y_train],[X_ros,y_ros],[X_sos,y_sos],[X_kos,y_kos]]
# 训练模型
for features,labels in data:cv.fit(features,labels)pred_test = cv.predict(X_test)print('auc:%.3f' % roc_auc_score(y_test, pred_test),'recall:%.3f' % recall_score(y_test, pred_test),'precision:%.3f' % precision_score(y_test, pred_test))
# 经结果易得，随机过采样的数据集得到的auc值最高# 利用该数据建立预测模型
train_data = X_ros
train_target = y_ros
test_target = y_test
test_data = X_test# 逻辑回归
lr = LogisticRegression(C = 1, penalty = 'l1')
lr.fit(train_data,train_target)
test_est = lr.predict(test_data)
print("Logistic Regression accuracy:")
# 分类报告
print(classification_report(test_target,test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
# auc值
print('Logistic Regression AUC: %.4f' %auc(fpr_test, tpr_test))# 随机森林
rf = RandomForestClassifier(criterion = 'entropy',max_depth = 10,n_estimators = 15,max_features = 0.6,min_samples_split = 50)
rf.fit(train_data, train_target)
test_est = rf.predict(test_data)
print("Random Forest accuracy:")
print(classification_report(test_target,test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Random Forest AUC: %.4f' %auc(fpr_test, tpr_test))# GBDT
gb = GradientBoostingClassifier(loss = 'exponential',learning_rate = 0.2,n_estimators = 40,max_depth = 3,min_samples_split = 30)
gb.fit(train_data, train_target)
test_est = gb.predict(test_data)
print("GradientBoosting accuracy:")
print(classification_report(test_target,test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('GradientBoosting AUC : %.4f' %auc(fpr_test, tpr_test))# 寻找最优参数-参数的范围设定对搜索结果起着重要作用，并且在搜索时仅能搭建出局部最优解，而非全局最优解# 随机森林
param_grid = {'criterion':['entropy','gini'],'max_depth':[8,10,12],'n_estimators':[11,13,15],'max_features':[0.3,0.4,0.5],'min_samples_split':[4,8,12]
}rfc = RandomForestClassifier()
rfccv = GridSearchCV(estimator = rfc, param_grid = param_grid, scoring = 'roc_auc', cv = 4)
rfccv.fit(train_data, train_target)
test_est = rfccv.predict(test_data)
print("Random Forest accuracy:")
# 分类报告
print(classification_report(test_target,test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Random Forest AUC: %.4f' %auc(fpr_test, tpr_test))print('最优参数模型为:\n',rfccv.best_params_)# GBDT
param_grid = {'learning_rate':[0.1,0.3,0.5],'n_estimators':[15,20,30],'max_depth':[1,2,3],'min_samples_split':[12,16,20]
}gbc = GradientBoostingClassifier()
gbccv = GridSearchCV(estimator = gbc, param_grid = param_grid, scoring = 'roc_auc', cv = 4)
gbccv.fit(train_data, train_target)
test_est = gbccv.predict(test_data)
print("Gradient Boosting accuracy:")
# 分类报告
print(classification_report(test_target,test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Gradient Boosting AUC : %.4f' %auc(fpr_test, tpr_test))print('最优参数模型:\n',gbccv.best_params_)

最终代码

import matplotlib.pyplot as plt
import numpy as np
# 信用卡反欺诈模型
# 识别数据中的虚假信息
import pandas as pd
from imblearn.combine import SMOTETomek
# 过采样
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (auc, roc_auc_score, precision_score, roc_curve, recall_score, classification_report)
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier# 数据准备
# 1.加载数据
data = pd.read_csv('data/creditcard.csv', delimiter=',')print(data.shape)
print(data.head(5))
# 样本类别分布情况
print(data['Class'].value_counts())
# 通过条形图形式，查看数据中两种样本类别的数量情况
plt.subplots(1, 1, figsize=(7, 5))
count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar')
plt.title('Fraud class histogram', fontsize=13)
plt.xlabel('Class', fontsize=13)
plt.ylabel('Frequency', fontsize=15)
# 坐标轴变名
plt.xticks(rotation=0)
plt.show()# 从图中可以看出，绝大部分样本的类别为“0”，即信用数据可信，仅有极少数的样本类别为“1”，
# 即存在欺诈的情况。 并且数据是极度有偏的# 数据采样
# 不平衡数据的训练集与测试的分割方法
data = data.drop(['Time'], axis=1)
# 1.按照被解释变量进行分层超额抽样
# 使用sklearn.model_selection中的StratifiedShuffleSplit做训练集和测试集的划分，该方法先将数据集打乱，
# 之后根据参数设置划分出train/test对，同时可以保证每组划分中类别比例相同。
X = np.array(data.loc[:, :'V28'])
y = np.array(data['Class'])
# n_splits=1表示随机取一次
sess = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=0)for train_index, test_index in sess.split(X, y):print(len(train_index))  # 170884# 数据集的分割X_train, X_test = X[train_index], X[test_index]y_train, y_test = y[train_index], y[test_index]print('train_size:%s' % len(y_train),'test_size:%s' % len(y_test))  # train_size:170884 test_size:113923plt.figure(figsize=(7, 5))
count_classes = pd.value_counts(y_train, sort=True)
count_classes.plot(kind='bar')
plt.title("The histogram of fraud class in trainingdata ", fontsize=13)
plt.xlabel("Class", fontsize=13)
plt.ylabel("Frequency", fontsize=15)
plt.xticks(rotation=0)
plt.show()
# 从训练集的两种类别的直方图来看，“0”类数据远多于“1”类数据，数据存在不平衡现象，在建模之前需要进行处理# 2.过采样平衡样本
# 随机过采样
ros = RandomOverSampler(random_state=0)
# SMOTE过采样
sos = SMOTE(random_state=0)
# 综合过采样
kos = SMOTETomek(random_state=0)X_ros, y_ros = ros.fit_sample(X_train, y_train)
X_sos, y_sos = sos.fit_sample(X_train, y_train)
X_kos, y_kos = kos.fit_sample(X_train, y_train)print('ros:%s,sos:%s,kos:%s' % (len(y_ros), len(y_sos), len(y_kos)))a = pd.DataFrame(y_ros)
print(a[0].value_counts())# 建模与调参
# 过采样后，两类样本均衡，下面将对直接划分的训练集和三种过采样方法得到的数据集建立决策树模型进行预测，
# 选择预测效果好的数据集进行后续建模
clf = DecisionTreeClassifier(criterion='gini', random_state=1234)
param_grid = {'max_depth': [3, 4, 5, 6], ',max_lesf_nodes': [4, 6, 8, 10, 12]}
cv = GridSearchCV(clf, param_grid=param_grid, scoring='f1')data = [[X_train, y_train],[X_ros, y_ros],[X_sos, y_sos],[X_kos, y_kos]]
# 训练模型
for features, labels in data:cv.fit(features, labels)pred_test = cv.predict(X_test)print('auc:%.3f' % roc_auc_score(y_test, pred_test),'recall:%.3f' % recall_score(y_test, pred_test),'precision:%.3f' % precision_score(y_test, pred_test))
# 经结果易得，随机过采样的数据集得到的auc值最高# 利用该数据建立预测模型
train_data = X_ros
train_target = y_ros
test_target = y_test
test_data = X_test# 逻辑回归
lr = LogisticRegression(C=1, penalty='l1')
lr.fit(train_data, train_target)
test_est = lr.predict(test_data)
print("Logistic Regression accuracy:")
# 分类报告
print(classification_report(test_target, test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
# auc值
print('Logistic Regression AUC: %.4f' % auc(fpr_test, tpr_test))# 随机森林
rf = RandomForestClassifier(criterion='entropy', max_depth=10, n_estimators=15,max_features=0.6, min_samples_split=50)
rf.fit(train_data, train_target)
test_est = rf.predict(test_data)
print("Random Forest accuracy:")
print(classification_report(test_target, test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Random Forest AUC: %.4f' % auc(fpr_test, tpr_test))# GBDT
gb = GradientBoostingClassifier(loss='exponential', learning_rate=0.2, n_estimators=40,max_depth=3, min_samples_split=30)
gb.fit(train_data, train_target)
test_est = gb.predict(test_data)
print("GradientBoosting accuracy:")
print(classification_report(test_target, test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('GradientBoosting AUC : %.4f' % auc(fpr_test, tpr_test))# 寻找最优参数-参数的范围设定对搜索结果起着重要作用，并且在搜索时仅能搭建出局部最优解，而非全局最优解# 随机森林
param_grid = {'criterion': ['entropy', 'gini'],'max_depth': [8, 10, 12],'n_estimators': [11, 13, 15],'max_features': [0.3, 0.4, 0.5],'min_samples_split': [4, 8, 12]
}rfc = RandomForestClassifier()
rfccv = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc', cv=4)
rfccv.fit(train_data, train_target)
test_est = rfccv.predict(test_data)
print("Random Forest accuracy:")
# 分类报告
print(classification_report(test_target, test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Random Forest AUC: %.4f' % auc(fpr_test, tpr_test))print('最优参数模型为:\n', rfccv.best_params_)# GBDT
param_grid = {'learning_rate': [0.1, 0.3, 0.5],'n_estimators': [15, 20, 30],'max_depth': [1, 2, 3],'min_samples_split': [12, 16, 20]
}gbc = GradientBoostingClassifier()
gbccv = GridSearchCV(estimator=gbc, param_grid=param_grid, scoring='roc_auc', cv=4)
gbccv.fit(train_data, train_target)
test_est = gbccv.predict(test_data)
print("Gradient Boosting accuracy:")
# 分类报告
print(classification_report(test_target, test_est))
fpr_test, tpr_test, th_test = roc_curve(test_target, test_est)
print('Gradient Boosting AUC : %.4f' % auc(fpr_test, tpr_test))print('最优参数模型:\n', gbccv.best_params_)

机器学习10-信用卡反欺诈模型相关推荐

项目实例---金融---用机器学习构建模型，进行信用卡反欺诈预测
来源: 用机器学习构建模型,进行信用卡反欺诈预测反欺诈中所用到的机器学习模型有哪些? Credit card fraud detection 构建信用卡反欺诈预测模型--机器学习信用卡交易数据相关 ...
构建信用卡反欺诈预测模型——机器学习
本项目需解决的问题本项目通过利用信用卡的历史交易数据,进行机器学习,构建信用卡反欺诈预测模型,提前发现客户信用卡被盗刷的事件. 建模思路项目背景数据集包含由欧洲持卡人于2013年9月使用信用卡进 ...
AI：人工智能实践六大场景(金融信用违约、反欺诈模型、客户偏好洞察、智能推荐、精准营销、客户流失管理)及其对应常用机器学习算法经验总结(不断更新)
AI:人工智能实践六大场景(金融信用违约.反欺诈模型.客户偏好洞察.智能推荐.精准营销.客户流失管理)及其对应常用机器学习算法经验总结(不断更新) 目录
金融反欺诈模型----项目实战--机器学习
机器学习:从源数据清洗到特征工程建立谈金融反欺诈模型训练本文旨在通过一个完整的实战例子,演示从源数据清洗到特征工程建立,再到模型训练,以及模型验证和评估的一个机器学习的完整流程.由于初识机器学习,会 ...
【待继续研究】解析机器学习技术在反欺诈领域的应用
反欺诈简单说,就是:根据借款人提供的信息,查找多方面资料,进行不同属性的比对,从而发现"羊群中的狼".这种工作复杂而枯燥,为了识别团伙欺诈,往往需要收集.整理.分析各种维度的数据, ...
反欺诈概念库-信用卡反欺诈管理
原文:http://www.cnki.com.cn/Article/CJFDTotal-XYKZ200508004.htm 2005年6月,美国爆出4000万张信用卡资料外泄的特大新闻.消息传来,舆论 ...
原理+代码｜手把手教你使用Python实战反欺诈模型
三本点击上方"早起Python",关注并"星标" 每日接收Python干货! 本文含 6192 字,15 图表截屏建议阅读 20分钟本文将基于不平衡数据,使 ...
python 靶心_手把手教你使用Python实战反欺诈模型｜原理+代码
原标题:手把手教你使用Python实战反欺诈模型|原理+代码作者 | 萝卜来源 | 早起Python(ID: zaoqi-python) 本文将基于不平衡数据,使用Python进行反欺诈模型数据 ...
实操信贷场景中的反欺诈模型
今天的文章,关于反欺诈模型的实操,之前有跟大家分享过相关内容,部分反欺诈的领域的童鞋感觉内容比较有帮助,今天就该内容进行讲解.本文介绍的产品适合在消费零售信贷及现金场景贷中的中短期产品,其中涉及的变量 ...
Python分析信用卡反欺诈
本文研究的是大数据量(284807条数据)下模型选择的问题,也参考了一些文献,但大多不够清晰,因此吐血整理本文,希望对大家有帮助; 本文试着从数据分析师的角度,设想"拿到数据该如何寻找规律. ...

机器学习10-信用卡反欺诈模型

文章目录

1.数据准备

2.数据采样

3.建模与调参

最终代码

机器学习10-信用卡反欺诈模型相关推荐

最新文章

热门文章