构建信用卡反欺诈预测模型—

本项目需解决的问题

本项目通过利用信用卡的历史交易数据，进行机器学习，构建信用卡反欺诈预测模型，提前发现客户信用卡被盗刷的事件。

建模思路

项目背景

数据集包含由欧洲持卡人于2013年9月使用信用卡进行交的数据。此数据集显示两天内发生的交易，其中284,807笔交易中有492笔被盗刷。数据集非常不平衡，积极的类（被盗刷）占所有交易的0.172％。

它只包含作为PCA转换结果的数字输入变量。不幸的是，由于保密问题，我们无法提供有关数据的原始功能和更多背景信息。特征V1，V2，... V28是使用PCA获得的主要组件，没有用PCA转换的唯一特征是“时间”和“量”。特征'时间'包含数据集中每个事务和第一个事务之间经过的秒数。特征“金额”是交易金额，此特征可用于实例依赖的成本认知学习。特征'类'是响应变量，如果发生被盗刷，则取值1，否则为0。
1 场景解析（算法选择）

1）首先，我们拿到的数据是持卡人两天内的信用卡交易数据，这份数据包含很多维度，要解决的问题是预测持卡人是否会发生信用卡被盗刷。信用卡持卡人是否会发生被盗刷只有两种可能，发生被盗刷或不发生被盗刷。又因为这份数据是打标好的（字段Class是目标列），也就是说它是一个监督学习的场景。于是，我们判定信用卡持卡人是否会发生被盗刷是一个二分类问题，意味着可以通过二分类相关的算法来找到具体的解决办法，本项目选用的算法是逻辑回归（Logistic Regression）。

2）分析数据：数据是结构化数据，不需要做特征抽象。特征V1至V28是经过PCA处理，而特征Time和Amount的数据规格与其他特征差别较大，需要对其做特征缩放，将特征缩放至同一个规格。在数据质量方面，没有出现乱码或空字符的数据，可以确定字段Class为目标列，其他列为特征列。

3）这份数据是全部打标好的数据，可以通过交叉验证的方法对训练集生成的模型进行评估。70%的数据进行训练，30%的数据进行预测和评估。

现对该业务场景进行总结如下：

1）根据历史记录数据学习并对信用卡持卡人是否会发生被盗刷进行预测，二分类监督学习场景，选择逻辑斯蒂回归（Logistic Regression）算法。

2）数据为结构化数据，不需要做特征抽象，是否需要做特征缩放有待后续观察

2 数据预处理（Pre-processing Data）

前期准备
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report
import itertools
import seaborn as sns
读取数据以及数据初步探查

data = pd.read_csv('C:/Users/Administrator/Desktop/CredictCard/creditcard.csv')
data.shape
data.head()

该数据共有284807行，31列，其中V1-V28为结构化数据；特征Time和Amount的数据规格和其他特征不一样，数量级较大。

msno.matrix(data)

数据初步探索

对目标特征进行初步探索

count_classes.plot(kind='bar')
plt.title('Fraud class histogram')
plt.xlabel('Class')
plt.ylabel('Frequency')

可以看出数据很不均衡：数据不均衡很可能导致我们模型预测结果‘0’时很准确，而预测‘1’时并不准确。

数据不均衡是机器学习中很常见的情况，解决方案有如下几种：

1）扩大数据样本

2）改变评价标准，以下标准可以更加深入地洞察模型的准确率

混淆矩阵：将要预测的数据分到表里来显示正确的预测（对角线），并了解其不正确的预测的类型（哪些类被分配了不正确的预测）；
精度：一种分类准确性的处理方法；
召回率：一种分类完整性的处理方法；
F1分数（或F-分）：精度和召回率的加权平均。

3）对数据重新采样

过抽样：可以从代表性不足的类添加实例的副本

抽样不足：您可以从过度代表类里删除实例

4）生成人工样本（SMOTE）

5）使用不同的算法

6）尝试名义变量

3 特征工程（Feature Engineering）

1）查看盗刷与正常刷卡的刷卡金额分布图

f,(ax1,ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))
bins=30
ax1.hist(data[data.Class ==1]['Amount'],bins=bins)
ax1.set_title('Fraud')
ax2.hist(data[data.Class == 0]['Amount'], bins=bins)
ax2.set_title('Normal')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()

信用卡被盗刷发生的金额与信用卡正常用户发生的金额相比，比较小。这说明信用卡盗刷者为了不引起信用卡卡主的注意，更偏向选择小金额消费。

2）正常刷卡与盗刷时间分布

plt.figure(figsize=[16,4])
sns.distplot(data[data['Class']==1]['Hour'],bins=50)
sns.distplot(data[data['Class']==0]['Hour'],bins=100)

信用正常刷卡与盗刷时间分布从大体上看并没有太大差别，由此推测盗刷者为了减小被识别的风险，将盗刷时间放在正常刷卡时间集中区域。因此建立模型预测时，可以将该特征过滤。

3）查看其它特征分布

plt.figure(figsize=(12,28*4))
v_features = data.ix[:,1:29].columns
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(data[v_features]):ax = plt.subplot(gs[i])sns.distplot(data[data.Class == 1][cn], bins=50)sns.distplot(data[data.Class == 0][cn], bins=50)ax.set_xlabel('')ax.set_title('histogram of feature:' + str(cn))
plt.show()

上图是不同变量在信用卡被盗刷和信用卡正常的不同分布情况，我们将选择在不同信用卡状态下的分布有明显区别的变量。因此剔除变量V8、V13 、V15 、V20 、V21 、V22、 V23 、V24 、V25 、V26 、V27 和V28变量这也与我们开始用相关性图谱观察得出结论一致，同时剔除变量Time。

4 模型训练

1）处理不平衡样本

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']
# 盗刷样本数量
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
# 获取正常刷卡样本的索引
normal_indices = data[data.Class == 0].index
# 从正常刷卡样本的索引中随机选择与盗刷样本数量相同的量
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
# 合并
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])
# 重新取样
under_sample_data = data.iloc[under_sample_indices,:]X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,y_undersample,test_size = 0.3)

进行数据重新抽样，使样本集比例为1:1；同时获取一个原比例的训练集，用于做模型的对比。

2）训练模型

在这个模型中，我们可以使用召回率来进行模型评价，该指标能准确的表达我们捕获盗刷交易的准确度。

精度：Accuracy = (TP+TN)/total

准确率：precision=TP/(TP+FP)

召回率：recall=TP/(TP+FN)

更多详情查看理解准确率(accuracy)、精度(precision)、查全率(recall)、F1、

a.对模型进行迭代，获取训练结果较好的正则化参数

def printing_Kfold_scores(x_train_data,y_train_data):fold = KFold(len(y_train_data),5,shuffle=False)c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score'])results_table['C_parameter'] = c_param_rangej = 0for c_param in c_param_range:recall_accs = []for iteration, indices in enumerate(fold,start=1):lr = LogisticRegression(C=c_param,penalty='l1')lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)recall_accs.append(recall_acc)print('Iteration ', iteration,': recall score = ', recall_acc)results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)j += 1print('')print('Mean recall score ', np.mean(recall_accs))print('')best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']return best_c

Iteration  1 : recall score =  0.952380952381
Iteration  2 : recall score =  0.986111111111
Iteration  3 : recall score =  0.984848484848
Iteration  4 : recall score =  1.0
Iteration  5 : recall score =  0.942857142857Mean recall score  0.97323953824Iteration  1 : recall score =  0.904761904762
Iteration  2 : recall score =  0.902777777778
Iteration  3 : recall score =  0.878787878788
Iteration  4 : recall score =  0.9375
Iteration  5 : recall score =  0.871428571429Mean recall score  0.899051226551Iteration  1 : recall score =  0.920634920635
Iteration  2 : recall score =  0.902777777778
Iteration  3 : recall score =  0.893939393939
Iteration  4 : recall score =  0.9375
Iteration  5 : recall score =  0.871428571429Mean recall score  0.905256132756Iteration  1 : recall score =  0.920634920635
Iteration  2 : recall score =  0.902777777778
Iteration  3 : recall score =  0.893939393939
Iteration  4 : recall score =  0.9375
Iteration  5 : recall score =  0.9Mean recall score  0.91097041847Iteration  1 : recall score =  0.920634920635
Iteration  2 : recall score =  0.902777777778
Iteration  3 : recall score =  0.909090909091
Iteration  4 : recall score =  0.9375
Iteration  5 : recall score =  0.914285714286Mean recall score  0.916857864358

b）选择C=0.01进行建模，并画出混淆矩阵

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):plt.imshow(cm, interpolation='nearest',cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks,classes,rotation=0)plt.yticks(tick_marks,classes)if normalize:cm = cm.astype('float')/cm.sum(axis=1)[:,np.newaxis]else:print('Confusion matrix, without normalization')thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel('True label')plt.xlabel('Predicted label')#画出重取样数据的混淆矩阵
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()
#画出原数据的混淆矩阵
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

该模型在重采样数据上获得了0.9554的召回率，在原数据上获得了 0.9319的召回率。总体来说，这是一个不错的数字

c.画出ROC曲线，对模型进行评估

lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values)
fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)
roc_auc = auc(fpr,tpr)
# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

一般来说，如果ROC是光滑的，那么基本可以判断没有太大的overfitting，但是由于数据不均衡，只用ROC曲线来进行评估并不准确。

d.直接使用原数据进行模型训练并进行评估

best_c = printing_Kfold_scores(X_train,y_train)
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_score = lr.fit(X_train,y_train.values.ravel()).decision_function(X_test.values)
fpr, tpr, thresholds = roc_curve(y_test.values.ravel(),y_pred_score)
roc_auc = auc(fpr,tpr)plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

可以看直接使用模型训练并不能获得很好的召回率，也就是说并不能很好的辨别盗刷交易。然后该模型在ROC曲线仍然有很好的表现，因此应更好的评估方法来评估模型。

总结：

1.在数据不均衡的机器学习模型中，应根据具体情况选择是否重新调整数据集比例。一般说来precision与recall是评价模型两个不同角度。例如：对于地震的预测，我们希望的是RECALL非常高，也就是说每次地震我们都希望预测出来。这个时候我们可以牺牲PRECISION。情愿发出1000次警报，把10次地震都预测正确了；也不要预测100次对了8次漏了两次。在嫌疑人定罪方面，基于不错怪一个好人的原则，对于嫌疑人的定罪我们希望是非常准确的。及时有时候放过了一些罪犯（recall低），但也是值得的。但是有时候错判是有成本的，应根据具体情况具体选择。

2.原始数据集剥离了业务场景，因此在进行数据分析时，并不能看到盗刷交易的具体特征。在具体场景中，每个信用卡用户都具有自己的消费特性，如常用消费地点，消费时间，消费金额，可以根据这些特征俩进行建模