It’s Sunday morning, it’s quiet and you wake up with a big smile on your face. Today is going to be a great day! Except, your phone rings, rather “internationally”. You pick it up slowly and hear something really bizarre - “Bonjour, je suis Michele. Oops, sorry. I am Michele, your personal bank agent.”. What could possibly be so urgent for someone from Switzerland to call you at this hour? “Did you authorize a transaction for $3,358.65 for 100 copies of Diablo 3?” Immediately, you start thinking of ways to explain why you did that to your loved one. “No, I didn’t !?”. Michele’s answer is quick and to the point - “Thank you, we’re on it”. Whew, that was close! But how did Michele knew that this transaction was suspicious? After all, you did order 10 new smartphones from that same bank account, last week - Michele didn’t call then.

source: Tutsplus

Annual global fraud losses reached $21.8 billion in 2015, according to Nilson Report. Probably you feel very lucky if you are a fraud. About every 12 cents per $100 were stolen in the US during the same year. Our friend Michele might have a serious problem to solve here.

In this part of the series, we will train an Autoencoder Neural Network (implemented in Keras) in unsupervised (or semi-supervised) fashion for Anomaly Detection in credit card transaction data. The trained model will be evaluated on pre-labeled and anonymized dataset.

The source code and pre-trained model are available on GitHub here.

Setup

We will be using TensorFlow 1.2 and Keras 2.0.4. Let’s begin:

import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from scipy import stats
import tensorflow as tf
import seaborn as sns
from pylab import rcParams
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.callbacks import ModelCheckpoint, TensorBoard
from keras import regularizers%matplotlib inlinesns.set(style='whitegrid', palette='muted', font_scale=1.5)rcParams['figure.figsize'] = 14, 8RANDOM_SEED = 42
LABELS = ["Normal", "Fraud"]

Loading the data

The dataset we’re going to use can be downloaded from Kaggle. It contains data about credit card transactions that occurred during a period of two days, with 492 frauds out of 284,807 transactions.

All variables in the dataset are numerical. The data has been transformed using PCA transformation(s) due to privacy reasons. The two features that haven’t been changed are Time and Amount. Time contains the seconds elapsed between each transaction and the first transaction in the dataset.

df = pd.read_csv("data/creditcard.csv")

Exploration

df.shape

(284807, 31)

31 columns, 2 of which are Time and Amount. The rest are output from the PCA transformation. Let’s check for missing values:

df.isnull().values.any()

False

count_classes = pd.value_counts(df['Class'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Transaction class distribution")
plt.xticks(range(2), LABELS)
plt.xlabel("Class")
plt.ylabel("Frequency");

We have a highly imbalanced dataset on our hands. Normal transactions overwhelm the fraudulent ones by a large margin. Let’s look at the two types of transactions:

frauds = df[df.Class == 1]
normal = df[df.Class == 0]

frauds.shape

(492, 31)

normal.shape

(284315, 31)

How different are the amount of money used in different transaction classes?

frauds.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

normal.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

Let’s have a more graphical representation:

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Amount per transaction by class')bins = 50ax1.hist(frauds.Amount, bins = bins)
ax1.set_title('Fraud')ax2.hist(normal.Amount, bins = bins)
ax2.set_title('Normal')plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.xlim((0, 20000))
plt.yscale('log')
plt.show();

Do fraudulent transactions occur more often during certain time?

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Time of transaction vs Amount by class')ax1.scatter(frauds.Time, frauds.Amount)
ax1.set_title('Fraud')ax2.scatter(normal.Time, normal.Amount)
ax2.set_title('Normal')plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()

Doesn’t seem like the time of transaction really matters.

Autoencoders

Autoencoders can seem quite bizarre at first. The job of those models is to predict the input, given that same input. Puzzling? Definitely was for me, the first time I heard it.

More specifically, let’s take a look at Autoencoder Neural Networks. This autoencoder tries to learn to approximate the following identity function:

fW,b(x)≈xfW,b(x)≈x

While trying to do just that might sound trivial at first, it is important to note that we want to learn a compressed representation of the data, thus find structure. This can be done by limiting the number of hidden units in the model. Those kind of autoencoders are called undercomplete.

Here’s a visual representation of what an Autoencoder might learn:

Reconstruction error

We optimize the parameters of our Autoencoder model in such way that a special kind of error - reconstruction error is minimized. In practice, the traditional squared error is often used:

L(x,x′)=||x−x′||2L(x,x′)=||x−x′||2

If you want to learn more about Autoencoders I highly recommend the following videos by Hugo Larochelle:

Preparing the data

First, let’s drop the Time column (not going to use it) and use the scikit’s StandardScaler on the Amount. The scaler removes the mean and scales the values to unit variance:

from sklearn.preprocessing import StandardScalerdata = df.drop(['Time'], axis=1)data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

Training our Autoencoder is gonna be a bit different from what we are used to. Let’s say you have a dataset containing a lot of non fraudulent transactions at hand. You want to detect any anomaly on new transactions. We will create this situation by training our model on the normal transactions, only. Reserving the correct class on the test set will give us a way to evaluate the performance of our model. We will reserve 20% of our data for testing:

X_train, X_test = train_test_split(data, test_size=0.2, random_state=RANDOM_SEED)
X_train = X_train[X_train.Class == 0]
X_train = X_train.drop(['Class'], axis=1)y_test = X_test['Class']
X_test = X_test.drop(['Class'], axis=1)X_train = X_train.values
X_test = X_test.values

X_train.shape

(227451, 29)

Building the model

Our Autoencoder uses 4 fully connected layers with 14, 7, 7 and 29 neurons respectively. The first two layers are used for our encoder, the last two go for the decoder. Additionally, L1 regularization will be used during training:

input_dim = X_train.shape[1]
encoding_dim = 14

input_layer = Input(shape=(input_dim, ))encoder = Dense(encoding_dim, activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder)decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder)
decoder = Dense(input_dim, activation='relu')(decoder)autoencoder = Model(inputs=input_layer, outputs=decoder)

Let’s train our model for 100 epochs with a batch size of 32 samples and save the best performing model to a file. The ModelCheckpoint provided by Keras is really handy for such tasks. Additionally, the training progress will be exported in a format that TensorBoard understands.

nb_epoch = 100
batch_size = 32autoencoder.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])checkpointer = ModelCheckpoint(filepath="model.h5",verbose=0,save_best_only=True)
tensorboard = TensorBoard(log_dir='./logs',histogram_freq=0,write_graph=True,write_images=True)history = autoencoder.fit(X_train, X_train,epochs=nb_epoch,batch_size=batch_size,shuffle=True,validation_data=(X_test, X_test),verbose=1,callbacks=[checkpointer, tensorboard]).history

autoencoder = load_model('model.h5')

Evaluation

plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right');

The reconstruction error on our training and test data seems to converge nicely. Is it low enough? Let’s have a closer look at the error distribution:

predictions = autoencoder.predict(X_test)

mse = np.mean(np.power(X_test - predictions, 2), axis=1)
error_df = pd.DataFrame({'reconstruction_error': mse,'true_class': y_test})

error_df.describe()

  reconstruction_error true_class
count 56962.000000 56962.000000
mean 0.742613 0.001720
std 3.396838 0.041443
min 0.050139 0.000000
25% 0.237033 0.000000
50% 0.390503 0.000000
75% 0.626220 0.000000
max 263.909955 1.000000

Reconstruction error without fraud

fig = plt.figure()
ax = fig.add_subplot(111)
normal_error_df = error_df[(error_df['true_class']== 0) & (error_df['reconstruction_error'] < 10)]
_ = ax.hist(normal_error_df.reconstruction_error.values, bins=10)

Reconstruction error with fraud

fig = plt.figure()
ax = fig.add_subplot(111)
fraud_error_df = error_df[error_df['true_class'] == 1]
_ = ax.hist(fraud_error_df.reconstruction_error.values, bins=10)

from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,roc_curve, recall_score, classification_report, f1_score,precision_recall_fscore_support)

ROC curves are very useful tool for understanding the performance of binary classifiers. However, our case is a bit out of the ordinary. We have a very imbalanced dataset. Nonetheless, let’s have a look at our ROC curve:

fpr, tpr, thresholds = roc_curve(error_df.true_class, error_df.reconstruction_error)
roc_auc = auc(fpr, tpr)plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();

The ROC curve plots the true positive rate versus the false positive rate, over different threshold values. Basically, we want the blue line to be as close as possible to the upper left corner. While our results look pretty good, we have to keep in mind of the nature of our dataset. ROC doesn’t look very useful for us. Onward…

Precision vs Recall

source: Wikipedia

Precision and recall are defined as follows:

Precision=true positivestrue positives+false positivesPrecision=true positivestrue positives+false positives
Recall=true positivestrue positives+false negativesRecall=true positivestrue positives+false negatives

Let’s take an example from Information Retrieval in order to better understand what precision and recall are. Precision measures the relevancy of obtained results. Recall, on the other hand, measures how many relevant results are returned. Both values can take values between 0 and 1. You would love to have a system with both values being equal to 1.

Let’s return to our example from Information Retrieval. High recall but low precision means many results, most of which has low or no relevancy. When precision is high but recall is low we have the opposite - few returned results with very high relevancy. Ideally, you would want high precision and high recall - many results with that are highly relevant.

precision, recall, th = precision_recall_curve(error_df.true_class, error_df.reconstruction_error)
plt.plot(recall, precision, 'b', label='Precision-Recall curve')
plt.title('Recall vs Precision')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

plt.plot(th, precision[1:], 'b', label='Threshold-Precision curve')
plt.title('Precision for different threshold values')
plt.xlabel('Threshold')
plt.ylabel('Precision')
plt.show()

You can see that as the reconstruction error increases our precision rises as well. Let’s have a look at the recall:

plt.plot(th, recall[1:], 'b', label='Threshold-Recall curve')
plt.title('Recall for different threshold values')
plt.xlabel('Reconstruction error')
plt.ylabel('Recall')
plt.show()

Here, we have the exact opposite situation. As the reconstruction error increases the recall decreases.

Prediction

Our model is a bit different this time. It doesn’t know how to predict new values. But we don’t need that. In order to predict whether or not a new/unseen transaction is normal or fraudulent, we’ll calculate the reconstruction error from the transaction data itself. If the error is larger than a predefined threshold, we’ll mark it as a fraud (since our model should have a low error on normal transactions). Let’s pick that value:

threshold = 2.9

And see how well we’re dividing the two types of transactions:

groups = error_df.groupby('true_class')
fig, ax = plt.subplots()for name, group in groups:ax.plot(group.index, group.reconstruction_error, marker='o', ms=3.5, linestyle='',label= "Fraud" if name == 1 else "Normal")
ax.hlines(threshold, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
ax.legend()
plt.title("Reconstruction error for different classes")
plt.ylabel("Reconstruction error")
plt.xlabel("Data point index")
plt.show();

I know, that chart might be a bit deceiving. Let’s have a look at the confusion matrix:

y_pred = [1 if e > threshold else 0 for e in error_df.reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.true_class, y_pred)plt.figure(figsize=(12, 12))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Our model seems to catch a lot of the fraudulent cases. Of course, there is a catch (see what I did there?). The number of normal transactions classified as frauds is really high. Is this really a problem? Probably it is. You might want to increase or decrease the value of the threshold, depending on the problem. That one is up to you.

Conclusion

We’ve created a very simple Deep Autoencoder in Keras that can reconstruct what non fraudulent transactions looks like. Initially, I was a bit skeptical about whether or not this whole thing is gonna work out, bit it kinda did. Think about it, we gave a lot of one-class examples (normal transactions) to a model and it learned (somewhat) how to discriminate whether or not new examples belong to that same class. Isn’t that cool? Our dataset was kind of magical, though. We really don’t know what the original features look like.

Keras gave us very clean and easy to use API to build a non-trivial Deep Autoencoder. You can search for TensorFlow implementations and see for yourself how much boilerplate you need in order to train one. Can you apply a similar model to a different problem?

The source code and pre-trained model are available on GitHub here.

References

  • Building Autoencoders in Keras
  • Stanford tutorial on Autoencoders
  • Stacked Autoencoders in TensorFlow

原文地址: http://curiousily.com/data-science/2017/06/11/tensorflow-for-hackers-part-7.html

TensorFlow for Hackers (Part VII) - Credit Card Fraud Detection using Autoencoders in Keras相关推荐

  1. 论文 | Credit Card Fraud Detection Using Convolutional Neural Networks

    本篇博客继续为大家介绍一篇论文,也是关于用卷积神经网络 CNN 来进行信用卡欺诈检测的. 论文信息 论文题目:Credit card fraud detection using convolution ...

  2. Credit Card Fraud Detection(信用卡欺诈检测相关数据集)

    原文: Credit Card Fraud Detection Anonymized credit card transactions labeled as fraudulent or genuine ...

  3. 4.实操(Credit Card Fraud Detection)

    目录 一.数据挖掘流程 二.Data Preview 2.1 data. shape 2.2 data. head() 2.3 data.describe() 2.4 check NaN 2.5 Cl ...

  4. Credit Card Fraud Detection(信用卡诈欺侦测)Spark建模

    背景 这个数据来自2013年9月欧洲信用卡交易数据,总共包括两天的交易数据.在284,807次交易中发现了492例诈骗.数据集极其不平衡,诈骗频率只占了交易频次的0.172%. 这个数据因为涉及敏感信 ...

  5. credit card fraud detection

    概述 本文为机器学习入门贴,一方面作为自己的学习记录,另一方面,作为入门级代码分享,希望能给一些同学带来一点点帮助.欢迎指正交流. 摘要 本文以[kaggle信用卡欺诈判别原数据]为学习对象,利用决策 ...

  6. Kaggle系列-IEEE-CIS Fraud Detection第一名复现

    赛题背景 想象一下,站在杂货店的收银台,身后排着长队,收银员没有那么安静地宣布您的信用卡被拒绝了.在这一刻,你可能没有想到决定你命运的数据科学. 非常尴尬有木有?当然你肯定有足够的资金为50个最亲密的 ...

  7. cf#401(Div. 2)B. Game of Credit Card(田忌赛马类贪心)

    题干: After the fourth season Sherlock and Moriary have realized the whole foolishness of the battle b ...

  8. Python:实现测试信用卡号码有效性credit card validator的算法(附完整源码)

    Python:实现测试信用卡号码有效性credit card validator的算法 def validate_initial_digits(credit_card_number: str) -&g ...

  9. (原创)北美信用卡(Credit Card)个人使用心得与总结(个人理财版) [精华]

    http://forum.chasedream.com/thread-766972-1-1.html 本人2010年 8月F1 二度来美,现在credit score 在724-728之间浮动,最高的 ...

最新文章

  1. android Unable to add window -- token null is n...
  2. Web 标准实战的评论
  3. C#中WebBrowser控件的使用
  4. 推荐95个极富创意的单页网站设计实例欣赏
  5. redis 命令别名_redis 命令、命令行根据前缀(通配符)批量删除redis存储的key
  6. SQL Tuning学习杂记
  7. python抓包模块
  8. excel删除重复数据保留一条_Excel一键删除重复数据,你居然还用逐条排查?
  9. 30 岁的超级玛丽怎样改变了游戏行业?
  10. 分享选书原则,推荐几本书(附书评)
  11. Python Turtle绘图【难度2星】:横切的橙子(基础效果/画海绵层优化)
  12. 小程序服务器还得备案域名,小程序服务器要不要域名备案
  13. 目标检测模型——One stage(YOLO v5的模型解析及应用)
  14. 神经网络怎么学,怎么学神经网络
  15. 前端常用素材网站大全
  16. JAVA EE项目开发及应用实训报告——网上考试系统
  17. 计算机 缓冲区,计算机里的缓冲区
  18. 【大数据人工智能AI2.0】GPT-4 office 全家桶发布,打工人凛冬将至 :Microsoft 365 Copilot (副驾驶)简介
  19. 爬虫学习(6):通宵两万字xpath教程,学不会找我
  20. 2016年每日最新Google hosts

热门文章

  1. 【插件开发】—— 1 Eclipse插件开发导盲
  2. ActionScript 3 作用域内部细节介绍
  3. ArcGIS9.3全套下载地址
  4. 基于节拍谱的语音音乐分类模型
  5. 单核7:全景闹钟和单核工作法
  6. Centos设置程序开机自启的方法
  7. tensorboard出现OSError: [Errno 22] Invalid argument问题解决
  8. android 实训的背景,Android实训项目作业.doc
  9. python中循环结构关键字,04.循环结构
  10. 【Python 】单引号和双引号有什么区别?