Amazon Real-time Fraud detection:

亚马逊实时欺诈检测:

Amazon Lambda

AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without …

AWS Lambda 是一种无服务器、事件驱动的计算服务,可让您为几乎任何类型的应用程序或后端服务运行代码,而无需…

Amazon S3
Amazon S3 is cloud object storage with industry-leading scalability, data availability, security, and performance. S3 is ideal for data lakes, …
Amazon S3 是具有行业领先的可扩展性、数据可用性、安全性和性能的云对象存储。

Amazon SageMake
Machine learning platform 机器学习平台

Amazon Neptune: Graph database
图数据库


Data Preprocessing and feature engineering

数据预处理和特征工程

In this section, we will introduce how to preprocess the sample data set to determine the relationship between nodes in a heterogeneous graph!
在本节中,我们将介绍如何对样本数据集进行预处理以确定异构图中节点之间的关系!

dataset 数据集

In this use case, we benchmark the modeling method using the IEEE-CIS fraud data set. This is an anonymous data set containing up to 500000 transactions between users. The dataset contains two main tables:

在这个用例中,我们使用 IEEE-CIS 欺诈数据集对建模方法进行基准测试。这是一个匿名数据集,最多包含用户之间的 500000 笔交易。数据集包含两个主要表:

Transactions table: a transaction table that contains information about transactions or interactions between users.

交易表: 包含有关用户之间的交易或交互的信息的表

    • TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
    • TransactionAMT: transaction payment amount in USD
    • ProductCD: product code, the product for each transaction
    • card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
    • addr1, addr2: “both addresses are for purchaser addr1 as billing region addr2 as billing country. certain transactions don’t need recipient, so R_emaildomain is null.
    • dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
    • P_ and (R__) emaildomain: purchaser and recipient email domain
    • C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked. It could be counts of phone numbers, email addresses, names associated with the user, device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient.
    • D1-D15: timedelta, such as days between previous transaction, etc.
    • M1-M9: match, such as names on card and address, etc.
    • Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
      For example, how many times the payment card associated with a IP and email or address appeared in 24 hours time range, etc
      All Vesta features were derived as numerical. some of them are count of orders within a clustering, a time-period or condition, so the value is finite and has ordering (or ranking).
      Categorical Features:
    • ProductCD
    • card1 - card6
    • addr1, addr2
    • P_emaildomain R_emaildomain
    • M1 -M9

train_transaction.csv 553 features (399vesta engineered features)


Identity table: which contains the log access, equipment and network information of the specific user executing the transaction.
身份表:包含执行事务的特定用户的日志访问、设备和网络信息。
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They’re collected by Vesta’s fraud protection system and digital security partners.

此表中的变量是身份信息——与交易相关的网络连接信息(IP、ISP、代理等)和数字签名(UA/浏览器/操作系统/版本等)。
Categorical Features:

  1. DeviceType
  2. DeviceInfo
  3. id_12 - id_38

“id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C. I hope you could get basic meaning of these features, and by mentioning them as numerical/categorical, you won’t deal with them inappropriately.”
“id01 到 id11 是身份的数字特征,由 Vesta 和安全合作伙伴收集,如设备评级、ip_domain 评级、代理评级等。它还记录了行为指纹,如帐户登录次数/登录失败次数、帐户多长时间停留在页面上等。由于安全合作伙伴的条款和条件,所有这些都无法详细说明。我希望你能了解这些特征的基本含义,并通过将它们作为数字/分类来提及,你不会对它们进行不当处理

*test_identity.csv

We can use the subsets of these transactions and their labels as supervision signals in model training. For transactions in the test data set, their labels will be masked during training.

The task of the model is very clear: predict which blocked transactions are fraudulent and which are legal.

我们可以将这些交易的子集及其标签用作模型训练中的监督信号。对于测试数据集中的交易,它们的标签将在训练期间被遮住。

该模型的任务非常明确:预测哪些被遮住的交易是欺诈性的,哪些是合法的。


Loading Pre-processed data from S3 从 S3 加载预处理数据

The dataset used in this Solution is the IEEE-CIS Fraud Detection dataset which is a typical example of financial transactions dataset that many companies have. The dataset consists of two tables:

  • Transactions: Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction.
  • Identity: Contains information about the identity users performing transactions. Examples of columns here include the device type and device ids used.

This notebook assumes that the two data tables had been pre-processed, mimicing the 1st time data preparation.

Current version uses the pre-processed data in nearly raw format, include all relation files, a feature file, a tag file, and a test index files..

本方案使用的数据集是IEEE-CIS Fraud Detection dataset许多公司拥有的交易数据集。数据集由两个表组成:

  • 交易:记录有关两个用户之间交易的交易和元数据。列的示例包括交易的产品代码和用于交易的卡上的功能。
  • 身份:包含有关执行交易的身份用户的信息。此处的列示例包括使用的设备类型和设备 ID。

当前版本使用几乎原始格式的预处理数据,包括所有关系文件、特征文件、标签文件和测试索引文件。

from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('n'.join(processed_files))Output:
===== Processed Files =====
s3://graph-fraud-detection/dgl/processed-data/features.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceInfo_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceType_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_P_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_ProductCD_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_R_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_TransactionID_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card3_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card4_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card5_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card6_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_01_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_02_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_03_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_04_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_05_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_06_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_07_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_08_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_09_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_10_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_11_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_12_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_13_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_14_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_15_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_16_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_17_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_18_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_19_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_20_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_21_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_22_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_23_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_24_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_25_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_26_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_27_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_28_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_29_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_30_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_31_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_32_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_33_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_34_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_35_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_36_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_37_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_38_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/tags.csv
s3://graph-fraud-detection/dgl/processed-data/test.csv

All relational Edgelist files represent different types of edges used to construct heterogeneous graphs during training. Features.csv contains the features after the final conversion of the transaction node, while tags.csv contains node labels as training supervision signals. Test.csv contains TransactionID data as a test data set to evaluate the performance of the model. These node labels are shielded during training to avoid interference with model prediction.

所有的关系 Edgelist 文件都表示用于在训练期间构建异构图的不同类型的边。 Features.csv 包含交易节点最终转换后的特征,而 tags.csv 包含节点标签作为训练监督信号。 Test.csv 包含 TransactionID 数据作为测试数据集,用于评估模型的性能。这些节点标签在训练过程中被屏蔽以避免干扰模型预测。

Train Graph Neural Network with DGL

使用 DGL 训练图神经网络

We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node’s features to a representation space where the node can be easily classified as fraud or not.

Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types.

我们可以将欺诈检测问题建模为节点分类任务,图神经网络的目标是学习如何使用每个交易节点的子图拓扑信息将节点的特征转换为表示空间其中节点可以很容易地被分类为欺诈与否。

具体来说,我们将在异构图上使用关系图卷积神经网络模型 (R-GCN),因为我们有不同类型的节点和边。

def parse_args():parser = argparse.ArgumentParser()parser.add_argument('--training-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])parser.add_argument('--output-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])parser.add_argument('--nodes', type=str, default='features.csv')parser.add_argument('--target-ntype', type=str, default='TransactionID')parser.add_argument('--edges', type=str, default='homogeneous_edgelist.csv')parser.add_argument('--labels', type=str, default='tags.csv')parser.add_argument('--new-accounts', type=str, default='test.csv')parser.add_argument('--compute-metrics', type=lambda x: (str(x).lower() in ['true', '1', 'yes']),default=True, help='compute evaluation metrics after training')parser.add_argument('--threshold', type=float, default=0, help='threshold for making predictions, default : argmax')parser.add_argument('--num-gpus', type=int, default=0)parser.add_argument('--optimizer', type=str, default='adam')parser.add_argument('--lr', type=float, default=1e-2)parser.add_argument('--n-epochs', type=int, default=100)parser.add_argument('--n-hidden', type=int, default=16, help='number of hidden units')parser.add_argument('--n-layers', type=int, default=3, help='number of hidden layers')parser.add_argument('--weight-decay', type=float, default=5e-4, help='Weight for L2 loss')parser.add_argument('--dropout', type=float, default=0.2, help='dropout probability, for gat only features')parser.add_argument('--embedding-size', type=int, default=360, help="embedding size for node embedding")
  • nodes is the name of the file that contains the node_ids of the target nodes and the node features.
    nodes 是包含目标节点的node_id和节点特征的文件名。

  • edges is a regular expression that when expanded lists all the filenames for the edgelists
    edges 是一个正则表达式,在展开时会列出边缘列表的所有文件名

  • labels is the name of the file tha contains the target node_ids and their labels
    labels 是包含目标 node_ids 及其标签的文件名


referece& resource

resource 1: IEEE-CIS Fraud Detection@kaggle

resource 2: realtime-fraud-detection-with-gnn-on-dgl@github

resource 3: realtime-fraud-detection-with-gnn-on-dgl@aws.amazon

【工作】Amazon Fraud Detection相关推荐

  1. Kaggle系列-IEEE-CIS Fraud Detection第一名复现

    赛题背景 想象一下,站在杂货店的收银台,身后排着长队,收银员没有那么安静地宣布您的信用卡被拒绝了.在这一刻,你可能没有想到决定你命运的数据科学. 非常尴尬有木有?当然你肯定有足够的资金为50个最亲密的 ...

  2. IEEE-CIS Fraud Detection(一)

    IEEE-CIS Fraud Detection(一) 本博客是对kaggle上的一个检测欺诈的项目的top1方案的翻译,部分是博主自己翻译的,部分是机翻,中间会加上自己的补充,力求小白也能看懂. 题 ...

  3. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection阅读笔记

    Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection阅读笔记 文章标题:A ...

  4. Network-based Fraud Detection for Social Security Fraud

    这个研究旨在确定那些故意申请破产来避免交税的公司.介绍了一个新的方法关于如何从一个时间加权网络中定义并提取出特征,如何利用整合在欺诈检测中基于网络的本质的特征. 欺诈检测是一个包括很多种类不同的申请者 ...

  5. 【总结】反欺诈(Fraud Detection)中所用到的机器学习模型

    反欺诈(Fraud Detection)是指识别和预防欺诈行为的过程,通常是通过监视和分析数据来识别异常行为和模式.机器学习在反欺诈中发挥了重要作用,可以使用各种机器学习算法来建立预测模型.下面列举了 ...

  6. 论文 | Credit Card Fraud Detection Using Convolutional Neural Networks

    本篇博客继续为大家介绍一篇论文,也是关于用卷积神经网络 CNN 来进行信用卡欺诈检测的. 论文信息 论文题目:Credit card fraud detection using convolution ...

  7. Credit Card Fraud Detection(信用卡欺诈检测相关数据集)

    原文: Credit Card Fraud Detection Anonymized credit card transactions labeled as fraudulent or genuine ...

  8. A Semi-supervised Graph Attentive Network for Financial Fraud Detection 个人总结

    A Semi-supervised Graph Attentive Network for Financial Fraud Detection 个人总结 写在前面:为方便阅读,尽量使用中文总结,对于翻 ...

  9. Kaggle数据竞赛记录 - IEEE-CIS Fraud Detection

    最近准备春招,之前打的kaggle竞赛过了很久回顾一下:IEEE-CIS Fraud Detection 排名:铜牌 Top9%,在公榜100多,后期没有什么时间,私榜直接掉400多,所以看了很多大神 ...

最新文章

  1. MongoDB(一):安装
  2. echarts 横纵分割线颜色透明度
  3. Java性能优化(3):通过私有构造函数强化不可实例化的能力
  4. 配置EditPlus为汇编的编辑工具
  5. DLmalloc 内存分配算法
  6. PHP web应用的调试
  7. 使用selenium + pytest + allure做WBE UI自动化
  8. coap 返回版本信息_CoAP协议浅析
  9. WIN7 旗舰版已激活但还是黑屏的解决方法
  10. RobotStudio实现喷漆、打磨等功能(曲面路径生成与仿真)
  11. 当私域逐渐摆烂--伟大航路战略咨询
  12. 安防大数据在智慧城市建设中的地位与深度应用
  13. 《关键对话》读书笔记
  14. MDT 2013 Update 1 Preview 部署 Windows 10之WDS部署服务配置
  15. 编译原理:语法制导翻译
  16. EXCEL如何固定住一行和一列
  17. 语义分割常用指标详解(附代码)
  18. IPA 包不经过APP Store直接发布到网站供用户下载安装
  19. samba共享文件时端口问题
  20. 用c# webbrowser 编写自动签到

热门文章

  1. VIC Image 驱动程序
  2. 联想笔记本电脑novo键在哪?联想笔记本novo键位置介绍
  3. html5怎么引入苹方简,Kindle 推荐使用“苹方-简”自定义字体,阅读体验最舒服...
  4. create和qypt qt_Win10安装PyQt5与Qt Designer
  5. java去除图片水印的解决办法
  6. 使用docker搭建sqli-lab环境以及upload-labs环境 xss挑战之旅环境 搭建vulhub环境
  7. 市值高达67亿的1元“壳股”海润光伏 谁敢接盘?
  8. Asp.Net常见问题及技术实现方案(一)
  9. MySQL中 (GROUP BY 用法)和(ORDER BY用法)
  10. 智慧立法平台,以“智慧”赋能立法工作新格局