欺诈行为识别

背景 (Background)

Online recruitment fraud (ORF) is a form of malicious behaviour that aims to inflict loss of privacy, economic damage or harm the reputation of the stakeholders via fraudulent job advertisements.

在线招聘欺诈(ORF)是一种恶意行为,旨在通过欺诈性的招聘广告造成隐私的丧失,经济损失或损害利益相关者的声誉。

The aim of the analytics task was to identify fraudulent job advertisements from the data, determine key indicators of fraud and make recommendations on how to identify fraudulent job advertisements in the future.

分析任务的目的是从数据中识别欺诈性工作广告,确定欺诈的关键指标,并就未来如何识别欺诈性工作广告提出建议。

数据集 (Dataset)

We will use the Employment Scam Aegean Dataset (EMSCAD), which can be downloaded at http://icsdweb.aegean.gr/emscad. A description of how the data was collected and a data dictionary is available on this page.

我们将使用“就业骗局爱琴海数据集(EMSCAD)”,该数据集可从http://icsdweb.aegean.gr/emscad下载。 此页面上提供了有关如何收集数据和数据字典的描述。

The dataset contains 17,880 real-life job ads. Variables within the dataset include:

数据集包含17,880个现实工作广告。 数据集中的变量包括:

方法 (Methodology)

It is firstly important to understand how the dataset can be utilized to distinguish between fraudulent and non-fraudulent ads as this will govern the type of analytical method will be employed. The response variable which is the binary field “fraudulent” is what we are trying to predict where t = “Yes” and f = “No”.

首先,重要的是要了解如何利用数据集来区分欺诈性广告和非欺诈性广告,因为这将决定采用何种分析方法。 我们试图预测的响应变量是二进制字段“欺诈性”,其中t =“是”而f =“否”。

了解数据集以选择分析方法 (Understanding the dataset to choose the analytical approach)

We have a set of variables which are not categorized and are essentially HTML strings — Benefits, Company profile, Description and Requirements. The type of analysis that will be required of textual data is Sentiment & Emotion analysis or Frequency analysis.

我们有一组未分类的变量,它们实质上是HTML字符串-收益,公司简介,描述和要求。 文本数据所需的分析类型为情感与情感分析或频率分析。

The categorical or factor variables of which there are 11 — location, company logo, industry, function, salary range, department, required education, required experience, employment type, telecommuting, and questions will be inputs into a machine learning algorithm such as the gradient boosting machine (GBM), distributed random forest (DRF) and generalized linear model (GLMNET) to determine the top predictors that can be used to distinguish between fraudulent and non-fraudulent ads.

其中包含11个类别或类别变量-位置,公司徽标,行业,职能,薪资范围,部门,所需教育,所需经验,就业类型,远程办公和问题将输入到机器学习算法中,例如渐变提升机(GBM),分布式随机森林(DRF)和广义线性模型(GLMNET)来确定可用于区分欺诈性广告和非欺诈性广告的最佳预测指标。

As two analytical approaches will be used — one for the string variables and the other for the factor variables, there will be two sets of output as follows:

由于将使用两种分析方法(一种用于字符串变量,另一种用于因子变量),因此将有两组输出,如下所示:

  • HTML variables: Sentiment, emotion and word frequency plotsHTML变量:情感,情感和单词频率图
  • Nominal & Binary Variables: Top predictors, coefficients标称和二进制变量:最佳预测变量,系数

Some variables do not contribute to any information and as such, they were excluded from the analysis. These include title and in-balanced as title is identifying information and in-balanced is used to include and exclude records to balance the dataset.

一些变量不会提供任何信息,因此将其从分析中排除。 这些包括标题和不平衡,因为标题是标识信息,不平衡用于包括和排除记录以平衡数据集。

数据ETL (Data ETL)

Prior to modelling, a number of steps have to be carried out to cleanse the dataset. The flow diagram below shows the steps that were carried out to prepare the dataset for modelling.

在建模之前,必须执行许多步骤来清理数据集。 下面的流程图显示了为准备建模数据集而执行的步骤。

Figure 1: Data ETL Flow Diagram
图1:数据ETL流程图

结果 (Results)

输出-文本分析 (Output — Text Analytics)

词云 (Word clouds)

Word clouds were created for each of the HTML strings — company profile, job description, requirements and benefits.

为每个HTML字符串创建了文字云-公司简介,职位描述,要求和收益。

The word clouds below are for company profile for non-fraudulent ads (left) and fraudulent ads (right).

下面的词云是针对非欺诈性广告(左)和欺诈性广告(右)的公司资料。

  • Non-fraudulent ads emphasize a work life balance (“home”, “life”, “care”) and company culture (“team”, “experience”)不欺诈的广告强调工作与生活之间的平衡(“家庭”,“生活”,“护理”)和公司文化(“团队”,“体验”)
  • Fraudulent ads are largely missing the company profile with an emphasis on monetary perks (“cell phones”, “money”, “cost”)欺诈性广告在很大程度上缺少公司形象,而侧重于金钱利益(“手机”,“金钱”,“成本”)
Figure 2.1: Company Ads — Non-Fraudulent
图2.1:公司广告-非欺诈性
Figure 2.2: Company Ads — Fraudulent
图2.2:公司广告-欺诈

The word clouds below are for job description for non-fraudulent ads (left) and fraudulent ads (right).

以下单词云用于描述非欺诈性广告(左)和欺诈性广告(右)的工作。

  • Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”)非欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
  • Fraudulent ads emphasize monetary value (“money”, “financially”, “discounts”) Non-fraudulent ads emphasize company offerings (“gas”, “oil”, “operations”)欺诈性广告强调货币价值(“金钱”,“财务”,“折扣”)欺诈性广告强调公司产品(“天然气”,“石油”,“运营”)
Figure 3.1: Job description — Non-fraudulent ads
图3.1:职位描述-非欺诈性广告
Figure 3.2: Job description: Fraudulent Ads
图3.2:职位描述:欺诈性广告

Below are word clouds for job requirements — non-fraudulent on the left and fraudulent on the right.

以下是工作要求的词云-左侧为欺诈性质,右侧为欺诈性质。

  • Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation非欺诈性广告强调多年的经验,技能,学位资格和项目方向
  • Fraudulent ads emphasize the above attributes to a lesser extent Non-fraudulent ads emphasize years of experience, skills, degree qualifications and project orientation欺诈性广告较少强调上述属性欺诈性广告强调较少的经验,技能,学位资格和项目方向
Figure 4.1: Job requirements — Non-fraudulent ads
图4.1:职位要求-不欺诈的广告
Figure 4.2: Job requirements — Fraudulent ads
图4.2:职位要求-欺诈性广告

Finally, the word clouds below are based on the test for job benefits.

最后,下面的“云”一词基于对工作福利的测试。

  • Non-fraudulent ads emphasize benefits such as “sick leave”, “hours” and “vacation”不欺诈的广告会强调诸如“请病假”,“工作时间”和“假期”之类的好处
  • Fraudulent ads appear to offer monetary perks such as accommodation, holidays, food, competitive salary, visa, and food among others.欺诈性广告似乎提供金钱福利,例如住宿,假期,食物,有竞争力的薪水,签证和食物等。
Figure 5.1: Job benefits — Non-fraudulent ads
图5.1:工作收益-非欺诈性广告
Figure 5.2: Job benefits — Fraudulent ads
图5.2:工作收益-欺诈性广告

情绪分析 (Sentiment Analysis)

Another way to analyse text is via sentiment analysis, which is type of emotion (positive or negative) associated with each word in text.

分析文本的另一种方法是通过情感分析,这是与文本中每个单词相关的情感类型(正面或负面)。

For instance, looking at the emotion categories below for non-fraudulent and fraudulent ads for job requirements, we can see that a greater proportion of non-fraudulent ads (left) are positive (“joy”, “surprise”), whereas the contrast is true for fraudulent ads (right).

例如,查看下面针对工作要求的不欺诈和欺诈广告的情感类别,我们可以看到,较大比例的不欺诈广告(左)是积极的(“欢乐”,“惊奇”),而对比对于欺诈性广告是正确的(右)。

Figure 6: Emotion sentiment analysis: Non-fraudulent vs. Fraudulent ads
图6:情绪情感分析:非欺诈性广告与欺诈性广告

We can also look at the polarity of these ads that is the orientation towards a specific emotion category, positive or negative. A greater proportion of non-fraudulent ads are positive than fraudulent ads.

我们还可以查看这些广告的极性,即针对特定情绪类别(正面或负面)的方向。 与欺诈性广告相比,非欺诈性广告中肯定的比例更大。

Figure 7: Emotion (Polarity) analysis for non-fraudulent vs. fraudulent ads
图7:非欺诈性广告与欺诈性广告的情感(极性)分析

As shown from the examples of word clouds and bar graphs of textual sentiment, we can see that text information is very useful in predicting certain behaviour. The next logical step would be to tag these ads as positive or negative based on their emotion/ polarity and introduce this information as binary variables into a machine learning model for prediction to determine the importance of these variables for prediction.

从单词云和文本情感条形图的示例中可以看出,文本信息对于预测某些行为非常有用。 下一步的逻辑步骤是根据广告的情感/极性将这些广告标记为肯定或否定,并将此信息作为二进制变量引入到机器学习模型中进行预测,以确定这些变量对预测的重要性。

For instance, you would create four variables, job requirements, description, benefits and company profile. For each variable, each ad would be assigned a “0” or “1” to signify “positive” or “negative” sentiment.

例如,您将创建四个变量,工作要求,描述,福利和公司简介。 对于每个变量,将为每个广告分配“ 0”或“ 1”,以表示“积极”或“消极”情绪。

Now, let’s move on to utilizing the numerical variables in a model for predicting which ads are fraudulent and non-fraudulent.

现在,让我们继续使用模型中的数字变量来预测哪些广告是欺诈性和非欺诈性的。

机器学习模型 (Machine Learning Models)

总览 (Overview)

It is always good to run a number of different types of models and then select the one or combination of models that provide you not only with the highest accuracy but also meaningful results that can readily be explained to business stakeholders and are likely to be accepted by them.

最好运行多个不同类型的模型,然后选择一个或多个模型组合,这些模型不仅可以为您提供最高的准确性,而且还可以向业务利益相关方解释并可以为您所接受的有意义的结果。他们。

For this problem, I ran three types of models:

对于这个问题,我运行了三种类型的模型:

  • Distributed random forest (DRF): Essentially a random forest which is an ensemble of classification trees but run in parallel on the h2o server, hence, the word distributed.

    分布式随机森林 ( DRF ):本质上是一个随机森林,它是分类树的集合,但是在h2o服务器上并行运行,因此是分布式的。

  • Gradient boosting machine (GBM): Like the random forest, it is also a classification method consisting of an ensemble of trees. The difference is that random forests are used to build deep independent trees (i.e. each tree is run on a random set of variables on a random subset of the data — the “bagging” method), whereas GBMs built lots of shallow and weak, dependent, successive trees. In this approach, each tree learns from the previous tree and tries to improve on it by reducing the amount of error and increasing the amount of variation in the response variable explained by the predictive variables.

    梯度提升机 ( GBM ):与随机森林一样,它也是一种由树木集合组成的分类方法。 不同之处在于,随机森林用于构建深层独立的树(即,每棵树都在数据的随机子集上的随机变量集上运行-“装袋”方法),而GBM则构建了许多浅层和弱层,相关的,连续的树木。 在这种方法中,每棵树都从前一棵树中学习,并尝试通过减少错误量和增加由预测变量解释的响应变量的变化量来对其进行改进。

  • Generalized linear model (GLM): GLMs are just an extension of linear models that can be run on a non-normally distributed dependent variable. As this is a classification problem, the link function used is for logistic regression. The output of a logistic regression algorithm are coefficients for the predictor in logits, where a one unit change in the predictor variable leads to the coefficient value change in the log odds. These logits can be converted to odds ratio to provide more meaningful information.

    广义线性模型 ( GLM ): GLM只是线性模型的扩展,可以在非正态分布的因变量上运行。 由于这是一个分类问题,因此使用的链接函数用于逻辑回归 。 Logistic回归算法的输出是logits中预测变量的系数,其中预测变量的单位变化导致对数赔率的系数值变化。 可以将这些logit转换为优势比,以提供更多有意义的信息。

  • To calculate the odds ratio, we need to exponentiate each coefficient by raising it to the power of e i.e. e^b

    要计算比值比,我们需要通过将每个系数提高到e的幂( 即e ^ b)来取幂

Now that you have some understanding of the three types of models, let’s compare their model accuracy.

现在您已经对这三种类型的模型有了一定的了解,让我们比较它们的模型准确性。

方法 (Methodology)

The dataset was split into training (80% of dataset) and test (20% of dataset) sets using a random seed where the goal is to train the model on the training set and test its accuracy on the test set.

使用随机种子将数据集分为训练集(占数据集的80%)和测试集(占数据集的20%),其目的是在训练集上训练模型并在测试集上测试其准确性。

The GBM was run with the following parameters where the max depth of the tree was set to 4 (4 levels), a small learn rate, and five fold cross validation.

使用以下参数运行GBM,其中树的最大深度设置为4(4个级别),学习率小,交叉验证五倍。

Cross-validation is a technique used to validate our training model before we apply it to the test set. By specifying five folds, it means that we build five different models where each model is trained on four parts and tested on the fifth. So, the first model is trained on parts 1, 2, 3, and 4 and tested on 5. The second model is trained on parts 1, 3, 4, and 5 and tested on part 2 and so on.

交叉验证是一种用于将训练模型应用于测试集之前对其进行验证的技术。 通过指定五折,这意味着我们建立了五个不同的模型,其中每个模型分为四个部分进行训练,并在第五个部分进行测试。 因此,第一个模型在零件1、2、3和4上进行训练,并在5上进行测试。第二个模型在零件1、3、4和5上进行训练,并在第2部分上进行测试,依此类推。

This method is called k-fold cross-validation and allows us to be more confident in the performance of the modelling method utilised. When we create five different models, we are testing it on five different/unseen datasets. If we only test the model once, for example, on our test set, then we only have a single evaluation which may be a biased results.

这种方法称为k折交叉验证,它使我们对所使用的建模方法的性能更有信心。 当我们创建五个不同的模型时,我们正在五个不同/看不见的数据集上对其进行测试。 例如,如果仅在测试集上对模型进行一次测试,则只有一个评估,这可能是有偏差的结果。

gbm_model <-h2o.gbm(y=y_dv, x=x_iv, training_frame = model_train.h2o,                     ntrees =500, max_depth = 4, distribution="bernoulli", #for 0-1 outcomes                    learn_rate = 0.01, seed = 1234, nfolds = 5, keep_cross_validation_predictions = TRUE)

To measure model accuracy, I used the ROC-AUC metrics. ROC or Receiver Operating Characteristic is a probability curve and the AUC, Area Under Curve, is a measure of the degree of separation between classes. In our case, the AUC is how accurately can the given model distinguish between non-fraudulent and fraudulent ads. The higher the AUC, the more accurate the model is at classifying the ads correctly.

为了测量模型的准确性,我使用了ROC-AUC指标。 ROC或接收器工作特性是一条概率曲线,而AUC(曲线面积)是对类别之间分离程度的度量。 在我们的案例中,AUC是给定模型在非欺诈性广告和欺诈性广告之间的区分精度。 AUC越高,模型正确分类广告的准确性就越高。

fpr <- h2o.fpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['fpr']]tpr <- h2o.tpr( h2o.performance(gbm_model, newdata=model_test.h2o) )[['tpr']]ggplot( data.table(fpr = fpr, tpr = tpr), aes(fpr, tpr) ) +   geom_line() + theme_bw() + ggtitle( sprintf('AUC: %f', gbm.auc) )

AUC is made up of a couple of metrics to test model accuracy which are:

AUC由几个衡量模型准确性的指标组成:

  • True Positives (TP): Fraudulent ads that were correctly predicted as fraudulent真实肯定(TP):被正确预测为欺诈的欺诈性广告
  • True Negatives (TN): Non-fraudulent ads that were correctly predicted as non-fraudulent真实否定词(TN):正确预测为非欺诈性的非欺诈性广告
  • False Positives (FP): Non-fraudulent ads that were incorrectly predicted as fraudulent误报(FP):被误认为是欺诈的非欺诈性广告
  • False Negatives (FN): Fraudulent ads that were incorrectly predicted as non-fraudulent假阴性(FN):被错误地预测为非欺诈的欺诈广告

These metrics can then be combined to calculate sensitivity and specificity.

然后可以将这些指标进行组合以计算敏感性和特异性。

Sensitivity is a measure of what proportion of fraudulent ads were correctly classified.

敏感性衡量正确分类欺诈广告的比例。

Sensitivity = count (TP) / sum(count(TP) + count(FP))

灵敏度=计数(TP)/总和(计数(TP)+计数(FP))

Specificity is a measure of what proportion of non-fraudulent ads were correctly identified.

特异性是衡量正确识别非欺诈性广告比例的一种方法。

Specificity = count (FP)/sum (count(TP) + count(FP))

特异性=计数(FP)/总和(计数(TP)+计数(FP))

When determining which measure is more important for your analysis, ask yourself the question whether it is more important for you to identify the number of correctly classified positives (sensitivity is more important) or negatives (specificity is more important). In our case, we want a model with higher sensitivity as we are more interested in correctly distinguishing fraudulent ads.

在确定哪种量度对您的分析更重要时,问自己一个问题,即确定正确分类的阳性(敏感性更重要)或阴性(特异性更重要)的数量对您来说更重要。 在我们的案例中,我们希望模型具有更高的灵敏度,因为我们对正确区分欺诈性广告更加感兴趣。

All these metrics can be summarized in a confusion matrix which is a table comparing number of cases that were correctly and incorrectly predicted against the actual number of fraudulent and non-fraudulent cases. This information can be used to supplement our understanding of the ROC and AUC metrics.

所有这些指标都可以汇总在一个混淆矩阵中 ,该矩阵是一个表格,该表格将正确和错误地预测的案件数量与欺诈和非欺诈案件的实际数量进行比较。 此信息可用于补充我们对ROC和AUC指标的理解。

Another aspect of the ROC-AUC metrics is the threshold used to determine whether an ad is fraudulent or non-fraudulent. To determine the best threshold t that maximizes the number of TPs positives, we can use the ROC curve, where we plot the TPR (True Positive Rate) on the y-axis against the FPR (False Positive Rate) on the x-axis.

ROC-AUC指标的另一个方面是用于确定广告是欺诈还是不欺诈的阈值 。 为了确定使TP阳性数最大化的最佳阈值t,我们可以使用ROC曲线,在该曲线上我们绘制y轴上的TPR(真阳性率)相对于x轴上的FPR(假阳性率)。

The AUC allows for comparison of models where we can compare their ROC curves for model accuracy on the test set as shown in the model output below.

AUC允许对模型进行比较,我们可以在测试集上比较其ROC曲线以确保模型准确性,如下面的模型输出所示。

模型输出 (Model Output)

模型精度比较 (Model Accuracy Comparison)

The table below shows that the DRF produces a model with the highest AUC of 0.962 on the test set. All three models have high AUC values (> 0.5 or random prediction).

下表显示了DRF在测试集上生成的AUC最高为0.962的模型。 这三个模型均具有较高的AUC值(> 0.5或随机预测)。

Figure 9: AUC curve for classification of ads by fraudulence
图9:按欺诈分类广告的AUC曲线

However, let’s dig deeper into what this AUC means in terms of correctly classified ads as fraudulent by looking at the confusion matrix below for the GLM model as an example.

但是,让我们通过以GLM模型为例,查看下面的混淆矩阵,进一步深入了解该AUC在将广告正确分类为欺诈广告方面的含义。

The confusion matrix for GLM on the test set indicates an error rate of 8.15% in classifying fraudulent cases incorrectly. The sensitivity for this model is 327/(327+29) = 92% which is very good.

测试集上的GLM混淆矩阵表明,错误地对欺诈案件进行分类的错误率为8.15%。 该模型的灵敏度为327 /(327 + 29)= 92%,非常好。

Now, let’s look at the remaining output from the models, more specifically what are the top predictors in classifying fraudulent and non-fraudulent ads.

现在,让我们看一下模型的其余输出,更具体地说,是对欺诈性和非欺诈性广告进行分类的最佳预测指标是什么。

最重要的预测因子 (Most Important Predictors)

The variable importance rank in a classification problem tells us how accurately can a predictor variable classify fraudulent ads over non-fraudulent ads relative to all other predictors that were used int he mode.

分类问题中变量的重要性等级告诉我们,与在模式中使用的所有其他预测变量相比,预测变量可以如何准确地将欺诈性广告分类为非欺诈性广告。

For both (a) GBM and (b) DRF, the top three variables — location, company logo and industry — in terms of how useful they are in classifying job ads into fraudulent or non-fraudulent are the same. This is also true for the has questions and telecommuting variables as being least important

对于(a)GBM和(b)DRF,就它们在将招聘广告分类为欺诈或不欺诈方面的有用程度而言,前三个变量(位置,公司徽标和行业)是相同的。 对于具有最不重要的问题远程办公变量也是如此

Now, let’s plot the dataset to better understand how the top predictors vary for fraudulent and non-fraudulent ads.

现在,让我们绘制数据集,以更好地了解欺诈性和非欺诈性广告的主要预测变量如何变化。

Let’s look at the top variable, location, where we can see that a greater proportion of fraudulent than non-fraudulent ads are from the USA and Australia as indicated by the circled bars.

让我们看一下最上面的变量location ,在该变量中,如带圆圈的条所示,我们发现来自美国和澳大利亚的欺诈广告比非欺诈广告更大。

Figure 10: Frequency of Fraudulent vs. non-fraudulent ads by country
图10:按国家划分的欺诈性广告和非欺诈性广告的频率

A greater proportion of fraudulent than non-fraudulent ads do not display a company logo in their job ads.

欺诈性广告中比不欺诈性广告更大的比例在其招聘广告中不显示公司徽标。

Figure 11: Frequency of fraudulent vs. non-fraudulent ads by absence/presence of company logo in ad
图11:广告中是否存在公司徽标的欺诈性广告与非欺诈性广告的频率

了解模型系数 (Understanding the model coefficients)

Now, let’s try to numerically understand the relationship between the predictor variables and the classification of ads.

现在,让我们尝试从数字上了解预测变量与广告分类之间的关系。

As shown in the table below, the highlighted predictors are best at distinguishing fraudulent and non-fraudulent ads.

如下表所示,突出显示的预测变量最能区分欺诈性和非欺诈性广告。

  • 767 variables entered into model, only 48 have a non-zero coefficient (Top predictors shown)

    输入模型的767个变量中,只有48个具有非零系数(显示了最高预测变量)

  • The greater the probability, the higher the chance of the ad being fraudulent

    可能性越大,则广告被欺诈的可能性越高。

结论和后续步骤 (Conclusion and Next Steps)

Now that you have a good understanding of using both textual and numerical predictors in a classification problem with the employment of both textual analytics tools and machine learning classification algorithms.

现在,您已经对使用文本分析工具和机器学习分类算法同时使用文本预测器和数字预测器在分类问题中有了很好的了解。

So, what can we do next?

那么,下一步我们该怎么做?

  • A combination of textual analysis and predictive modelling should be used to classify job ads into fraudulent and non-fraudulent job ads应结合使用文本分析和预测模型来将求职广告分为欺诈性和非欺诈性职业广告
  • To improve accuracy of textual analysis the following methods can be introduced:为了提高文本分析的准确性,可以引入以下方法:
  • N-grams modelling: Look at the combination of words that occur together to identify patterns

    N-gram建模 :查看一起出现的单词组合以识别模式

  • Look for trends in capitalization and punctuation寻找大写和标点符号的趋势
  • Look for trends in emphasized text (bold, italicized)在强调的文本中查找趋势(粗体,斜体)
  • Look for trends in types of HTML tags used (raw text lists vs. list text wrapped in list elements)寻找使用HTML标签类型的趋势(原始文本列表与列表元素中包裹的列表文本)

Predictive model accuracy can be improved by:

预测模型的准确性可以通过以下方法提高:

  • Working with a larger dataset处理更大的数据集
  • Splitting dataset into three: training, test and validation sets将数据集分为三部分:训练集,测试集和验证集
  • Splitting salary range into numerical variables: minimum & maximum将薪水范围分为数字变量:最小和最大
  • Removing variables that are associated with each other (i.e. use of chi-squared test of independence)删除相互关联的变量(即,使用卡方独立性检验)
  • Splitting location into country, state, and city将位置分为国家,州和城市
  • Reducing number of variables by grouping industry and function categories通过对行业和职能类别进行分组来减少变量数量
  • Expanding the dataset to include online behaviour — i.e. number of times ad was clicked on, IP location, time ad was uploaded, etc.扩展数据集以包含在线行为,例如,广告被点击的次数,IP地址,广告被上传的时间等。

For all the code used to generate results, see my GitHub repository — https://github.com/shedoesdatascience/fraudanalytics

有关用于生成结果的所有代码,请参见我的GitHub存储库— https://github.com/shedoesdatascience/fraudanalytics

翻译自: https://towardsdatascience.com/identifying-fraudulent-job-advertisements-using-r-programming-230daa20aec7

欺诈行为识别


http://www.taodudu.cc/news/show-994926.html

相关文章:

  • nlp gpt论文_GPT-3:NLP镇的最新动态
  • 基于plotly数据可视化_[Plotly + Datashader]可视化大型地理空间数据集
  • 划痕实验 迁移面积自动统计_从Jupyter迁移到合作实验室
  • 数据开放 数据集_除开放式清洗之外:叙述是开放数据门户的未来吗?
  • 它们是什么以及为什么我们不需要它们
  • 机器学习 啤酒数据集_啤酒数据集上的神经网络
  • nasa数据库cm1数据集_获取下一个地理项目的NASA数据
  • r语言处理数据集编码_在强调编码语言或工具之前,请学习这3个基本数据概念
  • 数据迁移测试_自动化数据迁移测试
  • 使用TensorFlow概率预测航空乘客人数
  • 程序员 sql面试_非程序员SQL使用指南
  • r a/b 测试_R中的A / B测试
  • 工作10年厌倦写代码_厌倦了数据质量讨论?
  • 最佳子集aic选择_AutoML的起源:最佳子集选择
  • 管道过滤模式 大数据_大数据管道配方
  • 用户体验可视化指南pdf_R中增强可视化的初学者指南
  • sql横着连接起来sql_SQL联接的简要介绍(到目前为止)
  • 如何击败Python的问题
  • 数据冒险控制冒险_劳动生产率和其他冒险
  • knn 邻居数量k的选取_选择K个最近的邻居
  • 什么样的代码是好代码_什么是好代码?
  • 在Python中使用Twitter Rest API批量搜索和下载推文
  • 大数据 vr csdn_VR中的数据可视化如何革命化科学
  • 导入数据库怎么导入_导入必要的库
  • 更便捷的画决策分支图的工具_做出更好决策的3个要素
  • 矩阵线性相关则矩阵行列式_搜索线性时间中的排序矩阵
  • bigquery数据类型_将BigQuery与TB数据一起使用后的成本和性能课程
  • 脚本 api_从脚本到预测API
  • binary masks_Python中的Masks概念
  • python 仪表盘_如何使用Python刮除仪表板

欺诈行为识别_使用R(编程)识别欺诈性的招聘广告相关推荐

  1. ocr语种识别_利用OCR图文识别,快速帮你提取文字信息

    我们在浏览网页.读书的时候,经常找到我们感兴趣的资料,有时候一些纸质文字或图片是无法复制保存的,那么为了方便这类信息的提取.编辑保存,中安未来特研发了OCR图文识别技术: 中安未来OCR图文识别技术是 ...

  2. 什么叫侧面指纹识别_哪种指纹识别方式好?侧边指纹识别可能会成为主流

    我以前没有用过背面的指纹.从手机到现在,前置解屏都是一个被认为是理所当然的,甚至升级也应该是屏幕下的指纹.然而,有了小米8,突然觉得指纹解锁实际上不是一个非常重要的问题.首先,如果是前置指纹解屏的话, ...

  3. ocr 超时小票识别_票总管-发票识别核验利器

    发票在我们生活中扮演者十分重要的角色,发票是我们进行消费的凭证,也是我们进行报销的有力工具.然而随着生活水平的不断提升,发票种类的增多,给公司财务报销带来严重的负担.如何对公司发票有效分类则是另财务头 ...

  4. 用python做一个车牌识别_如何用 Python 识别车牌

    车牌识别在高速公路中有着广泛的应用,比如我们常见的电子收费(ETC)系统和交通违章车辆的检测,除此之外像小区或地下车库门禁也会用到,基本上凡是需要对车辆进行身份检测的地方都会用到. 简介 车牌识别系统 ...

  5. 人脸识别与膜虹识别_指纹、人脸识别、虹膜识别,告诉你谁才是黑科技

    [华强智慧网讯] 随着物联网应用逐渐丰富,生物识别技术迎来了大显身手的机会--这很容易理解,在万物互联的时代,数据安全重要性不言而喻,市场需要一种更加靠谱的与机器进行交互的方式,生物识别技术能担当这一 ...

  6. 航拍仙人掌识别_使用转移学习识别空中仙人掌

    航拍仙人掌识别 Transfer learning is a useful strategy for applications of image like classification and det ...

  7. python手写字母识别_机器学习--kNN算法识别手写字母

    本文主要是用kNN算法对字母图片进行特征提取,分类识别.内容如下: kNN算法及相关Python模块介绍 对字母图片进行特征提取 kNN算法实现 kNN算法分析 一.kNN算法介绍 K近邻(kNN,k ...

  8. python颜色识别_浅谈Python3识别判断图片主要颜色并和颜色库进行对比的方法

    [更新]主要提供两种方案: 方案一:(参考网上代码,感觉实用性不是很强)使用PIL截取图像,然后将RGB转为HSV进行判断,统计判断颜色,最后输出RGB值 方案二:使用opencv库函数进行处理.(效 ...

  9. 安卓dtmf识别_使用Goertzel算法识别DTMF信号

    Goertzel算法 Goertzel算法由Gerald Goertzel在1958年提出,用于数字信号处理,是属于离散傅里叶变换的范畴,目的是从给定的采样中求出某一特定频率信号的能量,用于有效性的评 ...

最新文章

  1. 算法笔记之回溯法(2)
  2. getaway网关转发去前缀_为什么微服务一定要有网关?
  3. 如何删除GIT中的.DS_Store
  4. QT信号与槽机制需要注意的问题
  5. linux dhcp 服务(转)
  6. 人工智能 - paddlepaddle飞桨 - 深度学习基础教程 - 语义角色标注
  7. python继承语法_python语法学习面向对象之继承
  8. 利用UICollectionView实现瀑布流
  9. rgba通道转rgb,将RGBA颜色转换为RGB
  10. 吉首大学2019年程序设计竞赛(重现赛)- A SARS病毒 (矩阵,欧拉降幂)
  11. CentOS 5和6的启动流程
  12. cocos2d - JS Sprite 镜像反转显示 ( Flipped )
  13. WordPress 安全漏洞
  14. SettingWithCopyWarning:A value is trying to be set on a copy of a slice from a DataFrame(Pandas库)
  15. 根据视频地址获取视频的第一帧画面做为封面 IllegalArgumentException
  16. Fluent UDF 获取组分传输模型中的摩尔分数或分压力
  17. 输入一个字符串,逆序并输出
  18. AI-多模态-2021:ALBEF
  19. 写给当初的你,现在的我
  20. mstsc远程桌面连接失败,提示CredSSP加密Oracel修正

热门文章

  1. 数据库原理及应用【六】数据库设计
  2. 将信号量代码生成静态库以及动态库
  3. 05-树7 堆中的路径 (25 分)
  4. Java高级:mysqllimit两个参数
  5. java编写斐波那契数列,实战案例
  6. 处理效应模型stata实例_stata︱政策处理效应模型sata基本命令汇总
  7. 合并两个链表,去掉重复元素
  8. FastReport4.6程序员手册_翻译
  9. (9)How to take a picture of a black hole
  10. 【系统架构理论】一篇文章精通:Spring Cloud Netflix Eureka