垃圾邮件分类 python

介绍 (Introduction)

I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.

我一直对Google的gmail垃圾邮件检测系统着迷,该系统似乎可以毫不费力地判断收到的电子邮件是否是垃圾邮件,因此不值得我们的关注。

In this article, I seek to recreate such a spam detection system, but on sms messages. I will use a few different models and compare their performance.

在本文中,我试图重新创建这样的垃圾邮件检测系统,但要针对短信。 我将使用几种不同的模型并比较它们的性能。

The models are as below:

型号如下:

  1. Multinomial Naive Bayes Model (Count tokenizer)多项朴素贝叶斯模型(Count tokenizer)
  2. Multinomial Naive Bayes Model (tfidf tokenizer)多项式朴素贝叶斯模型(tfidf tokenizer)
  3. Support Vector Classifier Model支持向量分类器模型
  4. Logistic Regression Model with ngrams parameters具有ngrams参数的Logistic回归模型

Using a train-test split, the 4 models were put through the stages of X_train vectorization, model fitting on X_train and Y_train, make some predictions and generate the respective confusion matrices and area under the receiver operating characteristics curve for evaluation. (AUC-ROC)

使用火车测试拆分,对这四个模型进行了X_train向量化,对X_train和Y_train进行模型拟合的阶段,进行了一些预测,并在接收器工作特性曲线下生成了相应的混淆矩阵和面积以进行评估。 (AUC-ROC)

The resultant best performing model was the Logistic Regression Model, although it should be noted that all 4 models performed reasonably well at detecting spam messages (all AUC > 0.9).

最终表现最好的模型是Logistic回归模型 ,尽管应该注意的是,这4个模型在检测垃圾邮件方面都表现得相当不错(所有AUC> 0.9)。

Photo by Hannes Johnson on Unsplash
Hannes Johnson在Unsplash上拍摄的照片

数据 (The Data)

The data was obtained from UCI’s Machine Learning Repository, alternatively I have also uploaded the used dataset onto my github repo. In total, the data set has 5571 rows, and 2 columns: spamorham indicating it’s spam status and the message’s text. I found it quite funny how the text is quite relatable.

数据是从UCI的机器学习存储库中获得的 ,或者我也将使用过的数据集上传到了我的github存储库中 。 数据集总共有5571行和2列:spamorham(表明其为垃圾邮件状态)和邮件的文本。 我发现文本之间的相关性很好笑。

Definitions: Spam refers to spam messages as they are commonly known, ham refers to non-spam messages.

定义:垃圾邮件是指众所周知的垃圾邮件,火腿是指非垃圾邮件。

spam.head(10)
垃圾邮件标题(10)

数据预处理 (Data Preprocessing)

As the dataset is relatively simple, not much preprocessing was needed. Spam messages were marked with a 1, while ham was marked with a 0.

由于数据集相对简单,因此不需要太多预处理。 垃圾邮件标记为1,火腿标记为0。

df.head(10)
df.head(10)

探索性数据分析 (Exploratory Data Analysis)

Now, let’s look at the dataset in detail. Taking an average of the ‘target’ column, we find that that 13.409% of the messages were marked as spam.

现在,让我们详细看一下数据集。 取“目标”列的平均值,我们发现有13.409%的邮件被标记为垃圾邮件。

Further, maybe the message length has some correlation with the target outcome? Splitting the spam and ham messages into their individual dataframes, we further add on the number of characters of a message as a third column ‘len’.

此外,消息长度可能与目标结果有一些相关性吗? 将垃圾邮件和火腿邮件拆分为各自的数据帧,然后在第三列“ len”中进一步添加邮件的字符数。

The split data frames
分割数据帧
EDA code
EDA代码

Further, taking the averages of messages lengths, we can find that spam and ham messages have average lengths of 139.12 and 71.55 characters respectively.

此外,以邮件长度的平均值为基础,我们可以发现垃圾邮件和火腿邮件的平均长度分别为139.12和71.55个字符。

资料建模 (Data Modelling)

Now it’s time for the interesting stuff.

现在该是有趣的东西了。

火车测试拆分 (Train-test split)

We begin with creating a train-test split using the default sklearn split of a 75% train-test split.

我们首先使用默认的sklearn拆分(75%的火车测试拆分)创建火车测试拆分。

计数向量化器 (Count Vectorizer)

A count vectorizer will convert a collection of text documents to a sparse matrix of token counts. This will be necessary for model fitting to be done.

计数矢量化器会将文本文档的集合转换为令牌计数的稀疏矩阵 。 这对于完成模型拟合是必要的。

We fit the CountVectorizer onto X_train, and then further transform it using the transform method.

我们将CountVectorizer拟合到X_train上,然后使用transform方法对其进行进一步转换。

Count Vectorizer code
计数向量化器代码

MNNB模型拟合 (MNNB Model Fitting)

Let’s first try fitting a classic Multinomial Naive Bayes Classifier Model (MNNB), on X_train and Y_train.

首先让我们尝试经典 多项式朴素贝叶斯分类器模型 (MNNB),位于X_train和Y_train上。

A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. In practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.

一个朴素的贝叶斯模型假设它使用的每个功能在给定类的条件下彼此独立 。 实际上,朴素贝叶斯模型的表现令人惊讶地出色,即使在很显然独立性强的假设是错误的复杂任务上也是如此。

MNNB模型评估 (MNNB Model Evaluation)

In evaluating the model’s performance, we can generate some predictions then look at the confusion matrix and AUC-ROC score to evaluate performance on the test dataset.

在评估模型的性能时,我们可以生成一些预测,然后查看混淆矩阵和AUC-ROC得分以评估测试数据集的性能。

The confusion matrix is generated as below:

混淆矩阵的生成如下:

MNNB confusion matrix
MNNB混淆矩阵

The results seem promising, with a True Positive Rate (TPR) of 92.6% , specificity of 99.7% and a False Positive Rate (FPR) of 0.3%. These results show that the model performs quite well in predicting whether messages are spam, based solely on the text in the messages.

结果似乎很有希望, 真阳性率(TPR)为92.6%特异性为99.7%假阳性率(FPR)为0.3% 。 这些结果表明,仅基于邮件中的文本,该模型在预测邮件是否为垃圾邮件方面表现非常出色。

The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.

接收者操作员特征(ROC)曲线是二进制分类问题的评估指标。 这是一条概率曲线,在不同的阈值下绘制了TPRFPR的关系 ,并从本质上将“信号”与“噪声”分开曲线下面积(AUC)是分类器区分类的能力的度量,并用作ROC曲线的摘要。

this article本文图片

The model produced an AUC score of 0.962, which is significantly better than if the model made random guesses of the outcome.

该模型产生的AUC得分为0.962,这比该模型对结果进行随机猜测的结果要好得多。

Although the Multinomial Naive Bayes Classifier seems to have worked quite well, I felt that maybe the result could possibly be improved further through a different model

尽管多项式朴素贝叶斯分类器的效果似乎很好,但我认为通过不同的模型可以进一步改善结果

MNNB (count_vect) code for fitting and evaluation
MNNB(count_vect)代码用于拟合和评估

MNNB(Tfid矢量化器)模型拟合 (MNNB(Tfid-vectorizer) Model Fitting)

I then attempt to use a tfidf vectorizer instead of a count-vectorizer to see if it improves the results.

然后,我尝试使用tfidf矢量化器而不是count-vectorizer来查看它是否可以改善结果。

The goal of using tfidf is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

使用tfidf的目的是减少在给定语料库中非常频繁出现的令牌的影响,因此,从经验上讲,这些令牌的信息少于在训练语料库的一小部分中出现的特征。

MNNB(Tfid矢量化器)模型评估 (MNNB(Tfid-vectorizer) Model Evaluation)

In evaluating the model’s performance we look at the AUC-ROC numbers and the confusionn matrix again. It generates an AUC score of 91.67%

在评估模型的性能时,我们再次查看AUC-ROC数和混淆矩阵。 它的AUC得分为91.67%

The results seem promising, with a True Positive Rate (TPR) of 83.3% , specificity of 100% and a False Positive Rate (FPR) of 0.0%%.

结果似乎很有希望, 真阳性率(TPR)为83.3%特异性为100%假阳性率(FPR)为0.0 %%

tfid confusion matrix
tfid混淆矩阵

When comparing the two models based on AUC scores, it seems like the tfid vectorizer did not improve upon model accuracy, but even introduced more noise into the predictions! However, the tfid seems to have greatly improved the model’s ability to detect ham messages to the point of 100% accuracy.

当根据AUC分数比较两个模型时,tfid矢量化器似乎并没有提高模型的准确性,但甚至在预测中引入了更多噪声! 但是,tfid似乎已大大提高了该模型检测火腿消息的能力,准确性达到100%。

MNNB (tfid_vect) code for fitting and evaluation
MNNB(tfid_vect)代码用于拟合和评估

Being a stubborn person, I still believe that better performance can be obtained, with a few tweaks.

作为一个固执的人,我仍然相信通过一些调整就能获得更好的性能。

SVC模型拟合 (SVC Model Fitting)

I now attempt to fit and transform the training data X_train using a Tfidf Vectorizer, while ignoring terms that have a document frequency strictly lower than 5. Further adding an additional feature, the length of document (number of characters), I then fit a Support Vector Classification (SVC) model with regularization C=10000.

我现在尝试使用Tfidf Vectorizer拟合和转换训练数据X_train,同时忽略文档频率严格低于5的术语。进一步添加附加功能,即文档长度(字符数),然后适合支持具有正则化C = 10000的向量分类(SVC)模型。

SVC模型评估 (SVC Model Evaluation)

This results in the following:

结果如下:

  • AUC score of 97.4%AUC分数为97.4%
  • TPR of 95.1%TPR为95.1%
  • Specificity of 99.7%特异性为99.7%
  • FPR of 0.3%FPR为0.3%
SVC confusion matrix
SVC混淆矩阵
SVM code for fitting and evaluation
支持和评估的SVM代码

Logistic回归模型(n克)拟合 (Logistic Regression Model (n-grams) Fitting)

Using a logistic regression I further include the use of ngrams which allow the model to take into account groups of words, of max size 3, when considering whether a message is spam.

使用逻辑回归,我进一步包括使用ngram,当考虑消息是否为垃圾邮件时,该模型允许模型考虑最大大小为3的单词组。

Logistic回归模型(n-gram)评估 (Logistic Regression Model (n-grams) Evaluation)

This results in the following:

结果如下:

  • AUC score of 97.7%AUC分数为97.7%
  • TPR of 95.6%TPR为95.6%
  • Specificity of 99.7%特异性为99.7%
  • FPR of 0.3%FPR为0.3%

型号比较 (Model Comparison)

After training and testing these 4 models, it’s time to compare them. I primarily look at comparing them based on AUC scores as the TPR and TNR rates are all somewhat similar.

在训练和测试了这四个模型之后,是时候进行比较了。 我主要考虑根据AUC分数比较它们,因为TPR和TNR率都有些相似。

The logistic regression had the highest AUC score, with the SVC model and MNNB 1 model being marginally behind. Relatively, the MNNB 2 model seemed to have underperformed the rest. However, I would still opine that all 4 models produce AUC scores which are much higher than 0.5, showing that all 4 perform good enough to beat a model that only randomly guesses the target.

Logistic回归的AUC得分最高,SVC模型和MNNB 1模型仅次于。 相对而言,MNNB 2模型的表现似乎不如其他模型。 但是,我仍然认为,所有4个模型产生的AUC分数都远高于0.5,这表明所有4个模型的性能足以击败仅随机猜测目标的模型。

AUC score comparison between the 4 models
4个模型之间的AUC得分比较

感谢您的阅读! (Thanks for the read!)

Do find the code here.

在这里找到代码 。

Do feel free to reach out to me on LinkedIn if you have questions or would like to discuss ideas on applying data science techniques in a post-Covid-19 world!

如果您有任何疑问或想讨论在Covid-19后世界中应用数据科学技术的想法,请随时在LinkedIn上与我联系。

这是给您的另一篇文章! (Here’s another article for you!)

翻译自: https://medium.com/analytics-vidhya/create-a-sms-spam-classifier-in-python-b4b015f7404b

垃圾邮件分类 python


http://www.taodudu.cc/news/show-863767.html

相关文章:

  • 脑电波之父:汉斯·贝格尔_深度学习,认识聪明的汉斯
  • PyCaret 2.0在这里-新增功能?
  • 特征选择 回归_如何执行回归问题的特征选择
  • 建立神经网络来预测贷款风险
  • redshift教程_分析和可视化Amazon Redshift数据—教程
  • 白雪小町_町
  • 机器学习术语_机器学习术语神秘化。
  • centos有趣软件包_这5个软件包使学习R变得有趣
  • 求解决方法_解决方法
  • xml格式是什么示例_什么是对抗示例?
  • mlflow_在生产中设置MLflow
  • 神秘实体ALIMA
  • mnist数据集彩色图像_使用MNIST数据集构建多类图像分类模型。
  • bert使用做文本分类_使用BERT进行深度学习的多类文本分类
  • 垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器
  • ai 图灵测试_适用于现代AI系统的“视觉图灵测试”
  • pytorch图像分类_使用PyTorch和Streamlit创建图像分类Web应用
  • 深度学习之对象检测_深度学习时代您应该阅读的12篇文章,以了解对象检测
  • python 梯度下降_Python解释的闭合形式和梯度下降回归
  • 内容管理系统_内容
  • opencv图像深度-1_OpenCV空间AI竞赛之旅(第1部分-初始设置+深度)
  • 概率编程编程_概率编程语言的温和介绍
  • TensorFlow 2.X中的动手NLP深度学习模型准备
  • 时间序列 线性回归 区别_时间序列分析的完整介绍(带R)::线性过程I
  • 深度学习学习7步骤_如何通过4个简单步骤为深度学习标记音频
  • 邮件伪造_伪造品背后的数学
  • 图像匹配与OpenCV模板匹配
  • 边缘计算边缘计算edge_Edge AI-边缘上的计算机视觉推理
  • arduino 入门套件_计算机视觉入门套件
  • 了解LSTM和GRU

垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器相关推荐

  1. Python实现基于朴素贝叶斯的垃圾邮件分类 标签: python朴素贝叶斯垃圾邮件分类 2016-04-20 15:09 2750人阅读 评论(1) 收藏 举报 分类: 机器学习(19) 听说

    Python实现基于朴素贝叶斯的垃圾邮件分类 标签: python朴素贝叶斯垃圾邮件分类 2016-04-20 15:09 2750人阅读 评论(1) 收藏 举报  分类: 机器学习(19)  听说朴 ...

  2. 如何在 Python 中创建一个简单的神经网络

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 引言 在过去的几十年里,机器学习对世界产生了巨大的影响,而且它的普 ...

  3. python中创建列表[]和list()哪个效率快?为什么快?快多少呢?

    python中创建列表的两种方式: # 方法一:使用成对的方括号语法 list_a = []# 方法二:使用内置的 list() list_b = list() 1. [] 是 list() 的三-四 ...

  4. python计算均方根误差_如何在Python中创建线性回归机器学习模型?「入门篇」

    线性回归和逻辑回归是当今很受欢迎的两种机器学习模型. 本文将教你如何使用 scikit-learn 库在Python中创建.训练和测试你的第一个线性.逻辑回归机器学习模型,本文适合大部分的新人小白. ...

  5. python按列输出_在python中创建漂亮的列输出

    我试图在python中创建一个很好的列列表,用于我创建的命令行管理工具. 基本上,我想要一个列表,如: [['a', 'b', 'c'], ['aaaaaaaaaa', 'b', 'c'], ['a' ...

  6. python 微信bot_使用Tweepy在Python中创建Twitter Bot

    python 微信bot by Lucas Kohorst 卢卡斯·科斯特(Lucas Kohorst) 使用Tweepy在Python中创建Twitter Bot (Create a Twitter ...

  7. python如何创建一个列表_在python中创建列表的最佳和/或最快方法

    在python中,据我所知,至少有3到4种方法来创建和初始化给定大小的列表: 简单循环append: my_list =[]fori inrange(50):my_list.append(0) 简单循 ...

  8. 借助datetime和pyttsx3在Python中创建闹钟

    Modules used: 使用的模块: To create this script, we will use 3 modules: 要创建此脚本,我们将使用3个模块: datetime 约会时间 t ...

  9. 在python中创建列表的最佳和/或最快方法

    在python中,据我所知,至少有3到4种方法来创建和初始化给定大小的列表: 简单循环append: my_list = [] for i in range(50):my_list.append(0) ...

最新文章

  1. shell中的算数运算
  2. 关于TestNg注解执行
  3. css中调整高度充满_6个很棒的PostCSS插件,让您成为一个CSS向导
  4. java.library.path在哪?
  5. HTTP相关知识 --转载
  6. [蓝桥杯-138译码器和74hc573锁存器的关系]
  7. 打印纸张尺寸的简单说明
  8. TF卡(全称Trans Flash)
  9. idea光标移至行尾快捷键——End键不能移至行尾的解决办法
  10. 数模优秀论文分析(国赛C题)
  11. 优秀课程案例:使用Scratch制作贪吃蛇大战游戏
  12. 1211: 【入门】数字走向IV
  13. POI加密Excel文件导出
  14. Git入门【学费git一篇还不够?】
  15. 官网下载最新版本Spring
  16. 视频压缩发微信朋友圈怎么弄
  17. et格式如何转换Excel
  18. Linux deepin 安装mysql
  19. 2022-2028全球与中国模板系统(SOM)市场现状及未来发展趋势
  20. JavaScript (WebAPI)

热门文章

  1. 英国法院裁定GCHQ黑客发动网络攻击并不侵犯人权
  2. 《Spring揭秘》——IOC梳理2(容器启动,bean生命周期)
  3. js 运算符 语句
  4. Tomcat8.01及nginx-1.8.1安装
  5. 多列集合的索引器实现
  6. GIF图片合集(用于网络请求图片用)
  7. 【转】Oracle回收站(recyclebin)
  8. C#中搜索关键词高亮显示
  9. python合并多个pdf_pypdf将多个pdf文件合并到一个pd中
  10. CMMI3组织级文档列表清单