python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件／短信

自己从头手写一下这些经典的算法，不调用 sklearn 等 API，调一调参数，蛮有收获和启发。

数据集

概要：5572 条短信，13% 的 spam。

选择这个数据集的原因：短信的文本预处理要比 email 简单一些，运算量小，更容易聚焦算法本身。

数据集来自 kaggle，取样相对科学一些，更容易准确的反应算法的效果。

我的数据备份：github.com spam.csvgithub.com

算法原理

目标函数：给定一篇文章(d)，计算属于各个分类(c) 的概率，以概率最大的分类作为最终结果。

在垃圾邮件／短信检测的案例里，分类只有 2 个：spam，not-spam.

在垃圾邮件检测的特定领域里，not-spam 通常又叫 ham。没有什么原因，最初的大佬一时兴起想到了这个名字而已。所以，分类名字就变成了 spam, ham

根据贝叶斯公式，变形为

其中 f1, f2, ..., fn 是 document 的 feature。

有很多种选取 feature 的方法，比如，单词出现频率，单词的 TF/IDF 值，去除 stop-words 以后的单词频率。选择什么 feature，与贝叶斯无关，由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。

假定 feature 相互独立，实践上，即使不相互独立，直接用的效果也不错。

得到新公式：

跟其他的 language model 一样，换成 log space，

Feature 选取

spammy email 检测的算法已经很成熟，商用的版本里，会用到发件人地址等非文本特征。

此处，我们把讨论约定在文本／自然语言特征范围内。

在机器学习之前，已经有 rule based 的过滤器，大家发现的规则，比如：online pharmaceutical

WITHOUT ANY COST

Dear Winner

总结下来，垃圾邮件喜欢把一些 keyword 全部大写，表述习惯上与普通文本不同。

常见的 NLP 预处理 pipeline，比如，全部转小写，TF/IDF 等，反而会把一些 feature 处理掉。此处可能不适合做这些预处理。

我们以单词出现次数这个最简单的指标作为特征。

训练语料里，所有分类下的所有单词，构成一个 vocabulary，然后在每个类别下，分别统计各个单词的出现次数。

在某个分类下没有出现的单词，概率是 0，导致最终的概率也都是 0。为了解决这个问题，使用 add-one (Laplace) smoothing,

伪代码

模型评价

以 spam 分类作为 positive 分类。

Python 实现

import csv

import string

import numpy as np

import math

def load_data(filename, train_ratio):

with open(filename, "rb") as f:

csv_reader = csv.reader(f)

csv_reader.next() # header

dataset = [(line[0], line[1]) for line in csv_reader]

np.random.shuffle(dataset)

train_size = int(len(dataset) * train_ratio)

return dataset[:train_size], dataset[train_size:]

def train(train_set):

total_doc_cnt = len(train_set)

label_doc_cnt = {}

bigdoc_words = {}

for label, doc in train_set:

if label not in label_doc_cnt:

# init

label_doc_cnt[label] = 0

bigdoc_words[label] = []

label_doc_cnt[label] += 1

bigdoc_words[label].extend([

w.strip(string.punctuation) for w in doc.split()])

vocabulary = set()

for words in bigdoc_words.values():

vocabulary |= set(words)

V = len(vocabulary)

log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}

log_likelihoods = dict()

for label, words in bigdoc_words.items():

word_cnt = len(words) + V

log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]

return log_priors, log_likelihoods, vocabulary

def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):

words = {w.strip(string.punctuation) for w in input_text.split()}

prob_max = 0

label_max = None

probs = {} # tmp for log

for label, likelihood in log_likelihoods.items():

prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])

probs[label] = prob

if not prob_max or prob > prob_max:

prob_max = prob

label_max = label

if expect_label and expect_label != label_max:

print '---'

print 'expect: %s, got: %s' % (expect_label, label_max)

print probs

print input_text

return label_max

def main():

filename = 'input/spam.csv'

train_ratio = 0.75

train_data, test_data = load_data(filename, train_ratio)

print('data loaded. train: {}, test: {}').format(

len(train_data), len(test_data))

# train the model

log_priors, log_likelihoods, vocabulary = train(train_data)

print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))

pos_true = 0

pos_false = 0

neg_false = 0

neg_true = 0

for label, text in test_data:

got = predict(log_priors, log_likelihoods, vocabulary, text, label)

if label != got:

if label == 'spam':

pos_false += 1

else:

neg_false += 1

else:

if label == 'spam':

pos_true += 1

else:

neg_true += 1

print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)

print 'negative true: %s, false: %s' % (neg_true, neg_false)

print 'Precision: %.2f%%, Recall: %.2f%%' % (

100.0 * pos_true / (pos_true + pos_false),

100.0 * pos_true / (pos_true + neg_false),

)

if __name__ == '__main__':

main()

运行结果

在 load_data 函数里，对 dataset 做了 shuffle 洗牌，所以，每次运行结果都会有区别，但 Precision 基本在 90% 左右，Recall 96%

data loaded. train: 4179, test: 1393

model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996

positive(spam) true: 156, false: 23

negative true: 1213, false: 1

Precision: 87.15%, Recall: 99.36%

误判数据的详细信息

---

expect: spam, got: ham

{'ham': -84.45531071975456, 'spam': -90.70055052987975}

You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -52.676090263258466, 'spam': -56.80155705621162}

Are you unique enough? Find out from 30th August. www.areyouunique.co.uk

---

expect: spam, got: ham

{'ham': -70.17115950997167, 'spam': -72.51873090685471}

This message is brought to you by GMW Ltd. and is not connected to the

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -168.26318051822577, 'spam': -178.0225162634835}

Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES

---

expect: spam, got: ham

{'ham': -142.83829011312517, 'spam': -160.32809534946767}

ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.

---

expect: spam, got: ham

{'ham': -153.87952746055848, 'spam': -157.51773572622523}

How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)

---

expect: spam, got: ham

{'ham': -70.49633709940345, 'spam': -71.4324288990577}

Latest News! Police station toilet stolen, cops have nothing to go on!

---

expect: spam, got: ham

{'ham': -167.8034762079193, 'spam': -181.29820672852267}

Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1

---

expect: spam, got: ham

{'ham': -201.64646891761794, 'spam': -221.190966006233}

Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)

---

expect: spam, got: ham

{'ham': -179.72840292764013, 'spam': -198.04988429454147}

Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!

---

expect: spam, got: ham

{'ham': -169.32983912276305, 'spam': -171.08314763971316}

Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.

---

expect: spam, got: ham

{'ham': -94.5052592368405, 'spam': -100.35544827044257}

Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.

---

expect: spam, got: ham

{'ham': -92.32667266264504, 'spam': -98.95596194645321}

Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.

---

expect: spam, got: ham

{'ham': -72.48756276984723, 'spam': -76.59107843644884}

Missed call alert. These numbers called but left no message. 07008009200

---

expect: spam, got: ham

{'ham': -77.48695175645791, 'spam': -93.71448200582458}

Did you hear about the new \Divorce Barbie\"? It comes with all of Ken's stuff!"

---

expect: spam, got: ham

{'ham': -178.2932238903798, 'spam': -184.67604680539299}

Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL

---

expect: spam, got: ham

{'ham': -146.47067170071077, 'spam': -149.22917007660618}

Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.

---

expect: spam, got: ham

{'ham': -50.71127368750657, 'spam': -55.32646466594103}

Money i have won wining number 946 wot do i do next

---

expect: spam, got: ham

{'ham': -86.61673741525858, 'spam': -100.67904461391855}

Sorry I missed your call let's talk when you have the time. I'm on 07090201529

---

expect: spam, got: ham

{'ham': -135.8452563916395, 'spam': -138.23271598529016}

Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3

---

expect: spam, got: ham

{'ham': -106.60938155575177, 'spam': -115.73546864261806}

INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.

---

expect: ham, got: spam

{'ham': -109.18333505635592, 'spam': -107.80307740800048}

MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED

调参数

我们只做一个非常简单的实验，如果先全部转成小写字母，再统计出现次数，效果会不会提升。

当前代码，连续 3 次的运行结果：

positive(spam) true: 188, false: 21

negative true: 1182, false: 2

Precision: 89.95%, Recall: 98.95%

---

positive(spam) true: 160, false: 16

negative true: 1209, false: 8

Precision: 90.91%, Recall: 95.24%

---

positive(spam) true: 167, false: 13

negative true: 1208, false: 5

Precision: 92.78%, Recall: 97.09%

代码修改：

第 11 行

dataset = [(line[0], line[1]) for line in csv_reader]

改成

dataset = [(line[0], line[1].lower()) for line in csv_reader]

连续 3 次的运行结果：

positive(spam) true: 164, false: 18

negative true: 1205, false: 6

Precision: 90.11%, Recall: 96.47%

---

positive(spam) true: 174, false: 20

negative true: 1197, false: 2

Precision: 89.69%, Recall: 98.86%

---

positive(spam) true: 162, false: 23

negative true: 1204, false: 4

Precision: 87.57%, Recall: 97.59%

结果变化不大，Precision 略有降低，Recall 略微提升。

3 次运行结果，随机性比较大，不能作出哪一个 feature 更好的结论。

但我们也没有看到明显的优化或下降。

总结training 阶段主要是计算 prior 和 likelihood，这两个都与具体的 document 内的文本无关，而是在整个 label 所有训练集内统计 document count 和 word count。

对结果产生影响的，不是一个训练集内 word A 与 word B 的相对高低，而是 word 在不同 label 集内的概率差异。

python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件／短信相关推荐

贝叶斯算法 — 朴素贝叶斯分类器— 过滤垃圾邮件 — 流失用户 — 用户画像
目录应用 1. 胃疼胃癌 2. 过滤垃圾邮件朴素贝叶斯分类器概念介绍朴素贝叶斯分类器原理贝叶斯分类器的应用公式求得是后验概率,等式右侧为先验概率贝叶斯定理本质:通过先验概率求后验 ...
[实战] 朴素贝叶斯分类器进行垃圾邮件过滤
我们已经讲解过朴素贝叶斯分类器的基本原理和实现:动手实现朴素贝叶斯分类器进行文档分类在此基础上,我们实现垃圾邮件的过滤,数据为50封txt邮件 (1)将text文本文件,分成单词列表使用正则表达式 ...
sklearn朴素贝叶斯分类器_手撕朴素贝叶斯分类器源码(Naive Bayesian)
鋌~(最近压力略大,好久没有更新,xixixi),今天的主题是朴素贝叶斯分类器,NB这个缩写真是绝了,确实值得这个缩写,哈哈哈.今天跟大家聊一聊朴素贝叶斯分类器的基本原理和代码编写.贝叶斯分类器的基本 ...
python 安卓app按钮数字识别_Python 手写数字识别-knn算法应用
knn算法代码: from numpy import * import operator import os def img2vector(filename): """ ...
python朴素贝叶斯_Python实现的朴素贝叶斯分类器示例
本文实例讲述了Python实现的朴素贝叶斯分类器.分享给大家供大家参考,具体如下: 因工作中需要,自己写了一个朴素贝叶斯分类器. 对于未出现的属性,采取了拉普拉斯平滑,避免未出现的属性的概率为零导致整 ...
python朴素贝叶斯分类示例_Python实现的朴素贝叶斯分类器示例
本文实例讲述了Python实现的朴素贝叶斯分类器.分享给大家供大家参考,具体如下: 因工作中需要,自己写了一个朴素贝叶斯分类器. 对于未出现的属性,采取了拉普拉斯平滑,避免未出现的属性的概率为零导致整 ...
多项式朴素贝叶斯分类器_多项式朴素贝叶斯分类器的主题预测
多项式朴素贝叶斯分类器 In Analytics Vidhya, Hackathon, there was a problem statement for text prediction of top ...
基于Python的BP网络实现手写数字识别
资源下载地址:https://download.csdn.net/download/sheziqiong/86790047 资源下载地址:https://download.csdn.net/downl ...
Python实现基于机器学习的手写数字识别系统
目录摘要 I ABSTRACT II 1 绪论 1 1.1 数字识别研究现状 1 1.2 深度学习的发展与现状 1 1.3 研究意义 2 1.4 论文结构 3 2 卷积神经网络基本原理 4 2.1 ...

python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件／短信

python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件／短信相关推荐

最新文章

热门文章