自己从头手写一下这些经典的算法,不调用 sklearn 等 API,调一调参数,蛮有收获和启发。

数据集

概要:5572 条短信,13% 的 spam。

选择这个数据集的原因:短信的文本预处理要比 email 简单一些,运算量小,更容易聚焦算法本身。

数据集来自 kaggle,取样相对科学一些,更容易准确的反应算法的效果。

我的数据备份:github.com spam.csv​github.com

算法原理

目标函数:给定一篇文章(d),计算属于各个分类(c) 的概率,以概率最大的分类作为最终结果。

在垃圾邮件/短信检测的案例里,分类只有 2 个:spam,not-spam.

在垃圾邮件检测的特定领域里,not-spam 通常又叫 ham。没有什么原因,最初的大佬一时兴起想到了这个名字而已。所以,分类名字就变成了 spam, ham

根据贝叶斯公式,变形为

其中 f1, f2, ..., fn 是 document 的 feature。

有很多种选取 feature 的方法,比如,单词出现频率,单词的 TF/IDF 值,去除 stop-words 以后的单词频率。选择什么 feature,与贝叶斯无关,由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。

假定 feature 相互独立,实践上,即使不相互独立,直接用的效果也不错。

得到新公式:

跟其他的 language model 一样,换成 log space,

Feature 选取

spammy email 检测的算法已经很成熟,商用的版本里,会用到发件人地址等非文本特征。

此处,我们把讨论约定在文本/自然语言特征范围内。

在机器学习之前,已经有 rule based 的过滤器,大家发现的规则,比如:online pharmaceutical

WITHOUT ANY COST

Dear Winner

总结下来,垃圾邮件喜欢把一些 keyword 全部大写,表述习惯上与普通文本不同。

常见的 NLP 预处理 pipeline,比如,全部转小写,TF/IDF 等,反而会把一些 feature 处理掉。此处可能不适合做这些预处理。

我们以单词出现次数这个最简单的指标作为特征。

训练语料里,所有分类下的所有单词,构成一个 vocabulary,然后在每个类别下,分别统计各个单词的出现次数。

在某个分类下没有出现的单词,概率是 0,导致最终的概率也都是 0。为了解决这个问题,使用 add-one (Laplace) smoothing,

伪代码

模型评价

以 spam 分类作为 positive 分类。

Python 实现

import csv

import string

import numpy as np

import math

def load_data(filename, train_ratio):

with open(filename, "rb") as f:

csv_reader = csv.reader(f)

csv_reader.next() # header

dataset = [(line[0], line[1]) for line in csv_reader]

np.random.shuffle(dataset)

train_size = int(len(dataset) * train_ratio)

return dataset[:train_size], dataset[train_size:]

def train(train_set):

total_doc_cnt = len(train_set)

label_doc_cnt = {}

bigdoc_words = {}

for label, doc in train_set:

if label not in label_doc_cnt:

# init

label_doc_cnt[label] = 0

bigdoc_words[label] = []

label_doc_cnt[label] += 1

bigdoc_words[label].extend([

w.strip(string.punctuation) for w in doc.split()])

vocabulary = set()

for words in bigdoc_words.values():

vocabulary |= set(words)

V = len(vocabulary)

log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}

log_likelihoods = dict()

for label, words in bigdoc_words.items():

word_cnt = len(words) + V

log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]

return log_priors, log_likelihoods, vocabulary

def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):

words = {w.strip(string.punctuation) for w in input_text.split()}

prob_max = 0

label_max = None

probs = {} # tmp for log

for label, likelihood in log_likelihoods.items():

prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])

probs[label] = prob

if not prob_max or prob > prob_max:

prob_max = prob

label_max = label

if expect_label and expect_label != label_max:

print '---'

print 'expect: %s, got: %s' % (expect_label, label_max)

print probs

print input_text

return label_max

def main():

filename = 'input/spam.csv'

train_ratio = 0.75

train_data, test_data = load_data(filename, train_ratio)

print('data loaded. train: {}, test: {}').format(

len(train_data), len(test_data))

# train the model

log_priors, log_likelihoods, vocabulary = train(train_data)

print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))

pos_true = 0

pos_false = 0

neg_false = 0

neg_true = 0

for label, text in test_data:

got = predict(log_priors, log_likelihoods, vocabulary, text, label)

if label != got:

if label == 'spam':

pos_false += 1

else:

neg_false += 1

else:

if label == 'spam':

pos_true += 1

else:

neg_true += 1

print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)

print 'negative true: %s, false: %s' % (neg_true, neg_false)

print 'Precision: %.2f%%, Recall: %.2f%%' % (

100.0 * pos_true / (pos_true + pos_false),

100.0 * pos_true / (pos_true + neg_false),

)

if __name__ == '__main__':

main()

运行结果

在 load_data 函数里,对 dataset 做了 shuffle 洗牌,所以,每次运行结果都会有区别,但 Precision 基本在 90% 左右,Recall 96%

data loaded. train: 4179, test: 1393

model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996

positive(spam) true: 156, false: 23

negative true: 1213, false: 1

Precision: 87.15%, Recall: 99.36%

误判数据的详细信息

---

expect: spam, got: ham

{'ham': -84.45531071975456, 'spam': -90.70055052987975}

You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -52.676090263258466, 'spam': -56.80155705621162}

Are you unique enough? Find out from 30th August. www.areyouunique.co.uk

---

expect: spam, got: ham

{'ham': -70.17115950997167, 'spam': -72.51873090685471}

This message is brought to you by GMW Ltd. and is not connected to the

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -168.26318051822577, 'spam': -178.0225162634835}

Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES

---

expect: spam, got: ham

{'ham': -142.83829011312517, 'spam': -160.32809534946767}

ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.

---

expect: spam, got: ham

{'ham': -153.87952746055848, 'spam': -157.51773572622523}

How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)

---

expect: spam, got: ham

{'ham': -70.49633709940345, 'spam': -71.4324288990577}

Latest News! Police station toilet stolen, cops have nothing to go on!

---

expect: spam, got: ham

{'ham': -167.8034762079193, 'spam': -181.29820672852267}

Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1

---

expect: spam, got: ham

{'ham': -201.64646891761794, 'spam': -221.190966006233}

Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)

---

expect: spam, got: ham

{'ham': -179.72840292764013, 'spam': -198.04988429454147}

Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!

---

expect: spam, got: ham

{'ham': -169.32983912276305, 'spam': -171.08314763971316}

Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.

---

expect: spam, got: ham

{'ham': -94.5052592368405, 'spam': -100.35544827044257}

Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.

---

expect: spam, got: ham

{'ham': -92.32667266264504, 'spam': -98.95596194645321}

Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.

---

expect: spam, got: ham

{'ham': -72.48756276984723, 'spam': -76.59107843644884}

Missed call alert. These numbers called but left no message. 07008009200

---

expect: spam, got: ham

{'ham': -77.48695175645791, 'spam': -93.71448200582458}

Did you hear about the new \Divorce Barbie\"? It comes with all of Ken's stuff!"

---

expect: spam, got: ham

{'ham': -178.2932238903798, 'spam': -184.67604680539299}

Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL

---

expect: spam, got: ham

{'ham': -146.47067170071077, 'spam': -149.22917007660618}

Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.

---

expect: spam, got: ham

{'ham': -50.71127368750657, 'spam': -55.32646466594103}

Money i have won wining number 946 wot do i do next

---

expect: spam, got: ham

{'ham': -86.61673741525858, 'spam': -100.67904461391855}

Sorry I missed your call let's talk when you have the time. I'm on 07090201529

---

expect: spam, got: ham

{'ham': -135.8452563916395, 'spam': -138.23271598529016}

Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3

---

expect: spam, got: ham

{'ham': -106.60938155575177, 'spam': -115.73546864261806}

INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.

---

expect: ham, got: spam

{'ham': -109.18333505635592, 'spam': -107.80307740800048}

MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED

调参数

我们只做一个非常简单的实验,如果先全部转成小写字母,再统计出现次数,效果会不会提升。

当前代码,连续 3 次的运行结果:

positive(spam) true: 188, false: 21

negative true: 1182, false: 2

Precision: 89.95%, Recall: 98.95%

---

positive(spam) true: 160, false: 16

negative true: 1209, false: 8

Precision: 90.91%, Recall: 95.24%

---

positive(spam) true: 167, false: 13

negative true: 1208, false: 5

Precision: 92.78%, Recall: 97.09%

代码修改:

第 11 行

dataset = [(line[0], line[1]) for line in csv_reader]

改成

dataset = [(line[0], line[1].lower()) for line in csv_reader]

连续 3 次的运行结果:

positive(spam) true: 164, false: 18

negative true: 1205, false: 6

Precision: 90.11%, Recall: 96.47%

---

positive(spam) true: 174, false: 20

negative true: 1197, false: 2

Precision: 89.69%, Recall: 98.86%

---

positive(spam) true: 162, false: 23

negative true: 1204, false: 4

Precision: 87.57%, Recall: 97.59%

结果变化不大,Precision 略有降低,Recall 略微提升。

3 次运行结果,随机性比较大,不能作出哪一个 feature 更好的结论。

但我们也没有看到明显的优化或下降。

总结training 阶段主要是计算 prior 和 likelihood,这两个都与具体的 document 内的文本无关,而是在整个 label 所有训练集内统计 document count 和 word count。

对结果产生影响的,不是一个训练集内 word A 与 word B 的相对高低,而是 word 在不同 label 集内的概率差异。

python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件/短信相关推荐

  1. 贝叶斯算法 — 朴素贝叶斯分类器— 过滤垃圾邮件 — 流失用户 — 用户画像

    目录 应用 1. 胃疼胃癌 2. 过滤垃圾邮件 朴素贝叶斯分类器 概念介绍 朴素贝叶斯分类器原理 贝叶斯分类器的应用 公式 求得是后验概率,等式右侧为先验概率 贝叶斯定理本质:通过 先验概率 求 后验 ...

  2. [实战] 朴素贝叶斯分类器进行垃圾邮件过滤

    我们已经讲解过朴素贝叶斯分类器的基本原理和实现:动手实现朴素贝叶斯分类器进行文档分类 在此基础上,我们实现垃圾邮件的过滤,数据为50封txt邮件 (1)将text文本文件,分成单词列表 使用正则表达式 ...

  3. sklearn朴素贝叶斯分类器_手撕朴素贝叶斯分类器源码(Naive Bayesian)

    鋌~(最近压力略大,好久没有更新,xixixi),今天的主题是朴素贝叶斯分类器,NB这个缩写真是绝了,确实值得这个缩写,哈哈哈.今天跟大家聊一聊朴素贝叶斯分类器的基本原理和代码编写.贝叶斯分类器的基本 ...

  4. python 安卓app按钮数字识别_Python 手写数字识别-knn算法应用

    knn算法代码: from numpy import * import operator import os def img2vector(filename): """ ...

  5. python朴素贝叶斯_Python实现的朴素贝叶斯分类器示例

    本文实例讲述了Python实现的朴素贝叶斯分类器.分享给大家供大家参考,具体如下: 因工作中需要,自己写了一个朴素贝叶斯分类器. 对于未出现的属性,采取了拉普拉斯平滑,避免未出现的属性的概率为零导致整 ...

  6. python朴素贝叶斯分类示例_Python实现的朴素贝叶斯分类器示例

    本文实例讲述了Python实现的朴素贝叶斯分类器.分享给大家供大家参考,具体如下: 因工作中需要,自己写了一个朴素贝叶斯分类器. 对于未出现的属性,采取了拉普拉斯平滑,避免未出现的属性的概率为零导致整 ...

  7. 多项式朴素贝叶斯分类器_多项式朴素贝叶斯分类器的主题预测

    多项式朴素贝叶斯分类器 In Analytics Vidhya, Hackathon, there was a problem statement for text prediction of top ...

  8. 基于Python的BP网络实现手写数字识别

    资源下载地址:https://download.csdn.net/download/sheziqiong/86790047 资源下载地址:https://download.csdn.net/downl ...

  9. Python实现基于机器学习的手写数字识别系统

    目 录 摘要 I ABSTRACT II 1 绪论 1 1.1 数字识别研究现状 1 1.2 深度学习的发展与现状 1 1.3 研究意义 2 1.4 论文结构 3 2 卷积神经网络基本原理 4 2.1 ...

最新文章

  1. 共用体的定义和应用【C++】
  2. matplotlib的下载和安装方法
  3. tftp 在线更新 cisco switch IOS
  4. Quartus II13.1安装教程
  5. Unix系统编程()main函数的命令行参数
  6. html-文本框和单选框
  7. c++ auto用法_不想写表达式的类型?试试auto吧
  8. yii直接执行sql
  9. linux 发生变更的文件夹,【Linux常识篇(3)】文件及文件夹的ctimeatimemtime的含义详解...
  10. 场效应管P-MOS N-MOS
  11. Shell中通过机器人发送钉钉群消息
  12. 工作流系统之三十四 集成用户系统
  13. 三屏设置相关技术收集
  14. 标书的总结和感受(对标书整体流程的理解,和细节的把控
  15. 本地HTML文档批量翻译软件
  16. Windows 2008 Server R2 桌面体验
  17. 输入一个数求他的因数c语言,【代码】求一个数的因数和、求优化、顺便也供新人参考算法...
  18. Python编程快速入门
  19. 这个传奇大佬,自杀了!
  20. java 获取星期_java中如何获取日期时间中的星期几?

热门文章

  1. 计算机文档字体替换,word2007进行字体替换的两种方法
  2. 模拟器安装不了apk,fail to start adbCheck settings to verify your chosen adb path is valid.
  3. 服务器和交换机物理连接_服务器与交换机连接及校园网搭建方案
  4. 上微信怎么同时用计算机,电脑端微信双开,教你两种简单的方法,上手即用!...
  5. 使用Python将图片变成铅笔素描
  6. 【渝粤题库】陕西师范大学202221保险学Ⅱ 作业(高起本、专升本)
  7. 电商库存系统设计mysql_详解:电商系统库存逻辑的设计
  8. linux 内核配置简介
  9. mac电脑开机进入grub界面
  10. 信息系统综合知识八 专业英语