程序示例–垃圾邮件检测

邮件内容的预处理

下面展示了一封常见的 email，邮件内容包含了一个 URL （http://www.rackspace.com/），一个邮箱地址（groupname-unsubscribe@egroups.com），若干数字，美元符号等：

Anyone knows how much it costs to host a web portal ?Well, it depends on how many visitors youre expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big..To unsubscribe yourself from this mailing list, send an email to: groupname-unsubscribe@egroups.com

许多其他邮件可能也包含了这些信息，因此，在对邮件内容的预处理阶段，就是要 “标准化” 这些参数。例如，我们需要将所有的 URL链接都替换为 httpaddr，这能让垃圾邮件分类器在获得决策边界时不受限于某个具体的 URL。标准化的手段能显著提升分类器的性能，这是因为，垃圾邮件通常会随机产生 URL，因此，在新的垃圾邮件中，发现之前见过的 URL 的可能性是不高的。

在预处理邮件内容阶段，我们将完成如下内容：

转换内容为小写：邮件的所有单词都会被转化为小写。
去除 HTML 标签：
标准化 URL：将所有的 URL 链接替换为 httpaddr。
标准化邮箱地址：将所有的邮箱地址替换为 emailaddr。
标准化所有的数字：将所有的数字替换为 number。
标准化所有的美元符号：将所有的美元符号 $ 替换为 dollar。
词干提取（Word-Stemming）：例如，我们会将 “discount”、“discounts”、“discounted” 及 “discounting” 都替换为 “discount”。
删去非单词字符：将空格、tab、换行符等删去。

上述邮件经过预处理后，内容如下：

anyon know how much it cost to host a web portal well it depend on how mani visitor your expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr

接下来，我们需要确定用于垃圾邮件分类器的单词，亦即，需要构建一个用于垃圾邮件分类器的词汇列表。在本例中，将选择在垃圾邮件语料库（spam corpus）中，出现至少 100 次的单词。这么做的原因是，如果选择将出现寥寥的单词放到训练样本中，可能就会造成模型的过拟合。最终完整的词汇列表存入了文件 vocab.txt 中，当中含有共计 1899 个单词：

1 aa
2 ab
3 abil
...
86 anyon
...
916 know
...
1898 zero
1899 zip

在生产环境中，词汇列表的规模在 10,000 到 50,000 个单词之间。

有了词汇表之后，我们现在就能将预处理后的邮件映射为一个坐标集，每个坐标反映了邮件中单词在词汇表的位置：

86 916 794 1077 883
370 1699 790 1822
1831 883 431 1171
794 1002 1893 1364
592 1676 238 162 89
688 945 1663 1120
1062 1699 375 1162
479 1893 1510 799
1182 1237 810 1895
1440 1547 181 1699
1758 1896 688 1676
992 961 1477 71 530
1699 531

特别地，“anyone” 被标准化了词汇表的第86个单词 “anyon”。

预处理的代码片如下，我们用到了函数式编程库 pydash 以及词干提取库 stemming：

# coding: utf8
# svm/spam.py# ...
def processEmail(email):"""预处理邮件Args:email 邮件内容Returns:indices 单词在词表中的位置"""# 转换为小写 --> 标准化 URL --> 标准化 邮箱地址# --> 去除 HTML 标签 --> 标准化数字# --> 标准化美元 --> 删除非空格字符return py_(email) \.strip_tags() \.reg_exp_replace(r'(http|https)://[^\s]*', 'httpaddr') \.reg_exp_replace(r'[^\s]+@[^\s]+', 'emailaddr') \.reg_exp_replace(r'\d+', 'number') \.reg_exp_replace(r'[$]+', 'dollar') \.lower_case() \.trim() \.words() \.map(stem) \.map(lambda word : py_.index_of(vocabList, word) + 1) \.value()

邮件特征提取

我们定义特征 x(i)x^{(i)}x(i) 为：
x(i)={1,如果词汇表的第i个单词出现0,otherwisex^{(i)}=\begin{cases}1,\quad\quad 如果词汇表的第\ i\ 个单词出现\\ 0,\quad\quad otherwise\end{cases}x(i)={1,如果词汇表的第 i 个单词出现0,otherwise

则知， x∈Rnx∈\R^nx∈Rn ， nnn 为词汇表长度。

特征提取的代码同样封装到了项目的 svm/spam.py中：

# coding: utf8
# svm/spam.py# ....
def extractFeatures(indices):"""提取特征Args:indices 单词索引Returns:feature 邮件特征"""feature = py_.map(range(1, len(vocabList) + 1),lambda index: py_.index_of(indices, index) > -1)return np.array(feature, dtype=np.uint)

Top Predictors

在 SVM 训练后获得的模型中，权值向量 www 评估了每个特征的重要性，对应到词汇表的坐标中，我们也就知道了那些词汇最能分辨垃圾邮件，这些词汇称为 Top Predictors。

# coding: utf8
# svm/spam.py# ....
def getTopPredictors(weights, count):"""获得最佳标识词汇Args:weights 权值count top countReturns:predictors predicators"""return py_(vocabList) \.map(lambda word, idx: (word, weights[idx])) \.sort_by(lambda item: item[1], reverse = True) \.take(count) \.value()

测试

下面，我们会使用 sklearn 提供的 SVM 模型进行训练以加快训练效率，它是基于 libsvm 的，有不错的运算性能：

# coding: utf8
# svm/test_spam.py
import spam
import numpy as np
from scipy.io import loadmat
from sklearn.svm import SVC
import matplotlib.pyplot as plt# 垃圾邮件分类器
data = loadmat('data/spamTrain.mat')
X = np.mat(data['X'])
y = data['y']
m, n = X.shape
C = 0.1
tol = 1e-3# 使用训练集训练分类器
clf = SVC(C=C, kernel='linear', tol=tol)
clf.fit(X, y.ravel())
predictions = np.mat([clf.predict(X[i, :])  for i in range(m)])
accuracy = 100 * np.mean(predictions == y)
print 'Training set accuracy: %0.2f %%' % accuracy# 使用测试集评估训练结果
data = loadmat('data/spamTest.mat')
XTest = np.mat(data['Xtest'])
yTest = data['ytest']
mTest, _ = XTest.shapeclf.fit(XTest, yTest.ravel())
print 'Evaluating the trained Linear SVM on a test set ...'
predictions = np.mat([clf.predict(XTest[i, :])  for i in range(mTest)])
accuracy = 100 * np.mean(predictions == yTest)
print 'Test set accuracy: %0.2f %%' % accuracy# 获得最能标识垃圾邮件的词汇（在模型中获得高权值的）
weights = abs(clf.coef_.flatten())
top = 15
predictors = spam.getTopPredictors(weights, top)
print 'Top %d predictors of spam:'%top
for word, weight in predictors:print '%-15s (%f)' % (word, weight)# 使用邮件测试
def genExample(f):email = open(f).read()indices =  spam.processEmail(email)features =  spam.extractFeatures(indices)return featuresfiles = ['data/emailSample1.txt','data/emailSample1.txt','data/spamSample1.txt','data/spamSample2.txt'
]emails = np.mat([genExample(f) for f in files], dtype=np.uint8)
labels = np.array([[0, 0, 1, 1]]).reshape(-1, 1)
predictions = np.mat([clf.predict(emails[i, :])  for i in range(len(files))])
accuracy = 100 * np.mean(predictions == labels)
print('Test set accuracy for own datasets: %0.2f %%' % accuracy)

测试结果为：

Training set accuracy: 99.83 %
Test set accuracy: 99.90 %Top 15 predictors of spam:
click           (0.477352)
url             (0.332529)
date            (0.316182)
wrote           (0.282825)
spamassassin    (0.266681)
remov           (0.250697)
there           (0.248973)
numbertnumb     (0.240243)
the             (0.238051)
pleas           (0.235655)
our             (0.233895)
do              (0.226717)
basenumb        (0.218456)
we              (0.218421)
free            (0.214057)Test set accuracy for own datasets: 100.00 %

5.11 程序示例--垃圾邮件检测-机器学习笔记-斯坦福吴恩达教授相关推荐

4.4 机器学习系统设计--垃圾邮件分类-机器学习笔记-斯坦福吴恩达教授
机器学习系统设计–垃圾邮件分类假定我们现有一封邮件,其内容如下: From: cheapsales@buystufffromme.com To: ang@cs.stanford.edu Subjec ...
2.7 程序示例--多分类问题-机器学习笔记-斯坦福吴恩达教授
程序示例–多分类问题我们采用 One-vs-All 方法来进行多分类,在原有的逻辑回归模块中添加 One-vs-All 的训练以及预测方法: # coding: utf-8 # logical_re ...
2.5 程序示例--非线性决策边界-机器学习笔记-斯坦福吴恩达教授
程序示例–非线性决策边界我们首先对数据进行了多项式拟合,再分别使用 λ=0,λ=1,λ=100λ=0,λ=1,λ=100λ=0,λ=1,λ=100 的批量梯度下降法(sgd)完成了训练,获得了非线性 ...
2.4 程序示例--线性决策边界-机器学习笔记-斯坦福吴恩达教授
程序示例–线性决策边界回归模块在逻辑回归模块 logical_regression.py 中,实现了批量梯度下降法(bgd)以及随机梯度下降法(sgd),同时,支持正规化方程 # coding: ...
1.9 程序示例--局部加权线性回归-机器学习笔记-斯坦福吴恩达教授
程序示例–局部加权线性回归现在,我们在回归中又添加了 JLwr() 方法用于计算预测代价,以及 lwr() 方法用于完成局部加权线性回归: # coding: utf-8 # linear_regr ...
3.12 程序示例--多分类问题-机器学习笔记-斯坦福吴恩达教授
多分类问题我们手上包含有手写字符的数据集,该数据集来自斯坦福机器学习的课后作业,每个字符图片大小为 20×20 ,总的样本规模为 5000×400 , 我们的神经网络设计如下,包含 1 个隐含层,隐 ...
8.4 有监督学习与异常检测-机器学习笔记-斯坦福吴恩达教授
有监督学习与异常检测很多人会认为异常检测非常类似于有监督学习,尤其是逻辑回归,但我们用一张表格来描述有监督学习与异常检测的区别: 有监督学习异常检测数据分布均匀数据非常偏斜,异常样本数目远小于 ...
8.7 程序示例--异常检测-机器学习笔记-斯坦福吴恩达教授
程序示例–异常检测异常检测模型提供了一般高斯分布模型和多元高斯分布模型.其中,多元高斯分布模型被限制到了同轴分布: # coding: utf8 # anomaly_detection/anoma ...
6.8 程序示例--二分 K-Means-机器学习笔记-斯坦福吴恩达教授
程序示例–二分 K-Means 仍然是在 kmeans.py 中,我们又添加了二分 K-Means 算法: # coding: utf-8 # kmeans/kmeans.py# ... def bi ...

5.11 程序示例--垃圾邮件检测-机器学习笔记-斯坦福吴恩达教授

程序示例–垃圾邮件检测

邮件内容的预处理

邮件特征提取

Top Predictors

测试

5.11 程序示例--垃圾邮件检测-机器学习笔记-斯坦福吴恩达教授相关推荐

最新文章

热门文章