基于概率论的分类方法：朴素贝叶斯

引入
1 基于贝叶斯决策理论的分类方法
- 1.1 条件概率
- 1.2 使用条件概率来分类
- 1.3 使用朴素贝叶斯进行文档分类
2 使用Python进行文本分类
- 2.1 准备数据：从文本中构建词向量
- 2.2 训练算法：从词向量计算概率
- 2.3 测试算法：根据现实情况修改分类器
- 2.4 准备数据：文档词袋模型
3 示例：过滤垃圾邮件
- 3.1 准备数据：切分文本
- 3.2 测试算法：进行交叉验证

引入

对于k-近邻和决策树，我们要求分类器给出“该实例属于哪一类”这样的明确答案。不过，分类器有时会产生错误结果，这时可以要求分类器给出一个最优的类别猜测，同时给出这个猜测的概率估计值。

朴素贝叶斯朴素一词的原因，是因为整个形式化过程只是做最原始、最简单的假设，随后将介绍这些假设。

1 基于贝叶斯决策理论的分类方法

朴素贝叶斯

优点：在数据较少的情况下依然有效，可以处理多类别问题。
缺点：对于输入数据的准备方式较为敏感。
　　适用数据类型：标称型数据。

朴素贝叶斯是贝叶斯决策理论的一部分，所有在讲述朴素贝叶斯之前有必要快速了解一下贝叶斯决策理论：
　　假设现有一个数据集，它由两类数据组成，数据分布如图1-1：

图1-1 两个参数已知的概率分布

假设已有两类数据的统计参数。现在用 p 1 ( x , y ) p1(x, y) p1(x,y)表示数据点 ( x , y ) (x, y) (x,y)属于类别1(图中绿色)的概率，用 p 2 ( x , y ) p2(x, y) p2(x,y)表示数据点 ( x , y ) (x, y) (x,y)属于类别2(图中红色)的概率，则对于一个新的数据点 ( x , y ) (x, y) (x,y)，可以用以下的规则来判断：
1）如果 p 1 ( x , y ) > p 2 ( x , y ) p1(x, y) > p2(x, y) p1(x,y)>p2(x,y)，那么类别为1；
　　2）如果 p 1 ( x , y ) < p 2 ( x , y ) p1(x, y) < p2(x, y) p1(x,y)<p2(x,y)，那么类别为2。

也就是说，我们会选择高概率对应的类别。这就是贝叶斯决策理论的核心思想。接下来，为了能够计算 p 1 p1 p1及 p 2 p2 p2，有必要讨论条件概率。当然如果已了解，则跳过。

贝叶斯？

这里使用的概率解释属于贝叶斯概率理论的范畴，该理论非常流行且效果良好。贝叶斯概率以18世纪的一位神学家托马斯·贝叶斯(Thomas Bayes)的名字命名。贝叶斯概率引入先验知识和逻辑推理来处理不确定命题。另一种概率解释称为频数概率(frequency probability)，它只从数据本身获得结论，并不考虑逻辑推理及先验知识。

1.1 条件概率

假设现在有一个装了7块石头的罐子，其中三块绿色，4块红色(如图2-1所示)。

图1-2 一个包含7块石头的集合

如果这7块石头放在两个桶中，如图2-2所示，那么该如何计算概率呢？

图1-3 落在两个桶中的7块石头

要计算 P ( g r e e n ) P(green) P(green)或者 P ( r e d ) P(red) P(red)，事先得知道石头桶所在的信息会不会改变结果？如果计算从B桶中取到绿色石头的概率，这便是条件概率(conditional probability)。假定计算的是从B桶取到绿色石头的概率，记作 P ( g r e e n ∣ b u c k e t B ) P(green | bucketB) P(green∣bucketB)，称之为“在已知石头出自B桶的条件下，取得绿色石头的概率”。
很容易推出， P ( g r e e n ∣ b u c k e t A ) = 2 / 4 P(green | bucketA) = 2 / 4 P(green∣bucketA)=2/4， P ( g r e e n ∣ b u c k e t B ) = 1 / 3 P(green| bucketB) = 1 / 3 P(green∣bucketB)=1/3。

条件概率的计算公式如下：
P ( g r e e n ∣ b u c k e t B ) = P ( g r e e n a n d b u c k e t B ) / P ( b u c k e t B ) P(green | bucketB) = P(green and bucketB) / P(bucketB) P(green∣bucketB)=P(greenandbucketB)/P(bucketB) 这个公式是否合理呢？
首先， P ( g r e e n a n d b u c k e t B ) = 1 / 7 P(green and bucketB) = 1 / 7 P(greenandbucketB)=1/7，而 P ( b u c k e t B ) = 3 / 7 P(bucketB) = 3 / 7 P(bucketB)=3/7，故 P ( g r e e n ∣ b u c k e t B ) = 1 / 3 P(green | bucketB) = 1 / 3 P(green∣bucketB)=1/3。这个公式对于简单例子来说稍显复杂，但是当存在诸多特征时，却非常有效。

另一种计算条件概率的方法是贝叶斯准则。贝叶斯准则告诉我们如何交换条件概率中的条件和结果，即如果已知 P ( x ∣ c ) P(x|c) P(x∣c)，求 P ( c ∣ x ) P(c|x) P(c∣x)，则可使用以下方法：
p ( c ∣ x ) = p ( x ∣ c ) p ( c ) p ( x ) p(c|x) = \frac{p(x|c)p(c)}{p(x)} p(c∣x)=p(x)p(x∣c)p(c) 接下来的问题是如何将其应用到分类器中？

1.2 使用条件概率来分类

如前所述，贝叶斯决策理论要求计算两个概率 p 1 ( x , y ) p1(x, y) p1(x,y)和 p 2 ( x , y ) p2(x, y) p2(x,y)：
1）如果 p 1 ( x , y ) > p 2 ( x , y ) p1(x, y) > p2(x, y) p1(x,y)>p2(x,y)，那么类别为1；
　　2）如果 p 1 ( x , y ) < p 2 ( x , y ) p1(x, y) < p2(x, y) p1(x,y)<p2(x,y)，那么类别为2。

但这两个准则并不是贝叶斯决策论的所有内容。使用 p 1 ( ) p1( 　) p1(　)和 p 2 ( ) p2(　) p2(　)只是为了尽可能简化描述，而真正需要计算和比较的则是 p ( c 1 ∣ x , y ) p(c_1| x , y) p(c1∣x,y)和 p ( c 2 ∣ x , y ) p(c_2| x , y) p(c2∣x,y)。这些符号所代表的具体含义是：
　　给定某个由 x x x， y y y表示的数据点，那么该数据点来自类别 c 1 c_1 c1和 c 2 c_2 c2的概率分别是多少？注意这些概率和之前给出的 p ( x , y ∣ c 1 ) p(x,y|c_1) p(x,y∣c1)并不一样，不过可以使用贝叶斯准则进行求解：
p ( c i ∣ x , y ) = p ( x , y ∣ c i ) p ( c i ) p ( x , y ) p(c_i|x,y) = \frac{p(x,y|c_i)p(c_i)}{p(x,y)} p(ci∣x,y)=p(x,y)p(x,y∣ci)p(ci) 使用这些定义，可以定义贝叶斯分类准则为：
1）如果 P ( c 1 ∣ x , y ) > P ( c 2 ∣ x , y ) P(c_1|x, y) > P(c_2|x, y) P(c1∣x,y)>P(c2∣x,y)，那么类别为1；
　　2）如果 P ( c 1 ∣ x , y ) < P ( c 2 ∣ x , y ) P(c_1|x, y) < P(c_2|x, y) P(c1∣x,y)<P(c2∣x,y)，那么类别为2。

1.3 使用朴素贝叶斯进行文档分类

机器学习的一个重要应用就是文档的自动分类。在文档分类中，整个文档(如一封电子邮件)是实例，而电子邮件中的某些元素则构成特征。虽然电子邮件是一种会不断增加的文本，但是同样也可以对新闻报道、用户留言、政府公文等其他任意类型的文本进行分类。我们可以观察文档中出现的词，并把每个词的出现或者不出现作为一个特征，这样得到的特征数目就会根词汇表中的词目一样多。
　　朴素贝叶斯是贝叶斯分类器的一个扩展，是用于文档分类的常用算法。

假定词汇表中有1000个单词。要得到好的概率分布，就需要足够的数据样本，假定样本数为 N N N，由统计学知，对于k个特征的数据来说，便需要 N k N^k Nk个样本。可见，所需样本会随着特征数目的增加而陡增。

如果特征相互独立，那么所需样本数就可以减少到 1000 1000 1000x N N N。所谓独立(independence)，是指统计意义上的独立，即一个特征或者单词出现的可能性与它和其他单词相邻没有关系。举例而言，假设单词bacon出现在unhealthy后面与出现在delicious后面的概率相同。当然，这种假设是不正确的，bacon常常出现在delicious附件，而很少出现在unhealthy附件，这个假设正是朴素贝叶斯分类器中朴素(naive)一词的含义。朴素贝叶斯的另一个假设是，每个特征同等重要，但是这个假设也有问题。如果要判断留言板的留言是否得当，那么只需要10~20个单词就可以了。虽然以上假设存在瑕疵，但是朴素贝叶斯的实际效果却很好。

2 使用Python进行文本分类

以在线社区的留言板为例。为例不影响社区的发展，需要屏蔽侮辱性的言论，所以要构建一个快速过滤器，若果某条留言违规，则标记内容不正当。对此建立两个类别：侮辱类和非侮辱类，分别使用1和0表示。

2.1 准备数据：从文本中构建词向量

将文本看成单词向量或者词条向量，考虑出现在现有文档中的所有单词，再决定将哪些词纳入词汇表或者说所要的词汇集合，然后必须要将每一篇文档转换为词汇表上的向量。创建名为bayes.py的文档，并添加以下代码：

程序清单2-1：词表到向量的转换函数

#coding:utf-8def create_data_set():    #测试用数据posting_list = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],['stop', 'posting', 'stupid', 'worthless', 'garbage'],['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]    #词条切分后的文档集合class_list = [0, 0, 0, 1, 1, 1]    #0代表非侮辱性言论；1反之；类别由人工标注return posting_list, class_list def create_vocab_list(data_set):    #创建一个包含在所有文档中出现且不重复的词汇列表；输入参数：切分好的文档vocab_set = set([])    #创建空集for document in data_set:vocab_set = vocab_set | set(document)    #"|"用于求并集return list(vocab_set)def set_of_words_vec(vocab_list, input_set):    #统计测试数据是否在词汇表中出现；输入参数：词汇表、测试数据return_vec = [0] * len(vocab_list)    #创建一个和词汇表等长的列表for word in input_set:if word in vocab_list:    #只判断该单词的有无而不关心其出现次数return_vec[vocab_list.index(word)] = 1else:print("The word: %s is not in my Vocabulary!" % word)return return_vecif __name__ == '__main__':    #主函数data_set, list_classes = create_data_set()my_vocab_list = create_vocab_list(data_set)print("My vocabulary is:", my_vocab_list)words_vec = set_of_words_vec(my_vocab_list, data_set[0])print("The vector of words is:", words_vec)

运行结果如下：

My vocabulary is: ['posting', 'cute', 'problems', 'ate', 'love', 'food', 'how', 'my', 'to', 'park', 'help', 'buying', 'has', 'flea', 'is', 'I', 'him', 'steak', 'not', 'dog', 'worthless', 'stupid', 'take', 'mr', 'so', 'garbage', 'dalmation', 'stop', 'licks', 'please', 'quit', 'maybe']
The vector of words is: [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

该代码使用词汇表或者想要检查的所有单词作为输入，然后为每一个单词构建一个特征。一旦给定一篇文档，该文档就会被转换为词汇量。

2.2 训练算法：从词向量计算概率

现在已经知道一个词是否出现在一篇文档中，也知道了该文档所属的类别，那如何使用这些数据计算概率呢？
　　重写贝叶斯准则，将之前的 x x x， y y y替换为 w \bm{w} w。粗体 w \bm{w} w表示这是一个向量，即它由多个值组成：
p ( c i ∣ w ) = p ( w ∣ c i ) p ( c i ) p ( w ) p(c_i| \bm{w}) = \frac{p(\bm{w}|c_i)p(c_i)}{p(\bm{w})} p(ci∣w)=p(w)p(w∣ci)p(ci) 接下来将使用上述公式。对每个类计算该值，然后比较这两个概率值的大小。如何计算呢？
　　1）求解 p ( c i ) p(c_i) p(ci)：类别 i i i中文档数除以总的文档数；
　　2）求解 p ( w ∣ c i ) p(\bm{w}|c_i) p(w∣ci)：将 w \bm{w} w展开，得 p ( w ∣ c i ) = p ( w 0 , w 1 , w 2 ⋅ ⋅ w N ∣ c i ) p(\bm{w}|c_i) = p(w_0, w_1,w_2··w_N|c_i) p(w∣ci)=p(w0,w1,w2⋅⋅wN∣ci)；假设所有属性都互相独立，该假设也称为条件独立性假设，则有 p ( w ∣ c i ) = p ( w 0 ∣ c i ) p ( w 1 ∣ c i ) p ( w 2 ∣ c i ) ⋅ ⋅ p ( w N ∣ c i ) p(\bm{w}|c_i) =p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)··p(w_N|c_i) p(w∣ci)=p(w0∣ci)p(w1∣ci)p(w2∣ci)⋅⋅p(wN∣ci)，从而极大简化了计算过程。
　　
该函数的伪代码如下：

计算每个类别中的文档数目
对每篇训练文档：对每个训练类别：如果词条出现在文档中：增加该词条的计数值增加所有词条的计数值
对每个类别：对每个词条：将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率

于bayes.py中添加以下代码：

程序清单2-2：朴素贝叶斯分类器训练函数

def train_NB(train_mat, train_class):    #输入参数：训练文档矩阵、标签；目前代码仅使用于二分类num_train_mat = len(train_mat)    #获取文档向量大小，与切分好的文档等长num_words = len(train_mat[0])    #与词汇表等长p_abusive = sum(train_class) / float(num_train_mat)    #计算侮辱性类的概率p0_num = ones(num_words); p1_num = ones(num_words)    #创建与词汇表等长的数组p0_denom = 2.0; p1_denom = 2.0    #统计0或1类别中单词出现的总次数for i in range(num_train_mat):    #遍历整个文档向量if train_class[i] == 1:p1_num += train_mat[i]    #统计同一类别的文档向量中每个单词出现的次数，例如[0, 0, 0, 1]与[1, 1, 0, 1]都是类别1，则p1_num = [1, 1, 0, 2]p1_denom += sum(train_mat[i])    #同理p1_denom = 4，即将1相加else:p0_num += train_mat[i]    #与计算p1_num和p1_denom时等同p0_denom += sum(train_mat[i])p1_vec = log(p1_num / p1_denom)p0_vec = log(p0_num / p0_denom)return p0_vec, p1_vec, p_abusiveif __name__ == '__main__':    #主函数data_set, class_list = create_data_set()my_vocab_list = create_vocab_list(data_set)train_mat = []for data in data_set:train_mat.append(set_of_words_vec(my_vocab_list, data))p0_vec, p1_vec, p_abusive = train_NB(train_mat, class_list)print("The insulting probability is:\n", p_abusive)print("The probability of non-insulting each word is:\n", p0_vec)print("The probability of insulting each word is:\n", p1_vec)

运行结果如下(注：为方便查看，已将train mat的结果手动转变)：

The train mat is:[[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0], [0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]]
The vocabulary is:['has', 'love', 'stop', 'how', 'posting', 'problems', 'cute', 'maybe', 'not', 'is', 'help', 'him', 'dalmation', 'garbage', 'quit', 'stupid', 'my', 'I', 'ate', 'mr', 'to', 'park', 'so', 'steak', 'buying', 'take', 'dog', 'food', 'licks', 'flea', 'please', 'worthless']
The insulting probability is:0.5
The probability of non-insulting each word is:[0.04166667 0.04166667 0.04166667 0.04166667 0.         0.041666670.04166667 0.         0.         0.04166667 0.04166667 0.083333330.04166667 0.         0.         0.         0.125      0.041666670.04166667 0.04166667 0.04166667 0.         0.04166667 0.041666670.         0.         0.04166667 0.         0.04166667 0.041666670.04166667 0.        ]
The probability of insulting each word is:[0.         0.         0.05263158 0.         0.05263158 0.0.         0.05263158 0.05263158 0.         0.         0.052631580.         0.05263158 0.05263158 0.15789474 0.         0.0.         0.         0.05263158 0.05263158 0.         0.0.05263158 0.05263158 0.10526316 0.05263158 0.         0.0.         0.10526316]

首先，文档属于侮辱性的概率为0.5，是正确的。接下来对应于每一个单词，以第一个单词’has’为例：
在文档向量中，前三个标签为非侮辱性，即0，后三个反之。在类别0时，p0_num 最终转为为：[1，1，1，1，0…]；p1_num 最终转为为：[0，0，1，0，1…]；p0_denom 最终转为为：24；p1_denom 最终转为为：19。故p0_vec = [0.0417， 0.0417，0.0417，0.0417，0 …]，p_vec = [0，0， 0.053，0，0.053…]，与程序运行结果对比，仅是精度的差异。
在使用函数进行分类之前，还需要解决函数中的一些缺陷。

2.3 测试算法：根据现实情况修改分类器

利用贝叶斯分类器对文档进行分类时，要计算多个概率的乘积以获得文档属于某个类别的概率，即计算 p ( w 0 ∣ 1 ) p ( w 1 ∣ 1 ) ⋅ ⋅ ⋅ p ( w N ∣ 1 ) p(w_0|1)p(w_1|1)···p(w_N|1) p(w0∣1)p(w1∣1)⋅⋅⋅p(wN∣1)。如果其中一个概率等于0，那么最终的乘积也为0。为降低这种影响，可以将所有词出现的次数初始化为1，分母初始化为2。
　　即在初始化p0_num 、p1_num 、p0_denom 、p1_denom 进行以下赋值：

 p0_num = ones(num_words); p1_num = ones(num_words)p0_denom = 2.0; p1_denom = 2.0

另一个问题是下溢出，这是由于太多很小的树相乘(可以自行尝试很多很小的树相乘，其结果四舍五入之后便是0)。
　　一种解决办法是对乘积取自然对数。在代数中有 l n ( a ∗ b ) = l n ( a ) + l n ( b ) ln(a*b)=ln(a)+ln(b) ln(a∗b)=ln(a)+ln(b)，于是通过求对数可以给出函数 f ( x ) = x f(x)=x f(x)=x以及 f ( x ) = l n ( x ) f(x)=ln(x) f(x)=ln(x)的曲线，一个示例如下图：

图2-1　两个函数会一块增减，虽然具体的值不同

如图2-1，所示例两函数曲线会共同增减，并在相同的地方取得极值。虽然取值不同，但是并不会影响最终结果。于是代码中求解p0_vec与p1_vec作以下修改：

 p1_vec = log(p1_num / p1_denom)p0_vec = log(p0_num / p0_denom)

现在已做好构建完整分类器的要素。于bayes.py中添加以下代码：

程序清单2-3：朴素贝叶斯分类函数

def classify_NB(classify_vec, p0_vec, p1_vec, p_class):    #输入参数为：测试文档向量、各类概率p1 = sum(classify_vec * p1_vec) + log(p_class)    #例如classify_vec = [1, 0, 0, 1]，p1_vec = [0.1，0.2，0.3，0.1]p0 = sum(classify_vec * p0_vec) + log(1.0 - p_class)if p1 > p0:return 1else:return 0def create_test_data_set(i):    #可自行添加测试数据test_data_set = [['love', 'my', 'dalmation'],['stupid', 'garbage']]return test_data_set[i]def test():data_set, class_list = create_data_set()my_vocab_list = create_vocab_list(data_set)p0_vec, p1_vec, p_abusive = train_NB(my_vocab_list, data_set, class_list )test_data_set = create_test_data_set(1)test_mat = array(set_of_words_vec(my_vocab_list, test_data_set))print("The test mat is:", test_mat)classified_label = classify_NB(test_mat, p0_vec, p1_vec, p_abusive)print("The classified as:", classified_label)if __name__ == '__main__':    #主函数test()

运行结果如下：

The train mat is:[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0], [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]]
The vocabulary is:['licks', 'mr', 'buying', 'maybe', 'dalmation', 'steak', 'park', 'I', 'posting', 'please', 'take', 'not', 'stupid', 'garbage', 'is', 'him', 'worthless', 'ate', 'food', 'love', 'so', 'to', 'has', 'quit', 'my', 'stop', 'dog', 'help', 'cute', 'flea', 'problems', 'how']
The test mat is: [0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
The classified as: 1

2.4 准备数据：文档词袋模型

目前为止，我们将每个词出现与否作为一个特征，这可以被称作词集模型(set-of-words model)。如果一个词在文档中出现不止一次，这可能意味着包含该词是否出现在文档中所不能表达的某种信息，这种方法被称为词袋模式(bag-of-words model)。在词袋中，每个单词可以出现多次，而在词集中，每个单词只能出现一次。
　　为了适应词袋模式，新增函数bag_of_words_vec()，以便和set_of_words_vec区分。与bayes.py中添加以下代码：

程序清单2-4：朴素贝叶斯词袋模式

def bag_of_words_vec(vocab_list, input_set):return_vec = [0] * len(vocab_list)for word in input_set:if word in vocab_list:return_vec[vocab_list.index(word)] += 1    #注意与set_of_words_vec()的区别return return_vec

至此，分类器构建完毕。

3 示例：过滤垃圾邮件

使用朴素贝叶斯解决一些现实生活中的问题时，需要先从文本内容中得到字符串列表，然后生成词向量。接下来的例子，便是朴素贝叶斯最著名的一个应用：电子邮件过滤。

3.1 准备数据：切分文本

测试数据如下：

Hi Peter,With Jose out of town, do you want to
meet once in a while to keep things
going and do some interesting stuff?Let me know
Eugene

转化后：

my_sent = 'Hi Peter,       With Jose out of town, do you want to meet once in a while to keep things ' \'going and do some interesting stuff?         Let me know Eugene'

将bayes.py文件中test()函数重写如下：

程序清单3-1：划分测试1

def test():my_sent = 'Hi Peter,       With Jose out of town, do you want to meet once in a while to keep things ' \'going and do some interesting stuff?         Let me know Eugene'my_sent = my_sent.split()print(my_sent)

运行结果如下：

['Hi', 'Peter,', 'With', 'Jose', 'out', 'of', 'town,', 'do', 'you', 'want', 'to', 'meet', 'once', 'in', 'a', 'while', 'to', 'keep', 'things', 'going', 'and', 'do', 'some', 'interesting', 'stuff?', 'Let', 'me', 'know', 'Eugene']

效果似乎不错，但是将标点符号当做了单词的一部分。故用正则表达式来切分句子，其中分隔符是除单词、数字外的任意字符串。再次重写test()函数：

程序清单3-2：划分测试2

def test():import remy_sent = 'Hi Peter,       With Jose out of town, do you want to meet once in a while to keep things ' \'going and do some interesting stuff?         Let me know Eugene'reg_ex = re.compile('\\W+')my_sent = reg_ex.split(my_sent)print(my_sent)

运行结果如下：

['Hi', 'Peter', 'With', 'Jose', 'out', 'of', 'town', 'do', 'you', 'want', 'to', 'meet', 'once', 'in', 'a', 'while', 'to', 'keep', 'things', 'going', 'and', 'do', 'some', 'interesting', 'stuff', 'Let', 'me', 'know', 'Eugene']

效果不错，但是，我们发现句子的第一个单词是大写的。如果通过句子查找，这将会很有用，但是这里的文本只看成词袋。一个可行的办法便是将所有单词转换为小写。再次重写test()函数：

程序清单3-3：划分测试3

def test():import remy_sent = 'Hi Peter,       With Jose out of town, do you want to meet once in a while to keep things ' \'going and do some interesting stuff?         Let me know Eugene'reg_ex = re.compile('\\W+')my_sent = reg_ex.split(my_sent)my_sent = [tok.lower() for tok in my_sent if len(tok) > 2]print(my_sent)

运行结果如下：

['peter', 'with', 'jose', 'out', 'town', 'you', 'want', 'meet', 'once', 'while', 'keep', 'things', 'going', 'and', 'some', 'interesting', 'stuff', 'let', 'know', 'eugene']

可见得到了我们想要的结果。需要注意的是，对于以上划分，可以认为长度小于3的单词是无关紧要的。

3.2 测试算法：进行交叉验证

测试用电子邮件数据使用机器学习实战提供的数据，下载后位于machinelearninginaction/Ch04目录下。
注意：请将email\ham中的23.txt中第二段多出的问号删去

接下来将文本解析器集成到一个完整分类器中。于bayes.py文件中添加以下代码：

程序清单3-4：文件解析及完整的垃圾邮件测试函数

def load_data_set():    #数据载入data_set = []; class_list = []"""操作标签为1的集合"""num_of_file1 = get_num_of_file('email/spam')    #获取文件夹中邮件数量for i in range(1, num_of_file1 + 1):word_list = text_parse(open('email/spam/%d.txt' % i).read())    #数据切分data_set.append(word_list)class_list.append(1)"""操作标签为0的集合"""num_of_file2 = get_num_of_file('email/ham')for i in range(1, num_of_file2 + 1):word_list = text_parse(open('email/ham/%d.txt' % i).read())data_set.append(word_list)class_list.append(0)num_of_file = num_of_file1 + num_of_file2return data_set, class_list, num_of_filedef get_num_of_file(path):    #获取文件夹中邮件数量；输入参数：路径import osnum = 0if not os.path.isdir(path):    #判断当前路径是否存在print('Error: "', path, '" is not a directory or does not exist.')exit(0)else:try:for lists in os.listdir(path):num += 1except:passreturn numdef text_parse(doc_list):    #数据切分；输入参数：未切分的数据import rereg_ex = re.compile('\W+')doc_list = reg_ex.split(doc_list)temp_doc_list = [tok.lower() for tok in doc_list if len(tok) > 2]data_set = []for value in temp_doc_list:if not is_number(value):    #将字符串的内容为纯数字的去除data_set.append(value)return data_setdef is_number(num):    #判断是否为数字try:float(num)return Trueexcept ValueError:passreturn Falsedef spam_test(__N):    #交叉验证；输出参数：训练集个数data_set, class_list, num_of_file = load_data_set()if __N < 1 or __N > num_of_file:    #检测输入是否合法print("The n should between 1 to %d !" % num_of_file)exit(0)vocab_list = create_vocab_list(data_set)print("The vocabulary is:\n", vocab_list)train_set = list(range(num_of_file)); test_set = []for i in range(__N):    #测试集选取rand_index = int(random.uniform(0, len(train_set)))test_set.append(train_set[rand_index])del(train_set[rand_index])train_mat = []; train_class = []    #训练集选取for doc_index in train_set:train_mat.append(bag_of_words_vec(vocab_list, data_set[doc_index]))train_class.append(class_list[doc_index])p0_vec, p1_vec, p_spam = train_NB(array(train_mat), array(train_class))error_count = 0for doc_index in test_set:word_vec = bag_of_words_vec(vocab_list, data_set[doc_index])if classify_NB(array(word_vec), p0_vec, p1_vec, p_spam) != class_list[doc_index]:print("The Error-partitioned data is:",data_set[doc_index])error_count += 1print("The error rate is: ", float(error_count / len(test_set)))if __name__ == '__main__':    #主函数spam_test(15)

运行结果如下：

The vocabulary is:['two', 'com', 'vuitton', '30mg', 'nvidia', 'code', 'tiffany', 'assistance', 'hydrocodone', 'please', 'thailand', 'edit', 'thirumalai', 'cca', 'derivatives', 'ma1eenhancement', 'see', 'items', 'major', 'natural', 'reliever', 'individual', 'drugs', 'add', 'store', 'has', 'sophisticated', 'dior', 'turd', 'insights', 'great', 'out', 'good', 'jar', 'announcement', 'year', 'google', 'uses', 'february', 'buy', 'through', 'certified', 'working', 'food', 'herbal', 'being', 'just', 'signed', 'both', 'more', 'ups', 'free', 'development', 'watchesstore', 'analgesic', 'but', 'trusted', 'fans', 'computing', 'well', 'ultimate', 'came', 'linkedin', 'success', 'fundamental', 'members', 'hamm', 'rock', 'been', 'window', 'party', 'wrote', 'all', 'plus', 'zolpidem', 'back', 'whybrew', 'creative', 'methods', 'close', 'rude', 'creation', 'fermi', 'professional', 'share', 'yeah', 'finder', 'night', 'generation', 'yourpenis', 'titles', 'rent', 'strategy', 'behind', 'using', 'learn', 'stepp', 'may', 'use', 'shipment', 'monte', 'generates', 'jpgs', 'get', 'effective', 'these', 'jquery', 'automatically', 'focus', 'talked', 'was', 'item', 'pictures', 'mathematics', 'pages', 'website', 'programming', 'you', 'had', 'should', 'museum', 'inspired', 'significantly', 'files', 'game', 'john', 'control', 'features', 'grounds', 'site', 'mandarin', 'sliding', 'survive', 'based', 'focusing', 'brained', 'featured', 'assigning', 'does', '100mg', 'enough', 'quantitative', 'location', 'expo', 'went', 'hello', 'than', 'permanantly', 'money', 'link', 'brandviagra', 'decision', 'china', 'percocet', 'station', 'latest', 'financial', 'photoshop', 'try', 'prices', 'modelling', 'team', 'information', 'docs', 'full', 'earn', 'used', 'canadian', 'writing', 'noprescription', 'october', 'faster', 'changing', 'book', 'meet', 'most', 'via', 'fine', 'place', 'storage', 'way', 'fedex', 'with', 'pavilion', 'core', 'received', 'credit', 'ofejacu1ate', 'like', 'accept', '10mg', 'far', 'here', 'york', 'lunch', 'high', 'watches', 'pls', 'supplement', 'plugin', 'differ', 'need', 'ones', '14th', 'there', 'discount', 'commented', 'doing', 'moderate', 'length', 'who', 'cold', 'intenseorgasns', 'that', 'knocking', 'home', 'train', 'narcotic', 'withoutprescription', 'pro', 'girl', 'died', 'service', 'thousand', 'couple', 'requested', 'http', 'thought', 'nature', 'viagranoprescription', 'definitely', 'wilson', 'think', 'business', 'pick', 'of_penisen1argement', 'gpu', 'don', 'brand', 'cats', 'thanks', '50mg', 'chinese', 'gucci', 'germany', 'support', 'also', 'eugene', 'job', 'microsoft', 'door', 'shape', 'yay', 'doors', 'vicodin', 'owner', 'jay', 'opportunity', 'heard', 'number', 'order', 'chance', 'welcome', 'gas', 'mailing', 'sky', 'transformed', 'arolexbvlgari', 'scifinance', 'improving', 'launch', 'right', 'cs5', 'zach', 'once', 'could', 'looking', 'ideas', 'extended', 'parallel', 'plane', 'recieve', 'income', 'thank', 'amazing', 'bad', 'cat', 'tool', 'listed', 'can', 'forum', 'groups', 'sounds', 'gain', 'prototype', 'having', 'hold', 'cartier', 'dhl', 'giants', 'visa', 'aged', 'magazine', 'sorry', 'color', 'running', 'methylmorphine', 'bin', 'office', 'time', 'access', 'work', 'proven', 'hotels', 'concise', 'holiday', 'retirement', 'page', 'grow', 'this', 'not', 'volume', 'copy', 'placed', 'held', 'then', 'any', 'suggest', 'storedetailview_98', 'advocate', 'knew', 'computer', 'class', 'town', 'connection', 'might', 'keep', 'have', 'jqplot', 'blue', 'father', 'leaves', 'accepted', 'from', 'discussions', 'inside', 'harderecetions', 'supporting', 'instead', 'wednesday', 'answer', 'scenic', 'will', '100m', 'days', 'and', 'horn', 'name', 'expertise', 'want', 'things', 'since', 'butt', 'cuda', 'carlo', 'buyviagra', 'come', 'where', 'automatic', 'email', 'vivek', 'exhibit', 'notification', 'shipping', '25mg', 'mandelbrot', 'online', 'comment', 'fda', 'pretty', 'sent', 'least', 'reply', 'help', 'save', 'freeviagra', 'new', 'questions', 'release', 'your', 'drunk', 'school', 'tickets', 'create', 'contact', 'courier', 'experience', 'phone', 'windows', 'cannot', 'attaching', 'incoming', 'ready', 'lined', 'explosive', 'fbi', 'each', 'program', 'worldwide', 'huge', '15mg', 'province', 'articles', 'bargains', 'per', 'approach', 'often', 'express', 'bathroom', 'warranty', 'level', 'yesterday', 'upload', 'over', 'changes', 'benoit', 'source', 'serial', 'pill', 'pills', 'told', 'jewerly', 'ems', 'spaying', 'address', 'file', 'arvind', 'cheers', 'risk', 'those', 'interesting', 'wallets', 'delivery', 'thickness', 'chapter', 'winter', 'finance', 'enabled', 'must', 'regards', 'ordercializviagra', 'model', 'julius', 'follow', 'discreet', 'doctor', 'treat', 'inform', 'specifications', 'made', 'you抮e', 'rain', 'view', 'car', 'pain', 'series', 'strategic', 'only', 'pricing', 'message', 'inches', 'now', 'products', 'fast', 'bettererections', 'today', 'them', 'phentermin', 'foaming', 'group', 'know', 'his', 'tokyo', 'safe', 'python', 'check', 'famous', 'runs', 'codeine', 'acrobat', 'guy', 'hotel', 'much', 'because', 'customized', 'peter', 'hommies', 'kerry', 'bike', 'increase', 'such', 'amex', 'said', 'hermes', 'wasn', 'price', 'take', 'wilmott', 'tesla', 'top', 'longer', 'troy', 'riding', 'specifically', 'sites', 'favorite', 'would', 'experts', 'net', 'done', 'about', 'thread', 'endorsed', 'jocelyn', 'brands', 'low', 'invitation', 'capabilities', 'incredib1e', 'designed', 'cheap', 'art', 'jose', 'enjoy', 'guaranteeed', 'mom', 'web', 'forward', 'too', 'hope', 'ambiem', 'possible', 'the', 'either', 'starting', 'borders', 'tabs', 'roofer', 'another', 'genuine', 'got', 'logged', 'glimpse', 'adobe', 'superb', 'while', 'louis', 'stuff', 'management', 'let', 'mandatory', 'moderately', 'update', 'issues', '0nline', 'moneyback', 'selected', 'approved', 'hours', '300x', 'includes', 'saw', 'severepain', 'works', 'ryan', 'life', 'away', 'cards', 'competitive', 'example', 'everything', 'same', 'how', 'pharmacy', 'femaleviagra', 'past', 'find', 'sure', 'call', 'doggy', 'what', 'thing', 'network', 'they', 'mba', '5mg', 'don抰', 'mathematician', 'some', 'haloney', 'speedpost', 'cost', 'ferguson', 'private', 'encourage', 'quality', 'dusty', 'located', 'others', 'inconvenience', 'perhaps', 'naturalpenisenhancement', 'are', 'important', 'easily', 'functionalities', 'note', 'watson', 'reservation', 'care', 'opioid', 'off', 'www', 'prepared', 'for', 'download', 'one', 'tent', 'day', 'design', 'hangzhou', 'reputable', 'safest', 'required', 'trip', 'mail', 'dozen', 'fractal', 'coast', 'status', 'biggerpenis', 'needed', 'oris', 'wholesale', 'oem', 'going', 'below', 'lists', 'gains', 'betterejacu1ation', 'bags', 'when', 'tour', 'millions', 'style', 'softwares']
The Error-partitioned data is: ['linkedin', 'kerry', 'haloney', 'requested', 'add', 'you', 'connection', 'linkedin', 'peter', 'like', 'add', 'you', 'professional', 'network', 'linkedin', 'kerry', 'haloney']
The Error-partitioned data is: ['benoit', 'mandelbrot', 'benoit', 'mandelbrot', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
The error rate is:  0.13333333333333333

每次的运行结果均不同，但总体的错误率较低，这里出现的错误是将垃圾邮件预判为正常邮件。有多种方式可以用来修改分类器，这里暂不讨论。

机器学习实战之基于概率论的分类方法：朴素贝叶斯相关推荐

基于概率论的分类方法—朴素贝叶斯
基于概率论的分类方法-朴素贝叶斯转载于:https://www.cnblogs.com/liuys635/p/11181304.html
《机器学习实战》学习笔记（四）：基于概率论的分类方法 - 朴素贝叶斯
欢迎关注WX公众号:[程序员管小亮] [机器学习]<机器学习实战>读书笔记及代码总目录 https://blog.csdn.net/TeFuirnever/article/details ...
机器学习实战教程（三）：基于概率论的分类方法——朴素贝叶斯
文章目录一.朴素贝叶斯理论 1.贝叶斯决策理论 2.条件概率 3.全概率公式 4.贝叶斯推断 5.朴素贝叶斯推断二.示例:言论过滤器三.朴素贝叶斯改进之拉普拉斯平滑四.示例:朴素贝叶斯之过滤垃 ...
基于概率论的分类方法: 朴素贝叶斯
朴素贝叶斯概述贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类.本章首先介绍贝叶斯分类算法的基础--贝叶斯定理.最后,我们通过实例来讨论贝叶斯分类的中最简单的一种: ...
机器学习实战：基于概率论的分类方法：朴素贝叶斯（源码解析，错误分析）...
按照惯例,先把代码粘到这里 from numpy import *def LoadDataSet():postingList = [['my', 'dog', 'has', 'flea', 'prob ...
Python《机器学习实战》读书笔记（四）——朴素贝叶斯
第四章基于概率论的分类方法朴素贝叶斯 4-1 基于贝叶斯决策理论的分类方法 4-2 条件概率 4-3 使用条件概率来分类 4-4 使用朴素贝叶斯进行文档分类 4-5 使用Python进行文本分类 4 ...
《机器学习实战》笔记（04）：基于概率论的分类方法 - 朴素贝叶斯分类
基于概率论的分类方法:朴素贝叶斯分类 Naive Bayesian classification 这大节内容源于带你理解朴素贝叶斯分类算法,并非源于<机器学习实战>.个人认为<机器学 ...
《机器学习实战》学习笔记（3）—— 朴素贝叶斯
1 朴素贝叶斯算法描述工作原理: 对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,哪个最大,就认为此待分类项属于哪个类别. 2 计算概率的伪代码计算每个类别中的文档数目: 对每篇训练 ...
机器学习：用于恶意软件检测和分类的朴素贝叶斯规则
本文转载自:https://resources.infosecinstitute.com/machine-learning-naive-bayes-rule-for-malware-detection ...

机器学习实战之基于概率论的分类方法：朴素贝叶斯

基于概率论的分类方法：朴素贝叶斯

引入

1 基于贝叶斯决策理论的分类方法

1.1 条件概率

1.2 使用条件概率来分类

1.3 使用朴素贝叶斯进行文档分类

2 使用Python进行文本分类

2.1 准备数据：从文本中构建词向量

2.2 训练算法：从词向量计算概率

2.3 测试算法：根据现实情况修改分类器

2.4 准备数据：文档词袋模型

3 示例：过滤垃圾邮件

3.1 准备数据：切分文本

3.2 测试算法：进行交叉验证

机器学习实战之基于概率论的分类方法：朴素贝叶斯相关推荐

最新文章

热门文章