利用朴素贝叶斯算法识别垃圾邮件

转载自：http://blog.csdn.net/wowcplusplus/article/details/25190809

朴素贝叶斯算法是被工业界广泛应用的机器学习算法，它有较强的数学理论基础，在一些典型的应用中效果显著。朴素贝叶斯算法基于概率论的贝叶斯理论。该理论的核心公式如下：

式中，表示某种分类，则表示已知的情况下类型为的条件概率。我们求出各个类别下的，然后比较它们的大小，以概率最大的作为最后的类别，以此达到分类的目的。下面我们来看如何计算这些条件概率。

已知，则。朴素贝叶斯假定互为独立变量，则。而（为指示函数，存在则为1，不存在则为0），和都可用训练数据直接统计得出。故可依据上述分析求得的大小。又由于对所有类别都是固定大小，所以比较条件概率的大小等同于比较的大小。这就是朴素贝叶斯的数学原理。

下面，我们以朴素贝叶斯的一个典型应用——过滤垃圾邮件来展示该算法的python实现。

现在，我们有25件垃圾邮件和25件正常邮件，如何使用这些邮件作为训练数据得到过滤垃圾邮件的朴素贝叶斯模型呢？首先，我们用各邮件的词组成词向量，表示在邮件中出现过的词，再计算与孰大孰小即可决定这是封正常邮件（ham）还是封垃圾邮件（spam）。

第一步，我们将邮件转换为numpy的array形式，使用如下函数：

[python] view plaincopy

def file2array(filename):
fileReader=open(filename,'r').read()
listOfWord=re.split(r'\W*',fileReader)
fileArray=[word.lower() for word in listOfWord if len(word)>3]
return fileArray

然后我们将所有邮件合在一个array里面，并在其中随机选取5封作为测试集：

[python] view plaincopy

def getAllInfo():
allTextMat=[]
allTypeArray=[]
testMat=[]
testTypeArray=[]
for i in range(1,26):
allTextMat.append(file2array('email/spam/%d.txt'%i))
allTypeArray.append(1)
allTextMat.append(file2array('email/ham/%d.txt'%i))
allTypeArray.append(0)
for i in range(5):
randIndex=int(random.uniform(0,len(allTextMat)))
testMat.append(allTextMat[randIndex])
testTypeArray.append(allTypeArray[randIndex])
del(allTextMat[randIndex])
del(allTypeArray[randIndex])
return allTextMat,allTypeArray,testMat,testTypeArray

接着我们计算所有词的出现次数和在垃圾邮件、正常邮件中分别出现的次数：

[python] view plaincopy

def getWordList(allTextMat):
wordSet=set()
for textVec in allTextMat:
wordSet|=set(textVec)
return list(wordSet)
def getCountList(wordList,allTextMat,allTypeArray):
wordListLen=len(wordList)
totalCntList=ones(wordListLen)
totalCntList*=2
p0CntList=ones(wordListLen)
p1CntList=ones(wordListLen)
order=0
p0Cnt=0
p1Cnt=0
for textVec in allTextMat:
for word in textVec:
wordPos=wordList.index(word)
totalCntList[wordPos]+=1
if allTypeArray[order]==1:
p1CntList[wordPos]+=1
p1Cnt+=1
elif allTypeArray[order]==0:
p0CntList[wordPos]+=1
p0Cnt+=1
order+=1
p0=float(p0Cnt)/(p0Cnt+p1Cnt)
p1=1-p0
return totalCntList,p0CntList,p1CntList,p0,p1

最后我们进行贝叶斯分类：

[python] view plaincopy

def bayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList):
docIndex=0
errorCnt=0
for testVec in testMat:
sum0=0.0
sum1=0.0
for word in testVec:
if word not in wordList:
continue
wordPos=wordList.index(word)
sum0+=log(float(p0CntList[wordPos]/totalCntList[wordPos]))
sum1+=log(float(p1CntList[wordPos]/totalCntList[wordPos]))
sum0+=log(p0)
sum1+=log(p1)
decType=0 if sum0>sum1 else 1
if decType != testTypeArray[docIndex]:
errorCnt+=1
print sum0,sum1
docIndex+=1
return errorCnt

整体的代码如下：

[python] view plaincopy

import os
import re
from numpy import *
def file2array(filename):
fileReader=open(filename,'r').read()
listOfWord=re.split(r'\W*',fileReader)
fileArray=[word.lower() for word in listOfWord if len(word)>3]
return fileArray
def getAllInfo():
allTextMat=[]
allTypeArray=[]
testMat=[]
testTypeArray=[]
for i in range(1,26):
allTextMat.append(file2array('email/spam/%d.txt'%i))
allTypeArray.append(1)
allTextMat.append(file2array('email/ham/%d.txt'%i))
allTypeArray.append(0)
for i in range(5):
randIndex=int(random.uniform(0,len(allTextMat)))
testMat.append(allTextMat[randIndex])
testTypeArray.append(allTypeArray[randIndex])
del(allTextMat[randIndex])
del(allTypeArray[randIndex])
return allTextMat,allTypeArray,testMat,testTypeArray
def getWordList(allTextMat):
wordSet=set()
for textVec in allTextMat:
wordSet|=set(textVec)
return list(wordSet)
def getCountList(wordList,allTextMat,allTypeArray):
wordListLen=len(wordList)
totalCntList=ones(wordListLen)
totalCntList*=2
p0CntList=ones(wordListLen)
p1CntList=ones(wordListLen)
order=0
p0Cnt=0
p1Cnt=0
for textVec in allTextMat:
for word in textVec:
wordPos=wordList.index(word)
totalCntList[wordPos]+=1
if allTypeArray[order]==1:
p1CntList[wordPos]+=1
p1Cnt+=1
elif allTypeArray[order]==0:
p0CntList[wordPos]+=1
p0Cnt+=1
order+=1
p0=float(p0Cnt)/(p0Cnt+p1Cnt)
p1=1-p0
return totalCntList,p0CntList,p1CntList,p0,p1
def bayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList):
docIndex=0
errorCnt=0
for testVec in testMat:
sum0=0.0
sum1=0.0
for word in testVec:
if word not in wordList:
continue
wordPos=wordList.index(word)
sum0+=log(float(p0CntList[wordPos]/totalCntList[wordPos]))
sum1+=log(float(p1CntList[wordPos]/totalCntList[wordPos]))
sum0+=log(p0)
sum1+=log(p1)
decType=0 if sum0>sum1 else 1
if decType != testTypeArray[docIndex]:
errorCnt+=1
print sum0,sum1
docIndex+=1
return errorCnt
def main():
allTextMat,allTypeArray,testMat,testTypeArray=getAllInfo()
wordList=getWordList(allTextMat)
totalCntList,p0CntList,p1CntList,p0,p1=getCountList(wordList,allTextMat,allTypeArray)
print bayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList)
if __name__=='__main__':
main()

我使用的邮件数据来源于《机器学习实战》第四章。感兴趣的同学可以去它官网http://www.manning.com/pharrington/下载数据集。

以上就是贝叶斯算法的基本介绍。作为本系列的开篇之作，我在表述上可能会有不当之处，还请各位同学在评论中指正。

利用朴素贝叶斯算法识别垃圾邮件相关推荐

机器学习：朴素贝叶斯算法与垃圾邮件过滤
简介贝叶斯算法是由英国数学家托马斯·贝叶斯提出的,这个算法的提出是为了解决"逆向概率"的问题.首先我们先来解释下正向概率与逆向概率的含义: 正向概率:假设一个箱子里有5个黄色球和 ...
朴素贝叶斯算法实现垃圾邮件过滤（Python3实现）
目录 1.朴素贝叶斯实现垃圾邮件分类的步骤 2.邮件数据 3.代码实现 4.朴素贝叶斯的优点和缺点 1.朴素贝叶斯实现垃圾邮件分类的步骤 (1)收集数据:提供文本文件. (2)准备数据:将文本文件解析 ...
python：基于朴素贝叶斯算法的垃圾邮件过滤分类
目录一.朴素贝叶斯算法 1.概述 2.推导过程二.实现垃圾邮件过滤分类 1.垃圾邮件问题背景 2.朴素贝叶斯算法实现垃圾邮件分类的步骤 3.python实现参考学习网址:https://blog ...
朴素贝叶斯算法实现垃圾邮件过滤
朴素贝叶斯算法实现垃圾邮件过滤 1．1 题目的主要研究内容 (1)贝叶斯垃圾邮件过滤技术是一种电子邮件过滤的统计学技术,它使用贝叶斯分类来进行垃圾邮件的判别. (2)贝叶斯分类的运作是借着使用标记(一 ...
利用朴素贝叶斯原理过滤垃圾邮件（TF-IDF算法）
本人是新手,为了还原该过程用了自己的方法,可能时间复杂度较高,并且在训练数据时也没有用到SKlearn模块中的贝叶斯分类器,是为了尝试自己去还原求后验条件概率这个过程. 目录一.简述朴素贝叶斯原理 ...
机器学习——朴素贝叶斯算法（垃圾邮件分类）
朴素贝叶斯算法介绍以及垃圾邮件分类实现 1.一些数学知识 2.贝叶斯公式 3.朴素贝叶斯算法 (1)介绍 (2)核心思想 (3)朴素贝叶斯算法 (4)拉普拉斯修正 (5)防溢出策略 (6)一般过程 ( ...
《机器学习实战》基于朴素贝叶斯算法实现垃圾邮件分类
import random import sys import numpy as np import pandas as pd from pandas import Series, DataFrame ...
朴素贝叶斯算法实现对邮件的分类
通过朴素贝叶斯算法来分类垃圾邮件的PYTHON实现(超易懂) 文章目录通过朴素贝叶斯算法来分类垃圾邮件的PYTHON实现(超易懂) 前言一.朴素贝叶斯算法分类垃圾邮件原理二.python实现 1 ...
利用贝叶斯算法对垃圾邮件进行分类处理
代码及注释如下: #使用贝叶斯算法实现垃圾邮件过滤 #将一个大字符串解析为字符串列表 def textParse(bigString):import relistOfTokens = re.split ...

利用朴素贝叶斯算法识别垃圾邮件

利用朴素贝叶斯算法识别垃圾邮件相关推荐

最新文章

热门文章