邮件分类练习–朴素贝叶斯

思路

数据导入
数据处理
构建训练集和测试集
词云展示
数据降维
训练模型
模型测试
提升模型

一、数据导入

文件目录为：C:\Users\kelanj\Documents\data\spam\…和C:\Users\kelanj\Documents\data\ham\…

# 数据导入
# 获得文件路径/目录
setwd("C:\\Users\\kelanj\\Documents")
spam.path<-file.path("data","spam")
ham.path<-file.path("data","ham")
# 获得目录下的文件名 向量
spam.docs <- dir(spam.path)
ham.docs<-dir(ham.path)
spam.docs[1:4]#查看前四个文件

## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00001.7848dde101aa985090474a91ec93fcf0"
## [3] "00002.9438920e9a55591b18e60d1ed37d992b"
## [4] "00002.d94f1b97e48ed3b553b3508d116e6a09"

ham.docs[1:4]

## [1] "00001.1a31cc283af0060967a233d26548a6ce"
## [2] "00001.7c53336b37003a9286aba55d2945844c"
## [3] "00002.5a587ae61666c5aa097c8e866aedcc59"
## [4] "00002.9c4069e25e1ef370c078db7ee85ff9ac"

# 编写函数getContent 实现一封邮件内容读取 返回内容的字符串
# 注意：邮件格式规定"每份邮件包含头部和正文两个部分一般由第一个空行分割"
getContent<-function(path){conn<-file(path,open = "rt")#不需要指定encoding否则读取的时候会出错line<-readLines(conn,warn = F)content<- tryCatch(line[seq(which(line == "")[1]+1, length(line), 1)], error = function(e) e)close(conn)content<-paste(content,collapse = '\n')return(content)
}
# 分别获取spam 1897个文件和ham 3900个文件 的邮件内容
spamContent<-sapply(spam.docs,function(path) getContent(file.path(spam.path,path)))
hamContent<-sapply(ham.docs,function(path) getContent(file.path(ham.path,path)))
s.h.content<-c(spamContent,hamContent)#合并邮件内容

二、数据处理

首先加载包：NLP、tm、SnowballC、slam

1.自定义去除HTML,URL的函数，以及自己的停词表；

2.定义字符处理函数，返回预料库

library(NLP)
library(tm)
library(SnowballC)#提取词干
library(slam)#将各种url转化为http
myremoveURL<-function(x){x<-gsub(pattern = "(https?|ftp|file):\\/\\/[-A-Za-z0-9+&@#\\/%?=~_|!:,\\.;]+[-A-Za-z0-9+&@#\\/%=~_|]","http",x)
}#除去html标签
myremoveHTML<-function(x){x<-gsub(pattern = "<[^>]+>","",x)
}#自己的英文停词表
myenstopwords<-function(){c(stopwords(),"will","also")
}#自己的文本处理函数
cleanContent1<-function(content){contentCorpus<-Corpus(VectorSource(content))contentCorpus<-tm_map(contentCorpus,PlainTextDocument)contentCorpus <- tm_map(contentCorpus, myremoveURL)contentCorpus <- tm_map(contentCorpus, myremoveHTML)contentCorpus <- tm_map(contentCorpus, tolower)contentCorpus <- tm_map(contentCorpus, removeNumbers)contentCorpus<-tm_map(contentCorpus,removeWords,myenstopwords())contentCorpus <- tm_map(contentCorpus, removePunctuation)contentCorpus <- tm_map(contentCorpus, stripWhitespace)return(contentCorpus)
}

防止编码问题报错

Sys.setlocale(category = "LC_ALL", locale = "us")

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

#进行文本处理
s.h.corpus<-cleanContent1(s.h.content)
inspect(s.h.corpus[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
##
## [1] greetings receiving letter expressed interest receiving information online business opportunities erroneous please accept sincere apology onetime mailing removal necessary burned betrayed backstabbed multilevel marketing mlm please read letter important one ever landed inbox multilevel marketing huge mistake people mlm failed deliver promises past years pursuit mlm dream cost hundreds thousands people friends fortunes sacred honor fact mlm fatally flawed meaning work people companies earn big money mlm going tell real story finally someone courage cut hype lies tell truth mlm good news alternative mlm works works big yet abandoned dreams need see earning kind income dreamed easier think permission like send brief letter tell mlm work people introduce something new refreshing wonder heard promise unwanted follow sales pitch one call email address used send information period receive free lifechanging information simply click reply type send info subject box hit send get information within hours just look words mlm wall shame inbox cordially siddhi ps someone recently sent letter eyeopening financially beneficial information ever received honestly believe feel way read free email never sent unsolicited spam receiving email explicitly signed list online signup form use ffa links page emaildom systems explicit terms use state use agree receive emailings may member altra computer systems list one many numerous free marketing services agreed signed list receiving emailing due email message considered unsolicitated spam irish linux users group iluglinuxie http unsubscription information list maintainer listmasterlinuxie
## [2]  save life insurance spend life quote savings ensurin g familys financial security important life quote savings ma kes buying life insurance simple affordable provide free access best companies lowest rates life quote savings fast eas y saves money let us help get started best val ues country new coverage can save hundreds even tho usands dollars requesting free quote lifequote savings service take less minutes complete shop d compare save types life insurance click free quote protecting family best investment eve r make receipt email error wish removed list please click type remove reside state prohibits email solicitations insuran ce please disregard email
## [3]  need safety real might get one chance ready free shipping handling within usa order may day super sale now may save items getting spring protect walk jog exercise outside protect loved ones return home college legal protection college students great coming outdoor protection gifts nothing worth protecting life stun devices pepper products legal protection join war crime stun guns batons effective safe nonlethal protect loved ones matter matter city town live live america touched crime hear tv read newspaper secret crime major problem us today criminals finding easier commit crimes time weapons readily available cities police forces work can handle even criminal caught spending long nations overcrowded jails lawmakers well aware crime problem seem effective answers email address merchantsallaolcom interested protecting within days wait visit web page join war crime http well effective answer take responsibility security site variety quality personal security products visit site choose personal security products right use join war crime free pepper spray stun unit purchase value ship orders within days every state us ups fedex us postal service visa mastercard american express debt card gladly accepted ask question help loved ones interested http the stun monster volts the zforce slim style volts the stunmaster volts straight the stunmaster volts curb the stunmaster volts straight the stunmaster volts curb the stunbaton volts the stunbaton volts pen knife one two wildfire pepper spray one two add shipping handling charge order postal mail please send address make payable mega safety technology mega safety technology merrimac ave dayton ohio email address merchantsallaolcom order hour fax important credit card information please read credit card address city state zip code must match billing address processed check moneyorder visa mastercard americanexpress debt card name appears check credit card address appears check credit card citystatezip appears check credit card country credit card number expiration month year authorized signature important note shipping address different billing address please fill information shipping name shipping address shipping citystatezip country email address phone numberplease write neat

s.h.dtm<-DocumentTermMatrix(s.h.corpus)
s.h.dtm#98191列，5797行 前1897行是spam,后3900行是ham

## <<DocumentTermMatrix (documents: 5797, terms: 71600)>>
## Non-/sparse entries: 553817/414511383
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)

三、构建训练集和测试集

注：选择75%的数据作为训练集，25%的作为测试集

#分离语料库
s.h.corpus.train<-s.h.corpus[c(1:1423,1897:4822)]#共4349条
s.h.corpus.test<-s.h.corpus[c(1424:1896,4823:5797)]#共1448条
#分离DTM
s.h.dtm.train<-s.h.dtm[c(1:1423,1897:4822),]
s.h.dtm.test<-s.h.dtm[c(1424:1896,4823:5797),]

四、词云展示

加载包：wordcloud2
注：选择垃圾邮件训练集。即s.h.dtm.train的前1423行

#先转换为正常的矩阵进行  词频统计
s.dtm.train<-as.matrix(s.h.dtm.train[1:1423,])
s.sum<-col_sums(s.dtm.train)
s.term<-names(s.sum)
s.freq<-as.numeric(s.sum)
#转换为数据框
s.frame<-as.data.frame(cbind(s.term,s.freq),row.names=NULL,optional=F)
s.frame$s.freq<-as.numeric(s.frame$s.freq)
head(s.frame)

##        s.term s.freq
## 1   abandoned     43
## 2      accept      8
## 3     address    330
## 4       agree     88
## 5      agreed    164
## 6 alternative    269

（直接添加的图片！！）

library(wordcloud2)
#wordcloud2(s.frame)

wordcloud

五、数据降维

选择出现频数大于100的词汇

myfindFreqTerms <- function(x,lowfreq=0,highfreq=Inf){stopifnot(inherits(x,c("DocumentTermMatrix","TermDocumentMatrix","simple_triplet_matrix")),is.numeric(lowfreq),is.numeric(highfreq))if(inherits(x,"DocumentTermMatrix"))x<-t(x)rs <- slam::row_sums(x)y <- which(rs >= lowfreq & rs<= highfreq)return(x[y,])
}
s.h.dict<-Terms(myfindFreqTerms(s.h.dtm.train,100))
length(s.h.dict)#共有1151个term

## [1] 1151

使用筛选后的词汇对原始数据进行处理

s.h.train<-DocumentTermMatrix(s.h.corpus.train,list(dictionary=s.h.dict))
s.h.train#4349行，1151列

## <<DocumentTermMatrix (documents: 4349, terms: 1151)>>
## Non-/sparse entries: 224076/4781623
## Sparsity           : 96%
## Maximal term length: 35
## Weighting          : term frequency (tf)

s.h.test<-DocumentTermMatrix(s.h.corpus.test,list(dictionary=s.h.dict))
s.h.test#1448行，1151列

## <<DocumentTermMatrix (documents: 1448, terms: 1151)>>
## Non-/sparse entries: 52503/1614145
## Sparsity           : 97%
## Maximal term length: 35
## Weighting          : term frequency (tf)

六、训练模型

加载包：e1071

#首先将训练集中的0 1值转换为因子No Yes
convert_counts <- function(x){x <- ifelse(x>0,1,0)x <- factor(x, levels=c(0,1),labels=c("No","Yes"))return(x)
}
s_h_train <- apply(s.h.train, MARGIN=2, convert_counts)
s_h_test<-apply(s.h.test, MARGIN = 2, convert_counts)

训练开始

library(e1071)
s_h_train_type<-c(rep("spam",1423),rep("ham",2926))
s_h_test_type<-c(rep("spam",473),rep("ham",975))
s_h_train_type<-as.data.frame(s_h_train_type)model_s_h<-naiveBayes(s_h_train,s_h_train_type$s_h_train_type,laplace=1)
s_h_prediction<-predict(model_s_h,s_h_test,type = "class")

七、模型测试

加载包：gmodels

library(gmodels)
CrossTable(s_h_prediction,s_h_test_type,prop.chisq=TRUE,prop.t=FALSE,dnn=c("predicted","actual"))

##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table:  1448
##
##
##              | actual
##    predicted |       ham |      spam | Row Total |
## -------------|-----------|-----------|-----------|
##          ham |       967 |        69 |      1036 |
##              |   104.053 |   214.486 |           |
##              |     0.933 |     0.067 |     0.715 |
##              |     0.992 |     0.146 |           |
## -------------|-----------|-----------|-----------|
##         spam |         8 |       404 |       412 |
##              |   261.648 |   539.337 |           |
##              |     0.019 |     0.981 |     0.285 |
##              |     0.008 |     0.854 |           |
## -------------|-----------|-----------|-----------|
## Column Total |       975 |       473 |      1448 |
##              |     0.673 |     0.327 |           |
## -------------|-----------|-----------|-----------|
##
##

分析：从表中可以看出，975条非垃圾短信中有8条短信被错误的归为垃圾短信，比例为：0.8%，而473条垃圾短信中有69条短信被错误的归为非垃圾短信，比例为14.6%。

八、提升模型

修改laplace值

model_s_h<-naiveBayes(s_h_train,s_h_train_type$s_h_train_type,laplace=0.001)
s_h_prediction<-predict(model_s_h,s_h_test,type = "class")
CrossTable(s_h_prediction,s_h_test_type,prop.chisq=TRUE,prop.t=FALSE,dnn=c("predicted","actual"))

##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table:  1448
##
##
##              | actual
##    predicted |       ham |      spam | Row Total |
## -------------|-----------|-----------|-----------|
##          ham |       969 |        66 |      1035 |
##              |   106.231 |   218.975 |           |
##              |     0.936 |     0.064 |     0.715 |
##              |     0.994 |     0.140 |           |
## -------------|-----------|-----------|-----------|
##         spam |         6 |       407 |       413 |
##              |   266.220 |   548.762 |           |
##              |     0.015 |     0.985 |     0.285 |
##              |     0.006 |     0.860 |           |
## -------------|-----------|-----------|-----------|
## Column Total |       975 |       473 |      1448 |
##              |     0.673 |     0.327 |           |
## -------------|-----------|-----------|-----------|
##
##

分析：
通过将拉普拉斯值调整为0.001，来优化建立的贝叶斯模型。可以看出，调整之后，假阴性错误的比例减少为：0.6%，假阳性错误的比例减少为：14%.

结束！

R语言垃圾邮件分类--朴素贝叶斯（机器学习）相关推荐

垃圾邮件分类-朴素贝叶斯算法
目录一.贝叶斯公式原理二.使用朴素贝叶斯进行文档分类三.Python代码实现一.贝叶斯公式原理在基础的概率学中,经典的有求独立事件的概率以及求关联时间的概率,贝叶斯所要解决的问题就是在有条件 ...
垃圾邮件识别-朴素贝叶斯算法
1.数据集垃圾邮件识别使用的数据集为Enron-Spam数据集,该数据集是目前在电子邮件相关研究中使用最多的公开数据集,其邮件数据是安然公司(Enron Corporation, 原是世界上最大的综 ...
人工智能知识全面讲解：垃圾邮件克星——朴素贝叶斯算法
6.1 什么是朴素贝叶斯 6.1.1 一个流量预测的场景某广告平台接到小明和小李两家服装店的需求,准备在A.B两个线上渠道投放广告.因为小明和小李两家店都卖女装,属于同一行业相同品类的广告, 所以 ...
c语言性别体重身高的程序,1-1 C语言手撕高斯-朴素贝叶斯 - 通过身高和体重推测性别（机器学习）...
目录目录目录项目介绍项目1-1的准确度项目可行性:原理简介高斯分布(正态分布) 朴素贝叶斯代码流程获取数据(数据清洗) 代码实现头文件 allHead.h 源文件 handleDat ...
R语言︱贝叶斯网络语言实现及与朴素贝叶斯区别（笔记）
每每以为攀得众山小,可.每每又切实来到起点,大牛们,缓缓脚步来俺笔记葩分享一下吧,please~ --------------------------- 一.贝叶斯网络与朴素贝叶斯的区别朴素贝叶斯的 ...
集体智慧编程——垃圾邮件过滤器（贝叶斯）-Python实现
介绍垃圾邮件分类器的设计与实现,分为一下几个步骤: 特征提取: 将训练样本的正文切分为特征,如果是英文,直接按照空格切分,每个词可以作为一个特征:如果是中文,则需要借助分词器,如jieba分词器等.切 ...
使用R语言的BNLearn包实现贝叶斯网络
转载自:http://f.dataguru.cn/thread-301701-1-1.html 1. 加载程序包导入数据 library(bnlearn) #CRAN中有,可以直接用install. ...
中文文本分类-朴素贝叶斯
原创作品,出自 "晓风残月xj" 博客,欢迎转载,转载时请务必注明出处(http://blog.csdn.net/xiaofengcanyuexj). 由于各种原因,可能存在诸多不 ...
机器学习实现图像分类（简单易上手） SVM KNN 决策树朴素贝叶斯机器学习作业
机器学习实现图像分类 SVM KNN 决策树朴素贝叶斯重要提示:本文仅仅靠调用python的sklearn中的模型包实现机器学习方法,不喜勿喷代码主要参考并改进 https://blog.csd ...

R语言垃圾邮件分类--朴素贝叶斯（机器学习）

邮件分类练习–朴素贝叶斯

思路

一、数据导入

文件目录为：C:\Users\kelanj\Documents\data\spam\…和C:\Users\kelanj\Documents\data\ham\…

二、数据处理

1.自定义去除HTML,URL的函数，以及自己的停词表；

2.定义字符处理函数，返回预料库

三、构建训练集和测试集

四、词云展示

（直接添加的图片！！）

五、数据降维

六、训练模型

七、模型测试

八、提升模型

结束！

R语言垃圾邮件分类--朴素贝叶斯（机器学习）相关推荐

最新文章

热门文章