朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测

转自相国大人的博客，

http://blog.csdn.net/github_36326955/article/details/54891204

做个笔记

代码按照1 2 3 4的顺序进行即可：

1.py(corpus_segment.py)

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@version: python2.7.8
@author: XiangguoSun
@contact: sunxiangguodut@qq.com
@file: corpus_segment.py
@time: 2017/2/5 15:28
@software: PyCharm
"""
import sys
import os
import jieba
# 配置utf-8输出环境
reload(sys)
sys.setdefaultencoding('utf-8')
# 保存至文件
def savefile(savepath, content):with open(savepath, "wb") as fp:fp.write(content)'''上面两行是python2.6以上版本增加的语法，省略了繁琐的文件close和try操作2.5版本需要from __future__ import with_statement新手可以参考这个链接来学习http://zhoutall.com/archives/325'''
# 读取文件
def readfile(path):with open(path, "rb") as fp:content = fp.read()return contentdef corpus_segment(corpus_path, seg_path):'''corpus_path是未分词语料库路径seg_path是分词后语料库存储路径'''catelist = os.listdir(corpus_path)  # 获取corpus_path下的所有子目录'''其中子目录的名字就是类别名，例如：train_corpus/art/21.txt中，'train_corpus/'是corpus_path，'art'是catelist中的一个成员'''# 获取每个目录（类别）下所有的文件for mydir in catelist:'''这里mydir就是train_corpus/art/21.txt中的art（即catelist中的一个类别）'''class_path = corpus_path + mydir + "/"  # 拼出分类子目录的路径如：train_corpus/art/seg_dir = seg_path + mydir + "/"  # 拼出分词后存贮的对应目录路径如：train_corpus_seg/art/if not os.path.exists(seg_dir):  # 是否存在分词目录，如果没有则创建该目录os.makedirs(seg_dir)file_list = os.listdir(class_path)  # 获取未分词语料库中某一类别中的所有文本'''train_corpus/art/中的21.txt,22.txt,23.txt...file_list=['21.txt','22.txt',...]'''for file_path in file_list:  # 遍历类别目录下的所有文件fullname = class_path + file_path  # 拼出文件名全路径如：train_corpus/art/21.txtcontent = readfile(fullname)  # 读取文件内容'''此时，content里面存贮的是原文本的所有字符，例如多余的空格、空行、回车等等，接下来，我们需要把这些无关痛痒的字符统统去掉，变成只有标点符号做间隔的紧凑的文本内容'''content = content.replace("\r\n", "")  # 删除换行content = content.replace(" ", "")#删除空行、多余的空格content_seg = jieba.cut(content)  # 为文件内容分词savefile(seg_dir + file_path, " ".join(content_seg))  # 将处理后的文件保存到分词后语料目录print "中文语料分词结束！！！"if __name__=="__main__":#对训练集进行分词corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train/"  # 未分词分类语料库路径seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/"  # 分词后分类语料库路径,本程序输出结果corpus_segment(corpus_path,seg_path)#对测试集进行分词corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/answer/"  # 未分词分类语料库路径seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/"  # 分词后分类语料库路径，本程序输出结果corpus_segment(corpus_path,seg_path)

2.py(corpus2Bunch.py)

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@version: python2.7.8
@author: XiangguoSun
@contact: sunxiangguodut@qq.com
@file: corpus2Bunch.py
@time: 2017/2/7 7:41
@software: PyCharm
"""
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import os#python内置的包，用于进行文件目录操作，我们将会用到os.listdir函数
import cPickle as pickle#导入cPickle包并且取一个别名pickle
'''
事实上python中还有一个也叫作pickle的包，与这里的名字相同了，无所谓
关于cPickle与pickle，请参考博主另一篇博文：
python核心模块之pickle和cPickle讲解
http://blog.csdn.net/github_36326955/article/details/54882506
本文件代码下面会用到cPickle中的函数cPickle.dump
'''
from sklearn.datasets.base import Bunch
#这个您无需做过多了解，您只需要记住以后导入Bunch数据结构就像这样就可以了。
#今后的博文会对sklearn做更有针对性的讲解def _readfile(path):'''读取文件'''#函数名前面带一个_,是标识私有函数# 仅仅用于标明而已，不起什么作用，# 外面想调用还是可以调用，# 只是增强了程序的可读性with open(path, "rb") as fp:#with as句法前面的代码已经多次介绍过，今后不再注释content = fp.read()return contentdef corpus2Bunch(wordbag_path,seg_path):catelist = os.listdir(seg_path)# 获取seg_path下的所有子目录，也就是分类信息#创建一个Bunch实例bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])bunch.target_name.extend(catelist)'''extend(addlist)是python list中的函数，意思是用新的list（addlist）去扩充原来的list'''# 获取每个目录下所有的文件for mydir in catelist:class_path = seg_path + mydir + "/"  # 拼出分类子目录的路径file_list = os.listdir(class_path)  # 获取class_path下的所有文件for file_path in file_list:  # 遍历类别目录下文件fullname = class_path + file_path  # 拼出文件名全路径bunch.label.append(mydir)bunch.filenames.append(fullname)bunch.contents.append(_readfile(fullname))  # 读取文件内容'''append(element)是python list中的函数，意思是向原来的list中添加element，注意与extend()函数的区别'''# 将bunch存储到wordbag_path路径中with open(wordbag_path, "wb") as file_obj:pickle.dump(bunch, file_obj)print "构建文本对象结束！！！"if __name__ == "__main__":#这个语句前面的代码已经介绍过，今后不再注释#对训练集进行Bunch化操作：wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"  # Bunch存储路径，程序输出seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/"  # 分词后分类语料库路径，程序输入corpus2Bunch(wordbag_path, seg_path)# 对测试集进行Bunch化操作：wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"  # Bunch存储路径，程序输出seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/"  # 分词后分类语料库路径，程序输入corpus2Bunch(wordbag_path, seg_path)

3.py(TFIDF_space.py)

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@version: python2.7.8
@author: XiangguoSun
@contact: sunxiangguodut@qq.com
@file: TFIDF_space.py
@time: 2017/2/8 11:39
@software: PyCharm
"""
import sys
reload(sys)
sys.setdefaultencoding('utf-8')from sklearn.datasets.base import Bunch
import cPickle as pickle
from sklearn.feature_extraction.text import TfidfVectorizerdef _readfile(path):with open(path, "rb") as fp:content = fp.read()return contentdef _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunchdef _writebunchobj(path, bunchobj):with open(path, "wb") as file_obj:pickle.dump(bunchobj, file_obj)def vector_space(stopword_path,bunch_path,space_path,train_tfidf_path=None):stpwrdlst = _readfile(stopword_path).splitlines()bunch = _readbunchobj(bunch_path)tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[], vocabulary={})if train_tfidf_path is not None:trainbunch = _readbunchobj(train_tfidf_path)tfidfspace.vocabulary = trainbunch.vocabularyvectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5,vocabulary=trainbunch.vocabulary)tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)else:vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5)tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)tfidfspace.vocabulary = vectorizer.vocabulary__writebunchobj(space_path, tfidfspace)print "tf-idf词向量空间实例创建成功！！！"if __name__ == '__main__':# stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件# bunch_path = "train_word_bag/train_set.dat"#输入的文件# space_path = "train_word_bag/tfdifspace.dat"#输出的文件# vector_space(stopword_path,bunch_path,space_path)## bunch_path = "test_word_bag/test_set.dat"#输入的文件# space_path = "test_word_bag/testspace.dat"# train_tfidf_path="train_word_bag/tfdifspace.dat"# vector_space(stopword_path,bunch_path,space_path,train_tfidf_path)stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件train_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"#输入的文件space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"#输出的文件vector_space(stopword_path,train_bunch_path,space_path)train_tfidf_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"  # 输入的文件，由上面生成test_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"#输入的文件test_space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"#输出的文件vector_space(stopword_path,test_bunch_path,test_space_path,train_tfidf_path)

4.py(NBayes_Predict.py)

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@version: python2.7.8
@author: XiangguoSun
@contact: sunxiangguodut@qq.com
@file: NBayes_Predict.py
@time: 2017/2/8 12:21
@software: PyCharm
"""
import sys
reload(sys)
sys.setdefaultencoding('utf-8')import cPickle as pickle
from sklearn.naive_bayes import MultinomialNB  # 导入多项式贝叶斯算法# 读取bunch对象
def _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunch# 导入训练集
trainpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"
train_set = _readbunchobj(trainpath)# 导入测试集
testpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"
test_set = _readbunchobj(testpath)# 训练分类器：输入词袋向量和分类标签，alpha:0.001 alpha越小，迭代次数越多，精度越高
clf = MultinomialNB(alpha=0.01).fit(train_set.tdm, train_set.label)# 预测分类结果
predicted = clf.predict(test_set.tdm)for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted):if flabel != expct_cate:print file_name,": 实际类别:",flabel," -->预测类别:",expct_cateprint "预测完毕!!!"# 计算分类精度：
from sklearn import metrics
def metrics_result(actual, predict):print '精度:{0:.3f}'.format(metrics.precision_score(actual, predict,average='weighted'))print '召回:{0:0.3f}'.format(metrics.recall_score(actual, predict,average='weighted'))print 'f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict,average='weighted'))metrics_result(test_set.label, predicted)

大概说下用法：

一、上面四个代码依次运行即可

二、要注意数据的存放方式要和转载的博客中一样，文件夹的名字就是类别名字，代码会进行自动识别。

三、每次跑完一遍流程，跑下一次程序前，train_corpus_seg和test_corpus_seg两个文件夹要全部删除，不然上次残留的结果会影响这次的预测。

同样地，如果更换中文数据集，这两个文件夹也要删除，总之，运行以上代码的第一步骤就是检查这两个文件夹下面是不是空的。（当然如果是第一次运行以上四个代码，没有生成这两个文件夹，自然是不用检查的）

另外，他这篇博客的优点是，可以针对小数据集（数据条数不到1000，十折交叉验证），预测概率可以达到60%~70%

程序之间的输入输出关系图

朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测相关推荐

ML之NB：基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测
ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测目录输出结果设计思路核心代码输出结果设计思路核心代码 vec = CountVectorizer() X_trai ...
[分类算法] ：朴素贝叶斯 NaiveBayes
[分类算法] :朴素贝叶斯 NaiveBayes 1. 原理和理论基础(参考) 2. Spark代码实例: 1)windows 单机 import org.apache.spark.mllib.cla ...
【统计学习方法】朴素贝叶斯对鸢尾花(iris)数据集进行训练预测
本文摘要 · 理论来源:[统计学习方法]第三四章朴素贝叶斯 · 技术支持:pandas(读csv).numpy.sklearn.naive_bayes.GaussianNB(高斯朴素贝叶斯模型).s ...
Python3《机器学习实战》学习笔记（五）：朴素贝叶斯实战篇之新浪新闻分类
转载请注明作者和出处:http://blog.csdn.net/c406495762 Github代码获取:https://github.com/Jack-Cherish/Machine-Learni ...
[Machine Learning]朴素贝叶斯(NaiveBayes)
C++ 描述: 1 #include <iostream> 2 #include <string> 3 #include <fstream> 4 #include ...
使用朴素贝叶斯对连续NBA数据集进行分类
数据集点个赞再走吧 Pos 为分类属性数据和代码可以在下面链接github上下载. github数据集和代码 import pandas as pd import math# 数据读取与预处理 p ...
【机器学习入门】(3) 朴素贝叶斯算法：多项式、高斯、伯努利，实例应用（心脏病预测）附python完整代码及数据集
各位同学好,今天我和大家分享一下朴素贝叶斯算法中的三大模型.在上一篇文章中,我介绍了朴素贝叶斯算法的原理,并利用多项式模型进行了文本分类预测. 朴素贝叶斯算法 -- 原理,多项式模型文档分类预测,附p ...
朴素贝叶斯（naive bayes）原理小结
朴素贝叶斯(naive bayes)原理小结 1. 朴素贝叶斯的学习 1.1 基本假设:条件独立性 1.2 朴素贝叶斯分类器 1.3 后验概率的含义 2. 参数估计 2.1 极大似然估计 2.2 贝叶 ...
Sklearn官方文档中文整理6——交叉分解，朴素贝叶斯和决策树篇
Sklearn官方文档中文整理6--交叉分解,朴素贝叶斯和决策树篇 1. 监督学习 1.8. 交叉分解[cross_decomposition.PLSRegression,cross_decompos ...

朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测

朴素贝叶斯(NaiveBayes)针对小数据集中文文本分类预测相关推荐

最新文章

热门文章