KNN针对中文文本分类
改编自博客:
http://blog.csdn.net/github_36326955/article/details/54891204
做个笔记
代码按照1 2 3 4的顺序进行即可:
1.py(corpus_segment.py)
- #!/usr/bin/env python
- # -*- coding: UTF-8 -*-
- """
- @version: python2.7.8
- @author: XiangguoSun
- @contact: sunxiangguodut@qq.com
- @file: corpus_segment.py
- @time: 2017/2/5 15:28
- @software: PyCharm
- """
- import sys
- import os
- import jieba
- # 配置utf-8输出环境
- reload(sys)
- sys.setdefaultencoding('utf-8')
- # 保存至文件
- def savefile(savepath, content):
- with open(savepath, "wb") as fp:
- fp.write(content)
- '''''
- 上面两行是python2.6以上版本增加的语法,省略了繁琐的文件close和try操作
- 2.5版本需要from __future__ import with_statement
- '''
- # 读取文件
- def readfile(path):
- with open(path, "rb") as fp:
- content = fp.read()
- return content
- def corpus_segment(corpus_path, seg_path):
- '''''
- corpus_path是未分词语料库路径
- seg_path是分词后语料库存储路径
- '''
- catelist = os.listdir(corpus_path) # 获取corpus_path下的所有子目录
- '''''
- 其中子目录的名字就是类别名,例如:
- train_corpus/art/21.txt中,'train_corpus/'是corpus_path,'art'是catelist中的一个成员
- '''
- # 获取每个目录(类别)下所有的文件
- for mydir in catelist:
- '''''
- 这里mydir就是train_corpus/art/21.txt中的art(即catelist中的一个类别)
- '''
- class_path = corpus_path + mydir + "/" # 拼出分类子目录的路径如:train_corpus/art/
- seg_dir = seg_path + mydir + "/" # 拼出分词后存贮的对应目录路径如:train_corpus_seg/art/
- if not os.path.exists(seg_dir): # 是否存在分词目录,如果没有则创建该目录
- os.makedirs(seg_dir)
- file_list = os.listdir(class_path) # 获取未分词语料库中某一类别中的所有文本
- '''''
- train_corpus/art/中的
- 21.txt,
- 22.txt,
- 23.txt
- ...
- file_list=['21.txt','22.txt',...]
- '''
- for file_path in file_list: # 遍历类别目录下的所有文件
- fullname = class_path + file_path # 拼出文件名全路径如:train_corpus/art/21.txt
- content = readfile(fullname) # 读取文件内容
- '''''此时,content里面存贮的是原文本的所有字符,例如多余的空格、空行、回车等等,
- 接下来,我们需要把这些无关痛痒的字符统统去掉,变成只有标点符号做间隔的紧凑的文本内容
- '''
- content = content.replace("\r\n", "") # 删除换行
- content = content.replace(" ", "")#删除空行、多余的空格
- content_seg = jieba.cut(content) # 为文件内容分词
- savefile(seg_dir + file_path, " ".join(content_seg)) # 将处理后的文件保存到分词后语料目录
- print "中文语料分词结束!!!"
- '''''
- 如果你对if __name__=="__main__":这句不懂,可以参考下面的文章
- http://imoyao.lofter.com/post/3492bc_bd0c4ce
- 简单来说如果其他python文件调用这个文件的函数,或者把这个文件作为模块
- 导入到你的工程中时,那么下面的代码将不会被执行,而如果单独在命令行中
- 运行这个文件,或者在IDE(如pycharm)中运行这个文件时候,下面的代码才会运行。
- 即,这部分代码相当于一个功能测试。
- 如果你还没懂,建议你放弃IT这个行业。
- '''
- if __name__=="__main__":
- #对训练集进行分词
- corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train/" # 未分词分类语料库路径
- seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分词后分类语料库路径,本程序输出结果
- corpus_segment(corpus_path,seg_path)
- #对测试集进行分词
- corpus_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/answer/" # 未分词分类语料库路径
- seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分词后分类语料库路径,本程序输出结果
- corpus_segment(corpus_path,seg_path)
2.py(corpus2Bunch.py)
- #!/usr/bin/env python
- # -*- coding: UTF-8 -*-
- """
- @version: python2.7.8
- @author: XiangguoSun
- @contact: sunxiangguodut@qq.com
- @file: corpus2Bunch.py
- @time: 2017/2/7 7:41
- @software: PyCharm
- """
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- import os#python内置的包,用于进行文件目录操作,我们将会用到os.listdir函数
- import cPickle as pickle#导入cPickle包并且取一个别名pickle
- '''''
- 事实上python中还有一个也叫作pickle的包,与这里的名字相同了,无所谓
- 关于cPickle与pickle,请参考博主另一篇博文:
- python核心模块之pickle和cPickle讲解
- http://blog.csdn.net/github_36326955/article/details/54882506
- 本文件代码下面会用到cPickle中的函数cPickle.dump
- '''
- from sklearn.datasets.base import Bunch
- #这个您无需做过多了解,您只需要记住以后导入Bunch数据结构就像这样就可以了。
- #今后的博文会对sklearn做更有针对性的讲解
- def _readfile(path):
- '''''读取文件'''
- #函数名前面带一个_,是标识私有函数
- # 仅仅用于标明而已,不起什么作用,
- # 外面想调用还是可以调用,
- # 只是增强了程序的可读性
- with open(path, "rb") as fp:#with as句法前面的代码已经多次介绍过,今后不再注释
- content = fp.read()
- return content
- def corpus2Bunch(wordbag_path,seg_path):
- catelist = os.listdir(seg_path)# 获取seg_path下的所有子目录,也就是分类信息
- #创建一个Bunch实例
- bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])
- bunch.target_name.extend(catelist)
- '''''
- extend(addlist)是python list中的函数,意思是用新的list(addlist)去扩充
- 原来的list
- '''
- # 获取每个目录下所有的文件
- for mydir in catelist:
- class_path = seg_path + mydir + "/" # 拼出分类子目录的路径
- file_list = os.listdir(class_path) # 获取class_path下的所有文件
- for file_path in file_list: # 遍历类别目录下文件
- fullname = class_path + file_path # 拼出文件名全路径
- bunch.label.append(mydir)
- bunch.filenames.append(fullname)
- bunch.contents.append(_readfile(fullname)) # 读取文件内容
- '''''append(element)是python list中的函数,意思是向原来的list中添加element,注意与extend()函数的区别'''
- # 将bunch存储到wordbag_path路径中
- with open(wordbag_path, "wb") as file_obj:
- pickle.dump(bunch, file_obj)
- print "构建文本对象结束!!!"
- if __name__ == "__main__":#这个语句前面的代码已经介绍过,今后不再注释
- #对训练集进行Bunch化操作:
- wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat" # Bunch存储路径,程序输出
- seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_corpus_seg/" # 分词后分类语料库路径,程序输入
- corpus2Bunch(wordbag_path, seg_path)
- # 对测试集进行Bunch化操作:
- wordbag_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat" # Bunch存储路径,程序输出
- seg_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_corpus_seg/" # 分词后分类语料库路径,程序输入
- corpus2Bunch(wordbag_path, seg_path)
3.py(TFIDF_space.py)
- #!/usr/bin/env python
- # -*- coding: UTF-8 -*-
- """
- @version: python2.7.8
- @author: XiangguoSun
- @contact: sunxiangguodut@qq.com
- @file: TFIDF_space.py
- @time: 2017/2/8 11:39
- @software: PyCharm
- """
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- from sklearn.datasets.base import Bunch
- import cPickle as pickle
- from sklearn.feature_extraction.text import TfidfVectorizer
- def _readfile(path):
- with open(path, "rb") as fp:
- content = fp.read()
- return content
- def _readbunchobj(path):
- with open(path, "rb") as file_obj:
- bunch = pickle.load(file_obj)
- return bunch
- def _writebunchobj(path, bunchobj):
- with open(path, "wb") as file_obj:
- pickle.dump(bunchobj, file_obj)
- def vector_space(stopword_path,bunch_path,space_path,train_tfidf_path=None):
- stpwrdlst = _readfile(stopword_path).splitlines()
- bunch = _readbunchobj(bunch_path)
- tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[], vocabulary={})
- if train_tfidf_path is not None:
- trainbunch = _readbunchobj(train_tfidf_path)
- tfidfspace.vocabulary = trainbunch.vocabulary
- vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5,vocabulary=trainbunch.vocabulary)
- tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
- else:
- vectorizer = TfidfVectorizer(stop_words=stpwrdlst, sublinear_tf=True, max_df=0.5)
- tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
- tfidfspace.vocabulary = vectorizer.vocabulary_
- _writebunchobj(space_path, tfidfspace)
- print "tf-idf词向量空间实例创建成功!!!"
- if __name__ == '__main__':
- # stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件
- # bunch_path = "train_word_bag/train_set.dat"#输入的文件
- # space_path = "train_word_bag/tfdifspace.dat"#输出的文件
- # vector_space(stopword_path,bunch_path,space_path)
- #
- # bunch_path = "test_word_bag/test_set.dat"#输入的文件
- # space_path = "test_word_bag/testspace.dat"
- # train_tfidf_path="train_word_bag/tfdifspace.dat"
- # vector_space(stopword_path,bunch_path,space_path,train_tfidf_path)
- stopword_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/hlt_stop_words.txt"#输入的文件
- train_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/train_set.dat"#输入的文件
- space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat"#输出的文件
- vector_space(stopword_path,train_bunch_path,space_path)
- train_tfidf_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/train_word_bag/tfidfspace.dat" # 输入的文件,由上面生成
- test_bunch_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/test_set.dat"#输入的文件
- test_space_path = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204_tenwhy/chinese_text_classification-master/test_word_bag/testspace.dat"#输出的文件
- vector_space(stopword_path,test_bunch_path,test_space_path,train_tfidf_path)
4.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@version: python2.7.8
@author: XiangguoSun
@contact: sunxiangguodut@qq.com
@file: NBayes_Predict.py
@time: 2017/2/8 12:21
@software: PyCharm
"""
import sys
reload(sys)
sys.setdefaultencoding('utf-8')import cPickle as pickle
from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法# 读取bunch对象
def _readbunchobj(path):with open(path, "rb") as file_obj:bunch = pickle.load(file_obj)return bunch# 导入训练集
trainpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/train_word_bag/tfidfspace.dat"
train_set = _readbunchobj(trainpath)# 导入测试集
testpath = "/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_word_bag/testspace.dat"
test_set = _readbunchobj(testpath)# 训练分类器:输入词袋向量和分类标签,alpha:0.001 alpha越小,迭代次数越多,精度越高
# clf = MultinomialNB(alpha=0.1).fit(train_set.tdm, train_set.label)######################################################
#KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
print '*************************\nKNN\n*************************'
clf = KNeighborsClassifier()#default with k=5
clf.fit(train_set.tdm, train_set.label)# 预测分类结果
predicted = clf.predict(test_set.tdm)for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted):if flabel != expct_cate:print file_name,": 实际类别:",flabel," -->预测类别:",expct_cateprint "预测完毕!!!"# 计算分类精度:
from sklearn import metrics
def metrics_result(actual, predict):print '精度:{0:.3f}'.format(metrics.precision_score(actual, predict,average='weighted'))print '召回:{0:0.3f}'.format(metrics.recall_score(actual, predict,average='weighted'))print 'f1-score:{0:.3f}'.format(metrics.f1_score(actual, predict,average='weighted'))metrics_result(test_set.label, predicted)
依然使用复旦大学的新闻数据集
运行结果(这里复制一部分):
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics37.txt : 实际类别: C16-Electronics -->预测类别: C11-Space
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics19.txt : 实际类别: C16-Electronics -->预测类别: C34-Economy
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics35.txt : 实际类别: C16-Electronics -->预测类别: C39-Sports
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics31.txt : 实际类别: C16-Electronics -->预测类别: C11-Space
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics52.txt : 实际类别: C16-Electronics -->预测类别: C17-Communication
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics07.txt : 实际类别: C16-Electronics -->预测类别: C17-Communication
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics02.txt : 实际类别: C16-Electronics -->预测类别: C34-Economy
/home/appleyuchi/PycharmProjects/MultiNB/csdn_blog/54891204/chinese_text_classification-master/test_corpus_seg/C16-Electronics/C16-Electronics48.txt : 实际类别: C16-Electronics -->预测类别: C34-Economy
预测完毕!!!
精度:0.890
召回:0.893
f1-score:0.886
KNN针对中文文本分类相关推荐
- 随机森林针对中文文本分类
改编自博客: http://blog.csdn.net/github_36326955/article/details/54891204 做个笔记 代码按照1 2 3 4的顺序进行即可: 1.py(c ...
- 决策树(CART算法)针对中文文本分类
改编自博客: http://blog.csdn.net/github_36326955/article/details/54891204 根据下面的参考了链接可知,sklearn中的决策树用的是CAR ...
- LogisticRegression针对中文文本分类
改编自博客: http://blog.csdn.net/github_36326955/article/details/54891204 做个笔记 代码按照1 2 3 4的顺序进行即可: 1.py(c ...
- SVM针对中文文本分类
改编自博客: http://blog.csdn.net/github_36326955/article/details/54891204 做个笔记 代码按照1 2 3 4的顺序进行即可: 1.py(c ...
- [Python人工智能] 二十一.Word2Vec+CNN中文文本分类详解及与机器学习(RF\DTC\SVM\KNN\NB\LR)分类对比
从本专栏开始,作者正式研究Python深度学习.神经网络及人工智能相关知识.前一篇文章分享了Keras实现RNN和LSTM的文本分类算法,并与传统的机器学习分类算法进行对比实验.这篇文章我们将继续巩固 ...
- 详解CNN实现中文文本分类过程
摘要:本文主要讲解CNN实现中文文本分类的过程,并与贝叶斯.决策树.逻辑回归.随机森林.KNN.SVM等分类算法进行对比. 本文分享自华为云社区<[Python人工智能] 二十一.Word2Ve ...
- 万字总结Keras深度学习中文文本分类
摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...
- java 知网 语义 相似度,基于知网语义相似度的中文文本分类研究 论文笔记
基于知网语义相似度的中文文本分类研究 1.传统的文本处理大部分是根据词频和逆向文档频率将文本表示成向量空间模型,实践证明这种模型确实简单高效并且得到了广泛应用,但这种模型表示缺乏对语义的理解,忽略了词 ...
- textcnn文本词向量_基于Text-CNN模型的中文文本分类实战
1 文本分类 文本分类是自然语言处理领域最活跃的研究方向之一,目前文本分类在工业界的应用场景非常普遍,从新闻的分类.商品评论信息的情感分类到微博信息打标签辅助推荐系统,了解文本分类技术是NLP初学者比 ...
最新文章
- Deep Web 爬虫体系结构
- TCP,IP,HTTP,SOCKET区别和联系
- 如何理解IIS 7的两种应用程序池的管道模式(Managed Pipeline Mode)
- Hadoop SSH免密登录公钥生成并实现不同主机间的免密登录
- GitHub 标星 7000+,面试官的灵魂 50 问,问到你怀疑人生!
- Linux系统 iptables 和 firewalld 的那些事
- Python刷csdn访问量
- 探究make_shared效率
- APICloud学习笔记之窗体跳转
- pku773_Happy 2006
- 【Javascript Demo】图片瀑布流实现
- 怎么让拿到的字符串类型的值去掉双引号显示_python入门系列:Python数据类型
- vs2013 mfc资源在另一个编辑器中打开
- 如何在 Linux 终端中知道你的公有 IP
- Nginx虚拟主机别名的配置
- MySQL(1)----帮助使用
- C语言指针详解(经典,非常详细)
- 我的Ubuntu软件清单
- 方差var、协方差cov、协方差矩阵(浅谈)
- IOS版本回退操作教程
热门文章
- 抽象类和接口的关系之我的图解(转自Jack Fan)
- 宁波中小学生计算机技术展示,2020年宁波市中小学生电脑制作活动创客竞赛暨2020年宁波市中小学生创客大赛顺利举行...
- python 清空文件夹_Python初学者请注意!别这样直接运行python命令,否则电脑等于“裸奔”...
- 【BZOJ3676】 [Apio2014]回文串(SAM,manacher)
- thinkphp5 composer
- Testng生成的测试报告乱码解决办法
- hdu1034 简单模拟
- PHP 学习笔记 - - - 简单方法的使用 (数组)
- JavaXml教程(十)XML作为属性文件使用
- 高考前几天我们应该干什么?