20210126 nlp中文分词库

http://www.360doc.cn/mip/794073715.html

自带的文本分类器选用朴素贝叶斯模型

一、hanlp 信息抽取关键词和句提取

https://blog.csdn.net/weixin_40605573/article/details/108316742

https://blog.csdn.net/lk7688535/article/details/89964014

1、关键词提取与自动摘要

from pyhanlp import *document = "水利部水资源司司长陈明忠9月29日在国务院新闻办举行的新闻发布会上透露，" \"根据刚刚完成了水资源管理制度的考核，有部分省接近了红线的指标，" \"有部分省超过红线的指标。对一些超过红线的地方，陈明忠表示，对一些取用水项目进行区域的限批，" \"严格地进行水资源论证和取水许可的批准。"# 提取document的两个关键词
print(HanLP.extractKeyword(document, 2))# 提取ducument中的3个关键句作为摘要
print(HanLP.extractSummary(document, 3))

2、短语提取

text = "在计算机音视频和图形图像技术等二维信息算法处理方面目前比较先进的视频处理算法"
phraseList = HanLP.extractPhrase(text, 10)
print(phraseList);

3、文本分类

import os
import zipfile
from pyhanlp import SafeJClass
from pyhanlp.static import download, remove_file, HANLP_DATA_PATHdef test_data_path():"""获取测试数据路径，位于$root/data/test，根目录由配置文件指定。:return:"""data_path = os.path.join(HANLP_DATA_PATH, 'test')if not os.path.isdir(data_path):os.mkdir(data_path)return data_pathdef ensure_data(data_name, data_url):root_path = test_data_path()dest_path = os.path.join(root_path, data_name)if os.path.exists(dest_path):return dest_pathif data_url.endswith('.zip'):dest_path += '.zip'download(data_url, dest_path)if data_url.endswith('.zip'):with zipfile.ZipFile(dest_path, "r") as archive:archive.extractall(root_path)remove_file(dest_path)dest_path = dest_path[:-len('.zip')]return dest_pathNaiveBayesClassifier = SafeJClass('com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier')
IOUtil = SafeJClass('com.hankcs.hanlp.corpus.io.IOUtil')
sogou_corpus_path = ensure_data('搜狗文本分类语料库迷你版','http://file.hankcs.com/corpus/sogou-text-classification-corpus-mini.zip')def train_or_load_classifier():model_path = sogou_corpus_path + '.ser'if os.path.isfile(model_path):return NaiveBayesClassifier(IOUtil.readObjectFrom(model_path))classifier = NaiveBayesClassifier()classifier.train(sogou_corpus_path)model = classifier.getModel()IOUtil.saveObjectTo(model, model_path)return NaiveBayesClassifier(model)def predict(classifier, text):print("《%16s》\t属于分类\t【%s】" % (text, classifier.classify(text)))# 如需获取离散型随机变量的分布，请使用predict接口# print("《%16s》\t属于分类\t【%s】" % (text, classifier.predict(text)))if __name__ == '__main__':classifier = train_or_load_classifier()predict(classifier, "C罗获2018环球足球奖最佳球员 德尚荣膺最佳教练")predict(classifier, "英国造航母耗时8年仍未服役 被中国速度远远甩在身后")predict(classifier, "研究生考录模式亟待进一步专业化")predict(classifier, '长城新出大狗和白猫品牌')
predict(classifier, "通用及其部分竞争对手目前正在考虑解决库存问题")

4、情感分析

from pyhanlp import *
from tests.test_utility import ensure_dataIClassifier = JClass('com.hankcs.hanlp.classification.classifiers.IClassifier')
NaiveBayesClassifier = JClass('com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier')
# 中文情感挖掘语料-ChnSentiCorp 谭松波
chn_senti_corp = ensure_data("ChnSentiCorp情感分析酒店评论", "http://file.hankcs.com/corpus/ChnSentiCorp.zip")def predict(classifier, text):print("《%s》 情感极性是 【%s】" % (text, classifier.classify(text)))if __name__ == '__main__':classifier = NaiveBayesClassifier()#  创建分类器，更高级的功能请参考IClassifier的接口定义classifier.train(chn_senti_corp)#  训练后的模型支持持久化，下次就不必训练了predict(classifier, "前台客房服务态度非常好！早餐很丰富，房价很干净。再接再厉！")predict(classifier, "这个东西给差评")
predict(classifier, "可利用文本分类实现情感分析，效果不是不行")

20210126 nlp中文分词库相关推荐

准确实用，7个优秀的开源中文分词库推荐
文章来自开源中国(微信ID:oschina2013) 如需转载请注明上述来源,其他来源无效并视为侵权中文分词是中文文本处理的基础步骤,也是中文人机自然语言交互的基础模块.由于中文句子中没有词的界限, ...
nlp中文文本摘要提取，快速提取文本主要意思
文本摘要提取之前写过一版文本摘要提取,但那版并不完美.有所缺陷(但也获得几十次收藏). 中文文本摘要提取 (文本摘要提取有代码)基于python 今天写改进版的文本摘要提取. 文本摘要旨在将文本 ...
nlp 中文停用词数据集
nlp 中文停用词数据集不多说,上数据集 --- >), )÷(1- ", ). ＝( : → ℃ & * 一一 ~~~~ ' . 『 .一 ./ -- 』＝″ [ ［*］ ...
NLP 中文短文本分类项目实践（下）
本场 Chat 和<NLP 中文短文本分类项目实践(上)>可以看做姊妹篇,在上一篇的基础上,本篇主要讲一下文本分类在集成学习和深度学习方面的应用,由于内容比较多,笔者不可能面面俱到.下面我 ...
AI周报丨中国信息通信研究院发布《AI框架发展白皮书》；华为开源首个NLP中文数据集-悟空；AAAI2022年度论文公布。
AI周报丨中国信息通信研究院发布<AI框架发展白皮书>:华为开源首个NLP中文数据集-悟空:AAAI2022最佳论文公布. 2022年2月22日极链AI云官网地址点击注册更多AI内 ...
jieba 同义词_中文分词库FNLP与jieba的安装与使用
本篇讲述FNLP自然语言处理库和jieba中文分词库的安装与使用,FNLP自然语言处理库适合Java开发者学习中文分词处理,jieba中文分词库适合Python开发者学习中文分词处理.通过本篇的学习, ...
Python：中文分词库jieba安装使用
hello,大家好,我是wangzirui32,今天我们来学习jieba中文分词库如何安装使用. 开始学习吧! 1. pip安装命令: pip install jieba 没有报错即为安装成功. 2 ...
python文本分析的开源工具_共有11款Python 中文分词库开源软件
"哑哈"中文分词,更快或更准确,由你来定义.通过简单定制,让分词模块更适用于你的需求. "Yaha" You can custom your Chinese W ...
NLP 中文形近字相似度算法开源实现
项目简介 nlp-hanzi-similar 为汉字提供相似性的计算. 创作目的有一个小伙伴说自己在做语言认知科学方向的课题研究,看了我以前写的 NLP 中文形近字相似度计算思路就想问下有没有源码 ...

20210126 nlp中文分词库

20210126 nlp中文分词库相关推荐

最新文章

热门文章