pyhanlp

HanLP是由一系列模型与算法组成的Java工具包，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。

HanLP有如下功能：

中文分词
词性标注
命名实体识别
依存句法分析
关键词提取新词发现
短语提取
自动摘要
文本分类
拼音简繁

安装pyhanlp

pip install pyhanlp

安装后在第一次使用时，当运行from pyhanlp import *时，会下载hanlp的数据文件，这个文件比较大，一般都会下载失败，推荐手动下载并放到要求的路径下。

data文件下载地址：https://github.com/hankcs/HanLP/releases

在页面中下载data-for-1.7.2.zip

然后把下载的文件放到C:\Anaconda3\Lib\site-packages\pyhanlp\static 目录下

再执行from pyhanlp import *，完成自动解压。

pyhanlp的使用

pyhanlp的参考文档：https://github.com/hankcs/pyhanlp

hanlp的参考文档：https://github.com/hankcs/HanLP/blob/master/README.md

pyhanlp的demo：https://github.com/hankcs/pyhanlp/tree/master/tests/demos

分词

pyhanlp可以自定义多种分词规则和模型，也可以加入自定义词典，经测试，默认的分词方法效果就不错，而且兼备词性标注以及命名实体识别，可以识别人名、地名、机构名等信息。

from pyhanlp import *
sentence = "下雨天地面积水"# 返回一个list，每个list是一个分词后的Term对象，可以获取word属性和nature属性，分别对应的是词和词性
terms = HanLP.segment(sentence )
for term in terms:print(term.word,term.nature)

关键词提取与自动摘要

from pyhanlp import *document = "水利部水资源司司长陈明忠9月29日在国务院新闻办举行的新闻发布会上透露，" \"根据刚刚完成了水资源管理制度的考核，有部分省接近了红线的指标，" \"有部分省超过红线的指标。对一些超过红线的地方，陈明忠表示，对一些取用水项目进行区域的限批，" \"严格地进行水资源论证和取水许可的批准。"# 提取document的两个关键词
print(HanLP.extractKeyword(document, 2))# 提取ducument中的3个关键句作为摘要
print(HanLP.extractSummary(document, 3))

依存句法分析

from pyhanlp import *
print(HanLP.parseDependency("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。"))

共性分析

共性是指文本中词语共同出现的情况。

一阶共性分析也就是统计词频，二阶分析和三阶分析主要用来发现短语。

调用hanlp的共性分析模块，可以发现2个词或者3个词的出现次数（tf）、互信息（mi），左熵（le）、右熵（re）以及score。

参考自：https://blog.csdn.net/fontthrone/article/details/82824202

from pyhanlp import *
# 共性分析
Occurrence = JClass("com.hankcs.hanlp.corpus.occurrence.Occurrence")
PairFrequency = JClass("com.hankcs.hanlp.corpus.occurrence.PairFrequency")
TermFrequency = JClass("com.hankcs.hanlp.corpus.occurrence.TermFrequency")
TriaFrequency = JClass("com.hankcs.hanlp.corpus.occurrence.TriaFrequency")occurrence = Occurrence()
occurrence.addAll("在计算机音视频和图形图像技术等二维信息算法处理方面目前比较先进的视频处理算法")
occurrence.compute()print("一阶共性分析，也就是词频统计")
unigram = occurrence.getUniGram()
for entry in unigram.iterator():term_frequency = entry.getValue()print(term_frequency)
print()print('二阶共性分析')
bigram = occurrence.getBiGram()
for entry in bigram.iterator():pair_frequency = entry.getValue()if pair_frequency.isRight():print(pair_frequency)
print()print('三阶共性分析')
trigram = occurrence.getTriGram()
for entry in trigram.iterator():tria_frequency = entry.getValue()if tria_frequency.isRight():print(tria_frequency)

短语提取

text = "在计算机音视频和图形图像技术等二维信息算法处理方面目前比较先进的视频处理算法"
phraseList = HanLP.extractPhrase(text, 10)
print(phraseList);

文本分类

pyhanlp自带的文本分类器选用朴素贝叶斯模型，训练语料需要自己收集，这里使用搜狗文本分类语料迷你版作为训练语料，下载地址为：

http://file.hankcs.com/corpus/sogou-text-classification-corpus-mini.zip

下载之后将文件放到pyhanlp包的pyhanlp/static/data/test文件夹下，即：

C:\Anaconda3\Lib\site-packages\pyhanlp\static\data\test

然后我们引用github官方的例子，参考链接：

https://github.com/hankcs/pyhanlp/blob/master/tests/demos/demo_text_classification.py

# -*- coding:utf-8 -*-
# Author：hankcs
# Date: 2018-05-23 17:26
import osfrom pyhanlp import SafeJClass
from tests.test_utility import ensure_dataNaiveBayesClassifier = SafeJClass('com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier')
IOUtil = SafeJClass('com.hankcs.hanlp.corpus.io.IOUtil')
sogou_corpus_path = ensure_data('搜狗文本分类语料库迷你版','http://file.hankcs.com/corpus/sogou-text-classification-corpus-mini.zip')def train_or_load_classifier():model_path = sogou_corpus_path + '.ser'if os.path.isfile(model_path):return NaiveBayesClassifier(IOUtil.readObjectFrom(model_path))classifier = NaiveBayesClassifier()classifier.train(sogou_corpus_path)model = classifier.getModel()IOUtil.saveObjectTo(model, model_path)return NaiveBayesClassifier(model)def predict(classifier, text):print("《%16s》\t属于分类\t【%s】" % (text, classifier.classify(text)))# 如需获取离散型随机变量的分布，请使用predict接口# print("《%16s》\t属于分类\t【%s】" % (text, classifier.predict(text)))if __name__ == '__main__':classifier = train_or_load_classifier()predict(classifier, "C罗获2018环球足球奖最佳球员 德尚荣膺最佳教练")predict(classifier, "英国造航母耗时8年仍未服役 被中国速度远远甩在身后")predict(classifier, "研究生考录模式亟待进一步专业化")predict(classifier, "如果真想用食物解压,建议可以食用燕麦")
predict(classifier, "通用及其部分竞争对手目前正在考虑解决库存问题")

文本分类的性能指标：

Classifier+Tokenizer	P	R	F1	文档/秒
NaiveBayesClassifier+HanLPTokenizer	96.16	96.00	96.08	6172
NaiveBayesClassifier+BigramTokenizer	96.36	96.20	96.28	3378
LinearSVMClassifier+HanLPTokenizer	97.24	97.20	97.22	27777
LinearSVMClassifier+BigramTokenizer	97.83	97.80	97.81	12195

情感分析

情感分析和文本分类的默认分类模型一样，也是朴素贝叶斯模型，首先我们需要准备训练数据，这里使用谭松波的酒店评论语料，下载地址为：

http://file.hankcs.com/corpus/ChnSentiCorp.zip