NLTK学习笔记(一)

文章目录

NLTK学习笔记(一)
- 一、概述
- 二、NLTK语料库
- - 2.1 语料库处理API
- 三、分词和分句
- 四、词频统计
- 五、单词分布
- 六、词性标注
- 七、去除停用词
- 八、NLTK中的wordnet
- 九、文本预处理
- - 9.1 词干提取
  - 9.2 词形还原

NLTK，全称Natural Language Toolkit，自然语言处理工具包，是NLP研究领域常用的一个Python库，由宾夕法尼亚大学的Steven Bird和Edward Loper在Python的基础上开发的一个模块，至今已有超过十万行的代码。这是一个开源项目，包含数据集、Python模块、教程等；可以方便的完成包括分词、词性标注、命名实体识别及句法分析在内的多种任务。

NLTK设计目标:

简易性；
一致性；
可扩展性；
模块化；

一、概述

NLTK自然语言处理库，主要包括以下模块，及其对应的功能

语言处理任务	NLTK模块	功能描述
获取和处理语料库	nltk.corpus	语料库和词典的标准化接口
字符串处理	nltk.tokenize, nltk.stem	分词，句子分解提取主干
搭配发现	nltk.collocations	t-检验，χ\chiχ，点互信息PMI
词性标识符	nltk.tag	n-gram，backokk，Brill，HMM，TnT
分类	nltk.classfy, nltk.cluster	决策树，最大熵，贝叶斯，EM，k-means
分块	nltk.chunk	正则表达式，n-gram，命名实体
解析	nltk.parse	图表，基于特征，一致性，概率，依赖
语义解释	nltk.sem, nltk.inference	λ\lambdaλ演算，一阶逻辑，模型检验
指标评测	nltk.metrics	精度，召回率，协议系数
概率与估计	nltk.probability	概率分布，平滑概率分布
应用	nltk.app, nltk.char	图形化的关键词排序，分析器，WordNet查看器，聊天机器人
语言学领域的工作	nltk.toolbox	处理SIL工具箱格式的数据

二、NLTK语料库

在nltk.corpus包下，提供了几类标注好的语料库。如下表所示：

语料库	说明
gutenberg	一个有若干万部的小说语料库，多是古典作品
webtext	收集的网络广告等内容
nps_chat	有上万条聊天消息语料库，即时聊天消息为主
brown	一个百万词级的英语语料库，按文体进行分类
reuters	路透社语料库，上万篇新闻方档，约有1百万字，分90个主题，并分为训练集和测试集两组
inaugural	演讲语料库，几十个文本，都是总统演说

2.1 语料库处理API

方法明	说明
fileids()	返回语料库中文件名列表
fileids(categories=[])	返回指定类别的文件名列表
raw(fid=[c1,c2])	返回指定文件名的文本字符串
raw(catergories=[])	返回指定分类的原始文本
sents(fid=[c1,c2])	返回指定文件名的语句列表
sents(catergories=[c1,c2])	按分类返回语句列表
words(filename)	返回指定文件名的单词列表
words(catogories=[])	返回指定分类的单词列表

from nltk.corpus import reutersprint(reuters.categories())  # 输出reuters语料库的类别
print(len(reuters.sents()))  # 输出reuters语料库的句子数量
print(len(reuters.words()))  # 输出reuters语料库的词数量

三、分词和分句

tokenize是NLTK的分词包，其中的函数可以识别英文词汇和标点符号对文本进行分句或分词处理。

sent_tokenize为tokenize中的分句函数，返回文本的分句结果，调用方式为：sent_tokenize(text, language=‘english’)。一般通过句末的标点符号（如’.’，’?’）进行分隔。
word_tokenize为tokenize中的分词函数，返回文本的分词结果，调用方式为：word_tokenize(text, language=‘english’)。一般通过空格或句中标点符号（如’,’）进行分隔。

sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.sent_tokenize(sentence)
print('sent_tokenize:')
print(np.array(tokens))
tokens2 = nltk.word_tokenize(sentence)
print('word_tokenize:')
print(tokens2)

TreebankWordTokenizer依据PennTreebank语料库的约定，通过分离缩略词来实现切分
PunktWordTokenizer通过分离标点来实现切分的，每一个单词都会被保留

四、词频统计

在NLTK中通过FreqDist类进行实现。这个类主要记录了每个词出现的次数，根据统计数据生成表格，或绘图。其结构很简单，用一个有序词典进行实现。

方法	作用
B()	返回词典的长度
plot(title,cumulative=False)	绘制频率分布图，若cumu为True，则是累积频率分布图
tabulate()	生成频率分布的表格形式
most_common()	返回出现次数最频繁的词与频度
hapaxes()	返回只出现过一次的词

import nltktext = open('demo.txt').read()
fdist = nltk.FreqDist(nltk.word_tokenize(text))
fdist.plot(30, cumulative=True)

五、单词分布

绘制离散图，查看指定单词在文中的分布位置。

import nltkwords = open('demo.txt').read()
text = nltk.text.Text(nltk.word_tokenize(words))
text.dispersion_plot(["time",'about','field','magnetic','records','underway','time' ])

六、词性标注

词性标注——POS(Part Of Speech)，是一种分析句子成分的方法，通过它来识别每个词的词性。

标记	词性	示例
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CONJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
MOD	情态动词	will, can, would, may, must, should
NN	名词	year,home,costs,time
NNP	专有名词	April，China，Washington
NUM	数词	fourth，2016, 09:30
PRON	代词	he,they,us
P	介词	on,over,with,of
TO	词to	to
UH	叹词	ah,ha,oops
VB		动词
VBD	动词过去式	made,said,went
VBG	现在分词	going,lying,playing
VBN	过去分词	taken,given,gone
WH	wh限定词	who,where,when,what

import nltksentence = "They found that the reversal took about as long as many scientists previously believed it did, " \"just a few thousand years.";
tokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)
print(taged_sent)

七、去除停用词

文本经过简单的而分词处理后，还会包含大量的无实际意义的通用词，由于这些常用字或者词使用的频率相当的高，比如a，the, he等，每个页面几乎都包含了这些词汇，如果搜索引擎它们当关键字进行索引，那么所有的网站都会被索引，而且没有区分度，所以一般把这些词直接去掉，不可当做关键词。NLTK提供了一份英文停用词词典直接使用。

sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.word_tokenize(sentence)
stops = set(nltk.corpus.stopwords.words('english'))
tokens = [word for word in tokens if word.lower() not in stops]
print(tokens)

八、NLTK中的wordnet

wordnet是为自然语言处理构建的数据库。它包括部分词语的一个同义词组和一个简短的定义。

通过wordnet可以得到给定词的定义和例句
- 用法：wordnet.synsets().definition()，wordnet.synsets().example()
通过wordnet可以获得同义词
- 用法：wordnet.synsets().lemmas().name()
使用wordnet可以获取反义词
- 用法：wordnet.synsets().lemmas().antonyms().name()

from nltk.corpus import wordnetsyn = wordnet.synsets("dynamic")
print("定义：", syn[0].definition())
print("例句：", syn[0].examples())
synonyms = []
for lemma in syn[0].lemmas():synonyms.append(lemma.name())
print("同义词：", synonyms)antonyms = []
for ss in syn:for lemma in ss.lemmas():if lemma.antonyms():antonyms.append(lemma.antonyms()[0].name())
print("反义词：", antonyms)

九、文本预处理

NLP在获取语料之后，通常要进行文本预处理。NTLK英文的预处理包括：分词，去停词，提取词干等步骤。对于英文去停词的支持，在corpus下包含了一个stopword的停词库。对于提取词词干，提供了Porter和Lancaster两个stemer。另个还提供了一个WordNetLemmatizer做词形归并，lemmatize()函数可以进行词形还原，第一个参数为单词，第二个参数为该单词的词性
Stem通常基于语法规则使用正则表达式来实现，处理的范围广，但过于死板。而Lemmatizer实现采用基于词典的方式来解决，因而更慢一些，处理的范围和词典的大小有关。

9.1 词干提取

词干提取（stemming）是文本预处理中较为主要的操作，是去除单词的前后缀得到词根的过程。词干提取以抽取词的词干或词根形式，基于规则，方法比较简单，不一定能够表达完整语义。

用法：nltk.PorterStemmer().stem(token) / nltk.LancasterStemmer().stem(token)

import nltksentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.word_tokenize(sentence)porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()print("sentence: " + sentence)
print("PorterStemmer: ")
print([porter.stem(t) for t in tokens])
print("LancasterStemmer: ")
print([lancaster.stem(t) for t in tokens])

Porter和Lancaster词干提取器按照各自的规则剥离词缀。此例中Porter词干提取器正确处理了词women，而Lancaster进行了不必要的切分。

9.2 词形还原

词形还原（Lemmatization）是文本预处理中的重要部分，与词干提取（stemming）很相似。词形还原就是去掉单词的词缀，提取单词的主干部分，通常提取后的单词会是字典中的单词，不同于词干提取（stemming），提取后的单词不一定会出现在单词中。比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。

用法：nltk.WordNetLemmatizer().lemmatize(token, wordnet)

import nltk
from nltk.corpus import wordnetdef get_wordnet_pos(tag): # 单词词性转换if tag.startswith('J'):return wordnet.ADJelif tag.startswith('V'):return wordnet.VERBelif tag.startswith('N'):return wordnet.NOUNelif tag.startswith('R'):return wordnet.ADVelse:return Nonetokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)wnl = nltk.WordNetLemmatizer()print("WordNetLemmatizer:")
print([wnl.lemmatize(t[0],get_wordnet_pos(t[1]) or wordnet.NOUN) for t in taged_sent])