NLTK常用操作和语料库

我的原文：http://blog.hijerry.cn/p/22281.html

安装NLTK

按照官方步骤：Installing NLTK

在安装完nltk后，通过下述命令可查看nltk版本：

import nltk
print nltk.__doc__

输出：

The Natural Language Toolkit (NLTK) is an open source Python library
for Natural Language Processing.  A free online book is available.
(If you use the library for academic research, please cite the book.)Steven Bird, Ewan Klein, and Edward Loper (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.
http://nltk.org/book@version: 3.2.5

所以，我的版本是 3.2.5

获取图书集合

为便于学习，nltk提供了图书集合，通过下述命令下载该集合：

nltk.download()

在出现的界面中选择 book 进行下载，文件大概有100M，这也是《Python自然语言处理》这本书中推荐的图书集合。

下载完成后，回到python解析器，导入刚刚下载的图书集合：

from nltk.book import *

运行 texts() 可以查看书籍列表，sents() 可以查看句子列表。

查看 book.py 文件可知，该文件导入了下面一些模块

nltk.corpus 中的一些语料库
nltk.text 中的Text类
nltk.probability 中的 FreqDist 类
nltk.util 中的 bigrams 模块

同时通过语料库创建9个 Text 实例，变量为text1~text9。

而 sent1~sent9 是硬编码的9个句子。

语料库介绍

在book.py 文件中，我们可以看到：

from nltk.corpus import (gutenberg, genesis, inaugural,nps_chat, webtext, treebank, wordnet)

所以我们已经引入了语料库了，通过打印可知：

print gutenbergoutputs:
<PlaintextCorpusReader in u'/Users/jerry/nltk_data/corpora/gutenberg'>

可见 gutenberg 是一个 PlaintextCorpusReader 类。

进入到指定的目录可见gutenberg 的所有文本，可见都是txt文档

cd /Users/jerry/nltk_data/corpora/gutenberg
lsoutputs:
README                  burgess-busterbrown.txt milton-paradise.txt
austen-emma.txt         carroll-alice.txt       shakespeare-caesar.txt
austen-persuasion.txt   chesterton-ball.txt     shakespeare-hamlet.txt
austen-sense.txt        chesterton-brown.txt    shakespeare-macbeth.txt
bible-kjv.txt           chesterton-thursday.txt whitman-leaves.txt
blake-poems.txt         edgeworth-parents.txt
bryant-stories.txt      melville-moby_dick.txt

古腾堡语料库

古腾堡项目大约有36000本免费电子图书，NLTK中只包含了其中的一小部分，通过下述命令可以查看语料库中的文件标识符（文件名）：

from nltk.corpus import gutenberg
gutenberg.fileids()outputs:
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

网络和聊天文本

古腾堡项目代表的既定文学，考虑到非正式的文本也很重要，所以NLTK中也包含网络小说和论坛、《加勒比海盗》的电影剧本等内容。

from nltk.corpus import webtext
webtext.fileids()outputs:
[u'firefox.txt', u'grail.txt', u'overheard.txt', u'pirates.txt', u'singles.txt', u'wine.txt']

即时聊天会话语料库，是美国海军研究生院的研究生收集的。

from nltk.corpus import nps_chat
nps_chat.fileids()outputs:
[u'10-19-20s_706posts.xml', u'10-19-30s_705posts.xml', u'10-19-40s_686posts.xml', u'10-19-adults_706posts.xml', u'10-24-40s_706posts.xml', u'10-26-teens_706posts.xml', u'11-06-adults_706posts.xml', u'11-08-20s_705posts.xml', u'11-08-40s_706posts.xml', u'11-08-adults_705posts.xml', u'11-08-teens_706posts.xml', u'11-09-20s_706posts.xml', u'11-09-40s_706posts.xml', u'11-09-adults_706posts.xml', u'11-09-teens_706posts.xml']

布朗语料库

布朗语料库是第一个百万词级别的英语电子语料库，这个语料库包含500个不同来源的文本，按文体分类有新闻、社论等，完整列表。

from nltk.corpus import brown
brown.categories()outputs:
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']

这里注意的是它是按文体分类的，例如要取得news类别的文本：

print brown.words(categories='news')outputs:
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

当前也可以用fileids 方法列出所有文件标识，再用文件标识来获取文本，但最好还是用类别来取得文本。

自定义语料库

例如我们在家目录建立一个texts 目录和一个文件：

cd ~ && mkdir texts && cd texts
echo "This is a love story" >> lovestory.txt
pwdoutputs:
/Users/jerry/texts

将上面的output放在下面的dir变量中，在python解析器中输入：

from nltk.corpus import PlaintextCorpusReader
dir = '/Users/jerry/texts'
my_corpus = PlaintextCorpusReader(dir, '.*\.txt')
print my_corpus.fileids()outputs:
['lovestory.txt']

断词：

print my_corpus.words('lovestory.txt')outputs:
[u'This', u'is', u'a', u'love', u'story']

PlaintextCorpusReader类

官方文档：nltk.corpus.reader.plaintext.PlaintextCorpusReader

介绍：基于文件系统，用来加载纯文本的类

所在文件：corpus/reader/plaintext.py，大约在23~138行

断词

定义：words(fileids=None)

我们可以使用words函数将这些文件分割成tokens(词，此处也叫words)：

words = gutenberg.words('shakespeare-macbeth.txt')
print wordsoutputs:
[u'[', u'The', u'Tragedie', u'of', u'Macbeth', u'by', ...]

标点符号也会被单独分出来

断句

定义：sents(fileids=None)

sents = gutenberg.sents('shakespeare-macbeth.txt')
print sentsoutputs:
[[u'[', u'The', u'Tragedie', u'of', u'Macbeth', u'by', u'William', u'Shakespeare', u'1603', u']'], [u'Actus', u'Primus', u'.'], ...]

可以发现每一句都是一个字符串数组，连起来就是一个句子：

print ' '.join(sents[0])outputs:
[ The Tragedie of Macbeth by William Shakespeare 1603 ]

分段

定义：paras(fileids=None)

段是指自然段，一个空白行表示自然段之间的分隔符

paras = gutenberg.paras('shakespeare-macbeth.txt')
print ' '.join(paras[0][0]) #first para, first sentenceoutputs:
[ The Tragedie of Macbeth by William Shakespeare 1603 ]

加载原始文本

定义：raw(fileids=None)

str = gutenberg.raw('shakespeare-macbeth.txt')
print str[:100]outputs:
[The Tragedie of Macbeth by William Shakespeare 1603]Actus Primus. Scoena Prima.Thunder and Lig

Text类

官方文档：nltk.text.Text

介绍：token(词)序列的容器，同时提供一些常用的文本分析函数。

所在文件：text.py，大约在264行~395行

创建

定义：__init__(self, tokens, name=None)

text1 变量其实是通过下述语句创建的：

text1 = Text(gutenberg.words('melville-moby_dick.txt'))
print text1outputs:
<Text: Moby Dick by Herman Melville 1851>

通过打印words方法的结果可知：

print gutenberg.words('melville-moby_dick.txt')outputs:
[u'[', u'Moby', u'Dick', u'by', u'Herman', u'Melville', ...]

发现gutenberg 的 words 方法返回的其实是一个字符串数组 。

查找单词

定义：concordance(word, width=79, lines=25)

参阅： ConcordanceIndex 类

text1.concordance('monstrous', 30, 5)outputs:
Displaying 5 of 11 matches:
of a most monstrous size . ..
hing that monstrous bulk of tarray of monstrous clubs and
ered what monstrous cannibal
od ; most monstrous and most

这里我指定了只显示5行，每一行宽度是30。

细心的你可能发现每行只有29个字符，详细原因看text.py文件的print_concordance 方法即可。

搭配词

定义：collocations(num=20, window_size=2)

text1.collocations()outputs:
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

可见，该函数是打印出文中由多个词组成的搭配词，如上文的old man、years ago。

相似词

定义：similar(word, num=20)

参阅：ContextIndex.similar_word 方法

text1.similar('monstrous', 5)outputs:
imperial subtly impalpable pitiable curious

该方法根据单词在上下文中的用法，返回与给定词相似的词，相似度越高的词排名越靠前。

num 参数表示查找数量，例如上面就只显示了5个。

相同上下文

定义：common_contexts(words, num=20)

参阅：ContextIndex.common_contexts 方法

text2.common_contexts(['monstrous', 'very'])outputs:
a_pretty is_pretty a_lucky am_glad be_glad

给出两个词的共同的上下文。这意味着下划线部分可用monstrous 或 very进行填空，而且a_pretty 这个公共上下文出现的频率最大。

词分布

定义：dispersion_plot(words)

text4.dispersion_plot(['citizens', 'democracy', 'freedom', 'duties', 'America'])

text4 是美国总统就职演说的文本，上述图像打印出了这五个词在演说过程中的分布情况。

这可以用来研究随时间推移，语言使用上的变化。

正则表达式匹配单词

定义：findall(regexp)

text5.findall("<.*><.*><bro>")outputs:
you rule bro; telling you bro; u twizted bro

词出现次数

定义：count(word)

text1.count('I')outputs:
2124

词索引

定义：index(word)

text1.index('I')outputs:
37

文本总长度

len(text1)outputs:
260819

生成FreqDist类

dist = text1.vocab()
print distoutputs:
<FreqDist with 19317 samples and 260819 outcomes>

打印频率分布图

定义：plot(*args)

参阅：nltk.prob.FreqDist.plot 方法

从定义可以看出，其实就是调用了FreqDist的plot方法

def plot(self, *args):"""See documentation for FreqDist.plot():seealso: nltk.prob.FreqDist.plot()"""self.vocab().plot(*args)