

  • NLTK学习笔记(一)
    • 一、概述
    • 二、NLTK语料库
      • 2.1 语料库处理API
    • 三、分词和分句
    • 四、词频统计
    • 五、单词分布
    • 六、词性标注
    • 七、去除停用词
    • 八、NLTK中的wordnet
    • 九、文本预处理
      • 9.1 词干提取
      • 9.2 词形还原

  NLTK,全称Natural Language Toolkit,自然语言处理工具包,是NLP研究领域常用的一个Python库,由宾夕法尼亚大学的Steven Bird和Edward Loper在Python的基础上开发的一个模块,至今已有超过十万行的代码。这是一个开源项目,包含数据集、Python模块、教程等;可以方便的完成包括分词、词性标注、命名实体识别及句法分析在内的多种任务。


  • 简易性;
  • 一致性;
  • 可扩展性;
  • 模块化;



语言处理任务 NLTK模块 功能描述
获取和处理语料库 nltk.corpus 语料库和词典的标准化接口
字符串处理 nltk.tokenize, nltk.stem 分词,句子分解提取主干
搭配发现 nltk.collocations t-检验,χ\chiχ,点互信息PMI
词性标识符 nltk.tag n-gram,backokk,Brill,HMM,TnT
分类 nltk.classfy, nltk.cluster 决策树,最大熵,贝叶斯,EM,k-means
分块 nltk.chunk 正则表达式,n-gram,命名实体
解析 nltk.parse 图表,基于特征,一致性,概率,依赖
语义解释 nltk.sem, nltk.inference λ\lambdaλ演算,一阶逻辑,模型检验
指标评测 nltk.metrics 精度,召回率,协议系数
概率与估计 nltk.probability 概率分布,平滑概率分布
应用, nltk.char 图形化的关键词排序,分析器,WordNet查看器,聊天机器人
语言学领域的工作 nltk.toolbox 处理SIL工具箱格式的数据



语料库 说明
gutenberg 一个有若干万部的小说语料库,多是古典作品
webtext 收集的网络广告等内容
nps_chat 有上万条聊天消息语料库,即时聊天消息为主
brown 一个百万词级的英语语料库,按文体进行分类
reuters 路透社语料库,上万篇新闻方档,约有1百万字,分90个主题,并分为训练集和测试集两组
inaugural 演讲语料库,几十个文本,都是总统演说

2.1 语料库处理API

方法明 说明
fileids() 返回语料库中文件名列表
fileids(categories=[]) 返回指定类别的文件名列表
raw(fid=[c1,c2]) 返回指定文件名的文本字符串
raw(catergories=[]) 返回指定分类的原始文本
sents(fid=[c1,c2]) 返回指定文件名的语句列表
sents(catergories=[c1,c2]) 按分类返回语句列表
words(filename) 返回指定文件名的单词列表
words(catogories=[]) 返回指定分类的单词列表
from nltk.corpus import reutersprint(reuters.categories())  # 输出reuters语料库的类别
print(len(reuters.sents()))  # 输出reuters语料库的句子数量
print(len(reuters.words()))  # 输出reuters语料库的词数量



  • sent_tokenize为tokenize中的分句函数,返回文本的分句结果,调用方式为:sent_tokenize(text, language=‘english’)。一般通过句末的标点符号(如’.’,’?’)进行分隔。
  • word_tokenize为tokenize中的分词函数,返回文本的分词结果,调用方式为:word_tokenize(text, language=‘english’)。一般通过空格或句中标点符号(如’,’)进行分隔。
sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.sent_tokenize(sentence)
tokens2 = nltk.word_tokenize(sentence)

  • TreebankWordTokenizer依据PennTreebank语料库的约定,通过分离缩略词来实现切分
  • PunktWordTokenizer通过分离标点来实现切分的,每一个单词都会被保留



方法 作用
B() 返回词典的长度
plot(title,cumulative=False) 绘制频率分布图,若cumu为True,则是累积频率分布图
tabulate() 生成频率分布的表格形式
most_common() 返回出现次数最频繁的词与频度
hapaxes() 返回只出现过一次的词
import nltktext = open('demo.txt').read()
fdist = nltk.FreqDist(nltk.word_tokenize(text))
fdist.plot(30, cumulative=True)



import nltkwords = open('demo.txt').read()
text = nltk.text.Text(nltk.word_tokenize(words))
text.dispersion_plot(["time",'about','field','magnetic','records','underway','time' ])


  词性标注——POS(Part Of Speech),是一种分析句子成分的方法,通过它来识别每个词的词性。

标记 词性 示例
ADJ 形容词 new, good, high, special, big, local
ADV 动词 really, already, still, early, now
CONJ 连词 and, or, but, if, while, although
DET 限定词 the, a, some, most, every, no
EX 存在量词 there, there’s
MOD 情态动词 will, can, would, may, must, should
NN 名词 year,home,costs,time
NNP 专有名词 April,China,Washington
NUM 数词 fourth,2016, 09:30
PRON 代词 he,they,us
P 介词 on,over,with,of
TO 词to to
UH 叹词 ah,ha,oops
VB 动词
VBD 动词过去式 made,said,went
VBG 现在分词 going,lying,playing
VBN 过去分词 taken,given,gone
WH wh限定词 who,where,when,what
import nltksentence = "They found that the reversal took about as long as many scientists previously believed it did, " \"just a few thousand years.";
tokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)


  文本经过简单的而分词处理后,还会包含大量的无实际意义的通用词,由于这些常用字或者词使用的频率相当的高,比如a,the, he等,每个页面几乎都包含了这些词汇,如果搜索引擎它们当关键字进行索引,那么所有的网站都会被索引,而且没有区分度,所以一般把这些词直接去掉,不可当做关键词。NLTK提供了一份英文停用词词典直接使用。

sentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.word_tokenize(sentence)
stops = set(nltk.corpus.stopwords.words('english'))
tokens = [word for word in tokens if word.lower() not in stops]



  • 通过wordnet可以得到给定词的定义和例句

    • 用法:wordnet.synsets().definition(),wordnet.synsets().example()
  • 通过wordnet可以获得同义词
    • 用法:wordnet.synsets().lemmas().name()
  • 使用wordnet可以获取反义词
    • 用法:wordnet.synsets().lemmas().antonyms().name()
from nltk.corpus import wordnetsyn = wordnet.synsets("dynamic")
print("定义:", syn[0].definition())
print("例句:", syn[0].examples())
synonyms = []
for lemma in syn[0].lemmas():synonyms.append(
print("同义词:", synonyms)antonyms = []
for ss in syn:for lemma in ss.lemmas():if lemma.antonyms():antonyms.append(lemma.antonyms()[0].name())
print("反义词:", antonyms)



9.1 词干提取


  • 用法:nltk.PorterStemmer().stem(token) / nltk.LancasterStemmer().stem(token)
import nltksentence = "In a new study, women researchers used lava flow records, along with sedimentary and Antarctic ice core data, " \"to examine that event. They found that the reversal took about as long as many scientists previously " \"believed it did, just a few thousand years. In a new study, researchers used lava flow records, along with " \"sedimentary and Antarctic ice core data, to examine that event. "tokens = nltk.word_tokenize(sentence)porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()print("sentence: " + sentence)
print("PorterStemmer: ")
print([porter.stem(t) for t in tokens])
print("LancasterStemmer: ")
print([lancaster.stem(t) for t in tokens])


9.2 词形还原


  • 用法:nltk.WordNetLemmatizer().lemmatize(token, wordnet)
import nltk
from nltk.corpus import wordnetdef get_wordnet_pos(tag): # 单词词性转换if tag.startswith('J'):return wordnet.ADJelif tag.startswith('V'):return wordnet.VERBelif tag.startswith('N'):return wordnet.NOUNelif tag.startswith('R'):return wordnet.ADVelse:return Nonetokens = nltk.word_tokenize(sentence)
taged_sent = nltk.pos_tag(tokens)wnl = nltk.WordNetLemmatizer()print("WordNetLemmatizer:")
print([wnl.lemmatize(t[0],get_wordnet_pos(t[1]) or wordnet.NOUN) for t in taged_sent])


