文章目录

一、获取文本语料库
- 1、古腾堡语料库
- - （1）输出语料库中的文件标识符
  - （2）词的统计与索引
  - （3）文本统计
- 2、网络和聊天文本
- 3、布朗语料库
- - （1）初识
  - （2）比较不同文体中的情态动词的用法
- 4、路透社语料库
- - （1）初识
  - （2）通过主题和fileids查找words
  - （3）以文档或类别为单位查找想要的词或句子
- 5、就职演说语料库
- - （1）初识
  - （2）条件频率分布图
- 6、标注文本语料库
- 7、多国语言语料库
- 8、文本语料库的结构
- 9、加载你自己的语料库
- - （1）PlaintextCorpusReader
  - （2）BracketParseCorpusReader
二、条件频率分布
- 1、条件和事件
- 2、按文体计数词汇
- 3、绘制分布图和分布表
- 4、使用双连词生成随机文本
三、代码重用（Python）
四、词典资源
- 1、词汇列表语料库
- - （1）过滤文本
  - （2）停用词语料库
  - （3）一个字母拼词谜题
  - （4）名字语料库
- 2、发音的词典
- 3、比较词表
五、WordNet
- 1、意义与同义词
- 2、层次结构
- 3、更多的词汇关系
- 4、语义相似度

大量的语言数据或者语料库。

一、获取文本语料库

1、古腾堡语料库

NLTK 包含 古腾堡项目（Project Gutenberg） 电子文本档案的经过挑选的一小部分文本，该项目大约有25,000本免费电子图书。

（1）输出语料库中的文件标识符

import nltk
# 输出语料库中的文件标识符
print(nltk.corpus.gutenberg.fileids())

输出结果

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

（2）词的统计与索引

from nltk.corpus import gutenbergemma = gutenberg.words('austen-emma.txt')
print(len(emma))emma = nltk.Text(gutenberg.words('austen-emma.txt'))
print(emma.concordance("surprize"))

输出结果

192427
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity `
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulationthe present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s sto her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ;
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her iexpected by the best judges , for surprize -- but there was great joy . Mr . sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !. It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

（3）文本统计

for fileid in gutenberg.fileids():# raw()函数：没有进行过任何语言学处理的文件的内容num_chars = len(gutenberg.raw(fileid))num_words = len(gutenberg.words(fileid))# sents()函数将文本划分为句子，每个句子都是一个单词列表。num_sents = len(gutenberg.sents(fileid))num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))print('平均词长:', round(num_chars/num_words), '平均句长:', round(num_words/num_sents), '每个单词出现的平均次数：', round(num_words/num_vocab), 'from:', fileid)

输出结果

平均词长: 5 平均句长: 25 每个单词出现的平均次数： 26 from: austen-emma.txt
平均词长: 5 平均句长: 26 每个单词出现的平均次数： 17 from: austen-persuasion.txt
平均词长: 5 平均句长: 28 每个单词出现的平均次数： 22 from: austen-sense.txt
平均词长: 4 平均句长: 34 每个单词出现的平均次数： 79 from: bible-kjv.txt
平均词长: 5 平均句长: 19 每个单词出现的平均次数： 5 from: blake-poems.txt
平均词长: 4 平均句长: 19 每个单词出现的平均次数： 14 from: bryant-stories.txt
平均词长: 4 平均句长: 18 每个单词出现的平均次数： 12 from: burgess-busterbrown.txt
平均词长: 4 平均句长: 20 每个单词出现的平均次数： 13 from: carroll-alice.txt
平均词长: 5 平均句长: 20 每个单词出现的平均次数： 12 from: chesterton-ball.txt
平均词长: 5 平均句长: 23 每个单词出现的平均次数： 11 from: chesterton-brown.txt
平均词长: 5 平均句长: 18 每个单词出现的平均次数： 11 from: chesterton-thursday.txt
平均词长: 4 平均句长: 21 每个单词出现的平均次数： 25 from: edgeworth-parents.txt
平均词长: 5 平均句长: 26 每个单词出现的平均次数： 15 from: melville-moby_dick.txt
平均词长: 5 平均句长: 52 每个单词出现的平均次数： 11 from: milton-paradise.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 9 from: shakespeare-caesar.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 8 from: shakespeare-hamlet.txt
平均词长: 4 平均句长: 12 每个单词出现的平均次数： 7 from: shakespeare-macbeth.txt
平均词长: 5 平均句长: 36 每个单词出现的平均次数： 12 from: whitman-leaves.txt

显示每个文本的三个统计量：平均词长、平均句子长度和本文中每个词出现的平均次数（我们的词汇多样性得分）。平均词长似乎是英语的一个一般属性，因为它的值总是4。（事实上，平均词长是3而不是4，因为num_chars变量计数了空白字符。）相比之下，平均句子长度和词汇多样性看上去是作者个人的特点。

2、网络和聊天文本

from nltk.corpus import webtext：网络文本小集合
from nltk.corpus import nps_chat：即时消息聊天会话语料库

# 网络文本小集合
from nltk.corpus import webtext
for fileid in webtext.fileids():print(fileid, webtext.raw(fileid)[:65], '...')# 即时消息聊天会话语料库
# 10-19-20s_706posts.xml包含2006 年10 月19 日
# 从20 多岁聊天室收集的706 个帖子。
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
print(chatroom[123])

输出结果

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

3、布朗语料库

1961年，布朗大学，第一个百万词语的英语电子语料库，包含500个不同来源的文本。

（1）初识

布朗语料库每一部分的示例文档

from nltk.corpus import brown
print(brown.categories())print(brown.words(categories='news'))
print(brown.words(fileids=['cg22']))
print(brown.sents(categories=['new', 'editorial', 'reviews']))

输出结果

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
[['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ...]

（2）比较不同文体中的情态动词的用法

一个文体中情态动词的对比

import nltk
from nltk.corpus import brownnews_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:print(m + ':', fdist[m], end=' ')

输出结果

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

在不同的文体中统计感兴趣词的词频分布

# 带条件的频率分布函数
cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories()for word in brown.words(categories=genre))
# 填写我们想要展示的文体种类
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
# 填写我们想要统计的词
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions = genres, samples = modals)
cfd.plot(conditions = genres, samples = modals)

输出结果

                  can could   may might  must  will news    93    86    66    38    50   389 religion    82    59    78    12    54    71 hobbies   268    58   131    22    83   264
science_fiction    16    49     4    12     8    16 romance    74   193    11    51    45    43 humor    16    30     8     8     9    13

4、路透社语料库

10,788 个新闻文档，90个主题，共计130 万字，按照“training”和“test”分为两组。

（1）初识

from nltk.corpus import reuters
print(reuters.fileids())
print(reuters.categories())     # 主题
print("\n")

输出结果

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', ,..., 'test/21567', 'test/21568', 'test/21570', 'test/21571', 'test/21573', 'test/21574', 'test/21575', 'test/21576', 'training/1', 'training/10', 'training/100', 'training/1000', 'training/10000', 'training/10002', 'training/10005', ...'training/9988', 'training/9989', 'training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995']
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', ..., 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']

（2）通过主题和fileids查找words

路透社语料库的类别是有互相重叠的:新闻报道往往涉及多个主题。

# 查询该id中包含的主题
print(reuters.categories('training/9865'))
print(reuters.categories(['training/9865', 'training/9880']))
print("\n")
# 查询该主题中包含的id
print(reuters.fileids('barley'))
print(reuters.fileids(['barley', 'corn']))
print("\n")

输出结果

['barley', 'corn', 'grain', 'wheat']
['barley', 'corn', 'grain', 'money-fx', 'wheat']['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', ..., 'training/8257', 'training/8759', 'training/9865', 'training/9958']
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648',...,'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989']

（3）以文档或类别为单位查找想要的词或句子

print(reuters.words('training/9865')[:14])
print(reuters.words(['training/9865', 'training/9880']))
print(reuters.words(categories='barley'))
print(reuters.words(categories=['barley', 'corn']))

输出结果

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

5、就职演说语料库

（1）初识

import nltk
from nltk.corpus import inauguralprint(inaugural.fileids())
print([fileid[:4] for fileid in inaugural.fileids()])

输出结果

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', ...,  '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

（2）条件频率分布图

cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target))
cfd.plot()

输出结果：条件频率分布图
计数就职演说语料库中所有以america 或citizen开始的词。

6、标注文本语料库

7、多国语言语料库

print(nltk.corpus.cess_esp.words())
print(nltk.corpus.floresta.words())
print(nltk.corpus.indian.words('hindi.pos'))
print(nltk.corpus.udhr.fileids())
print(nltk.corpus.udhr.words('Javanese-Latin1')[11:])

用条件频率分布来研究“世界人权宣言”（udhr）语料库中不同语言版本中的字长差异

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)

8、文本语料库的结构

文本语料库的常见结构：

isolated：一些孤立的没有什么特别的组织的文本集合；
categorized：分类组织结构；
overlapping：重叠，如主题类别（路透社语料库）；
temporal：随时间变化语言用法的改变（就职演说语料库）。

9、加载你自己的语料库

（1）PlaintextCorpusReader

PlaintextCorpusReader更适合文本文件，eg：添加 corpus_root 下的语料库

from nltk.corpus import PlaintextCorpusReadercorpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
print(wordlists.fileids())
print(wordlists.words('american-english'))

输出结果

['README.select-wordlist', 'american-english', 'british-english', 'cracklib-small', 'words', 'words.pre-dictionaries-common']
['A', 'A', "'", 's', 'AMD', 'AMD', "'", 's', 'AOL', ...]

（2）BracketParseCorpusReader

BracketParseCorpusReader更适合已解析过的语料库

from nltk.corpus import BracketParseCorpusReadercorpus_root = ''  # 路径
file_pattern = r".*/wsj_.*\.mrg" # 匹配模式
# 初始化读取器：语料库目录和要加载文件的格式，默认utf8格式的编码
ptb = BracketParseCorpusReader(corpus_root, file_pattern)
print(ptb.fileids())
print(len(ptb.sents()))
print(ptb.sents(fileids='20/wsj_2013/mrg')[19])

二、条件频率分布

1、条件和事件

每个配对pairs的形式是：(条件, 事件)。如果我们按文体处理整个布朗语料库，将有15 个条件（每个文体一个条件）和1,161,192 个事件（每一个词一个事件）。

text = ['The', 'Fulton', 'Country', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

2、按文体计数词汇

# 构建文体与词的配对
genre_word = [(genre, word)for genre in ['news', 'romance']for word in brown.words(categories=genre)]
print(len(genre_word))
print(genre_word[:4])
print(genre_word[-4:], '\n')# 频率分布
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions(), '\n')print(cfd['news'])
print(cfd['romance'])
print(cfd['romance'].most_common(20))
print(cfd['romance']['could'])

输出结果

170576
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] <ConditionalFreqDist with 2 conditions>
['news', 'romance'] <FreqDist with 14394 samples and 100554 outcomes>
<FreqDist with 8452 samples and 70022 outcomes>
[(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
193

3、绘制分布图和分布表

from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target))
cfd.plot()from nltk.corpus import udhrlanguages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],samples=range(10), cumulative=True)
cfd.plot(cumulative=True)

4、使用双连词生成随机文本

利用bigrams制作生成模型

def generate_model(cfdist, word, num=15):for i in range(num):print(word, end=" ")word = cfdist[word].max()text = nltk.corpus.genesis.words("english-kjv.txt")
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
print(cfd)
print(list(cfd))
print(cfd["so"])
print(cfd["living"])generate_model(cfd, "so")
generate_model(cfd, "living")

输出结果

<ConditionalFreqDist with 2789 conditions>
['In', 'the', 'beginning', 'God', 'created', 'heaven', 'and', 'earth', '.', 'And', 'was', 'without', 'form', ',', 'void', ';', 'darkness', 'upon', 'face', 'of', 'deep', 'Spirit', 'moved', 'waters', 'said', 'Let', 'there', 'be', 'light', ':', 'saw', 'that', 'it', 'good', 'divided', 'from', 'called', 'Day', 'he', ..., ', 'embalmed', 'past', 'elders', 'chariots', 'horsemen', 'threshingfloor', 'Atad', 'lamentati', 'floor', 'Egyptia', 'Abelmizraim', 'requite', 'messenger', 'Forgive', 'forgive', 'meant', 'Machir', 'visit', 'coffin']FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...})
FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})so that he said , and the land of the land of the land of
living creature that he said , and the land of the land of

条件频率分布的常用方法

示例	描述
cfdist= ConditionalFreqDist(pairs)	从配对链表中创建条件频率分布
cfdist.conditions()	将条件按字母排序
cfdist[condition]	此条件下的频率分布
cfdist[condition][sample]	此条件下给定样本的频率
cfdist.tabulate()	为条件频率分布制表
cfdist.tabulate(samples, conditions)	指定样本和条件限制下制表
cfdist.plot()	为条件频率分布绘图
cfdist.plot(samples, conditions)	指定样本和条件
cfdist1 < cfdist2	测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数

三、代码重用（Python）

函数、方法
模块（module）：一个文件中的变量和函数定义的集合。可通过文件入来访问自定义的函数。
包（package）：相关模块的集合。

注意：当 Python 导入模块时，它先查找当前目录（文件夹）。

四、词典资源

词典或者词典资源：一个词和（或）短语以及一些相关信息的集合，附属于文本，通常在文本的帮助下创建和丰富。

上图为词典术语：两个拼写相同的词条但意义不同（同音异义词）的词汇项（包括词
目（也叫词条）以及其他附加信息），其他附加信息包括词性和注释信息。

1、词汇列表语料库

词汇语料库是Unix 中的/usr/share/dict/words文件，被一些拼写检查程序使用。我们可以用它来寻找文本语料中不寻常的或拼写错误的词汇。

（1）过滤文本

此程序计算文本的词汇表，然后删除所有在现有的词汇列表中出现的元
素，只留下罕见或拼写错误的词。

def unusual_words(text):text_vocab = set(w.lower() for w in text if w.isalpha())english_vocab = set(w.lower() for w in nltk.corpus.words.words())unusual = text_vocab - english_vocabreturn sorted(unusual)un_words1 = unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
print(un_words1, '\n')
un_words2 = unusual_words(nltk.corpus.nps_chat.words())
print(un_words2)

输出结果

['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'accents', 'accepting', 'accommodations', ..., 'wiping', 'wisest', 'wishes', 'withdrew', 'witnessed', 'witnesses', 'witnessing', 'witticisms', 'wittiest', 'wives', 'women', 'wondered', 'woods', 'words', 'workmen', 'worlds', 'wrapt', 'writes', 'yards', 'years', 'yielded', 'youngest'] ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy', ...,'yuuuuuuuuuuuummmmmmmmmmmm', 'yvw', 'yw', 'zebrahead', 'zoloft', 'zyban', 'zzzzzzzing', 'zzzzzzzz']

（2）停用词语料库

停用词通常几乎没有什么词汇内容，eg：如the，to和also…

from nltk.corpus import stopwordsprint(stopwords.words('english'))# 计算文本中没有在停用词列表中的词的比例。
def content_fraction(text):stopwords = nltk.corpus.stopwords.words('english')content = [w for w in text if w.lower() not in stopwords]return len(content)/len(text)frac = content_fraction(nltk.corpus.reuters.words())
print(frac)

输出结果

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",..., 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
0.735240435097661

（3）一个字母拼词谜题

在由随机选择的字母组成的网格中，选择里面的字母组成词；这个谜题叫做“目标”。

要求：

长度不小于6
每个词必须包括中间的字母
每个字母在每个词中只能被用一次

import nltkpuzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
print([w for w in wordlist if len(w) >= 6and obligatory in wand nltk.FreqDist(w) <= puzzle_letters])

输出结果

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']

FreqDist 比较法：允许我们检查每个字母在候选词中的频率是否小于或等于相应的字母在拼词谜题中的频率。

（4）名字语料库

包括 8000 个按性别分类的名字。男性和女性的名字存储在单独的文件中。

names = nltk.corpus.names
print(names.fileids())
male_names = names.words('male.txt')
female_names = names.words('female.txt')## 男女同名
print([w for w in male_names if w in female_names])# 条件频率分布：此图显示男性和女性名字的结尾字母
cfd = nltk.ConditionalFreqDist((fileid, name[-1])for fileid in names.fileids()for name in names.words(fileid))
cfd.plot()

输出结果

['female.txt', 'male.txt']
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',..., 'Ted', 'Teddie', 'Teddy', 'Terri', 'Terry', 'Theo', 'Tim', 'Timmie', 'Timmy', 'Tobe', 'Tobie', 'Toby', 'Tommie', 'Tommy', 'Tony', 'Torey', 'Trace', 'Tracey', 'Tracie', 'Tracy', 'Val', 'Vale', 'Valentine', 'Van', 'Vin', 'Vinnie', 'Vinny', 'Virgie', 'Wallie', 'Wallis', 'Wally', 'Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

条件频率分布：此图显示男性和女性名字的结尾字母；大多数以 a，e 或 i 结尾的名字是女性；以 h 和 l 结尾的男性和女性同样多；以 k，o，r，s 和 t 结尾的更可能是男性。

2、发音的词典

为语音合成器使用而设计的

entries = nltk.corpus.cmudict.entries()
print(len(entries))
for entry in entries[39943:39951]:print(entry)for word, pron in entries:if len(pron) == 3:ph1, ph2, ph3 = pronif ph1 == 'P' and ph3 == 'T':print(word, ph2, end=' ')

输出结果

133737
('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])pait EY1
pat AE1
pate EY1
patt AE1
peart ER1
peat IY1
peet IY1
peete IY1
pert ER1
pet EH1
pete IY1
pett EH1
piet IY1
piette IY1
pit IH1
pitt IH1
pot AA1
pote OW1
pott AA1
pout AW1
puett UW1
purt ER1
put UH1
putt AH1

找到所有发音结尾与 nicks 相似的词汇。

syllable = ['N', 'IH0', 'K', 'S']
print([word for word, pron in entries if pron[-4:] == syllable])

输出结果

["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", "endotronics'", 'endotronics', 'enix', 'environics', 'ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics', 'histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics', 'mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix', 'minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix', 'mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics', "panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix', "philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix', 'plantronics', 'pyrotechnics', 'refuseniks', "resnick's", 'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks', 'technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics', 'unix', "vinick's", "vinnick's", 'vitronics']

3、比较词表

斯瓦迪士核心词列表

from nltk.corpus import swadeshprint(swadesh.fileids())
print(swadesh.words('en'))# entries()方法:指定一个语言链表来访问多语言中的同源词
fr2en = swadesh.entries(['fr', 'en'])
print(fr2en)
translate = dict(fr2en)
print(translate['chien'])
print(translate['jeter'])de2en = swadesh.entries(['de', 'en'])       # German-English
es2en = swadesh.entries(['es', 'en'])      # Spanish-English
translate.update(dict(de2en))
translate.update(dict(es2en))
print(translate['Hund'])
print(translate['perro'])languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
for i in [139, 140, 141, 142]:print(swadesh.entries(languages)[i])

输出结果

['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name'][('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'), ('qui', 'who'), ('quoi', 'what'), ('où', 'where'), ('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'), ('plusieurs', 'many'), ... , ('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin', 'far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si', 'if'), ('parce que', 'because'), ('nom', 'name')]dog
throwdog
dog('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

五、WordNet

面向语义的英语词典，共有155,287 个词和117,659 个同义词集合。

1、意义与同义词

from nltk.corpus import wordnet as wnprint(wn.synsets('motorcar'))
# 意义相同的词（或“词条”）的集合
print(wn.synset('car.n.01').lemma_names())
# 获取该词在该词集的定义
print(wn.synset('car.n.01').definition())
# 获取该词在该词集下的例句
print(wn.synset('car.n.01').examples())# 得到指定同义词集的所有词条
print(wn.synset('car.n.01').lemmas())
# 查找特定的词条
print(wn.lemma('car.n.01.automobile'))
# 得到一个词条对应的同义词集
print(wn.lemma('car.n.01.automobile').synset())
# 以得到一个词条的“名字”
print(wn.lemma('car.n.01.automobile').name(), '\n')print(wn.synsets('car'))
for synset in wn.synsets('car'):print(synset.lemma_names())print(wn.lemmas('car'))

输出结果

[Synset('car.n.01')]
['car', 'auto', 'automobile', 'machine', 'motorcar']
a motor vehicle with four wheels; usually propelled by an internal combustion engine
['he needs a car to get to work'][Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]
Lemma('car.n.01.automobile')
Synset('car.n.01')
automobile [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car'][Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

2、层次结构

WordNet的同义词集相当于抽象的概念，它们并不总是有对应的英语词汇。这些概念在层次结构中相互联系在一起。

from nltk.corpus import wordnet as wnmotorcar = wn.synset('car.n.01')
type_of_motorcar = motorcar.hyponyms()
print(type_of_motorcar[26])# 下位词
print(sorted([lemma.name()for synset in type_of_motorcarfor lemma in synset.lemmas()]))# 上位词
print(motorcar.hypernyms())
paths = motorcar.hypernym_paths()
print(len(paths))
print([synset.name for synset in paths[0]])
print([synset.name() for synset in paths[0]])
print([synset.name() for synset in paths[1]])# 个最一般的上位（或根上位）同义词集：
print(motorcar.root_hypernyms())

输出结果

Synset('stanley_steamer.n.01')['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon'][Synset('motor_vehicle.n.01')]
2
[<bound method Synset.name of Synset('entity.n.01')>, <bound method Synset.name of Synset('physical_entity.n.01')>, <bound method Synset.name of Synset('object.n.01')>, <bound method Synset.name of Synset('whole.n.02')>, <bound method Synset.name of Synset('artifact.n.01')>, <bound method Synset.name of Synset('instrumentality.n.03')>, <bound method Synset.name of Synset('container.n.01')>, <bound method Synset.name of Synset('wheeled_vehicle.n.01')>, <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>, <bound method Synset.name of Synset('motor_vehicle.n.01')>, <bound method Synset.name of Synset('car.n.01')>]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'][Synset('entity.n.01')]

3、更多的词汇关系

part_meronyms()：部分，例如：一棵树的部分是它的树干，树冠等。
substance_meronyms()：实质包括…组成，例如：一棵树的实质是包括心材和边材组成的。
member_holonyms()：形成…整体，例如：树木的集合形成了一个森林。
entailments()：蕴含关系
antonyms()：反义词
dir()：查看词汇关系和同义词集上定义的其它方法

print(wn.synset('tree.n.01').part_meronyms())
print(wn.synset('tree.n.01').substance_meronyms())
print(wn.synset('tree.n.01').member_holonyms())print(wn.synset('mint.n.04').part_holonyms())
print(wn.synset('mint.n.04').substance_holonyms())for synset in wn.synsets('mint', wn.NOUN):print(synset.name() + ':', synset.definition())
print('-----------' * 4)# 蕴涵关系
print(wn.synset('walk.v.01').entailments())
print(wn.synset('eat.v.01').entailments())
print(wn.synset('tease.v.03').entailments())
print('-----------' * 4)# 反义词
print(wn.lemma('supply.n.02.supply').antonyms())
print(wn.lemma('rush.v.01.rush').antonyms())
print(wn.lemma('horizontal.a.01.horizontal').antonyms())
print(wn.lemma('staccato.r.01.staccato').antonyms())# dir():查看词汇关系和同义词集上定义的其它方法
print(dir(wn.synset('harmony.n.02')))

输出结果

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
[Synset('forest.n.01')]
[Synset('mint.n.02')]
[Synset('mint.n.05')]
batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government
--------------------------------------------
[Synset('step.v.01')]
[Synset('chew.v.01'), Synset('swallow.v.01')]
[Synset('arouse.v.07'), Synset('disappoint.v.01')]
--------------------------------------------
[Lemma('demand.n.02.demand')]
[Lemma('linger.v.04.linger')]
[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]
[Lemma('legato.r.01.legato')]
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']

4、语义相似度

right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')# 共同的上位词
print(right.lowest_common_hypernyms(minke))
print(right.lowest_common_hypernyms(orca))
print(right.lowest_common_hypernyms(tortoise))
print(right.lowest_common_hypernyms(novel))# 查找每个同义词集深度量化
print(wn.synset('baleen_whale.n.01').min_depth())
print(wn.synset('whale.n.02').min_depth())
print(wn.synset('vertebrate.n.01').min_depth())
print(wn.synset('entity.n.01').min_depth())# 基于上位词层次结构中相互连接的概念之间的最短路径
# 在 0-1 范围的打分（两者之间没有路径就返回-1）。
print(right.path_similarity(minke))
print(right.path_similarity(orca))
print(right.path_similarity(tortoise))
print(right.path_similarity(novel))

输出结果

[Synset('baleen_whale.n.01')]
[Synset('whale.n.02')]
[Synset('vertebrate.n.01')]
[Synset('entity.n.01')]14
13
8
00.25
0.16666666666666666
0.07692307692307693
0.043478260869565216

【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源相关推荐

Python核心教程(第二版)读书笔记(三)
第三章Python基础 2010-04-09 换行一行过长的语句可以使用反斜杠'\'分解成几行.有两种例外情况一个语句不使用反斜线也可以跨行. 1.在使用闭合操作符时,单一语句可以跨多行.例如:在 ...
Python自然语言处理-学习笔记(2)——获得文本语料和词汇资源
语料库基本语法载入自己的语料库 PlaintextCorpusReadera 从文件系统载入 BracketParseCorpusReader 从本地硬盘载入写一段简短的程序,通过遍历前面所列出的 ...
【Python 自然语言处理第二版】读书笔记1:语言处理与Python
文章目录前言语言处理与Python 一.语言计算:文本和单词 1.NLTK入门 (1)安装(nltk.nltk.book) (2)搜索文本 (3)词汇计数 2.列表与字符串 (1)列表操作 (2) ...
《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源
第02章获得文本语料和词汇资源 2.1 获取文本语料库古腾堡语料库网络和聊天文本布朗语料库路透社语料库就职演说语料库标注文本语料库在其他语言的语料库文本语料库的结构载入你自己的语料 ...
Python自然语言处理 | 获得文本语料与词汇资源
本章解决问题- 什么是有用的文本语料和词汇资源,我们如何使用Python获取它们? 哪些Python结构最适合这项工作? 编写Python代码时我们如何避免重复的工作? 这里写目录标题 1获取文本语料 ...
《用Python进行自然语言处理》第2章获得文本语料和词汇资源
1. 什么是有用的文本语料和词汇资源,我们如何使用 Python 获取它们? 2. 哪些 Python 结构最适合这项工作? 3. 编写 Python 代码时我们如何避免重复的工作? 2.1 获取文本 ...
python nlp_【NLP】Python NLTK获取文本语料和词汇资源
作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开数据集.模型上提供了全面.易用的接口, ...
利用python进行数据分析第二版学习笔记
行话: 数据规整(Munge/Munging/Wrangling) 指的是将非结构化和(或)散乱数据处理为结构化或整洁形式的整个过程.这几个词已经悄悄成为当今数据黑客们的行话了.Munge这个词跟Lu ...
【我的JS第三本】JavaScript_DOM编程艺术第二版读书笔记
经过前一段时间HTML&CSS的学习,感觉视频加读书是一个比较不错的学习方法,两者相辅相成,互相补充,所以也准备看看关于JavaScript的书. 2015年12月14日,之前使用韩顺平老师的 ...

【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源

文章目录

一、获取文本语料库

1、古腾堡语料库

（1）输出语料库中的文件标识符

（2）词的统计与索引

（3）文本统计

2、网络和聊天文本

3、布朗语料库

（1）初识

（2）比较不同文体中的情态动词的用法

4、路透社语料库

（1）初识

（2）通过主题和fileids查找words

（3）以文档或类别为单位查找想要的词或句子

5、就职演说语料库

（1）初识

（2）条件频率分布图

6、标注文本语料库

7、多国语言语料库

8、文本语料库的结构

9、加载你自己的语料库

（1）PlaintextCorpusReader

（2）BracketParseCorpusReader

二、条件频率分布

1、条件和事件

2、按文体计数词汇

3、绘制分布图和分布表

4、使用双连词生成随机文本

三、代码重用（Python）

四、词典资源

1、词汇列表语料库

（1）过滤文本

（2）停用词语料库

（3）一个字母拼词谜题

（4）名字语料库

2、发音的词典

3、比较词表

五、WordNet

1、意义与同义词

2、层次结构

3、更多的词汇关系

4、语义相似度

【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源相关推荐

最新文章

热门文章

【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源

【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源相关推荐