第02章 获得文本语料和词汇资源

  • 2.1 获取文本语料库
    • 古腾堡语料库
    • 网络和聊天文本
    • 布朗语料库
    • 路透社语料库
    • 就职演说语料库
    • 标注文本语料库
    • 在其他语言的语料库
    • 文本语料库的结构
    • 载入你自己的语料库
    • 中文自然语言处理 语料/数据集
      • 情感/观点/评论 倾向性分析
      • 中文命名实体识别
      • 推荐系统
  • 2.2 条件频率分布
    • 条件和事件
    • 按文体计数词汇
    • 绘制分布图和分布表
    • 使用双连词生成随机文本
  • 2.3 更多关于Python代码重用
    • 使用文本编辑器创建程序
    • 函数
    • 模块
  • 2.4 词典资源
    • 词汇列表语料库
    • 发音的词典
    • 比较词表
    • 词汇工具:Toolbox和Shoebox
  • 2.5 WordNet
    • 意义与同义词
    • WordNet的层次结构
    • 更多的词汇关系
    • 语义相似度
  • 2.6 小结
  1. 什么是有用的文本语料和词汇资源,我们如何使用Python 获取它们?
  2. 哪些Python 结构最适合这项工作?
  3. 编写Python 代码时我们如何避免重复的工作?

2.1 获取文本语料库

古腾堡语料库

import nltk
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
nltk.corpus.gutenberg.fileids()
['austen-emma.txt','austen-persuasion.txt','austen-sense.txt','bible-kjv.txt','blake-poems.txt','bryant-stories.txt','burgess-busterbrown.txt','carroll-alice.txt','chesterton-ball.txt','chesterton-brown.txt','chesterton-thursday.txt','edgeworth-parents.txt','melville-moby_dick.txt','milton-paradise.txt','shakespeare-caesar.txt','shakespeare-hamlet.txt','shakespeare-macbeth.txt','whitman-leaves.txt']
emma = nltk.corpus.gutenberg.words('austen-emma.txt') #简·奥斯丁的《爱玛》
len(emma)
192427
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity `
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulationthe present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s sto her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ;
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her iexpected by the best judges , for surprize -- but there was great joy . Mr . sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !. It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

调用了NLTK 中的corpus 包中的gutenberg 对象的words()函数。但因为总是要输入这么长的名字很繁琐,Python 提供了另一个版本的import 语句

from nltk.corpus import gutenberg
gutenberg.fileids()
['austen-emma.txt','austen-persuasion.txt','austen-sense.txt','bible-kjv.txt','blake-poems.txt','bryant-stories.txt','burgess-busterbrown.txt','carroll-alice.txt','chesterton-ball.txt','chesterton-brown.txt','chesterton-thursday.txt','edgeworth-parents.txt','melville-moby_dick.txt','milton-paradise.txt','shakespeare-caesar.txt','shakespeare-hamlet.txt','shakespeare-macbeth.txt','whitman-leaves.txt']
emma = gutenberg.words("austen-emma.txt")
#这个程序显示每个文本的三个统计量:平均词长、平均句子长度和本文中每个词出现的平均次数(我们的词汇多样性得分)
for fileid in gutenberg.fileids():num_chars = len(gutenberg.raw(fileid)) num_words = len(gutenberg.words(fileid)) #    num_sents = len(gutenberg.sents(fileid)) num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt

平均词长似乎是英语的一个一般属性,因为它的值总是4。(事实上,平均词长是3 而不是4,因为num_chars 变量计数了空白字符。)相比之下,平均句子长度和词汇多样性看上去是作者个人的特点。

len(gutenberg.raw('blake-poems.txt')) #raw()函数给我们没有进行过任何语言学处理的文件的内容,词汇个数,包括词之间的空格。
38153
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')#sents()函数把文本划分成句子,其中每一个句子是一个词链表。
macbeth_sentences
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]
macbeth_sentences[1603]
['The', 'hart', 'is', 'sorely', 'charg', "'", 'd']
longest_len = max(len(s) for s in macbeth_sentences)
longest_len_sent = [s for s in macbeth_sentences if len(s) == longest_len]
print(longest_len_sent)
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

网络和聊天文本

from nltk.corpus import webtext
for fileid in webtext.fileids():print(fileid, webtext.raw(fileid)[:65], '...') #Firefox 交流论坛,在纽约无意听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml') #例如:10-19-20s_706posts.xml 包含2006 年10 月19 日从20 多岁聊天室收集的706 个帖子。
print(chatroom[123])
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

布朗语料库

布朗语料库是第一个百万词级的英语电子语料库的,由布朗大学于1961 年创建。这个语料库包含500 个不同来源的文本,按照文体分类,如:新闻、社论等。

表2-1. 布朗语料库每一部分的示例文档

ID 文件 文体 描述
A16 ca16 新闻 news Chicago Tribune: Society Reportage
B02 cb02 社论 editorial Christian Science Monitor: Editorials
C17 cc17 评论 reviews Time Magazine: Reviews
D12 cd12 宗教 religion Underwood: Probing the Ethics of Realtors
E36 ce36 爱好 hobbies Norling: Renting a Car in Europe
F25 cf25 传说 lore Boroff: Jewish Teenage Culture
G22 cg22 纯文学 belles_lettres Reiner: Coping with Runaway Technology
H15 ch15 政府 government US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17 cj19 博览 learned Mosteller: Probability with Statistical Applications
K04 ck04 小说 fiction W.E.B. Du Bois: Worlds of Color
L13 cl13 推理小说 mystery Hitchens: Footsteps in the Night
M01 cm01 科幻 science_fiction Heinlein: Stranger in a Strange Land
N14 cn15 探险 adventure Field: Rattlesnake Ridge
P12 cp12 言情 romance Callaghan: A Passion in Rome
R06 cr06 幽默 humor Thurber: The Future, If Any, of Comedy
from nltk.corpus import brown
brown.categories()
['adventure','belles_lettres','editorial','fiction','government','hobbies','humor','learned','lore','mystery','news','religion','reviews','romance','science_fiction']
brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

布朗语料库是一个研究文体之间的系统性差异——一种叫做文体学的语言学研究——很方便的资源。

让我们来比较不同文体中的情态动词的用法

from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist([w.lower() for w in news_text])
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:print(m + ':', fdist[m], end=' ')
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
modals = ['what', 'when', 'where', 'who', 'why']
for m in modals:print(m + ':', fdist[m], end=' ')
what: 95 when: 169 where: 59 who: 268 why: 14

统计每一个感兴趣的文体。我们使用NLTK 提供的带条件的频率分布函数。

cfd = nltk.ConditionalFreqDist((genre, word)for genre in brown.categories()for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

新闻文体中最常见的情态动词是will,而言情文体中最常见的情态动词是could

cfd.tabulate(conditions=genres, samples=modals)
                  can could   may might  must  will news    93    86    66    38    50   389 religion    82    59    78    12    54    71 hobbies   268    58   131    22    83   264
science_fiction    16    49     4    12     8    16 romance    74   193    11    51    45    43 humor    16    30     8     8     9    13

路透社语料库

from nltk.corpus import reuters
reuters_fileids = reuters.fileids()
reuters_fileids[1:10]
['test/14828','test/14829','test/14832','test/14833','test/14839','test/14840','test/14841','test/14842','test/14843']
reuters_categories = reuters.categories()
reuters_categories[:5]
['acq', 'alum', 'barley', 'bop', 'carcass']
reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']
reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']
reuters_fileids = reuters.fileids('barley')
reuters_fileids[:5]
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871']
reuters_fileids = reuters.fileids(['barley', 'corn'])
reuters_fileids[:5]
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106']
reuters.words('training/9865')[:5]
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT']

就职演说语料库

from nltk.corpus import inaugural
inaugural_fileids = inaugural.fileids()
inaugural_fileids[:5]
['1789-Washington.txt','1793-Washington.txt','1797-Adams.txt','1801-Jefferson.txt','1805-Jefferson.txt']

每个文本的年代都出现在它的文件名中。要从文件名中获得年代,我们使用fileid[:4]提取前四个字符。

print([fileid[:4] for fileid in inaugural.fileids()])
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']

让我们来看看词汇america 和citizen 随时间推移的使用情况。

from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
...           (target, fileid[:4])
...           for fileid in inaugural.fileids()
...           for w in inaugural.words(fileid)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))
cfd.plot()

图2-1. 条件频率分布图:计数就职演说语料库中所有以america 或citizen 开始的词。每个演讲单独计数。这样就能观察出随时间变化用法上的演变趋势。计数没有与文档长度进行归一化处理。

标注文本语料库

表2-2 列出了其中一些语料库。用于教学和科研的话可以免费下载,有关下载信息请参阅http://www.nltk.org/data

在其他语言的语料库

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch','Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)

图2-2. 累积字长分布:内容是“世界人权宣言”的6 个翻译版本

raw_text= udhr.raw('Chinese_Mandarin-GB2312')
nltk.FreqDist(raw_text).plot() #崩溃........

文本语料库的结构

NLTK 语料库阅读器支持高效的访问大量语料库,并且能用于处理新的语料库。
文本语料库的常见结构:

  • 孤立语料库 最简单的一种语料库是一些孤立的没有什么特别的组织的文本集合;
  • 分类语料库 一些语料库按如文体(布朗语料库)等分类组织结构;
  • 重叠语料库 一些分类会重叠,如主题类别(路透社语料库);
  • 时变语料库 另外一些语料库可以表示随时间变化语言用法的改变(就职演说语料库)。

表2-3. NLTK 中定义的基本语料库函数

示例 描述
fileids() 语料库中的文件
fileids([categories]) 这些分类对应的语料库中的文件
categories() 语料库中的分类
categories([fileids]) 这些文件对应的语料库中的分类
raw() 语料库的原始内容
raw(fileids=[f1,f2,f3]) 指定文件的原始内容
raw(categories=[c1,c2]) 指定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件中的词汇
words(categories=[c1,c2]) 指定分类中的词汇
sents() 指定分类中的句子
sents(fileids=[f1,f2,f3]) 指定文件中的句子
sents(categories=[c1,c2]) 指定分类中的句子
abspath(fileid) 指定文件在磁盘上的位置
encoding(fileid) 文件的编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径
raw = gutenberg.raw("burgess-busterbrown.txt")
raw[1:20]
'The Adventures of B'
words = gutenberg.words("burgess-busterbrown.txt")
words[1:5]
['The', 'Adventures', 'of', 'Buster']
sents = gutenberg.sents("burgess-busterbrown.txt")
sents[1:3]
[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING']]

载入你自己的语料库

如果你有自己收集的文本文件,并且想使用前面讨论的方法访问它们,你可以很容易地在NLTK 中的PlaintextCorpusReader 帮助下载入它们。检查你的文件在文件系统中的位置;在下面的例子中,我们假定你的文件在/usr/share/dict 目录下。不管是什么位置,将变量corpus_root?的值设置为这个目录。PlaintextCorpusReader 初始化函数?的第二个参数可以是一个如[‘a.txt’, ‘test/b.txt’]这样的fileids 链表,或者一个匹配所有fileids 的模式,如:’[abc]/.*.txt’。

# from nltk.corpus import PlaintextCorpusReader
# corpus_root = '/usr/share/dict'
# wordlists = PlaintextCorpusReader(corpus_root, '.*')
# wordlists.fileids()
# wordlists.words(' ...')

中文自然语言处理 语料/数据集

https://github.com/SophonPlus/ChineseNlpCorpus

情感/观点/评论 倾向性分析

1、ChnSentiCorp_htl_all 数据集
■数据概览:7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb

2、waimai_10k 数据集

■数据概览:某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb

3、online_shopping_10_cats 数据集
■数据概览:10 个类别,共 6 万多条评论数据,正、负向评论各约 3 万条, 包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats/intro.ipynb

4、weibo_senti_100k 数据集
■数据概览:10 万多条,带情感标注 新浪微博,正负向评论约各 5 万条
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb

5、simplifyweibo_4_moods 数据集
■数据概览:36 万多条,带情感标注 新浪微博,包含 4 种情感, 其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb

6、dmsc_v2 数据集
■数据概览:28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2/intro.ipynb

7、yf_dianping 数据集
■数据概览:24 万家餐馆,54 万用户,440 万条评论/评分数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb

8、yf_amazon 数据集
■数据概览:52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon/intro.ipynb

中文命名实体识别

dh_msra 数据集

■数据概览:5 万多条中文命名实体识别标注数据(包括地点、机构、人物)

■下载地址:

https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dh_msra/intro.ipynb

推荐系统


1、ez_douban 数据集
■数据概览:5 万多部电影(3 万多有电影名称,2 万多没有电影名称),2.8 万 用户,280 万条评分数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ez_douban/intro.ipynb

2、dmsc_v2 数据集
■数据概览:28 部电影,超 70 万 用户,超 200 万条 评分/评论 数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2/intro.ipynb

3、yf_dianping 数据集
■数据概览:24 万家餐馆,54 万用户,440 万条评论/评分数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb

4、yf_amazon 数据集
■数据概览:52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据
■下载地址:
https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon/intro.ipynb

2.2 条件频率分布

频率分布计算观察到的事件,如文本中出现的词汇。条件频率分布需要给每个时间关联一个条件,所以不是处理一个词序列,我们必须处理的是一个配对序列。

条件和事件

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] #每对的形式是:(条件,事件)

按文体计数词汇

FreqDist()以一个简单的链表作为输入,ConditionalFreqDist()以一个配对链表作为输入。

import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre, word)for genre in brown.categories()for word in brown.words(categories=genre))
genre_word = [(genre, word) for genre in ['news', 'romance']for word in brown.words(categories=genre)]
len(genre_word)
170576
genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]
genre_word[-3:]
[('romance', 'not'), ('romance', "''"), ('romance', '.')]
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
<ConditionalFreqDist with 2 conditions>
cfd.conditions()
['news', 'romance']
print(cfd['news'])
<FreqDist with 14394 samples and 100554 outcomes>
print(cfd['romance'])
<FreqDist with 8452 samples and 70022 outcomes>

绘制分布图和分布表

from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target))
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word)) for lang in languagesfor word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'],samples=range(10), cumulative=True)
                  0    1    2    3    4    5    6    7    8    9 English    0  185  525  883  997 1166 1283 1440 1558 1638
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275

使用双连词生成随机文本

sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven','and', 'the', 'earth', '.']
list(nltk.bigrams(sent))
[('In', 'the'),('the', 'beginning'),('beginning', 'God'),('God', 'created'),('created', 'the'),('the', 'heaven'),('heaven', 'and'),('and', 'the'),('the', 'earth'),('earth', '.')]

例2-1. 产生随机文本:此程序获得《创世记》文本中所有的双连词,然后构造一个条件频率分布来记录哪些词汇最有可能跟在给定词的后面;例如:living 后面最可能的词是creature;generate_model()函数使用这些数据和种子词随机产生文本。

def generate_model(cfdist, word, num=15):for i in range(num):print(word, end=' ')word = cfdist[word].max()text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
cfd['living']
FreqDist({',': 1,'.': 1,'creature': 7,'soul': 1,'substance': 2,'thing': 4})
generate_model(cfd, 'living')
living creature that he said , and the land of the land of the land

表2-4. NLTK 中的条件频率分布:定义、访问和可视化一个计数的条件频率分布的常用方法和习惯用法

示例 描述
cfdist= ConditionalFreqDist(pairs) 从配对链表中创建条件频率分布
cfdist.conditions() 将条件按字母排序
cfdist[condition] 此条件下的频率分布
cfdist[condition][sample] 此条件下给定样本的频率
cfdist.tabulate() 为条件频率分布制表
cfdist.tabulate(samples, conditions) 指定样本和条件限制下制表
cfdist.plot() 为条件频率分布绘图
cfdist.plot(samples, conditions) 指定样本和条件限制下绘图
cfdist1 < cfdist2 测试样本在cfdist1 中出现次数是否小于在cfdist2 中出现次数

2.3 更多关于Python代码重用

使用文本编辑器创建程序

  • 使用IDLE
  • Spider

函数

def lexical_diversity(text):return len(text) / len(set(text)) #关键字return 表示函数作为输出而产生的值
def lexical_diversity(my_text_data):#局部变量,不能在函数体外访问。word_count = len(my_text_data)    vocab_size = len(set(my_text_data))diversity_score = word_count / vocab_sizereturn diversity_score

例2-2. 一个Python 函数:这个函数试图生成任何英语名词的复数形式。关键字def(define)后面跟着函数的名称,然后是包含在括号内的参数和一个冒号;函数体是缩进代码块;它试图识别词内的模式,相应的对词进行处理;例如:如果这个词以y 结尾,删除它们并添加ies。

def plural(word):if word.endswith('y'): #使用对象的名字,一个点,然后跟函数的名称。这些函数通常被称为方法。return word[:-1] + 'ies'elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:return word + 'es'elif word.endswith('an'):return word[:-2] + 'en'else:return word + 's'
plural('fairy')
'fairies'
plural('woman')
'women'

模块

请将你的plural函数保存到一个文件:textproc.py

from textproc import plural
plural('woman')
'women'

2.4 词典资源

词汇列表语料库

例2-3. 过滤文本:此程序计算文本的词汇表,然后删除所有在现有的词汇列表中出现的元素,只留下罕见或拼写错误的词。

def unusual_words(text):text_vocab = set(w.lower() for w in text if w.isalpha())english_vocab = set(w.lower() for w in nltk.corpus.words.words())unusual = text_vocab.difference(english_vocab)return sorted(unusual)
len(unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')))
1601
len(unusual_words(nltk.corpus.nps_chat.words()))
2095

还有一个停用词语料库,就是那些高频词汇,如:the,to,我们有时在进一步的处理之前想要将它们从文档中过滤。

from nltk.corpus import stopwords
len(stopwords.words('english'))
179

定义一个函数来计算文本中没有在停用词列表中的词的比例

def content_fraction(text):stopwords = nltk.corpus.stopwords.words('english')content = [w for w in text if w.lower() not in stopwords]return len(content) / len(text)
content_fraction(nltk.corpus.reuters.words())
0.735240435097661

图2-6. 一个字母拼词谜题:在由随机选择的字母组成的网格中,选择里面的字母组成词。这个谜题叫做“目标”。图中文字的意思是:用这里显示的字母你能组成多少个4 字母或者更多字母的词?每个字母在每个词中只能被用一次。每个词必须包括中间的字母并且必须至少有一个9 字母的词。没有复数以“s”结尾;没有外来词;没有姓名。能组出21 个词就是“好”;32 个词,“很好”;42 个词,“非常好”。

puzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6 and obligatory in w and nltk.FreqDist(w) <= puzzle_letters][:5]
['glover', 'gorlin', 'govern', 'grovel', 'ignore']
names = nltk.corpus.names
names.fileids()
['female.txt', 'male.txt']
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names][:5]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian']
cfd = nltk.ConditionalFreqDist((fileid, name[-1])for fileid in names.fileids()for name in names.words(fileid))
cfd.plot()

发音的词典

entries = nltk.corpus.cmudict.entries()
len(entries)
133737
for entry in entries[39943:39951]:print(entry)
('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0'])
('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z'])
('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z'])
('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG'])
('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N'])
('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z'])
('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V'])
('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])
for word, pron in entries: if len(pron) == 3:ph1, ph2, ph3 = pronif ph1 == 'P' and ph3 == 'T':print(word, ph2, end=' ')
pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1
syllable = ['N', 'IH0', 'K', 'S']
[word for word, pron in entries if pron[-4:] == syllable][:5]  #使用此方法来找到押韵的词
["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics']
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n'][:5]
['autumn', 'column', 'condemn', 'damn', 'goddamn']
sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))
['gn', 'kn', 'mn', 'pn']
def stress(pron):return [char for phone in pron for char in phone if char.isdigit()]
[w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']][:5]
['abbreviated', 'abbreviated', 'abbreviating', 'accelerated', 'accelerating']
[w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']][:5]
['abbreviation','abbreviations','abomination','abortifacient','abortifacients']
p3 = [(pron[0]+'-'+pron[2], word) for (word, pron) in entriesif pron[0] == 'P' and len(pron) == 3]
cfd = nltk.ConditionalFreqDist(p3)
for template in cfd.conditions():if len(cfd[template]) > 10:words = cfd[template].keys()wordlist = ' '.join(words)print(template, wordlist[:70] + "...")
P-S pus peace pesce pass puss purse piece perse pease perce pasts poss pos...
P-N pain penn pyne pinn poon pine pin paine penh paign peine pawn pun pane...
P-T pat peart pout put pit purt putt piet pert pet pote pate patt piette p...
P-UW1 pru plue prue pshew prugh prew peru peugh pugh pew plew...
P-K pique pack paque paek perk poke puck pik polk purk peak poch pake perc...
P-Z poe's pas pei's pows pao's pose purrs peas paiz pies pays pause p.s pa...
P-CH perch petsch piech petsche piche peach pautsch pouch pietsch pitsch pu...
P-P pipp papp pep pope paup pop popp pup poop pape poppe pip paap paape pe...
P-R pour poor poore parr porr pear peer pore pier paar por pare pair par...
P-L pal pall peil pehl peele paille pile poll pearl peale perle pull pill ...
prondict = nltk.corpus.cmudict.dict()
prondict['fire'] #通过指定词典的名字后面跟一个包含在方括号里的关键字(例如:词fire)来查词典
[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]
prondict['blog']
---------------------------------------------------------------------------KeyError                                  Traceback (most recent call last)<ipython-input-175-f0ffb282ba9a> in <module>()
----> 1 prondict['blog']KeyError: 'blog'
prondict['blog'] = [['B', 'L', 'AA1', 'G']]
prondict['blog']
[['B', 'L', 'AA1', 'G']]
text = ['natural', 'language', 'processing']
[ph for w in text for ph in prondict[w][0]][:5]
['N', 'AE1', 'CH', 'ER0', 'AH0']

比较词表

斯瓦迪士核心词列表(Swadesh wordlists)

from nltk.corpus import swadesh
swadesh.fileids()[:5]
['be', 'bg', 'bs', 'ca', 'cs']
swadesh.words('en')[:5]
['I', 'you (singular), thou', 'he', 'we', 'you (plural)']
fr2en = swadesh.entries(['fr', 'en'])
fr2en[:5]
[('je', 'I'),('tu, vous', 'you (singular), thou'),('il', 'he'),('nous', 'we'),('vous', 'you (plural)')]
translate = dict(fr2en)
translate['chien']
'dog'
translate['jeter']
'throw'
de2en = swadesh.entries(['de', 'en']) # German-English
es2en = swadesh.entries(['es', 'en']) # Spanish-English
translate.update(dict(de2en))
translate.update(dict(es2en))
translate['Hund']
'dog'
translate['perro']
'dog'
languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
for i in [139, 140, 141, 142]:print(swadesh.entries(languages)[i])
('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')

词汇工具:Toolbox和Shoebox

Toolbox(工具箱),以前叫做Shoebox

from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[:1]
[('kaa',[('ps', 'V'),('pt', 'A'),('ge', 'gag'),('tkp', 'nek i pas'),('dcsv', 'true'),('vx', '1'),('sc', '???'),('dt', '29/Oct/2005'),('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),('xe', 'Apoka is gagging from food while talking.')])]

2.5 WordNet

NLTK 中包括英语WordNet,共有155,287 个词和117,659 个同义词集合。

意义与同义词

from nltk.corpus import wordnet as wn
wn.synsets('motorcar') #car.n.01被称为synset或“同义词集”,意义相同的词(或“词条”)的集合
[Synset('car.n.01')]
wn.synset('car.n.01').lemma_names() #同义词
['car', 'auto', 'automobile', 'machine', 'motorcar']
wn.synset('car.n.01').definition() #synset或“同义词集”
'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
wn.synset('car.n.01').examples()
['he needs a car to get to work']
wn.synset('car.n.01').lemmas()
[Lemma('car.n.01.car'),Lemma('car.n.01.auto'),Lemma('car.n.01.automobile'),Lemma('car.n.01.machine'),Lemma('car.n.01.motorcar')]
wn.lemma('car.n.01.automobile')
Lemma('car.n.01.automobile')
wn.lemma('car.n.01.automobile').synset()
Synset('car.n.01')
wn.lemma('car.n.01.automobile').name()
'automobile'
wn.synsets('car')
[Synset('car.n.01'),Synset('car.n.02'),Synset('car.n.03'),Synset('car.n.04'),Synset('cable_car.n.01')]
for synset in wn.synsets('car'):print(synset.lemma_names())
['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']
wn.lemmas('car')
[Lemma('car.n.01.car'),Lemma('car.n.02.car'),Lemma('car.n.03.car'),Lemma('car.n.04.car'),Lemma('cable_car.n.01.car')]

WordNet的层次结构

motorcar = wn.synset('car.n.01') #下位词
types_of_motorcar = motorcar.hyponyms()
types_of_motorcar[26]
Synset('stanley_steamer.n.01')
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())[:5]
['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance']
motorcar.hypernyms()
[Synset('motor_vehicle.n.01')]
paths = motorcar.hypernym_paths()
len(paths)
2
[synset.name for synset in paths[0]]
[<bound method Synset.name of Synset('entity.n.01')>,<bound method Synset.name of Synset('physical_entity.n.01')>,<bound method Synset.name of Synset('object.n.01')>,<bound method Synset.name of Synset('whole.n.02')>,<bound method Synset.name of Synset('artifact.n.01')>,<bound method Synset.name of Synset('instrumentality.n.03')>,<bound method Synset.name of Synset('container.n.01')>,<bound method Synset.name of Synset('wheeled_vehicle.n.01')>,<bound method Synset.name of Synset('self-propelled_vehicle.n.01')>,<bound method Synset.name of Synset('motor_vehicle.n.01')>,<bound method Synset.name of Synset('car.n.01')>]
[synset.name for synset in paths[1]]
[<bound method Synset.name of Synset('entity.n.01')>,<bound method Synset.name of Synset('physical_entity.n.01')>,<bound method Synset.name of Synset('object.n.01')>,<bound method Synset.name of Synset('whole.n.02')>,<bound method Synset.name of Synset('artifact.n.01')>,<bound method Synset.name of Synset('instrumentality.n.03')>,<bound method Synset.name of Synset('conveyance.n.03')>,<bound method Synset.name of Synset('vehicle.n.01')>,<bound method Synset.name of Synset('wheeled_vehicle.n.01')>,<bound method Synset.name of Synset('self-propelled_vehicle.n.01')>,<bound method Synset.name of Synset('motor_vehicle.n.01')>,<bound method Synset.name of Synset('car.n.01')>]
motorcar.root_hypernyms()
[Synset('entity.n.01')]

更多的词汇关系

wn.synset('tree.n.01').part_meronyms() #一棵树的部分是它的树干,树冠等;这些都是part_meronyms()
[Synset('burl.n.02'),Synset('crown.n.07'),Synset('limb.n.02'),Synset('stump.n.01'),Synset('trunk.n.01')]
wn.synset('tree.n.01').substance_meronyms()  #一棵树的实质是包括心材和边材组成的,即substance_meronyms()
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
wn.synset('tree.n.01').member_holonyms() #树木的集合形成了一个森林,即member_holonyms()
[Synset('forest.n.01')]
for synset in wn.synsets('mint', wn.NOUN):print(synset.name() + ':', synset.definition())
batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government
wn.synset('mint.n.04').part_holonyms()
[Synset('mint.n.02')]
wn.synset('mint.n.04').substance_holonyms()
[Synset('mint.n.05')]

动词之间也有关系。例如:走路的动作包括抬脚的动作,所以走路蕴涵着抬脚。一些动词有多个蕴涵:

wn.synset('walk.v.01').entailments()
[Synset('step.v.01')]
wn.synset('eat.v.01').entailments()
[Synset('chew.v.01'), Synset('swallow.v.01')]
wn.synset('tease.v.03').entailments()
[Synset('arouse.v.07'), Synset('disappoint.v.01')]

词条之间的一些词汇关系,如:反义词

wn.lemma('supply.n.02.supply').antonyms()
[Lemma('demand.n.02.demand')]
wn.lemma('rush.v.01.rush').antonyms()
[Lemma('linger.v.04.linger')]
wn.lemma('horizontal.a.01.horizontal').antonyms()
[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]
wn.lemma('staccato.r.01.staccato').antonyms()
[Lemma('legato.r.01.legato')]

语义相似度

right = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
minke = wn.synset('minke_whale.n.01')
tortoise = wn.synset('tortoise.n.01')
novel = wn.synset('novel.n.01')
right.lowest_common_hypernyms(minke)
[Synset('baleen_whale.n.01')]
right.lowest_common_hypernyms(orca)
[Synset('whale.n.02')]
right.lowest_common_hypernyms(tortoise)
[Synset('vertebrate.n.01')]
right.lowest_common_hypernyms(novel)
[Synset('entity.n.01')]
wn.synset('baleen_whale.n.01').min_depth()
14
wn.synset('whale.n.02').min_depth()
13
wn.synset('vertebrate.n.01').min_depth()
8
wn.synset('entity.n.01').min_depth()
0
right.path_similarity(minke)
0.25
right.path_similarity(orca)
0.16666666666666666
right.path_similarity(tortoise)
0.07692307692307693
right.path_similarity(novel)
0.043478260869565216

还有一些其它的相似性度量方法;NLTK 还包括VerbNet,一个连接到WordNet 的动词的层次结构的词典。

2.6 小结

  • 文本语料库是一个大型结构化文本的集合。NLTK 包含了许多语料库,如:布朗语料库nltk.corpus.brown。
  • 有些文本语料库是分类的,例如通过文体或者主题分类;有时候语料库的分类会相互重叠。
  • 条件频率分布是一个频率分布的集合,每个分布都有一个不同的条件。它们可以用于通过给定内容或者文体对词的频率计数。
  • 行数较多的Python 程序应该使用文本编辑器来输入,保存为.py 后缀的文件,并使用import 语句来访问。
  • Python 函数允许你将一段特定的代码块与一个名字联系起来,然后重用这些代码想用多少次就用多少次。
  • 一些被称为“方法”的函数与一个对象联系在起来,我们使用对象名称跟一个点然后跟方法名称来调用它,就像:x.funct(y)或者word.isalpha()。
  • 要想找到一些关于变量v 的信息,可以在Pyhon 交互式解释器中输入help(v)来阅读这一类对象的帮助条目。
  • WordNet 是一个面向语义的英语词典,由同义词的集合—或称为同义词集(synsets)—组成,并且组织成一个网络。
  • 默认情况下有些函数是不能使用的,必须使用Python 的import 语句来访问。

致谢
《Python自然语言处理》123 4,作者:Steven Bird, Ewan Klein & Edward Loper,是实践性很强的一部入门读物,2009年第一版,2015年第二版,本学习笔记结合上述版本,对部分内容进行了延伸学习、练习,在此分享,期待对大家有所帮助,欢迎加我微信(验证:NLP),一起学习讨论,不足之处,欢迎指正。

参考文献


  1. http://nltk.org/ ↩︎

  2. Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎

  3. (英)伯德,(英)克莱因,(美)洛普,《Python自然语言处理》,2010年,东南大学出版社 ↩︎

  4. Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎

《Python自然语言处理(第二版)-Steven Bird等》学习笔记:第02章 获得文本语料和词汇资源相关推荐

  1. 《用Python进行自然语言处理》第2章 获得文本语料和词汇资源

    1. 什么是有用的文本语料和词汇资源,我们如何使用 Python 获取它们? 2. 哪些 Python 结构最适合这项工作? 3. 编写 Python 代码时我们如何避免重复的工作? 2.1 获取文本 ...

  2. 【Python 自然语言处理 第二版】读书笔记1:语言处理与Python

    文章目录 前言 语言处理与Python 一.语言计算:文本和单词 1.NLTK入门 (1)安装(nltk.nltk.book) (2)搜索文本 (3)词汇计数 2.列表与字符串 (1)列表操作 (2) ...

  3. Python自然语言处理 | 获得文本语料与词汇资源

    本章解决问题- 什么是有用的文本语料和词汇资源,我们如何使用Python获取它们? 哪些Python结构最适合这项工作? 编写Python代码时我们如何避免重复的工作? 这里写目录标题 1获取文本语料 ...

  4. python nlp_【NLP】Python NLTK获取文本语料和词汇资源

    作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开数据集.模型上提供了全面.易用的接口, ...

  5. 学完可以解决90%以上的数据分析问题-利用python进行数据分析第二版(代码和中文笔记)...

    <利用python进行数据分析>是数据分析的基础教程,绝大部分数据分析师的入门教材,目前已经升级到第二版.本站搜集了教材的第二版原版代码进行中文翻译和注释,并做了一定的笔记.基本上只需要看 ...

  6. 【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源

    文章目录 一.获取文本语料库 1.古腾堡语料库 (1)输出语料库中的文件标识符 (2)词的统计与索引 (3)文本统计 2.网络和聊天文本 3.布朗语料库 (1)初识 (2)比较不同文体中的情态动词的用 ...

  7. 鸟哥Linux私房菜_基础篇(第二版)_第十章学习笔记

    第十章 vi文字处理器 编辑器 vi 1.一般模式 2.编辑模式 3.命令行模式 注意:在vi编辑模式中 Tab键与空格键的不同 向上(k)   向下(j)  向左(h)  向右(l) ctrl+f ...

  8. Python自然语言处理-学习笔记(2)——获得文本语料和词汇资源

    语料库基本语法 载入自己的语料库 PlaintextCorpusReadera 从文件系统载入 BracketParseCorpusReader 从本地硬盘载入 写一段简短的程序,通过遍历前面所列出的 ...

  9. 《Python编程:从入门到实践》学习笔记——第11章 测试代码

    文章目录 前言 1 测试函数 1.1 单元测试和测试用例 1.2 可通过的测试 1.3 不能通过的测试 1.4 测试未通过时怎么办 1.5 添加新测试 2 测试类 2.1 各种断言方法 2.2 一个要 ...

最新文章

  1. Linux下查看文件或文件夹大小的命令df 、du、ls
  2. numpy找到数组中符合条件的数
  3. ArcGIS API for Silverlight开发
  4. webpack 中的加载器简介||webpack 中加载器的基本使用——1. 打包处理 css 文件 2. 打包处理 less 文件 3.打包处理 scss 文件
  5. Wireshark 抓包分析 RTSP/RTP/RTCP 基本工作过程
  6. python学习笔记第9天《文件的管理办法》
  7. (转)会议期刊论文发表介绍(计算机科学领域)
  8. mybatis分页数据重复
  9. sqlplus登录缓慢的问题分析过程及解决小记
  10. python产生随机数列表_python如何产生10个不同的随机数
  11. MyBatis的总结(下)
  12. 获取文件夹内的文件数目
  13. solr4.2增量索引之同步(修改,删除,新增)
  14. 创建 Agg 静态链接库
  15. 黑客的入侵方式你知道几种?
  16. linux ansys14.0,linux 安装 ansys14
  17. PyTorch基础:Tensor的组合与分块
  18. oCPC实践录 | 开篇语
  19. uniApp使用uni.chooseAddress()获取微信收货地址
  20. 【接口技术】实验二:基本I/O实验

热门文章

  1. STM32L系列简介
  2. STM32L0系列之【串口收发】
  3. 【vue】npm run dev报错解决方法
  4. Autolayout的一点理解
  5. 中位数的应用—士兵站队问题
  6. 女友抵连!接站等待中。
  7. 雨听 | 英语学习笔记(八)~作文范文:公务员考试的热潮
  8. 【阿朱洞察】中国大数据行业的下一步走向
  9. Vue3加载中(Spin)
  10. R语言垃圾邮件分类--朴素贝叶斯(机器学习)