第2章获得文本语料和词汇资源

时间所限，仅对自己用到的习题做了整理解答，如果想知道其他题目的答案，请留言，我会不定期查看博客的。^_^。希望大家多多与我交流意见，我会继续努力写的。
1. 创建一个变量phrase包含一个词的链表。实验本章描述的操作，包括加法、乘法、索引、切片和排序。

>>> phrase = ['This','file','is','available','for','text','mining']
>>> phrase2 = ['.']
>>> phrase+phrase2
['This', 'file', 'is', 'available', 'for', 'text', 'mining', '.']
>>> phrase2*5
['.', '.', '.', '.', '.']
>>> phrase[2]
'is'
>>> phrase[-3:]
['for', 'text', 'mining']
>>> sorted(phrase)
['This', 'available', 'file', 'for', 'is', 'mining', 'text']

2.使用语料库模块处理austen-persuasion.txt。这本书有多少词标识符？多少词类型？

>>> gutenberg_words = nltk.corpus.gutenberg.words(fileids='austen-sense.txt')
>>> len(gutenberg_words)
141576
>>> len(set(gutenberg_words))
6833
>>>

3.使用布朗语料库阅读器nltk.corpus.brown.words()或网络文本语料库阅读器nltk.corpus.webtext.words()
来访问两个不同文体的一些样例文本。

>>> nltk.corpus.brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']>>> nltk.corpus.brown.words(categories=u'adventure')
[u'Dan', u'Morgan', u'told', u'himself', u'he', ...]>>> nltk.corpus.brown.words(categories=u'humor')
[u'It', u'was', u'among', u'these', u'that', u'Hinkle', ...]

4.使用state_union语料库阅读器，访问《国情咨文报告》的文本。计数每个文档中
出现的men、women和people。随时间的推移这些词的用法有什么变化？

from nltk import ConditionalFreqDist
from nltk.corpus import state_union
cfd = ConditionalFreqDist((target,fileid[:4])for fileid in state_union.fileids()for w in state_union.words(fileid)for target in ['men','women','people']if w.lower() == target)
cfd.plot()

‘man’的使用频率有所减少

8.在名字语料库上定义一个条件频率分布，显示哪个首字母在男性名字中比在女性名字中更常用。

from nltk import ConditionalFreqDist
from nltk.corpus import namescfd = ConditionalFreqDist((fileid,w.lower()[0])for fileid in names.fileids()for w in names.words(fileid))cfd.plot()

方法2：

>>> from nltk.corpus import names
>>> pairs = [(sex,name[0])
... for sex in names.fileids()
... for name in names.words(fileids=sex)]
>>> pairs[:4:]
[(u'female.txt', u'A'), (u'female.txt', u'A'), (u'female.txt', u'A'), (u'female.txt', u'A')]
>>> pairs[-4::]
[(u'male.txt', u'Z'), (u'male.txt', u'Z'), (u'male.txt', u'Z'), (u'male.txt', u'Z')]
>>> from nltk import ConditionalFreqDist
>>> cfd = ConditionalFreqDist(pairs)
>>> cfd.plot()

15.编写一段程序，找出所有在布朗语料库中出现至少3次的词

>>> from nltk.corpus import brown
>>> words = brown.words()
>>> from nltk import FreqDist
>>> fd = FreqDist([w.lower() for w in words])
>>> freq_words = [w for w in fd if fd[w]>=3]

16.编写一段程序，生成如表1-1所示的词汇多样性得分表（例如：标识符/类型的比例）。包括布朗语料库文体的全集（nltk.corpus.borwn.categories()）。哪个文体的词汇多样性最低（每个类型的标识符数最多）？和你预测的结果相同吗？

from nltk.corpus import browndef word_diversity(words):words = [w.lower() for w in words]return len(words)*1.0/len(set(words))def main():for category in brown.categories():diversity_sent = word_diversity(brown.words(categories=category))print "%s\t%.2f"%(category,diversity_sent)if __name__ == "__main__":main()

adventure 8.37
belles_lettres 10.15
editorial 6.76
fiction 7.89
government 9.53
hobbies 7.61
humor 4.56
learned 11.75
lore 8.23
mystery 8.85
news 7.67
religion 6.64
reviews 5.04
romance 8.88
science_fiction 4.77
learned词汇多样性最低

17.编写一个函数，找出文本中最常出现的50个词，停用词除外。

def max_frequent_words(words):filtered_words = [w.lower() for w in words if w.lower() not in stopwords.words(fileids=u'english')]fdist = FreqDist(filtered_words)sorted_words = sorted(fdist.keys(),key=lambda x:fdist[x],reverse=True)return sorted_words[:50:]

text1: Moby Dick by Herman Melville 1851中的50个常用词
,
.
;
‘

-
”
!
whale

–
one
like
?
upon
man
ship
ahab
.”
ye
sea
old
would
though
yet
head
boat
time
long
captain
still
great
!”
said
,”
two
seemed
must
white
last
see
thou
way
whales
stubb
?”
queequeg
little
round
three
sperm
men

18.编写一段程序，输出文本中50个常见双连词（相邻词对），忽略包含停用词的双连词。

def max_frequent_bigrams(words):stopwords_set = set(stopwords.words(fileids=u'english'))bigram_words = bigrams([w.lower() for w in words])filtered_bigram_words = []for bigram_word in bigram_words:if bigram_word[0] not  in stopwords_set and bigram_word[1] not in stopwords_set:filtered_bigram_words.append(bigram_word)fdist = FreqDist(filtered_bigram_words)sorted_filtered_bigram_words = sorted(fdist.keys(),key=lambda x:fdist[x],reverse=True)return sorted_filtered_bigram_words[:50:]

. ”
.” ”
whale ,
sperm whale
, ”
?” ”
, sir
, like
, though
,” said
whale ’
whale -
aye ,
sea ,
ahab ,
white whale
man ,
say ,
. chapter
, ahab
oh ,
ship ,
moby dick
head ,
stubb ,
old man
mast -
boat ,
ship ’
whale .
time ,
ye ,
deck ,
.” –
ahab ’
queequeg ,
, yet
whales ,
!” ”
whale ;
sea -
- like
sea .
mr .
men ,
one ,
starbuck ,
, queequeg
, ye
captain ahab

19.编写一段程序，按文体创建词频表，以2.1节给出的词频表为范例，选择你自己的词汇，并尝试找出那些在文体中很突出或很缺乏的词汇。并讨论你的研究结果。

from nltk.corpus import brown
from nltk import ConditionalFreqDist
from nltk.corpus import stopwordsstopwords_set = set(stopwords.words(fileids=u'english'))pairs = [(wenti,word.lower()) for wenti in brown.categories()for word  in brown.words(categories=wenti) if word.lower() not in stopwords_set]cfd = ConditionalFreqDist(pairs)for wenti in brown.categories():fdist = cfd[wenti]sorted_words = sorted(fdist.keys(),key=lambda x:fdist[x],reverse=True)print wentiprint '===>',print ','.join(sorted_words[:10])print '===>',print ','.join(sorted_words[-10::])

adventure
===> .,,,,'',?,!,said,;,--,would ===> conclusion,lance,kinds,shredded,june,pumps,geysers,ranks,britain's,races belles_lettres ===> ,,.,,”,;,one,?,–,would,:
===> circumspection,contends,frustrate,squeak,cliff,intentionally,britain’s,richardson,life-death,ranke
editorial
===> ,,.,,'',?,;,--,would,one,: ===> percent,boom,sick,truckers,non-military,kinda,dollarette,coupling,ranks,britain's fiction ===> ,,.,,”,?,;,would,!,said,one
===> album,junk,kinds,corcoran,pumps,sopping,volumes,tulips,furthermore,mattathias
government
===> ,,.,;,),(,state,year,states,may,united
===> broadly,meaningless,diplomat,book,apportioned,4.21,gap,auxiliary,intentionally,photek
hobbies
===> ,,.,;,?,–,one,(,),”, ===> clicks,junk,completes,galvanism,designer's,riboflavin,xenon,smoke-filled,awake,mosaics humor ===> ,,.,,”,?,;,said,!,one,–
===> misnomer,entertainment,rule,eddies,misconstructions,portion,lift,furthermore,unspeakable,liver
learned
===> ,,.,;,af,(,),”,,one,may ===> incurred,29.2,altitude-azimuth-mounted,clergymen,infuriated,races,ideological,two-valued,expands,irony lore ===> ,,.,,”,;,–,one,?,would,time
===> contends,risked,dissension,incurred,shredder,northerly,volumes,gynecologist,pods,jawbone
mystery
===> .,,,,'',?,--,;,said,would,one ===> goodness,trailing,basketball-playing,glint,incredibly,conclusion,kinds,auxiliary,elaine's,intentionally news ===> ,,.,,”,said,;,–,mrs.,would,new
===> brigantine,skyway,kinda,alienated,pate,layman’s,republicans’,hemisphere’s,pastel-like,motion-picture
religion
===> ,,.,;,,'',?,god,),:,-- ===> meaningless,gnawing,boot,boom,non-military,alienated,coupling,ranks,intentionally,volumes reviews ===> ,,.,,”,;,–,one,mr.,(,)
===> lance,kinds,nourishment,june,contends,cliff,auxiliary,insipid,ideological,sash
romance
===> ,,.,,'',?,said,!,--,;,would ===> lubricated,goodness,trailing,overdeveloped,conclusion,junk,kinds,pumps,infuriated,volumes science_fiction ===> ,,.,'',,?,;,would,–,could,!
===> exposure,bulwark,bcd,jewel,lasted,lift,delicate-beyond-description,well-oriented,oklahoma,augmented
高频次不具备分类性，应考虑类间分部度量因子

20.编写一个函数word_freq()，用一个词和布朗语料库中的一个部分名字作为参数，计算这部分语料中词的频率。

def word_freq(word,f_name):words =[w.lower() for w in  brown.words(fileids=f_name)]fdist = FreqDist(words)if word in fdist.keys():return fdist[word]else:return 0

22.定义一个函数hedge(text)，用于处理文本并产生一个在三个词之间插入一个词like的新版本。

def hedge(sent):new_sent = []for insert_index in range(3,len(sent),3):new_sent.extend(sent[insert_index-3:insert_index]+['like'])new_sent.extend(sent[insert_index:])return new_sent

[‘The’, ‘family’, ‘of’, ‘Dashwood’, ‘had’, ‘long’, ‘been’, ‘settled’, ‘in’, ‘Sussex’, ‘.’]
[‘The’, ‘family’, ‘of’, ‘like’, ‘Dashwood’, ‘had’, ‘long’, ‘like’, ‘been’, ‘settled’, ‘in’, ‘like’, ‘Sussex’, ‘.’]

”’

23.齐夫定律：f(w)是自由文本中词w的频率。假设一个文本中的所有词都按照它们的频率排名，频率最高的排在最前面。齐夫定律指出一个词类型的频率与它的排名成反比（即f*r=k，k是某个常数）。例如：最常见的第50个词类型出现的频率应该是最常见的第150个词类型出现频率的3倍。
a. 编写一个函数用于处理大文本，使用pylab.plot根据排名画出词的频率，你赞同齐夫定律吗？（提示：使用对数刻度）所绘曲线的极端情况是怎样的？基本符合齐夫定律。

>>> brown_words = brown.words()
>>> fdist = FreqDist([w.lower() for w in brown_words])
>>> sorted_brown_words = sorted(fdist.keys(),key=lambda x:fdist[x],reverse=True)
>>> outfile = open(u"/home/yf-u/2_23.txt",'w')
>>> for w in sorted_brown_words:
...     outfile.write("%s\t%d\n"%(w,fdist[w]))
... >>> outfile.close()

b. 随机生成文本，如：使用random.choice(“abcdefg “)，注意要包括空格字符。事先需要输入random。使用字符串连接操作将字符累积成一个很长的字符串。然后为这个字符串分词，生成前面的齐夫图，比较这两个图。此时你如何看待齐夫定律？基本不符合齐夫定律。齐夫定律可以作为度量文本质量的一个因子。

# coding=gbk
import random
from nltk import FreqDist
import retext = ""
for i in range(500000):text += random.choice("abcdefg ")
word_li = re.split(u"\s+", text)
fdist = FreqDist(word_li)
sorted_word_li = sorted(fdist.keys(), key=lambda x: fdist[x], reverse=True)
with open(u'/home/yf-u/2_23_2.txt', 'w') as outfile:for w in sorted_word_li:outfile.write("%s\t%d\n" % (w, fdist[w]))

第2章获得文本语料和词汇资源相关推荐

《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第02章获得文本语料和词汇资源
第02章获得文本语料和词汇资源 2.1 获取文本语料库古腾堡语料库网络和聊天文本布朗语料库路透社语料库就职演说语料库标注文本语料库在其他语言的语料库文本语料库的结构载入你自己的语料 ...
《用Python进行自然语言处理》第2章获得文本语料和词汇资源
1. 什么是有用的文本语料和词汇资源,我们如何使用 Python 获取它们? 2. 哪些 Python 结构最适合这项工作? 3. 编写 Python 代码时我们如何避免重复的工作? 2.1 获取文本 ...
Python自然语言处理-学习笔记(2)——获得文本语料和词汇资源
语料库基本语法载入自己的语料库 PlaintextCorpusReadera 从文件系统载入 BracketParseCorpusReader 从本地硬盘载入写一段简短的程序,通过遍历前面所列出的 ...
Python自然语言处理 | 获得文本语料与词汇资源
本章解决问题- 什么是有用的文本语料和词汇资源,我们如何使用Python获取它们? 哪些Python结构最适合这项工作? 编写Python代码时我们如何避免重复的工作? 这里写目录标题 1获取文本语料 ...
python nlp_【NLP】Python NLTK获取文本语料和词汇资源
作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开数据集.模型上提供了全面.易用的接口, ...
【Python 自然语言处理第二版】读书笔记2:获得文本语料和词汇资源
文章目录一.获取文本语料库 1.古腾堡语料库 (1)输出语料库中的文件标识符 (2)词的统计与索引 (3)文本统计 2.网络和聊天文本 3.布朗语料库 (1)初识 (2)比较不同文体中的情态动词的用 ...
【NLP】Python NLTK获取文本语料和词汇资源
向AI转型的程序员都关注了这个号
python自然语言处理答案_《用Python进行自然语言处理》第一章练习题答案
尝试使用Python解释器作为一个计算器,输入表达式,如12/(4+1). >>> 12 / (4 + 1) 2.4 26 个字母可以组成 26 的 10 次方或者 26**10个 ...
python简单心形代码爱情闪字_《使用Python进行自然语言处理》学习笔记四
第二章获得文本语料和词汇资源 2.2 条件频率分布 1条件和事件频率分布计算观察到的事件,如文本中出现的词汇.条件频率分布需要给每个时间关联一个条件,所以不是处理一个词序列,我们必须处理的是一个配 ...

第2章获得文本语料和词汇资源

第2章获得文本语料和词汇资源相关推荐

最新文章

热门文章

第2章 获得文本语料和词汇资源

第2章 获得文本语料和词汇资源相关推荐

最新文章

热门文章

第2章获得文本语料和词汇资源

第2章获得文本语料和词汇资源相关推荐