原书：《Python自然语言处理》：https://usyiyi.github.io/nlp-py-2e-zh/

语言处理与Python

原文：https://usyiyi.github.io/nlp-py-2e-zh/1.html

1.NLTK入门

1.NKLT的安装，nltk.book的安装

2.搜索文本

text1.concordance("monstrous") # 搜索文本text1中含有“monstrous”的句子
text1.similar("monstrous") # 搜索文本text1中与“monstrous”相似的单词
text2.common_contexts(["monstrous", "very"]) # 搜索文本text2中两个单词共同的上下文
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 显示在文本text4中各个单词的使用频率

3.词汇计数

len(text3) # 文本text3的符号总数
sorted(set(text3))  # 不重复的符号排序
len(set(text3))  # 不重复的符号总数
len(set(text3)) / len(text3) # 词汇丰富度：不重复符号占总符号6%，或者：每个单词平均使用16词
text3.count("smote") # 文本中“smote”的计数
def lexivcal_diversity(text): # 计算词汇丰富度return len(set(text))/len(text)
def percentage(word,text): # 计算词word在文本中出现的频率return 100*text.count(word)/len(text)

4.索引列表

>>> text4[173]
'awaken'
>>>

>>> text4.index('awaken')
173
>>>
>>> sent[5:8]
['word6', 'word7', 'word8']

5.字符串与列表的相互转换

>>> ' '.join(['Monty', 'Python'])
'Monty Python'
>>> 'Monty Python'.split()
['Monty', 'Python']
>>>

6.词频分布

>>> fdist1 = FreqDist(text1)  # 计算text1的每个符号的词频
>>> print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
>>> fdist1.most_common(50) [3]
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
>>> fdist1['whale']
906
>>>

fdist1.plot(50, cumulative=True) # 50个常用词的累计频率图

fdist1.hapaxes() # 返回词频为1的词

7.细粒度的选择词

选出长度大于15的单词

sorted(w for w in set(text1) if len(w) > 15)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',

选出长度大于7且词频大于7的单词

sorted(w for w in set(text5) if len(w) > 7 and FreqDist(text5)[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',

提取词汇中的次对

>>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

提取文本中的频繁出现的双连词

>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old

8.查看文本中词长的分布

>>> [len(w) for w in text1] # 文本中每个词的长度
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)  # 文本中词长的频数
>>> print(fdist)  [3]
<FreqDist with 19 samples and 260819 outcomes>
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,8: 9966, 9: 6428, 10: 3528, ...})
>>>

>>> fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
3
>>> fdist[3]
50223
>>> fdist.freq(3) # 词频中词长为“3”的频率
0.19255882431878046
>>>

9.`[w for w in text if condition ]`模式

选出以ableness结尾的单词

>>> sorted(w for w in set(text1) if w.endswith('ableness'))
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]

选出含有gnt的单词

>>> sorted(term for term in set(text4) if 'gnt' in term)
['Sovereignty', 'sovereignties', 'sovereignty']

选出以大写字母开头的单词

>>> sorted(item for item in set(text6) if item.istitle())
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]

选出数字

>>> sorted(item for item in set(sent7) if item.isdigit())
['29', '61']
>>>

选出全部小写字母的单词

sorted(w for w in set(sent7) if not w.islower())

将单词变为全部大写字母

>>> [w.upper() for w in text1]
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]
>>>

将text1中过滤掉不是字母的，然后全部转换成小写，然后去重，然后计数

>>> len(set(word.lower() for word in text1 if word.isalpha()))
16948

10.条件循环

这里可以不换行打印print(word, end=' ')

>>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in tricky:
...     print(word, end=' ')
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...
>>>

11.作业

计算词频，以百分比表示


>>> def percent(word, text):
...     return 100*text.count(word)/len([w for w in text if w.isalpha()])
>>> percent(",", text1)
8.569753756394228

计算文本词汇量

>>> def vocab_size(text):
...     return len(set(w.lower() for w in text if w.isalpha()))
>>> vocab_size(text1)
16948

【Python自然语言处理】读书笔记：第一章：语言处理与Python相关推荐

用Python进行自然语言处理读书笔记第一章
用Python进行自然语言处理(第一章) 搜索文本 text1.concordance("monstrous")#搜索文章中的词语text3.concordance("l ...
《Python自然语言处理》——第1章语言处理与Python 1.1 语言计算：文本和词汇...
本节书摘来自异步社区<Python自然语言处理>一书中的第1章,第1.1节,作者[美]Steven Bird,Ewan Klein,Edward Loper, 陈涛,张旭,崔杨,刘海平译 ...
《MAC OS X 技术内幕》读书笔记第一章：MAC OS X的起源
<MAC OS X 技术内幕>读书笔记第一章:MAC OS X的起源前言 1 System x.x系列 1.1System 1.0(1984年1月24日) 1.2System 2.x(1 ...
Android群英传神兵利器读书笔记——第一章：程序员小窝——搭建高效的开发环境
Android群英传神兵利器读书笔记--第一章:程序员小窝--搭建高效的开发环境目录 1.1 搭建高效的开发环境之操作系统 1.2 搭建开发环境之高效配置基本环境配置基本开发工具 1.3 搭建程 ...
流畅的python读书笔记-第一章Python 数据模型
第一章 python数据类型 1 隐式方法利用collections.namedtuple 快速生成类 import collectionsCard = collections.namedtuple ...
数据结构（c语言版严蔚敏_吴伟民）读书笔记第一章
目录数据结构第一章绪论数据元素之间存在的关系称为结构算法 1.算法应具有的特性 2.设计算法的要求 3.算法效率度量(时间复杂度)T(n) = O(f(n)) 4.算法的储存空间的需求(空间复 ...
尚硅谷python核心基础教程笔记-第一章计算机基础知识
第一章计算机基础知识(视频1-10) 课程介绍课程名称:Python基础视频教程讲师:尚硅谷教育,李立超(lichao.li@foxmail.com) 面向的层次:From Zero to He ...
《置身事内》读书笔记第一章地方政府的权利与事务
第一章地方政府的权利与事务第一节政府治理的特点中央与地方政府维持大一统国家必然要求中央权威和统一领导中国之大决定了政治体系的日常运作要以地方政府为主党和政府本书主题是经济发展,无须特别 ...
linux鸟叔私房菜读后感,鸟叔的Linux私房菜读书笔记第一章
目录dom 硬盘数学第一章计算机概论知识点总结计算机的定义为:接受使用者输入指令与资料,经由中央处理器的数学与逻辑单元运算处理后,以产生或储存成有用的资讯:程序电脑的五大单元包括:输入单元. ...
代码整洁之道读书笔记——第一章：整洁代码
软件质量,不仅仅依赖于项目架构和项目管理,同样重要的是代码质量!!! 序神在细节之中,其实干什么事都一样,从小到大,一直明白一个道理:细节决定成败! 软件架构在开发中占据重要地位.其次,宏达建筑的最 ...

【Python自然语言处理】读书笔记：第一章：语言处理与Python

语言处理与Python

1.NLTK入门

1.NKLT的安装，nltk.book的安装

2.搜索文本

3.词汇计数

4.索引列表

5.字符串与列表的相互转换

6.词频分布

7.细粒度的选择词

8.查看文本中词长的分布

9.`[w for w in text if condition ]`模式

10.条件循环

11.作业

【Python自然语言处理】读书笔记：第一章：语言处理与Python相关推荐

最新文章

热门文章

【Python自然语言处理】读书笔记：第一章：语言处理与Python

语言处理与Python

1.NLTK入门

1.NKLT的安装，nltk.book的安装

2.搜索文本

3.词汇计数

4.索引列表

5.字符串与列表的相互转换

6.词频分布

7.细粒度的选择词

8.查看文本中词长的分布

9.[w for w in text if condition ]模式

10.条件循环

11.作业

【Python自然语言处理】读书笔记：第一章：语言处理与Python相关推荐

最新文章

热门文章

9.`[w for w in text if condition ]`模式