这本书对应python2的中文版书籍网上有很多,但是随后更新的python3的版本却微乎其微,只能从官网上的电子英文版开看了,反正也全当练习了。

官网明确更新的几条观月NLTK 3.0的信息,间接说明这些可能很重要或者很常用,就像print对于python一样。

NLTK also includes some pervasive changes:

  • many types are initialised from strings using a fromstring() method
  • many functions now return iterators instead of lists
  • ContextFreeGrammar is now called CFG and WeightedGrammar is now called PCFG
  • batch_tokenize() is now called tokenize_sents(); there are corresponding changes for batch taggers, parsers, and classifiers
  • some implementations have been removed in favour of external packages, or because they could not be maintained adequately

详情:https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

第一章没什么新内容,多了一个concordance的方法

>>> text5.concordance('lol')
Displaying 25 of 25 matches:
ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w
ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s
a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha
e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does
30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it
es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap
ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE
k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 'loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett
ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I
cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? that mean I want you ? U6 hello room lol U83 and this .. has been the grammar the rule he 's in PM land now though lol ah ok i wont bug em then someone wann
flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80
ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265
082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class. i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any. whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl
ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim
pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOis 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic
s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to


通过实验,可以知道dispersion_plot是注意大小写的,可以稍微见得,在NLP处理过程中大小写都是要很注意的。
对于generate这个函数,根据网页:https://github.com/nltk/nltk/issues/736来看,仍然没有解决,最近的一条回复竟然是18号,然而很多其他也并不能给出相应的解答,无非都是没办法,不去管,我这边也尝试了几种不同的方式,也没有得到不错的结果……故而暂且搁置,文章说第三章会再见,我们第三期再说。

token被译为标识符(管他第二个字念什么),括号和标点符号的组合体貌似算是一种标识符,有点意思。
word type 词类型,含有标点符号的一般不叫word type,而是叫item type,换句话说纯正的单词表才会是word type。

1.3上来这个saying是什么就不知道,中间一串省略号…

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done']
>>> tokens=set(saying)
>>> tokens=sorted(tokens)
>>> tokens[-2:]
['said', 'than']

“单纯来看”

再使用hapaxes方法的时候可能会出现IDLE短时死机的可能,不过等一会儿就好了,毕竟9000多个词呢。

Collocations被翻译成了搭配,好像没什么问题

只计数小写的词肯定有问题啊,国家名地名什么的……

babelize_shell()这个函数已经不再使用了,官网的电子书给出了解释:

Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings, which are starkly revealed by translating a sentence back and forth between a pair of languages until equilibrium is reached, e.g.:0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become nonsensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure.

正如之前讨论所得出的结果一样,现在很多翻译器的翻译结果都是呈离散型的,换句话说一句话翻译过去在翻译过来并不能和原句相同,这也许是现在NLP面临的另外一个难题吧。

《Natural Language Processing with Python》读书笔记 001期相关推荐

  1. 《Natural Language Processing with Python》读书笔记 004期

    编程是切勿急躁,但是也不能慢悠悠啊[手动捂脸] 这章主要都是python的非常基础的知识,有很多BUG也都是非常非常有特点的需要注意的 基本上对于个人来讲没有特别多的新知识了 assert的用法可以再 ...

  2. 《Natural Language Processing with Python》读书笔记 003期

    这个2554.txt已经改名了貌似,改成2554-0.txt了.把代码也相应改了. 长度变成了:1176965 多了一些编码: >>> len(tokens) 257726 > ...

  3. 《Natural Language Processing with Python》读书笔记 002期

    第二章一开始核心就是再讲nltk里面内置的各种语料库,但是个人觉得这个并不是这张的重点,重点在于后面如何自己构造自己的语料库,毕竟如果一般训练的话,都肯定是拿自己手头的data来搞. 这个地方其实也没 ...

  4. 预训练综述 Pre-trained Models for Natural Language Processing: A Survey 阅读笔记

    原文链接:https://arxiv.org/pdf/2003.08271.pdf 此文为邱锡鹏大佬发布在arXiv上的预训练综述,主要写了预训练模型(PTM)的历史,任务分类,PTM的扩展,将PTM ...

  5. Python在自然语言处理领域的应用 Natural Language Processing With Python: Analyzing Text

    作者:禅与计算机程序设计艺术 1.简介 概述 在自然语言处理领域,Python被视作最优秀.应用范围最广泛.社区氛围最活跃.学习曲线最平缓的一门编程语言.它提供丰富的库函数和框架支持,有着庞大的生态系 ...

  6. 《Natural Language Processing with PyTorch》 Chapter 2: A Quick Tour of Traditional NLP 笔记

    <Natural Language Processing with PyTorch> Chapter 2: A Quick Tour of Traditional NLP 笔记 这本书 本 ...

  7. 论文阅读笔记(一)【Journal of Machine Learning Research】Natural Language Processing (Almost) from Scratch(未完)

    学习内容 题目: 自然语言从零开始 Natural Language Processing (Almost) from Scratch 2021年7月28日 1-5页 这将是一个长期的过程,因为本文长 ...

  8. 【吴恩达深度学习笔记】5.2自然语言处理与词嵌入Natural Language Processing and Word Embeddings

    第五门课 序列模型(Sequence Models) 2.1词汇表征(Word Representation) 词嵌入(word embeddings)是语言表示的一种方式,可以让算法自动的理解一些类 ...

  9. 自然语言处理NLP 2022年最新综述:An introduction to Deep Learning in Natural Language Processing

    论文题目:An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools ...

最新文章

  1. 王者归来!2020 年 5 月编程语言排行榜,Python竟然排老三
  2. float x 与“零值”比较的if语句为?
  3. 使用element ui 组件的时候,如果使用两个或多个按钮在同一个单元格内,按钮会竖着排列,但是不能够对齐怎么解决?
  4. 吴恩达 coursera AI 专项四第二课总结+作业答案
  5. 谈谈如何使用Netty开发实现高性能的RPC服务器
  6. 以太坊地址和公钥_以太坊交易签名解析源码解读
  7. html超链接点不了_HTML、CSS、JS都有哪些区别?不看必悔
  8. [系列文章]上传文件管理控件v2
  9. 实习成长之路:MySQL七:事务到底是隔离的还是不隔离的?
  10. pe服务器注册表,在WIN PE环境下修改或导入系统注册表项
  11. 了解记录管理系统RMS
  12. 12306抢票,极限并发带来的思考?
  13. kubernetes(K8s)容器设计模式实践案例 多节点选举模式
  14. CRAFT: Character Region Awareness for Text Detection ---- 论文翻译
  15. 弱电机房工程搬迁工作内容(方案)
  16. 被遗忘权_HTML:前5个被遗忘的元素
  17. IDEA快捷键 进行查找和批量替换
  18. cmd导入sql数据
  19. 嘘,你抢的不是红包而是云
  20. 计算机网络域名解析,域名解析是什么意思?

热门文章

  1. Baidu Apollo代码解析之Planning的结构与调用流程(1)
  2. 常用电平标准 TTL、CMOS、LVTTL、LVCMOS..
  3. 程序员写好技术文章的几点小技巧,简述java内存模型面试
  4. 通过ip地址访问操作远程Mysql数据库
  5. Your license has expired IDEA过期问题
  6. android起始页面与导航页面
  7. 云服务器和一般服务器之间有什么区别?
  8. IEEE Communications Letters - cover letter
  9. 青藤云Webshell查杀绕过
  10. java面试(二十五)--(1)redis为什么读写速率快性能好(2)说说web.xml文件中可以配置哪些内容(3)和的区别(4)扑克牌顺子