python nltk book_自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON

题记: 由于现在的项目是搜索引擎，所以不由的对自然语言处理产生了好奇，再加上一直以来都想学Python，只是没有机会与时间。碰巧这几天在亚马逊上找书时发现了这本《Python自然语言处理》，瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助。所以最近都会学习这本书，也写下这些笔记。

1. NLTK简述

NLTK模块及功能介绍

语言处理任务

NLTK模块

功能描述

获取语料库

nltk.corpus

语料库和词典的标准化接口

字符串处理

nltk.tokenize,nltk.stem

分词、句子分解、提取主干

搭配研究

nltk.collocations

t-检验，卡方，点互信息

词性标示符

nltk.tag

n-gram，backoff，Brill，HMM，TnT

分类

nltk.classify,nltk.cluster

决策树，最大熵，朴素贝叶斯，EM，k-means

分块

nltk.chunk

正则表达式，n-gram，命名实体

解析

nltk.parse

图标，基于特征，一致性，概率性，依赖项

语义解释

nltk.sem,nltk.inference

λ演算，一阶逻辑，模型检验

指标评测

nltk.metrics

精度，召回率，协议系数

概率与估计

nltk.probability

频率分布，平滑概率分布

应用

nltk.app,nltk.chat

图形化的关键词排序，分析器，WordNet查看器，聊天机器人

语言学领域的工作

nltk.toolbox

处理SIL工具箱格式的数据

2. NLTK安装

我的Python版本是2.7.5，NLTK版本2.0.4

1 DESCRIPTION2 The Natural Language Toolkit (NLTK) is an open source Python library3 for Natural Language Processing. A freeonline book is available.4 (If you use the library foracademic research, please cite the book.)5

6 Steven Bird, Ewan Klein, and Edward Loper (2009).7 Natural Language Processing with Python. O'Reilly Media Inc.

8 http://nltk.org/book

10 @version: 2.0.4

安装步骤跟http://www.nltk.org/install.html 一样

2. 安装 Pip: 运行 sudo easy_install pip(一定要以root权限运行)

3. 安装 Numpy (optional): 运行 sudo pip install -U numpy

4. 安装 NLTK: 运行 sudo pip install -U nltk

5. 进入python，并输入以下命令

1 192:chapter2 rcf$ python2 Python 2.7.5 (default, Mar 9 2014, 22:15:05)3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin4 Type "help", "copyright", "credits" or "license" for moreinformation.5 >>>import nltk6 >>> nltk.download()

当出现以下界面进行nltk_data的下载

最后在Python目录运行以下命令以及结果，说明安装已成功

1 from nltk.book import *

2 *** Introductory Examples for the NLTK Book ***

3 Loading text1, ..., text9 and sent1, ..., sent94 Type the name of the text or sentence to view it.5 Type: 'texts()' or 'sents()'to list the materials.6 text1: Moby Dick by Herman Melville 1851

7 text2: Sense and Sensibility by Jane Austen 1811

8 text3: The Book of Genesis9 text4: Inaugural Address Corpus10 text5: Chat Corpus11 text6: Monty Python and the Holy Grail12 text7: Wall Street Journal13 text8: Personals Corpus14 text9: The Man Who Was Thursday by G . K . Chesterton 1908

3. NLTK的初次使用

现在开始进入正题，由于本人没学过python，所以使用NLTK也就是学习Python的过程。初次学习NLTK主要使用的时NLTK里面自带的一些现有数据，上图中已由显示，这些数据都在nltk.book里面。

3.1 搜索文本

concordance:搜索text1中的monstrous

1 >>> text1.concordance("monstrous")2 Building index...3 Displaying 11 of 11matches:4 ong the former , one was of a most monstrous size . ... This came towards us ,5 ON OF THE PSALMS . "Touching that monstrous bulk of the whale or ork we have r

6 ll over with a heathenish array of monstrous clubs and spears . Some were thick7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav8 that has survived the flood ; most monstrous and most mountainous !That Himmal9 they might scout at Moby Dick as a monstrous fable , or still worse and morede10 th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l

11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly12 ere to enter upon those still more monstrous stories of them whichare to be fo13 ght have been rummaged out of this monstrous cabinet there is no telling . But14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar:查找text1中与monstrous相关的所有词语

1 >>> text1.similar("monstrous")2 Building word-context index...3 abundant candid careful christian contemptible curious delightfully4 determined doleful domineering exasperate fearless few gamesome5 horrible impalpable imperial lamentable lazy loving

dispersion_plot：用离散图判断词在文本的位置即偏移量

1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

3.2 计数词汇

len:获取长度，即可获取文章的词汇个数，也可获取单个词的长度

1 >>>len(text1) #计算text1的词汇个数2 260819

3 >>>len(set(text1)) #计算text1 不同的词汇个数4 19317

5 >>> len(text1[0]) #计算text1 第一个词的长度6 1

sorted:排序

1 >>>sent12 ['Call', 'me', 'Ishmael', '.']3 >>>sorted(sent1)4 ['.', 'Call', 'Ishmael', 'me']

3.3 频率分布

nltk.probability.FreqDist

1 >>> fdist1=FreqDist(text1) #获取text1的频率分布情况2 >>>fdist1 　　　　　　　　#text1具有19317个样本,但是总体有260819个值3

4 >>> keys=fdist1.keys()5 >>> keys[:50]#获取text1的前50个样本

6[',','the','.','of','and','a','to',';','in','that',"'",'-','his','it','I','s','is','he','with','was','as','"','all','for','this','!','at','by','but','not','--','him','from','be','on','so','whale','one','you','had','have','there','But','or','were','now','which','?','me','like']

1 >>> fdist1.items()[:50] #text1的样本分布情况，比如','出现了18713次，总共的词为2608192 [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]

1 >>> fdist1.hapaxes()[:50] #text1的样本只出现一次的词2 ['!\'"', '!)"', '!*', '!--"', '"...', "',--", "';", '):', ');--', ',)', '--\'"', '---"', '---,', '."*', '."--', '.*--', '.--"', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130']

3 >>> fdist1['!\'"']

4 1

1 >>> fdist1.plot(50,cumulative=True) #画出text1的频率分布图

3.4 细粒度的选择词

1 >>> long_words=[w for w in set(text1) if len(w) > 15] #获取text1内样本词汇长度大于15的词并按字典序排序2 >>>sorted(long_words)3 ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']4 >>> fdist1=FreqDist(text1)#获取text1内样本词汇长度大于7且出现次数大于7的词并按字典序排序

5>>> sorted([wforwinset(text5)iflen(w) >7and fdist1[w] >7])6['American','actually','afternoon','anything','attention','beautiful','carefully','carrying','children','commanded','concerning','considered','considering','difference','different','distance','elsewhere','employed','entitled','especially','everything','excellent','experience','expression','floating','following','forgotten','gentlemen','gigantic','happened','horrible','important','impossible','included','individual','interesting','invisible','involved','monsters','mountain','occasional','opposite','original','originally','particular','pictures','pointing','position','possibly','probably','question','regularly','remember','revolving','shoulders','sleeping','something','sometimes','somewhere','speaking','specially','standing','starting','straight','stranger','superior','supposed','surprise','terrible','themselves','thinking','thoughts','together','understand','watching','whatever','whenever','wonderful','yesterday','yourself']

3.5 词语搭配和双连词

用bigrams()可以实现双连词

1 >>> bigrams(['more','is','said','than','done'])2 [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]3 >>>text1.collocations()4 Building collocations list5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief8 mate; white whale; ivory leg; one hand

3.6 NLTK频率分类中定义的函数

例子

描述

fdist=FreqDist(samples)

创建包含给定样本的频率分布

fdist.inc(sample)

增加样本

fdist['monstrous']

计数给定样本出现的次数

fdist.freq('monstrous')

样本总数

fdist.N()

以频率递减顺序排序的样本链表

fdist.keys()

以频率递减的顺序便利样本

for sample in fdist:

数字最大的样本

fdist.max()

绘制频率分布表

fdist.tabulate()

绘制频率分布图

fdist.plot()

绘制积累频率分布图

fdist.plot(cumulative=True)

绘制积累频率分布图

fdist1

测试样本在fdist1中出现的样本是否小于fdist2

最后看下text1的类情况. 使用type可以查看变量类型，使用help()可以获取类的属性以及方法。以后想要获取具体的方法可以使用help()，这个还是很好用的。

1 >>>type(text1)2

3 >>> help('nltk.text.Text')4 Help on class Text innltk.text:5

6 nltk.text.Text = class Text(__builtin__.object)7 | A wrapper around a sequence of simple (string) tokens, whichis8 |intended to support initial exploration of texts (via the9 |interactive console). Its methods perform a variety of analyses10 | on the text's contexts (e.g., counting, concordancing, collocation

11 | discovery), and display the results. If you wish to writea12 | program which makes use of these analyses, thenyou should bypass13 | the ``Text`` class, and use the appropriate analysis functionor14 |class directly instead.15 |

16 |A ``Text`` is typically initialized from a given document or17 |corpus. E.g.:18 |

19 | >>>import nltk.corpus20 | >>>from nltk.text import Text21 | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))22 |

23 |Methods defined here:24 |

25 |__getitem__(self, i)26 |

27 | __init__(self, tokens, name=None)28 | Create a Text object.29 |

30 |:param tokens: The source text.31 |:type tokens: sequence of str32 |

33 |__len__(self)34 |

35 |__repr__(self)36 | :return: A stringrepresentation of this FreqDist.37 | :rtype: string

38 |

39 | collocations(self, num=20, window_size=2)40 |Print collocations derived from the text, ignoring stopwords.41 |

42 |:seealso: find_collocations43 |:param num: The maximum number of collocations to print.44 | :type num: int

45 | :param window_size: The number of tokens spanned by a collocation (default=2)46 | :type window_size: int

47 |

48 | common_contexts(self, words, num=20)49 |Find contexts where the specified words appear; list50 |most frequent common contexts first.51 |

52 |:param word: The word used to seed the similarity search53 |:type word: str54 | :param num: The number of words to generate (default=20)55 | :type num: int

56 | :seealso: ContextIndex.common_contexts()

4. 语言理解的技术

1. 词意消歧

2. 指代消解

3. 自动生成语言

4. 机器翻译

5. 人机对话系统

6. 文本的含义

5. 总结

虽然是初次接触Python，NLTK，但是我已经觉得他们的好用以及方便，接下来就会深入的学习他们。

python nltk book_自然语言处理(1)之NLTK与PYTHON相关推荐

r与python自然语言处理_Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器 | 我爱自然语言处理...
斯坦福大学自然语言处理组是世界知名的NLP研究小组,他们提供了一系列开源的Java文本分析工具,包括分词器(Word Segmenter),词性标注工具(Part-Of-Speech Tagger), ...
探索 Python、机器学习和 NLTK 库开发一个应用程序，使用 Python、NLTK 和机器学习对 RSS 提要进行分类
挑战:使用机器学习对 RSS 提要进行分类最近,我接到一项任务,要求为客户创建一个 RSS 提要分类子系统.目标是读取几十个甚至几百个 RSS 提要,将它们的许多文章自动分类到几十个预定义的主题领域 ...
独家 | 快速掌握spacy在python中进行自然语言处理（附代码链接）
作者:Paco Nathan 翻译:笪洁琼校对:和中华本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...
教程 | 理解和实现自然语言处理终极指南（附Python代码）
教程 | 理解和实现自然语言处理终极指南(附Python代码) 时间 2017-02-16 14:41:39 机器之心原文 http://www.jiqizhixin.com/article ...
NLTK基础 | 一文轻松使用NLTK进行NLP任务(附视频)
NLTK作为文本处理的一个强大的工具包,为了帮助NLPer更深入的使用自然语言处理(NLP)方法.本公众号开更Natural Language Toolkit(即NLTK)模块的" Natu ...
推荐：快速掌握spacy在python中进行自然语言处理（附代码链接）
作者:Paco Nathan 翻译:笪洁琼校对:和中华本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...
2020美赛C题：python实现npl自然语言处理记录
2020美赛C题:python实现npl自然语言处理记录前言文本预处理 LDA主题分析加可视化多进程程序需写进main函数可视化 NLTK情感分析制作语料包情感积极性量化一些收获 pyt ...
R语言机器学习与大数据可视化暨Python文本挖掘与自然语言处理核心技术研修
中国通信工业协会通信和信息技术创新人才培养工程项目办公室通人办[2017] 第45号 "R语言机器学习与大数据可视化"暨"Python文本挖掘与自然语言处理" ...
“R语言机器学习与大数据可视化”暨“Python文本挖掘与自然语言处理”核心技术高级研修班的通知
中国通信工业协会通信和信息技术创新人才培养工程项目办公室通人办[2017] 第45号 "R语言机器学习与大数据可视化"暨"Python文本挖掘与自然语言处理" ...

python nltk book_自然语言处理(1)之NLTK与PYTHON

python nltk book_自然语言处理(1)之NLTK与PYTHON相关推荐

最新文章

热门文章