第03章 加工原料文本

  • 3.1 从网络和硬盘访问文本
    • 电子书
    • 处理的HTML
    • 处理搜索引擎的结果
    • 处理RSS 订阅
    • 读取本地文件
    • 从PDF、MS Word 及其他二进制格式中提取文本
    • 捕获用户输入
    • NLP 的流程
  • 3.2 字符串最底层的文本处理
    • 字符串的基本操作
    • 输出字符串
    • 访问单个字符
    • 访问子字符串
    • 更多的字符串操作
    • 链表与字符串的差异
  • 3.3 使用Unicode 进行文字处理
    • 什么是Unicode?
    • 从文件中提取已编码文本
    • 在Python中使用本地编码
  • 3.4 使用正则表达式检测词组搭配
    • 使用基本的元字符
    • 范围与闭包
  • 3.5 正则表达式的有益应用
    • 提取字符块
    • 在字符块上做更多事情
    • 查找词干
    • 搜索已分词文本
  • 3.6 规范化文本
    • 词干提取器
    • 词形归并
  • 3.7 用正则表达式为文本分词
    • 分词的简单方法
    • NLTK 的正则表达式分词器
    • 分词的进一步问题
  • 3.8 分割
    • 断句
    • 分词
  • 3.9 格式化:从链表到字符串
    • 从链表到字符串
    • 字符串与格式
    • 排列
    • 将结果写入文件
    • 文本换行
  • 3.10 小结
import nltk
from nltk import word_tokenize

3.1 从网络和硬盘访问文本

电子书

http://www.gutenberg.org/catalog 上浏览25,000 本免费在线书籍的目录,获得ASCII 码文本文件的URL。包括中文。

from urllib import request
url ="http://www.gutenberg.org/files/25196/25196-0.txt" #编号2554 的文本是《百家姓》
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)
str
len(raw)
20497
raw[600:800]
'百家姓\r\n\r\n趙錢孫李 周吳鄭王 馮陳褚衛 蔣沈韓楊\r\n朱秦尤許 何呂施張 孔曹嚴華 金魏陶薑\r\n戚謝鄒喻 柏水竇章 雲蘇潘葛 奚範彭郎\r\n魯韋昌馬 苗鳳花方 俞任袁柳 酆鮑史唐\r\n費廉岑薛 雷賀倪湯 滕殷羅畢 郝鄔安常\r\n\r\n樂於時傅 皮卞齊康 伍餘元蔔 顧孟平黃\r\n和穆蕭尹 姚邵堪汪 祁毛禹狄 米貝明臧\r\n計伏成戴 談宋茅龐 熊紀舒屈 項祝董梁\r\n杜阮藍閔 席季麻強 賈路婁危 江童顏郭\r\n梅盛'

对于语言处理,我们要将字符串分解为词和标点符号,这一步被称为分词,它产生我们所熟悉的结构,一个词汇和标点符号的链表。

tokens = nltk.word_tokenize(raw)
type(tokens)
list
len(tokens)
3542
tokens[:5]
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of']
text = nltk.Text(tokens) #创建一个NLTK 文本
type(text)
nltk.text.Text
text[:5]
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of']
text.collocations()
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Archive
Foundation; United States; Gutenberg Literary; electronic works;
Gutenberg-tm electronic; set forth; public domain; electronic work;
Gutenberg-tm License; Bai Jia; Jia Xing; copyright holder; PROJECT
GUTENBERG; BAI JIA; EBOOK BAI; JIA XING; Plain Vanilla

方法find()和rfind()(反向的find)帮助我们得到字符串切片需要用到的正确的索引值

raw.find("朱")
628
raw.rfind("周")
612
raw[612:629]
'周吳鄭王 馮陳褚衛 蔣沈韓楊\r\n朱'

处理的HTML

url = "https://www.baidu.com"
html = request.urlopen(url).read().decode('utf8')
html[:60]
'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.'
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:BeautifulSoup([your markup])to this:BeautifulSoup([your markup], "lxml")markup_type=markup_type))
tokens = word_tokenize(raw)
tokens[:5]
['location.replace', '(', 'location.href.replace', '(', '``']
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('replace')
no matches

更多更复杂的有关处理HTML 的内容,可以使用http://www.crummy.com/software/BeautifulSoup/上的Beautiful Soup 软件包。

处理搜索引擎的结果

  • 优势

1.规模

2.非常容易使用

  • 缺点

1.首先,允许的搜索方式的范围受到严格限制。不同于本地驱动器中的语料库,你可以编写程序来搜索任意复杂的模式,搜索引擎一般只允许你搜索单个词或词串,有时也允许使用通配符。

2.其次,搜索引擎给出的结果不一致,并且在不同的时间或在不同的地理区域会给出非常不同的结果。

3.最后,搜索引擎返回的结果中的标记可能会不可预料的改变,基于模式的方法定位特定的内容将无法使用。

处理RSS 订阅

博客圈是文本的重要来源,无论是正式的还是非正式的。Universal Feed Parser 的第三方Python 库http://feedparser.org/ 可以访问博客的内容

import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']
'Language Log'
len(llog.entries)
13
post = llog.entries[2]
post.title
'Miscellaneous bacteria'
content = post.content[0].value
content[:70]
'<p>Jeff DeMarco spotted this menu item at the Splendid China attractio'
raw = BeautifulSoup(content).get_text()
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:BeautifulSoup([your markup])to this:BeautifulSoup([your markup], "lxml")markup_type=markup_type))
word_tokenize(raw)[10:15]
['attraction', 'in', 'Shenzhen', ':', 'zá']

读取本地文件

import os
os.listdir('.')
['.ipynb_checkpoints','document.txt','NLP','output.txt','readme.txt.txt','Steven Bird-2009-Natural Language Processing with Python.pdf','Steven Bird-2015-Natural Language Processing with Python.pdf','textproc.py','__pycache__','目录.txt','第01章 语言处理与Python.ipynb','第02章 获得文本语料和词汇资源.ipynb','第03章 加工原料文本.ipynb']
f = open('document.txt')
f = open('document.txt', 'rU')
C:\Program Files\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: DeprecationWarning: 'U' mode is deprecatedif __name__ == '__main__':
for line in f:print(line.strip())

从PDF、MS Word 及其他二进制格式中提取文本

ASCII 码文本和HTML 文本是人可读的格式。文字常常以二进制格式出现,如PDF 和MSWord,只能使用专门的软件打开。第三方函数库如pypdf 和pywin32 提供了对这些格式的访问。

捕获用户输入

s = input("Enter some text: ")
Enter some text: sdasdassadsa
print("You typed", len(word_tokenize(s)), "words.")
You typed 1 words.

NLP 的流程

NLP处理流程:打开一个URL,读里面HTML 格式的内容,去除标记,并选择字符的切片,然后分词,是否转换为nltk.Text 对象是可选择的。我们也可以将所有词汇小写并提取词汇表。

from bs4 import BeautifulSoupurl = "https://www.baidu.com/"
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html).get_text()
raw = raw[:500]
tokens = word_tokenize(raw)
tokens = tokens[:390]
text = nltk.Text(tokens)
words = [w.lower() for w in text]
vocab = sorted(set(words))
vocab[:5]
C:\Program Files\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 193 of the file C:\Program Files\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:BeautifulSoup([your markup])to this:BeautifulSoup([your markup], "lxml")markup_type=markup_type))["''", '(', ')', ',', '//']

3.2 字符串最底层的文本处理

字符串的基本操作

monty = 'Monty Python' #单引号
monty
'Monty Python'
circus = "Monty Python's Flying Circus" #双引号
circus
"Monty Python's Flying Circus"
circus = 'Monty Python\'s Flying Circus' #如果一个字符串中包含一个单引号,我们必须在单引号前加反斜杠或者也可以将这个字符串放入双引号中
circus
"Monty Python's Flying Circus"
circus = 'Monty Python's Flying Circus'
  File "<ipython-input-6-d481af75953d>", line 1circus = 'Monty Python's Flying Circus'^
SyntaxError: invalid syntax
couplet = "Shall I compare thee to a Summer's day?"\"Thou are more lovely and more temperate:" #使用反斜杠
print(couplet)
Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:
couplet = ("Rough winds do shake the darling buds of May,""And Summer's lease hath all too short a date:") #或者括号
print(couplet)
Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:""" #三重引号的字符串
print(couplet)
Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:
'very' + 'very' + 'very' #加法或连接
'veryveryvery'
'very' * 3  # 乘法
'veryveryvery'
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:print(b)
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
['            very', '          veryvery', '        veryveryvery', '      veryveryveryvery', '    veryveryveryveryvery', '  veryveryveryveryveryvery', 'veryveryveryveryveryveryvery', '  veryveryveryveryveryvery', '    veryveryveryveryvery', '      veryveryveryvery', '        veryveryvery', '          veryvery', '            very']
'very' - 'y' #不能对字符串用减法
---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-22-4771cfed954b> in <module>()
----> 1 'very' - 'y' #不能对字符串用减法TypeError: unsupported operand type(s) for -: 'str' and 'str'
'very' / 2 #不能对字符串用除法:
---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-23-c8fac486cd75> in <module>()
----> 1 'very' / 2 #不能对字符串用除法:TypeError: unsupported operand type(s) for /: 'str' and 'int'

输出字符串

print(monty)
Monty Python
grail = 'Holy Grail'
print(monty + grail)
Monty PythonHoly Grail
print(monty, grail)
Monty Python Holy Grail
print(monty, "and the", grail)
Monty Python and the Holy Grail

访问单个字符

monty[0]
'M'
monty[3]
't'
monty[-1]
'n'
monty[20]
---------------------------------------------------------------------------IndexError                                Traceback (most recent call last)<ipython-input-30-3d1d6a1691c6> in <module>()
----> 1 monty[20]IndexError: string index out of range
sent = 'colorless green ideas sleep furiously'
for char in sent:print(char, end=' ')
c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

一个文本相关的字母频率特征可以用在文本语言自动识别中。下面这段代码按照出现频率最高排在最先的顺序显示出英文字母,用fdist.plot()可视化这个分布。

from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.keys()
dict_keys(['s', 'n', 'w', 'o', 'y', 'a', 'z', 'd', 'h', 'g', 'b', 'j', 'f', 'x', 'e', 'p', 't', 'i', 'u', 'k', 'c', 'r', 'v', 'q', 'l', 'm'])
fdist.plot()

访问子字符串

monty[6:10] #切片[m,n]包含从位置m 到n-1 中的字符。
'Pyth'
monty[-12:-7]
'Monty'
monty[:5]
'Monty'
monty[6:]
'Python'
phrase = 'And now for something completely different'
if 'thing' in phrase:print('found "thing"')
found "thing"
monty.find('Python') #查找子字符串,返回在字符串内的首次出现的位置
6

更多的字符串操作

表3-2. 有用的字符串方法

方法 功能
s.find(t) 字符串s 中包含t 的第一个索引(没找到返回-1)
s.rfind(t) 字符串s 中包含t 的最后一个索引(没找到返回-1)
s.index(t) 与s.find(t)功能类似,但没找到时引起ValueError
s.rindex(t) 与s.rfind(t)功能类似,但没找到时引起ValueError
s.join(text) 连接字符串s 与text 中的词汇
s.split(t) 在所有找到t 的位置将s 分割成链表(默认为空白符)
s.splitlines() 将s 按行分割成字符串链表
s.lower() 将字符串s 小写
s.upper() 将字符串s 大写
s.titlecase() 将字符串s 首字母大写
s.strip() 返回一个没有首尾空白字符的s 的拷贝
s.replace(t, u) 用u 替换s 中的t

链表与字符串的差异

query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
query[2]
'o'
beatles[2]
'George'
query[:2]
'Wh'
beatles[:2]
['John', 'Paul']
query + " I don't"
"Who knows? I don't"
beatles + 'Brian'
---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-52-fa447bb3f680> in <module>()
----> 1 beatles + 'Brian'TypeError: can only concatenate list (not "str") to list
beatles + ['Brian']
['John', 'Paul', 'George', 'Ringo', 'Brian']
beatles[0] = "John Lennon" #,链表是可变的,其内容可以随时修改。作为一个结论,链表支持修改原始值的操作,而不是产生一个新的值。
del beatles[-1]
beatles
['John Lennon', 'Paul', 'George']
query[0] = 'F' #字符串是不可变的:一旦你创建了一个字符串,就不能改变它。
---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-57-f19cb6b849a3> in <module>()
----> 1 query[0] = 'F' #字符串是不可变的:一旦你创建了一个字符串,就不能改变它。TypeError: 'str' object does not support item assignment

3.3 使用Unicode 进行文字处理

什么是Unicode?

将文本翻译成Unicode——翻译成Unicode 叫做解码。将Unicode 转化为其它编码的过程叫做编码。一个字体是一个字符到字形映射。

从文件中提取已编码文本

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:line = line.strip()print(line)
Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.
f = open(path, encoding='latin2')
for line in f:line = line.strip()print(line.encode('unicode_escape'))
b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'
ord('a') #ord()查找一个字符的整数序数
97
nacute = u'\u0061'
nacute
'a'
nacute_utf = nacute.encode('utf8')
nacute_utf
b'a'
print(repr(nacute_utf))
b'a'
import unicodedata  #unicodedata 模块使我们可以检查Unicode 字符的属性。
lines = open(path, encoding='latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'
for c in line:if ord(c) > 127:print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))
b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE

下面展示Python 字符串函数和re 模块是如何接收Unicode 字符串的

line.find(u'zosta\u0142y')
54
line = line.lower()
line
'niemców pod koniec ii wojny światowej na dolny śląsk, zostały\n'
line.encode('unicode_escape')
b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'
import re
m = re.search('\u015b\w*', line)
m.group()
'światowej'
nltk.word_tokenize(line)
['niemców','pod','koniec','ii','wojny','światowej','na','dolny','śląsk',',','zostały']

在Python中使用本地编码

# -*- coding: utf-8 -*-

3.4 使用正则表达式检测词组搭配

import rewordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

使用基本的元字符

[w for w in wordlist if re.search('ed$', w)][:5]
['abaissed', 'abandoned', 'abased', 'abashed', 'abatised']
'''
通配符“.”匹配任何单个字符。假设我们有一个8 个字母组成的词的字谜室,
j 是其第三个字母,t 是其第六个字母。插入符号“^”匹配字符串的开始,
就像“$”符号匹配字符串的结尾。符号“?”表示前面的字符是可选的。
'''
[w for w in wordlist if re.search('^..j..t..$', w)] [:5]
['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector']
sum(1 for w in wordlist  if re.search('^e-? mail$', w))
0

范围与闭包

[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

“+”和“*”符号有时被称为的Kleene闭包,或者干脆闭包。

chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee','miiiiiinnnnnnnnnneeeeeeee','mine','mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']
[w for w in chat_words if re.search('^[ha]+$', w)][:5]
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah']
[w for w in chat_words if re.search('^m*i*n*e*$', w)][:5]
['', 'e', 'i', 'in', 'm']
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:5]
['0.0085', '0.05', '0.1', '0.16', '0.2']
[w for w in wsj if re.search('^[A-Z]+\$$', w)]
['C$', 'US$']
[w for w in wsj if re.search('^[0-9]{4}$', w)][:5]
['1614', '1637', '1787', '1901', '1903']
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][:5]
['10-day', '10-lap', '10-year', '100-share', '12-point']
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
['black-and-white','bread-and-butter','father-in-law','machine-gun-toting','savings-and-loan']
[w for w in wsj if re.search('(ed|ing)$', w)][:5]
['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced']

表3-3. 正则表达式基本元字符,其中包括通配符,范围和闭包

操作符 行为
· 通配符,匹配所有字符
^abc 匹配以abc 开始的字符串
abc$ 匹配以abc 结尾的字符串
[abc] 匹配字符集合中的一个
[A-Z0-9] 匹配字符一个范围
ed|ing|s 匹配指定的一个字符串(析取)
* 前面的项目零个或多个,如a*, [a-z]* (也叫Kleene 闭包)
+ 前面的项目1 个或多个,如a+, [a-z]+
? 前面的项目零个或1 个(即:可选)如:a?, [a-z]?
{n} 重复n 次,n 为非负整数
{n,} 至少重复n 次
{,n} 重复不多于n 次
{m,n} 至少重复m 次不多于n 次
a(b|c)+ 括号表示操作符的范围

通过给字符串加一个前缀“r”来表明它是一个原始字符串。

3.5 正则表达式的有益应用

提取字符块

word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word) #找出一个词中的元音
['u','e','a','i','a','i','i','i','e','i','a','i','o','i','o','u']
len(re.findall(r'[aeiou]', word)) #计数
16
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsjfor vs in re.findall(r'[aeiou]{2,}', word)) #文本中的两个或两个以上的元音序列,并确定它们的相对频率
fd.items()
dict_items([('ai', 261), ('eau', 10), ('iou', 27), ('iai', 1), ('io', 549), ('ue', 105), ('ei', 86), ('ea', 476), ('ee', 217), ('oei', 1), ('ie', 331), ('aii', 1), ('oe', 15), ('eu', 18), ('oui', 6), ('uu', 1), ('ioa', 1), ('uie', 3), ('au', 106), ('iu', 14), ('ou', 329), ('oi', 65), ('aa', 3), ('uou', 5), ('aiia', 1), ('oo', 174), ('ao', 6), ('ueui', 1), ('iao', 1), ('aia', 1), ('oa', 59), ('ui', 95), ('ia', 253), ('ieu', 3), ('eou', 5), ('uo', 8), ('uee', 4), ('eea', 1), ('eo', 39), ('ooi', 1), ('eei', 2), ('ae', 11), ('ua', 109)])

在字符块上做更多事情

正则表达式匹配词首元音序列,词尾元音序列和所有的辅音;其它的被忽略。这三个阶段从左到右处理,如果词匹配了三个部分之一,正则表达式后面的部分将被忽略。我们使用re.findall()提取所有匹配的词中的字符,然后使用’’.join()将它们连接在一起。

regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):pieces = re.findall(regexp, word)return ''.join(pieces)
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

提取所有辅音-元音序列,如ka 和si。因为每部分都是成对的,它可以被用来初始化一个条件频率分布,然后每对的频率列表。

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
    a   e   i   o   u
k 418 148  94 420 173
p  83  31 105  34  51
r 187  63  84  89  79
s   0   0 100   2   1
t  47   8   0 148  37
v  93  27 105  48  49

要检查表格中数字背后的词汇,有一个索引允许我们迅速找到包含一个给定的辅音-元音对的单词的列表将会有帮助。例如:cv_index[‘su’]应该给我们所有含有“su”的词汇。

cv_word_pairs = [(cv, w) for w in rotokas_wordsfor cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']
['kasuari']
cv_index['po'][:10]
['kaapo','kaapopato','kaipori','kaiporipie','kaiporivira','kapo','kapoa','kapokao','kapokapo','kapokapo']

查找词干

def stem(word):for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:if word.endswith(suffix):return word[:-len(suffix)]return word
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['ing']
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
['processing']
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('processe', 's')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]
def stem(word):regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'stem, suffix = re.findall(regexp, word)[0]return stem
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens][:10]
['DENNIS',':','Listen',',','strange','women','ly','in','pond','distribut']

搜索已分词文本

from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")
you rule bro; telling you bro; u twizted bro
chat.findall(r"<l.*>{3,}")
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals

自动和人工处理相结合的方式是最常见的建造新的语料库的方式

3.6 规范化文本

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

词干提取器

优先NLTK 中包括了一些现成的词干提取器,Porter 和Lancaster 词干提取器按照它们自己的规则剥离词缀。

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens][:4]
['denni', ':', 'listen', ',']
[lancaster.stem(t) for t in tokens][:4]
['den', ':', 'list', ',']

例3-1. 使用词干提取器索引文本。

class IndexedText(object):def __init__(self, stemmer, text):self._text = textself._stemmer = stemmerself._index = nltk.Index((self._stem(word), i)for (i, word) in enumerate(text))def concordance(self, word, width=40):key = self._stem(word)wc = int(width/4) # words of contextfor i in self._index[key]:lcontext = ' '.join(self._text[i-wc:i])rcontext = ' '.join(self._text[i:i+wc])ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)rdisplay = '{:{width}}'.format(rcontext[:width], width=width)print(ldisplay, rdisplay)def _stem(self, word):return self._stemmer.stem(word).lower()
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')
r king ! DENNIS : Listen , strange women lying in ponds distributing swords is nobeat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest ofNay . Nay . Come . Come . You may lie here . Oh , but you are wounded !
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

词形归并

WordNet 词形归并器删除词缀产生的词都是在它的字典中的词。如果想编译一些文本的词汇,或者想要一个有效词条(或中心词)列表,WordNet词形归并器是一个不错的选择。

wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens][:5]
['DENNIS', ':', 'Listen', ',', 'strange']

识别非标准词,包括数字、缩写、日期以及任何此类标识符到一个特殊的词汇的映射。例如:每一个十进制数可以被映射到一个单独的标识符0.0,每首字母缩写可以映射为AAA。这使词汇量变小,提高了许多语言建模任务的准确性。

3.7 用正则表达式为文本分词

分词是将字符串切割成可识别的构成一块语言数据的语言单元。

分词的简单方法

raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very..."""
print(re.split(r' ', raw)) #在空格符处分割文本
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very...']
print(re.split(r'[ \t\n]+', raw)) #正则表达式«[ \t\n]+»匹配一个或多个空格、制表符(\t)或换行符(\n)
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very...']
print(re.split(r'\W+', raw)) #分割所有单词字符以外的输入
['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', '']
print(re.findall(r'\w+|\S\w*', raw)) #标点会与跟在后面的字母(如's)在一起,但两个或两个以上的标点字符序列会被分割。
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', '.', '.', '.']
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)) #来匹配引号字符让它们与它们包括的文字分开
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', '...']

表3-4. 正则表达式符号

符号 功能
\b 词边界(零宽度)
\d 任一十进制数字(相当于[0-9])
\D 任何非数字字符(等价于[^ 0-9])
\s 任何空白字符(相当于[ \t\n\r\f\v])
\S 任何非空白字符(相当于[^ \t\n\r\f\v])
\w 任何字母数字字符(相当于[a-zA-Z0-9_])
\W 任何非字母数字字符(相当于[^a-zA-Z0-9_])
\t 制表符
\n 换行符

NLTK 的正则表达式分词器

函数nltk.regexp_tokenize()与re.findall()类似。然而,nltk.regexp_tokenize()分词效率更高,且不需要特殊处理括号。

text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens; includes ], ['''
nltk.regexp_tokenize(text, pattern)
[('', '', ''),('A.', '', ''),('', '-print', ''),('', '', ''),('', '', '.40'),('', '', '')]

分词的进一步问题

  • 分词是一个艰巨的任务。没有单一的解决方案行之有效,我们必须根据应用领域的需要决定。
  • 开发分词器时,访问已经手工标注好的原始文本是有益的,这可以让你的分词器的输出结果与高品质(或称“黄金标准”)的标注进行比较。
  • 缩写

3.8 分割

分词是一个更普遍的分割问题的一个实例

断句

在词级水平处理文本通常假定能够将文本划分成单个句子。

len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) #计算布朗语料库中每个句子的平均词数
20.250994070456922

更多情况,文本可能只是作为一个字符流。在将文本分词之前,我们需要将它分割成句子。NLTK 通过包含Punkt 句子分割器。

import pprint
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = sent_tokenizer.tokenize(text)
pprint.pprint(sents[79:89])
['"Nonsense!"','said Gregory, who was very rational when anyone else\nattempted paradox.','"Why do all the clerks and navvies in the\n''railway trains look so sad and tired, so very sad and tired?','I will\ntell you.','It is because they know that the train is going right.','It\n''is because they know that whatever place they have taken a ticket\n''for that place they will reach.','It is because after they have\n''passed Sloane Square they know that the next station must be\n''Victoria, and nothing but Victoria.','Oh, their wild rapture!','oh,\n''their eyes like stars and their souls again in Eden, if the next\n''station were unaccountably Baker Street!"','"It is you who are unpoetical," replied the poet Syme.']

分词

由于没有词边界,文本分词变得更加困难。例如:在中文中,三个字符的字符串:爱国人可以被分词为“爱国/人”,或者“爱/国人”。

例3-2. 从分词表示字符串seg1 和seg2 中重建文本分词。seg1 和seg2 表示假设的一些儿童讲话的初始和最终分词。

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
def segment(text, segs):words = []last = 0for i in range(len(segs)):if segs[i] == '1':words.append(text[last:i+1])last = i+1words.append(text[last:])return words
segment(text, seg1)
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
segment(text, seg2)
['do','you','see','the','kitty','see','the','doggy','do','you','like','the','kitty','like','the','doggy']

例3-3. 计算存储词典和重构源文本的成本

计算目标函数:给定一个假设的源文本的分词(左),推导出一个词典和推导表,它能让源文本重构,然后合计每个词项(包括边界标志)与推导表的字符数,作为分词质量的得分;得分值越小表明分词越好。

def evaluate(text, segs):words = segment(text, segs)text_size = len(words)lexicon_size = sum(len(word) + 1 for word in set(words))return text_size + lexicon_size
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
segment(text, seg3)
['doyou','see','thekitt','y','see','thedogg','y','doyou','like','thekitt','y','like','thedogg','y']
evaluate(text, seg3)
47
evaluate(text, seg2)
48
evaluate(text, seg1)
64

例3-4. 使用模拟退火算法的非确定性搜索

一开始仅搜索短语分词;随机扰动0 和1,它们与“温度”成比例;每次迭代温度都会降低,扰动边界会减少。

from random import randint
def flip(segs, pos):return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]
def flip_n(segs, n):for i in range(n):segs = flip(segs, randint(0, len(segs)-1))return segs
def anneal(text, segs, iterations, cooling_rate):temperature = float(len(segs))while temperature > 0.5:best_segs, best = segs, evaluate(text, segs)for i in range(iterations):guess = flip_n(segs, round(temperature))score = evaluate(text, guess)if score < best:best, best_segs = score, guessscore, segs = best, best_segstemperature = temperature / cooling_rateprint(evaluate(text, segs), segment(text, segs))print()return segs
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
62 ['doyou', 'seethek', 'ittys', 'eeth', 'edoggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
62 ['doyou', 'seethek', 'ittys', 'eeth', 'edoggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
62 ['doyou', 'seethek', 'ittys', 'eeth', 'edoggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
62 ['doyou', 'seethek', 'ittys', 'eeth', 'edoggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
62 ['doyou', 'seethek', 'ittys', 'eeth', 'edoggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
59 ['doyou', 's', 'eeth', 'e', 'k', 'it', 'tys', 'eeth', 'e', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
57 ['doyou', 's', 'eethe', 'kit', 'ty', 's', 'eethe', 'dogg', 'y', 'doyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
55 ['doyous', 'eethe', 'kitty', 's', 'eethe', 'dogg', 'ydoyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
49 ['doyou', 's', 'eethe', 'kitty', 's', 'eethe', 'dogg', 'y', 'doyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
49 ['doyou', 's', 'eethe', 'kitty', 's', 'eethe', 'dogg', 'y', 'doyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
49 ['doyou', 's', 'eethe', 'kitty', 's', 'eethe', 'dogg', 'y', 'doyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
46 ['doyou', 'seethe', 'kitty', 'seethe', 'dogg', 'y', 'doyou', 'likethe', 'kitty', 'likethe', 'dogg', 'y']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']
43 ['doyou', 'seethe', 'kitty', 'seethe', 'doggy', 'doyou', 'likethe', 'kitty', 'likethe', 'doggy']'0000100000100001000001000010000100000010000100000010000'

有了足够的数据,就可能以一个合理的准确度自动将文本分割成词汇。这种方法可用于为那些词的边界没有任何视觉表示的书写系统分词。

3.9 格式化:从链表到字符串

从链表到字符串

silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']

’ '.join(silly)的意思是:取出silly 中的所有项目,将它们连接成一个大的字符串使用

' '.join(silly)
'We called him Tortoise because he taught us .'
';'.join(silly)
'We;called;him;Tortoise;because;he;taught;us;.'
''.join(silly)
'WecalledhimTortoisebecausehetaughtus.'

字符串与格式

word = 'cat'
sentence = """helloworld"""
print(word)
cat
print(sentence)
helloworld
word
'cat'
sentence
'hello\n    world'
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:print(word, '->', fdist[word], end='; ')
dog -> 4; cat -> 3; snake -> 1;
for word in sorted(fdist):print('{}->{};'.format(word, fdist[word]), end=' ')  #字符串格式化表达式
cat->3; dog->4; snake->1;
'{}->{};'.format ('cat', 3)
'cat->3;'
'{}->'.format('cat')
'cat->'
'{}'.format(3)
'3'
'I want a {} right now'.format('coffee')
'I want a coffee right now'
'{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')
'Lee wants a sandwich for lunch'
'{} wants a {} {}'.format ('sandwich', 'for lunch')
---------------------------------------------------------------------------IndexError                                Traceback (most recent call last)<ipython-input-208-209f2a88ac01> in <module>()
----> 1 '{} wants a {} {}'.format ('sandwich', 'for lunch')IndexError: tuple index out of range
'{} wants a {}'.format ('Lee', 'sandwich', 'for lunch')
'Lee wants a sandwich'
'from {1} to {0}'.format('A', 'B')
'from B to A'
template = 'Lee wants a {} right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:print(template.format(snack))
Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now

排列

'%6s' % 'dog'
'   dog'
'%-6s' % 'dog'
'dog   '
width = 6
'%-*s' % (width, 'dog')
'dog   '
count, total = 3205, 9375
"accuracy for %d words: %2.4f%%" % (total, 100 * count / total)
'accuracy for 9375 words: 34.1867%'

例3-5. 布朗语料库的不同部分的频率模型

def tabulate(cfdist, words, categories):print('{:16}'.format('Category'), end=' ') # column headingsfor word in words:print('{:>6}'.format(word), end=' ')print()for category in categories:print('{:16}'.format(category), end=' ') # row headingfor word in words: # for each wordprint('{:6}'.format(cfdist[category][word]), end=' ') # print table cellprint() # end the row
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre, word)for genre in brown.categories()for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)
Category            can  could    may  might   must   will
news                 93     86     66     38     50    389
religion             82     59     78     12     54     71
hobbies             268     58    131     22     83    264
science_fiction      16     49      4     12      8     16
romance              74    193     11     51     45     43
humor                16     30      8      8      9     13

将结果写入文件

output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):output_file.write(word + "\n")
len(words)
2789
str(len(words))
'2789'
output_file.write(str(len(words)) + "\n")
5
output_file.close()

文本换行

saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
'more', 'is', 'said', 'than', 'done', '.']
for word in saying:print(word, '(' + str(len(word)) + '),', end=' ')
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1),
from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),

3.10 小结

  • 将文本作为一个词链表。“原始文本”是一个潜在的长字符串,其中包含文字和用于设置格式的空白字符,也是我们通常存储和可视化文本的原料。
  • 在Python 中指定一个字符串使用单引号或双引号:‘Monty Python’,“Monty Python”。
  • 字符串中的字符是使用索引来访问的,索引从零计数:‘Monty Python’[0]的值是M。求字符串的长度可以使用len()。
  • 子字符串使用切片符号来访问: ‘Monty Python’[1:5]的值是onty。如果省略起始索引,子字符串从字符串的开始处开始;如果省略结尾索引,切片会一直到字符串的结尾处结束。
  • 字符串可以被分割成链表:‘Monty Python’.split()得到[‘Monty’, ‘Python’]。链表可以连接成字符串:’/’.join([‘Monty’, ‘Python’])得到’Monty/Python’。
  • 我们可以使用text = open(f).read()从一个文件f 读取文本。可以使用text = urlopen(u).read()从一个URL u 读取文本。我们可以使用for line in open(f)遍历一个文本文件的每一行。
  • 在网上找到的文本可能包含不需要的内容(如页眉、页脚和标记),在我们做任何语言处理之前需要去除它们。
  • 分词是将文本分割成基本单位或标记,例如词和标点符号等。基于空格符的分词对于许多应用程序都是不够的,因为它会捆绑标点符号和词。NLTK 提供了一个现成的分词器nltk.word_tokenize()。
  • 词形归并是一个过程,将一个词的各种形式(如:appeared,appears)映射到这个词标准的或引用的形式,也称为词位或词元(如:appear)。
  • 正则表达式是用来指定模式的一种强大而灵活的方法。只要导入了re 模块,我们就可以使用re.findall()来找到一个字符串中匹配一个模式的所有子字符串。
  • 如果一个正则表达式字符串包含一个反斜杠,你应该使用原始字符串与一个r 前缀:r’regexp’,告诉Python 不要预处理这个字符串。
  • 当某些字符前使用了反斜杠时,例如:\n,处理时会有特殊的含义(换行符);然而,当反斜杠用于正则表达式通配符和操作符时,如:.,|,$,这些字符失去其特殊的含义,只按字面表示匹配。
  • 一个字符串格式化表达式template % arg_tuple 包含一个格式字符串template,它由如%-6s 和%0.2d 这样的转换标识符符组成。

致谢
《Python自然语言处理》123 4,作者:Steven Bird, Ewan Klein & Edward Loper,是实践性很强的一部入门读物,2009年第一版,2015年第二版,本学习笔记结合上述版本,对部分内容进行了延伸学习、练习,在此分享,期待对大家有所帮助,欢迎加我微信(验证:NLP),一起学习讨论,不足之处,欢迎指正。

参考文献


  1. http://nltk.org/ ↩︎

  2. Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎

  3. (英)伯德,(英)克莱因,(美)洛普,《Python自然语言处理》,2010年,东南大学出版社 ↩︎

  4. Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎

《Python自然语言处理(第二版)-Steven Bird等》学习笔记:第03章 加工原料文本相关推荐

  1. 【Python 自然语言处理 第二版】读书笔记1:语言处理与Python

    文章目录 前言 语言处理与Python 一.语言计算:文本和单词 1.NLTK入门 (1)安装(nltk.nltk.book) (2)搜索文本 (3)词汇计数 2.列表与字符串 (1)列表操作 (2) ...

  2. 【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源

    文章目录 一.获取文本语料库 1.古腾堡语料库 (1)输出语料库中的文件标识符 (2)词的统计与索引 (3)文本统计 2.网络和聊天文本 3.布朗语料库 (1)初识 (2)比较不同文体中的情态动词的用 ...

  3. 学完可以解决90%以上的数据分析问题-利用python进行数据分析第二版(代码和中文笔记)...

    <利用python进行数据分析>是数据分析的基础教程,绝大部分数据分析师的入门教材,目前已经升级到第二版.本站搜集了教材的第二版原版代码进行中文翻译和注释,并做了一定的笔记.基本上只需要看 ...

  4. 《用Python进行自然语言处理》第3章 加工原料文本

    1. 我们怎样才能编写程序访问本地和网络上的文件,从而获得无限的语言材料? 2. 我们如何把文档分割成单独的词和标点符号,这样我们就可以开始像前面章节中在文本语料上做的那样的分析? 3. 我们怎样编程 ...

  5. 鸟哥Linux私房菜_基础篇(第二版)_第十章学习笔记

    第十章 vi文字处理器 编辑器 vi 1.一般模式 2.编辑模式 3.命令行模式 注意:在vi编辑模式中 Tab键与空格键的不同 向上(k)   向下(j)  向左(h)  向右(l) ctrl+f ...

  6. 《Python编程:从入门到实践》学习笔记——第11章 测试代码

    文章目录 前言 1 测试函数 1.1 单元测试和测试用例 1.2 可通过的测试 1.3 不能通过的测试 1.4 测试未通过时怎么办 1.5 添加新测试 2 测试类 2.1 各种断言方法 2.2 一个要 ...

  7. 拒绝从入门到放弃_《Python 核心编程 (第二版)》必读目录

    目录 目录 关于这本书 必看知识点 最后 关于这本书 <Python 核心编程 (第二版)>是一本 Python 编程的入门书,分为 Python 核心(其实并不核心,应该叫基础) 和 高 ...

  8. python基础教程第二版和第三版哪个好-python基础教程 2版和3版哪个适合新手?!...

    python基础教程 2版和3版哪个适合新手? 现在学是学python3. 两者的差异发者本身影响并不大,个别语法细微的差比如python3的print数方式使用,一些常用模块的名称差异,一些机制的差 ...

  9. python核心编程第二版pdf_Python Book电子书pdf版合集 Python核心高级编程第二版

    1小时学会Python.doc 51CTO下载-[Python系列].BeginningPythonFromNovicetoProfessionalSecondEdition.pdf 8.Python ...

最新文章

  1. oracle电子商务套件视频,Oracle电子商务套件培训 Oracle EBS R12 制造模块培训视频教程 Oracle管理套件教程...
  2. OVIRT创建自动备份数据
  3. python while循环语句-Python while循环语句
  4. linux e盘路径,Linux添加路径到PATH环境变量
  5. jQuery(一)初识
  6. windows 通过批处理 修改环境变量
  7. CSS效果:固定页脚、PNG透明、最小高度 3枚
  8. ubuntu16.04安装nvidia-384
  9. EViews11.0程序安装及注意事项
  10. 高通骁龙845的android手机有哪些,骁龙845手机买什么好 目前6款最值得买的骁龙845手机推荐...
  11. html中文本框角度旋转,如何设置Word文本框旋转,任意角度调整文本框的方向?...
  12. 卸载Docker CE
  13. pyhon身份证验证
  14. 【原创】基于SSM的医院预约挂号系统(医院预约挂号系统毕设源代码)
  15. ios13.5.1降级_四条重磅消息,iOS 13.5.1 也能降级 iOS 13.4.1
  16. uni-app 学习: 页面高度设置100%
  17. Atmel ATSAMV70Q20 的 CAN 扩展帧收发设置
  18. [转]体育运动比赛英语
  19. 衡量视频质量的新标准ITU-T P.1203、P.1204
  20. AMSR-E微波辐射计详细介绍

热门文章

  1. 谷歌pay破解_Google Pay缺少Google闻名的一件事-UX案例研究
  2. 访存控制信号——IO/M(M上方带横杠)
  3. 赠书 | 为什么说混合云是新基建的流行架构?
  4. Apple Push Notification Service(苹果推送服务)
  5. 显示web服务器登陆,web服务器登陆界面
  6. c#开发初学者之mvc及架构分层
  7. EditPlus 设置关闭后不打开未关闭的文件,以及设置在同一个窗口打开多个文件
  8. 这本书为什么会被誉为Spring开发百科全书(文末附带源码视频)
  9. JavaScript let 与var 区别及var弊端
  10. 时钟芯片PCF8563应用