Python自然语言处理学习笔记(23)：3.7 用正则表达式文本分词

3.7 Regular Expressions for Tokenizing Text

用正则表达式文本分词

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

Simple Approaches to Tokenization分词的简单方法

The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:

>>>raw="""'When I'M a Duchess,' she said to herself, (not in a very hopeful tone

... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very

... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string①, since this results in tokens that contain a \n newline character; instead, we need to match any number of spaces, tabs, or newlines②:

>>>re.split(r'', raw) ①

["'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone\nthough),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very\nwell','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]>>>re.split(r'[ \t\n]+', raw) ②

["'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone','though),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very','well','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]

The regular expression «[ \t\n]+» matches one or more spaces, tabs (\t), or newlines (\n). Other whitespace characters, such as carriage return and form feed（换页）, should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character（匹配任何空白符）. The second statement in the preceding example can be rewritten as re.split(r'\s+', raw).

Important: Remember to prefix regular expressions with the letter r (meaning “raw”), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.

Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement（补集） of this class, \W, i.e., all characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:

>>>re.split(r'\W+', raw)

['','When','I','M','a','Duchess','she','said','to','herself','not','in','a','very','hopeful','tone','though','I','won','t','have','any','pepper','in','my','kitchen','AT','ALL','Soup','does','very','well','without','Maybe','it','s','always','pepper','that','makes','people','hot','tempered','']

Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we’re matching the words, we’re in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., ’s) but that sequences of two or more punctuation characters are separated.

>>>re.findall(r'\w+|\S\w*', raw)

["'When",'I',"'M",'a','Duchess',',',"'",'she','said','to','herself',',','(not','in','a','very','hopeful','tone','though',')',',',"'I",'won',"'t",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','-','-Maybe','it',"'s",'always','pepper','that','makes','people','hot','-tempered',',',"'",'.','.','.']

Let’s generalize the \w+ in the preceding expression to permit word-internal hyphens（连字号） and apostrophes（省字号，表示词的一个或多个字母的省略，例 it is略为it's， of略为o'， over略为o'er，cannot略为can't，1999略为'99）: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose.

>>>printre.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)

["'",'When',"I'M",'a','Duchess',',',"'",'she','said','to','herself',',','(','not','in','a','very','hopeful','tone','though',')',',',"'",'I',"won't",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','--','Maybe',"it's",'always','pepper','that','makes','people','hot-tempered',',',"'",'...']

The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.

Table 3-4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.

Table 3-4. Regular expression symbols

Symbol Function

\b Word boundary (zero width)

\d Any decimal digit (equivalent to [0-9])

\D Any non-digit character (equivalent to [^0-9])

\s Any whitespace character (equivalent to [ \t\n\r\f\v]

\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])

\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])

\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])

\t The tab character

\n The newline character

NLTK’s Regular Expression Tokenizer

NLTK的正则表达式分词器

The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments.

>>>text='That U.S.A. poster-print costs $12.40...'>>>pattern=r'''(?x) # set flag to allow verbose regexps

... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.

... | \w+(-\w+)* # words with optional internal hyphens

... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%

... | \.\.\. # ellipsis

... | [][.,;"'?():-_`] # these are separate tokens

...'''>>>nltk.regexp_tokenize(text, pattern)

['That','U.S.A.','poster-print','costs','$12.40','...']

When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don’t appear in the wordlist, using set(tokens).difference(wordlist). You’ll probably want to lowercase all the tokens first. 评价分词器的好办法！

Further Issues with Tokenization分词的进一步问题

Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across the board, and we must decide what counts as a token depending on the application domain.

When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or“gold-standard”) tokens. The NLTK corpus collection includes a sample of Penn Tree-bank data, including the raw Wall Street Journal text (nltk.corpus.treebank_raw.raw()) and the tokenized version (nltk.corpus.treebank.words()).

A final issue for tokenization is the presence of contractions（缩写）, such as didn’t. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n’t (or not). We can do this work with the help of a lookup table.

Python自然语言处理学习笔记(23)：3.7 用正则表达式文本分词相关推荐

Python自然语言处理学习笔记(22)：3.6 规格化文本
3.6 Normalizing Text 规格化文本 In earlier program examples we have often converted text to lowercase b ...
python自然语言处理学习笔记三
第三章处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译未作测试?? From utlib import urlopen Url='http://www.g ...
python自然语言处理学习笔记一
第一章语言处理与python 1 语言计算文本与词汇 NLTK入门下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...
Python自然语言处理学习笔记(2)：Preface 前言
Updated 1st:2011/8/5 Updated 2nd:2012/5/14 中英对照完成 Preface 前言 This is a book about Natural Language ...
Python自然语言处理学习笔记(7)：1.5 自动理解自然语言
Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...
python自然语言处理-学习笔记（一）之nltk入门
nltk学习第一章一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...
Python自然语言处理学习笔记(32)：4.4 函数：结构化编程的基础
4.4 Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...
Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理
3.3 Text Processing with Unicode 使用Unicode进行文字处理 Our programs will often need to deal with differe ...
Python自然语言处理学习笔记(68)：7.9 练习
7.9 Exercises 练习 ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags nec ...
Python自然语言处理学习笔记(41)：5.2 标注语料库
5.2 Tagged Corpora 标注语料库 Representing Tagged Tokens 表示标注的语言符号 By convention in NLTK, a tagged toke ...

Python自然语言处理学习笔记(23)：3.7 用正则表达式文本分词

Python自然语言处理学习笔记(23)：3.7 用正则表达式文本分词相关推荐

最新文章

热门文章