Python自然语言处理学习笔记(23):3.7 用正则表达式文本分词
3.7 Regular Expressions for Tokenizing Text
用正则表达式文本分词
Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.
Simple Approaches to Tokenization分词的简单方法
The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'...
"""We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string①, since this results in tokens that contain a \n newline character; instead, we need to match any number of spaces, tabs, or newlines②:
[
"'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone\nthough),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very\nwell','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]>>>re.split(r'[ \t\n]+', raw) ②[
"'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone','though),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very','well','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]The regular expression «[ \t\n]+» matches one or more spaces, tabs (\t), or newlines (\n). Other whitespace characters, such as carriage return and form feed(换页), should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character(匹配任何空白符). The second statement in the preceding example can be rewritten as re.split(r'\s+', raw).
Important: Remember to prefix regular expressions with the letter r (meaning “raw”), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.
Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement(补集) of this class, \W, i.e., all characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:
[
'','When','I','M','a','Duchess','she','said','to','herself','not','in','a','very','hopeful','tone','though','I','won','t','have','any','pepper','in','my','kitchen','AT','ALL','Soup','does','very','well','without','Maybe','it','s','always','pepper','that','makes','people','hot','tempered','']Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we’re matching the words, we’re in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., ’s) but that sequences of two or more punctuation characters are separated.
[
"'When",'I',"'M",'a','Duchess',',',"'",'she','said','to','herself',',','(not','in','a','very','hopeful','tone','though',')',',',"'I",'won',"'t",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','-','-Maybe','it',"'s",'always','pepper','that','makes','people','hot','-tempered',',',"'",'.','.','.']Let’s generalize the \w+ in the preceding expression to permit word-internal hyphens(连字号) and apostrophes(省字号,表示词的一个或多个字母的省略,例 it is略为it's, of略为o', over略为o'er,cannot略为can't,1999略为'99): «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose.
[
"'",'When',"I'M",'a','Duchess',',',"'",'she','said','to','herself',',','(','not','in','a','very','hopeful','tone','though',')',',',"'",'I',"won't",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','--','Maybe',"it's",'always','pepper','that','makes','people','hot-tempered',',',"'",'...']The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.
Table 3-4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.
Table 3-4. Regular expression symbols
Symbol Function
\b Word boundary (zero width)
\d Any decimal digit (equivalent to [0-9])
\D Any non-digit character (equivalent to [^0-9])
\s Any whitespace character (equivalent to [ \t\n\r\f\v]
\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t The tab character
\n The newline character
NLTK’s Regular Expression Tokenizer
NLTK的正则表达式分词器
The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments.
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens
...
'''>>>nltk.regexp_tokenize(text, pattern)[
'That','U.S.A.','poster-print','costs','$12.40','...']When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().
We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don’t appear in the wordlist, using set(tokens).difference(wordlist). You’ll probably want to lowercase all the tokens first. 评价分词器的好办法!
Further Issues with Tokenization分词的进一步问题
Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across the board, and we must decide what counts as a token depending on the application domain.
When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or“gold-standard”) tokens. The NLTK corpus collection includes a sample of Penn Tree-bank data, including the raw Wall Street Journal text (nltk.corpus.treebank_raw.raw()) and the tokenized version (nltk.corpus.treebank.words()).
A final issue for tokenization is the presence of contractions(缩写), such as didn’t. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n’t (or not). We can do this work with the help of a lookup table.
Python自然语言处理学习笔记(23):3.7 用正则表达式文本分词相关推荐
- Python自然语言处理学习笔记(22):3.6 规格化文本
3.6 Normalizing Text 规格化文本 In earlier program examples we have often converted text to lowercase b ...
- python自然语言处理学习笔记三
第三章 处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译 未作测试?? From utlib import urlopen Url='http://www.g ...
- python自然语言处理学习笔记一
第一章 语言处理与python 1 语言计算 文本与词汇 NLTK入门 下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...
- Python自然语言处理学习笔记(2):Preface 前言
Updated 1st:2011/8/5 Updated 2nd:2012/5/14 中英对照完成 Preface 前言 This is a book about Natural Language ...
- Python自然语言处理学习笔记(7):1.5 自动理解自然语言
Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...
- python自然语言处理-学习笔记(一)之nltk入门
nltk学习第一章 一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...
- Python自然语言处理学习笔记(32):4.4 函数:结构化编程的基础
4.4 Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...
- Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理
3.3 Text Processing with Unicode 使用Unicode进行文字处理 Our programs will often need to deal with differe ...
- Python自然语言处理学习笔记(68):7.9 练习
7.9 Exercises 练习 ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags nec ...
- Python自然语言处理学习笔记(41):5.2 标注语料库
5.2 Tagged Corpora 标注语料库 Representing Tagged Tokens 表示标注的语言符号 By convention in NLTK, a tagged toke ...
最新文章
- 解决CSV文件中长数字以科学记数格式保存问题
- SAP SD基础知识之流程概览
- python画樱桃小丸子的程序_多任务---线程threading使用总结。
- LeetCode Longest Increasing Subsequence(动态规划、二分法)
- 学习笔记94—所有用过SCI-hub的科研工作者都应该知道的事
- 每天一道LeetCode-----求一个数的n次方,n是很大很大的数,n用数组存储着
- Flex实现分页显示功能(mx:DataGrid)
- 部署被测软件应用和中间件_使用FlexDeploy对融合中间件应用程序进行自动化软件测试...
- attribute property --- jquery attr() prop()
- C++程序设计:原理与实践(进阶篇)15.6 实例:一个简单的文本编辑器
- Maven - Maven3实战学习笔记(2)坐标和依赖
- paip.手写OCR识别方法总结--基于验证码识别... 1
- ABAQUS常用量纲
- 用Java代码实现一个简单的聊天室功能
- java下拉刷新上拉加载_使用PullToRefresh实现下拉刷新和上拉加载
- java nginx报502,Nginx 502错误排查及解决办法
- 【彩虹 钢琴伴奏】笔记
- 【Mind+ 玩转Maixduino系列0】工欲善其事必先利其器
- matlab数字信号处理常用函数
- linux kvm参数,virt-install创建kVM参数