3.7 Regular Expressions for Tokenizing Text

用正则表达式文本分词

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it until now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

Simple Approaches to Tokenization分词的简单方法

The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice’s Adventures in Wonderland:

>>>raw="""'When I'M a Duchess,' she said to herself, (not in a very hopeful tone

... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very

... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, it is not enough to match any space characters in the string, since this results in tokens that contain a \n newline character; instead, we need to match any number of spaces, tabs, or newlines:

>>>re.split(r'', raw) ①

["'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone\nthough),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very\nwell','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]>>>re.split(r'[ \t\n]+', raw) ②

["'When","I'M",'a',"Duchess,'",'she','said','to','herself,','(not','in','a','very','hopeful','tone','though),',"'I","won't",'have','any','pepper','in','my','kitchen','AT','ALL.','Soup','does','very','well','without--Maybe',"it's",'always','pepper','that','makes','people',"hot-tempered,'..."]

The regular expression «[ \t\n]+» matches one or more spaces, tabs (\t), or newlines (\n). Other whitespace characters, such as carriage return and form feed(换页), should really be included too. Instead, we will use a built-in re abbreviation, \s, which means any whitespace character(匹配任何空白符). The second statement in the preceding example can be rewritten as re.split(r'\s+', raw).

Important: Remember to prefix regular expressions with the letter r (meaning “raw”), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.

Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters, equivalent to [a-zA-Z0-9_]. It also defines the complement(补集) of this class, \W, i.e., all characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character:

>>>re.split(r'\W+', raw)

['','When','I','M','a','Duchess','she','said','to','herself','not','in','a','very','hopeful','tone','though','I','won','t','have','any','pepper','in','my','kitchen','AT','ALL','Soup','does','very','well','without','Maybe','it','s','always','pepper','that','makes','people','hot','tempered','']

Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we’re matching the words, we’re in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., ’s) but that sequences of two or more punctuation characters are separated.

>>>re.findall(r'\w+|\S\w*', raw)

["'When",'I',"'M",'a','Duchess',',',"'",'she','said','to','herself',',','(not','in','a','very','hopeful','tone','though',')',',',"'I",'won',"'t",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','-','-Maybe','it',"'s",'always','pepper','that','makes','people','hot','-tempered',',',"'",'.','.','.']

Let’s generalize the \w+ in the preceding expression to permit word-internal hyphens(连字号) and apostrophes(省字号,表示词的一个或多个字母的省略,例 it is略为it's of略为o' over略为o'ercannot略为can't1999略为'99: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it’s. (We need to include ?: in this expression for reasons discussed earlier.) We’ll also add a pattern to match quote characters so these are kept separate from the text they enclose.

>>>printre.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)

["'",'When',"I'M",'a','Duchess',',',"'",'she','said','to','herself',',','(','not','in','a','very','hopeful','tone','though',')',',',"'",'I',"won't",'have','any','pepper','in','my','kitchen','AT','ALL','.','Soup','does','very','well','without','--','Maybe',"it's",'always','pepper','that','makes','people','hot-tempered',',',"'",'...']

The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.

Table 3-4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.

Table 3-4. Regular expression symbols

Symbol      Function

\b          Word boundary (zero width)

\d          Any decimal digit (equivalent to [0-9])

\D         Any non-digit character (equivalent to [^0-9])

\s           Any whitespace character (equivalent to [ \t\n\r\f\v]

\S          Any non-whitespace character (equivalent to [^ \t\n\r\f\v])

\w         Any alphanumeric character (equivalent to [a-zA-Z0-9_])

\W               Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])

\t           The tab character

\n          The newline character

NLTK’s Regular Expression Tokenizer

NLTK的正则表达式分词器

 

The function nltk.regexp_tokenize() is similar to re.findall() (as we’ve been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) “verbose flag” tells Python to strip out the embedded whitespace and comments.

>>>text='That U.S.A. poster-print costs $12.40...'>>>pattern=r'''(?x)            # set flag to allow verbose regexps

...     ([A-Z]\.)+         # abbreviations, e.g. U.S.A.

...   | \w+(-\w+)*         # words with optional internal hyphens

...   | \$?\d+(\.\d+)?%?   # currency and percentages, e.g. $12.40, 82%

...   | \.\.\.              # ellipsis

...   | [][.,;"'?():-_`]     # these are separate tokens

...'''>>>nltk.regexp_tokenize(text, pattern)

['That','U.S.A.','poster-print','costs','$12.40','...']

When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don’t appear in the wordlist, using set(tokens).difference(wordlist). You’ll probably want to lowercase all the tokens first. 评价分词器的好办法!

Further Issues with Tokenization分词的进一步问题

Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across the board, and we must decide what counts as a token depending on the application domain.

When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (orgold-standard”) tokens. The NLTK corpus collection includes a sample of Penn Tree-bank data, including the raw Wall Street Journal text  (nltk.corpus.treebank_raw.raw()) and the tokenized version (nltk.corpus.treebank.words()).

A final issue for tokenization is the presence of contractions(缩写), such as didn’t. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n’t (or not). We can do this work with the help of a lookup table.

Python自然语言处理学习笔记(23):3.7 用正则表达式文本分词相关推荐

  1. Python自然语言处理学习笔记(22):3.6 规格化文本

    3.6 Normalizing Text   规格化文本 In earlier program examples we have often converted text to lowercase b ...

  2. python自然语言处理学习笔记三

    第三章 处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译 未作测试?? From utlib import urlopen Url='http://www.g ...

  3. python自然语言处理学习笔记一

    第一章 语言处理与python 1 语言计算 文本与词汇 NLTK入门 下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...

  4. Python自然语言处理学习笔记(2):Preface 前言

    Updated 1st:2011/8/5 Updated 2nd:2012/5/14  中英对照完成 Preface 前言 This is a book about Natural Language ...

  5. Python自然语言处理学习笔记(7):1.5 自动理解自然语言

    Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...

  6. python自然语言处理-学习笔记(一)之nltk入门

    nltk学习第一章 一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...

  7. Python自然语言处理学习笔记(32):4.4 函数:结构化编程的基础

    4.4   Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...

  8. Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理

    3.3 Text Processing with Unicode 使用Unicode进行文字处理   Our programs will often need to deal with differe ...

  9. Python自然语言处理学习笔记(68):7.9 练习

    7.9   Exercises  练习 ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags nec ...

  10. Python自然语言处理学习笔记(41):5.2 标注语料库

    5.2   Tagged Corpora 标注语料库 Representing Tagged Tokens 表示标注的语言符号 By convention in NLTK, a tagged toke ...

最新文章

  1. 解决CSV文件中长数字以科学记数格式保存问题
  2. SAP SD基础知识之流程概览
  3. python画樱桃小丸子的程序_多任务---线程threading使用总结。
  4. LeetCode Longest Increasing Subsequence(动态规划、二分法)
  5. 学习笔记94—所有用过SCI-hub的科研工作者都应该知道的事
  6. 每天一道LeetCode-----求一个数的n次方,n是很大很大的数,n用数组存储着
  7. Flex实现分页显示功能(mx:DataGrid)
  8. 部署被测软件应用和中间件_使用FlexDeploy对融合中间件应用程序进行自动化软件测试...
  9. attribute property --- jquery attr() prop()
  10. C++程序设计:原理与实践(进阶篇)15.6 实例:一个简单的文本编辑器
  11. Maven - Maven3实战学习笔记(2)坐标和依赖
  12. paip.手写OCR识别方法总结--基于验证码识别... 1
  13. ABAQUS常用量纲
  14. 用Java代码实现一个简单的聊天室功能
  15. java下拉刷新上拉加载_使用PullToRefresh实现下拉刷新和上拉加载
  16. java nginx报502,Nginx 502错误排查及解决办法
  17. 【彩虹 钢琴伴奏】笔记
  18. 【Mind+ 玩转Maixduino系列0】工欲善其事必先利其器
  19. matlab数字信号处理常用函数
  20. linux kvm参数,virt-install创建kVM参数

热门文章

  1. 四川省高中计算机考试,四川高中信息技术学业水平考试时间
  2. 用C语言编译病毒,来来来,教你一个用C语言写个小病毒
  3. 点云孔洞定位_隧道三维点云孔洞修复方法
  4. javascript 内存和连等赋值
  5. View Agent Direct-Connection注册表
  6. String replaceAll()
  7. hibernate 映射错误
  8. 16个美艳时尚的的网站设计作品欣赏
  9. 剑指offer、按之字形打印二叉树(python)
  10. atoi函数_linux网络编程之POSIX 消息队列 和 系列函数