5.6 Transformation-Based Tagging 基于转换的标记

A potential issue with n-gram taggers is the size of their n-gram table (表的大小问题or language model). If tagging is to be employed in a variety of language technologies deployed on mobile computing devices, it is important to strike a balance(公平处理) between model size and tagger performance. An n-gram tagger with backoff may store trigram and bigram tables, which are large, sparse arrays that may have hundreds of millions of entries.

A second issue concerns context(内容). The only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information(n-gram标注器仅关心的信息是先前内容的标记,即时单词本身可能是有用的信息资源). It is simply impractical for n-gram models to be conditioned on the identities of words in the context. In this section, we examine Brill tagging, an inductive tagging method which performs very well using models that are only a tiny fraction of the size of n-gram taggers.

Brill tagging is a kind of transformation-based learning, named after(以...命名) its inventor. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes.猜测每个单词的标志,然后返回修复错误

In this way, a Brill tagger successively transforms a bad tagging of a text into a better one. As with n-gram tagging, this is a supervised learning(监督学习) method, since we need annotated training data to figure out whether the tagger’s guess is a mistake or not. However, unlike n-gram tagging, it does not count observations but compiles a list of transformational correction rules(不是统计而是编辑出一个转换修正的规则).

The process of Brill tagging is usually explained by analogy with painting. Suppose we were painting a tree, with all its details of boughs(大树枝), branches, twigs(小枝), and leaves, against a uniform sky-blue background. Instead of painting the tree first and then trying to paint blue in the gaps, it is simpler to paint the whole canvas blue, then “correct” the tree section by over-painting the blue background. In the same fashion, we might paint the trunk(树干) a uniform brown before going back to over-paint further details with even finer(出色的) brushes. Brill tagging uses the same idea: begin with broad brush strokes(画笔, and then fix up the details, with successively finer changes.(先完成整体,然后从细节上一点点地修正) Let’s look at an example involving the following sentence:

(1) The President said he will ask Congress to increase grants to states for  vocational rehabilitation(职业康复).

We will examine the operation of two rules: (a) replace NN with VB when the previous word is TO; (b) replace TO with IN when the next tag is NNS. Table 5-6 illustrates this process, first tagging with the unigram tagger, then applying the rules to fix the errors.

Table 5-6. Steps in Brill tagging

In this table, we see two rules. All such rules are generated from a template of the following form: “replace T1 with T2 in the context C.” Typical contexts are the identity or the tag of the preceding or following word, or the appearance of a specific tag within two to three words of the current word. During its training phase, the tagger guesses values for T1, T2, and C, to create thousands of candidate(候选的) rules. Each rule is scored according to its net benefit(净收益): the number of incorrect tags that it corrects, less(减去) the number of correct tags it incorrectly modifies.

Brill taggers have another interesting property: the rules are linguistically interpretable(规则是可用语言解释的). Compare this with the n-gram taggers, which employ a potentially massive table of n-grams. We cannot learn much from direct inspection of such a table, in comparison to the rules learned by the Brill tagger. Example 5-6 demonstrates NLTK’s Brill tagger.

Example 5-6. Brill tagger demonstration: The tagger has a collection of templates of the form X Y if the preceding word is Z; the variables in these templates are instantiated to particular words and tags to create “rules”; the score for a rule is the number of broken examples it corrects minus the number of correct cases it breaks; apart from training a tagger, the demonstration displays residual(剩余的) errors.

>>> nltk.tag.brill.demo()

Training Brill tagger on 80 sentences...

Finding initial useful rules...

    Found 6555 useful rules.

 

           B      |

   S   F   r   O |        Score = Fixed - Broken

   c   i   o   t | R     Fixed = num tags changed incorrect -> correct

   o   x   k   h | u     Broken = num tags changed correct -> incorrect

   r   e   e   e | l     Other = num tags changed incorrect -> incorrect

   e   d   n   r | e

------------------+-------------------------------------------------------

 12 13   1   4 | NN -> VB if the tag of the preceding word is 'TO'

   8   9   1 23 | NN -> VBD if the tag of the following word is 'DT'

   8   8   0   9 | NN -> VBD if the tag of the preceding word is 'NNS'

   6   9   3 16 | NN -> NNP if the tag of words i-2...i-1 is '-NONE-'

   5   8   3   6 | NN -> NNP if the tag of the following word is 'NNP'

   5   6   1   0 | NN -> NNP if the text of words i-2...i-1 is 'like'

   5   5   0   3 | NN -> VBN if the text of the following word is '*-1'

   ...

>>> print(open("errors.out").read())

             left context |    word/test->gold     | right context

                --------------------------+------------------------+--------------------------

                                             |      Then/NN->RB       | ,/, in/IN the/DT guests/N

, in/IN the/DT guests/NNS |       '/VBD->POS       | honor/NN ,/, the/DT speed

'/POS honor/NN ,/, the/DT |    speedway/JJ->NN     | hauled/VBD out/RP four/CD

NN ,/, the/DT speedway/NN |     hauled/NN->VBD     | out/RP four/CD drivers/NN

DT speedway/NN hauled/VBD |      out/NNP->RP       | four/CD drivers/NNS ,/, c

dway/NN hauled/VBD out/RP |      four/NNP->CD      | drivers/NNS ,/, crews/NNS

hauled/VBD out/RP four/CD |    drivers/NNP->NNS    | ,/, crews/NNS and/CC even

P four/CD drivers/NNS ,/, |     crews/NN->NNS      | and/CC even/RB the/DT off

NNS and/CC even/RB the/DT |    official/NNP->JJ    | Indianapolis/NNP 500/CD a

                                   |     After/VBD->IN      | the/DT race/NN ,/, Fortun

ter/IN the/DT race/NN ,/, |    Fortune/IN->NNP     | 500/CD executives/NNS dro

s/NNS drooled/VBD like/IN | schoolboys/NNP->NNS   | over/IN the/DT cars/NNS a

olboys/NNS over/IN the/DT |      cars/NN->NNS      | and/CC drivers/NNS ./.

Example 5.10 (code_brill_demo.py): Figure 5.10: Brill Tagger Demonstration: the tagger has a collection of templates of the form X -> Y if the preceding word is Z; the variables in these templates are instantiated to particular words and tags to create "rules"; the score for a rule is the number of broken examples it corrects minus the number of correct cases it breaks; apart from training a tagger, the demonstration displays residual errors.

Python自然语言处理学习笔记(45):5.6 基于转换的标记相关推荐

  1. Python自然语言处理学习笔记(2):Preface 前言

    Updated 1st:2011/8/5 Updated 2nd:2012/5/14  中英对照完成 Preface 前言 This is a book about Natural Language ...

  2. Python自然语言处理学习笔记(7):1.5 自动理解自然语言

    Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...

  3. python自然语言处理学习笔记一

    第一章 语言处理与python 1 语言计算 文本与词汇 NLTK入门 下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...

  4. python自然语言处理-学习笔记(一)之nltk入门

    nltk学习第一章 一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...

  5. python自然语言处理学习笔记三

    第三章 处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译 未作测试?? From utlib import urlopen Url='http://www.g ...

  6. Python自然语言处理学习笔记(32):4.4 函数:结构化编程的基础

    4.4   Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...

  7. Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理

    3.3 Text Processing with Unicode 使用Unicode进行文字处理   Our programs will often need to deal with differe ...

  8. Python自然语言处理学习笔记(68):7.9 练习

    7.9   Exercises  练习 ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags nec ...

  9. Python自然语言处理学习笔记(41):5.2 标注语料库

    5.2   Tagged Corpora 标注语料库 Representing Tagged Tokens 表示标注的语言符号 By convention in NLTK, a tagged toke ...

最新文章

  1. 大数据分布式集群搭建(1)
  2. php5.1 0day,DEDECMS 5.1 feedback_js.php 0DAY
  3. 使用ToolRunner运行Hadoop程序基本原理分析
  4. 禁用UITabBarController双击事件
  5. !!超级筹码理论总结
  6. 被平均(统计平均)的陷阱
  7. es数据定时清理_elasticsearch索引自动清理
  8. WiFi 2.4G和5G国家及信道分布
  9. 为什么RISC-V在中国岌岌可危?
  10. 解决Windows 无法打开文件夹 找不到应用程序
  11. FLOPS, FLOPs and MACs
  12. y9000p + ubuntu18.04 亮度无法调节问题解决方法(亲测有效)
  13. 三七互娱php笔试题,三七互娱笔试
  14. 浏览器支持字体大小情况 以及 Chrome设置小于12px的字体的处理方案
  15. codewarrior烧录,34704B_freescalecodewarrior烧写程序
  16. 企业IT项目开发之七宗罪(下篇)
  17. 逆变器运用到的c语言算法,总结逆变电源常用到的六种控制算法
  18. Android sp sp
  19. python网页截屏
  20. 中小型项目手撸过滤器实现认证与授权

热门文章

  1. Android 性能测试之方向与框架篇
  2. 自定义View----滑动刻度尺与流式布局 实例(四)
  3. Oracle、 Mysql 、 SQLserver 分页查询
  4. 流媒体服务器搭建详解
  5. .NET并行编程实践(一:.NET并行计算基本介绍、并行循环使用模式)
  6. Colorful SegmentedControl
  7. 手把手教你设计交友网站【3】
  8. 数据有什么特征和作用
  9. html 并集选择器,HTML+CSS基础 并集选择器
  10. python爬取豆瓣代码_python爬取豆瓣视频信息代码