7.9   Exercises  练习

  1. ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags necessary? What problem would be caused if we used I and O tags exclusively?
  2. ☼ Write a tag pattern to match noun phrases containing plural head nouns, e.g. "many/JJ researchers/NNS", "two/CD weeks/NNS", "both/DT new/JJ positions/NNS". Try to do this by generalizing the tag pattern that handled singular noun phrases.
  3. ☼ Pick one of the three chunk types in the CoNLL corpus. Inspect the CoNLL corpus and try to observe any patterns in the POS tag sequences that make up this kind of chunk. Develop a simple chunker using the regular expression chunker nltk.RegexpParser. Discuss any tag sequences that are difficult to chunk reliably.
  4. ☼ An early definition of chunk was the material that occurs between chinks. Develop a chunker that starts by putting the whole sentence in a single chunk, and then does the rest of its work solely by chinking. Determine which tags (or tag sequences) are most likely to make up chinks with the help of your own utility program. Compare the performance and simplicity of this approach relative to a chunker based entirely on chunk rules.
  5. ◑ Write a tag pattern to cover noun phrases that contain gerunds, e.g. "the/DT receiving/VBG end/NN", "assistant/NN managing/VBG editor/NN". Add these patterns to the grammar, one per line. Test your work using some tagged sentences of your own devising.
  6. ◑ Write one or more tag patterns to handle coordinated noun phrases, e.g. "July/NNP and/CC August/NNP", "all/DT your/PRP$ managers/NNS and/CC supervisors/NNS", "company/NN courts/NNS and/CC adjudicators/NNS".
  7. ◑ Carry out the following evaluation tasks for any of the chunkers you have developed earlier. (Note that most chunking corpora contain some internal inconsistencies, such that any reasonable rule-based approach will produce errors.)
    1. Evaluate your chunker on 100 sentences from a chunked corpus, and report the precision, recall and F-measure.
    2. Use the chunkscore.missed() and chunkscore.incorrect() methods to identify the errors made by your chunker. Discuss.
    3. Compare the performance of your chunker to the baseline chunker discussed in the evaluation section of this chapter.
  8. ◑ Develop a chunker for one of the chunk types in the CoNLL corpus using a regular-expression based chunk grammar RegexpChunk. Use any combination of rules for chunking, chinking, merging or splitting.
  9. ◑ Sometimes a word is incorrectly tagged, e.g. the head noun in "12/CD or/CC so/RB cases/VBZ". Instead of requiring manual correction of tagger output, good chunkers are able to work with the erroneous output of taggers. Look for other examples of correctly chunked noun phrases with incorrect tags.
  10. ◑ The bigram chunker scores about 90% accuracy. Study its errors and try to work out why it doesn't get 100% accuracy. Experiment with trigram chunking. Are you able to improve the performance any more?
  11. ★ Apply the n-gram and Brill tagging methods to IOB chunk tagging. Instead of assigning POS tags to words, here we will assign IOB tags to the POS tags. E.g. if the tag DT (determiner) often occurs at the start of a chunk, it will be tagged B (begin). Evaluate the performance of these chunking methods relative to the regular expression chunking methods covered in this chapter.
  12. ★ We saw in Chapter 5 that it is possible to establish an upper limit to tagging performance by looking for ambiguous n-grams, n-grams that are tagged in more than one possible way in the training data. Apply the same method to determine an upper bound on the performance of an n-gram chunker.
  13. ★ Pick one of the three chunk types in the CoNLL corpus. Write functions to do the following tasks for your chosen type:
    1. List all the tag sequences that occur with each instance of this chunk type.
    2. Count the frequency of each tag sequence, and produce a ranked list in order of decreasing frequency; each line should consist of an integer (the frequency) and the tag sequence.
    3. Inspect the high-frequency tag sequences. Use these as the basis for developing a better chunker.
  14. ★ The baseline chunker presented in the evaluation section tends to create larger chunks than it should. For example, the phrase: [every/DT time/NN] [she/PRP] sees/VBZ [a/DT newspaper/NN] contains two consecutive chunks, and our baseline chunker will incorrectly combine the first two: [every/DT time/NN she/PRP]. Write a program that finds which of these chunk-internal tags typically occur at the start of a chunk, then devise one or more rules that will split up these chunks. Combine these with the existing baseline chunker and re-evaluate it, to see if you have discovered an improved baseline.
  15. ★ Develop an NP chunker that converts POS-tagged text into a list of tuples, where each tuple consists of a verb followed by a sequence of noun phrases and prepositions, e.g. the little cat sat on the mat becomes ('sat', 'on', 'NP')...
  16. ★ The Penn Treebank contains a section of tagged Wall Street Journal text that has been chunked into noun phrases. The format uses square brackets, and we have encountered it several times during this chapter. The Treebank corpus can be accessed using: for sent in nltk.corpus.treebank_chunk.chunked_sents(fileid). These are flat trees, just as we got using nltk.corpus.conll2000.chunked_sents().
    1. The functions nltk.tree.pprint() and nltk.chunk.tree2conllstr() can be used to create Treebank and IOB strings from a tree. Write functions chunk2brackets() and chunk2iob() that take a single chunk tree as their sole argument, and return the required multi-line string representation.
    2. Write command-line conversion utilities bracket2iob.py and iob2bracket.py that take a file in Treebank or CoNLL format (resp) and convert it to the other format. (Obtain some raw Treebank or CoNLL data from the NLTK Corpora, save it to a file, and then use for line in open(filename) to access it from Python.)
  17. ★ An n-gram chunker can use information other than the current part-of-speech tag and the n-1 previous chunk tags. Investigate other models of the context, such as the n-1 previous part-of-speech tags, or some combination of previous chunk tags along with previous and following part-of-speech tags.
  18. ★ Consider the way an n-gram tagger uses recent tags to inform its tagging choice. Now observe how a chunker may re-use this sequence information. For example, both tasks will make use of the information that nouns tend to follow adjectives (in English). It would appear that the same information is being maintained in two places. Is this likely to become a problem as the size of the rule sets grows? If so, speculate about any ways that this problem might be addressed.

Python自然语言处理学习笔记(68):7.9 练习相关推荐

  1. Python自然语言处理学习笔记(2):Preface 前言

    Updated 1st:2011/8/5 Updated 2nd:2012/5/14  中英对照完成 Preface 前言 This is a book about Natural Language ...

  2. Python自然语言处理学习笔记(7):1.5 自动理解自然语言

    Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...

  3. python自然语言处理学习笔记一

    第一章 语言处理与python 1 语言计算 文本与词汇 NLTK入门 下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...

  4. python自然语言处理-学习笔记(一)之nltk入门

    nltk学习第一章 一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...

  5. Python自然语言处理学习笔记(32):4.4 函数:结构化编程的基础

    4.4   Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...

  6. Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理

    3.3 Text Processing with Unicode 使用Unicode进行文字处理   Our programs will often need to deal with differe ...

  7. Python自然语言处理学习笔记(41):5.2 标注语料库

    5.2   Tagged Corpora 标注语料库 Representing Tagged Tokens 表示标注的语言符号 By convention in NLTK, a tagged toke ...

  8. Python自然语言处理学习笔记(30):4.2 序列

    4.2   Sequences 序列 So far, we have seen two kinds of sequence object: strings and lists. Another kin ...

  9. python自然语言处理学习笔记三

    第三章 处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译 未作测试?? From utlib import urlopen Url='http://www.g ...

最新文章

  1. java urlconn 下载慢_使用HttpURLConnection下载文件时出现 java.io.FileNotFoundException彻底解决办法...
  2. 计算机导航辅助教程,计算机导航辅助下微创人工全膝关节置换的初步经验
  3. iOS imageNamed 的缓存问题
  4. 机器学习实战4-sklearn训练线性回归模型(鸢尾花iris数据集分类)
  5. java(7)LinkedList源码
  6. javascript-从toString方法在判断复杂数据类型上的妙用,引申到对原型链的理解...
  7. Ajax异步请求阻塞情况的解决办法(asp.net MVC Session锁的问题)
  8. 软件测试达内视频笔记(一)
  9. ​香农与信息论三大定律
  10. 中国城市竞争力历年数据(2005-2009年)
  11. 目标检测算法——SSD
  12. 进博会中国自行车排名辐轮王自行车点赞中国GDP突破一百万亿元
  13. 【读书笔记】《洛克菲勒写给儿子的38封信》
  14. 加密流量分析-2.研究背景
  15. XCTF MISC 我们的秘密是绿色的
  16. office注意事项
  17. 简单爬取猫眼实时票房数据
  18. 4字节 经纬度_北京54坐标系转经纬度坐标系教程
  19. 1号店详情页(共5页)
  20. ps 证件照(1,2寸)

热门文章

  1. dbms_stats包更新、导出、导入、锁定统计信息
  2. 从英伟达 vs ATI的芯片大战看GPU前世今生
  3. Linux调试时常见问题,C程序在linux下调试时经常出现的问题
  4. python mkdir -p_Python中的mkdir -p功能[复制]
  5. 流量超过谷歌的Tiktok,在扩张过程中被质疑“偷窃”OBS代码
  6. 别再瞎搞了,处理Java异常的10个最佳实践
  7. 为什么阿里巴巴Java开发手册中强制要求接口返回值不允许使用枚举?
  8. 什么?你项目还在用Date表示时间?!
  9. 缓存穿透、缓存击穿和缓存雪崩实践附源码
  10. 公众平台小程序文档和工具