Python自然语言处理学习笔记(41):5.2 标注语料库
5.2 Tagged Corpora 标注语料库
Representing Tagged Tokens 表示标注的语言符号
By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():
|
We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).
|
Reading Tagged Corpora 读取已标注的语料库
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
Other corpora use a variety of formats for storing part-of-speech tags. NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats. In contrast with the file extract shown above, the corpus reader for the Brown Corpus represents the data as shown below. Note that part-of-speech tags have been converted to uppercase, since this has become standard practice(标准惯例) since the Brown Corpus was published.
|
Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:
|
Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned above for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:
|
Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.
|
If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example, Figure 5.1 shows data accessed using nltk.corpus.indian.
Figure 5.1: POS-Tagged Data from Four Indian Languages: Bangla, Hindi, Marathi, and Telugu
If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words.
Tagged corpora use many different conventions for tagging words. To help us get started, we will be looking at a simplified tagset (shown in Table 5.1).
Tag |
Meaning |
Examples |
ADJ |
adjective |
new, good, high, special, big, local |
ADV |
adverb |
really, already, still, early, now |
CNJ |
conjunction |
and, or, but, if, while, although |
DET |
determiner |
the, a, some, most, every, no |
EX |
existential |
there, there's |
FW |
foreign word |
dolce, ersatz, esprit, quo, maitre |
MOD |
modal verb |
will, can, would, may, must, should |
N |
noun |
year, home, costs, time, education |
NP |
proper noun |
Alison, Africa, April, Washington |
NUM |
number |
twenty-four, fourth, 1991, 14:24 |
PRO |
pronoun |
he, their, her, its, my, I, us |
P |
preposition |
on, of, at, with, by, into, under |
TO |
the word to |
to |
UH |
interjection |
ah, bang, ha, whee, hmpf, oops |
V |
verb |
is, has, get, do, make, see, run |
VD |
past tense |
said, took, told, made, asked |
VG |
present participle |
making, going, playing, working |
VN |
past participle |
given, taken, begun, sung |
WH |
wh determiner |
who, which, when, what, where, howTable 5.1: Simplified Part-of-Speech Tagset |
Let's see which of these tags are the most common in the news category of the Brown corpus:
|
Note
Your Turn: Plot the above frequency distribution using tag_fd.plot(cumulative=True). What percentage of words are tagged using the first five tags of the above list? 60%
We can use these tags to do powerful searches using a graphical POS-concordance tool nltk.app.concordance(). Use it to search for any combination of words and POS tags, e.g. N N N N, hit/VD, hit/VN, or the ADJ man.
Nouns 名词
Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb, (名词可以出现在限定词和形容词之后,并且可以做动词的主语或宾语)as shown in Table 5.2.
Word |
After a determiner |
Subject of the verb |
woman |
the woman who I saw yesterday ... |
the woman sat down |
Scotland |
the Scotland I remember as a child ... |
Scotland has five million people |
book |
the book I bought yesterday ... |
this book recounts the colonization of Australia |
intelligence |
the intelligence displayed by the child ... |
Mary's intelligence impressed her teachersTable 5.2: Syntactic Patterns involving some Nouns |
The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.
Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.
|
(a,b)也就是(('The', 'DET'), ('Fulton', 'NP')),如果b[1]==’N’,则给出前面这个词的词性a[1]
This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as NUM).
Verbs 动词
Verbs are words that describe events and actions, e.g. fall, eat in Table 5.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.
Word |
Simple |
With modifiers and adjuncts (italicized) |
fall |
Rome fell |
Dot com stocks suddenly fell like a stone |
eat |
Mice eat cheese |
John ate the pizza with gustoTable 5.3: Syntactic Patterns involving some Verbs |
What are the most common verbs in news text? Let's sort all the verbs by frequency:
|
Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:
|
We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events(词作为条件,标签作为事件). Now we can see likely words for a given tag:
|
To clarify the distinction between VD (past tense) and VN (past participle), let's find words which can be both VD and VN, and see some surrounding text:
|
In this case, we see that the past participle of kicked is preceded by a form of the auxiliary verb have. Is this generally true?
Note
Your Turn: Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.
Adjectives and Adverbs 形容词和副词
Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks). Adverbs modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).
English has several categories of closed class words in addition to prepositions, such as articles (also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g., she, they). Each dictionary and grammar classifies these words differently.
Note
Your Turn: If you are uncertain about some of these parts of speech, study them using nltk.app.concordance(), or watch some of the Schoolhouse Rock! grammar videos available at YouTube, or consult the Further Reading section at the end of this chapter.
Unsimplified Tags 未简化的标签
Let's find the most frequent nouns of each noun part-of-speech type. The program in Example 5.2 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN; the most important contain $ for possessive nouns, S for plural nouns (since plural nouns typically end in s) and P for proper nouns. In addition, most of the tags have suffix modifiers: -NC for citations, -HL for words in headlines and -TL for titles (a feature of Brown tabs).
|
||
|
||
Example 5.2 (code_findtags.py): Program to Find the Most Frequent Noun Tags |
When we come to constructing part-of-speech taggers later in this chapter, we will use the unsimplified tags.
Exploring Tagged Corpora 探索标注的语料库
Let's briefly return to the kinds of exploration of corpora we saw in previous chapters, this time exploiting POS tags.
Suppose we're studying the word often and want to see how it is used in text. We could ask to see the words that follow often
|
However, it's probably more instructive use the tagged_words() method to look at the part-of-speech tag of the following words:
|
Notice that the most high-frequency parts of speech following often are verbs. Nouns never appear in this position (in this particular corpus).
Next, let's look at some larger context, and find words involving particular sequences of tags and words (in this case "<Verb> to <Verb>"). In code-three-word-phrase we consider each three-word window in the sentence , and check if they meet our criterion . If the tags match, we print the corresponding words .
|
||
|
||
Example 5.3 (code_three_word_phrase.py): Figure 5.3: Searching for Three-Word Phrases Using POS Tags |
Finally, let's look for words that are highly ambiguous as to their part of speech tag. Understanding why such words are tagged as they are in each context can help us clarify the distinctions between the tags.
|
Note
Your Turn: Open the POS concordance tool nltk.app.concordance() and load the complete Brown Corpus (simplified tagset). Now pick some of the above words and see how the tag of the word correlates with the context of the word. E.g. search for near to see all forms mixed together, near/ADJ to see it used as an adjective, near N to see just those cases where a noun follows, and so forth.
转载于:https://www.cnblogs.com/yuxc/archive/2011/08/24/2152667.html
Python自然语言处理学习笔记(41):5.2 标注语料库相关推荐
- Python自然语言处理学习笔记(2):Preface 前言
Updated 1st:2011/8/5 Updated 2nd:2012/5/14 中英对照完成 Preface 前言 This is a book about Natural Language ...
- Python自然语言处理学习笔记(7):1.5 自动理解自然语言
Updated log 1st:2011/8/5 1.5 Automatic Natural Language Understanding 自然语言的自动理解 We have been explori ...
- python自然语言处理学习笔记一
第一章 语言处理与python 1 语言计算 文本与词汇 NLTK入门 下载安装nltk http://www.nltk.org 下载数据 >>> import nltk >& ...
- python自然语言处理-学习笔记(一)之nltk入门
nltk学习第一章 一,入门 1,nltk包的导入和报的下载 import nltk nltk.download() (eg: nltk.download('punkt'),也可以指定下载那个包) 2 ...
- Python自然语言处理学习笔记(32):4.4 函数:结构化编程的基础
4.4 Functions: The Foundation of Structured Programming 函数:结构化编程的基础 Functions provide an effective ...
- Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理
3.3 Text Processing with Unicode 使用Unicode进行文字处理 Our programs will often need to deal with differe ...
- Python自然语言处理学习笔记(68):7.9 练习
7.9 Exercises 练习 ☼ The IOB format categorizes tagged tokens as I, O and B. Why are three tags nec ...
- Python自然语言处理学习笔记(30):4.2 序列
4.2 Sequences 序列 So far, we have seen two kinds of sequence object: strings and lists. Another kin ...
- python自然语言处理学习笔记三
第三章 处理原始文本 1 从网络和硬盘访问文本 #<<罪与罚>>的英文翻译 未作测试?? From utlib import urlopen Url='http://www.g ...
最新文章
- bugku——web 做题记录
- mysql恢复语句报错_php对于mysql恢复数据的时候,只能恢复一条!然后就会报错!但把sql语句直接贴到数据库里面是可以执行的!...
- 查看ios设备型号网址
- 爬虫中 Selenium-Requets-模拟登陆cookie-代理proxy 的简单总结
- matlab销量预测的数学模型,数学建模:酒店最优化问题.用matlab算出《酒店价格预测模型》...
- Ubuntu 安装 Java EE
- Hermite多项式
- Excel之数据透视表
- 快速切换node版本
- Pycharm里面的一些超级好用的功能——(TODO注释)用法防遗忘大法
- Vlan是什么?定义,特点超详细解析
- 公共数据库介绍~OpenCorporates
- windows11 git 安装SSH密钥
- 中海国亚Java笔试题
- python随机生成邮箱、自我介绍、地址、时间等
- mysql 三个表级联查询,以主表为主数据表为辅
- canvas制作旋转的太极图
- 蓝桥ROS云课→一键配置←f1tenth和PID绕圈
- (附源码)ssm学校疫情服务平台 毕业设计 291202
- 扫盲:mmdetection安装以及训练自己的数据集
热门文章
- C# 根据文本设置combobox的两种方法
- php页面上必须有表单,php – 在同一页面上显示提交的表单响应. (没有重装)
- PiFlow v0.5 发布:大数据流水线系统
- python is 与 == 的区别
- Integer的自动拆装箱的陷阱(整型数-128到127的值比较问题)
- java 实现二分法
- 华为hg-526拨号加路由破解简介
- 日本新年传统习俗介绍(二)
- The Double-Checked Locking is Broken Declaration
- 深入浅出计算机组成原理03:处理器