POS Tagging 标签类型查询表（Penn Treebank Project）

在分析英文文本时，我们可能会关心文本当中每个词语的词性和在句中起到的作用。识别文本中各个单词词性的过程，可以称为词性标注。

英语主要的八种词性分别为：

1、名词（noun）

2、代词（pronoun）

3、动词（verb）

4、形容词（adjective）

5、副词（adverb）

6、介词（preposition）

7、连词（conjunction）

8、感叹词（interjection）

其他还包括数词（numeral）和冠词（article）等。

在使用第三方工具（如NLTK）进行词性标注时，返回的结果信息量可能比上述八种词性要丰富一些。比如NLTK，其所标注的词性可以参考Penn Treebank Project给出的pos tagset，如下图：

举例来说，我们使用NLTK对一段英文进行词性标注：

这段英文摘自19年3月13日华盛顿邮报有关加拿大停飞波音737客机相关报道，段落的原文为：

After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals.

我们对该段落进行断句，然后对每句话进行分词，再对每个词语进行词性标注，然后循环打印每句话中每个词的词性标注结果，具体代码如下：

1 import nltk
2 passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""
3 sentences = nltk.sent_tokenize( passage )
4 for sent in sentences:
5     tokens = nltk.word_tokenize( sent )
6     posTags = nltk.pos_tag( tokens )
7     print( posTags )

代码的print()函数打印的内容如下：

[('After', 'IN'), ('the', 'DT'), ('Lion', 'NNP'), ('Air', 'NNP'), ('crash', 'NN'), (',', ','), ('questions', 'NNS'), ('were', 'VBD'), ('raised', 'VBN'), (',', ','), ('so', 'IN'), ('Boeing', 'NNP'), ('sent', 'VBD'), ('further', 'JJ'), ('instructions', 'NNS'), ('that', 'IN'), ('it', 'PRP'), ('said', 'VBD'), ('pilots', 'NNS'), ('should', 'MD'), ('know', 'VB'), (',', ','), ('”', 'FW'), ('he', 'PRP'), ('said', 'VBD'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('the', 'DT'), ('Associated', 'NNP'), ('Press', 'NNP'), ('.', '.')]
[('“Those', 'JJ'), ('relate', 'NN'), ('to', 'TO'), ('the', 'DT'), ('specific', 'JJ'), ('behavior', 'NN'), ('of', 'IN'), ('this', 'DT'), ('specific', 'JJ'), ('type', 'NN'), ('of', 'IN'), ('aircraft', 'NN'), ('.', '.')]
[('As', 'IN'), ('a', 'DT'), ('result', 'NN'), (',', ','), ('training', 'NN'), ('was', 'VBD'), ('given', 'VBN'), ('by', 'IN'), ('Boeing', 'NNP'), (',', ','), ('and', 'CC'), ('our', 'PRP$'), ('pilots', 'NNS'), ('have', 'VBP'), ('taken', 'VBN'), ('it', 'PRP'), ('and', 'CC'), ('put', 'VB'), ('it', 'PRP'), ('into', 'IN'), ('our', 'PRP$'), ('manuals', 'NNS'), ('.', '.')]

如何看懂上面的输出结果：段落中的每句话为一个list，每句话中的每个词及其词性表示为一个tuple，左边为单词本身，右边为词性缩写，这些缩写的具体含义可以查找Penn Treebank Pos Tags表格。

我们对代码稍微修改一下，以便使结果呈现更清楚一些，而不至于看的太费力，如下：

1 import nltk
2 passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""
3 sentences = nltk.sent_tokenize( passage )
4 for sent in sentences:
5     tokens = nltk.word_tokenize( sent )
6     posTags = nltk.pos_tag( tokens )
7     for tag in posTags:
8         print( "{}({}) ".format( tag[0], tag[1] ), end = "" )

输出结果如下（标注的词性以括号形式紧跟在每个单词右侧）：

After(IN) the(DT) Lion(NNP) Air(NNP) crash(NN) ,(,) questions(NNS) were(VBD) raised(VBN) ,(,) so(IN) Boeing(NNP) sent(VBD) further(JJ) instructions(NNS) that(IN) it(PRP) said(VBD) pilots(NNS) should(MD) know(VB) ,(,) ”(FW) he(PRP) said(VBD) ,(,) according(VBG) to(TO) the(DT) Associated(NNP) Press(NNP) .(.) “Those(JJ) relate(NN) to(TO) the(DT) specific(JJ) behavior(NN) of(IN) this(DT) specific(JJ) type(NN) of(IN) aircraft(NN) .(.) As(IN) a(DT) result(NN) ,(,) training(NN) was(VBD) given(VBN) by(IN) Boeing(NNP) ,(,) and(CC) our(PRP$) pilots(NNS) have(VBP) taken(VBN) it(PRP) and(CC) put(VB) it(PRP) into(IN) our(PRP$) manuals(NNS) .(.)

参考文献：

1、https://en.wikipedia.org/wiki/Part_of_speech

2、https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

3、https://www.washingtonpost.com/local/trafficandcommuting/canada-grounds-boeing-737-max-8-leaving-us-as-last-major-user-of-plane/2019/03/13/25ac2414-459d-11e9-90f0-0ccfeec87a61_story.html?utm_term=.f359a714d4d8

转载于:https://www.cnblogs.com/creatures-of-habit/p/10520079.html

POS Tagging 标签类型查询表（Penn Treebank Project）相关推荐

Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
网页网址
Penn Treebank Tags做点小翻译（上篇）
前言 :最近在研究自然语言处理,搞的很浅,然后下了openNLP,实现了分词和分句,现在要做词性标注,结果openNLP参照的是这个Penn Treebank Tags,所以没办法要看懂词性标注的结果 ...
POS Tagging 和Chunking （学习笔记）
来源:NLP中的 POS Tagging 和Chunking_Sirow的博客-CSDN博客_pos tagging 词性标注(POS-Tagging) 这里的例子主要针对英文,词性标注的作用便是给输 ...
PoS Tagging代码学习与应用
文章目录代码学习(模型训练) 新数据集代入(模型应用) 此阶段总结未来改进代码学习(模型训练) PyTorch PoS Tagging import torch import torch.nn ...
NLP中的 POS Tagging 和Chunking
这篇文章将使用NLTK向您解释NLP中的词性标注 (POS-Tagging)和组块分析(Chunking)过程.词袋模型(Bag-of-Words)无法捕捉句子的结构,有时也无法给出适当的含义.词性标 ...
Penn Treebank数据集介绍+句法分析parsed的基本语法+句法分析基础知识+NLP常用公开数据集汇总及下载
Penn Treebank数据集介绍+句法分析parsed的基本语法+句法分析基础知识+NLP常用公开数据集汇总及下载 Penn Treebank数据集介绍 NLP底层技术之句法分析 NLP常用公开数 ...
(自然语言处理文档系列)Penn Treebank词性标记集
Penn Treebank词性标记集在进行自然语言处理时,常见的任务是对单词进行词性标注,但对于标注的结果我们有时候不是很明白,文章介绍了对于常见的标注结果的中文含义: 编号缩写英文中文 1 ...
ios转向前端进阶之:html标签类型
2019独角兽企业重金招聘Python工程师标准>>> 1.html标签类型可以分为三大类: 块级标签(block):特点是独占一行,能随时设置width与height,比如div, ...
手把手教你做用户画像：3种标签类型、8大系统模块
导读:在互联网步入大数据时代后,用户行为给企业的产品和服务带来了一系列的改变和重塑,其中最大的变化在于,用户的一切行为在企业面前是可"追溯""分析"的.企业内保 ...

POS Tagging 标签类型查询表（Penn Treebank Project）

POS Tagging 标签类型查询表（Penn Treebank Project）相关推荐

最新文章

热门文章