目录

1. What is Part of Speech (POS)? 词性是什么

2. Information Extraction 信息提取

2.1 POS Open Classes

2.2 POS Closed Classes (English)

2.3 Ambiguity

2.4 POS Ambiguity in News Headlines

3. Tagsets

3.1 Major Penn Treebank Tags

3.2 Derived Tags (Open Class)

3.3 Derived Tags (Closed Class)

3.4 Tagged Text Example

4. Automatic Tagging

4.1 Why Automatically POS tag?

4.2 Automatic Taggers

4.3 Rule-based tagging

4.4 Unigram tagger

4.5 Classifier-Based Tagging

4.6 Hidden Markov Models

4.7 Unknown Words

4.8 A Final Word


1. What is Part of Speech (POS)? 词性是什么

AKA word classes, morphological classes, syntactic categories 又名词类,形态词,句法类

Nouns, verbs, adjective, etc

POS tells us quite a bit about a word and its neighbours POS告诉我们关于一个词和它的邻居的相当多的信息:

  • nouns are often preceded by determiners 名词前面经常有定语从句
  • verbs preceded by nouns 动词前面是名词
  • content as a noun pronounced as CONtent content作为名词读作CONtent
  • content as a adjective pronounced as conTENT content作为形容词读作conTENT

2. Information Extraction 信息提取

Given this:

  • “Brasilia, the Brazilian capital, was founded in 1960.”

Obtain this:

  • capital(Brazil, Brasilia)
  • founded(Brasilia, 1960)

Many steps involved but first need to know nouns (Brasilia, capital), adjectives (Brazilian), verbs (founded) and numbers (1960).

2.1 POS Open Classes

Open vs closed classes: how readily do POS categories take on new words? Just a few open classes: 开放类与封闭类:POS类别多容易接受新词?只有几个开放类。

Nouns

  • Proper (Australia) versus common (wombat)
  • Mass (rice) versus count (bowls)

Verbs

  • Rich inflection (go/goes/going/gone/went)
  • Auxiliary verbs (be, have, and do in English)
  • Transitivity (wait versus hit versus give) — number of arguments

Adjectives

  • Gradable (happy) versus non-gradable (computational)

Adverbs

  • Manner (slowly)
  • Locative (here)
  • Degree (really)
  • Temporal (today)

2.2 POS Closed Classes (English)

Prepositions (in, on, with, for, of, over,…)

  • on the table

Particles

  • brushed himself off

Determiners

  • Articles (a, an, the)
  • Demonstratives (this, that, these, those)
  • Quantifiers (each, every, some, two,…)

Pronouns

  • Personal (I, me, she,…)
  • Possessive (my, our,…)
  • Interrogative or Wh (who, what, …)

Conjunctions

  • Coordinating (and, or, but)
  • Subordinating (if, although, that, …)

Modal verbs

  • Ability (can, could)
  • Permission (can, may)
  • Possibility (may, might, could, will)
  • Necessity (must)

And some more…

  • negatives, politeness markers, etc

2.3 Ambiguity

Many word types belong to multiple classes

POS depends on context

Compare:

  • Time flies like an arrow
  • Fruit flies like a banana

2.4 POS Ambiguity in News Headlines

  • British Left Waffles on Falkland Islands

    • [British Left] [Waffles] [on] [Falkland Islands]
  • Juvenile Court to Try Shooting Defendant
    • [Juvenile Court] [to] [Try] [Shooting Defendant]
  • Teachers Strike Idle Kids
    • [Teachers Strike] [Idle Kids]
  • Eye Drops Off Shelf
    • [Eye Drops] [Off Shelf]

3. Tagsets

A compact representation of POS information

  • Usually ≤ 4 capitalized characters (e.g. NN = noun)
  • Often includes inflectional distinctions

Major English tagsets

  • Brown (87 tags)
  • Penn Treebank (45 tags)
  • CLAWS/BNC (61 tags)
  • “Universal” (12 tags)

At least one tagset for all major languages

3.1 Major Penn Treebank Tags

NN noun

VB verb

JJ adjective

RB adverb

DT determiner

CD cardinal number

IN preposition

PRP personal pronoun

MD modal

CC coordinating conjunction

RP particle

WH wh-pronoun

TO to

3.2 Derived Tags (Open Class)

NN (noun singular, wombat)

  • NNS (plural, wombats)
  • NNP (proper, Australia)
  • NNPS (proper plural, Australians)

VB (verb infinitive, eat)

  • VBP (1st /2nd person present, eat)
  • VBZ (3rd person singular, eats)
  • VBD (past tense, ate)
  • VBG (gerund, eating)
  • VBN (past participle, eaten)

JJ (adjective, nice)

  • JJR (comparative, nicer)
  • JJS (superlative, nicest)

RB (adverb, fast)

  • RBR (comparative, faster)
  • RBS (superlative, fastest)

3.3 Derived Tags (Closed Class)

PRP (pronoun personal, I)

  • PRP$ (possessive, my)

WP (Wh-pronoun, what):

  • WP$ (possessive, whose)
  • WDT(wh-determiner, which)
  • WRB (wh-adverb, where)

3.4 Tagged Text Example

4. Automatic Tagging

4.1 Why Automatically POS tag?

Important for morphological analysis, e.g. lemmatisation

For some applications, we want to focus on certain POS

  • E.g. nouns are important for information retrieval, adjectives for sentiment analysis

Very useful features for certain classification tasks

  • E.g. genre attribution (fiction vs. non-fiction)

POS tags can offer word sense disambiguation

  • E.g. cross/NN vs cross/VB cross/JJ

Can use them to create larger structures (parsing; lecture 14–16)

4.2 Automatic Taggers

Rule-based taggers

Statistical taggers

  • Unigram tagger
  • Classifier-based taggers
  • Hidden Markov Model (HMM) taggers

4.3 Rule-based tagging

Typically starts with a list of possible tags for each word

  • From a lexical resource, or a corpus

Often includes other lexical information, e.g. verb subcategorisation (its arguments)

Apply rules to narrow down to a single tag

  • E.g. If DT comes before word, then eliminate VB
  • Relies on some unambiguous contexts

Large systems have 1000s of constraints

4.4 Unigram tagger

Assign most common tag to each word type

Requires a corpus of tagged words

“Model” is just a look-up table

But actually quite good, ~90% accuracy

  • Correctly resolves about 75% of ambiguity

Often considered the baseline for more complex approaches

4.5 Classifier-Based Tagging

Use a standard discriminative classifier (e.g. logistic regression, neural network), with features:

  • Target word
  • Lexical context around the word
  • Already classified tags in sentence

But can suffer from error propagation: wrong predictions from previous steps affect the next ones

4.6 Hidden Markov Models

A basic sequential (or structured) model

Like sequential classifiers, use both previous tag and lexical evidence

Unlike classifiers, considers all possibilities of previous tag

Unlike classifiers, treat previous tag evidence and lexical evidence as independent from each other

  • Less sparsity
  • Fast algorithms for sequential prediction, i.e. finding the best tagging of entire word sequence

Next lecture!

4.7 Unknown Words

Huge problem in morphologically rich languages (e.g. Turkish)

Can use things we’ve seen only once (hapax legomena) to best guess for things we’ve never seen before

  • Tend to be nouns, followed by verbs
  • Unlikely to be determiners

Can use sub-word representations to capture morphology (look for common affixes)

4.8 A Final Word

  • Part of speech is a fundamental intersection between linguistics and automatic text analysis 语音部分是语言学和自动文本分析之间的一个基本交叉点
  • A fundamental task in NLP, provides useful information for many other applications NLP的一项基本任务,为许多其他应用提供有用的信息
  • Methods applied to it are typical of language tasks in general, e.g. probabilistic, sequential machine learning 应用于它的方法是一般语言任务的典型,例如概率、顺序机器学习

自然语言处理(四): Part of Speech Tagging相关推荐

  1. 自然语言15_Part of Speech Tagging with NLTK

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&am ...

  2. 第四篇:Part of Speech Tagging 词性标注

    词性也就是单词类别,形态类别,句法类别 名词,动词,形容词等. POS告诉了我们单词和他的邻居的一些信息,简单举例: 名词前常有限定词 名词前有动词 content 作为名词,发音为 CONtent ...

  3. Lecture 5 Part of Speech Tagging

    目录 POS application: Information Extraction 词性应用:信息提取 POS Open Class 开放类词性 Problem of word classes: A ...

  4. CS224n 深度自然语言处理(四) Note - Backpropagation and computation graphs

    本文为笔者学习CS224N所做笔记,所包含内容不限于课程课件和讲义,还包括笔者对机器学习.神经网络的一些理解.所写内容难免有难以理解的地方,甚至可能有错误.如您在阅读中有疑惑或者建议,还望留言指正.笔 ...

  5. 语言模型自然语言处理[置顶] 哥伦比亚大学 自然语言处理 公开课 授课讲稿 翻译(四)...

    每日一贴,今天的内容关键字为语言模型自然语言处理 媒介:灵机一动看了一个自然语言处理公开课,大牛柯林斯讲解的.认为很好,就自己动手把它的讲稿翻译成中文.一方面,希望通过这个翻译过程,让自己更加理解大牛 ...

  6. MIT自然语言处理第四讲:标注

    MIT自然语言处理第四讲:标注(第一部分) 自然语言处理:标注 Natural Language Processing: Tagging 作者:Regina Barzilay(MIT,EECS Dep ...

  7. 哥伦比亚大学 自然语言处理 公开课 授课讲稿 翻译(四)

    前言:心血来潮看了一个自然语言处理公开课,大牛柯林斯讲授的.觉得很好,就自己动手把它的讲稿翻译成中文.一方面,希望通过这个翻译过程,让自己更加理解大牛的讲授内容,锻炼自己翻译能力.另一方面,造福人类, ...

  8. 自然语言处理(NLP):概述【NLP技术的四个维度:声音、单词、句子结构、语义】

    深度学习-自然语言处理:概述[NLP技术的四个维度:声音.单词.句子结构.语义] 一.NLP的主要问题 二.NLP技术的四个维度 1.声音 2.单词/Morphology层面的技术 2.1 Word ...

  9. nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子

    nlp自然语言处理 介绍 (Introduction) Natural language processing (NLP) is an intimidating name for an intimid ...

最新文章

  1. EFCore+MSSS CodeFirst多对多设计初体验
  2. [TCP/IP] TCP关闭连接为什么四次挥手
  3. python简单使用
  4. Matlab画图技巧之消除空白(二)
  5. uva 610(tarjan的应用)
  6. python缺失值类型与分析_3.1.1 缺失值分析
  7. BZOJ3236[Ahoi2013]作业——莫队+树状数组/莫队+分块
  8. Win7下的内置FTP组件的设置详解
  9. 一加Ace外观设计理念揭晓:主推硬朗直线条力量感/速度感十足
  10. maven配置 mysql_maven项目使用mybatis+mysql
  11. Scala学习笔记(1)-环境搭建
  12. 钉钉第三方服务商应用ISV应用开发及上架教程
  13. 通用单目标跟踪综述《Handcrafted and Deep Trackers: A Review of Recent Object Tracking Approaches》
  14. 关于手机app合并m3u8文件失效,pc端合成方法
  15. 查询-SPJ练习3参考答案
  16. linux下安装录制视频软件Simple Screen Recorder
  17. html5的技术串讲,Web前端开发职业标准串讲(初级).pdf
  18. 直线上最多的点数 | leetcode 149
  19. 【Java中的构造器(构造方法)_基本语法_详细解读】
  20. Linux下的power_supply小析

热门文章

  1. 上班不开心,想裸辞又不敢提离职!
  2. TL431实现太阳能LED路灯控制器
  3. Android-String类
  4. 12.3 总结(功能记录:智慧成都地铁网络)
  5. ​ leetcode 714. 买卖股票的最 佳时机含手续费 medium ​
  6. Android View onFinishInflate
  7. MCGS如何加装透明背景位图
  8. OpenSSL - 学习/实践
  9. 《代码敲不队》第二次作业:团队项目选题报告
  10. objectarx向量的用法