Chapter 8 Part-of-Speech Tagging

Speech and Language Processing ed3 读书笔记

parts-of-speech: noun, verb, pronoun, preposition, adverb, conjunction, participle, and article.

Parts-of-speech (also known as POS, word classes, or syntactic categories) are useful because they reveal a lot about a word and its neighbors. Knowing whether a word is a noun or a verb tells us about likely neighboring words (nouns are preceded by determiners and adjectives, verbs by nouns) and syntactic structure word (nouns are generally part of noun phrases), making part-of-speech tagging a key aspect of parsing (Chapter 11). Parts of speech are useful features for labeling named entities like people or organizations in information extraction (Chapter 17), or for coreference resolution (Chapter 20). A word’s part-of-speech can even play a role in speech recognition or synthesis, e.g., the word content is pronounced CONtent when it is a noun and conTENT when it is an adjective.

This chapter introduces parts-of-speech, and then introduces two algorithms for part-of-speech tagging, the task of assigning parts-of-speech to words. One is generative— Hidden Markov Model (HMM)—and one is discriminative—the Maximum Entropy Markov Model (MEMM). Chapter 9 then introduces a third algorithm based on the recurrent neural network (RNN). All three have roughly equal performance but, as we’ll see, have different tradeoffs.

8.1 (Mostly) English Word Classes

While word classes do have semantic tendencies—adjectives, for example, often describe properties and nouns people— parts-of-speech are traditionally defined instead based on syntactic and morphological function, grouping words that have similar neighboring words (their distributional properties) or take similar affixes (their morphological properties).

Parts-of-speech can be divided into two broad supercategories: closed class types and open class types. Closed classes are those with relatively fixed membership, such as prepositions—new prepositions are rarely coined. Closed class words are generally function words like of, it, and, or you, which tend to be very short, occur frequently, and often have structuring uses in grammar.

Four major open classes occur in the languages of the world: nouns, verbs, adjectives, and adverbs.

Open class nouns fall into two classes, proper nouns and common nouns. Common nouns are divided into count nouns and mass nouns.

Verbs refer to actions and processes, including main verbs like draw, provide, and go. English verbs have inflections (non-third-person-sg (eat), third-person-sg (eats), progressive (eating), past participle (eaten)).

The third open class English form is adjectives, a class that includes many terms for properties or qualities.

The final open class form, adverbs, is rather a hodge-podge in both form and meaning. Directional adverbs or locative adverbs (home, here, downhill) specify the direction or location of some action; degree adverbs (extremely, very, somewhat) specify the extent of some action, process, or property; manner adverbs (slowly, slinkily, delicately) describe the manner of some action or process; and temporal adverbs describe the time that some action or event took place (yesterday, Monday). Because of the heterogeneous nature of this class, some adverbs (e.g., temporal adverbs like Monday) are tagged in some tagging schemes as nouns.

Some of the important closed classes in English include:

prepositions: on, under, over, near, by, at, from, to, with
particles: up, down, on, off, in, out, at, by
determiners: a, an, the
conjunctions: and, but, or, as, if, when
pronouns: she, who, I, others
auxiliary verbs: can, may, should, are
numerals: one, two, three, first, second, third

Prepositions occur before noun phrases. A particle resembles a preposition or an adverb and is used in combination with a verb.

A verb and a particle that act as a single syntactic and/or semantic unit are called a phrasal verb. The meaning of phrasal verbs is often problematically noncompositional—not predictable from the distinct meanings of the verb and the particle. Thus, turn down means something like ‘reject’, rule out ‘eliminate’, find out ‘discover’, and go on ‘continue’.

A closed class that occurs with nouns, often marking the beginning of a noun is the determiner. One small subtype of determiners is the article: English has three articles: a, an, and the. Other determiners include this and that (this chapter, that page). A and an mark a noun phrase as indefinite, while the can mark it as definite; definiteness is a discourse property (Chapter 21). Articles are quite frequent in English; indeed, the is the most frequently occurring word in most corpora of written English, and a and an are generally right behind.

Conjunctions join two phrases, clauses, or sentences. Coordinating conjunctions like and, or, and but join two elements of equal status. Subordinating conjunctions are used when one of the elements has some embedded status. For example, that in “I thought that you might like some milk” is a subordinating conjunction that links the main clause I thought with the subordinate clause you might like some milk. This clause is called subordinate because this entire clause is the “content” of the main verb thought. Subordinating conjunctions like that which link a verb to its argument in this way are also called complementizers.

Pronouns are forms that often act as a kind of shorthand for referring to some noun phrase or entity or event. Personal pronouns refer to persons or entities (you, she, I, it, me, etc.). Possessive pronouns are forms of personal pronouns that indicate either actual possession or more often just an abstract relation between the person and some object (my, your, his, her, its, one’s, our, their). Wh-pronouns (what, who, whom, whoever) are used in certain question forms, or may also act as complementizers (Frida, who married Diego. . . ).

A closed class subtype of English verbs are the auxiliary verbs. Cross-linguistically, auxiliaries mark semantic features of a main verb: whether an action takes place in the present, past, or future (tense), whether it is completed (aspect), whether it is negated (polarity), and whether an action is necessary, possible, suggested, or desired (mood). English auxiliaries include the copula verb be, the two verbs do and have, along with their inflected forms, as well as a class of modal verbs. Be is called a copula because it connects subjects with certain kinds of predicate nominals and adjectives (He is a duck). The verb have can mark the perfect tenses (I have gone, I had gone), and be is used as part of the passive (We were robbed) or progressive (We are leaving) constructions. Modals are used to mark the mood associated with the event depicted by the main verb: can indicates ability or possibility, may permission or possibility, must necessity. There is also a modal use of have (e.g., I have to go).

English also has many words of more or less unique function, including interjections (oh, hey, alas, uh, um), negatives (no, not), politeness markers (please, negative thank you), greetings (hello, goodbye), and the existential there (there are two on the table) among others. These classes may be distinguished or lumped together as interjections or adverbs depending on the purpose of the labeling.

8.2 The Penn Treebank Part-of-Speech Tagset

An important tagset for English is the 45-tag Penn Treebank tagset (Marcus et al., 1993), shown in Fig. 8.1, which has been used to label many corpora. In such labelings, parts-of-speech are generally represented by placing the tag after each word, delimited by a slash:

1557712256512

(8.1) The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

(8.2) There/EX are/VBP 70/CD children/NNS there/RB

(8.3) Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.

Example (8.1) shows the determiners the and a, the adjectives grand and other, the common nouns jury, number, and topics, and the past tense verb commented. Example (8.2) shows the use of the EX tag to mark the existential there construction in English, and, for comparison, another use of there which is tagged as an adverb (RB). Example (8.3) shows the segmentation of the possessive morpheme ’s and a passive construction, ‘were reported’, in which reported is marked as a past participle (VBN). Note that since New England Journal of Medicine is a proper noun, the Treebank tagging chooses to mark each noun in it separately as NNP, including journal and medicine, which might otherwise be labeled as common nouns (NN).

Corpora labeled with parts-of-speech are crucial training (and testing) sets for statistical tagging algorithms. Three main tagged corpora are consistently used for training and testing part-of-speech taggers for English. The Brown corpus is a million words of samples from 500 written texts from different genres published in the United States in 1961. The WSJ corpus contains a million words published in the Wall Street Journal in 1989. The Switchboard corpus consists of 2 million words of telephone conversations collected in 1990-1991. The corpora were created by running an automatic part-of-speech tagger on the texts and then human annotators hand-corrected each tag.

There are some minor differences in the tagsets used by the corpora. For example in the WSJ and Brown corpora, the single Penn tag TO is used for both the infinitive to (I like to race) and the preposition to (go to the store), while in Switchboard the tag TO is reserved for the infinitive use of to and the preposition is tagged IN:

Well/UH ,/, I/PRP ,/, I/PRP want/VBP to/TO go/VB to/IN a/DT restaurant/NN

Finally, there are some idiosyncracies inherent in any tagset. For example, because the Penn 45 tags were collapsed from a larger 87-tag tagset, the original Brown tagset, some potential useful distinctions were lost. The Penn tagset was designed for a treebank in which sentences were parsed, and so it leaves off syntactic information recoverable from the parse tree. Thus for example the Penn tag IN is used for both subordinating conjunctions like if, when, unless, after:

after/IN spending/VBG a/DT day/NN at/IN the/DT beach/NN

and prepositions like in, on, after:

after/IN sunrise/NN

Words are generally tokenized before tagging. The Penn Treebank and the British National Corpus split contractions and the ’s-genitive from their stems:

would/MD n’t/RB
children/NNS ’s/POS

The Treebank tagset assumes that tokenization of multipart words like New York is done at whitespace, thus tagging a New York City firm as a/DT New/NNP York/NNP City/NNP firm/NN.

Another commonly used tagset, the Universal POS tag set of the Universal Dependencies project (Nivre et al., 2016a), is used when building systems that can tag many languages. See Section 8.7.

8.3 Part-of-Speech Tagging

Part-of-speech tagging is the process of assigning a part-of-speech marker to each word in an input text. The input to a tagging algorithm is a sequence of (tokenized) words and a tagset, and the output is a sequence of tags, one per token.

Tagging is a disambiguation task; words are ambiguous —have more than one possible part-of-speech—and the goal is to find the correct tag for the situation.

Some of the most ambiguous frequent words are that, back, down, put and set.

Nonetheless, many words are easy to disambiguate, because their different tags aren’t equally likely. For example, a can be a determiner or the letter a, but the determiner sense is much more likely. This idea suggests a simplistic baseline algorithm for part-of-speech tagging: given an ambiguous word, choose the tag which is most frequent in the training corpus. This is a key concept:

Most Frequent Class Baseline: Always compare a classifier against a baseline at least as good as the most frequent class baseline (assigning each token to the class it occurred in most often in the training set).

How good is this baseline? A standard way to measure the performance of part-of-speech taggers is accuracy: the percentage of tags correctly labeled (matching human labels on a test set).

8.4 HMM Part-of-Speech Tagging

In this section we introduce the use of the Hidden Markov Model for part-of-speech sequence model tagging. The HMM is a sequence model. A sequence model or sequence classifier is a model whose job is to assign a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.

8.4.1 Markov Chains

The HMM is based on augmenting the Markov chain.

Consider a sequence of state variables $q1,q2,…,qiq_1, q_2,\ldots, q_i$ . A Markov model embodies the Markov assumption on the probabilities of this sequence: that when predicting the future, the past doesn’t matter, only the present.
$P(qi=a∣q1…qi−1)=P(qi=a∣qi−1)P(q_i=a|q_1\ldots q_{i-1})=P(q_i=a|q_{i-1})$
Formally, a Markov chain is specified by the following components:

$q_1q_2 \ldots q_N$ : a set of $N$ states
$a_{11}\ldots a_{NN}$ : a transition probability matrix $A$ , each $a_{i j}$ representing the probability of moving from state $i$ to state $j$ , s.t. $∑j=1Naij=1,∀i\sum_{j=1}^N a_{ij}=1, \forall i$ .
$\pi_1,\pi_2,\ldots, \pi_N$ : an initial probability distribution over states. $πi\pi_i$ is the probability that the Markov chain will start in state $i$ . Some states $j$ may have $πj=0\pi_j = 0$ , meaning that they cannot be initial states. Also, $∑i=1nπi=1\sum_{i=1}^n\pi_i = 1$ .

8.4.2 The Hidden Markov Model

A Markov chain is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. For example we don’t normally observe part-of-speech tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed.

A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components:

$q_1q_2 \ldots q_N$ : a set of $N$ hidden states
$a_{11}\ldots a_{NN}$ : a transition probability matrix $A$ , each $a_{i j}$ representing the probability of moving from a hidden state $i$ to a hidden state $j$ , s.t. $∑j=1Naij=1,∀i\sum_{j=1}^N a_{ij}=1, \forall i$ .
$O=o1o2…oTO=o_1o_2\ldots o_T$ : a sequence of $T$ observations, each one drawn from a vocabulary $v_1,v_2,\ldots,v_V$
$B=b_i(o_t)$ : a sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation $o_t$ being generated from a hidden state $i$
$\pi_1,\pi_2,\ldots, \pi_N$ : an initial probability distribution over hidden states. $πi\pi_i$ is the probability that the Markov chain will start in state $i$ . Some states $j$ may have $πj=0\pi_j = 0$ , meaning that they cannot be initial states. Also, $∑i=1nπi=1\sum_{i=1}^n\pi_i = 1$ .

A first-order hidden Markov model instantiates two simplifying assumptions. First, as with a first-order Markov chain, the probability of a particular hidden state depends only on the previous state:
$MarkovAssumption:P(qi∣q1…qi−1)=P(qi∣qi−1)\textbf{Markov Assumption}: P(q_i|q_1\ldots q_{i-1}) = P(q_i|q_{i-1})$
Second, the probability of an output observation $o_i$ depends only on the hidden state that produced the observation $q_i$ and not on any other states or any other observations:
$OutputIndependence:P(oi∣q1…qi,…,qT,o1,…,oi,…,oT)=P(oi∣qi)\textbf{Output Independence}: P(o_i|q_1\ldots q_i,\ldots,q_T , o_1,\ldots, o_i,\ldots, o_T ) = P(o_i|q_i)$

8.4.3 The components of an HMM tagger

Let’s start by looking at the pieces of an HMM tagger, and then we’ll see how to use it to tag. An HMM has two components, the $A$ and $B$ probabilities.

The $A$ matrix contains the tag (hidden state) transition probabilities $P(t_i|t_{i-1})$ which represent the probability of a tag occurring given the previous tag. For example, modal verbs like will are very likely to be followed by a verb in the base form, a VB, like race, so we expect this probability to be high. We compute the maximum likelihood estimate of this transition probability by counting, out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second:
$P(ti∣ti−1)=C(ti−1,ti)C(ti−1)P(t_i|t_{i-1}) = \frac{C(t_{i-1},t_i)}{C(t_{i-1})}$
In the WSJ corpus, for example, MD occurs 13124 times of which it is followed by VB 10471, for an MLE estimate of
$\frac{C(MD,VB)}{C(MD)} =\frac{10471}{13124}= .80$
The $B$ emission probabilities, $P(w_i|t_i)$ , represent the probability, given a tag (say MD), that it will be associated with a given word (say will). The MLE of the emission probability is
$P(wi∣ti)=C(ti,wi)C(ti)P(w_i|t_i) = \frac{C(t_i,w_i)}{C(t_i)}$
Of the 13124 occurrences of MD in the WSJ corpus, it is associated with will 4046
times:
$\frac{C(MD,will)}{C(MD)} =\frac{4046}{13124}= .31$
We saw this kind of Bayesian modeling in Chapter 4; recall that this likelihood term is not asking “which is the most likely tag for the word will?”. That would be the posterior $P (M D ∣ w i l l)$ . Instead, $P (w i l l ∣ M D)$ answers the slightly counterintuitive question “If we were going to generate a MD, how likely is it that this modal would be will?”

The $A$ transition probabilities, and $B$ observation likelihoods of the HMM are illustrated in Fig. 8.4 for three states in an HMM part-of-speech tagger; the full tagger would have one state for each tag.

8.4.4 HMM tagging as decoding

For any model, such as an HMM, that contains hidden variables, the task of determining the hidden variables sequence corresponding to the sequence of observations is called decoding. More formally,

Decoding: Given as input an HMM $λ=(A,B)\lambda = (A,B)$ and a sequence of observations $o_1,o_2,\ldots, o_T$ , find the most probable sequence of states $q_1q_2q_3\ldots q_T$ .

For part of speech tagging, the goal of HMM decoding is to choose the tag sequence $t_1^n$ that is most probable given the observation sequence of $n$ words $w_1^n$ :
$t^1n=argmaxt1nP(t1n∣w1n)\hat t_1^n= \mathop{\textrm{argmax}}_{t_1^n}P(t_1^n|w_1^n)$
The way we’ll do this in the HMM is to use Bayes’ rule to instead compute:
$t^1n=argmaxt1nP(w1n∣t1n)P(t1n)P(w1n)\hat t_1^n= \mathop{\textrm{argmax}}_{t_1^n}\frac{P(w_1^n|t_1^n)P(t_1^n)}{P(w_1^n)}$
Furthermore, we simplify it by dropping the denominator $P(w_1^n)$ :
$t^1n=argmaxt1nP(w1n∣t1n)P(t1n)\hat t_1^n= \mathop{\textrm{argmax}}_{t_1^n}P(w_1^n|t_1^n)P(t_1^n)$
HMM taggers make two further simplifying assumptions. The first is that the probability of a word appearing depends only on its own tag and is independent of neighboring words and tags:
$P(w1n∣t1n)≈∏i=1nP(wi∣ti)P(w_1^n|t_1^n)\approx \prod_{i=1}^n P(w_i|t_i)$
The second assumption, the bigram assumption, is that the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence:
$P(t1n)≈∏i=1nP(ti∣ti−1)P(t_1^n) \approx \prod_{i=1}^nP(t_i|t_{i-1})$
The most probable tag sequence from a bigram tagger is thus:
$t^1n=argmaxt1nP(t1n∣w1n)≈argmaxt1n∏i=1nP(wi∣ti)⏞emissionP(ti∣ti−1)⏞transition\hat t_1^n = \mathop{\textrm{argmax}}_{t_1^n} P(t_1^n|w_1^n) \approx \mathop{\textrm{argmax}}_{t_1^n}\prod_{i=1}^n \overbrace{P(w_i|t_i)}^{\textrm{emission}}\overbrace{P(t_i|t_{i-1})}^\textrm{transition}$

emissionP(ti∣ti−1)

transition
The two parts correspond neatly to the

B

emission probability and

A

transition probability that we just defined above!

8.4.5 The Viterbi Algorithm

The decoding algorithm for HMMs is the Viterbi algorithm. As an instance of dynamic programming, Viterbi resembles the dynamic programming minimum edit distance algorithm of Chapter 2.

1557833499433

import numpy as np
obs="Janet will back the bill"#    NNP       MD      VB      JJ      NN      RB       DT
states=['NNP','MD','VB','JJ','NN','RB','DT']
pi=[0.2767, 0.0006, 0.0031, 0.0453, 0.0449, 0.0510, 0.2026]#             NNP       MD      VB      JJ      NN      RB       DT
state_graph=[[0.3777, 0.0110, 0.0009, 0.0084, 0.0584, 0.0090, 0.0025],#NNP[0.0008, 0.0002, 0.7968, 0.0005, 0.0008, 0.1698, 0.0041],#MD[0.0322, 0.0005, 0.0050, 0.0837, 0.0615, 0.0514, 0.2231],#VB[0.0366, 0.0004, 0.0001, 0.0733, 0.4509, 0.0036, 0.0036],#JJ[0.0096, 0.0176, 0.0014, 0.0086, 0.1216, 0.0177, 0.0068],#NN[0.0068, 0.0102, 0.1011, 0.1012, 0.0120, 0.0728, 0.0479],#RB[0.1147, 0.0021, 0.0002, 0.2157, 0.4744, 0.0102, 0.0017]]#DT#   Janet  will back the bill
b=[[0.000032, 0, 0, 0.000048, 0],       #NNP[0, 0.308431, 0, 0, 0],              #MD[0, 0.000028, 0.000672, 0, 0.000028],#VB[0, 0, 0.000340, 0, 0],              #JJ[0, 0.000200, 0.000223, 0, 0.002337],#NN[0, 0, 0.010446, 0, 0],              #RB[0, 0, 0, 0.506099, 0]]              #DTn_obs=len(obs.split())
n_states=len(state_graph)
v=np.zeros(shape=(n_states,n_obs))
backpointer=np.zeros((n_states,n_obs),dtype=int)-1#-1 indicates not specified
for row in range(n_states):#initialization stepv[row][0]=pi[row]*b[row][0]backpointer[row][0]=0
for col in range(1,n_obs): for row in range(n_states):#recursion step#as is shown in Eq.8.18, each loop we multiply the maximum probability#so far with P(t_i|t_{i-1})=state_graph[s][row] and P(w_i|t_i)=b[row][col]attemps=[v[s][col-1]*state_graph[s][row]*b[row][col] for s in range(n_states)]v[row][col]=np.max(attemps)backpointer[row][col]=np.argmax(attemps)
print(backpointer)
bestpathpointer=np.argmax([v[s][n_obs-1] for s in range(n_states)])
bestpath=[-1]*n_obs
bestpath[n_obs-1]=bestpathpointer
for col in range(n_obs-1,0,-1):bestpath[col-1]=backpointer[bestpathpointer][col]bestpathpointer=bestpath[col-1]
print(bestpath)
print(np.array(states)[bestpath])

[[0 0 0 2 0][0 0 0 0 0][0 0 1 0 6][0 0 1 0 0][0 0 1 0 6][0 0 1 0 0][0 0 0 2 0]]
[0, 1, 2, 6, 4]
['NNP' 'MD' 'VB' 'DT' 'NN']

The Viterbi algorithm first sets up a probability matrix or lattice, with one column for each observation (word) $o_t$ and one row for each state (tag) in the state graph. Each column thus has a cell for each state $q_i$ in the single combined automaton. Figure 8.6
shows an intuition of this lattice for the sentence Janet will back the bill.

1557801840824

Each cell of the trellis, $v_t(j)$ , represents the probability that the HMM is in state $j$ after seeing the first $t$ observations and passing through the most probable state sequence $q1,…,qt−1q_1, \ldots, q_{t-1}$ , given the HMM $λ\lambda$ . The value of each cell $v_t(j)$ is computed by recursively taking the most probable path that could lead us to this cell. Formally, each cell expresses the probability
$vt(j)=max⁡q1,…,qt−1P(q1,…,qt−1,o1,o2,…,ot,qt=j∣λ)v_t(j) = \max_{q_1, \ldots, q_{t-1}} P(q_1, \ldots, q_{t-1},o_1,o_2,\ldots,o_t,q_t = j|\lambda)$
We represent the most probable path by taking the maximum over all possible previous state sequences $max⁡q1,…,qt−1\mathop{\max}\limits_{q_1, \ldots, q_{t-1}}$ . Like other dynamic programming algorithms, Viterbi fills each cell recursively. Given that we had already computed the probability of being in every state at time $t - 1$ , we compute the Viterbi probability by taking the most probable of the extensions of the paths that lead to the current cell. For a given state $q_ j$ at time $t$ , the value $v [t, j]$ is computed as
$v_t(j) = \max_{i=1}^N v_{t-1}(i) a_{ij} b_j(o_t)$
The three factors that are multiplied for extending the previous paths to compute the Viterbi probability at time $t$ are

$v_{t-1}(i)$ : the previous Viterbi path probability from the previous time step
$a_{i j}$ : the transition probability from previous state $q_i$ to current state $q_ j$
$b_j(o_t)$ : the state observation likelihood of the observation symbol $o_t$ given the current state $j$

8.4.6 Working through an example

Let’s tag the sentence Janet will back the bill; the goal is the correct series of tags
(see also Fig. 8.6):

(8.21) Janet/NNP will/MD back/VB the/DT bill/NN

Let the HMM be defined by the two tables in Fig. 8.7 and Fig. 8.8. Figure 8.7 lists the $a_{i j}$ probabilities for transitioning between the hidden states (part-of-speech tags). Figure 8.8 expresses the $b_i(o_t)$ probabilities, the observation likelihoods of words given tags. This table is (slightly simplified) from counts in the WSJ corpus. So the word Janet only appears as an NNP, back has 4 possible parts of speech, and the word the can appear as a determiner or as an NNP (in titles like “Somewhere Over the Rainbow” all words are tagged as NNP).

1557816304166

Figure 8.9 shows a fleshed-out version of the sketch we saw in Fig. 8.6, the Viterbi trellis for computing the best hidden state sequence for the observation sequence Janet will back the bill.

1557817124626

There are $N = 5$ state columns. We begin in column 1 (for the word Janet) by setting the Viterbi value in each cell to the product of the $π\pi$ transition probability (the start probability for that state $i$ , which we get from the < $s$ > entry of Fig. 8.7), and the observation likelihood of the word Janet given the tag for that cell. Most of the cells in the column are zero since the word Janet cannot be any of those tags. The reader should find this in Fig. 8.9.

Next, each cell in the will column gets updated. For each state, we compute the value $v i t e r b i [s, t]$ by taking the maximum over the extensions of all the paths from the previous column that lead to the current cell. We have shown the values for the MD, VB, and NN cells. Each cell gets the max of the 7 values from the previous column, multiplied by the appropriate transition probability; as it happens in this case, most of them are zero from the previous column. The remaining value is multiplied by the relevant observation probability, and the (trivial) max is taken. In this case the final value, .0000002772, comes from the NNP state at the previous column. The reader should fill in the rest of the trellis in Fig. 8.9 and backtrace to reconstruct the correct state sequence NNP MD VB DT NN.

8.4.7 Extending the HMM Algorithm to Trigrams

In practice we use more of the history, letting the probability of a tag depend on two previous tags:
$P(t1n)≈∏i=1nP(ti∣ti−1,ti−2)P(t_1^n)\approx \prod_{i=1}^n P(t_i|t_{i-1},t_{i-2})$
Extending the algorithm from bigram to trigram taggers gives a small (perhaps a half point) increase in performance, but conditioning on two previous tags instead of one requires a significant change to the Viterbi algorithm. For each cell, instead of taking a max over transitions from each cell in the previous column, we have to take a max over paths through the cells in the previous two columns, thus considering $N^2$ rather than $N$ hidden states at every observation.

In addition to increasing the context window, HMM taggers have a number of other advanced features. One is to let the tagger know the location of the end of the sentence by adding dependence on an end-of-sequence marker for $t_{n+1}$ . This gives the following equation for part-of-speech tagging:
$t^1n=argmaxt1nP(t1n∣w1n)≈argmaxt1n[∏i=1nP(wi∣ti)P(ti∣ti−1,ti−2)]P(tn+1∣tn)\hat t_1^n = \mathop{\textrm{argmax}}_{t_1^n} P(t_1^n|w_1^n) \approx \mathop{\textrm{argmax}}_{t_1^n}\left[\prod_{i=1}^n P(w_i|t_i)P(t_i|t_{i-1},t_{i-2})\right]P(t_{n+1}|t_n)$
In tagging any sentence with it, three of the tags used in the context will fall off the edge of the sentence, and hence will not match regular words. These tags, $t_{-1}$ , $t_0$ , and $t_{n+1}$ , can all be set to be a single special ‘sentence boundary’ tag that is added to the tagset, which assumes sentences boundaries have already been marked.

One problem with trigram taggers is data sparsity. Any particular sequence of tags ti-2;ti-1;ti that occurs in the test set may simply never have occurred in the training set. The standard approach to solving this problem is the same interpolation idea we saw in language modeling: estimate the probability by combining more robust, but weaker estimators. The standard way to combine these three estimators to estimate the trigram probability P(tijti-1;ti-2) is via linear interpolation. We estimate the probability P(tijti-1ti-2) by a weighted sum of the unigram, bigram, and trigram probabilities:
$P(ti∣ti−1ti−2)=λ3P^(ti∣ti−1ti−2)+λ2P^(ti∣ti−1)+λP^(ti)P(t_i|t_{i-1}t_{i-2}) = \lambda_3\hat P(t_i|t_{i-1}t_{i-2})+ \lambda_2\hat P(t_i|t_{i-1})+ \lambda \hat P(t_i)$
We require $λ1+λ2+λ3=1\lambda_1 + \lambda_2 + \lambda_3 = 1$ , ensuring that the resulting $P$ is a probability distribution. The $λ\lambda$ s are set by deleted interpolation (Jelinek and Mercer, 1980): we successively delete each trigram from the training corpus and choose the $λ\lambda$ s so as to maximize the likelihood of the rest of the corpus. The deletion helps to set the $λ\lambda$ s in such a way as to generalize to unseen data and not overfit. Figure 8.10 gives a deleted interpolation algorithm for tag trigrams.

1557844982470

8.4.8 Beam Search

When the number of states grows very large, the vanilla Viterbi algorithm will be slow. The complexity of the algorithm is $O(N^2T)$ ; $N$ (the number of states) can be large for trigram taggers, which have to consider every previous pair of the 45 tags, resulting in $45^3 = 91,125$ computations per column. N can be even larger for other applications of Viterbi, for example to decoding in neural networks, as we will see in future chapters.

One common solution to the complexity problem is the use of beam search decoding. In beam search, instead of keeping the entire column of states at each time point $t$ , we just keep the best few hypothesis at that point. At time $t$ this requires computing the Viterbi score for each of the $N$ cells, sorting the scores, and keeping only the best-scoring states. The rest are pruned out and not continued forward to time $t + 1$ .

One way to implement beam search is to keep a fixed number of states instead of beam width all $N$ current states. Here the beam width $β\beta$ is a fixed number of states. Alternatively $β\beta$ can be modeled as a fixed percentage of the $N$ states, or as a probability threshold. Figure 8.11 shows the search lattice using a beam width of 2 states.

8.4.9 Unknown Words

To achieve high accuracy with part-of-speech taggers, it is also important to have unknown words a good model for dealing with unknown words. Proper names and acronyms are created very often, and even new common nouns and verbs enter the language at a surprising rate. One useful feature for distinguishing parts of speech is word shape: words starting with capital letters are likely to be proper nouns (NNP).

But the strongest source of information for guessing the part-of-speech of unknown words is morphology. Words that end in -s are likely to be plural nouns (NNS), words ending with -ed tend to be past participles (VBN), words ending with -able adjectives (JJ), and so on. We store for each final letter sequence (for simplicity referred to as word suffixes) of up to 10 letters the statistics of the tag it was associated with in training. We are thus computing for each suffix of length $i$ the probability of the tag $t_i$ given the suffix letters (Samuelsson 1993, Brants 2000):
$P(ti∣ln−i+1…ln)P(t_i|l_{n-i+1}\ldots l_n)$
Back-off is used to smooth these probabilities with successively shorter suffixes. Because unknown words are unlikely to be closed-class words like prepositions, suffix probabilities can be computed only for words whose training set frequency is $≤10\le 10$ , or only for open-class words. Separate suffix tries are kept for capitalized and uncapitalized words.

Finally, we can compute the likelihood $P(w_i|t_i)$ that HMMs require by using Bayesian inversion (i.e., using Bayes rule and computation of the two priors $P(t_i)$ and $P(ti∣ln−i+1…ln)P(t_i|l_{n-i+1}\ldots l_n)$ ).
$P(wi∣ti)=P(ti∣wi)P(wi)P(ti)≈P(ti∣ln−i+1…ln)P(wi)P(ti)P(w_i|t_i)=\frac{P(t_i|w_i)P(w_i)}{P(t_i)}\approx \frac{P(t_i|l_{n-i+1}\ldots l_n)P(w_i)}{P(t_i)}$
In addition to using capitalization information for unknown words, Brants (2000) also uses capitalization for known words by adding a capitalization feature to each tag. Thus, instead of computing $P(t_i|t_{i-1},t_{i-2})$ , the algorithm computes the probability $P(t_i|t_{i-1},c_{i-1},t_{i-2},c_{i-2})$ . This is equivalent to having a capitalized and uncapitalized version of each tag, doubling the size of the tagset.

Combining all these features, a trigram HMM like that of Brants (2000) has a tagging accuracy of 96.7% on the Penn Treebank, perhaps just slightly below the performance of the best MEMM and neural taggers.

8.5 Maximum Entropy Markov Models

We could turn logistic regression into a discriminative sequence model simply by running it on successive words, using the class assigned to the prior word as a feature in the classification of the next word. When we apply logistic regression MEMM in this way, it’s called the maximum entropy Markov model or MEMM.

Let the sequence of words be $W = w_1^n$ and the sequence of tags $T = t_1^n$ . In an HMM to compute the best tag sequence that maximizes $P (T ∣ W)$ we rely on Bayes’ rule and the likelihood $P (W ∣ T)$ :
$T^=argmaxTP(T∣W)=argmaxTP(W∣T)P(T)=argmaxT∏iP(wordi∣tagi)∏iP(tagi∣tagi−1)\hat T = \mathop{\textrm{argmax}}_T P(T|W)= \mathop{\textrm{argmax}}_T P(W|T)P(T) = \mathop{\textrm{argmax}}_T\prod_iP(word_i|tag_i)\prod_iP(tag_i|tag_{i-1})$
In an MEMM, by contrast, we compute the posterior $P (T ∣ W)$ directly, training it to discriminate among the possible tag sequences:
$T^=argmaxTP(T∣W)=argmaxT∏iP(ti∣wi,ti−1)\hat T =\mathop{\textrm{argmax}}_T P(T|W) = \mathop{\textrm{argmax}}_T \prod_i P(t_i|w_i,t_{i-1})$
Consider tagging just one word. A multinomial logistic regression classifier could compute the single probability $P(t_i|w_i,t_{i-1})$ in a different way that an HMM. Fig. 8.12 shows the intuition of the difference via the direction of the arrows; HMMs compute likelihood (observation word conditioned on tags) but MEMMs compute posterior (tags conditioned on observation words).

8.5.1 Features in a MEMM

Of course we don’t build MEMMs that condition just on $w_{i}$ and $t_{i-1}$ . The reason to use a discriminative sequence model is that it’s easier to incorporate a lots of features. Figure 8.13 shows a graphical intuition of some of these additional features.

1557973868807

A basic MEMM part-of-speech tagger conditions on the observation word itself, neighboring words, and previous tags, and various combinations, using feature templates like the following:
$KaTeX parse error: Expected & or \\ or \cr or \end at position 17: …\begin{aligned}\̲ ̲\langle t_i,w_{…$
Recall from Chapter 5 that feature templates are used to automatically populate the set of features from every instance in the training and test set. Thus our example Janet/NNP will/MD back/VB the/DT bill/NN, when $w_i$ is the word back, would generate the following features:

$t_i = VB$ and $w_{i-2} = Janet$
$t_i = VB$ and $w_{i-1} = will$
$t_i = VB$ and $w_i = back$
$t_i = VB$ and $w_{i+1} = the$
$t_i = VB$ and $w_{i+2} = bill$
$t_i = VB$ and $t_{i-1} = MD$
$t_i = VB$ and $t_{i-1} = MD$ and $t_{i-2} = NNP$
$t_i = VB$ and $w_i = back$ and $w_{i+1} = the$

Also necessary are features to deal with unknown words, expressing properties of the word’s spelling or shape:

$w_{i}$ contains a particular prefix (from all prefixes of length $≤\le$ 4)
$w_{i}$ contains a particular suffix (from all suffixes of length $≤\le$ 4)
$w_{i}$ contains a number
$w_{i}$ contains an upper-case letter
$w_{i}$ contains a hyphen
$w_{i}$ is all upper case
$w_{i}$ ’s word shape
$w_{i}$ ’s short word shape
$w_{i}$ is upper case and has a digit and a dash (like CFC-12)
$w_{i}$ is upper case and followed within 3 words by Co., Inc., etc.

Word shape features are used to represent the abstract letter pattern of the word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers to ’d’, and retaining punctuation. Thus for example I.M.F would map to X.X.X. and DC10-30 would map to XXdd-dd. A second class of shorter word shape features is also used. In these features consecutive character types are removed, so DC10-30 would be mapped to Xd-d but I.M.F would still map to X.X.X. For example the word well-dressed would generate the following non-zero valued feature values:

prefix( $w_{i}$ ) = w
prefix( $w_{i}$ ) = we
prefix( $w_{i}$ ) = wel
prefix( $w_{i}$ ) = well
suffix( $w_{i}$ ) = ssed
suffix( $w_{i}$ ) = sed
suffix( $w_{i}$ ) = ed
suffix( $w_{i}$ ) = d
has-hyphen( $w_{i}$ )
word-shape( $w_{i}$ ) = xxxx-xxxxxxx
short-word-shape( $w_{i}$ ) = x-x

Features for known words are computed for every word seen in the training set. The unknown word features can also be computed for all words in training, or only on training words whose frequency is below some threshold. The result of the known-word templates and word-signature features is a very large set of features. Generally a feature cutoff is used in which features are thrown out if they have count < 5 in the training set.

8.5.2 Decoding and Training MEMMs

The most likely sequence of tags is then computed by combining these features of the input word $w_i$ , its neighbors within $l$ words $w_{i-l}^{i+l}$ , and the previous $k$ tags $t_{i-k}^{i-1}$ as follows (using $θ\theta$ to refer to feature weights instead of w to avoid the confusion with $w$ meaning words):
$T^=argmaxTP(T∣W)=argmaxT∏iP(ti∣wi−li+l,ti−ki−1)=argmaxT∏iexp⁡(∑jθjfj(ti,wi−li+l,ti−ki−1))∑t′∈tagsetexp⁡(∑jθjfj(t′,wi−li+l,ti−ki−1))\begin{aligned} \hat T &=\mathop{\textrm{argmax}}_T P(T|W)\\ &=\mathop{\textrm{argmax}}_T \prod_{i} P(t_i|w_{i-l}^{i+l},t_{i-k}^{i-1})\\ &=\mathop{\textrm{argmax}}_T \prod_{i} \frac{\exp\left(\sum_{j}\theta_jf_j(t_i,w_{i-l}^{i+l},t_{i-k}^{i-1})\right)}{\sum_{t'\in tagset}\exp\left(\sum_{j}\theta_jf_j(t',w_{i-l}^{i+l},t_{i-k}^{i-1})\right)} \end{aligned}$
where each $f_j$ is a function represents one of the features introduced in the last section, and $θj\theta_j$ is the weight for $f_j$ .

How should we decode to find this optimal tag sequence $T^\hat T$ ? The simplest way to turn logistic regression into a sequence model is to build a local classifier that classifies each word left to right, making a hard classification of the first word in the sentence, then a hard decision on the second word, and so on. This is called a greedy decoding algorithm, because we greedily choose the best tag for each word.

greedy: the tag makes the probability for each time step maximum

Viterbi: the tag makes the probability for whole path maximum

The MEMM requires only a slight change to this latter formula, replacing the $a$ and $b$ prior and likelihood probabilities with the direct posterior:
$vt(j)=max⁡i=1Nvt−1(i)P(sj∣si,ot),i≤j≤N,1<t≤Tv_t(j)=\max_{i=1}^N v_{t-1}(i)P(s_j|s_i,o_t), i\le j\le N, 1<t\le T$
Learning in MEMMs relies on the same supervised learning algorithms we presented for logistic regression. Given a sequence of observations, feature functions, and corresponding hidden states, we use gradient descent to train the weights to maximize the log-likelihood of the training corpus.

8.6 Bidirectionality

The one problem with the MEMM and HMM models as presented is that they are exclusively run left-to-right. While the Viterbi algorithm still allows present decisions to be influenced indirectly by future decisions, it would help even more if a decision about word $w_i$ could directly use information about future tags $t_{i+1}$ and $t_{i+2}$ .

Adding bidirectionality has another useful advantage. MEMMs have a theoretical weakness, referred to alternatively as the label bias or observation bias problem (Lafferty et al. 2001, Toutanova et al. 2003). These are names for situations when one source of information is ignored because it is explained away by another source. Consider an example from Toutanova et al. (2003), the sequence will/NN to/TO fight/VB. The tag TO is often preceded by NN but rarely by modals (MD), and so that tendency should help predict the correct NN tag for will. But the previous transition $P(twill∣⟨s⟩)P(t_{will}|\lang s\rang)$ prefers the modal (for example, Will you please…) , and because $P(TO|to,t_{will})$ is so close to 1 regardless of $t_{will}$ the model cannot make use of the transition probability and incorrectly chooses MD. The strong information that to must have the tag TO has explained away the presence of TO and so the model doesn’t learn the importance of the previous NN tag for predicting TO. Bidirectionality helps the model by making the link between TO available when tagging the NN.

One way to implement bidirectionality is to switch to a more powerful model called a conditional random field or CRF. The CRF is an undirected graphical model, which means that it’s not computing a probability for each tag at each time step. Instead, at each time step the CRF computes log-linear functions over a clique, a set of relevant features. Unlike for an MEMM, these might include output features of words in future time steps. The probability of the best sequence is similarly computed by the Viterbi algorithm. Because a CRF normalizes probabilities over all tag sequences, rather than over all the tags at an individual time t, training requires computing the sum over all possible labelings, which makes CRF training quite slow.

Simpler methods can also be used; the Stanford tagger uses a bidirectional version of the MEMM called a cyclic dependency network (Toutanova et al., 2003).

Alternatively, any sequence model can be turned into a bidirectional model by using multiple passes. For example, the first pass would use only part-of-speech features from already-disambiguated words on the left. In the second pass, tags for all words, including those on the right, can be used. Alternately, the tagger can be run twice, once left-to-right and once right-to-left. In greedy decoding, for each word the classifier chooses the highest-scoring of the tag assigned by the left-to-right and right-to-left classifier. In Viterbi decoding, the classifier chooses the higher scoring of the two sequences (left-to-right or right-to-left). These bidirectional models lead directly into the bi-LSTM models that we will introduce in Chapter 9 as a standard neural sequence model.

8.7 Part-of-Speech Tagging for Other Languages

Augmentations to tagging algorithms become necessary when dealing with languages with rich morphology like Czech, Hungarian and Turkish.

Highly inflectional languages also have much more information than English coded in word morphology, like case (nominative, accusative, genitive) or gender (masculine, feminine). Because this information is important for tasks like parsing and coreference resolution, part-of-speech taggers for morphologically rich languages need to label words with case and gender information. Tagsets for morphologically rich languages are therefore sequences of morphological tags rather than a single primitive tag.

For non-word-space languages like Chinese, word segmentation (Chapter 2) is either applied before tagging or done jointly. Although Chinese words are on average very short (around 2.4 characters per unknown word compared with 7.7 for English) the problem of unknown words is still large. While English unknown words tend to be proper nouns in Chinese the majority of unknown words are common nouns and verbs because of extensive compounding. Tagging models for Chinese use similar unknown word features to English, including character prefix and suffix features, as well as novel features like the radicals of each character in a word (Tseng et al., 2005b).

A standard for multilingual tagging is the Universal POS tag set of the Universal Dependencies project, which contains 16 tags plus a wide variety of features that can be added to them to create a large tagset for any language (Nivre et al., 2016a).

8.8 Summary

This chapter introduced parts-of-speech and part-of-speech tagging:

Languages generally have a small set of closed class words that are highly frequent, unambiguous, and act as function words, and open-class words like nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40 and 200 tags.
Part-of-speech tagging is the process of assigning a part-of-speech label to each of a sequence of words.
Two common approaches to sequence modeling are a generative approach, HMM tagging, and a discriminative approach, MEMM tagging. We will see a third, discriminative neural approach in Chapter 9.
The probabilities in HMM taggers are estimated by maximum likelihood estimation on tag-labeled training corpora. The Viterbi algorithm is used for decoding, finding the most likely tag sequence.
Beam search is a variant of Viterbi decoding that maintains only a fraction of high scoring states rather than all states during decoding.
Maximum entropy Markov model or MEMM taggers train logistic regression models to pick the best tag given an observation word and its context and the previous tags, and then use Viterbi to choose the best sequence of tags.
Modern taggers are generally run bidirectionally.