Chapter 4 Naive Bayes and Sentiment Classification

Speech and Language Processing ed3 读书笔记

Text categorization, the task of assigning a label or category to an entire text or document.

We focus on one common text categorization task, sentiment analysis, the extraction of sentiment, the positive or negative orientation that a writer expresses toward some object.

Spam detection is another important commercial application, the binary classification task of assigning an email to one of the two classes spam or not-spam.

Another thing we might want to know about a text is the language it’s written in. The task of language id is thus the first step in most language processing pipelines. Related tasks like determining a text’s author (authorship attribution), or author characteristics like gender, age, and native language are text classification tasks that are also relevant to the digital humanities, social sciences, and forensic linguistics.

One of the oldest tasks in text classification is assigning a library subject category or topic label to a text. In fact, as we will see, subject category classification is the task for which the naive Bayes algorithm was invented in 1961.

Generative classifiers like naive Bayes build a model of how a class could generate some input data. Given an observation, they return the class most likely to have generated the observation. Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes.

4.1 Naive Bayes Classifiers

A bag-of-words is an unordered set of words with their position ignored, keeping only their frequency in the document.

Naive Bayes is a probabilistic classifier, meaning that for a document ddd, out of all classes c∈Cc \in Cc∈C the classifier returns the class c^\hat cc^ which has the maximum posterior probability given the document.
c^=argmaxc∈CP(c∣d)\hat c=\mathop{\textrm{argmax}}_{c\in C}P(c|d) c^=argmaxc∈CP(c∣d)
The intuition of Bayesian classification is to use Bayes’ rule to transform Eq.1 into other probabilities that have some useful properties. Bayes’ rule is presented in Eq.2; it gives us a way to break down any conditional probability P(x∣y)P(x|y)P(x∣y) into three other probabilities:
P(x∣y)=P(y∣x)P(x)P(y)P(x|y)=\frac{P(y|x)P(x)}{P(y)} P(x∣y)=P(y)P(y∣x)P(x)

c^=argmaxc∈CP(c∣d)=argmaxc∈CP(d∣c)P(c)P(d)\hat c=\mathop{\textrm{argmax}}_{c\in C}P(c|d)=\mathop{\textrm{argmax}}_{c\in C}\frac{P(d|c)P(c)}{P(d)} c^=argmaxc∈CP(c∣d)=argmaxc∈CP(d)P(d∣c)P(c)

P(d)P(d)P(d) are identical for all ccc,
c^=argmaxc∈CP(c∣d)=argmaxc∈CP(d∣c)P(c)\hat c=\mathop{\textrm{argmax}}_{c\in C}P(c|d)=\mathop{\textrm{argmax}}_{c\in C}P(d|c)P(c) c^=argmaxc∈CP(c∣d)=argmaxc∈CP(d∣c)P(c)
We thus compute the most probable class c^\hat cc^ given some document ddd by choosing probability prior the class which has the highest product of two probabilities: the prior probability likelihood of the class P© and the likelihood of the document P(d∣c)P(d|c)P(d∣c):
c^=argmaxc∈CP(d∣c)⏞likelihoodP(c)⏞prior\hat c=\mathop{\textrm{argmax}}_{c\in C}\overbrace{P(d|c)}^{\textrm{likelihood}}\overbrace{P(c)}^{\textrm{prior}} c^=argmaxc∈CP(d∣c)likelihoodP(c)prior
Without loss of generalization, we can represent a document d as a set of features f1,f2,…,fnf_1,f_2,\ldots, f_nf1,f2,…,fn:
c^=argmaxc∈CP(f1,f2,…,fn∣c)⏞likelihoodP(c)⏞prior\hat c=\mathop{\textrm{argmax}}_{c\in C}\overbrace{P(f_1,f_2,\ldots, f_n|c)}^{\textrm{likelihood}}\overbrace{P(c)}^{\textrm{prior}} c^=argmaxc∈CP(f1,f2,…,fn∣c)likelihoodP(c)prior
Naive Bayes classifiers make two simplifying assumptions.

The first is the bag of words assumption discussed intuitively above: we assume position doesn’t matter, and that the word “love” has the same effect on classification whether it occurs as the 1st, 20th, or last word in the document. Thus we assume that the features f1,f2,…,fnf_1,f_2,\ldots, f_nf1,f2,…,fn only encode word identity and not position.

The second is commonly called the naive Bayes assumption: this is the conditional independence assumption that the probabilities P(fi∣c)P( f_i|c)P(fi∣c) are independent given the class ccc and hence can be ‘naively’ multiplied as follows:
P(f1,f2,…,fn∣c)=P(f1∣c)⋅P(f2∣c)⋅…⋅P(fn∣c)P(f_1,f_2,\ldots, f_n|c)=P(f_1|c)\cdot P(f_2|c)\cdot \ldots \cdot P(f_n|c) P(f1,f2,…,fn∣c)=P(f1∣c)⋅P(f2∣c)⋅…⋅P(fn∣c)
The final equation for the class chosen by a naive Bayes classifier is thus:
cNB=argmaxc∈CP(c)∏f∈FP(f∣c)c_{NB}=\mathop{\textrm{argmax}}_{c\in C}P(c)\prod_{f\in F} P(f|c) cNB=argmaxc∈CP(c)f∈F∏P(f∣c)
To apply the naive Bayes classifier to text, we need to consider word positions, by simply walking an index through every word position in the document:
positions ←all word positions in test documentcNB=argmaxc∈CP(c)∏i∈positionsP(wi∣c)\textrm{positions }\leftarrow \textrm{ all word positions in test document}\\ c_{NB}=\mathop{\textrm{argmax}}_{c\in C}P(c)\prod_{i\in positions} P(w_i|c) positions ← all word positions in test documentcNB=argmaxc∈CP(c)i∈positions∏P(wi∣c)
To avoid underflow and increase speed,
cNB=argmaxc∈Clog⁡P(c)+∑i∈positionslog⁡P(wi∣c)c_{NB}=\mathop{\textrm{argmax}}_{c\in C}\log P(c)+\sum_{i\in positions} \log P(w_i|c) cNB=argmaxc∈ClogP(c)+i∈positions∑logP(wi∣c)
Classifiers that use a linear combination of the inputs to make a classification decision —like naive Bayes and also logistic regression— are called linear classifiers.

4.2 Training the Naive Bayes Classifier

Let NcN_cNc be the number of documents in our training data with class ccc and NdocN_{doc}Ndoc be the total number of documents. Then:
P^(c)=NcNdoc\hat P(c)=\frac{N_c}{N_{doc}} P^(c)=NdocNc
To learn the probability P(fi∣c)P( f_i|c)P(fi∣c), we’ll assume a feature is just the existence of a word in the document’s bag of words, and so we’ll want P(wi∣c)P(w_i|c)P(wi∣c), which we compute as the fraction of times the word wiw_iwi appears among all words in all documents of topic ccc. We first concatenate all documents with category ccc into one big “category c” text. Then we use the frequency of wiw_iwi in this concatenated document to give a
maximum likelihood estimate of the probability:
P^(wi∣c)=count(wi,c)∑w∈Vcount(w,c)\hat P(w_i|c)= \frac{count(w_i,c)}{\sum_{w\in V}count(w,c)} P^(wi∣c)=∑w∈Vcount(w,c)count(wi,c)
Here the vocabulary VVV consists of the union of all the word types in all classes, not just the words in one class ccc.

To solve the zero count problem, use add-one (Laplace) smoothing:
P^(wi∣c)=count(wi,c)+1∑w∈V(count(w,c)+1)=count(wi,c)+1(∑w∈Vcount(w,c))+∣V∣\hat P(w_i|c)= \frac{count(w_i,c)+1}{\sum_{w\in V}(count(w,c)+1)}=\frac{count(w_i,c)+1}{\left(\sum_{w\in V}count(w,c)\right)+|V|} P^(wi∣c)=∑w∈V(count(w,c)+1)count(wi,c)+1=(∑w∈Vcount(w,c))+∣V∣count(wi,c)+1
Note once again that it is crucial that the vocabulary V consists of the union of all the word types in all classes, not just the words in one class ccc. The reason for this lies in the fact that, as is shown in Eq.9, wiw_iwiwith i∈positionsi\in positionsi∈positions denotes all words from all the training documents, not just one class ccc.

The solution for unknown words is to ignore them—remove them from the test document and not include any probability for them at all.

Finally, some systems choose to completely ignore another class of words: stop words, very frequent words like the and a. This can be done by sorting the vocabulary by frequency in the training set, and defining the top 10–100 vocabulary entries as stop words, or alternatively by using one of the many pre-defined stop word list available online.

training_data={"just plain boring":"-",
"entirely predictable and lacks energy":"-",
"no surprises and very few laughs":"-",
"very powerful":"+",
"the most fun film of the summer":"+"}import numpy as np
def naive_bayes_classifier(training_data):n_docs=len(training_data)classes=set(training_data.values())n_c={}logprior={}vocabulary=set()bigdoc={}  for key,value in training_data.items():vocabulary.update(key.split())if value not in bigdoc.keys():bigdoc[value]=key.split()n_c[value]=1else:bigdoc[value]+=key.split()n_c[value]+=1count={}loglikelihood={}for c in classes:count[c]={}loglikelihood[c]={}logprior[c]=np.log(1.0*n_c[c]/n_docs)for word in vocabulary:count[c][word]=bigdoc[c].count(word)loglikelihood[c][word]=np.log(1.0*(count[c][word]+1)/(len(bigdoc[c])+len(vocabulary)))        return vocabulary,classes,logprior,loglikelihoodvocabulary,classes,logprior,loglikelihood= naive_bayes_classifier(training_data)
def test_naive_bayes_classifier(test_data):sum={}for c in classes:sum[c]=logprior[c]for word in test_data.split():if word in vocabulary:sum[c]+=loglikelihood[c][word]return max(sum,key=sum.get)

4.3 Worked example

4.4 Optimization for Sentiment Analysis

Some small changes are generally employed that improve performance.

First, for sentiment classification and a number of other text classification tasks, whether a word occurs or not seems to matter more than its frequency. Thus it often improves performance to clip the word counts in each document at 1. This variant is called binary multinomial naive Bayes or binary NB.

A second important addition commonly made when doing text classification for sentiment is to deal with negation. A very simple baseline that is commonly used in sentiment to deal with negation is during text normalization to prepend the prefix NOT_ to every word after a token of logical negation (n’t, not, no, never) until the next punctuation mark. Newly formed ‘words’ like NOT_like, NOT_recommend will thus occur more often in negative document and act as cues for negative sentiment, while words like
NOT_bored, NOT_dismiss will acquire positive associations.

Finally, in some situations we might have insufficient labeled training data to train accurate naive Bayes classifiers using all words in the training set to estimate positive and negative sentiment. In such cases we can instead derive the positive and negative word features from sentiment lexicons, lists of words that are preannotated with positive or negative sentiment. Four popular lexicons are the General Inquirer (Stone et al., 1966), LIWC (Pennebaker et al., 2007), the opinion lexicon LIWC of Hu and Liu (2004a) and the MPQA Subjectivity Lexicon (Wilson et al., 2005).

A common way to use lexicons in a naive Bayes classifier is to add a feature that is counted whenever a word from that lexicon occurs. Thus we might add a feature called ‘this word occurs in the positive lexicon’, and treat all instances of words in the lexicon as counts for that one feature, instead of counting each word separately. Similarly, we might add as a second feature ‘this word occurs in the negative lexicon’ of words in the negative lexicon. If we have lots of training data, and if the test data matches the training data, using just two features won’t work as well as using all the words. But when training data is sparse or not representative of the test set, using dense lexicon features instead of sparse individual-word features may generalize better.

4.5 Naive Bayes for other text classification tasks

Consider the task of spam detection, a common solution here, rather than using all the words as individual features, is to predefine likely sets of words or phrases as features, combined these with features that are not purely linguistic.

For other tasks, like language ID—determining what language a given piece of text is written in—the most effective naive Bayes features are not words at all, but byte n-grams, 2-grams (‘zw’) 3-grams (‘nya’, ‘ Vo’), or 4-grams (‘ie z’, ‘thei’). Because spaces count as a byte, byte n-grams can model statistics about the beginning or ending of words. 2 A widely used naive Bayes system, langid.py (Lui and Baldwin, 2012) begins with all possible n-grams of lengths 1-4, using feature selection to winnow down to the most informative 7000 final features.

4.6 Naive Bayes as a Language Model

If we use only individual word features, and we use all of the words in the text (not a subset), then naive Bayes has an important similarity to language modeling. Specifically, a naive Bayes model can be viewed as a set of class-specific unigram language models, in which the model for each class instantiates a unigram language model.

Since the likelihood features from the naive Bayes model assign a probability to each word P(word∣c)P(word|c)P(word∣c), the model also assigns a probability to each sentence:
P(s∣c)=∏i∈positionsP(wi∣c)P(s|c)=\prod_{i\in positions}P(w_i|c) P(s∣c)=i∈positions∏P(wi∣c)

4.7 Evaluation: Precision, Recall, F-measure

We will refer to human labels as the gold labels.
Precision=true positivestrue positives+ false positives\textbf{Precision}=\frac{\textrm{true positives}}{\textrm{true positives+ false positives}} Precision=true positives+ false positivestrue positives

Recall=true positivestrue positives+ false negatives\textbf{Recall}=\frac{\textrm{true positives}}{\textrm{true positives+ false negatives}} Recall=true positives+ false negativestrue positives

Fβ=(β2+1)PRβ2P+RF_\beta =\frac{(\beta^2 +1)PR}{\beta^2P+R} Fβ=β2P+R(β2+1)PR

F1=2PRP+RF_1=\frac{2PR}{P+R} F1=P+R2PR

accuracy=tp+tntp+fp+tn+fnaccuracy=\frac{\textrm{tp+tn}}{\textrm{tp+fp+tn+fn}} accuracy=tp+fp+tn+fntp+tn

Although accuracy might seem a natural metric, we generally don’t use it. That’s because accuracy doesn’t work well when the classes are unbalanced. Accuracy is not a good metric when the goal is to discover something that is rare, or at least not completely balanced in frequency, which is a very common situation in the world.

F-measure comes from a weighted harmonic mean of precision and recall. The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of reciprocals:
HarmonicMean(a1,a2,a3,…,an)=n1a1+1a2+1a3+…+1an\textrm{HarmonicMean}(a_1,a_2,a_3,\ldots,a_n) =\frac{n}{\frac{1}{a_1}+\frac{1}{a_2}+\frac{1}{a_3}+\ldots+\frac{1}{a_n}} HarmonicMean(a1,a2,a3,…,an)=a11+a21+a31+…+an1n
and hence F-measure is
F=1α1P+(1−α)1Ror (with β2=1−αα)F=(β2+1)PRβ2P+RF=\frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}\textrm{ or }\left(\textrm{ with } \beta^2=\frac{1-\alpha}{\alpha}\right) \textrm{ }F =\frac{(\beta^2 +1)PR}{\beta^2P+R} F=αP1+(1−α)R11 or ( with β2=α1−α) F=β2P+R(β2+1)PR
Harmonic mean is used because it is a conservative metric; the harmonic mean of two values is closer to the minimum of the two values than the arithmetic mean is. Thus it weighs the lower of the two numbers more heavily.

4.7.1 More than two classes

There are two kinds of multi-class classification tasks. In any-of or multi-label classification, each document or item can be assigned more than one label. We can solve any-of classification by building separate binary classifiers for each class c, trained on positive examples labeled c and negative examples not labeled c. Given a test document or item d, then each classifier makes their decision independently,
and we may assign multiple labels to d.

More common in language processing is one-of or multinomial classification, multinomial classification in which the classes are mutually exclusive and each document or item appears in exactly one class. Here we again build a separate binary classifier trained on positive examples from c and negative examples from all other classes. Now given a test document or item d, we run all the classifiers and choose the label from the classifier with the highest score.

In order to derive a single metric that tells us how well the system is doing, we can combine these values in two ways. In macroaveraging, we compute the performance for each class, and then average over classes. In microaveraging, we collect the decisions for all classes into a single contingency table, and then compute precision and recall from that table.

As the figure shows, a microaverage is dominated by the more frequent class (in this case spam), since the counts are pooled. The macroaverage better reflects the statistics of the smaller classes, and so is more appropriate when performance on all the classes is equally important.

4.8 Test sets and Cross-validation

cross-validation: we randomly choose a training and test set division of our data, train our classifier, and then compute the error rate on the test set. Then we repeat with a different randomly selected training set and test set. We do this sampling process 10 times and average these 10 runs to get an average error rate.
This is called 10-fold cross-validation.

It is common to create a fixed training set and test set, then do 10-fold cross-validation inside the training set, but compute error rate the normal way in the test set.

4.9 Statistical Significance Testing

Let’s say we have a test set xxx of nnn observations x=x1,x2,…,xnx = x_1, x_2, \ldots, x_nx=x1,x2,…,xn on which A’s performance is better than B by δ(x)\delta(x)δ(x). How can we know if A is really better than B? To do so we’d need to reject the null hypothesis that A isn’t really better than B and this difference δ(x)\delta(x)δ(x) occurred purely by chance. If the null hypothesis was correct, we would expect that if we had many test sets of size n and we measured A and B’s performance on all of them, that on average A might accidentally still be better than B by this amount δ(x)\delta(x)δ(x) just by chance.

More formally, if we had a random variable XXX ranging over test sets, the null hypothesis H0H_0H0 expects P(δ(X)>δ(x)∣H0)P(\delta(X) >\delta(x)|H_0)P(δ(X)>δ(x)∣H0), the probability that we’ll see similarly big differences just by chance, to be high.

If we had all these test sets we could just measure all the δ(x′)\delta(x')δ(x′) for all the x′x'x′. If we found that those deltas didn’t seem to be bigger than δ(x)\delta(x)δ(x), that is, that p-value(xxx) was sufficiently small, less than the standard thresholds of 0.05 or 0.01, then we might reject the null hypothesis and agree that δ(x)\delta(x)δ(x) was a sufficiently surprising difference and A is really a better algorithm than B. Following Berg-Kirkpatrick et al. (2012)
we’ll refer to P(δ(x)>δ(x)∣H0)P(\delta(x) > \delta(x)|H_0)P(δ(x)>δ(x)∣H0) as p-value(xxx).

In language processing we don’t generally use traditional statistical approaches like paired t-tests to compare system outputs because most metrics are not normally distributed, violating the assumptions of the tests. The standard approach to computing p-value(x) in natural language processing is to use non-parametric tests like the bootstrap test (Efron and Tibshirani, 1993) or a similar test, approximate randomization (Noreen, 1989). The advantage of these tests is that they can apply to any metric; from precision, recall, or F1 to the BLEU metric used in machine translation.

The word bootstrapping refers to repeatedly drawing large numbers of smaller samples with replacement (called bootstrap samples) from an original larger sample. The intuition of the bootstrap test is that we can create many virtual test sets from an observed test set by repeatedly sampling from it. The method only makes the assumption that the sample is representative of the population.

import numpy.random as rndx=[[1,1,1,0,1,0,1,1,0,1],[1,0,1,1,0,1,0,1,0,0]]
def delta(x):return (np.count_nonzero(x[0])-np.count_nonzero(x[1]))/len(x[0])def bootstrap(x,batches):s=0n=len(x[0])for i in range(batches):x_star=[[None]*n,[None]*n]for j in range(n):rnd_index= rnd.randint(0,n)x_star[0][j]=x[0][rnd_index]x_star[1][j]=x[1][rnd_index]if delta(x_star)>2*delta(x):s+=1p_value=1.0*s/batchesreturn p_value

4.10 Advanced: Feature Selection

Feature selection is a method of removing features that are unlikely to generalize well. The basis of feature selection is to assign some metric of goodness to each feature, rank the features, and keep the best ones. The number of features to keep is a meta-parameter that can be optimized on a dev set.

Features are generally ranked by how informative they are about the classification decision. A very common metric is information gain. Information gain tells us how many bits of information the presence of the word gives us for guessing the class, and can be computed as follows (where cic_ici is the ith class and wˉ\bar{w}wˉ means that a document does not contain the word www):
G(w)=−∑i=1CP(ci)log⁡P(ci)+P(w)∑i=1CP(ci∣w)log⁡P(ci∣w)+P(wˉ)∑i=1CP(ci∣wˉ)log⁡P(ci∣wˉ)G(w)=-\sum_{i=1}^CP(c_i)\log P(c_i)+P(w)\sum_{i=1}^CP(c_i|w)\log P(c_i|w)+P(\bar w)\sum_{i=1}^CP(c_i|\bar w)\log P(c_i|\bar w) G(w)=−i=1∑CP(ci)logP(ci)+P(w)i=1∑CP(ci∣w)logP(ci∣w)+P(wˉ)i=1∑CP(ci∣wˉ)logP(ci∣wˉ)

4.11 Summary

This chapter introduced the naive Bayes model for classification and applied it to the text categorization task of sentiment analysis.

Many language processing tasks can be viewed as tasks of classification.
Text categorization, in which an entire text is assigned a class from a finite set, includes such tasks as sentiment analysis, spam detection, language identification, and authorship attribution.
Sentiment analysis classifies a text as reflecting the positive or negative orientation (sentiment) that a writer expresses toward some object.
Naive Bayes is a generative model that make the bag of words assumption (position doesn’t matter) and the conditional independence assumption (words are conditionally independent of each other given the class)
Naive Bayes with binarized features seems to work better for many text classification tasks.
Feature selection can be used to automatically remove features that aren’t helpful.
Classifiers are evaluated based on precision and recall.
Classifiers are trained using distinct training, dev, and test sets, including the use of cross-validation in the training set.