CHAPTER 19 Lexicons for Sentiment, Affect, and Connotation

Speech and Language Processing ed3 读书笔记

[外链图片转存失败(img-UAOqdFAS-1563455821316)(19.1.png)]

We can design extractors for each of these kinds of affective states. Chapter 4 already introduced sentiment analysis, the task of extracting the positive or negative orientation that a writer expresses in a text. This corresponds in Scherer’s typology to the extraction of attitudes: figuring out what people like or dislike, from affectrish texts like consumer reviews of books or movies, newspaper editorials, or public sentiment in blogs or tweets.

Detecting emotion and moods is useful for detecting whether a student is confused, engaged, or certain when interacting with a tutorial system, whether a caller to a help line is frustrated, whether someone’s blog posts or tweets indicated depression. Detecting emotions like fear in novels, for example, could help us trace what groups or situations are feared and how that changes over time.

Detecting different interpersonal stances can be useful when extracting information from human-human conversations. The goal here is to detect stances like friendliness or awkwardness in interviews or friendly conversations, or even to detect flirtation in dating. For the task of automatically summarizing meetings, we’d like to be able to automatically understand the social relations between people, who is friendly or antagonistic to whom. A related task is finding parts of a conversation where people are especially excited or engaged, conversational hot spots that can help a summarizer focus on the correct region.

Detecting the personality of a user—such as whether the user is an extrovert or the extent to which they are open to experience — can help improve conversational agents, which seem to work better if they match users’ personality expectations (Mairesse and Walker, 2008).

In this chapter we focus on an alternative model, in which instead of using every word as a feature, we focus only on certain words, ones that carry particularly strong cues to affect or sentiment. We call these lists of words affective lexicons or sentiment lexicons. These lexicons presuppose a fact about semantics: that words have affective meanings or connotations. The word connotation has different meanings in different fields, but here we use it to mean the aspects of a word’s meaning that are related to a writer or reader’s emotions, sentiment, opinions, or evaluations. In addition to their ability to help determine the affective status of a text, connotation lexicons can be useful features for other kinds of affective tasks, and for computational social science analysis.

In the next sections we introduce basic theories of emotion, show how sentiment lexicons can be viewed as a special case of emotion lexicons, and then summarize some publicly available lexicons. We then introduce three ways for building new lexicons: human labeling, semi-supervised, and supervised. Finally, we turn to some other kinds of affective meaning, including interpersonal stance, personality, and connotation frames.

19.1 Defining Emotion

There are two widely-held families of theories of emotion. In one family, emotions are viewed as fixed atomic units, limited in number, and from which others are generated, often called basic emotions (Tomkins 1962, Plutchik 1962). Perhaps most well-known of this family of theories are the 6 emotions proposed by (Ekman, 1999) as a set of emotions that is likely to be universally present in all cultures: surprise, happiness, anger, fear, disgust, sadness. Another atomic theory is the (Plutchik, 1980) wheel of emotion, consisting of 8 basic emotions in four opposing pairs: joy–sadness, anger–fear, trust–disgust, and anticipation–surprise, together with the emotions derived from them, shown in Fig. 19.2.

[外链图片转存失败(img-DY4Dtiry-1563455821318)(19.2.png)]

The second class of emotion theories views emotion as a space in 2 or 3 dimensions (Russell, 1980). Most models include the two dimensions valence and arousal, and many add a third, dominance. These can be defined as:

valence: the pleasantness of the stimulus

arousal: the intensity of emotion provoked by the stimulus

dominance: the degree of control exerted by the stimulus

Sentiment can be viewed as a special case of this second view of emotions as points in space. In particular, the valence dimension, measuring how pleasant or unpleasant a word is, is often used directly as a measure of sentiment.

19.2 Available Sentiment and Affect Lexicons

The most basic lexicons label words along one dimension of semantic variability, generally called ”sentiment” or ”valence”.

In the simplest lexicons this dimension is represented in a binary fashion, with a wordlist for positive words and a wordlist for negative words.

[外链图片转存失败(img-J91PGKCJ-1563455821318)(19.3.png)]

Slightly more general than these sentiment lexicons are lexicons that assign each word a value on all three emotional dimension The lexicon of Warriner et al. (2013) assigns valence, arousal, and dominance scores to 14,000 words. Some examples are shown in Fig. 19.4.

[外链图片转存失败(img-a5Dyzqxn-1563455821318)(19.4.png)]

The NRC Word-Emotion Association Lexicon, also called EmoLex (Mohammad and Turney, 2013), uses the Plutchik (1980) 8 basic emotions defined above. Values from the lexicon for some sample words:

[外链图片转存失败(img-hOz1AxGV-1563455821319)(19.2.example.png)]

There are various other hand-built affective lexicons. The General Inquirer includes additional lexicons for dimensions like strong vs. weak, active vs. passive, overstated vs. understated, as well as lexicons for categories like pleasure, pain, virtue, vice, motivation, and cognitive orientation.

Another useful feature for various tasks is the distinction between concrete words like banana or bathrobe and abstract words like belief and although. The lexicon in (Brysbaert et al., 2014) used crowdsourcing to assign a rating from 1 to 5 of the concreteness of 40,000 words, thus assigning banana, bathrobe, and bagel 5, belief 1.19, although 1.07, and in between words like brisk a 2.5.

LIWC, Linguistic Inquiry and Word Count, is another set of 73 lexicons containing over 2300 words (Pennebaker
et al., 2007), designed to capture aspects of lexical meaning relevant for social psychological tasks. In addition to sentiment-related lexicons like ones for negative emotion (bad, weird, hate, problem, tough) and positive emotion (love, nice, sweet), LIWC includes lexicons for categories like anger, sadness, cognitive mechanisms, perception, tentative, and inhibition, shown in Fig. 19.5.

[外链图片转存失败(img-fFKQayGj-1563455821319)(19.5.png)]

19.3 Creating affect lexicons by human labeling

The earliest method used to build affect lexicons, and still in common use, is to have humans label each word. This is now most commonly done via crowdsourcing: breaking the task into small pieces and distributing them to a large number of annotators. Let’s take a look at some of the methodological choices for two crowdsourced emotion lexicons.

The NRC Word-Emotion Association Lexicon (EmoLex) (Mohammad and Turney, 2013), labeled emotions in two steps. In order to ensure that the annotators were judging the correct sense of the word, they first answered a multiple-choice synonym question that primed the correct sense of the word (without requiring the annotator to read a potentially confusing sense definition). These were created automatically using the headwords associated with the thesaurus category of the sense in question in the Macquarie dictionary and the headwords of 3 random distractor categories. An example:

Which word is closest in meaning (most related) to startle?

automobile
shake
honesty
entertain

For each word (e.g. startle), the annotator was then asked to rate how associated that word is with each of the 8 emotions (joy, fear, anger, etc.). The associations were rated on a scale of not, weakly, moderately, and strongly associated. Outlier ratings were removed, and then each term was assigned the class chosen by the majority of the annotators, with ties broken by choosing the stronger intensity, and then the 4 levels were mapped into a binary label for each word (no and weak mapped to 0, moderate and strong mapped to 1).

For the Warriner et al. (2013) lexicon of valence, arousal, and dominance, crowdworkers marked each word with a value from 1-9 on each of the dimensions, with the scale defined for them as follows:

valence (the pleasantness of the stimulus)
9: happy, pleased, satisfied, contented, hopeful
1: unhappy, annoyed, unsatisfied, melancholic, despaired, or bored
arousal (the intensity of emotion provoked by the stimulus)
9: stimulated, excited, frenzied, jittery, wide-awake, or aroused
1: relaxed, calm, sluggish, dull, sleepy, or unaroused;
dominance (the degree of control exerted by the stimulus)
9: in control, influential, important, dominant, autonomous, or controlling
1: controlled, influenced, cared-for, awed, submissive, or guided

19.4 Semi-supervised induction of affect lexicons

Another common way to learn sentiment lexicons is to start from a set of seed words that define two poles of a semantic axis (words like good or bad), and then find ways to label each word $w$ by its similarity to the two seed sets. Here we summarize two families of seed-based semi-supervised lexicon induction algorithms, axis-based and graph-based.

19.4.1 Semantic axis methods

One of the most well-known lexicon induction methods, the Turney and Littman (2003) algorithm, is given seed words like good or bad, and then for each word $w$ to be labeled, measures both how similar it is to good and how different it is from bad. Here we describe a slight extension of the algorithm due to An et al. (2018), which is based on computing a semantic axis.

In the first step, we choose seed words by hand. Because the sentiment or affect of a word is different in different contexts, it’s common to choose different seed words for different genres, and most algorithms are quite sensitive to the choice of seeds. For example, for inducing sentiment lexicons, Hamilton et al. (2016a) defines one set of seed words for general sentiment analysis, a different set for Twitter, and yet another set for learning a lexicon for sentiment in financial text:

[外链图片转存失败(img-7CP89APT-1563455821320)(19.4.example.png)]

In the second step, we compute embeddings for each of the pole words. These embeddings can be off-the-shelf word2vec embeddings, or can be computed directly on a specific corpus (for example using a financial corpus if a finance lexicon is the goal), or we can fine-tune off-the-shelf embeddings to a corpus. Fine-tuning is especially important if we have a very specific genre of text but don’t have enough data to train good embeddings. In fine-tuning, we begin with off-the-shelf embeddings like word2vec, and continue training them on the small target corpus. Once we have embeddings for each pole word, we create an embedding that represents each pole by taking the centroid of the embeddings of each of the seed words; recall that the centroid is the multidimensional version of the mean. Given a set of embeddings for the positive seed words $S^+ = \{E(w_1^+), E(w_2^+), \ldots, E(w_n^+)\}$ , and embeddings for the negative seed words $S^- = \{E(w_1^-), E(w_2^-), \ldots, E(w_m^-)\}$ , the pole centroids are:
$V^+ = \frac{1}{n}\sum_1^nE(w_i^+)\\ V^- = \frac{1}{m}\sum_1^nE(w_i^-)$
The semantic axis defined by the poles is computed just by subtracting the two vectors:
$V_{axis} = V^+ -V^-$
$V_{axis}$ , the semantic axis, is a vector in the direction of sentiment. Finally, we compute how close each word $w$ is to this sentiment axis, by taking the cosine between $w$ ’s embedding and the axis vector. A higher cosine means that $w$ is more aligned with $S^+$ than $S^-$ .
$(cos(E(w),V_{axis})=\frac{E(w)\cdot V_{axis}}{||E(w)||V_{axis}||}$
If a dictionary of words with sentiment scores is sufficient, we’re done! Or if we need to group words into a positive and a negative lexicon, we can use a threshold or other method to give us discrete lexicons.

19.4.2 Label propagation

An alternative family of methods defines lexicons by propagating sentiment labels on graphs, an idea suggested in early work by Hatzivassiloglou and McKeown (1997). We’ll describe the simple SentProp (Sentiment Propagation) algorithm of Hamilton et al. (2016a), which has four steps:

Define a graph: Given word embeddings, build a weighted lexical graph by connecting each word with its $k$ nearest neighbors (according to cosine-similarity). The weights of the edge between words $w_i$ and $w_j$ are set as:
$E_{i,j}=\arccos\left(-\frac{w_i^Tw_j}{||w_i||||w_j||}\right)$
Define a seed set: By hand, choose positive and negative seed words.
Propagate polarities from the seed set: Now we perform a random walk on this graph, starting at the seed set. In a random walk, we start at a node and then choose a node to move to with probability proportional to the edge probability. A word’s polarity score for a seed set is proportional to the probability of a random walk from the seed set landing on that word (Fig. 19.6).

[外链图片转存失败(img-vhKLVhl4-1563455821320)(19.6.png)]
Create word scores: We walk from both positive and negative seed sets, resulting in positive ( $score^+(w_i)$ ) and negative ( $score^-(w_i)$ ) label scores. We then combine these values into a positive-polarity score as:
$score^+(w_i) = \frac{score^+(w_i)}{score^+(w_i)+ score^-(w_i)}$
It’s often helpful to standardize the scores to have zero mean and unit variance within a corpus.
Assign confidence to each score: Because sentiment scores are influenced by the seed set, we’d like to know how much the score of a word would change if a different seed set is used. We can use bootstrap-sampling to get confidence regions, by computing the propagation $B$ times over random subsets of the positive and negative seed sets (for example using $B = 50$ and choosing 7 of the 10 seed words each time). The standard deviation of the bootstrap-sampled polarity scores gives a confidence measure.

19.4.3 Other methods

Other methods have chosen other kinds of distance metrics besides embedding cosine.

For example the Hatzivassiloglou and McKeown (1997) algorithm uses syntactic cues; two adjectives are considered similar if they were frequently conjoined by and and rarely conjoined by but. This is based on the intuition that adjectives conjoined by the words and tend to have the same polarity; positive adjectives are generally coordinated with positive, negative with negative:

fair and legitimate, corrupt and brutal

but less often positive adjectives coordinated with negative:

*fair and brutal, *corrupt and legitimate

By contrast, adjectives conjoined by but are likely to be of opposite polarity:

fair but brutal

Another cue to opposite polarity comes from morphological negation (un-, im-, -less). Adjectives with the same root but differing in a morphological negative (adequate/inadequate, thoughtful/thoughtless) tend to be of opposite polarity.

Yet another method for finding words that have a similar polarity to seed words is to make use of a thesaurus like WordNet (Kim and Hovy 2004, Hu and Liu 2004b). A word’s synonyms presumably share its polarity while a word’s antonyms probably have the opposite polarity. After a seed lexicon is built, each lexicon is updated as follows, possibly iterated.

Lex $^+$ : Add synonyms of positive words (well) and antonyms (like fine) of negative words

Lex $^-$ : Add synonyms of negative words (awful) and antonyms ( like evil) of positive words

An extension of this algorithm assigns polarity to WordNet senses, called SentiWordNet (Baccianella et al., 2010). Fig. 19.7 shows some examples.

[外链图片转存失败(img-VAmDGeZp-1563455821321)(19.7.png)]

In this algorithm, polarity is assigned to entire synsets rather than words. A positive lexicon is built from all the synsets associated with 7 positive words, and a negative lexicon from synsets associated with 7 negative words. A classifier is then trained from this data to take a WordNet gloss and decide if the sense being defined is positive, negative or neutral. A further step (involving a random-walk algorithm) assigns a score to each WordNet synset for its degree of positivity, negativity, and neutrality.

In summary, semisupervised algorithms use a human-defined set of seed words for the two poles of a dimension, and use similarity metrics like embedding cosine, coordination, morphology, or thesaurus structure to score words by how similar they are to the positive seeds and how dissimilar to the negative seeds.

19.5 Supervised learning of word sentiment

Semi-supervised methods require only minimal human supervision (in the form of seed sets). But sometimes a supervision signal exists in the world and can be made use of. One such signal is the scores associated with online reviews.

The web contains an enormous number of online reviews for restaurants, movies, books, or other products, each of which have the text of the review along with an associated review score: a value that may range from 1 star to 5 stars, or scoring 1 to 10. Fig. 19.8 shows samples extracted from restaurant, book, and movie reviews.

[外链图片转存失败(img-F0kYxXIM-1563455821321)(19.8.png)]

We can use this review score as supervision: positive words are more likely to appear in 5-star reviews; negative words in 1-star reviews. And instead of just a binary polarity, this kind of supervision allows us to assign a word a more complex representation of its polarity: its distribution over stars (or other scores).

Thus in a ten-star system we could represent the sentiment of each word as a 10-tuple, each number a score representing the word’s association with that polarity level. This association can be a raw count, or a likelihood $P (w ∣ c)$ , or some other function of the count, for each class $c$ from 1 to 10.

For example, we could compute the IMDB likelihood of a word like disappoint(ed/ing) occurring in a 1 star review by dividing the number of times disappoint(ed/ing) occurs in 1-star reviews in the IMDB dataset (8,557) by the total number of words occurring in 1-star reviews (25,395,214), so the IMDB estimate of $P (d i s a p p o i n t i n g ∣ 1)$ is .0003.

A slight modification of this weighting, the normalized likelihood, can be used as an illuminating visualization (Potts, 2011):
$\frac{count(w,c)}{\sum_{w\in C}count(w,c)}\\ PottsScore(w)=\frac{P(w,c)}{\sum_cP(w,c)}$
Dividing the IMDB estimate $P (d i s a p p o i n t i n g ∣ 1)$ of .0003 by the sum of the likelihood $P (w ∣ c)$ over all categories gives a Potts score of 0.10. The word disappointing thus is associated with the vector [.10, .12, .14, .14, .13, .11, .08, .06, .06, .05]. The Potts diagram (Potts, 2011) is a visualization of these word scores, representing the prior sentiment of a word as a distribution over the rating categories.

Fig. 19.9 shows the Potts diagrams for 3 positive and 3 negative scalar adjectives. Note that the curve for strongly positive scalars have the shape of the letter J, while strongly negative scalars look like a reverse J. By contrast, weakly positive and negative scalars have a hump-shape, with the maximum either below the mean (weakly negative words like disappointing) or above the mean (weakly positive words like good). These shapes offer an illuminating typology of affective word meaning.

[外链图片转存失败(img-t2p1d9M1-1563455821322)(19.9.png)]

Fig. 19.10 shows the Potts diagrams for emphasizing and attenuating adverbs. Again we see generalizations in the characteristic curves associated with words of particular meanings. Note that emphatics tend to have a J-shape (most likely to occur in the most positive reviews) or a U-shape (most likely to occur in the strongly positive and negative). Attenuators all have the hump-shape, emphasizing the middle of the scale and downplaying both extremes.

[外链图片转存失败(img-LsQKuzYE-1563455821322)(19.10.png)]

The diagrams can be used both as a typology of lexical sentiment, and also play a role in modeling sentiment compositionality.

In addition to functions like posterior $P (c ∣ w)$ , likelihood $P (w ∣ c)$ , or normalized likelihood (Eq. 19.6) many other functions of the count of a word occurring with a sentiment label have been used. We’ll introduce some of these on page 394, including ideas like normalizing the counts per writer in Eq. 19.14.

19.5.1 Log odds ratio informative Dirichlet prior

One thing we often want to do with word polarity is to distinguish between words that are more likely to be used in one category of texts than in another.

In this section we walk through the details of one solution to this problem: the “log odds ratio informative Dirichlet prior” method of Monroe et al. (2008) that is a particularly useful method for finding words that are statistically overrepresented in one particular category of texts compared to another. It’s based on the idea of using another large corpus to get a prior estimate of what we expect the frequency of each word to be.

Let’s start with the goal: assume we want to know whether the word horrible occurs more in corpus $i$ or corpus $j$ . We could compute the log likelihood ratio, using $f^i(w)$ to mean the frequency of word $w$ in corpus $i$ , and $n^i$ to mean the total number of words in corpus $i$ :
$\operatorname{lir}(horrible)=\log \frac{P^{i}(horrible)}{P^{j}(horrible)}\\=\log P^{i}(horrible)-\log P^{j}(horrible)\\=\log \frac{f^i(horrible)}{n^{i}}-\log \frac{f^{j}(horrible)}{n^{j}}$
Instead, let’s compute the log odds ratio: does horrible have higher odds in $i$ or in $j$ :
$\operatorname{lor}(horrible)=\log\left(\frac{P^i(horrible)}{1-P^i(horrible)}\right)-\log\left(\frac{P^j(horrible)}{1-P^j(horrible)}\right)\\=\log\left(\frac{\frac{f^i(horrible)}{n^i}}{1-\frac{f^i(horrible)}{n^i}}\right)-\log\left(\frac{\frac{f^j(horrible)}{n^j}}{1-\frac{f^j(horrible)}{n^j}}\right)\\=\log\left(\frac{f^i(horrible)}{n^i-f^i(horrible)}\right)-\log\left(\frac{f^j(horrible)}{n^j-f^j(horrible)}\right)$
The Dirichlet intuition is to use a large background corpus to get a prior estimate of what we expect the frequency of each word $w$ to be. We’ll do this very simply by adding the counts from that corpus to the numerator and denominator, so that we’re essentially shrinking the counts toward that prior. It’s like asking how large are the differences between $i$ and $j$ given what we would expect given their frequencies in a well-estimated large background corpus.

The method estimates the difference between the frequency of word $w$ in two corpora $i$ and $j$ via the prior-modified log odds ratio for $w$ , $\delta_w^{(i-j)}$ , which is estimated as:
$\delta_{w}^{(i-j)}=\log \left(\frac{f_{w}^{i}+\alpha_{w}}{n^{i}+\alpha_{0}-\left(f_{w}^{i}+\alpha_{w}\right)}\right)-\log \left(\frac{f_{w}^{j}+\alpha_{w}}{n^{j}+\alpha_{0}-\left(f_{w}^{j}+\alpha_{w}\right)}\right)$
(where $n^i, n^j$ and $\alpha_0$ are the size of corpus $i, j$ and the background corpus, and $f_w^i, f_w^j$ and $\alpha_w$ are the count of word $w$ in corpus $i, j$ and the background corpus.)

In addition, Monroe et al. (2008) make use of an estimate for the variance of the log–odds–ratio:
$\sigma^{2}\left(\hat{\delta}_{w}^{(i-j)}\right) \approx \frac{1}{f_{w}^{i}+\alpha_{w}}+\frac{1}{f_{w}^{j}+\alpha_{w}}$
The final statistic for a word is then the z–score of its log–odds–ratio:
$\frac{\hat{\delta}_{w}^{(i-j)}}{\sqrt{\sigma^{2}\left(\hat{\delta}_{w}^{(i-j)}\right)}}$

δ^w(i−j)
The Monroe et al. (2008) method thus modifies the commonly used log odds ratio in two ways: it uses the z-scores of the log odds ratio, which controls for the amount of variance in a words frequency, and it uses counts from a background corpus to provide a prior count for words.

Fig. 19.11 shows the method applied to a dataset of restaurant reviews from Yelp, comparing the words used in 1-star reviews to the words used in 5-star reviews (Jurafsky et al., 2014). The largest difference is in obvious sentiment words, with the 1-star reviews using negative sentiment words like worse, bad, awful and the 5-star reviews using positive sentiment words like great, best, amazing. But there are other illuminating differences. 1-star reviews use logical negation (no, not), while 5-star reviews use emphatics and emphasize universality (very, highly, every, always). 1- star reviews use first person plurals (we, us, our) while 5 star reviews use the second person. 1-star reviews talk about people (manager, waiter, customer) while 5-star reviews talk about dessert and properties of expensive restaurants like courses and atmosphere. See Jurafsky et al. (2014) for more details.

[外链图片转存失败(img-dEpIEvLm-1563455821323)(19.11.png)]

19.6 Using Lexicons for Sentiment Recognition

In Chapter 4 we introduced the naïve Bayes algorithm for sentiment analysis. The lexicons we have focused on throughout the chapter so far can be used in a number of ways to improve sentiment detection. In the simplest case, lexicons can be used when we don’t have sufficient training data to build a supervised sentiment analyzer; it can often be expensive to have a human assign sentiment to each document to train the supervised classifier.

In such situations, lexicons can be used in a simple rule-based algorithm for classification. The simplest version is just to use the ratio of positive to negative words: if a document has more positive than negative words (using the lexicon to decide the polarity of each word in the document), it is classified as positive. Often a threshold $\lambda$ is used, in which a document is classified as positive only if the ratio is greater than l. If the sentiment lexicon includes positive and negative weights for each word, $\theta_w^+$ and $\theta_w^-$ , these can be used as well. Here’s a simple such sentiment algorithm:
$f^{+}=\sum_{w \text { s.t. } w \in \text {positive lexicon}} \theta_{w}^{+} \operatorname{count}(w)\\ f^{-}=\sum_{w \text { s.t. } w \in \text {negativelexicon}} \theta_{w}^{-} \text {count}(w)\\ sentiment=\left\{\begin{array}{ll}{+} & {\text { if } \frac{f^{+}}{f^{-}}>\lambda} \\ {-} & {\text { if } \frac{f^{-}}{f^{+}}>\lambda} \\ {0} & {\text { otherwise }}\end{array}\right.$
If supervised training data is available, these counts computed from sentiment lexicons, sometimes weighted or normalized in various ways, can also be used as features in a classifier along with other lexical or non-lexical features. We return to such algorithms in Section 19.8.

19.7 Other tasks: Personality

Many other kinds of affective meaning can be extracted from text and speech. For example detecting a person’s personality from their language can be useful for dialog systems (users tend to prefer agents that match their personality), and can play a useful role in computational social science questions like understanding how personality is related to other kinds of behavior.

Many theories of human personality are based around a small number of dimensions, such as various versions of the “Big Five” dimensions (Digman, 1990):

Extroversion vs. Introversion: sociable, assertive, playful vs. aloof, reserved, shy
Emotional stability vs. Neuroticism: calm, unemotional vs. insecure, anxious
Agreeableness vs. Disagreeableness: friendly, cooperative vs. antagonistic, faultfinding
Conscientiousness vs. Unconscientiousness: self-disciplined, organized vs. inefficient, careless
Openness to experience: intellectual, insightful vs. shallow, unimaginative

A few corpora of text and speech have been labeled for the personality of their author by having the authors take a standard personality test. The essay corpus of Pennebaker and King (1999) consists of 2,479 essays (1.9 million words) from psychology students who were asked to “write whatever comes into your mind” for 20 minutes. The EAR (Electronically Activated Recorder) corpus of Mehl et al. (2006) was created by having volunteers wear a recorder throughout the day, which randomly recorded short snippets of conversation throughout the day, which were then transcribed. The Facebook corpus of (Schwartz et al., 2013) includes 309 million words of Facebook posts from 75,000 volunteers.

For example, here are samples from Pennebaker and King (1999) from an essay written by someone on the neurotic end of the neurotic/emotionally stable scale,

One of my friends just barged in, and I jumped in my seat. This is crazy. I should tell him not to do that again. I’m not that fastidious actually. But certain things annoy me. The things that would annoy me would actually annoy any normal human being, so I know I’m not a freak.

and someone on the emotionally stable end of the scale:
I should excel in this sport because I know how to push my body harder than anyone I know, no matter what the test I always push my body harder than everyone else. I want to be the best no matter what the sport or event. I should also be good at this because I love to ride my bike.

Another kind of affective meaning is what Scherer (2000) calls interpersonal stance, the ‘affective stance taken toward another person in a specific interaction coloring the interpersonal exchange’. Extracting this kind of meaning means automatically labeling participants for whether they are friendly, supportive, distant. For example Ranganath et al. (2013) studied a corpus of speed-dates, in which participants went on a series of 4-minute romantic dates, wearing microphones. Each participant labeled each other for how flirtatious, friendly, awkward, or assertive they were. Ranganath et al. (2013) then used a combination of lexicons and other features to detect these interpersonal stances from text.

19.8 Affect Recognition

Detection of emotion, personality, interactional stance, and the other kinds of affective meaning described by Scherer (2000) can be done by generalizing the algorithms described above for detecting sentiment.

The most common algorithms involve supervised classification: a training set is labeled for the affective meaning to be detected, and a classifier is built using features extracted from the training set. As with sentiment analysis, if the training set is large enough, and the test set is sufficiently similar to the training set, simply using all the words or all the bigrams as features in a powerful classifier like SVM or logistic regression, as described in Fig. 4.2 in Chapter 4, is an excellent algorithm whose performance is hard to beat. Thus we can treat affective meaning classification of a text sample as simple document classification.

Some modifications are nonetheless often necessary for very large datasets. For example, the Schwartz et al. (2013) study of personality, gender, and age using 700 million words of Facebook posts used only a subset of the n-grams of lengths 1- 3. Only words and phrases used by at least 1% of the subjects were included as features, and 2-grams and 3-grams were only kept if they had sufficiently high PMI (PMI greater than $2 * l e n g t h$ , where length is the number of words):
$KaTeX parse error: Got function '\prod' as argument to '\mathop' at position 58: …hrase)}{\mathop\̲p̲r̲o̲d̲\limits_{w \in …$
Various weights can be used for the features, including the raw count in the training set, or some normalized probability or log probability. Schwartz et al. (2013), for example, turn feature counts into phrase likelihoods by normalizing them by each subject’s total word use.
$p(phrase|subject)=\frac{\text {freq} (phrase, subject)}{\sum_{phrase'\in \text{vocab}(subject)} \text { freq} (phrase', subject) }$
If the training data is sparser, or not as similar to the test set, any of the lexicons we’ve discussed can play a helpful role, either alone or in combination with all the words and n-grams. Many possible values can be used for lexicon features. The simplest is just an indicator function, in which the value of a feature $f_L$ takes the value 1 if a particular text has any word from the relevant lexicon $L$ . Using the notation of Chapter 4, in which a feature value is defined for a particular output class $c$ and document $x$ .
$f_{L}(c, x)=\left\{\begin{array}{ll}{1} & {\text { if }\ \exists w : w \in L\ \&\ w \in x\ \&\ class =c} \\ {0} & {\text { otherwise }}\end{array}\right.$
Alternatively the value of a feature $f_L$ for a particular lexicon $L$ can be the total number of word tokens in the document that occur in $L$ :
$f_L = \sum_{w\in L}count(w)$
For lexica in which each word is associated with a score or weight, the count can be multiplied by a weight $\theta_w^L$ :
$f_L = \sum_{w\in L}\theta_w^L count(w)$
Counts can alternatively be logged or normalized per writer as in Eq. 19.14.

However they are defined, these lexicon features are then used in a supervised classifier to predict the desired affective category for the text or document. Once a classifier is trained, we can examine which lexicon features are associated with which classes. For a classifier like logistic regression the feature weight gives an indication of how associated the feature is with the class.

Thus, for example, Mairesse and Walker (2008) found that for classifying personality, for the dimension Agreeable, the LIWC lexicons Family and Home were positively associated while the LIWC lexicons anger and swear were negatively associated. By contrast, Extroversion was positively associated with the Friend, Religion and Self lexicons, and Emotional Stability was positively associated with Sports and negatively associated with Negative Emotion.

[外链图片转存失败(img-7FkfPanH-1563455821324)(19.12.png)]

19.9 Connotation Frames

The lexicons we’ve described so far define a word as a point in affective space. A connotation frame, by contrast, is lexicon that incorporates a richer kind of grammatical structure, by combining affective lexicons with the frame semantic lexicons of Chapter 18. The basic insight of connotation frame lexicons is that a predicate like a verb expresses connotations about the verb’s arguments (Rashkin et al. 2016, Rashkin et al. 2017).

Consider sentences like:

(19.15) Country A violated the sovereignty of Country B

(19.16) the teenager … survived the Boston Marathon bombing”

By using the verb violate in (19.15), the author is expressing their sympathies with Country B, portraying Country B as a victim, and expressing antagonism toward the agent Country A. By contrast, in using the verb survive, the author of (19.16) is expressing that the bombing is a negative experience, and the subject of the sentence the teenager, is a sympathetic character. These aspects of connotation are inherent in the meaning of the verbs violate and survive, as shown in Fig. 19.13.

[外链图片转存失败(img-Oe4ryWZM-1563455821324)(19.13.png)]

The connotation frame lexicons of Rashkin et al. (2016) and Rashkin et al. (2017) also express other connotative aspects of the predicate toward each argument, including the effect (something bad happened to x) value: (x is valuable), and mental state: (x is distressed by the event). Connotation frames can also mark aspects of power and agency; see Chapter 18 (Sap et al., 2017).

Connotation frames can be built by hand (Sap et al., 2017), or they can be learned by supervised learning (Rashkin et al., 2016), for example using hand-labeled training data to supervise classifiers for each of the individual relations, e.g., whether S(writer $\to$ Role1) is + or -, and then improving accuracy via global constraints across all relations.

19.10 Summary

Many kinds of affective states can be distinguished, including emotions, moods, attitudes (which include sentiment), interpersonal stance, and personality.
Emotion can be represented by fixed atomic units often called basic emotions, or as points in space defined by dimensions like valence and arousal.
Words have connotational aspects related to these affective states, and this connotational aspect of word meaning can be represented in lexicons.
Affective lexicons can be built by hand, using crowd sourcing to label the affective content of each word.
Lexicons can be built with semi-supervised, bootstrapping from seed words using similarity metrics like embedding cosine.
Lexicons can be learned in a fully supervised manner, when a convenient training signal can be found in the world, such as ratings assigned by users on a review site.
Words can be assigned weights in a lexicon by using various functions of word counts in training texts, and ratio metrics like log odds ratio informative Dirichlet prior.
Personality is often represented as a point in 5-dimensional space.
Affect can be detected, just like sentiment, by using standard supervised text classification techniques, using all the words or bigrams in a text as features. Additional features can be drawn from counts of words in lexicons.
Lexicons can also be used to detect affect in a rule-based classifier by picking the simple majority sentiment based on counts of words in each lexicon.
Connotation frames express richer relations of affective meaning that a predicate encodes about its arguments.