CHAPTER 23 Question Answering

Speech and Language Processing ed3 读书笔记

Two major paradigms of question answering—information retrieval-based and knowledge-based.

Most question answering systems focus on factoid questions, questions that can be answered with simple facts expressed in short texts.

Information-retrieval or IR-based question answering relies on the vast quantities of textual information on the web or in collections like PubMed. Given a user question, information retrieval techniques first find relevant documents and passages. Then systems (feature-based, neural, or both) use reading comprehension algorithms to read these retrieved documents or passages and draw an answer directly from spans of text.

Knowledge-based question answering, a system instead builds a semantic representation of the query, mapping What states border Texas? to the logical representation: $λx.state(x)∧borders(x,texas)\lambda x.state(x)\land borders(x, texas)$ , or When was Ada Lovelace born? to the gapped relation: $?x)\verb| birth-year (Ada Lovelace, ?x)|$ . These meaning representations are then used to query databases of facts.

23.1 IR-based Factoid Question Answering

Figure 23.2 shows the three phases of an IR-based factoid question-answering system: question processing, passage retrieval and ranking, and answer extraction.

[外链图片转存失败(img-VUVCOQ0a-1563719162725)(23.2.png)]

23.1.1 Question Processing

The main goal of the question-processing phase is to extract the query: the keywords passed to the IR system to match potential documents. Some systems additionally extract further information such as:

answer type: the entity type (person, location, time, etc.). of the answer
focus: the string of words in the question that are likely to be replaced by the answer in any answer string found.
question type: is this a definition question, a math question, a list question?

For example, for the question Which US state capital has the largest population? the query processing might produce:

query: “US state capital has the largest population”

answer type: city

focus: state capital

In the next two sections we summarize the two most commonly used tasks, query formulation and answer type detection.

23.1.2 Query Formulation

For question answering from the web, we can simply pass the entire question to the web search engine, at most perhaps leaving out the question word (where, when, etc.).

For question answering from smaller sets of documents like corporate information pages or Wikipedia, we still use an IR engine to index and search our documents, generally using standard tf-idf cosine matching, but we might need to do more processing. For example, for searching Wikipedia, it helps to compute tf-idf over bigrams rather than unigrams in the query and document (Chen et al., 2017). Or we might need to do query expansion, adding query terms in hopes of matching the particular form of the answer as it appears, like adding morphological variants of the content words in the question, or synonyms from a thesaurus.

A query formulation approach that is sometimes used for questioning the web is to apply query reformulation rules to the query. The rules rephrase the question to make it look like a substring of possible declarative answers. The question “when was the laser invented?” might be reformulated as “the laser was invented”; the question “where is the Valley of the Kings?” as “the Valley of the Kings is located in”. Here are some sample hand-written reformulation rules from Lin (2007):

(23.4) wh-word did A verb B $→…\to \ldots$ A verb+ed B
(23.5) Where is A $→\to$ A is located in

23.1.3 Answer Types

Some systems make use of question classification, the task of finding the answer type, the named-entity categorizing the answer.

we can also use a larger hierarchical set of answer types called an answer type taxonomy. Figure 23.4 shows one such hand-built ontology, the Li and Roth (2005) tagset; a subset is also shown in Fig. 23.3.

[外链图片转存失败(img-KqZWi456-1563719162726)(23.3.png)]

[外链图片转存失败(img-eA15KBS5-1563719162727)(23.4.png)]

Question classifiers can be built by

hand-writing rules
supervised learning, trained on databases of questions that have been hand-labeled with an answer type (Li and
Roth, 2002).
- Feature based methods rely on words in the questions and their embeddings, the part-of-speech of each word, and named entities in the questions. Answer type word or question headword may be defined as the headword of the first NP after the question’s wh-word; headwords are indicated in boldface in the following examples:
  (23.7) Which city in China has the largest number of foreign financial companies?
  (23.8) What is the state flower of California?

23.1.4 Document and Passage Retrieval

The IR query produced from the question processing stage is sent to an IR engine, resulting in a set of documents ranked by their relevance to the query.

QA systems next divide the top $n$ documents into smaller passages such as sections, paragraphs, or sentences.

The simplest form of passage retrieval is then to simply pass along every passages to the answer extraction stage. A more sophisticated variant is to filter the passages by running a named entity or answer type classification on the retrieved passages.

It’s also possible to use supervised learning to fully rank the remaining passages, using features like:

The number of named entities of the right type in the passage
The number of question keywords in the passage
The longest exact sequence of question keywords that occurs in the passage
The rank of the document from which the passage was extracted
The proximity of the keywords from the original query to each other (Pasca 2003, Monz 2004).
The number of n**-grams** that overlap between the passage and the question (Brill et al., 2002).

For question answering from the web we can instead take snippets from the Web search engine (see Fig. 23.5) as the passages.

[外链图片转存失败(img-Bowi7Hpd-1563719162727)(23.5.png)]

23.1.5 Answer Extraction

This task is commonly modeled by span labeling: given a passage, identifying the span of text which constitutes an answer.

A simple baseline algorithm for answer extraction is to run a named entity tagger on the candidate passage and return whatever span in the passage is the correct answer type.

The answers to many questions, such as DEFINITION questions, don’t tend to be of a particular named entity type.

23.1.6 Feature-based Answer Extraction

Supervised learning approaches to answer extraction train classifiers to decide if a span or a sentence contains an answer. One obviously useful feature is the answer type feature of the above baseline algorithm. Hand-written regular expression patterns also play a role, such as the sample patterns for definition questions in Fig. 23.6.

[外链图片转存失败(img-s62a4BC4-1563719162728)(23.6.png)]

Other features in such classifiers include:

Answer type match: True if the candidate answer contains a phrase with the correct answer type.

Pattern match: The identity of a pattern that matches the candidate answer.

Number of matched question keywords: How many question keywords are contained in the candidate answer.

Keyword distance: The distance between the candidate answer and query keywords

Novelty factor: True if at least one word in the candidate answer is novel, that is, not in the query.

Apposition features: True if the candidate answer is an appositive to a phrase containing many question terms. Can be approximated by the number of question terms separated from the candidate answer through at most three words and one comma (Pasca, 2003).

Punctuation location: True if the candidate answer is immediately followed by a comma, period, quotation marks, semicolon, or exclamation mark.

Sequences of question terms: The length of the longest sequence of question terms that occurs in the candidate answer.

23.1.7 N-gram tiling answer extraction

An alternative approach to answer extraction, used solely in Web search, is based on n-gram tiling, an approach that relies on the redundancy of the web (Brill et al. 2002, Lin 2007). This simplified method begins with the snippets returned from the Web search engine, produced by a reformulated query. In the first step, n-gram mining, every unigram, bigram, and trigram occurring in the snippet is extracted and weighted. The weight is a function of the number of snippets in which the n-gram occurred, and the weight of the query reformulation pattern that returned it. In the n-gram filtering step, n-grams are scored by how well they match the predicted answer type. These scores are computed by hand-written filters built for each answer type. Finally, an n-gram tiling algorithm concatenates overlapping ngram fragments into longer answers. A standard greedy method is to start with the highest-scoring candidate and try to tile each other candidate with this candidate. The best-scoring concatenation is added to the set of candidates, the lower-scoring candidate is removed, and the process continues until a single answer is built.

23.1.8 Neural Answer Extraction

Neural network approaches to answer extraction draw on the intuition that a question and its answer are semantically similar in some appropriate way. As we’ll see, this intuition can be fleshed out by computing an embedding for the question and an embedding for each token of the passage, and then selecting passage spans whose embeddings are closest to the question embedding.

Reading Comprehension Datasets. Neural answer extractors are often designed in the context of the reading comprehension task.

Modern reading comprehension systems tend to use collections of questions that are designed specifically for NLP, and so are large enough for training supervised learning systems. For example the Stanford Question Answering Dataset (SQuAD) consists of passages from Wikipedia and associated questions whose answers are spans from the passage, as well as some questions that are designed to be unanswerable (Rajpurkar et al. 2016, Rajpurkar et al. 2018); a total of just over 150,000 questions. Fig. 23.7 shows a (shortened) excerpt from a SQUAD 2.0 passage together with three questions and their answer spans.

[外链图片转存失败(img-RSxBxQzx-1563719162728)(23.7.png)]

SQuAD was build by having humans write questions for a given Wikipedia passage and choose the answer span. Other datasets used similar techniques; the NewsQA dataset consists of 100,000 question-answer pairs from CNN news articles. For other datasets like WikiQA the span is the entire sentence containing the answer (Yang et al., 2015); the task of choosing a sentence rather than a smaller answer span is sometimes called the sentence selection task.

These reading comprehension datasets are used both as a reading comprehension task in themselves, and as a training set and evaluation set for the sentence extraction component of open question answering algorithms.

Basic Reading Comprehension Algorithm. Neural algorithms for reading comprehension are given a question $q$ of $l$ tokens $q1,…,qlq_1,\ldots, q_l$ and a passage $p$ of $m$ tokens $p1,…,pmp_1,\ldots, p_m$ . Their goal is to compute, for each token $p_i$ the probability $pstart(i)p_{\text{start}}(i)$ that $p_i$ is the start of the answer span, and the probability $p_{end}(i)$ , that $p_i$ is the end of the answer span.

Fig. 23.8 shows the architecture of the Document Reader component of the DrQA system of Chen et al. (2017). Like most such systems, DrQA builds an embedding for the question, builds an embedding for each token in the passage, computes a similarity function between the question and each passage word in context, and then uses the question-passage similarity scores to decide where the answer span starts and ends.[外链图片转存失败(img-aRG3n3kU-1563719162729)(23.8.png)]

Let’s consider the algorithm in detail, following closely the description in Chen et al. (2017). The question is represented by a single embedding $q\textbf q$ , which is a weighted sum of representations for each question word $q_i$ . It is computed by passing the series of embeddings $E(q1),…,E(ql)\textbf E\left(q_{1}\right), \ldots, \textbf E\left(q_{l}\right)$ of question words through an RNN (such as a bi-LSTM shown in Fig. 23.8). The resulting hidden representations ${q1,…,ql}\left\{\mathbf{q}_{1}, \dots, \mathbf{q}_{l}\right\}$ are combined by a weighted sum
$q=∑jbjqj\mathbf{q}=\sum_{j} b_{j} \mathbf{q}_{j}$
The weight $b_j$ is a measure of the relevance of each question word, and relies on a learned weight vector $w\textbf w$ :
$bj=exp⁡(w⋅qj)∑j′exp⁡(w⋅qj′)b_{j}=\frac{\exp \left(\mathbf{w} \cdot \mathbf{q}_{j}\right)}{\sum_{j^{\prime}} \exp \left(\mathbf{w} \cdot \mathbf{q}_{j}^{\prime}\right)}$
To compute the passage embedding ${p1,…,pm}\left\{\mathbf{p}_{1}, \dots, \mathbf{p}_{m}\right\}$ we first form an input representation $p~={p~1,…,p~m}\tilde p=\left\{\tilde {\mathbf{p}}_{1}, \dots, \tilde{\mathbf{p}}_{m}\right\}$ by concatenating four components:

An embedding for each word $E(pi)\textbf E(p_i)$ such as from GLoVE (Pennington et al.,2 014).
Token features like the part of speech of $p_i$ , or the named entity tag of $p_i$ , from running POS or NER taggers.
Exact match features representing whether the passage word $p_i$ occurred in the question: $1(pi∈q)1(p_i \in q)$ . Separate exact match features might be used for lemmatized or lower-cased versions of the tokens.
Aligned question embedding: In addition to the exact match features, many QA systems use an attention mechanism to give a more sophisticated model of similarity between the passage and question words, such as similar but nonidentical words like release and singles. For example a weighted similarity $∑jai,jE(qj)\sum_j a_{i,j}\textbf E(q_j)$ can be used, where the attention weight $a_{i,j}$ encodes the similarity between $p_i$ and each question word $q_j$ . This attention weight can be computed as the dot product between functions $α\alpha$ of the word embeddings of the question and passage:
$qi,j=exp⁡(α(E(pi))⋅α(E(qj)))∑j′exp⁡(α(E(pi))⋅α(E(qj′)))q_{i, j}=\frac{\exp \left(\alpha\left(\mathbf{E}\left(p_{i}\right)\right) \cdot \alpha\left(\mathbf{E}\left(q_{j}\right)\right)\right)}{\sum_{j^{\prime}} \exp \left(\alpha\left(\mathbf{E}\left(p_{i}\right)\right) \cdot \alpha\left(\mathbf{E}\left(q_{j}^{\prime}\right)\right)\right)}$
$α(⋅)\alpha(\cdot)$ can be a simple feed forward network.

We then pass $p~\tilde p$ through a biLSTM:
${p1,…,pm})=RNN({p~1,…,p~m})\left\{\mathbf{p}_{1}, \ldots, \mathbf{p}_{m}\right\} )=R N N\left(\left\{\tilde{\mathbf{p}}_{1}, \ldots, \tilde{\mathbf{p}}_{m}\right\}\right)$
The result of the previous two step is a single question embedding $q\textbf q$ and a representations for each word in the passage ${p1,…,pm}\left\{\mathbf{p}_{1}, \dots, \mathbf{p}_{m}\right\}$ . In order to find the answer span, we can train two separate classifiers, one to compute for each $p_i$ the probability $p_{start}(i)$ that $p_i$ is the start of the answer span, and one to compute the probability $p_{end}(i)$ . While the classifiers could just take the dot product between the passage and question embeddings as input, it turns out to work better to learn a more sophisticated similarity function, like a bilinear attention layer $W\textbf W$ :
$pstart(i)∝exp⁡(piWsq)pend(i)∝exp⁡(piWeq)\begin{array}{l}{p_{start}(i) \propto \exp \left(\mathbf{p}_{i} \mathbf{W}_{s} \mathbf{q}\right)} \\ {p_{end}(i) \propto \exp \left(\mathbf{p}_{i} \mathbf{W}_{e} \mathbf{q}\right)}\end{array}$
These neural answer extractors can be trained end-to-end by using datasets like SQuAD.

23.2 Knowledge-based Question Answering

Knowledge-based question answering: answering a natural language question by mapping it to a query over a structured database.

Systems for mapping from a text string to any logical form are called semantic parsers. Semantic parsers for question answering usually map either to some version of predicate calculus or a query language like SQL or SPARQL, as in the examples in Fig. 23.9.

[外链图片转存失败(img-Znyu7A5H-1563719162729)(23.9.png)]

Popular ontologies like Freebase (Bollacker et al., 2008) or DBpedia (Bizer et al., 2009) have large numbers of triples derived from Wikipedia infoboxes, the structured tables associated with certain Wikipedia articles.

23.2.1 Rule-based Methods

Write hand-written rules to extract relations from the question, just as we saw in Section 17.2.

23.2.2 Supervised Methods

Most supervised algorithms for learning to answer these simple questions about relations first parse the questions and then align the parse trees to the logical form. Generally these systems bootstrap by having a small set of rules for building this mapping, and an initial lexicon as well. The supervised approach can be extended to deal with more complex questions that are not just about single relations.

23.2.3 Dealing with Variation: Semi-Supervised Methods

The most common source of redundancy, of course, is the web, which contains vast number of textual variants expressing any relation. For this reason, most methods make some use of web text, either via semi-supervised methods like distant supervision or unsupervised methods like open information extraction, both introduced in Chapter 17. For example the REVERB open information extractor (Fader et al., 2011) extracts billions of (subject, relation, object) triples of strings from the web, such as (“Ada Lovelace”,“was born in”, “1815”). By aligning these strings with a canonical knowledge source like Wikipedia, we create new relations that can be queried while simultaneously learning to map between the words in question and canonical relations.

To align a REVERB triple with a canonical knowledge source we first align the arguments and then the predicate. Recall from Chapter 20 that linking a string like “Ada Lovelace” with a Wikipedia page is called entity linking. Once we’ve aligned the arguments, we align the predicates. Given the Freebase relation people.person.birthdate(ada lovelace,1815) and the string ‘Ada Lovelace was born in 1815’, having linked Ada Lovelace and normalized 1815, we learn the mapping between the string ‘was born in’ and the relation people.person.birthdate. In the simplest case, this can be done by aligning the relation with the string of words in between the arguments; more complex alignment algorithms like IBM Model 1 (Chapter 22) can be used. Then if a phrase aligns with a predicate across many entities, it can be extracted into a lexicon for mapping questions to relations.

Another useful source of linguistic redundancy are paraphrase databases. For example the site wikianswers.com contains millions of pairs of questions that users have tagged as having the same meaning, 18 million of which have been collected in the PARALEX corpus (Fader et al., 2013).

23.3 Using multiple information sources: IBM’s Watson

Figure 23.11 shows the 4 stages of the DeepQA system that is the question answering component of Watson.

[外链图片转存失败(img-QZCmKE9A-1563719162729)(23.11.png)]

The first stage is question processing. The DeepQA system runs parsing, named entity tagging, and relation extraction on the question. Then, like the text-based systems in Section 23.1, the DeepQA system extracts the focus, the answer type (also called the lexical answer type or LAT), and performs question classification and question sectioning. The focus is the part of the question that co-refers with the answer, used for example to align with a supporting passage. The lexical answer type is a word or words which tell us something about the semantic type of the answer.

In the second candidate answer generation stage, we combine the processed question with external documents and other knowledge sources to suggest many candidate answers. These candidate answers can either be extracted from text documents or from structured knowledge bases.

For structured resources like DBpedia, IMDB, or the triples produced by Open Information Extraction, we can just query these stores with the relation and the known entity, just as we saw in Section 23.2.

The method for extracting answers from text depends on the type of text documents. To extract answers from normal text documents we can do passage search just as we did in Section 23.1.

The third candidate answer scoring stage uses many sources of evidence to score the candidates.

The final answer merging and scoring step first merges candidate answers that are equivalent. We merge the evidence for each variant, combining the scoring feature vectors for the merged candidates into a single vector. Now we have a set of candidates, each with a feature vector. A classifier takes each feature vector and assigns a confidence value to this candidate answer. The classifier is trained on thousands of candidate answers, each labeled for whether it is correct or incorrect, together with their feature vectors, and learns to predict a probability of being a correct answer. Since, in training, there are far more incorrect answers than correct answers, we need to use one of the standard techniques for dealing with very imbalanced data. DeepQA uses instance weighting, assigning an instance weight of .5 for each incorrect answer example in training. The candidate answers are then sorted by this confidence value, resulting in a single best answer.

In summary, we’ve seen in the four stages of DeepQA that it draws on the intuitions of both the IR-based and knowledge-based paradigms. Indeed, Watson’s architectural innovation is its reliance on proposing a very large number of candidate answers from both text-based and knowledge-based sources and then developing a wide variety of evidence features for scoring these candidates —again both text-based and knowledge-based.

23.4 Evaluation of Factoid Answers

A common evaluation metric for factoid question answering, introduced in the TREC Q/A track in 1999, is mean reciprocal rank, or MRR. MRR also assumes that systems are returning a short ranked list of answers or passages containing answers. Each question is then scored according to the reciprocal of the rank of the first correct answer. For example if the system returned five answers but the first three are wrong and hence the highest-ranked correct answer is ranked fourth, the reciprocal rank score for that question would be $14\frac{1}{4}$ . Questions with return sets that do not contain any correct answers are assigned a zero. The score of a system is then the average of the score for each question in the set. More formally, for an evaluation of a system returning a set of ranked answers for a test set consisting of N questions, the MRR is defined as
$ranki≠0N1ranki\operatorname{MRR}=\frac{1}{N} \sum_{i=1\ \text{s. t. } rank_{i} \neq 0}^{N} \frac{1}{rank_{i}}$
Reading comprehension systems on datasets like SQuAD are often evaluated using two metrics, both ignoring punctuations and articles (a, an, the) (Rajpurkar et al., 2016):

Exact match: The percentage of predicted answers that match the gold answer exactly.
F1 score: The average overlap between predicted and gold answers. Treat the prediction and gold as a bag of tokens, and compute F1, averaging the F1 over all questions.

A number of test sets are available for question answering. Early systems used the TREC QA dataset; questions and hand-written answers for TREC competitions from 1999 to 2004 are publicly available. TriviaQA (Joshi et al., 2017) has 650K question-answer evidence triples, from 95K hand-created question-answer pairs together with on average six supporting evidence documents collected retrospectively from Wikipedia and the Web.

Another family of datasets starts from WEBQUESTIONS (Berant et al., 2013), which contains 5,810 questions asked by web users, each beginning with a whword and containing exactly one entity. Questions are paired with hand-written answers drawn from the Freebase page of the question’s entity. WEBQUESTIONSSP (Yih et al., 2016) augments WEBQUESTIONS with human-created semantic parses (SPARQL queries) for those questions answerable using Freebase. COMPLEXWEBQUESTIONS augments the dataset with compositional and other kinds of complex questions, resulting in 34,689 question questions, along with answers, web snippets, and SPARQL queries. (Talmor and Berant, 2018).

There are a wide variety of datasets for training and testing reading comprehension/answer extraction in addition to the SQuAD (Rajpurkar et al., 2016) and WikiQA (Yang et al., 2015) datasets discussed on page 410. The NarrativeQA (Kocisky et al., 2018) dataset, for example, has questions based on entire long documents like books or movie scripts, while the Question Answering in Context (QuAC) dataset (Choi et al., 2018) has 100K questions created by two crowdworkers who are asking and answering questions about a hidden Wikipedia text.

Others take their structure from the fact that reading comprehension tasks designed for children tend to be multiple choice, with the task being to choose among the given answers. The MCTest dataset uses this structure, with 500 fictional short stories created by crowd workers with questions and multiple choice answers (Richardson et al., 2013). The AI2 Reasoning Challenge (ARC) (Clark et al., 2018), has questions that are designed to be hard to answer from simple lexical methods:

This ARC example is difficult because the correct answer luster is unlikely to cooccur frequently on the web with phrases like looking at it, while the word mineral is highly associated with the incorrect answer hardness.