NAACL2021阅读理解论文整理

作者：孙嘉伟

单位：燕山大学

就在5月23号，NAACL终于放出接收论文目录啦。

论文列表：https://www.aclweb.org/anthology/events/naacl-2021/

NAACL作为NLP的顶会之一，接收的论文质量也非常之高，其中有不少关于阅读理解&问答领域相关的文章以供大家研究参考。作为一个从事该研究领域的研究生（小学生），在这里整理了2021年NAACL阅读理解和问答领域的相关文章并对阅读理解领域的发展进行了一个小结。笔者根据自己的研究方向对部分文章做了简短的注释，下边按照论文的研究详细方向进行逐一介绍。

阅读理解和问答领域相关论文共计29篇，其中：

开放域问答(5篇)：

QA数据集(3篇)：

鲁棒性(2篇)：

机器阅读理解(4篇)：

多跳问答(3篇)：

多模态(2篇)：

视觉问答(2篇)：

图谱QA(2篇)：

跨语言QA(2篇)：

其他类（4篇）：

开放域问答(5篇)：

Open Domain Question Answering over Tables via Dense Retrieval	https://www.aclweb.org/anthology/2021.naacl-main.43/
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.466/
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.468/
SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval	https://www.aclweb.org/anthology/2021.naacl-main.47/
RECONSIDER: Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.100/

标题：Open Domain Question Answering over Tables via Dense Retrieval

链接：https://www.aclweb.org/anthology/2021.naacl-main.43.pdf

摘要：Recent advances in open-domain QA have led to strong models based on dense retrieval, but only focused on retrieving textual passages. In this work, we tackle open-domain QA over tables for the first time, and show that retrieval can be improved by a retriever designed to handle tabular context. We present an effective pre-training procedure for our retriever and improve retrieval quality with mined hard negatives. As relevant datasets are missing, we extract a subset of NATURAL QUESTIONS (Kwiatkowski et al., 2019) into a Table QA dataset. We find that our retriever improves retrieval results from 72.0 to 81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a BERT based retriever.

在TableQA上的开放域问答

2.
标题：RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.466.pdf

摘要：In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely crossbatch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA.
Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever

提出了3种优化用于检索候选答案段落的DPR(Dense Passage Retrieval)算法的方法：跨批次负采样、去噪强负例采样、数据增强。在MARCO 和 Natural Questions数据集上做到了SOTA

3.
标题：Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.468.pdf

摘要：In open-domain question answering (QA), retrieve-and-read mechanism has the inherent benefit of interpretability and the easiness of adding, removing, or editing knowledge compared to the parametric approaches of closedbook QA models. However, it is also known to suffer from its large storage footprint due to its document corpus and index. Here, we discuss several orthogonal strategies to drastically reduce the footprint of a retrieve-andread open-domain QA system by up to 160x.Our results indicate that retrieve-and-read can be a viable option even in a highly constrained serving environment such as edge devices, as we show that it can achieve better accuracy than a purely parametric model with comparable docker-level system size.

基于retrieve-and-read方式的问答系统往往需要一个很大的知识、语料检索库，作者提出了用于减小该检索库存储空间大小的方法。实验中结果表明原系统的内存占用大小变为原来的1/160，在dev和test集上准确率下降了2.45%和4%。

4.
标题：SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

链接：https://www.aclweb.org/anthology/2021.naacl-main.47.pdf

提出了一种用于检索答案候选段落的方法：使用token级别的表示替换原有句子级别的表示来进行query与answer的交互。在多个数据集上效果优秀。

5.
标题：RECONSIDER: Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.100.pdf

摘要：State-of-the-art Machine Reading Comprehension (MRC) models for Open-domain Question Answering (QA) are typically trained for span selection using distantly supervised positive examples and heuristically retrieved negative examples. This training scheme possibly explains empirical observations that these models achieve a high recall amongst their top few predictions, but a low overall accuracy, motivating the need for answer re-ranking.
We develop a successful re-ranking approach (RECONSIDER) for span-extraction tasks that improves upon the performance of MRC models, even beyond large-scale pre-training. RECONSIDER is trained on positive and negative examples extracted from high confidence MRC model predictions, and uses in-passage span annotations to perform span-focused reranking over a smaller candidate set. As a result, RECONSIDER learns to eliminate close false positives, achieving a new extractive state of the art on four QA tasks, with 45.5% Exact Match accuracy on Natural Questions with real user questions, and 61.7% on TriviaQA.We will release all related data, models, and code

解决开放域问答中由于MRC模块无法区分语义相似的错误答案而出错的问题，提出一种对答案重新排序的方法，该方法进一步提高了四个数据集上SOTA算法的准确性。

QA数据集(3篇)：

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers	https://www.aclweb.org/anthology/2021.naacl-main.365/
Open-Domain Question Answering Goes Conversational via Question Rewriting	https://www.aclweb.org/anthology/2021.naacl-main.44/
SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning	https://www.aclweb.org/anthology/2021.naacl-main.364/

1.
标题：A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

链接：https://www.aclweb.org/anthology/2021.naacl-main.365.pdf

摘要：Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing informationseeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

使用学术论文构建的问答数据集

2.
标题：Open-Domain Question Answering Goes Conversational via Question Rewriting

链接：https://www.aclweb.org/anthology/2021.naacl-main.44.pdf

摘要：We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14K conversations with 80K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10M web pages (split into 54M passages). Answers to questions in the same conversation may be distributed across several web pages.QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.

开放域中的会话式阅读理解数据集

3.
标题：SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

链接：https://www.aclweb.org/anthology/2021.naacl-main.364.pdf

摘要：This paper proposes a question-answering (QA) benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior work and is challenging for state-of-the-art language models (LM). We propose a distant supervision method to improve on this task.Specifically, we design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs. Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ. We hope that this work can foster investigations into more sophisticated models for spatial reasoning over text.

一种自然本文上的空间推理数据集，通过远程监督的方式生成描述视觉场景的文字和对应场景的QA问答对。

鲁棒性(2篇)：

Robust Question Answering Through Sub-part Alignment	https://www.aclweb.org/anthology/2021.naacl-main.98/
On the Transferability of Minimal Prediction Preserving Inputs in Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.101/

1.
标题：Robust Question Answering Through Sub-part Alignment

链接：https://www.aclweb.org/anthology/2021.naacl-main.98.pdf

摘要：Current textual question answering (QA) models achieve strong performance on in-domain test sets, but often do so by fitting surfacelevel patterns, so they fail to generalize to outof-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem.We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despite using beam search for approximate inference. Our use of explicit alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.

该文章关注于阅读理解模型易受到对抗样本的攻击。虽然之前已经存在了使用对抗训练、数据增强、后验正则化等方法提高模型鲁棒性来防御对抗攻击，但是他们都是在固定数据集上的工作，难以推广到其他域外数据集上。该文章提出了一种方法将问答问题转变为约束图对齐问题。上问题与上下文都表示为图，使用谓词以及相关的arguments作为图的顶点，边代表两者之间的关系。最后通过对齐问题与上下文两者的图表示来找到答案。

2.
标题：On the Transferability of Minimal Prediction Preserving Inputs in Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.101.pdf

摘要：Recent work (Feng et al., 2018) establishes the presence of short, uninterpretable input fragments that yield high confidence and accuracy in neural models. We refer to these as Minimal Prediction Preserving Inputs (MPPIs).In the context of question answering, we investigate competing hypotheses for the existence of MPPIs, including poor posterior calibration of neural models, lack of pretraining, and “dataset bias" (where a model learns to attend to spurious, non-generalizable cues in the training data). We discover a perplexing invariance of MPPIs to random training seed, model architecture, pretraining, and training domain. MPPIs demonstrate remarkable transferability across domains — achieving significantly higher performance than comparably short queries. Additionally, penalizing over-confidence on MPPIs fails to improve either generalization or adversarial robustness.These results suggest the interpretability of MPPIs is insufficient to characterize generalization capacity of these models. We hope this focused investigation encourages more systematic analysis of model behavior outside of the human interpretable distribution of examples.

本文研究的是阅读理解模型当中的MPPI（最小预测保持输入）问题。例如：删减问答对中“问题”的部分词语，虽然人类难以理解该问题，但是阅读理解模型还是会找到与原来相同的答案。而MPPI就是删减的最多、“问题”最短、答案输出仍然相同的片段。作者还通过实验，发现在修改了随机种子、模型结构、预训练之后MPPI问题仍然存在。

机器阅读理解(4篇)：

Hurdles to Progress in Long-form Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.393/
Self-Supervised Test-Time Learning for Reading Comprehension	https://www.aclweb.org/anthology/2021.naacl-main.95/
Does Structure Matter? Encoding Documents for Machine Reading Comprehension	https://www.aclweb.org/anthology/2021.naacl-main.367/
ReadTwice: Reading Very Large Documents with Memories	https://www.aclweb.org/anthology/2021.naacl-main.408/

1.
标题：Hurdles to Progress in Long-form Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.393.pdf

摘要：The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends: (1) our system’s generated answers are not actually grounded in the documents that it retrieves; (2) ELI5 contains significant train / validation overlap, as at least 81% of ELI5 validation questions occur in paraphrased form in the training set; (3) ROUGE-L is not an informative metric of generated answer quality and can be easily gamed; and (4) human evaluations used for other text generation tasks are unreliable for LFQA. We offer suggestions to mitigate each of these issues, which we hope will lead to more rigorous LFQA research and meaningful progress in the future.

长文本问答任务是一项检索与问题相关的文档，最终生成答案片段的任务。该篇论文中作者提出现有的长文本问答存在的两个问题：(1)训练集和验证集中存在大量相似问题集合，对于retriver的训练有着很大的影响(2)ROUGE-L作为评判生成答案的质量是不可靠的。模型产生的答案中包含正确答案的重复片段时会使该指标上升，但却不符合实际的应用。

2.
标题：Self-Supervised Test-Time Learning for Reading Comprehension

链接：https://www.aclweb.org/anthology/2021.naacl-main.95.pdf

摘要：Recent work on unsupervised question answering has shown that models can be trained with procedurally generated question-answer pairs and can achieve performance competitive with supervised methods. In this work, we consider the task of unsupervised reading comprehension and present a method that performs “test-time learning” (TTL) on a given context (text passage), without requiring training on large-scale human-authored datasets containing context-question-answer triplets. This method operates directly on a single test context, uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen humanauthored questions for this context. Our method achieves accuracies competitive with fully supervised methods and significantly outperforms current unsupervised methods. TTL methods with a smaller model are also competitive with the current state-of-the-art in unsupervised reading comprehension.

提出一种在不断变化的语境下的无监督阅读理解训练框架TTL ，通过给定的上下文(context)自动生成问题与答案(qas)。该框架假设每个上下文都来自不同的分布，适用于样本的分布具有较大差距的数据集。
3.
标题：Does Structure Matter? Encoding Documents for Machine Reading Comprehension

链接：https://www.aclweb.org/anthology/2021.naacl-main.367.pdf

摘要：Machine reading comprehension is a challenging task especially for querying documents with deep and interconnected contexts.
Transformer-based methods have shown advanced performances on this task; however, most of them still treat documents as a flat sequence of tokens. This work proposes a new Transformer-based method that reads a document as tree slices. It contains two modules for identifying more relevant text passage and the best answer span respectively, which are not only jointly trained but also jointly consulted at inference time. Our evaluation results show that our proposed method outperforms several competitive baseline approaches on two datasets from varied domains.

将顺序的令牌(句子结构)转换为树结构，通过文档不同部分之间的父子关系、兄弟关系等层次结构来展现语义关系。模型分为寻找答案所在段落和从单独段落中提取答案两部分。
4.
标题：READTWICE: Reading Very Large Documents with Memories

链接：https://www.aclweb.org/anthology/2021.naacl-main.408.pdf

摘要：Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose READTWICE1 , a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizing each segment into a memory table to be used in a second read of the text. We show that the method outperforms models of comparable size on several question answering (QA) datasets and sets a new state of the art on the challenging NarrativeQA task, with questions about entire books.

在长文本（比如图书、文章集）中有大量的关键信息，而现有的大部分模型在编码如此长的文本时都存在一些问题。本文提出一种建立长距离依赖关系的方法：对文本进行两次编码，第一次阅读时将长文本拆解为较短的文本，编码过后存入特定内存块中，在第二次阅读时令被拆解的短文本之间交互，得到最终增强后的短文本段，再执行问答任务。

多跳问答(3篇)：

If You Want to Go Far Go Together: Unsupervised Joint Candidate Evidence Retrieval for Multi-hop Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.363/
Breadth First Reasoning Graph for Multi-hop Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.464/
Unsupervised Multi-hop Question Answering by Question Generation	https://www.aclweb.org/anthology/2021.naacl-main.469/

1.
标题：If You Want to Go Far Go Together: Unsupervised Joint Candidate Evidence Retrieval for Multi-hop Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.363.pdf

摘要：Multi-hop reasoning requires aggregation and inference from multiple facts. To retrieve such facts, we propose a simple approach that retrieves and reranks set of evidence facts jointly. Our approach first generates unsupervised clusters of sentences as candidate evidence by accounting links between sentences and coverage with the given query. Then, a RoBERTa-based reranker is trained to bring the most representative evidence cluster to the top. We specifically emphasize on the importance of retrieving evidence jointly by showing several comparative analyses to other methods that retrieve and rerank evidence sentences individually. First, we introduce several attention- and embedding-based analyses, which indicate that jointly retrieving and reranking approaches can learn compositional knowledge required for multi-hop reasoning.Second, our experiments show that jointly retrieving candidate evidence leads to substantially higher evidence retrieval performance when fed to the same supervised reranker. In particular, our joint retrieval and then reranking approach achieves new state-of-the-art evidence retrieval performance on two multi-hop question answering (QA) datasets: 30.5 Recall@2 on QASC, and 67.6% F1 on MultiRC.When the evidence text from our joint retrieval approach is fed to a RoBERTa-based answer selection classifier, we achieve new state-ofthe-art QA performance on MultiRC and second best result on QASC.
2.
标题：Breadth First Reasoning Graph for Multi-hop Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.464.pdf

摘要：Recently Graph Neural Network (GNN) has been used as a promising tool in multi-hop question answering task. However, the unnecessary updations and simple edge constructions prevent an accurate answer span extraction in a more direct and interpretable way.In this paper, we propose a novel model of Breadth First Reasoning Graph (BFR-Graph), which presents a new message passing way that better conforms to the reasoning process.
In BFR-Graph, the reasoning message is required to start from the question node and pass to the next sentences node hop by hop until all the edges have been passed, which can effectively prevent each node from over-smoothing or being updated multiple times unnecessarily.To introduce more semantics, we also define the reasoning graph as a weighted graph with considering the number of co-occurrence entities and the distance between sentences. Then we present a more direct and interpretable way to aggregate scores from different levels of granularity based on the GNN. On HotpotQA leaderboard, the proposed BFR-Graph achieves state-of-the-art on answer span prediction.
3.
标题：Unsupervised Multi-hop Question Answering by Question Generation

链接：https://www.aclweb.org/anthology/2021.naacl-main.469.pdf

摘要：Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-intensive. We explore the possibility to train a well-performed multi-hop QA model without referencing any human-labeled multihop question-answer pairs, i.e., unsupervised multi-hop QA. We propose MQA-QG, an unsupervised framework that can generate human-like multi-hop training data from both homogeneous and heterogeneous data sources. MQA-QG generates questions by first selecting/generating relevant information from each data source and then integrating the multiple information to form a multi-hop question. Using only generated training data, we can train a competent multi-hop QA which achieves 61% and 83% of the supervised learning performance for the HybridQA and the HotpotQA dataset, respectively. We also show that pretraining the QA system with the generated data would greatly reduce the demand for human-annotated training data.Our codes are publicly available at https: //github.com/teacherpeterpan/ Unsupervised-Multi-hop-QA.

多模态(2篇)：

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.153/
MIMOQA: Multimodal Input Multimodal Output Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.418/

1.
标题：Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.153.pdf

摘要：Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called WorldlyWise (WoW). WoW is shown to perform endto-end cross-lingual FVSQA at same levels of accuracy across 3 languages - English, Hindi, and Turkish.
2.
标题：MIMOQA: Multimodal Input Multimodal Output Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.418.pdf

摘要：Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.

视觉问答(2篇)：

Video Question Answering with Phrases via Semantic Roles	https://www.aclweb.org/anthology/2021.naacl-main.196/
CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images	https://www.aclweb.org/anthology/2021.naacl-main.289/

1.
链接：https://www.aclweb.org/anthology/2021.naacl-main.196.pdf1.
标题：Video Question Answering with Phrases via Semantic Roles

摘要：Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models’ application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fillin-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language-bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and CharadesSRL-QA and benchmark them by extending three vision-language models. We perform extensive analysis and ablative studies to guide future work. Code and data are public.
2.
标题：CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

链接：https://www.aclweb.org/anthology/2021.naacl-main.289.pdf

摘要：Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et al., 2017a) dataset. Wethen modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality.

图谱QA(2篇)：

Improving Zero-Shot Cross-lingual Transfer for Multilingual Question Answering over Knowledge Graph	https://www.aclweb.org/anthology/2021.naacl-main.465/
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.45/

1.
标题：Improving Zero-Shot Cross-lingual Transfer for Multilingual Question Answering over Knowledge Graph

链接：https://www.aclweb.org/anthology/2021.naacl-main.465.pdf

摘要：Multilingual question answering over knowledge graph (KGQA) aims to derive answers from a knowledge graph (KG) for questions in multiple languages. To be widely applicable, we focus on its zero-shot transfer setting. That is, we can only access training data in a highresource language, while need to answer multilingual questions without any labeled data in target languages. A straightforward approach is resorting to pre-trained multilingual models (e.g., mBERT) for cross-lingual transfer, but there is a still significant gap of KGQA performance between source and target languages. In this paper, we exploit unsupervised bilingual lexicon induction (BLI) to map training questions in source language into those in target language as augmented training data, which circumvents language inconsistency between training and inference. Furthermore, we propose an adversarial learning strategy to alleviate syntax-disorder of the augmented data, making the model incline to both languageand syntax-independence.Consequently, our model narrows the gap in zero-shot crosslingual transfer. Experiments on two multilingual KGQA datasets with 11 zero-resource languages verify its effectiveness.
2.
标题：QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.45.pdf

摘要：The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG. Here we propose a new model, QA-GNN, which addresses the above challenges through two key innovations: (i) relevance scoring, where we use LMs to estimate the importance of KG nodes relative to the given QA context, and (ii) joint reasoning, where we connect the QA context and KG to form a joint graph, and mutually update their representations through graph-based message passing. We evaluate QA-GNN on the CommonsenseQA and OpenBookQA datasets, and show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning, e.g., correctly handling negation in questions.

跨语言QA(2篇)：

XOR QA: Cross-lingual Open-Retrieval Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.46/
X-METRA-ADA: Cross-lingual Meta-Transfer learning Adaptation to Natural Language Understanding and Question Answering	https://www.aclweb.org/anthology/2021.naacl-main.283/

1.
标题：XOR QA: Cross-lingual Open-Retrieval Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.46.pdf

摘要：Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity—where languages have few reference articles—and information asymmetry—where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TYDI QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual OpenRetrieval Question Answering (XOR QA), that consists of three new tasks involving crosslingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at https://nlp.cs.washington.edu/xorqa/.
2.
标题：X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural Language Understanding and Question Answering

链接：https://www.aclweb.org/anthology/2021.naacl-main.283.pdf

摘要：Multilingual models, such as M-BERT and XLM-R, have gained increasing popularity, due to their zero-shot cross-lingual transfer learning capabilities. However, their generalization ability is still inconsistent for typologically diverse languages and across different benchmarks. Recently, meta-learning has garnered attention as a promising technique for enhancing transfer learning under low-resource scenarios: particularly for crosslingual transfer in Natural Language Understanding (NLU).In this work, we propose X-METRAADA, a cross-lingual MEta-TRAnsfer learning ADAptation approach for NLU. Our approach adapts MAML, an optimization-based meta-learning approach, to learn to adapt to new languages. We extensively evaluate our framework on two challenging cross-lingual NLU tasks: multilingual task-oriented dialog and typologically diverse question answering.We show that our approach outperforms naive fine-tuning, reaching competitive performance on both tasks for most languages. Our analysis reveals that X-METRA-ADA can leverage limited data for faster adaptation.

其他类（4篇）：

Multilingual Language Models Predict Human Reading Behavior	https://www.aclweb.org/anthology/2021.naacl-main.10/
KPQA: A Metric for Generative Question Answering Using Keyphrase Weights	https://www.aclweb.org/anthology/2021.naacl-main.170/
AVA: an Automatic eValuation Approach for Question Answering Systems	https://www.aclweb.org/anthology/2021.naacl-main.412/

1.
标题：Multilingual Language Models Predict Human Reading Behavior

链接：https://www.aclweb.org/anthology/2021.naacl-main.10.pdf

摘要：We analyze if large language models are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transformer models implicitly encode relative importance in language in a way that is comparable to human processing mechanisms.We find that BERT and XLM models successfully predict a range of eye tracking features.In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing.

作者研究了现在流行的基于transformers的模型是否具有人类阅读文章时的一些行为模式。

2.
标题：KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

链接：https://www.aclweb.org/anthology/2021.naacl-main.170.pdf

摘要：In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens.To alleviate this problem, we propose KPQAmetric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets.Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics. Code for KPQA-metric will be available at https://github.com/ hwanheelee1993/KPQA.

提出一种评估生成式MRC模型所产生答案的正确性的指标
3.
标题：AVA: an Automatic eValuation Approach for Question Answering Systems

链接：https://www.aclweb.org/anthology/2021.naacl-main.10.pdf

摘要：We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers (references), can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference texts. This allows for effectively assessing answer correctness using similarity between the reference and an automatic answer, biased towards the question semantics. To design, train, and test AVA, we built multiple large training, development, and test sets on public and industrial benchmarks. Our innovative solutions achieve up to 74.7% F1 score in predicting human judgment for single answers.Additionally, AVA can be used to evaluate the overall system Accuracy with an error lower than 7% at 95% of confidence when measured on several QA systems.

提出了一种新的评估问答效果的方法，通过将生成的答案与标准答案做语义上的计算来计算模型的准确度。

标题：Capturing Row and Column Semantics in Transformer Based Question Answering over Tables

链接：https://www.aclweb.org/anthology/2021.naacl-main.96.pdf

摘要：Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ∼98% Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TABERT), achieving ∼3.4% and ∼18.86% additional precision improvement on the standard WikiSQL benchmark.

这篇文章研究的问题是使用机器阅读理解模型从传统数据库系统中寻找问题的答案并回答问题。作者提出的两种模型在WikiSQL上有着显著的改善

从论文数量上来看开放域问答和传统的MRC占比较多，大概占总数的1/3。开放域问答上的研究主要集中于改进检索答案段落的检索器上；从所使用的数据集上来看，长文本、推理、多模态的复杂场景的数据集应用较多，随着阅读理解的完善，研究者更加追求考察阅读理解模型能否在更复杂的场景下回答对问题的能力。没有标注的无监督、自监督形式的QA是未来的趋势。笔者本人比较关注阅读理解模型在真实场景（干扰较多，含有对抗性样本；含有与数据集分布不相同的测试样本）下模型的鲁棒性，该方向上也有两篇相关的文章。

个人认为，阅读理解作为一项最基本的NLU任务是随着预训练模型、词嵌入技术等编码上下文能力的发展而逐步提高的，在简单的数据集如SQuAD上的准确率已经达到了一定的瓶颈，当下需要更为关注在更复杂场景下阅读理解模型的效果和健壮性，尝试阅读理解任务与其他任务的交互、交叉使用。

最后，由于个人理解难免会产生偏差，文章中可能存在错误，恳请大家批评指正。