VQA- 近五年视觉问答顶会论文创新点笔记

转载自VQA - 近五年视觉问答顶会论文创新点笔记

简要梳理近五年顶级会议发表的视觉问答（Visual Question Answering, VQA）相关论文的创新点。选取自NIPS、CVPR、ICCV、ACL等，已整理86篇。

2019.10.21修订，新增5篇ACL 2019。
VQA - 近五年视觉问答顶会论文创新点笔记
2014 A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in neural information processing systems. 2014: 1682-1690.

本文是VQA的概念萌芽作，但此后的文章【2015 VQA Visual Question Answering】认为本文定义的问题把answers限制在了预定义的16种基础颜色和894种目标类别中，只算VQA Efforts，没有真正定义VQA。
[Figure 1: Overview of our approach to question answering with multiple latent worlds in contrast to single world approach.]

本文在一个贝叶斯框架中，把对真实世界场景的语义分割和对问题语句的符号推理结合起来，实现自动问答。

本文给出了一个含1.2万条人工标注问答对的RGBD彩色景深图像数据集。
2015 Are You Talking to a Machine Dataset and Methods for Multilingual Image Question

Gao H, Mao J, Zhou J, et al. Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in neural information processing systems. 2015: 2296-2304.

本文提出mQA模型，能够回答有关某个图像内容的问题。回答的答案可以是一个句子、短语或单词。
[Figure 2: Illustration of the mQA model architecture. We input an image and a question about the image (i.e. “What is the cat doing?”) to the model. The model is trained to generate the answer to the question (i.e. “Sitting on the umbrella”). The weight matrix in the word embedding layers of the two LSTMs (one for the question and one for the answer) are shared. In addition, as in [25], this weight matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer. Different colors in the figure represent different components of the model. (Best viewed in color.)]

本文模型含四部分：

LSTM提取问题表示；
CNN提取图像视觉表示；
LSTM存储一个回答的语言上下文；
一个融合组件用于结合前三者并生成答案。

本文提供了一个自由风格多语种图像问答数据集（Freestyle Multilingual Image Question Answering, FM-IQA）。
2015 Ask Your Neurons A Neural-Based Approach to Answering Questions About Images

Malinowski M, Rohrbach M, Fritz M. Ask your neurons: A neural-based approach to answering questions about images[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1-9.

本文提出Neural-Image-QA模型。原理是encoder-decoder框架，把VQA建模为生成问题，在逐词编码完question句子后，逐词解码输出answer，以预测出为结束符。
[Figure 1. Our approach Neural-Image-QA to question answering with a Recurrent Neural Network using Long Short Term Memory (LSTM). To answer a question about an image, we feed in both, the image (CNN features) and the question (green boxes) into the LSTM. After the (variable length) question is encoded, we generate the answers (multiple words, orange boxes). During the answer generation phase the previously predicted answers are fed into the LSTM until the END symbol is predicted.]
2015 Exploring Models and Data for Image Question Answering

Ren M, Kiros R, Zemel R. Exploring models and data for image question answering[C]//Advances in neural information processing systems. 2015: 2953-2961.

本文提出使用神经网络、视觉语义嵌入来预测简单视觉问题的答案，而不加入中间步骤，比如：目标检测、图像分割。
[Figure 2: VIS+LSTM Model]

也是一个类似encoder-decoder framework的东西，把图像特征和问题句子的各单词以此输入LSTM中进行编码，但没有解码输出句子，而是把编码完成时的向量用来在预定义词汇上做分类，预测答案单词。

本文还给出了一种问题生成算法，能够把图像描述转换为问题语句。
2015 Visalogy Answering Visual Analogy Questions

Sadeghi F, Zitnick C L, Farhadi A. Visalogy: Answering visual analogy questions[C]//Advances in Neural Information Processing Systems. 2015: 1882-1890.

本文研究视觉类比（Visual Analogy）问题。图像A对图像B，正如图像C和那个图像？

本文使用四路暹罗架构（quadruple Siamese architecture）的卷积神经网络。
[Figure 2: VISALOGY Network has quadruple Siamese architecture with shared θ parameters. The network is trained with correct analogy quadruples of images [I1, I2, I3, I4] along with wrong analogy quadruples as negative samples. The contrastive loss function pushes (I1; I2) and (I3; I4) of correct analogies close to each other in the embedding space while forcing the distance between (I1; I2) and (I3; I4) in negative samples to be more than margin m.]

本文的Visalogy网络建立了共享θ参数的四路暹罗架构
2015 VisKE Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases

Sadeghi F, Kumar Divvala S K, Farhadi A. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1456-1464.

本文主要关注视觉知识抽取，VQA是视觉知识抽取后的应用展示。

已有的知识抽取研究一般仅仅关注于通过文本驱动的推理来验证关系短语。（无法利用视觉）

本文首次提出对关系短语的视觉验证的研究问题，并开发了视觉知识抽取系统VisKE（Visual Knowledge Extraction system）
[Figure 2. Approach Overview. Given a relation predicate, such as fish(bear,salmon) VisKE formulates visual verification as the problem of estimating the most probable explanation (MPE) by searching for visual consistencies among the patterns of subject, object and the action being involved.]

输入的关系谓词：熊(noun, subjective) 捕鱼(verb) 鲑鱼(salmon, objective)。给定一个关系谓词，如：熊捕鱼，VisKE把视觉验证建模为对最可能解释（most probable explanation, MPE）的估计问题，通过搜素主语、宾语和动作三者模式之间的视觉一致性（visual consistencies）实现。
2015 Visual Madlibs Fill in the Blank Description Generation and Question Answering

Yu L, Park E, Berg A C, et al. Visual madlibs: Fill in the blank description generation and question answering[C]//Proceedings of the ieee international conference on computer vision. 2015: 2461-2469.

[Figure 1. An example from the Visual Madlibs Dataset, including a variety of targeted descriptions for people and objects.]

本文发布Visual Madlibs数据集，通过填空模板生成对人物、目标、外表、活动、互动、场景的描述。
2015 VQA Visual Question Answering

Antol S, Agrawal A, Lu J, et al. Vqa: Visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2425-2433.

本文首次提出VQA视觉问答任务。
[Figure 1: Examples of free-form, open-ended questions collected for images via Amazon Mechanical Turk. Note that commonsense knowledge is needed along with a visual understanding of the scene to answer many questions.]
2016 Answer-Type Prediction for Visual Question Answering

Kafle K, Kanan C. Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4976-4984.

本文的核心思想是预测根据问题语句预测问题类型，用于VQA。
[Figure 1: In the open-ended VQA problem, an algorithm is given an image and a question, and it must output a string containing the answer. We obtain state-of-the-art results on multiple VQA datasets by adopting a Bayesian approach that incorporates information about the form the answer should take. In this example, the system is given an image of a bear and it is asked about the color of the bear. Our method explicitly infers that this is a “color” question and uses that information in its predictive process.]

通过预测问题类型，选择针对该问题类型的模型进行VQA回答预测，提高VQA水平。

本文设计了一款贝叶斯模型。在图像特征x和问题特征q为的条件下，回答k且问题类型为c的概率。
2016 Ask Me Anything Free-Form Visual Question Answering Based on Knowledge from External Sources

Wu Q, Wang P, Shen C, et al. Ask me anything: Free-form visual question answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4622-4630.

本文研究视觉问答，通过把图像内容和外部知识库的信息结合起来，回答宽范围的基于图像的问题。

本文实际上是把图像信息转换为了属性词和文本描述，转换为了一般的问答问题解决。
[Figure 2. Our proposed framework: given an image, a CNN is first applied to produce the attribute-based representation Vatt(I). The internal textual representation is made up of image captions generated based on the image-attributes. The hidden state of the caption-LSTM after it has generated the last word in each caption is used as its vector representation. These vectors are then aggregated as Vcap(I) with average-pooling. The external knowledge is mined from the KB (in this case DBpedia) and the responses encoded by Doc2Vec, which produces a vector Vknow(I). The 3 vectorsV are combined into a single representation of scene content, which is input to the VQA LSTM model which interprets the question and generates an answer.]

通过image captioning把图像转文本，并向量化为
；通过image annotation把图像转单词属性，检索知识图谱获取文本描述，并向量化为；然后把以上, ,

向量都输入到LSTM解码器中，以query句为逐个step输入来输出回答句。
2016 Hierarchical Question-Image Co-Attention for Visual Question Answering

Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]//Advances In Neural Information Processing Systems. 2016: 289-297.

本文提出不仅要建模视觉注意力（看什么位置），还需要建模问题注意力（关注哪些单词）。本文提出VQA协同注意力（Co-Attention）模型，可以对图像、问题注意力联合推理。

本文的Co-Attention指的是用图像表示来导向问题注意力，用问题表示来导向图像注意力。Co-Attention既能够关注图像中的不同区域，又能够关注问题中的不同块（单词、短语）

本文对问题文本构建了层次注意力机制（word level, phrase level, question level）。
[Figure 1: Flowchart of our proposed hierarchical co-attention model. Given a question, we extract its word level, phrase level and question level embeddings. At each level, we apply co-attention on both the image and question. The final answer prediction is based on all the co-attended image and question features.]

本文把抽取问题的单词级、短语级和问题整句级嵌入。（对问题文本构建了层次注意力机制）。在每一级都通过co-attention机制来计算图像和问题的注意力。最后基于所有特征预测答案。
2016 Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction

Noh H, Hongsuck Seo P, Han B. Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 30-38.

本文认为，不同的问题，其问题类型、对输入图像的理解层次是不同的。（因此认为CNN要能够根据问题进行动态变化去适应问题需求）。

本文模型是一个具有动态参数层的CNN，动态参数层的权值能够适应性地由问题语句而定。

为实现适应性参数预测，本文设计了一个独立的参数预测网络，该网络通过GRU处理输入的问题语句，并通过全连接层来生成一组候选权值作为输出。

为CNN的全连接层生成大量参数很复杂，本文通过哈希技术来降低复杂性，利用哈希来计算参数预测网络的候选权值一并做出选择。

本文提出动态参数预测网络（DPPnet）。
[Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classification network and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from the candidate weights obtained from the parameter prediction network.]

分类网络中的动态参数层的参数，是通过对参数预测网络的候选权值输出做哈希映射取得的。（因为让参数预测网络直接预测全连接层的大量参数太复杂了）
2016 MovieQA Understanding Stories in Movies Through Question-Answering

Tapaswi M, Zhu Y, Stiefelhagen R, et al. Movieqa: Understanding stories in movies through question-answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4631-4640.

本文发布MovieQA数据集。
[Figure 1: Our MovieQA dataset contains 14,944 questions about 408 movies. It contains multiple sources of information: plots, subtitles, video clips, scripts, and DVS transcriptions. In this figure we show example QAs from The Matrix and localize them in the timeline.]

MovieQA数据集是多选题QA，让模型在5个选项中选答案。

其中有一个DVS的概念，DVS在对话之间插入的对电影场景的描述，是一项面向视障人士的服务。本文借助该服务取得文本描述。

本文参考了MemN2N模型设计了本文面向QA的Memory Network。
2016 Stacked Attention Networks for Image Question Answering

Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29.

本文提出栈式注意力网络（stacked attention networks, SANs）。

SANs模型把问题语句表示为语义表示（向量），以此为查询来搜索图像中与答案相关的区域。本文认为VQA需要多步推理，因此设计了一个多层SAN，可以对图像做多次查询，以便渐进地推理出答案。
[Figure 1: Model architecture and visualization. (b) Visualization of the learned multiple attention layers. The stacked attention network first focuses on all referred concepts, e.g., and objects in the basket (dogs) in bicycle, basket the first attention layer and then further narrows down the focus in the second layer and finds out the answer dog.]

结合可视化效果来看，多个注意力层实现的是缩小注意力范围的作用。

所谓的多步推理，其实就是先通过注意力机制找出问题中提及的图像内容，然后再逐步通过注意力在上一步找出的图像内容中进一步找注意区域。几步下来，注意力区域不断缩小，直至最值得注意的区域，即答案。

具体地，本文把VGGNet的最后一个池化层的结果取出来，得到512×14×14的，保持了原图空间信息矩阵。该矩阵把原图的空间划分为了14×14的网格区域，而每个区域通过一个512维的特征向量作为该区域的表示向量。并以此与问题特征向量计算每个区域的注意力权值。

14×14 is the number of regions in the image and 512 is the dimension of the feature vector for each region.

本文作者Zichao Yang、Xiaodong He等人恰好是Hierarchical Attention Network, HAN的提出者。我很喜欢他们的论文，对阐明原理非常负责任，总是能用最清晰的思路、最准确的表达来把技术原理讲得清清楚楚。在此致谢！
2016 Visual Question Answering with Question Representation Update (QRU)

Li R, Jia J. Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems. 2016: 4655-4663.

本文指出要根据图像更新问题表示。
本文的方法是，对每一个图像区域进行迭代，每次迭代计算该图像区域与问题的相关性，选出与问题相关的图像区域来对问题表示（question representation）进行更新，并进一步学习给出正确答案，
[Figure 2: The overall architecture of our model with single reasoning layer for VQA]

问题Query0向量和M个图像区域的
向量计算取得M个

向量，由此根据图像的M个区域更新问题表示，即本文的问题表示更新QRU的概念。
2016 Visual7W Grounded Question Answering in Images

Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4995-5004.

先前的研究工作建立的是QA句子与图像之间的松散的、全局的关联，而本文的研究工作是建立图像中目标级的图像区域与文本描述的语义关联。

本文发布了Visual7W数据集，包含7万条多选题QA对。

本文的grounding指的是定位出图像中回答问题时依赖的目标
[Figure 1: Deep image understanding relies on detailed knowledge about different image parts. We employ diverse questions to acquire detailed information on images, ground objects mentioned in text with their visual appearances, and provide a multiple-choice setting for evaluating a visual question answering task with both textual and visual answers.]

Visual7W数据集不仅包含图像、问题及其答案，还标注了答案对应的图像grounding区域。
2016 Where to Look Focus Regions for Visual Question Answering

Shih K J, Singh S, Hoiem D. Where to look: Focus regions for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4613-4621.

本文提出的模型根据查询文本与图像区域之间的相关性来回答视觉问题。

本文认为模型要能学会根据问题去看图像中的什么地方。

通过内积计算相关性（查询文本和视觉特征之间）
[Figure 3. Overview of our network for the example question-answer pairing: “What color is the fire hydrant? Yellow.” Question and answer representations are concatenated, fed through the network, then combined with selectively weighted image region features to produce a score.]

图像区域特征向量 + 文本特征向量 // 此处“+”是连接对每个区域的注意力权重，用于对右侧绿色框中的N个向量做加权平均。
dot product, softmax是一次注意力机制，根据文本特征关注图像区域。region向量和text向量映射到公共向量空间中。

2016 Yin and Yang Balancing and Answering Binary Visual Questions

Zhang P, Goyal Y, Summers-Stay D, et al. Yin and yang: Balancing and answering binary visual questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5014-5022.

本文关注于抽象场景中的二元VQA。

本文将二元VQA建模为视觉验证问题来解决，验证图像中是否存在问题中询问的概念。
[Figure 1: We address the problem of answering binary questions about images. To eliminate strong language priors that shadow the role of detailed visual understanding in visual question answering (VQA), we use abstract scenes to collect a balanced dataset containing pairs of complementary scenes: the two scenes have opposite answers to the same question, while being visually as similar as possible. We view the task of answering binary questions as a visual verification task: we convert the question into a tuple that concisely summarizes the visual concept, which if present, result in the answer of the question being “yes”, and otherwise “no”. Our approach attends to relevant portions of the image when verifying the presence of the visual concept.]

本文一个问题配备两个互补的抽象场景，即一个对应yes，一个对应no。

本文实验使用的模型基于VQA方法。
2017 A Dataset and Exploration of Models for Understanding Video Data Through Fill-In-The-Blank Question-Answering

Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6884-6893.

本文认为视频理解领域缺乏充足的数据。

本文给出MovieFIB（Movie Fill-In-the-Blank）数据集，含30万个样本，基于为视障人士准备的描述性视频注释。
[Figure 1. Two examples from the training set of our fill-in-the-blank dataset.]

MovieFIB采用填空（Fill-In-the-Blank）QA题型。
2017 An Analysis of Visual Question Answering Algorithms

Kafle K, Kanan C. An analysis of visual question answering algorithms[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1965-1973.

本文主要是给出一个新数据集——任务驱动图像理解挑战（Task Driven Image Understanding Challenge, TDIUC），包含的问题分类12个问题类别。
本文使用这个数据集分析现有的VQA算法。
[Figure 1: A good VQA benchmark tests a wide range of computer vision tasks in an unbiased manner. In this paper, we propose a new dataset with 12 distinct tasks and evaluation metrics that compensate for bias, so that the strengths and limitations of algorithms can be better measured.]

另外，本文还加入了absurd问题，训练模型识别没道理、不相干的问题。
2017 An Empirical Evaluation of Visual Question Answering for Novel Objects

Ramakrishnan S K, Pal A, Sharma G, et al. An empirical evaluation of visual question answering for novel objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4392-4401.

本文研究如何让VQA认知新目标。（训练集中没有的东西）。现有的热门VQA方法在遇到新目标时准确率大幅下挫。
[Figure 1: We are interested in answering questions about images containing objects not seen at training.]

使用外部语料和图片数据：

无标签文本
有标签图片

2017 Are You Smarter Than a Sixth Grader Textbook Question Answering for Multimodal Machine Comprehension

Kembhavi A, Seo M, Schwenk D, et al. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4999-5007.

本文研究的教科书问答问题属于多模态机器理解（Multi-Modal Machine Comprehension, M3C）：给定一个文本、流程图和图像组成的上下文，让机器能够回答多模态问题。
[Figure 1. An overview of the Multi-modal Machine Comprehension (M3C) paradigm, statistics of the proposed Textbook Question Answering (TQA) dataset and an illustration of a lesson in it. TQA can be downloaded at http://textbookqa.org .]

本文发布教科书问答（Textbook Question Answering, TQA）数据集。

本文的TQA数据集相较于现有的机器阅读理解和VQA研究更有难度，数据集中相当一部分的问题需要对文本、流程图进行复杂的解析和推理。
2017 Creativity Generating Diverse Questions Using Variational Autoencoders

Jain U, Zhang Z, Schwing A G. Creativity: Generating diverse questions using variational autoencoders[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6485-6494.

[Figure 3: High level VAE overview of our approach.]

本文提出结合变分自编码器（variational autoencoder， VAE）和LSTM来构建一个有创造力的算法，用于解决视觉问题生成问题。
2017 End-To-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3165-3173.

本文提出一个高层概念词检测器，能够整合到各种视频转语言的模型中。
该检测器根据输入的视频，生成一列概念词（concept words），提供给语言生成模型。
[Figure 1. The intuition of the proposed concept word detector. Given a video clip, a set of tracing LSTMs extract multiple concept words that consistently appear across frame regions. We then employ semantic attention to combine the detected concepts with text encoding/decoding for several video-to-language tasks of LSMDC 2016, such as captioning, retrieval, and question answering.]

本文的概念词检测器（concept word detector），输入是视频及其对应的描述语句，训练后能够对每个视频生成一组高层概念词。
[Figure 2. The architecture of the concept word detection in a top red box (section 2.2), and our video description model in bottom, which uses semantic attention on the detected concept words (section 3.1).]

加入了概念词检测器的encoder-decoder框架。
2017 Explicit Knowledge-based Reasoning for Visual Question Answering

Wang P, Wu Q, Shen C, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 1290-1296.

本文提出的VQA模型能够基于从大型知识库中抽取的信息对图像进行推理。
[Figure 1: A real example of the proposed KB-VQA dataset and the results given by Ahab, the proposed VQA approach. Our approach answers questions by extracting several types of visual concepts from an image and aligning them to large-scale structured knowledge bases. Apart from answers, our approach can also provide reasons and explanations for certain types of questions.]

本文的方法不仅能回答图像语义以外的问题，还能对推理取得答案的过程进行解释。
[Figure 3: Top: An RDF graph such as might be constructed by Ahab. For simplicity, we only show entities that are relevant to answering the questions in Fig. 1. Each arrow corresponds to one triple in the graph, with circles representing entities and green text reflecting predicate type. The graph of extracted visual concepts (left side) is linked to DBpedia (right side) by mapping object/attribute/scene to DBpedia entities using the predicate same-concept. Bottom: The question processing pipeline. The input question is parsed using a set of NLP tools to identify the appropriate template. The extracted slot-phrases are then mapped to entities in the KB. Next, KB queries are generated to mine the relevant relationships for the KB-entities. Finally, the answer and reason are generated based on the query results. The predicate category/?broader is used to obtain the categories transitively.]

构建RDF图（Resource Description Framework [Cyganiak et al., 2014] (RDF)），拓展知识库，对自然语言进行解析、映射和逻辑查询取得推理过程可解释的答案。
2017 Graph-Structured Representations for Visual Question Answering

Teney D, Liu L, van den Hengel A. Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1-9.

本文首次提出用图结构表示图像和问题文本，并建立深度学习模型进行处理实现分类。

本文：

将图片编码为场景图；
把句子表示为句法依存图；
训练一个神经网络对场景图和依存图进行推理，并把结果分类为词汇中的一个词。

[Figure 2. Architecture of the proposed neural network. The input is provided as a description of the scene (a list of objects with their visual characteristics) and a parsed question (words with their syntactic relations). The scene-graph contains a node with a feature vector for each object, and edge features that represent their spatial relationships. The question-graph reflects the parse tree of the question, with a word embedding for each node, and a vector embedding of types of syntactic dependencies for edges. A recurrent unit (GRU) is associated with each node of both graphs. Over multiple iterations, the GRU updates a representation of each node that integrates context from its neighbours within the graph. Features of all objects and all words are combined (concatenated) pairwise, and they are weighted with a form of attention. That effectively matches elements between the question and the scene. The weighted sum of features is passed through a final classifier that predicts scores over a fixed set of candidate answers.]

要点：

场景图（scene-graph）：一组目标即其视觉特点。具体地，每一个目标对应一个节点。该节点包含一个特征向量；节点之间的边表示他们的空间关系。
问题图（question-graph）：句法解析后的问题语句。问题图是问题语句的解析树，每个单词对应一个节点。节点包含该单词的词嵌入（word embedding），节点之间的边包含单词之间句法依存关系的向量嵌入。
所有目标和单词的特征向量两两成对组合组合起来，即图2中的Words-Objects矩阵，并通过注意力机制加权求和（Matching weights矩阵为注意力权重矩阵）。

局限：本文的scene graph只是包含空间上的相对位置（relative position）。
2017 High-Order Attention Models for Visual Question Answering

Schwartz I, Schwing A, Hazan T. High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems. 2017: 3664-3674.

本文提出一种注意力机制的新形式，能够学习不同数据模态之间的高阶相关性。

针对以往的注意力机制存在对特定任务手工设计因此针对性强、泛化性差的缺点，本文强调：

泛化性好（generally applicable），能够广泛应用于各种任务的注意力机制。
高阶相关性（high-order correlation），能够学习不同数据模态之间的高阶相关性。k阶相关性能够建模k种模态之间的相关性。

本文把目前的决策系统整体分解为三部分：

数据嵌入；
注意力机制；
决策产生；

[Figure 2: Our state-of-the-art VQA system]

本文把注意力机制视为概率模型，注意力机制计算“势”（potentials）。

一元势（unary potentials）：

、
、
，表示视觉输入、问题语句、回答语句中每个元素的重要性。
成对势（pairwise potentials）：
、、
表示两种模态之间的相关性。
三元势（ternary potentials）：

捕捉三种模特之间的依存性。

本文的决策产生（decision making）阶段使用MCB和MCT池化：

MCB池化（Multimodal Compact Bilinear Pooling）：本文的决策生成阶段使用该双线性池化把成对情况（pairwise setting）下的两种模态做池化输出。
MCT池化（Multimodal Compact Trilinear Pooling）：本文的决策生成阶段使用该三线性池化把三种模态的数据池化输出。

2017 Knowledge Acquisition for Visual Question Answering via Iterative Querying

Zhu Y, Lim J J, Fei-Fei L. Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1154-1163.

当人无法直接回答一个问题时，人会通过问一些补充性的问题来理解场景，等到能理解到位、能回答了再回答原问题。本文受此启发，提出的动态VQA模型能够提出查询以获取问答任务所需的支撑依据。

本文的动态VQA模型能迭代式地查询新依据并收集外部源中的相关依据。
[Figure 2: (a) An illustration of a standard VQA model. (b) An overview of our iterative model. © Detailed flowchart of our model. The model consists of two major components: core network (green) and query generator (blue). The query generator proposes task-driven queries to fetch evidence from external sources. Acquired knowledge is encoded and stored as memories in the core network for answering a question.]

具体而言，本文的模型通过对知识源（knowledge sources）的一系列查询（queries）获取支撑依据。获取到的依据被编码存储进记忆银行（memory bank）。随后，模型使用刚更新的记忆来提出下一轮的查询，或给出目标问题的答案。
2017 Learning to Disambiguate by Asking Discriminative Questions

Li Y, Huang C, Tang X, et al. Learning to disambiguate by asking discriminative questions[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3419-3428.

人类能够通过问问题来了解信息，认知世界并消解歧义。本文受此启发，提出一种新研究问题——“如何生成有判别力的问题（discriminative questions）来帮助消解视觉实例的歧义？”。
[Figure 4: Overview of the attribute-conditioned question generation process. Given a pair of ambiguous images, we first extract semantic attributes from the images respectively. The attribute scores are sent into a selection model to select the distinguishing attributes pair, which reflects the most obvious difference between the ambiguous images. Then the visual feature and selected attribute pair are fed into an attribute-conditioned LSTM model to generate discriminative questions.]

给定一对歧义图片：

分别提取视觉语义属性；
属性及其分值通过选择模块选择出最有区别性的属性对，能够反映两张图之间最明显的区别；
把视觉特征和选出的属性对输入属性条件LSTM生成问题。

2017 Learning to Reason End-to-End Module Networks for Visual Question Answering

Hu R, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 804-813.

基于近期提出的神经模块网络架构（Neural Module Network architecture, NMN），本文提出端到端模块网络（End-to-End Module Networks, N2NMNs），在不使用NMN中的解析器的情况下，通过预测特定实例网络布局。

本文的模型能够模仿专家示范，学习生成网络结构，同时根据下游任务损失来学习网络参数。
[Figure 2: Model overview. Our approach first computes a deep representation of the question, and uses this as an input to a layout-prediction policy implemented with a recurrent neural network. This policy emits both a sequence of structural actions, specifying a template for a modular neural network in reverse Polish notation, and a sequence of attentive actions, extracting parameters for these neural modules from the input sentence. These two sequences are passed to a network builder, which dynamically instantiates an appropriate neural network and applies it to the input image to obtain an answer.]

根据问题构建网络——根据问题语句预测出需要做哪些操作，每一个操作都通过一个子网络实现，一串子网络形成模块神经网络，用于解决当前问题所需的任务。具体地：

首先计算问题的深度表示；
基于RNN实现的布局预测策略根据问题的深度表示，输出一个确定模块神经网络的模板的结构化动作序列，以及一个从输入语句中提取神经网络参数的注意力动作序列；
网络构建器根据结构化动作序列和注意力动作序列，动态实例化出一个恰当的神经网络，由此根据图像取得答案。

2017 Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering

Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6904-6913.

VQA被数据中的天然规律和语言因素带偏了，训练出的VQA模型在根据数据中的统计规律（自然规律、语言规律）作答，而忽视了视觉因素。

本文强调视觉问答VQA中视觉的重要性。
[Figure 1: Examples from our balanced VQA dataset.]

具体地，本文为每一个问题找一对语义互补的图像，实现正负例平衡（例如：man/woman, yes/no），避免VQA模型受到视觉无关的统计规律影响。本文的全平衡数据集为VisualQA数据集。
2017 MarioQA Answering Questions by Watching Gameplay Videos

Mun J, Hongsuck Seo P, Jung I, et al. Marioqa: Answering questions by watching gameplay videos[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2867-2875.

本文研究视频问答VideoQA问题。本文提出一个视频问答模型的分析框架，该框架通过自动生成的游戏视频构建出合成数据集，用该数据集分析模型不同层面的表现。
[Figure 1: Overall QA generation procedure. Given a gameplay video and event logs shown on the left, (a) target event is selected (marked as a green box), (b) question semantic chunk is generated from the target event, © question template is sampled from template pool, and (d) QA pairs are generated by filling the template and the linguistically realizing answer.]

本文根据超级马里奥兄弟游戏生成了一个合成视频问答数据集。
2017 Multi-level Attention Networks for Visual Question Answering

Yu D, Fu J, Mei T, et al. Multi-level attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4709-4717.

本文认为现有方法主要从抽象的低层视觉特征推测答案，而忽视了建模高层图像语义及图中区域的丰富空间上下文语义。

本文提出一个多层注意力网络，通过语义注意力以及基于视觉注意力的细粒度空间推理来缩小语义鸿沟，解决VQA问题。具体地，

从CNN的高层语义生成语义概念（semantic concepts），并选出与问题相关地概念作为语义注意力（semantic attention）。
通过双向RNN把基于区域的CNN中层输出编码为空间嵌入表示，并用MLP进一步定位与回答相关的区域作为视觉注意力（visual attention）。
联合优化语义注意力、视觉注意力和问题嵌入，通过softmax分类器取得答案。

[Figure 2. Overall framework of multi-level attention networks. Our framework consists of three components: (A) semantic attention, (B) context-aware visual attention and © joint attention learning. Here, we denote by vq the representation of the question Q, by vimg, vc the representation of image content on the visual and semantic level queried by the question, respectively. vr and pimg c is the activation of the last convolutional layer and the probability layer from the CNN.]

其中
和

都是图像表示，但分别由各区域嵌入和各语义概念而来：

：从CNN中层输出编码为空间嵌入表示（各区域的图像表示），并通过视觉注意力取得的图像表示；

：从CNN的高层语义生成语义概念（semantic concepts），并通过语义注意力选择后的图像表示。

2017 Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

Yu Z, Yu J, Fan J, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1821-1830.

本文提出多模态因子分解双线性池化方法（Multi-modal Factorized Bilinear (MFB) pooling approach），提高多模态特征融合能力，以改进VQA。

图像中的视觉特征和问题中的文本特征尽管都可用特征向量表示，但两者概率分布差异很大，简单的拼接和主元素相加可能不足以融合两种模态的特征。因此，双线性模型被提出以解决该问题。
[Figure 3. MFB with Co-Attention network architecture for VQA. Different from the network of MFB baseline, the images and questions are firstly represented as the fine-grained features respectively. Then, Question Attention and Image Attention modules are jointly modeled in the framework to provide more accurate answer predictions.]

Multi-modal Compact Bilinear pooling (MCB)对两个特征向量做外积，因二次方膨胀产生了非常高维的特征向量。MLB通过低阶映射矩阵改进了高维问题。

Multi-modal Low-rank Bilinear Pooling (MLB)：

x通过矩阵U变换为o维向量；
y通过矩阵V变换为o维向量；
随后两个o维向量做逐元素相乘，取得o维向量z。

本文认为MLB又存在收敛缓慢问题，因此提出MFB。

最简多模态双线性模型：
，
，

Multi-modal Factorized Bilinear pooling (MFB)：

VQA- 近五年视觉问答顶会论文创新点笔记相关推荐

A+B+C写作公式？顶会论文创新点干货！
听说,拥有顶会论文就仿佛自带"流量". 很多大厂的校招已经明晃晃的说明有顶会等buff加成的同学优先考虑,甚至可以免笔试直接面试! 当然不仅仅是毕业进大厂需要高区论文作为背书,顶会 ...
五年12篇顶会论文综述！一文读懂深度学习文本分类方法
作者 | 何从庆来源 | AI算法之心(ID:AIHeartForYou) 最近有很多小伙伴想了解深度学习在文本分类的发展,因此,笔者整理最近几年比较经典的深度文本分类方法,希望帮助小伙伴们了解深度 ...
论文浅尝 | 端到端神经视觉问答之上的显式推理
链接:http://www.public.asu.edu/~cbaral/papers/2018-aaai-psl.pdf 概述视觉问答(Visual Question Answering)现有两大 ...
＜＜视觉问答＞＞2022：CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
目录摘要: 一.介绍二.Preliminaries 2.1.CLIP 2.2.Vision-Language Understanding Tasks 三.Zero-shot VQA 3.1.A T ...
＜＜视觉问答＞＞2022：SwapMix: Diagnosing and Regularizingthe Over-Reliance on Visual Context in VQA
先看一下下面这篇论文对VQA任务语言偏差的介绍Greedy Gradient Ensemble for Robust Visual Question Answering 目录摘要一.Introdu ...
用于视觉问答的基于关系推理和注意力的多峰特征融合模型《Multimodal feature fusion by relational reasoning and attention for VQA》
目录一.文献摘要介绍二.网络框架介绍三.实验分析四.结论这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收货.如有不足,随时欢迎交流和探讨. 一.文献摘要介绍 The re ...
＜＜视觉问答＞＞2021：Mind Your Outliers，Investigating the Negative Impact of Outliers on Active Learning VQA
目录前言一.介绍二.实验设置 2.1.实验流程 2.2.VQA模型 2.3.主动学习方法三.实验结果四.通过数据集映射图分析五.集体离群值六.结论七.附录前言主动学习将分类.识别等 ...
Facobook开源视觉问答VQA框架：Pythia
VQA Challenge 2018的冠军方案. (欢迎关注"我爱计算机视觉",一个有价值有深度的公众号~) 什么是视觉问答VQA(Visual Question Answerin ...
【自然语言处理】--视觉问答（Visual Question Answering，VQA）从初始到应用
一.前述视觉问答(Visual Question Answering,VQA),是一种涉及计算机视觉和自然语言处理的学习任务.这一任务的定义如下: A VQA system takes as inp ...

VQA- 近五年视觉问答顶会论文创新点笔记

VQA- 近五年视觉问答顶会论文创新点笔记相关推荐

最新文章

热门文章