TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

With its superior global dependency modeling capabilities, Transformer and its variants have become the primary structure for many visual and language tasks. However, in tasks such as visual question answering (VQA) and directed expression understanding (REC), multimodal prediction usually requires visual information from macro to micro. Therefore, how to dynamically schedule global and local dependency modeling in Transformer becomes an emerging problem.

二、Motivation

1）In some V&L tasks, such as visual question answering (VQA) and directed expressive comprehension (REC), multimodal reasoning usually requires visual attention from different receptive fields. Not only should the model understand the overall semantics, but more importantly, it needs to capture the local relationships in order to answer the right answer.
2）In this paper, the authors propose a new lightweight routing scheme called Transformer Routing (TRAR), which enables automatic attention selection without increasing computation and video memory overhead.

三、Model

（一）The framework of TRAR

（二）Routing Process

To achieve the dynamic routing goals of each example, the intuitive solution is to create a multi-branch network structure, with each layer equipped with modules with different Settings. Specifically, given the fea_x0002_tures of the last inference step, X∈R^n×d and routing space, F=[F_0,…,F_n] of the previous inference step, the output of the next inference step X^’ is obtained in the following way
However, from the above equation, we can see that such a routing scheme will inevitably make the network very complicated and greatly increase the training cost.
The key to reduce the experimental burden is to optimize the definition of routing. By revisiting the definition of the standard self-attention, defined as:
-
We can see that SA can be regarded as a feature update function of a fully connected graph, when A∈R^n×n is regarded as a weighted adjacency matrix. Therefore, to obtain characteristics of different attention spans, you only need to limit the graph connections for each input element. This can be achieved by multiplying the result of dot-product by the mask D∈R^n×n of an adjacency, as shown below
Based on the above equations, a routing layer for SA is then defined as：
However, the above formula is still computationally expensive. Therefore, the author further simplifies the module selection problem to the selection of adjacency mask D, defined as:
In TRAR, each SA layer is equipped with a path controller to predict the probabilities of routing options, i.e., the module selection. Specifically, given the input features X∈R^n×d, the path probabilities α∈R^n is defined as:

（三）Optimization

By applying the Softmax function, routing path classification can be selected as a continuous differentiable operation. The Router and the entire network can then be combined for end-to-end optimization using task-specific objective functions arg min_w,zL_train(w,z). During the test, the features of different attention spans were combined dynamically. Because soft routing does not require additional parameters, training is relatively easy.
Hard routing is to achieve binary path selection, which can further introduce specific CUDA kernels to accelerate the model inference. However, classified routing makes the weight of Router non-differentiable, and binarization of the results of soft routing may lead to feature gap between training and testing. To solve this problem, the authors introduce the Gumbel-softmax Trick to implement differentiable path routing:

四、Experiment

To validate the proposed TRAR, the author apply it to visual question answering (VQA) and referring expression comprehension(REC), and conduct extensive experiments on five benchmark datasets, VQA2.0, CLVER, RefCOCO, RefCOCO+ and RefCOCOg.

（一） Ablations

（二） Comparison with SOTA

（三） Qualitative Analysis

五、Conclusion

In this paper, the author examines dependency modeling for Transformer in two visual-language tasks, namely VQA and REC. These two tasks typically require visual attention from different receptive fields, and standard Transformer can’t handle them completely.
To this end, the authors propose a lightweight routing scheme called Transformer Routing (TRAR) to help the model dynamically select attention span for each sample. TRAR transforms the module selection problem into a selective attention mask problem, thus making the extra computing and video memory overhead negligible.
In order to verify the effectiveness of TRAR, a large number of experiments were carried out on five benchmark data sets, and the experimental results confirmed the superiority of TRAR.

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering相关推荐

论文分享——Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
文章目录文章简介 1.背景介绍研究背景概念介绍问题描述 IC与VQA领域的主要挑战 2.相关研究 CNN+RNN体系架构 Attention mechanism Bottom-Up and T ...
论文笔记：Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
论文链接:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up A ...
自下而上和自上而下的注意力模型《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering》
本文有点长,请耐心阅读,定会有收货.如有不足,欢迎交流, 另附:论文下载地址一.文献摘要介绍 Top-down visual attention mechanisms have been used ...
【论文】VQA：Learning Conditioned Graph Structures for Interpretable Visual Question Answering
[论文]VQA:学习可解释的可视问题解答的条件图结构目录 [论文]VQA:学习可解释的可视问题解答的条件图结构摘要一.模型结构图二.Computing model inputs 三.Grap ...
论文阅读—Relation-Aware Graph Attention Network for Visual Question Answering
论文阅读-Relation-Aware Graph Attention Network for Visual Question Answering 一.标题用于视觉问答的关系感知图注意力网络二.引 ...
论文解读：Hierarchical Question-Image Co-Attention for Visual Question Answering
这是关于VQA问题的第七篇系列文章.本篇文章将介绍论文:主要思想:模型方法:主要贡献.有兴趣可以查看原文:Hierarchical Question-Image Co-Attention for Vi ...
《Generating Question Relevant Captions to Aid Visual Question Answering》（生成问题相关标题，以帮助视觉回答问题）论文解读
下面是我对最近阅读的论文<Generating Question Relevant Captions to Aid Visual Question Answering>的一些简要理解一. ...
【论文分享】Relation-Aware Graph Attention Network for Visual Question Answering
分享一篇nlp领域运用注意力机制设计视觉问答系统的文章,只是对文章进行翻译以及简单的归纳. 目录二.动机三.方法 1.问题定义 2.模型结构 2.1 图的构建 2.2 关系编码器 2.3 多模融合 ...
《Deep Modular Co-Attention Networks for Visual Question Answering》论文翻译
论文地址:https://doi.org/10.48550/arXiv.1906.10770 代码地址:GitHub - MILVLG/mcan-vqa: Deep Modular Co-Attent ...

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering