TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

With its superior global dependency modeling capabilities, Transformer and its variants have become the primary structure for many visual and language tasks. However, in tasks such as visual question answering (VQA) and directed expression understanding (REC), multimodal prediction usually requires visual information from macro to micro. Therefore, how to dynamically schedule global and local dependency modeling in Transformer becomes an emerging problem.

二、Motivation

1)In some V&L tasks, such as visual question answering (VQA) and directed expressive comprehension (REC), multimodal reasoning usually requires visual attention from different receptive fields. Not only should the model understand the overall semantics, but more importantly, it needs to capture the local relationships in order to answer the right answer.
2)In this paper, the authors propose a new lightweight routing scheme called Transformer Routing (TRAR), which enables automatic attention selection without increasing computation and video memory overhead.

三、Model

(一)The framework of TRAR

(二)Routing Process

  • To achieve the dynamic routing goals of each example, the intuitive solution is to create a multi-branch network structure, with each layer equipped with modules with different Settings. Specifically, given the fea_x0002_tures of the last inference step, X∈R^n×d and routing space, F=[F_0,…,F_n] of the previous inference step, the output of the next inference step X^’ is obtained in the following way

  • However, from the above equation, we can see that such a routing scheme will inevitably make the network very complicated and greatly increase the training cost.

  • The key to reduce the experimental burden is to optimize the definition of routing. By revisiting the definition of the standard self-attention, defined as:
    -

  • We can see that SA can be regarded as a feature update function of a fully connected graph, when A∈R^n×n is regarded as a weighted adjacency matrix. Therefore, to obtain characteristics of different attention spans, you only need to limit the graph connections for each input element. This can be achieved by multiplying the result of dot-product by the mask D∈R^n×n of an adjacency, as shown below

  • Based on the above equations, a routing layer for SA is then defined as:

  • However, the above formula is still computationally expensive. Therefore, the author further simplifies the module selection problem to the selection of adjacency mask D, defined as:

  • In TRAR, each SA layer is equipped with a path controller to predict the probabilities of routing options, i.e., the module selection. Specifically, given the input features X∈R^n×d, the path probabilities α∈R^n is defined as:

(三)Optimization

  • By applying the Softmax function, routing path classification can be selected as a continuous differentiable operation. The Router and the entire network can then be combined for end-to-end optimization using task-specific objective functions arg min_w,zL_train(w,z). During the test, the features of different attention spans were combined dynamically. Because soft routing does not require additional parameters, training is relatively easy.

  • Hard routing is to achieve binary path selection, which can further introduce specific CUDA kernels to accelerate the model inference. However, classified routing makes the weight of Router non-differentiable, and binarization of the results of soft routing may lead to feature gap between training and testing. To solve this problem, the authors introduce the Gumbel-softmax Trick to implement differentiable path routing:

四、Experiment

  • To validate the proposed TRAR, the author apply it to visual question answering (VQA) and referring expression comprehension(REC), and conduct extensive experiments on five benchmark datasets, VQA2.0, CLVER, RefCOCO, RefCOCO+ and RefCOCOg.

(一) Ablations

(二) Comparison with SOTA



(三) Qualitative Analysis


五、Conclusion

  • In this paper, the author examines dependency modeling for Transformer in two visual-language tasks, namely VQA and REC. These two tasks typically require visual attention from different receptive fields, and standard Transformer can’t handle them completely.
  • To this end, the authors propose a lightweight routing scheme called Transformer Routing (TRAR) to help the model dynamically select attention span for each sample. TRAR transforms the module selection problem into a selective attention mask problem, thus making the extra computing and video memory overhead negligible.
  • In order to verify the effectiveness of TRAR, a large number of experiments were carried out on five benchmark data sets, and the experimental results confirmed the superiority of TRAR.

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering相关推荐

  1. 论文分享——Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

    文章目录 文章简介 1.背景介绍 研究背景 概念介绍 问题描述 IC与VQA领域的主要挑战 2.相关研究 CNN+RNN体系架构 Attention mechanism Bottom-Up and T ...

  2. 论文笔记:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

    论文链接:Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up A ...

  3. 自下而上和自上而下的注意力模型《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering》

    本文有点长,请耐心阅读,定会有收货.如有不足,欢迎交流, 另附:论文下载地址 一.文献摘要介绍 Top-down visual attention mechanisms have been used ...

  4. 【论文】VQA:Learning Conditioned Graph Structures for Interpretable Visual Question Answering

    [论文]VQA:学习可解释的可视问题解答的条件图结构 目录 [论文]VQA:学习可解释的可视问题解答的条件图结构 摘 要 一.模型结构图 二.Computing model inputs 三.Grap ...

  5. 论文阅读—Relation-Aware Graph Attention Network for Visual Question Answering

    论文阅读-Relation-Aware Graph Attention Network for Visual Question Answering 一.标题 用于视觉问答的关系感知图注意力网络 二.引 ...

  6. 论文解读:Hierarchical Question-Image Co-Attention for Visual Question Answering

    这是关于VQA问题的第七篇系列文章.本篇文章将介绍论文:主要思想:模型方法:主要贡献.有兴趣可以查看原文:Hierarchical Question-Image Co-Attention for Vi ...

  7. 《Generating Question Relevant Captions to Aid Visual Question Answering》(生成问题相关标题,以帮助视觉回答问题)论文解读

    下面是我对最近阅读的论文<Generating Question Relevant Captions to Aid Visual Question Answering>的一些简要理解 一. ...

  8. 【论文分享】Relation-Aware Graph Attention Network for Visual Question Answering

    分享一篇nlp领域运用注意力机制设计视觉问答系统的文章,只是对文章进行翻译以及简单的归纳. 目录 二.动机 三.方法 1.问题定义 2.模型结构 2.1 图的构建 2.2 关系编码器 2.3 多模融合 ...

  9. 《Deep Modular Co-Attention Networks for Visual Question Answering》论文翻译

    论文地址:https://doi.org/10.48550/arXiv.1906.10770 代码地址:GitHub - MILVLG/mcan-vqa: Deep Modular Co-Attent ...

最新文章

  1. 这些优秀的国产分布式任务调度系统,你用过几个?
  2. jq选择器||基本选择器 层级选择器 属性选择器 过滤选择器 表单过滤选择器
  3. SAP UI5 Opportunity type long description empty issue
  4. 基于java二手书论文_java毕业设计_springboot框架的二手书交易管理与实现
  5. (原創) 如何在Visual Studio 2005編譯boost 1.33.1? (C/C++) (VC++) (boost)
  6. 程序员,其实你可以做的更好
  7. 计算机系统操作技师考试题,机关事业单位技术工人计算机操作技师考试题库
  8. 【狂神css笔记】CSS介绍选择器
  9. 归并排序java(内附超详解图文讲解)
  10. SPFA算法模板(刘汝佳版)--Wormholes POJ - 3259
  11. Java进阶架构实战——Redis在京东到家的订单中的使用
  12. [转帖]2.0&TBC 术士常用宏+宏答疑
  13. can软件android,appcan-android
  14. 【支持升级官方最新版】西部数码主机代理系统模板源码IDC网站源码虚拟主机代理管理系统
  15. 调出软键盘 挤掉标题栏咋办
  16. C#给图片加水印文字或图片
  17. c++ 崩溃 正则表达式regex_C++正则表达式regex初探及踩的坑
  18. react-native-beacons-manager在Android上的使用 --工作笔记
  19. 发现新的恶意 Torii IoT 僵尸网络
  20. 用JAVA 做一个简易版的坦克大战(只实现基本功能)

热门文章

  1. latex行间距调整
  2. php 上传pdf文件损坏,pdf文件损坏打不开怎么解决
  3. iOS App体验设计
  4. 三星折叠屏手机爆火,但国内市场已经“不需要”
  5. vivo X Flip会是高端手机市场的又一折叠屏爆款吗?
  6. spring的两大核心
  7. 为什么计算机里没有桌面显示不出来,电脑开机后桌面显示不出来如何修复_电脑开机后桌面没有东西的处理办法-系统城...
  8. 给定平面上任意三个点的坐标(x​1,y​1)、(x​2​​ ,y​2​​)、(x​3​​ ,y​3​​),检验它们能否构成三角形
  9. Socket学习总结系列(一) -- IM Socket
  10. android switch模块