Dynamic Graph Attention for Referring Expression Comprehension論文閲讀筆記
Abstract
1、Referring expression comprehension is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression.
Referring expression comprehension :以句子作爲指導,判斷圖中句子描述的是圖中的哪一個方位。
2、However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression.
現在工作的缺點:對待圖中所描述的事務,不進行推理,或者只是簡單的進行一階推理。
3、In this paper, propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node.
Introduction
解決的問題:提出動態圖注意網絡(DGA)實現圖中物體的多步推理,用差分分析器(實際上是GCN?)來預測以語言爲指導的圖中關係的推理過程。
Introduction
1、The most classic work [13, 16, 21, 25] encodes an expression with an LSTM model [5], extracts features of visual objects in the image using CNNs [24, 20], and adopts matching loss functions to learn a common feature space for the expression and the visual objects.
作者認爲這些模型的可解釋性不好,推理不夠突出。 almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support singlestep reasoning. Meanwhile, the models trained with those approaches have poor interpretability.
2、 [30, 19, 26, 28], involves extra pairwise context features or multi-order context features to improve the understanding of the image. However, they generally treat the learning process as a black box without explicit reasoning, and the learned monolithic features do not have adequate competitiveness when complex referring expressions are given. 作者認爲這些模型運用多步提取上下文特徵提取來理解圖像,學習過程向一個黑匣子,學習到的特徵不夠充分。
3、Recently, single-step reasoning [7, 29] has been proposed by decomposing the expression into different components and matching each component with a corresponding visual region via modular networks. 最近一些學者提出單步推理,將語言和圖形對應的模塊聯係起來。
4、 [33] Its stepwise reasoning is implemented using an LSTM model, which recurrently generates attended visual features while feeding the combination of word embedding and the attended visual features back to the LSTM. However, its stepwise reasoning does not consider the linguistic structure of the expression, and it does not explore the relationships among objects in the image.
考慮多步推理的這篇文章,不斷將 combination of word embedding 和 attended visual features 喂給 LSTM,循環生成 attended visual features. 但是它的逐步推理没有考虑到表达的语言结构,也没有探究图像中物体之间的关系。
所以作者提出了DGA
The core ideas behind the proposed DGA come from three aspects,
1、expression decomposition based on linguistic structure.
it is hard to accurately obtain the linguistic structure of a referring expression as such expressions are usually complex and flexible. Therefore, we resort to a differential analyzer module to predict constituent expressions of the input expression step by step to capture the linguistic structure, and the input expression is represented as a sequence of constituent expressions.我们借助差分分析模块对输入表达式的成分表达式进行一步步的预测,以捕捉语言结构,并将输入表达式表示为一系列成分表达式。
2、 object relationships modeling.
The proposed DGA constructs a directed graph over the objects in the image. The nodes and edges of the graph correspond to the objects and relationships among the objects respectively.
3、multi-step reasoning for identifying compound objects from relations.
The graph under the guidance of the constituent expressions in a stepwise manner to capture higher-order relationships among the objects and update the compound objects corresponding to each node through graph propagation.
Relate work
1、Referring Expression Comprehension
A. Some previous work [16, 21, 25] independently encodes the inputs in the two modals and learns a common feature space for them. To learn the common feature space, they propose different matching loss functions to optimize, e.g., softmax loss [16, 21] and triplet loss [25].
這個在Introduction的第一個提到過。
B. recent work [32, 4] adopts co-attention mechanisms to build up the interactions between the expression and the objects in the image.
C. [7, 29] designs fixed templates to softly decompose the expression into different semantic components via self-attention, and they compute the language-vision matching scores for each pair of the component and visual region. However, current work is not applicable for expressions that do not conform to the fixed templates.
D. [14] explores the visual reasoning for referring expression comprehension in synthetic domain. Different with them, we focus on real-world images and expressions, but do not resort to the guidance of language parsing (language programs[14]) ground-truth.
2、Interpretable Reasoning
A. For one-step relational reasoning, the relation networks [22] model pairwise relationships between objects directly.
B. For single-step or multi-step reasoning, some work [28, 26, 15, 8] explains the reasoning steps by generating updated attention distribution on the image for each step using the attention mechanisms.
C. The other work [1, 9, 6, 3] decomposes the reasoning procedure into a sequence of sub-tasks and learns different modular networks to deal with each sub-task.
但是以上都沒有引進可解釋的推理步驟。
D. The modular networks are used to improve the interpretabilities of models on referring expression comprehension [7, 29].
E. The other work [32] enables reasoning as a step-wise attention process following the stepwise representation of the expression; however, it treats the expression as the sequence of words, which ignores the linguistic structure of the expression.
Different from existing work on referring expression comprehension, we adopt a differential analyzer module to dynamically decompose the expression into its constituent expressions step by step to maintain its linguistic structure and to implement multi-step and dynamic reasoning.
Dynamic Graph Attention Network
引入了一种网络,动态图注意力网络(DGA),以解决 referring expression comprehension中的可解释性和多步骤推理问题。

(1) A language-driven differential analyzer
We model an expression as a sequence of constituent expressions, and each constituent expression is specified as a soft distribution over the words in the expression.
a tuple consisting of soft distribution over the words:
Q={ql}l=1LQ=\{{{q}_{l}}\}_{l=1}^{L}Q={ql}l=1L 表示 LLL 個單詞。
R(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}R(t)={rl(t)}l=1L soft distribution
LSTM 輸入: F={fl}l=1LF=\{{{f}_{l}}\}_{l=1}^{L}F={fl}l=1L
輸出生成單詞特徵 a vector sequence:H={hl}l=1LH=\{{{h}_{l}}\}_{l=1}^{L}H={hl}l=1L
同時在LSTM中整個句子的特徵為:qqq
DGADGADGA進行TTT步推理,首先要通過一個綫性變換把 feature vector qqq 變成vector q(t)q^{(t)}q(t)(應該是一個離散化的過程):
q(t)=W(t)q+b(t)q^{(t)}=W^{(t)}q + b^{(t)} q(t)=W(t)q+b(t)
然後將上一步的結果y(t−1)y^{(t-1)}y(t1)q(t)q^{(t)}q(t)結合產生新的向量u(t)u^{(t)}u(t):
u(t)=[q(t);y(t−1)]u^{(t)}=[q^{(t)};y^{(t-1)}]u(t)=[q(t);y(t1)]
接下來計算如何生成soft distributionR(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}R(t)={rl(t)}l=1L

這裏是將u(t)u^{(t)}u(t)hlh_{l}hl結合在一起,計算soft distribution:R(t){{R}^{\left( t \right)}}R(t),最終:
y(t)=∑l=1Lrl(t)hl{{y}^{\left( t \right)}}=\sum\limits_{l=1}^{L}{r_{l}^{\left( t \right)}{{h}_{l}}}y(t)=l=1Lrl(t)hl
(2) A static graph attention module
A.图结构
GI=(V,E,XI){{G}^{I}}=\left( V,E,{{X}^{I}} \right)GI=(V,E,XI)
XI={xkI}k=1K{X}^{I}=\left\{ x_{k}^{I} \right\}_{k=1}^{K}XI={xkI}k=1K
xkI=[xko;pk]x_{k}^{I}=\left[ x_{k}^{o};{{p}_{k}} \right]xkI=[xko;pk]
xkIx_{k}^{I}xkI is the concatenation of oko_{k}ok’s visual feature xkox_{k}^{o}xko and oko_{k}ok’s spatial feature pk{p}_{k}pk
xkox_{k}^{o}xko is extracted from a pretrained CNN model [24, 20]
pk=Wp[x0k;x1k;wk;hk;wkhk]{p}_{k}=W_{p}\left[ x_{0k};x_{1k};w_{k};h_{k};w_{k}h_{k} \right]pk=Wp[x0k;x1k;wk;hk;wkhk]
x0k;x1kx_{0k};x_{1k}x0k;x1k are the normalized coordinates of the center of object oko_{k}ok, wkw_{k}wk and hkh_{k}hk are the normalized width and height.
然后像论文[28]一样,将grounding之间的关系分成11种,该分类方法描述了物体与物体之间的位置关系,也是拓扑图中边信息传递的重要部分。
B. Static Attention
GM=(V,E,XM){{G}^{M}}=\left( V,E,{{X}^{M}} \right)GM=(V,E,XM)
xkM=Wm[xkI;ck]+bmx_{k}^{M}=W_{m}\left[ x_{k}^{I};{{c}_{k}} \right]+b_{m}xkM=Wm[xkI;ck]+bm
ck=∑l=1Lαk,lfl{{c}_{k}}=\sum\limits_{l=1}^{L}{{{\alpha }_{k,l}}}{{f}_{l}}ck=l=1Lαk,lfl

词汇的权重分两类(entity and relationrelation):

(3) A dynamic graph attention module
GCN特征聚合模块,每个时间聚合一次

(4) A matching module

Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記相关推荐

  1. 论文:Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

    作者 Abstract Referring expression comprehension expects to accurately locate an object described by a ...

  2. Generalized-ICP(GICP)論文研讀

    Generalized-ICP論文研讀 前言 損失函數推導 應用 point-to-point point-to-plane plane-to-plane 前言 ICP最基本的形式是point-to- ...

  3. Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension 2021

    **本文内容仅代表个人理解,如有错误,欢迎指正** *****(原论文在方法部分写得有点套娃的意思,实在是有点乱,内心os:心平气和心平气和) 1. Problems - 这篇论文主要提出两个问题: ...

  4. 【推荐系统->论文阅读】Dynamic Graph Neural Networks for Sequential Recommendation(用于序列推荐的动态图神经网络)

    Dynamic Graph Neural Networks for Sequential Recommendation(用于序列推荐的动态图神经网络) Mengqi Zhang, Shu Wu,Mem ...

  5. 论文:Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

    作者 Abstract In this paper, we are tackling the weakly-supervised referring expression grounding task ...

  6. 2019_WWW_Dual graph attention networks for deep latent representation of multifaceted social effect

    [论文阅读笔记]2019_WWW_Dual graph attention networks for deep latent representation of multifaceted social ...

  7. 论文笔记之:Graph Attention Networks

    Graph Attention Networks 2018-02-06  16:52:49 Abstract: 本文提出一种新颖的 graph attention networks (GATs), 可 ...

  8. OpenCV gapi模块动态图dynamic graph的实例(附完整代码)

    OpenCV gapi模块动态图dynamic graph的实例 OpenCV gapi模块动态图dynamic graph的实例 OpenCV gapi模块动态图dynamic graph的实例 # ...

  9. 图注意力网络(Graph Attention Network, GAT) 模型解读与代码实现(tensorflow2.0)

    前面的文章,我们讲解了图神经网络三剑客GCN.GraphSAGE.GAT中的两个: 图卷积神经网络(GCN)理解与tensorflow2.0代码实现 GraphSAGE 模型解读与tensorflow ...

  10. 论文学习2-Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforce

    文章目录 摘要 介绍 相关工作 方法 Mean Selection Rate (MSR) and Mean Replacement Rate (MRR Incorporating Graph Atte ...

最新文章

  1. Android新手之旅(10) 嵌套布局
  2. The Ice::Current Object
  3. 一步步学习微软InfoPath2010和SP2010--第八章节--使用InfoPath表单Web部件
  4. Spring的properties属性配置文件和Spring常用注解
  5. 高斯曲率求表面极值点
  6. java不要无限循环_java – 看似无限循环终止,除非使用System.out.println
  7. 第12课第2.2节 字符设备驱动程序之LED驱动程序_测试改进
  8. Linux系统编程9:进程入门之操作系统为什么这么重要以及它是如何实现管理的
  9. textarea不可拖动
  10. [网络安全自学篇] 九十二.《Windows黑客编程技术详解》之病毒启动技术创建进程API、突破SESSION0隔离、内存加载详解(3)
  11. Computer:Microsoft Office Visio2021的简介、安装、使用方法图文教程之详细攻略
  12. Java之spring新手教程(包教包会)
  13. 可靠性 可用性 可维护性
  14. J.K.罗琳 哈佛大学毕业演讲
  15. 《Qt5:Widget、Dialog和MainWindow之间的关系》
  16. 程序猿如何高效的学习英语
  17. 花了三个月终于把所有的Python库全部整理了!祝你早日拿到高薪!
  18. linux 无法生成缩略图,Thinkphp3.2 Linux下缩略图生成失败
  19. 如何快速查看Linux系统上的Shell类型
  20. ui设计需要美术功底吗,没有美术功底如何快速提高

热门文章

  1. Power BI中文版
  2. 爬虫——获取页面源代码
  3. 计算机excel怎么删除重复项,excel怎么删除重复项
  4. 组合数问题(NOIP2016提高组Day2T1)
  5. 北京亚控笔试题目(2014年10月9日)
  6. IE 不兼容 justify-content:space-evenly 的解决办法
  7. Android获取手机存储空间大小
  8. 解决 primordials is not defined 问题
  9. 今天教大家怎么用Unity制作简单的AR
  10. 分赛区决赛见!2021eBay全国跨境电商创新创业大赛分赛区决赛晋级名单