Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記

Dynamic Graph Attention for Referring Expression Comprehension論文閲讀筆記
Abstract
1、Referring expression comprehension is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression.
Referring expression comprehension ：以句子作爲指導，判斷圖中句子描述的是圖中的哪一個方位。
2、However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression.
現在工作的缺點：對待圖中所描述的事務，不進行推理，或者只是簡單的進行一階推理。
3、In this paper, propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node.
Introduction
解決的問題：提出動態圖注意網絡(DGA)實現圖中物體的多步推理，用差分分析器(實際上是GCN？)來預測以語言爲指導的圖中關係的推理過程。
Introduction
1、The most classic work [13, 16, 21, 25] encodes an expression with an LSTM model [5], extracts features of visual objects in the image using CNNs [24, 20], and adopts matching loss functions to learn a common feature space for the expression and the visual objects.
作者認爲這些模型的可解釋性不好，推理不夠突出。 almost all the existing approaches for referring expression comprehension do not introduce reasoning or only support singlestep reasoning. Meanwhile, the models trained with those approaches have poor interpretability.
2、 [30, 19, 26, 28], involves extra pairwise context features or multi-order context features to improve the understanding of the image. However, they generally treat the learning process as a black box without explicit reasoning, and the learned monolithic features do not have adequate competitiveness when complex referring expressions are given. 作者認爲這些模型運用多步提取上下文特徵提取來理解圖像，學習過程向一個黑匣子，學習到的特徵不夠充分。
3、Recently, single-step reasoning [7, 29] has been proposed by decomposing the expression into different components and matching each component with a corresponding visual region via modular networks. 最近一些學者提出單步推理，將語言和圖形對應的模塊聯係起來。
4、 [33] Its stepwise reasoning is implemented using an LSTM model, which recurrently generates attended visual features while feeding the combination of word embedding and the attended visual features back to the LSTM. However, its stepwise reasoning does not consider the linguistic structure of the expression, and it does not explore the relationships among objects in the image.
考慮多步推理的這篇文章，不斷將 combination of word embedding 和 attended visual features 喂給 LSTM，循環生成 attended visual features. 但是它的逐步推理没有考虑到表达的语言结构，也没有探究图像中物体之间的关系。
所以作者提出了DGA
The core ideas behind the proposed DGA come from three aspects,
1、expression decomposition based on linguistic structure.
it is hard to accurately obtain the linguistic structure of a referring expression as such expressions are usually complex and flexible. Therefore, we resort to a differential analyzer module to predict constituent expressions of the input expression step by step to capture the linguistic structure, and the input expression is represented as a sequence of constituent expressions.我们借助差分分析模块对输入表达式的成分表达式进行一步步的预测，以捕捉语言结构，并将输入表达式表示为一系列成分表达式。
2、 object relationships modeling.
The proposed DGA constructs a directed graph over the objects in the image. The nodes and edges of the graph correspond to the objects and relationships among the objects respectively.
3、multi-step reasoning for identifying compound objects from relations.
The graph under the guidance of the constituent expressions in a stepwise manner to capture higher-order relationships among the objects and update the compound objects corresponding to each node through graph propagation.
Relate work
1、Referring Expression Comprehension
A. Some previous work [16, 21, 25] independently encodes the inputs in the two modals and learns a common feature space for them. To learn the common feature space, they propose different matching loss functions to optimize, e.g., softmax loss [16, 21] and triplet loss [25].
這個在Introduction的第一個提到過。
B. recent work [32, 4] adopts co-attention mechanisms to build up the interactions between the expression and the objects in the image.
C. [7, 29] designs fixed templates to softly decompose the expression into different semantic components via self-attention, and they compute the language-vision matching scores for each pair of the component and visual region. However, current work is not applicable for expressions that do not conform to the fixed templates.
D. [14] explores the visual reasoning for referring expression comprehension in synthetic domain. Different with them, we focus on real-world images and expressions, but do not resort to the guidance of language parsing (language programs[14]) ground-truth.
2、Interpretable Reasoning
A. For one-step relational reasoning, the relation networks [22] model pairwise relationships between objects directly.
B. For single-step or multi-step reasoning, some work [28, 26, 15, 8] explains the reasoning steps by generating updated attention distribution on the image for each step using the attention mechanisms.
C. The other work [1, 9, 6, 3] decomposes the reasoning procedure into a sequence of sub-tasks and learns different modular networks to deal with each sub-task.
但是以上都沒有引進可解釋的推理步驟。
D. The modular networks are used to improve the interpretabilities of models on referring expression comprehension [7, 29].
E. The other work [32] enables reasoning as a step-wise attention process following the stepwise representation of the expression; however, it treats the expression as the sequence of words, which ignores the linguistic structure of the expression.
Different from existing work on referring expression comprehension, we adopt a differential analyzer module to dynamically decompose the expression into its constituent expressions step by step to maintain its linguistic structure and to implement multi-step and dynamic reasoning.
Dynamic Graph Attention Network
引入了一种网络，动态图注意力网络（DGA），以解决 referring expression comprehension中的可解释性和多步骤推理问题。

(1) A language-driven differential analyzer
We model an expression as a sequence of constituent expressions, and each constituent expression is specified as a soft distribution over the words in the expression.
a tuple consisting of soft distribution over the words:
$Q=\{{{q}_{l}}\}_{l=1}^{L}$ 表示 $L$ 個單詞。
$R(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}$ soft distribution
LSTM 輸入: $F=\{{{f}_{l}}\}_{l=1}^{L}$
輸出生成單詞特徵 a vector sequence: $H=\{{{h}_{l}}\}_{l=1}^{L}$
同時在LSTM中整個句子的特徵為： $q$
$D G A$ 進行 $T$ 步推理，首先要通過一個綫性變換把 feature vector $q$ 變成vector $q^{(t)}$ （應該是一個離散化的過程）:
$q^{(t)}=W^{(t)}q + b^{(t)}$
然後將上一步的結果 $y^{(t-1)}$ 與 $q^{(t)}$ 結合產生新的向量 $u^{(t)}$ :
$u^{(t)}=[q^{(t)};y^{(t-1)}]$
接下來計算如何生成soft distribution $R(t)={rl(t)}l=1L{{R}^{\left( t \right)}}=\left\{ r_{l}^{\left( t \right)} \right\}_{l=1}^{L}$

這裏是將 $u^{(t)}$ 和 $h_{l}$ 結合在一起，計算soft distribution: $R(t){{R}^{\left( t \right)}}$ ，最終：
$y(t)=∑l=1Lrl(t)hl{{y}^{\left( t \right)}}=\sum\limits_{l=1}^{L}{r_{l}^{\left( t \right)}{{h}_{l}}}$
(2) A static graph attention module
A.图结构
$GI=(V,E,XI){{G}^{I}}=\left( V,E,{{X}^{I}} \right)$
$XI={xkI}k=1K{X}^{I}=\left\{ x_{k}^{I} \right\}_{k=1}^{K}$
$xkI=[xko;pk]x_{k}^{I}=\left[ x_{k}^{o};{{p}_{k}} \right]$
$x_{k}^{I}$ is the concatenation of $o_{k}$ ’s visual feature $x_{k}^{o}$ and $o_{k}$ ’s spatial feature ${p}_{k}$
$x_{k}^{o}$ is extracted from a pretrained CNN model [24, 20]
$pk=Wp[x0k;x1k;wk;hk;wkhk]{p}_{k}=W_{p}\left[ x_{0k};x_{1k};w_{k};h_{k};w_{k}h_{k} \right]$
$x_{0k};x_{1k}$ are the normalized coordinates of the center of object $o_{k}$ , $w_{k}$ and $h_{k}$ are the normalized width and height.
然后像论文[28]一样，将grounding之间的关系分成11种，该分类方法描述了物体与物体之间的位置关系，也是拓扑图中边信息传递的重要部分。
B. Static Attention
$GM=(V,E,XM){{G}^{M}}=\left( V,E,{{X}^{M}} \right)$
$xkM=Wm[xkI;ck]+bmx_{k}^{M}=W_{m}\left[ x_{k}^{I};{{c}_{k}} \right]+b_{m}$
$ck=∑l=1Lαk,lfl{{c}_{k}}=\sum\limits_{l=1}^{L}{{{\alpha }_{k,l}}}{{f}_{l}}$

词汇的权重分两类（entity and relationrelation）：

(3) A dynamic graph attention module
GCN特征聚合模块，每个时间聚合一次

(4) A matching module

Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記相关推荐

论文：Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension
作者 Abstract Referring expression comprehension expects to accurately locate an object described by a ...
Generalized-ICP(GICP)論文研讀
Generalized-ICP論文研讀前言損失函數推導應用 point-to-point point-to-plane plane-to-plane 前言 ICP最基本的形式是point-to- ...
Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension 2021
**本文内容仅代表个人理解,如有错误,欢迎指正** *****(原论文在方法部分写得有点套娃的意思,实在是有点乱,内心os:心平气和心平气和) 1. Problems - 这篇论文主要提出两个问题: ...
【推荐系统-＞论文阅读】Dynamic Graph Neural Networks for Sequential Recommendation（用于序列推荐的动态图神经网络）
Dynamic Graph Neural Networks for Sequential Recommendation(用于序列推荐的动态图神经网络) Mengqi Zhang, Shu Wu,Mem ...
论文：Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
作者 Abstract In this paper, we are tackling the weakly-supervised referring expression grounding task ...
2019_WWW_Dual graph attention networks for deep latent representation of multifaceted social effect
[论文阅读笔记]2019_WWW_Dual graph attention networks for deep latent representation of multifaceted social ...
论文笔记之：Graph Attention Networks
Graph Attention Networks 2018-02-06 16:52:49 Abstract: 本文提出一种新颖的 graph attention networks (GATs), 可 ...
OpenCV gapi模块动态图dynamic graph的实例(附完整代码)
OpenCV gapi模块动态图dynamic graph的实例 OpenCV gapi模块动态图dynamic graph的实例 OpenCV gapi模块动态图dynamic graph的实例 # ...
图注意力网络(Graph Attention Network, GAT) 模型解读与代码实现(tensorflow2.0)
前面的文章,我们讲解了图神经网络三剑客GCN.GraphSAGE.GAT中的两个: 图卷积神经网络(GCN)理解与tensorflow2.0代码实现 GraphSAGE 模型解读与tensorflow ...
论文学习2-Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforce
文章目录摘要介绍相关工作方法 Mean Selection Rate (MSR) and Mean Replacement Rate (MRR Incorporating Graph Atte ...

Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記

Dynamic Graph Attention for Referring Expression Comprehension 論文閲讀筆記相关推荐

最新文章

热门文章