论文记录：Neural Motifs: Scene Graph Parsing with Global Context (CVPR-18)

（这里只是记录了论文的一些内容以及自己的一点点浅薄的理解，具体实验尚未恢复。由于本人新人一枚，若有错误以及不足之处，还望不吝赐教）

总结

本文关注的问题是 Scene Graph 的生成。通过观察 VG 数据集发现：
- 超过大半的 images，实体对之间的可能关系高度依赖于实体对的标签，反之不是（object labels are highly predictive of relation labels but not vice-versa）
- 同时当某一个 object label 出现时，另外一个 object label 的取值范围就会缩小（in general, the identity of edges involved in a relationship is not highly informative of other elements of the structure while the identities of head or tail provide significant information, both to each other and to edge labels）
论文将这种 regularly appearing substructures(基础) in scene graphs 称为 motifs，并由此提出了 Stacked Motif Networks，是一个用于 capture higher order motifs 的新型网络模型。基于这个发现，提出了一种 simple but powerful baseline: given object detections with labels, predict the most frequent relation between object pairs without visual cues，实验结果发现，该 baseline 的效果不降反升了平均 1.4 个召回率。

2. Key Challenge: devise an efficient mechanism to encode the global context that can directly inform the local predictors.

模型架构

该模型的输入是一幅无任何标注的图像 III，输出是图像上所有的视觉三原组。各个阶段如下：

stage 1）
Bounding boxs、利用一个在 VG 上预训练好的 Faster R-CNN 用于对输入图像上的物体的 label 和 bbox 标注出来，其输出是 [(b1,f1,11),…,(bn,fn,1n)][(b_1,\mathbf{f}_1,\mathbf{1}_1),\dots,(b_n,\mathbf{f}_n,\mathbf{1}_n)][(b1,f1,11),…,(bn,fn,1n)]，其中 bib_ibi 表示区域 iii，fi\mathbf{f}_ifi 表示该区域的 Faster R-CNN 的feature，1i\mathbf{1}_i1i 表示该区域的 object label 的概率分布向量

stage 2）
Objects。由两种 LSTM 构成，第一种是 biLSTM，用于将来自stage 1的 fi\mathbf{f}_ifi 进行再次编码成具有上下文信息（contextualized information）的特征 ci\mathbf{c}_ici；第二种是普通的 LSTM，用于预测在考虑了上下文信息后，该区域的label，表示为 ci^\hat{\mathbf{c}_i}ci^（与 1i\mathbf{1}_i1i 的不同之处为 1i\mathbf{1}_i1i 不考虑上下文信息，故预测的label可能不准确，而 oi^\hat{\mathbf{o}_i}oi^ 考虑了上下文信息，预测的label 可能更准确）。该阶段的两种 LSTM 公式如下：
C=biLSTM([fi,W11i]i=1,…,n)(1)hi=LSTM([ci,oi−1^])(2)o^i=arg⁡max⁡(Wohi)∈R∣C∣(one-hot)(3)\begin{array}{llll} \mathbf{C} &=& \text{biLSTM}([\mathbf{f}_i, \mathbf{W}_1\mathbf{1}_i]_{i=1,\dots,n}) & \text{(1)} \\ \mathbf{h}_i &=& \text{LSTM}([\mathbf{c}_i, \hat{\mathbf{o}_{i-1}}]) &\text{(2)} \\ \mathbf{\hat{o}}_i &=& \arg\max(\mathbf{W}_o\mathbf{h}_i) \in R^{|\mathcal{C}|} \ (\text{one-hot}) & \text{(3)} \end{array} Chio^i===biLSTM([fi,W11i]i=1,…,n)LSTM([ci,oi−1^])argmax(Wohi)∈R∣C∣ (one-hot)(1)(2)(3)
其中公式（1）来自第一个 LSTM，C=[c1,…,cn]\mathbf{C} = [\mathbf{c}_1,\dots,\mathbf{c}_n]C=[c1,…,cn]（the biLSTM allows all elements of B={b1,…,bn}B=\{b_1,\dots,b_n\}B={b1,…,bn} to contribute information about potenttial object indentities），ci\mathbf{c}_ici 表示该 LSTM 最后的隐藏层状态信息；公式（2）（3）来自第二个 LSTM，其中 hi\mathbf{h}_ihi 表示该 LSTM 最后的隐藏层状态信息，而 o^i\hat{\mathbf{o}}_io^i 则是预测的 label 标签。总之，之所以引入这两种 LSTM 对区域 label 重新预测，是基于结论1中的发现2，即 object labels 之间具有相互的启示信息

stage 3）
Relations、由一个 biLSTM 构成，用于预实体对之间可能存在的关系。此处引入biLSTM的原因是基于结论1中的发现1，即 object labels 对关系的预测具有启示信息。该阶段的公式如下：
D=biLSTM([ci,W2o^i]i=1,…,n)(4)gi,j=(Whdi)∘(Wtdj)∘fi,j(5)Pr(xi→j∣B,O)=softmax(Wrgi,j+woi,oj)(6)\begin{array}{rll} \mathbf{D} &=& \text{biLSTM}([\mathbf{c}_i, \mathbf{W}_2\hat{\mathbf{o}}_i]_{i=1,\dots,n}) &\text{(4)} \\ \mathbf{g}_{i,j} &=& (\mathbf{W}_h \mathbf{d}_i) \circ (\mathbf{W}_t \mathbf{d}_j) \circ \mathbf{f}_{i,j} & \text{(5)} \\ \text{Pr}(x_{i \to j}|B,O) &=& \text{softmax}(\mathbf{W}_r \mathbf{g}_{i,j} + \mathbf{w}_{o_i,o_j}) & \text{(6)} \end{array} Dgi,jPr(xi→j∣B,O)===biLSTM([ci,W2o^i]i=1,…,n)(Whdi)∘(Wtdj)∘fi,jsoftmax(Wrgi,j+woi,oj)(4)(5)(6)
其中 gi,j∈R4096,Pr(⋅)∈R∣P∣\mathbf{g}_{i,j} \in R^{4096}, \text{Pr}(\cdot) \in R^{|P|}gi,j∈R4096,Pr(⋅)∈R∣P∣，OOO 表示 objects label set，fi,j\mathbf{f}_{i,j}fi,j 则是实体 i 和 j 所在区域 bi,bjb_i,b_jbi,bj 的并集对应的特征，其计算方式为：①将该并集所在的图片区域输入到 Faster R-CNN 中提取其视觉特征 fi,j1\mathbf{f}_{i,j}^1fi,j1 并缩放为 R7×7×256R^{7×7×256}R7×7×256，然后②将 i 和 j 各自的空间掩模缩放成 Mask∈R14×14\text{Mask} \in R^{14×14}Mask∈R14×14 然后合并在一起成 R14×14×2R^{14×14×2}R14×14×2 大小，之后输入到两层卷积后得到feature maps fi,j2∈R7×7×256\mathbf{f}_{i,j}^2 \in R^{7×7×256}fi,j2∈R7×7×256，最后③ fi,j1+fi,j2\mathbf{f}_{i,j}^1 + \mathbf{f}_{i,j}^2fi,j1+fi,j2 输入到一个已经预训练好的 VGG 的 FC 中计算得到 fi,j∈R4096\mathbf{f}_{i,j} \in R^{4096}fi,j∈R4096。

注意，上述各个阶段的W∗,w∗\mathbf{W}_{\ast},\mathbf{w}_{\ast}W∗,w∗ 均为要学习的模型参宿。同时，为了缓解梯度消失，论文引入了 Alternating Highway LSTMs 来替换 LSTM，具体做法就是对 LSTM 进行一层封装：
ri=σ(Wg[hi−δ,xi]+bg)(7)hi=ri∘LSTM(xi,hi−δ)+(1−ri)∘Wixi(8)\begin{array}{lll} \mathbf{r}_i &=& \sigma(\mathbf{W}_g[\mathbf{h}_{i-\delta},\mathbf{x}_i] + \mathbf{b}_g) & \text{(7)} \\ \mathbf{h}_i &=& \mathbf{r}_i \circ \text{LSTM}(\mathbf{x}_i,\mathbf{h}_{i-\delta}) + (1 - \mathbf{r}_i) \circ \mathbf{W}_i \mathbf{x}_i & \text{(8)} \end{array} rihi==σ(Wg[hi−δ,xi]+bg)ri∘LSTM(xi,hi−δ)+(1−ri)∘Wixi(7)(8)
其中 δ={1,δ%2==0−1,else\delta = \begin{cases}1, &\delta \% 2 == 0 \\ -1, &else \end{cases}δ={1,−1,δ%2==0else
损失函数：the sum of the cross entropy for predicates and cross entropy for objects predicated by the object context layer

实验

实验指标

	Input	Output
predicate classification(PREDCLS)	image, object bboxs & labels, head & tail	edge labels
scene graph classification(SGCLS)	image, object bboxs	box labels & edge labels
scene graph detection(SGDET)	image	boxes & box labels & edge labels

结果

论文记录：Neural Motifs: Scene Graph Parsing with Global Context (CVPR-18)相关推荐

Neural Motifs: Scene Graph Parsing with Global Context (CVPR 2018) 运行复现遇到的一些坑以及解决方法
写在前面首先,感谢这篇文章 https://blog.csdn.net/weixin_38651565/article/details/87901172 的作者 @jiayan97 和他有很多交流帮 ...
[Scene Graph] Neural Motifs: Scene Graph Parsing with Global Context 论文解读
[Scene Graph] Neural Motifs: Scene Graph Parsing with Global Context (CVPR 2018) 论文解读简介这篇文章工作的创新之处 ...
论文阅读：Neural Motifs Scene Graph Parsing with Global Context(CVPR18)
MOTIF把场景图的生成分解成了以下三部分: (1)第一部分:Pr(B | I),给定image输出bounding box,标准的目标检测模型 (2)第二部分:Pr(O | B, I),给定imag ...
Neural Motifs: Scene Graph Parsing with Global Contex解读
计算机视觉一步步发展,从最初的分类.检测.分割来到了更深层的理解: Scene Graph Generation(场景图生成),即开始预测场景中物体之间的关系 Scene Graph简介原有的检测b ...
Neural Motifs: Scene Graph Parsing with Global Contex
待续 Scene Graph Generation做的是预测物体之间的关系的
【场景图生成】Graphical Contrastive Losses for Scene Graph Parsing
文章下载地址:Graphical Contrastive Losses for Scene Graph Parsing 代码地址:https://github.com/NVIDIA/Contrasti ...
论文Spatial-Temporal Transformer for Dynamic Scene Graph Generation
最近由于要做SGG方向,恰巧之前保存过这篇论文 2107.12309.pdf (arxiv.org)https://arxiv.org/pdf/2107.12309.pdf 代码地址: GitHub ...
Learning Visual Commonsense for Robust Scene Graph Generation论文笔记
原论文地址:https://link.springer.com/content/pdf/10.1007/978-3-030-58592-1_38.pdf 目录总体结构: 感知模型GLAT: 融合感知 ...
场景图生成论文阅读笔记之 Neural Motifs
CVPR2018 <Neural Motifs: Scene Graph Parsing with Global Context> 文章目录 <Neural Motifs: Scen ...

论文记录：Neural Motifs: Scene Graph Parsing with Global Context (CVPR-18)

论文记录：Neural Motifs: Scene Graph Parsing with Global Context (CVPR-18)相关推荐

最新文章

热门文章