Introduction

Scene Graph (场景图)

Scene graphs contain structured knowledge of visual scenes, including the present objects, attributes of objects, and relationships (actions or relative position) between objects. 下图展示了
As a beneficial prior knowledge describing the detailed semantics of images and captions, scene graphs have led to many state-of-the-art models in image captioning, image retrieval, VQA and image generation.

ERNIE-ViL

现有的 vision-language 预训练模型都尝试通过一些 image/text-level 的任务来获取视觉-语言的联合特征表示 (e.g. Masked Language Modelling, Masked Region Prediction and Image-Text Matching; 这些任务主要关注的是 sub-word 和 image region 的语义对齐). 但这些模型都忽略了构建视觉和语言信息之间细粒度语义对齐 (detailed semantic alignments across vision and language) 的重要性，没有重点关注那些描述图片中细粒度语义的单词 (e.g., objects(“man”, “boat”), attributes of objects(“boat is white”), relationships between objects(“man standing on boat”)) (did not distinguish common words and words describing the detailed semantics)，而这些细粒度语义恰恰对理解图片含义是非常重要的
因此，为了得到更好的视觉-语言联合特征表示，模型应该更加关注 detailed semantic words，着重加强 跨模态细粒度语义对齐 (detailed semantic alignments across the modalities) 的能力
在 NLP 领域，ERNIE 模型通过遮掩 (mask) phrases 和 named entities 而非 individual sub-words 来学得更加结构化的知识。受此启发，ERNIE-ViL 通过融入 scene graphs 包含的细粒度语义信息 (object, attribute, relationship) 来让模型更好地进行跨模态特征表示。具体而言，ERNIE-ViL 设置了 3 个 Scene Graph Prediction 预训练任务 (i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks)，通过遮掩并预测 (mask and predict) scene graphs 中不同类型的节点 (Object/Attribute/Relationship) 来让模型更关注对 detailed semantic words 而非 common words 的理解，进而对不同模态的 detailed semantics 之间的联系进行建模，最终帮助模型进行跨模态细粒度语义对齐

Model Architecture

Sentence Embedding: 文本的处理方法与 BERT 相同，使用 WordPiece 分词，并加入特殊 token [CLS] 和 [SEP]。token 的 embedding 由 word embedding, segment embedding 和 sequence position embedding 组成。文本部分的输入为 ${[CLS],w_1,...,w_T,[SEP]\}$
Image Embedding: 使用一个预训练好的 Faster R-CNN 来提取 salient image regions (置信度阈值为 0.2，保留 10~36 个 RoI)，并将 RoI 多分类层之前的 mean-pooled convolutional features 作为最终的 region features。同时还由 region 的位置和面积得到一个 5- $d$ 向量 $(x1W,y1H,x2W,y2H,(y2−y1)(x2−x1)WH)\left(\frac{x_{1}}{W}, \frac{y_{1}}{H}, \frac{x_{2}}{W}, \frac{y_{2}}{H}, \frac{\left(y_{2}-y_{1}\right)\left(x_{2}-x_{1}\right)}{W H}\right)$ ，经过投影后得到 location features。将 region features 和 location features 相加即可得到 region visual features。除了 RoI，ERNIE-ViL 也将整张图片的 region visual features [IMG] 输入模型，因此，图像部分的输入为 ${[IMG],v_1,...,v_I\}$
Vision-Language Encoder: 类似 ViLBERT，ERNIE-ViL 采用了 two-stream cross-modal Transformers 的结构，由两个 Transformer encoders 分别对 image 和 text 输入进行编码，再由 cross-modal Transformer blocks 进行多模态信息融合。模型最终为每个输入的 image/text embedding 输出一个 embedding， $h_{[IMG]}$ 和 $h_{[CLS]}$ 可以看作是模型抽取出的整个图像和文本的特征

Scene Graph Prediction (SGP)

Scene graph parsing: 将给定的句子 $w$ 解析为 scene graph $G (w) = < O (w), E (w), K (w) >$ ，其中 $O (w)$ 代表句子中出现的物体集 (object set)， $E (w) \subseteq O (w) \times R (w) \times O (w)$ 为 scene graph 的边，代表物体之间的关系， $R (w)$ 为 relationship nodes between object nodes， $K (w) \subseteq O (w) \times A (w)$ 代表物体带有的属性， $A (w)$ 为 attribute nodes associated with object nodes 的集合。ERNIE-ViL 使用 Anderson (Anderson et al. 2016) 提供的 Scene Graph Parser
Object Prediction: Predicting the objects forces the model to build the vision-language connections at object level. 具体而言，ERNIE-ViL 在 scene graph 中随机选择 30% 的 object nodes 进行预测，对选中的 object nodes，有 80% 的几率将其替换为 MASK，10% 的几率替换为一个随机 token，10% 的几率保持不变。注意到，一个 object 在句子中其实对于一个子句 (i.e. 若干个 token)，遮掩 object node 就等价于遮掩相应的子句。Object Prediction 的目标即为最小化下式：
其中， $w_{o_t}$ 为被遮掩的 object tokens， $w\wotw_{\backslash w_{o_t}}$ 为未被遮掩的 tokens， $v$ 为视觉信息
Attribute Prediction: 遮掩方式与 Object Prediction 类似，不同的是遮掩对象变成了 attribute node $A (w)$ 。Attribute Prediction 的目标即为最小化下式：
其中， $w_{a_t}$ 为被遮掩的 attribute tokens， $w_{o_t}$ 为属性 $w_{a_t}$ 对应物体的 object tokens
Relationship Prediction: 遮掩方式与 Object Prediction 类似，不同的是遮掩对象变成了 relationship node $R (w)$ 。Relationship Prediction 的目标即为最小化下式：

Why Scene Graph Prediction works? For example, as the relationship words “on top of” is masked, based on the language context, the model may predict that the missing word is “under” or “into”. These words are grammatically fluent in the sentence, but are inconsistent with the scene “the cat is on top of the car”. Through training the Relationship Prediction task, the model obtains the spatial relation of the corresponding $o b j e c t s (“ c a r ”, “ c a t ”)$ from the image, thus can accurately predict that the missing word is “on top of”. Through constructing Scene Graph Prediction tasks, ERNIE-ViL learns cross-modal detailed semantic alignments

Pre-training with Scene Graph Prediction: ERNIE-ViL 不仅使用了上述三个 Scene graph prediction 的预训练任务，还使用了 Masked Language Modelling (MLM)、Masked Region Prediction 和 Image-text Matching 作为额外的预训练任务。所有预训练任务的 loss 加在一起作为最终的 loss (根据论文中的描述，这些所有的预训练任务都是一起训练的。需要注意，在做 Image-text Matching 时需要随机采样句子对形成 negative image-text pair，对于这些负例样本不做 token 和 region prediction tasks)

值得一提的是，由 SGP 训练出的模型在预测 masked detailed semantic words 时比不使用 SGP 的模型准确率更高，这说明 SGP 确实帮助模型学习到了更细粒度的语义对齐

Experiments

Training ERNIE-ViL

Pre-training Data: ERNIE-ViL 使用 Conceptual Captions (CC) dataset 和 SBU Captions (SBU) dataset 作为预训练数据集，其中，CC 含有 3.0 million image-caption pairs，SBU 含有 0.8 million image-caption pairs。这两个数据集都是直接从网络上搜集的，与下游任务使用的数据集不同 (out-of-domain datasets)，这样可以保证评估模型在下游任务上性能时的公平性
Implementation Details: ERNIE-ViL 按照模型大小分为 ERNIE-ViL-base 和 ERNIE-ViL-large，使用 ERNIE 2.0 的参数进行初始化 (其他详细设置可参见论文 “Experiments” 一节)

Downstream Tasks

论文在 5 个下游任务上测试了 ERNIE-ViL 的性能：
- (1) Visual Question Answering (VQA 2.0): The VQA task requires answering natural language questions according to images. VQA 2.0 dataset contains 204k images and 1.1M questions about these images.
  - We treat VQA as a multi-label classification task – assigning a soft target score to each answer based on its relevancy to the 10 human answer responses. We take dot product of final hidden state $h_{[CLS]}$ and $h_{[IMG]}$ to map this representation into 3,129 possible answers with an additional two-layer MLP.
- (2) Visual Commonsense Reasoning (VCR): The Visual Commonsense Reasoning (VCR) task contains two sub-tasks: visual question answering (Q→A) and answer justification (QA→R), which are both multiple choice problems. The holistic setting (Q→AR) requires both the chosen answer and chosen rationale to be correct.
  - In visual question answering (Q→A) task, we concatenate the question and each candidate answer for the language modality. We take dot product of final hidden state $h_{[CLS]}$ and $h_{[IMG]}$ to predict matching score with an additional FC layer.
  - For the answer justification (QA→R) task, we concatenate the question, the answer and each candidate rationale as the input of the text stream.
- (3) Grounding Referring Expressions (RefCOCO+): The referring expression task is to localize an image region given a natural language reference. We evaluate the task on RefCOCO+ dataset. Bounding box proposals provided by Mattnet (Yu et al. 2018) are utilized.
  - The representation for each region is denoted by its final hidden state $h_{v_i}$ with an additional FC layer. Each region $i$ is labelled as positive only when the IoU between it and the ground truth box is over 0.5.
- (4, 5) Image Retrieval & Text Retrieval (Flickr30K): Caption-based image retrieval is a task of identifying an image from a pool based on a caption describing its content. Flickr30K contains 31,000 images and 5 captions for each image. Adopting the same split in ViLBERT, we use each of 1,000 images for validation and for testing and the rest for training.
  - We take dot product of final hidden state of $h_{[CLS]}$ and $h_{[IMG]}$ to predict matching score $s (w, v)$ for each image-text pair with an additional FC layer. We utilize circle loss with 20 random negative samples for each image-text pair.

Results

可以看到，当只在 Out-of-domain 数据集上做训练时，ERNIE-ViL 在所有任务上都取得了最好的结果。为了进一步与同时在 In-domain 和 Out-of-domain 数据集上做预训练的模型进行比较，我们也将 ERNIE-ViL 在 MS-COCO 和 Visual-Genome 上做了预训练，结果在所有下游任务上都达到了 SOTA

References

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph相关推荐

11:ERNIE-VIL:KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH
1.介绍为了学习图片和文本的联合表征,我们提出了知识增强的方法ERNIE-Vil, ERNIE-Vil尝试构建详细的语法连接(物体,属性,以及物体在视觉场景的关系) 目前模型不区分常见词,本文将其分 ...
【论文笔记】ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH
本文强调的点是语义对齐(semantics alignment),并且将VL任务划分为了三个部分,即识别图中的对象.属性.关系. 本文利用了ERNIE的知识掩蔽策略,即每次掩蔽整个短语或实体而不是子词 ...
论文笔记《Knowledge Enhanced Contextual Word Representations》
Motivition 作者的出发点有几个: 尽管BERT这种预训练模型取得了state-of-art的成绩.但是.因为他们没有包含真实世界的实体,所以导致这些模型也很难覆盖真实世界的知识. 没有实体没 ...
详细介绍ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
系列阅读: 详细介绍百度ERNIE1.0:Enhanced Representation through Knowledge Integration 详细介绍百度ERNIE 2.0:A Continu ...
论文解读：SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis
论文解读:SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis 简要信 ...
LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching学习笔记
文章目录 1. 背景与相关技术 1.1 解决的问题与方法 1.2 HowNet 1.3 Word lattice graph 1.4 Graph attention networks 2. LET 2 ...
【论文阅读】Knowledge Enhanced GAN for IoT Traffic Generation
CCF A 用于物联网流量生成的知识图谱增强 GAN Shuodi Hui, Huandong Wang, Zhenhua Wang, Xinghao Yang, Zhongjin Liu, Depe ...
X-VLM: Multi-Grained Vision Language Pre-Training
Contents Introduction Method Experiment References Introduction 大部分 VLM (Visual-Language Model) 依赖于目 ...
知识蒸馏论文翻译（7）—— Knowledge Distillation from Internal Representations（内部表征）
知识蒸馏论文翻译(7)-- Knowledge Distillation from Internal Representations(内部表征) 文章目录知识蒸馏论文翻译(7)-- Knowledg ...

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

目录

Introduction

Scene Graph (场景图)

ERNIE-ViL

Model Architecture

Scene Graph Prediction (SGP)

Experiments

Training ERNIE-ViL

Downstream Tasks

Results

References

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph相关推荐

最新文章

热门文章