论文阅读《Context-Transformer：Tackling Object Confusion for Few-Shot Detection》

Background ＆ Motivation

当数据量不足的时候，很主流的一个做法就是迁移学习。当迁移至目标域后，最常见的一个错误就是误分类。因为在源域的 box regressor 本来就是类别无关的，而 classification 是类别相关，由于数据量太少训练难以收敛，所以就导致了误分类。而且此前的小样本学习方法很多都没有考虑到误分类这个问题。

Modeling context，上下文建模也称作背景建模，一直是目标检测的一个挑战。

The main reason is that, objects may have various locations, scales, aspect ratios, and classes. It is often difficult to model such complex instance-level relations by manual design.

此前的很多小样本学习方法动机都是：人类视觉系统对于一个物体能够”过目不忘“，如何能让模型也具有这种能力。但是这个动机本身就是不太科学的。首先对于人类能够过目不忘，有人认为这是因为人类视觉系统是一个超大规模的学习系统。因此当我们看到一个物体时，我们在脑海里实际上已经把它放到了无穷无尽的场景下进行了联想、想象，此时的数据量是任何数据集和数据扩增方法都无法比拟的；其次现在的深度学习模型，本来就是数据驱动的，给多少数据就有多少精度。不给模型喂数据的同时还希望模型具有检测能力，这种想法本身就不合实际。

而这篇文章基于人类视觉系统的 Motivation 是：人类在识别物体的时候会建立起 contextual associations，即会找到物体与物体周围环境（下文称作 contextual field）之间的线索，来完成识别。这一 Motivation 显然比上一种说法要靠谱得多。

Source Detection Transfer

Backbone 采用 SSD，一个原因是多尺度的感受野提供了更丰富的 context，另一个原因是其简洁的设计。

将检测任务分成了 BBOX、BG（background）和 OBJ（object，可以理解为不同类型物体的多分类），其中前两个都可以直接通过微调迁移到 target-domain 上，而最后一个如果直接替换掉模型 top 的 source-domain OBJ 而随机初始化一个 target-domain OBJ 的话，由于参数过多又数据太少将很难将其训练收敛。

本文的做法是保留之前的 source-domain OBJ，在模型的 top 再增加一个 target-domain OBJ（其实就是 prior box 经过 refinement 之后得到的），文中解释 source-domain OBJ 的输出的维度比卷积层的特征通道的数量要小得多，这样做来避免过拟合。

简单的从 source-domain 迁移到 target-domain，很难解决 target-domain 中面临的误分类问题，本文的做法是通过衡量物体周围 context 来应对误分类。

For example, a few images may be discriminative enough to distinguish horse from dog, when we find that these images contain important contents such as a person sits on this animal, the scene is about wild grassland, etc.

Context-Transformer

其结构如下，包含两个子模块：affinity discovery 和 context aggregation。

prior boxes 理解为用预训练模型检测 target-image 输出的 bbox。affinity discovery 首先基于 prior boxes 建立一些 contextual field，之后用 source-domain 学到的特征得到这些 prior box 与 contextual field 之间的关系。在找到这两者之间的 affinity 之后，对 context 做 Transformer 式的注意力计算将其整合到 prior box 的表征中。

Affinity Discovery

Source-Domain Object Knowledge of Prior Boxes

这一步将 target-image 输入到 source-domain 上训练的 SSD 中，得到输出的各 bbox 的 score tensor：

We would like to emphasize that, the score of source-domain classifier often provides rich semantic knowledge about target-domain object categories.

在第 k 尺度上第 m 个长宽比的位置在（h，w）的 prior box 中的物体是 source-domain 中每一个类别 Cs 的概率。

Cs 指的是 source-domain 数据物体的类别数，这里的 bbox 也被称作 prior boxes。

Contextual Field Construction via Pooling

得到这些 prior box 之后，我们希望得到它的 contextual field，最直观的做法是采用 SSD 中所有的 anchor box，但是其数量过多给模型学习带来了困难。

Alternatively, humans often check sparse contextual fields, instead of paying attention to every tiny detail in an image.

本文的做法是对 prior box 做空间池化操作，这里做的是 max pooling 得到 contextual field：

Affinity Discovery

这一步的作用是找到 prior box 与 contextual field 之间的关系，将 P 和 Q 分别通过一个共享参数的全连接层（公式中的 f 和 g） reshape 成 Dp*Cs 和 Dq *Cs 大小，得到各自的 embedding，其中 Dp 和 Dq 分别为 prior box 和 contextual field 的总数。之后做点乘（即余弦相似度）来比较 P 和 Q，得到 affinity matrix：

其中

的含义为对于第 i 个 prior box，所有 contextual field 的重要性得分。

To sum up, affinity discovery allows a prior box to identify its important contextual fields automatically from various aspect ratios, locations and spatial scales.

Such diversified relations provide discriminative clues to reduce object confusion caused by annotation scarcity.

这里有点 Transformer 里自注意力的意思，用矩阵相乘的思想得到一个框对于其他所有框的注意力得分。

Context Aggregation

Context Aggregation

对于第 i 个 prior box 及其重要性得分，计算出调整过权重的 contextual vector：

h 是一个全连接层，与 f 和 g 是否相同不得而知，增加这个全连接层的用来提高学习的灵活性。之后将 L 聚合到原始的 score tensor P 上，得到 prior box 的 context-aware score tensor，相当于对 prior box 完成了一次 refinement。

φ 也是一个全连接层，这时的 P 已经带上了与 context 交互的信息。对于其中的每一行（即一个 prior box），都聚合了不同尺度下不同比例的 contextual field。

上面的全连接层全部都是 residual-style FC layers。

Target OBJ

将此时的 P 再聚合到 target-domain OBJ 上，完成最后的分类。

Experiments

有效地减少了误分类。

Few-Shot Object Detection

各个模块的消融实验：

以及 Context-Transformer 中各子模块是按方法的消融实验：

检测精度：

COCO 中60类作为 source-domain，VOC 中的20类作为 target-domain，所以这里的精度是在 VOC 上的精度。上表中 Non-local 是自注意力的方法，可以看出直接使用自注意力行不通。

并且随着 shot 的增加，和 Baseline 之间的差距越来越小。

This matches our insight that, Context-Transformer is preferable to distinguish object confusion caused by low data diversity.

在通用检测任务上也有提升：

Incremental Few-Shot Object Detection

Specifically, we add a residual-style FC layer on top of the pretrained source-domain OBJ, which allows us to construct a new source-domain OBJ that can be more compatible with target-domain OBJ.

Then, we concatenate new source and target OBJs to classify objects in both domains.

可以看出在 source-domain 上的精度也没有掉。

Conclusion

正如文章中所说：

To our best knowledge, Context-Transformer is the first work to investigate context for few-shot object detection.

与背景 context 建模是一种新思路。

感觉 Transformer 里自注意力的方法做的事情就是度量学习要做的，如何将这两者结合起来，这个思路似乎行得通，但是还有很多工作要做。