【多模态】9、GLIP | 首次将 object detection 重建为 phrase grounding 任务

文章目录

一、背景
二、方法
- 2.1 将 object detection 和 phrase grounding 进行统一
- 2.2 Language-aware deep fusion
- 2.3 使用语义丰富的数据来进行预训练
三、效果
- 3.1 迁移到现有 Benchmarks
- 3.2 在 COCO 上进行零样本和有监督的迁移
- 3.3 在 LVIS 上进行零样本迁移学习
- 3.4 在 Flickr30K Entities 上进行 phrase grounding 验证
- 3.5 分析
- 3.6 自然环境中的目标检测
- 3.7 所有结果对比
四、代码
- 4.1 配置环境
- 4.2 拉取预训练模型
- 4.3 Pre-training
- 4.4 (Zero-Shot) Evaluation
- 4.5 Fine-tuning/Prompt tuning
- 4.6 The Object Detection in the Wild Benchmark
- 4.7 推理可视化
- 4.8 RPN 网络结构如下：

论文：Grounded Language-Image Pre-training

代码：https://github.com/microsoft/GLIP

出处：微软 | 华盛顿大学

时间：2022.06

效果：

使用 2700 万的 grounding 数据进行预训练（包括 300 万人工标注数据和 2400 万网络收集数据）
预训练后的模型有很强的零样本和少样本迁移能力，在 COCO 和 LVIS 上直接测试（不参与训练）就达到了 49.8AP 和 26.9 AP，在经过微调后在 COCO-val 和 COCO-test-dev 上达到了60.8AP 和 61.5 AP
在 13 个下游目标检测任务上，使用 1-shot GLIP 可以和全监督的 Dynamic head 媲美

一、背景

视觉识别模型一般都使用提前设定的类别来进行训练，这会限制器在真实场景中的使用

CLIP 方法证明了 image-level 的视觉特征表达能够很好的学习大量的 image-text pairs

其 text 中包含很丰富的视觉概念，pre-trained CLIP 模型语义非常丰富，能够很好的泛化到下游的 zero-shot 图像分类、图文检索

为了更细粒度的对图像理解，目标检测、分割、姿态估计、场景理解等 object-level 的任务也非常重要。

本文的重点：

展示了 phrase grounding（可以理解为将短语和目标区域进行关联）任务，能够实现短语和目标/区域之间的关联
提出了 Grounded Language-Image Pre-training（GLIP），在目标检测任务中实现了对 phrase grounding 和 object detection 任务的统一

二、方法

GLIP 主要内容如下：

1、将 object detection 任务重建为 phrase grounding

如何重建任务：

将检测模型的输入从图像变为图像和 text Prompt（prompt 包括该检测任务中所有候选类别）
示例： COCO 目标检测任务的 text prompt 是一个 text str，包含 80 个 phrase，使用 ‘.’ 进行 phrase 的切分，如图 2 左侧所示。
任何一个 object detection model 都可以被转换为 grounding model，实现方式是将 object classification logits 替换为 object box classifier，比如可以使用 region（or box） visual features 和 token （or phrase）language feature 的点积，如图 2 右侧所示
language feature：使用 language model 来计算得到，不同于 CLIP 只在最后一个 dot product layer 将 vision 和 language 进行融合，GLIP 结构中 vision 和 language 的融合更加深入（如图 2 中间部分），这有助于学习更高质量的 language-aware visual representation，达到更好的迁移学习效果

detection 和 grounding 任务统一有什么好处：能够同时利用两个任务的数据并且互利互惠

在 detection 上，grounding data 能帮助其提升视觉概念的丰富性
在 grounding 上，detection data 能够引入更多 bounding box 标注信息

2、使用大量的 image-text data 扩充视觉概念

假设有一个很好的 grounding model（teacher），则可以为这些大量的 image-text-paired 数据自动生成 grounding box ，phrase 可以使用 NLP parser 来检测

teacher model 能对一些难定论的目标或抽象的目标进行定位，也能带来很丰富的语义信息

所以，可以在 27M grounding data 上 pre-train 我们的 student GLIP-large model（GLIP-L）

27M grounding data ：

3M 人工标注的精细数据
24M 从网上收集的 image-text pairs，有 78.1M 高置信得分（>0.5）的 phrase-box 伪标注，58.1M 唯一的 phrase
示例如图 3 所示

3、使用 GLIP 进行迁移学习：从一个 model 到所有 model

当 GLIP-L 模型在 COCO 和 LVIS 数据集上直接进行验证（没有见过其他数据），就能在 COCO val 2017 上达到 49.8 AP，在 LVIS val 上达到 26.9 AP ，超越了很多基础方法

2.1 将 object detection 和 phrase grounding 进行统一

1、传统的目标检测任务：

是将图像输入 backbone（CNN 或 Transformer）来抽取基础特征，如图 2 底部，然后将每个候选区域输入分类头和检测头来预测类别和位置，loss 如下：

在 two-stage 检测器中，会使用 RPN 来进行前景和背景的初步区分，作者在这里将 RPN 的 loss 在含义上融入了 LOC loss 中
在 one-stage 检测器中，LOC loss 中也会包含类似 centerness loss

box classifier $C$ 可以是简单的线性层，classification loss $L_{cls}$ 可以被写为：

$O$ 是 object/region/box features
$W$ 是 box classifier $C$ 的 weight matrix
$S_{cls}$ 是输出分类 logits
$T\in\{0, 1\}$ 是 region 和 classes 的匹配
$l oss (S; T)$ 在单阶段检测器中是 cross-entropy loss，在双阶段检测器中是 focal loss

2、将目标检测重构为 phrase grounding：

不需要对每个 region/box 划分类别，grounding 任务是通过将每个 region 对齐（grounding/aligning）到 text prompt 中的 $c$ 个 phrase 上，如图 2 所示。
如何为 detection 任务设计 text prompt：将所有要检测的类别组成如下形式，每个类别名称都是需要被 grounded 的
在 grounding model 中，作者会计算 image region 和 words in prompt 之间的 alignment scores $S_{ground}$ ，
- $\in R^{M \times d}$ 是从 language encoder 中得到的上下文 word feature
- $Enc_I$ ：image encoder
- $Enc_L$ ：language encoder
训练：将公式 2 中的 classification logits $S_{cls}$ 替换为 $S_{ground}$ ，然后最小化公式 1 和 2

注意，在公式 2 中， $S_{ground} \in R^{N \times M}$ ， $T\in\{0, 1\}^{N \times c}$ ，但由于 word token 的数量 $M$ 一般都大于 phrases $c$ 的数量，原因有四个：

一些 phrases 包含多个 words（如 traffic light）
一些 single-word phrase 被分为多个 sub-word token（如 toothbrush 被分为 tooth 和 bruth）
一些 token 是被添加进去的（图 Detect: 和 ‘,’）
tokenized sequence 的尾部有 [NoObj] token

所以，当 loss 是 binary sigmoid loss 时，将 $T\in\{0, 1\}^{N \times c}$ 扩展为 $\in\{0, 1\}^{N \times M}$ ，这样就可以实现：

当 phrase 是 positive match 时，让每个 sub-word 也都实现 positive match
一些被加进去当做标识字符的 word 对所有特征都是 negative match

2.2 Language-aware deep fusion

在公式 3 中，image 和 text 分别使用不同的 encoder 来提取特征，然后在最后进行融合来计算 alignment scores，这样的模型是 late-fusion models

在 vision-language 的相关方法中，deep fusion 很重要，能够优化 phrase grounding 模型

所以本文也使用了 deep fusion 的方式来对 image 和 language encoder 的结果进行融合，也就是在后面几个 encoder layer 就对 image 和 text 特征进行融合，如图 2 中间所示

作者使用 DyHead[10] 作为 image encoder，使用 BERT 作为 text encoder，则 deep-fused encoder 为：

$L$ ：DyHead 的 DyHeadModules 的数量
BERTLayer：是在经过预训练的 BERT 的顶部新加的一个 BERT layer
$O^0$ ：vision backbone 输出的 visual feature
$P^0$ ：language backbone 输出的 token feature
X-MHA：cross-modality multi-head attention model，用于进行多模态交互， $O_{t2i}^i$ 是 token2image 交互结果， $P_{i2t}^i$ 是 image2token 交互结果，如果没有 X-MHA 的话，则退化为 late-fusion model

X-MHA 的每个 head 都计算从一个模态到另一个模态的 context vectors：

2.3 使用语义丰富的数据来进行预训练

人工标注很费时费力，很多方法研究使用 self-training 方式来扩充，一般使用 teacher 模型来生成伪边界框来训练 student model，但生成的 label 也会受限于 concept pool，student model 也只能学习预设好的 concept pool。

本文的模型可以在 detection 和 grounding data 上同时训练，grounding data 可以提供丰富的语义信息来促进定位：

首先，好的 grounding data 覆盖了很大的词汇池，比现有的检测数据集词汇池大得多，目前经过扩展的最大的检测数据集词汇不超过 2000 类，而 grounding data 可以扩展到很大，如 Flickr30K[44] 包括 44518 个独一无二的 phrases，VG Caption[28] 包括 110689 个独一无二的 phrases，远远大于检测数据集的类别数量
其次，不扩充 detection data 的情况下，也可以使用扩充 grounding data 的方式来提高语义丰富性。作者受启发于 self-training，首先使用人工标注的好的检测标注和 grounding data 预训练一个 teacher GLIP，然后使用这个 teacher model 来预测从网络获取的 image-text data 中的目标框，最后使用标注好的 data 和生成的伪标签来训练 student model，如图 3 所示，teacher 模型能够对语义丰富的描述生成准确的框

为什么 student model 的效果可能会超过 teacher model 的效果：

如图 3 所示，如图标注数据中没有某些类别的话，teacher model 可能没法直接识别特定的概念，如 vaccine 、turquoise，但是，丰富的语言概念可以给 teacher model 提供很强的指导作用，让其能够进行猜想，所以，如果模型能够定位 small vail，则也可能能够定位 vaccine，如果模型能够定位 caribbean sea，则也可能能够定位 turquoise
所以在训练 student model 时，这种猜想能力就会变成有监督的信号，让 student 模型能够学习 vaccine 和 turquoise

三、效果

输入 prompt 长度的限制：
BERT 的长度限制是 512 token，在这里为了降低计算资源的消耗，将最大长度限制到了 256
当类别数量大于 token 限制时，作者会在训练和测试的时候都将类别拆分为多个 prompt，且证明了使用多 prompt 的方式只会带来很小的性能下降
如图 2 所示，DyHead-T 在 Object365 上预训练，在 coco 上 zero-shot 的效果为 43.6，GLIP-T(A) 是以 grounding 格式实现的 DyHead，效果为 42.9。没有很大差别

在测评 LVIS 时，LVIS 有 1000 个类别，作者在每个 prompt 中放 40 个类别，每次使用不同的 prompt 来测评对应的类别效果，

3.1 迁移到现有 Benchmarks

经过预训练的 GLIP 能够很方便的用于 grounding 和 detection 任务，作者在三个数据集上进行了验证：

MS-COCO object detection，包括 80 个检测类别
LVIS object detection：包括超过 1000 个检测类别
Flickr30K phrase grounding

训练了 5 个 GLIP 变体，如表 1 所示：

GLIP-T(A)：基于 SOTA detection model——Dynamic Head，使用 word-region alignment loss 代替 classification loss，使用 Swin-Tiny backbone，在 Object365 上预训练（0.66M 数据，356 个类别）
GLIP-T(B)：使用 language-aware deep fusion，只在 Object365 上进行了预训练
GLIP-T ( C)：在 Object365 和 GoldG 上预训练，GoldG 包括 0.8M 人工标注的 grounding data（ including Flickr30K, VG Caption [28], and GQA [19]）
GLIP-T：使用 Swin-Tiny 作为 backbone，在 Object365、GoldG、Cap4M（4M image-text pairs 从网上收集的数据，使用 GLIP-T© 进行预标框。
GLIP-L：使用 Swin-Large 并且在这些数据上训练 FourODs (2.66M data)、Objects365、 OpenImages [27]、Visual Genome (excluding
COCO images) [28]、ImageNetBoxes [29]、GoldG、CC12M+SBU（24M image-text data collected from the web with generated boxes）

3.2 在 COCO 上进行零样本和有监督的迁移

DyHead：用于对比，先在 Object365 上训练 DyHead（因为 COCO 80 类基本包含在 Object 365），在推理的时候只推理 COCO 的 80 个类别

如表 2 所示：

GLIP model 同时在 zero-shot 和 supervised 上获得了好的效果
GLIP-T 获得了 46.7AP，超过了 Faster RCNN
GLIP-L 获得了 49.8 AP，超过了 DyHead-T，在有监督情况下， GLIP-T 超过 DyHead 5.5 AP (55.2 vs. 49.7)

在 zero-shot 性能验证时，发现了 3 点：

Objects365 和 COCO 的域很接近，所以在 Object365 上预训练的模型在 COCO 上的表现很好，zero-shot 达到了 43.6 AP。
直接将检测模型重建为 grounding model 会导致性能下降 (GLIP-T(A))，但使用 deep fusion 带来了 2AP 的提升（GLIP-T(B))
最大的贡献在于 gold grounding data，GLIP-T© 获得了 zero-shot 的最高 46.7 AP，image-text data 对 COCO 没有什么提升 (CLIP-T vs. GLIP-T©)

3.3 在 LVIS 上进行零样本迁移学习

GLIP 在所有类别上都表现出了很好的效果

3.4 在 Flickr30K Entities 上进行 phrase grounding 验证

3.5 分析

下表展示了在不同数据上预训练 GLIP-T 的消融实验：

证明 1：使用 detection dataset 能够提升模型性能
证明 2：grounding data 能够为模型引入更丰富的语义信息
证明 3：grounding data 对通用类别和稀有类别的检测都有效

3.6 自然环境中的目标检测

为了验证 GLIP 在 real-world 任务中的迁移能力，作者收集了一个 Object Detection in the wild（ODinW）集合，使用 13 个 Roboflow 上的开源数据集。每个数据集关注的点不一样。EgoHands 需要定位人的手，Phthole 用于检测道路上的坑。

GLIP 能否迁移到不同域的任务呢：

从数据来看
从微调来看

1、从数据来看

GLIP-T(A)：基于 SOTA detection model——Dynamic Head，使用 word-region alignment loss 代替 classification loss，使用 Swin-Tiny backbone，在 Object365 上预训练（0.66M 数据，356 个类别）
GLIP-T(B)：使用 language-aware deep fusion，只在 Object365 上进行了预训练
GLIP-T ( C)：在 Object365 和 GoldG 上预训练，GoldG 包括 0.8M 人工标注的 grounding data（ including Flickr30K, VG Caption [28], and GQA [19]）
GLIP-T：使用 Swin-Tiny 作为 backbone，在 Object365、GoldG、Cap4M（4M image-text pairs 从网上收集的数据，使用 GLIP-T© 进行预标框。
GLIP-L：使用 Swin-Large 并且在这些数据上训练 FourODs (2.66M data)、Objects365、 OpenImages [27]、Visual Genome (excluding
COCO images) [28]、ImageNetBoxes [29]、GoldG、CC12M+SBU（24M image-text data collected from the web with generated boxes）

X-shot 就表示每个类别有 X 个训练样本

zero-shot 效果对比如下，引入 grounding data 能够提升对新概念类别的效果（如 Pothole 和 EgoHands），没有 grounding data 的模型（A 和 B）表现就较差，有 grounding data 的模型 C 表现就更好。

2、从微调来看，如何将一个模型扩展到所有任务

由于现在基础模型越来越大，如何降低部署的消耗就很重要

现在很多 language model、image classification、object detection 任务都是使用一个预训练好的模型，只修改少量需要定制的超参数就可以扩展到不同的任务上

如 linear probing[26]、prompt tuning[27]、efficient task adapter[13]

1、Manual prompt tuning

GLIP 支持在 prompt 中添加特定的 input 来指定不同的任务

如图 6，左侧表示模型无法识别 stingray，但如果添加上了一些属性 prompt（如 flat and round），模型就可以定位出 stingrays，AP50 从 4.6 提到了 9.7

2、prompt tuning

GLIP 中，对目标检测任务，只 fine-tuning 两个部件：

box head
region 和 prompt embedding 中间的映射层

因为 language-aware 深层融合，GLIP 支持更有效且高效的迁移策略：prompt tuning

在 GLIP 中，每个检测任务只有一个 prompt，效果如图 7 所示
首先从 language backbone 得到 prompt embedding $P^0$
然后舍弃 language backbone，只 fine-tuning 作为 task-specific 输入

linear probing、prompt tuning、full-model tuning 这三种设置得到的结果如图 7 所示

3.7 所有结果对比

四、代码

4.1 配置环境

拉取 docker 镜像，起 docker 来执行代码

# 拉取 docker 镜像
docker pull pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9
# 起容器
sudo docker run --runtime=nvidia -it -d --name "xxx" -w /home/ -v /home:/home -e NVIDIA_VISIBLE_DEVICES=all --shm-size 128g pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9 bash
# 安装环境
pip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install transformers -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
python setup.py build develop --user

4.2 拉取预训练模型

1、glip 预训练模型

https://huggingface.co/harold/GLIP/tree/main

2、语言预训练模型 bert

因为 glip 为多模态模型，所以需要前置语言模型，在代码中使用的是 transformer 库从 huggingface 中下载模型，但通常下不下来，可以在 huggingface 官网手动下载，然后将模型放入指定路径即可。

我这里找到了一个阿里源，下载的很快

① 在 glip 文件夹下新建一个 pretrained model 文件夹，然后使用如下命令，拉取对应的模型和 config

mkdir bert-base-uncased
wget -P bert-base-uncased https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/pretrain/bert-base-uncased/config.json
wget -P bert-base-uncased https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/pretrain/bert-base-uncased/pytorch_model.bin
wget -P bert-base-uncased https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/pretrain/bert-base-uncased/vocab.txt

② 在 glip/maskrcnn_benchmark/modeling/language_backbone/bert_model.py 中进行如下修改，被注释掉的是原本的代码，重写路径的是可执行的代码。

注意：有些博客中写可以在 from_pretraind 中指定 mirror 为清华源（mirror='https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models')，这在 2021 年之后已经不可行了，所以最保险的方法就是自己手动下载模型，然后修改路径即可。

class BertEncoder(nn.Module):def __init__(self, cfg):super(BertEncoder, self).__init__()self.cfg = cfgself.bert_name = cfg.MODEL.LANGUAGE_BACKBONE.MODEL_TYPEprint("LANGUAGE BACKBONE USE GRADIENT CHECKPOINTING: ", self.cfg.MODEL.LANGUAGE_BACKBONE.USE_CHECKPOINT)# if self.bert_name == "bert-base-uncased":#     config = BertConfig.from_pretrained(self.bert_name)#     config.gradient_checkpointing = self.cfg.MODEL.LANGUAGE_BACKBONE.USE_CHECKPOINT#     self.model = BertModel.from_pretrained(self.bert_name, add_pooling_layer=False, config=config)#     self.language_dim = 768if self.bert_name == "bert-base-uncased":config = BertConfig.from_pretrained('/glip/pretrained_models/bert-base-uncased')config.gradient_checkpointing = self.cfg.MODEL.LANGUAGE_BACKBONE.USE_CHECKPOINTself.model = BertModel.from_pretrained('/glip/pretrained_models/bert-base-uncased', add_pooling_layer=False, config=config)self.language_dim = 768elif self.bert_name == "roberta-base":config = RobertaConfig.from_pretrained(self.bert_name)config.gradient_checkpointing = self.cfg.MODEL.LANGUAGE_BACKBONE.USE_CHECKPOINTself.model = RobertaModel.from_pretrained(self.bert_name, add_pooling_layer=False, config=config)self.language_dim = 768else:raise NotImplementedErrorself.num_layers = cfg.MODEL.LANGUAGE_BACKBONE.N_LAYERS

4.3 Pre-training

4.4 (Zero-Shot) Evaluation

1、COCO Evaluation

第一步：首先按如下方式对 COCO 数据进行组织：https://github.com/microsoft/GLIP/blob/main/DATA.md

即在 DATASET 中按如下方式放置 coco 数据集：

第二步：写 coco 对应的 yaml，configs/coco/val.yaml

MODEL:ATSS:NUM_CLASSES: 80 # these fields are not used; just a placeholderFCOS:NUM_CLASSES: 80ROI_BOX_HEAD:NUM_CLASSES: 80DYHEAD:NUM_CLASSES: 80
DATASETS:REGISTER:coco_train:img_dir: "coco/train2017"ann_file: "coco/annotations/instance_train2017.json"coco_val:img_dir: "coco/val2017/"ann_file: "coco/annotations/instances_val2017.json"TRAIN: ("coco_train",) TEST: ("coco_val",)INPUT:MIN_SIZE_TRAIN: 800MAX_SIZE_TRAIN: 1333MIN_SIZE_TEST: 800MAX_SIZE_TEST: 1333
DATALOADER:SIZE_DIVISIBILITY: 32ASPECT_RATIO_GROUPING: False
TEST:IMS_PER_BATCH: 8

使用如下命令来测评：

python tools/test_grounding_net.py \
--config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml \
--task_config configs/coco/val.yaml \
--weight MODEL/glip_tiny_model_o365_goldg_cc_sbu.pth \TEST.IMS_PER_BATCH 1 \MODEL.DYHEAD.SCORE_AGG "MEAN" \TEST.EVAL_TASK detection \MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \OUTPUT_DIR {output_dir}

得到如下测评结果：mAP=46.6 (GroundingDino 为 48.5)

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.46638Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.63107Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.50778Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.33602Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.50633Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.59646Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.37554Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.62210Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.65833Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.49174Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.71016Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.81746
2023-07-10 09:52:15,449 maskrcnn_benchmark.inference INFO: OrderedDict([('bbox', OrderedDict([('AP', 0.4663779092127794), ('AP50', 0.6310650844926528), ('AP75', 0.50778317378933), ('APs', 0.33601965830665764), ('APm', 0.5063337770432529), ('APl', 0.5964644305214388)]))])

推理/测评流程图如下，推理代码是 engine/inference.py(line382)：

第一步:这是 detection 测评任务，先生成 all_queries 和 all_positive_map_label_to_token
- all_queries: 由类别名称组成的，用于文本输入，后面会对 queries 进行分词，然后使用语言模型提取特征
  
  ['person. bicycle. car. motorcycle. airplane. bus. train. truck. boat. traffic light. fire hydrant. stop sign. parking meter. bench. bird. cat. dog. horse. sheep. cow. elephant. bear. zebra. giraffe. backpack. umbrella. handbag. tie. suitcase. frisbee. skis. snowboard. sports ball. kite. baseball bat. baseball glove. skateboard. surfboard. tennis racket. bottle. wine glass. cup. fork. knife. spoon. bowl. banana. apple. sandwich. orange. broccoli. carrot. hot dog. pizza. donut. cake. chair. couch. potted plant. bed. dining table. toilet. tv. laptop. mouse. remote. keyboard. cell phone. microwave. oven. toaster. sink. refrigerator. book. clock. vase. scissors. teddy bear. hair drier. toothbrush']
- all_positive_map_label_to_token: 是每个类别对应的编码的序号，如 1: [1] 表示类别 person 对应 tokenized 的 input_ids 中的第 1 个位置的编码。前面几个词语对应的都是 1、3、5、7、9…，没有出现 2、4、6、8 等位置，因为这些位置其实就是 . 的编码位置，还有一些类别对应了两个或多个编码位置，如 10: [19, 20]，可以看出 10 类对应的是 traffic light，本来就是两个单词，所以会对应多个位置。
  
  也是测试时候用的 positive_map
  
  [defaultdict(<class 'list'>, {1: [1], 2: [3], 3: [5], 4: [7], 5: [9], 6: [11], 7: [13], 8: [15], 9: [17], 10: [19, 20], 11: [22, 23, 24], 12: [26, 27], 13: [29, 30], 14: [32], 15: [34], 16: [36], 17: [38], 18: [40], 19: [42], 20: [44], 21: [46], 22: [48], 23: [50], 24: [52, 53, 54], 25: [56], 26: [58], 27: [60, 61], 28: [63], 29: [65], 30: [67, 68, 69], 31: [71, 72], 32: [74, 75], 33: [77, 78], 34: [80], 35: [82, 83], 36: [85, 86], 37: [88, 89], 38: [91, 92], 39: [94, 95, 96], 40: [98], 41: [100, 101], 42: [103], 43: [105], 44: [107], 45: [109], 46: [111], 47: [113], 48: [115], 49: [117], 50: [119], 51: [121, 122, 123], 52: [125], 53: [127, 128], 54: [130], 55: [132, 133], 56: [135], 57: [137], 58: [139], 59: [141, 142, 143], 60: [145], 61: [147, 148], 62: [150], 63: [152], 64: [154], 65: [156], 66: [158], 67: [160], 68: [162, 163], 69: [165], 70: [167], 71: [169, 170], 72: [172], 73: [174], 74: [176], 75: [178], 76: [180], 77: [182], 78: [184, 185], 79: [187, 188, 189], 80: [191, 192]})]
第二步：进入 detector/generalized_vl_rcnn.py(175) 执行模型 forward
- 文本：将目标类别组成的 caption 进行 tokenized 分词（如训练的一样），然后输入 language backbone 进行文本特征提取，[1,256,768]
- 图像：将图像送入 swin transformer 进行多级特征提取
- 然后对文本特征和多级图像特征进行融合（使用的是 attention），得到融合后的特征
- 对融合后的文本特征进行归一化和维度压缩, [1,256,256]，然后用于和融合后的图像特征求相似性特征
- 对融合后的图像特征进行 cls、box_reg、centerness 特征的提取
第三步：对得到的所有类别、bbox_reg、centerness、anchor 等信息进行 grounding to object detection 特征的转换：
- 生成 anchors，和训练时生成方法一样
- 使用 cls、box_reg、centerness、anchors、positive_map、dot_product_logits 来进行输出框的选择，即进行 ATSS 后处理（inference.py line592)
- 对 cls、dot_product_logits 进行 sigmoid 处理，然后从 grounding 格式转到 coco 格式，inference.py line 771，这里的 dot_product_logits 是[1,16400,256] 即表示每个 anchor 和每个类别的关系
- 然后，从 positive_map 中一个个提取每个类别对应的类别编号，提取 dot_product_logits 中对应的类别，假设第一个类别 person，对应的编号为 [1]，就把 dot_product_logits 的第一列拿出来（共 18134 个值）求平均或最大（具体按照定义的方式），如果对应多列，如 78:[184,185]，则把这两列进行平均，得到 [1, 16400] 的特征，作为每个 anchor 在该类别上的得分。
- 接着，得到每个 anchor 对应的80类编码类别，作为最终得分，即 score 维度为 [1, 16400, 80]，这个得分会替换 dot_product_logits 作为每个 anchor 的分类得分
第四步：分类阈值过滤，即先每层分别处理得到每层保留得分大于阈值0.05的前1000个框
- 给框的分类得分（即dot_product_logits处理后的得分）乘 centerness 得分，用于过滤距离中心点远的框
- 之后，保留得分大于 self.pre_nms_thresh（0.05）的 anchor，然后保留前 1000 个 anchor
- 然后，使用 dot_product_logits[16400,80]、box 回归特征（[16400,4]、nms 参数（前1000个）、anchors（16400个）这些特征来进行最终结果的过滤和保留
- 保留 anchor 中的得分大于 0.05 的前 1000 个框在 anchor 中的索引和预测类别，最终保留的预测结果就是 1000 个框
第五步：对保留的框进行 decode
- 使用预测的框和 anchor 来decode，decode 后就得到的是对每个 anchor 的修正后的结果，也就是预测框啦，具体怎么 decode 的呢
  - 先解码计算 anchor 的中心和宽高
  - 再解码出预测的偏移（bbox 的那组特征就预测的是相对于 anchor 的偏移），然后进行 torch.exp() 缩小数值分布
  - 最终预测的框的中心点 x = anchor中心点x + 预测的中心点偏移 * anchor的宽
  - 最终预测的框的中心点 y = anchor中心点y + 预测的中心点偏移 * anchor的高
  - 最终预测的框的宽 = anchor 宽 * exp(预测的宽偏移)
  - 最终预测的框的高 = anchor 高 * exp(预测的高偏移)
  - 删除面积小于 min_size 的框
- 每层都解码完了后，就会得到这样的 bbox_list：[(BoxList(num_boxes=1000, image_width=1201, image_height=800, mode=xyxy), BoxList(num_boxes=877, image_width=1201, image_height=800, mode=xyxy), BoxList(num_boxes=313, image_width=1201, image_height=800, mode=xyxy), BoxList(num_boxes=0, image_width=1201, image_height=800, mode=xyxy), BoxList(num_boxes=0, image_width=1201, image_height=800, mode=xyxy))]
- 然后将所有层的 bbox_list 进行 concat，得到这样的：[BoxList(num_boxes=2190, image_width=1201, image_height=800, mode=xyxy)]
第六步：对所有保留下来的框中进行 NMS 过滤筛选（不分层级）(inference.py line750)，NMS 阈值为 0.6，先直接进行 NMS 过滤，得到 BoxList(num_boxes=334, image_width=1201, image_height=800, mode=xyxy)，即保留了 334 个框，但受限于每张图中最多的检出框个数（100），会保留前 100 个预测框。

2、LVIS Evaluation

3、ODinW / Custom Dataset Evaluation

4、Flickr30K Evaluation

4.5 Fine-tuning/Prompt tuning

finetuned 流程图如下：

在开始介绍代码之前，先介绍几个关键的问题：

1、输入 [img, text, target]

img：图像
text：所有要检测的类别名称组成的，coco 为例，需要将 coco 的标注信息转换成 grounding 任务能用的信息，即标签要有一个转换的过程（object detection to grounding 的过程）。具体执行转换的代码是 modulated_coco.py/CocoGrounding()

在 yaml 文件中会指定 SEPARATION_TOKENS: ". "，即类别之间使用 . 来分割

这些过程就在 make_data_loader 的时候执行了，所以不注意看很难发现

具体过程在 data/build.py/make_data_loader 中，所有超参数如下：

{'few_shot': 3, 'shuffle_seed': 3, 'random_sample_negative': 85, 'disable_shuffle': True, 'control_probabilities': (0.0, 0.0, 0.5, 0.0), 'separation_tokens': '. ', 'caption_min_box': 1, 'caption_conf': 0.9, 'caption_nms': 0.9, 'safeguard_positive_caption': True, 'local_debug': False, 'no_mask_for_od': False, 'no_mask_for_gold': False, 'mlm_obj_for_only_positive': False, 'caption_format_version': 'v1', 'special_safeguard_for_coco_grounding': False, 'diver_box_for_vqa': False, 'caption_prompt': None, 'use_caption_prompt': True, 'tokenizer': BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)}

data: {'factory': 'CocoGrounding', 'args': {'img_folder': './DATASET/coco/train2017', 'ann_file': './DATASET/coco/annotations/instances_train2017.json'}}

具体的负责操作的代码在 build.py(line93)，也就是会进入 maskrcnn_benchmark.data.datasets.modulated_coco.CocoGrounding

处理后的标签如下，也就是将 OD 的标签转换成了 grounding 任务能用的标签了：

'person. bicycle. car. motorcycle. airplane. bus. train. truck. boat. traffic light. fire hydrant. stop sign. parking meter. bench. bird. cat. dog. horse. sheep. cow. elephant. bear. zebra. giraffe. backpack. umbrella. handbag. tie. suitcase. frisbee. skis. snowboard. sports ball. kite. baseball bat. baseball glove. skateboard. surfboard. tennis racket. bottle. wine glass. cup. fork. knife. spoon. bowl. banana. apple. sandwich. orange. broccoli. carrot. hot dog. pizza. donut. cake. chair. couch. potted plant. bed. dining table. toilet. tv. laptop. mouse. remote. keyboard. cell phone. microwave. oven. toaster. sink. refrigerator. book. clock. vase. scissors. teddy bear. hair drier. toothbrush'
target：标签，包括图像中的目标框的位置、类别

2、text 需要怎么处理成目标检测可用的 label 的

对 caption 进行分词，即使用 self.tokenizer.batch_encode_plus 对其进行序列化（分词），会输出以下三个部分，序列化后的特征会用于后续编码，如将某个框的类别编码成 256 维的特征，对应的就是 input_ids 中的类别编号
- input_ids：101 是 [CLS]，102 是 [SEP]，分别表示头和尾
  
  2711、1012、10165 是 bert-base-uncased有个json文件，里面是所有字符。7592这种应该是该字符所在的行数（从 0 开始计算行数）
  
  如 2712 对应 person，1012对应 . ，11868 是 tooth，18623 是 ##brush，即会把组成的词语拆开成两个甚至多个，
  
  input_ids 就表示了对 caption 的所有类别进行了编码，编码成文本 txt 中的行号，具体的可以查看 json 文件。然后将编码后的特征输入 language_backbone 进行特征提取。
  
  input_ids 的维度为 [4, 256]，也就是将每个图像的 caption 都进行编码，包括caption 中的单词和 . ，这里有 194 个非零值，因为组合单词会占多个位置，留下了 62 个 0 的位置，就是没用到的 token 位置。
  
  input_ids: [ 101, 2711, 1012, 10165, 1012, 2482, 1012, 9055, 1012, 13297, 1012, 3902, 1012, 3345, 1012, 4744, 1012, 4049, 1012, 4026, 2422, 1012, 2543, 26018, 3372, 1012, 2644, 3696, 1012, 5581, 8316, 1012, 6847, 1012, 4743, 1012, 4937, 1012, 3899, 1012, 3586, 1012, 8351, 1012, 11190, 1012, 10777, 1012, 4562, 1012, 29145, 1012, 21025, 27528, 7959, 1012, 13383, 1012, 12977, 1012, 2192, 16078, 1012, 5495, 1012, 15940, 1012, 10424, 2483, 11306, 1012, 8301, 2015, 1012, 4586, 6277, 1012, 2998, 3608, 1012, 20497, 1012, 3598, 7151, 1012, 3598, 15913, 1012, 17260, 6277, 1012, 14175, 6277, 1012, 5093, 14513, 3388, 1012, 5835, 1012, 4511, 3221, 1012, 2452, 1012, 9292, 1012, 5442, 1012, 15642, 1012, 4605, 1012, 15212, 1012, 6207, 1012, 11642, 1012, 4589, 1012, 22953, 21408, 3669, 1012, 25659, 1012, 2980, 3899, 1012, 10733, 1012, 2123, 4904, 1012, 9850, 1012, 3242, 1012, 6411, 1012, 8962, 3064, 3269, 1012, 2793, 1012, 7759, 2795, 1012, 11848, 1012, 2694, 1012, 12191, 1012, 8000, 1012, 6556, 1012, 9019, 1012, 3526, 3042, 1012, 18302, 1012, 17428, 1012, 15174, 2121, 1012, 7752, 1012, 18097, 1012, 2338, 1012, 5119, 1012, 18781, 1012, 25806, 1012, 11389, 4562, 1012, 2606, 2852, 3771, 1012, 11868, 18623, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- token_type_ids：表示单词属于哪个句子，第一个句子为 0，第二句子为1。
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- attention_mask：维度也是 256 维的，1 表示有 phrase，0 表示空位
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

3、分词后的特征怎么提取文本特征

使用 BERT 进行文本特征提取
BERT 会输出多层特征，每个特征的维度都是 [4,256,768]，768 是预设的每个 token 的编码维度，处理后会输出下面机组：
embedded ： [4,256,768]，会使用最后一层的特征和 mask （mask 就是上面的 attention mask）相乘，将无 token 的位置的特征变成 0，还是得到 [4,256,768] 的特征，特征的后面的行的特征全 0
aggregate： [4,768]，embedded 所有 token 的特征相加除以有占位的token个数（194）进行归一化。
mask：上面的 attention_mask
hidden：没有经过 mask 相乘的最后一层特征，用于特征融合的特征就是这个 hidden 的特征

4、targets 长什么样子

[BoxList(num_boxes=4, image_width=640, image_height=968, mode=xyxy), BoxList(num_boxes=4, image_width=800, image_height=1210, mode=xyxy), BoxList(num_boxes=2, image_width=800, image_height=1064, mode=xyxy), BoxList(num_boxes=2, image_width=800, image_height=1199, mode=xyxy)]
查看 bbox 位置信息：tatgets[0].bbox:tensor([[310.1792, 241.7580, 427.6033, 419.7641], [236.5868, 86.9536, 353.6931, 421.2010], [418.5555, 123.3293, 543.2572, 398.0749], [ 77.2827, 117.1431, 543.6808, 912.5518]], device='cuda:0')
查看 bbox 类别信息：targets[0].get_field('labels')：tensor([80, 80, 80, 42], device='cuda:0')

4、很重要的 positive_map 又是什么

纵观整个代码，出现频率很高的还有 positive_map ，创建的位置：data/datasets/modulated_coco.py(line561)

positive_map 长什么样子：[0,0,0,...,1,0],...,[0,1,0,...,0]

positive_map 是什么：我们前面知道了 target 中包含每个框的类别信息，还知道了 BERT 会将每个输入 phrase 编码为 256 维的特征向量，那么每个框的类别信息也就可以用一个 256 维的特征向量来表示，而不是像传统目标检测方法中直接使用一个 ‘person’ 的字符串来表示啦。

所以 positive_map 就是一个大小为 [目标个数，256] 的特征，每一行都表示该目标的类别，而这里的类别编码也会映射回前面 tokenized 后的 input_ids，即类别编码和类别编号是一一对应的。

positive_map 会作用于 anchor，即 anchor 经过 ATSS 分配后就有了类别信息（源于其对应的 gt），那么将 positive_map 和 anchor 进行组合，就能将 anchor 的类别映射到 256 维的编码。

positive_map 的作用：给 anchor 分配类别编码，然后作为真值参与 loss 的计算：

分类 loss：使用类别编码作为真值
region-word 对比学习 loss：region 就是 anchor，word 就是真值类别的编码（[0,0,0,…,1]），当正样本对儿的相似性越高（越接近 1），则就会越接近 word 的分布，loss 会越小

1、COCO Fine-tuning

python -m torch.distributed.launch --nproc_per_node=4 --master_port=10801 tools/finetune.py \--config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml --skip-test \--custom_shot_and_epoch_and_general_copy 3_200_4 \--evaluate_only_best_on_test --push_both_val_and_test \DATASETS.TRAIN '("coco_grounding_train", )' \MODEL.WEIGHT MODEL/glip_tiny_model_o365_goldg_cc_sbu.pth \SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full

custom_shot_and_epoch_and_general_copy 是指定不同的 shot tuning、epoch、general copy 的
- 1-shot：1_200_8
- 3-shot：3_200_4
- 5-shot：4_200_2
- 10-shot：10_200_1
- all-data：0_200_1、SOLVER.STEP_PATIENCE 2、set SOLVER.AUTO_TERMINATE_PATIENCE 4

2、COCO promot tuning

python -m torch.distributed.launch --nproc_per_node=4 --master_port=10831 tools/finetune.py \--config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml --skip-test \--custom_shot_and_epoch_and_general_copy 3_200_4 \--evaluate_only_best_on_test --push_both_val_and_test \DATASETS.TRAIN '("coco_grounding_train", )' \MODEL.WEIGHT MODEL/glip_tiny_model_o365_goldg_cc_sbu.pth \SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOVER.BASE_LR 0.05 SOLVER.WEIGHT_DECAY 0.25 TEST.EVAL_TASK detection MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2

和 fine-tuning 的主要区别在于：

SOVER.BASE_LR 0.05
SOLVER.WEIGHT_DECAY 0.25
SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2

可以去代码里边看一下 SOLVER.TUNING_HIGHLEVEL_OVERRIDE 的设计：tools/finetune.py(line215)

def tuning_highlevel_override(cfg,):if cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "full":cfg.MODEL.BACKBONE.FREEZE = Falsecfg.MODEL.FPN.FREEZE = Falsecfg.MODEL.RPN.FREEZE = Falsecfg.MODEL.LINEAR_PROB = Falsecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Falsecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = Falseelif cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "linear_prob":cfg.MODEL.BACKBONE.FREEZE = Truecfg.MODEL.FPN.FREEZE = Truecfg.MODEL.RPN.FREEZE = Falsecfg.MODEL.LINEAR_PROB = Truecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Falsecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = Truecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = Truecfg.MODEL.DYHEAD.USE_CHECKPOINT = False # Disable checkpointelif cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "language_prompt_v1":cfg.MODEL.BACKBONE.FREEZE = Truecfg.MODEL.FPN.FREEZE = Truecfg.MODEL.RPN.FREEZE = Truecfg.MODEL.LINEAR_PROB = Falsecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Falsecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = Falseelif cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "language_prompt_v2":cfg.MODEL.BACKBONE.FREEZE = Truecfg.MODEL.FPN.FREEZE = Truecfg.MODEL.RPN.FREEZE = Truecfg.MODEL.LINEAR_PROB = Falsecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Truecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = Trueelif cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "language_prompt_v3":cfg.MODEL.BACKBONE.FREEZE = Truecfg.MODEL.FPN.FREEZE = Truecfg.MODEL.RPN.FREEZE = Truecfg.MODEL.LINEAR_PROB = True # Turn on linear probecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Falsecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = False # Turn on language backboneelif cfg.SOLVER.TUNING_HIGHLEVEL_OVERRIDE == "language_prompt_v4":cfg.MODEL.BACKBONE.FREEZE = Truecfg.MODEL.FPN.FREEZE = Truecfg.MODEL.RPN.FREEZE = Truecfg.MODEL.LINEAR_PROB = True # Turn on linear probecfg.MODEL.DYHEAD.FUSE_CONFIG.ADD_LINEAR_LAYER = Truecfg.MODEL.LANGUAGE_BACKBONE.FREEZE = True # Turn off language backbonereturn cfg

3、ODinW / Custom Dataset Fine-Tuning

训练代码位置：finetune.py→maskrcnn_benchmark/engine/trainer.py(line45) → maskrcnn_benchmark/modeling/generalized_vl_rcnn.py(line134)，这里对 model 中需要冻结的层进行冻结
冻结完成后进行本次迭代所需的 images 和 label 等的提取，maskrcnn_benchmark/engine/trainer.py(line45)
maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py(175) 训练 forward
先对 language 编码进行 tokenized，使用函数为 batch_encode_plus，输出包括
- input_ids：[4, 256]，是单词在词典中的编码，4 是 batch，256 是文本 token 的最大长度
- token_type_ids：[4, 256]，区分两个句子的编码（上句全为0，下句全为1）
- attention_mask：[4, 256]，指定对哪些词进行self-Attention操作
- special_token_mask：[4, 256]
caption:'person. bicycle. car. motorcycle. airplane. bus. train. truck. boat. traffic light. fire hydrant. stop sign. parking meter. bench. bird. cat. dog. horse. sheep. cow. elephant. bear. zebra. giraffe. backpack. umbrella. handbag. tie. suitcase. frisbee. skis. snowboard. sports ball. kite. baseball bat. baseball glove. skateboard. surfboard. tennis racket. bottle. wine glass. cup. fork. knife. spoon. bowl. banana. apple. sandwich. orange. broccoli. carrot. hot dog. pizza. donut. cake. chair. couch. potted plant. bed. dining table. toilet. tv. laptop. mouse. remote. keyboard. cell phone. microwave. oven. toaster. sink. refrigerator. book. clock. vase. scissors. teddy bear. hair drier. toothbrush'
language 编码完后得到 language_dict_features
- language_dict_features['aggregate']：[4, 768]
- language_dict_features['embedded']：[4, 256, 768]，256 是 token 长度，768 是 bert 会将每个单词转换为 768 维的向量
- language_dict_features['masks']：[4, 256]
- language_dict_features['hidden']：[4, 256, 768]
然后对 visual 特征使用 visual backbone 进行提取（swin），得到 5 层不同层级的 swin 特征，尺度分别为：[4, 256, 92, 136]， [4, 256, 46, 68]，[4, 256, 23, 34]，[4, 256, 12, 17]，[4, 256, 6, 9]
之后再使用 attention，对图像特征、语言特征进行融合maskrcnn_benchmark/modeling/rpn/vldyhead.py(892)→line(560)
使用 early fusion module 对两组特征进行 fuse（MHS-A/MHS-B），visual 特征是 5 级，language 特征是 5 种，特征融合是代码如下，是对图像的 5 级特征和 language 的 hidden/mask 特征进行融合，这里的多级融合也就是的不同层级的 swin transformer 特征和 language 特征进行融合，而不是只使用某一级进行融合。融合之后得到的特征和融合之前的特征大小是一样的，视觉特征为 [4, 256, 92, 136]， [4, 256, 46, 68]，[4, 256, 23, 34]，[4, 256, 12, 17]，[4, 256, 6, 9]，语言特征为 [4, 256, 768]，融合后的语言特征更新到 language_dict_features['hidden'] 中。
对语言特征进行规范化和一些线性操作
对视觉特征来提取分类、回归、centerness 特征、对每个层级的特征和语言特征进行相似性度量
生成 anchor，[[BoxList(num_boxes=16800, image_width=720, image_height=1283, mode=xyxy), BoxList(num_boxes=4200, image_width=720, image_height=1283, mode=xyxy), BoxList(num_boxes=1050, image_width=720, image_height=1283, mode=xyxy), BoxList(num_boxes=273, image_width=720, image_height=1283, mode=xyxy), BoxList(num_boxes=77, image_width=720, image_height=1283, mode=xyxy)], [BoxList(num_boxes=16800, image_width=748, image_height=1333, mode=xyxy), BoxList(num_boxes=4200, image_width=748, image_height=1333, mode=xyxy), BoxList(num_boxes=1050, image_width=748, image_height=1333, mode=xyxy), BoxList(num_boxes=273, image_width=748, image_height=1333, mode=xyxy), BoxList(num_boxes=77, image_width=748, image_height=1333, mode=xyxy)], [BoxList(num_boxes=16800, image_width=560, image_height=839, mode=xyxy), BoxList(num_boxes=4200, image_width=560, image_height=839, mode=xyxy), BoxList(num_boxes=1050, image_width=560, image_height=839, mode=xyxy), BoxList(num_boxes=273, image_width=560, image_height=839, mode=xyxy), BoxList(num_boxes=77, image_width=560, image_height=839, mode=xyxy)], [BoxList(num_boxes=16800, image_width=800, image_height=1147, mode=xyxy), BoxList(num_boxes=4200, image_width=800, image_height=1147, mode=xyxy), BoxList(num_boxes=1050, image_width=800, image_height=1147, mode=xyxy), BoxList(num_boxes=273, image_width=800, image_height=1147, mode=xyxy), BoxList(num_boxes=77, image_width=800, image_height=1147, mode=xyxy)]]
对 batch 中的 label 进行信息提取（loss.py line653），以一张图为例，bboxes_per_im = targets_per_im.bbox，labels_per_im = targets_per_im.get_field("labels")， num_gt = len(bboxes_per_im)
- bboxes_per_im: tensor([[282.8400, 263.7800, 319.6100, 296.5700], [ 21.1700, 30.2600, 449.5800, 457.6000], [122.3300, 129.2700, 429.1300, 471.5900], [235.6900, 338.0400, 257.0000, 366.4200]], device='cuda:0')
- labels_per_im：tensor([80, 1, 1, 80], device='cuda:0')
- num_gt:4
对 anchor 和 labels 求 iou，每个 anchor 和每个 gt 都得到一个 iou
求 anchor 和每个 gt 的中心点的距离（不区分层级，所有层级的anchor都和 gt 计算距离），然后以每个层级分别基于两者之间的距离来选择哪些 anchor 保留，每个 gt 保留和其距离最近的前 9 个 anchor（就是 ATSS 的分配方式），求得保留下来的 anchor 对应的 iou 的均值和方差，将 iou>均值+方差的 anchor 设置为正，其他设置为负 anchor，如果一个 anchor 被分配到多个 gt 上去了，则只将其分配到 iou 最大的 gt 上去，并且按 gt 类别给其对应的 anchor 分配类别
根据保留的 anchor 和对应的 box、cls、token、centerness，分别求对应的 loss
- box reg loss：GIoU loss
- cls loss：sigmoid Focal loss，但是最终计算loss的时候权重可以为0，也就是分类loss可以不参与计算,{'loss_reg': tensor(0.1464, device='cuda:0', grad_fn=<MulBackward0>), 'loss_centerness': tensor(0.6058, device='cuda:0', grad_fn=<DivBackward0>), 'loss_cls': tensor(0., device='cuda:0', grad_fn=<MulBackward0>)}
- token loss：（loss.py line1159）Token sigmoid Focal loss，参与计算的就是 language 和 vision 的相似度（5个不同层级的视觉特征和语言特征的相似性）也就是 loss 函数中的 $L_{inter}+L_{intra}$ ，即图像和文本相似性得分的 loss。会计算正样本和 batch 内所有负样本的相似性，目标是使得正样本对儿的相似性拉近，且和其他负样本的距离拉远。{' 'loss_dot_product_token': tensor(0.3043, device='cuda:0', grad_fn=<MulBackward0>)}
- centerness loss：BCEwithLogitsLoss

4.6 The Object Detection in the Wild Benchmark

1、Download ODinW

2、(Zero-Shot) Evaluation

3、Full-Model Fine-Tuning

4、Prompt Tuning

5、Linear Probing

6、Knowledge-Augmented Inference

4.7 推理可视化

推理可视化过程如下：

首先，对图像按短边 resize 到 800，如将（480,640,3）的图像 resize 到（3, 800,1066），然后进行归一化
接着，对 caption 进行分词，假设输入为 ‘bobble heads on top of the shelf’ ，分词后为 {'input_ids': tensor([[ 101, 3960, 3468, 4641, 2006, 2327, 1997, 1996, 11142, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
使用 nltk 找到 noun_phrases :['bobble heads', 'top', 'the shelf']，这些分后的词就是要检测的 label，如果输入 caption 为 'sofa . remote . dog . person . car . sky . plane .'，则 nltk 处理后就是 noun_phrases：['sofa', 'remote', 'dog', 'person', 'car', 'sky', 'plane']
之后，找到每个 label 在 caption 中出现的开始位置和结束位置，组成 tokens_positive: [[[0, 12]], [[16, 19]], [[23, 32]]]，然后将要得到 3x256 的 positive_map（这个 3 是因为前面提取了 3 个 phrase）。方法是，对于第一组 [0,12]，对 0 和 12 分别进行 tokenized.char_to_token(0) 和 tokenized.char_to_token(12-1) ，得到 beg_pos 和 end_pos，将每个 phrase 编码到 256 维特征，然后找到 256 维特征中 1 的位置，得到 positive_map_label_to_token ：{1: [1, 2, 3], 2: [5], 3: [7, 8]}
然后，经过模型:
- 对 caption 进行分词，经过 language model 提取特征
- 对图像进 swin 特征提取
- 对语言特征和图像特征进行融合，然后对文本特征压缩，压缩到 256 维，
- 对图像特征求分类特征、回归特征、centerness 特征、region-text 相似度特征
- 对生成的 anchor 和预测的特征进行 decode（decode 细节见前文），先进行 NMS (0.6 阈值)，并且保留最多为 100 个框
- 输出 prediction 检测框：BoxList(num_boxes=13, image_width=640, image_height=480, mode=xyxy)，再把预测结果 resize 到原图大小，
接着，进行可视化阈值过滤（如0.5），输出 labels：predictions.get_field("labels").tolist()： [1, 1, 1, 3, 1, 2, 1, 3, 1, 1, 2, 3, 1]，输出 scores：predictions.get_field("scores")：tensor([0.2432, 0.3272, 0.2529, 0.3403, 0.7509, 0.3260, 0.2384, 0.2128, 0.1887, 0.4218, 0.3763, 0.7082, 0.1189])如设置可视化阈值为 0.5，则保留得分大于 0.5 的框

4.8 RPN 网络结构如下：

VLDyHeadModule((head): VLDyHead((dyhead_tower): Sequential((0): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(1): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(2): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))(3): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(4): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(5): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))(6): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(7): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(8): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))(9): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(10): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(11): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))(12): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(13): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(14): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))(15): VLFuse((b_attn): BiAttentionBlockForCheckpoint((layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(layer_norm_l): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(attn): BiMultiHeadAttention((v_proj): Linear(in_features=256, out_features=2048, bias=True)(l_proj): Linear(in_features=768, out_features=2048, bias=True)(values_v_proj): Linear(in_features=256, out_features=2048, bias=True)(values_l_proj): Linear(in_features=768, out_features=2048, bias=True)(out_v_proj): Linear(in_features=2048, out_features=256, bias=True)(out_l_proj): Linear(in_features=2048, out_features=768, bias=True))(drop_path): Identity()))(16): BertEncoderLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(17): DyConv((DyConv): ModuleList((0): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(1): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=1, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True))(2): Conv3x3Norm((conv): ModulatedDeformConv(in_channels=256, out_channels=256, kernel_size=(3, 3), stride=2, dilation=1, padding=1, groups=1, deformable_groups=1, bias=True)(bn): GroupNorm(16, 256, eps=1e-05, affine=True)))(AttnConv): Sequential((0): AdaptiveAvgPool2d(output_size=1)(1): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(2): ReLU(inplace=True))(h_sigmoid): h_sigmoid((relu): ReLU6(inplace=True))(relu): DYReLU((avg_pool): AdaptiveAvgPool2d(output_size=1)(fc): Sequential((0): Linear(in_features=256, out_features=64, bias=True)(1): ReLU(inplace=True)(2): Linear(in_features=64, out_features=1024, bias=True)(3): h_sigmoid((relu): ReLU6(inplace=True))))(offset): Conv2d(256, 27, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))))(cls_logits): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1))(bbox_pred): Conv2d(256, 4, kernel_size=(1, 1), stride=(1, 1))(centerness): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))(dot_product_projection_image): Identity()(dot_product_projection_text): Linear(in_features=768, out_features=256, bias=True)(scales): ModuleList((0): Scale()(1): Scale()(2): Scale()(3): Scale()(4): Scale()))(loss_evaluator): ATSSLossComputation((cls_loss_func): SigmoidFocalLoss(gamma=2.0, alpha=0.25)(centerness_loss_func): BCEWithLogitsLoss()(token_loss_func): TokenSigmoidFocalLoss(gamma=2.0, alpha=0.25))(box_selector_train): ATSSPostProcessor()(box_selector_test): ATSSPostProcessor()(anchor_generator): AnchorGenerator((cell_anchors): BufferList())
)