论文简读《Harmonizing Transferability and Discriminability for Adapting Object Detectors》

CVPR2020 | Code

思想：首先文章提出当前基于对抗的方法

image and instance levels alignment [7], strong-local and weak-global alignment [44], local-region alignment based on region proposal [62], multi-level feature alignment with prediction-guided instance-level constraint

未考虑到可迁移性(transferability)和可判别性(discriminability)是可能存在矛盾的，

the transferability refers to the invariance of the learned representations across domains,
and discriminability refers to the ability of the detector to localize and distinguish different instances.

因此提出使用Hierarchical Transferability Calibration Network (HTCN)来平衡这两者的关系。

Importance Weighted Adversarial Training with Input Interpolation (IWAT-I)

首先使用CycleGAN分别对源域和目标域生成了相应的合成图像，然后对这些生成的图像按照跨域的相似性（cross-domain similarity）进行权重的分配。

Our key insight is that not all images are created equally in terms of transferability especially after interpolation

根据域分类器D2D_2D2的输出di=D2(G1⋅G2(xi))d_i = D_2(G_1 \cdot G_2(x_i))di=D2(G1⋅G2(xi))，将其带入信息熵函数中
vi=H(di)=−di⋅log⁡(di)−(1−di)⋅log⁡(1−di)v_i = H(d_i) = -d_i \cdot \log(d_i) -(1- d_i) \cdot \log(1-d_i) vi=H(di)=−di⋅log(di)−(1−di)⋅log(1−di)

值越大，说明该样本具有越高的uncertainty，此时加大输入到D3D_3D3的特征fif_ifi的权重。

gi=fi×(1+vi)g_i = f_i \times (1+v_i)gi=fi×(1+vi)

那么此时域分类器D3D_3D3的对抗损失可以写为：

Lga=E[log⁡(D3(G3(gis)))]+E[1−log⁡(D3(G3(git)))]\mathcal{L}_{ga} = \mathbb{E}[\log(D_3(G_3(g_i^s)))] + \mathbb{E}[1-\log(D_3(G_3(g_i^t)))]Lga=E[log(D3(G3(gis)))]+E[1−log(D3(G3(git)))]

从而实现图像级的可迁移性的校准。

Context-Aware Instance-Level Alignment (CILA)

作者认为，常见的Instance-level对齐，每个ROI对应的特征向量只代表了每个局部的个体，未考虑到全局上下文信息，因此作者提出将finsf_{ins}fins（特征向量的长度为128）同时与不同层次的GiG_iGi的特征向量进行融合

ffus=[fc1,fc2,fc3]⊗finsf_{fus} = [{f_{c}^{1}}, {f_c^2},f_c^3] \otimes f_{ins}ffus=[fc1,fc2,fc3]⊗fins

但与strong-weak不同的是，这样的tensor product operation可能会产生维度爆炸，因此先随机采样，然后进行Hadamard product.

ffus=1d(R1fc)⊙(R2fins)f_{fus} = \frac{1}{\sqrt{d}}(R_1f_c) \odot (R_2f_{ins})ffus=d1(R1fc)⊙(R2fins)

Local Feature Mask for Semantic Consistency

同一物体在不同域中，虽然所处的场景不同，但应该是语义不变，可以匹配的。因此作者认为图像的某些局部区域是更利于迁移的。

we assume that some local regions of the whole image are more descriptive and dominant than others.

D1D_1D1是一个pixel-wise（针对G1G_1G1生成的特征图）的域分类器，同样对域分类器的值使用了信息熵来确定每个区域（针对原图像）的不确定性。

根据特征图上的每个位置kkk，计算mfk=2−H(dik)m_f^k = 2- H(d_i^k)mfk=2−H(dik)得到相应的Mask 图（不确定性值越小的区域，可迁移性越好），作为一个Attention Map。

总结

作者将信息熵作为一个度量域分类器的准则，从而挖掘出难迁移的图像（IWAT-I），易迁移的区域（Local Feature Mask）。