摘要

Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation. Recent work tackles this problem in a weakly-supervised manner: object parts are first detected and the corresponding part-specific features are extracted for fine-grained classification. However, these methods typically treat the part-specific features of each image in isolation while neglecting their relationships between different images. In this paper, we propose Cross-X learning, a simple yet effective approach that exploits the relationships between different images and between different network layers for robust multi-scale feature learning. Our approach involves two novel components: (i) a cross-category cross-semantic regularizer that guides the extracted features to represent semantic parts and, (ii) a cross-layer regularizer that improves the robustness of multi-scale features by matching the prediction distribution across multiple layers. Our approach can be easily trained end-to-end and is scalable to large datasets like NABirds. We empirically analyze the contributions of different components of our approach and demonstrate its robustness, effectiveness and state-of-the-art performance on five benchmark datasets. Code is available at https: //github.com/cswluo/CrossX.

从具有非常细微差异的子类别中识别对象仍然是一项具有挑战性的任务，因为类内差异很大，类间差异很小。最近的工作以一种弱监督的方式解决了这个问题：首先检测目标部分，然后提取相应的特定于部分的特征进行细粒度分类。然而，这些方法通常孤立地对待每个图像的特定部分特征，而忽略它们在不同图像之间的关系。在本文中，我们提出了Cross-X学习，这是一种简单而有效的方法，它利用不同图像之间和不同网络层之间的关系来进行稳健的多尺度特征学习。我们的方法包括两个新的组成部分：(I)跨类别跨语义正则化，用于引导提取的特征表示语义部分；(Ii)跨层正则化，通过匹配多层预测分布来提高多尺度特征的鲁棒性。我们的方法可以很容易地端到端训练，并且可以扩展到像NABirds这样的大型数据集。我们实证分析了该方法的不同组成部分的贡献，并在五个基准数据集上展示了它的健壮性、有效性和最先进的性能。代码https://github.com/cswluo/CrossX

具体实现

图1.网络架构。我们的网络通过使用osme块输出多个特征地图。在最后两个阶段中，描述了两个OSME块，每个块都有两个激发，以说明我们的方法。来自阶段L−1(蓝色)和L(红色)的特征地图被组合以生成合并的特征图(橙色)。左上角是合并特征图的合并过程的放大显示。然后，通过GAP或GMP聚合特征图以获得相应的集合特征。来自同一阶段的融合特征被C3S正则化相互约束，并且同时被连接以馈送到fc层生成逻辑值。在转换为类概率后，通过CL正则化对逻辑进行约束，并将其组合用于分类。
———————————————————————————————————————

OSME模块

OSME（one-squeeze multi-excitation）模块结合代码和结构图给更容易理解：

class MELayer(nn.Module):def __init__(self, channel, reduction=16, nparts=1):super(MELayer, self).__init__()self.avg_pool = nn.AdaptiveAvgPool2d(1)self.nparts = npartsparts = list()for part in range(self.nparts):parts.append(nn.Sequential(nn.Linear(channel, channel // reduction),nn.ReLU(inplace=True),nn.Linear(channel // reduction, channel),nn.Sigmoid()))self.parts = nn.Sequential(*parts)def forward(self, x):b, c, _, _ = x.size()y = self.avg_pool(x).view(b, c)meouts = list()for i in range(self.nparts):meouts.append(x * self.parts[i](y).view(b, c, 1, 1))return meouts

设U=[u1,⋯,uC]∈RW×H×CU=\left[u_{1}, \cdots, u_{C}\right] \in \mathbb{R}^{W \times H \times C}U=[u1,⋯,uC]∈RW×H×C表示残差块τ\tauτ的输出特征图。为了生成多个特定于注意的特征图，OSME块通过执行单次压缩(池化)和多次激发(FC)操作来扩展原始残差块。虽然OSME可以产生特定注意力的特征，但引导这些特征具有语义意义是具有挑战性的。

跨类别跨语义正则化(C3SC^{3} SC3S)

接上(挑战)，作者提出通过研究不同图像和不同激励模块的特征映射之间的相关性来学习语义特征。理想情况下，希望从同一个激发模块中提取的特征具有相同的语义含义，即使它们来自不同的带有不同类标记的图像。从不同激励模块提取的特征，即使来自同一幅图像，也应该具有不同的语义。为了实现这一目标，我们引入了跨类别跨语义正则化器（C3SC^{3}SC3S），该正则化器最大化了来自同一激励模块的特征的相关性，同时最小化了来自不同激励模块的特征的相关性。

图2. C3SC^{3}SC3S示例。以中间图像为例，C3SC^{3}SC3S通过利用来自不同图像（橙色虚线框）的特征和来自不同激励模块（蓝色阴影框）的特征之间的关系，鼓励在不同语义部分激活激励模块U1U1U1和U2U2U2。
———————————————————————————————————————
正则化损失函数公式：
LC3S(S)=12(∥S∥F2−2∥diag⁡(S)∥22)\mathcal{L}_{C^{3} S}(S)=\frac{1}{2}\left(\|S\|_{F}^{2}-2\|\operatorname{diag}(S)\|_{2}^{2}\right)LC3S(S)=21(∥S∥F2−2∥diag(S)∥22)
Sp,p′=1N2∑FpTFp′S_{p, p^{\prime}}=\frac{1}{N^{2}} \sum \mathbf{F}_{p}^{T} \mathbf{F}_{p^{\prime}}Sp,p′=N21∑FpTFp′
fp←fp/∥fp∥\mathbf{f}_{p} \leftarrow \mathbf{f}_{p} /\left\|\mathbf{f}_{p}\right\|fp←fp/∥fp∥
Fp=[fp,1,⋯,fp,N]∈RC×N\mathbf{F}_{p}=\left[\mathbf{f}_{p, 1}, \cdots, \mathbf{f}_{p, N}\right] \in \mathbb{R}^{C \times N}Fp=[fp,1,⋯,fp,N]∈RC×N
正则化器即正则化损失，输入是结构图中的同颜色的正方体（特征图），从以下两个部分构造正则化损失：1）最大化SSS的对角线以最大化同一激励模块内的相关性；2）惩罚SSS的范式以最小化不同激励模块之间的相关性；
代码实现：

reg_loss_ulti = RegularLoss(gamma=gamma1, nparts=nparts)
reg_loss_plty = RegularLoss(gamma=gamma2, nparts=nparts)
reg_loss_cmbn = RegularLoss(gamma=gamma3, nparts=nparts)
###############################################
outputs_ulti, outputs_plty, outputs_cmbn, ulti_ftrs, plty_ftrs, cmbn_ftrs = model(inputs)
reg_loss_cmbn = reg_loss_cmbn(cmbn_ftrs)
reg_loss_ulti = reg_loss_ulti(ulti_ftrs)
reg_loss_plty = reg_loss_plty(plty_ftrs)

class RegularLoss(nn.Module):def __init__(self, gamma=0, part_features=None, nparts=2):""":param bs: batch size:param ncrops: number of crops used at constructing dataset"""super(RegularLoss, self).__init__()#self.register_buffer('part_features', part_features)self.nparts = npartsself.gamma = gammadef forward(self, x):assert isinstance(x, list), "parts features should be presented in a list"corr_matrix = torch.zeros(self.nparts, self.nparts)for i in range(self.nparts):x[i] = x[i].squeeze()x[i] = torch.div(x[i], x[i].norm(dim=1, keepdim=True))# original designfor i in range(self.nparts):for j in range(self.nparts):corr_matrix[i, j] = torch.mean(torch.mm(x[i], x[j].t()))if i == j:corr_matrix[i, j] = 1.0 - corr_matrix[i, j]regloss = torch.mul(torch.sum(torch.triu(corr_matrix)), self.gamma).to(device)return regloss

跨层正则化(CLCLCL)

利用CNN不同层次的语义特征对许多视觉任务都是有益的。将这种思想推广到细粒度识别的一个简单方法是将不同层的预测结果结合起来进行最终的预测。然而，这种简单的策略通常会导致较差的性能。我们假设这个问题是由于两个原因造成的：1）中间层特征对输入变化更敏感，这使得它们对于类内变化较大的细粒度识别的鲁棒性降低；2）特征预测之间的关系没有被利用。为了解决这些问题，我们采用特征金字塔网络（FPN）来集成不同层的特征，并提出了一种新的跨层正则化方法（CLCLCL），通过匹配不同层之间的预测分布来学习鲁棒性特征。

UpG=BN(K2∗(UpL−1+Bilinear⁡(K1∗UpL)))\mathbf{U}_{p}^{G}=\mathbf{B} \mathbf{N}\left(\mathbf{K}_{2} *\left(\mathbf{U}_{p}^{L-1}+\operatorname{Bilinear}\left(\mathbf{K}_{1} * \mathbf{U}_{p}^{L}\right)\right)\right)UpG=BN(K2∗(UpL−1+Bilinear(K1∗UpL)))
K1K1K1, K2K2K2 are 1 × 1 and 3 × 3 filters
Bilinear(·) denotes bilinear interpolation
UG\mathbf{U}^{G}UG综合了中间层空间分辨率高和顶层语义丰富的特点。为了进一步挖掘特征预测之间的关系，作者提出了匹配不同层之间预测分布的CL正则化器。
LCL(PrL,PrL−1)=KL(PrL∥PrL−1)=1N∑n=1N∑k=1KpnkLlog⁡pnkLpnkL−1\begin{aligned} \mathcal{L}_{C L}\left(\mathbf{P r}^{L}, \mathbf{P r}^{L-1}\right) &=\mathrm{KL}\left(\mathbf{P r}^{L} \| \mathbf{P r}^{L-1}\right) \\ &=\frac{1}{N} \sum_{n=1}^{N} \sum_{k=1}^{K} p_{n k}^{L} \log \frac{p_{n k}^{L}}{p_{n k}^{L-1}} \end{aligned}LCL(PrL,PrL−1)=KL(PrL∥PrL−1)=N1n=1∑Nk=1∑KpnkLlogpnkL−1pnkL
PrL−1=σ(f(UL−1))\mathbf{P r}^{L-1}=\sigma\left(f\left(U^{L-1}\right)\right)PrL−1=σ(f(UL−1))
PrL=σ(f(UL))\mathbf{P r}^{L}=\sigma\left(f\left(U^{L}\right)\right)PrL=σ(f(UL))
σ(⋅)\sigma\left(·\right)σ(⋅) is the softmax function
f(⋅)f\left(·\right)f(⋅) denotes the output layer
最终的预测分类可以结合不同模块的输出：
Pr=σ(f(UL)+f(UL−1)+f(UG))\mathbf{P r}=\sigma\left(f\left(\mathbf{U}^{L}\right)+f\left(\mathbf{U}^{L-1}\right)+f\left(\mathbf{U}^{G}\right)\right)Pr=σ(f(UL)+f(UL−1)+f(UG))

最终损失为：
L=Ldata +γLC3S+λLCL\mathcal{L}=\mathcal{L}_{\text {data }}+\gamma \mathcal{L}_{C^{3} S}+\lambda \mathcal{L}_{C L}L=Ldata +γLC3S+λLCL
Ldata =−1N∑n=1N∑k=1Kcnklog⁡pnk\mathcal{L}_{\text {data }}=-\frac{1}{N} \sum_{n=1}^{N} \sum_{k=1}^{K} c_{n k} \log p_{n k}Ldata =−N1n=1∑Nk=1∑Kcnklogpnk
LC3S=γ1LC3S(SL)+γ2LC3S(SL−1)+γ3LC3S(SG)\mathcal{L}_{C^{3} S}=\gamma_{1} \mathcal{L}_{C^{3} S}\left(S^{L}\right)+\gamma_{2} \mathcal{L}_{C^{3} S}\left(S^{L-1}\right)+\gamma_{3} \mathcal{L}_{C^{3} S}\left(S^{G}\right)LC3S=γ1LC3S(SL)+γ2LC3S(SL−1)+γ3LC3S(SG)
LCL=λ1LCL(PrL,PrL−1)+λ2LCL(PrL,PrG)\mathcal{L}_{C L}=\lambda_{1} \mathcal{L}_{C L}\left(\mathbf{P r}^{L}, \mathbf{P r}^{L-1}\right)+\lambda_{2} \mathcal{L}_{C L}\left(\mathbf{P r}^{L}, \mathbf{P r}^{G}\right)LCL=λ1LCL(PrL,PrL−1)+λ2LCL(PrL,PrG)

个人网站 https://aydenfan.github.io/

【论文阅读】Cross-X Learning for Fine-Grained Visual Categorization相关推荐

论文阅读 [TPAMI-2022] On Learning Disentangled Representations for Gait Recognition
论文阅读 [TPAMI-2022] On Learning Disentangled Representations for Gait Recognition 论文搜索(studyai.com) 搜索 ...
《论文阅读》Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response Gener
<论文阅读>Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response ...
论文阅读 [CVPR-2022] BatchFormer: Learning to Explore Sample Relationships for Robust Representation Lea
论文阅读 [CVPR-2022] BatchFormer: Learning to Explore Sample Relationships for Robust Representation Lea ...
论文阅读：Deep Learning in Mobile and Wireless Networking:A Survey
论文阅读:Deep Learning in Mobile and Wireless Networking:A Survey 从背景介绍到未来挑战,一文综述移动和无线网络深度学习研究近来移动通信和 5 ...
论文阅读|DeepWalk: Online Learning of Social Representations
论文阅读|DeepWalk: Online Learning of Social Representations 文章目录论文阅读|DeepWalk: Online Learning of Soci ...
【论文阅读笔记】Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer
摘要: 本文主要研究训练和测试类别不相交时(即没有目标类别的训练示例)的对象分类问题.在此之前并没有对于毫无关联的训练集和测试集进行对象检测的工作,只是对训练集所包含的样本进行分类.实验表明,通过使用 ...
【论文阅读】Federated Learning应用扩展合集
2020-MM-Performance Optimization for Federated Person Re-identification via Benchmark Analysis 动机:联邦 ...
论文阅读《Representation learning with contrastive predictive coding 》（CPC）对比预测编码
论文地址:Representation Learning with Contrastive Predictive Coding 目录一.Background(背景) 二.Motivation and ...
论文阅读“Augmentation-Free Self-Supervised Learning on Graphs”(AAAI 2022)
论文标题 Augmentation-Free Self-Supervised Learning on Graphs 论文作者.链接作者:Lee, Namkyeong and Lee, Junseok ...

【论文阅读】Cross-X Learning for Fine-Grained Visual Categorization