Long-Tailed Recognition

长尾数据

在传统的分类和识别任务中，训练数据的分布往往都受到了人工的均衡，即不同类别的样本数量无明显差异。一个均衡的数据集固然大大简化了对算法鲁棒性的要求，也一定程度上保障了所得模型的可靠性，但随着关注类别的逐渐增加，维持各个类别之间均衡就将带来指数增长的采集成本
- 举个简单的例子，如果要做一个动物分类数据集，猫狗等常见数据可以轻轻松松的采集数以百万张的图片，但是考虑到数据集的均衡，我们必须也给雪豹等罕见动物采集等量的样本，而随着类别稀有度的增加，其采集成本往往成指数增长
那么如果我们完全不考虑人工均衡，自然的采集所有相关数据呢？这样的数据就是本文所关注的长尾数据。如下图所示，训练集中的少数类别 (head class) 含有训练集中的多数标注数据，而大量其余类别 (tail class) 仅有少数标注数据 (a few classes occupy most of the data, while most classes have rarely few samples)。直接利用长尾数据来训练的分类和识别系统，往往会对头部数据 (instance-rich / head classes) 过拟合，对尾部类别 (instance-scarce / tail classes) 欠拟合

注意，虽然长尾问题的训练集是 imbalance 的，但其测试集必须是 balance 的

基本方法

重采样 (data re-sampling)

重采样: 对不同类别的图片采样频率根据样本数量进行反向加权
pj=njq∑i=1Cniqp_j=\frac{n_j^q}{\sum_{i=1}^Cn_i^q}pj=∑i=1Cniqnjq其中 CCC 为数据集的类别数量，nin_ini 为类别 iii 的样本总数，pjp_jpj 为从类别 jjj 中采样一个图片的概率
- 狭义的重采样可以看作 q∈(0,1]q\in(0,1]q∈(0,1] 的情况，也就是尾部类别的样本图片会比头部类别的图片有更高的概率被采样到，当然尾部类别的图片可能会被反复重复采样，所以一般也会做一些简单的数据增强，例如反转，随机剪裁等
- 传统的样本均衡采样 (instance-balanced sampling) 在这个公式里就是 q=0q=0q=0 的情况，也就是每个图片等概率被采样
- 类别均衡采样 (class-balanced sampling) 则是 q=1q=1q=1 的情况，即所有类别都采样相同数量的样本，这个过程可以看作由两个阶段组成，即首先随机选择一个类，然后在类内作样本均衡采样
- Square-root sampling 则是采样方法的有一个变种，相当于将 qqq 设为 1/21/21/2
- Progressively-balanced sampling 是一种混合的采样策略，训练开始时采样策略为样本均衡采样 (IB) (pj=pjIBp_j=p_j^{IB}pj=pjIB)，随着训练的进行，采样策略逐渐过渡为类别均衡采样 (CB) (pj=pjCBp_j=p_j^{CB}pj=pjCB)：
  pjPB(t)=(1−tT)pjIB+tTpjCBp_j^{PB}(t)=(1-\frac{t}{T})p_j^{IB}+\frac{t}{T}p_j^{CB}pjPB(t)=(1−Tt)pjIB+TtpjCB其中 ttt 为 epoch 数

缺点

总的来说，重采样就是在已有数据不均衡的情况下，人为地让模型学习时接触到的训练样本是类别均衡的，从而一定程度上减少对头部数据的过拟合
不过由于尾部的少量数据往往被反复学习，缺少足够多的样本差异，不够鲁棒，而头部拥有足够差异的大量数据又往往得不到充分学习，所以重采样也并非是个真正完美的解决方案

重加权 (loss re-weighting)

重加权则主要体现在分类的 loss 上。不同于采样，因为 loss 计算的灵活性和方便性，很多比较复杂的任务比如物体检测和实例分割等，都更倾向于使用重加权 loss 来解决长尾分布问题。毕竟当一张图片上包含多个需要检测或分割的物体，采样时，对他们分别按类别作筛选远比图像层面的采样麻烦的多。而重加权的实现不仅简单，也往往更加灵活。从基于类别分布的反向加权 (class-level) (Cui et al., 2019; Khan et al., 2017; Cao et al., 2019; Khan et al., 2019; Huang et al., 2019)，到不需要知道类别，直接根据分类的可信度进行的困难样本挖掘 (Hard Example Mining) (sample level)，如 focal loss, Meta-Weight-Net, re-weighted training。最近 Hayat et al. (2019) 提出使用 affinity measure 来使得各类中心均匀等距离分布
这里我们先给个 Re-weighted Cross-Entropy Loss 的通用公式：
Loss=−βlog⁡exp⁡(zj)∑i=1Cexp⁡(zi)Loss=-\beta\log\frac{\exp(z_j)}{\sum_{i=1}^C\exp(z_i)}Loss=−βlog∑i=1Cexp(zi)exp(zj)其中 ziz_izi 是网络输出的 logit，β\betaβ 就是我们重加权中的权重，需要注意的是，这里的 β\betaβ 不是一个常数，而是一个取决于具体实现的经过计算的权重，但一般来说 β\betaβ 的趋势是，给头部类别更低的权重，给尾部类别更高的权重，从而反向抵消长尾效应
- 关于最简单的重加权实现，则可以直接利用公式 β=g(∑i=1Cf(ni)f(nj))\beta=g\left(\frac{\sum_{i=1}^Cf(n_i)}{f(n_j)}\right)β=g(f(nj)∑i=1Cf(ni)), f,gf,gf,g 可以是任意单调递增函数，比如 log⁡\loglog 或者各种幂大于 0 的指数函数
- 在 Focal loss 中，设 xix_ixi 的类别为 yiy_iyi，模型预测概率为 hih_ihi，则 β=(1−hi)γ\beta=(1-h_i)^\gammaβ=(1−hi)γ

迁移学习 (transfer learning from head- to tail-classes)

迁移学习: 从头部常见类中学习通用知识，然后迁移到尾部少样本类别中，这通常需要大量精力来设计特征转移所需的特殊模块 (e.g. external memory)。近期的工作包括 transfer the intra-class variance (Yin et al., 2019) 和 transfer semantic deep features (Liu et al., 2019).

Long-tailed datasets

Generally, in long-tail recognition tasks, the classes are categorized into many-shot (with more than 100 training samples), medium-shot (with 20 ∼100 samples) and few-shot (with less than 20 samples) splits.
The imbalance factors (IFs) of the long-tailed datasets, defined as the frequency of the largest class divided by the smallest class, vary from 10 to over 500.

Long-tailed CIFAR

Both CIFAR-10 and CIFAR-100 contain 60,000 images, 50,000 for training and 10,000 for validation with category number of 10 and 100, respectively. As the original CIFAR datasets, CIFAR-10-LT (CIFAR-10) and CIFAR-100-LT (CIFAR-100) contain the same categories. However, they are created by reducing the number of training samples per class according to an exponential function n=nt×μtn = n_t × \mu^tn=nt×μt, where ttt is the class index (0-indexed) and ntn_tnt is the original number of training images with μ∈(0,1)\mu ∈ (0, 1)μ∈(0,1). The test set remains unchanged.
The imbalance factor of a long-tailed CIFAR dataset is defined as the number of training samples in the largest class divided by that of the smallest, which ranges from 10 to 200. In the literature, the imbalance factor of 50 and 100 are widely used, with around 12,000 training images under each imbalance factor.

iNaturalist 2018

The iNaturalist species classification datasets are large-scale real-world datasets for species identification of animals and plants that suffer from extremely imbalanced label distributions.
The most challenging dataset of iNaturalist is the 2018 version, which contains 437,513 images from 8,142 categories. Besides the extreme imbalance (IF=512), the iNaturalist datasets also face the fine-grained problem.

Long-Tailed ImageNet

The long-tailed ImageNet (ImageNet-LT) is derived from the original ImageNet-2012 by sampling a subset following the Pareto distribution with the power value α=6α = 6α=6 from 1,000 categories, consisting of 115.8K images from 1000 categories, with maximally 1280 images per class and minimally 5 images per class. The test set is balanced by following (Liu et al. 2019).

Places-LT

Places-LT has an imbalanced training set with 62,500 images for 365 classes from Places-2. The class frequencies follow a natural power law distribution with a maximum number of 4,980 images per class and a minimum number of 5 images per class. The validation and testing sets are balanced and contain 20 and 100 images per class respectively.

References

长尾分布下分类问题简介与基本方法
长尾分布下分类问题的最新研究 (持续更新)
长尾分布下的物体检测和实例分割最新研究
一种崭新的长尾分布下分类问题的通用算法
Decoupling Representation and Classifier for Long-Tailed Recognition, ICLR 2020
Places-LT、ImageNet-LT、iNaturalist
Zhang, Yongshun, et al. “Bag of tricks for long-tailed visual recognition with deep convolutional neural networks.” Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 4. 2021.
Cai, Jiarui, Yizhou Wang, and Jenq-Neng Hwang. “Ace: Ally complementary experts for solving long-tailed recognition in one-shot.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Long-tailed Recognition (长尾问题)相关推荐

CVPR 2022 3月3日论文速递（19 篇打包下载）涵盖网络架构设计、姿态估计、三维视觉、动作检测、语义分割等方向
以下CVPR2022论文打包合集:下载地址神经网络架构设计 [1] An Image Patch is a Wave: Quantum Inspired Vision MLP(图像补丁是波浪:量子启 ...
CVPR 2022 论文列表（持续更新）
本文包括论文链接及代码关注公众号:AI基地,及时获取最新资讯,学习资料 GitHub链接:GitHub - gbstack/cvpr-2022-papers: CVPR 2022 papers wi ...
CVPR 2022 论文列表
CVPR2022 Papers (Papers/Codes/Demos) https://github.com/gbstack/cvpr-2022-papers 分类目录: 1. 检测 2. 分割(S ...
Long tailed 长尾分布论文汇总
什么是长尾分布? 长期以来研究人员做图像.文本分类时使用的大多是均衡数据集:MNIST, CIFAR 10, CIFAR 100等,但是现实生活中的数据分布是非常不均衡的.有的类会占绝大多数,有的类别 ...
长尾分布之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION
原始文档:https://www.yuque.com/lart/papers/drggso ICLR 2020的文章. 针对长尾分布的分类问题提出了一种简单有效的基于re-sample范式的策略. 提 ...
长尾分布系列论文解析（一）Decoupling Representation and Classifier for Long-Tailed Recognition
大纲引言分类问题中的长尾分布采样策略解耦训练实验结果总结引言看了挺多长尾分布的论文,从中获益匪浅,长尾分布的问题并不仅仅只局限于早期的分类问题之中,而是广泛存在于深度学习的多项任务之 ...
样本不均衡、长尾分布问题的方法整理（文献+代码）
文章目录分类任务中的不平衡问题解决思路 1.重采样类 2.平衡损失类 3.集成方法类 4.异常检测.One-class分类等长尾分布问题的其他视角小结分类任务中的不平衡问题分类任务中的样本 ...
旷视提双边分支网络BBN：攻坚长尾分布的现实世界任务 | CVPR 2020 Oral
作者 | 旷视研究院出品 | AI科技大本营(ID:rgznai100) 导读:本文是旷视 CVPR 2020 论文系列解读文章,也是 CVPR 2020 Oral展示论文之一,它揭示了再平衡方法解 ...
EMNLP 2021 | 多标签文本分类中长尾分布的平衡策略
点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达作者 | 黄毅作者简介:黄毅,本文一作,目前为罗氏集团的数据科学家 ...
CVPR2020 | 为尾部样本构造特征云，就像用电子云填充空旷的原子——长尾数据上的特征学习方法...
点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达作者:听笙 https://zhuanlan.zhihu.com/p ...

Long-tailed Recognition (长尾问题)

目录

Long-Tailed Recognition

长尾数据

基本方法

重采样 (data re-sampling)

重加权 (loss re-weighting)

迁移学习 (transfer learning from head- to tail-classes)

Long-tailed datasets

Long-tailed CIFAR

iNaturalist 2018

Long-Tailed ImageNet

Places-LT

References

Long-tailed Recognition (长尾问题)相关推荐

最新文章

热门文章