Improving Few-Shot Learning with Auxiliary Self-Supervised Pretext Tasks

使用辅助自监督代理任务改进小样本学习

Improving Few-Shot Learning with Auxiliary Self-Supervised Pretext Tasks | Papers With Code

**Combining supervised and self-supervised tasks for meta-training**

Figure 1: We train the embedding model $F_{\Theta }$ with both annotated images and unlabeled images in a multi-task setting. Self-supervised tasks such as rotation prediction or representation prediction (BYOL) act as a data-dependent regularizer for the shared feature extractor $F_{\Theta }$ . Although additional unlabeled data can be used for the self-supervised tasks, in this work we sample images from the annotated set.

Abstract

Recent work on few-shot learning (Tian et al., 2020a) showed that quality of learned representations plays an important role in few-shot classification performance. On the other hand, the goal of self-supervised learning is to recover useful semantic information of the data without the use of class labels. In this work, we exploit the complementarity of both paradigms via a multi-task framework where we leverage recent self-supervised methods as auxiliary tasks. We found that combining multiple tasks is often beneficial, and that solving them simultaneously can be done efficiently. Our results suggest that self-supervised auxiliary tasks are effective data-dependent regularizers for representation learning.

3. Method

In this section, we first describe in §3.1 the few-shot learning problem addressed and introduce in §3.2 the proposed multi-task approach to improve few-shot performance with self-supervised auxiliary tasks.

3.1. Problem Formulation

Standard few-shot learning benchmarks evaluate models in episodes of N-way, K-shot classification tasks. Each task consists of a small number of N classes with K training examples per class. Meta-learning approaches for few-shot learning aim to minimize the generalization error across a distribution of tasks sampled from a task distribution. This can be thought of as learning over a collection of tasks $\mathcal{T}=\left\{\left(\mathcal{D}_{i}^{\text {train }}, \mathcal{D}_{i}^{\text {test }}\right)\right\}_{i=1}^{I}$ , commonly referred to as the meta-training set.

In practice, a task is constructed on the fly during the meta-training phase and sampled as follows. For each task, N classes from the set of training classes are first sampled (with replacement), from which the training (support) set Di train of K images per class is sampled, and finally the test (query) set Di test consisting of Q images per class is sampled. The support set is used to learn how to solve this specific task, and the additional examples from the query set are used to evaluate the performance for this task. Once the meta-training phase of a model is finished, its performance is evaluated on a set of held-out tasks S = {Dj train, Dj test }Jj=1, called the meta-test set. During meta-training, an additional held-out meta-validation set can be used for hyperparameter selection and model selection. Training examples Dtrain = {(xt, yt)}Tt=1 and testing examples Dtest = {(xq, yq)}Qq=1 are sampled from the same distribution, and are mapped to a feature space using an embedding model Fθ. A base learner is trained on Dtrain and used as a predictor on Dtest .

在实践中，任务是在元训练阶段动态构建的，并采样如下。对于每个任务，首先从训练类集合中抽取 N 个类（有放回），从中抽取每类 K 个图像的训练（支持）集 Di train，最后测试（查询）集 Di test 由每类 Q 图像被采样。支持集用于学习如何解决此特定任务，查询集中的其他示例用于评估此任务的性能。一旦模型的元训练阶段完成，它的性能就会在一组保留任务 S = {Dj train, Dj test }Jj=1 上进行评估，称为元测试集。在元训练期间，额外的保留元验证集可用于超参数选择和模型选择。训练样例 Dtrain = {(xt, yt)}Tt=1 和测试样例 Dtest = {(xq, yq)}Qq=1 从同一分布中采样，并使用嵌入模型 Fθ 映射到特征空间。基础学习器在 Dtrain 上进行训练并用作 Dtest 上的预测器。

BYOL.

In BYOL (Grill et al., 2020), the online network directly predicts the output of one view from another view given by the target network. Essentially, this is a representation prediction task in the latent space, similar to contrastive learning except that it only relies on the positive pairs（与对比学习相似，不同的是它只依赖于正样本）. In this task, the online network is composed of the shared encoder $F_{\Theta }$ , the MLP projection head $g_{\phi }$ and the predictor $q_{\phi }$ (also an MLP). The target network has the same architecture as the online network (minus the predictor), but its paramters are an exponential moving average (EMA) of the online network parameters as illustrated in Figure 1. Denoting the parameters of the online network as $\theta_{O}=\{\theta, \varphi\}$ , those of the target network as ξ and the target decay rate τ ∈ [0, 1), the update rule for ξ is:

The self-supervised loss $L_{BYOL}$ is the mean squared error between the normalized predictions and target projections as defined in Grill et al. (2020). Effectively, this task enforces the representations for different views of positive pairs to be closer together in latent space, which provides transformation invariance to the pre-defined set of data augmentations used for BYOL.

In §4.4, we explore this more effificient setting by combining the supervised and BYOL tasks using the stronger data augmentation strategy from BYOL in both. More concretely, we generate an augmented view of an input image and compute the supervised loss on the first augmented view, while another augmented view of the same input is generated to solve the representation prediction task in BYOL.

4.4. Data Augmentation: Stronger is Better

In order to ensure that the performance improvement from BYOL is not strictly due to the stronger data augmentation strategy used by the task, we conduct experiments using the same data augmentations for the supervised baseline. An additional experiment without data augmentation for the supervised task is presented in Appendix A. On both CIFAR-FS (Table 1) and miniImageNet (Table 2), we find that stronger data augmentation improves the supervised baseline. This is in line with a lot of the recent work in strong data augmentation techniques (DeVries & Taylor, 2017; Zhang et al., 2018; Yun et al., 2019; Cubuk et al., 2019; 2020). Effectively, data augmentation is an important regularization technique that has been shown to improve generalization. Furthermore, we show that in this setting the addition of BYOL as a self-supervised auxiliary task still boosts the performance. As mentioned in §3.2, when used in combination with BYOL, both tasks share the same transformed inputs as part of our multi-task framework. Additional experiments where we leverage both augmented views generated to compute both the supervised and BYOL losses can be found in Table 4 (Appendix C).

Improving Few-Shot Learning with Auxiliary Self-Supervised Pretext Tasks（论文解读）相关推荐

Learning Generalized Spoof Cues for Face Anti-spoofing论文解读及复现笔记
Paper link: https://arxiv.org/abs/2005.03922 Code link: https://github.com/VIS-VAR/LGSC-for-FAS 简介: ...
基于深度强化学习的车道线检测和定位（Deep reinforcement learning based lane detection and localization）论文解读+代码复现
之前读过这篇论文,导师说要复现,这里记录一下.废话不多说,再重读一下论文. 注:非一字一句翻译.个人理解,一定偏颇. 基于深度强化学习的车道检测和定位官方源码下载:https://github.co ...
Learning Attentive Pairwise Interaction for Fine-Grained Classification论文解读
论文链接:https://arxiv.org/abs/2002.10191 分享的这篇文章来自于AAAI2020,文章的整个思路并不难理解.文章的idea来自于我们人类对相似图像的识别.一般来说,我们 ...
Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking 论文解读
1. 相关链接中文介绍链接: 语音语义创新Lab_News_聚焦虚拟说话人生成技术,华为云论文被人工智能语音领域顶级会议ICASSP2022接收论文链接: Meta Talk: Learning ...
Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation论文解读
(CVPR 2020|中科院VIPL实验室) 1.要解决的问题: 基于类别标签的弱监督语义分割是一个具有挑战性的问题,类别响应图(class activation map,简称CAM)始终是这一领域的 ...
（转）Paper list of Meta Learning/ Learning to Learn/ One Shot Learning/ Lifelong Learning
Meta Learning/ Learning to Learn/ One Shot Learning/ Lifelong Learning 2018-08-03 19:16:56 本文转自:http ...
语音识别(ASR)论文优选：挑战ASR规模极限Scaling ASR Improves Zero and Few Shot Learning
声明:平时看些文章做些笔记分享出来,文章中难免存在错误的地方,还望大家海涵.搜集一些资料,方便查阅学习:http://yqli.tech/page/speech.html.语音合成领域论文列表请访问h ...
Generative Adversarial Learning Towards Fast Weakly Supervised Detection
Generative Adversarial Learning Towards Fast Weakly Supervised Detection Abstract 近年来,弱监督对象检测已经吸引了广泛 ...
Zero shot learning
Zero shot learning 主要考察的问题是如何建立语义和视觉特征的关系(视觉特征一般用预训练好的CNN提取特征,不再进行fine-tine) 为了预测从未在训练集上出现的目标种类,仿照人的 ...
Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference理解
Wu M, Mosse M, Goodman N, et al. Zero Shot Learning for Code Education: Rubric Sampling with Deep Le ...

Improving Few-Shot Learning with Auxiliary Self-Supervised Pretext Tasks（论文解读）