Prototype Rectification for Few-Shot Learning 小样本学习的原型校正

论文地址：https://arxiv.org/pdf/1911.10713v1.pdf

代码地址：暂未开放

首先利用基于余弦相似性的原型网络（CSPN）学习判别特征空间，得到新类的原型向量。然后提出了一种用于原型校正的偏差减小（BD）方法，通过该方法可以同时减小类内偏差和跨类偏差。
①使用基类训练一个基于余弦相似性分类器的特征提取器和余弦分类器。支持集的样本通过特征提取器得到特征，将每个类的所有样本的平均值作为基本原型向量，计算其到每个类的距离，归一化得到概率，随机梯度下降求最小值。余弦相似性分类器可以学习一个嵌入空间，计算类均值，减小同类特征间的距离。
②预测时，支持集的样本通过特征提取器得到特征，将每个类的所有样本的平均值作为基本原型向量，计算和样本的预测相似度进行分类
③基础原型与预测原型会有偏差，需要缩小类内偏差和跨类偏差
类内偏差Intra-Class Bias：少量样本计算所得基础类原型与理想的类原型之间的偏差
首先通过计算query set中的样本与基础类原型之间的余弦相似度获得query sample的伪标签，然后将置信度top-z的query sample加入support set中重新计算，得到修正后的类原型P'n
跨类偏差Cross-Class Bias：support集中的平均样本特征与query集中的平均样本特征间的偏差
求原型：
①求类c中所有样本特征的平均值，得到类原型

②得到每个类的原型后，需要求一个样本属于一个类的概率，在训练中，这个样本是已标记的，即已知类k的原型，已知一个属于类k的样本，求此样本属于类k的概率。我们的目标函数为求这个概率的最大值

对于样本x，求它到每个类的距离，然后进行归一化操作得到概率，即x属于类k的概率。然后通过随机梯度下降方法来求它的最小值，从而收敛后学到一个好的值。

Abstract

Few-shot learning is a challenging problem that requires a model to recognize novel classes with few labeled data. In this paper, we aim to find the expected prototypes of the novel classes, which have the maximum cosine similarity with the samples of the same class. Firstly, we propose a cosine similarity based prototypical network to compute basic prototypes of the novel classes from the few samples. A bias diminishing module is further proposed for prototype rectification since the basic prototypes computed in the low-data regime are biased against the expected prototypes. In our method, the intra-class bias and the cross-class bias are diminished to modify the prototypes. Then we give a theoretical analysis of the impact of the bias diminishing module on the expected performance of our method. We conduct extensive experiments on four few-shot benchmarks and further analyze the advantage of the bias diminishing module. The bias diminishing module brings in significant improvement by a large margin of 3% to 9% in general. Notably, our approach achieves state-of-the-art performance on miniImageNet (70.31% in 1-shot and 81.89% in 5-shot) and tieredImageNet (78.74% in 1-shot and 86.92% in 5-shot), which demonstrates the superiority of the proposed method.

小样本学习是一个具有挑战性的问题，需要一个模型来识别具有少量标记数据的新类。在本文中，我们的目标是找到与同类样本具有最大余弦相似性的新类的预期原型。首先，我们提出了一种基于余弦相似性的原型网络，从少量样本中计算新类的基本原型。由于在低数据区计算的基本原型与预期原型存在偏差，因此进一步提出了用于原型校正的偏差减小模块。在我们的方法中，通过减小类内偏差和跨类偏差来修改原型。然后，我们从理论上分析了偏置减小模块对我们方法预期性能的影响。我们在四个小样本评价标准上进行了广泛的实验，并进一步分析了偏置减小模块的优势。偏差减小模块通常会带来3%到9%的显著改善。值得注意的是，我们的方法在miniImageNet（1-shot 70.31%，5-shot 81.89%）和tieredImageNet（1-shot 78.74%，5-shot 86.92%）上实现了最先进的性能，这证明了所提出方法的优越性。

1 Introduction

Many deep learning based methods have achieved significant performance on object recognition tasks with abundant labeled data provided [13, 28, 9]. However, these methods generally perform unsatisfactorily if the labeled data is scarce. To reduce the dependency of data annotation, more researchers make efforts to develop powerful methods to learn new concepts from very few samples, which is socalled few-shot learning (FSL) [18, 5, 33]. In FSL, we aim to learn prior knowledge on base classes with large amounts of labeled data and utilize the knowledge to recognize novel classes with few labeled data. It is usually formed as N-way K-shot few-shot tasks where each task consists of N novel classes with K labeled samples per class (the support set) and some unlabeled samples (the query set) for test. Such sampled N-way K-shot tasks are called episodes in fewshot learning.

许多基于深度学习的方法在提供大量标记数据的目标识别任务中取得了显著的性能[13,28,9]。然而，如果标记的数据很少，这些方法通常不能令人满意。为了减少数据注释的依赖性，更多的研究人员努力开发强大的方法，从很少的样本中学习新概念，这就是所谓的小样本学习（FSL）[18,5,33]。在FSL中，我们的目标是学习具有大量标记数据的基类的先验知识，并利用这些知识识别具有少量标记数据的新类。它通常形成为N-way K-shot小样本任务，其中每个任务由N个新类组成，每个类有K个标记样本（支持集）和一些未标记样本（查询集）进行测试。这种抽样的N-way K-shot任务称为小样本学习中的片段。

Few-shot learning methods can be split into two categories: metric learning based methods [33, 31, 32, 1] and gradient based methods [6, 16, 21, 29, 14]. Prototypical Networks (PN) [31] is a classical metric learning based method, which recognizes test samples by computing Euclidean distance to the class prototypes. It views the mean vector as the class prototype in the embedding space. The performance of PN is limited by the representation ability of the prototypes. In PN, the features learnt in the embedding space are not discriminative enough such that, the prototypes are not suitable enough to represent the classes. Furthermore, the prototypes computed in PN can not achieve the expected performance in few-shot learning. We argue that an expected prototype is supposed to have minimal distance with all samples of its class. However, especially in few-shot scenarios, the prototypes we get are biased against the expected prototypes.

小样本学习方法可以分为两类：基于度量学习的方法[33,31,32,1]和基于梯度的方法[6,16,21,29,14]。原型网络（PN）[31]是一种经典的基于度量学习的方法，它通过计算到类原型的欧氏距离来识别测试样本。它将均值向量视为嵌入空间中的类原型。PN的性能受到原型表示能力的限制。在PN中，嵌入空间中学习到的特征没有足够的区分性，因此原型不足以表示类。此外，在PN中计算的原型在小样本学习中无法达到预期的性能。我们认为，预期的原型应该与其类的所有样本有最小的距离。然而，特别是在小样本场景中，我们得到的原型与预期的原型有偏差。

In this paper, we target to find expected prototypes of the novel classes which have the maximum cosine similarity to all data points in the same class. A cosine similarity based prototypical network (CSPN) is used to learn discriminative features and compute the basic prototypes from the few samples. Furthermore, we propose a bias diminishing (BD) method for rectifying the basic prototypes so as to achieve the expected performance. In CSPN, we firstly train a feature extractor with a cosine similarity based classifier on the base classes. Cosine classifier has a strong ability to drive the feature extractor to learn discriminative features. It learns an embedding space where features of the same class cluster more tightly. At the inference stage, we use class means as the basic prototypes of the novel classes and compute cosine similarity of prototypes and samples for classification. Since the basic prototypes computed in CSPN are biased due to the data scarcity, we propose to diminish the intra-class bias and cross-class bias for prototype rectification. The intra-class bias refers to the distance between the expectedly unbiased prototype of a class and the prototype actually computed from the available data of a class. To reduce it, we adopt the pseudo-labeling strategy to add unlabeled samples with high prediction confidence into the support set. Considering that some of the pseudo-labeled samples are possibly misclassified, we use the weighted sum as the modified prototypes instead of simple averaging. It avoids to bring larger bias when computing the prototypes. The cross-class bias refers to the distance between the representatives of the training and test datasets, which are commonly expressed by the mean vectors. We reduce it by importing a shifting item ξ. The framework of our bias diminishing method for prototype rectification is shown in Figure 1.

在本文中，我们的目标是找到与同一类中所有数据点具有最大余弦相似性的新类的预期原型。使用基于余弦相似性的原型网络（CSPN）从少量样本中学习鉴别特征并计算基本原型。此外，我们还提出了一种偏置减小（BD）方法来校正基本原型，以达到预期的性能。在CSPN中，我们首先训练一个基于基类的余弦相似性分类器的特征提取器。余弦分类器具有很强的驱动特征提取器学习鉴别特征的能力。它学习一个嵌入空间，在该空间中，同一类的特征更紧密地聚集在一起。在推理阶段，我们使用类均值作为新类的基本原型，计算原型和样本的余弦相似度进行分类。由于CSPN中计算的基本原型由于数据稀缺而存在偏差，我们建议减少类内偏差和类间偏差以进行原型校正。类内偏差是指类的预期无偏差原型与根据类的可用数据实际计算的原型之间的距离。为了减少这种情况，我们采用伪标记策略将预测置信度高的未标记样本添加到支持集中。考虑到一些伪标记样本可能被错误分类，我们使用加权和作为修改的原型，而不是简单的平均。它避免了在计算原型时带来较大的偏差。跨类偏差是指训练和测试数据集代表之间的距离，通常用平均向量表示。我们通过导入一个偏置项ξ来减少它。原型校正的偏差减小方法的框架如图1所示。

Figure 1. Framework of our proposed method for prototype rectification. The cross-class bias diminishing module reduces the bias between the support set and the query set while the intra-class bias diminishing module reduces the bias between the actually computed prototypes and the expected prototypes. We aim to find more suitable class prototypes by diminishing the bias so as to achieve the expected accuracy原型校正方法的框架。跨类偏差减少模块减少支持集和查询集之间的偏差，而类内偏差减少模块减少实际计算的原型和预期原型之间的偏差。我们的目标是通过减少偏差来找到更合适的类原型，从而达到预期的精度

In Section 3.4, we theoretically analyze the performance of the prototypical network by using cosine similarity as distance metric. It comes to a conclusion that the lower bound of the expected accuracy is positively correlated with the samples number. The analysis demonstrates that our method can achieve excellent performance by exploiting more samples for prototype rectification. We argue that our method is simpler but more efficient than many complicated few-shot learning methods. Our contributions are summarized as:

在第3.4节中，我们使用余弦相似性作为距离度量，从理论上分析了原型网络的性能。结果表明，期望精度的下限与样本数呈正相关。分析表明，通过开发更多样本进行原型校正，我们的方法可以获得优异的性能。我们认为我们的方法比许多复杂的小样本学习方法更简单但更有效。我们的贡献总结如下：

1) We propose a bias diminishing method which is utilized to reduce the intra-class bias and the cross-class bias for prototype rectification.
2) We theoretically analyze the impact of the bias diminishing method on the lower bound of the expected performance. Our proposed method can raise the lower bound thus, achieves the superior performance in fewshot scenarios.
3) We conduct extensive experiments on four popular few-shot benchmarks and achieve the state-of-the-art performance. The experiment results demonstrate that our proposed bias diminishing module can bring in significant improvement by a large margin.

1）我们提出了一种用于减少类内偏差和跨类偏差的偏差减小方法，用于原型校正。
2）我们从理论上分析了偏差减小方法对期望性能下限的影响。我们提出的方法可以提高下界，从而在小样本场景中获得优越的性能。
3）我们在四个流行的小样本评价标准进行了广泛的实验，并实现了最先进的性能。实验结果表明，我们提出的偏置减小模块可以带来很大幅度的改善。

2 Related Works

Few-Shot Learning Few-shot learning methods can be divided into two groups: gradient based methods and metric learning based methods. Gradient based methods focus on fast adapting model parameters to new tasks through gradient descent [2, 6, 16, 21, 29, 14]. Typical methods such as MAML [6] and Reptile [21] aim to learn a good way of parameter initialization that enables the model easy to fine-tune. In this section, we focus on metric learning based methods which are more closely to our approach. Metric learning based methods learn an informative metric to indicate the similarity relationship in the embedding space [33, 31, 32, 1]. Relation network [32] learns a distance metric to construct the relation of samples within an episode. The unlabeled samples thus can be classified according to the computed relation scores. Prototypical Networks (PN) [31] views the mean feature as the class prototype and assigns the points to the nearest class prototype based on Euclidean distance in the embedding space. It is indicated in [14] that PN shows limited performance in the high-dimensional embedding space. In some recent works, models trained with a cosine-similarity based classifier are more effective in learning discriminative features [8, 4]. In this paper, we use cosine classifier to learn a discriminative embedding space and compute the cosine distance to the class prototype (mean) for classification. The prototype computed in the discriminative feature space is more robust
to represent a class.

小样本学习 小样本学习方法可以分为两类：基于梯度的方法和基于度量学习的方法。基于梯度的方法侧重于通过梯度下降快速调整模型参数以适应新任务[2,6,16,21,29,14]。MAML[6]和Reptile[21]等典型方法旨在学习一种良好的参数初始化方法，使模型易于微调。在本节中，我们将重点介绍与我们的方法更接近的基于度量学习的方法。基于度量学习的方法学习信息度量，以指示嵌入空间中的相似关系[33,31,32,1]。关系网络[32]学习一个距离度量来构建一个事件中样本之间的关系。因此，可以根据计算的关系分数对未标记样本进行分类。原型网络（PN）[31]将平均特征视为类原型，并根据嵌入空间中的欧几里德距离将点分配给最近的类原型。[14]中指出，PN在高维嵌入空间中表现出有限的性能。在最近的一些工作中，使用基于余弦相似性的分类器训练的模型在学习鉴别特征方面更有效[8,4]。在本文中，我们使用余弦分类器学习一个判别嵌入空间，并计算到类原型的余弦距离（均值）进行分类。在判别特征空间中计算的原型对于表示一个类更具鲁棒性。

According to the test way, FSL can be divided into two branches: inductive few-shot learning and transductive fewshot learning. The former predicts the test samples one by one while the latter predicts the test samples as a whole. Early proven in [10, 36], transductive inference outperforms inductive inference especially when training data is scarce. Some literatures recently attack few-shot learning problem under the transductive setting. In [21], the shared information between test samples via normalization is used to improve classification accuracy. Different from [21], TPN [17] adopts transductive inference to alleviate low-data problem in few-shot learning. It constructs a graph using the union of the support set and the query set, where labels are propagated from support to query. Under transductive inference, the edge-labeling graph neural network (EGNN) proposed in [11] learns more accurate edge-labels through exploring the intra-cluster similarity and the inter-cluster dissimilarity. Our method takes the advantage of transductive inference that samples with higher prediction confidence can be obtained when the test samples are predicted as a whole.

根据测试方法的不同，FSL可以分为两个分支：归纳式小样本学习和转换式小样本学习。前者对测试样本逐一进行预测，后者对测试样本整体进行预测。早在[10,36]中就已经证明，特别是在训练数据稀少的情况下，转导推理优于归纳推理。最近一些文献研究了在转换环境下的创业板学习问题。在[21]中，通过标准化，测试样本之间的共享信息用于提高分类精度。与[21]不同的是，TPN[17]采用了转换推理来缓解小样本学习中的低数据问题。它使用支持集和查询集的并集构造一个图，其中标签从支持传播到查询。文献[11]中提出的边缘标记图神经网络（EGNN）通过探索簇内相似性和簇间差异性，在直推推理下学习更精确的边缘标记。我们的方法利用了跨导推理的优点，即当测试样本作为一个整体进行预测时，可以获得具有更高预测置信度的样本。

Semi-Supervised Few-Shot Learning In semisupervised few-shot learning, an extra unlabeled set excluded in the support and query set is introduced to improve classification accuracy [30, 26, 15]. In [26], the extended versions of Prototypical networks [31] are proposed to use the unlabeled data to create class prototypes by Soft k-Means. LST [15] employs pseudo-labeling strategy to the unlabeled set, then it re-trains and fine-tunes the base model based on the pseudo-labeled data. For recognizing the novel classes, it utilizes extra data dynamically sampled beyond the current episode. Strictly to say, it is not inferred under the semi-supervised FSL setting at test time. Different from these methods, the unlabeled data in our method comes from the query set and we requires no extra datasets besides the support and query set.

半监督小样本学习在半监督少镜头学习中，引入了一个额外的未标记集，排除在支持集和查询集中，以提高分类精度[30,26,15]。在[26]中，原型网络[31]的扩展版本被提议使用未标记的数据通过软k均值创建类原型。LST[15]对未标记集采用伪标记策略，然后基于伪标记数据重新训练和微调基础模型。为了识别新类，它利用了当前事件之外动态采样的额外数据。严格地说，在测试时，在半监督FSL设置下不能推断。与这些方法不同，我们方法中的未标记数据来自查询集，除了支持集和查询集之外，我们不需要额外的数据集。

3 Methodology

In this paper, we firstly use cosine similarity based prototypical network (CSPN) to learn a discriminative feature space and get the basic prototypes of the novel classes. Then we propose a bias diminishing (BD) method for prototype rectification by which we target to diminish both the intra-class bias and the cross-class bias. Finally, we provide a theoretical analysis of the expected performance to show the rationale of our proposed method.

本文首先利用基于余弦相似性的原型网络（CSPN）学习判别特征空间，得到新类的基本原型。然后，我们提出了一种用于原型校正的偏差减小（BD）方法，通过该方法，我们的目标是同时减小类内偏差和跨类偏差。最后，我们提供了一个预期性能的理论分析，以说明我们提出的方法的基本原理。

3.1. Denotation

At the training stage, a labeled dataset D of Nbase base classes is given to train the feature extractor Fθ and the cosine classifier Cw. At the inference stage, we aim to recognize Nnovel novel classes with K labeled images per class. Episodic sampling is adopted to form such N-way K-shot tasks. Each episode consists of a support set S and a query set Q. In the support set, all samples are labeled and we use the extracted features X = Fθ(x) to compute the prototypes P of novel classes. The samples in the query set are unlabeled for test.

在训练阶段，给出Nbase基类的标记数据集D来训练特征提取器Fθ和余弦分类器Cw。在推理阶段，我们的目标是识别Nnovel 个新类，每个类有K个标记的图像，采用间隔抽样的方法形成这样的N-way K-shot任务。每个片段由一个支持集S和一个查询集Q组成。在支持集中，所有样本都被标记，我们使用提取的特征X=Fθ（X）来计算新类的原型P。查询集中的样本未标记为测试。

3.2. Cosine Similarity Based Prototypical Network 基于余弦相似性的原型网络

We propose a metric learning based method: cosine similarity based prototypical network (CSPN) to compute basic prototypes of the novel classes. Training a good feature extractor that can extract discriminative features is of great importance. Thus, we firstly train a feature extractor Fθ(·) with a cosine similarity based classifier C(·|W ) on the base classes. The cosine classifier C(·|W ) is:

我们提出了一种基于度量学习的方法：基于余弦相似性的原型网络（CSPN）来计算新类的基本原型。训练一个能够提取鉴别特征的优秀特征提取器是非常重要的。因此，我们首先在基类上训练一个具有基于余弦相似性的分类器C（·| W）的特征提取器Fθ（·）。余弦分类器C（·W）是：

where W is the learnable weight of the base classes and τ is a scalar parameter. We target to minimize the negative log-likelihood loss on the supervised classification task:

其中W是基类的可学习权重，τ是标量参数。我们的目标是最小化监督分类任务的负对数似然损失：

At the inference stage, retraining Fθ(·) and classification weights on few data of the novel classes is likely to run into overfitting. To avoid it, we directly compute the prototypes of the novel classes. The prototype Pn of a class is computed as:

在推理阶段，对新类别的少数数据重新训练Fθ（·）和分类权重可能会出现过度拟合。为了避免这种情况，我们直接计算新类的原型。类的原型Pn计算如下：

where is the normalized feature of support samples. Hence we get the basic prototypes of the novel classes by Eq. 3. The query samples can be classified by finding the nearest prototype based on cosine similarity.

其中是支持样本的标准化特征。因此，我们通过公式3得到了新类的基本原型。通过基于余弦相似度寻找最近的原型，可以对查询样本进行分类。

3.3. Bias Diminishing for Prototype Rectification 原型校正中的偏差消除

In CSPN, we can obtain the basic prototypes by simply averaging the features of support samples. However, the prototypes computed in such low-data regimes are biased against the expected prototypes we want to find. To rectify the basic prototypes, we propose to diminish the intra-class bias and the cross-class bias as introduced in this section.

在CSPN中，我们可以通过简单地平均支持样本的特征来获得基本原型。然而，在这种低数据状态下计算的原型与我们希望找到的预期原型存在偏差。为了纠正基本原型，我们建议减少本节中介绍的类内偏差和跨类偏差。

Intra-Class Bias of a class is defined by Eq. (4): 类的类内偏差由式（4）定义：

where pX′ is the distribution of all samples of a class and pX is the distribution of the available labeled samples of a class. It is easy to observe the difference between the expectations of the two distributions. And the difference is more obvious in the low-data regime. Since the prototype is computed by feature averaging, the intra-class bias also can be understood as the difference of the expected prototype and the actually computed prototype of a class. The expected prototype is supposed to be represented by the mean feature of all samples in a class. In practice, only a part of samples are available for training which is to say that, it is almost impossible to get the expected prototype. In few-shot learning scenario, we merely have K samples per novel class. The available samples are far less than the expected all samples. From the limited samples, the actually computed prototypes are greatly biased.

其中pX′是一个类别所有样本的分布，pX是一个类别可用标记样本的分布。很容易观察到两种分布的期望值之间的差异。这种差异在低数据区更为明显。由于原型是通过特征平均来计算的，类内偏差也可以理解为类的预期原型和实际计算原型的差异。期望的原型应该用一个类中所有样本的平均特征来表示。实际上，只有一部分样本可用于训练，也就是说，几乎不可能得到预期的原型。在小样本学习场景中，每个新类只有K个样本。可用样本远低于预期的所有样本。从有限的样本来看，实际计算的原型有很大的偏差。

To reduce the bias, we adopt the pseudo-labeling strategy to augment the support set, which assigns temporary labels to the unlabeled data according to their prediction confidence [15]. Pseudo-labeled samples can be augmented into the support set such that we can compute new prototypes in a ‘higher-data’ regime. We can simply select top Z confidently predicted query samples per class to augment the support set S with their pseudo labels. We use CSPN as the recognition model to get prediction scores. Then we have a augmented support set with confidently predicted query samples: . Since some pseudolabeled samples are likely to be misclassified, simple averaging with the same weights is possible to result in larger bias for prototype computation. To compute new prototypes in a more reasonable way, we use the weighted sum of X′ as the rectified prototype. We note that X′ refers to the feature of the sample in S′ including both original support samples and pseudo-labeled query samples. The rectified prototype of a class is thus computed from the normalized features :

为了减少偏差，我们采用伪标记策略来增加支持集，支持集根据未标记数据的预测置信度为其分配临时标签[15]。伪标记样本可以扩展到支持集中，这样我们就可以在“更多数据”的情况下计算新的原型。我们可以简单地为每个类选择置信度top Z预测的查询样本，用它们的伪标签扩充支持集S。我们使用CSPN作为识别模型来获得预测分数。然后，我们有一个扩展的支持集，其中包含置信度预测的查询样本：。由于一些伪标记样本可能会被错误分类，因此使用相同权重的简单平均可能会导致原型计算的较大偏差。为了更合理地计算新原型，我们使用X′的加权和作为校正原型。我们注意到，X′是指S′中样本的特征，包括原始支持样本和伪标记查询样本。因此，根据规范化特征计算类的校正原型：

where Wi,n is the weight indicating the relation of the augmented support samples and the basic prototypes. The weight is computed by Eq. (6):

式中，Wi，n是表示增强支持样本与基本原型之间关系的权重。通过等式（6）计算权重：

ε is a scalar parameter and Pn is the basic prototype obtained in Section 3.2. It indicates that the new prototype computation is based on the cosine similarity between the samples and the basic prototypes. Samples with larger cosine similarity take up larger proportions in prototype rectification. Compared with the basic prototype Pn, the rectified prototype Pn′ is more close to the expected prototype.

ε是一个标量参数，Pn是第3.2节中获得的基本原型。结果表明，新的原型计算是基于样本和基本原型之间的余弦相似性。具有较大余弦相似性的样本在原型校正中所占比例较大。与基本原型Pn相比，校正后的原型Pn′更接近预期原型。

Cross-Class Bias refers to the distance of mean vectors between the support and query datasets. It is derived from the domain adaptation problem where the mean value is used as a type of the first order statistic information to represent a dataset [34]. Minimizing the distance between different domains is a typical method of mitigating domain gaps. Since the support and query sets are assumed to be sampled in the same domain, the distance between them is the distribution bias rather than the domain gap. The crossclass bias Bcross is formulated as:

跨类偏差是指支持数据集和查询数据集之间的平均向量距离。它源于域自适应问题，其中平均值用作一阶统计信息的一种类型，以表示数据集[34]。最小化不同域之间的距离是缓解域差距的典型方法。由于支持集和查询集假设在同一个域中采样，因此它们之间的距离是分布偏差，而不是域间隙。跨级偏差Bcross公式如下：

where pS and pQ are distributions of the support set and the query set respectively. Notably, the support set S and the query set Q include N novel classes in Eq. 7. To diminish Bcross, we can shift the query set towards the support set. In practice, we add a shifting item ξ to each normalized query data X q and ξ is defined as:

其中pS和pQ分别是支持集和查询集的分布。值得注意的是，支持集S和查询集Q在等式7中包括N个新类。为了减少Bcross，我们可以将查询集转向支持集。实际上，我们向每个规范化查询数据X q添加一个偏置项ξ，ξ的定义如下：

3.4. Theoretical Analysis 理论分析

In this section, we explain why the prototypes in our method can achieve superior performance in few-shot learning. We derive the formulation of our expected performance in theory and point out what factors influence the final result. We use X to represent the feature of a class. For clear illustration, we give the prototype formulation we used in this section:

在本节中，我们将解释为什么我们的方法中的原型可以在小样本学习中获得优异的性能。我们从理论上推导了我们的预期绩效公式，并指出影响最终结果的因素。我们使用X来表示类的特征。为了清楚说明，我们给出了本节中使用的原型公式：

where T = K + Z, X′ i ∈ S′ and S′ is a subset sampled from X. X is the normalized feature and P is the normalized prototype. For cosine similarity based prototypical network, an expected prototype should have the largest cosine similarity with all samples of its class. Our objective is to maximize the expected cosine similarity which is positively correlated with the classification accuracy. It is formulated as:

其中T=K+Z，X′i∈ S′和S′是从X采样的子集。X是归一化特征，P是归一化原型。对于基于余弦相似性的原型网络，期望原型与其类的所有样本具有最大的余弦相似性。我们的目标是最大化与分类精度正相关的期望余弦相似性。其表述如下：

And we derive it as:我们将其导出为：

From previous works [22, 27], we know that:根据之前的工作[22,27]，我们知道：

where A and B refer to random variables. In Eq. 12, EE[[BA]] is the first order estimator of E[ A B ]. Thus, Eq. 11 is approximate to:

其中A和B指随机变量。在等式12中，EE[[BA]]是E[ab]的一阶估计量。因此，式11近似于：

Based on Cauchy-Schwarz inequality, we have:基于Cauchy-Schwarz不等式，我们得到：

P and are D-dimensional vectors which can be denoted as P = [p1, p2, ..., pD] and = [x1, x2, ..., xD] respectively. In our method, we assume that each dimension of a vector is independent from each other. Then, we can derive that:

P和是D维向量，可分别表示为P=[p1，p2，…，pD]和=[x1，x2，…，xD]。在我们的方法中，我们假设向量的每个维度彼此独立。然后，我们可以得出：

Thus, the lower bound of the expected cosine similarity is formulated as:

因此，预期余弦相似性的下限公式为：

Maximizing the expected accuracy is approximate to maximize its lower bound of the cosine similarity as shown in Eq. 16. It shows the impact of the sample number T on the prototypical network. The expected performance is positively correlated with T which is to say that, the lower bound can be raised by enlarging the value of T . In our intra-class bias diminishing method, more samples are utilized for prototype computation. As a result, the expected performance is improved. We use Eq. 5 to substitute Eq. 9 in practice since few samples in Eq. 5 are possible to be misclassified. The weights in Eq. 5 are set to avoid importing larger bias in prototype rectification.

最大化预期精度近似于最大化余弦相似性的下限，如等式16所示。它显示了样本数T对原型网络的影响。预期绩效与T正相关，也就是说，可以通过增大T值来提高下限。在我们的类内偏差减少方法中，更多的样本用于原型计算。因此，预期的性能得到了改善。在实践中，我们使用公式5代替公式9，因为公式5中很少有样本可能被错误分类。设置等式5中的权重是为了避免在原型校正中引入更大的偏差。

4. Experiments

4.1. Datasets

The miniImageNet is initially proposed in [33], which consists of 100 randomly chosen classes from ILSVRC- 2012 [28]. Since the exact class splits are not released by [33], we adopt another commonly used version proposed in [25]. The 100 classes are split into 64 training classes, 16 validation classes and 20 test classes. Each class contains 600 images of size 84 × 84.

miniImageNet最初在[33]中提出，由ILSVRC-2012[28]中随机选择的100个类组成。由于[33]没有发布确切的类拆分，我们采用了[25]中提出的另一个常用版本。这100个类分为64个训练类、16个验证类和20个测试类。每个类包含600幅大小为84×84的图像。

The tieredImageNet [26] is also a derivative of ILSVRC-2012 [28] containing 34 high-level categories. These categories are split into 20/6/8 categories for training/validation/test. The splits include 351, 97, 160 lowlevel classes respectively with images of size 84 × 84.

tieredImageNet[26]也是ILSVRC-2012[28]的衍生产品，包含34个高级类别。这些类别分为20/6/8类，用于训练/验证/测试。这些分割分别包括351、97和160个低层类，图像大小为84×84。

The CIFAR-FS [3] is recently proposed which is derived from CIFAR-100 [12] containing all 100 classes of CIFAR- 100. The 100 classes are randomly split into 64, 16, 20 classes for training, validation and test, by using the same criteria to which the miniImageNet has been created according. Each class consists of 600 images of size 32 × 32.

最近提出的CIFAR-FS[3]源自包含所有100类CIFAR-100的CIFAR-100[12]。100个类被随机分为64、16、20个类进行训练、验证和测试，使用与miniImageNet创建时所依据的相同标准。每一类由600幅大小为32×32的图像组成。

The FC100 [23] is a newly split dataset based on CIFAR- 100 [12] for few-shot learning. It contains 20 high-level categories which are divided into 12, 4, 4 categories for training, validation and test. There are 60, 20, 20 low-level classes in the corresponding split containing 600 images of size 32 × 32 per class. Smaller image size makes it more challenging for few-shot learning.

FC100[23]是一个基于CIFAR-100[12]的新拆分数据集，用于小样本学习。它包含20个高级类别，分为12、4、4个类别，用于训练、验证和测试。在相应的分割中有60、20、20个低级类，每个类包含600个大小为32×32的图像。较小的图像大小使小样本学习更具挑战性。

4.2. Implementation details

We use WRN-28-10 [35] as the backbone and remove the last ReLU layer of the network. At the training stage, we train the base recognition model CSPN in the traditional supervised way and test the validation set in 5-way 5-shot way for model selection. The results reported in our experiments are collected by sampling 600 episodes with 95% confidence intervals. For both 5-way 1-shot and 5-way 5- shot test, each episode contains randomly selected 15 query data per class. We choose SGD as the optimizer with a momentum of 0.9 and a weight decay parameter of 0.0005. For all networks, the initial learning rate is 0.1. The maximum training epoch is set to 60 for miniImageNet and tieredImageNet and the learning rate is reduced after 10, 20, 40 epochs. As for CIFAR-FS and FC100, the max training epoch is 30 and the learning rate reduces after 5, 10, 20 epochs. At the training stage, we adopt horizontal flip and random crop for data augmentation of the two ImageNet derivatives and just use horizontal flip on the two CIFAR derivatives. In our experiments, the initial value of τ is set to 10 and the value of ε is fixed on 10.

我们使用WRN-28-10[35]作为主干，并删除网络的最后一个ReLU层。在训练阶段，我们采用传统的有监督的方法对基础识别模型CSPN进行训练，并采用5-way 5-shot方式对验证集进行测试，以进行模型选择。我们在实验中报告的结果是通过以95%的置信区间抽样600次收集的。对于5-way 1-shot和5-way 5-shot测试，每集每个类包含随机选择的15个查询数据。我们选择SGD作为优化器，动量为0.9，权重衰减参数为0.0005。对于所有网络，初始学习率为0.1。miniImageNet和tieredImageNet的最大训练历元设置为60，10、20、40历元后学习率降低。对于CIFAR-FS和FC100，最大训练周期为30，5、10、20个周期后学习率降低。在训练阶段，我们采用水平翻转和随机裁剪对两个ImageNet导数进行数据增强，而仅对两个CIFAR导数使用水平翻转。在我们的实验中，τ的初始值设置为10，ε的值固定为10。

4.3. Comprehensive Comparison

Table 1 shows the result comparison of 5-way tasks on miniImageNet and tieredImageNet. It is clear that we achieve the state-of-the-art performance on both 5-way 1- shot and 5-way 5-shot tasks in the two datasets. It can be seen that, CSPN provides a strong baseline which is competitive with previous methods. Although the cosinesimilarity based class prototype network is trained on the traditional supervised task (64-way) rather than few-shot recognition tasks (5-way), it is more effective to learn discriminative representations of few-shot classes compared with previous complicated methods [24, 29, 7] with the same backbone. Based on the strong baseline, our proposed bias diminishing (BD) module still improves it by a large margin which approximately increases 9% and 3% on 1-shot and 5-shot tasks accordingly. Note that LST [15]
achieves a competitive performance especially on 1-shot tasks. It trains a model under the semi-supervised setting and, re-trains and fine-tunes the model on each novel task. At test time, it dynamically samples extra unlabeled data besides the current episode as auxiliary information. Different from LST [15], our proposed method needs no additional re-training and fine-tuning at test time. We improve the performance in low-data regimes by diminishing the intra-class bias and the cross-class bias. BFSL [7] uses image rotation and relative patch location as supervisory signals at the training stage to boost the recognition performance. Differently, we do not perform auxiliary tasks during training but make use of the prediction confidence as side information during transductive inference. We argue that our bias diminishing method is efficient for prototype rectification. It generally promotes the performance by a large margin. Meanwhile, it strengths the model capacity during inference.

表1显示了miniImageNet和tieredImageNet上5向任务的结果比较。很明显，在这两个数据集中，我们在5-way 1- shot和5-way 5- shot任务上都达到了最先进的性能。可以看出，CSPN提供了一个强大的基线，与以前的方法相比具有竞争力。尽管基于共线性相似性的类原型网络是在传统的监督任务（64路）而不是小样本识别任务（5路）上训练的，但与具有相同主干的先前复杂方法相比，学习小样本类的区分表示更为有效[24，29，7]。基于强基线，我们提出的偏差减小（BD）模块仍然大幅度地改进了它，在1次和5次任务中分别增加了约9%和3%。请注意，LST[15]实现了具有竞争力的性能，尤其是在单发任务上。它在半监督设置下训练模型，并在每个新任务上重新训练和微调模型。在测试时，除了当前事件之外，它还动态采样额外的未标记数据作为辅助信息。与LST[15]不同，我们提出的方法在测试时不需要额外的重新训练和微调。我们通过减少类内偏差和跨类偏差来改善低数据区的性能。BFSL[7]在训练阶段使用图像旋转和相对面片位置作为监控信号，以提高识别性能。不同的是，我们在训练过程中不执行辅助任务，而是利用预测置信度作为传递推理的辅助信息。我们认为，我们的偏差减少方法是有效的原型校正。它通常会大幅度提升性能。同时，增强了模型推理能力。

We further display the results comparison with prior approaches on recently proposed few-shot benchmarks CIFAR-FS and FC100 in Table 2. Notably, we achieve the best performance in 1-shot scenarios and a superior result in 5-shot scenarios. The improvement brought by our bias diminishing method in FC100 is relatively lower than the improvements in other datasets. It is caused by the low accuracy of the basic recognition model CSPN. Since the prediction confidence provided by CSPN is the criteria for sample selection, low confidence indicates more misclassified samples in prototype computation.

在表2中，我们进一步显示了与最近提出的小样本基准CIFAR-FS和FC100的先前方法的结果比较。值得注意的是，我们在1-shot场景中实现了最佳性能，在5-shot场景中实现了优异的结果。FC100中我们的偏差减少方法带来的改进相对低于其他数据集的改进。这是由于基本识别模型CSPN的精度较低造成的。由于CSPN提供的预测置信度是样本选择的标准，因此低置信度表示原型计算中错误分类的样本更多。

4.4. Comparison with Prototypical Networks

We compare our method with Prototypical Networks (PN) [31] on miniImageNet and the results are displayed in Table 3. Experiments are conducted on three widely used backbones. ConvNet-128 refers to a 4-module convolutional network with 128 filters per layer in the last module [8]. And ResNet-12 refers to a ResNet like network consisting of three residual blocks as described in [14]. It can be seen in Table 3 that our proposed method outperforms PN in all cases. With bigger backbone networks, our method outperforms PN by larger margins. BD-CSPN achieves the
results of 76.12% and 79.23% with Conv-128 and ResNet- 12 on 5-shot tasks, which are higher than PN by 9.5% and 23.27% respectively. We can learn a more discriminative feature space than PN where the computed prototypes are more suitable to represent the classes. Table 3 also indicates the large enhancement caused by our bias diminishing method on different backbones, e.g. 6.12% and 6.8% in 1- shot with Conv-128 and ResNet-12. The result comparison demonstrates that our prototype rectification method is capable of finding better prototypes for few-shot problems.

我们将我们的方法与miniImageNet上的原型网络（PN）[31]进行比较，结果如表3所示。实验在三个广泛使用的主干上进行。ConvNet-128是指在最后一个模块中每层有128个滤波器的4模块卷积网络[8]。ResNet-12是指由[14]中所述的三个剩余块组成的类似ResNet的网络。从表3可以看出，我们提出的方法在所有情况下都优于PN。对于更大的骨干网络，我们的方法比PN有更大的优势。BD-CSPN与Conv-128和ResNet-12在5次激发任务上的结果分别为76.12%和79.23%，分别比PN高9.5%和23.27%。我们可以学习比PN更具辨别力的特征空间，其中计算原型更适合表示类。表3还显示了我们的偏差减小方法在不同主干上造成的较大增强，例如，在Conv-128和ResNet-12的单次射击中，分别为6.12%和6.8%。结果比较表明，我们的原型校正方法能够为小样本问题找到更好的原型。

4.5. Intra-Class Bias Diminishing

To illustrate the effectiveness of our proposed intra-class bias diminishing method, we display the 5-way classification accuracy in Figure 2(a)-2(b). FC100 is a more challenging dataset compared with others and there exists an obvious gap between the accuracy of FC100 and the accuracy of other datasets. To better show the accuracy change with different Z pseudo-labeled samples, we just choose the accuracy of miniImageNet, tieredImageNet and CIFAR-FS for visualization. Figure 2(a)-2(b) show a coincident tendency that with larger Z, there is an obvious growth of classification accuracy in each dataset. In BD-CSPN, the support set is augmented based on the confidence of the query samples that are predicted by CSPN. As the baseline provided by CSPN is strong enough, shown in Table 3, the top Z pseudo-labeled samples are confident enough to be treated as the support data. Hence, the augmented support set contains more labeled and effective samples to modify the original class prototypes. Our intra-class bias diminishing method efficiently exploits more samples for prototype computation such that, we improve the lower bound of the expected performance as formulated in Eq. 16. We use the validation set to determine the value of Z and set it to 8 for accuracy comparison in Table 1 and Table 2.

为了说明我们提出的类内偏差减少方法的有效性，我们在图2（a）-（b）中显示了5-way分类精度。与其他数据集相比，FC100是一个更具挑战性的数据集，FC100的精度与其他数据集的精度存在明显差距。为了更好地显示不同Z伪标记样本的精度变化，我们仅选择miniImageNet、tieredImageNet和CIFAR-FS的精度进行可视化。图2（a）-（b）显示了一个一致的趋势，即随着Z的增大，每个数据集中的分类精度都有明显的增长。在BD-CSPN中，支持集是基于CSPN预测的查询样本的置信度来扩充的。由于CSPN提供的基线足够强大，如表3所示，顶部Z伪标记样本足够自信，可以作为支持数据处理。因此，扩充的支持集包含更多标记的有效样本，用于修改原始类原型。我们的类内偏差减少方法有效地利用了更多的样本进行原型计算，因此，我们改进了公式16中规定的预期性能的下限。我们使用验证集确定Z值，并将其设置为8，以便在表1和表2中进行精度比较。

Figure 2. (a)-(b) Classification accuracy with different Z pseudo-labeled samples to diminish the intra-class bias. (c) Comparison of experiment results and theoretical results on miniImageNet. The experiment results (solid lines) show a consistent tendency with the theoretical results (dashed lines).（a） -（b）不同Z伪标记样本的分类精度，以减少类内偏差。（c） miniImageNet上实验结果与理论结果的比较。实验结果（实线）显示出与理论结果（虚线）一致的趋势。

As we know, the expected accuracy Acc(P, X) has a positive correlation with the expected cosine similarity. Then we derive the first-order estimation of Acc(P, X) from Eq. 16 which is formulated as:

如我们所知，预期精度Acc（P，X）与预期余弦相似性呈正相关。然后，我们从公式16推导出Acc（P，X）的一阶估计，公式如下：

where η is a coefficient and K + Z = T . λ and α are values correlated with the variance item and expectation item in Eq. 16. The theoretical value of λ and α can be approximately computed from the extracted features. Furthermore, we can compute the value of η by 1-shot and 5-shot accuracies of CSPN. Thus, the number Z is the only variable of Eq. 17. The theoretical curves are displayed as the dashed lines in Figure 2(c) to show the impact of Z on classification accuracy. The dashed lines of theoretical lower bound of the expected accuracy have a consistent tendency with our experiment results in Figure 2(a)-2(b). Since the cosine similarity is continuous and the accuracy is discrete, the accuracy stops increasing when the cosine similarity grows to a certain value.

式中η为系数，K+Z=T。λ和α是与等式16中的方差项和期望项相关的值。λ和α的理论值可根据提取的特征近似计算。此外，我们可以通过CSPN的1-shot和5-shot精度计算η的值。因此，数字Z是等式17的唯一变量。理论曲线如图2（c）中的虚线所示，以显示Z对分类精度的影响。预期精度的理论下限虚线与我们在图2（a）-2（b）中的实验结果具有一致的趋势。由于余弦相似性是连续的，精度是离散的，因此当余弦相似性增长到某个值时，精度停止增加。

T-SNE Visualization We show t-SNE visualization of our intra-bias diminishing method in Figure 3 for intuitive illustration. In Figure 3, the basic prototype of each class is computed from the support set and the rectified prototype is computed from the augmented support set. For visualization in this section, the expected prototype refers to the first term in Eq. 4 which is represented by the average vector of all samples (both the support and query samples) of a class in an episode. Due to the limited labeled samples, there is a large bias between the basic prototype and the expected prototype. The bias can be reflected by the distance between the stars and the triangles. For example, we can see that the original prototype is far from the query samples in the class marked in purple. Through diminishing the intraclass bias, the rectified prototype distributes closer to the expected prototype. As shown in Figure 3, the rhombuses move towards the triangles from the stars.

T-SNE可视化我们在图3中显示了我们的内部偏差减小方法的T-SNE可视化，以便于直观说明。在图3中，每个类的基本原型是从支持集计算出来的，而校正后的原型是从扩充支持集计算出来的。对于本节中的可视化，预期原型指等式4中的第一项，该项由一个事件中类的所有样本（支持样本和查询样本）的平均向量表示。由于标记样本有限，基本原型与预期原型之间存在较大偏差。这种偏差可以通过恒星和三角形之间的距离来反映。例如，我们可以看到原始原型与紫色标记的类中的查询样本相差很远。通过减小组内偏差，校正后的原型分布更接近预期原型。如图3所示，菱形从恒星向三角形移动。

4.6. Cross-Class Bias Diminishing

Table 4 shows the accuracy comparison of the crossclass bias diminishing method on four datasets. It shows an overall performance improvement as a result of diminishing the cross-class bias. Moving the whole query set towards the support set center by importing the shifting item
ξ is an efficient approach to decrease the bias between the two datasets. For example, the accuracy increases 1.64% in 1-shot tieredImageNet.

表4显示了四个数据集上跨类偏差减小方法的精度比较。它表明，由于减少了跨类偏差，整体性能有所提高。通过导入移位项ξ将整个查询集移向支持集中心是减少两个数据集之间偏差的有效方法。例如，在单镜头tieredImageNet中，精度提高1.64%。

T-SNE Visualization In few-shot learning, the support set includes samples far less than the query set in an episode, e.g. 5 support samples and 75 query samples in a 5-way 1-shot episode. There exists a large distance between the two means of the datasets. We aim to decrease the distance by shifting the query samples towards the center of the support set as shown in Figure 4. Figure 4 depicts the spatial changing of the query samples before and after crossclass bias diminishing. The typical part is zoomed in for clear visualization, where the query samples with BDcross (marked in green) distribute more closely to the center of support set.

T-SNE可视化在小样本学习中，支持集包含的样本远远少于一个片段中的查询集，例如，5-way 1-shot片段中有5个支持样本和75个查询样本。这两种数据集之间存在很大的距离。我们的目标是通过将查询样本移向支持集的中心来减少距离，如图4所示。图4描述了跨类偏差减小前后查询样本的空间变化。典型部分放大以清晰显示，其中带有BDcross（绿色标记）的查询样本更靠近支持集的中心。

5 Conclusions

In our work, we propose the bias diminishing method for prototype rectification in few-shot learning. The prototypes are rectified by diminishing the intra-class bias and the cross-class bias. Our theoretical analysis shows that, the proposed bias diminishing method can raise the lower bound of the expected performance. Extensive experiments on four few-shot datasets demonstrate the superiority of our method. The proposed bias diminishing method makes
significant improvements by a large margin (e.g. 8.47% in 1-shot miniImageNet and 9.54% in 1-shot tieredImageNet). In total, we achieve state-of-the-art performance in 6/8 cases on the four datasets.

在我们的工作中，我们提出了在小样本学习中原型校正的偏差减小方法。原型通过减少类内偏差和跨类偏差进行校正。我们的理论分析表明，所提出的偏差减小方法可以提高预期性能的下限。在四个少数镜头数据集上的大量实验证明了该方法的优越性。建议的偏差减小方法大幅度提高了性能（例如，在1-shot miniImageNet中为8.47%，在1-shot tieredImageNet中为9.54%）。总之，我们在四个数据集的6/8案例中实现了最先进的性能。

【论文总结】Prototype Rectification for Few-Shot Learning（附翻译）相关推荐

【论文总结】Meta-Learning for semi-supervised few-shot classification(附翻译)
Meta-Learning for semi-supervised few-shot classification 半监督小样本分类的元学习论文地址:https://arxiv.org/pdf/1 ...
语音识别(ASR)论文优选：挑战ASR规模极限Scaling ASR Improves Zero and Few Shot Learning
声明:平时看些文章做些笔记分享出来,文章中难免存在错误的地方,还望大家海涵.搜集一些资料,方便查阅学习:http://yqli.tech/page/speech.html.语音合成领域论文列表请访问h ...
论文阅读-2 | Meta-Learning with Task-Adaptive Loss Function for Few Shot Learning
论文目录 0 概述 0.1 论文题目 0.2 摘要 1 简介 2 相关的工作 3 提出的方法 3.1 前言 3.1.1 提出问题 3.1.2 模型无关元学习 Model-agnostic meta-l ...
论文笔记：Limited Data Rolling Bearing Fault Diagnosis With Few Shot Learning
论文:基于少样本学习的小样本滚动轴承故障诊断 Limited Data Rolling Bearing Fault Diagnosis With Few Shot Learning **摘要:**这篇 ...
论文精读《Prototypical Networks for Few-shot Learning》
论文链接:https://arxiv.org/abs/1703.05175 时间:NIPS2016 最近在读小样本方向的相关论文,想和大家分享一下解决小样本方法之"原型网络". 背 ...
论文学习 Feature Generating Networks for Zero-Shot Learning
最近火热的迁移学习领域实涵盖了很多方向,比如多任务学习.对抗生成网络.zero/one/few shot learning等.其中的关键问题应该从有限的标注数据中学习其真实分布,进而将训练好的模型通过 ...
Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference理解
Wu M, Mosse M, Goodman N, et al. Zero Shot Learning for Code Education: Rubric Sampling with Deep Le ...
元学习之《Matching Networks for One Shot Learning》代码解读
元学习系列文章 optimization based meta-learning <Model-Agnostic Meta-Learning for Fast Adaptation of Dee ...
DeepLearning | Zero shot learning 零样本学习AWA2 图像数据集预处理
因为有打算想要写一组关于零样本学习算法的博客,需要用到AWA2数据集作为demo演示之前想只展示算法部分的代码就好了,但是如果只展示算法部分的代码可能不方便初学者复现,所以这里把我数据预处理的方法也 ...

【论文总结】Prototype Rectification for Few-Shot Learning（附翻译）