Distant Domain Transfer Learning

前面第一篇Transfer Learning的论文是其在医学影像上的应用，这一篇主要是提出了远域迁移学习的概念，并从其结构上探讨远域迁移学习的学习过程。

原文

In this paper, we study a novel transfer learning problem termed Distant Domain Transfer Learning (DDTL). Different from existing transfer learning problems which assume that there is a close relation between the source domain and the target domain, in the DDTL problem, the target domain can be totally different from the source domain. For example, the source domain classifies face images but the target domain distinguishes plane images. Inspired by the cognitive process of human where two seemingly unrelated concepts can be connected by learning intermediate concepts gradually, we propose a Selective Learning Algorithm (SLA) to solve the DDTL problem with supervised autoencoder or supervised convolutional autoencoder as a base model for handling different types of inputs. Intuitively, the SLA algorithm selects usefully unlabeled data gradually from intermediate domains as a bridge to break the large distribution gap for transferring knowledge between two distant domains. Empirical studies on image classification problems demonstrate the effectiveness of the proposed algorithm, and on some tasks the improvement in terms of the classification accuracy is up to 17% over “non-transfer” methods.

译文

在这篇论文中，我们研究了一种被称为远域迁移学习（DDTL）的新颖的迁移学习问题。现行的迁移学习问题假设源域和目标域之间有密切的联系。不同于现行的迁移学习问题的是，在远域迁移学习问题中，目标域和源域可以完全不同。例如，源域是对人脸图像的分类而目标域却是对飞机图像的辨别。人类可以通过学习中间概念将两个看起来不相干的概念逐渐联系起来，受人类这一认知过程的启发，我们提出了一种基于有监督自编码器或有监督卷积自编码器作为处理不同类型输入的基础模型的选择性学习算法（SLA），来解决远域迁移学习问题。直观上，SLA算法逐渐从中间域中挑选有用的无标签数据作为桥梁来打破两个远域之间有关迁移知识的巨大分布差距。在图像分类问题上的实践研究证实了所提出算法的有效性，并且在某些任务上，相较于“无迁移”的方法，其分类精度提升了17%。

笔记

本文是针对于目标域和源域的相似度不高的所谓远域情况下提出的新的远域迁移学习算法——SLA算法，并证实了SLA算法的能力和精度。SLA算法是基于自编码器或卷积自编码器实现的，用于解决远域迁移学习问题。

Introduction

Transfer Learning, which borrows knowledge from a source domain to enhance the learning ability in a target domain, has received much attention recently and has been demonstrated to be effective in many applications. An essential requirement for successful knowledge transfer is that the source domain and the target domain should be closely related. This relation can be in the form of related instances, features or models, and measured by the KL-divergence or A-distnace (Blitzer et al. 2008). For two distant domains where no direct relation can be found, transferring knowledge between them forcibly will not work. In the worst case, it could lead to even worse performance than ‘non-transfer’ algorithms in the target domain, which is the ‘negative transfer’ phenomena(Rosenstein et al. 2005; Pan and Yang 2010). For example, online photo sharing communities, such as Flickr and Qzone, generate vast amount of images as well as their tags. However, due to the diverse interests of users, the tag distribution is often long-tailed, which can be verified by our analysis in Figure 1 on the tag distribution of the uploaded images at Qzone from January to April in 2016. For the tags in the head part, we can build accurate learners as there are plenty of labeled data but in the tail part, due to the scarce labeled data, the learner for each tag usually has no satisfactory performance. In this case, we can adopt transfer learning algorithms to build accurate classifiers for tags in the tail part by reusing knowledge in the head part. When the tag in the tail part is related to that in the head part, this strategy usually works very well. For example, as shown in Figure 1, we can build an accurate tiger classifier by transferring knowledge from cat images when we have few labeled tiger images, where the performance improvement is as large as 24% compared to some supervised learning algorithm learned from labeled tiger images only. However, if the two tags (e.g., face and airplane images) are totally unrelated from our perspective, existing transfer learning algorithms such as (Patel et al. 2015) fail as shown in Figure 1. One reason for the failure of existing transfer learning algorithms is that the two domains, face and airplane, do not share any common characteristic in shape or other aspects, and hence they are conceptually distant, which violates the assumption of existing transfer learning works that the source domain and the target domain are closely related.

In this paper, we focus on transferring knowledge between two distant domains, which is referred to Distant Domain Transfer Learning (DDTL). The DDTL problem is critical as solving it can largely expand the application scope of transfer learning and help reuse as much previous knowledge as possible. Nonetheless, this is a difficult problem as the distribution gap between the source domain and the target domain is large. The motivation behind our solution to solve the DDTL problem is inspired by human’s ‘transitivity’ learning and inference ability (Bryant and Trabasso 1971). That is, people transfer knowledge between two seemingly unrelated concepts via one or more intermediate concepts as a bridge.

Along this line, there are several works aiming to solve the DDTL problem. For instance, Tan et al. (2015) introduce annotated images to bridge the knowledge transfer between text data in the source domain and image data in the target domain, and Xie et al. (2016) predict the poverty based on the daytime satellite imagery by transferring knowledge from an object classification task with the help of some nighttime light intensity information as an intermediate bridge. Those studies assume that there is only one intermediate domain and that all the data in the intermediate domain are helpful. However, in some cases the distant domains can only be related via multiple intermediate domains. Exploiting only one intermediate domain is not enough to help transfer knowledge across long-distant domains. Moreover, given multiple intermediate domains, it is highly possible that only a subset of data from each intermediate domain is useful for the target domain, and hence we need an automatic selection mechanism to determine the subsets.

In this paper, to solve the DDTL problem in a better way, we aim to transfer knowledge between distant domains by gradually selecting multiple subsets of instances from a mixture of intermediate domains as a bridge. We use the reconstruction error as a measure of distance between two domains. That is, if the data reconstruction error on some data points in the source domain is small based on a model trained on the target domain, then we consider that these data points in the source domain are helpful for the target domain. Based on this measure, we propose a Selective Learning Algorithm (SLA) for the DDTL problem, which simultaneously selects useful instances from the source and intermediate domains, learns high-level representations for selected data, and trains a classifier for the target domain. The learning process of SLA is an iterative procedure that selectively adds new data points from intermediate domains and removes unhelpful data in the source domain to revise the source-specific model changing towards a target-specific model step by step until some stopping criterion is satisfied.

The contributions of this paper are three-fold. Firstly, to our best knowledge, this is the first work that studies the DDTL problem by using a mixture of intermediate domains. Secondly, we propose an SLA algorithm for DDTL. Thirdly, we conduct extensive experiments on several real-world datasets to demonstrate the effectiveness of the proposed algorithm.

译文

迁移学习借用了源域的知识来提高目标域的学习能力，近年来受到了广泛的关注，在很多领域得到了应用。成功的的知识迁移的一个基本要求是源域和目标域要紧密相联。这种联系可以以相关样本、特征或模型的形式呈现，且可以用KL散度和A距离（A-distance）来度量。对于两个找不到直接关联的远域，强行在它们之间进行知识迁移是行不通的。最坏的情况下，这样做甚至会导致比在目标域使用“无迁移”算法更坏的性能，也就是“负迁移”现象。例如，在线照片共享社区（如Flickr和Qzone）会生成大量的图片及其标签。然而，由于用户兴趣的多样性，标签的分布往往是拖着个长尾巴，我们在图1中对2016年1 - 4月Qzone上传图片的标签分布进行了分析，验证了这一点。对于头部的标签，由于有大量的数据我们可以建立准确的学习器，但是在尾部，由于缺乏标签数据，每个标签的学习器通常没有令人满意的表现。在这种情况下，我们可以采用迁移学习算法，通过重用头部的知识来搭建尾部标签的精确分类器。如若尾部的标签和头部的标签相近，这一算法通常很有效。例如，正如图1所示，当我们缺少老虎图像的标签数据时，我们可以利用迁移从猫的图像学到的知识来建立一个准确的老虎的分类器。这样建立的算法与仅对标记的老虎图像的有监督学习算法相比，性能改进高达24%。然而，如果从我们的角度来看两个标签（如人脸和飞机图像）完全不相关，现有的迁移学习算法如论文(Patel et al. 2015)就会迁移失败，如图1所示。现有的迁移学习算法失败的一个原因是，脸和飞机这两个领域在形状或其他方面不享有任何共性，因此他们就属于概念上的远域，这违反了现有迁移学习在源域和目标域密切相关时才会起作用的假设。

在这篇论文中，我们关注的是在两个远域之间进行知识迁移，这被称为远域迁移学习(DDTL)。远域迁移学习问题至关重要，因为解决它可以极大地扩展迁移学习的应用范围，并帮助尽可能多地重用以前的知识。尽管如此，远域迁移学习仍然是一个难题，因为源域和目标域之间的分布差距很大。我们解决DDTL问题的灵感来自于人类的“传递性”学习和推理能力。也就是说，人们通过一个或多个中间概念作为桥梁，在两个看似不相关的概念之间进行知识迁移。

沿着这条思路，有不少工作旨在解决DDTL问题。例如，Tan et al. (2015)引入注释图像来搭建源域中文本数据和目标域中图像数据之间知识迁移的桥梁，而Xie et al. (2016)基于日间卫星图像，在以一些夜间光强度信息作为中间桥梁的帮助下，利用来自对象分类任务的知识迁移，从而预测贫困程度。这些研究只假设有一个中间域，且中间域的所有数据都是有用的。然而，在一些情况下，远域只能通过多个中间域进行关联。仅利用一个中间域不足以帮助在远域之间进行知识迁移。此外，在给定多个中间域的情况下，很有可能只有来自每个中间域的数据子集对目标域有用，因此我们需要一种自动选择机制来确定子集。

在本文中，为了更好地解决DDTL问题，我们的目标是通过从混合的中间域中逐步选择多个样本子集作为桥梁，在远域之间进行知识迁移。我们使用重构误差作为两个域之间距离的度量。也就是说，如果基于在目标域上训练的模型，源域上某些数据点的数据重构误差较小，则我们认为这些源域上的数据点对目标域是有帮助的。在此基础上，我们提出了一种针对DDTL问题的选择性学习算法(SLA)，该算法同时从源域和中间域选择有用的样本，学习所选数据的高级表示，训练目标域的分类器。SLA的学习过程是一个迭代的过程，它有选择地从中间域中添加新的数据点，并删除源域中无用的数据，逐步将特定于源域的模型修改为特定于目标域的模型，直到满足某种停止条件。

本文的贡献有三方面。首先，据我们所知，这是第一次使用中间域的混合来研究DDTL问题。其次，我们提出了一个针对DDTL问题的SLA算法。第三，我们在几个真实的数据集上进行了大量的实验来证明所提算法的有效性。

笔记

这一部分主要是对全文工作的一个介绍，首先谈到了迁移学习的成功。然后介绍了源领域(Source Domain)、目标领域(Target Domain)、负迁移(Negative Transfer)，以及衡量源域和目标域之间相似性的参数：KL散度(Kullback-Leibler divergence)和A-distance。之后借用从人脸向飞机图像的失败迁移引出了远域迁移学习(Distant Domain Transfer Learning, DDTL)的概念。在介绍了一些前辈对DDTL问题的研究后，作者提出了本文所研究的选择性学习算法(Selective Learning Algorithm, SLA)并介绍了其学习过程。最后，简述了本文的三个主要工作（就不列了）。

源领域
有知识、有大量数据标注的领域，是要迁移的对象
目标领域
最终要赋予知识、赋予标注的对象
负迁移
两个领域之间基本不相似，这样知识迁移就会产生负效果
KL散度
DKL(P∣∣Q)=∑i=1P(x)logP(x)Q(x)D_{KL} (P||Q) = \sum_{i=1} P(x) log\frac{P(x)}{Q(x)}DKL(P∣∣Q)=∑i=1P(x)logQ(x)P(x)
这是一个非对称距离：DKL(P∣∣Q)≠DKL(Q∣∣P)D_{KL} (P||Q) \neq D_{KL} (Q||P)DKL(P∣∣Q)=DKL(Q∣∣P)
A-distance
A-distance可以用来估计不同分布之间的差异性。它被定义为建立一个线性分类器来区分两个数据领域的hinge损失。它的计算方式是，我们首先在源域和目标域上训练一个二分类器hhh，使得这个分类器可以区分样本是来自于哪一个领域。用err(h)err(h)err(h)来表示分类器的损失，则A-distance定义为：A(Ds,Dt)=2(1−2err(h))A(D_s,D_t)=2(1-2err(h))A(Ds,Dt)=2(1−2err(h))
传递迁移学习
有一个中间域
远域迁移学习
有多个中间域
选择性学习算法
该算法同时从源域和中间域选择有用的实例，学习所选数据的高级表示，训练目标域的分类器。SLA的学习过程是一个迭代的过程，它有选择地从中间域中添加新的数据点，并删除源域中无用的数据，逐步将特定于源域的模型修改为特定于目标域的模型，直到满足某种停止条件。

Figure 1

介绍

上传在Qzone的图片的标签分布。在第一个任务中，我们在猫和老虎的图像之间进行知识迁移。迁移学习算法比有监督学习算法有更好的性能。在第二个任务中，我们在人脸和飞机图像之间进行知识迁移。转移学习算法由于性能比有监督学习算法差而失败。然而，在应用我们提出的SLA算法时，我们发现我们的模型获得了更好的性能。

笔记

Figure 1主要利用qq空间里获得的图像进行迁移学习，在近类（猫和虎）的迁移中表现较好，而在远域（人和飞机）的迁移中，只有使用SLA算法的迁移学习才能起到迁移作用。

Related work

原文

Typical transfer learning algorithms include instance weighting approaches (Dai et al. 2007) which select relevant data from the source domain to help the learning in the target domain, feature mapping approaches (Pan et al. 2011) which transform the data in both source and target domains into a common feature space where data from the two domains follow similar distributions, and model adaptation approach (Aytar and Zisserman 2011) which adapt the model trained in the source domain to the target domain. However, these approaches cannot handle the DDTL problem as they assume that the source domain and the target domain are conceptually close. Recent studies (Yosinski et al. 2014; Oquab et al. 2014; Long et al. 2015) reveal that deep neural networks can learn transferable features for a target domain from a source domain but they still assume that the target domain is closely related to the source domain.

The transitive transfer learning (TTL) (Tan et al. 2015; Xie et al. 2016) also learns from the target domain with the help of a source domain and an intermediate domain. In TTL, there is only one intermediate domain, which is selected by users manually, and all intermediate domain data are used. Different from TTL, our work automatically selects subsets from a mixture of multiple intermediate domains as a bridge across the source domain and the target domain. Transfer Learning with Multiple Source Domains (TLMS) (Mansour, Mohri, and Rostamizadeh 2009; Tan et al. 2013) leverages multiple source domains to help learning in the target domain, and aims to combine knowledge simultaneously transferred from all the source domains. The different between TLMS and our work is two-fold. First, all the source domains in TLMS have sufficient labeled data. Second, all the source domains in TLMS are close to the target domain.

Self-Taught Learning (STL) (Raina et al. 2007) aims to build a target classifier with limited target-domain labeled data by utilizing a large amount of unlabeled data from different domains to learn a universal feature representation. The difference between STL and our work is two-fold. First, there is no a so-called source domain in STL. Second, STL aims to use all the unlabeled data from different domains to help learning in the target domain while our work aims to identify useful subsets of instances from the intermediate domains to bridge the source domain and the target domain.

Semi-supervised autoencoder (SSA) (Weston, Ratle, and Collobert 2008; Socher et al. 2011) also aims to minimize both the reconstruction error and the training loss while learning a feature representation. However, our work is different from SSA in three-fold. First, in SSA, both unlabeled and labeled data are from the same domain, while in our work, labeled data are from either the source domain or the target domain and unlabeled data are from a mixture of intermediate domains, whose distributions can be very different to each other. Second, SSA uses all the labeled and unlabeled data for learning, while our work selectively chooses some unlabeled data from the intermediate domains and removes some labeled data from the source domain for assisting the learning in the target domain. Third, SSA does not have convolutional layer(s), while our work uses convolutional filters if the input is a matrix or tensor.

译文

典型的迁移学习算法包括基于样本的方法，基于特征的方法和基于模型的方法。基于样本的迁移方法从源域选择相关数据，以帮助目标域的学习；基于特征的迁移方法将源域和目标域中的数据变换到统一特征空间，在这个空间中来自这两个域的数据遵循类似的分布；基于模型的迁移方法使在源域训练的模型适应于目标域。然而，这些方法不能处理DDTL问题，因为他们是在源域和目标域在概念上相近的假设前提下进行的。最近的研究 (Yosinski et al. 2014; Oquab et al. 2014; Long et al. 2015)表明深度神经网络可以从源域中学习目标域中的可迁移特征，但他们的工作仍是基于目标域和源域密切相关的假设。

传递迁移学习(TTL)也是利用一个源域和一个中间域的帮助从目标域学习。TTL中只有一个中间域。该中间域是人为选取的，且该域中所有的数据都有用。与TTL不同，我们的工作是从多个中间域的混合中自动选择子集，作为连接源域和目标域的桥梁。多源域迁移学习(TLMS)利用多个源域来帮助目标域中的学习，并旨在同时结合所有源域迁移的知识。我们的工作和TLMS有两点不同。第一，TLMS中的所有源域都有足够的带标签数据。第二，TLMS中的所有源域都和目标域相关。

自学习(STL)的目标是利用大量不同领域的未标记数据，学习一种通用的特征表示，以建立一个目标域有标记数据有限的分类器。我们的工作和STL有两点不同。第一，STL里没有所谓的源域。第二，STL旨在使用来自不同域的所有未标记的数据来帮助在目标域中进行学习，而我们的工作旨在识别来自中间域的有用的样本子集来连接源域和目标域。

半监督自编码器(SSA)亦旨在于学习特征表示时最小化重构误差和训练损失。然而，我们与SSA的不同有三点。首先，在SSA中，无论缺乏标签的数据还是有标签的数据都来自于同一个领域；而在我们的工作中，有标签的数据来自于源域或目标域，未带标记的数据来自中间域的混合，而这些中间域的分布可能大相径庭。其次，SSA使用所有带标记的和未标记的数据进行学习，而我们的工作是从中间域中选择性地选择一些未标记的数据，并从源域中删除一些标记的数据来辅助目标域中的学习。第三，SSA没有卷积层，而我们的工作在输入为矩阵或张量时使用卷积滤波器。

笔记

这一部分作者介绍了传统的一些迁移学习算法的原理，并比较了TTL、TLMS、STL、SSA与SLA算法的异同。

Problem definition

原文

We denote by S={(xS1,yS1),⋅⋅⋅,(xSnS,ySnS)}S=\{(x_S^1, y_S^1), ···, (x_S^{n_S}, y_S^{n_S})\}S={(xS1,yS1),⋅⋅⋅,(xSnS,ySnS)} the source domain labeled data of size nSn_SnS, which is assumed to be sufficient enough to train an accurate classifier for the source domain, and by T={(xT1,yT1),⋅⋅⋅,(xTnT,yTnT)}T=\{(x_T^1, y_T^1), ···, (x_T^{n_T}, y_T^{n_T})\}T={(xT1,yT1),⋅⋅⋅,(xTnT,yTnT)} the target domain labeled data of size nTn_TnT , which is assumed to be too insufficient to learn an accurate classifier for the target domain. Moreover, we denote by I={xI1,⋅⋅⋅,xInI}I=\{x_I^1, ···,x_I^{n_I}\}I={xI1,⋅⋅⋅,xInI} the mixture of unlabeled data of multiple intermediate domains, where nIn_InI is assumed to be large enough. In this work, a domain corresponds to a concept or class for a specific classification problem, such as face or airplane recognition from images. Without loss of generality, we suppose the classification problems in the source domain and the target domain are both binary. All data points are supposed to lie in the same feature space. Let pS(x)p_S(x)pS(x), pS(y∣x)p_S(y|x)pS(y∣x), pS(x,y)p_S(x, y)pS(x,y) be the marginal, conditional and joint distributions of the source domain data, respectively, pT(x)p_T (x)pT(x), pT(y∣x)p_T (y|x)pT(y∣x), pT(x,y)p_T (x, y)pT(x,y) be the parallel definitions for the target domain, and pI(x)p_I (x)pI(x) be the marginal distribution for the intermediate domains. In a DDTL problem, we have pT(x)≠pS(x)p_T (x) \neq p_S(x)pT(x)=pS(x), pT(x)≠pI(x)p_T (x) \neq p_I (x)pT(x)=pI(x), and pT(y∣x)≠pS(y∣x)p_T (y|x) \neq p_S(y|x)pT(y∣x)=pS(y∣x). The goal of DDTL is to exploit the unlabeled data in the intermediate domains to build a bridge between the source and target domains, which are originally distant to each other, and train an accurate classifier for the target domain by transferring supervised knowledge from the source domain with the help of the bridge. Note that not all the data in the intermediate domains are supposed to be similar to the source domain data, and some of them may be quite different. Therefore, simply using all the intermediate data to build the bridge may fail to work.

译文

我们用 S={(xS1,yS1),⋅⋅⋅,(xSnS,ySnS)}S=\{(x_S^1, y_S^1), ···, (x_S^{n_S}, y_S^{n_S})\}S={(xS1,yS1),⋅⋅⋅,(xSnS,ySnS)} 表示源域中大小为 nSn_SnS的带标签数据集，且我们假定用这个数据集为源域训练一个精确的分类器是足够的；同理用 T={(xT1,yT1),⋅⋅⋅,(xTnT,yTnT)}T=\{(x_T^1, y_T^1), ···, (x_T^{n_T}, y_T^{n_T})\}T={(xT1,yT1),⋅⋅⋅,(xTnT,yTnT)} 表示目标域中大小为 nTn_TnT 的带标签数据集，且假定用它为目标域来训练一个精确的分类器是绝对不够的。此外，我们用 I={xI1,⋅⋅⋅,xInI}I=\{x_I^1, ···,x_I^{n_I}\}I={xI1,⋅⋅⋅,xInI} 表示多个中间域的未标记数据的混合，且假定 nIn_InI 足够大。在这项工作中，一个领域对应于一个特定分类问题的概念或类别，例如从图像中识别人脸或飞机。在不失一般性的前提下，我们假设源域和目标域的分类问题都是二进制的。所有的数据点都应该位于相同的特征空间中。设 pS(x)p_S \left(x\right)pS(x), pS(y∣x)p_S(y|x)pS(y∣x), pS(x,y)p_S(x, y)pS(x,y) 依次为源域数据的边际分布、条件分布和联合分布， pT(x)p_T \left(x\right)pT(x), pT(y∣x)p_T (y|x)pT(y∣x), pT(x,y)p_T (x, y)pT(x,y) 是在目标域上的并行定义， pI(x)p_I \left(x\right)pI(x) 为中间域的边际分布。在一个远域迁移学习问题中，我们有 pT(x)≠pS(x)p_T \left(x\right) \neq p_S \left(x\right)pT(x)=pS(x), pT(x)≠pI(x)p_T \left(x\right) \neq p_I \left(x\right)pT(x)=pI(x), 和 pT(y∣x)≠pS(y∣x)p_T \left(y|x\right) \neq p_S(y|x)pT(y∣x)=pS(y∣x). 远域迁移学习的目标是利用中间域的未标记数据建立一个源域和目标域之间最初遥遥相望的桥梁，并在该桥的帮助下，通过从源域转移监督知识来训练目标域的精确分类器。注意，并不是所有中间域中的数据都应该与源域数据相似，其中一些可能会有很大差异。因此，简单地使用所有中间数据来构建桥梁可能达不到效果。

笔记

该部分介绍了本文中关于DDLT的一些常用记法，包括源域、中间域、目标域的表示，以及两域之间应满足的概率关系。值得一提的是，本文中的源域、中间域、目标域选取基于两个前提假设：

源域和目标域的分类问题是二进制的
所有的数据，无论是隶属于哪个域，都应分布于相同的特征空间。

The Selective Learning Algorithm

原文

In this section, we present the proposed SLA.

译文

在本节中，我们将介绍提到过的SLA.

Auto-Encoders and Its Variant

原文

As a basis component in our proposed method to solve the DDTL problem is the autoencoder (Bengio 2009) and its variant, we first review them. An autoencoder is an unsupervised feed-forward neural network with an input layer, one or more hidden layers, and an output layer. It usually includes two processes: encoding and decoding. Given an input x∈Rqx \in R_qx∈Rq, an autoencoder first encodes it through an encoding function fe(⋅)f_e\left( \cdot \right)fe(⋅) to map it to a hidden representation, and then decodes it through a decoding function fd(⋅)f_d\left( \cdot \right)fd(⋅) to reconstruct xxx. The process of the autoencoder can be summarized as encoding: h=fe(x)h=f_e\left(x\right)h=fe(x), and decoding: x^=fd(h)\hat{x}=f_d\left(h\right)x^=fd(h), where x^\hat xx^ is the reconstructed input to approximate xxx. The learning of the pair of encoding and decoding functions, fe(⋅)f_e\left( \cdot \right)fe(⋅) and fd(⋅)f_d\left(\cdot\right)fd(⋅), is done by minimizing the reconstruction error over all training data, i.e., min⁡fe,fd∑i=1n∥x^i−xi∥22\mathop {\min}\limits_{f_e, f_d} {\sum_{i=1}^n {\left \| {\hat x}_i - x_i \right \|}^2_2}fe,fdmin∑i=1n∥x^i−xi∥22.

After the pair of encoding and decoding functions are learned, the output of encoding function of an input x, i.e., h=fe(x)h = f_e\left(x\right)h=fe(x), is considered as a higher-level and robust representation for xxx. Note that an autoencoder takes a vector as the input. When an input instance represented by a matrix or tensor, such as images, is presented to an autoencoder, the spatial information of the instance may be discarded. In this case, a convolutional autoencoder is more desired, and it is a variant of the autoencoder by adding one or more convolutional layers to capture inputs, and one or more correspondingly deconvolutional layers to generate outputs.

译文

在我们提出的解决DDTL问题的方法中，自动编码器及其变体是一个基本组成，我们首先对它们进行了回顾。自动编码器是一种无监督前馈神经网络，具有一个输入层、一个或多个隐藏层和一个输出层。它通常包括两个过程：编码和解码。对于输入 x∈Rqx \in \mathbb{R}_qx∈Rq，自动编码器首先通过编码函数 fe(⋅)f_e\left( \cdot \right)fe(⋅) 对其进行编码，将其映射到一个隐藏的表示形式，然后通过解码函数 fd(⋅)f_d\left( \cdot \right)fd(⋅) 对其进行解码，重构 xxx. 自动编码器的过程可以被总结为
encoding（编码）: h=fe(x)h=f_e\left(x\right)h=fe(x), 和decoding（解码）:x^=fd(h)\hat{x}=f_d\left(h\right)x^=fd(h),
其中 x^\hat xx^ 是输入 xxx 的近似重构。对于 fe(⋅)f_e\left( \cdot \right)fe(⋅) 和 fd(⋅)f_d\left(\cdot\right)fd(⋅) 这一对编码和解码函数的学习，是由最小化所有训练数据上的重构误差来完成的，例如min⁡fe,fd∑i=1n∥x^i−xi∥22\mathop {\min}\limits_{f_e, f_d} {\sum_{i=1}^n {\left \| {\hat x}_i - x_i \right \|}^2_2}fe,fdmin∑i=1n∥x^i−xi∥22.

在学完这一对编码和解码函数之后，对于输入xxx的编码器的输出（例如 h=fe(x)h = f_e\left(x\right)h=fe(x)）被视为xxx的更高层次的鲁棒性表示。需要指出的是自编码器使用向量作为输入。当实例以矩阵或张量的形式（例如图像）输入自编码器，其空间信息会被忽略。这种情况更需要一个卷积自编码器——这是自编码器的一种变体。自编码器使用一个或多个卷积层来采集输入，以及一个或多个相对应的去卷积层来生成输出。

笔记

这一部分作者对自编码器和其变体卷积自编码器进行了简单介绍，自编码器的形式为

#mermaid-svg-jj6siwspg7fyYNr6 .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-jj6siwspg7fyYNr6 .label text{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .node rect,#mermaid-svg-jj6siwspg7fyYNr6 .node circle,#mermaid-svg-jj6siwspg7fyYNr6 .node ellipse,#mermaid-svg-jj6siwspg7fyYNr6 .node polygon,#mermaid-svg-jj6siwspg7fyYNr6 .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-jj6siwspg7fyYNr6 .node .label{text-align:center;fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .node.clickable{cursor:pointer}#mermaid-svg-jj6siwspg7fyYNr6 .arrowheadPath{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-jj6siwspg7fyYNr6 .flowchart-link{stroke:#333;fill:none}#mermaid-svg-jj6siwspg7fyYNr6 .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-jj6siwspg7fyYNr6 .edgeLabel rect{opacity:0.9}#mermaid-svg-jj6siwspg7fyYNr6 .edgeLabel span{color:#333}#mermaid-svg-jj6siwspg7fyYNr6 .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-jj6siwspg7fyYNr6 .cluster text{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-jj6siwspg7fyYNr6 .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-jj6siwspg7fyYNr6 text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-jj6siwspg7fyYNr6 .actor-line{stroke:grey}#mermaid-svg-jj6siwspg7fyYNr6 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-jj6siwspg7fyYNr6 .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-jj6siwspg7fyYNr6 #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-jj6siwspg7fyYNr6 .sequenceNumber{fill:#fff}#mermaid-svg-jj6siwspg7fyYNr6 #sequencenumber{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 #crosshead path{fill:#333;stroke:#333}#mermaid-svg-jj6siwspg7fyYNr6 .messageText{fill:#333;stroke:#333}#mermaid-svg-jj6siwspg7fyYNr6 .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-jj6siwspg7fyYNr6 .labelText,#mermaid-svg-jj6siwspg7fyYNr6 .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-jj6siwspg7fyYNr6 .loopText,#mermaid-svg-jj6siwspg7fyYNr6 .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-jj6siwspg7fyYNr6 .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-jj6siwspg7fyYNr6 .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-jj6siwspg7fyYNr6 .noteText,#mermaid-svg-jj6siwspg7fyYNr6 .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-jj6siwspg7fyYNr6 .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-jj6siwspg7fyYNr6 .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-jj6siwspg7fyYNr6 .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-jj6siwspg7fyYNr6 .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .section{stroke:none;opacity:0.2}#mermaid-svg-jj6siwspg7fyYNr6 .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-jj6siwspg7fyYNr6 .section2{fill:#fff400}#mermaid-svg-jj6siwspg7fyYNr6 .section1,#mermaid-svg-jj6siwspg7fyYNr6 .section3{fill:#fff;opacity:0.2}#mermaid-svg-jj6siwspg7fyYNr6 .sectionTitle0{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .sectionTitle1{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .sectionTitle2{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .sectionTitle3{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-jj6siwspg7fyYNr6 .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .grid path{stroke-width:0}#mermaid-svg-jj6siwspg7fyYNr6 .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-jj6siwspg7fyYNr6 .task{stroke-width:2}#mermaid-svg-jj6siwspg7fyYNr6 .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .taskText:not([font-size]){font-size:11px}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-jj6siwspg7fyYNr6 .task.clickable{cursor:pointer}#mermaid-svg-jj6siwspg7fyYNr6 .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-jj6siwspg7fyYNr6 .taskText0,#mermaid-svg-jj6siwspg7fyYNr6 .taskText1,#mermaid-svg-jj6siwspg7fyYNr6 .taskText2,#mermaid-svg-jj6siwspg7fyYNr6 .taskText3{fill:#fff}#mermaid-svg-jj6siwspg7fyYNr6 .task0,#mermaid-svg-jj6siwspg7fyYNr6 .task1,#mermaid-svg-jj6siwspg7fyYNr6 .task2,#mermaid-svg-jj6siwspg7fyYNr6 .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutside0,#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutside2{fill:#000}#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutside1,#mermaid-svg-jj6siwspg7fyYNr6 .taskTextOutside3{fill:#000}#mermaid-svg-jj6siwspg7fyYNr6 .active0,#mermaid-svg-jj6siwspg7fyYNr6 .active1,#mermaid-svg-jj6siwspg7fyYNr6 .active2,#mermaid-svg-jj6siwspg7fyYNr6 .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-jj6siwspg7fyYNr6 .activeText0,#mermaid-svg-jj6siwspg7fyYNr6 .activeText1,#mermaid-svg-jj6siwspg7fyYNr6 .activeText2,#mermaid-svg-jj6siwspg7fyYNr6 .activeText3{fill:#000 !important}#mermaid-svg-jj6siwspg7fyYNr6 .done0,#mermaid-svg-jj6siwspg7fyYNr6 .done1,#mermaid-svg-jj6siwspg7fyYNr6 .done2,#mermaid-svg-jj6siwspg7fyYNr6 .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-jj6siwspg7fyYNr6 .doneText0,#mermaid-svg-jj6siwspg7fyYNr6 .doneText1,#mermaid-svg-jj6siwspg7fyYNr6 .doneText2,#mermaid-svg-jj6siwspg7fyYNr6 .doneText3{fill:#000 !important}#mermaid-svg-jj6siwspg7fyYNr6 .crit0,#mermaid-svg-jj6siwspg7fyYNr6 .crit1,#mermaid-svg-jj6siwspg7fyYNr6 .crit2,#mermaid-svg-jj6siwspg7fyYNr6 .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-jj6siwspg7fyYNr6 .activeCrit0,#mermaid-svg-jj6siwspg7fyYNr6 .activeCrit1,#mermaid-svg-jj6siwspg7fyYNr6 .activeCrit2,#mermaid-svg-jj6siwspg7fyYNr6 .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-jj6siwspg7fyYNr6 .doneCrit0,#mermaid-svg-jj6siwspg7fyYNr6 .doneCrit1,#mermaid-svg-jj6siwspg7fyYNr6 .doneCrit2,#mermaid-svg-jj6siwspg7fyYNr6 .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-jj6siwspg7fyYNr6 .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-jj6siwspg7fyYNr6 .milestoneText{font-style:italic}#mermaid-svg-jj6siwspg7fyYNr6 .doneCritText0,#mermaid-svg-jj6siwspg7fyYNr6 .doneCritText1,#mermaid-svg-jj6siwspg7fyYNr6 .doneCritText2,#mermaid-svg-jj6siwspg7fyYNr6 .doneCritText3{fill:#000 !important}#mermaid-svg-jj6siwspg7fyYNr6 .activeCritText0,#mermaid-svg-jj6siwspg7fyYNr6 .activeCritText1,#mermaid-svg-jj6siwspg7fyYNr6 .activeCritText2,#mermaid-svg-jj6siwspg7fyYNr6 .activeCritText3{fill:#000 !important}#mermaid-svg-jj6siwspg7fyYNr6 .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-jj6siwspg7fyYNr6 g.classGroup text .title{font-weight:bolder}#mermaid-svg-jj6siwspg7fyYNr6 g.clickable{cursor:pointer}#mermaid-svg-jj6siwspg7fyYNr6 g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-jj6siwspg7fyYNr6 g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-jj6siwspg7fyYNr6 .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-jj6siwspg7fyYNr6 .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-jj6siwspg7fyYNr6 .dashed-line{stroke-dasharray:3}#mermaid-svg-jj6siwspg7fyYNr6 #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 .commit-id,#mermaid-svg-jj6siwspg7fyYNr6 .commit-msg,#mermaid-svg-jj6siwspg7fyYNr6 .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-jj6siwspg7fyYNr6 g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-jj6siwspg7fyYNr6 g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-jj6siwspg7fyYNr6 g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-jj6siwspg7fyYNr6 .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-jj6siwspg7fyYNr6 .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-jj6siwspg7fyYNr6 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-jj6siwspg7fyYNr6 .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-jj6siwspg7fyYNr6 .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-jj6siwspg7fyYNr6 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-jj6siwspg7fyYNr6 .edgeLabel text{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-jj6siwspg7fyYNr6 .node circle.state-start{fill:black;stroke:black}#mermaid-svg-jj6siwspg7fyYNr6 .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-jj6siwspg7fyYNr6 #statediagram-barbEnd{fill:#9370db}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-state .divider{stroke:#9370db}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-jj6siwspg7fyYNr6 .note-edge{stroke-dasharray:5}#mermaid-svg-jj6siwspg7fyYNr6 .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-jj6siwspg7fyYNr6 .error-icon{fill:#522}#mermaid-svg-jj6siwspg7fyYNr6 .error-text{fill:#522;stroke:#522}#mermaid-svg-jj6siwspg7fyYNr6 .edge-thickness-normal{stroke-width:2px}#mermaid-svg-jj6siwspg7fyYNr6 .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-jj6siwspg7fyYNr6 .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-jj6siwspg7fyYNr6 .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-jj6siwspg7fyYNr6 .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-jj6siwspg7fyYNr6 .marker{fill:#333}#mermaid-svg-jj6siwspg7fyYNr6 .marker.cross{stroke:#333}:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-jj6siwspg7fyYNr6 {color: rgba(0, 0, 0, 0.75);font: ;}

input

encoder

decoder

output

评价方式是最小化重构误差，即min⁡fe,fd∑i=1n∥x^i−xi∥22\mathop {\min}\limits_{f_e, f_d} {\sum_{i=1}^n {\left \| {\hat x}_i - x_i \right \|}^2_2}fe,fdmin∑i=1n∥x^i−xi∥22.
而所谓卷积自编码器则只是在自编码器的基础之上添加了卷积层，使得整个模型能直接接受图片作为输入。

Instance Selection via Reconstruction Error

原文

A motivation behind our proposed method is that in an ideal case, if the data from the source domain are similar and useful for the target domain, then one should be able to find a pair of encoding and decoding functions such that the reconstruction errors on the source domain data and the target domain data are both small. In practice, as the source domain and the target domain are distant, there may be only a subset of the source domain data is useful for the target domain. The situation is similar in the intermediate domains. Therefore, to select useful instances from the intermediate domains, and remove irrelevant instances from the source domain for the target domain, we propose to learn a pair of encoding and decoding functions by minimizing reconstruction errors on the selected instances in the source and intermediate domains, and all the instances in the target domain simultaneously. The objective function to be minimized is formulated as follows:
J1(fe,fd,vS,vT)=1nS∑i=1nSvSi∥x^Si−xSi∥22+1nI∑i=1nIvIi∥x^Ii−xIi∥22+1nT∑i=1nT∥x^Ti−xTi∥22+R(vS,vT),\mathcal{J}_1\left(f_e,f_d,v_S,v_T\right)=\frac{1}{n_S} \sum\limits_{i=1}^{n_S}v_S^i{\left\|{{\hat x}_S^i-x_S^i}\right\|}_2^2 + \frac{1}{n_I} \sum\limits_{i=1}^{n_I}v_I^i{\left\|{{\hat x}_I^i - x_I^i}\right\|}_2^2+\frac{1}{n_T} \sum\limits_{i=1}^{n_T}{\left\|{{\hat x}_T^i-x_T^i}\right\|}_2^2+ R\left(v_S,v_T\right),J1(fe,fd,vS,vT)=nS1i=1∑nSvSi∥∥x^Si−xSi∥∥22+nI1i=1∑nIvIi∥∥x^Ii−xIi∥∥22+nT1i=1∑nT∥∥x^Ti−xTi∥∥22+R(vS,vT),
where x^Si{\hat x}^i_Sx^Si, x^Ii{\hat x}^i_Ix^Ii and x^Ti{\hat x}^i_Tx^Ti are econstructions of xSix^i_SxSi, xIix^i_IxIi and xTix^i_TxTi based on the autoencoder, vS=(vS1,⋯,vSnS)T,vI=(vI1,⋯,vInI)Tv_S={\left(v_S^1,\cdots,v_S^{n_S}\right)}^T,v_I={\left(v_I^1,\cdots,v_I^{n_I}\right)}^TvS=(vS1,⋯,vSnS)T,vI=(vI1,⋯,vInI)T, and vSi,vIi∈{0,1}v_S^i,v_I^i\in\left\{0,1\right\}vSi,vIi∈{0,1} are selection indicators for the iii-th instance in the source domain and the jjj-th instance in the intermediate domains, respectively. When the value is 111, the corresponding instance is selected and otherwise unselected. The last term in the objective, R(vS,vT)R(v_S, v_T )R(vS,vT), is a regularization term on vSv_SvS and vTv_TvT to avoid a trivial solution by setting all values of vSv_SvS and vTv_TvT to be zero. In this paper, we define R(vS,vT)R\left(v_S, v_T \right)R(vS,vT) as
R(vS,vT)=−λSnS∑i=1nSvSi−−λInI∑i=1nIvIi.R\left(v_S,v_T\right)=-\frac{{\lambda}_S}{n_S}\sum\limits_{i=1}^{n_S}v_S^i - -\frac{{\lambda}_I}{n_I}\sum\limits_{i=1}^{n_I}v_I^i.R(vS,vT)=−nSλSi=1∑nSvSi−−nIλIi=1∑nIvIi.
Minimizing this term is equivalent to encouraging to select many instances as possible from the source and intermediate domains. Two regularization parameters, λSλ_SλS and λIλ_IλI , control the importance of this regularization term. Note that more useful instances are selected, more robust hidden representations can be learned through the autoencoder.

译文

我们所提出的方法背后的一个动机是，在一个理想的情况下，如果来自源域的数据对于目标域来说是相似且有用的，那么应该能找到一对编码和解码函数，使得在源域和目标域上的重构误差都非常小。实际上，由于源域和目标域相距较远，可能只有一部分源域数据对目标域有用——中间域的情况类似。因此，为了从中间域选择有用的实例，并为目标域去除源域中无关的实例，我们提出了一对编码和解码函数。通过最小化在源域和中间域所选实例，以及所有目标域实例上的重构误差，我们学习了这对函数。待最小化的目标函数表示如下：
J1(fe,fd,vS,vT)=1nS∑i=1nSvSi∥x^Si−xSi∥22+1nI∑i=1nIvIi∥x^Ii−xIi∥22+1nT∑i=1nT∥x^Ti−xTi∥22+R(vS,vT),\mathcal{J}_1\left(f_e,f_d,v_S,v_T\right)=\frac{1}{n_S} \sum\limits_{i=1}^{n_S}v_S^i{\left\|{{\hat x}_S^i-x_S^i}\right\|}_2^2 + \frac{1}{n_I} \sum\limits_{i=1}^{n_I}v_I^i{\left\|{{\hat x}_I^i - x_I^i}\right\|}_2^2+\frac{1}{n_T} \sum\limits_{i=1}^{n_T}{\left\|{{\hat x}_T^i-x_T^i}\right\|}_2^2+ R\left(v_S,v_T\right),J1(fe,fd,vS,vT)=nS1i=1∑nSvSi∥∥x^Si−xSi∥∥22+nI1i=1∑nIvIi∥∥x^Ii−xIi∥∥22+nT1i=1∑nT∥∥x^Ti−xTi∥∥22+R(vS,vT),
其中，x^Si{\hat x}^i_Sx^Si，x^Ii{\hat x}^i_Ix^Ii 和 x^Ti{\hat x}^i_Tx^Ti 是 xSix^i_SxSi，xIix^i_IxIi 和 xTix^i_TxTi 基于自编码器的重构；vS=(vS1,⋯,vSnS)T,vI=(vI1,⋯,vInI)Tv_S={\left(v_S^1,\cdots,v_S^{n_S}\right)}^T,v_I={\left(v_I^1,\cdots,v_I^{n_I}\right)}^TvS=(vS1,⋯,vSnS)T,vI=(vI1,⋯,vInI)T, 和 vSi,vIi∈{0,1}v_S^i,v_I^i\in\left\{0,1\right\}vSi,vIi∈{0,1} 分别是源域中的第 iii 个实例和中间域中的第 jjj 个实例的选择指标。如果他们的值为 111，相对应的实例就被选入，反之不然。目标函数的最后一项 R(vS,vI)R(v_S, v_I )R(vS,vI) 是 vSv_SvS 和 vIv_IvI 上的正则化项，为了避免 vSv_SvS 和 vIv_IvI 的所有值均被设为 000 而导致的平凡解。在这篇论文中，我们定义 R(vS,vI)R\left(v_S, v_I \right)R(vS,vI) 为
R(vS,vI)=−λSnS∑i=1nSvSi−−λInI∑i=1nIvIi.R\left(v_S,v_I\right)=-\frac{{\lambda}_S}{n_S}\sum\limits_{i=1}^{n_S}v_S^i - -\frac{{\lambda}_I}{n_I}\sum\limits_{i=1}^{n_I}v_I^i.R(vS,vI)=−nSλSi=1∑nSvSi−−nIλIi=1∑nIvIi.
最小化这一表达式相当于鼓励尽可能多地从源域和中间域中选择实例。两个正则化参数 vSv_SvS 和 vIv_IvI 控制了这个正则化项的重要性。注意，选择了更多有用的实例，就可以通过自动编码器学习更健壮的隐藏表示。

笔记

这里介绍了自编码器的目标函数 J1(fe,fd,vS,vT)\mathcal{J}_1\left(f_e,f_d,v_S,v_T\right)J1(fe,fd,vS,vT) 以及表达式中的各项表示的意义，这里原文出现了一个错误，多次将 vIv_IvI 误写为了 vTv_TvT，在译文中已经直接纠正了。

Incorporation of Side Information

原文

By solving the minimization problem (1)(1)(1), one can select useful instances from the source and intermediate domains for the target domain through vSv_SvS and vTv_TvT , and learn high-level hidden representations for data in different domains through the encoding function, i.e., h=fe(x)h=f_e\left(x\right)h=fe(x), simultaneously. However, the learning process is in an unsupervised manner. As a result, the learned hidden representations may not be relevant to the classification problem in the target domain. This motivates us to incorporate side information into the learning of the hidden representations for different domains. For the source and target domains, labeled data can be used as the side information, while for the intermediate domains, there is no label information. In this work, we consider the predictions on the intermediate domains as the side information, and use the confidence on the predictions to guide the learning of the hidden representations. To be specific, we propose to incorporate the side information into learning by minimizing the following function:
J2(fe,fc,fd)=1nS∑i=1nSvSil(ySi,fc(hSi))+1nT∑i=1nTl(yTi,fc(hTi))+1nI∑i=1nIvIig(fc(hIi)),{\mathcal J}_2\left(f_e,f_c,f_d\right)= \frac{1}{n_S} \sum\limits_{i=1}^{n_S}v_S^i{l \left(y_S^i,f_c\left(h_S^i\right)\right)}+ \frac{1}{n_T} \sum\limits_{i=1}^{n_T}{l\left(y_T^i,f_c\left(h_T^i\right)\right)}+ \frac{1}{n_I} \sum\limits_{i=1}^{n_I}{v_I^ig\left(f_c\left(h_I^i\right)\right)},J2(fe,fc,fd)=nS1i=1∑nSvSil(ySi,fc(hSi))+nT1i=1∑nTl(yTi,fc(hTi))+nI1i=1∑nIvIig(fc(hIi)),
where fc(⋅)f_c\left(\cdot\right)fc(⋅) is a classification function to output classification probabilities, and g(⋅)g\left(\cdot\right)g(⋅) is entropy function defined as g(z)=−zln⁡z−(1−z)ln⁡(1−z)g\left(z\right) = −z \ln z − \left(1 − z\right) \ln\left(1 − z\right)g(z)=−zlnz−(1−z)ln(1−z) for 0≤z≤10 \leq z \leq 10≤z≤1, which is used to select instances of high prediction confidence in the intermediate domains.

译文

通过求解最小化问题 (1)(1)(1)，可以通过 vSv_SvS 和 vTv_TvT 从源域和中间域中为目标域选择有用的实例，并同时利用编码函数（如 h=fe(x)h=f_e\left(x\right)h=fe(x)）为不同领域中的数据学习高级隐藏表示。然而，这一学习过程是在无监督方式下进行的。因此，所学习的隐藏表示可能与目标域中的分类问题无关。这促使我们将边信息合并到不同领域的隐藏表示的学习中。对于源域和目标域，可以使用带标签的数据作为边信息，而对于中间域，则没有带标签的信息。在这项工作中，我们考虑把对中间域的预测作为边信息，并利用预测的置信度来指导对隐藏表示的学习。具体来说，我们建议通过最小化以下函数，将边信息纳入到学习中：
J2(fe,fc,fd)=1nS∑i=1nSvSil(ySi,fc(hSi))+1nT∑i=1nTl(yTi,fc(hTi))+1nI∑i=1nIvIig(fc(hIi)),{\mathcal J}_2\left(f_e,f_c,f_d\right)= \frac{1}{n_S} \sum\limits_{i=1}^{n_S}v_S^i{l \left(y_S^i,f_c\left(h_S^i\right)\right)}+ \frac{1}{n_T} \sum\limits_{i=1}^{n_T}{l\left(y_T^i,f_c\left(h_T^i\right)\right)}+ \frac{1}{n_I} \sum\limits_{i=1}^{n_I}{v_I^ig\left(f_c\left(h_I^i\right)\right)},J2(fe,fc,fd)=nS1i=1∑nSvSil(ySi,fc(hSi))+nT1i=1∑nTl(yTi,fc(hTi))+nI1i=1∑nIvIig(fc(hIi)),
其中， fc(⋅)f_c\left(\cdot\right)fc(⋅) 是输出分类概率的分类函数，g(⋅)g\left(\cdot\right)g(⋅) 是熵函数，它的定义是 g(z)=−zln⁡z−(1−z)ln⁡(1−z)g\left(z\right) = −z \ln z − \left(1 − z\right) \ln\left(1 − z\right)g(z)=−zlnz−(1−z)ln(1−z) (0≤z≤1)\left(0 \leq z \leq 1\right)(0≤z≤1)，用于在中间域中选择高预测置信度的实例。

笔记

这里定义了另一个目标函数 J2(fe,fc,fd){\mathcal J}_2\left(f_e,f_c,f_d\right)J2(fe,fc,fd)。

Overall Objective Function

原文

By combining the two objectives in Eqs. (1)(1)(1) and (2)(2)(2), we obtain the final objective function for DDTL as follows:
min⁡Θ,vJ=J1+J2,s.t.vSi,vIi∈{0,1},\min\limits_{\Theta,v}\mathcal J=\mathcal {J_1 +J_2}, s.t.v_S^i,v_I^i\in\left\{0,1\right\},Θ,vminJ=J1+J2,s.t.vSi,vIi∈{0,1},
where v={vS,vT}v = \left\{v_S, v_T \right\}v={vS,vT}, and Θ\ThetaΘ denotes all parameters of the functions fc(⋅)f_c\left(\cdot\right)fc(⋅), fe(⋅)f_e\left(\cdot\right)fe(⋅), and fd(⋅)f_d\left(\cdot\right)fd(⋅).

To solve problem (3)\left(3\right)(3), we use the block coordinate decedent (BCD) method, where in each iteration, variables in each block are optimized sequentially while keeping other variables fixed. In problem (3)\left(3\right)(3), there are two blocks of variables: Θ\ThetaΘ and vvv. When the variables in vvv are fixed, we can update Θ\ThetaΘ using the Back Propagation (BP) algorithm where the gradients can be computed easily. Alternatingly, when the variables in Θ\ThetaΘ are fixed, we can obtain an analytical solution for vvv as follows,
vSi={1ifl(ySi,fc(fe(xSi)))+∥x^Si−xSi∥22<λS0otherwisev_S^i= \begin{cases} 1&{if\text{ }\text{ } l\left({y_S^i,f_c\left(f_e\left(x_S^i\right)\right)}\right)+ {\left\|\hat x_S^i-x_S^i\right\|}_2^2<{\lambda}_S } \\[2ex] 0&otherwise \end{cases}vSi=⎩⎨⎧10if l(ySi,fc(fe(xSi)))+∥∥x^Si−xSi∥∥22<λSotherwise
vIi={1if∥x^Ii−xIi∥22+g(fc(fe(xIi)))<λI0otherwisev_I^i= \begin{cases} 1&{if\text{ }\text{ } {\left\|\hat x_I^i-x_I^i\right\|}_2^2+ g\left({f_c\left(f_e\left(x_I^i\right)\right)}\right)<{\lambda}_I } \\[2ex] 0&otherwise \end{cases}vIi=⎩⎨⎧10if ∥∥x^Ii−xIi∥∥22+g(fc(fe(xIi)))<λIotherwise
Based on Eq. (4)(4)(4), we can see that for the data in the source domain, only those with low reconstruction errors and low training losses will be selected during the optimization procedure. Similarly, based on Eq. (5)(5)(5), it can be found that for the data in the intermediate domains, only those with low reconstruction errors and high prediction confidence will be selected.

An intuitive explanation of this learning strategy is twofold: 1) When updating vvv with a fixed Θ\ThetaΘ, “useless” data in the source domain will be removed, and those intermediate data that can bridge the source and target domains will be selected for training; 2) When updating Θ\ThetaΘ with fixed vvv, the model is trained only on the selected “useful” data samples. The overall algorithm for solving problem (3)(3)(3) is summarized in Algorithm 111.

The deep learning architecture corresponding to problem (3) is illustrated in Figure 2. From Figure 2, we note that except for the instance selection component vvv, the rest of the architecture in Figure 2 can be viewed as a generalization of an autoencoder or a convolutional autoencoder by incorporating the side information, respectively. In the sequel, we refer to the architecture (except for the instances selection component) using an antoencoder for fe(⋅)f_e(\cdot)fe(⋅) and fd(⋅)f_d(\cdot)fd(⋅) as SAN (Supervised AutoeNcoder), and the one using a convolutional antoencoder as SCAN (Supervised Convolutional AutoeNcoder).

译文

通过结合方程 (1)(1)(1) 和 (2)(2)(2) 中的两个目标函数，我们获得了最终关于DDTL问题的目标函数，如下所示：
min⁡Θ,vJ=J1+J2,s.t.vSi,vIi∈{0,1},\min\limits_{\Theta,v}\mathcal J=\mathcal {J_1 +J_2}, s.t.v_S^i,v_I^i\in\left\{0,1\right\},Θ,vminJ=J1+J2,s.t.vSi,vIi∈{0,1},
其中， v={vS,vT}v = \left\{v_S, v_T \right\}v={vS,vT} 和 Θ\ThetaΘ 表示函数 fc(⋅)f_c\left(\cdot\right)fc(⋅), fe(⋅)f_e\left(\cdot\right)fe(⋅) 和 fd(⋅)f_d\left(\cdot\right)fd(⋅) 的所有参数。

为了解决问题 (3)\left(3\right)(3)，我们使用块坐标下降(block coordinate descent, BCD)方法，在每个迭代中，每个块中的变量依次优化，同时保持其他变量不变。在问题 (3)\left(3\right)(3) 中有两个变量模块 Θ\ThetaΘ 和 vvv：当 vvv 中的变量固定时，我们可以利用BP算法更新 Θ\ThetaΘ，因为梯度很容易计算；交替地，当 Θ\ThetaΘ 中的变量固定时，我们可以得到 vvv 的解析解如下
vSi={1ifl(ySi,fc(fe(xSi)))+∥x^Si−xSi∥22<λS0otherwisev_S^i= \begin{cases} 1&{if\text{ }\text{ } l\left({y_S^i,f_c\left(f_e\left(x_S^i\right)\right)}\right)+ {\left\|\hat x_S^i-x_S^i\right\|}_2^2<{\lambda}_S } \\[2ex] 0&otherwise \end{cases}vSi=⎩⎨⎧10if l(ySi,fc(fe(xSi)))+∥∥x^Si−xSi∥∥22<λSotherwise
vIi={1if∥x^Ii−xIi∥22+g(fc(fe(xIi)))<λI0otherwisev_I^i= \begin{cases} 1&{if\text{ }\text{ } {\left\|\hat x_I^i-x_I^i\right\|}_2^2+ g\left({f_c\left(f_e\left(x_I^i\right)\right)}\right)<{\lambda}_I } \\[2ex] 0&otherwise \end{cases}vIi=⎩⎨⎧10if ∥∥x^Ii−xIi∥∥22+g(fc(fe(xIi)))<λIotherwise
从方程 (4)(4)(4) 可以看出，对于源域，在优化过程中只选取重构误差小、训练损失小的数据。同样，根据方程 (5)(5)(5) 可以发现，对于中间域的数据，只选择重建误差小、预测置信度高的数据。

这种学习策略有两个直观的解释：1）当用固定的 Θ\ThetaΘ 更新 vvv 时，去除源域中的“无用”数据，选择那些可以连接源域和目标域的中间数据进行训练；2）当用固定的 vvv 更新 Θ\ThetaΘ 时，模型只训练选定的“有用”数据样本。求解问题 (3)(3)(3) 的整体算法总结在算法 111 中。

有关问题 (3)(3)(3) 的深度学习框架的说明在图2中。从图2中，我们注意到除了实例挑选部分 vvv，图2中的其余架构可以看作是自动编码器或卷积自动编码器分别通过合并边信息的泛化。结果，除了实例挑选部分之外，我们将使用 fd(⋅)f_d(\cdot)fd(⋅) 和 fe(⋅)f_e(\cdot)fe(⋅) 的自动编码器的架构称为SAN(Supervised AutoeNcoder)，将使用卷积自动编码器的架构称为CSAN(Supervised Convolutional AutoeNcoder)。

笔记

这里亦有一处作者的笔误，decedent应为descent，已在译文中予以纠正。

Figure 2

Algorithm 1

Algorithm 1 The Selective Learning Algorithm(SLA)
1: Input: Data in S\mathcal SS, T\mathcal TT and I\mathcal II, and parameters λS{\lambda}_SλS, λI{\lambda}_IλI, and TTT;
2: Initialize Θ\ThetaΘ, vS=1v_S=1vS=1, vI=0v_I=0vI=0; //Allsourcedataareused//\;All\;source\;data\;are\;used//Allsourcedataareused
3: while t<Tt<Tt<T do
4: \;\; Update Θ\ThetaΘ via the BP algorithm; //Updatethenetwork//\;Update\;the\;network//Updatethenetwork
5: \;\; Update vvv by Eqs. (4)(4)(4) and (5)(5)(5); //Select"useful"instances//\;Select\;"useful\,"\;instances//Select"useful"instances
6: \;\; t = t + 1
7: end while
8: Output: Θ\ThetaΘ and vvv.

译文

算法 1 选择性学习算法(SLA)
1: 输入: S\mathcal SS, T\mathcal TT 和 I\mathcal II 中的数据, 以及 λS{\lambda}_SλS, λI{\lambda}_IλI, 和 TTT 中的参数;
2: 初始化Θ\ThetaΘ, vS=1v_S=1vS=1, vI=0v_I=0vI=0; //使用源域中的所有数据//\;使用源域中的所有数据//使用源域中的所有数据
3: while t<Tt<Tt<T do
4: \;\; 利用BP算法更新 Θ\ThetaΘ ; //更新网络//\;更新网络//更新网络
5: \;\; 通过方程 (4)(4)(4) 和 (5)(5)(5) 更新 vvv ; //选择“有用的”实例//\;选择\,“\,有用的\,”\,实例//选择“有用的”实例
6: \;\; t = t + 1
7: end while
8: 输出: Θ\ThetaΘ 和 vvv.

Experiments

原文

In this section, we conduct empirical studies to evaluate the proposed SLA algorithm from three aspects.2 Firstly, we test the effectiveness of the algorithm when the source domain and the target domain are distant. Secondly, we visualize some selected intermediate data to understand how the algorithm works. Thirdly, we evaluate the importance of the learning order of intermediate instances by comparing the order generated by SLA with other manually designed orders.

译文

在本节中，我们从三个方面对所提出的SLA算法进行了实证研究。首先，我们测试了该算法在源域和目标域距离较远时的有效性。其次，我们将选取的一些中间数据可视化，以了解算法的工作原理。然后，通过比较SLA生成的（中间域的）顺序与其他人工设计的顺序，评估中间实例学习顺序的重要性。

笔记

介绍了实验的细节。

Baseline Methods

原文

Three categories of methods are used as baselines.

Supervised Learning In this category, we choose two supervised learning algorithms, SVM and convolutional neural networks (CNN) (Krizhevsky, Sutskever, and Hinton 2012), as baselines. For SVM, we use the linear kernel. For CNN, we implement a network that is composed of two convolutional layers with kernel size 3×3, where each convolutional layer is followed by a max pooling layer with kernel size 2 × 2, a fully connected layer, and a logistic regression layer.
Transfer Learning In this category, we choose five transfer learning algorithms including Adaptive SVM (ASVM) (Aytar and Zisserman 2011), Geodesic Flow Kernel (GFK) (Gong et al. 2012), Landmark (LAN) (Gong, Grauman, and Sha2013), Deep Transfer Learning (DTL) (Yosinski et al. 2014), and Transitive Transfer Learning (TTL) (Tan et al. 2015). For ASVM, we use all the source domain and target domain labeled data to train a target model. For GFK, we first generate a series of intermediate spaces using all the source domain and target domain data to learn a new feature space, and then train a model in the space using all the source domain and target domain labeled data. For LAN, we first select a subset of labeled data from the source domain, which has a similar distribution as the target domain, and then use all the selected data to facilitate knowledge transfer. For DTL, we first train a deep model in the source domain, and then train another deep model for the target domain by reusing the first several layers of the source model. For TTL, we use all the source domain, target domain and intermediate domains data to learn a model.

Self-taught Learning (STL) We apply the autoencoder or convolutional autoencoder on all the intermediate domains data to learn a universal feature representation, and then train a classifier with all the source domain and target domain labeled data with the new feature representation.

Among the baselines, CNN can receive tensors as inputs, DTL and STL can receive both vectors and tensors as inputs, while the other models can only receive vectors as inputs. For a fair comparison, in experiments, we compare the proposed SLA-SAN method with SVM, GFK, LAN, ASVM, DTL, STL and TTL, while compare the SLA-SCAN method with CNN, DTL and STL. The convolutional autoencoder component used in SCAN has the same network structure as the CNN model, except that the fully connected layer is connected to two unpooling layers and two deconvolutional layers to reconstruct inputs. For all deep-learning based models, we use all the training data for pre-training.

译文

三种方法被用作基准。

有监督学习 在这一类中，我们选择了支持向量机SVM和卷积神经网络 (CNN) 这两种监督学习算法作为基线。对于SVM，我们使用线性核函数。对于CNN，我们实现了一个由两个卷积层一个全连接层和一个逻辑回归层组成的网络。卷积层的内核大小为3×3，每个卷积层后面跟着一个内核大小为2×2的最大池化层。

迁移学习 在这一类别中，我们选择了五种迁移学习算法，包括自适应支持向量机(ASVM)、测地线核(GFK) 、Landmark (LAN)、深度迁移学习(DTL) 和传递迁移学习(Transitive transfer learning) 。对于ASVM，我们使用源域和目标域中所有有标签数据来训练目标模型。对于GFK，我们首先使用所有的源域和目标域数据生成一系列中间空间来学习一个新的特征空间，然后使用所有的源域和目标域标记数据在该空间中训练一个模型。对于LAN，我们首先从源域选择与目标域分布相似的有标记数据子集，然后使用所有选择的数据来促进知识迁移。对于DTL，我们首先在源域训练一个深度模型，然后通过重用源模型的前几层为目标域训练另一个深度模型。对于TTL，我们使用所有的源域、目标域和中间域数据来学习模型。

自主学习 我们在所有中间域数据上应用自编码器或卷积自编码器来学习通用的特征表示，然后用新特征表示的所有源域和目标域标记数据训练分类器。

在基线中，CNN可以接收张量作为输入，DTL和STL可以同时接收矢量和张量作为输入，而其他模型只能接收矢量作为输入。为了公平比较，在实验中，我们将提出的SLA-SAN方法与SVM、GFK、LAN、ASVM、DTL、STL和TTL进行比较，同时将SLA-SCAN方法与CNN、DTL和STL进行比较。SCAN中使用的卷积自编码器部分，其网络结构与CNN模型相同，不同之处是全连接层连接两个上采样层和两个逆卷积层来重建输入。对于所有基于深度学习的模型，我们使用所有的训练数据进行预训练。

笔记

这一部分作者列举了三种机器学习领域的一些算法，这些算法是用来和论文所提出的SAN和SCAN算法进行对比的。而且由于各种算法所能接受的输入形式不同，所以后续的对比实验按输入可能分成了几类。

Figure 3

介绍

不同学习算法用于 Caltech-256 和 AwA 数据集上的平均准确度。

Table 1 Accuracies (%) of selected tasks on Catech-256

// 表一在Catech-256数据集上所挑选任务的准确度

	CNN	DTL	STL	SLA-SCAN
‘face-to-horse’	72 ±\pm± 2	67±\pm± 2	70±\pm± 3	89±\pm± 2
‘horse-to-airplane’	82±\pm± 2	76±\pm± 3	82±\pm± 2	92±\pm± 2
‘gorilla-to-face’	83±\pm± 1	80±\pm± 2	82±\pm± 2	96±\pm± 1

Datasets

原文

The datasets used for experiments include Caltech-256 (Griffin, Holub, and Perona 2007) and Animals with Attributes (AwA). The Caltech-256 dataset is a collection of 30,607 images from 256 object categories, where the number of instances per class varies from 80 to 827. Since we need to transfer knowledge between distant categories, we choose a subset of categories including ‘face’‘face’‘face’, ‘watch’‘watch’‘watch’, ‘airplane’‘airplane’‘airplane’, ‘horse’‘horse’‘horse’, ‘gorilla’‘gorilla’‘gorilla’, ‘billiards’‘billiards’‘billiards’ to form the source and target domains. Specifically, we first randomly choose one category to form the source domain, and consider images belonging to this category as positive examples for the source domain. Then we randomly choose another category to form the target domain, and consider images in this category as positive examples for the target domain. As this dataset has a clutter category, we sample images from this category as negative samples for the source and target domains and the sampling process guarantees that each pair of source and target domains has no overlapping on negative samples. After constructing the source domain and the target domain, all the other images in the dataset are used as the intermediate domains data. Therefore, we construct P62=30P^2_6 =30P62=30 pairs of DDTL problems from this dataset.

The AwA dataset consists of 30,475 images of 50 animals classes, where the number of samples per class varies from 92 to 1,168. Due to the copyright reason, this dataset only provides extracted SIFT features for each image instead of the original images. We choose three categories including ‘humpback+whale’, ‘zebra’ and ‘collie’ to form the source and target domains. The source-target domains construction procedure is similar to that on the Caltech-256 dataset, and images in the categories ‘beaver’, ‘blue+whale’, ‘mole’, ‘mouse’, ‘ox’, ‘skunk’, and ‘weasel’ are considered as negative samples for the source domain and the target domain. In total we construct P32=6P^2_3 =6P32=6 pairs of DDTL problems.

译文

用于实验的数据集包括Caltech-256和动物属性(AwA)。Caltech-256数据集是来自256个对象类别的30,607个图像的集合，其中每个类的实例数量从80到827不等。因为我们需要在遥远的类别之间传递知识，所以我们选择了一个类别子集，包括 ‘face’‘face’‘face’, ‘watch’‘watch’‘watch’, ‘airplane’‘airplane’‘airplane’, ‘horse’‘horse’‘horse’, ‘gorilla’‘gorilla’‘gorilla’, ‘billiards’‘billiards’‘billiards’ 来形成源域和目标域。具体来说，我们首先随机选择一个类别来形成源域，并将属于这个类别的图像作为源域的正例。然后我们随机选择另一个类别来形成目标域，并将这个类别中的图像作为目标域的正例。由于该数据集有很多类别，我们将该类别的图像作为源域和目标域的负例进行采样，采样过程保证了每一对源域和目标域在负样本上没有重叠。在构建了源域和目标域之后，数据集中的所有其他图像都被用作中间域数据。因此，我们从这个数据集构造了 P62=30P^2_6 = 30P62=30 对DDTL问题。

AwA数据集包含50个动物种类的30475张图片，每个种类的样本数量从92个到1168个不等。由于版权的原因，这个数据集只提供每个图像的提取SIFT特征，而不是原始图像。我们选择了“座头鲸+鲸鱼”、“斑马”和“牧羊犬”三个类别来形成来源域和目标域。源-目标域的构建过程类似于Caltech-256数据集，“海狸”、“蓝鲸”、“鼹鼠”、“老鼠”、“公牛”、“臭鼬”和“鼬鼠”等类别的图像被认为是源域和目标域的负样本。总的来说，我们构造了 P32=6P_3^2=6P32=6 对DDTL问题。

笔记

没啥可说的

Performance Comparison

原文

In experiments, for each target domain, we randomly sample 6 labeled instances for training, and use the rest for testing. Each configuration is repeated 10 times. On the Caltech-256 dataset, we conduct two sets of experiments. The first experiment uses the original images as inputs to compare SLA-SCAN with CNN, DTL and STL. The second experiment uses SIFT features extracted from images as inputs to do a comparison among SLA-SAN and the rest baselines. On the AwA dataset, since only SIFT features are given, we compare SLA-SAN with SVM, GFK, LAN, ASVM, DTL, STL and TTL. The comparison results in terms of average accuracy are shown in Figure 3. From the results, we can see that the transfer learning methods such as DTL, GFK, LAN and ASVM achieve worse performance than CNN or SVM because the source domain and the target domain have huge distribution gap, which leads to ‘negative transfer’. TTL does not work either due to the distribution gap among the source, intermediate and target domains. STL achieves slightly better performance than the supervised learning methods. A reason is that the universal feature representation learned from the intermediate domain data helps the learning of the target model to some extent. Our proposed SLA method obtains the best performance under all the settings. We also report the average accuracy as well as standard deviations of different methods on some selected tasks in Tables 1 and 2. On these tasks, we can see that for SLA-SCAN, the improvement in accuracy over CNN is larger than 10% and sometimes up to 17%. For SLA-SAN, the improvement in accuracy is around 10% over SVM.

译文

在实验中，对于每个目标域，我们随机抽取6个带标记的实例进行训练，剩下的用于测试。每个框架重复10次。在Caltech-256数据集上，我们进行了两组实验。第一个实验使用原始图像作为输入，将SLA-SCAN与CNN、DTL和STL进行比较。第二个实验使用从图像中提取的SIFT特征作为输入，对SLA-SAN和其余基准进行比较。在AwA数据集上，由于只给出SIFT特征，我们将SLA-SAN与SVM、GFK、LAN、ASVM、DTL、STL和TTL进行比较。平均准确率的比较结果如图3所示。从结果中可以看出，DTL、GFK、LAN、ASVM等转移学习方法的性能比CNN或SVM差，因为源域和目标域存在巨大的分布差距。这种分布差距导致了“负迁移”。由于源、中间和目标域之间的分布差距，TTL也不能工作。STL的性能略优于监督学习方法。原因之一是，从中间域数据中学习的通用特征表示在一定程度上有助于目标模型的学习。我们提出的SLA方法在所有的设置下都能获得最佳的性能。我们还记录了表1和表2中所选任务中不同方法的平均准确度和标准差。在这些任务中，我们可以看到，对于SLA-SCAN，其准确率比CNN提高了10%以上，有时甚至高达17%。与SVM相比，SLA-SAN的准确率提高了约10%。

笔记

这一部分展示了SLA-SCAN和SLA-SAN两种网络框架和其他常见深度学习、机器学习或其他框架在指定数据集上的实验效果的对比。依据文中数据（图3和表1），文中所提出的那两种算法的准确度比其他算法要好。

Comparison on Different Learning Orders

原文

As shown in Figure 4, the SLA algorithm can learn an order of useful intermediate data to be added into the learning process. To compare different orders of intermediate data to be added into learning, we manually design three ordering strategies. In the first strategy, denoted by Random, we randomly split intermediate domains data into ten subsets. In each iteration, we add one subset into the learning process. In the second strategy, denoted by Category, we first obtain an order using the SLA algorithm, and then use the additional category information of the intermediate data to add the intermediate data category by category into the learning process. Hence the order of the categories is the order they first appear in the order learned by the SLA algorithm. The third strategy, denoted by Reverse, is similar to the second one except that the order of the categories to be added is reversed. In the experiments, the source domain data are removed in the same way as the original learning process of the SLA algorithm. From the results shown in Figure 5, we can see that the three manually designed strategies obtain much worse accuracy than SLA. Furthermore, among the three strategies, ‘Category’ achieves the best performance, which is because this order generation is close to that of SLA.

译文

如图4所示，SLA算法可以学习要添加到学习过程中的有用中间数据的顺序。为了比较学习中加入的中间数据的不同顺序，我们手工设计了三种排序策略。在第一个策略（用Random表示）中，我们将中间域数据随机分为十个子。在每个迭代中，我们将一个子集添加到学习过程中。在第二种策略（用Category表示）中，我们首先使用SLA算法获得一个顺序，然后利用中间数据的附加类别信息，将中间数据的类别信息逐类添加到学习过程中。因此，类别的顺序是它们在SLA算法学习的顺序中首次出现的顺序。第三种策略（用Reverse表示）与第二种策略相似，只是要添加的类别顺序相反。在实验中，采用与SLA算法原始学习过程相同的方法去除源域数据。从图5所示的结果可以看出，这三种手动设计的策略获得的准确性比SLA差得多。此外，在这三种策略中，“Category”的性能最好，这是因为这一次的顺序生成接近于SLA。

笔记

这一部分设计的目的是为了弄清楚SLA的学习过程。作者提出了三种可能的学习方式：Random Category Reverse，并加以验证，最后发现 Category 这种策略比较接近于SLA的学习过程。

Figure 4

介绍

详细的结果：
在两个迁移学习任务 “face-to-airplane” 和 “face-to-watch” 中：

迭代过程中所选中间数据的可视化；
迭代过程中遗留的源域数据规模改变；
迭代过程中目标函数值（训练损失）的变化。

Figure 5

介绍

在数据集 Caltech-256 和 AwA 上不同学习顺序的结果比较。

Table 2

	SVM	DTL	GFK	LAN	ASVM	TTL	STL	SLA
‘horse-to-face’	84 ±\pm± 2	88 ±\pm± 2	77 ±\pm± 3	79 ±\pm± 2	76 ±\pm± 4	78 ±\pm± 2	86 ±\pm± 3	92 ±\pm± 2
‘airplane-to-gorilla’	75 ±\pm± 1	62 ±\pm± 3	67 ±\pm± 5	66 ±\pm± 4	51 ±\pm± 2	65 ±\pm± 2	76 ±\pm± 3	84 ±\pm± 2
‘face-to-watch’	75 ±\pm± 7	68 ±\pm± 3	61 ±\pm± 4	63 ±\pm± 4	60 ±\pm± 5	67 ±\pm± 4	75 ±\pm± 5	88 ±\pm± 4
‘zebra-to-collie’	71 ±\pm± 3	69 ±\pm± 2	56±\pm± 2	57 ±\pm± 3	59 ±\pm± 2	70 ±\pm± 3	72 ±\pm± 3	76 ±\pm± 2

Detailed Results

原文

To understand how the data in intermediate domains can help connect the source domain and the target domain, in Figure 4, we show some images from intermediate domains selected in different iterations of the SLA algorithm on two transfer learning tasks, ‘face-to-airplane’ and ‘face-to-watch’, on the Caltech-256 dataset. Given the same source domain ‘face’ and two different target domains ‘airplane’ and ‘watch’, in Figure 4, we can see that the model selects completely different data points from intermediate domains. It may be not easy to figure out why those images are iteratively selected from our perspective. However, intuitively we can find that at the beginning, the selected images are closer to the source images such as ‘buddhas’ and ‘football-helmet’ in the two tasks, respectively, and at the end of the learning process, the selected images look closer to the target images. Moreover, in Figure 4, we show that as the iterative learning process proceeds, the number of positive samples in the source domain involved decreases and the value of the objective function in problem (3) decreases as well.

译文

在图4中，为了了解中间域中的数据怎样帮助连接源域和目标域，我们展示了一些来自于中间域的图片。这些图片是从SLA算法在 “face-to-airplane” 和 “face-to-watch” 这两个迁移学习任务的不同迭代过程中挑选出来的。在图4中可以看到，给定相同的源域 “face” 和两个不同的目标域 “plane” 和 “watch”，模型从中间域选择了完全不同的数据点。不容易弄清楚的一点可能是，为什么从我们的角度选择这些迭代过程里的图像。但是，我们可以直观地发现在这两个任务中，开始的时候，所选择的图像如 “Buddhas”、“football-helmet” 等分别与源图像更接近；在学习过程的最后，所选择的图像与目标图像更接近。此外，在图4中，我们发现随着迭代学习过程的进行，所涉及的源域中的正样本数量减少了，问题(3)中的目标函数值也减少了。

笔记

用可视化方法探究了学习过程中的一些奥秘：远域迁移学习的进行经过一系列的中间域，开始的中间域的特征分布和源域相似，而最后的空间域的特征分布与目标域相似。

Conclusion

原文

In this paper, we study the novel DDTL problem, where the source domain and the target domain are distant, but can be connected via some intermediate domains. To solve the DDTL problem, we propose the SLA algorithm to gradually select unlabeled data from the intermediate domains to bridge the two distant domains. Experiments conducted on two benchmark image datasets demonstrate that SLA is able to achieve state-of-the-art performance in terms of accuracy. As a future direction, we will extend the proposed algorithm to handle multiple source domains.

译文

在这篇论文中，我们研究了一种新的DDTL问题。该类问题中，源域和目标域互为远域，但可以通过医学中间域相连接。为了解决DDTL问题，我们提出了一种SLA算法，从中间域逐步选择未标记的数据来连接两个距离较远的域。在两个基准图像数据集上进行的实验表明，SLA在精度方面能够实现最先进的性能。未来我们的一个方向会是，将所提出的算法扩展，以处理多个源域。

笔记

nothing

Acknowledgments

译文

我们由国家基础研究基金(973项目)2014CB340304、香港CERG项目16211214、16209715、16244616、国家自然科学基金61305071项目资助。Sinno Jialin Pan感谢新加坡南洋理工大学南洋助理教授(NAP)基金M4081532.020的支持。我们感谢Hao Wang的讨论，感谢审稿人的宝贵意见。

Transfer Learning从入门到放弃（二）相关推荐

keras从入门到放弃(二十二）一维卷积处理 RNN文本分类
什么是一维卷积一维卷积 • 图中的输入的数据维度为8,过滤器的维度为5.与二维卷积类似,卷积后输出的数据维度为8−5+1=48−5+1=4. • 如果过滤器数量仍为1,输入数据的channel数量变 ...
FlaskWeb开发从入门到放弃(二)
第5章章节五 01 内容概要 02 内容回顾 03 面向对象相关补充:metaclass(一) 04 面向对象相关补充:metaclass(二) 05 WTforms实例化流程分析(一) 06 WT ...
Swift3.0从入门到放弃(二)
Swift中的数组 Swift中数组的定义 Swift中数组的操作 Swift中数组的遍历 Swift中数组的合并 Swift中的字典 Swift中字典的定义 Swift中字典的操作 Swift中字典 ...
keras从入门到放弃(二十一）LSTM处理 RNN文本分类
数据集是航空公司评论数据集美联航 Twitter 用户评论数据是一个Twitter评论情绪数据,从2015年2月开始抓取的美国航空公司Twitter评论数据,并对每条评论进行情感评价,包括正面.负面 ...
keras从入门到放弃(二十）LSTM处理电影评价预测
电影评价预测,对于词频使用keras.preprocessing.sequence.pad_sequences import keras from keras import layers data = ...
tensorflow从入门到放弃(二）
我看着之前的笔记,发现tf2不支持placerholder占位符,现在才知道,之前知道不支持session tensorflow核心和语言支持的API 流动过程张量在TensorFlow系统中,张 ...
keras从入门到放弃(二）多项回归
多元线性回归这里有个Advertising.csv 找出TV , radio , newspaper 和 sales关系读入数据 import pandas as pd data = pd.rea ...
李宏毅作业十二 Transfer Learning（迁移学习）
系列文章目录李宏毅作业十 Generative Adversarial Network生成对抗网络(代码) 李宏毅作业九 Anomaly Detection异常检测李宏毅作业八unsupervis ...
hex editor怎么搜索代码_代码审计从入门到放弃(三) phplimit
原创: 一叶飘零合天智汇前言接着前面的代码审计从入门到放弃(一) & function.代码审计从入门到放弃(二) & pcrewaf 本次是phplimit这道题,本篇文章提供 ...