文章目录

A theory of learning from different domains
- H-divergence
Analysis of Representations for Domain Adaptation
参考资料

本文是对两篇文章：

A theory of learning from different domains
Analysis of Representations for Domain Adaptation

的整理。他从理论上给出了在target domain的误差的bound是部分由source domain的误差决定的，具有指导性意义。

A theory of learning from different domains

首先我们给出一些基本的设置，用Ds,fs\displaystyle D_{s} ,f_{s}Ds,fs表示在source domain上分布以及该domain上的函数分类函数（这里假设fs\displaystyle f_{s}fs是二分类函数，所以取值是[0,1]），同理target domain：用Dt,ft\displaystyle D_{t} ,f_{t}Dt,ft表示

我们称hypothesis是一个用来分类的函数h:X→{0,1}\displaystyle h:\mathcal{X}\rightarrow \{0,1\}h:X→{0,1}. 于是我们可以定义h和f的误差为：

ϵS(h,f)=Ex∼DS[∣h(x)−f(x)∣]\epsilon _{S} (h,f)=\mathrm{E}_{\mathbf{x} \sim \mathcal{D}_{S}} [|h(\mathbf{x} )-f(\mathbf{x} )|] ϵS(h,f)=Ex∼DS[∣h(x)−f(x)∣]

表示在source domain上h和f的误差，特别的，当f=fs\displaystyle f=f_{s}f=fs，即为真实的分类函数时，记ϵS\displaystyle \epsilon _{S}ϵS(h)=ϵS(h,fs)\displaystyle \epsilon _{S} (h,f_{s} )ϵS(h,fs)，同理target domain的误差同样有ϵT(h)=ϵT(h,ft)\displaystyle \epsilon _{T}( h) =\epsilon _{T} (h,f_{t} )ϵT(h)=ϵT(h,ft),接下来我们给出最重要的H-divergence

H-divergence

所谓散度就是一个弱化的距离，他不一定具备距离的性质，比如有可能不满足对称性等等，那么所谓H是定义在假设空间H\displaystyle \mathcal{H}H的D\displaystyle \mathcal{D}D和D′\displaystyle \mathcal{D}^{\prime }D′的距离：

dH(D,D′)=2sup⁡h∈H∣Pr⁡x∼D[h(x)=1]−Pr⁡x∼D′[h(x)=1]∣d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) =2\sup _{h\in \mathcal{H}}\left| \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]\right| dH(D,D′)=2h∈Hsup∣Prx∼D[h(x)=1]−Prx∼D′[h(x)=1]∣

直观来看，这个散度的意思是，在一个假设空间H\displaystyle \mathcal{H}H中，找到一个函数h，使得Pr⁡x∼D[h(x)=1]\displaystyle \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]Prx∼D[h(x)=1]的概率尽可能大，而Pr⁡x∼D′[h(x)=1]\displaystyle \operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]Prx∼D′[h(x)=1]的概率尽可能小，也就是说，我们用最大距离来衡量D,D′\displaystyle \mathcal{D} ,\mathcal{D}^{\prime }D,D′之间的距离。同时这个h也可以理解为是用来尽可能区分D,D′\displaystyle \mathcal{D} ,\mathcal{D}^{\prime }D,D′这两个分布的函数。
此外这个散度是可以从数据中估计出来的：

Lemma 1 LetHbe a hypothesis space on X with VC dimension d. If U and U’ are samples of size m from D and D’ respectively and d^H(D,D′)\displaystyle \hat{d}_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right)d^H(D,D′) is the empirical H-divergence between samples, then for any δ ∈ (0,1), with probability at least 1−δ,

dH(D,D′)≤d^H(U,U′)+4dlog⁡(2m)+log⁡(2δ)md_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) \leq \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) +4\sqrt{\frac{d\log (2m)+\log\left(\frac{2}{\delta }\right)}{m}} dH(D,D′)≤d^H(U,U′)+4mdlog(2m)+log(δ2)

这个bound其实就是VC维的bound，这里d表示H的VC维m是样本数量。显然当d有限时，样本量趋于无穷的时候收敛。接下来给出一种计算的方法：
Lemma 2 该散度可以从样本中计算

d^H(U,U′)=2(1−min⁡h∈H[1m∑xh(x)=0I[x∈U]+1m∑xh(x)=1I[x∈U′]])\hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) =2\left( 1-\min_{h\in \mathcal{H}}\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right]\right) \ d^H(U,U′)=2⎝⎛1−h∈Hmin⎣⎡m1xh(x)=0∑I[x∈U]+m1xh(x)=1∑I[x∈U′]⎦⎤⎠⎞

其中I[x∈U]I[ x\in U]I[x∈U]表示当x∈U\displaystyle x\in Ux∈U 的时候等于1，也就是统计 x∈U\displaystyle x\in Ux∈U的x的数量
可以其实可以直接看出他就是在估计这么个概率，也就是H散度：

1−[1m∑xh(x)=0I[x∈U]+1m∑xh(x)=1I[x∈U′]]=Pr⁡x∼D[h(x)=1]−Pr⁡x∼D′[h(x)=1]1-\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right] =\operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1] 1−⎣⎡m1xh(x)=0∑I[x∈U]+m1xh(x)=1∑I[x∈U′]⎦⎤=Prx∼D[h(x)=1]−Prx∼D′[h(x)=1]

Definition 1 symmetric difference hypothesis space HΔH\displaystyle \mathcal{H} \Delta \mathcal{H}HΔH是一系列hypotheses的集合

g∈HΔH⟺g(x)=h(x)⊕h′(x)for some h,h′∈Hg\in \mathcal{H} \Delta \mathcal{H} \ \ \Longleftrightarrow \ \ g(\mathbf{x} )=h(\mathbf{x} )\oplus h^{\prime } (\mathbf{x} )\ \ \text{ for some } h,h^{\prime } \in \mathcal{H} g∈HΔH ⟺ g(x)=h(x)⊕h′(x) for some h,h′∈H

其中⊕\displaystyle \oplus⊕表示异或，就是当h(x)≠h′(x)\displaystyle h(\mathbf{x} )\neq h'(\mathbf{x} )h(x)̸=h′(x)时，g(x)=1\displaystyle g(\mathbf{x} )=1g(x)=1

直观来说，这个g就是判断两个h的结果相不相等的函数。这个东西的好好处是，可以用这个集合中的函数来表示两个函数不相等的概率，也就是两个函数之间的误差，如果能找到两个domain之间的两个函数间的最大误差，也就找到了H散度的值，即：

dHΔH(DS,DT)=2sup⁡h,h′∈H∣ϵS(h,h′)−ϵT(h,h′)∣d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| dHΔH(DS,DT)=2h,h′∈Hsup∣ϵS(h,h′)−ϵT(h,h′)∣

推导过程可见引理3：

Lemma 3 对于任意的hypotheses h,h′∈H\displaystyle h,h'\in Hh,h′∈H

∣ϵS(h,h′)−ϵT(h,h′)∣≤12dHΔH(DS,DT)\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \leq \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) ∣ϵS(h,h′)−ϵT(h,h′)∣≤21dHΔH(DS,DT)

证明：

dHΔH(DS,DT)=2sup⁡h,h′∈H∣Pr⁡x∼DS[h(x)⊕h′(x)=1]−Pr⁡x∼DT[h(x)⊕h′(x)=1]=2sup⁡h,h′∈H∣Pr⁡x∼DS[h(x)≠h′(x)]−Pr⁡x∼DT[h(x)≠h′(x)]=2sup⁡h,h′∈H∣ϵS(h,h′)−ϵT(h,h′)∣≥2∣ϵS(h,h′)−ϵT(h,h′)∣\begin{aligned} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) & = 2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\oplus h^{\prime } (x)=1\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\oplus h^{\prime } (x)=1\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\neq h^{\prime } (x)\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\neq h^{\prime } (x)\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \geq 2\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \end{aligned} dHΔH(DS,DT)=2h,h′∈Hsup∣Prx∼DS[h(x)⊕h′(x)=1]−Prx∼DT[h(x)⊕h′(x)=1]=2h,h′∈Hsup∣Prx∼DS[h(x)̸=h′(x)]−Prx∼DT[h(x)̸=h′(x)]=2h,h′∈Hsup∣ϵS(h,h′)−ϵT(h,h′)∣≥2∣ϵS(h,h′)−ϵT(h,h′)∣

证毕。

有了上面的一些引理，我们证明一个重要的定理，这个定理告诉我们，只要找到一个h，使得在source domain上的误差尽可能小就能让target domain上的误差尽可能小。
Theorem 1 如果Us,Ut是从Ds，Dt中抽取的无标签数据。则

ϵT(h)≤ϵS(h)+12d^HΔH(US,UT)+42dlog⁡(2m′)+log⁡(2δ)m′+λ\epsilon _{T} (h)\leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda ϵT(h)≤ϵS(h)+21d^HΔH(US,UT)+4m′2dlog(2m′)+log(δ2)+λ

证明：该证明用到了上面的引理1，以及三角不等式：ϵT(h,fT)≤ϵT(fT,h∗)+ϵT(h,h∗)\displaystyle \epsilon _{T} (h,f_{T}) \leq \epsilon _{T}\left( f_{T} ,h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)ϵT(h,fT)≤ϵT(fT,h∗)+ϵT(h,h∗)

ϵT(h)≤ϵT(h∗)+ϵT(h,h∗)=ϵT(h∗)+ϵT(h,h∗)+ϵS(h,h∗)−ϵS(h,h∗)≤ϵT(h∗)+ϵS(h,h∗)+∣ϵT(h,h∗)−ϵS(h,h∗)∣(引理1)≤ϵT(h∗)+ϵS(h,h∗)+12dHΔH(DS,DT)(三角不等式)≤ϵT(h∗)+ϵS(h)+ϵS(h∗)+12dHΔH(DS,DT)=ϵS(h)+12dHΔH(DS,DT)+λ≤ϵS(h)+12d^HΔH(US,UT)+42dlog⁡(2m′)+log⁡(2δ)m′+λ\begin{aligned} \epsilon _{T} (h) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)\\ & =\epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\\ & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\left| \epsilon _{T}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\right| \\ ( 引理1) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ ( 三角不等式\ ) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S} (h)+\epsilon _{S}\left( h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ & =\epsilon _{S} (h)+\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) +\lambda \\ & \leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda \end{aligned} ϵT(h)(引理1)(三角不等式 )≤ϵT(h∗)+ϵT(h,h∗)=ϵT(h∗)+ϵT(h,h∗)+ϵS(h,h∗)−ϵS(h,h∗)≤ϵT(h∗)+ϵS(h,h∗)+∣ϵT(h,h∗)−ϵS(h,h∗)∣≤ϵT(h∗)+ϵS(h,h∗)+21dHΔH(DS,DT)≤ϵT(h∗)+ϵS(h)+ϵS(h∗)+21dHΔH(DS,DT)=ϵS(h)+21dHΔH(DS,DT)+λ≤ϵS(h)+21d^HΔH(US,UT)+4m′2dlog(2m′)+log(δ2)+λ

式1.用了三角不等式,式5用了三角不等式：ϵS(h,h∗)⩽ϵS(h,fs)+ϵS(h∗,fs)\displaystyle \epsilon _{S}\left( h,h^{*}\right) \leqslant \epsilon _{S}( h,f_{s}) +\epsilon _{S}\left( h^{*} ,f_{s}\right)ϵS(h,h∗)⩽ϵS(h,fs)+ϵS(h∗,fs),最后一个使用使用了VC维理论，这是从样本从估计12dHΔH\displaystyle \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}21dHΔH的泛化误差，其中d为VC维度
证毕。

这个bound的本质就是用H-divrgence将两个domain误差的差距建立了一个联系：
∣ϵS−ϵT∣≈12dHΔH(DS,DT)|\epsilon_S-\epsilon_T| \approx \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})∣ϵS−ϵT∣≈21dHΔH(DS,DT)

Analysis of Representations for Domain Adaptation

这篇论文将DA的误差推广到存在representation的分布上。通过假设存在一个表征函数R，将domain映射到一个representation上，即负责将X映射到Z，当然，R确定时，也就表示一个domain被确定了，因为R可以将表征逆映射回X上，而这个X就是一个domain

Pr⁡D~[B]=defPr⁡D[R−1(B)]f~(z)=defED[f(x)∣R(x)=z]\begin{array}{ c c c } \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] & \stackrel{\mathrm{def}}{=} & \operatorname{Pr}_{\mathcal{D}}\left[\mathcal{R}^{-1} (B)\right]\\ \tilde{f} (\mathbf{z} ) & \stackrel{\mathrm{def}}{=} & \mathrm{E}_{\mathcal{D}} [f(\mathbf{x} )|\mathcal{R} (\mathbf{x} )=\mathbf{z} ] \end{array} PrD~[B]f~(z)=def=defPrD[R−1(B)]ED[f(x)∣R(x)=z]

简单的说，B是在feature space上的一个时间，这里的Pr⁡D~[B]\operatorname{Pr}_{\tilde{\mathcal{D}}} [B]PrD~[B]就是直接测量representation上的概率的测度。另外这里f~(z)\displaystyle \tilde{f} (\mathbf{z} )f~(z)是所有被z表征的f(x)的均值,，这里每个f(x)都是一个label，将他们取均值来作为表征z的label.

在DA问题中，我们用DS\displaystyle D_{S}DS 表示source domain的分布，用D~S\displaystyle \tilde{D}_{S}D~S表示是建立在feature space上的source domain的分布，也就是这个分布是经过一个z进行转换得到的，正如上述定义的公式描述的一样。

那么误差也同样可以推广到带representation的场景下，只要我们从D~S\displaystyle \tilde{D}_{S}D~S从采样z就可以了，这里用h表示任意的一个分类器，于是h在source domain的误差计算如下：

ϵS(h)=Ez∼D~S[Ey∼f~(z)[y≠h(z)]]=Ez∼D~S∣fs~(z)−h(z)∣\begin{aligned} \epsilon _{S} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}} |\widetilde{f_{s}} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵS(h)=Ez∼D~S[Ey∼f~(z)[y̸=h(z)]]=Ez∼D~S∣fs(z)−h(z)∣

同理target domain的误差：

ϵT(h)=Ez∼D~T[Ey∼f~(z)[y≠h(z)]]=Ez∼D~T∣f~t(z)−h(z)∣\begin{aligned} \epsilon _{T} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}} |\tilde{f}_{t} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵT(h)=Ez∼D~T[Ey∼f~(z)[y̸=h(z)]]=Ez∼D~T∣f~t(z)−h(z)∣

也就是说ϵS(h)=ϵS(h,fs~)\displaystyle \epsilon _{S} (h)=\epsilon _{S} (h,\widetilde{f_{s}} )ϵS(h)=ϵS(h,fs),ϵT(h)=ϵT(h,fT~)\displaystyle \epsilon _{T} (h)=\epsilon _{T} (h,\widetilde{f_{T}} )ϵT(h)=ϵT(h,fT)

接下来我们开始尝试将定理1推广到带representation的情况。

Theorem 2 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. If a random labeled sample of size m is generated by applying R to a DS-i.i.d. sample labeled according to f, then with probability at least 1−δ, for every h ∈ H:

ϵT(h)≤ϵ^S(h)+4m(dlog⁡2emd+log⁡4δ)+dH(D~S,D~T)+λ\epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda ϵT(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)+dH(D~S,D~T)+λ

其中e是自然底数
证明：
令 h∗=argmin⁡h∈H(ϵT(h)+ϵS(h))h^{*} =\operatorname{argmin}_{h\in H}( \epsilon _{T} (h)+\epsilon _{S} (h))h∗=argminh∈H(ϵT(h)+ϵS(h))，且ϵT(h∗)=λT,ϵS(h∗)=λS\displaystyle \epsilon _{T} (h^{*} )=\lambda _{T} ,\epsilon _{S} (h^{*} )=\lambda _{S}ϵT(h∗)=λT,ϵS(h∗)=λS. 记λ=λT+λS\displaystyle \lambda =\lambda _{T} +\lambda _{S}λ=λT+λS

ϵT(h)≤λT+Pr⁡DT[ZhΔZh∗]=λT+Pr⁡DS[ZhΔZh∗]+Pr⁡DT[ZhΔZh∗]−Pr⁡DS[ZhΔZh∗]≤λT+Pr⁡DS[ZhΔZh∗]+∣Pr⁡DS[ZhΔZh∗]−Pr⁡DT[ZhΔZh∗]∣≤λT+Pr⁡DS[ZhΔZh∗]+dH(D~S,D~T)≤λT+λS+ϵS(h)+dH(D~S,D~T)≤λ+ϵS(h)+dH(D~S,D~T)\begin{aligned} \epsilon _{T} (h) & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & =\lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +| \operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] |\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda _{T} +\lambda _{S} +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) \end{aligned} ϵT(h)≤λT+PrDT[ZhΔZh∗]=λT+PrDS[ZhΔZh∗]+PrDT[ZhΔZh∗]−PrDS[ZhΔZh∗]≤λT+PrDS[ZhΔZh∗]+∣PrDS[ZhΔZh∗]−PrDT[ZhΔZh∗]∣≤λT+PrDS[ZhΔZh∗]+dH(D~S,D~T)≤λT+λS+ϵS(h)+dH(D~S,D~T)≤λ+ϵS(h)+dH(D~S,D~T)

其中Zh={z∈Z:h(z)=1}\displaystyle \mathcal{Z}_{h} =\{\mathbf{z} \in \mathcal{Z} :h(\mathbf{z} )=1\}Zh={z∈Z:h(z)=1},因此Pr⁡DT[ZhΔZh∗]\displaystyle \operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]PrDT[ZhΔZh∗]可以看做是ϵT(h,h∗)\displaystyle \epsilon _{T}\left( h,h^{*}\right)ϵT(h,h∗)。
第一条不等式来自与三角不等式：ϵT(h,fT)⩽ϵT(h∗,fT)+ϵT(h∗,h)\displaystyle \epsilon _{T} (h,f_{T} )\leqslant \epsilon _{T} (h^{*} ,f_{T} )+\epsilon _{T} (h^{*} ,h)ϵT(h,fT)⩽ϵT(h∗,fT)+ϵT(h∗,h)
第5条式子来自三角不等式: ϵS(h∗,h)⩽ϵS(h∗,fT)+ϵS(h,fT)\displaystyle \epsilon _{S} (h^{*} ,h)\leqslant \epsilon _{S} (h^{*} ,f_{T} )+\epsilon _{S} (h,f_{T} )ϵS(h∗,h)⩽ϵS(h∗,fT)+ϵS(h,fT)
最后根据Vapnik-Chervonenkis theory (V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998)

ϵS(h)≤ϵ^S(h)+4m(dlog⁡2emd+log⁡4δ)\epsilon _{S} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} ϵS(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)

因此

同理，对于dH(D~S,D~T)\displaystyle d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)dH(D~S,D~T)的经验估计，设该分布有m’个样本，bound可以进一步写作：

ϵT(h)≤ϵ^S(h)+4m(dlog⁡2emd+log⁡4δ)+λ+dH(U~S,U~T)+4dlog⁡(2m′)+log⁡(4δ)m′\epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\frac{4}{m}\sqrt{\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +\lambda +d_{\mathcal{H}}\left(\tilde{\mathcal{U}}_{S} ,\tilde{\mathcal{U}}_{T}\right) +4\sqrt{\frac{d\log\left( 2m^{\prime }\right) +\log\left(\frac{4}{\delta }\right)}{m^{\prime }}} ϵT(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)+λ+dH(U~S,U~T)+4m′dlog(2m′)+log(δ4)

证毕。

参考资料

A theory of learning fromdifferent domains

Analysis of Representations for Domain Adaptation

V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998

Domain Adaptation理论分析相关推荐

迁移学习之域自适应理论简介（Domain Adaptation Theory）
©作者 | 江俊广单位 | 清华大学研究方向 | 迁移学习本文主要介绍域自适应(Domain Adaptation)最基本的学习理论,全文不涉及理论的证明,主要是对部分理论的发展脉络的梳理,以及 ...
迁移学习：领域自适应的理论分析
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达领域自适应即Domain Adaptation,是迁移学习中很重要 ...
基于matlab的fisher线性判别及感知器判别_Deep Domain Adaptation论文集(一)：基于label迁移知识...
本系列简单梳理一下<Deep Visual Domain Adaptation: A Survey>这篇综述文章的内容,囊括了现在用深度网络做领域自适应DA(Domain Adaptati ...
从近年顶会论文看领域自适应（Domain Adaptation）最新研究进展
©PaperWeekly 原创 · 作者 | 张一帆学校 | 中科院自动化所博士生研究方向 | 计算机视觉 Domain Adaptation 即在源域上进行训练,在目标域上进行测试. 本文总结了 ...
辅助分类器遇上Domain Adaptation：连续性与不确定性
点击蓝字关注我们 AI TIME欢迎每一位AI爱好者的加入! 在领域迁移(Domain Adaptation)问题上,我们往往缺少对于无标注的目标领域数据(unlabled target-domai ...
【ICML 2015迁移学习论文阅读】Unsupervised Domain Adaptation by Backpropagation (DANN) 反向传播的无监督领域自适应
会议:ICML 2015 论文题目:Unsupervised Domain Adaptation by Backpropagation 论文地址: http://proceedings.mlr.pre ...
Transferability vs. Discriminability: Batch Spectral Penalization for Adversarial Domain Adaptation
ICML2019: Transferability vs. Discriminability: Batch Spectral Penalization for Adversarial Domain A ...
遥感图像-Deep Feature Alignment Neural Networks for Domain Adaptation of Hyperspectral Data高光谱数据深度特征对齐神经
Deep Feature Alignment Neural Networks for Domain Adaptation of Hyperspectral Data高光谱数据领域自适应的深度特征对齐神 ...
DANN：Unsupervised Domain Adaptation by Backpropagation
本篇是迁移学习专栏介绍的第十三篇论文,发表在ICML15上.论文提出了用对抗的思想进行domain adaptation,该方法名叫DANN(或RevGrad).核心的问题是同时学习分类器 .特征提 ...
迁移学习——理论分析文献
人工智能文献记录专栏,专栏地址:https://blog.csdn.net/u014157632/category_9760481.html,总目录:https://blog.csdn.net/u01 ...

Domain Adaptation理论分析

文章目录

A theory of learning from different domains

H-divergence

Analysis of Representations for Domain Adaptation

参考资料

Domain Adaptation理论分析相关推荐

最新文章

热门文章