统计机器学习中经常遇到熵的概念,在介绍NCE和InfoNCE之前,对熵以及相关的概念做简单的梳理。信息量用于度量不确定性的大小,熵可以看作信息量的期望,香农信息熵的定义:对于随机遍历 X X X,香农信息的定义为 I ( X ) = − l o g ( P ( X ) ) I(X) = -log(P(X)) I(X)=−log(P(X)),香农熵的定义为香农信息的期望 H ( X ) = E ( I ( X ) ) = ∑ x P ( x ) I ( x ) = − ∑ x P ( x ) l o g ( P ( x ) ) H(X) = E(I(X))= \sum_{x} P(x)I(x) = -\sum_{x} P(x)log(P(x)) H(X)=E(I(X))=∑x​P(x)I(x)=−∑x​P(x)log(P(x))。当随机变量为均匀分布时,熵最大。香农编码定理表明熵是传输一个随机变量状态值所需的比特位下界。

对于多维随机变量,联合熵的定义为: H ( X , Y ) = E ( I ( X , Y ) ) = ∑ x , y P ( x , y ) I ( x , y ) = − ∑ x , y P ( x , y ) l o g ( P ( x , y ) ) H(X,Y) =E(I(X,Y))= \sum_{x,y}P(x,y)I(x,y) = -\sum_{x,y}P(x,y)log(P(x,y)) H(X,Y)=E(I(X,Y))=∑x,y​P(x,y)I(x,y)=−∑x,y​P(x,y)log(P(x,y))

条件熵:

H ( Y ∣ X ) = E x ( H ( Y ∣ X = x ) ) = − ∑ x p ( x ) ∑ y p ( y ∣ x ) l o g P ( y ∣ x ) H(Y|X) = E_{x}(H(Y|X=x)) = -\sum_{x} p(x)\sum_{y}p(y|x)logP(y|x) H(Y∣X)=Ex​(H(Y∣X=x))=−x∑​p(x)y∑​p(y∣x)logP(y∣x)

= − ∑ x ∑ y p ( x ) p ( y ∣ x ) l o g ( p ( y ∣ x ) ) =-\sum_{x}\sum_{y}p(x)p(y|x)log(p(y|x)) =−x∑​y∑​p(x)p(y∣x)log(p(y∣x))

= − ∑ x , y p ( x , y ) l o g ( p ( y ∣ x ) ) =-\sum_{x,y}p(x,y)log(p(y|x)) =−x,y∑​p(x,y)log(p(y∣x))

有定义可推出条件熵的一下性质:

H ( Y ∣ X ) = − ∑ x , y p ( x , y ) l o g ( p ( y ∣ x ) ) H(Y|X) = -\sum_{x,y}p(x,y)log(p(y|x)) H(Y∣X)=−x,y∑​p(x,y)log(p(y∣x))

= − ∑ x , y p ( x , y ) l o g ( p ( x , y ) / p ( x ) ) ) =-\sum_{x,y}p(x,y)log(p(x,y)/p(x))) =−x,y∑​p(x,y)log(p(x,y)/p(x)))

= − ∑ x , y p ( x , y ) l o g ( p ( x , y ) ) + ∑ x , y p ( x , y ) l o g ( p ( x ) ) =-\sum_{x,y}p(x,y)log(p(x,y)) + \sum_{x,y}p(x,y)log(p(x)) =−x,y∑​p(x,y)log(p(x,y))+x,y∑​p(x,y)log(p(x))

= − ∑ x , y p ( x , y ) l o g ( p ( x , y ) ) + ∑ x l o g ( p ( x ) ) ∑ y p ( x , y ) =-\sum_{x,y}p(x,y)log(p(x,y)) + \sum_{x}log(p(x))\sum_{y}p(x,y) =−x,y∑​p(x,y)log(p(x,y))+x∑​log(p(x))y∑​p(x,y)

= H ( X , Y ) + ∑ x l o g ( p ( x ) ) p ( x ) =H(X,Y) + \sum_{x}log(p(x))p(x) =H(X,Y)+x∑​log(p(x))p(x)

= H ( X , Y ) − H ( X ) =H(X,Y)-H(X) =H(X,Y)−H(X)

相对熵(也成KL散度):

设 p ( x ) , q ( x ) p(x),q(x) p(x),q(x)是离散随机变量的两个概率分布,则p相对q的相对熵为

K L ( p ∣ ∣ q ) = E x ∼ p ( x ) ( l o g ( p ( x ) q ( x ) ) ) = ∑ x p ( x ) l o g ( p ( x ) q ( x ) ) KL(p||q) = E_{x\sim p(x)}(log(\frac{p(x)}{q(x)})) = \sum_{x}p(x)log(\frac{p(x)}{q(x)}) KL(p∣∣q)=Ex∼p(x)​(log(q(x)p(x)​))=x∑​p(x)log(q(x)p(x)​)

  • 如何p,q相同,则 K L ( p ∣ ∣ q ) = K L ( q ∣ ∣ p ) = 0 KL(p||q) = KL(q||p)= 0 KL(p∣∣q)=KL(q∣∣p)=0
  • K L ( p ∣ ∣ q ) ≠ K L ( q ∣ ∣ p ) KL(p||q)\ne KL(q||p) KL(p∣∣q)​=KL(q∣∣p)
  • K L ( p ∣ ∣ q ) ≥ 0 KL(p||q) \ge 0 KL(p∣∣q)≥0

证明:

K L ( p ∣ ∣ q ) = E x ∼ p ( x ) ( l o g p ( x ) q ( x ) ) KL(p||q)=E_{x\sim p(x)}(log\frac{p(x)}{q(x)}) KL(p∣∣q)=Ex∼p(x)​(logq(x)p(x)​)

= ∑ x p ( x ) l o g p ( x ) q ( x ) =\sum_{x}p(x)log\frac{p(x)}{q(x)} =x∑​p(x)logq(x)p(x)​

= − ∑ x p ( x ) l o g ( q ( x ) p ( x ) ) =-\sum_{x}p(x)log(\frac{q(x)}{p(x)}) =−x∑​p(x)log(p(x)q(x)​)

= − E x ∼ p ( x ) ( l o g q ( x ) p ( x ) ) =-E_{x\sim p(x)}(log\frac{q(x)}{p(x)}) =−Ex∼p(x)​(logp(x)q(x)​)

≥ − l o g E x ∼ p ( x ) ( q ( x ) p ( x ) ) − − − J e n s e n 不 等 式 \ge -logE_{x\sim p(x)}(\frac{q(x)}{p(x)}) --- Jensen不等式 ≥−logEx∼p(x)​(p(x)q(x)​)−−−Jensen不等式

= − l o g ∑ x p ( x ) q ( x ) p ( x ) =-log \sum_{x}p(x)\frac{q(x)}{p(x)} =−logx∑​p(x)p(x)q(x)​

= − l o g ∑ x q ( x ) = 0 =-log\sum_xq(x)=0 =−logx∑​q(x)=0

交叉熵:

H ( p , q ) = − ∑ x p ( x ) l o g ( q ( x ) ) H(p,q) = -\sum_xp(x)log(q(x)) H(p,q)=−∑x​p(x)log(q(x))

交叉熵有如下性质:

K L ( p ∣ ∣ q ) = ∑ x p ( x ) l o g p ( x ) q ( x ) KL(p||q)=\sum_{x}p(x)log\frac{p(x)}{q(x)} KL(p∣∣q)=x∑​p(x)logq(x)p(x)​

= ∑ x p ( x ) l o g ( p ( x ) ) − ∑ x p ( x ) l o g ( q ( x ) ) =\sum_xp(x)log(p(x))-\sum_xp(x)log(q(x)) =x∑​p(x)log(p(x))−x∑​p(x)log(q(x))

= − ∑ x p ( x ) l o g ( q ( x ) ) − ( − ∑ x p ( x ) l o g ( p ( x ) ) ) =-\sum_xp(x)log(q(x)) - (-\sum_xp(x)log(p(x))) =−x∑​p(x)log(q(x))−(−x∑​p(x)log(p(x)))

= H ( p , q ) − H ( p ) =H(p,q) - H(p) =H(p,q)−H(p)

≤ H ( p , q ) \le H(p, q) ≤H(p,q)

因此最小化交叉熵 H ( p , q ) H(p,q) H(p,q),在最小化KL散度的上界。事实上机器学习中,可以认为 p ( x ) p(x) p(x) 为真实的数据分布, q ( x ) q(x) q(x) 为通过模型建模的数据分布,参数待估计(学习)。一个常见的参数估计(学习)策略就是最小化交叉熵,可以证明样本固定的情况下,最小化交叉熵等价于最小化相对熵(KL散度),也等价于最大化似然。

实际情况是数据的真实分布 p ( x ) p(x) p(x) 是未知的,因此学习中用来自总体的样本$ {x_i, i=1, 2,…,n}$近似计算,希望学习到的分布 q ( x ) q(x) q(x) 与样本分布一致。

根据最小交叉熵策略,损失函数定义为:

L o s s ( θ ) = H ( p , q ) = − ∑ x p ( x ) l o g ( q ( x ; θ ) ) = − E x ∼ p ( x ) ( l o g ( q ( x ; θ ) ) ) Loss(\theta) = H(p,q)=-\sum_xp(x)log(q(x;\theta))=-E_{x\sim p(x)}(log(q(x;\theta))) Loss(θ)=H(p,q)=−x∑​p(x)log(q(x;θ))=−Ex∼p(x)​(log(q(x;θ)))

≈ − 1 n ∑ i = 1 n l o g ( q ( x i ; θ ) ) \approx -\frac{1}{n}\sum_{i=1}^nlog(q(x_i;\theta)) ≈−n1​i=1∑n​log(q(xi​;θ))

θ ∗ = a r g m i n θ − 1 n ∑ i = 1 n l o g ( q ( x i ; θ ) ) \theta^* = argmin_{\theta}-\frac{1}{n}\sum_{i=1}^nlog(q(x_i;\theta)) θ∗=argminθ​−n1​i=1∑n​log(q(xi​;θ))

根据最大似然估计策略,样本的似然函数为:

L ( θ ) = ∑ i = 1 n l o g ( q ( x i ; θ ) ) L(\theta) = \sum_{i=1}^nlog(q(x_i;\theta)) L(θ)=i=1∑n​log(q(xi​;θ))

θ M L E = a r g m a x θ ∑ i = 1 n l o g ( q ( x i ; θ ) ) \theta_{MLE} = argmax_{\theta}\sum_{i=1}^nlog(q(x_i;\theta)) θMLE​=argmaxθ​i=1∑n​log(q(xi​;θ))

可以看出最小交叉熵估计等价于最大似然估计。

互信息:

连个随机变量X,Y的互信息定义为

I ( X , Y ) = E x , y ∼ p ( x , y ) ( l o g p ( x , y ) p ( x ) p ( y ) ) = ∑ x ∑ y p ( x , y ) l o g p ( x , y ) p ( x ) p ( y ) I(X,Y) = E_{x,y\sim p(x,y)}(log\frac{p(x,y)}{p(x)p(y)})=\sum_x\sum_yp(x,y)log\frac{p(x,y)}{p(x)p(y)} I(X,Y)=Ex,y∼p(x,y)​(logp(x)p(y)p(x,y)​)=x∑​y∑​p(x,y)logp(x)p(y)p(x,y)​

互信息满足:

  • 对称性: I ( X ; Y ) = I ( Y ; X ) I(X;Y)=I(Y;X) I(X;Y)=I(Y;X)
  • 半正定性: I ( X : Y ) ≥ 0 I(X:Y)\ge 0 I(X:Y)≥0,当 X , Y X,Y X,Y互相独立时, I ( X ; Y ) = 0 ) I(X;Y)=0) I(X;Y)=0)

证明:

I ( X ; Y ) = E x , y ∼ p ( x , y ) ( l o g p ( x , y ) p ( x ) p ( y ) ) I(X;Y)=E_{x,y\sim p(x,y)}(log\frac{p(x,y)}{p(x)p(y)}) I(X;Y)=Ex,y∼p(x,y)​(logp(x)p(y)p(x,y)​)

− E x y ∼ p ( x , y ) ( l o g p ( x ) p ( y ) p ( x , y ) ) -E_{xy\sim p(x,y)}(log \frac{p(x)p(y)}{p(x,y)}) −Exy∼p(x,y)​(logp(x,y)p(x)p(y)​)

≥ l o g E x , y ∼ p ( x , y ) ( p ( x ) p ( y ) p ( x , y ) ) − − − J e n s e n 不 等 式 \ge log E_{x,y\sim p(x,y)}(\frac{p(x)p(y)}{p(x,y)}) --- Jensen不等式 ≥logEx,y∼p(x,y)​(p(x,y)p(x)p(y)​)−−−Jensen不等式

= l o g ∑ x , y p ( x , y ) p ( x ) p ( y ) p ( x , y ) ) =log\sum_{x,y}p(x,y)\frac{p(x)p(y)}{p(x,y)}) =logx,y∑​p(x,y)p(x,y)p(x)p(y)​)

= l o g 1 = 0 =log 1 = 0 =log1=0

Noise Contrastive Estimate

语言模型中建模给定上下文 c c c 下单词 w w w的条件概率:

p ( w ∣ c ) = p θ ( w ∣ c ) = e x p ( s θ ( w , c ) ) ∑ w ′ ∈ V e x p ( s θ ( w , c ) ) ( 1 ) p(w|c)=p_{\theta}(w|c)=\frac{exp(s_{\theta}(w,c))}{\sum_{w'\in V}exp(s_{\theta}(w,c))}\quad\quad (1) p(w∣c)=pθ​(w∣c)=∑w′∈V​exp(sθ​(w,c))exp(sθ​(w,c))​(1)

下面我们考虑给定上下文 c c c 后的参数 θ \theta θ 的估计问题。

考虑极大似然估计方法,样本的的对数似然函数为:

L ( θ ) = ∑ i = 1 n l o g p θ ( w i ∣ c ) L(\theta) = \sum_{i=1}^n logp_{\theta}(w_i|c) L(θ)=i=1∑n​logpθ​(wi​∣c)

= ∑ i = 1 n l o g p θ ( w i ∣ c ) = \sum_{i=1}^n logp_{\theta}(w_i|c) =i=1∑n​logpθ​(wi​∣c)

= ∑ i = 1 n l o g e x p ( s θ ( w i , c ) − ∑ i = 1 n l o g ∑ w ′ ∈ V e x p ( s θ ( w ’ , c ) ) ) = \sum_{i=1}^n logexp(s_{\theta}(w_i,c) - \sum_{i=1}^n log\sum_{w'\in V}exp(s_{\theta}(w’,c))) =i=1∑n​logexp(sθ​(wi​,c)−i=1∑n​logw′∈V∑​exp(sθ​(w’,c)))

= ∑ i = 1 n s θ ( w i , c ) − n ∗ l o g ∑ w ′ ∈ V e x p ( s θ ( w ′ , c ) ) ) = \sum_{i=1}^n s_{\theta}(w_i,c) - n * log\sum_{w'\in V}exp(s_{\theta}(w',c))) =i=1∑n​sθ​(wi​,c)−n∗logw′∈V∑​exp(sθ​(w′,c)))

其中 n 为样本容量。

最大似然估计:

θ ^ M L E = a r g m a x θ L ( θ ) \hat{\theta}_{MLE} = argmax_{\theta}L(\theta) θ^MLE​=argmaxθ​L(θ)

通过梯度下降方式求解估计值 θ ^ M L E \hat{\theta}_{MLE} θ^MLE​,首先梯度计算为:

∂ L ( θ ) ∂ θ = ∑ i = 1 n [ ∂ s θ ( w i , c ) ∂ θ ] − n ∗ ∑ w ′ ∈ V 1 ∑ w ′ ∈ V e x p ( s θ ( w ′ , c ) ) e x p ( s θ ( w ′ , c ) ) ∂ s θ ( w ′ , c ) ∂ θ \frac{\partial L(\theta)}{\partial \theta}=\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}\frac{1}{\sum_{w'\in V}exp(s_{\theta}(w',c))}exp(s_{\theta}(w',c))\frac{\partial s_{\theta}(w',c)}{\partial \theta} ∂θ∂L(θ)​=i=1∑n​[∂θ∂sθ​(wi​,c)​]−n∗w′∈V∑​∑w′∈V​exp(sθ​(w′,c))1​exp(sθ​(w′,c))∂θ∂sθ​(w′,c)​

= ∑ i = 1 n [ ∂ s θ ( w i , c ) ∂ θ ] − n ∗ ∑ w ′ ∈ V p θ ( w ′ , c ) ∂ s θ ( w ′ , c ) ∂ θ =\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}p_{\theta}(w',c)\frac{\partial s_{\theta}(w',c)}{\partial \theta} =i=1∑n​[∂θ∂sθ​(wi​,c)​]−n∗w′∈V∑​pθ​(w′,c)∂θ∂sθ​(w′,c)​

上面的计算分为两部分,前半部分针对每个训练样本 ( w i , c ) (w_i,c) (wi​,c),后半部分则需要对整个词表 V 进行计算,当词表较大时,计算量较大,因此提出了一些近似计算的方式:

  • 通过采样的方式计算,也就是从 p θ ( w ∣ c ) p_{\theta}(w|c) pθ​(w∣c)中采用样本,用 1 k ∑ i = 1 k ∂ s θ ( w i , c ) ∂ θ \frac{1}{k}\sum_{i=1}^k\frac{\partial s_{\theta}(w_i,c)}{\partial\theta} k1​∑i=1k​∂θ∂sθ​(wi​,c)​结果近似第二部分;实际操作中从 p θ ( w ∣ c ) p_{\theta}(w|c) pθ​(w∣c)中采样的操作比较复杂,因此会用另一个分布 q ( w ∣ c ) q(w|c) q(w∣c)作为采样分布,这种方式称为sampled softmax

∂ L ( θ ) ∂ θ = = ∑ i = 1 n [ ∂ s θ ( w i , c ) ∂ θ ] − n ∗ ∑ w ′ ∈ V p θ ( w ′ , c ) ∂ s θ ( w ′ , c ) ∂ θ \frac{\partial L(\theta)}{\partial\theta} = =\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}p_{\theta}(w',c)\frac{\partial s_{\theta}(w',c)}{\partial \theta} ∂θ∂L(θ)​==i=1∑n​[∂θ∂sθ​(wi​,c)​]−n∗w′∈V∑​pθ​(w′,c)∂θ∂sθ​(w′,c)​

= ∑ i = 1 n [ ∂ s θ ( w i , c ) ∂ θ ] − n ∗ E w ′ ∼ p θ ( w ∣ c ) ∂ s θ ( w ′ , c ) ∂ θ =\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * E_{w'\sim p_{\theta}(w|c)}\frac{\partial s_{\theta}(w',c)}{\partial \theta} =i=1∑n​[∂θ∂sθ​(wi​,c)​]−n∗Ew′∼pθ​(w∣c)​∂θ∂sθ​(w′,c)​

≈ ∑ i = 1 n [ ∂ s θ ( w i , c ) ∂ θ ] − n ∗ 1 k ∑ i = 1 k ∂ s θ ( w i ′ , c ) ∂ θ \approx \sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n *\frac{1}{k} \sum_{i=1}^k\frac{\partial s_{\theta}(w'_i,c)}{\partial \theta} ≈i=1∑n​[∂θ∂sθ​(wi​,c)​]−n∗k1​i=1∑k​∂θ∂sθ​(wi′​,c)​

  • 噪音对比学习(Noise Contrastive Estimate)

NCE 将训练样本 ( w i , c ) (w_i,c) (wi​,c)作为正样本,记为 D = 1 D = 1 D=1,然后从一个已知确定的与c无关的噪音分布 q ( w ) q(w) q(w)中采样样本作为负样本,记为 D = 0 D = 0 D=0。随机采样的负样本个数为 k k k,则新样本中:

p ( D = 1 ) = n n + k ∗ n p(D = 1) = \frac{n}{n + k * n} p(D=1)=n+k∗nn​

p ( D = 0 ) = 1 − p ( D = 1 ) = k ∗ n n + k ∗ n p(D = 0) = 1 - p(D = 1) = \frac{k*n}{n + k * n} p(D=0)=1−p(D=1)=n+k∗nk∗n​

p ( w ∣ D = 1 , c ) = p θ ( w ∣ c ) p(w|D = 1,c) = p_{\theta}(w|c) p(w∣D=1,c)=pθ​(w∣c)

p ( w ∣ D = 0 , c ) = q ( w ) p(w|D = 0, c) = q(w) p(w∣D=0,c)=q(w)

p ( D = 1 ∣ w , c ) = p ( D = 1 ) p ( w ∣ D = 1 , c ) p ( D = 0 ) p ( w ∣ D = 0 , c ) + p ( D = 1 ) p ( w ∣ D = 1 , c ) p(D = 1 | w, c) = \frac{p(D = 1)p(w | D = 1,c)}{p(D = 0)p(w|D = 0,c) + p(D = 1)p(w|D = 1,c)} p(D=1∣w,c)=p(D=0)p(w∣D=0,c)+p(D=1)p(w∣D=1,c)p(D=1)p(w∣D=1,c)​

= p θ ( w ∣ c ) p θ ( w ∣ c ) + k q ( w ) =\frac{p_{\theta}(w|c)}{p_{\theta}(w|c) + kq(w)} =pθ​(w∣c)+kq(w)pθ​(w∣c)​

p ( D = 0 ∣ w , c ) = 1 − p ( D = 1 ∣ w , c ) = p ( D = 0 ) p ( w ∣ D = 0 , c ) p ( D = 1 ) p ( w ∣ D = 0 , c ) + p ( D = 1 ) p ( w ∣ D = 0 , c ) p(D = 0 | w, c) = 1 - p(D = 1|w, c) = \frac{p(D = 0)p(w|D = 0, c)}{p(D = 1)p(w|D = 0, c) + p(D = 1)p(w|D = 0, c)} p(D=0∣w,c)=1−p(D=1∣w,c)=p(D=1)p(w∣D=0,c)+p(D=1)p(w∣D=0,c)p(D=0)p(w∣D=0,c)​

= k q ( w ) p θ ( w ∣ c ) + k q ( w ) =\frac{kq(w)}{p_{\theta}(w|c) + kq(w)} =pθ​(w∣c)+kq(w)kq(w)​

原有样本和随机负采样的样本组成新的样本集合,样本容量为 n + k * n,下面依然采用最大似然估计,新的样本的对数似然为:

L c ( θ ) = ∑ i = 1 n l o g p θ ( D = 1 ∣ w i , c ) + ∑ i = n + 1 k ∗ n l o g p θ ( D = 0 ∣ w i , c ) L^c(\theta) = \sum_{i = 1}^{n}logp_{\theta}(D = 1|w_i,c) + \sum_{i=n+1}^{k * n}logp_{\theta}(D = 0|w_i,c) Lc(θ)=i=1∑n​logpθ​(D=1∣wi​,c)+i=n+1∑k∗n​logpθ​(D=0∣wi​,c)

其中

p θ ( w ∣ c ) = e x p ( s θ ( w , c ) ) ∑ w ′ ∈ V e x p ( s θ ( w ′ , c ) ) p_{\theta}(w|c) = \frac{exp(s_{\theta}(w,c))}{\sum_{w'\in V}exp(s_{\theta}(w',c))} pθ​(w∣c)=∑w′∈V​exp(sθ​(w′,c))exp(sθ​(w,c))​

= u θ ( w , c ) Z ( c ) =\frac{u_{\theta}(w,c)}{Z(c)} =Z(c)uθ​(w,c)​

p ( D = 1 ∣ w , c ) = p θ ( w ∣ c ) p θ ( w ∣ c ) + k q ( w ) p(D = 1|w,c) = \frac{p_{\theta}(w|c)}{p_{\theta}(w|c) + kq(w)} p(D=1∣w,c)=pθ​(w∣c)+kq(w)pθ​(w∣c)​

p ( D = 0 ∣ w , c ) = k q ( w ) p θ ( w ∣ c ) + k q ( w ) p(D = 0|w,c) = \frac{kq(w)}{p_{\theta}(w|c) + kq(w)} p(D=0∣w,c)=pθ​(w∣c)+kq(w)kq(w)​

可以看到以上的似然函数中依然需要计算归一化分母 Z ( c ) = ∑ w ′ ∈ V e x p ( s θ ( w ′ , c ) ) Z(c) = \sum_{w'\in V}exp(s_{\theta}(w',c)) Z(c)=∑w′∈V​exp(sθ​(w′,c)),计算量依然较大,下面我们对模型做一次改造:

Z ( c ) = θ c Z(c) = \theta^c Z(c)=θc

p θ ( w ∣ c ) = e x p ( s θ ( w , c ) ) / θ c = p θ 0 ( w ∣ c ) / θ c p_{\theta}(w|c) = exp(s_{\theta(w,c)})/\theta^c= p_{\theta^0}(w|c) / \theta^c pθ​(w∣c)=exp(sθ(w,c)​)/θc=pθ0​(w∣c)/θc

新的函数的参数包括两部分组成: θ = θ 0 , θ c \theta = {\theta^0,\theta^c} θ=θ0,θc

p θ 0 ( w ∣ c ) p_{\theta^0}(w|c) pθ0​(w∣c)为未归一化的函数 p θ 0 ( w ∣ c ) = u θ 0 ( w ∣ c ) p_{\theta^0}(w|c) = u_{\theta^0}(w|c) pθ0​(w∣c)=uθ0​(w∣c)

L c ( θ ) = ∑ i = 1 n l o g p θ ( D = 1 ∣ w i , c ) + ∑ i = n + 1 k ∗ n l o g p θ ( D = 0 ∣ w i , c ) L^c(\theta) = \sum_{i = 1}^{n}logp_{\theta}(D = 1|w_i,c) + \sum_{i=n+1}^{k*n}logp_{\theta}(D = 0|w_i,c) Lc(θ)=i=1∑n​logpθ​(D=1∣wi​,c)+i=n+1∑k∗n​logpθ​(D=0∣wi​,c)

= ∑ i = 1 n l o g p θ ( w i ∣ c ) p θ ( w i ∣ c ) + k q ( w i ) + ∑ w ∼ q ( w i ) l o g k q ( w ) p θ ( w i ∣ c ) + k q ( w i ) =\sum_{i = 1}^{n}log\frac{p_{\theta}(w_i|c)}{p_{\theta}(w_i|c) + kq(w_i)} + \sum_{w\sim q(w_i)}log\frac{kq(w)}{p_{\theta}(w_i|c) + kq(w_i)} =i=1∑n​logpθ​(wi​∣c)+kq(wi​)pθ​(wi​∣c)​+w∼q(wi​)∑​logpθ​(wi​∣c)+kq(wi​)kq(w)​

作者实验发现取 θ c = 1 \theta^c = 1 θc=1也有不错的效果,因此模型只剩下一部分参数待估计 θ 0 \theta^0 θ0

InfoNCE

  • 表示学习:通过预测future的任务,学习到好的表示

  • 作者认为直接建模条件概率 p ( x ∣ c ) p(x|c) p(x∣c)方式对于提取 x , c x, c x,c的共享信息并不是最优的方法

  • 作者提出建模 x , c x, c x,c之间的建模密度比 p ( x ∣ c ) p ( x ) \frac{p(x|c)}{p(x)} p(x)p(x∣c)​ : f k ( x t + k , c t ) ∝ x t + k ∣ c t p ( x t + k ) f_k(x_{t+k},c_t) \propto \frac{x_{t+k}|c_t}{p(x_{t+k})} fk​(xt+k​,ct​)∝p(xt+k​)xt+k​∣ct​​,提升密度比提升x,c 的互信息: I ( X , C ) = ∑ x ∑ c p ( x , c ) l o g p ( x ∣ c ) p ( x ) I(X,C) = \sum_x\sum_cp(x,c)log\frac{p(x|c)}{p(x)} I(X,C)=∑x​∑c​p(x,c)logp(x)p(x∣c)​

  • 实践中作者采用的模型为 f k ( x t + k , c t ) = e x p ( z t + k T W k c t ) , z t = g e n c ( x t ) , c t = g a r ( z ≤ t ) f_k(x_{t+k},c_t) = exp(z^T_{t+k}W_kc_t), z_t = g_{enc}(x_t), c_t = g_{ar}(z_{\le t}) fk​(xt+k​,ct​)=exp(zt+kT​Wk​ct​),zt​=genc​(xt​),ct​=gar​(z≤t​)

  • 给定一个训练batch样本 X = { x 1 , x 2 , . . . , x N } X=\{x_1,x_2,...,x_N\} X={x1​,x2​,...,xN​},包含一个从 p ( x t + k ∣ c t ) p(x_{t+k}|c_t) p(xt+k​∣ct​)中采样的样本,作为正样本,N-1 个来自 p ( x t + k ) p(x_{t+k}) p(xt+k​)的样本作为负样本,InfoNCE的Loss定义为: L N = − E [ l o g f k ( x t + k , c t ) ∑ x ∈ X f k ( x j , c t ) ] L_N = -E[log\frac{f_k(x_{t+k},c_t)}{\sum_{x\in X}f_k(x_j, c_t)}] LN​=−E[log∑x∈X​fk​(xj​,ct​)fk​(xt+k​,ct​)​]

  • I ( x t + k , c t ) ≥ l o g N − L N I(x_{t+k},c_t) \ge logN -L_N I(xt+k​,ct​)≥logN−LN​

证明:

L N = − E [ l o g f k ( x t + k , c t ) f k ( x t + k , c t ) + ∑ x j ∈ X n e g f k ( x j , c t ) ] L_N = -E[log \frac{f_k(x_{t+k},c_t)}{f_k(x_{t+k},c_t) + \sum_{x_j\in X_{neg}f_k(x_j,c_t)}}] LN​=−E[logfk​(xt+k​,ct​)+∑xj​∈Xneg​fk​(xj​,ct​)​fk​(xt+k​,ct​)​]

= E [ l o g f k ( x t + k , c t ) + ∑ x j ∈ X n e g f k ( x j , c t ) f k ( x t + k , c t ) ] =E [log \frac{f_k(x_{t+k},c_t) + \sum_{x_j\in X_{neg}}f_k(x_{j},c_t)}{f_k(x_{t+k},c_t)}] =E[logfk​(xt+k​,ct​)fk​(xt+k​,ct​)+∑xj​∈Xneg​​fk​(xj​,ct​)​]

= E l o g [ 1 + p ( x t + k ) p ( x t + k ∣ c t ) ∑ x j ∈ X n e g p ( x j ∣ c t ) p ( x j ) ] =Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}\sum_{x_j\in X_{neg}}\frac{p(x_j|c_t)}{p(x_j)}] =Elog[1+p(xt+k​∣ct​)p(xt+k​)​xj​∈Xneg​∑​p(xj​)p(xj​∣ct​)​]

≈ E l o g [ 1 + p ( x t + k ) p ( x t + k ∣ c t ) ( N − 1 ) E x j ∼ p ( x j ) p ( x j ∣ c t ) p ( x j ) ] \approx Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}(N-1)E_{x_j\sim p(x_j)}\frac{p(x_j|c_t)}{p(x_j)}] ≈Elog[1+p(xt+k​∣ct​)p(xt+k​)​(N−1)Exj​∼p(xj​)​p(xj​)p(xj​∣ct​)​]

= E l o g [ 1 + p ( x t + k ) p ( x t + k ∣ c t ) ( N − 1 ) ] =Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}(N-1)] =Elog[1+p(xt+k​∣ct​)p(xt+k​)​(N−1)]

≥ E l o g [ p ( x t + k ) p ( x t + k ∣ c t ) N ] \ge Elog[\frac{p(x_{t+k})}{p(x_{t+k}|c_t)}N] ≥Elog[p(xt+k​∣ct​)p(xt+k​)​N]

= − H ( x t + k , c t ) + l o g N =-H(x_{t+k},c_t) + logN =−H(xt+k​,ct​)+logN

因此最小化L_N等价于最大化互信息 I ( x t + k , c t ) I(x_{t+k},c_t) I(xt+k​,ct​)的下限。

Noise Contrastive Estimation相关推荐

  1. Noise Contrastive Estimation (NCE) 、负采样(NEG)和InfoNCE

    总结一下自己的理解: NCE需要根据频次分布等进行采样,NEG不考虑,InfoNCE是一种通过无监督任务来学习(编码)高维数据的特征表示(representation),而通常采取的无监督策略就是根据 ...

  2. Noise Contrastive Estimation 前世今生——从 NCE 到 InfoNCE

    文章首发于:https://zhuanlan.zhihu.com/p/334772391 0 前言 作为刚入门自监督学习的小白,在阅读其中 Contrastive Based 方法的论文时,经常会看到 ...

  3. NCE(Noise Contrastive Estimation) 与negative sampling

    NCE Noise Contrastive Estimation与negative sampling负例采样 背景 NCE(Noise Contrastive Estimation) Negative ...

  4. [转] Noise Contrastive Estimation 噪声对比估计 资料

    有个视频讲的不错,mark一下 https://vimeo.com/306156327 转载于:https://www.cnblogs.com/Arborday/p/10903065.html

  5. 对比学习(Contrastive Learning)综述

    A.引入   https://zhuanlan.zhihu.com/p/346686467 A.引入 深度学习的成功往往依赖于海量数据的支持,其中对于数据的标记与否,可以分为监督学习和无监督学习. 1 ...

  6. 再介绍一篇Contrastive Self-supervised Learning综述论文

    文 | 黄浴 源 | 知乎 之前已经介绍过三篇自监督学习的综述:<怎样缓解灾难性遗忘?持续学习最新综述三篇!>.这是最近2020年10月arXiv上的又一篇论文"A Survey ...

  7. 再介绍一篇最新的Contrastive Self-supervised Learning综述论文

    作者 | 对白 出品 | 对白的算法屋 自监督学习(Self-supervised learning)最近获得了很多关注,因为其可以避免对数据集进行大量的标签标注.它可以把自己定义的伪标签当作训练的信 ...

  8. 【CVPR 2021】基于Wasserstein Distance对比表示蒸馏方法:Wasserstein Contrastive Representation Distillation

    [CVPR 2021]基于Wasserstein Distance对比表示蒸馏方法:Wasserstein Contrastive Representation Distillation 论文地址: ...

  9. 《PROTOTYPICAL CONTRASTIVE LEARNING OF UNSUPERVISED REPRESENTATIONS》学习笔记

    PROTOTYPICAL CONTRASTIVE LEARNING 引言 方法 实验 Low-shot classification 特征可视化 消融实验 引言 本文提出了一种无监督表征学习方法,主要 ...

最新文章

  1. 使用JDBC进行MySQL 5.1的数据连接、查询、修改等操作练习。
  2. python中列表、字典和集合推导式
  3. Ubuntu安装docker-ce,vagrant,virtualbox步骤
  4. 微软宣布.NET Native预览版
  5. 聊一聊js中的null、undefined与NaN
  6. Java 并发编程73道面试题及答案 ——面试看这篇就够了!
  7. 【GAN优化】详解GAN中的一致优化问题
  8. mysql sql 连接查询语句_Mysql——sql数据库中的连接查询
  9. 分區策略與數據傾斜處理策略的區別
  10. linux下 如何调试php,linux下使用gdb对php源码调试
  11. 用html做qq会员页面导航,untitledQQ会员页面导航3.html
  12. mysql invalid default value_mysql5.x升级到5.7 导入数据出错,提示Invalid default value for...
  13. Light OJ 1078
  14. 【干货】Python玩转各种多媒体,视频、音频到图片
  15. 西安力邦智能医疗amp;可穿戴设备沙龙--第1期---苹果HealthKit、谷歌GoogleFit来袭,智能医疗要爆发吗?...
  16. C++ typeid输出类型
  17. python执行adb命令_Python脚本之ADB命令(一)
  18. oracle 用户密码过期修改,Oracle用户登录密码过期的修改
  19. 如何发布谷歌离线地图
  20. Core Audio APIs 技术笔记二(麦克风音量和增强设置)

热门文章

  1. 卫浴品牌线上推广怎么做有效果?
  2. 【CTF】-Misc练习日志8.9
  3. COVID-19 对各年龄的影响分析
  4. NOR闪存和NAND闪存的读写特点总结归纳(对比分析)
  5. nginx进程杀不掉解决方案
  6. PostMessage的原理和实际应用
  7. Rust学习(1)[rand]Rng::gen_range在0.8版本改动与生成随机数
  8. 基于react的前后端渲染实例讲解
  9. duplicate value for resource 'attr/mvGravity' with config ''.
  10. 自动类型安全的REST .NET标准库refit