总结一下自己的理解:

NCE需要根据频次分布等进行采样,NEG不考虑,InfoNCE是一种通过无监督任务来学习(编码)高维数据的特征表示(representation),而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息

下面转自知乎:

在tensorflow里,这两个loss的实现都异常简洁,当然其中藏着各种细节可以挖掘,在这里我们主要从算法实现角度看着两个loss实现。


def nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1,sampled_values=None,  remove_accidental_hits=False, partition_strategy="mod",name="nce_loss"):logits, labels = _compute_sampled_logits(weights=weights, biases=biases, labels=labels, inputs=inputs,num_sampled=num_sampled, num_classes=num_classes, num_true=num_true,sampled_values=sampled_values, subtract_log_q=True, remove_accidental_hits=remove_accidental_hits,partition_strategy=partition_strategy, name=name)sampled_losses = sigmoid_cross_entropy_with_logits(labels=labels, logits=logits, name="sampled_losses")# sampled_losses is batch_size x {true_loss, sampled_losses...}# We sum out true and sampled losses.return _sum_rows(sampled_losses)

NCE的实现就由这三部分组成,从名字上看,_compute_sampled_logits负责采样,sigmoid_cross_entropy_with_logits负责做logistic regression,然后计算用cross entropy loss,_sum_rows求和。

这三个函数我们由简入繁,慢慢剥皮。

  • 首先_sum_rows代码很简单,先看comments,_sum_rows的arg是sampled loss,一个[batch_size, 1+sample_num]的矩阵。_sum_rows函数直接构造了一个ones tensor与sampled loss做矩阵乘法,最后reshape成一个[batch_size]的vector输出,vector中每一个element是true loss与 sampled loss之和。
  • 由上面我们可以推测和之前的了解,sigmoid_cross_entropy_with_logits应该是用对logits和labels求了logistic loss。

The issue

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

From: https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne

以下转载自知乎:word2vec中的负例采样为什么可以得到和softmax一样的效果? - 知乎

关于InfoNCE:

InfoNCE 是在 Representation Learning with Contrastive Predictive Coding 这篇论文中提出的,这里不会具体介绍 CPC ,而是着重说明如何借鉴 NCE 的思想提出 InfoNCE 并用于 CPC 中的,如果还不太了解的可以看我的这篇文章 ”对 CPC (对比预测编码) 的理解“。

简单来说,CPC(对比预测编码) 就是一种通过无监督任务来学习(编码)高维数据的特征表示(representation),而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息,NLP 中已经利用这种思想来学习 word 的 representation

部分转载自: Noise Contrastive Estimation 前世今生——从 NCE 到 InfoNCE - 知乎

Noise Contrastive Estimation (NCE) 、负采样(NEG)和InfoNCE相关推荐

  1. Noise Contrastive Estimation 前世今生——从 NCE 到 InfoNCE

    文章首发于:https://zhuanlan.zhihu.com/p/334772391 0 前言 作为刚入门自监督学习的小白,在阅读其中 Contrastive Based 方法的论文时,经常会看到 ...

  2. NCE(Noise Contrastive Estimation) 与negative sampling

    NCE Noise Contrastive Estimation与negative sampling负例采样 背景 NCE(Noise Contrastive Estimation) Negative ...

  3. Noise Contrastive Estimation

    熵 统计机器学习中经常遇到熵的概念,在介绍NCE和InfoNCE之前,对熵以及相关的概念做简单的梳理.信息量用于度量不确定性的大小,熵可以看作信息量的期望,香农信息熵的定义:对于随机遍历 X X X, ...

  4. [转] Noise Contrastive Estimation 噪声对比估计 资料

    有个视频讲的不错,mark一下 https://vimeo.com/306156327 转载于:https://www.cnblogs.com/Arborday/p/10903065.html

  5. 能量启发模型:从负采样到自监督学习NEG-NCE-GAN-SSL家族

    ©作者 | 程引 单位 | 日本BizReach公司 研究方向 | 推荐系统.自然语言处理 负采样(Negative Sampling, NEG)/噪声对比估计(Noise Contrastive E ...

  6. negative sampling负采样和nce loss

    negative sampling负采样和nce loss 一.Noise contrastive estimation(NCE) 语言模型中,在最后一层往往需要:根据上下文c,在整个语料库V中预测某 ...

  7. 负采样Negative Sampling

    1.噪声对比估计(Noise contrastive estimation) 语言模型中,根据上下文c,在整个语料库V中预测某个单词w的概率,一般采用softmax形式,公式为: NCE:将softm ...

  8. NLP-词向量(Word Embedding)-2013:Word2vec模型(CBOW、Skip-Gram)【对NNLM的简化】【层次Softmax、负采样、重采样】【静态表示;无法解决一词多义】

    一.文本的表示方法 (Representation) 文本是一种非结构化的数据信息,是不可以直接被计算的.因为文本不能够直接被模型计算,所以需要将其转化为向量. 文本表示的作用就是将这些非结构化的信息 ...

  9. [nlp] 负采样 nce_loss

    论文:http://demo.clab.cs.cmu.edu/cdyer/nce_notes.pdf 参考:求通俗易懂解释下nce loss? - 知乎 参考:(三)通俗易懂理解--Skip-gram ...

最新文章

  1. 【js】四种自定义对象的常见方法
  2. 计算机视觉组队学习预告!提前进群
  3. Java 泛型中? super T和? extends T的区别
  4. 探究java-JVM的五步(三步)类加载机制(包含类加载过程的一些代码书写,如类加载代码)
  5. oracle多条sql语句常量,如何在Oracle中一次执行多条sql语句
  6. C++ 单链表基本操作
  7. 《Microsoft SQL Server入门教程》第01篇 SQL Server 简介
  8. php 获取支付宝账号密码,php支付宝单笔转账到支付宝账户,用户提现业务-Go语言中文社区...
  9. 小火狐进化_口袋妖怪xy三主进化详细介绍
  10. Vue中变量前加...三个点什么意思
  11. 新疆电大计算机考试纸质版,2021年度电大计算机网考纸质题库考前必看题.doc
  12. 被中国家长摧残的十种优秀儿童品质
  13. 2019年最实用的导航栏设计实践和案例分析全解
  14. ospfdr选举规则_OSPF如何选举DR/BDR规则
  15. c语言实现定积分运算
  16. “周末不喝酒,人生路白走”,智慧山「精酿的夏天」3.0又将引爆全城!
  17. 网易云刷歌python
  18. MySQL从删库到跑路(5):in and not
  19. 泰山OFFICE技术讲座:字符宽度、中文标宽、字符间距
  20. 大智慧公式系统:条件选股之技术指标选股

热门文章

  1. 基于C#的Winfrom房产资源管理系统
  2. R语言JAVA对比_R语言统计分析应用与SAS、SPSS的比较
  3. ubuntu 下JAVA环境变量设置
  4. [附源码]java毕业设计宠物店管理系统
  5. 变年轻特效是什么软件?快把这些软件收好
  6. 数学建模——一元线性回归
  7. 如何遍历一棵二叉树?
  8. 架构设计参考项目系列主题:最全的权限系统设计方案
  9. java 算数表达式 转成 二叉树,将算术表达式((a+b)+c*(d+e)+f)*(g+h)转化为二叉树。...
  10. Java-虚拟机原理