NCE损失（Negative Sampling)

DSSM的损失函数: 先是1个正例和5个负例过softmax:

最后交叉熵损失函数：

Word2Vec的损失函数：输入词的词向量和预测词(或负例)的分界面向量点乘，经过sigmoid，再过交叉熵损失函数

在词向量的生成过程中，用的loss函数是NCE或negative sampling，而不是常规的softmax。在《learning tensorflow》这本书中，作者这样说道：but it is sufficient to think of it （NCE） as a sort of efficient approximation to the ordinary softmax function used in classification tasks。由此看来，NCE是softmax的一种近似，但是为什么要做这种近似，而不直接用softmax呢？下边是一个网友的回答，把他的答案贴在下边。

答案：因为类别太多时，softmax计算量太大；

原文链接：https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne/245452#245452

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.