DSSM的损失函数: 先是1个正例和5个负例过softmax:

最后交叉熵损失函数:

Word2Vec的损失函数:输入词的词向量和预测词(或负例)的分界面向量点乘,经过sigmoid,再过交叉熵损失函数

在词向量的生成过程中,用的loss函数是NCE或negative sampling,而不是常规的softmax。在《learning tensorflow》这本书中,作者这样说道:but it is sufficient to think of it (NCE) as a sort of efficient approximation to the ordinary softmax function used in classification tasks。由此看来,NCE是softmax的一种近似,但是为什么要做这种近似,而不直接用softmax呢?下边是一个网友的回答,把他的答案贴在下边。

答案:因为类别太多时,softmax计算量太大

原文链接:https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne/245452#245452

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

NCE损失(Negative Sampling)相关推荐

  1. NCE(Noise Contrastive Estimation) 与negative sampling

    NCE Noise Contrastive Estimation与negative sampling负例采样 背景 NCE(Noise Contrastive Estimation) Negative ...

  2. negative sampling负采样和nce loss

    negative sampling负采样和nce loss 一.Noise contrastive estimation(NCE) 语言模型中,在最后一层往往需要:根据上下文c,在整个语料库V中预测某 ...

  3. NLP | Word2Vec之基于Negative Sampling的 CBOW 和 skip-gram 模型

    前面介绍了基于Hierarchical Softmax的 skip-gram 和 CBOW 模型,虽然我们使用霍夫曼树代替传统的神经网络,可以提高模型训练的效率.但是如果我们的训练样本里的中心词www ...

  4. 负采样Negative Sampling

    1.噪声对比估计(Noise contrastive estimation) 语言模型中,根据上下文c,在整个语料库V中预测某个单词w的概率,一般采用softmax形式,公式为: NCE:将softm ...

  5. word2vec原理(三): 基于Negative Sampling的模型

    目录 1. Hierarchical Softmax的缺点与改进 2. Negative Sampling(负采样) 概述 3. 基于Negative Sampling的模型梯度计算 4. Negat ...

  6. 【word2vec】篇三:基于Negative Sampling 的 CBOW 模型和 Skip-gram 模型

    系列文章: [word2vec]篇一:理解词向量.CBOW与Skip-Gram等知识 [word2vec]篇二:基于Hierarchical Softmax的 CBOW 模型和 Skip-gram 模 ...

  7. Word2Vec学习笔记(五)——Negative Sampling 模型(续)

    本来这部分内容不多,是想写在negative sampling 中和cbow一起的,但是写了后不小心按了删除键,浏览器直接回退,找不到了,所以重新写新的,以免出现上述情况 (接上) 三.Negativ ...

  8. Word2Vec学习笔记(四)——Negative Sampling 模型

    前面讲了Hierarchical softmax 模型,现在来说说Negative Sampling 模型的CBOW和Skip-gram的原理.它相对于Hierarchical softmax 模型来 ...

  9. Efficient Heterogeneous Collaborative Filtering without Negative Sampling for Recommendation (2020)

    文章目录 1. Efficient Heterogeneous Collaborative Filtering without Negative Sampling for Recommendation ...

最新文章

  1. Camera框架初探
  2. python代码函数字符查询宝典书籍_Django基础五之django模型层(一)单表操作
  3. conda配置清华镜像
  4. window.parent和window.opener区别
  5. 2.3 Factory Method(工厂方法)
  6. Go-cron定时任务
  7. mysql数据库 day03
  8. 工具| PocSuite 使用介绍
  9. pycharm初始配置
  10. 为什么不推荐使用BeanUtils属性转换工具,老程序员都不使用!
  11. Visual Assist X 10.6.1837完美破解版(带VS2010破解)
  12. VIVO应用市场APP上架总结
  13. 家用游戏机主机的发展历史
  14. Aocoda-RCF7/F7 MINI飞控无法解锁的疑难杂症-使用 Betaflight 10.8.0调参软件地面站刷写固件以及AOCODAF722MINI 配置文件
  15. 【报告分享】2020中国民营企业500强调研分析报告-全国工商联(附下载)
  16. asyncio+aiohttp异步免费代理池(已失效)
  17. 基于Springboot的网上商城
  18. 使用SharePoint中的Move To功能将一个文档转移到其他位置
  19. 小程序发送模板消息给用户 —— 一次性模板实现“长期订阅”
  20. 镁光139 8510

热门文章

  1. 山东财经大学新生赛暨天梯赛选拔赛
  2. 卡巴斯基CEO称勒索病毒为最严重的网络攻击之一
  3. 数字经济的大航海时代
  4. ubuntu下对sd卡 分区和格式化 挂载sd卡
  5. 联想有linux系统下载地址,联想有了自己的Linux操作系统?
  6. 计算机网络与网络编程
  7. 抓包软件抓取手机数据(app,浏览器等)
  8. hdu 2023 求平均成绩
  9. 2021年中国乙二酸产能格局及进出口贸易分析:华峰化工己二酸产能全国排名第一[图]
  10. 大学物理简明教程全书思维导图