negative sampling负采样和nce loss

一、Noise contrastive estimation(NCE)

语言模型中,在最后一层往往需要:根据上下文c,在整个语料库V中预测某个单词w的概率,一般采用softmax形式

其中partition function Z(c)的目的是normalize,使得p为一个概率分布。一般根据最大似然法估计上述参数,但是这个公式的计算量是巨大的,因为要遍历V中的所有单词。

此时NCE就闪亮登场了,为了避免巨大的计算量,NCE的思路是将softmax的参数估计问题 转化成 二分类。二分类两类样本分别是真实样本和噪声样本:正样本是由经验分布


生成的(即真实分布)标签D=0,负样本则是噪声由q(w)生成 对应标签D=1。假设c代表上下文context,从噪声分布中提取k个噪声样本,在总样本(真实样本+噪声样本)中w代表预测的目标词。
那么(d,w)的联合概率分布如下:


由上面公式不难得出:在总样本中 P(w/c) = p(d=0,w/c) + p(d=1,w/c)

Tips:P指的是正负样本的整体分布,这与之前的正样本的经验分布不同

继续根据条件联合概率公式 可以得出:p(d=0/w,c) = p(d=0,w/c) / p(w/c)

p(d=1/w,c)类似

即下面公式:


其实这个公式直接根据条件概率理解。

NCE利用模型分布
来代替经验分布 ,到此就与最开始讲的softmax联系起来了,通过最大化likelihood来得到最优的参数theta。但是此处还是没有解决问题,因为和最开始的公式一样,需要遍历所有V计算partition function Z©。

  1. NCE接下来提出了两个假设:
  2. partition function Z(c)不需要遍历V得到,而是通过参数Zc来进行估计
  3. 由于神经网络具有很多参数,因此可以将Zc设定为一个固定值Zc=1,这种设定对于所有的c都是有效的([Mnih and Teh2012])

根据上述假设,公式可以进一步写成:


之后根据最大化log-likelihood来训练参数,其中选定k个负样本,那么loss函数为:


其中第二项依旧很难计算,因为涉及到k期望expectation(E),因此此处采用蒙特卡洛估计Monte Carlo approximation,根据采样k个样本来代替k期望:

二、Negative Sampling

负采样Negative Sampling是NCE的一个变种,概率的定义有所区别:


对于NCE的概率公式:如果k=|V|并且q是均匀分布uniform,那么k*q=1,此时两个概率公式就是一致的。

但是,除了上述情况以外 两个概率公式并不同,即便negative sampling在词向量方面表现优异,但是negative sampling依旧不能具备NCE的特性(如asymptotic consistency guarantees)。

三、nce loss in tensorflow

一、loss计算

经过负采样得到k个负样本后,那么对于每个样本来说,要么是正样本label=1,要么是负样本label=0。将上文中最后的NCE loss的最大化log变为最小化-log,那么NCE的损失函数可以表示为二分类的logistics loss(cross entropy):

  1. 采用tensorflow中的符号:
  2. x = logits 表示上文中的u(w,c),也就是最后一层(词w对应的)网络参数与c的词向量乘积
  3. z = labels 正样本=1,负样本=0
  4. 上面的logits和labels都是向量和矩阵,表示所有样本(包含正样本和采样负样本)

    那么 损失函数可以表示为:
    L = z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))

为了避免exp的计算导致溢出,tensorflow中进行了简单的变换:

<code><pre>z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))= z * -log(1 / (1 + exp(-x))) + (1 - z) * -log(exp(-x) / (1 + exp(-x)))= z * log(1 + exp(-x)) + (1 - z) * (-log(exp(-x)) + log(1 + exp(-x)))= z * log(1 + exp(-x)) + (1 - z) * (x + log(1 + exp(-x))= (1 - z) * x + log(1 + exp(-x))= x - x * z + log(1 + exp(-x))当x<0的时候:For x < 0, to avoid overflow in exp(-x), we reformulate the abovex - x * z + log(1 + exp(-x))= log(exp(x)) - x * z + log(1 + exp(-x))= - x * z + log(1 + exp(x))因此,将x>0和x<0整合到一起,得到下面公式:Hence, to ensure stability and avoid overflow, the implementation uses thisequivalent formulationmax(x, 0) - x * z + log(1 + exp(-abs(x)))</pre></code>

二、计算cross entropy的源码

根据labels和logits计算Loss的代码:https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/nn_impl.py

def sigmoid_cross_entropy_with_logits(  # pylint: disable=invalid-name_sentinel=None,labels=None,logits=None,name=None):"""Computes sigmoid cross entropy given `logits`.Args:_sentinel: Used to prevent positional parameters. Internal, do not use.labels: A `Tensor` of the same type and shape as `logits`.logits: A `Tensor` of type `float32` or `float64`.name: A name for the operation (optional).Returns:A `Tensor` of the same shape as `logits` with the componentwiselogistic losses.nn_ops._ensure_xent_args("sigmoid_cross_entropy_with_logits", _sentinel,labels, logits)# pylint: enable=protected-accesswith ops.name_scope(name, "logistic_loss", [logits, labels]) as name:logits = ops.convert_to_tensor(logits, name="logits")labels = ops.convert_to_tensor(labels, name="labels")try:labels.get_shape().merge_with(logits.get_shape())except ValueError:raise ValueError("logits and labels must have the same shape (%s vs %s)" %(logits.get_shape(), labels.get_shape()))# The logistic loss formula from above is#   x - x * z + log(1 + exp(-x))# For x < 0, a more numerically stable formula is#   -x * z + log(1 + exp(x))# Note that these two expressions can be combined into the following:#   max(x, 0) - x * z + log(1 + exp(-abs(x)))# To allow computing gradients at zero, we define custom versions of max and# abs functions.zeros = array_ops.zeros_like(logits, dtype=logits.dtype)cond = (logits >= zeros)relu_logits = array_ops.where(cond, logits, zeros)neg_abs_logits = array_ops.where(cond, -logits, logits)return math_ops.add(relu_logits - logits * labels,math_ops.log1p(math_ops.exp(neg_abs_logits)),name=name)

三、负采样、计算logits和label

首先根据tf.nn.log_uniform_candidate_sampler进行采样(常用的采样方法),得到num_sampled个负样本:

在语言学中,词按照出现频率从大到小排序之后,服从 Zipfian 分布,(但是是否其他场景同样适用有待考察)利用log-uniform (Zipfian) distribution进行采样,因此要求word是按照频率从高到低排列,也就是构造embedding的时候,根据词频来构建。
nn.log_uniform_candidate_sampler利用该公式进行采样P(class) = (log(class + 2) - log(class + 1)) / log(range_max + 1),class的值越大概率P越小,因此如果不是按照频率排序则无法使用该方法。

  1. 接下来就是计算所有样本的logits和labels,有以下几点需要注意:
  2. logits:最后一层权重矩阵weights 中样本对应的向量*inputs
  3. label:(1)采样的负样本就是0;(2)默认情况下inputs只对应一个正样本那么label=1,但是如果num_true>0,那么每个正样本的label=1/num_true
def _compute_sampled_logits(weights,biases,labels,inputs,num_sampled,num_classes,num_true=1,sampled_values=None,subtract_log_q=True,remove_accidental_hits=False,partition_strategy="mod",name=None,seed=None):"""Helper function for nce_loss and sampled_softmax_loss functions.Computes sampled output training logits and labels suitable for implementinge.g. noise-contrastive estimation (see nce_loss) or sampled softmax (seesampled_softmax_loss).Note: In the case where num_true > 1, we assign to each target classthe target probability 1 / num_true so that the target probabilitiessum to 1 per-example.Args:weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`objects whose concatenation along dimension 0 has shape`[num_classes, dim]`.  The (possibly-partitioned) class embeddings.biases: A `Tensor` of shape `[num_classes]`.  The (possibly-partitioned)class biases.labels: A `Tensor` of type `int64` and shape `[batch_size,num_true]`. The target classes.  Note that this format differs fromthe `labels` argument of `nn.softmax_cross_entropy_with_logits_v2`.inputs: A `Tensor` of shape `[batch_size, dim]`.  The forwardactivations of the input network.num_sampled: An `int`.  The number of classes to randomly sample per batch.num_classes: An `int`. The number of possible classes.num_true: An `int`.  The number of target classes per training example.sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,`sampled_expected_count`) returned by a `*_candidate_sampler` function.(if None, we default to `log_uniform_candidate_sampler`)subtract_log_q: A `bool`.  whether to subtract the log expected count ofthe labels in the sample to get the logits of the true labels.Default is True.  Turn off for Negative Sampling.remove_accidental_hits:  A `bool`.  whether to remove "accidental hits"where a sampled class equals one of the target classes.  Default isFalse.partition_strategy: A string specifying the partitioning strategy, relevantif `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.name: A name for the operation (optional).seed: random seed for candidate sampling. Default to None, which doesn't setthe op-level random seed for candidate sampling.Returns:out_logits: `Tensor` object with shape`[batch_size, num_true + num_sampled]`, for passing to either`nn.sigmoid_cross_entropy_with_logits` (NCE) or`nn.softmax_cross_entropy_with_logits_v2` (sampled softmax).out_labels: A Tensor object with the same shape as `out_logits`."""

四、NCE Loss 源码

将上面两部结合到一起,就可以直接计算NCE Loss了:首先计算所有样本的logits和labels(第三步),之后计算cross entropy loss(第二步)。

def nce_loss(weights,biases,labels,inputs,num_sampled,num_classes,num_true=1,sampled_values=None,remove_accidental_hits=False,partition_strategy="mod",name="nce_loss"):"""Computes and returns the noise-contrastive estimation training loss.See [Noise-contrastive estimation: A new estimation principle forunnormalized statisticalmodels](http://www.jmlr.org/proceedings/papers/v9/gutmann10a/gutmann10a.pdf).Also see our [Candidate Sampling AlgorithmsReference](https://www.tensorflow.org/extras/candidate_sampling.pdf)A common use case is to use this method for training, and calculate the fullsigmoid loss for evaluation or inference. In this case, you must set`partition_strategy="div"` for the two losses to be consistent, as in theNote: In the case where `num_true` > 1, we assign to each target classthe target probability 1 / `num_true` so that the target probabilitiessum to 1 per-example.Note: It would be useful to allow a variable number of target classes perexample.  We hope to provide this functionality in a future release.For now, if you have a variable number of target classes, you can pad themout to a constant number by either repeating them or by paddingwith an otherwise unused class.Args:weights: A `Tensor` of shape `[num_classes, dim]`, or a list of `Tensor`objects whose concatenation along dimension 0 has shape[num_classes, dim].  The (possibly-partitioned) class embeddings.biases: A `Tensor` of shape `[num_classes]`.  The class biases.labels: A `Tensor` of type `int64` and shape `[batch_size,num_true]`. The target classes.inputs: A `Tensor` of shape `[batch_size, dim]`.  The forwardactivations of the input network.num_sampled: An `int`.  The number of negative classes to randomly sampleper batch. This single sample of negative classes is evaluated for eachelement in the batch.num_classes: An `int`. The number of possible classes.num_true: An `int`.  The number of target classes per training example.sampled_values: a tuple of (`sampled_candidates`, `true_expected_count`,`sampled_expected_count`) returned by a `*_candidate_sampler` function.(if None, we default to `log_uniform_candidate_sampler`)remove_accidental_hits:  A `bool`.  Whether to remove "accidental hits"where a sampled class equals one of the target classes.  If set to`True`, this is a "Sampled Logistic" loss instead of NCE, and we arelearning to generate log-odds instead of log probabilities.  Seeour [Candidate Sampling Algorithms Reference](https://www.tensorflow.org/extras/candidate_sampling.pdf).Default is False.partition_strategy: A string specifying the partitioning strategy, relevantif `len(weights) > 1`. Currently `"div"` and `"mod"` are supported.Default is `"mod"`. See `tf.nn.embedding_lookup` for more details.name: A name for the operation (optional).Returns:A `batch_size` 1-D tensor of per-example NCE losses."""logits, labels = _compute_sampled_logits(weights=weights,biases=biases,labels=labels,inputs=inputs,num_sampled=num_sampled,num_classes=num_classes,num_true=num_true,sampled_values=sampled_values,subtract_log_q=True,remove_accidental_hits=remove_accidental_hits,partition_strategy=partition_strategy,name=name)sampled_losses = sigmoid_cross_entropy_with_logits(labels=labels, logits=logits, name="sampled_losses")# sampled_losses is batch_size x {true_loss, sampled_losses...}# We sum out true and sampled losses.return _sum_rows(sampled_losses)

参考文献:

[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural proba- bilistic language models. In Proc. ICML.

Notes on Noise Contrastive Estimation and Negative Sampling(https://arxiv.org/pdf/1410.8251.pdf)

https://knet.readthedocs.io/en/v0.7.3/deprecated/nce.html

negative sampling负采样和nce loss相关推荐

  1. Negative Sampling 负采样详解

    在word2vec中,为了简化训练的过程,经常会用到Negative Sampling负采样这个技巧,这个负采样到底是怎么样的呢?之前在我的博文 word2vec算法理解和数学推导 中对于word2v ...

  2. Negative sampling 负采样

    1.负采样的目的:减少参数的计算量:每次让一个训练样本仅仅更新一小部分的权重 目标词:正样本   其他词:全都是负样本 比如中间维度是300,输出的词典维度是10000.也就是有9999个负样本.如果 ...

  3. NCE(Noise Contrastive Estimation) 与negative sampling

    NCE Noise Contrastive Estimation与negative sampling负例采样 背景 NCE(Noise Contrastive Estimation) Negative ...

  4. [深度学习概念]·word2vec原理讲解Negative Sampling的模型概述

    word2vec原理讲解Negative Sampling的模型概述 目录 1. Hierarchical Softmax的缺点与改进 2. 基于Negative Sampling的模型概述 3. 基 ...

  5. python word2vec skipgram 负采样_word2vec中的负采样

    word2vec中的负采样 发布时间:2018-08-23 10:11, 浏览次数:991 , 标签: word vec 1. Hierarchical Softmax的缺点与改进 在讲基于Negat ...

  6. word2vec原理(三): 基于Negative Sampling的模型

    目录 1. Hierarchical Softmax的缺点与改进 2. Negative Sampling(负采样) 概述 3. 基于Negative Sampling的模型梯度计算 4. Negat ...

  7. skip-gram负采样原理

    skip-gram负采样 自然语言处理领域中,判断两个单词是不是一对上下文词(context)与目标词(target),如果是一对,则是正样本,如果不是一对,则是负样本. 采样得到一个上下文词和一个目 ...

  8. Candidate sampling:NCE loss和negative sample

    在工作中用到了类似于negative sample的方法,才发现我其实并不了解candidate sampling.于是看了一些相关资料,在此简单总结一些相关内容. 主要内容来自tensorflow的 ...

  9. 负采样Negative Sampling

    1.噪声对比估计(Noise contrastive estimation) 语言模型中,根据上下文c,在整个语料库V中预测某个单词w的概率,一般采用softmax形式,公式为: NCE:将softm ...

最新文章

  1. java基础(十) 数组类型
  2. Activity的回调机制---Activity学习笔记(三)
  3. 利用fiddler给android模拟器抓包
  4. DataGridView 中合并单元格
  5. 实现Servlet虚拟路径的映射
  6. android html 换行_Android-富文本处理-html字符串去掉内部样式,统一添加body、style,统一支持换行等...
  7. 大数据之-Hadoop之HDFS的API操作_修改文件的名称---大数据之hadoop工作笔记0060
  8. JPA零碎要点---JTA全局事物理解
  9. 灰度实战(四):Apollo配置中心(4)
  10. 如何提高网络安全指数
  11. mysql批量插入死锁的问题
  12. 调查问卷或量表数据的一般处理与SPSS统计分析
  13. FPGA实现sobel边缘检测并Modelsim仿真,与MATLAB实现效果对比
  14. Ethernet/IP以太网接M12 X-Coded 协议:port1(Ethernet连接)
  15. Linux Shell中的简单命令组合使用
  16. 如何用手机远程协助长辈?我找出了6个最佳方法!(免ROOT)
  17. 音视频入门系列-音视频基础知识篇(录播、点播、直播)
  18. 如何去除图片雾化?给你推荐图片去雾怎么去除的方法
  19. java excel 边框颜色_poi生成excel整理(设置边框/字体/颜色/加粗/居中/)
  20. 最年轻市长背后的选任悖

热门文章

  1. Windows垃圾文件清理--一键清理系统垃圾
  2. 如何下载谷歌离线地图瓦片数据
  3. 计算机三级信息安全技术常考知识点总结
  4. iMail Basic 功能之导入和导出
  5. 关于FT和TTa引脚作为数据IO时配置问题
  6. 13 蜡烛图与移动平均线
  7. 双十一是真便宜还是假便宜?
  8. 2022年危险化学品经营单位主要负责人新版试题及危险化学品经营单位主要负责人考试技巧
  9. 如何屏蔽某IP地址访问网站
  10. 室内打灯之射灯及灯罩