Distributed Representations of Sentences and Documents

继分享了一系列词向量相关的paper之后，今天分享一篇句子向量的文章，Distributed Representations of Sentences and Documents，作者是来自Google的Quoc Le和Tomas Mikolov，后者也是Word2Vec的作者。

用低维向量表示了word之后，接下来要挑战地就是表示句子和段落了。传统的表示句子的方式是用词袋模型，每个句子都可以写成一个特别大维度的向量，绝大多数是0，不仅没有考虑词序的影响，而且还无法表达语义信息。本文沿用了Word2Vec的思想，提出了一种无监督模型，将变长的句子或段落表示成固定长度的向量。不仅在一定上下文范围内考虑了词序，而且非常好地表征了语义信息。

首先简单回顾下word2vec的cbow模型架构图：

给定上下文the cat sat三个词来预测单词on。

与cbow模型类似，本文提出了PV-DM（Distributed Memory Model of Paragraph Vectors），如下图：

不同的地方在于，输入中多了一个paragraph vector，可以看做是一个word vector，作用是用来记忆当前上下文所缺失的信息，或者说表征了该段落的主题。这里，所有的词向量在所有段落中都是共用的，而paragraph vector只在当前paragraph中做训练时才相同。后面的过程与word2vec无异。

topic也好，memory也罢，感觉更像是一种刻意的说辞，本质上就是一个word，只是这个word唯一代表了这个paragraph，丰富了context vector。

另外一种模型，叫做PV-DBOW（Distributed Bag of Words version of Paragraph Vector），如下图：

看起来和word2vec的skip-gram模型很像。

用PV-DM训练出的向量有不错的效果，但在实验中采用了两种模型分别计算出的向量组合作为最终的paragraph vector，效果会更佳。在一些情感分类的问题上进行了测试，得到了不错的效果。

本文的意义在于提出了一个无监督的paragraph向量表示模型，无监督的意义非常重大。有了paragraph级别的高效表示模型之后，解决类似于句子分类，检索，问答系统，文本摘要等各种问题都会带来极大地帮助。

来源：paperweekly

原文链接

Distributed Representations of Sentences and Documents相关推荐

NLP论文 -《Distributed Representations of Sentences and Documents》-句子和文档的分布式表示学习（二）
Distributed Representations of Sentences and Documents(句子和文档的分布式表示学习) 作者:Quoc Le and Tomas Mikolov 单 ...
NLP论文 -《Distributed Representations of Sentences and Documents》-句子和文档的分布式表示学习
Distributed Representations of Sentences and Documents(句子和文档的分布式表示学习) 作者:Quoc Le and Tomas Mikolov 单 ...
文本相似度：Distributed Representations of Sentences and Documents
文章地址:https://arxiv.org/pdf/1405.4053.pdf 文章标题:Distributed Representations of Sentences and Documents ...
指代消解_论文理解《Improving Coreference Resolution by Learning Entity-Level Distributed Representations》
论文<Improving Coreference Resolution by Learning Entity-Level Distributed Representations> 段落: ...
Question Retrieval with Distributed Representations and Participant Reputation in Community QA论文笔记
原文下载地址摘要社区问题的难点在于:重复性问题解决上述问题要采用Query retrieval(QR),QR的难点在于:同义词汇本文算法:1)采用continuous bag-of-words ...
NLP论文解读《Distributed Representations of Words and Phrasesand their Compositionality》
目录词和短语的分布式表示以及他们的表示 1.介绍 2 Skip - gram模型 2.1 分层的Softmax(Hierarchical Softmax) 2.2 负样本(Negative Sam ...
论文翻译解读：Distributed Representations of Words and Phrases and their Compositionality【Word2Vec优化】
文章目录 Distributed Representations of Words and Phrases and their Compositionality 简要信息重点内容概括摘要 1 介绍 ...
论文笔记之Distributed Representations of Words and Phrases and their Compositionality
这篇文章是用于解决skip-gram和CBOW两种模型在计算softmax时因为语料库V太大导致计算复杂度偏高的问题.为了降低复杂度,提高运算效率,论文作者提出了层次softmax以及负采样的方式去解 ...
Word2Vec 与《Distributed Representations of Words and Phrases and their Compositionality》学习笔记
什么是Word2Vec 目录词嵌入 ( w o r d (word (word e m b e d d i n g ) embedding) embedding) 词嵌入的特点嵌入矩阵 S k i ...

Distributed Representations of Sentences and Documents

Distributed Representations of Sentences and Documents相关推荐

最新文章

热门文章