词向量化 Vector Representation of Words 方法汇总

PART I: Classical Machine Leaning

为什么要进行词向量化？

“向量化”可以理解为“数值化”，为什么要“数值化”？因为文字是不可以运算的，而数值可以。【明天探索者 2021】不管用什么方法把词向量化，都是为了下一步放进训练模型计算。一句话总结，目的：put words to vector space。

文字中词的特征和向量化又有什么关系？

NLP中字、词、词频、ngram、词性…等都可以认为是特征，只要将这些特征向量化就能放入模型中计算。比如词袋模型就是利用的词频特征，word2vec 可以认为是利用了窗口内文本的共现特征。【明天探索者 2021】

下图展示了文章中所提到的算法级别的联系。

Bag-of-Words (BoW): The BoW model separately matches and counts each element in the document to form a vector representation of a document. [Dongyang Yan 2020 Network-Based]

具体做法：A document is mapped into a vector as v = [x1,x2,...,xn] where xi denotes the occurrence of the ith word in basic terms.

- The basic terms (原型 ate --> eat, jumping --> jump, 无 stopwords 'a', 'the') are usually the top n highest-frequency words collected from the datasets (注意：from all documents, not a single document being analysed).

- The value of occurrence feature can be a binary, term frequency, or term frequency-inverse document frequency (TF-IDF). A binary value denotes whether the ith word is presented in a document, which reckons without the weight of words. The term frequency is the number of occurrences of each word. TF-IDF assumes that the importance of a word increases proportionally to its frequency in a document but is offset by its frequency in the word corpus. [Dongyang Yan 2020 Network-Based]

举例：我喜欢水，很想喝水。[Jonathan Hui]

basic terms：[我，喜，欢，水，很，想，喝]

测试字：[水]

binary: [0,0,0,1,0,0,0]

term-frequency: [0,0,0,2,0,0,0]

TF-IDF: ...

优点：Straightforward method for text representation in vector space. [Dongyang Yan 2020 Network-Based]

缺点：1）The occurrence value xi is matched and counted without considering the influence of other words. Much context information may be lost without dealing with correlated words.

举例： Sen 1: 我想喝水。 Sen 2: 水想喝我。

basic terms: [我，想，喝，水]

测试字：[我，想，喝，水]

binary: [1,1,1,1]

For both two sentences, each word in basic terms occurs once. The BoW model will project Sen 1 and Sen 2 to the same vector, i.e. v1=v2=[1,1,1,1], though the two sentences have the opposite meaning. [Dongyang Yan 2020 Network-Based]

2）传统向量空间模型使用精确的词匹配，即精确匹配用户输入的词与向量空间中存在的词，无法解决一词多义(polysemy)和一义多词(synonymy)的问题。实际上在搜索中，我们实际想要去比较的不是词，而是隐藏在词之后的意义和概念。

Word Embedding scheme:

This method introduces the dependence of one word on the other words, being the most popular vector representation of document vocabulary. [Dhruvil Karani 2018 Introduction to Word]

In a vector space, words with a similar context occupy close spatial positions; in other words, similar words cluster together and different words repel. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0. [Dhruvil Karani 2018 Introduction to Word]

实现方法1：

Word2Vec: A method to construct such an embedding that captures local statistics of a corpus. It can be obtained using two training methods (both involving Neural Networks): Continuous Bag Of Words (CBOW) or Skip Gram. 非常好的视频解释可以在B站找到：word2vec_bilibili。

具体做法：

--> CBOW: 用周围词预测中间词（的context）。This algorithm takes the context of each word as the input and tries to predict the word corresponding to the context. 具体数学方法：[Xing Rong 2016 word2vec Parameter]

优点：According to Mikolov, 1) CBOW is faster (computationally efficient) and 2) has better representations for more frequent words

缺点：Low accuracy

--> Skip Gram: 用中间词预测周围词（的context）。This model uses the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. To some extend, it is true to say it is the filpped (multiple-context) CBOW. 具体数学方法：[Xin Rong 2016 word2vec Parameter]

优点：According to Mikolov, 1) Skip Gram works well with small amount of data and 2) is found to represent rare words well, i.e. good accuracy.

缺点：computational time

Word2Vec优点：Word2vec which captures local statistics does very well in analogy tasks. [Sharma 2017 Vector]

Word2Vec 缺点：Word2vec relies only on local information of language.

It is unable to leverage the statistics of the corpus since they are trained on separate local context windows. That is, the semantics learnt for a given word is only affected by the surrounding words. [Sharma 2017 Vector]

实现方法2：

LSA: A method that efficiently utilize statistical information on global co-occurrence counts. The vector representation is obtained with SVD operation.

具体做法：1) A single term-frequency matrix X containing word counts per document (row represent unique words and columns represent each document) is constructed from a large piece of text. 2) A mathematical technique called singular value decomposition (SVD) is used to decompose the matrix into a product of three simpler matrices – an term-concept matrix U, a singular values marix D, and a concept-document vector matrix V. [Letsche 1997 Large-scale information retrieval]

引用吴军老师在 “矩阵计算与文本处理中的分类问题” 中的总结：三个矩阵有非常清楚的物理含义。

第一个矩阵 U 中的每一行表示意思相关的一类词，其中的每个非零元素表示这类词中每个词的重要性（或者说相关性），数值越大越相关。
最后一个矩阵V中的每一列表示同一主题一类文章，其中每个元素表示这类文章中每篇文章的相关性。
中间的矩阵D表示词类和文章类的相关性。

因此，我们只要对关联矩阵X进行一次奇异值分解，我们就可以同时完成近义词分类和文章的分类，同时得到每类文章和每类词的相关性。【笨兔勿应博客园】

优点： 1）Efficiently utilize statistical information on global co-occurrence counts.

2）LSA可以处理传统向量空间模型无法解决的一义多词(synonymy)问题

缺点：LSA不能解决一词多义(polysemy)问题。因为LSA将每一个词映射为潜在语义空间中的一个点，也就是说一个词的多个意思在空间中对于的是同一个点，并没有被区分。

实现方法3：

GloVe: This method captures both global statistics and local statistics of a corpus when putting words to a vector space. [Ganegedara, 2019 Intuitive Guide to] GloVe employs the observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. [Standford GloVe]

具体做法： 1) It utilises the count data, the ability to capture global statistics.

A co-occurrence matrix X is constructed, where a cell Xij is a “strength” which represents how often the word i appears in the context of the word j. 【Glove算法原理知乎】

2) It predicts surrounding words by performing a dynamic logistic regression.

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence.

Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. [Standford GloVe]

优点：GloVe combines the benefits of the word2vec skip-gram model in the word analogy tasks, with those of matrix factorization methods of the LSA exploiting global statistical information.

缺点：Uses a lot of memory: the fastest way to construct a term-cooccurence matrix is to keep it in RAM as a hash map and perform cooccurence increments in a global manner. [Sciforce 2018]

BERT

优点：

缺点：

词向量化 Vector Representation of Words 方法汇总相关推荐

词嵌入、句向量等方法汇总
在cips2016出来之前,笔者也总结多类似词向量的内容,自然语言处理︱简述四大类文本分析中的"词向量"(文本词特征提取)事实证明,笔者当时所写的基本跟CIPS2016一章中总结的 ...
（译）对词向量化的直观理解：从计数向量到Word2Vec
本文翻译自 An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec 能力所限,部分翻译可能会不尽准确 ...
《Diffusion Curves: A Vector Representation for Smooth-Shaded Images》翻译
Alexandrina Orzan, Adrien Bousseau, Holger Winnemöller, Pascal Barla, Joëlle Thollot, David Salesin ...
特征点匹配+特征检测方法汇总
特征点匹配+特征检测方法汇总特征提取与匹配---SURF:SIFT:ORB:FAST:Harris角点匹配方法匹配函数 1. OpenCV提供了两种Matching方式: • Brute-for ...
IE问题解决方法汇总
1.发送错误报告 [故障现象]在使用IE浏览网页的过程中,出现"Microsoft Internet Explorer遇到问题需要关闭--"的信息提示.此时,如果单击"发 ...
关于word2vec词向量化
word2vec最主要的目的就是进行文本向量化词向量维度通常是50-300维,goole官方提供的一般是用300维,有了词向量就可以用各种方法进行相似度计算:一般维度越高,提供的信息越多,计算结果可 ...
营销策略策划的方法汇总（上）
这篇营销策划方法论文章是数英网收藏数最高的文章之一,也获得了多个主流营销网站的首页推荐和公众号转载.作者"老泡OG"从亲身经历的互联网大厂0到1多个创业项目中提炼经验,从" ...
中文分词方法汇总笔记
中文分词方法汇总笔记分词难点分词方法传统基于字典(规则分词) 基于机器学习的分词方法统计分词语言模型隐马尔可夫 HMM 模型其他分词工具和云服务其他感谢知乎 @华天清的总结分词 ...
用python下载文件的若干种方法汇总
压缩文件可以直接放到下载器里面下载的 you-get 连接下载任意文件重点用python下载文件的若干种方法汇总写文章用python下载文件的若干种方法汇总 zhangqibot发表于Met ...

词向量化 Vector Representation of Words 方法汇总

Word Embedding scheme:

词向量化 Vector Representation of Words 方法汇总相关推荐

最新文章

热门文章