word2vec字向量

A common problem in ML, natural language processing (NLP), and AI at large surrounds representing objects in a way computers can process. And since computers understand numbers — which we have a common language for comparing, combining and manipulating — this generally means assigning objects numbers in some fashion. Think taking something abstract but intuitive to humans, like the text of a book, and assigning each word in that book a unique number. That book could then be represented by the list, or vector, of numbers assigned to it. This is the process of embedding that book as a vector — and there is an increasingly rich literature of techniques for embedding objects as vectors.

机器学习,自然语言处理(NLP)和AI中的一个常见问题是,周围环境以计算机可以处理的方式表示对象。 而且由于计算机能够理解数字(我们有一种用于比较,组合和操作的通用语言),所以通常意味着以某种方式分配对象数字。 考虑采取一些抽象但对人类直观的东西,例如一本书的文字,并为该书中的每个单词分配一个唯一的编号。 然后可以通过分配给它的编号列表或矢量来表示该书。 这是将该书作为矢量嵌入的过程-越来越多的关于将对象作为矢量嵌入的技术文献。

While much of this literature focuses on representing words as vectors, which can aide in NLP problems, much of the logic can be transferred to embedding any arbitrary set of objects. Through my research at the University of Toronto, and their computational social science lab, I’ve been applying embedding techniques to understand online forums like Reddit. This article is meant to serve as a starting point to break down the research that is being done at UofT. For more information on my research check out https://cameronraymond.me, and for the original paper that this article is based on see Waller, I., & Anderson, A.

尽管许多文献着重于将单词表示为向量,这可以帮助解决NLP问题,但许多逻辑可以转移到嵌入任意对象集上。 通过我在多伦多大学及其计算社会科学实验室的研究,我一直在应用嵌入技术来了解Reddit等在线论坛。 本文旨在作为分解UofT正在进行的研究的起点。 有关我的研究的更多信息,请访问https://cameronraymond.me ,以及本文所基于的原始论文,请参阅Waller,I.,&Anderson,A 。

First, we’ll take a look at what it means to embed some thing as a vector and what a good embedding entails. Then we’ll take a common embedding technique, Word2Vec, and see how it is used to model words as vectors. After seeing why Word2Vec is so useful, we can start to generalize its principles and show its utility in mapping the different communities of Reddit.

首先,我们来看看将某些东西作为向量嵌入意味着什么以及良好的嵌入意味着什么。 然后,我们将采用一种常见的嵌入技术Word2Vec,并了解如何将其用于将单词建模为矢量。 了解了Word2Vec为什么如此有用之后,我们可以开始概括其原理,并展示其在映射Reddit的不同社区中的实用性。

什么是嵌入? (What is an embedding?)

While embedding techniques can get complex — at its core, to embed some thing is just to represent that thing as a vector of real numbers. This is useful because there’s a common currency when talking about vectors of real numbers; namely they are easy to add, subtract, compare and manipulate. So to embed some set of objects then is just to represent those objects with unique vectors of real numbers. So not all embedding techniques involve complex neural nets, and often simple embeddings are powerful enough for a given problem; however, there are benefits to more nuanced techniques that we’ll focus on.

虽然嵌入技术可以得到复合物-在其核心,以嵌入一些事情仅仅是代表了东西的实数的向量。 这很有用,因为在谈论实数向量时有一种共同的货币。 也就是说,它们易于添加,减去,比较和操作。 因此,嵌入一对象只是用唯一的实数向量表示这些对象。 因此,并非所有的嵌入技术都涉及复杂的神经网络,通常简单的嵌入对于给定的问题而言足够强大。 但是,我们将重点关注更细致入微的技术。

A ‘dumb embedding’ would be to one-hot encode all the different unique objects as their own unit basis vector. This means that in a set of |V| objects, each object in that set, v, is represented as a vector of size |V| with all 0s, except for the vth index which is a 1.

“哑嵌入”是将所有不同的唯一对象作为自己的单位基础向量进行一次热编码。 这意味着| V | 对象,该集合v中的每个对象都表示为大小|的向量。 V | 全部为0,但vth索引为1。

Image for post
Kaggle.Kaggle 。

Why might this not be a powerful enough embedding? Even though we have the tools to manipulate these vectors, it may not return intuitive results. This is because when objects are one-hot encoded, the embedding isn’t tied back to the real world in anyway. Specifically, there isn’t a logical relationship between objects’ representations that reflects their actual relationships; each vector is equally far from every other vector. In an ideal world, you may want the vector representing ‘red’ ([red]=<1 0 0>) and the vector representing ‘yellow’ ([yellow]=<0 1 0>), when added together, to return the vector representing ‘orange’ ([orange]-><1 1 0>). One-hot encoding only allows you to say what an item is by its vector, it doesn’t tell you how the vectors relate to one another. With that said, one-hot encoding is often a good starting point.

为什么这可能不够强大? 即使我们拥有操纵这些向量的工具,也可能无法返回直观的结果。 这是因为当对象进行一次热编码时,无论如何嵌入都不会与现实世界联系在一起。 具体来说,对象的表示之间没有逻辑关系可以反映它们的实际关系; 每个向量都与其他向量相等。 在理想情况下,您可能希望将代表“红色” ([red] = <1 0 0>)的向量和代表“黄色” ([yellow] = <0 1 0>)的向量一起返回代表“橙色”的向量([橙色]-> <1 1 0>)。 一键编码仅允许您通过其向量说出一个项目是什么,而不能告诉您向量之间的关系。 话虽如此,单热编码通常是一个很好的起点。

To understand how we can embed objects in a way that is tied back to the real world we’ll look at a more nuanced technique called Word2Vec. While generally used to embed words, it generalizes to arbitrary objects in certain cases as well. Word2Vec allows us to represent each object from a set of objects as a dense vector of real numbers in a way that preserves relations between different objects.

为了了解我们如何以与现实世界紧密联系的方式嵌入对象,我们将研究一种更细微的技术,称为Word2Vec。 虽然通常用于嵌入单词,但在某些情况下也可以泛化为任意对象。 Word2Vec允许我们以保留不同对象之间关系的方式,将一组对象中的每个对象表示为实数密集向量

To get the intuition behind how Word2Vec works, we’ll look at its most common use case: embedding words as vectors. As such, those familiar with Word2Vec can skip the next section. From there we’ll see how Word2Vec can generalize to embed other objects. For this we’ll embed Reddit’s 10,000 most active communities. Finally, we’ll show how this embedding aligns with our understanding of what these communities represent.

为了获得Word2Vec的工作原理的直观信息,我们将研究其最常见的用例:将单词嵌入为向量。 因此,熟悉Word2Vec的人可以跳过下一部分。 从那里,我们将看到Word2Vec如何概括为嵌入其他对象。 为此,我们将嵌入Reddit的10,000个最活跃的社区。 最后,我们将展示这种嵌入如何与我们对这些社区所代表的理解相一致。

Word2Vec (Word2Vec)

The underlying intuition behind Word2Vec is that two words are similar if they are used in similar ways. For example if you substitute the word ‘good’ for the word ‘great’ in a sentence, it will likely still make sense. This concept is well summarized by the linguist John Rupert Firth who, in 1957, said “you shall know a word by the company it keeps.” While there are various implementations of Word2Vec, this article will focus on the Skip-gram model which fits in well with Firth’s ideas.

Word2Vec的基本直觉是,如果两个单词以相似的方式使用,则它们是相似的。 例如,如果将句子中的“好”一词替换为“好”一词,这仍然很有意义。 语言学家约翰·鲁珀特·费斯(John Rupert Firth)很好地概括了这个概念,他在1957年说:“您将知道它所拥有的公司所使用的单词。” 尽管Word2Vec的实现方式多种多样,但本文将重点介绍与Firth的思想非常吻合的Skip-gram模型。

“You shall know a word by the company it keeps.” — J.R. Firth

“您将知道它所经营的公司的一句话。” — JR Firth

The Skip-gram model — when applied to words — goes through each word in the text corpus and tries to predict the n words on either side of it. The n words surrounding the target word are its context. In the picture below we see that the context for the word ‘nasty are ferocious, dog’s, sharp and bite.

Skip-gram模型(应用于单词时) 会遍历文本语料库中的每个单词,并尝试预测单词两侧的n个单词。 目标词周围的n个词是其上下文。 在下面的图片中,我们看到“讨厌一词的上下文是凶猛的,狗的,尖锐的和被咬的。

Image for post

We start off by one-hot encoding each word, and then use a shallow neural network to predict all the context vectors associated with the target word. In this way, words used in similar contexts will have similar output vectors. By taking the output of the hidden layer, before converting the output into the concatenation of the one-hot encoded vectors, we can represent that word as a dense vector of real numbers.

我们首先对每个单词进行一次热编码,然后使用浅层神经网络预测与目标单词相关的所有上下文向量。 这样,在相似上下文中使用的单词将具有相似的输出向量。 通过获取隐藏层的输出,在将输出转换为单热编码矢量的级联之前,我们可以将该单词表示为实数的密集矢量。

Image for post

Through this training process Word2Vec preserves semantic as well as syntactic shifts in language. For example the transformation from the vector representing the word ‘King’ (denoted by [King]) to [Queen] is roughly the same as the transformation from [Man] to [Woman]. Therefore we can represent the analogy Man is to Woman as King is to Queen as [Man]-[Woman]=[King]-[Queen]. And if we didn’t already know that Queen is the final component of the analogy, we could solve for it using the equation [Queen] = [King]-[Man]+[Woman].

通过此培训过程,Word2Vec保留了语言的语义和句法转换。 例如,从表示单词'King'的向量(由[King]表示)到[Queen]的转换与从[Man][woman]的转换大致相同。 因此,我们可以用男人 [Man]-[Woman] = [King]-[Queen] 比喻男人对女人就像国王对女王 而且,如果我们还不知道皇后是类推的最后组成部分,则可以使用等式[Queen] = [King]-[Man] + [Woman ]来解决。

Image for post

任何东西 (Anything2Vec)

The Skip-gram model has been well explored when applied to words, as seen through the popularity of Word2Vec, but its utility doesn’t stop at linguistic analogies. For this we’ll show how Word2Vec generalizes to situations where there’s a logical target-context relation.

从Word2Vec的流行可以看出,将Skip-gram模型应用于单词时,已经进行了很好的探索,但是它的效用并不仅限于语言类比。 为此,我们将展示Word2Vec如何推广到存在逻辑目标上下文关系的情况。

Subreddit嵌入 (Subreddit Embeddings)

Just as you can “know a word by the company it keeps,” the same logic applies to Reddit and its variety of online communities, called subreddits. The, less pithy, analog in this case is that we can know a subreddit by the commenters it keeps. For the Skip-gram model, each subreddit represents a “word” and that subreddit’s commenters act as the “context.” So like Word2Vec, subreddits with similar commenters will have similar output vectors.

正如您可以“知道它所保留的公司所说的话”一样,相同的逻辑也适用于Reddit及其各种在线社区,称为subreddits。 在这种情况下,不那么精巧的类比是,我们可以通过其保留的评论者知道一个subreddit。 对于Skip-gram模型,每个子目录代表一个“单词”,而子目录的评论者充当“上下文”。 因此,与Word2Vec一样,具有相似注释者的子reddit也将具有相似的输出向量。

Image for post

While the output vectors are embedded in a high dimensional vector space (often 150+ dimensions), and thus can’t be visualized, principal component analysis can return a 3-dimensional approximation. Below is a visualization of such an approximation for all 10,000 subreddits. In this plot we’ve highlighted the hip hop oriented subreddit, /r/hiphopheads, and it’s 100 closest vectors. As we can see, the closest subreddits by cosine similarity are also hip hop themed.

虽然输出向量嵌入到高维向量空间(通常为150多个维度)中,因此无法可视化,但是主成分分析可以返回3维近似值。 下面是所有10,000个子reddit的近似值的可视化。 在此图中,我们突出了面向嘻哈的subreddit / r / hiphopheads ,它是100个最接近的向量。 正如我们所看到的,通过余弦相似度最接近的子项也以嘻哈为主题。

Image for post

Subreddit类比 (Subreddit Analogies)

With Word2Vec, the resulting embeddings can preserve relationships between words. This allows for simple vector addition and subtraction to answer analogy problems. For example, to answer the analogy Berlin is to Germany as Ottawa is to x, we calculate [x]=[Germany]-[Berlin]+[Ottawa] and choose the closest vector to [x] which would be [Canada]. This property holds for our subreddit embedding as well. When posing the analogy /r/boston is to /r/chicago as /r/bostonceltics is to x, the closest vector to [/r/bostonceltics]-[/r/boston]+[/r/chicago] is the subreddit dedicated to the Chicago Bulls.

使用Word2Vec,生成的嵌入可以保留单词之间的关系。 这允许简单的矢量加法和减法来回答类比问题。 例如,要回答类比, 柏林是德国,就像渥太华是x ,我们计算[x] = [德国]-[柏林] + [渥太华 ],然后选择最接近[x]的向量,即[加拿大] 。 此属性也适用于我们的subreddit嵌入。 当将类比/ r /波士顿与/ r /芝加哥视为类比/ r / bostonceltics与x时 ,最接近[/ r / bostonceltics]-[/ r /波士顿] + [/ r / chicago]的向量是subreddit致力于芝加哥公牛队。

Image for post
Vector transformation from a city to its corresponding NBA team.
从城市到其相应的NBA球队的向量转换。

On a testing set of ~1,500 similar analogy problems (city to sports team, university to university town, state to state capital) our embedding attained 81% accuracy.

在大约1,500个类似问题的测试集中(城市对运动队,大学对大学城,州对州府),我们的嵌入达到了81%的准确性。

什么时候和什么时候不? (When and When Not?)

The core intuition behind Word2Vec, and its generalization, is that you can represent words, subreddits, Twitter users, etc… by the company they keep. Words used in similar contexts are likely similar; the same holds for subreddits with similar commenters and Twitter users with similar followers. However, if there isn’t enough data, the embedding isn’t likely to pick up on the different dimensions in which the entities can be similar or different. Any user on Reddit likely comments on a variety of subreddits, not all of which are related. Yet, from a macro point of view, over millions of comments, very nuanced relations begin to emerge.

Word2Vec背后的核心直觉及其概括是,您可以由其保留的公司来代表单词,子目录,Twitter用户 。 在类似上下文中使用的词可能相似; 具有相似评论者的subreddit和具有相似关注者的Twitter用户也是如此。 但是,如果没有足够的数据,则嵌入不可能在实体可以相似或不同的不同维度上进行。 Reddit上的任何用户都可能会对各种子Reddit进行评论,但并非所有子Subdit都是相关的。 然而,从宏观的角度来看,超过数百万条评论的微妙关系开始出现。

By first starting with a bare-bones approach to what an embedding can be, and then seeing how more nuanced embeddings can improve NLP problems — this article showed how embedding techniques can derive interesting results when applied to arbitrary objects, like subreddits. If you have thoughts on how you’d like to see this work used, feel free to let me know below!

首先,从采用准系统方法进行嵌入的方法开始,然后看更多细微的嵌入方法可以如何改善NLP问题-本文介绍了将嵌入技术应用于子对象(如subreddits)时如何能够获得有趣的结果。 如果您对如何使用此作品有任何想法,请随时在下面告诉我!

Originally published at https://cameronraymond.me.

最初发布在 https://cameronraymond.me

翻译自: https://towardsdatascience.com/anything2vec-mapping-reddit-into-vector-spaces-dcc77d9f3bea

word2vec字向量

http://www.taodudu.cc/news/show-1873983.html

相关文章:

  • ai人工智能伪原创_AI伪科学与科学种族主义
  • ai人工智能操控什么意思_为什么要建立AI分散式自治组织(AI DAO)
  • 机器学习cnn如何改变权值_五个机器学习悖论将改变您对数据的思考方式
  • DeepStyle(第2部分):时尚GAN
  • 肉体之爱的解释圣经_可解释的AI的解释
  • 机器学习 神经网络 神经元_神经网络如何学习?
  • 人工智能ai应用高管指南_理解人工智能指南
  • 机器学习 决策树 监督_监督机器学习-决策树分类器简介
  • ai人工智能数据处理分析_建立数据平台以实现分析和AI驱动的创新
  • 极限学习机和支持向量机_极限学习机的发展
  • 人工智能时代的危机_AI信任危机:如何前进
  • 不平衡数据集_我们的不平衡数据集
  • 建筑业建筑业大数据行业现状_建筑—第4部分
  • 线性分类模型和向量矩阵求导_自然语言处理中向量空间模型的矩阵设计
  • 离散数学期末复习概念_复习第1部分中的基本概念
  • 熵 机器学习_理解熵:机器学习的金标准
  • heroku_如何通过5个步骤在Heroku上部署机器学习UI
  • detr 历史解析代码_视觉/ DETR变压器
  • 人工神经网络方法学习步长_人工神经网络-一种直观的方法第1部分
  • 机器学习 声音 分角色_机器学习对儿童电视节目角色的痴迷
  • 遗传算法是一种进化算法_一种算法的少量更改可以减少种族主义的借贷
  • 无监督模型 训练过程_监督使用训练模型
  • 端到端车道线检测_弱监督对象检测-端到端培训管道
  • feynman1999_AI Feynman 2.0:从数据中学习回归方程
  • canny edge_Canny Edge检测器简介
  • 迄今为止2020年AI的奋斗与成功
  • 机器学习算法应用_机器学习:定义,类型,算法,应用
  • 索尼爱立信k510驱动_未来人工智能驱动的电信网络:爱立信案例研究
  • ai驱动数据安全治理_利用AI驱动的自动协调器实时停止有毒信息
  • ai人工智能_古典AI的简要史前

word2vec字向量_Anything2Vec:将Reddit映射到向量空间相关推荐

  1. 乱炖“简书交友”数据之代码(2):关键词抽取、Word2Vec词向量

    继续更新出来本系列的代码:乱炖数据之2700余篇"简书交友"专题文章数据的花式玩法 在乱炖"简书交友"数据之代码(1)一文里,主要涉及结构化数据的分析,文本挖掘 ...

  2. Python Word2vec训练医学短文本字/词向量实例实现,Word2vec训练字向量,Word2vec训练词向量,Word2vec训练保存与加载模型,Word2vec基础知识

    一.Word2vec概念 (1)Word2vec,是一群用来产生词向量的相关模型.这些模型为浅而双层的神经网络,用来训练以重新建构语言学之词文本.网络以词表现,并且需猜测相邻位置的输入词,在word2 ...

  3. 【NLP_向量表示】使用Word2Vec训练字向量

    重要参考 https://github.com/liuhuanyong/ChineseEmbedding 原文作者提供了字向量.拼音向量.词向量.词性向量与依存关系向量,共5种类型的向量训练, 在此, ...

  4. Word2Vec词向量模型代码

    Word2Vec也称Word Embedding,中文的叫法是"词向量"或"词嵌入",是一种计算非常高效的,可以从原始语料中学习字词空间向量的预测模型.Word ...

  5. [NLP] 深入浅出 word2vec 词向量详解

    Word2vec 词向量 前置知识:需要理解基本的MLP 多层感知机(全连接神经网络) 和DL.数学相关基础知识 One-hot encoding 独热编码 刚开始,人们用one-hot编码来表示词, ...

  6. NLP—word2vec词向量简介

    NLP处理的数据都是文字,而文字是无法直接被计算机计算的,于是人们想出了使用独热编码的方式来表示单词. <span style="font-size:16px;">浙江 ...

  7. NLP入门之——Word2Vec词向量Skip-Gram模型代码实现(Pytorch版)

    代码地址:https://github.com/liangyming/NLP-Word2Vec.git 1. 什么是Word2Vec Word2vec是Google开源的将词表征为实数值向量的高效工具 ...

  8. python训练Word2Vec词向量

    一.模型训练 1.安装gensim pip install gensim gensim中封装了包括word2vec.doc2vec等模型,word2vec采用了CBOW(Continuous Bag- ...

  9. 使用gensim框架及Word2Vec词向量模型获取相似词

    使用gensim框架及Word2Vec词向量模型获取相似词 预备知识 Word2Vec模型下载 加载词向量模型 预备知识 gensim框架 gensim是基于Python的一个框架,它不但将Pytho ...

  10. 在Keras的Embedding层中使用预训练的word2vec词向量

    文章目录 1 准备工作 1.1 什么是词向量? 1.2 获取词向量 2 转化词向量为keras所需格式 2.1 获取所有词语word和词向量 2.2 构造"词语-词向量"字典 2. ...

最新文章

  1. 我们无法更新系统保留的分区_「图」Windows 10更新再遇尴尬:无法执行系统恢复点...
  2. ISP、IAP、ICP的区别!
  3. 通过极简模拟框架让你了解ASP.NET Core MVC框架的设计与实现[上篇]
  4. 牛客网暑期ACM多校训练营(第三场)
  5. BootStrap笔记-导航
  6. python数据挖掘课程】十七.社交网络Networkx库分析人物关系(初识篇)
  7. Rust常用编程概念之变量和可变性
  8. python判断素数的函数_使用Python判断质数(素数)的简单方法讲解
  9. PR曲线与ROC曲线绘制
  10. 微信语音识别开放平台
  11. DOSBox+MASM,汇编语言环境搭建
  12. 软件测试工程师应该如何进行职业规划?
  13. 我的Linux系统九阴真经
  14. Redis客户端访问
  15. 中国的数字化转型 China’s digital transformation
  16. 图片无序预加载技术一
  17. 计算机网络交换路由计算,计算机网络 交换路由
  18. iptables目标TTL
  19. win10 实时保护对KEIL5 编译速度慢的响应
  20. 《程序设计基础》 第四章 循环结构 7-13 找零钱 (20 分)

热门文章

  1. 白话CSS3的新特性
  2. Python3入门机器学习经典算法与应用 第3章 读取数据和简单的数据探索
  3. C++ 中 Windows 编程概述
  4. AR、VR、MR的那些事儿
  5. Atitit java js groupby 分组操作法
  6. Aittit rpc的实现协议 JSON-RPC XML-RPC . Ws协议webservice 目录 1. XML-RPC协议 1 1.1. JSON-RPC远程调用协议 - CieloSun
  7. Atitit.工作流系统的本质是dsl 图形化的dsl  4gl
  8. atitit.提升研发管理的利器---重型框架 框架 类库的区别
  9. atitit.GUI图片非规则按钮跟动态图片切换的实现模式总结java .net c# c++ web html js
  10. paip.c++ qt __gxx_personality_sj0 __gxx_personality_v0问题的解决