Python Gensim Word2Vec

Gensim is an open-source vector space and topic modelling toolkit. It is implemented in Python and uses NumPy & SciPy. It also uses Cython for performance.

Gensim是一个开源矢量空间和主题建模工具包。它在Python中实现，并使用NumPy和SciPy 。它还使用Cython来提高性能。

1. Python Gensim模块 (1. Python Gensim Module)

Gensim is designed for data streaming, handle large text collections and efficient incremental algorithms or in simple language – Gensim is designed to extract semantic topics from documents automatically in the most efficient and effortless manner.

Gensim设计用于数据流传输，处理大型文本集和高效的增量算法或使用简单的语言-Gensim设计用于以最高效，最轻松的方式自动从文档中提取语义主题。

This actually differentiates it from others as most of them only target in-memory and batch processing. At the core of Gensim unsupervised algorithms such as Latent Semantic Analysis, Latent Dirichlet Allocation examines word statistical co-occurrence patterns within a corpus of training documents to discover the semantic structure of documents.

实际上，这与其他产品有所区别，因为其中大多数仅针对内存和批处理。作为Gensim无监督算法（例如潜在语义分析）的核心，潜在狄利克雷分配检查了一组训练文档中的单词统计共现模式，以发现文档的语义结构。

2.为什么使用Gensim？ (2. Why use Gensim?)

Gensim has various features, which give it an edge over other scientific packages, like:

Gensim具有各种功能，使其比其他科学软件包更具优势，例如：

Memory independent – You don’t need the whole training corpus to reside in RAM at a given time which means it can process large, web-scale corpora with ease.
独立于内存–您不需要整个训练语料库在给定时间驻留在RAM中，这意味着它可以轻松处理大型Web规模的语料库。
It provides I/O wrappers and converters around several popular data formats.
它提供了几种流行数据格式的I / O包装器和转换器。
Gensim has efficient implementations for various vector space algorithms, which includes Tf-Idf, distributed incremental Latent Dirichlet Allocation (LDA) or Random Projection, distributed incremental Latent Semantic Analysis, also adding new ones is really easy.
Gensim对各种向量空间算法都具有高效的实现，包括Tf-Idf，分布式增量式潜在Dirichlet分配（LDA）或随机投影，分布式增量式潜在语义分析，而且添加新的算法确实非常容易。
It also provides similarity queries for documents in their semantic representation.
它还以语义表示为文档提供相似性查询。

3. Gensim入门 (3. Getting Started with Gensim)

Before getting started with Gensim you need to check if your machine is ready to work with it. Gensim assumes following to be working seamlessly on your machine:

在开始使用Gensim之前，您需要检查您的机器是否准备就绪可以使用它。 Gensim假定以下各项可在您的计算机上无缝运行：

Python 2.6 or later
Python 2.6或更高版本
Numpy 1.3 or later
Numpy 1.3或更高版本
Scipy 0.7 or later
Scipy 0.7或更高版本

3.1）安装Gensim库 (3.1) Install Gensim Library)

Once you have the above mentioned requirements satisfied your device is ready for gensim. You can get it using pip. Just go to your terminal and run the following command:

满足上述要求后，即可开始使用gensim设备。您可以使用pip获取它。只需转到终端并运行以下命令：

sudo pip install --upgrade gensim

3.2）使用Gensim (3.2) Using Gensim)

You can use gensim in any of your python scripts just by importing it like any other package. Just use the following import:

您可以像导入任何其他软件包一样将gensim导入任何python脚本中。只需使用以下导入：

import gensim

3.3）开发Gensim Word2Vec嵌入 (3.3) Develop Gensim Word2Vec Embedding)

We have talked a lot about text, word and vector while introducing Gensim, let’s start with developing a word 2 vector embedding:

在介绍Gensim时，我们讨论了很多有关文本，单词和向量的内容，让我们从开发word 2向量嵌入开始：

from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second', 'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Let’s run the code, we are expecting vector for each word:

让我们运行代码，我们期望每个单词都有矢量：

3.4）可视化单词嵌入 (3.4) Visualize Word Embedding)

We can see several vectors for every word in our training data and it is definitely hard to understand. Visualizing can help us in this scenario:

我们在训练数据中可以看到每个单词的多个向量，这绝对很难理解。在这种情况下，可视化可以帮助我们：

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second', 'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

Let’s run the program and see if we get something which is simpler and we can understand easily:

让我们运行程序，看看是否得到了一些更简单并且可以轻松理解的东西：

3.5）加载Google的Word2Vec嵌入 (3.5) Load Google’s Word2Vec Embedding)

Using an existing pre-trained data may not be the best approach for an NLP application but it can be really a time consuming and difficult task to train your own data at this point as it requires a lot of computer RAM and time of course. So we are using Google’s data for this example. For this example, you’ll be needing a file which you can find here.

对于NLP应用程序而言，使用现有的预训练数据可能不是最佳方法，但此时训练您自己的数据确实是一项耗时且困难的任务，因为这当然需要大量的计算机RAM和时间。因此，在此示例中，我们使用的是Google的数据。对于此示例，您将需要一个文件，可以在此处找到。

Download the file, unzip it and we’ll use the binary file inside.

下载文件，解压缩，我们将在其中使用二进制文件。

Here is a sample program:

这是一个示例程序：

from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

The above example loads google’s word to vec data and then calculates king-man + woman=?. We should expect the following:

上面的示例将google的单词加载到vec数据中，然后计算出king-man + woman=? 。我们应该期望以下几点：

[('queen', 0.7118192315101624)]

Let’s see the output for this program:

让我们看一下该程序的输出：

3.6）加载斯坦福的GloVe嵌入 (3.6) Load Stanford’s GloVe Embedding)

There is another algorithm available for converting word to vectors, popularly known as Global Vectors for Word Representation or GloVe. We’ll use them for our next example.

还有另一种可用于将单词转换为矢量的算法，通常被称为用于单词表示的全局矢量或GloVe。我们将在下一个示例中使用它们。

Since we are using existing data, we’ll be needing a file this one is relatively smaller and can be downloaded from here.

由于我们正在使用现有数据，因此我们需要一个相对较小的文件，可以从此处下载。

First we’ll need to convert the file to word to vec format and this can be done as:

首先，我们需要将文件转换为word到vec格式，这可以通过以下方式完成：

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

Once this is done we are ready to head forward with our example as:

完成此操作后，我们准备继续以下示例：

# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

Again we are expecting queen as the output, let’s run the program and check the results. Let’s see the output for this program:

再一次，我们希望输出皇后号，让我们运行程序并检查结果。让我们看一下该程序的输出：

4。结论 (4. Conclusion)

In this tutorial, we have seen how to produce and load word embedding layers in Python using Gensim. To be specific we have learned:

在本教程中，我们已经看到了如何使用Gensim在Python中生成和加载单词嵌入层。具体来说，我们了解到：

To train our own word embedding model on text data.
在文本数据上训练我们自己的词嵌入模型。
To visualize a trained word embedding model.
可视化受过训练的单词嵌入模型。
To load pre-trained GloVe and word2vec word embedding models from Stanford and Google respectively
分别从斯坦福大学和Google加载预训练的GloVe和word2vec词嵌入模型

We have seen Gensim makes it effortless to convert words to vectors and is very efficient. Also querying on the established pattern is easy and efficient.

我们已经看到Gensim使得将单词转换为向量变得很容易并且非常有效。同样，查询已建立的模式既简单又高效。

翻译自: https://www.journaldev.com/19279/python-gensim-word2vec