Gensim is an open-source vector space and topic modelling toolkit. It is implemented in Python and uses NumPy & SciPy. It also uses Cython for performance.

Gensim是一个开源矢量空间和主题建模工具包。 它在Python中实现,并使用NumPy和SciPy 。 它还使用Cython来提高性能。

1. Python Gensim模块 (1. Python Gensim Module)

Gensim is designed for data streaming, handle large text collections and efficient incremental algorithms or in simple language – Gensim is designed to extract semantic topics from documents automatically in the most efficient and effortless manner.


This actually differentiates it from others as most of them only target in-memory and batch processing. At the core of Gensim unsupervised algorithms such as Latent Semantic Analysis, Latent Dirichlet Allocation examines word statistical co-occurrence patterns within a corpus of training documents to discover the semantic structure of documents.

实际上,这与其他产品有所区别,因为其中大多数仅针对内存和批处理。 作为Gensim无监督算法(例如潜在语义分析)的核心,潜在狄利克雷分配检查了一组训练文档中的单词统计共现模式,以发现文档的语义结构。

2.为什么使用Gensim? (2. Why use Gensim?)

Gensim has various features, which give it an edge over other scientific packages, like:


  • Memory independent – You don’t need the whole training corpus to reside in RAM at a given time which means it can process large, web-scale corpora with ease.
  • It provides I/O wrappers and converters around several popular data formats.
    它提供了几种流行数据格式的I / O包装器和转换器。
  • Gensim has efficient implementations for various vector space algorithms, which includes Tf-Idf, distributed incremental Latent Dirichlet Allocation (LDA) or Random Projection, distributed incremental Latent Semantic Analysis, also adding new ones is really easy.
  • It also provides similarity queries for documents in their semantic representation.

3. Gensim入门 (3. Getting Started with Gensim)

Before getting started with Gensim you need to check if your machine is ready to work with it. Gensim assumes following to be working seamlessly on your machine:

在开始使用Gensim之前,您需要检查您的机器是否准备就绪可以使用它。 Gensim假定以下各项可在您的计算机上无缝运行:

  • Python 2.6 or later
    Python 2.6或更高版本
  • Numpy 1.3 or later
    Numpy 1.3或更高版本
  • Scipy 0.7 or later
    Scipy 0.7或更高版本

3.1)安装Gensim库 (3.1) Install Gensim Library)

Once you have the above mentioned requirements satisfied your device is ready for gensim. You can get it using pip. Just go to your terminal and run the following command:

满足上述要求后,即可开始使用gensim设备。 您可以使用pip获取它。 只需转到终端并运行以下命令:

sudo pip install --upgrade gensim

3.2)使用Gensim (3.2) Using Gensim)

You can use gensim in any of your python scripts just by importing it like any other package. Just use the following import:

您可以像导入任何其他软件包一样将gensim导入任何python脚本中。 只需使用以下导入:

import gensim

3.3)开发Gensim Word2Vec嵌入 (3.3) Develop Gensim Word2Vec Embedding)

We have talked a lot about text, word and vector while introducing Gensim, let’s start with developing a word 2 vector embedding:

在介绍Gensim时,我们讨论了很多有关文本,单词和向量的内容,让我们从开发word 2向量嵌入开始:

from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second', 'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
# summarize vocabulary
words = list(model.wv.vocab)
# access vector for one word
# save model'model.bin')
# load model
new_model = Word2Vec.load('model.bin')

Let’s run the code, we are expecting vector for each word:

python gensim word2vec load


3.4)可视化单词嵌入 (3.4) Visualize Word Embedding)

We can see several vectors for every word in our training data and it is definitely hard to understand. Visualizing can help us in this scenario:

我们在训练数据中可以看到每个单词的多个向量,这绝对很难理解。 在这种情况下,可视化可以帮助我们:

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],['this', 'is', 'the', 'second', 'sentence'],['yet', 'another', 'sentence'],['one', 'more', 'sentence'],['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

Let’s run the program and see if we get something which is simpler and we can understand easily:


3.5)加载Google的Word2Vec嵌入 (3.5) Load Google’s Word2Vec Embedding)

Using an existing pre-trained data may not be the best approach for an NLP application but it can be really a time consuming and difficult task to train your own data at this point as it requires a lot of computer RAM and time of course. So we are using Google’s data for this example. For this example, you’ll be needing a file which you can find here.

对于NLP应用程序而言,使用现有的预训练数据可能不是最佳方法,但此时训练您自己的数据确实是一项耗时且困难的任务,因为这当然需要大量的计算机RAM和时间。 因此,在此示例中,我们使用的是Google的数据。 对于此示例,您将需要一个文件,可以在此处找到。

Download the file, unzip it and we’ll use the binary file inside.


Here is a sample program:


from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

The above example loads google’s word to vec data and then calculates king-man + woman=?. We should expect the following:

上面的示例将google的单词加载到vec数据中,然后计算出king-man + woman=? 。 我们应该期望以下几点:

[('queen', 0.7118192315101624)]

Let’s see the output for this program:


3.6)加载斯坦福的GloVe嵌入 (3.6) Load Stanford’s GloVe Embedding)

There is another algorithm available for converting word to vectors, popularly known as Global Vectors for Word Representation or GloVe. We’ll use them for our next example.

还有另一种可用于将单词转换为矢量的算法,通常被称为用于单词表示的全局矢量或GloVe。 我们将在下一个示例中使用它们。

Since we are using existing data, we’ll be needing a file this one is relatively smaller and can be downloaded from here.


First we’ll need to convert the file to word to vec format and this can be done as:


from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

Once this is done we are ready to head forward with our example as:


# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

Again we are expecting queen as the output, let’s run the program and check the results. Let’s see the output for this program:

再一次,我们希望输出皇后号,让我们运行程序并检查结果。 让我们看一下该程序的输出:

4。结论 (4. Conclusion)

In this tutorial, we have seen how to produce and load word embedding layers in Python using Gensim. To be specific we have learned:

在本教程中,我们已经看到了如何使用Gensim在Python中生成和加载单词嵌入层。 具体来说,我们了解到:

  • To train our own word embedding model on text data.
  • To visualize a trained word embedding model.
  • To load pre-trained GloVe and word2vec word embedding models from Stanford and Google respectively

We have seen Gensim makes it effortless to convert words to vectors and is very efficient. Also querying on the established pattern is easy and efficient.

我们已经看到Gensim使得将单词转换为向量变得很容易并且非常有效。 同样,查询已建立的模式既简单又高效。


