Training Word2Vec Model on English Wikipedia by Gensim

更新:发现另一篇译文:中英文维基百科语料上的Word2Vec实验,该译文还提供了中文维基百科的做法。
在学习了word2vec和glove,一个很自然的方式是考虑去训练一个大型的语料库,对于这个任务,英文维基百科是一个理想的选择。在google了相关关键词比如“word2vec wikipedia”,“gensim word2vec wikipedia”,我在gensim谷歌组里看到,一个讨论“training word2vec on full Wikipedia”提供了一个正确的做法。虽然注意到也有别的选择比如wiki2vec,但我认为word2vec是更简单和有效率。
我从英文维基百科下载了数据(时间2015-03-01,大概11g):
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
首先,我们需要把xml格式的维基百科处理成文本格式,从process_wiki.py下面复制代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-import logging
import os.path
import sysfrom gensim.corpora import WikiCorpusif __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print globals()['__doc__'] % locals()sys.exit(1)inp, outp = sys.argv[1:3]space = " "i = 0output = open(outp, 'w')wiki = WikiCorpus(inp, lemmatize=False, dictionary={})for text in wiki.get_texts():output.write(space.join(text) + "\n")i = i + 1if (i % 10000 == 0):logger.info("Saved " + str(i) + " articles")output.close()logger.info("Finished Saved " + str(i) + " articles")

注意这里有点不一样:

wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
原来是:wiki = WikiCorpus(inp, dictionary={})
我们设置了lemmatize为False,不使用模式(pattern),因为模式会严重地减慢处理的速度。
执行“python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text”,我们得到:
2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2015-03-07 15:11:12,860: INFO: Saved 10000 articles
2015-03-07 15:13:25,369: INFO: Saved 20000 articles
2015-03-07 15:15:19,771: INFO: Saved 30000 articles
2015-03-07 15:16:58,424: INFO: Saved 40000 articles
2015-03-07 15:18:12,374: INFO: Saved 50000 articles
2015-03-07 15:19:03,213: INFO: Saved 60000 articles
2015-03-07 15:19:47,656: INFO: Saved 70000 articles
2015-03-07 15:20:29,135: INFO: Saved 80000 articles
2015-03-07 15:22:02,365: INFO: Saved 90000 articles
2015-03-07 15:23:40,141: INFO: Saved 100000 articles
.....
2015-03-07 19:33:16,549: INFO: Saved 3700000 articles
2015-03-07 19:33:49,493: INFO: Saved 3710000 articles
2015-03-07 19:34:23,442: INFO: Saved 3720000 articles
2015-03-07 19:34:57,984: INFO: Saved 3730000 articles
2015-03-07 19:35:31,976: INFO: Saved 3740000 articles
2015-03-07 19:36:05,790: INFO: Saved 3750000 articles
2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)
2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

在我的macpro处理5个小时(4核cpu和16g的RAM),我们获得了12g的wiki.en.text,一篇文章为一行,像这样,丢弃了标点符号:

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...
第二步是训练从text里训练word2vec模型,你可以使用原始的word2vec二进制文件训练像tex8文件的相关模型,但这样似乎速度太慢了。像这篇文章,我们使用gensim的word2vec模型训练英文维基百科模型,从train_word2vec_model.py复制下面的代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessingfrom gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentenceif __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print globals()['__doc__'] % locals()sys.exit(1)inp, outp = sys.argv[1:3]model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())# trim unneeded model memory = use (much) less RAMmodel.init_sims(replace=True)model.save(outp)

执行"python train_word2vec_model.py wiki.en.text wiki.en.word2vec.model":

2015-03-07 20:11:47,796 : INFO :  running train_word2vec_model.py wiki.en.texx wiki.en.word2vec.model
2015-03-07 20:11:47,801 : INFO :  collecting all words and their counts
2015-03-07 20:11:47,823 : INFO :  PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-07 20:12:09,816 : INFO :  PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-07 20:12:29,920 : INFO :  PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-07 20:12:45,654 : INFO :  PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-07 20:13:02,623 : INFO :  PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-07 20:13:13,613 : INFO :  PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-07 20:13:20,383 : INFO :  PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-07 20:13:25,511 : INFO :  PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-07 20:13:30,756 : INFO :  PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-07 20:13:42,144 : INFO :  PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-07 20:13:54,513 : INFO :  PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types
......
2015-03-07 20:36:02,246 : INFO :  PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-07 20:36:04,786 : INFO :  PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-07 20:36:07,423 : INFO :  PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-07 20:36:10,115 : INFO :  PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-07 20:36:12,595 : INFO :  PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-07 20:36:15,120 : INFO :  PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-07 20:36:17,057 : INFO :  collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-07 20:36:22,710 : INFO :  total 1969354 word types after removing those with count<5
2015-03-07 20:36:22,710 : INFO :  constructing a huffman tree from 1969354 words
2015-03-07 20:38:20,767 : INFO :  built huffman tree with maximum node depth 29
2015-03-07 20:38:23,219 : INFO :  resetting layer weights
2015-03-07 20:39:18,277 : INFO :  training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-07 20:39:33,141 : INFO :  PROGRESS: at 0.01% words, alpha 0.02500, 18766 words/s
2015-03-07 20:39:34,874 : INFO :  PROGRESS: at 0.05% words, alpha 0.02500, 56782 words/s
2015-03-07 20:39:35,886 : INFO :  PROGRESS: at 0.07% words, alpha 0.02500, 76206 words/s
2015-03-07 20:39:41,163 : INFO :  PROGRESS: at 0.08% words, alpha 0.02499, 66533 words/s
2015-03-07 20:39:43,442 : INFO :  PROGRESS: at 0.09% words, alpha 0.02500, 70345 words/s
2015-03-07 20:39:47,604 : INFO :  PROGRESS: at 0.11% words, alpha 0.02498, 77893 words/s
......
2015-03-08 02:33:26,624 : INFO :  PROGRESS: at 99.19% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:27,976 : INFO :  PROGRESS: at 99.20% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:29,097 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:30,465 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:31,768 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93813 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:33,535 : INFO :  reached the end of input; waiting to finish 8 outstanding jobs
2015-03-08 02:33:33,939 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:34,998 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,127 : INFO :  PROGRESS: at 99.24% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,961 : INFO :  training on 1994415728 words took 21258.7s, 93817 words/s
2015-03-08 02:33:36,996 : INFO :  precomputing L2-norms of word weight vectors
2015-03-08 02:33:58,490 : INFO :  saving Word2Vec object under wiki.en.word2vec.model, separately None
2015-03-08 02:33:58,666 : INFO :  not storing attribute syn0norm
2015-03-08 02:33:58,666 : INFO :  storing numpy array 'syn0' to wiki.en.word2vec.model.syn0.npy

在大概7小时后,我们得到了英文维基百科的模型"wiki.en.word2vec.model",但发现模型里有点奇怪:

In [1]: import gensimIn [2]: model = gensim.models.Word2Vec.load("wiki.en.word2vec.model")In [3]: model.most_similar("queen")
...python2.7/site-packages/gensim/models/word2vec.py:827: RuntimeWarning: invalid value encountered in divideself.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
Out[3]:
[(u'ahsns', nan),(u'ny\xedl', nan),(u'indradeo', nan),(u'jaimovich', nan),(u'addlepate', nan),(u'jagello', nan),(u'festenburg', nan),(u'picatic', nan),(u'tolosanum', nan),(u'mithoo', nan)]

正如gensim的作者Radim Řehůřek所说的,我认为问题来自于如下:

Thanks h3im.
Both numbers are identical, so there’s no problem with the dictionary/input.
I had another idea — inside the cython code, the maximum sentence length is clipped to 1,000 words. Any sentence longer than that will only consider the first 1,000 words.
In your case, you’re storing entire documents as a single sentence (1 wiki doc = 1 sentence). So this restriction may be kicking in.
Can you try increasing `DEF MAX_SENTENCE_LEN = 1000` to 10k for example, in word2vec_inner.pyx?
Or, alternatively, split documents into sentences, so each sentence is < 1,000 words long. Let me know, Radim
但发现在我的gensim(0.10.3版本)也设置了10k,我尝试一个wiki.en.text前100000行的小的文本“wiki.en.10w”,然后用脚本train_word2vec_model.py训练了word2vec模型“wiki.en.10w.model”,发现一切都ok了:
In [1]: import gensimIn [2]: model = gensim.models.Word2Vec.load("wiki.en.10w.model")In [3]: model.most_similar("queen")
Out[3]:
[(u'princess', 0.5976558327674866),(u'elizabeth', 0.591829776763916),(u'consort', 0.5514105558395386),(u'drottningens', 0.5454206466674805),(u'regnant', 0.5419434309005737),(u'f\xf6delsedag', 0.5259706974029541),(u'saovabha', 0.5250850915908813),(u'margrethe', 0.5195728540420532),(u'mary', 0.5035395622253418),(u'armgard', 0.5028442144393921)]In [4]: model.most_similar("man")
Out[4]:
[(u'woman', 0.6305292844772339),(u'boy', 0.5495858788490295),(u'girl', 0.5382533073425293),(u'bespectacled', 0.44303444027900696),(u'eutychus', 0.43531811237335205),(u'coochie', 0.42641448974609375),(u'soldier', 0.4228038191795349),(u'hater', 0.4212420582771301),(u'mannish', 0.4139400124549866),(u'bellybutton', 0.4139178991317749)]In [5]: model.similarity("man", "woman")
Out[5]: 0.63052930788363182In [6]: model.similarity("girl", "woman")
Out[6]: 0.59083314898425321

我决定保存原始的word2vec文档格式用于debug,因此如下修改train_word2vec_model.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-import logging
import os.path
import sys
import multiprocessingfrom gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentenceif __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 4:print globals()['__doc__'] % locals()sys.exit(1)inp, outp1, outp2 = sys.argv[1:4]model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,workers=multiprocessing.cpu_count())# trim unneeded model memory = use(much) less RAM#model.init_sims(replace=True)model.save(outp1)model.save_word2vec_format(outp2, binary=False)

然后执行“python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector”:

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2015-03-09 22:48:29,593: INFO: collecting all words and their counts
2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word
types
......
2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5
2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words
2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29
2015-03-09 23:14:09,790: INFO: resetting layer weights
2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s
2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s
2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s
2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s
2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s
2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s
2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s
2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s
2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s
2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s
2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s
2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s
2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s
2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s
2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s
2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s
2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s
2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s
2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s
2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s
2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s
2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s
2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s
2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s
2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s
2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s
.......
2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s
2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs
2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s
2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm
2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy
2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy
2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

在7个小时后面,我们得到了文本格式word2vec模式:wiki.en.text.vector:

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817 -0.051195 0.017663 0.043462 0.027486 -0.040694 0.025904 -0.075665 -0.000057 -0.076601 0.006704 -0.078985 -0.027770 -0.038087 0.097482 -0.001861 0.003741 -0.010897 0.042828 -0.037804 0.041546 -0.018394 -0.092459 0.010917 -0.004262 -0.113903 -0.037155 0.066674 0.096078 -0.114286 0.027908 -0.003139 -0.007529 -0.076928 0.025825 -0.090934 -0.013763 -0.057434 0.071827 -0.031783 -0.052096 0.107292 0.001864 -0.020808 0.043721 -0.024951 -0.046789 0.092858 0.037771 -0.006570 0.018282 -0.013571 -0.069215 0.019530 -0.080015 -0.078925 0.003094 0.044550 -0.046577 0.004945 -0.010885 -0.098681 0.044861 0.001618 -0.077582 -0.013834 0.024985 0.008541 -0.011861 0.023718 -0.018038 0.004162 -0.005827 -0.036836 0.081241 -0.028473 0.043937 0.005622 -0.004714 -0.029995 0.002236 -0.044635 -0.100051 0.006926 0.012636 -0.132891 -0.097755 -0.118586 0.038355 -0.034691 0.027983 0.074292 0.075199 0.033331 0.067474 -0.023996 0.024614 -0.039520 -0.110454 0.046004 -0.047849 0.023945 -0.022695 -0.053563 0.035277 0.011309 0.044326 0.026382 0.043251 0.004535 0.112228 0.022841 -0.068083 -0.122575 -0.053305 -0.005031 -0.078522 -0.044147 0.083576 0.005531 -0.063187 -0.032841 -0.067989 0.111359 0.125724 0.074154 0.040301 0.082240 0.015494 -0.066648 0.091087 0.095067 -0.059386 0.003256 -0.006734 -0.058248 0.020567 -0.006784 -0.017885 0.146956 -0.014679 -0.019453 -0.009875 -0.031508 0.002070 -0.002830 0.060321 0.056237 -0.080740 0.017465 0.016851 -0.067723 -0.061582 0.028104 0.067970 -0.024162 0.027407 0.075006 0.084483 -0.011534 0.129151 -0.072387 0.083424 -0.009501 0.041553 0.016603 0.002965 -0.027677 -0.110295 0.033986 0.028290 0.049621 0.001125 -0.018187 -0.001404 -0.024074 0.025322 -0.023594 -0.076071 0.107616 0.091381 -0.116943 0.109416 -0.045990 0.024346 0.152548 -0.010692 0.120887 -0.012670 -0.044978 -0.050880 -0.012535 -0.080475 0.036055 -0.050770 0.040417 -0.030957 -0.013680 0.001236 0.010180 -0.040136 -0.118249 0.017540 0.107725 -0.118492 -0.032438 -0.009072 -0.081345 -0.022384 0.045453 -0.008754 -0.098392 -0.113199 0.023589 0.017172 0.108523 -0.029611 0.041029 0.005958 0.010155 -0.036815 0.073110 -0.048424 -0.029022 -0.016711 -0.126587 0.045923 0.018589 0.113195 -0.002896 -0.051350 -0.007355 0.012278 0.093481 0.093676 -0.145230 -0.068279 -0.068407 0.008837 -0.012186 -0.136079 0.087961 0.041402 -0.058727 0.003030 0.008455 -0.062826 -0.139834 -0.014068 -0.115521 -0.117215 0.093502 0.026607 0.095726 -0.016339 0.033879 -0.022889 0.023565 0.028705

在ipython,我们像这样测试,注意到wiki.en.text.vector大概7g的大小,然后载入花费了很长的时间:
In [2]: import gensimIn [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)In [4]: model.most_similar("queen")
Out[4]:
[(u'princess', 0.5760838389396667),(u'hyoui', 0.5671186447143555),(u'janggyung', 0.5598698854446411),(u'king', 0.5556215047836304),(u'dollallolla', 0.5540223121643066),(u'loranella', 0.5522741079330444),(u'ramphaiphanni', 0.5310937166213989),(u'jeheon', 0.5298476219177246),(u'soheon', 0.5243583917617798),(u'coronation', 0.5217245221138)]In [5]: model.most_similar("man")
Out[5]:
[(u'woman', 0.7120707035064697),(u'girl', 0.58659827709198),(u'handsome', 0.5637181997299194),(u'boy', 0.5425317287445068),(u'villager', 0.5084836483001709),(u'mustachioed', 0.49287813901901245),(u'mcgucket', 0.48355430364608765),(u'spider', 0.4804879426956177),(u'policeman', 0.4780033826828003),(u'stranger', 0.4750771224498749)]In [6]: model.most_similar("woman")
Out[6]:
[(u'man', 0.7120705842971802),(u'girl', 0.6736541986465454),(u'prostitute', 0.5765659809112549),(u'divorcee', 0.5429972410202026),(u'person', 0.5276163816452026),(u'schoolgirl', 0.5102938413619995),(u'housewife', 0.48748138546943665),(u'lover', 0.4858251214027405),(u'handsome', 0.4773051142692566),(u'boy', 0.47445783019065857)]In [8]: model.similarity("woman", "man")
Out[8]: 0.71207063453821218In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'In [11]: model.similarity("woman", "girl")
Out[11]: 0.67365416785207421In [13]: model.most_similar("frog")
Out[13]:
[(u'toad', 0.6868536472320557),(u'barycragus', 0.6607867479324341),(u'grylio', 0.626731276512146),(u'heckscheri', 0.6208407878875732),(u'clamitans', 0.6150864362716675),(u'coplandi', 0.612680196762085),(u'pseudacris', 0.6108512878417969),(u'litoria', 0.6084023714065552),(u'raniformis', 0.6044802665710449),(u'watjulumensis', 0.6043726205825806)]

一切都没问题,但我载入numpy的时候,我仍然遇到“RuntimeWarning: invalid value encountered in divide”的问题:

In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")In [3]: model.most_similar("man")
... RuntimeWarning: invalid value encountered in divideself.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)Out[3]:
[(u'ahsns', nan),(u'ny\xedl', nan),(u'indradeo', nan),(u'jaimovich', nan),(u'addlepate', nan),(u'jagello', nan),(u'festenburg', nan),(u'picatic', nan),(u'tolosanum', nan),(u'mithoo', nan)]

如果你看到这里了,能解决我遇到的问题,请指出,非常感谢。

利用Gensim训练关于英文维基百科的Word2Vec模型(Training Word2Vec Model on English Wikipedia by Gensim)相关推荐

  1. Gensim官方教程翻译(五)——英文维基百科的实验

    仅供个人学习之用,如有错误,敬请指正.原文地址 为了测试gensim的性能,我们在维基百科英文版上运行了一些实验. 这个页面描述了获取与处理维基百科的过程,以便任何人都能再现这个结果.本教程要求已经正 ...

  2. Gensim训练维基百科词向量模型(含代码)

    由于平时会用到很多的文本预处理,这里就系统的讲解一下Gensim是如何训练维基百科词向量模型的!! 其中训练好的模型,也就是最终生成的 **.model 文件,可以作为预训练词向量使用. 训练维基百科 ...

  3. 用维基百科训练word2vec中文词向量

    主要参考: https://blog.csdn.net/weixin_40547993/article/details/97781179 https://www.kaggle.com/jeffd23/ ...

  4. Python Djang 搭建自动词性标注网站(基于Keras框架和维基百科中文预训练词向量Word2vec模型,分别实现由GRU、LSTM、RNN神经网络组成的词性标注模型)

    引言 本文基于Keras框架和维基百科中文预训练词向量Word2vec模型,分别实现由GRU.LSTM.RNN神经网络组成的词性标注模型,并且将模型封装,使用python Django web框架搭建 ...

  5. 中英文维基百科语料上的Word2Vec实验

    本文网址为:http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E ...

  6. 中文维基百科语料上的Word2Vec实验

    说明:此文主要参考52nlp-中英文维基百科语料上的Word2Vec实验,按照上面的步骤来做的,略有改动,因此不完全是转载的.这里,为了方便大家可以更快地运行gensim中的word2vec模型,我提 ...

  7. 「谷歌大脑」提出通过对长序列进行摘要提取,AI可自动生成「维基百科」

    原文来源:arXiv 作者:Peter J. Liu.Mohammad Saleh.Etienne Pot.Ben Goodrich.Ryan Sepassi.Łukasz Kaiser.Noam S ...

  8. [转载] wikipedia 维基百科 语料 获取 与 提取 处理 by python3.5

    参考链接: 使用Python从Wikipedia的信息框中获取文本 英文维基百科 https://dumps.wikimedia.org/enwiki/ 中文维基百科 https://dumps.wi ...

  9. 维基百科创建和百度百科建立有何不同?

    我们国内百科平台百度百科占主导地位,但在国际上来讲维基百科占主导地位,即使在中文百科领域维基百科也是有一席之地的,虽然在大陆访问维基百科非常不便,但是还是有不少海外人士,或国内精通互联网的人士会通过技 ...

最新文章

  1. datagrid分页问题(前后跳页)《控件版》
  2. Android PC投屏简单尝试—最终章2
  3. 11.15 dmidecode:查询系统硬件信息
  4. Linux基础入门学习笔记之二
  5. Qt|C++-OpenGL绘制三角形带
  6. 互联网行业个人精进指南
  7. C++新经典——C++从入门到精通
  8. I00035 完美数(Perfect number)
  9. POJ2750 Potted Flower (线段树+动态规划)
  10. 基于Python和MySQL的学生信息管理系统
  11. 常用的SQL注入语句
  12. ubuntu使用双模机械师K7机械键盘遇到的问题
  13. 全球及中国粮食加工行业产量需求规模与投资产值预测报告2022版
  14. 常用的Transformation
  15. 34.UCASE() LCASE() 函数
  16. 【以太网通信】PHY 芯片回环测试
  17. 五分之二用计算机怎么按,2015年计算机等级考试上机应试技巧
  18. 案例分析——2020春HIT网络与社会导论
  19. python窗口显示文本tk_python-Tkinter文本小部件设置选项卡
  20. python数据函数定义的规则是什么_Python自定义函数基础概念

热门文章

  1. 工业物联网体系架构概述及基于工业物联网的智能制造
  2. Windows+CPU only+VS2013安装caffe以及配置Python接口
  3. Python初探:turtle(海龟)实现动画
  4. 计算机配置赛扬 奔腾,intel 赛扬、奔腾、酷睿哪个好?
  5. python 高阶函数作业(3.16)
  6. 主线程 子线程死掉_当线程死时,子进程也会死
  7. 天空飘彩带的css3代码_纯CSS3实现飘逸洒脱带有飞行效果的三级下拉菜单
  8. 小米智能电视怎么投屏
  9. Greenplum 临时表年龄问题
  10. docker jvm调优 tomcat_docker+tomcat 启动时非常慢原因之JRE /dev/random阻塞