利用Gensim在英文Wikipedia训练词向量

最近在SemEval 2010 Task 8上做关系分类的实验，主要是实现了一下这篇论文的模型：A neural network framework for relation extraction: Learning entity semantic and relation pattern，奈何比原文中的性能差了两个点，好忧桑，想着是不是词向量的原因。这篇文章中用的是英文Wikipedia语料训练的词向量，没办法，动手试一下吧。。

一、数据准备

首先下载最新的英文wiki数据wiki dumps，我的下载时间是2017-01-17，12.7G的xml格式压缩包。

1.1 转为纯文本

之前看过一个实验是使用process_wiki.py+Gensim处理的，但是gensim默认是没有断句的，使用上述方法处理完之后，你会发现一句话变得特别地长，这可能会影响词向量的质量。
所以这里对源代码做了一些改动，gensim的WikiCorpus模块在文件如下文件中：
python3.5/dist-packages/gensim/corpora/wikicorpus.py
我们将wikicorpus.py文件拷贝出来，主要是修改process_article函数和WikiCorpus类，这里用到了nltk的sent_tokenize模块，并且将ARTICLE_MIN_WORDS的值改成了10（这里实际上判断的是句子的最短长度），修改后的两个文件process_wiki.py和wikicorpus.py分别见附件1和附件2。
上述工作完成之后，执行python3 process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text，最后会得到一个约13G的纯文本文件wiki.en.text，每个句子占一行（ps: 在24核的机器上跑了大概100分钟），从下图可以看出比分句之前的结果好看了很多，但是标点符号都不见了，不知道怎么保留标点符号，希望知道的小伙伴提出来。

Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions
These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations
Anarchism holds the state to be undesirable unnecessary and harmful
While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system
Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy
Many types and traditions of anarchism exist not all of which are mutually exclusive

二、训练词向量

将wiki语料处理成纯文本之后就可以开始训练词向量了，利用Gensim的Word2Vec模块，用的是下面这个脚本train_word2vec_model.py，执行python3 train_word2vec_model.py wiki.en.text model.bin，训练时间较长，写这篇博客时才完成2.05%。

# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.models.word2vec import LineSentence
from time import time
from gensim.models import Word2Vecif __name__ == '__main__':t0 = time()program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print(globals()['__doc__'] % locals())sys.exit(1)inp, outp = sys.argv[1:3]  # corpus path and path to save modelmodel = Word2Vec(sg=0, sentences=LineSentence(inp), size=300, window=5, min_count=5,workers=16, iter=35)# trim unneeded model memory = use(much) less RAM#model.init_sims(replace=True)model.save_word2vec_format(outp, binary=True)print('done in %ds!' % (time()-t0))

三、将词向量保存成pkl格式

txt格式的词向量加载起来那叫一个慢，Gensim生成的bin文件虽然快了很多，但是还是要利用Gensim模块来加载，很不方便。我在python中使用词向量的时候，一般是将词向量存在一个字典里，其中键是词，值是词对应的词向量（numpy类型的数组）。如果将字典直接以二进制的形式存储在文件中，那加载会方便很多，速度也会快不少。这里用到是pickle模块，程序如下：

# -*- encoding:utf-8 -*-
import pickle
import numpy as np
from time import time
from gensim.models import Word2Vecif __name__ == '__main__':t0 = time()word_wieghts = {}model = Word2Vec.load_word2vec_format('model.bin', binary=True)for word in model.vocab:word_weights[word] = model[word]with open('model.pkl','wb') as file:pickle.dump(file, word_weights)print('Done in %ds!' % (time()-t0))

四、总结

这里主要改进了gensim模块，在处理wiki文本的时候利用nltk工具将文章分割成句子。

附件1：process_wiki.py

# -*- encoding:utf-8 -*-
import logging
import os.path
import sysfrom wikicorpus import WikiCorpus  # 注意这里# add by ljx
def decode_text(text):words = []for w in text:words.append(w.decode('utf-8'))return wordsif __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print(globals()['__doc__'] % locals())sys.exit(1)inp, outp = sys.argv[1:3]space = " "i = 0output = open(outp, 'w')wiki = WikiCorpus(inp, lemmatize=False, dictionary={})for text in wiki.get_texts():output.write(space.join(decode_text(text)) + "\n")i = i + 1if (i % 10000 == 0):logger.info("Saved " + str(i) + " sentences")output.close()logger.info("Finished Saved " + str(i) + " sentences")

附件2：wikicorpus.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.If you have the `pattern` package installed, this module will use a fancy
lemmatization to get a lemma of each token (instead of plain alphabetic
tokenizer). The package is available at https://github.com/clips/pattern .See scripts/process_wiki.py for a canned (example) script based on this
module.
"""import bz2
import logging
import re
from xml.etree.cElementTree import iterparse  # LXML isn't faster, so let's go with the built-in solution
import multiprocessingfrom gensim import utils# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpusfrom nltk.tokenize import sent_tokenize, word_tokenizelogger = logging.getLogger('gensim.corpora.wikicorpus')# ignore articles shorter than ARTICLE_MIN_WORDS characters (after full preprocessing)
ARTICLE_MIN_WORDS = 10RE_P0 = re.compile('<!--.*?-->', re.DOTALL | re.UNICODE)  # comments
RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)  # footnotes
RE_P2 = re.compile("(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$", re.UNICODE)  # links to languages
RE_P3 = re.compile("{{([^}{]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P4 = re.compile("{{([^}]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P5 = re.compile('\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)  # remove URL, keep description
RE_P6 = re.compile("\[([^][]*)\|([^][]*)\]", re.DOTALL | re.UNICODE)  # simplify links, keep description
RE_P7 = re.compile('\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of images
RE_P8 = re.compile('\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of files
RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)  # outside links
RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)  # math content
RE_P11 = re.compile('<(.*?)>', re.DOTALL | re.UNICODE)  # all other tags
RE_P12 = re.compile('\n(({\|)|(\|-)|(\|}))(.*?)(?=\n)', re.UNICODE)  # table formatting
RE_P13 = re.compile('\n(\||\!)(.*?\|)*([^|]*?)', re.UNICODE)  # table cell formatting
RE_P14 = re.compile('\[\[Category:[^][]*\]\]', re.UNICODE)  # categories
# Remove File and Image template
RE_P15 = re.compile('\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)# MediaWiki namespaces (https://www.mediawiki.org/wiki/Manual:Namespace) that
# ought to be ignored
IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template','MediaWiki', 'User', 'Help', 'Book', 'Draft','WikiProject', 'Special', 'Talk']def filter_wiki(raw):"""Filter out wiki mark-up from `raw`, leaving only text. `raw` is either unicodeor utf-8 encoded string."""# parsing of the wiki markup is not perfect, but sufficient for our purposes# contributions to improving this code are welcome :)text = utils.to_unicode(raw, 'utf8', errors='ignore')text = utils.decode_htmlentities(text)  # '&amp;nbsp;' --> '\xa0'return remove_markup(text)def remove_markup(text):text = re.sub(RE_P2, "", text)  # remove the last list (=languages)# the wiki markup is recursive (markup inside markup etc)# instead of writing a recursive grammar, here we deal with that by removing# markup in a loop, starting with inner-most expressions and working outwards,# for as long as something changes.text = remove_template(text)text = remove_file(text)iters = 0while True:old, iters = text, iters + 1text = re.sub(RE_P0, "", text)  # remove commentstext = re.sub(RE_P1, '', text)  # remove footnotestext = re.sub(RE_P9, "", text)  # remove outside linkstext = re.sub(RE_P10, "", text)  # remove math contenttext = re.sub(RE_P11, "", text)  # remove all remaining tagstext = re.sub(RE_P14, '', text)  # remove categoriestext = re.sub(RE_P5, '\\3', text)  # remove urls, keep descriptiontext = re.sub(RE_P6, '\\2', text)  # simplify links, keep description only# remove table markuptext = text.replace('||', '\n|')  # each table cell on a separate linetext = re.sub(RE_P12, '\n', text)  # remove formatting linestext = re.sub(RE_P13, '\n\\3', text)  # leave only cell content# remove empty mark-uptext = text.replace('[]', '')if old == text or iters > 2:  # stop if nothing changed between two iterations or after a fixed number of iterationsbreak# the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists'# TODO is this really desirable?text = text.replace('[', '').replace(']', '')  # promote all remaining markup to plain textreturn textdef remove_template(s):"""Remove template wikimedia markup.Return a copy of `s` with all the wikimedia markup template removed. Seehttp://meta.wikimedia.org/wiki/Help:Template for wikimedia templatesdetails.Note: Since template can be nested, it is difficult remove them usingregular expresssions."""# Find the start and end position of each template by finding the opening# '{{' and closing '}}'n_open, n_close = 0, 0starts, ends = [], []in_template = Falseprev_c = Nonefor i, c in enumerate(iter(s)):if not in_template:if c == '{' and c == prev_c:starts.append(i - 1)in_template = Truen_open = 1if in_template:if c == '{':n_open += 1elif c == '}':n_close += 1if n_open == n_close:ends.append(i)in_template = Falsen_open, n_close = 0, 0prev_c = c# Remove all the templatess = ''.join([s[end + 1:start] for start, end inzip(starts + [None], [-1] + ends)])return sdef remove_file(s):"""Remove the 'File:' and 'Image:' markup, keeping the file caption.Return a copy of `s` with all the 'File:' and 'Image:' markup replaced bytheir corresponding captions. See http://www.mediawiki.org/wiki/Help:Imagesfor the markup details."""# The regex RE_P15 match a File: or Image: markupfor match in re.finditer(RE_P15, s):m = match.group(0)caption = m[:-2].split('|')[-1]s = s.replace(m, caption, 1)return sdef tokenize(content):  # 修改后"""Tokenize a piece of text from wikipedia. The input string `content` is assumedto be mark-up free (see `filter_wiki()`).Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longerthat 15 characters (not bytes!)."""# TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)return [token.encode('utf8') for token in utils.tokenize(content, lower=False, errors='ignore')if len(token) <= 15 and not token.startswith('_')]def get_namespace(tag):"""Returns the namespace of tag."""m = re.match("^{(.*?)}", tag)namespace = m.group(1) if m else ""if not namespace.startswith("http://www.mediawiki.org/xml/export-"):raise ValueError("%s not recognized as MediaWiki dump namespace"% namespace)return namespace
_get_namespace = get_namespacedef extract_pages(f, filter_namespaces=False):"""Extract pages from a MediaWiki database dump = open file-like object `f`.Return an iterable over (str, str, str) which generates (title, content, pageid) triplets."""elems = (elem for _, elem in iterparse(f, events=("end",)))# We can't rely on the namespace for database dumps, since it's changed# it every time a small modification to the format is made. So, determine# those from the first element we find, which will be part of the metadata,# and construct element paths.elem = next(elems)namespace = get_namespace(elem.tag)ns_mapping = {"ns": namespace}page_tag = "{%(ns)s}page" % ns_mappingtext_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mappingtitle_path = "./{%(ns)s}title" % ns_mappingns_path = "./{%(ns)s}ns" % ns_mappingpageid_path = "./{%(ns)s}id" % ns_mappingfor elem in elems:if elem.tag == page_tag:title = elem.find(title_path).texttext = elem.find(text_path).textif filter_namespaces:ns = elem.find(ns_path).textif ns not in filter_namespaces:text = Nonepageid = elem.find(pageid_path).textyield title, text or "", pageid     # empty page will yield None# Prune the element tree, as per# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/# except that we don't need to prune backlinks from the parent# because we don't use LXML.# We do this only for <page>s, since we need to inspect the# ./revision/text element. The pages comprise the bulk of the# file, so in practice we prune away enough.elem.clear()
_extract_pages = extract_pages  # for backward compatibilitydef process_article(args):  # 修改后"""Parse a wikipedia article, returning its content as a list of tokens(utf8-encoded strings)."""text, lemmatize, title, pageid = argstext = filter_wiki(text)sentences = []sentences_str = sent_tokenize(text)for sentence_str in sentences_str:sentences.append(tokenize(sentence_str))return sentences, title, pageidclass WikiCorpus(TextCorpus):  # 修改后"""Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus.The documents are extracted on-the-fly, so that the whole (massive) dumpcan stay compressed on disk.>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word"""def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):"""Initialize the corpus. Unless a dictionary is provided, this scans thecorpus once, to determine its vocabulary.If `pattern` package is installed, use fancier shallow parsing to gettoken lemmas. Otherwise, use simple regexp tokenization. You can overridethis automatic logic by forcing the `lemmatize` parameter explicitly."""self.fname = fnameself.filter_namespaces = filter_namespacesself.metadata = Falseif processes is None:processes = max(1, multiprocessing.cpu_count() - 1)self.processes = processesself.lemmatize = lemmatizeif dictionary is None:self.dictionary = Dictionary(self.get_texts())else:self.dictionary = dictionarydef get_texts(self):"""Iterate over the dump, returning text version of each article as a listof tokens.Only articles of sufficient length are returned (short articles & redirectsetc are ignored).Note that this iterates over the **texts**; if you want vectors, just usethe standard corpus interface instead of this function::>>> for vec in wiki_corpus:>>>     print(vec)"""articles, articles_all = 0, 0positions, positions_all = 0, 0texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))pool = multiprocessing.Pool(self.processes)# process the corpus in smaller chunks of docs, because multiprocessing.Pool# is dumb and would load the entire input into RAM at once...for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):for sentences, title, pageid in pool.imap(process_article, group):  # chunksize=10):articles_all += 1positions_all += len(sentences)# article redirects and short stubs are pruned hereif any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):continuefor sentence in sentences:if len(sentence) < ARTICLE_MIN_WORDS:continuearticles += 1positions += len(sentence)yield sentencepool.terminate()logger.info("finished iterating over Wikipedia corpus of %i documents with %i positions"" (total %i articles, %i positions before pruning articles shorter than %i words)",articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)self.length = articles  # cache corpus length
# endclass WikiCorpus