最近在SemEval 2010 Task 8上做关系分类的实验,主要是实现了一下这篇论文的模型:A neural network framework for relation extraction: Learning entity semantic and relation pattern,奈何比原文中的性能差了两个点,好忧桑,想着是不是词向量的原因。这篇文章中用的是英文Wikipedia语料训练的词向量,没办法,动手试一下吧。。


一、数据准备

首先下载最新的英文wiki数据wiki dumps,我的下载时间是2017-01-17,12.7G的xml格式压缩包。

1.1 转为纯文本

之前看过一个实验是使用process_wiki.py+Gensim处理的,但是gensim默认是没有断句的,使用上述方法处理完之后,你会发现一句话变得特别地长,这可能会影响词向量的质量。
所以这里对源代码做了一些改动,gensim的WikiCorpus模块在文件如下文件中:
python3.5/dist-packages/gensim/corpora/wikicorpus.py
我们将wikicorpus.py文件拷贝出来,主要是修改process_article函数和WikiCorpus类,这里用到了nltk的sent_tokenize模块,并且将ARTICLE_MIN_WORDS的值改成了10(这里实际上判断的是句子的最短长度),修改后的两个文件process_wiki.py和wikicorpus.py分别见附件1和附件2。
上述工作完成之后,执行python3 process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text,最后会得到一个约13G的纯文本文件wiki.en.text,每个句子占一行(ps: 在24核的机器上跑了大概100分钟),从下图可以看出比分句之前的结果好看了很多,但是标点符号都不见了,不知道怎么保留标点符号,希望知道的小伙伴提出来。

Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions
These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations
Anarchism holds the state to be undesirable unnecessary and harmful
While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system
Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy
Many types and traditions of anarchism exist not all of which are mutually exclusive

二、训练词向量

将wiki语料处理成纯文本之后就可以开始训练词向量了,利用Gensim的Word2Vec模块,用的是下面这个脚本train_word2vec_model.py,执行python3 train_word2vec_model.py wiki.en.text model.bin,训练时间较长,写这篇博客时才完成2.05%。

# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.models.word2vec import LineSentence
from time import time
from gensim.models import Word2Vecif __name__ == '__main__':t0 = time()program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print(globals()['__doc__'] % locals())sys.exit(1)inp, outp = sys.argv[1:3]  # corpus path and path to save modelmodel = Word2Vec(sg=0, sentences=LineSentence(inp), size=300, window=5, min_count=5,workers=16, iter=35)# trim unneeded model memory = use(much) less RAM#model.init_sims(replace=True)model.save_word2vec_format(outp, binary=True)print('done in %ds!' % (time()-t0))

三、将词向量保存成pkl格式

txt格式的词向量加载起来那叫一个慢,Gensim生成的bin文件虽然快了很多,但是还是要利用Gensim模块来加载,很不方便。我在python中使用词向量的时候,一般是将词向量存在一个字典里,其中键是词,值是词对应的词向量(numpy类型的数组)。如果将字典直接以二进制的形式存储在文件中,那加载会方便很多,速度也会快不少。这里用到是pickle模块,程序如下:

# -*- encoding:utf-8 -*-
import pickle
import numpy as np
from time import time
from gensim.models import Word2Vecif __name__ == '__main__':t0 = time()word_wieghts = {}model = Word2Vec.load_word2vec_format('model.bin', binary=True)for word in model.vocab:word_weights[word] = model[word]with open('model.pkl','wb') as file:pickle.dump(file, word_weights)print('Done in %ds!' % (time()-t0))

四、总结

这里主要改进了gensim模块,在处理wiki文本的时候利用nltk工具将文章分割成句子。

附件1:process_wiki.py

# -*- encoding:utf-8 -*-
import logging
import os.path
import sysfrom wikicorpus import WikiCorpus  # 注意这里# add by ljx
def decode_text(text):words = []for w in text:words.append(w.decode('utf-8'))return wordsif __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print(globals()['__doc__'] % locals())sys.exit(1)inp, outp = sys.argv[1:3]space = " "i = 0output = open(outp, 'w')wiki = WikiCorpus(inp, lemmatize=False, dictionary={})for text in wiki.get_texts():output.write(space.join(decode_text(text)) + "\n")i = i + 1if (i % 10000 == 0):logger.info("Saved " + str(i) + " sentences")output.close()logger.info("Finished Saved " + str(i) + " sentences")

附件2:wikicorpus.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.If you have the `pattern` package installed, this module will use a fancy
lemmatization to get a lemma of each token (instead of plain alphabetic
tokenizer). The package is available at https://github.com/clips/pattern .See scripts/process_wiki.py for a canned (example) script based on this
module.
"""import bz2
import logging
import re
from xml.etree.cElementTree import iterparse  # LXML isn't faster, so let's go with the built-in solution
import multiprocessingfrom gensim import utils# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpusfrom nltk.tokenize import sent_tokenize, word_tokenizelogger = logging.getLogger('gensim.corpora.wikicorpus')# ignore articles shorter than ARTICLE_MIN_WORDS characters (after full preprocessing)
ARTICLE_MIN_WORDS = 10RE_P0 = re.compile('<!--.*?-->', re.DOTALL | re.UNICODE)  # comments
RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)  # footnotes
RE_P2 = re.compile("(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$", re.UNICODE)  # links to languages
RE_P3 = re.compile("{{([^}{]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P4 = re.compile("{{([^}]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P5 = re.compile('\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)  # remove URL, keep description
RE_P6 = re.compile("\[([^][]*)\|([^][]*)\]", re.DOTALL | re.UNICODE)  # simplify links, keep description
RE_P7 = re.compile('\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of images
RE_P8 = re.compile('\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of files
RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)  # outside links
RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)  # math content
RE_P11 = re.compile('<(.*?)>', re.DOTALL | re.UNICODE)  # all other tags
RE_P12 = re.compile('\n(({\|)|(\|-)|(\|}))(.*?)(?=\n)', re.UNICODE)  # table formatting
RE_P13 = re.compile('\n(\||\!)(.*?\|)*([^|]*?)', re.UNICODE)  # table cell formatting
RE_P14 = re.compile('\[\[Category:[^][]*\]\]', re.UNICODE)  # categories
# Remove File and Image template
RE_P15 = re.compile('\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)# MediaWiki namespaces (https://www.mediawiki.org/wiki/Manual:Namespace) that
# ought to be ignored
IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template','MediaWiki', 'User', 'Help', 'Book', 'Draft','WikiProject', 'Special', 'Talk']def filter_wiki(raw):"""Filter out wiki mark-up from `raw`, leaving only text. `raw` is either unicodeor utf-8 encoded string."""# parsing of the wiki markup is not perfect, but sufficient for our purposes# contributions to improving this code are welcome :)text = utils.to_unicode(raw, 'utf8', errors='ignore')text = utils.decode_htmlentities(text)  # '&amp;nbsp;' --> '\xa0'return remove_markup(text)def remove_markup(text):text = re.sub(RE_P2, "", text)  # remove the last list (=languages)# the wiki markup is recursive (markup inside markup etc)# instead of writing a recursive grammar, here we deal with that by removing# markup in a loop, starting with inner-most expressions and working outwards,# for as long as something changes.text = remove_template(text)text = remove_file(text)iters = 0while True:old, iters = text, iters + 1text = re.sub(RE_P0, "", text)  # remove commentstext = re.sub(RE_P1, '', text)  # remove footnotestext = re.sub(RE_P9, "", text)  # remove outside linkstext = re.sub(RE_P10, "", text)  # remove math contenttext = re.sub(RE_P11, "", text)  # remove all remaining tagstext = re.sub(RE_P14, '', text)  # remove categoriestext = re.sub(RE_P5, '\\3', text)  # remove urls, keep descriptiontext = re.sub(RE_P6, '\\2', text)  # simplify links, keep description only# remove table markuptext = text.replace('||', '\n|')  # each table cell on a separate linetext = re.sub(RE_P12, '\n', text)  # remove formatting linestext = re.sub(RE_P13, '\n\\3', text)  # leave only cell content# remove empty mark-uptext = text.replace('[]', '')if old == text or iters > 2:  # stop if nothing changed between two iterations or after a fixed number of iterationsbreak# the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists'# TODO is this really desirable?text = text.replace('[', '').replace(']', '')  # promote all remaining markup to plain textreturn textdef remove_template(s):"""Remove template wikimedia markup.Return a copy of `s` with all the wikimedia markup template removed. Seehttp://meta.wikimedia.org/wiki/Help:Template for wikimedia templatesdetails.Note: Since template can be nested, it is difficult remove them usingregular expresssions."""# Find the start and end position of each template by finding the opening# '{{' and closing '}}'n_open, n_close = 0, 0starts, ends = [], []in_template = Falseprev_c = Nonefor i, c in enumerate(iter(s)):if not in_template:if c == '{' and c == prev_c:starts.append(i - 1)in_template = Truen_open = 1if in_template:if c == '{':n_open += 1elif c == '}':n_close += 1if n_open == n_close:ends.append(i)in_template = Falsen_open, n_close = 0, 0prev_c = c# Remove all the templatess = ''.join([s[end + 1:start] for start, end inzip(starts + [None], [-1] + ends)])return sdef remove_file(s):"""Remove the 'File:' and 'Image:' markup, keeping the file caption.Return a copy of `s` with all the 'File:' and 'Image:' markup replaced bytheir corresponding captions. See http://www.mediawiki.org/wiki/Help:Imagesfor the markup details."""# The regex RE_P15 match a File: or Image: markupfor match in re.finditer(RE_P15, s):m = match.group(0)caption = m[:-2].split('|')[-1]s = s.replace(m, caption, 1)return sdef tokenize(content):  # 修改后"""Tokenize a piece of text from wikipedia. The input string `content` is assumedto be mark-up free (see `filter_wiki()`).Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longerthat 15 characters (not bytes!)."""# TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)return [token.encode('utf8') for token in utils.tokenize(content, lower=False, errors='ignore')if len(token) <= 15 and not token.startswith('_')]def get_namespace(tag):"""Returns the namespace of tag."""m = re.match("^{(.*?)}", tag)namespace = m.group(1) if m else ""if not namespace.startswith("http://www.mediawiki.org/xml/export-"):raise ValueError("%s not recognized as MediaWiki dump namespace"% namespace)return namespace
_get_namespace = get_namespacedef extract_pages(f, filter_namespaces=False):"""Extract pages from a MediaWiki database dump = open file-like object `f`.Return an iterable over (str, str, str) which generates (title, content, pageid) triplets."""elems = (elem for _, elem in iterparse(f, events=("end",)))# We can't rely on the namespace for database dumps, since it's changed# it every time a small modification to the format is made. So, determine# those from the first element we find, which will be part of the metadata,# and construct element paths.elem = next(elems)namespace = get_namespace(elem.tag)ns_mapping = {"ns": namespace}page_tag = "{%(ns)s}page" % ns_mappingtext_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mappingtitle_path = "./{%(ns)s}title" % ns_mappingns_path = "./{%(ns)s}ns" % ns_mappingpageid_path = "./{%(ns)s}id" % ns_mappingfor elem in elems:if elem.tag == page_tag:title = elem.find(title_path).texttext = elem.find(text_path).textif filter_namespaces:ns = elem.find(ns_path).textif ns not in filter_namespaces:text = Nonepageid = elem.find(pageid_path).textyield title, text or "", pageid     # empty page will yield None# Prune the element tree, as per# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/# except that we don't need to prune backlinks from the parent# because we don't use LXML.# We do this only for <page>s, since we need to inspect the# ./revision/text element. The pages comprise the bulk of the# file, so in practice we prune away enough.elem.clear()
_extract_pages = extract_pages  # for backward compatibilitydef process_article(args):  # 修改后"""Parse a wikipedia article, returning its content as a list of tokens(utf8-encoded strings)."""text, lemmatize, title, pageid = argstext = filter_wiki(text)sentences = []sentences_str = sent_tokenize(text)for sentence_str in sentences_str:sentences.append(tokenize(sentence_str))return sentences, title, pageidclass WikiCorpus(TextCorpus):  # 修改后"""Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus.The documents are extracted on-the-fly, so that the whole (massive) dumpcan stay compressed on disk.>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h>>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word"""def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):"""Initialize the corpus. Unless a dictionary is provided, this scans thecorpus once, to determine its vocabulary.If `pattern` package is installed, use fancier shallow parsing to gettoken lemmas. Otherwise, use simple regexp tokenization. You can overridethis automatic logic by forcing the `lemmatize` parameter explicitly."""self.fname = fnameself.filter_namespaces = filter_namespacesself.metadata = Falseif processes is None:processes = max(1, multiprocessing.cpu_count() - 1)self.processes = processesself.lemmatize = lemmatizeif dictionary is None:self.dictionary = Dictionary(self.get_texts())else:self.dictionary = dictionarydef get_texts(self):"""Iterate over the dump, returning text version of each article as a listof tokens.Only articles of sufficient length are returned (short articles & redirectsetc are ignored).Note that this iterates over the **texts**; if you want vectors, just usethe standard corpus interface instead of this function::>>> for vec in wiki_corpus:>>>     print(vec)"""articles, articles_all = 0, 0positions, positions_all = 0, 0texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))pool = multiprocessing.Pool(self.processes)# process the corpus in smaller chunks of docs, because multiprocessing.Pool# is dumb and would load the entire input into RAM at once...for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):for sentences, title, pageid in pool.imap(process_article, group):  # chunksize=10):articles_all += 1positions_all += len(sentences)# article redirects and short stubs are pruned hereif any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):continuefor sentence in sentences:if len(sentence) < ARTICLE_MIN_WORDS:continuearticles += 1positions += len(sentence)yield sentencepool.terminate()logger.info("finished iterating over Wikipedia corpus of %i documents with %i positions"" (total %i articles, %i positions before pruning articles shorter than %i words)",articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)self.length = articles  # cache corpus length
# endclass WikiCorpus

利用Gensim在英文Wikipedia训练词向量相关推荐

  1. 预训练词向量中文维基百科,英文斯坦福glove预训练的词向量下载

    中文预训练词向量--基于中文维基百科语料训练 英文预训练词向量--斯坦福glove预训练的词向量 百度云分享:https://pan.baidu.com/s/1UpZeuqlNMl6XtTB5la53 ...

  2. 整理常用的中英文预训练词向量(Pretrained Word Vectors)

    文章目录 引言 腾讯中文词汇/短语向量(Tencent AI Lab Embedding Corpus for Chinese Words and Phrases) 使用方法 中文词向量语料库 by ...

  3. PyTorch在NLP任务中使用预训练词向量

    在使用pytorch或tensorflow等神经网络框架进行nlp任务的处理时,可以通过对应的Embedding层做词向量的处理,更多的时候,使用预训练好的词向量会带来更优的性能.下面分别介绍使用ge ...

  4. (一)利用Wikipedia中文语料训练词向量word2vec——获取Wikipedia简体中文语料库

    利用Wikipedia中文语料训练词向量一共分为两个篇章,这篇文章属于第一部分,包括下载Wikipedia语料库,并将其从繁体转换为简体. 目录 第一步 下载语料库 第二步 将下载好的bz2文件转换为 ...

  5. 利用word2vec训练词向量

    利用word2vec训练词向量 这里的代码是在pycharm上运行的,文件列表如下: 一.数据预处理 我选用的数据集是新闻数据集一共有五千条新闻数据,一共有四个维度 数据集:https://pan.b ...

  6. 自然语言处理之使用gensim.Word2Vec训练词向量进行词义消歧

    自然语言处理之使用gensim.Word2Vec训练词向量进行词义消歧 NLP中进行词义消歧的一个非常方便且简单的方法就是训练词向量,通过词向量计算余弦值,来推断某个词在句子中的含义.python中的 ...

  7. gensim训练词向量

    gensim训练词向量 # -*- coding: utf-8 -*- # @Time : 2020/7/7 12:41 # @Author : WngXngimport jieba from gen ...

  8. word2vec训练词向量 python_使用Gensim word2vector训练词向量

    注意事项 Skip-Gram models:输入为单个词,输出目标为多个上下文单词: CBOW models:输入为多个上下文单词,输出目标为一个单词: 选择的训练word2vec的语料要和要使用词向 ...

  9. word2vec预训练词向量+通俗理解word2vec+CountVectorizer+TfidfVectorizer+tf-idf公式及sklearn中TfidfVectorizer

    文章目录 文分类实(一) word2vec预训练词向量 2 数据集 3 数据预处理 4 预训练word2vec模型 canci 通俗理解word2vec 独热编码 word2vec (Continuo ...

最新文章

  1. python 数组数据类型
  2. JUC并发编程八 并发架构--park,unpark
  3. 引领PCB行业变革 捷配开启免费打样新时代
  4. 计算机循环语句for,计算机for循环语句相关知识.doc
  5. ROS Image_transport使用
  6. 埃斯顿三轴机器人编程_一文了解Estun Studio机器人仿真与离线编程软件
  7. 傲腾内存安装问题分享
  8. 计算机视觉引论及数字成像系统
  9. 学习笔记:EPS高级功能1-车道保持辅助LKA(Lane Keeping Assist)
  10. 【老九学堂】【初识C语言】编码规范
  11. MySQL-Utilities:mysqldbcompare及跳过复制错误
  12. 行人重识别 论文学习
  13. 算式最大值 (思维题)
  14. 【泛函分析】距离空间和赋范空间
  15. 业务流程图与数据流图的对比
  16. 设计模式 -- 桥接模式(Bridge)
  17. Excel2010基础-学习笔记
  18. 1024分辨率《X战警:第一战》BD中英双字无水印
  19. 自学python(2):利用opencv实现读图,显示,画框,裁剪的python代码
  20. 为什么共享充电宝能赚钱,共享单车不行?

热门文章

  1. Latex设置表格字体大小
  2. 美白,去斑,去黄,调理,中药面膜粉
  3. Genymotion 超详细安装教程图解
  4. 海尔智家:智慧场景掌握「主动」权,用户体验才有话语权
  5. NSIS实现安装前检测是否安装程序,程序是否运行,安装后关联程序默认打开方式,刷新文件图标
  6. python爬虫学习34
  7. php项目开发经验-2个月学习php经历
  8. LeetCode-365.水壶问题 广度优先搜索、费蜀定理
  9. winform访问被拒绝_详解C#对路径...的访问被拒绝解决过程
  10. 详细剖析平衡二叉树的四种旋转(附C++代码)