之前写过《中英文维基百科语料上的Word2Vec实验》,近期有不少同学在这篇文章下留言提问,加上最近一些工作也与Word2Vec相关,于是又做了一些功课,包括重新过了一遍Word2Vec的相关资料,试了一下gensim的相关更新接口,google了一下"wikipedia word2vec" or "维基百科 word2vec" 相关的英中文资料,发现多数还是走得这篇文章的老路,既通过gensim提供的维基百科预处理脚本"gensim.corpora.WikiCorpus"提取维基语料,每篇文章一行文本存放,然后基于gensim的Word2Vec模块训练词向量模型。这里再提供另一个方法来处理维基百科的语料,训练词向量模型,计算词语相似度(Word Similarity)。关于Word2Vec, 如果英文不错,推荐从这篇文章入手读相关的资料: Getting started with Word2Vec 。

这次我们仅以英文维基百科语料为例,首先依然是下载维基百科的最新XML打包压缩数据,在这个英文最新更新的数据列表下:https://dumps.wikimedia.org/enwiki/latest/ ,找到 "enwiki-latest-pages-articles.xml.bz2" 下载,这份英文维基百科全量压缩数据的打包时间大概是2017年4月4号,大约13G,我通过家里的电脑wget下载大概花了3个小时,电信100M宽带,速度还不错。

接下来就是处理这份压缩的XML英文维基百科语料了,这次我们使用WikiExtractor:

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires Python 2.7 or Python 3.3+ but no additional library.

WikiExtractor是一个Python 脚本,专门用于提取和清洗Wikipedia的dump数据,支持Python 2.7 或者 Python 3.3+,无额外依赖,安装和使用都非常方便:

安装:

git clone https://github.com/attardi/wikiextractor.git

cd wikiextractor/

sudo python setup.py install

使用:

WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2

......

INFO: 53665431 Pampapaul

INFO: 53665433 Charles Frederick Zimpel

INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

......

INFO: 53665431 Pampapaul

INFO: 53665433 Charles Frederick Zimpel

INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

这个过程总计花了2个多小时,提取了大概537万多篇文章。关于我的机器配置,可参考:《深度学习主机攒机小记》

提取后的文件按一定顺序切分存储在多个子目录下:

每个子目录下的又存放若干个以wiki_num命名的文件,每个大小在1M左右,这个大小可以通过参数 -b 控制:

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

我们看一下wiki_00里的具体内容:

Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.

...

Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.

Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child's life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.

...

...

每个wiki_num文件里又存放若干个doc,每个doc都有相关的tag标记,包括id, url, title等,很好区分。

这里我们按照Gensim作者提供的word2vec tutorial里"memory-friendly iterator"方式来处理英文维基百科的数据。代码如下,也同步放到了github里:train_word2vec_with_gensim.py

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# Author: Pan Yang (panyangnlp@gmail.com)

# Copyright 2017 @ Yu Zhen

import gensim

import logging

import multiprocessing

import os

import re

import sys

from pattern.en import tokenize

from time import time

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',

level=logging.INFO)

def cleanhtml(raw_html):

cleanr = re.compile('<.*?>')

cleantext = re.sub(cleanr, ' ', raw_html)

return cleantext

class MySentences(object):

def __init__(self, dirname):

self.dirname = dirname

def __iter__(self):

for root, dirs, files in os.walk(self.dirname):

for filename in files:

file_path = root + '/' + filename

for line in open(file_path):

sline = line.strip()

if sline == "":

continue

rline = cleanhtml(sline)

tokenized_line = ' '.join(tokenize(rline))

is_alpha_word_line = [word for word in

tokenized_line.lower().split()

if word.isalpha()]

yield is_alpha_word_line

if __name__ == '__main__':

if len(sys.argv) != 2:

print "Please use python train_with_gensim.py data_path"

exit()

data_path = sys.argv[1]

begin = time()

sentences = MySentences(data_path)

model = gensim.models.Word2Vec(sentences,

size=200,

window=10,

min_count=10,

workers=multiprocessing.cpu_count())

model.save("data/model/word2vec_gensim")

model.wv.save_word2vec_format("data/model/word2vec_org",

"data/model/vocabulary",

binary=False)

end = time()

print "Total procesing time: %d seconds" % (end - begin)

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# Author: Pan Yang (panyangnlp@gmail.com)

# Copyright 2017 @ Yu Zhen

import gensim

import logging

import multiprocessing

import os

import re

import sys

from pattern.en import tokenize

from time import time

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',

level=logging.INFO)

def cleanhtml(raw_html):

cleanr = re.compile('<.*?>')

cleantext = re.sub(cleanr, ' ', raw_html)

return cleantext

class MySentences(object):

def __init__(self, dirname):

self.dirname = dirname

def __iter__(self):

for root, dirs, files in os.walk(self.dirname):

for filename in files:

file_path = root + '/' + filename

for line in open(file_path):

sline = line.strip()

if sline == "":

continue

rline = cleanhtml(sline)

tokenized_line = ' '.join(tokenize(rline))

is_alpha_word_line = [word for word in

tokenized_line.lower().split()

if word.isalpha()]

yield is_alpha_word_line

if __name__ == '__main__':

if len(sys.argv) != 2:

print "Please use python train_with_gensim.py data_path"

exit()

data_path = sys.argv[1]

begin = time()

sentences = MySentences(data_path)

model = gensim.models.Word2Vec(sentences,

size=200,

window=10,

min_count=10,

workers=multiprocessing.cpu_count())

model.save("data/model/word2vec_gensim")

model.wv.save_word2vec_format("data/model/word2vec_org",

"data/model/vocabulary",

binary=False)

end = time()

print "Total procesing time: %d seconds" % (end - begin)

注意其中的word tokenize使用了pattern里的英文tokenize模块,当然,也可以使用nltk里的word_tokenize模块,做一点修改即可,不过nltk对于句尾的一些词的work tokenize处理的不太好。另外我们设定词向量维度为200, 窗口长度为10, 最小出现次数为10,通过 is_alpha() 函数过滤掉标点和非英文词。现在可以用这个脚本来训练英文维基百科的Word2Vec模型了:

python train_word2vec_with_gensim.py enwiki

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts

2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types

2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types

2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types

2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types

......

2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s

2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None

2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm

2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy

2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy

2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table

2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim

2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary

2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org

Total procesing time: 44476 seconds

2017-04-22 14:31:04,703 : INFO : collecting all words and their counts

2017-04-22 14:31:04,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2017-04-22 14:31:06,442 : INFO : PROGRESS: at sentence #10000, processed 480546 words, keeping 33925 word types

2017-04-22 14:31:08,104 : INFO : PROGRESS: at sentence #20000, processed 983240 words, keeping 51765 word types

2017-04-22 14:31:09,685 : INFO : PROGRESS: at sentence #30000, processed 1455218 words, keeping 64982 word types

2017-04-22 14:31:11,349 : INFO : PROGRESS: at sentence #40000, processed 1957479 words, keeping 76112 word types

......

2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-04-23 02:50:59,844 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-04-23 02:50:59,854 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-04-23 02:50:59,854 : INFO : training on 8903084745 raw words (6742578791 effective words) took 37805.2s, 178351 effective words/s

2017-04-23 02:50:59,855 : INFO : saving Word2Vec object under data/model/word2vec_gensim, separately None

2017-04-23 02:50:59,855 : INFO : not storing attribute syn0norm

2017-04-23 02:50:59,855 : INFO : storing np array 'syn0' to data/model/word2vec_gensim.wv.syn0.npy

2017-04-23 02:51:00,241 : INFO : storing np array 'syn1neg' to data/model/word2vec_gensim.syn1neg.npy

2017-04-23 02:51:00,574 : INFO : not storing attribute cum_table

2017-04-23 02:51:13,886 : INFO : saved data/model/word2vec_gensim

2017-04-23 02:51:13,886 : INFO : storing vocabulary in data/model/vocabulary

2017-04-23 02:51:17,480 : INFO : storing 868777x200 projection weights into data/model/word2vec_org

Total procesing time: 44476 seconds

这个训练过程中大概花了12多小时,训练后的文件存放在data/model下:

我们来测试一下这个英文维基百科的Word2Vec模型:

textminer@textminer:/opt/wiki/data$ ipython

Python 2.7.12 (default, Nov 19 2016, 06:48:10)

Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: from gensim.models import Word2Vec

In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

textminer@textminer:/opt/wiki/data$ ipython

Python 2.7.12 (default, Nov 19 2016, 06:48:10)

Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: from gensim.models import Word2Vec

In [2]: en_wiki_word2vec_model = Word2Vec.load('data/model/word2vec_gensim')

首先来测试几个单词的相似单词(Word Similariy):

In [3]: en_wiki_word2vec_model.most_similar('word')

Out[3]:

[('phrase', 0.8129693269729614),

('meaning', 0.7311851978302002),

('words', 0.7010501623153687),

('adjective', 0.6805518865585327),

('noun', 0.6461974382400513),

('suffix', 0.6440576314926147),

('verb', 0.6319557428359985),

('loanword', 0.6262609958648682),

('proverb', 0.6240501403808594),

('pronunciation', 0.6105246543884277)]

In [3]: en_wiki_word2vec_model.most_similar('word')

Out[3]:

[('phrase', 0.8129693269729614),

('meaning', 0.7311851978302002),

('words', 0.7010501623153687),

('adjective', 0.6805518865585327),

('noun', 0.6461974382400513),

('suffix', 0.6440576314926147),

('verb', 0.6319557428359985),

('loanword', 0.6262609958648682),

('proverb', 0.6240501403808594),

('pronunciation', 0.6105246543884277)]

In [4]: en_wiki_word2vec_model.most_similar('similarity')

Out[4]:

[('similarities', 0.8517599701881409),

('resemblance', 0.786037266254425),

('resemblances', 0.7496883869171143),

('affinities', 0.6571112275123596),

('differences', 0.6465682983398438),

('dissimilarities', 0.6212711930274963),

('correlation', 0.6071442365646362),

('dissimilarity', 0.6062943935394287),

('variation', 0.5970577001571655),

('difference', 0.5928016901016235)]

In [4]: en_wiki_word2vec_model.most_similar('similarity')

Out[4]:

[('similarities', 0.8517599701881409),

('resemblance', 0.786037266254425),

('resemblances', 0.7496883869171143),

('affinities', 0.6571112275123596),

('differences', 0.6465682983398438),

('dissimilarities', 0.6212711930274963),

('correlation', 0.6071442365646362),

('dissimilarity', 0.6062943935394287),

('variation', 0.5970577001571655),

('difference', 0.5928016901016235)]

In [5]: en_wiki_word2vec_model.most_similar('nlp')

Out[5]:

[('neurolinguistic', 0.6698148250579834),

('psycholinguistic', 0.6388964056968689),

('connectionism', 0.6027182936668396),

('semantics', 0.5866401195526123),

('connectionist', 0.5865628719329834),

('bandler', 0.5837364196777344),

('phonics', 0.5733655691146851),

('psycholinguistics', 0.5613113641738892),

('bootstrapping', 0.559638261795044),

('psychometrics', 0.5555593967437744)]

In [5]: en_wiki_word2vec_model.most_similar('nlp')

Out[5]:

[('neurolinguistic', 0.6698148250579834),

('psycholinguistic', 0.6388964056968689),

('connectionism', 0.6027182936668396),

('semantics', 0.5866401195526123),

('connectionist', 0.5865628719329834),

('bandler', 0.5837364196777344),

('phonics', 0.5733655691146851),

('psycholinguistics', 0.5613113641738892),

('bootstrapping', 0.559638261795044),

('psychometrics', 0.5555593967437744)]

In [6]: en_wiki_word2vec_model.most_similar('learn')

Out[6]:

[('teach', 0.7533557415008545),

('understand', 0.71148681640625),

('discover', 0.6749690771102905),

('learned', 0.6599283218383789),

('realize', 0.6390970349311829),

('find', 0.6308424472808838),

('know', 0.6171890497207642),

('tell', 0.6146825551986694),

('inform', 0.6008728742599487),

('instruct', 0.5998791456222534)]

In [6]: en_wiki_word2vec_model.most_similar('learn')

Out[6]:

[('teach', 0.7533557415008545),

('understand', 0.71148681640625),

('discover', 0.6749690771102905),

('learned', 0.6599283218383789),

('realize', 0.6390970349311829),

('find', 0.6308424472808838),

('know', 0.6171890497207642),

('tell', 0.6146825551986694),

('inform', 0.6008728742599487),

('instruct', 0.5998791456222534)]

In [7]: en_wiki_word2vec_model.most_similar('man')

Out[7]:

[('woman', 0.7243080735206604),

('boy', 0.7029494047164917),

('girl', 0.6441491842269897),

('stranger', 0.63275545835495),

('drunkard', 0.6136815547943115),

('gentleman', 0.6122575998306274),

('lover', 0.6108279228210449),

('thief', 0.609005331993103),

('beggar', 0.6083744764328003),

('person', 0.597919225692749)]

In [7]: en_wiki_word2vec_model.most_similar('man')

Out[7]:

[('woman', 0.7243080735206604),

('boy', 0.7029494047164917),

('girl', 0.6441491842269897),

('stranger', 0.63275545835495),

('drunkard', 0.6136815547943115),

('gentleman', 0.6122575998306274),

('lover', 0.6108279228210449),

('thief', 0.609005331993103),

('beggar', 0.6083744764328003),

('person', 0.597919225692749)]

再来看看其他几个相关接口:

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

Out[8]: [('queen', 0.7752252817153931)]

In [9]: en_wiki_word2vec_model.similarity('woman', 'man')

Out[9]: 0.72430799548282099

In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())

Out[10]: 'cereal'

In [8]: en_wiki_word2vec_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

Out[8]: [('queen', 0.7752252817153931)]

In [9]: en_wiki_word2vec_model.similarity('woman', 'man')

Out[9]: 0.72430799548282099

In [10]: en_wiki_word2vec_model.doesnt_match("breakfast cereal dinner lunch".split())

Out[10]: 'cereal'

我把这篇文章的相关代码还有另一篇“中英文维基百科语料上的Word2Vec实验”的相关代码整理了一下,在github上建立了一个 Wikipedia_Word2vec 的项目,感兴趣的同学可以参考。

python词组语义相似度_语义相似度相关推荐

  1. java 圈复杂度_圈复杂度和代码质量优化(附带示例代码纠正代码质量)

    什么是圈复杂度? --------------------------------------- 圈复杂度(Cyclomatic Complexity)是衡量计算机程序复杂程度的一种措施.它根据程序从 ...

  2. python jieba 文本相似度_文本相似度分析(基于jieba和gensim)

    ##基础概念 本文在进行文本相似度分析过程分为以下几个部分进行, 文本分词 语料库制作 算法训练 结果预测 分析过程主要用两个包来实现jieba,gensim jieba:主要实现分词过程 gensi ...

  3. python 拼音相似度_多种相似度计算的python实现

    前言 在机器学习中有很多地方要计算相似度,比如聚类分析和协同过滤.计算相似度的有许多方法,其中有欧几里德距离(欧式距离).曼哈顿距离.Jaccard系数和皮尔逊相关度等等.我们这里把一些常用的相似度计 ...

  4. 场景解析和语义分割区别_语义分割概念及应用介绍

    摘要: 一份关于语义分割的基本概述,涵盖语义分割的特征和可能的用途,包括地质检测.自动驾驶.面部识别等. 近年来,以深度学习为中心的机器学习技术引起了人们的关注.比如自动驾驶汽车已经逐渐成为可能,但在 ...

  5. pythonopencv人脸相似度_图像相似度算法的个人见解(pythonopencv)-Go语言中文社区...

    简述 前段时间写了篇博文 哈希算法实现图像相似度比较(Python&OpenCV) ,使用简单的哈希算法进行图像相似度判断.但是在实践中该算法达不到预期的效果: 图像缩放8*8大小,图片信息内 ...

  6. java计算图片相似度_图片相似度比较--算法

    最近由于要租房,所以下载了58同城的APP,在找个人房源过程中发现,58同城会把图片相似的发帖纪录被标志出来,并警告用户此信息可能是假的.这里不讨论58同城的这方面做得人性化.而是就图片相似度算法来做 ...

  7. mysql计算余弦相似度_余弦相似度公式及推导案例

    定义 余弦相似度通过测量两个向量的夹角的余弦值来度量它们之间的相似性.0度角的余弦值是1,而其他任何角度的余弦值都不大于1:并且其最小值是-1.从而两个向量之间的角度的余弦值确定两个向量是否大致指向相 ...

  8. java图像旋转90度_旋转图像90度在java

    使用此方法. /** * Rotates an image. Actually rotates a new copy of the image. * * @param img The image to ...

  9. Java之词义相似度计算(语义识别、词语情感趋势、词林相似度、拼音相似度、概念相似度、字面相似度)

    Java之词义相似度计算(语义识别.词语情感趋势.词林相似度.拼音相似度.概念相似度.字面相似度) 1.词语相似度计算 2. 短语相似度值 3. 词形词序句子相似度值.优化的编辑距离句子相似度值.标准 ...

  10. [JavaWeb-HTML]HTML标签_语义化标签

    语义化标签: 语义化标签:html5中为了提高程序的可读性,提供了一些标签.1. <header>:页眉2. <footer>:页脚

最新文章

  1. mac mysql 链接_Mysql mac安装以及navicat链接
  2. jQuery的选择器
  3. 分享Hadoop处理大数据工具及优势
  4. 第5周实践项目1 顺序栈建立的算法库
  5. 计算机与自动化专业有哪些学校,全国自动化专业大学排名
  6. nagios-3种报警方式–声音–email/邮件—短信
  7. 一口一口吃掉Struts(二)——STRUTS基本工作流程
  8. centos7 安装最新破解(awvs12)Acunetix Vulnerability Scanner12破解和批量导入和利用python删除任务
  9. 电工基础知识入门必背的知识
  10. el-select 设置为可手动输入
  11. java b2b2c 多商户 电商 源码,整套可运行
  12. 阿里云短信验证码后端
  13. input标签的type属性汇总
  14. 基于改进Bisenet的五官精确分割系统(源码&教程)
  15. DNS服务器解析问题
  16. 电脑c语言发音,C的发音
  17. P1931 套利-SPFA最长路与环的判断
  18. Vue实现简单的发表评论,吐槽评论
  19. 管理需因人而异,因时而变
  20. 远程桌面如何退出全屏或全屏切换

热门文章

  1. (C++)输入两个正整数m和n,求其最大公约数和最小公倍数(辗转相除法)
  2. C语言总结(一维数组、二维数组、字符数组和字符串)
  3. 欧拉计划26--Reciprocal cycles
  4. 谷歌浏览器清除百度广告
  5. 郭天祥51单片机笔记(一)
  6. 风变python多少钱_请问风变编程Python值得购买吗?
  7. 帝国cms7.2自定义列表建立tag效果 代码 教程
  8. IP属地靠谱吗?或是一把双刃剑
  9. OfficeXP的激活码保存位置
  10. python 文件格式转换_如何把txt文件转换成py文件