gensim word2vec自己寻找语料库训练模型——非gensim data

1. 数据下载
2. 中文wiki百科
- 2.1 数据获取
- 2.2 数据处理
3. 清华大学自然语言处理实验室数据集
4. 搜狗全网新闻数据集
5. 文件合并
6. Word2Vec
小结

1. 数据下载

英文语料数据来自英语国家语料库（British National Corpus, 简称BNC）(538MB, 样例数据22MB)和美国国家语料库（318MB），中文语料来自清华大学自然语言处理实验室：一个高效的中文文本分类工具包(1.45GB)和中文维基百科，下载点此(1.96GB)，搜狗全网新闻数据集之前下载使用过

踩坑，英语国家语料库和美国国家语料库迅雷下载很慢，而且中途会多次遇到下载停止的问题，改用Internet Download Manager，一晚上电脑不关机就下完了

2. 中文wiki百科

2.1 数据获取

参考这篇和这篇文章

报错：

OSError: Invalid data stream

参考这篇文章，应该是未解压前的 ‘.bz2’ 格式文件

报错：

TypeError: sequence item 0: expected a bytes-like object, str found

解决办法参考这个和这个

注： join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串

# -*- coding: UTF-8 -*-str = "-";
seq = ("a", "b", "c"); # 字符串序列
print str.join( seq );

结果：

a-b-c

获得txt格式数据代码如下所示：

from gensim.corpora import WikiCorpusdef parse_corpus():space = ''i = 0output = open('...your path/zhwiki-latest-pages-articles.xml/zhwiki_{}.txt'.format(i),'w',encoding='utf-8')wiki = WikiCorpus('...your path/zhwiki-latest-pages-articles.xml.bz2',lemmatize=False,dictionary={})       # gensim中的维基百科处理类WikiCorpusfor text in wiki.get_texts():output.write(space.join(text)+'\n')i += 1if(i % 10000 == 0):print('Saved' + str(i) + 'articles')output = open('...your path/zhwiki-latest-pages-articles.xml/zhwiki_{}.txt'.format(int(i/10000)), 'w', encoding='utf-8')output.close()print('Finish Saved' + str(i) + 'articles')

……
Saved370000articles
Saved380000articles
Finish Saved384755articles

所以一共是384755篇文章，保存了38000篇文章在38个txt格式文件中，总共1.28G：

2.2 数据处理

清除英文字符

有个坑，一开始使用

regex = r'[a‐z]+'

一直不能匹配，最后发现应该用

regex = r'[a-z]+'

这两个减号是不一样的……请动手敲一遍

删除英文的代码如下：

# regex = r'[a-z]+ '       # 由26个字母+空格组成的字符串
# txt = re.sub(regex,'',txt)
# regex = r'[a-z]+'         # 由26个字母组成的字符串
# txt = re.sub(regex, '', txt)

但注意到文本里面还有日文等其它字符，所以这里参考文章，而且注意到text8文件里面是没有标点符号的，中文维基百科是没有数字的，所以只保留中文字符和相关空格即可

繁体转化为简体

安装opencc库，输入：

pip install opencc

安装成功：

参考此文章：

t2s - 繁体转简体（Traditional Chinese to Simplified Chinese）
s2t - 简体转繁体（Simplified Chinese to Traditional Chinese）
mix2t - 混合转繁体（Mixed to Traditional Chinese）
mix2s - 混合转简体（Mixed to Simplified Chinese）

注：在繁体转简体遇到了一个问题，就是有个txt文件进去几乎不动，等了一个小时还在繁体转简体这儿，所以对此文件单独进行繁体转简体，把每一行（每一篇文章）作为一个输入进行转化，最后结果追加到一个列表后

值得注意到的是，列表转字符串不能直接用str()函数，这样会有转义字符’\n’的问题，例如：

list = ['我','爱','你','\n']
txt = '\'' + str(list).replace('[','').replace(']','').replace('\'','').replace(', ','') + '\''
print(txt)
print('我爱你\n')
txt = ''.join(list)
print(txt)

结果：

'我爱你\n'
我爱你我爱你

参考文章，所以列表转字符串需要使用.join()函数

中文分词

之前文章【NLP】3 word2vec库与基于搜狗全网新闻数据集实例介绍过，这里不叙述

停词表

停用词的意思就是意义不大、不重要的词语，所以提取关键字、关键短语、关键句子时都默认去掉了去停用词的思想：在原始文本集中去掉不需要的词汇，字符

采用四川大学机器智能实验室停用词库，其他的停词表见Python 1.2 中文文本分析常用停用词表，英文参考此文章

停词原理：

line = '牙齿 突然 又 小 了 许多 第二'
for stopword in stopwords_list:for item in line.split(' '):if(item == stopword):line = line.replace(stopword+' ', '')
print(line)

结果：

牙齿 突然 又 小 许多 第二

代码：

import re
import opencc
import jiebastopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:stopwords_list.append(item.replace('\n',''))        # 创建停词表def simplified_Chinese(txt):cc = opencc.OpenCC('t2s')# txt = cc.convert(txt)i = 0txt_sim = []for sentence in txt.split('\n'):txt_sim.append(cc.convert(sentence) + '\n')print("第{}句话转化成简体成功".format(i))i += 1txt = ''.join(txt_sim)return txtdef stopwords(i):# 以下为停词部分in_path = '...your path/zhwiki-latest-pages-articles.xml/parse_txt/'file = open(in_path + 'zhwikiSegDone_{}.txt'.format(i), 'r', encoding='utf-8')out_path = '...your path/zhwiki-latest-pages-articles.xml/stopword_txt/'txt = open(out_path + 'zhwikiStopWord_{}.txt'.format(i), 'w', encoding='utf-8')for line in file.readlines():for item in line.split(' '):for stopword in stopwords_list:if item == stopword:line = line.replace(stopword+' ','')txt.write(line)returndef seg_done(i, txt):# 以下为分词部分out_path = '...your path/zhwiki-latest-pages-articles.xml/parse_txt/'file = open(out_path + 'zhwikiSegDone_{}.txt'.format(i),'w',encoding='utf-8')file.write(' '.join(jieba.cut(txt, cut_all=False)).replace(' \n ', '\n'))file.close()# 以下为分词又停词部分# out_path = '...your path/zhwiki-latest-pages-articles.xml/segdone_stopword_txt/'# file = open(out_path + 'zhwiki_prepro_{}.txt'.format(i),'w',encoding='utf-8')# txt = ' '.join(jieba.cut(txt, cut_all=False)).replace(' \n ', '\n')# print('第' + str(i) + '个txt文件汉字分词成功')# for sentence in txt.split('\n'):#     for word in sentence.split(' '):#         for stopword in stopwords_list:#             if word == stopword:#                 sentence = sentence.replace(stopword+' ','')#     file.write(sentence+'\n')# file.close()# print('第' + str(i) + '个txt文件汉字停词成功')def parse_txt():in_path = '...your path/zhwiki-latest-pages-articles.xml/zhwiki/'for i in range(21,22):      # 理论上应该是从0至39，即[0,39)file = open(in_path+'zhwiki_{}.txt'.format(i),'r',encoding='utf-8')txt = file.read()file.close()txt = ''.join(re.findall('[\u4e00-\u9fa5|\n]',txt))      # 只保留汉字,如果其后有空格则保留print('第' + str(i) + '个txt文件提取汉字成功')txt = simplified_Chinese(txt)print('第' + str(i) + '个txt文件繁体汉字转化简体汉字成功')seg_done(i, txt)def main():# parse_txt()stopwords(21)       # 第21个文件直接运行很慢，不知道为什么returnif __name__ == '__main__':main()

结果：

得到了从’zhwiki_prepro_0’到’zhwiki_prepro_38’的共39个文件，中文维基百科数据处理完成

3. 清华大学自然语言处理实验室数据集

下载好解压出来即可，得到名为’THUCNews’的文件夹，包含以下18类数据：

代码：

import re
import jiebastopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:stopwords_list.append(item.replace('\n',''))        # 创建停词表NewsCatalog = ['体育','娱乐','家居','彩票','房产','教育','时尚','时政','星座','游戏','社会','科技','股票','财经']
# NewsCatalog = ['社会','科技','股票','财经']     # i = 430801
# NewsCatalog = ['时政']file_path = '...your path/THUCNews/THUCNews/'i = 0
for category in NewsCatalog:combine = open(file_path + '{}.txt'.format(category), 'w', encoding='utf-8')sentence = []while(True):if(i % 200 == 0):print('\r处理完成文本数量：{}'.format(i), end='')try:file = open(file_path + category + '/' + '{}.txt'.format(i), 'r', encoding='utf-8')i += 1txt = file.read().replace('\n　　',' ')      # 一篇文章为一排file.close()txt = ''.join(re.findall('[\u4e00-\u9fa5| |]', txt))txt = ' '.join(jieba.cut(txt, cut_all=False)).replace('   ',' ')for word in txt.split(' '):for stopword in stopwords_list:if word == stopword:txt = txt.replace(stopword+' ','')sentence.append(txt+'\n')except:combine.write(''.join(sentence))print(category + '文本处理完毕')break

结果：

……
处理完成文本数量：481600社会文本处理完毕
处理完成文本数量：644400科技文本处理完毕
处理完成文本数量：798800股票文本处理完毕
处理完成文本数量：836000财经文本处理完毕

4. 搜狗全网新闻数据集

重新处理一下数据，只留下中文字符和换行符，分词，停词，每三万篇新闻输出为一个txt文件，代码如下：

import re
import jiebastopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:stopwords_list.append(item.replace('\n',''))        # 创建停词表path = '...your path/news_tensite_xml.full/'file = open(path + 'news_tensite_xml.txt', 'r', encoding='gb18030')news = []
i = 0for txt in file.readlines():txt = ''.join(re.findall('[\u4e00-\u9fa5|\n]', txt))txt = ' '.join(jieba.cut(txt, cut_all=False))for word in txt.split(' '):for stopword in stopwords_list:if word == stopword:txt = txt.replace(stopword + ' ', '')news.append(txt)if i > 0 and i % 30000 == 0:out_file = open(path + 'Sogounews_{}.txt'.format(int(i/30000-1)), 'w', encoding='utf-8')out_file.write(''.join(news))out_file.close()news = []i += 1
file.close()

结果：

一共114万多个文件，共38个txt文件

5. 文件合并

由于最后模型训练只需要一个txt文件就可以了，所以需要把以上所有语料库txt格式的数据合并成一个文件，关于合并的原理如下：

txt1文件：

This is a Test
You

txt2文件：

This also a test
As you can see

注意空格和换行符，这里希望不同文本之间不要留空格，.strip用法见此

代码：

path = '...your path/test/'file1 = open(path + '{}.txt'.format(1), 'r', encoding='utf-8')
file2 = open(path + '{}.txt'.format(2), 'r', encoding='utf-8')
file = open(path + '{}.txt'.format('total'), 'a', encoding='utf-8')
txt1 = file1.read().strip('\n').strip(' ')
txt2 = file2.read().strip('\n').strip(' ')
file.write(txt1 + '\n')
file.write(txt2 + '\n')

结果：

This is a Test
You
This also a test
As you can see

下面对以上数据处理的文件进行合并：

中文wiki百科文件，共39个，1.23G；
THU数据集，14个类别，1.86G；
搜狗新闻数据集，共38个，1.60G

文件合并其实是一个复制粘贴的过程，然后处理一些换行、空格的问题，由于win10自带的记事本不能打开很大的文件（一般超过1G），所以还是通过代码实现：

NewsCatalog = ['体育','娱乐','家居','彩票','房产','教育','时尚','时政','星座','游戏','社会','科技','股票','财经']path = '...your path/test/'
wiki_path = '...your path/zhwiki-latest-pages-articles.xml/segdone_stopword_txt/'
THUCNews_path = '...your path/THUCNews/THUCNews/'
SougouNews_path = '...your path/news_tensite_xml.full/'Data = open(path + 'Data.txt', 'a', encoding='utf-8')for i in range(39):      # 合并中文wiki百科文件file = open(wiki_path + 'zhwiki_prepro_{}.txt'.format(i), 'r', encoding='utf-8')txt = file.read().strip('\n').strip(' ')Data.write(txt + '\n')file.close()print('中文wiki百科文件合并完成')for item in NewsCatalog:        # 合并THU数据集file = open(THUCNews_path + '{}.txt'.format(item), 'r', encoding='utf-8')txt = file.read().strip('\n').strip(' ')Data.write(txt + '\n')file.close()print('THU数据集合并完成')for i in range(38):      # 合并搜狗新闻数据集file = open(SougouNews_path + 'Sogounews_{}.txt'.format(i), 'r', encoding='utf-8')txt = file.read().strip('\n').strip(' ')Data.write(txt + '\n')file.close()print('搜狗新闻数据集合并完成')

以上三个文件相加为4.69GB，实际产生文件’Data.txt’为4.70GB，这很合理

6. Word2Vec

模型训练

首先把整个txt文档放到sentences参数里面进行训练，报了一个MemoryError的错误：

path = '...your path/test/Data.txt'
file = open(path, 'r', encoding='utf-8')
txt = file.read()model = Word2Vec(sentences=txt, size=300, window=5, iter=10)

显然不能这么写，参考一些文章，需要使用word2vec库中的LineSentence函数，完整训练代码如下：

from gensim.models import Word2Vec
from gensim.models import word2vecpath = '...your path/test/Data.txt'sentences = word2vec.LineSentence(path)# sg——word2vec两个模型的选择。如果是0， 则是CBOW模型，是1则是Skip-Gram模型，默认是0即CBOW模型
# hs——word2vec两个解法的选择，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling
# negative——即使用Negative Sampling时负采样的个数，默认是5。推荐在[3,10]之间
# min_count——需要计算词向量的最小词频。这个值可以去掉一些很生僻的低频词，默认是5。如果是小语料，可以调低这个值
# iter——随机梯度下降法中迭代的最大次数，默认是5。对于大语料，可以增大这个值
# alpha——在随机梯度下降法中迭代的初始步长。算法原理篇中标记为η，默认是0.025
# min_alpha——由于算法支持在迭代的过程中逐渐减小步长，min_alpha给出了最小的迭代步长值
model = Word2Vec(sentences, size=300, window=5, iter=10)
model.save('word2vec.model')

模型测试：

model = Word2Vec.load('word2vec.model')print(model.vector_size)
print(model.accuracy)
print(model.total_train_time)
print(model.wv)
print(model.most_similar('清华大学'))
print(model.most_similar('狗'))
print(model.most_similar('爱因斯坦'))
print(model.most_similar('加拿大'))

结果：

300
<bound method Word2Vec.accuracy of <gensim.models.word2vec.Word2Vec object at 0x0000022002D12B50>>
11660.956405
<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x0000022002D120D0>
[('北京大学', 0.8560054302215576), ('清华', 0.75086909532547), ('复旦大学', 0.7291961312294006), ('武汉大学', 0.6843674182891846), ('南开大学', 0.6799485683441162), ('北大', 0.6776059865951538), ('上海交通大学', 0.6770678758621216), ('浙江大学', 0.6726593971252441), ('上海交大', 0.6696270108222961), ('人民大学', 0.6536081433296204)]
[('小狗', 0.778811514377594), ('猫', 0.6958572864532471), ('狗狗', 0.6867367029190063), ('宠物狗', 0.6718654632568359), ('流浪狗', 0.6594550013542175), ('大狗', 0.6557995676994324), ('小猫', 0.6476685404777527), ('金毛犬', 0.6255425214767456), ('狼狗', 0.6238532662391663), ('爱犬', 0.6213604807853699)]
[('海森堡', 0.6210607290267944), ('玻尔', 0.6101648807525635), ('霍金', 0.5979549288749695), ('相对论', 0.5952655076980591), ('朗道', 0.5940419435501099), ('波耳', 0.5926888585090637), ('薛定谔', 0.5919954180717468), ('泡利', 0.5903158187866211), ('劳厄', 0.5894380807876587), ('爱丁顿', 0.5840190052986145)]
[('澳大利亚', 0.799114465713501), ('澳洲', 0.7645018100738525), ('新西兰', 0.7618841528892517), ('英国', 0.7025945782661438), ('纽西兰', 0.6988958120346069), ('多伦多', 0.692410409450531), ('墨西哥', 0.6753289699554443), ('渥太华', 0.6639829277992249), ('温哥华', 0.6550958156585693), ('新加坡', 0.6529322266578674)]

报警告：

E:/Users/Yang SiCheng/PycharmProjects/main.py:233: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).print(model.most_similar('清华大学'))……       # 狗、爱因斯坦、加拿大

模型分析

可以发现训练时间总共为11660.956405秒，即3.24小时，下次训练请注意加上日志，应该可以查看训练的进度，配置如下：

import logginglogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

参考上一篇文章，不使用完整模型而使用KeyedVectors的库，即键和向量的形式，代码如下：

from gensim.models import Word2Vec
from gensim.models import KeyedVectorsmodel = Word2Vec.load('word2vec.model')
word_vectors = model.wvword_vectors.save('vectors.kv')
# reloaded_word_vectors = KeyedVectors.load('vectors.kv')

小结

本次训练结果还算目测比之前【NLP】3 word2vec库与基于搜狗全网新闻数据集实例好了很多，下一步工作方向：

如何评测word2vec结果好坏？是否有类似【NLP】文献翻译2——英语单词语义相似性的Word2Vec模型分析的中文数据集能够评价准确度、相关系数
由单词向量生成句子向量，能够比较句子之间的相似度，例如简单追加词向量、加权求和、卷积神经网络、doc2vec、gensim word2vec自带的测句子相似度和多个词相似的方法原理，可以都分析比较一下
参考【NLP】文献翻译2——英语单词语义相似性的Word2Vec模型分析，也改变参数window、size即窗口和维度的大小（或许训练方法），再次训练，再次分析模型结果，比较参数对模型的影响

【NLP】6 gensim word2vec基于中文语料库实战——中文wiki百科、清华大学自然语言处理实验室数据集、搜狗全网新闻数据集相关推荐

【NLP】3 word2vec库与基于搜狗全网新闻数据集实例
word2vec库基于中文语料库实战 1. 语料库获取 2. 读取dat文件中有效内容.生成txt文件 3. 分词 4. 构建词向量小结思路参考word2vec构建中文词向量,原文是Linux环境 ...
基于RNN的NLP机器翻译深度学习课程 | 附实战代码
作者 | 小宋是呢来源 | CSDN博客深度学习用的有一年多了,最近开始NLP自然处理方面的研发.刚好趁着这个机会写一系列 NLP 机器翻译深度学习实战课程. 本系列课程将从原理讲解与数据处理深入 ...
NLP之word2vec：利用 Wikipedia Text(中文维基百科)语料+Word2vec工具来训练简体中文词向量
NLP之word2vec:利用 Wikipedia Text(中文维基百科)语料+Word2vec工具来训练简体中文词向量目录输出结果设计思路 1.Wikipedia Text语料来源 2.维基 ...
【NLP】N-LTP：基于预训练模型的中文自然语言处理平台
论文名称:N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models 论文作者:车万 ...
【NLP】5计数词向量底层代码编写和gensim word2vec库入门——斯坦福大学CS224n第一次课作业代码复现
gensim word2vec库入门背景:词向量第一部分:基于计数的词向量共现矩阵绘制共现词嵌入图问题1.1:实现distinct_words 问题1.2:实现compute_co_occu ...
Python Djang 搭建自动词性标注网站（基于Keras框架和维基百科中文预训练词向量Word2vec模型，分别实现由GRU、LSTM、RNN神经网络组成的词性标注模型）
引言本文基于Keras框架和维基百科中文预训练词向量Word2vec模型,分别实现由GRU.LSTM.RNN神经网络组成的词性标注模型,并且将模型封装,使用python Django web框架搭建 ...
基于Python的中文反义词识别判定工具（词典 + Word2Vec + jieba）
正反义词测试工具工具下载链接简介在机器学习中,识别和理解反义词的定义存在一些困难.这主要是因为反义词之间的关系通常是相对的,取决于上下文和语境.以下是一些可能导致难以识别反义词定义的原因: 语义 ...
实战：基于tensorflow 的中文语音识别模型 | CSDN博文精选
作者 | Pelhans 来源 | CSDN博客目前网上关于tensorflow 的中文语音识别实现较少,而且结构功能较为简单.而百度在PaddlePaddle上的 Deepspeech2 实现功能 ...
深度学习实战篇-基于RNN的中文分词探索
深度学习实战篇-基于RNN的中文分词探索近年来,深度学习在人工智能的多个领域取得了显著成绩.微软使用的152层深度神经网络在ImageNet的比赛上斩获多项第一,同时在图像识别中超过了人类的识别水平 ...

【NLP】6 gensim word2vec基于中文语料库实战——中文wiki百科、清华大学自然语言处理实验室数据集、搜狗全网新闻数据集