word2vec自训练词向量（代码+注释+训练过程和结果）

数据集MSRP、SICK、STS下载地址分享

百度云：https://pan.baidu.com/s/1sqlCc702owp_T6KjyNT6Yw

提取码: 66nb

运行：网盘中msr_train.zip是msr_train.txt处理后可直接训练的数据，结合word2vec.py代码训练，注意文件路径自行修改

预处理过程：txt文件在excel表格中导入，然后去掉多余部分只保留文本，在另存为.csv文件并utf-8编码，再压缩为.zip文件

word2vec代码（中英文均可训练），代码已更新到网盘中

import collections
import math
import random
import zipfile
import numpy as np
from six.moves import xrange
import tensorflow as tfdef read_data(filename):with zipfile.ZipFile(filename) as f:data = tf.compat.as_str(f.read(f.namelist()[0])).split()return data# 1.输入训练语料的文件路径（注意要去掉标注，只包含分词结果）
words = read_data('data/msr_train.zip')
print('Data size', len(words))# 2.设置输出的词向量的词汇量
vocabulary_size = 8000def build_dataset(words, vocabulary_size):count = [['UNK', -1]]count.extend(collections.Counter(words).most_common(vocabulary_size - 1))dictionary = dict()for word, _ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0  # dictionary['UNK']unk_count += 1data.append(index)count[0][1] = unk_countreverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))return data, count, dictionary, reverse_dictionarydata, count, dictionary, reverse_dictionary = build_dataset(words, vocabulary_size)# 删除words引用
del words#******************************   训练开始   ********************************************
data_index = 0
# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):global data_indexassert batch_size % num_skips == 0assert num_skips <= 2 * skip_windowbatch = np.ndarray(shape=(batch_size), dtype=np.int32)labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)span = 2 * skip_window + 1  # [ skip_window target skip_window ]buffer = collections.deque(maxlen=span)for _ in range(span):buffer.append(data[data_index])data_index = (data_index + 1) % len(data)# 获取batch和labelsfor i in range(batch_size // num_skips):target = skip_window  # target label at the center of the buffertargets_to_avoid = [skip_window]# 循环2次，一个目标单词对应两个上下文单词for j in range(num_skips):while target in targets_to_avoid:# 可能先拿到前面的单词也可能先拿到后面的单词target = random.randint(0, span - 1)targets_to_avoid.append(target)batch[i * num_skips + j] = buffer[skip_window]labels[i * num_skips + j, 0] = buffer[target]buffer.append(data[data_index])data_index = (data_index + 1) % len(data)# Backtrack a little bit to avoid skipping words in the end of a batch# 回溯3个词。因为执行完一个batch的操作之后，data_index会往右多偏移span个位置data_index = (data_index + len(data) - span) % len(data)return batch, labelsbatch_size = 128# 3.设置词向量维度
embedding_size = 128  # 词向量维度Dimension of the embedding vector.skip_window = 1  # How many words to consider left and right.
num_skips = 2  # How many times to reuse an input to generate a label.
valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
# 从0-100抽取16个整数，无放回抽样
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
# 负采样样本数
num_sampled = 64  # Number of negative examples to sample.# Step 4: Build and train a skip-gram model.
graph = tf.Graph()
with graph.as_default():# Input data.with tf.variable_scope('input'):train_inputs = tf.placeholder(tf.int32, shape=[batch_size],name='train_inputs')train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1],name='train_labels')valid_dataset = tf.constant(valid_examples, dtype=tf.int32)# Ops and variables pinned to the CPU because of missing GPU implementation#     with tf.device('/cpu:0'):# 词向量----------------------5万个词就是5万行，定义128维特征为128列************88# Look up embeddings for inputs.with tf.variable_scope('embedding'):embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embedding')# embedding_lookup(params,ids)其实就是按照ids顺序返回params中的第ids行# 比如说，ids=[1,7,4],就是返回params中第1,7,4行。返回结果为由params的1,7,4行组成的tensor# 提取要训练的词-----------------------------------不是每次迭代5万个词，抽样迭代按批次就是按词的编号，把词的编号传进去embed = tf.nn.embedding_lookup(embeddings, train_inputs)with tf.variable_scope('net'):# Construct the variables for the noise-contrastive estimation(NCE) lossnce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))nce_biases = tf.Variable(tf.zeros([vocabulary_size]))# Compute the average NCE loss for the batch.with tf.variable_scope('loss'):loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_biases,labels=train_labels,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size),name='loss')tf.summary.scalar('ece_loss',loss)# Construct the SGD optimizer using a learning rate of 1.0.optimizer = tf.train.GradientDescentOptimizer(1).minimize(loss)# Compute the cosine similarity between minibatch examples and all embeddings.norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))normalized_embeddings = embeddings / norm# 抽取一些常用词来测试余弦相似度# 如果输入的是64，那么对应的embedding是normalized_embeddings第64行的vectorvalid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)# valid_size == 16# [16,1] * [1*50000] = [16,50000]similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)# Add variable initializer.init = tf.global_variables_initializer()# 训练轮次（宜设置范围为100000-200000）
# num_steps = 100000
# 轮次较小，仅供测试代码运行
num_steps = 2000final_embeddings = []
# Step 5: 开始训练，启动session
with tf.Session(graph=graph) as session:print("启动session")merge = tf.summary.merge_all()init.run()train_writer = tf.summary.FileWriter('log')average_loss = 0for step in xrange(num_steps):batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}# We perform one update step by evaluating the optimizer op (including it# in the list of returned values for session.run()_, loss_val, summary_train = session.run([optimizer, loss, merge], feed_dict=feed_dict)average_loss += loss_valtrain_writer.add_summary(summary_train, step)# print("batch_inputs:%s  batch_labels:%s" % (batch_inputs,batch_labels))# batch_inputs矩阵 成对的标号   batch_labels 换行的标号      ？？# 每2000次迭代，打印损失值if step % 2000 == 0:if step > 0:average_loss /= 2000# The average loss is an estimate of the loss over the last 2000 batches.print("Average loss at step ", step, ": ", average_loss)average_loss = 0# 每2000次迭代，随机抽一个词，并打印周围相似词if step % 2000 == 0:sim = similarity.eval()# 计算验证集的余弦相似度最高的词for i in xrange(valid_size):# 根据id拿到对应单词valid_word = reverse_dictionary[valid_examples[i]]top_k = 8  # number of nearest neighbors# 从大到小排序，排除自己本身，取前top_k个值nearest = (-sim[i, :]).argsort()[1:top_k + 1]log_str = "Nearest to %s:" % valid_wordfor k in xrange(top_k):close_word = reverse_dictionary[nearest[k]]log_str = "%s %s," % (log_str, close_word)print(log_str)# 训练结束得到的全部词的词向量矩阵final_embeddings = normalized_embeddings.eval()# 常规记录日志文件writer = tf.summary.FileWriter("log", session.graph)# 4.保存词向量文件路径
e = open('word_embeddings/msr_embeddings','w', encoding='utf-8')e.write(str(vocabulary_size)+" "+str(embedding_size)+'\n')
for index in range(len(final_embeddings)):embedding_list = final_embeddings[index].tolist()# print(embedding_list)embedding_str = " ".join('%s' % id for id in embedding_list)e.write(str(reverse_dictionary[index])+" "+str(embedding_str)+'\n')e.close()# Step 6: Visualize the embeddings.降维画图
# 输出词向量二维图片路径
def plot_with_labels(low_dim_embs, labels,filename='word_embeddings/msr_embeddings.png'):assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"# 设置图片大小plt.figure(figsize=(15, 15))  # in inchesfor i, label in enumerate(labels):x, y = low_dim_embs[i, :]plt.scatter(x, y)plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',fontproperties = 'SimHei',fontsize = 14,ha='right',va='bottom')plt.savefig(filename)try:from sklearn.manifold import TSNEimport matplotlib.pyplot as plttsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')  # mac：method='exact'# 画300个点plot_only = 300#每个词reverse_dictionary对应每个词向量final_embeddingslow_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])labels = [reverse_dictionary[i] for i in xrange(plot_only)]plot_with_labels(low_dim_embs, labels)except ImportError:print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")

训练过程：

训练结果：

得到7000个词的128维的词向量表达（输出词的个数，维度，训练次数都可以自己根据需求设置）

词向量之间的多维空间距离，压平到二维平面。（达到距离近的词语义相近的效果）

训练次数增多效果会好些，但训练时间会长。

为了提升词嵌入的效果，也可使用预训练好的词向量，详见本篇博客：

预训练词向量中文维基百科,英文斯坦福glove预训练的词向量下载

https://blog.csdn.net/sinat_41144773/article/details/89875130

结束。

word2vec自训练词向量（代码+注释+训练过程和结果）相关推荐

lstm数学推导_如何在训练LSTM的同时训练词向量？
你本来也不用自己手动进行词向量更新啊,你搞这么一出最后收敛到0那不是必然的么? @霍华德老师的答案已经给你推导出来了. 实际上你问的这个问题很简单--只要把Embedding层本身也当成模型参数的一 ...
BERT 词向量理解及训练更新
1.BERT 词向量理解在预训练阶段中,词向量是在不断更新的,而在fine-tuning阶段中,词向量是固定不变的.在fine-tuning阶段中,我们使用预训练好的模型参数来对新的数据进行训练. ...
利用word2vec训练词向量
利用word2vec训练词向量这里的代码是在pycharm上运行的,文件列表如下: 一.数据预处理我选用的数据集是新闻数据集一共有五千条新闻数据,一共有四个维度数据集:https://pan.b ...
使用jieba对新闻标题进行切词，然后使用word2vec训练词向量及相似词计算的一个小例子
这个主要是我想记下来方便以后用的时候好直接copy 这个例子就是跑流程的,里面的参数都是随便设的,效果不怎么好,但是流程总得跑通吧. 首先是停用词表见 https://blog.csdn.net/qq ...
word2vec预训练词向量+通俗理解word2vec+CountVectorizer+TfidfVectorizer+tf-idf公式及sklearn中TfidfVectorizer
文章目录文分类实(一) word2vec预训练词向量 2 数据集 3 数据预处理 4 预训练word2vec模型 canci 通俗理解word2vec 独热编码 word2vec (Continuo ...
Python word2vec训练词向量，电子病历训练词向量，超简单训练电子病历的词向量，医学电子病历词向量预训练模型
1.词向量预训练模型的优势: (1)训练和保存含有语义信息的词向量,在用于模型训练之前,enbedding的过程同样带有语义信息,使模型训练的效果更好: (2)可以用预训练好的词向量模型直接计算两个词 ...
Python Djang 搭建自动词性标注网站（基于Keras框架和维基百科中文预训练词向量Word2vec模型，分别实现由GRU、LSTM、RNN神经网络组成的词性标注模型）
引言本文基于Keras框架和维基百科中文预训练词向量Word2vec模型,分别实现由GRU.LSTM.RNN神经网络组成的词性标注模型,并且将模型封装,使用python Django web框架搭建 ...
基于预训练词向量的文本相似度计算-word2vec, paddle
文章目录 0. 前言 1. 余弦相似度算子 2. 示例代码并验证 3. 基于词向量的文本相似度 3.1 读取word2vec文件 3.2 定义模型 3.3 运行模型 3.4 根据分数降序排列 3.5 ...
自然语言处理之使用gensim.Word2Vec训练词向量进行词义消歧
自然语言处理之使用gensim.Word2Vec训练词向量进行词义消歧 NLP中进行词义消歧的一个非常方便且简单的方法就是训练词向量,通过词向量计算余弦值,来推断某个词在句子中的含义.python中的 ...

word2vec自训练词向量（代码+注释+训练过程和结果）

word2vec自训练词向量（代码+注释+训练过程和结果）相关推荐

最新文章

热门文章