如何用TensorFlow训练词向量

前言

前面在《谈谈谷歌word2vec的原理》文章中已经把word2vec的来龙去脉说得很清楚了，接下去这篇文章将尝试根据word2vec的原理并使用TensorFlow来训练词向量，这里选择使用skip-gram模型。

语料库的准备

这里仅仅收集了网上关于房产新闻的文章，并且将全部文章拼凑到一起形成一个语料库。

skip-gram简要说明

skip-gram核心思想可以通过下图来看，假设我们的窗口大小为2，则对于文本"The quick brown fox jumps over the lazy dog."，随着窗口的滑动将产生训练样本。比如刚开始是(the,quick)(the,brown)两个样本，右移一步后训练样本为(quick,the)(quick,brown)(quick,fox)，继续右移后训练样本为(brown,the)(brown,quick)(brown,fox)(brown,jumps)，接着不断右移产生训练样本。skip-gram模型的核心思想即是上面所说。

这里写图片描述

预料加载&分词

def read_data(filename):with codecs.open(filename, 'r', encoding='utf-8') as f:data = f.read()seg_list = jieba.cut(data, cut_all=False)text = tf.compat.as_str("/".join(seg_list)).split('/')return textfilename = "D:\\data6\\house_train\\result.txt"vocabulary = read_data(filename)

实现对语料库文件的加载并且对其进行分词。filename指定语料库文件，而分词使用jieba来实现，最后返回一个包含语料库所有词的list。

构建词典

vocabulary_size = 50000def build_dataset(words, n_words):count = [['UNK', -1]]count.extend(collections.Counter(words).most_common(n_words - 1))dictionary = dict()for word, _ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0for word in words:if word in dictionary:index = dictionary[word]else:index = 0unk_count += 1data.append(index)count[0][1] = unk_countreversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))return data, count, dictionary, reversed_dictionarydata, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size)
del vocabulary

这里我们是要建立一个大小为50000的词汇，vocabulary是从语料集中获取的所有单词，统计vocabulary每个单词出现的次数，而且是取出现频率最多的前49999个词，count词典方便后面查询某个单词对应出现的次数。接着我们建立dictionary词典，它是单词与索引的词典，方便后面查询某个单词对应的索引位置。接着我们将vocabulary所有单词转换成索引的形式保存到data中，凡是不在频率最高的49999个词当中的我们都当成是unknown词汇并且将其索引置为0，此过程顺便统计vocabulary包含了多少个unknown的词汇。另外还要建立一个反向索引词典reversed_dictionary，可以通过位置索引得到单词。

获取批数据

def generate_batch(batch_size, num_skips, skip_window):global data_indexassert batch_size % num_skips == 0assert num_skips <= 2 * skip_windowbatch = np.ndarray(shape=(batch_size), dtype=np.int32)labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)span = 2 * skip_window + 1buffer = collections.deque(maxlen=span)if data_index + span > len(data):data_index = 0buffer.extend(data[data_index:data_index + span])data_index += spanfor i in range(batch_size // num_skips):target = skip_windowtargets_to_avoid = [skip_window]for j in range(num_skips):while target in targets_to_avoid:target = random.randint(0, span - 1)targets_to_avoid.append(target)batch[i * num_skips + j] = buffer[skip_window]labels[i * num_skips + j, 0] = buffer[target]if data_index == len(data):buffer[:] = data[:span]data_index = spanelse:buffer.append(data[data_index])data_index += 1data_index = (data_index + len(data) - span) % len(data)return batch, labels

提供一个生成训练批数据的函数，batch_size是我们一次取得一批样本的数量，num_skip则可以看成是我们要去某个词窗口内的词的数量，比如前面我们说到的窗口大小为2，则某个词附近一共有4个词最终最多可以组成4个训练样本，但如果你只需要组成2个样本的话则通过num_skip来设置。skip_window则用于设定窗口的大小。取样本时是在整个vocabulary通过滑动窗口进行的，得到的batch和labels都是单词对应的词典索引，这对后面运算提供了方便。

构建图

graph = tf.Graph()
with graph.as_default():train_inputs = tf.placeholder(tf.int32, shape=[batch_size])train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])valid_dataset = tf.constant(valid_examples, dtype=tf.int32)with tf.device('/cpu:0'):embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))embed = tf.nn.embedding_lookup(embeddings, train_inputs)nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))nce_biases = tf.Variable(tf.zeros([vocabulary_size]))loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_biases,labels=train_labels,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size))optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))normalized_embeddings = embeddings / normvalid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)init = tf.global_variables_initializer()

train_inputs是一个[batch_size]形状的输入占位符，它表示一批输入数据的索引。train_labels是一个[batch_size, 1]形状的正确的分类标签，它表示一批输入对应的正确的分类标签。

embeddings变量用来表示词典中所有单词的128维词向量，这些向量是会在训练过程中不断被更新的，它是一个[vocabulary_size, embedding_size]形状的矩阵，这里其实是[50000,128]，因为我们设定词汇一共有50000个单词，且它的元素的值都在-1到1之间。

然后通过embedding_lookup函数根据索引train_inputs获取到一批128维的输入embed。

接着使用NCE作为损失函数，根据词汇量数量vocabulary_size以及词向量维度embedding_size构建损失函数即可，NCE是负采样损失函数，也可以试试用其他的损失函数。nce_weights和nce_biases是NCE过程的权重和偏置，取平均后用梯度下降法优化损失函数。

最后对embeddings进行标准化，得到标准的词向量，再计算所有词向量与我们选来校验的词的相似性（距离）。

创建会话

with tf.Session(graph=graph) as session:init.run()average_loss = 0for step in range(num_steps):batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)average_loss += loss_valif step % 2000 == 0:if step > 0:average_loss /= 2000print('Average loss at step ', step, ': ', average_loss)average_loss = 0if step % 10000 == 0:sim = similarity.eval()for i in range(valid_size):valid_word = reverse_dictionary[valid_examples[i]]top_k = 8nearest = (-sim[i, :]).argsort()[1:top_k + 1]log_str = 'Nearest to %s:' % valid_wordfor k in range(top_k):close_word = reverse_dictionary[nearest[k]]log_str = '%s %s,' % (log_str, close_word)print(log_str)final_embeddings = normalized_embeddings.eval()

创建会话开始训练，设置需要训练多少轮，由num_steps指定。然后通过generate_batch获取到一批输入及对应标签，指定优化器对象和损失函数对象开始训练，每训练2000轮输出看下具体损失，每10000轮则使用校验数据看看他们最近距离的8个是什么词。

降维画图

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'plt.figure(figsize=(18, 18))  # in inchesfor i, label in enumerate(labels):x, y = low_dim_embs[i, :]plt.scatter(x, y)plt.annotate(label,xy=(x, y),xytext=(5, 2),textcoords='offset points',ha='right',va='bottom')plt.savefig(filename)plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
plot_only = 300
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

选取300个词汇并使用TSNE对其进行降维然后画图。

这里写图片描述

github

github.com/sea-boat/De…

========广告时间========

鄙人的新书《Tomcat内核设计剖析》已经在京东销售了，有需要的朋友可以到 item.jd.com/12185360.ht… 进行预定。感谢各位朋友。

为什么写《Tomcat内核设计剖析》

=========================

欢迎关注：

这里写图片描述