前面介绍了基于Hierarchical Softmax的 skip-gram 和 CBOW 模型,虽然我们使用霍夫曼树代替传统的神经网络,可以提高模型训练的效率。但是如果我们的训练样本里的中心词www是一个很生僻的词,那么就得在霍夫曼树中辛苦的向下走很久了。能不能不用搞这么复杂的一颗霍夫曼树,将模型变的更加简单呢?

Negative Sampling就是这么一种求解word2vec模型的方法,它摒弃了霍夫曼树,采用了Negative Sampling(负采样)的方法来求解,下面我们就来看看Negative Sampling的求解思路。






Skip-gram

Tensorflow 实现Skip-gram(电影评分数据集)

导入所需的库

# 导入库
from tensorflow.python.framework import ops
ops.reset_default_graph()
import tensorflow as tf
sess = tf.Session()
import matplotlib.pyplot as plt
import numpy as np
import random
import os # OS模块可以处理文件和目录这些我们日常手动需要做的操作。
import string # 可以调用与字符串操作相关的函数。
import requests # requests是python的一个HTTP客户端库
import collections
"""
collections提供了一下数据类型,很厉害
1.namedtuple(): 生成可以使用名字来访问元素内容的tuple子类
2.deque: 双端队列,可以快速的从另外一侧追加和推出对象
3.Counter: 计数器,主要用来计数
4.OrderedDict: 有序字典
5.defaultdict: 带有默认值的字典
"""
import io #  'StringIO',BytesIO',...
import gzip
import tarfile # 压缩、解压
import urllib.request
"""
Urllib是python内置的HTTP请求库包括以下模块
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块
"""
from nltk.corpus import stopwords

声明一些模型参数

batch_size = 50 # 一次查找50对词嵌套。
embedding_size = 200 # 每个单词嵌套大小为长度200的向量。
vocabulary_size = 10000 # 仅考虑频次为10000的单词
generations = 50000 # 迭代训练50000次
print_loss_every = 500 # 迭代500次打印一次损失函数
num_sampled = int(batch_size/2) # Number of negative examples to sample.
window_size = 3 # 查找目标单词两边各两个上下文单词
stops = stopwords.words('english')
print_valid_every = 2000 # 每迭代2000次训练打印最近邻域单词
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']

数据加载。首先检验是否已经下载了数据,如果下载了就直接加载就行,没有下载的话就去下载

def load_movie_data():save_folder_name = 'temp'pos_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.pos')neg_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.neg')if not os.path.exists(os.path.join(save_folder_name, 'rt-polaritydata')):movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'req = requests.get(movie_data_url, stream=True)with open('temp_movie_review_temp.tar.gz', 'wb') as f:for chunk in req.iter_content(chunk_size=1024):if chunk:f.write(chunk)f.flush()# Extract tar.gz file into temp foldertar = tarfile.open('temp_movie_review_temp.tar.gz', "r:gz")tar.extractall(path='temp')tar.close()pos_data = []with open(pos_file, 'r', encoding='latin-1') as f:for line in f:pos_data.append(line.encode('ascii', errors='ignore').decode())f.close()pos_data = [x.rstrip() for x in pos_data]neg_data = []with open(neg_file, 'r', encoding='latin-1') as f:for line in f:neg_data.append(line.encode('ascii', errors='ignore').decode())f.close()neg_data = [x.rstrip() for x in neg_data] # Python rstrip() 删除 string 字符串末尾的指定字符(默认为空格texts = pos_data + neg_datatarget = [1] * len(pos_data) + [0] * len(neg_data)return texts, target
texts, target = load_movie_data()

结果:
[‘the rock is destined to be the 21st century’s new " conan " and that he’s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .’, ‘the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j . r . r . tolkien’s middle-earth .’, ‘effective but too-tepid biopic’, ‘if you sometimes like to go to the movies to have fun , wasabi is a good place to start .’, “emerges as something rare , an issue movie that’s so honest and keenly observed that it doesn’t feel like one .”]
[1, 1, 1, 1, 1]

创建归一化文本函数。大小写、标点、数字、空格、停用词

def normalize_text(texts, stopwords):texts = [x.lower() for x in texts]texts = [''.join(x for x in text if x not in string.punctuation) for text in texts]print(texts[0:2])print("---------------------------------------------------------------------------------------")texts = [''.join(x for x in text if x not in '0123456789') for text in texts]print(texts[0:2])print("---------------------------------------------------------------------------------------")texts = [' '.join([word for word in text.split() if word not in stops]) for text in texts]print(texts[0:2])print("---------------------------------------------------------------------------------------")texts = [' '.join(text.split()) for text in texts]return texts
texts = normalize_text(texts, stops)
print(texts[0:5])

输出结果:
['the rock is destined to be the 21st centurys new conan and that hes going to make a splash even greater than arnold schwarzenegger jeanclaud van damme or steven segal ', 'the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe cowriterdirector peter jacksons expanded vision of j r r tolkiens middleearth ']

['the rock is destined to be the st centurys new conan and that hes going to make a splash even greater than arnold schwarzenegger jeanclaud van damme or steven segal ', 'the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe cowriterdirector peter jacksons expanded vision of j r r tolkiens middleearth ']

[‘rock destined st centurys new conan hes going make splash even greater arnold schwarzenegger jeanclaud van damme steven segal’, ‘gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe cowriterdirector peter jacksons expanded vision j r r tolkiens middleearth’]

[‘rock destined st centurys new conan hes going make splash even greater arnold schwarzenegger jeanclaud van damme steven segal’, ‘gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe cowriterdirector peter jacksons expanded vision j r r tolkiens middleearth’, ‘effective tootepid biopic’, ‘sometimes like go movies fun wasabi good place start’, ‘emerges something rare issue movie thats honest keenly observed doesnt feel like one’]

为了确保影评的有效性,强制影评长度大于等于三个单词。然后构建词汇表,从而实现{word:频次}的映射

target = [target[ix] for ix,x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]# 构建词汇表,创建函数来建立一个单词字典(单词和单词数对),词频不够的单词标记为RARE
def build_dictionary(sentences, vocabulary_size):# Turn sentences (list of strings) into lists of wordssplit_sentences = [s.split() for s in sentences]print(split_sentences[0:5])print("=============================================================")words = [x for sublist in split_sentences for x in sublist]print(words[0:5])print("=============================================================")# Initialize list of [word, word_count] for each word, starting with unknowncount = [['RARE', -1]]# Now add most frequent words, limited to the N-most frequent (N=vocabulary size)count.extend(collections.Counter(words).most_common(vocabulary_size - 1))# Now create the dictionaryword_dict = {}# For each word, that we want in the dictionary, add it, then make it# the value of the prior dictionary lengthfor word, word_count in count:word_dict[word] = len(word_dict)return word_dict
word_dict = build_dictionary(texts, vocabulary_size)
for k, v in word_dict.items():print(k + ":" + str(v))

输出结果:
[[‘rock’, ‘destined’, ‘st’, ‘centurys’, ‘new’, ‘conan’, ‘hes’, ‘going’, ‘make’, ‘splash’, ‘even’, ‘greater’, ‘arnold’, ‘schwarzenegger’, ‘jeanclaud’, ‘van’, ‘damme’, ‘steven’, ‘segal’], [‘gorgeously’, ‘elaborate’, ‘continuation’, ‘lord’, ‘rings’, ‘trilogy’, ‘huge’, ‘column’, ‘words’, ‘cannot’, ‘adequately’, ‘describe’, ‘cowriterdirector’, ‘peter’, ‘jacksons’, ‘expanded’, ‘vision’, ‘j’, ‘r’, ‘r’, ‘tolkiens’, ‘middleearth’], [‘effective’, ‘tootepid’, ‘biopic’], [‘sometimes’, ‘like’, ‘go’, ‘movies’, ‘fun’, ‘wasabi’, ‘good’, ‘place’, ‘start’], [‘emerges’, ‘something’, ‘rare’, ‘issue’, ‘movie’, ‘thats’, ‘honest’, ‘keenly’, ‘observed’, ‘doesnt’, ‘feel’, ‘like’, ‘one’]]

[‘rock’, ‘destined’, ‘st’, ‘centurys’, ‘new’]

thats:39
idiocy:4039
tenacious:5294
stand:685
authenticity:3966
worry:3967
familial:3092
。。。此处省略10000字。。。

将一系列的句子转化成单词索引列表,并将单词索引列表传入嵌套寻找函数

def text_to_numbers(sentences, word_dict):data = []for sentence in sentences:sentence_data = []for word in sentence:if word in word_dict:word_ix = word_dict[word]else:word_ix = 0sentence_data.append(word_ix)data.append(sentence_data)return data# 创建单词字典、转换句子列表为单词索引列表
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys())) # 键值互换
text_data = text_to_numbers(texts, word_dictionary)
print(text_data[0:5])
# 从预处理的单词词典中查找用于验证的单词的索引
valid_examples = [word_dictionary[x] for x in valid_words]print(valid_examples)
print(word_dictionary["rock"])
print(word_dictionary_rev[546])

[[2708, 0, 1872, 1044, 0, 0, 753, 0, 0, 0, 2408, 753, 0, 0, 0, 0, 0, 1872, 753, 2408, 0, 1775, 2708, 0, 0, 0, 2408, 753, 2754, 0, 1872, 0, 2408, 0, 2408, 0, 2044, 753, 0, 0, 2637, 0, 0, 2408, 2637, 0, 0, 0, 1044, 753, 0, 0, 4294, 4227, 0, 0, 2044, 0, 753, 8725, 753, 2408, 0, 2637, 2708, 753, 0, 0, 753, 2708, 0, 0, 2708, 2408, 0, 4227, 0, 0, 0, 1872, 2044, 2754, 0, 2708, 0, 753, 2408, 753, 2637, 2637, 753, 2708, 0, 1150, 753, 0, 2408, 1872, 4227, 0, 1775, 0, 0, 8725, 0, 2408, 0, 0, 0, 0, 0, 753, 0, 0, 0, 753, 8725, 753, 2408, 0, 0, 753, 2637, 0, 4227], [2637, 0, 2708, 2637, 753, 0, 1775, 0, 4227, 0, 0, 753, 4227, 0, 1363, 0, 2708, 0, 0, 753, 0, 1872, 0, 2408, 0, 0, 2408, 1775, 0, 0, 0, 0, 2408, 0, 4227, 0, 2708, 0, 0, 2708, 0, 2408, 2637, 0, 0, 0, 2708, 0, 4227, 0, 2637, 0, 0, 2044, 1775, 2637, 753, 0, 1872, 0, 4227, 1775, 0, 2408, 0, 2754, 0, 2708, 0, 0, 0, 1872, 0, 2408, 2408, 0, 0, 0, 0, 0, 753, 2864, 1775, 0, 0, 753, 4227, 0, 0, 0, 753, 0, 1872, 2708, 0, 1363, 753, 0, 1872, 0, 2754, 2708, 0, 0, 753, 2708, 0, 0, 2708, 753, 1872, 0, 0, 2708, 0, 4294, 753, 0, 753, 2708, 0, 1150, 0, 1872, 1044, 0, 0, 2408, 0, 0, 753, 979, 4294, 0, 2408, 0, 753, 0, 0, 8725, 0, 0, 0, 0, 2408, 0, 1150, 0, 2708, 0, 2708, 0, 0, 0, 4227, 1044, 0, 753, 2408, 0, 0, 0, 0, 0, 0, 4227, 753, 753, 0, 2708, 0, 2044], [753, 9010, 9010, 753, 1872, 0, 0, 8725, 753, 0, 0, 0, 0, 0, 753, 4294, 0, 0, 0, 1363, 0, 0, 4294, 0, 1872], [0, 0, 0, 753, 0, 0, 0, 753, 0, 0, 4227, 0, 1044, 753, 0, 2637, 0, 0, 0, 0, 8725, 0, 753, 0, 0, 9010, 1775, 2408, 0, 2754, 0, 0, 0, 1363, 0, 0, 2637, 0, 0, 0, 0, 4294, 4227, 0, 1872, 753, 0, 0, 0, 0, 2708, 0], [753, 0, 753, 2708, 2637, 753, 0, 0, 0, 0, 0, 753, 0, 2044, 0, 2408, 2637, 0, 2708, 0, 2708, 753, 0, 0, 0, 0, 1775, 753, 0, 0, 0, 8725, 0, 753, 0, 0, 2044, 0, 0, 0, 0, 2044, 0, 2408, 753, 0, 0, 0, 1044, 753, 753, 2408, 4227, 0, 0, 0, 1363, 0, 753, 2708, 8725, 753, 0, 0, 0, 0, 753, 0, 2408, 0, 0, 9010, 753, 753, 4227, 0, 4227, 0, 1044, 753, 0, 0, 2408, 753]]

[1423, 28, 952, 200, 360]

546

rock

创建skip-gram模型的批量函数。希望在单词对上训练模型,第一个单词作为输入(目标单词),另一个单词作为输出。比如“the cat in the hat”,目标单词为in, 上下文窗口大小设置为2,那么训练单词对为(in, the), (in, cat), (in, the), (in, hat)。
需要注意:窗口大小为2的话包含四个单词,前面讲的四个单词中包含目标单词,而代码里包含除目标单词外的另四个单词(也就是五个单词)。其实也不影响理解。

# 创建函数返回skip-gram模型的批量数据
def generate_batch_data(sentences, batch_size, window_size, method='skip-gram'):batch_data = []label_data = []while len(batch_data) < batch_size:rand_sentence = np.random.choice(sentences) # 随机选择一个开始window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix,x in enumerate(rand_sentence)]label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)] # 寻找中心点单词# 为每个窗口取出中心感兴趣的单词,并为每个窗口创建一个元组if method == 'skip-gram': # 从目标单词预测上下文batch_and_labels = [(x[y],x[:y]+x[(y+1):]) for x,y in zip(window_sequences,label_indices)] tuple_data = [(x,y_) for x,y in batch_and_labels for y_ in y] # (target word, surrounding word)elif method == 'cbow': # 利用上下文预测目标单词batch_and_labels = [(x[:y]+x[(y+1):],x[y]) for x,y in zip(window_sequences,label_indices)] # (target word, surrounding word)tuple_data = [(x_, y) for x, y in batch_and_labels for x_ in x]# (target word, surrounding word)else:raise ValueError('Method {} not implemented yet.'.format(method))batch,labels = [list(x) for x in zip(*tuple_data)] # 提取batch和labelbatch_data.extend(batch[:batch_size])label_data.extend(labels[:batch_size])batch_data = batch_data[:batch_size] # 在结尾处修剪批次和标签label_data = label_data[:batch_size]batch_data = np.array(batch_data)label_data = np.transpose(np.array([label_data]))return batch_data,label_data
batch_data, label_data = generate_batch_data(text_data, batch_size, window_size, method='skip-gram')for i in range(4):print(str(batch_data[i+8]) + ' ' + str(label_data[i+8]))

输出结果:
0 [2408]
0 [0]
0 [4294]
0 [753]

初始化嵌套矩阵,声明占位符和嵌套从查找函数

# 初始化嵌套矩阵
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size]))# 创建占位符
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)# 嵌套查找函数
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

softmax是实现多分类常用的损失函数,但是本例目标是10000个不同的分类,稀疏性非常高,导致算法模型拟合或收敛的问题。为了解决这个问题使用噪声对比损失函数NCE。NCE损失函数将问题转化成一个二值预测,预测单词分类和随机噪声。

# NCE损失函数
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_biases,labels=y_target,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size))

通过计算单词集与所有词向量之间的余弦相似度寻找验证单词周围的单词

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

声明优化器函数,初始化模型变量

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)# Add variable initializer.
init = tf.global_variables_initializer()
sess.run(init)

迭代训练单词嵌套

loss_vec = []
loss_x_vec = []
for i in range(generations):batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}# Run the train stepsess.run(optimizer, feed_dict=feed_dict)# Return the lossif (i + 1) % print_loss_every == 0:loss_val = sess.run(loss, feed_dict=feed_dict)loss_vec.append(loss_val)loss_x_vec.append(i+1)print('Loss at step {} : {}'.format(i+1, loss_val))# Validation: Print some random words and top 5 related wordsif (i+1) % print_valid_every == 0:sim = sess.run(similarity, feed_dict=feed_dict)for j in range(len(valid_words)):valid_word = valid_words[j]# top_k = number of nearest neighborstop_k = 5nearest = (-sim[j, :]).argsort()[1:top_k+1]log_str = "Nearest to {}:".format(valid_word)for k in range(top_k):close_word = word_dictionary_rev[nearest[k]]log_str = '{} {},'.format(log_str, close_word)print(log_str)

输出结果:
。。。。。。此处略去10000字 。。。。。。。。。。
Loss at step 48500 : 1.0923460721969604
Loss at step 49000 : 0.956238865852356
Loss at step 49500 : 0.19344420731067657
Loss at step 50000 : 1.9008288383483887
Nearest to cliche: enticing, mcgrath, media, hitandmiss, guiltypleasure,
Nearest to love: witty, rant, punchdrunk, brisk, scene,
Nearest to hate: commerce, cannot, proven, constant, triste,
Nearest to silly: dramatization, wellshot, fabulous, tuned, death,
Nearest to sad: chouchou, tomcats, enduring, kicking, patric,

CBOW

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import pickle
import string
import requests
import collections
import io
import tarfile
import urllib.request
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
ops.reset_default_graph()tf.set_random_seed(42)
np.random.seed(42)data_folder_name = 'temp'
if not os.path.exists(data_folder_name):os.makedirs(data_folder_name)sess = tf.Session()batch_size = 500
embedding_size = 200
vocabulary_size = 2000
generations = 50000
model_learning_rate = 0.25
num_sampled = int(batch_size/2)    # Number of negative examples to sample.
window_size = 3       # How many words to consider left and right.
save_embeddings_every = 5000
print_valid_every = 5000
print_loss_every = 100
stops = stopwords.words('english')
valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']# Normalize text
def normalize_text(texts, stops):texts = [x.lower() for x in texts]texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]texts = [''.join(c for c in x if c not in '0123456789') for x in texts]texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]texts = [' '.join(x.split()) for x in texts]return(texts)# Build dictionary of words
def build_dictionary(sentences, vocabulary_size):split_sentences = [s.split() for s in sentences] # split_sentences 格式[['dw','dqw',...,'dqw'],[......]]words = [x for sublist in split_sentences for x in sublist]count = [['RARE', -1]]count.extend(collections.Counter(words).most_common(vocabulary_size-1))word_dict = {}for word, word_count in count:word_dict[word] = len(word_dict)  return(word_dict)# Turn text data into lists of integers from dictionary
def text_to_numbers(sentences, word_dict):data = []for sentence in sentences:sentence_data = []for word in sentence.split(' '):if word in word_dict:word_ix = word_dict[word]else:word_ix = 0sentence_data.append(word_ix)data.append(sentence_data)return(data)# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):batch_data = []label_data = []while len(batch_data) < batch_size:rand_sentence_ix = int(np.random.choice(len(sentences), size=1))rand_sentence = sentences[rand_sentence_ix]window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]# Denote which element of each window is the center word of interestlabel_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]# Pull out center word of interest for each window and create a tuple for each windowbatch, labels = [], []if method=='skip_gram':batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]# Make it in to a big list of tuples (target word, surrounding word)tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]if len(tuple_data) > 0:batch, labels = [list(x) for x in zip(*tuple_data)]elif method=='cbow':batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]# Only keep windows with consistent 2*window_sizebatch_and_labels = [(x,y) for x,y in batch_and_labels if len(x)==2*window_size]if len(batch_and_labels) > 0:batch, labels = [list(x) for x in zip(*batch_and_labels)]elif method=='doc2vec':# For doc2vec we keep LHS window only to predict target wordbatch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)]batch, labels = [list(x) for x in zip(*batch_and_labels)]# Add document index to batch!! Remember that we must extract the last index in batch for the doc-indexbatch = [x + [rand_sentence_ix] for x in batch]else:raise ValueError('Method {} not implemented yet.'.format(method))# extract batch and labelsbatch_data.extend(batch[:batch_size])label_data.extend(labels[:batch_size])# Trim batch and label at the endbatch_data = batch_data[:batch_size]label_data = label_data[:batch_size]# Convert to numpy arraybatch_data = np.array(batch_data)label_data = np.transpose(np.array([label_data]))return batch_data, label_data# Load the movie review data
# Check if data was downloaded, otherwise download it and save for future use
def load_movie_data():save_folder_name = 'temp'pos_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.pos')neg_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.neg')# Check if files are already downloadedif not os.path.exists(os.path.join(save_folder_name, 'rt-polaritydata')):movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'# Save tar.gz filereq = requests.get(movie_data_url, stream=True)with open('temp_movie_review_temp.tar.gz', 'wb') as f:for chunk in req.iter_content(chunk_size=1024):if chunk:f.write(chunk)f.flush()# Extract tar.gz file into temp foldertar = tarfile.open('temp_movie_review_temp.tar.gz', "r:gz")tar.extractall(path='temp')tar.close()pos_data = []with open(pos_file, 'r', encoding='latin-1') as f:for line in f:pos_data.append(line.encode('ascii',errors='ignore').decode())f.close()pos_data = [x.rstrip() for x in pos_data]neg_data = []with open(neg_file, 'r', encoding='latin-1') as f:for line in f:neg_data.append(line.encode('ascii',errors='ignore').decode())f.close()neg_data = [x.rstrip() for x in neg_data]texts = pos_data + neg_datatarget = [1]*len(pos_data) + [0]*len(neg_data)return(texts, target)data_folder_name = 'temp'
if not os.path.exists(data_folder_name):os.makedirs(data_folder_name)texts, target = load_movie_data()
texts = normalize_text(texts, stops)
target = [target[ix] for ix,x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]# 创建单词字典,以便查找单词
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary)
valid_examples = [word_dictionary[word] for word in valid_words]# 初始化待拟合的单词嵌套并声明算法模型的数据占位符
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1., 1.))
x_inputs = tf.placeholder(shape=[batch_size, 2*window_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[batch_size, 1], dtype=tf.int32)
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)# 处理单词嵌套
embed = tf.zeros([batch_size, embedding_size])
for element in range(2*window_size):embed += tf.nn.embedding_lookup(embeddings, x_inputs[:,element])# NCE损失函数
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size], stddev=1.0/np.sqrt(embedding_size)))
nce_bias = tf.Variable(tf.zeros(vocabulary_size))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_bias,labels=y_target,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size))# 利用余弦相似度度量最接近的单词
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True )# 保存词向量
saver = tf.train.Saver({"embeddings":embeddings})# 声明优化器函数,初始化变量
optmizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate).minimize(loss)
init = tf.global_variables_initializer()
sess.run(init)# 遍历迭代、打印损失函数、保存单词嵌套
loss_vec = []
loss_x_vec = []
text_data = [x for x in text_data if len(x)>=(2*window_size+1)]
for i in range(generations):batch_inputs, batch_labels = generate_batch_data(text_data,batch_size,window_size,method="cbow")feed_dict = {x_inputs:batch_inputs, y_target:batch_labels}sess.run(optmizer, feed_dict=feed_dict)if (i+1) % print_loss_every == 0:loss_val = sess.run(loss, feed_dict=feed_dict)loss_vec.append(loss_val)loss_x_vec.append(i+1)print('loss at step {}:{}'.format(i+1,loss_val))if (i+1) % print_valid_every == 0:sim = sess.run(similarity, feed_dict=feed_dict)for j in range(len(valid_words)):valid_word = word_dictionary_rev[valid_examples[j]]top_k = 5nearest = (-sim[j,:]).argsort()[1:top_k+1]log_str = "Nearest to {}".format(valid_word)for k in range(top_k):close_word = word_dictionary_rev[nearest[k]]print_str = '{} {}'.format(log_str, close_word)print(print_str)if (i+1) % save_embeddings_every == 0:with open(os.path.join(data_folder_name, 'movie_vocab.pkl'), 'wb') as f:pickle.dump(word_dictionary, f)model_checkpoint_path = os.path.join(os.getcwd(), data_folder_name, 'cbow_movie_embeddings.ckpt')save_path = saver.save(sess, model_checkpoint_path)print('Model saved in file: {}'.format(save_path))

输出结果:
。。。。。。。。此处省略一万字。。。。。。.。。。
loss at step 49200:2.3909201622009277
loss at step 49300:2.5774049758911133
loss at step 49400:2.058192491531372
loss at step 49500:2.1763217449188232
loss at step 49600:2.239227056503296
loss at step 49700:2.235402822494507
loss at step 49800:2.0298542976379395
loss at step 49900:2.2141225337982178
loss at step 50000:2.3587381839752197
Nearest to love side
Nearest to hate talent
Nearest to happy taste
Nearest to sad energy
Nearest to man irritating
Nearest to woman chemistry
Model saved in file: D:\anaconda\envs\tensorflow\Scripts\Tensorflow机器学习实战指南\temp\cbow_movie_embeddings.ckpt

参考:https://www.cnblogs.com/pinard/p/7249903.html

NLP | Word2Vec之基于Negative Sampling的 CBOW 和 skip-gram 模型相关推荐

  1. 【word2vec】篇三:基于Negative Sampling 的 CBOW 模型和 Skip-gram 模型

    系列文章: [word2vec]篇一:理解词向量.CBOW与Skip-Gram等知识 [word2vec]篇二:基于Hierarchical Softmax的 CBOW 模型和 Skip-gram 模 ...

  2. 【word2vec】篇二:基于Hierarchical Softmax的 CBOW 模型和 Skip-gram 模型

    文章目录 CBOW 模型 基本结构 目标函数 梯度计算 Skip-gram 模型 基本结构 梯度计算 优缺点分析 系列文章: [word2vec]篇一:理解词向量.CBOW与Skip-Gram等知识 ...

  3. word2vec原理(三): 基于Negative Sampling的模型

    目录 1. Hierarchical Softmax的缺点与改进 2. Negative Sampling(负采样) 概述 3. 基于Negative Sampling的模型梯度计算 4. Negat ...

  4. [深度学习概念]·word2vec原理讲解Negative Sampling的模型概述

    word2vec原理讲解Negative Sampling的模型概述 目录 1. Hierarchical Softmax的缺点与改进 2. 基于Negative Sampling的模型概述 3. 基 ...

  5. Rasa课程、Rasa培训、Rasa面试、Rasa实战系列之Understanding Word Embeddings CBOW and Skip Gram

    Rasa课程.Rasa培训.Rasa面试.Rasa实战系列之Understanding Word Embeddings CBOW and Skip Gram 字嵌入 从第i个字符,第i+1个字符预测第 ...

  6. 基于pytorch实现Word2Vec(skip-gram+Negative Sampling)

    目录 word2vec简介 语料处理 数据预处理 训练模型 近似训练法 参数设定 预测及可视化 word2vec简介 2013 年,Google 团队发表了 word2vec 工具.word2vec ...

  7. 《自然语言处理学习之路》02 词向量模型Word2Vec,CBOW,Skip Gram

    本文主要是学习参考莫烦老师的教学,对老师课程的学习,记忆笔记. 原文链接 文章目录 书山有路勤为径,学海无涯苦作舟. 零.吃水不忘挖井人 一.计算机如何实现对于词语的理解 1.1 万物数字化 1.2 ...

  8. Word2Vec学习笔记(五)——Negative Sampling 模型(续)

    本来这部分内容不多,是想写在negative sampling 中和cbow一起的,但是写了后不小心按了删除键,浏览器直接回退,找不到了,所以重新写新的,以免出现上述情况 (接上) 三.Negativ ...

  9. 自然语言处理NLP——ERNIE-M:基于回译机制的“预训练-微调”多语言模型

    目录 系列文章目录 一.背景介绍 1.多语言任务 1.1 多语言任务定义 1.2 多语言任务难题 2.多语言模型 2.1 多语言模型定义与原理 2.2 多语言模型困难 3.论文简介 3.1 背景与开发 ...

最新文章

  1. 紧急提醒!售价3980,成本价80,你被坑过吗?
  2. IE6使用png透明图片的方法
  3. Html.ActionLink
  4. python哲学内容 多行胜于单行_Python3基础 __doc__ 单行与多行函数文档
  5. 机器学习基础-特征缩放交叉验证法-05
  6. LB-createLB整理架构图-loadbalancer-openstack F版
  7. 一文带你入门目前大热的图神经网络
  8. 受 SQLite 多年青睐,C 语言到底好在哪儿? 1
  9. 二叉树中和为某一值的路径(三)(C++)
  10. 微软云服务器搭建,75分钟快速构建微软Server 2012私有云
  11. 凭借这份diao炸天的资料,狂刷三遍成功从外包进入了字节跳动!
  12. jupyter 链接不到服务器_jupyter连接服务器
  13. 30岁选择回乡创业的90后告诉你,加盟汉庭酒店怎么样?
  14. RIGHT-BICEP单元测试——“二柱子四则运算升级版”
  15. Excel功能及实现
  16. 素食崇尚的是养身和修心的生活方式,鼎沐素食让你身心共融
  17. 六、文件管理(1.文件和文件系统)
  18. vue中企业微信 wx.onHistoryBack详解:
  19. 亚洲诚信助力看雪安全开发者峰会
  20. pandas样本分层抽样(可以自己设置每一个类别抽取多少个样本)

热门文章

  1. fluent中网格检查
  2. Uinty3d 工程导入 Android Studio
  3. ubuntu安装微信,解决网页版微信不能登陆问题
  4. 小学计算机水平测试论坛,热烈祝贺我校学生在市计算机水平测试中再创佳绩
  5. 表情符号Emoji的正则表达式
  6. hdu 2094 “产生冠军“
  7. ac3音频 机顶盒播放音量变小问题
  8. word模板填充数据
  9. Layer js——弹窗/提示/询问/加载/层
  10. 简单演示程序序列号的破解