本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow)，作者是Thushan Ganegedara。

对了宝贝儿们，卑微小李的公众号【野指针小李】已开通，期待与你一起探讨学术哟~摸摸大！

0 前言
1 数据集下载
2 读取数据集
3 创建词典
4 生成GloVe的batch数据
5 生成共现概率矩阵
6 GloVe算法
- 6.1 定义超参数
- 6.2 定义输入与输出
- 6.3 定义模型参数以及其他变量
- 6.4 定义模型计算
- 6.5 相似度计算
- 6.6 定义模型参数优化器
- 6.7 运行GloVe模型
参考

0 前言

本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow)，作者是Thushan Ganegedara。在作者代码的基础上，我添加了部分自己的注释（作者的注释是英文，我的注释是用的中文）。代码已上传至github，这里是链接。

如果有任何错误或者没有讲解清楚的部分，请评论在下方，看到后我会更改。

关于GloVe的原理，如果有疑问的同学，可以参考我之前的文章：GloVe原理与公式讲解。

TensorFlow版本是1.8.0。

1 数据集下载

url = 'http://www.evanjones.ca/software/'def maybe_download(filename, expected_bytes):"""Download a file if not present, and make sure it's the right size."""if not os.path.exists(filename):filename, _ = urlretrieve(url + filename, filename)statinfo = os.stat(filename)if statinfo.st_size == expected_bytes:print('Found and verified %s' % filename)else:print(statinfo.st_size)raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')return filenamefilename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)

不愿意采用这种方式下载的同学，也可以直接访问链接 http://www.evanjones.ca/software/wikipedia2text-extracted.txt.bz2 进行下载

2 读取数据集

该步骤主要包含：将数据读取出来成为string，将数据转换为小写，对数据进行分词操作。每次读取1M数据。

def read_data(filename):"""Extract the first file enclosed in a zip file as a list of wordsand pre-processes it using the nltk python library"""with bz2.BZ2File(filename) as f:data = []file_size = os.stat(filename).st_sizechunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately largeprint('Reading data...')for i in range(ceil(file_size//chunk_size)+1):bytes_to_read = min(chunk_size,file_size-(i*chunk_size))file_string = f.read(bytes_to_read).decode('utf-8')file_string = file_string.lower()  # 将数据转换为小写# tokenizes a string to words residing in a listfile_string = nltk.word_tokenize(file_string)  # 分词data.extend(file_string)return datawords = read_data(filename)
print('Data size %d' % len(words))
token_count = len(words)print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])

输出结果：

Reading data...
Data size 3361192
Example words (start):  ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end):  ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']

3 创建词典

根据以下的规则进行词典的创建. 为了方便理解以下的元素，采用 "I like to go to school"作为例子.

dictionary: 词语与ID之间的映射关系 (e.g. {‘I’: 0, ‘like’: 1, ‘to’: 2, ‘go’: 3, ‘school’: 4})
reverse_dictionary: ID与词语之间的映射关系 (e.g. {0: ‘I’, 1: ‘like’, 2: ‘to’, 3: ‘go’, 4: ‘school’})
count: 列表，列表中每个元素是个元组，每个元组中的元素为单词以及频率 (word, frequency) (e.g. [(‘I’, 1), (‘like’, 1), (‘to’, 2), (‘go’, 1), (‘school’, 1)])
data : 文本中的词语，这些词语以ID来代替 (e.g. [0, 1, 2, 3, 2, 4])

标记 UNK 来表示稀有词语。

词典中只统计50000个常见词。

# we restrict our vocabulary size to 50000
vocabulary_size = 50000 def build_dataset(words):count = [['UNK', -1]]# Gets only the vocabulary_size most common words as the vocabulary# All the other words will be replaced with UNK tokencount.extend(collections.Counter(words).most_common(vocabulary_size - 1))dictionary = dict()# Create an ID for each word by giving the current length of the dictionary# And adding that item to the dictionaryfor word, _ in count:dictionary[word] = len(dictionary)data = list()unk_count = 0# Traverse through all the text we have and produce a list# where each element corresponds to the ID of the word found at that indexfor word in words:# If word is in the dictionary use the word ID,# else use the ID of the special token "UNK"if word in dictionary:index = dictionary[word]else:index = 0  # dictionary['UNK']unk_count = unk_count + 1data.append(index)# update the count variable with the number of UNK occurencescount[0][1] = unk_countreverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) # Make sure the dictionary is of size of the vocabularyassert len(dictionary) == vocabulary_sizereturn data, count, dictionary, reverse_dictionarydata, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

因为作者是同一个人，又是同一份代码，之前Word2Vec中我也写了这块代码的运行逻辑，链接为：《TensorFlow学习笔记（3）——TensorFlow实现Word2Vec》，第4部分。如果对这个代码有疑惑的可以跳转链接过去看。

输出结果为：

Most common words (+UNK) [['UNK', 68751], ('the', 226893), (',', 184013), ('.', 120919), ('of', 116323)]
Sample data [1721, 9, 8, 16479, 223, 4, 5168, 4459, 26, 11597]

4 生成GloVe的batch数据

batch是中心词；labels是中心词上下文窗口中的词语。对于中心词的上下文，每次读取2 * window_size + 1个词语，称之为span。每个span中，中心词为1，上下文大小为2 * window_size。该函数以这种方式继续，直到创建batch_size数据点。每次到达单词序列的末尾时，我们都会从头开始。

batch: $\times 8$ 的向量; labels: $\times 1$ 的向量; weights: $\times 8$ 的向量，词语 $i$ 与词语 $j$ 共现的次数， $1d\frac{1}{d}$ ，其中 $d$ 为两个词之间的距离。

data_index = 0def generate_batch(batch_size, window_size):# data_index is updated by 1 everytime we read a data pointglobal data_index # two numpy arras to hold target words (batch)# and context words (labels)batch = np.ndarray(shape=(batch_size), dtype=np.int32)labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)weights = np.ndarray(shape=(batch_size), dtype=np.float32)# span defines the total window size, where# data we consider at an instance looks as follows.# [ skip_window target skip_window ]span = 2 * window_size + 1 # The buffer holds the data contained within the spanbuffer = collections.deque(maxlen=span)# Fill the buffer and update the data_indexfor _ in range(span):buffer.append(data[data_index])data_index = (data_index + 1) % len(data)# This is the number of context words we sample for a single target wordnum_samples = 2*window_size # We break the batch reading into two for loops# The inner for loop fills in the batch and labels with# num_samples data points using data contained withing the span# The outper for loop repeat this for batch_size//num_samples times# to produce a full batchfor i in range(batch_size // num_samples):k=0# avoid the target word itself as a prediction# fill in batch and label numpy arraysfor j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):batch[i * num_samples + k] = buffer[window_size]labels[i * num_samples + k, 0] = buffer[j]# 因为 j 是跳过了 window_size 的，所以 j - window_size 不会为0weights[i * num_samples + k] = abs(1.0/(j - window_size))k += 1 # Everytime we read num_samples data points,# we have created the maximum number of datapoints possible# withing a single span, so we need to move the span by 1# to create a fresh new spanbuffer.append(data[data_index])data_index = (data_index + 1) % len(data)return batch, labels, weightsprint('data:', [reverse_dictionary[di] for di in data[:9]])for window_size in [2, 4]:data_index = 0batch, labels, weights = generate_batch(batch_size=8, window_size=window_size)print('\nwith window_size = %d:' %window_size)print('    batch:', [reverse_dictionary[bi] for bi in batch])print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])print('    weights:', [w for w in weights])

在这里weights就体现出了论文中所写的：

In all cases we use a decreasing weighting function, so that word pairs that are $d$ words apart contribute $1 / d$ to the total count.

输出数据：

data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']with window_size = 2:batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']weights: [0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5]with window_size = 4:batch: ['set', 'set', 'set', 'set', 'set', 'set', 'set', 'set']labels: ['propaganda', 'is', 'a', 'concerted', 'of', 'messages', 'aimed', 'at']weights: [0.25, 0.33333334, 0.5, 1.0, 1.0, 0.5, 0.33333334, 0.25]

这个输出数据，我们以window_size = 4来看，在这一个窗口中中心词为set，其左侧上下文有['propaganda', 'is', 'a', 'concerted']，右侧上下文有['of', 'messages', 'aimed', 'at']。batch与labels中的数据是一一对应的（比如labels[0]是batch[0]的上下文）。而以propaganda为例，距离set为4（ $4 - 0 = 4$ ），所以weights[0]=1/4=0.25。window_size=2同理。

5 生成共现概率矩阵

# We are creating the co-occurance matrix as a compressed sparse colum matrix from scipy.
cooc_data_index = 0
dataset_size = len(data) # We iterate through the full text
skip_window = 4 # How many words to consider left and right.# The sparse matrix that stores the word co-occurences
cooc_mat = lil_matrix((vocabulary_size, vocabulary_size), dtype=np.float32)print(cooc_mat.shape)
def generate_cooc(batch_size, skip_window):'''Generate co-occurence matrix by processing batches of data'''data_index = 0print('Running %d iterations to compute the co-occurance matrix'%(dataset_size//batch_size))for i in range(dataset_size//batch_size):# Printing progressif i>0 and i%100000==0:print('\tFinished %d iterations'%i)# Generating a single batch of databatch, labels, weights = generate_batch(batch_size, skip_window)labels = labels.reshape(-1)# Incrementing the sparse matrix entries accordingly# inp: 中心词 i 的 id# lbl: 上下文词语 j 的 id# w: i 与 j 共现的频率for inp,lbl,w in zip(batch,labels,weights):            cooc_mat[inp,lbl] += (1.0*w)# Generate the matrix
generate_cooc(8,skip_window)    # Just printing some parts of co-occurance matrix
print('Sample chunks of co-occurance matrix')# Basically calculates the highest cooccurance of several chosen word
for i in range(10):idx_target = i# get the ith row of the sparse matrix and make it denseith_row = cooc_mat.getrow(idx_target)     ith_row_dense = ith_row.toarray('C').reshape(-1)  # 获得频率，如果ith_row没有这个元素，那么就是0# select target words only with a reasonable words around it.# 获得一个 X_i 在 10 - 50000 之间的单词while np.sum(ith_row_dense)<10 or np.sum(ith_row_dense)>50000:# Choose a random wordidx_target = np.random.randint(0,vocabulary_size)# get the ith row of the sparse matrix and make it denseith_row = cooc_mat.getrow(idx_target) ith_row_dense = ith_row.toarray('C').reshape(-1)    print('\nTarget Word: "%s"'%reverse_dictionary[idx_target])# sort_indices 按照从小到大排序 ith_row_dense (词频从小到大排序), 结果为索引sort_indices = np.argsort(ith_row_dense).reshape(-1) # indices with highest count of ith_row_dense# 按照从大到小排序 ith_row_dense (词频从大到小排序), 结果为索引sort_indices = np.flip(sort_indices,axis=0) # reverse the array (to get max values to the start)# printing several context words to make sure cooc_mat is correctprint('Context word:',end='')for j in range(10):        idx_context = sort_indices[j]       print('"%s"(id:%d,count:%.2f), '%(reverse_dictionary[idx_context],idx_context,ith_row_dense[idx_context]),end='')print()

这里作者是采用了scipy.sparse中的lil_matrix，因为原论文中提到，这个共现概率矩阵是个稀疏矩阵，所以采用lil_matrix可以节省存储空间。lil_matrix(arg1, shape=None, dtype=None, copy=False), 基于行连接存储的稀疏矩阵。lil_matrix使用两个列表保存非零元素。data保存每行中的非零元素，rows保存非零元素所在的列。这种格式也很适合逐个添加元素，并且能快速获取行相关的数据[4]。简单来说就是lil_matrix只存储非零元素的行列以及元素，其余位置全为0。

输出的部分结果为：

(50000, 50000)
Running 420149 iterations to compute the co-occurance matrixFinished 100000 iterationsFinished 200000 iterationsFinished 300000 iterationsFinished 400000 iterations
Sample chunks of co-occurance matrix...Target Word: "to"
Context word:"the"(id:1,count:2481.16), ","(id:2,count:989.33), "."(id:3,count:689.00), "a"(id:8,count:579.83), "and"(id:5,count:573.08), "be"(id:30,count:553.83), "of"(id:4,count:470.50), "UNK"(id:0,count:470.00), "in"(id:6,count:412.25), "is"(id:9,count:283.42),

这里的代码逻辑很简单，就是每一次抓8个数据计算（batch_size），那么一共需要抓取420149次。每一次抓取，都可以得到一个中心词及其8个上下文（window_size=4），以及在这个窗口中中心词与上下文共现的频率，接着更新共现概率矩阵。

6 GloVe算法

6.1 定义超参数

batch_size: 每个 batch 中的样本数；embedding_size: 嵌入层向量的大小；window_size: 上下文窗口大小；valid_examples: 随机选择的验证集样本（随机选择后就是常量了）；epsilon: 防止 $log{\rm log}$ 发散。

batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)num_sampled = 32 # Number of negative examples to sample.epsilon = 1 # used for the stability of log in the loss function

6.2 定义输入与输出

为每一个batch_size的内容创建训练集中输入与输出的placeholders，并且为验证集创建一个常数的tensor。

tf.reset_default_graph()# Training input data (target word IDs).
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

这里的valid_dataset就对应了6.1中的valid_examples。而这里的train_dataset和train_labels是用于每个batch中查询词向量用的，详细见6.4。

6.3 定义模型参数以及其他变量

in_embeddings: $W$ , $50000 \times 128$ ; in_bias_embeddings: $b$ , $50000 \times 1$ ; out_embeddings: $W~\tilde{W}$ , $50000 \times 128$ ; out_bias_embeddings: $b~\tilde{b}$ , $50000$

词向量初始化都是 $[- 1, 1]$ 的均匀分布，偏置初始化都是 $[0, 0.01]$ 的均匀分布

# Variables.
in_embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
in_bias_embeddings = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')out_embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='embeddings')
out_bias_embeddings = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01,dtype=tf.float32),name='embeddings_bias')

这里定义了词向量 $W$ 和 $W~\tilde{W}$ ，还有损失函数中的偏置项 $b$ 和 $b~\tilde{b}$ 。

6.4 定义模型计算

定义了4个查找方法：embed_in, embed_out, embed_bias_in, embed_bias_out。

weights_x: $\times 8$ , 权重函数 $f(X_{ij})$

x_ij: $\times 8$ , 词语 $i$ 与 $j$ 的共现频率, $X_{ij}$

损失函数： $J=∑i,j=1Vf(Xij)(wiTw~j+bi+b~j−log(1+Xij))2J=\sum_{i, j=1}^V f(X_{ij}) (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - {\rm log}(1 + X_{ij}))^2$

# Look up embeddings for inputs and outputs
# Have two seperate embedding vector spaces for inputs and outputs
embed_in = tf.nn.embedding_lookup(in_embeddings, train_dataset)
embed_out = tf.nn.embedding_lookup(out_embeddings, train_labels)
embed_bias_in = tf.nn.embedding_lookup(in_bias_embeddings,train_dataset)
embed_bias_out = tf.nn.embedding_lookup(out_bias_embeddings,train_labels)# weights used in the cost function
weights_x = tf.placeholder(tf.float32,shape=[batch_size],name='weights_x')
# Cooccurence value for that position
x_ij = tf.placeholder(tf.float32,shape=[batch_size],name='x_ij')# Compute the loss defined in the paper. Note that
# I'm not following the exact equation given (which is computing a pair of words at a time)
# I'm calculating the loss for a batch at one time, but the calculations are identical.
# I also made an assumption about the bias, that it is a smaller type of embedding
loss = tf.reduce_mean(weights_x * (tf.reduce_sum(embed_in*embed_out,axis=1) + embed_bias_in + embed_bias_out - tf.log(epsilon+x_ij))**2)

这里就是用每个batch中的train_dataset和train_labels来查询词向量以及偏置向量，将查询到的内容放入到损失函数中进行计算。由于原论文中提到 $log(0){\rm log}(0)$ 是发散的，所以采用 $log(1+Xij){\rm log}(1 + X_{ij})$ 解决这个问题。

6.5 相似度计算

这一部分主要是采用余弦相似度计算词语的相似度，详细的内容在6.7中。

# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
embeddings = (in_embeddings + out_embeddings)/2.0  # X = U + V
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))  # 矩阵中每行元素的模
normalized_embeddings = embeddings / norm  # L2正则化
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)  # 提取验证集中的数据
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))  # 余弦相似度

这里计算余弦相似度主要是采用了L2正则化，使得 $∣A⃗∣×∣B⃗∣=1|\vec{A}|\times|\vec{B}|=1$

∣×∣B

∣=1，从而得到两个词语的余弦相似度。

6.6 定义模型参数优化器

在这里采用了Adagrad优化器。

# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

6.7 运行GloVe模型

训练数据，训练num_steps次。并且在每次迭代中，在一个固定的验证集中评估算法，并且打印出距离给定词语最近的词语。

从结果来看，随着训练的进行，最接近验证集中词语的词语是一直在发生改变的。

num_steps = 100001
glove_loss = []average_loss = 0
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:tf.global_variables_initializer().run()print('Initialized')for step in range(num_steps):# generate a single batch (data,labels,co-occurance weights)batch_data, batch_labels, batch_weights = generate_batch(batch_size, skip_window) # 因为已经计算出来了共现矩阵，所以这里不需要 batch_weights# Computing the weights required by the loss functionbatch_weights = [] # weighting used in the loss functionbatch_xij = [] # weighted frequency of finding i near j# Compute the weights for each datapoint in the batchfor inp,lbl in zip(batch_data,batch_labels.reshape(-1)):  # 100: x_max, 0.75: 3/4, point_weight: f(X_ij), batch_xij: 词语 i 与 j 的频率point_weight = (cooc_mat[inp,lbl]/100.0)**0.75 if cooc_mat[inp,lbl]<100.0 else 1.0 batch_weights.append(point_weight)batch_xij.append(cooc_mat[inp,lbl])batch_weights = np.clip(batch_weights,-100,1)batch_xij = np.asarray(batch_xij)# Populate the feed_dict and run the optimizer (minimize loss)# and compute the loss. Specifically we provide# train_dataset/train_labels: training inputs and training labels# weights_x: measures the importance of a data point with respect to how much those two words co-occur# x_ij: co-occurence matrix value for the row and column denoted by the words in a datapointfeed_dict = {train_dataset : batch_data.reshape(-1), train_labels : batch_labels.reshape(-1),weights_x:batch_weights,x_ij:batch_xij}_, l = session.run([optimizer, loss], feed_dict=feed_dict)# Update the average loss variableaverage_loss += lif step % 2000 == 0:if step > 0:average_loss = average_loss / 2000# The average loss is an estimate of the loss over the last 2000 batches.print('Average loss at step %d: %f' % (step, average_loss))glove_loss.append(average_loss)average_loss = 0# Here we compute the top_k closest words for a given validation word# in terms of the cosine distance# We do this for all the words in the validation set# Note: This is an expensive stepif step % 10000 == 0:sim = similarity.eval()for i in range(valid_size):valid_word = reverse_dictionary[valid_examples[i]]top_k = 8 # number of nearest neighborsnearest = (-sim[i, :]).argsort()[1:top_k+1]log = 'Nearest to %s:' % valid_wordfor k in range(top_k):close_word = reverse_dictionary[nearest[k]]log = '%s %s,' % (log, close_word)print(log)final_embeddings = normalized_embeddings.eval()

部分输出结果（这里选用第0次训练，即初始情况，以及第100000次训练的结果）：

Average loss at step 0: 8.672687
Nearest to ,: pitcher, discharges, pigs, tolerant, fuzzy, medium-, on-campus, eduskunta,
Nearest to this: mediastinal, destined, implementing, honolulu, non-mormon, juniors, tycho, powered,
Nearest to most: translating, absolute, 111, bechet, adam, aleksey, penetrators, rake,
Nearest to but: motown, ridged, beginnings, shareholder, resurfacing, english, intelligence, o'dea,
Nearest to is: higher-quality, kitchener, kelley, confronted, m15, stanislaus, depictions, buf,
Nearest to ): encyclopedic, commute, symbiotic, forecasts, 1993., 243-year, cenwealh, inclosure,
Nearest to not: toulon, discount, dunblane, vividly, recorded, olive, afrikaansche, german-speaking,
Nearest to with: tofu, expansive, penned, grids, 102, drought, merced, cunningham,
Nearest to ;: all-electric, internationally-recognised, czars, 12–16, kana, immaculate, innings, wnba,
Nearest to a: non-residents, presumption, cephas, tau, stepfather, beside, aorist, vom,
Nearest to for: bitterroots, sx-64, weekday, edificio, sousley, self-proclaimed, whoever, liquid,
Nearest to have: dissenting, barret, psilocybin, massamba-débat, kopfstein, 5.5, fillmore, innovator,
Nearest to was: ., is, most, wheelchair, 1575, warm-blooded, dynamically, 1913.,
Nearest to 's: eoka, melancholia, downs, gallipoli, reichswehr, easter, chest, construed,
Nearest to were: 1138, djuna, 3, beni, high-grade, slander, agency, séamus,
Nearest to be: knelt, horrors, assistant, hospitalised, 1802, fierce, cinemas, magnified,
...
Average loss at step 100000: 0.019544
Nearest to ,: ., the, in, a, of, and, ,, is,
Nearest to this: ), (, ``, UNK, or, ., in, ,,
Nearest to most: ., the, of, ,, and, for, a, to,
Nearest to but: ), UNK, '', or, and, ,, in, .,
Nearest to is: 's, the, of, at, world, ., in, on,
Nearest to ): were, in, ., and, ,, the, by, is,
Nearest to not: (, ``, UNK, ), '', of, 's, the,
Nearest to with: been, had, to, has, be, that, a, may,
Nearest to ;: a, such, an, ,, for, and, with, is,
Nearest to a: the, was, ., in, and, ,, to, of,
Nearest to for: are, by, and, ,, in, to, the, was,
Nearest to have: is, was, that, also, this, not, has, a,
Nearest to was: ., of, in, and, ,, 's, for, to,
Nearest to 's: it, is, has, there, this, are, was, not,
Nearest to were: a, as, is, with, and, ,, to, for,
Nearest to be: was, it, when, had, that, his, in, ,,

根据结果，我们发现，随着训练的进行，最接近于中心词的词语是在发生改变的，且越来越相似（比如初始化中最接近于be的词语是莫名其妙的，而100000次后有了was）。

而整体的代码逻辑上来讲：

每一轮迭代生成一个中心词batch_data及其窗口中的上下文batch_labels；
迭代这组词，得到权重函数 $f(Xij)=(Xijxmax)0.75f(X_{ij})=(\frac{X_{ij}}{x_{\rm max}})^{0.75}$ ，并且提取出共现频率 $X_{ij}$ ；
通过np.clip()限制权重函数的最大值不超过1；
将数据放入到6.4定义的损失函数loss中以及6.6定义的优化器optimizer中训练。

参考

[1] Thushan Ganegedara. Natural Language Processing with TensorFlow (TensorFlow自然语言处理)[M]. 北京: 机械工业出版社, 2019: 88-90.
[2] Jeffrey Pennington, Richard Socher, Christopher D. Manning. Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing. 2014.
[3] AI研习社-译站. 【官方】【中英】CS224n 斯坦福深度自然语言处理课 @雷锋字幕组[EB/OL]. (2019-01-22)[2021-07-06]. https://www.bilibili.com/video/BV1pt411h7aT?p=3
[4] -柚子皮-. SciPy教程 - 稀疏矩阵库scipy.sparse[EB/OL]. (2014-12-06)[2021-07-08]. https://blog.csdn.net/pipisorry/article/details/41762945
[5] TaoTao Yu. embedding_lookup的学习笔记[EB/OL]. (2019-08-04)[2021-07-08]. https://blog.csdn.net/hit0803107/article/details/98377030

TensorFlow学习笔记（4）——TensorFlow实现GloVe相关推荐

tensorflow学习笔记——使用TensorFlow操作MNIST数据（1）
续集请点击我:tensorflow学习笔记--使用TensorFlow操作MNIST数据(2) 本节开始学习使用tensorflow教程,当然从最简单的MNIST开始.这怎么说呢,就好比编程入门有He ...
TensorFlow学习笔记——《TensorFlow技术解析与实战》
著名历史学家斯塔夫里阿诺斯在<全球通史>中,曾以15世纪的航海在"物理上"连通"各大洲"作为标志将人类历史划分为两个阶段.在我正在写作的<互联 ...
Tensorflow学习笔记6：解决tensorflow训练过程中GPU未调用问题
Tensorflow学习笔记6:解决tensorflow训练过程中GPU未调用问题参考文章: (1)Tensorflow学习笔记6:解决tensorflow训练过程中GPU未调用问题 (2)http ...
tensorflow学习笔记(三十二):conv2d_transpose (解卷积)
tensorflow学习笔记(三十二):conv2d_transpose ("解卷积") deconv解卷积,实际是叫做conv_transpose, conv_transpose ...
tensorflow学习笔记二——建立一个简单的神经网络拟合二次函数
tensorflow学习笔记二--建立一个简单的神经网络 2016-09-23 16:04 2973人阅读评论(2) 收藏举报分类: tensorflow(4) 目录(?)[+] 本笔记目的 ...
TensorFlow学习笔记（二）：快速理解Tutorial第一个例子-MNIST机器学习入门标签：机器学习SoftmaxTensorFlow教程 2016-08-02 22:12 3729人阅
TensorFlow学习笔记(二):快速理解Tutorial第一个例子-MNIST机器学习入门标签: 机器学习SoftmaxTensorFlow教程 2016-08-02 22:12 3729人阅读 ...
Tensorflow学习笔记2：About Session, Graph, Operation and Tensor
简介上一篇笔记:Tensorflow学习笔记1:Get Started 我们谈到Tensorflow是基于图(Graph)的计算系统.而图的节点则是由操作(Operation)来构成的,而图的各个节 ...
Win10：tensorflow学习笔记（4）
前言学以致用,以学促用.输出检验,完整闭环. 经过前段时间的努力,已经在电脑上搭好了深度学习系统,接下来就要开始跑程序了,将AI落地了. 安装win10下tensforlow 可以参照之前的例子:w ...
Win10: tensorflow 学习笔记（3）
前言学以致用,以学促用.输出检验,完整闭环. 怕什么真理无穷,进一寸有一寸的欢喜--胡适经过前段时间的努力,已经在电脑上搭好了深度学习系统,接下来就要开始跑程序了,将AI落地了. 安装win10下 ...
win10：tensorflow学习笔记（2）
目录: 前言 Tensorflow的故事 1Tensorflow和其他框架的对比 2Tesorflow 目前进展 3大杀器tensorboard 尾声前言经过前段时间的努力,已经在电脑上搭好了深度 ...

TensorFlow学习笔记（4）——TensorFlow实现GloVe

目录