Assignment #1

3.word2vec

(a)假设已有一个与skip-gram模型的中心词c对应的预测词向量 $v_{c}$ ，并使用word2vec模型中的softmax函数进行词预测：

$\widehat{y_{o}}=p(o|c)=\frac{exp(u_{o}^{T}v_{c})}{\sum_{w=1}^{W}exp(u_{w}^{T}v_{c})}$

其中w表示第w个词， $u_{w}(w=1,...,W)$ 是词汇表中所有单词的“输出”词向量。假设在预测中使用交叉熵损失函数，单词o是预期单词（在one-hot标签向量中的第o个元素是1），推导关于 $v_{c}$ 的梯度。

提示：使用问题2中的符号会有帮助。例如，让 $\widehat{y}$ 作为每个词的softmax预测的向量， $y$ 作为预期词的向量，损失函数为：

$J_{softmax-CE(o,v_{c},U)}=CE(y,\widehat{y})$

其中 $U=[u_{1},u_{2},...,u_{W}]$ 是所有输出向量组成的矩阵，确保声明了向量和矩阵的方向。

(b)推导上一题中“输出”词向量 $u_{w}$ 的梯度（包括 $u_{o}$ ）。

(c)使用负采样损失重新完成(a)和(b)，中心词向量为 $v_{c}$ ，期望的输出单词为o。假设采样了K个负采样单词，为了表示的简单性记为 $1,2,...,K$ （ $o\notin \left \{ 1,...,K \right \}$ ），对于给定的单词o，记它的输出向量为 $u_{o}$ 。这种情况下的负采样的损失函数为：

$J_{neg-sample}(o,v_{c},U)=-log(\sigma (u_{o}^{T}v_{c}))-\sum_{k=1}^{K}log(\sigma (-u_{k}^{T}v_{c}))$

其中 $\sigma (\cdot )$ 是sigmoid函数。

(d)给出前面的部分，并给出一组上下文词 $[word_{c-m},...,word_{c-1},word_{c},word_{c+1},...,word_{c+m}]$ ，推导skip-gram和CBOW的所有词向量的梯度，其中m是上下文的大小。将单词 $word_{k}$ 的“输入”词向量和“输出”词向量分别记作 $v_{k}$ 和 $u_{k}$ 。

提示：可以使用 $F(o,v_{c})$ （o是期望词）作为 $J_{softmax-CE(o,v_{c},...)}$ 或 $J_{neg-sample}(o,v_{c},...)$ 损失函数中的占位符，这在程序部分是很有用的抽象化。则答案中可能会包含 $\frac{\partial F(o,v_{c})}{\partial ...}$ 。

对于skip-gram，中心词c的上下文的损失函数为：

$J_{skip-gram}(word_{c-m...c+m})=\sum_{-m\leqslant j\leqslant m,j\neq 0}^{ }F(w_{c+j},v_{c})$

其中 $w_{c+j}$ 指的是距离中心词第j个索引的词。

CBOW略有不同，不使用 $v_{c}$ 作为预测向量，而是使用下面定义的 $\widehat{v}$ 。对于CBOW，我们把上下文单词的输入词向量加起来：

$\widehat{v}=\sum_{-m\leqslant j\leqslant m,j\neq 0}^{ }v_{c+j}$

则CBOW的损失函数为：

$J_{CBOW}(word_{c-m...c+m})=F(w_{c},\widehat{v})$

注：为了与 $\widehat{v}$ 的记号保持一致，例如代码部分，对于skip-gram $\widehat{v}=v_{c}$ 。

(e)实现word2vec模型并利用随机梯度下降法训练自己的词向量。

首先，写一个辅助函数用来归一化矩阵的行：

def normalizeRows(x):""" Row normalization functionImplement a function that normalizes each row of a matrix to haveunit length."""# YOUR CODE HEREx = np.array([x_row / np.sqrt(np.sum(x_row**2)) for x_row in x])# END YOUR CODEreturn x

然后，实现softmax和负采样损失函数和梯度：

def softmaxCostAndGradient(predicted, target, outputVectors, dataset):""" Softmax cost function for word2vec modelsImplement the cost and gradients for one predicted word vectorand one target word vector as a building block for word2vecmodels, assuming the softmax prediction function and crossentropy loss.Arguments:predicted -- numpy ndarray, predicted word vector (\hat{v} inthe written component)target -- integer, the index of the target wordoutputVectors -- "output" vectors (as rows) for all tokensdataset -- needed for negative sampling, unused here.Return:cost -- cross entropy cost for the softmax word predictiongradPred -- the gradient with respect to the predicted wordvectorgrad -- the gradient with respect to all the other wordvectorsWe will not provide starter code for this function, but feelfree to reference the code you previously wrote for thisassignment!"""# predicted:(d,)# outputVectors:(W, d)# YOUR CODE HEREW = len(outputVectors)d = len(predicted)y = np.zeros(shape=W)y[target] = 1y_hat_denominator = np.sum([np.exp(np.dot(u, predicted)) for u in outputVectors])y_hat = np.array([np.exp(np.dot(outputVectors[i], predicted)) / y_hat_denominator for i in range(W)])y_target_hat = y_hat[target]cost = -np.log(y_target_hat)gradPred = np.zeros(shape=d)grad = np.zeros(shape=(W, d))for w in range(W):if w == target:gradPred += (y_hat[w] - 1) * outputVectors[w]grad[w] = (y_hat[w] - 1) * predictedelse:gradPred += y_hat[w] * outputVectors[w]grad[w] = y_hat[w] * predicted# y_hat:(W,)# y:(W,)# gradPred:(d,)# grad:(W,d)# END YOUR CODEreturn cost, gradPred, graddef getNegativeSamples(target, dataset, K):""" Samples K indexes which are not the target """indices = [None] * Kfor k in range(K):newidx = dataset.sampleTokenIdx()while newidx == target:newidx = dataset.sampleTokenIdx()indices[k] = newidxreturn indicesdef negSamplingCostAndGradient(predicted, target, outputVectors, dataset,K=10):""" Negative sampling cost function for word2vec modelsImplement the cost and gradients for one predicted word vectorand one target word vector as a building block for word2vecmodels, using the negative sampling technique. K is the samplesize.Note: See test_word2vec below for dataset's initialization.Arguments/Return Specifications: same as softmaxCostAndGradient"""# predicted:(d,)# outputVectors:(W, d)# Sampling of indices is done for you. Do not modify this if you# wish to match the autograder and receive points!indices = [target]indices.extend(getNegativeSamples(target, dataset, K))# YOUR CODE HEREW = len(outputVectors)d = len(predicted)cost = -np.log(sigmoid(np.dot(outputVectors[target], predicted)))gradPred = (sigmoid(np.dot(outputVectors[target], predicted)) - 1) * outputVectors[target]grad = np.zeros(shape=(W, d))for k in indices:if k == target:grad[k] = (sigmoid(np.dot(outputVectors[k], predicted)) - 1) * predictedelse:cost -= np.log(sigmoid(-np.dot(outputVectors[k], predicted)))gradPred -= (sigmoid(-np.dot(outputVectors[k], predicted)) - 1) * outputVectors[k]grad[k] -= (sigmoid(-np.dot(outputVectors[k], predicted)) - 1) * predicted# END YOUR CODEreturn cost, gradPred, grad

最后，实现skip-gram模型的损失函数和梯度：

def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,dataset, word2vecCostAndGradient=softmaxCostAndGradient):""" Skip-gram model in word2vecImplement the skip-gram model in this function.Arguments:currentWord -- a string of the current center wordC -- integer, context sizecontextWords -- list of no more than 2*C strings, the context wordstokens -- a dictionary that maps words to their indices inthe word vector listinputVectors -- "input" word vectors (as rows) for all tokensoutputVectors -- "output" word vectors (as rows) for all tokensword2vecCostAndGradient -- the cost and gradient function fora prediction vector given the targetword vectors, could be one of the twocost functions you implemented above.Return:cost -- the cost function value for the skip-gram modelgrad -- the gradient with respect to the word vectors"""# currentWord:一个单词# C:上下文窗口大小# contextsWords:上下文单词列表# tokens:词对应索引的字典# inputVectors:(5,3)# outputVectors:(5,3)cost = 0.0gradIn = np.zeros(inputVectors.shape)gradOut = np.zeros(outputVectors.shape)# YOUR CODE HEREpredicted = inputVectors[tokens[currentWord]]for target in contextWords:dcost, dgradPred, dgrad = word2vecCostAndGradient(predicted, tokens[target], outputVectors, dataset)# dgradPred:(3,)# dgrad:(5,3)cost += dcostgradIn[tokens[currentWord]] += dgradPredgradOut += dgrad# END YOUR CODEreturn cost, gradIn, gradOut

结果如下：

Testing normalizeRows...
[[0.6        0.8       ][0.4472136  0.89442719]]==== Gradient check for skip-gram ====
Gradient check passed!
Gradient check passed!=== Results ===
(11.16610900153398, array([[ 0.        ,  0.        ,  0.        ],[ 0.        ,  0.        ,  0.        ],[-1.26947339, -1.36873189,  2.45158957],[ 0.        ,  0.        ,  0.        ],[ 0.        ,  0.        ,  0.        ]]), array([[-0.41045956,  0.18834851,  1.43272264],[ 0.38202831, -0.17530219, -1.33348241],[ 0.07009355, -0.03216399, -0.24466386],[ 0.09472154, -0.04346509, -0.33062865],[-0.13638384,  0.06258276,  0.47605228]]))
(16.15119285363322, array([[ 0.        ,  0.        ,  0.        ],[ 0.        ,  0.        ,  0.        ],[-4.54650789, -1.85942252,  0.76397441],[ 0.        ,  0.        ,  0.        ],[ 0.        ,  0.        ,  0.        ]]), array([[-0.69148188,  0.31730185,  2.41364029],[-0.22716495,  0.10423969,  0.79292674],[-0.45528438,  0.20891737,  1.58918512],[-0.31602611,  0.14501561,  1.10309954],[-0.80620296,  0.36994417,  2.81407799]]))

(f)实现SGD优化器。

def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,PRINT_EVERY=10):""" Stochastic Gradient DescentImplement the stochastic gradient descent method in this function.Arguments:f -- the function to optimize, it should take a singleargument and yield two outputs, a cost and the gradientwith respect to the argumentsx0 -- the initial point to start SGD fromstep -- the step size for SGDiterations -- total iterations to run SGD forpostprocessing -- postprocessing function for the parametersif necessary. In the case of word2vec we will need tonormalize the word vectors to have unit length.PRINT_EVERY -- specifies how many iterations to output lossReturn:x -- the parameter value after SGD finishes"""# Anneal learning rate every several iterationsANNEAL_EVERY = 20000if useSaved:start_iter, oldx, state = load_saved_params()if start_iter > 0:x0 = oldxstep *= 0.5 ** (start_iter / ANNEAL_EVERY)if state:random.setstate(state)else:start_iter = 0x = x0if not postprocessing:postprocessing = lambda x: xexpcost = Nonefor iter in range(start_iter + 1, iterations + 1):# Don't forget to apply the postprocessing after every iteration!# You might want to print the progress every few iterations.cost = None# YOUR CODE HEREcost, grad = f(x)x -= step * gradpostprocessing(x)# END YOUR CODEif iter % PRINT_EVERY == 0:if not expcost:expcost = costelse:expcost = .95 * expcost + .05 * costprint("iter %d: %f" % (iter, expcost))if iter % SAVE_PARAMS_EVERY == 0 and useSaved:save_params(iter, x)if iter % ANNEAL_EVERY == 0:step *= 0.5return x

(g)现在开始加载真实数据并训练上述实现的词向量，训练词向量使用的是Stanford Sentiment Treebank(SST)数据集，之后使用它们完成一个简单的情感分类任务。首先运行sh get_datasets.sh获取数据集，然后运行q3_run.py，得到可视化的词向量：

CS224N刷题——Assignment1.3_word2vec相关推荐

牛年前的一小结——打响本命年的第一枪，继续刷题！
经过一段时间的小尝试,摸索出了一点点头儿吧. 总结一下子. 关于面试的java,像我这个经验层次(1-2year普通厂)的都不会太难.最多超不出力扣中等难度. 多练习链表.树.指针类的比较基础的题目: ...
牛客网里刷题：JS获取输入的数组
有的时候我们刷题会遇到下面这种输入格式,那么用js怎么把它变成数组呢? [1,2,3,4,5] 难道用readline()之后在踢掉首尾的字符吗?这样也太麻烦了! 我发现了一个好用的方法: let l ...
【Leetcode】刷题之路2（python）
哈希映射类题目(简单题小试牛刀啦bhn) 242.有效的字母异位词 349.两个数组的交集 1002.查找常用字符 202.快乐数 383.赎金信 242. 有效的字母异位词用python的Coun ...
【Leetcode】刷题之路1（python）
leetcode 刷题之路1(python) 看到有大佬总结了一些相关题目,想着先刷一类. 1.两数之和 15.三数之和 16.最接近的三数之和 11.盛最多的水 18.四数之和 454.四数相加II ...
力扣(LeetCode)刷题，简单+中等题(第35期)
力扣(LeetCode)定期刷题,每期10道题,业务繁重的同志可以看看我分享的思路,不是最高效解决方案,只求互相提升. 第1题:解码异或后的排列试题要求如下: 回答(C语言): /*** Note: ...
力扣(LeetCode)刷题，简单+中等题(第34期)
目录第1题:整数转罗马数字第2题:电话号码的字母组合第3题:二叉树的所有路径第4题:砖墙第5题:下一个排列第6题:括号生成第7题:删除并获得点数第8题:全排列第9题:颜色分类第10 ...
力扣(LeetCode)刷题，简单+中等题(第33期)
目录第1题:Z 字形变换第2题:删除字符串中的所有相邻重复项第3题:基本计算器 II 第4题:螺旋矩阵第5题:螺旋矩阵 II 第6题:盛最多水的容器第7题:删除有序数组中的重复项 II 第8 ...
力扣(LeetCode)刷题，简单+中等题(第32期)
目录第1题:数组的度第2题:托普利茨矩阵第3题:爱生气的书店老板第4题:翻转图像第5题:有效的数独第6题:无重复字符的最长子串第7题:区域和检索 - 数组不可变第8题:二维区域和检索 ...
力扣(LeetCode)刷题，简单+中等题(第31期)
目录第1题:同构字符串第2题:最后一块石头的重量第3题:最小路径和第4题:键盘行第5题:存在重复元素 II 第6题:两数相加第7题:三个数的最大乘积第8题:等价多米诺骨牌对的数量第9题 ...
力扣(LeetCode)刷题，简单+中等题(第30期)
目录第1题:单词规律第2题:找不同第3题:在排序数组中查找元素的第一个和最后一个位置第4题:使用最小花费爬楼梯第5题:寻找峰值第6题:字符串中的第一个唯一字符第7题:两个数组的交集 II ...

CS224N刷题——Assignment1.3_word2vec

Assignment #1

3.word2vec

CS224N刷题——Assignment1.3_word2vec相关推荐

最新文章

热门文章