Written: Understanding word2vec

Let’s have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that ‘a word is known by the company it keeps’. Concretely, suppose we have a ‘center’ word ccc and a contextual window surrounding ccc. We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution P(O∣C)P(O|C)P(O∣C). Given a specific word ooo and a specific word ccc, we want to calculate P(O=o∣C=c)P(O = o|C = c)P(O=o∣C=c), which is the probability that word ooo is an ‘outside’ word for ccc, i.e., the probability that ooo falls within the contextual window of ccc.

Figure 1: The word2vec skip-gram prediction model with window size 2

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
(1)P(O=o∣C=c)=exp⁡(uo⊤vc)∑w∈Vocabexp⁡(uw⊤vc)P(O=o|C=c)=\frac{\exp (\mathbf{u_o}^\top v_c)}{\sum_{w\in Vocab}\exp(\mathbf{u_w}^\top v_c)}\tag{1} P(O=o∣C=c)=∑w∈Vocab​exp(uw​⊤vc​)exp(uo​⊤vc​)​(1)

Here, uou_ouo​ is the ‘outside’ vector representing outside word ooo, and vcv_cvc​ is the ‘center’ vector representing center word ccc. To contain these parameters, we have two matrices, UUU and VVV . The columns of UUU are all the ‘outside’ vectors uwu_wuw​ . The columns of VVV are all of the ‘center’ vectors vwv_wvw​ . Both UUU and VVV contain a vector for every w∈w \inw∈ Vocabulary.1

Recall from lectures that, for a single pair of words ccc and ooo, the loss is given by:
(2)Jnaive−softmax(vc,o,U)=−log⁡P(O=o∣C=c)J_{naive-softmax}(v_c,o,U)=-\log P(O=o|C=c)\tag{2} Jnaive−softmax​(vc​,o,U)=−logP(O=o∣C=c)(2)
Another way to view this loss is as the cross-entropy2 between the true distribution y and the predicted distribution y^\hat{y}y^​. Here, both yyy and y^\hat{y}y^​ are vectors with length equal to the number of words in the vocabulary. Furthermore, the kthk^{th}kth entry in these vectors indicates the conditional probability of the kthk^{th}kth word being an ‘outside word’ for the given ccc. The true empirical distribution yyy is a one-hot vector with a 1 for the true outside word ooo, and 0 everywhere else. The predicted distribution y^\hat{y}y^​ is the probability distribution P(O∣C=c)P(O|C = c)P(O∣C=c) given by our model in equation (1).

Question a

Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between yyy and y^\hat{y}y^​; i.e., show that

(3)−∑w∈Vocabywlog⁡(y^w)=−log⁡(y^o)-\sum_{w\in{Vocab}}y_w\log(\hat{y}_w) = -\log(\hat{y}_o)\tag{3} −w∈Vocab∑​yw​log(y^​w​)=−log(y^​o​)(3)

Ans for a

yw={1,w=o0,w≠oy_w=\left\{ \begin{aligned} 1, w=o\\ 0, w\neq o \end{aligned} \right. yw​={1,w=o0,w̸​=o​

−∑w∈Vocabywlog⁡(y^w)=−yolog⁡(yo^)=−log⁡(yo^)-\sum_{w\in{Vocab}}y_w\log(\hat{y}_w)=-y_o\log(\hat{y_o})=-\log(\hat{y_o}) −w∈Vocab∑​yw​log(y^​w​)=−yo​log(yo​^​)=−log(yo​^​)

Question b

Compute the partial derivative of Jnaive−softmax(vc,o,U)J_{naive-softmax}(v_c,o,U)Jnaive−softmax​(vc​,o,U) with respect to vcv_cvc​. Please write your answer in terms of yyy, y^\hat{y}y^​, and UUU.

Ans for b

∂∂vcJnaive−softmax=−∂∂vclog⁡P(O=o∣C=c)=−∂∂vclog⁡exp⁡(uoTvc)∑w=1Vexp⁡(uwTvc)=−∂∂vclog⁡exp⁡(uoTvc)+∂∂vclog⁡∑w=1Vexp⁡(uwTvc)=−uo+1∑w=1Vexp⁡(uwTvc)∂∂vc∑x=1Vexp⁡(uxTvc)=−uo+1∑w=1Vexp⁡(uwTvc)∑x=1Vexp⁡(uxTvc)∂∂vcuxTvc=−uo+1∑w=1Vexp⁡(uwTvc)∑x=1Vexp⁡(uxTvc)ux=−uo+∑x=1Vexp⁡(uxTvc)∑w=1Vexp⁡(uwTvc)ux=−uo+∑x=1VP(O=x∣C=c)ux=−yTUT+y^TuT=U(y^−y)\begin{aligned} \frac{\partial}{\partial v_c} J_{naive-softmax} =& -\frac{\partial}{\partial v_c} \log P(O=o|C=c)\\ =& -\frac{\partial}{\partial v_c} \log \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}\\ =& -\frac{\partial}{\partial v_c} \log \exp (u_o^T v_c) + \frac{\partial}{\partial v_c} \log \sum_{w=1}^V \exp (u_w^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \frac{\partial}{\partial v_c}\sum_{x=1}^V \exp (u_x^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) \frac{\partial}{\partial v_c} u_x^T v_c\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) u_x\\ =& -u_o + \sum_{x=1}^V \frac{\exp (u_x^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)} u_x\\ =& -u_o + \sum_{x=1}^V P(O=x|C=c) u_x\\ =&-y^T U^T + \hat{y}^T u^T \\ =& U(\hat{y} - y) \end{aligned} ∂vc​∂​Jnaive−softmax​==========​−∂vc​∂​logP(O=o∣C=c)−∂vc​∂​log∑w=1V​exp(uwT​vc​)exp(uoT​vc​)​−∂vc​∂​logexp(uoT​vc​)+∂vc​∂​logw=1∑V​exp(uwT​vc​)−uo​+∑w=1V​exp(uwT​vc​)1​∂vc​∂​x=1∑V​exp(uxT​vc​)−uo​+∑w=1V​exp(uwT​vc​)1​x=1∑V​exp(uxT​vc​)∂vc​∂​uxT​vc​−uo​+∑w=1V​exp(uwT​vc​)1​x=1∑V​exp(uxT​vc​)ux​−uo​+x=1∑V​∑w=1V​exp(uwT​vc​)exp(uxT​vc​)​ux​−uo​+x=1∑V​P(O=x∣C=c)ux​−yTUT+y^​TuTU(y^​−y)​

Question c

Compute the partial derivatives of Jnaive−softmax(vc,o,U)J_{naive-softmax}(v_c,o,U)Jnaive−softmax​(vc​,o,U) with respect to each of the ‘outside’ word vectors, uwu_wuw​'s. There will be two cases: when w=ow = ow=o, the true ‘outside’ word vector, and w≠ow \neq ow̸​=o, for all other words. Please write you answer in terms of yyy, y^\hat{y}y^​, and vcv_cvc​.

Ans for c

∂∂uwJnaive−softmax=−∂∂uwlog⁡exp⁡(uoTvc)∑m=1Vexp⁡(umTvc)=−∂∂uwlog⁡exp⁡(uoTvc)+∂∂uwlog⁡∑m=1Vexp⁡(umTvc)\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& -\frac{\partial}{\partial u_w}\log\frac{\exp (u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T v_c)}\\ =& -\frac{\partial}{\partial u_w} \log\exp (u_o^T v_c)+ \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ \end{aligned} ∂uw​∂​Jnaive−softmax​==​−∂uw​∂​log∑m=1V​exp(umT​vc​)exp(uoT​vc​)​−∂uw​∂​logexp(uoT​vc​)+∂uw​∂​logm=1∑V​exp(umT​vc​)​

When w=ow=ow=o:

∂∂uoJnaive−softmax=−vc+1∑m=1Vexp⁡(umT)∑n=1V∂∂uoexp⁡(unTvc)=−vc+1∑m=1Vexp⁡(umT)∂∂uoexp⁡(uoTvc)=−vc+exp⁡(uoTvc)∑m=1Vexp⁡(umT)vc=−vc+P(O=o∣C=c)vc=(P(O=o∣C=c)−1)vc\begin{aligned} \frac{\partial}{\partial u_o}J_{naive-softmax}=& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)}\sum_{n=1}^V \frac{\partial}{\partial u_o}\exp(u_n^T v_c)\\ =& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)} \frac{\partial}{\partial u_o}\exp(u_o^T v_c)\\ =& -v_c + \frac{\exp(u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& -v_c + P(O=o|C=c)v_c\\ =&(P(O=o|C=c) - 1)v_c \end{aligned} ∂uo​∂​Jnaive−softmax​=====​−vc​+∑m=1V​exp(umT​)1​n=1∑V​∂uo​∂​exp(unT​vc​)−vc​+∑m=1V​exp(umT​)1​∂uo​∂​exp(uoT​vc​)−vc​+∑m=1V​exp(umT​)exp(uoT​vc​)​vc​−vc​+P(O=o∣C=c)vc​(P(O=o∣C=c)−1)vc​​

When w≠ow\neq ow̸​=o:

∂∂uwJnaive−softmax=∂∂uwlog⁡∑m=1Vexp⁡(umTvc)=exp⁡(uwTvc)∑m=1Vexp⁡(umT)vc=P(O=w∣C=c)vc=(P(O=o∣C=c)−0)vc\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ =& \frac{\exp(u_w^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& P(O=w|C=c)v_c\\ =& (P(O=o|C=c) - 0)v_c \end{aligned} ∂uw​∂​Jnaive−softmax​====​∂uw​∂​logm=1∑V​exp(umT​vc​)∑m=1V​exp(umT​)exp(uwT​vc​)​vc​P(O=w∣C=c)vc​(P(O=o∣C=c)−0)vc​​

In summary:

∂∂uwJnaive−softmax=(y^w−yw)vc\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& (\hat{y}_w-y_w)v_c \end{aligned} ∂uw​∂​Jnaive−softmax​=​(y^​w​−yw​)vc​​

Question d

The sigmoid function is given by Equation 4:
(4)σ(x)=11+e−x=exex+1\sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}\tag{4} σ(x)=1+e−x1​=ex+1ex​(4)
Please compute the derivative of σ(x)\sigma(x)σ(x) with respect to xxx, where xxx is a vector.

Ans for d

∂∂xσ(x)=∂∂xexex+1=∂∂yyy+1∂∂xex=∂∂y(1−1y+1)∂∂xex=∂∂y1y+1∂∂xex=1y+1∂∂xex=ex(ex+1)2=exex+11ex+1=exex+1ex+1−exex+1=exex+1(1−exex+1)=σ(x)(1−σ(x))\begin{aligned} \frac{\partial}{\partial x}\sigma(x) =& \frac{\partial}{\partial x} \frac{e^x}{e^x + 1}\\ =& \frac{\partial}{\partial y}\frac{y}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}(1-\frac{1}{y+1})\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}\frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{e^x}{(e^x + 1)^2}\\ =& \frac{e^x}{e^x+1} \frac{1}{e^x+1}\\ =& \frac{e^x}{e^x+1} \frac{e^x+1-e^x}{e^x+1}\\ =& \frac{e^x}{e^x+1}(1-\frac{e^x}{e^x+1})\\ =& \sigma(x)(1-\sigma(x)) \end{aligned} ∂x∂​σ(x)==========​∂x∂​ex+1ex​∂y∂​y+1y​∂x∂​ex∂y∂​(1−y+11​)∂x∂​ex∂y∂​y+11​∂x∂​exy+11​∂x∂​ex(ex+1)2ex​ex+1ex​ex+11​ex+1ex​ex+1ex+1−ex​ex+1ex​(1−ex+1ex​)σ(x)(1−σ(x))​

Question e

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that KKK negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as w1,w2,…,wKw_1,w_2,…,w_Kw1​,w2​,…,wK​ and their outside vectors as u1,…,uKu_1,…,u_Ku1​,…,uK​. Note that$ o\notin {w_1,…,w_K}$. For a center word ccc and an outside word ooo, the negative sampling loss function is given by:
(5)Jneg−sample(vc,o,U)=−log⁡(σ(uo⊤vc))−∑K=1Klog⁡(σ(−uk⊤vc))J_{neg-sample}(v_c,o,U) =-\log(\sigma (\mathbf{u_o}^\top v_c)) -\sum_{K=1}^{K}\log(\sigma(-\mathbf{u_k}^\top v_c))\tag{5} Jneg−sample​(vc​,o,U)=−log(σ(uo​⊤vc​))−K=1∑K​log(σ(−uk​⊤vc​))(5)
for a sample w1,...wKw_1, ... w_Kw1​,...wK​ , where σ(⋅)\sigma(\cdot)σ(⋅) is the sigmoid function3

Please repeat parts (b) and ©, computing the partial derivatives of Jneg−sampleJ_{neg-sample}Jneg−sample​ with respect to vcv_cvc​, with respect to uou_ouo​, and with respect to a negative sample uku_kuk​. Please write your answers in terms of the vectors uou_ouo​, vcv_cvc​, and uku_kuk​, where k∈[1,K]k \in [1,K]k∈[1,K]. After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

Ans for e

∂∂vcJneg−sample=−∂∂vclog⁡(σ(uoTvc))−∂∂vc∑k=1Klog⁡(σ(−ukTvc))=−1σ(uoTvc)∂∂vcσ(uoTvc)−∑k=1K1σ(−ukTvc)∂∂vcσ(−ukTvc)=−1σ(uoTvc)σ(uoTvc)(1−σ(uoTvc))∂∂vcuoTvc−∑k=1K1σ(−ukTvc)σ(−ukTvc)(1−σ(ukTvc))∂∂vc(−ukTvc)=(σ(uoTvc)−1)uo−∑k=1K(σ(−ukTvc)−1)uk∂∂uoJneg−sample=−∂∂uolog⁡(σ(uoTvc))−∂∂uo∑k=1Klog⁡(σ(−ukTvc))=−∂∂uolog⁡(σ(uoTvc))=−1σ(uoTvc)∂∂uoσ(uoTvc)=−1σ(uoTvc)σ(uoTvc)(1−σ(uoTvc))∂∂uouoTvc=(σ(uoTvc)−1)vc∂∂ukJneg−sample=−∂∂uklog⁡(σ(uoTvc))−∂∂uk∑x=1Klog⁡(σ(−uxTvc))=−∂∂uk∑x=1Klog⁡(σ(−uxTvc))=−∂∂uklog⁡(σ(−ukTvc))=−1σ(−ukTvc)∂∂ukσ(−ukTvc)=−1σ(−ukTvc)σ(−ukTvc)(1−σ(−ukTvc))∂∂uk(−ukTvc)=(1−σ(−ukTvc))vc\begin{aligned} \frac{\partial}{\partial v_c}J_{neg-sample}=& -\frac{\partial}{\partial v_c}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial v_c}\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))\\ =&-\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial v_c}\sigma(u_o^T v_c) -\sum_{k=1}^K \frac{1}{\sigma(-u_k^T v_c)}\frac{\partial}{\partial v_c}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c)) \frac{\partial}{\partial v_c}u_o^T v_c -\sum_{k=1}^K\frac{1}{\sigma(-u_k^T v_c)}\sigma(-u_k^T v_c)(1-\sigma(u_k^T v_c))\frac{\partial}{\partial v_c}(-u_k^T v_c)\\ =& (\sigma(u_o^T v_c)-1)u_o -\sum_{k=1}^K(\sigma(-u_k^T v_c) - 1)u_k\\ ~\\ \frac{\partial}{\partial u_o}J_{neg-sample}=& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_o}\sum_{k=1}^K\log(\sigma(-u_k^T v_c))\\ =& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c))\\ =& -\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial u_o}\sigma(u_o^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c))\frac{\partial}{\partial u_o}u_o^T v_c\\ =& (\sigma(u_o^T v_c) - 1)v_c\\ ~\\ \frac{\partial}{\partial u_k}J_{neg-sample}=& -\frac{\partial}{\partial u_k}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\log(\sigma(-u_k^T v_c))\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\frac{\partial}{\partial u_k}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\sigma(-u_k^Tv_c)(1-\sigma(-u_k^Tv_c))\frac{\partial}{\partial u_k}(-u_k^Tv_c)\\ =& (1-\sigma(-u_k^Tv_c))v_c \end{aligned} ∂vc​∂​Jneg−sample​==== ∂uo​∂​Jneg−sample​===== ∂uk​∂​Jneg−sample​======​−∂vc​∂​log(σ(uoT​vc​))−∂vc​∂​k=1∑K​log(σ(−ukT​vc​))−σ(uoT​vc​)1​∂vc​∂​σ(uoT​vc​)−k=1∑K​σ(−ukT​vc​)1​∂vc​∂​σ(−ukT​vc​)−σ(uoT​vc​)1​σ(uoT​vc​)(1−σ(uoT​vc​))∂vc​∂​uoT​vc​−k=1∑K​σ(−ukT​vc​)1​σ(−ukT​vc​)(1−σ(ukT​vc​))∂vc​∂​(−ukT​vc​)(σ(uoT​vc​)−1)uo​−k=1∑K​(σ(−ukT​vc​)−1)uk​−∂uo​∂​log(σ(uoT​vc​))−∂uo​∂​k=1∑K​log(σ(−ukT​vc​))−∂uo​∂​log(σ(uoT​vc​))−σ(uoT​vc​)1​∂uo​∂​σ(uoT​vc​)−σ(uoT​vc​)1​σ(uoT​vc​)(1−σ(uoT​vc​))∂uo​∂​uoT​vc​(σ(uoT​vc​)−1)vc​−∂uk​∂​log(σ(uoT​vc​))−∂uk​∂​x=1∑K​log(σ(−uxT​vc​))−∂uk​∂​x=1∑K​log(σ(−uxT​vc​))−∂uk​∂​log(σ(−ukT​vc​))−σ(−ukT​vc​)1​∂uk​∂​σ(−ukT​vc​)−σ(−ukT​vc​)1​σ(−ukT​vc​)(1−σ(−ukT​vc​))∂uk​∂​(−ukT​vc​)(1−σ(−ukT​vc​))vc​​

Cause through this loss funtion, we don’t need to go through all word in vocabulary which cost expensive.

Question f

Suppose the center word is c=wtc = w_tc=wt​ and the context window is [wt−m,...,wt−1,wt,wt+1,...,wt+m][w_{t−m}, . . ., w_{t−1}, w_t, w_{t+1}, . . ., w_{t+m}][wt−m​,...,wt−1​,wt​,wt+1​,...,wt+m​], where mmm is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
(6)Jskip−gram(vc,wt−m,...wt+m,U)=∑−m≤j≤m,j≠0J(vc,wt+j,U)J_{skip-gram}(v_c,w_{t-m},...w_{t+m},U) =\sum_{-m\leq j \leq m, j\neq0}J(v_c,w{t+j},U)\tag{6} Jskip−gram​(vc​,wt−m​,...wt+m​,U)=−m≤j≤m,j̸​=0∑​J(vc​,wt+j,U)(6)
Here, J(vc,wt+j,U)J (v_c , w_{t+j} , U )J(vc​,wt+j​,U) represents an arbitrary loss term for the center word c=wtc = w_tc=wt​ and outside word wt+jw_{t+j}wt+j​. J(vc,wt+j,U)J(v_c,w_{t+j},U)J(vc​,wt+j​,U) could be Jnaive−softmax(vc,wt+j,U)J_{naive-softmax}(v_c,w_{t+j},U)Jnaive−softmax​(vc​,wt+j​,U) or Jneg−sample(vc,wt+j,U)J_{neg-sample}(v_c,w_{t+j},U)Jneg−sample​(vc​,wt+j​,U), depending on your implementation.

Write down three partial derivatives:

  1. ∂Jskip−gram(vc,wt−m,…wt+m,U)∂U\frac{\partial J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U)}{\partial U}∂U∂Jskip−gram​(vc​,wt−m​,…wt+m​,U)​
  2. ∂Jskip−gram(vc,wt−m,…wt+m,U)∂vc\frac{\partial J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})}}{\partial v_c}∂vc​∂Jskip−gram(vc​,wt−m,…wt+m​,U​)​​
  3. ∂Jskip−gram(vc,wt−m,…wt+m,U)∂vw\frac{\partial J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)}}{\partial v_w}∂vw​∂Jskip−gram(vc​,wt−m​,…wt+m​,U)​​ when w≠cw\neq cw̸​=c

Write your answers in terms of ∂J(vc,wt+j,U)∂U\frac{\partial J(v_c,w_{t+j},U)}{\partial U}∂U∂J(vc​,wt+j​,U)​ and ∂J(vc,wt+j,U)∂vc\frac{\partial J(v_c,w_{t+j},U)}{\partial v_c}∂vc​∂J(vc​,wt+j​,U)​. This is very simple - each solution should be one line.

Ans for f

∂∂UJskip−gram(vc,wt−m,…wt+m,U)=∑−m≤j≤mJ(vc,wt+j,U)∂U∂∂vcJskip−gram(vc,wt−m,…wt+m,U)=∑−m≤j≤mJ(vc,wt+j,U)∂vc∂∂vwJskip−gram(vc,wt−m,…wt+m,U)=0\begin{aligned} \frac{\partial}{\partial U}J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U) =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial U}\\ ~\\ \frac{\partial}{\partial v_c}J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})} =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial v_c}\\ ~\\ \frac{\partial}{\partial v_w}J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)} = & 0 \end{aligned} ∂U∂​Jskip−gram​(vc​,wt−m​,…wt+m​,U)= ∂vc​∂​Jskip−gram(vc​,wt−m,…wt+m​,U​)​= ∂vw​∂​Jskip−gram(vc​,wt−m​,…wt+m​,U)​=​−m≤j≤m∑​∂UJ(vc​,wt+j​,U)​−m≤j≤m∑​∂vc​J(vc​,wt+j​,U)​0​


  1. Assume that every word in our vocabulary is matched to an integer number k.ukk. u_kk.uk​ is both the kthk ^{th}kth column of UUU and the ‘outside’ word vector for the word indexed by k.vkk. v_kk.vk​ is both the kthk^{th}kth column of VVV and the ‘center’ word vector for the word indexed by kkk. In order to simplify notation we shall interchangeably use kkk to refer to the word and the index-of-the-word. ↩︎

  2. The Cross Entropy Loss between the true (discrete) probability distribution ppp and another distribution qqq is$ − \sum_{i}p_i log(q_i)$. ↩︎

  3. Note: the loss function here is the negative of what Mikolov et al. had in their original paper, because we are doing aminimization instead of maximization in our assignment code. Ultimately, this is the same objective function. ↩︎

CS224N 2019 Assignment 2相关推荐

  1. CS224n 2019 Winter 笔记(一):Word Embedding:Word2vec and Glove

    CS224n笔记:Word2Vec:CBOW and Skip-Gram 摘要 一.语言模型(Language Model) (一)一元模型(Unary Language Model) (二)二元模型 ...

  2. COSC 1047 – Winter 2019 – Assignment

    COSC 1047 – Winter 2019 – Assignment 4 See D2L For Due Date This assignment is designed to give you ...

  3. CS224n 2019 Winter 笔记(三):句子依存分析(Dependency Parsing)

    CS224n 2019 Winter 笔记(三):句子依存分析(Dependency Parsing) 一.概述 二.语言结构的两种Views (一)成分分析(constituent parsing) ...

  4. cs224n 2019 Lecture 9: Practical Tips for Final Projects

    主要内容: 项目的选择:可以选择默认的问答项目,也可以自定义项目 如何发现自定义项目 如何找到数据集 门神经网络序列模型的复习 关于机器翻译的一些话题 查看训练结果和进行评估 一.项目的选择 默认项目 ...

  5. ANU COMP2310(2019) Assignment 1

    原创发表于 DavidZ Blog,遵循 CC 4.0 BY-NC-SA 版权协议,转载请附上原文出处链接及本声明. 敬告 本博客仅供参考,请不要抄袭. 前言 这是 ANU COMP2310的第一次大 ...

  6. CS224N 2019 自然语言处理(一)自然语言处理库gensim之Word2vec

    笔记摘抄 1. WordNet显示同义词 from nltk.corpus import wordnet as wn# 同义词 poses = {'n': 'noun', 'v': 'verb', ' ...

  7. CS224N 2019年课程第一次作业复现

    本次作业主要介绍 余弦相似性 两种求词向量的方法 基于计数(词共现矩阵 + SVD) 基于预测(word2vec) 完整代码:CS 224N | Home 一.环境及数据问题 1.gensim安装 p ...

  8. 斯坦福大学NLP公开课CS224n上映啦!华人助教陪你追剧

    一只小狐狸带你解锁NLP/DL/ML秘籍 作者:小鹿鹿鹿,QvQ,夕小瑶 CS224n: Natural Language Processing with Deep Learning Stanford ...

  9. ApacheCN 翻译活动进度公告 2019.5.3

    Special Sponsors 我们组织了一个开源互助平台,方便开源组织和大 V 互相认识,互相帮助,整合资源.请回复这个帖子并注明组织/个人信息来申请加入. 如果大家遇到了做得不错的教程或翻译项目 ...

最新文章

  1. 为JAVA性能而设计(一)
  2. wxPython做界面的适用性
  3. 阿里云centos镜像地址以及个发行版本说明
  4. Nginx配置单项SSL以及双向SSL
  5. ionic4 引入外部字体ttf
  6. JavaScript实现模糊推荐的input框(类似百度搜索框)
  7. GAN(生成对抗神经网络 )的一点思考
  8. Kibana 操作 ES+搜索
  9. 计算机如何更改键盘设置在哪里设置,原神pc按键怎么改
  10. 怎么生成一个永久性的二维码?微信群二维码如何长期有效?
  11. 【01月11日】【精彩电影合集】【10部】【亲测】【Lsyq5647发布】
  12. oBlog 4.0 正式版 2006-09-06
  13. 思科ccna认证工程师必看路由协议IGRP和EIGRP详解
  14. 01.JS基础_前端的语法(4)
  15. svn使用过程中遇到的错误
  16. 讲座有感——科技论文写作要素
  17. SQL的左连接 ,右连接,内连接和全外连接的4者区别
  18. Spring Boot应用退出
  19. 2011Android技术面试整理附有详细答案(包括百度、新浪、中科软等多家公司笔试面试题)
  20. 120项改进:开源超级爬虫Hawk 2.0 重磅发布!

热门文章

  1. 五分钟成交2亿美金的元宇宙土地拍卖会
  2. 服务器上传文件被挂起,打印机文件被挂起是啥意思
  3. 跪求《HCNP/HCIE的培训内容
  4. 贵州大学计算机科学与技术学院排名,贵州大学计算机专业全国排名
  5. Android easeui 3.0 即时通讯 我踩过的坑---小米
  6. Linux进程间通信:共享内存(shm)
  7. 《质量免费》读后感_20161125
  8. path与classpath区别
  9. LETO纯相位空间光调制器
  10. 南方cassCAD快捷命令对照设置流程