L5W2作业2 词向量的基本操作

欢迎来到本周的第一份作业！

因为训练单词嵌入在计算上非常耗时耗力，所以大多数ML练习者都会加载一组经过预先训练的嵌入。

完成此任务后，你将能够：

加载预训练的词向量，并使用余弦相似度测量相似度
使用单词嵌入来解决单词类比问题，例如“男人相对女人”，“国王相对*__*”。
修改词嵌入以减少其性别偏见

让我们开始吧！运行以下单元格以加载所需的软件包。

In [2]:

cd /home/kesci/input/deeplearning163512
/home/kesci/input/deeplearning163512

In [3]:

import numpy as np
from w2v_utils import *
Using TensorFlow backend.
/opt/conda/lib/python3.6/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.25.7) or chardet (3.0.4) doesn't match a supported version!RequestsDependencyWarning)

接下来，让我们加载单词向量。对于此作业，我们将使用50维GloVe向量表示单词。运行以下单元格以加载word_to_vec_map。

In [4]:

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

你已加载：

words：词汇表中的单词集。
word_to_vec_map：将单词映射到其GloVe向量表示的字典上。

你已经看到，单向向量不能很好地说明相似的单词。GloVe向量提供有关单个单词含义的更多有用信息。现在让我们看看如何使用GloVe向量确定两个单词的相似程度。

1 余弦相似度

要测量两个单词的相似程度，我们需要一种方法来测量两个单词的两个嵌入向量之间的相似度。给定两个向量u和v，余弦相似度定义如下：
$v)=u.v∣∣u∣∣2∣∣v∣∣2=cos(θ)(1)\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$

其中 u.v 是两个向量的点积（或内积）， $u||_2$ 是向量 u 的范数（或长度），而θ 是u和v之间的夹角。这种相似性取决于u和v之间的角度。如果u和v非常相似，它们的余弦相似度将接近1；如果它们不相似，则余弦相似度将取较小的值。

图1：两个向量之间的夹角余弦表示它们的相似度

练习：实现函数cosine_similarity()以评估单词向量之间的相似性。

提醒：u的范数定义为 $∣∣u∣∣2=∑i=1nui2||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [5]:

# GRADED FUNCTION: cosine_similaritydef cosine_similarity(u, v):"""Cosine similarity reflects the degree of similariy between u and vArguments:u -- a word vector of shape (n,)          v -- a word vector of shape (n,)Returns:cosine_similarity -- the cosine similarity between u and v defined by the formula above."""distance = 0.0### START CODE HERE #### Compute the dot product between u and v (≈1 line)dot = np.dot(u,v)# Compute the L2 norm of u (≈1 line)norm_u = np.linalg.norm(u)# Compute the L2 norm of v (≈1 line)norm_v = np.linalg.norm(v)# Compute the cosine similarity defined by formula (1) (≈1 line)cosine_similarity = dot/(norm_u*norm_v)### END CODE HERE ###return cosine_similarity

In [6]:

father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))
cosine_similarity(father, mother) =  0.8909038442893615
cosine_similarity(ball, crocodile) =  0.2743924626137942
cosine_similarity(france - paris, rome - italy) =  -0.6751479308174202

预期输出:
cosine_similarity(father, mother) = 0.8909038442893615
cosine_similarity(ball, crocodile) = 0.2743924626137942
cosine_similarity(france - paris, rome - italy) = -0.6751479308174202

获得正确的预期输出后，请随时修改输入并测量其他词对之间的余弦相似度！围绕其他输入的余弦相似性进行操作将使你对单词向量的表征有更好的了解。

2 单词类比任务

在类比任务中，我们完成句子"a is to b as c is to __*"。一个例子是’man is to woman as king is to queen’。详细地说，我们试图找到一个单词d*，以使关联的单词向量 $e_a, e_b, e_c, e_d$ 通过以下方式相关： $eb−ea≈ed−ece_b - e_a \approx e_d - e_c$ 。我们将使用余弦相似性来衡量 $e_b - e_a$ 和 $e_d - e_c$ 之间的相似性。

练习：完成以下代码即可执行单词类比！

In [7]:

# GRADED FUNCTION: complete_analogydef complete_analogy(word_a, word_b, word_c, word_to_vec_map):"""Performs the word analogy task as explained above: a is to b as c is to ____. Arguments:word_a -- a word, stringword_b -- a word, stringword_c -- a word, stringword_to_vec_map -- dictionary that maps words to their corresponding vectors. Returns:best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity"""# convert words to lower caseword_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()### START CODE HERE #### Get the word embeddings v_a, v_b and v_c (≈1-3 lines)e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]### END CODE HERE ###words = word_to_vec_map.keys()max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative numberbest_word = None                   # Initialize best_word with None, it will help keep track of the word to output# loop over the whole word vector setfor w in words:        # to avoid best_word being one of the input words, pass on them.if w in [word_a, word_b, word_c] :continue### START CODE HERE #### Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)cosine_sim = cosine_similarity(e_b-e_a,word_to_vec_map[w]-e_c) # If the cosine_sim is more than the max_cosine_sim seen so far,# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)if cosine_sim > max_cosine_sim:max_cosine_sim = cosine_simbest_word = w### END CODE HERE ###return best_word

运行下面的单元格以测试你的代码，这可能需要1-2分钟。

In [8]:

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))
italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

预期输出:
italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

一旦获得正确的预期输出，请随时修改上面的输入单元以测试你自己的类比。尝试找到其他可行的类比对，但还要找到一些算法无法给出正确答案的类比对：例如，你可以尝试使用small-> smaller正如big->?。

恭喜！你到了本作业的结尾。以下是你应记住的要点：

余弦相似度是比较单词向量对之间相似度的好方法。（尽管L2距离也适用。）
对于NLP应用程序，使用互联网上经过预先训练的一组词向量通常是入门的好方法。

即使你已完成分级部分，我们也建议你也看一下本笔记本的其余部分。

恭喜你完成了笔记本的分级部分！

3 去偏见词向量（可选练习）

在下面的练习中，你将研究可嵌入词嵌入的性别偏见，并探索减少偏见的算法。除了了解去偏斜的主题外，本练习还将帮助你磨清直觉，了解单词向量在做什么。本节涉及一些线性代数，尽管即使你不擅长线性代数，你也可以完成它，我们鼓励你尝试一下。笔记本的此部分是可选的，未分级。

首先让我们看看GloVe词嵌入与性别之间的关系。你将首先计算向量 $g = e_{woman}-e_{man}$ ，其中 $e_{woman}$ 表示与单词woman对应的单词向量，而 $e_{man}$ 则与单词对应与单词man对应的向量。所得向量g大致编码“性别”的概念。（如果你计算 $g_1 = e_{mother}-e_{father}$ , $g_2 = e_{girl}-e_{boy}$ 等，并对其进行平均，则可能会得到更准确的表示。但是仅使用 $e_{woman}-e_{man}$ 会给你足够好的结果。）

In [9]:

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)
[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165-0.06042     0.32988     0.46144    -0.35962     0.31102    -0.868240.96006     0.01073     0.24337     0.08193    -1.02722    -0.211220.695044   -0.00222     0.29106     0.5053     -0.099454    0.404450.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115-0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957-0.149       0.048882    0.12191    -0.27362    -0.165476   -0.204260.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455-0.04371     0.01258   ]

现在，你将考虑一下不同单词g的余弦相似度。考虑相似性的正值与余弦相似性的负值意味着什么。

In [10]:

print ('List of names and their similarities with constructed vector:')# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']for w in name_list:print (w, cosine_similarity(word_to_vec_map[w], g))
List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.31868789859418784
ronaldo -0.31244796850329437
priya 0.17632041839009402
rahul -0.16915471039231716
danielle 0.24393299216283895
reza -0.07930429672199553
katy 0.2831068659572615
yasmin 0.2331385776792876

如你所见，女性名字与我们构造的向量g的余弦相似度为正，而男性名字与余弦的相似度为负。这并不令人惊讶，结果似乎可以接受。

但是，让我们尝试其他一些话。

In [11]:

print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:print (w, cosine_similarity(word_to_vec_map[w], g))
Other words and their similarities:
lipstick 0.2769191625638267
guns -0.1888485567898898
science -0.06082906540929701
arts 0.008189312385880337
literature 0.06472504433459932
warrior -0.20920164641125288
doctor 0.11895289410935041
tree -0.07089399175478091
receptionist 0.3307794175059374
technology -0.13193732447554302
fashion 0.03563894625772699
teacher 0.17920923431825664
engineer -0.0803928049452407
pilot 0.0010764498991916937
computer -0.10330358873850498
singer 0.1850051813649629

你有什么惊讶的地方吗？令人惊讶的是，这些结果如何反映出某些不健康的性别定型观念。例如，“computer"更接近 “man”，而"literature” 更接近"woman"。

我们将在下面看到如何使用Boliukbasi et al., 2016提出的算法来减少这些向量的偏差。注意，诸如 “actor”/“actress” 或者 “grandmother”/“grandfather"之类的词对应保持性别特定，而诸如"receptionist” 或者"technology"之类的其他词语应被中和，即与性别无关。消除偏见时，你将不得不区别对待这两种类型的单词。

3.1 消除非性别特定词的偏见

下图应帮助你直观地了解中和的作用。如果你使用的是50维词嵌入，则50维空间可以分为两部分：偏移方向g和其余49维，我们将其称为g⊥。在线性代数中，我们说49维 $g_⊥$ 与g垂直（或“正交”），这意味着它与g成90度。中和步骤采用向量，例如 $e_{receptionist}$ ，并沿g的方向将分量清零，从而得到 $e_{receptionist}^{debiased}$ 。

即使 $g_⊥$ 是49维的，鉴于我们可以在屏幕上绘制的内容的局限性，我们还是使用下面的1维轴对其进行说明。

图2 ：在应用中和操作之前和之后，代表"receptionist"的单词向量。

练习：实现neutralize()以消除诸如"receptionist" 或 "scientist"之类的词的偏见。给定嵌入e的输入，你可以使用以下公式来计算 $e^{debiased}$ ：
$ebias_component=e∗g∣∣g∣∣22∗g(2)e^{bias\_component} = \frac{e*g}{||g||_2^2} * g\tag{2}$

$edebiased=e−ebias_component(3)e^{debiased} = e - e^{bias\_component}\tag{3}$

如果你是线性代数方面的专家，则可以将 $e^{bias\_component}$ 识别为e在g方向上的投影。如果你不是线性代数方面的专家，请不必为此担心。

提醒：向量u可分为两部分：在向量轴 $v_B$ 上的投影和在与v正交的轴上的投影：
$u_B + u_{\perp}$

其中： $u_B =$ and $u⊥=u−uBu_{\perp} = u - u_B$

In [12]:

def neutralize(word, g, word_to_vec_map):"""Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. This function ensures that gender neutral words are zero in the gender subspace.Arguments:word -- string indicating the word to debiasg -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)word_to_vec_map -- dictionary mapping words to their corresponding vectors.Returns:e_debiased -- neutralized word vector representation of the input "word""""### START CODE HERE #### Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)e = word_to_vec_map[word]# Compute e_biascomponent using the formula give above. (≈ 1 line)e_biascomponent = (np.dot(e,g)/np.square(np.linalg.norm(g)))*g# Neutralize e by substracting e_biascomponent from it# e_debiased should be equal to its orthogonal projection. (≈ 1 line)e_debiased = e-e_biascomponent### END CODE HERE ###return e_debiased

In [13]:

e = "receptionist"
print("cosine similarity between " + e + " and g, before neutralizing: ", cosine_similarity(word_to_vec_map["receptionist"], g))e_debiased = neutralize("receptionist", g, word_to_vec_map)
print("cosine similarity between " + e + " and g, after neutralizing: ", cosine_similarity(e_debiased, g))
cosine similarity between receptionist and g, before neutralizing:  0.3307794175059374
cosine similarity between receptionist and g, after neutralizing:  -2.099120994400013e-17

预期输出: The second result is essentially 0, up to numerical roundof (on the order of 10−17).
cosine similarity between receptionist and g, before neutralizing: 0.3307794175059374
cosine similarity between receptionist and g, after neutralizing: -2.099120994400013e-17

3.2 性别专用词的均衡算法

接下来，让我们看一下如何将偏置也应用于单词对，例如"actress"和"actor"。均衡仅应用于你希望通过性别属性有所不同的单词对。作为具体示例，假设"actress"比"actor"更接近"babysit"。通过将中和应用于"babysit"，我们可以减少与"babysit"相关的性别刻板印象。但这仍然不能保证"actress"和"actor"与"babysit"等距，均衡算法负责这一点。

均衡背后的关键思想是确保一对特定单词与49维 $g_⊥$ 等距。均衡步骤还确保了两个均衡步骤现在与 $e_{receptionist}^{debiased}$ 或与任何其他已中和的作品之间的距离相同。图片中展示了均衡的工作方式：

为此，线性代数的推导要复杂一些。（详细信息请参见Bolukbasi et al., 2016）但其关键方程式是：
$μ=ew1+ew22(4)\mu = \frac{e_{w1} + e_{w2}}{2}\tag{4}$

$μB=μ∗biasaxis∣∣biasaxis∣∣2+∣∣biasaxis∣∣2∗biasaxis(5)\mu_{B} = \frac {\mu * \text{biasaxis}}{||\text{biasaxis}||_2} + ||\text{biasaxis}||_2 *\text{biasaxis} \tag{5}$

$μ⊥=μ−μB(6)\mu_{\perp} = \mu - \mu_{B} \tag{6}$

$ew1B=∣1−∣∣μ⊥∣∣22∣∗(ew1−μ⊥)−μB∣(ew1−μ⊥)−μB)∣(7)e_{w1B} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{(e_{\text{w1}} - \mu_{\perp}) - \mu_B} {|(e_{w1} - \mu_{\perp}) - \mu_B)|} \tag{7}$

∗∣(ew1−μ⊥)−μB)∣(ew1−μ⊥)−μB(7)

$ew2B=∣1−∣∣μ⊥∣∣22∣∗(ew2−μ⊥)−μB∣(ew2−μ⊥)−μB)∣(8)e_{w2B} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{(e_{\text{w2}} - \mu_{\perp}) - \mu_B} {|(e_{w2} - \mu_{\perp}) - \mu_B)|} \tag{8}$

∗∣(ew2−μ⊥)−μB)∣(ew2−μ⊥)−μB(8)

$e1=ew1B+μ⊥(9)e_1 = e_{w1B} + \mu_{\perp} \tag{9}$

$e2=ew2B+μ⊥(10)e_2 = e_{w2B} + \mu_{\perp} \tag{10}$

练习：实现以下函数。使用上面的等式来获取单词对的最终均等化形式。Good luck!

In [15]:

def equalize(pair, bias_axis, word_to_vec_map):"""Debias gender specific words by following the equalize method described in the figure above.Arguments:pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. genderword_to_vec_map -- dictionary mapping words to their corresponding vectorsReturnse_1 -- word vector corresponding to the first worde_2 -- word vector corresponding to the second word"""### START CODE HERE #### Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)w1, w2 = paire_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]# Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)mu = (e_w1 + e_w2) / 2# Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)mu_B = np.dot(mu, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axismu_orth = mu - mu_B# Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)e_w1B = np.dot(e_w1, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axise_w2B = np.dot(e_w2, bias_axis) / np.sum(bias_axis * bias_axis) * bias_axis# Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)corrected_e_w1B = np.sqrt(np.abs(1 - np.sum(mu_orth * mu_orth))) * (e_w1B - mu_B) / np.linalg.norm(e_w1 - mu_orth - mu_B)corrected_e_w2B = np.sqrt(np.abs(1 - np.sum(mu_orth * mu_orth))) * (e_w2B - mu_B) / np.linalg.norm(e_w2 - mu_orth - mu_B)# Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)e1 = corrected_e_w1B + mu_orthe2 = corrected_e_w2B + mu_orth### END CODE HERE ###return e1, e2

In [16]:

print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g, word_to_vec_map)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, g))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, g))
cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) =  -0.11711095765336832
cosine_similarity(word_to_vec_map["woman"], gender) =  0.35666618846270376cosine similarities after equalizing:
cosine_similarity(e1, gender) =  -0.7004364289309387
cosine_similarity(e2, gender) =  0.7004364289309389

预期输出:

cosine similarities before equalizing:
cosine_similarity(word_to_vec_map[“man”], gender) = -0.11711095765336832
cosine_similarity(word_to_vec_map[“woman”], gender) = 0.35666618846270376

cosine similarities after equalizing:
cosine_similarity(e1, gender) = -0.7004364289309387
cosine_similarity(e2, gender) = 0.7004364289309389

请随意使用上方单元格中的输入单词，以将均衡应用于其他单词对。

这些去偏置算法对于减少偏差非常有帮助，但并不完美，并且不能消除所有偏差痕迹。例如，该实现的一个缺点是仅使用单词 woman 和 man 来定义偏差方向g。如前所述，如果g是通过计算 $g_1 = e_{woman} - e_{man}$ ; $g_2 = e_{mother} - e_{father}$ ; $g_3 = e_{girl} - e_{boy}$ 等等来定义的，然后对它们进行平均，可以更好地估计50维单词嵌入空间中的“性别”维度。也可以随意使用这些变体。

恭喜，你已经到了本笔记本的结尾，并且看到了许多使用和修改单词向量的方法。

祝贺你完成本笔记本！

参考:

The debiasing algorithm is from Bolukbasi et al., 2016, Man is to Computer Programmer as Woman is to
Homemaker? Debiasing Word Embeddings
The GloVe word embeddings were due to Jeffrey Pennington, Richard Socher, and Christopher D. Manning. (https://nlp.stanford.edu/projects/glove/)