我的原文:https://www.hijerry.cn/p/54554.html

去年冬季学习了cs224n的2017课程,做了三个assignments,用的是TensorFlow。今年cs224n再次放课,一共有5个assignments,使用PyTorch,主讲还是Manning,特别喜欢这个老师,讲课生动有趣还挺可爱的哈哈哈~~

Assignment1(点击下载) 的任务是探索词向量。以基于计数的共现矩阵和基于预测的word2vec两种方式,计算词的相似度,研究近义词、反义词等等性质,从代码层面来理解它们,有更深刻的记忆。

作业是ipynb文件,所以要用jupyter打开,可以参考chaibubble的如何打开ipynb文件。

注意:python版本 >= 3.5

词向量

词向量是下游NLP任务(如问答、文本生成、翻译等) 的基本组件,词向量的好坏能在很大程度上影响下游任务的性能。这里我们将探索两类词向量:共现矩阵word2vec

术语解释: “word vectors” 和 “word embeddings” 通常可以互换使用。“embedding” 这个词的内在含义是将词编码到一个底维空间中。“概念上而言,它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中,每个单词或词组被映射为实数域上的向量。”——维基百科

Part 1:基于计数的词向量

大多数词向量模型都是基于一个观点:

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

大多数词向量的实现的核心是 相似词 ,也就是同义词,因为它们有相似的上下文。这里我们介绍一种策略叫做 共现矩阵 (更多信息可以查看 这里 或 这里 )

这部分要实现的是,给定语料库,根据共现矩阵计算词向量,得到语料库中每个词的词向量,流程如下:

  • 计算语料库的单词集
  • 计算共现矩阵
  • 使用SVD降维
  • 分析词向量

问题1.1:实现 dicintct_words

计算语料库的单词数量、单词集

def distinct_words(corpus):""" Determine a list of distinct words for the corpus.Params:corpus (list of list of strings): corpus of documentsReturn:corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)num_corpus_words (integer): number of distinct words across the corpus"""corpus_words = []num_corpus_words = -1# ------------------# Write your implementation here.corpus = [w for sent in corpus for w in sent]corpus_words = list(set(corpus))corpus_words = sorted(corpus_words)num_corpus_words = len(corpus_words)# ------------------return corpus_words, num_corpus_words

问题1.2:实现compute_co_occurrence_matrix

计算给定语料库的共现矩阵。具体来说,对于每一个词 w,统计前、后方 window_size 个词的出现次数

def compute_co_occurrence_matrix(corpus, window_size=4):""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).Note: Each word in a document should be at the center of a window. Words near edges will have a smallernumber of co-occurring words.For example, if we take the document "START All that glitters is not gold END" with window size of 4,"All" will co-occur with "START", "that", "glitters", "is", and "not".Params:corpus (list of list of strings): corpus of documentswindow_size (int): size of context windowReturn:M (numpy matrix of shape (number of corpus words, number of corpus words)): Co-occurence matrix of word counts. The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M."""words, num_words = distinct_words(corpus)M = Noneword2Ind = {}# ------------------# Write your implementation here.M = np.zeros(shape=(num_words, num_words), dtype=np.int32)for i in range(num_words):word2Ind[words[i]] = ifor sent in corpus:for p in range(len(sent)):ci = word2Ind[sent[p]]# precedingfor w in sent[max(0, p - window_size):p]:wi = word2Ind[w]M[ci][wi] += 1# subsequentfor w in sent[p + 1:p + 1 + window_size]:wi = word2Ind[w]M[ci][wi] += 1# ------------------return M, word2Ind

问题1.3:实现 reduce_to_k_dim

这一步是降维。在问题1.2得到的是一个N x N的矩阵(N是单词集的大小),使用scikit-learn实现的SVD(奇异值分解),从这个大矩阵里分解出一个含k个特制的N x k 小矩阵。

注意:在numpy、scipy和scikit-learn都提供了一些SVD的实现,但是只有scipy、sklearn有Truncated SVD,并且只有sklearn提供了计算大规模SVD的高效的randomized算法,详情参考sklearn.decomposition.TruncatedSVD 。

def reduce_to_k_dim(M, k=2):""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.htmlParams:M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word countsk (int): embedding size of each word after dimension reductionReturn:M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.In terms of the SVD from math class, this actually returns U * S"""    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`M_reduced = Noneprint("Running Truncated SVD over %i words..." % (M.shape[0]))# ------------------# Write your implementation here.svd = TruncatedSVD(n_components=k)svd.fit(M.T)M_reduced = svd.components_.T# ------------------print("Done.")return M_reduced

问题1.4 实现 plot_embeddings

基于matplotlib,用scatter 画 “×”,用 text 写字

def plot_embeddings(M_reduced, word2Ind, words):""" Plot in a scatterplot the embeddings of the words specified in the list "words".NOTE: do not plot all the words listed in M_reduced / word2Ind.Include a label next to each point.Params:M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddingsword2Ind (dict): dictionary that maps word to indices for matrix Mwords (list of strings): words whose embeddings we want to visualize"""# ------------------# Write your implementation here.for w in words:x = M_reduced[word2Ind[w]][0]y = M_reduced[word2Ind[w]][1]plt.scatter(x, y, marker='x')plt.text(x, y, w)plt.show()# ------------------

效果:

mark

问题1.5:共现打印分析

将词嵌入到2个维度上,归一化,最终词向量会落到一个单位圆内,在坐标系上寻找相近的词。

mark

Part 2:基于预测的词向量

目前,基于预测的词向量是最流行的,比如word2vec。现在我们来探索word2vec生成的词向量,如果想要深入了解,可以读一读 原始论文 。

这一部分主要是使用gensim探索词向量,不是自己实现word2vec,所使用的词向量维度是300,由google发布。

首先使用SVD降维,将300维降2维,方便打印查看。

问题2.1:word2vec打印分析

和问题1.5一样

问题2.2:一词多义

找到一个有多个含义的词(比如 “leaves”,“scoop”),这种词的top-10相似词(根据余弦相似度)里有两个词的意思不一样。比如"leaves"(叶子,花瓣)的top-10词里有"vanishes"(消失)和"stalks"(茎秆)。

这里我找到的词是"column"(列),它的top-10里有"columnist"(专栏作家)和"article"(文章)

# ------------------
# Write your polysemous word exploration code here.wv_from_bin.most_similar("column")# ------------------

输出:

[('columns', 0.767943263053894),('columnist', 0.6541407108306885),('article', 0.651928186416626),('columnists', 0.617466926574707),('syndicated_column', 0.599014401435852),('op_ed', 0.588202714920044),('Op_Ed', 0.5801560282707214),('op_ed_column', 0.5779396891593933),('nationally_syndicated_column', 0.572504997253418),('colum', 0.5595961213111877)]

问题2.3:近义词和反义词

找到三个词(w1, w2, w3),其中w1和w2是近义词,w1和w3是反义词,但是w1和w3的距离<w1和w2的距离。例如:w1=“happy”,w2=“cheerful”,w3=“sad”

为什么反义词的相似度反而更大呢(距离越小说明越相似)?因为他们的上下文通常非常一致

# ------------------
# Write your synonym & antonym exploration code here.w1 = "love"
w2 = "like"
w3 = "hate"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))# ------------------

输出:

Synonyms love, like have cosine distance: 0.6328612565994263
Antonyms love, hate have cosine distance: 0.39960432052612305

问题2.4:类比

man 对于 king,相当于woman对于___,这样的问题也可以用word2vec来解决,关于most_similar的详细用法可以参考 GenSim文档。

这里我们找另外一组类比

# ------------------
# Write your analogy exploration code here.
# man : him :: woman : her
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'him'], negative=['man']))# ------------------

输出:

[('her', 0.694490909576416),('she', 0.6385233402252197),('me', 0.628451406955719),('herself', 0.6239798665046692),('them', 0.5843966007232666),('She', 0.5237804651260376),('myself', 0.4885627031326294),('saidshe', 0.48337966203689575),('he', 0.48184287548065186),('Gail_Quets', 0.4784894585609436)]

可以看到正确的计算出了"queen"

问题2.5:错误的类比

找到一个错误的类比,树:树叶 ::花:花瓣

# ------------------
# Write your incorrect analogy exploration code here.
# tree : leaf :: flower : petal
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))# ------------------

输出:

[('floral', 0.5532568693161011),('marigold', 0.5291938185691833),('tulip', 0.521312952041626),('rooted_cuttings', 0.5189826488494873),('variegation', 0.5136324763298035),('Asiatic_lilies', 0.5132641792297363),('gerberas', 0.5106234550476074),('gerbera_daisies', 0.5101010203361511),('Verbena_bonariensis', 0.5070016980171204),('violet', 0.5058108568191528)]

结果输出的里面没有“花瓣”

问题2.6:偏见分析

注意偏见是很重要的比如性别歧视、种族歧视等,执行下面代码,分析两个问题:

(a) 哪个词与“woman”和“boss”最相似,和“man”最不相似?

(b) 哪个词与“man”和“boss”最相似,和“woman”最不相似?

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

输出:

[('bosses', 0.5522644519805908),('manageress', 0.49151360988616943),('exec', 0.45940813422203064),('Manageress', 0.45598435401916504),('receptionist', 0.4474116563796997),('Jane_Danson', 0.44480544328689575),('Fiz_Jennie_McAlpine', 0.44275766611099243),('Coronation_Street_actress', 0.44275566935539246),('supremo', 0.4409853219985962),('coworker', 0.43986251950263977)][('supremo', 0.6097398400306702),('MOTHERWELL_boss', 0.5489562153816223),('CARETAKER_boss', 0.5375303626060486),('Bully_Wee_boss', 0.5333974361419678),('YEOVIL_Town_boss', 0.5321705341339111),('head_honcho', 0.5281980037689209),('manager_Stan_Ternent', 0.525971531867981),('Viv_Busby', 0.5256162881851196),('striker_Gabby_Agbonlahor', 0.5250812768936157),('BARNSLEY_boss', 0.5238943099975586)]

第一个类比 男人:女人 :: 老板:___,最合适的词应该是"landlady"(老板娘)之类的,但是top-10里只有"manageress"(女经理),“receptionist”(接待员)之类的词。

第二个类比 女人:男人 :: 老板:___,输出的不知道是些什么东西/捂脸

问题2.7:自行分析偏见

这里我找的例子是:

  • 男人:女人 :: 医生:___
  • 女人:男人 :: 医生:___
# ------------------
# Write your bias exploration code here.pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'doctor'], negative=['woman']))# ------------------

输出:

[('gynecologist', 0.7093892097473145),('nurse', 0.647728681564331),('doctors', 0.6471461057662964),('physician', 0.64389967918396),('pediatrician', 0.6249487996101379),('nurse_practitioner', 0.6218312978744507),('obstetrician', 0.6072014570236206),('ob_gyn', 0.5986712574958801),('midwife', 0.5927063226699829),('dermatologist', 0.5739566683769226)][('physician', 0.6463665962219238),('doctors', 0.5858404040336609),('surgeon', 0.5723941326141357),('dentist', 0.552364706993103),('cardiologist', 0.5413815975189209),('neurologist', 0.5271126627922058),('neurosurgeon', 0.5249835848808289),('urologist', 0.5247740149497986),('Doctor', 0.5240625143051147),('internist', 0.5183224081993103)]

第一个类比中,我们看到了"nurse"(护士),这是一个有偏见的类比

问题2.8:思考偏见问题

什么会导致词向量里的偏见?

因为数据集中有偏见

参考

[1] CS224n: Natural Language Processing with Deep Learning, 2019-03-12.

[2] 计算机系统里的偏见和歧视:除了杀死,还有其他方法, 2019-03-12.

2019-CS224n-Assignment1相关推荐

  1. 2019 CS224N Assignment 1: Exploring Word Vectors

    文章目录 包的导入 Part 1: Count-Based Word Vectors Question 1.1: Implement distinct_words Question 1.2: Impl ...

  2. cs224n上完后会会获得证书吗_斯坦福NLP组-CS224n: NLP与深度学习-2019春全套资料分享...

    斯坦福自然语言处理小组2019年的最新课程<CS224n: NLP与深度学习>春季课程已经全部结束了,课程内容囊括了深度学习在各项NLP任务中应用的最新技术,非常值得一看.本文整理本期课程 ...

  3. CS224N WINTER 2022(一)词向量(附Assignment1答案)

    CS224N WINTER 2022(一)词向量(附Assignment1答案) CS224N WINTER 2022(二)反向传播.神经网络.依存分析(附Assignment2答案) CS224N ...

  4. CS224n 2019 Winter 笔记(一):Word Embedding:Word2vec and Glove

    CS224n笔记:Word2Vec:CBOW and Skip-Gram 摘要 一.语言模型(Language Model) (一)一元模型(Unary Language Model) (二)二元模型 ...

  5. Task 4: Contextual Word Embeddings (附代码)(Stanford CS224N NLP with Deep Learning Winter 2019)

    Task 4: Contextual Word Embeddings 目录 Task 4: Contextual Word Embeddings 词向量的表示 一.Peters et al. (201 ...

  6. CS224n 2019 Winter 笔记(三):句子依存分析(Dependency Parsing)

    CS224n 2019 Winter 笔记(三):句子依存分析(Dependency Parsing) 一.概述 二.语言结构的两种Views (一)成分分析(constituent parsing) ...

  7. cs224n 2019 Lecture 9: Practical Tips for Final Projects

    主要内容: 项目的选择:可以选择默认的问答项目,也可以自定义项目 如何发现自定义项目 如何找到数据集 门神经网络序列模型的复习 关于机器翻译的一些话题 查看训练结果和进行评估 一.项目的选择 默认项目 ...

  8. Task 2: Word Vectors and Word Senses (附代码)(Stanford CS224N NLP with Deep Learning Winter 2019)

    Task 2: Word Vectors and Word Senses 目录 Task 2: Word Vectors and Word Senses 一.词向量计算方法 1 回顾word2vec的 ...

  9. Task 3: Subword Models (附代码)(Stanford CS224N NLP with Deep Learning Winter 2019)

    Task 3: Subword Models 目录 Task 3: Subword Models 回顾:Word2vec & Glove 一.人类语言声音:语音学和音系学 二.字符级模型(Ch ...

  10. 【2019斯坦福CS224N笔记】(5)The probability of a sentence Recurrent Neural Networks and Language Models

    这部分内容主要研究语言模型及循环神经网络在语言模型中的应用. 目录 1.语言模型 2.经典n-gram模型 3.Window-based DNN 4.Recurrent Neural Networks ...

最新文章

  1. 如何利用Tensorflow和OpenCV构建实时对象识别程序?
  2. Homestead 无法挂载共享目录解决方案
  3. 介绍一篇关于session的好文章,写的很详细
  4. 被寄予厚望的区块链 能否为游戏行业带来新的曙光?
  5. php动态生成链接,PHP动态生成javascript文件的2个例子
  6. WebService和Netty的区别
  7. POJ 3617 Best Cow Line
  8. 一步步编写操作系统 11 实模式下程序分段的原因
  9. JAVA入门级教学之(标识符与关键字)
  10. java获取对象的子_java – 如何根据子对象字段获取父对象
  11. 游戏建模成熟期:在这个阶段,技术已经比较成熟了,可以独挡一面
  12. AWS机器学习初探(1):Comprehend - 自然语言处理服务
  13. DirectFB简介以及移植[一]【转】
  14. java createcell_CreateCell
  15. python之绘制图形库turtle
  16. 分布式 集群 负载均衡含义
  17. 大数据之Hadoop简介
  18. REFPROP导出温熵数据绘图
  19. 华为鸿蒙16号开发大会,刚刚!华为2019年开发者大会,鸿蒙系统正式雄起
  20. Jenkins的分布式构建及部署(master~slaver)

热门文章

  1. Win8之常用快捷键(1)
  2. QT软件开发: 获取CPU序列号、硬盘序列号、主板序列号 (采用wmic命令)
  3. fedora linux五笔输入法,教你在Fedora 14 下安装五笔输入法
  4. JAVA使用JCo连接SAP介绍-1
  5. flash builder 4.6 的破解安装
  6. SSM+校园网上订餐系统 毕业设计-附源码211510
  7. 计算机毕业设计ssm基于vue的健康餐饮管理系统的设计与实现
  8. 算法分析 | 分支限界算法设计之布线问题 C语言版
  9. 连锁门店数字化营销,打造千城万店新零售体系
  10. ShardingSphere——水平分表与数据迁移