MIT自然语言处理第三讲：概率语言模型（第四部分）

自然语言处理：概率语言模型

Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年1月20日）

四、平滑算法
a) 最大似然估计（Maximum Likelihood Estimate）
　i. MLE使训练数据尽可能的“大”（MLE makes training data as probable as possible）：
　　　P_{ML}(w_{i}/{w_{i-1},w_{i-2}}) = {Count(w_{i-2},w_{i-1},w_{i})}/{Count(w_{i-2},w_{i-1})}
　　1. 对于词汇规模为N的语料库，我们将要在模型中得到N^{3}的参数（For vocabulary of size N, we will have N3 parameters in the model）
　　2. 对于N=1000，我们将要估计1000^{3}=10^{9}个参数（For N =1, 000, we have to estimate1, 000^{3}=10^{9} parameters）
　　3. 问题（Problem）: 如何处理未登录词（how to deal with unseen words）?
　ii. 数据稀疏问题（Sparsity）
　　1. 未知事件的总计概率构成了测试数据的很大一部分（The aggregate probability of unseen events constitutes a large fraction of the test data）
　　2. Brown et al (1992): 考虑一个3.5亿词的英语语料库，14%的三元词是未知的（considered a 350 million word corpus of English, 14% of trigrams are unseen）
　iii. 注：关于MLE的简要补充
　　1. 最大似然估计是一种统计方法，它用来求一个样本集的相关概率密度函数的参数。这个方法最早是遗传学家以及统计学家罗纳德•费雪爵士在1912年至1922年间开始使用的。
　　2. “似然”是对likelihood 的一种较为贴近文言文的翻译，“似然”用现代的中文来说即“可能性”。故而，若称之为“最大可能性估计”则更加通俗易懂。
　　3.MLE选择的参数使训练语料具有最高的概率，它没有浪费任何概率在训练语料中没有出现的事件中
　　4.但是MLE概率模型通常不适合做NLP的统计语言模型，会出现0概率，这是不允许的。
b) 如何估计未知元素的概率（How to estimate probability of unseen elements）?
　i. 打折（Discounting）
　　1. Laplace加1平滑（Laplace）
　　2. Good-Turing打折法（Good-Turing）
　ii. 线性插值法（Linear Interpolation）
　iii. Katz回退（Katz Back-Off）
c) 加一(Laplace)平滑（Add-One (Laplace) Smoothing）
　i. 最简单的打折方法（Simplest discounting technique）:
　　　{P(w_{i}/w_{i-1})} = {C(w_{i-1},w_{i})+1}/{C(w_{i-1})+V}
　　这里Ｖ是词汇表的数目——语料库的“型”（where |ν| is a vocabulary size）
　　注：MIT课件这里似乎有误，我已修改
　ii. 贝叶斯估计假设事件发生前是一个均匀分布（Bayesian estimator assuming a uniform unit prior on events）
　iii. 问题（Problem）: 对于未知事件占去的概率太多了（Too much probability mass to unseen events）
　iv. 例子（Example）：
　　假设V=10000(词型)，S=1000000(词例)（Assume |ν| =10, 000, and S=1, 000, 000）：
　　　P_{MLE}(ball/{kike~a}) = {{Count(kike~a~ball)}/{Count(kick~a)}} = 9/10 = 0.9
　　　P_{+1}(ball/{kike~a}) = {{Count(kike~a~ball)+1}/{Count(kick~a)+V}} = {9+1}/{10+10000} = 9*10^{-4}
　v. Laplace的缺点（Weaknesses of Laplace）
　　1. 对于稀疏分布，Laplace法则赋予未知事件太多的概率空间（For Sparse distribution, Laplace’s Law gives too much of the probability space to unseen events）
　　2. 在预测二元语法的实际概率时与其他平滑方法相比显得非常差（Worst at predicting the actual probabilities of bigrams than other methods）
　　3. 使用加epsilon平滑更合理一些（More reasonable to use add-epsilonsmoothing (Lidstone’s Law)）

未完待续:第五部分

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fourth-part/

MIT自然语言处理第三讲：概率语言模型（第五部分）

自然语言处理：概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年2月10日）

五、 Good-Turing打折法（Good-Turing Discounting）
a) 你在将来看到一个新词的可能性有多大？用所看到的事件去估计未知事件的概率（How likely are you to see a new word type in the future? Use things you’ve seen once to estimate the probability of unseen things）
　i. n_r——频率为r的元素（n元语法）计数并且r>0（number of elements with r frequency and r>0）
　ii. n_0——总词汇规模减去观察到的词汇规模，既出现次数为0的n元语法（size of the total lexicon minus the size of observed lexicon）
　iii. 对于频率为r的元素，修正计数为（Modified count for elements with frequency r）：
　　　　　　　　r^* = (r+1)*{n_{r+1}/n_r}
b) 关于Good-Turing打折法的补充说明：
　i. Good(1953)首先描述了Good-Turing算法，而这种算法的原创思想则来自Turing 。
　ii. Good-Turing平滑的基本思想是：用观察较高的N元语法数的方法来重新估计概率量的大小，并把它指派给那些具有零计数或较低计数的N元语法。
c) 直观的Good-Turing打折法（Good-Turing Discounting: Intuition）
　i. 目的（Goal）: 估计训练数据中计数为r的单词在同样规模测试集中的出现频率（estimate how often word with r counts in training data occurs in test set of equal size）。
　ii. 我们使用删除估计（We use deleted estimation）：
　　1. 每次删除一个单词（delete one word at a time）
　　2. 如果单词“test”在所有的数据集中出现了r+1次（if “test” word occurs r +1 times in complete data set）：
　　——它在训练集中出现了r 次（it occurs r times in “training” set）
　　——对计数为r的单词加1（add one count to words with r counts）
　iii. r-count单词“桶”中的总的计数为（total count placed to bucket for r-count words is）:
　　　　　　　　　n_{r+1}*(r +1)
　iv. 平均计数为：
　　　　　　(avg-count of r count words) = {n_{r+1}*(r+1)}/n_r
d) Good-Turing打折法续（Good-Turing Discounting (cont.)）：
　i. 在Good-Turing中，分配给所有未知事件的总的概率等于n_1/N, 其中N是训练集的规模。它与分配给独立事件的相对频率公式相似。
　ii. In Good-Turing, the total probability assigned to all the unobserved events is equal ton_1/N , where N is the size of the training set. It is the same as a relative frequency formula would assign to singleton events.
e) 举例（Example: Good-Turing）
Training sample of 22,000,000 (Church&Gale’1991))
r 　　　N_r　　　　　　　heldout　　r^*
0 　　74,671,100,000　0.00027　0.00027
1 　　2,018,046　　　　0.448　　0.446
2 　　449,721　　　　　1.25　　　1.26
3 　　188,933　　　　　2.24　　　2.24
4 　　105,668　　　　　3.23　　　3.24
5 　　68,379　　　　　 4.21　　　4.22
6 　　48,190　　　　　 5.23　　　5.19
f) 补充说明：
　i. 根据Zipf定律,对于小的r, N_r比较大;对于大的r,N_r小,对于出现次数最多的n元组,r*=0!
　ii. 所以对于出现次数很多的n元组, GT估计不准,而MLE估计比较准,因此可以直接采用MLE. GT估计一般适用于出现次数为k(k<10)的n元组　iii. 如果这样,考虑”劫富济贫”,这里的”富”就变成了”中产”阶级!呵呵,真正的富翁沾光了!（虽然富翁损一点也没什么）连打折法也不敢欺富人！这就是“为富不仁”，“一毛不拔”的来历。未完待续：第六部分

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fifth-part/

MIT自然语言处理第三讲：概率语言模型（第六部分）

六、插值及回退
a) The Bias-Variance Trade-Off
　i. 未平滑的三元模型估计(Unsmoothed trigram estimate)：　　　　　　
　　P_ML({w_i}/{w_{i-2},w_{i-1}})={Count(w_{i-2}w_{i-1}w_{i})}/{Count(w_{i-2},w_{i-1})}
　ii. 未平滑的二元模型估计(Unsmoothed bigram estimate）：
　　　P_ML({w_i}/{w_{i-1}})={Count(w_{i-1}w_{i})}/{Count(w_{i-1})}
　iii. 未平滑的一元模型估计(Unsmoothed unigram estimate)：
　　　P_ML({w_i})={Count(w_{i})}/sum{j}{}{Count(w_{j})}
　iv. 这些不同的估计中哪个和“真实”的P({w_i}/{w_{i-2},w_{i-1}})概率最接近（How close are these different estimates to the “true” probability P({w_i}/{w_{i-2},w_{i-1}}))?
b) 插值（Interpolation）
　i. 一种解决三元模型数据稀疏问题的方法是在模型中混合使用受数据稀疏影响较小的二元模型和一元模型（One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness）。
　ii. 权值可以使用期望最大化算法（EM）或其它数值优化技术设置（The weights can be set using the Expectation-Maximization Algorithm or another numerical optimization technique）
　iii. 线性插值（Linear Interpolation)
　　hat{P}({w_i}/{w_{i-2},w_{i-1}})={lambda_1}*P_ML({w_i}/{w_{i-2},w_{i-1}})
　　+{lambda_2}*P_ML({w_i}/w_{i-1})+{lambda_3}*P_ML({w_i})
　　这里{lambda_1}+{lambda_2}+{lambda_3}=1并且{lambda_i}>=0 对于所有的 i
　iv. 参数估计（Parameter Estimation）
　　1. 取出训练集的一部分作为“验证”数据（Hold out part of training set as “validation” data）
　　2. 定义Count_2(w_1,w_2,w_3)作为验证集中三元集 w_1,w_2,w_3 的出现次数（Define Count_2(w_1,w_2,w_3) to be the number of times the trigram w_1,w_2,w_3 is seen in validation set）
　　3. 选择{lambda_i}去最大化(Choose {lambda_i} to maximize):
L({lambda_1},{lambda_2},{lambda_3})=sum{(w_1,w_2,w_3)in{upsilon}}{}{Count_2(w_1,w_2,w_3)}log{hat{P}}({w_3}/{w_2,w_1})
　　这里{lambda_1}+{lambda_2}+{lambda_3}=1并且{lambda_i}>=0 对于所有的 i
　　注：关于参数估计的其他内容，由于公式太多，这里略，请参考原始课件
c)Kats回退模型-两元（Katz Back-Off Models (Bigrams)）：
　i. 定义两个集合（Define two sets）：
　　A(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)>0}{rbrace}
　　
　　B(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)=0}{rbrace}
　ii. 一种两元模型（A bigram model）：
P_K({w_i}/w_{i-1})=delim{lbrace}{matrix{2}{2}{{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}>0} {if{w_i}{in}{A(w_{i-1})}} {alpha(w_{i-1}){{P_ML(w_{i})}/sum{w{in}B(w_{i-1})}{}{P_ML(w)}} } {if{w_i}{in}{B(w_{i-1})}} }}{}
{alpha(w_{i-1})=1-sum{w{in}A(w_{i-1})}{}{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}}}
　iii. Count^*定义（Count^*definitions）
　　1. Kats对于Count(x)<5使用Good-Turing方法,对于Count(x)>=5令Count^*(x)=Count(x)(Katz uses Good-Turing method for Count(x)< 5, and Count^*(x)=Count(x)for Count(x)>=5)
　　2. “Kneser-Ney”方法（“Kneser-Ney” method）：
　　　Count^*(x)=Count(x)-D,其中 D={n_1}/{n_1+n_2}
　　　n_1是频率为1的元素个数（n_1 is a number of elements with frequency 1)
　　　n_2是频率为2的元素个数（n_2 is a number of elements with frequency 2)

七、综述
a) N元模型的弱点（Weaknesses of n-gram Models）
　i. 有何想法（Any ideas）?
　　短距离（Short-range）
　　中距离（Mid-range）
　　长距离（Long-range）
b) 更精确的模型（More Refined Models）
　i. 基于类的模型（Class-based models）
　ii. 结构化模型（Structural models）
　iii. 主题和长距离模型（Topical and long-range models）
c) 总结（Summary）
　i. 从一个词表开始（Start with a vocabulary）
　ii. 选择一种模型（Select type of model）
　iii. 参数估计（Estimate Parameters）
d) 工具包参考：
　i. CMU-Cambridge language modeling toolkit:
　　http://mi.eng.cam.ac.uk/~prc14/toolkit.html
　ii.SRILM – The SRI Language Modeling Toolkit:
　　http://www.speech.sri.com/projects/srilm/

第三讲结束！
第四讲：标注

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工学院开放式课程创作共享规范翻译发布，转载请注明出处“我爱自然语言处理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-sixth-part/

MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）相关推荐

MIT自然语言处理第三讲：概率语言模型（第一、二、三部分）
MIT自然语言处理第三讲:概率语言模型(第一部分) 自然语言处理:概率语言模型 Natural Language Processing: Probabilistic Language Modeling ...
MIT自然语言处理第三讲：概率语言模型
一. 简单介绍 a) 预测字符串概率 i. 那一个字符串更有可能或者更符合语法 1. Grill doctoral candidates. 2. Grill doctoral updates. (ex ...
MIT自然语言处理第四讲：标注
MIT自然语言处理第四讲:标注(第一部分) 自然语言处理:标注 Natural Language Processing: Tagging 作者:Regina Barzilay(MIT,EECS Dep ...
MIT自然语言处理第二讲：单词计数（第三、四部分）
MIT自然语言处理第二讲:单词计数(第三部分) 自然语言处理:单词计数 Natural Language Processing: (Simple) Word Counting 作者:Regina Ba ...
斯坦福大学深度学习与自然语言处理第三讲：高级的词向量表示
斯坦福大学在三月份开设了一门"深度学习与自然语言处理"的课程:CS224d: Deep Learning for Natural Language Processing,授课老师是 ...
MIT自然语言处理第五讲：最大熵和对数线性模型
MIT自然语言处理第五讲:最大熵和对数线性模型(第一部分) 自然语言处理:最大熵和对数线性模型 Natural Language Processing: Maximum Entropy and Log ...
MIT自然语言处理第二讲：单词计数（第一、二部分）
MIT自然语言处理第二讲:单词计数(第一部分) 自然语言处理:单词计数 Natural Language Processing: (Simple) Word Counting 作者:Regina Ba ...
论文阅读：A Neural Probabilistic Language Model 一种神经概率语言模型
A Neural Probabilistic Language Model 一种神经概率语言模型目录 A Neural Probabilistic Language Model 一种神经概率语言模型 ...
概率语言模型(probabilistic grammar model) : IRTG的原理、应用场景、算法、可解释性以及未来的发展方向
作者:禅与计算机程序设计艺术 1.简介概率语言模型(probabilistic grammar model)近年来受到越来越多学者的关注和重视,其在自然语言处理.机器翻译.图像识别等领域都取得了很好 ...

MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）

MIT自然语言处理第三讲：概率语言模型（第四部分）

MIT自然语言处理第三讲：概率语言模型（第五部分）

MIT自然语言处理第三讲：概率语言模型（第六部分）

MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）相关推荐

最新文章

热门文章