原文

这一篇介绍拼写的检查和更正，主要研究打字者键入的文本，同时这样的算法也可以应用于 OCR 和手写体识别。

这篇博客要解决的三个问题：

Detecting isolated non-words(非词错误检查)
如 giraffe 拼写成 graffe
Fixing isolated non-words(孤立词错误改正)
把 graffe 更正为 giraffe，但只在孤立的环境中寻找这个词
Fixing errors in context(依赖于上下文的错误检查和更正)
真词错误，用上下文来检查和更正拼写错误。如把 I ate desert 改为 I ate dessert

拼写错误模式

Kukich(1992) 把人的打字错误分为两大类：打字操作错误(typographic error) 和 认知错误(cognitive error)。

打字操作错误(typographic error)
一般与键盘有关，如 spell 拼成了 speel
包括 插入(insertion)，脱落(deletion)，替代(substitution)，换位(transposition)
占所有错误类型的大多数
认知错误(cognitive error)
由于写文章的人不知道如何拼写某个单词造成的
- 语音错误: 用语音上等价的字母序列来替代，如 separate 拼成了 seperate
- 同音词错误: 如用 piece 来替代 peace

所以单词的拼写错误其实有两类，Non-word Errors 和 Real-word Errors。前者指那些拼写错误后的词本身就不合法，如错误的将“giraffe”写成“graffe”；后者指那些拼写错误后的词仍然是合法的情况，如将“there”错误拼写为“three”（形近），将“peace”错误拼写为“piece”（同音），这一篇主要讲 Non-word Errors。

补充： OCR 错误分为五类：替代、多重替代、空白脱落、空白插入和识别失败

非词错误的检查与更正

非词错误的检查

一般有两种方法，一是 使用词典，二是 检查序列

使用词典

看键入词是否出现在了词典中。用来做拼写错误检查的词典一般还要包括形态分析模式，来表示能产性的屈折变换和派生词。词典通常是哈希表，用 integer 代替 string，来提高 performance。

检查序列

这个方法是自己概括的。类似于 “letter combination” 的思想。截取单词的部分字母，来检查这个字母序列在词典中出现的频率如何，是不是根本不会出现这种排列组合，如 “xy” 这个序列就基本不会出现在单词中，所以判断这个词是错误的。然而截取的长度很难定义，而且也需要使用词典。

非词错误改正

查找词典中与 error 最近似的词，常见的方法有 Shortest weighted edit distance 和 Highest noisy channel probability。

编辑距离(edit distance)

编辑距离，顾名思义，把一个符号串转换为另一个符号串所需的最小编辑操作的次数(how many letter changes to map A to B)。Leetcode 上有相应的题目。

编辑距离的 4 种转换方式

插入(insertion)
脱落(deletion)
替代(substitution)
换位(transposition)

示例

• Substitutions

– E X A M P E L

– E X A M P L E 2 substitutions

• Insertions

– E X A P L E

– E X A M P L E 1 insertion

• Deletions

– E X A M M P L E

– E X A _ M P L E 1 deletion

下图表示如何从 intention 变换到 execution。

由上图可知 intention 变换到 execution 的 Levenshtein 距离是 5。

算法

最小编辑距离的算法。用动态规划(dynamic programming)来解决。
只有插入(insertion)/脱落(deletion)/替代(substitution)三种操作的 Levenshtein 距离：

加上 transposition 的 Levenshtein 距离：

Levenshtein Distance

function LevenshteinDistance(char s[1..m], char t[1..n]):

// for all i and j, d[i,j] will hold the Levenshtein distance between

// the first i characters of s and the first j characters of t

// note that d has (m+1)*(n+1) values

declare int d[0..m, 0..n]

set each element in d to zero

// source prefixes can be transformed into empty string by

// dropping all characters

for i from 1 to m:

d[i, 0] := i

// target prefixes can be reached from empty source prefix

// by inserting every character

for j from 1 to n:

d[0, j] := j

for j from 1 to n:

for i from 1 to m:

if s[i] = t[j]:

substitutionCost := 0

else:

substitutionCost := 1

d[i, j] := minimum(d[i-1, j] + 1, // deletion

d[i, j-1] + 1, // insertion

d[i-1, j-1] + substitutionCost) // substitution

return d[m, n]

intention 变换到 execution 的最小编辑距离。

Damerau-Levenshtein(DL) distance

algorithm DL-distance is

input: strings a[1..length(a)], b[1..length(b)]

output: distance, integer

da := new array of |Σ| integers

for i := 1 to |Σ| inclusive do

da[i] := 0

let d[−1..length(a), −1..length(b)] be a 2-d array of integers, dimensions length(a)+2, length(b)+2

// note that d has indices starting at −1, while a, b and da are one-indexed.

maxdist := length(a) + length(b)

d[−1, −1] := maxdist

for i := 0 to length(a) inclusive do

d[i, −1] := maxdist

d[i, 0] := i

for j := 0 to length(b) inclusive do

d[−1, j] := maxdist

d[0, j] := j

for i := 1 to length(a) inclusive do

db := 0

for j := 1 to length(b) inclusive do

k := da[b[j]]

ℓ := db

if a[i] = b[j] then

cost := 0

db := j

else

cost := 1

d[i, j] := minimum(d[i−1, j−1] + cost, //substitution

d[i, j−1] + 1, //insertion

d[i−1, j ] + 1, //deletion

d[k−1, ℓ−1] + (i−k−1) + 1 + (j-ℓ−1)) //transposition

da[a[i]] := i

return d[length(a), length(b)]

Optimal String Alignment(OSA) distance

algorithm OSA-distance is

input: strings a[1..length(a)], b[1..length(b)]

output: distance, integer

let d[0..length(a), 0..length(b)] be a 2-d array of integers, dimensions length(a)+1, length(b)+1

// note that d is zero-indexed, while a and b are one-indexed.

for i := 0 to length(a) inclusive do

d[i, 0] := i

for j := 0 to length(b) inclusive do

d[0, j] := j

for i := 1 to length(a) inclusive do

for j := 1 to length(b) inclusive do

if a[i] = b[j] then

cost := 0

else

cost := 1

d[i, j] := minimum(d[i-1, j ] + 1, // deletion

d[i, j-1] + 1, // insertion

d[i-1, j-1] + cost) // substitution

if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then

d[i, j] := minimum(d[i, j],

d[i-2, j-2] + cost) // transposition

return d[length(a), length(b)]

与唯一的差别就是多了下面几行：

if i > 1 and j > 1 and a[i] = b[j-1] and a[i-1] = b[j] then

d[i, j] := minimum(d[i, j],

d[i-2, j-2] + cost) // transposition

噪声信道模型(Noisy Channel Model)

Noisy Channel Model 即噪声信道模型，或称信源信道模型，这是一个普适性的模型，被用于 语音识别、拼写纠错、机器翻译、中文分词、词性标注、音字转换 等众多应用领域。噪声信道模型本身是一个贝叶斯推理的特殊情况。

其形式很简单，如下图所示：

应用于拼写纠错任务的流程如下：

noisy word（即 spelling error）被看作 original word 通过 noisy channel 转换得到。由于在信道中有噪声，我们很难辨认词汇形式的真实单词的面目。我们的目的就是建立一个信道模型，使得能够计算出这个真实单词是如何被噪声改变面目的，从而恢复它的本来面目。噪声就是给正确的拼写戴上假面具的拼写错误，它有很多来源：发音变异、音子实现时的变异以及来自信道的声学方面的变异（扩音器、电话网络等）。

无论单词 separate 是怎样错误拼写了，我们只想把它识别为 separate。也就是，给定 observation，我们的任务是确定这个 observation 属于哪个类别的集合。所以，我们考虑一切可能的类，也就是一切可能的单词，在这些单词中，我们只想选择那些最有可能给出已有的 observation 的单词。也就是，在词汇 V 的所有单词中，我们只想使得 P(Word|Observation)最大的那个单词，也就是我们对单词 W 的正确估计就是 argmaxP(W|O)

现在已知 noisy word（用 O 表示）如何求得最大可能的 original word（用 W 表示），公式如下：

看一下留下的两个 factor：

P(W): prior, language model
how likely the word is going to be a word
P(O|W): likelihood, channel model/error model
if it was that word, how likely is that generates this exact error, models the correct word into a misspelled word

Bayes 方法应用于拼写的算法分两个步骤：

提出候选更正表(proposing candidate correction)
怎么产生候选更正表？可以采用编辑距离产生下面两种 words set
- 相似拼写(words with similar spelling)
- 相似发音(words with similar pronunciation)
  事实上，80%的错误单词与正确单词的编辑距离是 1，而几乎所有的情况下编辑距离都小于 2
对候选进行打分(scoring the candidate)
用上面的 bayes 公式

Generate candidate words

举个例子，给定拼写错误“acress”，首先通过词典匹配容易确定为 “Non-word spelling error”；然后通过计算最小编辑距离获取最相似的 candidate correction。下面是通过 insertion, deletion, substitution, transposition 四种操作转化产生且编辑距离为 1 的 candidate words。

此时，我们希望选择概率最大的 W 作为最终的拼写建议，基于噪声信道模型思想，需要进一步计算 P(W) 和 P(O|W)。

Language model probability

通过对语料库计数、平滑等处理可以很容易建立语言模型，即可得到 P(w)，如下表所示，计算 Unigram Prior Probability（word 总数：404,253,213）

Channel model probability

P(O|W)=probability of the editP(O|W)=probability of the edit

P(O|W)的精确计算至今还是一个没有解决的课题，我们可以进行简单的估算，用 confusion matrix，confusion matrix 是一个 26*26 的矩阵，表示一个字母被另一个字母错误替代的次数，有 4 种 confusion matrix(因为有四种错误)。

基于大量pair 计算 del、ins、sub 和 trans 四种转移矩阵，然后求得转移概率 P(O|W)，这里用 P(x|w) 表示:

Calculation

计算P(“acress”|w)如下：

计算P(w)P(“acress”|w)如下：

“across”相比其他 candidate 可能性更大。

Evaluation

一些测试集：
• Wikipedia’s list of common English misspelling
• Aspell filtered version of that list
• Birkbeck spelling error corpus
• Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training
or tes/ng)

Other application

噪声信道模型：Y →→ Channel →→ X
看一下其他应用：
在 POS tag 里，Y 就是 POS tag 序列，X 就是单词序列。
在机器翻译里，如 L1 翻译成 L2，那么 Y 就是 L2， X 就是 L1，P(Y)就是 language model，P(X|Y) 就是 channel model。

真词错误的检查和更正

25%-40% 的错误是真词(Real-word)，比如说 Can they lave him my messages? / The study was conducted mainly beJohn Black. 这类错误。真词错误的检查和更正往往依赖于上下文。

真词(Real-word)的检查和纠正:
Detection： 每个 word 都作为 spelling error candidate。
Correction： 从发音和拼写等角度，查找与每个 word 最近似的 words 集合作为拼写建议，常见的方法有 Highest noisy channel probability 和 classifier

对一个句子中的每个单词，都选出与之编辑距离为 1 的所有单词作为候选单词(包括原单词本身)，也就是说一个句子 N 个单词，就有 N 个 candidate set，然后从每个单词 set 里各取出一个单词组成一个句子，求 P(W) 最大的单词序列

简化版，就是在所有 candidate words 里，每次只选出一个单词，与其它原词组成句子，然后同样求 P(W) 最大的单词序列。

真词纠正和非词纠正的逻辑相同，都有 language model 概率和 channel model 概率，不同的是对真词纠正，channel model 的概率包含了 p(w|w) 也就是完全没有错误的概率。

拼写错误更正系统

为了使人机交互(HCI)的体验更加友好，我们可以根据拼写检查的 confidence 来决定对其进行哪种操作

Very confident
直接改正(autocorrect)
Less confident
给出最佳的更正单词(best correction)
Less confident
给出更正列表(correction list)
Unconfident
给出错误提示(flag)，就像 MS word 里错误单词下面的红线提示

Phone’c error model

噪声信道模型的改进

可以改进/思考的方向：

language model: 上文用的是 unigram，实际应用中当然是 bigram/trigram 效果会更好
channel model: 是否要考虑编辑距离大于 1 的情况
unseen words: 关于未登录词，尤其是层出不穷的新动词/名词，我们怎么处理。一般还是检查单词序列/组合的概率分布
classifier: 考虑分类器，综合各种特征

在实际应用中，我们并不会直接把 prior 和 error model probability 相乘，因为我们不能作出独立性假设，所以，我们会用权重来计算：

w^=argmaxw∈VP(o|w)P(w)λw^=argmaxw∈VP(o|w)P(w)λ

通常从训练集里学习 λλ 参数。

有其他的方法对噪声信道模型进行改进，如允许更多的编辑(ph → f, le → al, etc.)，把发音特征加入信道等。另外，也可以把 channel model 和 language model 当做特征，并加入其他特征，来训练分类器来进行拼写错误的改正。

根据可能影响 p(misspelling|word) 的因素来提取特征：

• The source letter

• The target letter

• Surrounding letters

• The position in the word

• Nearby keys on the keyboard

• Homology on the keyboard

• Pronunciations

• Likely morpheme transformations

参考链接
斯坦福大学自然语言处理第五课“拼写纠错（Spelling Correction）”

Spelling, Edit Distance, and Noisy Channels 拼写、编辑距离和噪声通道相关推荐

字符串编辑距离（Edit Distance）
一.问题描述定义字符串编辑距离(Edit Distance),是俄罗斯科学家 Vladimir Levenshtein 在 1965 年提出的概念,又称 Levenshtein 距离,是指两个字符 ...
详解编辑距离(Edit Distance)及其代码实现
概述编辑距离(Minimum Edit Distance,MED),由俄罗斯科学家 Vladimir Levenshtein 在1965年提出,也因此而得名 Levenshtein Distance ...
mysql 编辑距离搜索_最小编辑距离（Minimum Edit Distance）
最小编辑距离 1)定义编辑距离(Minimum Edit Distance,MED),又称Levenshtein距离,是指两个字符串之间,由一个转成另一个所需要的最少编辑操作次数.允许的编辑操作包括 ...
stanford NLP学习笔记3：最小编辑距离（Minimum Edit Distance）
I. 最小编辑距离的定义最小编辑距离旨在定义两个字符串之间的相似度(word similarity).定义相似度可以用于拼写纠错,计算生物学上的序列比对,机器翻译,信息提取,语音识别等. 编辑距离就 ...
编辑距离算法（Edit Distance）
目录一.概念二.算法过程三.举例说明四.Python实现 1. 递归方式 2. 动态规划 3. Jupyter Notebook 步骤分解一.概念编辑距离的作用主要是用来比较两个字符串的相 ...
Edit Distance编辑距离（NM tag）- sam/bam格式解读进阶
sam格式很精炼,几乎包含了比对的所有信息,我们平常用到的信息很少,但特殊情况下,我们会用到一些较为生僻的信息,关于这些信息sam官方文档的介绍比较精简,直接看估计很难看懂. 今天要介绍的是如何通过b ...
Minimum edit distance（levenshtein distance）(最小编辑距离)初探
最小编辑距离的定义:编辑距离(Edit Distance),又称Levenshtein距离.是指两个字串之间,由一个转成还有一个所需的最少编辑操作次数.许可的编辑操作包含将一个字符替换成还有一个字符. ...
[LeetCode] One Edit Distance 一个编辑距离
Given two strings S and T, determine if they are both one edit distance apart. 这道题是之前那道Edit Distance ...
最小编辑距离算法 Edit Distance（经典DP）
最小编辑距离算法 Edit Distance(经典DP) 转载baodream 最后发布于2018-05-23 11:36:32 阅读数 22226 收藏展开编辑距离(Edit Distance ...
python 编辑距离_python实现编辑距离edit distance
1.定义理解 edit distance--指两个字符串之间,一个转为另一个的最小编辑次数(方式有:插入/删除/替换) 若edit distance越小,则字符串之间的相似度越高. 例1: 输入: w ...

Spelling, Edit Distance, and Noisy Channels 拼写、编辑距离和噪声通道

拼写错误模式

非词错误的检查与更正

非词错误的检查

使用词典

检查序列

非词错误改正

编辑距离(edit distance)

噪声信道模型(Noisy Channel Model)

真词错误的检查和更正

拼写错误更正系统

Phone’c error model

噪声信道模型的改进

Spelling, Edit Distance, and Noisy Channels 拼写、编辑距离和噪声通道相关推荐

最新文章

热门文章