语音识别-关键词检测

introduction

word-spotting,audio indexing,spoken term detection
输出的是word lattice，根据lattice计算关键词的后验概率。

ATWV=mean(Ncorrect(s)Ntrue(s)−βNspurious(s)T−Ntrue(s))

ATWV=mean(\frac{N_{correct}(s)}{N_{true}(s)}-\beta\frac{N_{spurious}(s)}{T-N_{true}(s)})
其中 Ncorrect(s) N_{correct}(s)表示检测正确的个数， Ntrue(s) N_{true}(s)表示reference中关键词的个数， Nspurious(s) N_{spurious}(s)检测错误的个数，T表示音频的秒数。 β \beta在evaluation中一般设置为999.9。
检测系统共有四部分：
1. speech-to-text engine
输出lattice和single-best phonetic transcripts
2. indexer
The indexer takes these as input and creates an index containing a precomputed list of candidate detection records for each word in the speech-to-text lexicon. The index also contains the phonetic
transcripts to accommodate out-of-vocabulary search terms.
3. detector
The detector loads the index and processes a list of search terms, generating a sorted, scored list of detection records for each term.
4. decider
the decider takes the lists of candidate detections and the cost parameter β and sets a per-term score threshold for making yes/no decisions.

systerm

recognition

对于离线的大量语音数据，首先进行分段，然后使用通用语音识别系统对语音进行解码，获得lattice（边上包含有声学得分和语言得分）。
如果直接根据识别结果进行关键词检测，将会导致更多的漏报情况，因为同音词的存在。

indexing

建索引。假设lattice中出现的所有候选词分别是 w1,w2,...,wL w_1,w_2,...,w_L.
1. 首先计算每一个出现在lattice里面词 wi w_i的后验概率。根据lattice中包含有的似然得分信息。
2. 对同一时间段出现的相同词 wi w_i的后验概率累加作为最后的得分.
3. 使用 L L个独立的链表对所有lattice的词wiw_i进行汇总，按照后验概率从大到小的顺序。

detection

单个词：直接根据索引查询即可。
多个词：首先查询单个词，然后根据正确的词顺序和较短的时间间隔进行过滤。

decision

p>NtrueT/β+β−1βNtrue

p>\frac{N_{true}}{T/\beta+\frac{\beta-1}{\beta}N_{true}}
其中 β−1β≈1 \frac{\beta-1}{\beta}\approx1，对于三小时的语音 T/β≈10 T/\beta\approx10， Ntrue(wi) N_{true}(w_i)未知，使用所有候选 wi w_i的后验概率和，乘以一个term无关的系数。

problems

out-of-vocabulary word
online systerm for real-time task

reference

rapid and accurate spoken term detection
kaldi keyword search code