语音的基本概念--译自CMU sphinx




这是CMU sphinx语音识别系统wiki的第一部分,主要是介绍语音的一些基本概念的。我试着翻译了一下。英语水平受限,翻译难免出错,请各位不吝指点!呵呵

Basic concepts of speech


Speech is a complex phenomenon. People rarely understand how is it produced and perceived. The naive perception is often that speech is built with words, and each word consists of phones. The reality is unfortunately very different. Speech is a dynamic process without clearly distinguished parts. It's always useful to get a sound editor and look into the recording of the speech and listen to it. Here is for example the speech recording in an audio editor.


All modern descriptions of speech are to some degree probabilistic. That means that there are no certain boundaries between units, or between words. Speech to text translation and other applications of speech are never 100% correct. That idea is rather unusual for software developers, who usually work with deterministic systems. And it creates a lot of issues specific only to speech technology.


Structure of speech


In current practice, speech structure is understood as follows:


Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones. Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation协同发音 makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.



协同发音(指的是一个音受前后相邻音的影响而发生变化,从发声机理上看就是人的发声器官在一个音转向另一个音时其特性只能渐变,从而使得后一个音的频谱与其他条件下的频谱产生差异。)的存在使得音素的感知与标准不一样,所以我们需要根据上下文来辨别音素。将一个音素划分为几个亚音素单元。如:数字“three”,音素的第一部分与在它之前的音素存在关联,中间部分是稳定的部分,而最后一部分则与下一个音素存在关联,这就是为什么在用HMM模型做语音识别时,选择音素的三状态HMM模型。(上下文相关建模方法在建模时考虑了这一影响,从而使模型能更准确地描述语音,只考虑前一音的影响的称为Bi-Phone,考虑前一音和后一音的影响的称为 Tri-Phone。)

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.


Next, phones build subword units, like syllables. Sometimes, syllables are defined as “reduction-stable entities”. To illustrate, when speech becomes fast, phones often change, but syllables remain the same. Also, syllables are related to intonational contour. There are other ways to build subwords - morphologically-based in morphology-rich languages or phonetically-based. Subwords are often used in open vocabulary speech recognition.


Subwords form words. Words are important in speech recognition because they restrict combinations of phones significantly. If there are 40 phones and an average word has 7 phones, there must be 40^7 words. Luckily, even a very educated person rarely uses more then 20k words in his practice, which makes recognition way more feasible.


Words and other non-linguistic sounds, which we call fillers (breath, um, uh, cough), form utterances. They are separate chunks of audio between pauses. They don't necessary match sentences, which are more semantic concepts.


On the top of this, there are dialog acts like turns, but they go beyond the purpose of the document.

Recognition process


The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. There are few important things in this match.



First of all it's a concept of features. Since number of parameters is large, we are trying to optimize it. Numbers that are calculated from speech usually by dividing speech on frames. Then for each frame of length typically 10 milliseconds we extract 39 numbers that represent the speech. That's called feature vector. The way to generates numbers is a subject of active investigation, but in simple case it's a derivative from spectrum.



Second it's a concept of the model. Model describes some mathematical object that gathers common attributes of the spoken word. In practice, for audio model of senone is gaussian mixture of it's three states - to put it simple, it's a most probable feature vector. From concept of the model the following issues raised - how good does model fits practice, can model be made better of it's internal model problems, how adaptive model is to the changed conditions.



Third, it's a matching process itself. Since it would take a huge time more than universe existed to compare all feature vectors with all models, the search is often optimized by many tricks. At any points we maintain best matching variants and extend them as time goes producing best matching variants for the next frame.




According to the speech structure, three models are used in speech recognition to do the match:

An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context).

声学模型acoustic model:


A phonetic dictionary contains a mapping from words to phones. This mapping is not very effective. For example, only two to three pronunciation variants are noted in it, but it's practical enough most of the time. The dictionary is not the only variant of mapper from words to phones. It could be done with some complex function learned with a machine learning algorithm.

语音学字典phonetic dictionary:



A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. Most common language models used are n-gram language models-these contain statistics of word sequences-and finite state language models-these define speech sequences by finite state automation, sometimes with weights. To reach a good accuracy rate, your language model must be very successful in search space restriction. This means it should be very good at predicting the next word. A language model usually restricts the vocabulary considered to the words it contains. That's an issue for name recognition. To deal with this, a language model can contain smaller chunks like subwords or even phones. Please note that search space restriction in this case is usually worse and corresponding recognition accuracies are lower than with a word-based language model.

语言模型 language model:


Those three entities are combined together in an engine to recognize speech. If you are going to apply your engine for some other language, you need to get such structures in place. For many languages there are acoustic models, phonetic dictionaries and even large vocabulary language models available for download.


Other concepts used


A Lattice is a directed graph that represents variants of the recognition. Often, getting the best match is not practical; in that case, lattices are good intermediate formats to represent the recognition result.


N-best lists of variants are like lattices, though their representations are not as dense as the lattice ones.

N-best lists和lattices有点像,但是它没有lattices那么密集(也就是保留的结果没有lattices多)。(N-best搜索和多遍搜索:为在搜索中利用各种知识源,通常要进行多遍搜索,第一遍使用代价低的知识源(如声学模型、语言模型和音标词典),产生一个候选列表或词候选网格,在此基础上进行使用代价高的知识源(如4阶或5阶的N-Gram、4阶或更高的上下文相关模型)的第二遍搜索得到最佳路径。)

Word confusion networks (sausages) are lattices where the strict order of nodes is taken from lattice edges.


Speech database - a set of typical recordings from the task database. If we develop dialog system it might be dialogs recorded from users. For dictation system it might be reading recordings. Speech databases are used to train, tune and test the decoding systems.


Text databases - sample texts collected for language model training and so on. Usually, databases of texts are collected in sample text form. The issue with collection is to put present documents (PDFs, web pages, scans) into spoken text form. That is, you need to remove tags and headings, to expand numbers to their spoken form, and to expand abbreviations.

文本数据库-为了训练语言模型而收集的文本。一般是以样本文本的方式来收集形成的。而收集过程存在一个问题就是误把PDFs, web pages, scans等现成文档也当成口语文本的形式放进数据库中。所以,我们就需要把这些文件带进数据库里面的标签和文件头去掉,还有把数字展开为它们的语音形式(例如1展开为英文的one或者汉语的yi),另外还需要把缩写给扩大还原为完整单词。


What is optimized


When speech recognition is being developed, the most complex issue is to make search precise (consider as many variants to match as possible) and to make it fast enough to not run for ages. There are also issues with making the model match the speech since models aren't perfect.


Usually the system is tested on a test database that is meant to represent the target task correctly.


The following characteristics are used:


Word error rate. Let we have original text and recognition text of length of N words. From them the I words were inserted D words were deleted and S words were substituted Word error rate is

WER = (I + D + S) / N

WER is usually measured in percent.

单词错误率:我们有一个N个单词长度的原始文本和识别出来的文本。(对单词串进行识别难免有词的插入,替换和删除的误识)I代表被插入的单词个数,D代表被删除的单词个数,S代表被替换的单词个数,那么单词错误率就定义为:WER = (I + D + S) / N


Accuracy. It is almost the same thing as word error rate, but it doesn't count insertions.

Accuracy = (N - D - S) / N

Accuracy is actually a worse measure for most tasks, since insertions are also important in final results. But for some tasks, accuracy is a reasonable measure of the decoder performance.

准确度。它和单词错误率大部分是相似的,但是它不计算插入单词的个数,它定义为:Accuracy = (N - D - S) / N


Speed. Suppose the audio file was 2 hours and the decoding took 6 hours. Then speed is counted as 3xRT.


ROC curves. When we talk about detection tasks, there are false alarms and hits/misses; ROC curves are used. A curve is a graphic that describes the number of false alarms vs number of hits, and tries to find optimal point where the number of false alarms is small and number of hits matches 100%.


There are other properties that aren't often taken into account, but still important for many practical applications. Your first task should be to build such a measure and systematically apply it during the system development. Your second task is to collect the test database and test how does your application perform.


语音的基本概念--译自CMU sphinx相关推荐

  1. 横向对比5大开源语音识别工具包,CMU Sphinx最佳

    目前开源世界里存在多种不同的语音识别工具包,它们为开发者构建应用提供了很大帮助.这些工具各有哪些优劣?数据科学公司 Silicon Valley Data Science 为我们带来了 5 种流行工具 ...

  2. 资源 | 横向对比5大开源语音识别工具包,CMU Sphinx最佳

    选自svds 作者:Cindi Thompson 机器之心编译 参与:李泽南.Smith目前开源世界里存在多种不同的语音识别工具包,它们为开发者构建应用提供了很大帮助.这些工具各有哪些优劣?数据科学公 ...

  3. 语音的基本概念(理解senone时看的较好的文章)

           这是CMU sphinx语音识别系统wiki的第一部分,主要是介绍语音的一些基本概念的. Basic concepts of speech 语音的基本概念 Speech is a com ...

  4. 只能通过语音控制的概念型音乐播放器乐流Music Flow:想要带给用户最原始的交互体验,说出你想听的音乐

    我们从WISE Talk科技改变音乐这个活动中发现一个问题,"音乐和科技的路口,人不多,风也还没来,音乐和科技双方已经发生的关系,比想象中浅薄".这其中的原因,也可能是因为想要入行 ...

  5. 语音信号处理初学者概念总结

    1. 频谱就是频率的分布曲线,复杂振荡分解为振幅不同和频率不同的谐振荡,这些谐振荡的幅值按频率排列的图形叫做频谱.广泛应用在声学.光学和无线电技术等方面. 频谱是频率谱密度的简称.它将对信号的研究从时 ...

  6. 简述html语音的概念,语音共振的概念和特点简述

    语音共振(vocal resonance)的检查方法与语音震颤基本相同,即嘱被检查者用一般面谈的声音强度重复发"衣"的长音,或重复发"一.二.三",喉部发音产生 ...

  7. 什么样的 python 可以可谓专业 PyPI 项目?刚刚学到三个概念:pep8、Sphinx、pytest与GitHub Action的集成

    前言: 最近在读很火的 tianshou (基于 pytorch 提供深度强化学习算法的简易接口),两个清华本科生做的.很规范.很优秀的项目. 做出来的项目,想要让别人使用.维护.建立良性可持续社区, ...

  8. NLP开源 CMU Sphinx


  9. 计算机视觉、机器学习、人工智能领域知识汇总

    http://blog.csdn.net/zouxy09 2012年8月21号开始了我的第一篇博文,也开始了我的研究生生涯.怀着对机器学习和计算机视觉等等领域的懵懂,从一个电子材料的领域跨入这个高速发 ...


  1. SmartNIC/DPU — 基本组成示例
  2. java遍历对象属性_java开发中遍历一个对象的所有属性并set值 缓存优化
  3. Win10系列:UWP界面布局基础4
  4. 安卓抓包软件_Packet Capture安卓抓包神器介绍及使用教程
  5. 1.5 编程基础之循环控制 05 最高的分数
  6. 原生js、jq移入移出事件
  7. NUnit单元测试笔记
  8. 页面自动刷新代码大全
  9. 详细介绍svn在eclipse中的使用(附图解说明)
  10. Netflix Ribbon 负载均衡 概述 与 基本使用
  11. python运行cmd命令和opencv搭建_Python环境搭建之OpenCV
  12. iec104点号_IEC104规约报文说明(104报文解释的较好的文本)
  13. Python读取scel文件
  14. php实现成语小游戏,成语小秀才微信小程序源码-PHP代码类资_aqa7qj 源码采用php实现 - 下载 - 搜珍网...
  15. 密码管理方案之SafeInCloud+坚果云同步
  16. php遍历数组查询数据库,php如何遍历数据库查询数组
  17. Java实现多线程远程投屏并打包可执行文件(从代码到.exe)
  18. qt ini文件的读、写、删除
  19. android基本功
  20. 毕业设计 STM32单片机的蓝牙智能计步器手环


  1. Linux 删除权限 umask,linux中的umask控制文件或目录的默认权限
  2. 寻找连通域算法_FPGA实现的连通域识别算法升级
  3. 渗透测试入门12之渗透测试简介
  4. 扫地机器人水箱背景_水箱尘盒组件及扫地机器人的制作方法
  5. 从零开始实现数据结构(一) 动态数组
  6. Unity下一轮最大的变革-Entity Component System C# Jobs System
  7. solution: stuch on 'setting up your MAC'
  8. poj1113/hdu1348(凸包。。。两个网站上的输入输出有点出入)
  9. MyEclipse 10, 2013, 2014 破解、注册码
  10. Flask 富文本编辑器