The LeVoice Far-field Speech Recognition System for VOiCES from a Distance Challenge 2019
Yulong Liang, Lin Yang, Xuyang Wang, Yingjie Li, Chen Jia, Junjie Wang
Lenovo Research


This paper describes our submission to the “VOiCES from a Distance Challenge 2019”, which is designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with a special focus on single channel distant/far-field audio under noisy conditions. We focused on the ASR task under a fixed condition in which the training data was clean and small, but the development data and test data were noisy and unmatched. Thus we developed the following major technical points for our system, which included data augmentation, weighted-prediction-error based speech enhancement, acoustic models based on different networks, TDNN or LSTM based language model rescore, and ROVER. Experiments on the development set and the evaluation set showed that the front-end processing, data augmentation and system fusion made the main contributions for the performance increasing, and the final word error rate results based on our system scored 15.91% and 19.6% respectively.

  1. Introduction

Since the accuracy of the close-talking and the noise-free speech recognition is approaching the best possible human speech recognition performance [1-4], more and more researchers have turned their attention to the far-field and noisy scenarios[5-8]. The “VOiCES from a Distance Challenge 2019’’ [9][10] is such a competition designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with a special focus on single channel distant/far-field audio under noisy conditions. This challenge is based on the newly released Voices Obscured in Complex Environmental Settings (VOiCES) corpus, and the training data is an 80 hours subset of the Librispeech dataset. The VOiCES challenge has two tasks: speaker recognition and automatic speech recognition (ASR). Each task has fixed and open training conditions. The main difficulty of each task is that the training data is small, and there was mismatch between the training data and the evaluation data. For far-field speech recognition, a lot of researches have been conducted. These researches can be divided into two categories. In the first category, researchers process the evaluation data in the front-end to make it more matchable with the model. In the second category, researchers train acoustic models(AM) and language models in the back-end to make model parameters match the data under the test conditions as much as possible. For the front-end processing, the main methods such as Optimal Modified Minimum MeanSquare Error Log-Spectral Amplitude and Improved Minimal Controlled Recursive Averaging (OMLSA-IMCRA)[11] and Weighted Prediction Error(WPE)[12][13] are used to realize de-reverberation and de-noising. For the back end, the mainly methods include applying different acoustic model architectures, such as Deep Neural Network(DNN), Timedelay Neural Network(TDNN)[5], factorized TDNN(TDNNF), Convolutional Neural Network(CNN), Long Short Term Memory(LSTM), model parameters optimization, Neural Network Language Model(NNLM) based rescore and multimodel fusion. The goal is to decrease the mismatch between the distant speech to be recognized with the training condition. Because the training set given was clean speech, while the development set and the evaluation set were speech under complex conditions in which different kinds of noises and reverberation existed, we took several measures to optimize the recognition performance. Firstly, in order to solve the lacking of training data, we expanded the dataset by data augmentation strategies and adding reverberation and noises; Also we trained acoustic models with different network architectures; Thirdly a rescoring mechanism was added based on the one-pass decoding lattices; Finally, ROVER [14] was used to make full use of the complementarity among different systems. The rest of this paper is organized as follows. Section 2 introduces each component of the system. Section 3 shows ASR results obtained using the VOiCES corpus. Section 4 is the conclusion of paper.



The LeVoice Far-field Speech Recognition System for VOiCES from a Distance Challenge 2019相关推荐

  1. [blog] Speech Recognition Is Not Solved 语音识别领域尚待解决的子问题

    链接: Ever since Deep Learning hit the scene in speech recog ...

  2. Kaldi学习笔记——The Kaldi Speech Recognition Toolkit(Kaldi语音识别工具箱)(上)

    最近看了有关KALDI的论文,在这里介绍一下. Abstract: We describe the design of Kaldi, a free, open-source toolkit for s ...

  3. Whither Speech Recognition: 25年又一个25年

    Pierce's harsh criticism Whither Speech Recognition - J.R. Pierce, 1969 In deception, studied and ar ...

  4. “Imperceptible,Robust,and Targeted Adversaria lExamples for Automatic Speech Recognition”

    背景: 1.对抗样本大多用于图像领域: 2.目前用于音频的对抗样本有两个缺点: (1)容易被人类察觉 改进方法:频率掩蔽.通过使用另外一种充当"掩蔽器"的信号对对抗性样本进行掩护 ...

  5. 语音识别(Speech recognition)的核心内容是将语音转换成文字 语音识别,又称为自动语音识别(A ...

  6. 语音识别(Speech Recognition)综述

    文章目录 1. 语音识别的基本单位 1.1 Phoneme(音位,音素) 1.2 Grapheme(字位) 1.3 Word(词) 1.4 Morpheme(词素) 1.5 bytes 2. 获取语音 ...

  7. 用SAPI实现Speech Recognition(SR) - 命令控制模式

    微软的语音识别,在这里我们简称它为SR(speech recognition),SR分为两种模式的监听:第一种模式:听写模式,即随意输入语音,监听对象将最为接近的字或者词,句反馈出来:第二种模式:命令 ...

  8. 用SAPI实现Speech Recognition(SR) - 听写模式

    摘选自:"北极悠蓝"的博客<C++使用SAPI实现语音合成和语音识别的方法和代码> 微软的语音识别,在这里我们简称它为SR(speech recognition),SR ...

  9. 语音识别系列1:语音识别Speech recognition综述

    名词约定: 语声识别----- VOICE RECOGNITION 语音识别-----SPEECH RECOGNITION 1 什么是语声识别VOICE RECOGNITION? 语音或说话者识别是程 ...


  1. [Nginx] Nginx 配置location总结
  2. 单例模式(singleton)解析例子
  3. 配置文件中链接数据库的配置
  4. MVC Pager分页实现
  5. Widget实例可以添加多个并独立更新
  6. deb方式安装openjdk8
  7. php网页设计课程设计dreamweaver8_Dreamweaver 8.0 多媒体网页制作教程
  8. 杭州小伙逆行-没有生活,只有活着
  9. 干法:经营者应该怎样工作
  10. Debian 10上设置和配置证书颁发机构(CA)
  11. SOLIDWORKS 2018官方正版功能介绍
  12. Mac终端常用命令及报错处理
  13. 重新启动mysql服务器
  14. 解决www.coursera.org可以登录但无法播放视频
  15. mysql 循环取值 重复循环_mysql在for循环中插入数据重复问题
  16. Linux中木马如何处理
  17. fedora15中yum安装卸载libreoffice中文版
  18. c语言输出不足10补0,c++ cout输出不足位补0 setw、setfill
  19. 合肥达内培训php,合肥PHP开发培训班介绍PHP7是什么?
  20. django部署机器学习模型---搭建新闻推荐系统


  1. 磁悬浮matlab,磁悬浮小球matlab
  2. 杨柳目-杨柳科:杨柳科
  3. 关于Android开发的面试经验总结,妈妈再也不用担心我找工作了!
  4. java中fractions,[CF743C]Vladik and fractions
  5. 电机编码器的使用方法
  6. arduino 1 读取电机编码器值
  7. 群晖NAS:共享文件夹、用户、群组建立及权限设置
  8. 【数据结构】-期末复习或者考研复习资料文档
  9. android记账本登录界面,Android记账本开发(一):整体UI界面布局
  10. Spring Cloud + Mybatis 多数据源配置