作为语音领域里的顶级国际会议,INTERSPEECH历来都是学术界和工业界关注的焦点,会议涵盖了语音语言处理和应用的各个方面,以及语音相关领域的各类前沿进展。INTERSPEECH2021于8月30日-9月3日举办,会议由国际语音通信协会 ISCA主办,今年会议为线上加线下(捷克布鲁诺)的形式。为方便全球各地研究者交流,今年被接收的论文都能进行视频展示。


历届INTERSPEECH会收到来自全球上千家科研机构及企业厂商投稿,而最终入选的数量却十分有限。在今年Interspeech2021,希尔贝壳投递的2篇论文《AISHELL-3: A Multi-speaker Mandarin TTS Corpus 》 和《AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario》成功被大会收录其中。

论 文 1

题目:《AISHELL-3:A Multi-speaker Mandarin TTS Corpus 》


作者:Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li


  • School of Computer Science, Wuhan University, Wuhan, China

  • Data Science Research Center, Duke Kunshan University, Kunshan, China

  • Beijing Shell Shell Technology Co., Ltd, Beijing, China


In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.


论 文 2


《AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario》



Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen


  • Northwestern Polytechnical University, Xi’an, China

  • Microsoft Corporation, USA

  • Microsoft Corporation, China

  • Beijing Shell Shell Technology Co., Ltd., Beijing, China

  • University of Science and Technology of China, Hefei, China


In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.


AISHELL 的开源项目已经成为了语音技术领域的数据开源标杆,目前已形成了智能语音技术+数据的矩阵开源方案,覆盖语音识别、声纹识别、语音合成、场景智能语音技术应用方案。


