ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition

Mahesh Kumar Nandwana1
, Mitchell McLaren1
, Luciana Ferrer2
, Diego Castan1
, Aaron Lawson1

1，Speech Technology and Research Laboratory, SRI International, Menlo Park, California, USA
美国加利福尼亚门罗公园SRI国际SPEECH技术和研究实验室
2，Instituto de Investigacon en Ciencias de la Computaci ´ on, UBA-CONICET, Argentina
阿根廷联邦大学计算机研究所

http://150.162.46.34:8080/icassp2019/ICASSP2019/pdfs/0006001.pdf

Abstract:
摘要:
In this work, we assess the impact of vocal effort on discrimination and calibration performance of a state-of-the-art speaker recognition system.
在这项工作中，我们评估了vocal effort对最先进的说话人识别系统的辨别和校准性能的影响。
We analyze three levels of vocal effort (low, normal, and high) from the SRI-FRTIV corpus.
我们分析了来自SRI-FRTIV语料库的三种vocal effort (低、正常和高)。
We use a deep neural network (DNN) speaker embeddings system with probabilistic linear discriminant analysis (PLDA) and find that vocal effort variation significantly degrades system performance.
我们利用深度神经网络(DNN)说话人嵌入系统与概率线性判别分析(PLDA)，发现vocal effort的变化会明显降低系统的性能。
We apply both mixture PLDA (mix-PLDA) and trial-based calibration（校准） with condition PLDA similarity (TBC-CPLDA) to improve system robustness.
为了提高系统的鲁棒性，我们采用了混合PLDA (mix-PLDA)和基于条件PLDA相似性(TBC-CPLDA)的实验标定方法。
Our proposed approaches resulted in 18% and 33% relative improvement in discrimination and calibration performance respectively on the SRI-FRTIV corpus.
我们提出的方法在SRI-FRTIV语料库上的识别和校准性能分别提高了18%和33%。

From Wikipedia：
Vocal effort is a quantity varied by speakers when adjusting to an increase or decrease in the communication distance.
Vocal effort是这样一个变量，当交谈距离变化的时候，它也会随说话人的不同而变化
The communication distance is the distance between the speaker and the listener.
交谈距离是指说话者和听者之间的距离。
Vocal effort is a subjective physiological quantity, and is mainly dependent on subglottal pressure, vocal fold tension and jaw opening.
Vocal effort 是一个主观的生理变量，主要取决于声门下压力、声带张力和下颌张开度。
Vocal effort is different from sound pressure.
Vocal effort不是声压。
To measure vocal effort, listeners are asked to rate the distance between speaker and addressee.
为了衡量说话人的vocal effort，听众被要求对说话人和听众之间的距离打分。

SECTION 1.INTRODUCTION
1.节介绍
Variability in the acoustic signal is a persistent challenge for speaker recognition systems operating under real-world conditions.
声音信号的可变性是说话人识别系统在真实环境下工作所面临的一个长期挑战。
Such variability is caused by either intrinsic or extrinsic factors.
这种变异性是由内在因素或外在因素造成的。
Intrinsic factors are associated with the speaker rather than the recording environment.
内在因素与说话者有关，而与录音环境无关。
These factors include changes in vocal effort, speaking style [1], non-speech sounds [2], [3], [4], emotions, language [5], aging, etc. across recordings of the same speaker.
内在因素包括在vocal effort方面的变化，说话风格[1]，非语言的声音[2]，[3]，[4]，情绪，语言[5]，年龄等。
Extrinsic factors are associated with the differences in the recording environments between recordings.
外部因素与录音环境的差异有关。
These factors include changes in background noise, microphone, room acoustics, distance from the microphone [6], transmission channel, codec [7], etc.
外部因素包括背景噪声的变化、麦克风、房间音响效果、与麦克风[6]的距离、传输通道、编解码器[7]等。
Intrinsic factors are also known as speaker-dependent factors, whereas extrinsic factors are called speaker-independent factors [8].
内在因素也称为说话者相关因素，而外在因素则称为说话者无关因素[8]。
During recent decades, US government evaluations and programs (such as the NIST Speaker Recognition Evaluations (SRE), the IARPA BEST program, and the DARPA RATS program) have motivated particular research directions in the speaker recognition community.
近几十年来，美国政府的评估和项目(如NIST Speaker Recognition assessment (SRE)、IARPA BEST program和DARPA RATS program)推动了说话人识别领域的特定研究方向。
Those research programs have primarily focused on the problem of extrinsic variability, including channel effects, transmission noise, and environmental noise.
这些研究项目主要集中于外部变异性的问题，包括通道效应、传输噪声和环境噪声。
Intrinsic variability, in contrast, has received sparse research exposure.
相反，内在的可变性却很少得到研究的关注。
Yet, intrinsic variability is a key factor for unconstrained applications, such as forensic speaker recognition.
然而，内在的可变性是无约束应用的一个关键因素，例如法医说话人识别。
This work is focused specifically on vocal effort variations, which is one class of intrinsic variability.
我们的工作特别关注 vocal effort方面的变化，这是一种内在的变异性。
Vocal effort has been shown to impact the performance of speaker recognition systems [9].
vocal effort已经被证明，会影响说话人识别系统[9]的性能。
In the past, a number of studies focused on different levels of vocal effort, such as whisper [10], shouts [11], and screams [4].
在过去，许多研究集中于不同程度的 vocal effort，如耳语[10]、大喊[11]和尖叫[4]。
The impact of Lombard speech on the performance of speaker verification system was considered in [12], [13].
[12]、[13]中考虑了朗巴德语对说话人验证系统性能的影响。
The main contributions of this work are as follows.
这项工作的主要贡献如下。
First, we use a state-of-the-art DNN speaker embeddings based speaker recognition system over classical GMM-UBM or i-vector based systems.
首先，不同于经典的基于GMM-UBM或i-vector的系统，我们使用了一种最先进的基于DNN说话人嵌入式的说话人识别系统。
Second, rather than focusing on just one type of vocal effort level such as whisper or shouts, we develop our mitigation approaches for a range of vocal efforts from low to high.
其次，我们不是只专注于一种类型的 vocal effort，如耳语或呼喊，而是为一系列从低到高的 vocal effort开发我们的缓解方法。
Third, we use a relatively large number of speakers with sufficient audio data per speaker to get significant results.
第三，我们使用相对较多的说话人，每个说话人有足够的音频数据，以获得显著的结果。
Also, to the best of our knowledge, this study is the first to consider calibration of speaker recognition system for a range of vocal efforts.
此外，据我们所知，本研究是第一个考虑校准说话人识别系统的一系列 vocal effort。
In this study, we first assess the impact of vocal effort on discrimination and calibration performance of a DNN speaker embeddings speaker recognition system.
在这项研究中，我们首先评估了 vocal effort对DNN嵌入式说话人识别系统的识别和校准性能的影响。
We then apply mixture PLDA (mix-PLDA) using meta information and the recently proposed trial-based calibration with condition PLDA similarity (TBC-CPLDA) to mitigate the impact of vocal effort.
然后，我们使用元信息和最近提出的基于条件PLDA相似性的实验校准(TBC-CPLDA)，混合PLDA (mix-PLDA)来减轻 vocal effort的影响。
We used SRI-FRTIV corpora for all the experiments.
所有实验均采用SRI-FRTIV语料库。

ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition相关推荐

【无标题】RADICAL ANALYSIS NETWORK FOR ZERO-SHOT LEARNING IN PRINTED CHINESE CHARACTER RECOGNITION
印刷体汉字识别中零次学习的部件分析网络 (RADICAL ANALYSIS NETWORK FOR ZERO-SHOT LEARNING IN PRINTED CHINESE CHARACTER RE ...
TGARS 2019: What, Where, and How to Transfer in SAR Target Recognition Based on Deep CNNs ——学习笔记
1 TGARS-2019论文:What, Where, and How to Transfer in SAR Target Recognition Based on Deep CNNs 链接:http ...
11.FREQUENCY AND TEMPORAL CONVOLUTIONAL ATTENTION FORTEXT-INDEPENDENT SPEAKER RECOGNITION(2019.10)
题目:用于独立于文本的说话人识别的频率和时间卷积注意力论文地址:https://arxiv.org/abs/1910.07364 摘要:大多数最近的与文本无关的说话人识别方法都应用注意力或类似技术来 ...
流利阅读 2019.1.31 #10YearChallenge: harmless trend or boon to facial recognition technology?
下载笔记版/无笔记版 pdf资料: GitHub - zhbink/LiuLiYueDu: 流利阅读pdf汇总本文内容全部来源于流利阅读.流利阅读对每期内容均有很好的文章讲解,向您推荐. 您可以关 ...
（ICASSP 19）AUTOMATIC GRAMMAR AUGMENTATION FOR ROBUST VOICE COMMAND RECOGNITION
会议:ICASSP 2019 论文:AUTOMATIC GRAMMAR AUGMENTATION FOR ROBUST VOICE COMMAND RECOGNITION 作者:Yang Yang ; ...
人工智能/数据科学比赛汇总 2019.8
内容来自 DataSciComp,人工智能/数据科学比赛整理平台. Github:iphysresearch/DataSciComp 本项目由 ApacheCN 强力支持. 微博 | 知乎 | CSD ...
人工智能/数据科学比赛汇总 2019.9
内容来自 DataSciComp,人工智能/数据科学比赛整理平台. Github:iphysresearch/DataSciComp 本项目由 ApacheCN 强力支持. 微博 | 知乎 | CSD ...
Comprehensive survey of computational ECG analysis: Databases,methods and applications
1.Learning algorithms classifiers(most common and highest-performing): Support Vector Machines(SVM 支 ...
最全2019 AI/计算机/机器人顶会时间表来了，共收录36场会议，投稿冲鸭！
郭一璞整理量子位出品 | 公众号 QbitAI 2018年又要数着指头过了,你的2019学术计划怎么样了? 下面,是量子位给大家整理的2019 AI顶会时间表,包含会议举办的时间.地点.投稿截 ...

ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition

ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition相关推荐

最新文章

热门文章