2016--Analysis of the DNN-based SRE systems in multi-language conditions

Ondrˇej Novotny ́, Pavel Mateˇjka, Ondrˇej Glembek, Oldrˇich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́
恩德·伊吉·诺沃特尼、帕维尔·马特·伊卡、恩德·伊吉·格伦贝克、奥尔德·伊奇·普尔肖特、弗朗蒂斯·埃克·格雷兹尔、卢卡·斯伯格特和简·恩扎·恩诺基
Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Brno, Czech Republic
布尔诺理工大学，Speech@FIT和IT4I卓越中心，布尔诺，捷克共和国
{inovoton,matejkap,glembek,iplchot,grezl,burget,cernocky}@fit.vutbr.cz
{inovoton，matejkap，glembek，iplchot，grezl，burget，cernocky}@fit.vutbr.cz

https://ieeexplore.ieee.xilesou.top/abstract/document/7846265

This work was supported by the DARPA RATS Program under Contract No. HR0011-15-C-0038. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
这项工作得到了DARPA RATS项目的支持，合同号为HR0011-15-C-0038。所表达的观点是作者的观点，并不反映国防部或美国政府的官方政策或立场。
This work was also supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research Laboratory contract number W911NF-12-C-0013. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.
这项工作还得到了美国国防部高级研究项目活动（IARPA）的支持，该活动的合同号为W911NF-12-C-0013。美国政府有权为政府目的复制和分发再版，尽管有版权注释。免责声明：本文包含的观点和结论是作者的观点和结论，不应被解释为必然代表IARPA、DoD/ARL或美国政府的官方政策或背书，无论是明示的还是暗示的。
The work was also supported by Czech Ministry of Interior project No. VI20152020025 ”DRAPAK” and European Union’s Horizon 2020 pro- gramme under grant agreement No. 645523 BISON.
这项工作还得到了捷克内政部第VI20152020025号“德拉帕克”项目和欧盟第645523号《比森赠款协定》规定的“地平线2020计划”的支持。

摘要
This paper analyzes the behavior of our state-of-the-art Deep Neural Network/i-vector/PLDA-based speaker recognition systems in multi-language conditions. On the “Language Pack” of the PRISM set, we evaluate the systems’ performance using the NIST’s standard metrics. We show that not only the gain from using DNNs vanishes, nor using dedicated DNNs for target conditions helps, but also the DNN-based systems tend to produce de-calibrated scores under the studied conditions. This work gives suggestions for directions of future research rather than any particular solutions to these issues.
本文分析了目前最先进的基于深度神经网络/i-vector/PLDA的说话人识别系统在多语言环境下的行为。在PRISM集合的“语言包”中，我们使用NIST的标准度量来评估系统的性能。我们发现，在目标条件下使用DNN不仅会使增益消失，而且使用专用DNN也不会有帮助，而且基于DNN的系统在所研究的条件下往往会产生去校准分数。这项工作为未来的研究方向提供了建议，而不是针对这些问题的任何具体解决方案。

INTRODUCTION

During the last decade, neural networks have experienced a renais- sance as a powerful machine learning tool. Deep Neural Networks (DNN) have been also successfully applied to the field of speech processing. After their great success in automatic speech recogni- tion (ASR) [1], DNNs were also found very useful in other fields of speech processing such as speaker [2, 3, 4] or language recog- nition [5, 6, 7]. In speech recognition, DNNs are often directly trained for the ”target” task of frame-by-frame classification of speech sounds (e.g. tied tri-phone states). Similarly, a DNN directly trained for frame-by-frame classification of languages was success- fully used for language recognition in [7]. However, this system provided competitive performance only for speech utterances of short durations.
在过去的十年里，神经网络作为一种强大的机器学习工具，经历了一个全新的时代。深神经网络（DNN）也已成功地应用于语音处理领域。DNNs在自动语音识别（ASR）领域取得了巨大成功[1]之后，在其他语音处理领域，如说话人[2,3,4]或语言识别[5,6,7]中也得到了广泛的应用。在语音识别中，dnn通常直接训练用于语音的逐帧分类（如三音系状态）的“目标”任务。类似地，一个直接训练用于语言逐帧分类的DNN成功地完全用于语言识别。然而，该系统只为短时长的语音提供了竞争性的性能。即语音长度较短，比如只有几到几十秒时DNN效果较好。
In the field of speaker recognition, DNNs are usually used in more elaborate and indirect way: One approach is to use DNNs for extracting frame-by-frame speech features. Such features are then used in the usual way (e.g. input to i-vector based system [8]).
在说话人识别领域中，DNNs的使用通常比较精细和间接：一种方法是利用DNNs逐帧提取语音特征。然后以通常的方式使用这些特征（例如，输入到基于i向量的系统[8]）。
These features can be directly derived from the DNN output pos- terior probabilities [9] and combined with the conventional features (PLP or MFCC) [10]. More commonly, however, bottleneck (BN) DNNs are trained for a specific task, and the features are taken from a narrow hidden layer compressing the relevant information into low dimensional feature vectors [6, 5, 11]. Alternatively, standard DNN (with no bottleneck) can be used, where the high-dimensional out- puts of one of the hidden layers can be converted to features using a dimensionality reduction technique such as PCA [12].
这些特征可以直接从DNN输出概率中导出[9]，并与传统特征（PLP或MFCC）相结合[10]。然而，更常见的是，瓶颈dnn是为特定任务训练的，特征是从一个狭窄的隐藏层提取的，将相关信息压缩成低维特征向量[6，5，11]。或者，可以使用标准DNN（没有瓶颈），其中，一个隐藏层的高维输出可以使用诸如PCA[12]之类的降维技术转换为特征。
In [13], we analyzed various DNN approaches to speaker recog- nition (and similar studies were conducted e.g. in [14, 15]). We used two different DNN’s (a mono-lingual DNN—trained on the Fisher English data corpus—and a multi-lingual DNN—trained on 11 lan- guages of the Babel data collection). The rest of the system was trained on the PRISM set, i.e. mainly on the English data. We re- ported our results only on the NIST SRE 2010 telephone condition (i.e. only on English speech) via the Equal Error Rates (EERs) and the minimum DCF NIST metrics.
在[13]中，我们分析了不同的DNN方法对说话人识别的影响（并进行了类似的研究，如在[14，15]中）。我们使用了两种不同的DNN（一种是在Fisher英语数据语料库上训练的单语DNN，另一种是在Babel数据收集的11种语言上训练的多语DNN）。系统的其余部分是在棱镜组上训练的，即主要是在英语数据上。我们仅在NIST SRE 2010电话条件下（即仅在英语语音条件下）通过等错误率（EER）和最小DCF NIST指标重新报告结果。

[13] Pavel Mateˇjka, Ondˇrej Glembek, Ondˇrej Novotny ́, Oldˇrich Pl- chot,FrantisˇekGre ́zl,Luka ́sˇBurget,andJanCˇernocky ́,“Anal- ysis of DNN approaches to speaker identification,” in Proceed- ings of the 41th IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, pp. 5100–5104, IEEE Signal Processing Society.
[14] Yao Tian, Meng Cai, Liang He, and Jia Liu, “Investigation of bottleneck features and multilingual deep neural networks,” in Interspeech, 2015.
[15] Sandro Cumani, Oldˇrich Plchot, and Pietro Laface, “Compar- ison of hybrid DNN-GMM architectures for speaker recogni- tion,” in ICASSP. 2016, IEEE Signal Processing Society.

However, when tested on non-English test sets, we observed that the benefit of using the DNNs degraded dramatically. We used the “lan” Language Pack of the PRISM set (described later in the paper), and its Chinese subset—the “chn” pack in comparison with the orig- inally used NIST SRE 2010 telephone condition. Not only we saw performance degradation in terms of EER and the minimum DCFs, but more so in terms of the actual DCFs, i.e. the systems produce heavily de-calibrated scores.
然而，当在非英语测试集上测试时，我们发现使用DNNs的好处显著降低。我们使用PRISM集的“lan”语言包（本文稍后介绍），其，中文子集“chn”包与最初使用的NIST SRE 2010电话条件进行比较。我们不仅看到了EER和最小DCF方面的性能下降，更看到了实际DCF方面的性能下降，即系统产生严重的未校准分数。，即，系统产生的分数准确度较小。如果是calibrated scores，就是产生的分数准度较高。

Our hypothesis was that when we use the DNN trained for the target language, the error rates would decrease. To match the sre10, “lan”, and “chn” test conditions, we chose three DNNs, trained on: i) the Fisher English, the ii) Multilingual set, and iii) the Mandarin, re- spectively. However, it turned out that, apart from the Fisher English being optimal for the NIST SRE 2010 test, there was no clear corre- lation between the test language and the DNN training language.
我们的假设是，当我们使用为目标语言训练的DNN时，错误率会降低。（即假设是，训练集的语言和测试集的语言一致时，错误率会降低）为了符合sre10，“lan”和“chn”测试条件，我们选择了三个DNN，分别是：i）Fisher英语，ii）多语言集，和iii）普通话。（lan就是说这个情况下的测试语言是多语种的，chn就是说这个情况下的测试语言的普通话，当然还有一种情况下的测试语言是英语。所以作者选用了三种DNN，训练集分别是英语，多语言和普通话）然而，结果发现，除了Fisher英语是NIST SRE 2010测试的最佳语言外，，**测试语言和DNN训练语言之间没有明显的相关性。**结果与前面的假设并不一致，并不是-------训练集的语言和测试集的语言一致时，错误率会降低。

This paper analyzes the problems that emerged when applying the current state-of-the-art SRE systems to non-English domains, and provides directions for future research. This work is an exten- sion of our previous analysis, available as a technical report [16].
本文分析了目前最先进的SRE系统在非英语领域应用中出现的问题，为今后的研究提供了方向。这项工作是我们先前分析的扩展，可作为技术报告[16]。

感觉文献16和本文是一模一样的啊，扩展在哪里

[16] Ondˇrej Novotny ́, Pavel Mateˇjka, Ondˇrej Glembek, Oldˇrich Plchot, Frantisˇek Gre ́zl, Luka ́sˇ Burget, and Jan “Honza” Cˇernocky ́, “DNN-based SRE systems in multi-language conditions,” 2016, BUT Technical Report, http://www.fit.vutbr.cz/research/pubs/report.php?id=11235, also being submitted to IEEE Signal Processing Letters.

THEORETICAL BACKGROUND
2。理论背景
2.1. i-vector Systems
2.1条。i-向量系统
The i-vectors [8] provide an elegant way of reducing large-dimensional input data to a small-dimensional feature vector while retaining most of the relevant information. The main principle is that the utterance-dependent Gaussian Mixture Model (GMM) supervector of concatenated mean vectors s is modeled as
i-向量[8]提供了一种优雅的方法，在保留大部分相关信息的同时，将大维输入数据减少为小维特征向量。其主要原理是将级联平均向量s的与话语相关的高斯混合模型（GMM）超向量建模为

We experimented with monolingual (English and Mandarin) and multilingual BN features. In the case of multilingual training, we adopted training scheme with block-softmax, which divides the out- put layer into parts according to individual languages. During train- ing, only the part of the output layer is activated that corresponds to the language that the given target belongs to. See [20, 21] for detailed description.
我们尝试了单语（英语和汉语）和多语种的BN特征。在多语种训练的情况下，采用了块softmax的训练方案，将输出层按语言划分为若干部分。在训练过程中，只激活输出层中与给定目标所属语言相对应的部分。详细说明见[20,21]。

Bottleneck Neural-Network (BN-NN) refers to such topology of a NN, one of whose hidden layers has significantly lower dimension- ality than the surrounding layers. A bottleneck feature vector is gen- erally understood as a by-product of forwarding a primary input fea- ture vector through the BN-NN and reading off the vector of values at the bottleneck layer. We have used a cascade of two such NNs for our experiments. The output of the first network is stacked in time, defining context-dependent input features for the second NN, hence the term Stacked Bottleneck Features.
瓶颈神经网络（BN-NN）是指一个网络的拓扑结构，其隐含层的维数明显低于周围层。瓶颈特征向量一般被理解为通过BN-NN转发主输入特征向量并读取瓶颈层的值向量的副产品。我们在实验中使用了两个这样的NNs级联。第一个网络的输出是按时间堆叠的，为第二个神经网络定义了上下文相关的输入特征，因此称为堆叠瓶颈特征。

However, it was shown that DNNs can be used directly for posterior computa- tion [2] .
然而，DNNs可直接用于后验概率计算[2]。

In other words, we show the utility of the trained DNNs as both feature- and posterior-extractors
换句话说，我们展示了训练dnn作为特征提取和后验概率提取的实用性

SBN特征提取涉及两个NN：第一个NN的瓶颈（BN）输出是堆叠的、下采样的、可选的，并作为第二个NN的输入向量。第二个神经网络又有一个BN层，其输出作为传统高斯混合模型-隐马尔可夫模型（GMM-HMM）语音识别系统的输入特征。基频（f0）相关特征是两种声调语言和非声调语言语音识别系统中的重要特征。我们尝试将不同的f0特性作为SBN的附加输入。SBN最后输出的瓶颈特征是80维的，后续作为传统GMM/UBM i-vector 说话人识别系统的输入特征。

本文的DNN主要用来做特征提取的。提取后验概率用在哪里了？

有两种结构的DNN，标准DNN和SBN，训练集又有三种，英语，普通话和多语种，
以上结果就是有五种DNN，
English SBN
Mandarin SBN
Multilang SBN
English DNN
Mandarin DNN

Baseline的DNN结构不知道是啥，也没说，Baseline应该是标准的kaldi 的i-vector吧，Baseline的特征提取设置如下：
19 MFCC coefficients + en- ergy augmented with their delta and double delta coefficients, re- sulting in 60-dimensional feature vectors. The analysis window was 20 ms long with the shift of 10 ms.

SNB结构的输入是：
24 log Mel-scale filter bank outputs augmented with fundamental frequency features from 4 different f0 estimators (Kaldi, Snack1 , and two other according to [17] and [18]). Together, we have 13 f0 related features, see [19] for more de- tails.
SNB结构的输出是：
80维的，后续作为传统GMM/UBM i-vector 说话人识别系统的输入特征。

English DNN 和 Mandarin DNN 的输入和输出又是啥，感觉文中没说啊。

tab3的解释我也没看懂：
In Tab. 3, we show the effect of a linear calibration on the En- glish SBN system. Because of the lack of an independent held- out set, we performed a cheating (gender-independent) calibration trained using the “lan” trial set, which contains both English and Chinese trials.

CONCLUSIONS
In this work, we have studied the behavior of the DNN techniques in SRE i-vector/PLDA systems, currently considered to be state-of- the-art, as evaluated on the most common NIST SRE English test sets, such as the NIST SRE 2010, condition 5. We have shown that when applied to non-English test sets, these techniques stop being effective and are susceptible to de-calibration of the scores produced by the traditional i-vector/PLDA systems. We have also observed that selecting a DNN to match the test condition does not solve the issues mentioned above.
在这项工作中，我们研究了目前被认为是最先进的SRE i-vector/PLDA系统中DNN技术的行为，并在最常见的NIST SRE英语测试集（如NIST SRE 2010，条件5）上进行了评估。我们已经证明，当应用于非英语测试集时，这些技术不再有效，并且容易对传统的i-vector/PLDA系统产生的分数进行去校准。我们还观察到，选择一个DNN来匹配测试条件并不能解决上述问题。
This work therefore leaves more questions than answers, and suggests that we focus on the analysis of the DNN acoustic space clustering with regard to multiple languages and other types of vari- ability, and that we study the behavior of clustering with regard to the available SRE training data.
因此，这项工作留下了更多的问题而不是答案，并建议我们将重点放在分析DNN声学空间聚类的多语言和其他类型的变异性，并研究聚类行为对可用的SRE训练数据。

2016--Analysis of the DNN-based SRE systems in multi-language conditions相关推荐

END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA
END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA Johan Rohdin, Anna Silnova, M ...
【RS-Attack】Data Poisoning Attacks to Deep Learning Based Recommender Systems NDSS‘21
Data Poisoning Attacks to Deep Learning Based Recommender Systems NDSS'21 首个在基于深度学习的推荐系统中进行投毒攻击的研究.文 ...
《Sentiment Analysis of Chinese Microblog Based on Stacked Bidirectional LSTM》论文阅读笔记
文章名:<Sentiment Analysis of Chinese Microblog Based on Stacked Bidirectional LSTM> 作者:JUNHAO ZH ...
Data Poisoning Attacks to Deep Learning Based Recommender Systems论文解读
1 摘要在这项工作中,作者对基于深度学习的推荐系统的数据中毒攻击进行了首次系统研究.攻击者的目标是操纵推荐系统,以便向许多用户推荐攻击者选择的目标项目.为了实现这一目标,作者将精心设计的评分注入到推 ...
南昌大学计算机学院梁教授,教授
[Academic And Professional Qualification and Awards] 2002-present, Professor, School of Communicatio ...
False data injection attacks and the insider threat in smart systems
智能系统中的虚假数据注入攻击和内部威胁一.摘要二.介绍三.实验评估 (一)实验设置 (二)攻击实施 (三)攻击分析 (四)预防措施四.结论一.摘要工业控制系统(ICS)构成智能网络和智能城 ...
【文献学习】 2021 Deep-Waveform: A Learned OFDM Receiver Based on Deep Complex Convolutional Networks
2018版 https://arxiv.org/abs/1810.07181 2018译文参考文章参考文章深波:一种基于深复卷积网络的学习OFDM接收机: V 结果评估 OFDM系统和衰落信道配 ...
Model-Reuse Attacks on Deep Learning Systems
摘要 Many of today's machine learning (ML) systems are built by reusing an array of, pretrained, primi ...
A Complete Tutorial on Tree Based Modeling from Scratch (in R Python)
转载自: http://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-py ...

2016--Analysis of the DNN-based SRE systems in multi-language conditions

2016--Analysis of the DNN-based SRE systems in multi-language conditions相关推荐

最新文章

热门文章