声纹识别中pooling总结
1、Statistics Pooling
http://danielpovey.com/files/2017_interspeech_embeddings.pdf
The statistics pooling layer calculates the mean vector µ as well as the second-order statistics as the standard deviation vector σ over frame-level features ht (t = 1, · · · , T ).
where ⊙ represents the Hadamard product.
2、Attentive Statistics Pooling
https://arxiv.org/pdf/1803.10963.pdf
calculates a scalar score et for each frame-level feature.
where f(·) is a non-linear activation function, such as a tanh or ReLU function.
The score is normalized over all frames by a softmax function so as to add up to the following unity:
The normalized score αt is then used as the weight in the pooling layer to calculate the weighted mean vector
the weighted standard deviation is defined as follows:
3、Self-Attentive pooling
https://danielpovey.com/files/2018_interspeech_xvector_attention.pdf
H = {h1, h2, · · · , hT }, where ht is the hidden representation of input frame xt captured by the hidden layer below the self-attention layer.
where W1 is a matrix of size dh × da; W2 is a matrix of size da × dr, and dr is a hyperparameter that represents the number of attention heads; g(·) is some activation function and ReLU is chosen here. The sof tmax(·) is performed column-wise.
Each column vector of A is an annotation vector that represents the weights for different ht. Finally the weighted means E is obtained by
By increasing dr, we can easily have multiple attention heads to learn different aspects from a speaker’s speech. To encourage diversity in the annotation vectors so that each attention head can extract dissimilar information from the same speech segment, a penalty term P is introduced when dr > 1:
where I is the identity matrix and k·kF represents the Frobenius norm of a matrix. P is similar to L2 regularization and is minimized together with the original cost of the whole system
4、self Multi-Head Attention pooling
https://ieeexplore.ieee.org/document/9053217
where Ti 1 is a temperature hyperparameter
5、NetVLAD
https://arxiv.org/pdf/1902.10107.pdf
https://arxiv.org/pdf/1511.07247.pdf
更详细的解释参考:https://zhuanlan.zhihu.com/p/96718053
6 learnable dictionary encoding (LDE)
https://arxiv.org/pdf/1804.05160.pdf
Here, we introduce two groups of learnable parameters. One is the dictionary component center, noted as µ = {µ1, µ2 · · · µc}. The other one is assigned weights, noted as w.
where the smoothing factor for each dictionary center is learnable.
7、Attentive bilinear pooling (ABP)
https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1922.pdf
Specifically, let H ∈ RL×D be the frame-level feature map captured by the hidden layer below the self-attention layer, where L and D are the number of frames and feature dimension respectively. Then the attention map A ∈ RL×K can be obtained by feeding H into a 1×1 convolutional layer followed by softmax non-linear activation, where K is the number of attention heads. The 1 st-order and 2 nd-order attentive statistics of H, denoted by µ and σ 2 , can be computed similar as crosslayer bilinear pooling [4], which is
where T1(x) is the operation of reshaping x into a vector, and T2(x) includes a signed square-root step and a L2- normalization step. represents the Hadamard product. The output of ABP is the concatenation of µ and σ 2
8、Short-time Spectral Pooling (STSP)
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9414094
(完)
声纹识别中pooling总结相关推荐
- 声纹识别中的召回和精度概念
准确率和召回率是广泛用于信息检索和统计学分类领域的两个度量值,用来评价结果的质量.其中精度是检索出相关文档数与检索出的文档总数的比率,衡量的是检索系统的查准率:召回率是指检索出的相关文档数和文档库中所 ...
- 声纹识别(说话人识别)技术
说话人识别(Speaker Recognition,SR),又称声纹识别(Voiceprint Recognition,VPR),顾名思义,即通过声音来识别出来"谁在说话",是根据 ...
- 知物由学 | 听声辨人,看声纹识别技术如何保障内容安全?
大家对"指纹"并不陌生,但听说过"声纹"吗? "违法犯罪变得越来越困难了.如今罪犯都没法使用电话了,因为侦探们可以通过他们在话筒上留下的声纹来追踪他. ...
- 声纹识别与声源定位(一)
针对目前智能计算机及大规模数据的发展,依据大脑处理语音.图像数据方法的deep learning技术应运而生.deep learning技术是应用于音频信号识别,模仿大脑的语音信号学习.识别的模式.在 ...
- [深度学习概念]·声纹识别技术简介
声纹识别技术简介 声纹识别,也称作说话人识别,是一种通过声音判别说话人身份的技术.从直觉上来说,声纹虽然不像人脸.指纹的个体差异那样直观可见,但由于每个人的声道.口腔和鼻腔也具有个体的差异性,因此反映 ...
- android声纹识别技术,基于Android平台的声纹识别系统的研究与实现
摘要: 社会的发展越来越快,计算机技术的应用也愈来愈广,已经渗透到生活的各个方面.在快节奏.信息化的时代,需要识别和交互的应用日益广泛,要求验证身份的场合越来越多,迅速判定一个人的身份是一个非常重要的 ...
- AnalyticDB向量检索+AI 实战: 声纹识别
简介: 分析型数据库(AnalyticDB)是阿里云上的一种高并发低延时的PB级实时数据仓库,可以毫秒级针对万亿级数据进行即时的多维分析透视和业务探索,向量检索和非结构化数据分析是AnalyticDB ...
- 最强大脑第二场战平听音神童!百度大脑小度声纹识别技术解析
from: http://geek.csdn.net/news/detail/134398 日前,继在江苏卫视<最强大脑>第四季"人机大战"首轮任务跨年龄人脸识别竞赛中 ...
- 基于深度学习的声纹识别
一.选题 1.1 题目: 基于深度学习的声纹识别 1.2 研究内容: ① 研究用于声纹识别的语音参数以及这些参数对声纹识别性能的影响. ② 研究声纹识别的传统方法及性能. ③ 在研究声纹识别的传统方法 ...
最新文章
- 共享之windows与Linux
- VS Code 和 Sublime Text 3 安装及常用插件安装
- kotlin学习笔记——类、函数、接口
- Redis运维和开发学习笔记(7) 内存管理和过期策略
- oracle复制一个表的结构图,Oracle复制表结构
- 内网服务器文件如何加密,局域网共享文件如何加密?
- Table of Contents
- pod打包原理_webpack打包原理解析
- 怎样导入mysql驱动包_怎么导入MYSQL的驱动包
- 实践Hive的点点滴滴
- xbox360 双65厚机自制系统无硬盘 U盘玩游戏方法
- css 恢复ulli_CSS Ul(列表样式)
- 数据挖掘和知识发现的技术、方法及应用
- Gitlab 设置页面语言为简体中文
- java日志体系分析
- VR购物之初体验:Buy+
- 2020年IDA插件大赛:DynDataResolver夺冠
- CTR --- DIEN论文阅读笔记,及tf2复现
- jira新增、修改、关闭问题,“处理结果”错误
- TypeScript组件化实现弹层播放器