Transformer Transducer 论文笔记
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
Google团队,Librispeech数据集SOTA,2020.02
Abstract
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders.
论文把Recurrent Neural Network Transducer (RNN-T) model的LSTM encoders替换为transformer encoders,得到transformer transducer (T-T)结构,用RNN-T loss训练模型能够streaming decoding。在Librispeech得到SOTA结果。
Full-attention:unstreamable
Limited-attention:limiting the left context for self-attention in the Transformer layers
bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.
在full-attention和limited-attention之间trade-off,允许利用未来数帧做attention,即模型具有一定延时性
Introduction
Transformerbased models attend over encoder features using decoder features, implying that the decoding has to be done in a label-synchronous way, thereby posing a challenge for streaming speech recognition applications. An additional challenge for streaming speech recognition with these models is that the number of computations for self-attention increases quadratically with input sequence size.
论文总结Transformer-based模型具有两个问题:(1)无法streaming应用(2)self-attention的计算量随着input size平方增长(RNN的话则计算量线性增加,友好一点)
In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding (帧同步编码).
论文提高RNN-T模型和RNA模型被训练地学习声学特征和label特征对齐,能够自然地支持帧同步解码方式
we show that Transformer-based models can be trained with self-attention on a fixed number of past input frames and previous labels
论文限制self-attention可用的历史序列,使得计算量不会平方增长,making it suitable for streaming。
相对于RNN来说,self-attention还能够并行化训练,从而大大提高训练效率
RNN-T LOSS
(待补充)
Transformer Transducer (T-T) 结构
1)每层由a multi-headed attention layer和a feed-forward layersub-layer两个sub-layer组成
2)输入先经过LayerNorm,再映射为key,query和value特征空间,然后进行multi-head attention操作,得到各个Values,将其concat送到dense layers。
3)两次残差连接
第一次,LayerNorm(x) + AttentionLayer(LayerNorm(x));
第二次,LayerNorm(x) + FeedForwardLayer(LayerNorm(x));
4)dropout = 0.1
With relative positional encoding, the encoding only affects the attention score instead of the Values being summed. This allows us to reuse previously computed states rather than recomputing all previous states and getting the last state in an overlapping inference manner when the number of frames or labels that AudioEncoder or LabelEncoder processed is larger than the maximum length used during training (which would again be intractable for streaming applications).
论文使用相对位置编码,而不是绝对位置编码。原因:(1)能够复用之前计算过的states,不需要重新计算;(2)?
AudioEncoder为18层,LabelEncoder为2层。
Experiment
数据集:LibriSpeech ASR corpus,970h语音,paired labels有10M,unpaired labels有800M。训练T-T模型使用了10M labels,训练语言模型LM使用810M word token
频谱特征:128-channel logmel,32ms 窗长,stacked every 4 frames, and sub-sampled every 3 frames,30ms步长,得到512维声学特征
数据增强:见论文[12] “Specaugment”
Table 2 可以发现w/o LM,精度达到WERs=2.4%;w/ LM,WERs=2.0%。
LM模型为6层,57M params,在dev-clean上,perplexity=2.49
接下来是 limited attention实验,验证模型的online streaming能力,通过mask来实现的。Table 4,对AudioEncoder进行mask;Table 5,对LabelEncoder进行mask;Table 6,一起mask
Transformer Transducer 论文笔记相关推荐
- 【time series】时间序列领域的Transformer综述论文笔记
论文名称:Transformers in Time Series: A Survey 论文年份:2022/5/7 论文作者:阿里巴巴达摩院 论文下载:https://arxiv.org/abs/220 ...
- 【时序】应用于时间序列的 Transformer 综述论文笔记
论文名称:Transformers in Time Series: A Survey 论文下载:https://arxiv.org/abs/2202.07125 论文源码:https://github ...
- arXiv 2021《Transformer in Transformer》论文笔记
目录 简介 动机 方法 实验 简介 本文出自华为诺亚方舟,作者是韩凯. 文章链接 动机 本文动机是,在ViT基础上,编码patch内的pixel之间的结构信息. 方法 使用两个transformer, ...
- 论文笔记目录(ver2.0)
1 时间序列 1.1 时间序列预测 论文名称 来源 主要内容 论文笔记:DCRNN (Diffusion Convolutional Recurrent Neural Network: Data-Dr ...
- 论文笔记-Vanilla Transformer:Character-Level Language Modeling with Deeper Self-Attention
论文笔记-Vanilla Transformer:Character-Level Language Modeling with Deeper Self-Attention 1. 介绍 2. Chara ...
- 论文笔记 | 【CVPR-2023】Activating More Pixels in Image Super-Resolution Transformer
论文笔记 | [CVPR-2023]Activating More Pixels in Image Super-Resolution Transformer 抛砖引玉了,如有不同意见欢迎讨论. 目录 ...
- 论文笔记 |【CVPR2021】Uformer: A General U-Shaped Transformer for Image Restoration
论文笔记 |[CVPR2021]Uformer: A General U-Shaped Transformer for Image Restoration 文章目录 论文笔记 |[CVPR2021]U ...
- 【学习笔记】:Multi-mode Transformer Transducer with Stochastic Future Context
原文链接 文章基本信息:题目,作者,作者机构,发表刊物或会议,年份,期刊会议等级(CCF) 题目,Multi-mode Transformer Transducer with Stochastic F ...
- 论文笔记:CLIP:Learning Transferable Visual Models From Natural Language Supervision详解
paper:https://arxiv.org/abs/2103.00020 代码:GitHub - openai/CLIP: Contrastive Language-Image Pretraini ...
- 论文笔记:Meta-attention for ViT-backed Continual Learning CVPR 2022
论文笔记:Meta-attention for ViT-backed Continual Learning CVPR 2022 论文介绍 论文地址以及参考资料 Transformer 回顾 Self- ...
最新文章
- 跨域资源共享的10种方式(转)
- “百度杯”CTF比赛 十月场 Hash 复现
- jsonb 查询_如何使用IN运算符查询jsonb数组
- Git之深入解析如何使用Git调试项目源码中的问题
- 网络编程(part1)--IO及字节串
- mysql(mariadb)的安装与使用,mysql相关命令,mysql数据类型
- Java - String字符串的部分操作
- 【转】Android实例剖析笔记(二)--用实例讲解Andriod的开发过程,以NotesList为实例介绍Android的菜单机制...
- Xmanager注册码激活教程
- 【GNN】图网络|图神经网络(GNN)结构化数据分析
- 高等数学学习笔记——第五十七讲——平面与直线的位置关系
- 图片背景处理技巧快来学学
- 夜晚网速变慢与网站服务器开机数量减少有关,【网络】网速慢的原因与对策
- 【我想对策划说的事】-- 入职dy一年后被邀请召开的扯淡分享会讲稿
- ios13如何隐藏第三方应用(苹果ios13怎么隐藏个别软件)
- 1.1 电阻 RES Resistance
- 7-42 大炮打蚊子 (15 分)
- 基于ssm+vue的班级同学录网站管理系统 elementui
- XUI 熟练使用之(三) -----------启动页( SimpleGuideBanner的使用)
- 2021年3月20日美团笔试