Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Google团队,Librispeech数据集SOTA,2020.02


Abstract

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders.

论文把Recurrent Neural Network Transducer (RNN-T) model的LSTM encoders替换为transformer encoders,得到transformer transducer (T-T)结构,用RNN-T loss训练模型能够streaming decoding。在Librispeech得到SOTA结果。

Full-attention:unstreamable

Limited-attention:limiting the left context for self-attention in the Transformer layers

bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

在full-attention和limited-attention之间trade-off,允许利用未来数帧做attention,即模型具有一定延时性

Introduction

Transformerbased models attend over encoder features using decoder features, implying that the decoding has to be done in a label-synchronous way, thereby posing a challenge for streaming speech recognition applications. An additional challenge for streaming speech recognition with these models is that the number of computations for self-attention increases quadratically with input sequence size.

论文总结Transformer-based模型具有两个问题:(1)无法streaming应用(2)self-attention的计算量随着input size平方增长(RNN的话则计算量线性增加,友好一点)

In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding (帧同步编码).

论文提高RNN-T模型和RNA模型被训练地学习声学特征和label特征对齐,能够自然地支持帧同步解码方式

we show that Transformer-based models can be trained with self-attention on a fixed number of past input frames and previous labels

论文限制self-attention可用的历史序列,使得计算量不会平方增长,making it suitable for streaming。

相对于RNN来说,self-attention还能够并行化训练,从而大大提高训练效率

RNN-T LOSS

(待补充)

Transformer Transducer (T-T) 结构

1)每层由a multi-headed attention layer和a feed-forward layersub-layer两个sub-layer组成

2)输入先经过LayerNorm,再映射为key,query和value特征空间,然后进行multi-head attention操作,得到各个Values,将其concat送到dense layers。

3)两次残差连接

第一次,LayerNorm(x) + AttentionLayer(LayerNorm(x));

第二次,LayerNorm(x) + FeedForwardLayer(LayerNorm(x));

4)dropout = 0.1

With relative positional encoding, the encoding only affects the attention score instead of the Values being summed. This allows us to reuse previously computed states rather than recomputing all previous states and getting the last state in an overlapping inference manner when the number of frames or labels that AudioEncoder or LabelEncoder processed is larger than the maximum length used during training (which would again be intractable for streaming applications).

论文使用相对位置编码,而不是绝对位置编码。原因:(1)能够复用之前计算过的states,不需要重新计算;(2)?

AudioEncoder为18层,LabelEncoder为2层。

Experiment

数据集:LibriSpeech ASR corpus,970h语音,paired labels有10M,unpaired labels有800M。训练T-T模型使用了10M labels,训练语言模型LM使用810M word token

频谱特征:128-channel logmel,32ms 窗长,stacked every 4 frames, and sub-sampled every 3 frames,30ms步长,得到512维声学特征

数据增强:见论文[12] “Specaugment”

Table 2 可以发现w/o LM,精度达到WERs=2.4%;w/ LM,WERs=2.0%。

LM模型为6层,57M params,在dev-clean上,perplexity=2.49

接下来是 limited attention实验,验证模型的online streaming能力,通过mask来实现的。Table 4,对AudioEncoder进行mask;Table 5,对LabelEncoder进行mask;Table 6,一起mask

Transformer Transducer 论文笔记相关推荐

  1. 【time series】时间序列领域的Transformer综述论文笔记

    论文名称:Transformers in Time Series: A Survey 论文年份:2022/5/7 论文作者:阿里巴巴达摩院 论文下载:https://arxiv.org/abs/220 ...

  2. 【时序】应用于时间序列的 Transformer 综述论文笔记

    论文名称:Transformers in Time Series: A Survey 论文下载:https://arxiv.org/abs/2202.07125 论文源码:https://github ...

  3. arXiv 2021《Transformer in Transformer》论文笔记

    目录 简介 动机 方法 实验 简介 本文出自华为诺亚方舟,作者是韩凯. 文章链接 动机 本文动机是,在ViT基础上,编码patch内的pixel之间的结构信息. 方法 使用两个transformer, ...

  4. 论文笔记目录(ver2.0)

    1 时间序列 1.1 时间序列预测 论文名称 来源 主要内容 论文笔记:DCRNN (Diffusion Convolutional Recurrent Neural Network: Data-Dr ...

  5. 论文笔记-Vanilla Transformer:Character-Level Language Modeling with Deeper Self-Attention

    论文笔记-Vanilla Transformer:Character-Level Language Modeling with Deeper Self-Attention 1. 介绍 2. Chara ...

  6. 论文笔记 | 【CVPR-2023】Activating More Pixels in Image Super-Resolution Transformer

    论文笔记 | [CVPR-2023]Activating More Pixels in Image Super-Resolution Transformer 抛砖引玉了,如有不同意见欢迎讨论. 目录 ...

  7. 论文笔记 |【CVPR2021】Uformer: A General U-Shaped Transformer for Image Restoration

    论文笔记 |[CVPR2021]Uformer: A General U-Shaped Transformer for Image Restoration 文章目录 论文笔记 |[CVPR2021]U ...

  8. 【学习笔记】:Multi-mode Transformer Transducer with Stochastic Future Context

    原文链接 文章基本信息:题目,作者,作者机构,发表刊物或会议,年份,期刊会议等级(CCF) 题目,Multi-mode Transformer Transducer with Stochastic F ...

  9. 论文笔记:CLIP:Learning Transferable Visual Models From Natural Language Supervision详解

    paper:https://arxiv.org/abs/2103.00020 代码:GitHub - openai/CLIP: Contrastive Language-Image Pretraini ...

  10. 论文笔记:Meta-attention for ViT-backed Continual Learning CVPR 2022

    论文笔记:Meta-attention for ViT-backed Continual Learning CVPR 2022 论文介绍 论文地址以及参考资料 Transformer 回顾 Self- ...

最新文章

  1. 跨域资源共享的10种方式(转)
  2. “百度杯”CTF比赛 十月场 Hash 复现
  3. jsonb 查询_如何使用IN运算符查询jsonb数组
  4. Git之深入解析如何使用Git调试项目源码中的问题
  5. 网络编程(part1)--IO及字节串
  6. mysql(mariadb)的安装与使用,mysql相关命令,mysql数据类型
  7. Java - String字符串的部分操作
  8. 【转】Android实例剖析笔记(二)--用实例讲解Andriod的开发过程,以NotesList为实例介绍Android的菜单机制...
  9. Xmanager注册码激活教程
  10. 【GNN】图网络|图神经网络(GNN)结构化数据分析
  11. 高等数学学习笔记——第五十七讲——平面与直线的位置关系
  12. 图片背景处理技巧快来学学
  13. 夜晚网速变慢与网站服务器开机数量减少有关,【网络】网速慢的原因与对策
  14. 【我想对策划说的事】-- 入职dy一年后被邀请召开的扯淡分享会讲稿
  15. ios13如何隐藏第三方应用(苹果ios13怎么隐藏个别软件)
  16. 1.1 电阻 RES Resistance
  17. 7-42 大炮打蚊子 (15 分)
  18. 基于ssm+vue的班级同学录网站管理系统 elementui
  19. XUI 熟练使用之(三) -----------启动页( SimpleGuideBanner的使用)
  20. 2021年3月20日美团笔试

热门文章

  1. 学习笔记10:程序设计基础(C)实验(函数)
  2. 在积分系统中可以设置哪些获取积分方式
  3. 支付接口的开放有什么好处?第三方支付API文档如何对接?
  4. Vot-Toolkit环境配置指南
  5. 深度指纹识别:通过深度学习破坏网站指纹防御
  6. 360大战QQ演义之一:一场腾讯可能连底裤都输掉的战争!
  7. U盘无法打开的解决方法大全
  8. python查询12306余票_使用 Python 在 12306 查询火车票余票
  9. 古城钟楼微博地支报时程序铛,100行代码实现
  10. Android5.1下拉状态栏新增截屏功能