Transformer Transducer 论文笔记

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Google团队，Librispeech数据集SOTA，2020.02

Abstract

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders.

论文把Recurrent Neural Network Transducer (RNN-T) model的LSTM encoders替换为transformer encoders，得到transformer transducer （T-T）结构，用RNN-T loss训练模型能够streaming decoding。在Librispeech得到SOTA结果。

Full-attention：unstreamable

Limited-attention：limiting the left context for self-attention in the Transformer layers

bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

在full-attention和limited-attention之间trade-off，允许利用未来数帧做attention，即模型具有一定延时性

Introduction

Transformerbased models attend over encoder features using decoder features, implying that the decoding has to be done in a label-synchronous way, thereby posing a challenge for streaming speech recognition applications. An additional challenge for streaming speech recognition with these models is that the number of computations for self-attention increases quadratically with input sequence size.

论文总结Transformer-based模型具有两个问题：（1）无法streaming应用（2）self-attention的计算量随着input size平方增长（RNN的话则计算量线性增加，友好一点）

In particular, the RNN-T and RNA models are trained to learn alignments between the acoustic encoder features and the label encoder features, and so lend themselves naturally to frame-synchronous decoding （帧同步编码）.

论文提高RNN-T模型和RNA模型被训练地学习声学特征和label特征对齐，能够自然地支持帧同步解码方式

we show that Transformer-based models can be trained with self-attention on a fixed number of past input frames and previous labels

论文限制self-attention可用的历史序列，使得计算量不会平方增长，making it suitable for streaming。

相对于RNN来说，self-attention还能够并行化训练，从而大大提高训练效率

RNN-T LOSS

（待补充）

Transformer Transducer (T-T) 结构

1）每层由a multi-headed attention layer和a feed-forward layersub-layer两个sub-layer组成

2）输入先经过LayerNorm，再映射为key，query和value特征空间，然后进行multi-head attention操作，得到各个Values，将其concat送到dense layers。

3）两次残差连接

第一次，LayerNorm(x) + AttentionLayer(LayerNorm(x))；

第二次，LayerNorm(x) + FeedForwardLayer(LayerNorm(x))；

4）dropout = 0.1

With relative positional encoding, the encoding only affects the attention score instead of the Values being summed. This allows us to reuse previously computed states rather than recomputing all previous states and getting the last state in an overlapping inference manner when the number of frames or labels that AudioEncoder or LabelEncoder processed is larger than the maximum length used during training (which would again be intractable for streaming applications).

论文使用相对位置编码，而不是绝对位置编码。原因：（1）能够复用之前计算过的states，不需要重新计算；（2）？

AudioEncoder为18层，LabelEncoder为2层。

Experiment

数据集：LibriSpeech ASR corpus，970h语音，paired labels有10M，unpaired labels有800M。训练T-T模型使用了10M labels，训练语言模型LM使用810M word token

频谱特征：128-channel logmel，32ms 窗长，stacked every 4 frames, and sub-sampled every 3 frames，30ms步长，得到512维声学特征

数据增强：见论文[12] “Specaugment”

Table 2 可以发现w/o LM，精度达到WERs=2.4%；w/ LM，WERs=2.0%。

LM模型为6层，57M params，在dev-clean上，perplexity=2.49

接下来是 limited attention实验，验证模型的online streaming能力，通过mask来实现的。Table 4，对AudioEncoder进行mask；Table 5，对LabelEncoder进行mask；Table 6，一起mask