a survey of transformer 学习笔记

1、引言

一些 X-formers 从以下几个方面提升了vanilla Transformer 的性能：模型的效率、模型泛化、模型适应性

2、背景

2.1 vanilla Transformer

The vanilla Transformer 是一种sequence-to-sequence 的模型。由一个encoder和一个decoder组成, （encoder和decoder都是由L个相同的块堆叠而成）。

encoder块是由多头自注意力模块和 position-wise FFN组成。为了构建更深层次的网络，层归一化后使用残差网络。

decoder块在多头自注意力模块和 position-wise FFN中加了一个交叉注意力模块。

2.1.1注意力模块

the scaled dot-product attention used by Transformer is given by

（式1）

Query-Key-Value (QKV)

Q ∈ $R^{N\times Dk}$ ; K ∈ $R^{M\times Dk}$ ;V ∈ $R^{M\times Dv}$

N,M ：queries and keys （or values）长度; $Dk$ and $Dv$ : keys (or queries) and values的维度；

A：注意力矩阵

式中的 $\sqrt{Dk}$ 是为了减轻softmax函数的梯度消失问题

multi-head attention

式2将queries, keys and values由 $Dm$ 维投射到 $Dk$ 、 $Dk$ 、 $Dv$ 维；式3又将其还原为 $Dm$ 维

分类：

依据q、k、v的来源分为三种：

Self-attention ：式1 Q = K = V = X
Masked Self-attention：parallel training：
Cross-attention：The queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

2.1.2 Position-wise FFN.

实质是全连接前馈模块

H′ is the outputs of previous layer

2.1.3 Residual Connection and Normalization.

每个模块都加了一个残差网络

2.1.4 Position Encodings.

Transformer忽视了位置信息，需要额外的工作去弥补。

2.2 Model Usage

Encoder-Decoder:sequence to sequence
Encoder only:用于分类
Decoder only：用于序列生成

2.3 Model Analysis

D：hidden dimension

T：input sequence length

2.4 Comparing Transformer to Other Network Types（transformer同其他网络类型比较）

2.4.1 Analysis ofSelf-Attention.

与全连接层相比，自注意力机制参数效率更高，更灵活处理不同的序列长度
与卷积层相比，突破需深层网络需要深层网络才能捕获全局信息的限制，自注意力机制通过一定数量的层能捕获全局信息
比起递归层，并行化更好

2.4.2 In Terms of Inductive Bias（归纳偏置）

卷积网络：平移不变性的归纳偏置，共享核函数；
递归网络：时间不变性的归纳偏置，马尔科夫结构；
Transformer：较少的结构信息，使其更具灵活性，缺陷是小数据集容易过拟合
GNN：Transformer可以被视为图神经网络（Transformer can be viewed as a GNN defined over a complete directed graph (with self-loop) where each input is a node in the graph），两者主要区别：Transformer无先验的结构信息

3 TAXONOMY OF TRANSFORMERS（分类）

4 ATTENTION

Self-attention的两大挑战：计算复杂性、结构先验性

对注意力机制以下六个方面的提升：稀疏注意、线性注意、原型和内存压缩、低秩自注意、有先验的注意、多头注意机制的改进

4.1 Sparse Attention（两类：基于位置的稀疏注意、基于内容的稀疏注意）

standard self-attention mechanism： every token needs to attend to all other tokens。实质上注意力矩阵A在大多数数据点是稀疏的

4.1.1 Position-based Sparse Attention.

4.1.1.1 Atomic Sparse Attention.（原子稀疏注意）

4.1.1.2 Compound Sparse Attention（复合）

4.1.1.3 Extended Sparse Attention

4.1.2 Content-based Sparse Attention.

Routing Transformer ：

在同一组中心向量之下聚类q和k

（没有太明白公式的意思）

Reformer ：

b：桶的数量

R：尺寸为【 $Dk$ ， $\frac{b}{2}$ 】的矩阵

LSH（locality-sensitive hashing）函数的计算：

Sparse Adaptive Connection (SAC)
Sparse Sinkhorn Attention

4.2 Linearized Attention

计算复杂度

标准自注意力

线性自注意力

Z为：

$\phi$ ：按行的特征图

从向量角度推导公式深入了解线性注意力

4.2.1 Feature Maps

Linear Transformer 的特征图

Performer [18, 19] uses random feature maps that approximate the scoring function of Trans- former

Performer [18] ：

不能保证非负注意力分数，导致不稳定和反常行为

Performer [19]：

保证无偏估计和非负输出，比Performer【18】更稳定

Schlag et al.

4.2.2 Aggregation Rule：

RFA ：introduces a gating mechanism to the summation
Schlag et al ：enlarge the capacity in a write-and-remove fashion

4.3 Query Prototyping and Memory Compression Apart

4.3.1 Attention with Prototype Queries

decreasing the number of queries with query prototyping

4.3.2 Attention with CompressedKey-Value Memory

reducing the number ofthe key-value pairs before applying the attention mechanism

Liu et al. ：propose Memory Compressed Attention (MCA) that reduces the number of keys

and values using a strided convolution.

Set Transformer [70] and Luna [90]： a number of external trainable global nodes to summarize information from inputs
Linformer [142]：utilizes linear projections to project keys and values from length n to a smaller length nk
Poolingformer [165] ：adopts two-level attention that combines a sliding window attention and a compressed memory attention.

4.4 Low-rank Self-Attention

4.4.1 Low-rankParameterization

限制 $Dk$ 的维度

4.4.2 Low-rank Approximation

kernel approximation with random feature maps

Nyström method.

4.5 Attention with Prior

4.5.1 Prior that Models locality

higher $Gij$ indicates a higher prior probability

Yang et al. [156]

Gaussian Transformer

4.5.2 Prior from Lower Modules.
attention distributions are similar in adjacent layers. previous layer as a prior

第l层的注意分数；

w1,w2：相邻层分数的权重；

g：将前层的分数转化为先验

Predictive Attention Transformer ：w1= $\alpha$ w2= 1- $\alpha$ ;g卷积层
Realformer ：w1=w1=1;identity map; g=恒等映射
Lazyformer ： layers. This is equivalent to setting g(·) to identity and switch the settings of w1 = 0, w2 = 1 and w1 = 1, w2 = 0 alternatingly.

4.5.3 Prior as Multi-task Adapters

直和：

Aj:训练参数

$\beta j$ , $\gamma j$ :Feature Wise Linear Modulation functions

4.5.4 Attention with Only Prior

Zhang et al.:只用离散均匀分布作为注意分布的来源
You et al. ： utilize a Gaussian distribution as the hardcoded attention distribution for attention calculation.
Synthesizer: replace generated attention scores with: (1) a learnable, randomly initialized attention scores, and (2) attention scores output by a feed-forward network that is only conditioned on the querying input itself

4.6 Improved Multi-Head Mechanism

4.6.1 Head Behavior Modeling.

Li et al：引入正则项到损失函数激励注意力头的多样性
Deshpande and Narasimhan：辅助损失——Frobenius norm（佛罗伯尼范数） between attention distribution maps and predefined attention patterns
Talking-head Attention ：a talking head mechanism
Collaborative Multi-head Attention：

4.6.2 Multi-head with Restricted Spans

• Locality.
• Efficiency.

4.6.3 Multi-head with RefinedAggregation.

best balance the translation performance and computational efficency

4.6.4 Other Modifications.

a survey of transformer 学习笔记相关推荐

计算机视觉算法——Transformer学习笔记
算机视觉算法--Transformer学习笔记计算机视觉算法--Transformer学习笔记 1. Vision Transformer 1.1 网络结构 1.2 关键知识点 1.2.1 Self ...
语言模型（五）—— Seq2Seq、Attention、Transformer学习笔记
按:三个月前的一篇笔记,先发出来,后面还会有第二次学习的笔记,两者结合起来看,更加爽口. 本篇笔记5000余字,可细嚼,亦可跳阅. 机器翻译本篇笔记中要开始介绍的Encoder-Decoder模型最 ...
transformer学习笔记：Feed-Forward Network
transformer结构在Muli-Head Attention层之后还添加了一层Feed-Forward层.Feed-Forward层包括两层全连接层以及一个非线性激活函数ReLu. 注意到在Mu ...
End-to-End Entity Resolution for Big Data: A Survey Matching部分学习笔记
Matching ER的核心是匹配任务,它接收一个块集合作为输入,对于一个块中的每一对候选匹配,它决定它们是否指向相同的真实世界实体. Preliminaries 匹配决策通常由匹配函数MMM做出,它 ...
Swin trasnformer 学习笔记
提示:Swin transformer 学习笔记,仅供学习记录,方便日后回顾,侵删文章目录前言一.主要贡献 1.如何抓住多尺度特征 2. 滑动窗口和窗口自注意力二.网络主干 1.模型整体架构 ...
Anchorpoints学习笔记：
Anchor Detr学习笔记: 文章目录 Anchor Detr学习笔记: 1.首先介绍下什么叫锚点(Anchor point) 2.再来介绍下什么叫DETR 3.Anchor Detr 1.首先介 ...
神经网络学习笔记3——Transformer、VIT与BoTNet网络
系列文章目录神经网络学习笔记1--ResNet残差网络.Batch Normalization理解与代码神经网络学习笔记2--VGGNet神经网络结构与感受野理解与代码文章目录系列文章目录 A ...
transformer模型的奥秘-学习笔记
本文主要介绍了transformer模型的大概原理及模型结构.这篇学习笔记的学习资料主要是<Attention is All you Need>这篇神作,还有两位大神的指点(见 ...
《Deep Learning Techniques for Music Generation – A Survey》深度学习用于音乐生成——书籍阅读笔记（一）Chapter 1
<Deep Learning Techniques for Music Generation – A Survey>深度学习用于音乐生成--书籍阅读笔记(一)Chapter 1 关于这本书 ...

a survey of transformer 学习笔记

a survey of transformer 学习笔记相关推荐

最新文章

热门文章