Transformer [137]是一个已经广泛应用于各种领域的重要的深度学习模型, 例如自然语言处理(NLP),计算机视觉(CV)以及语音处理。 Transformer was originally proposed as a sequence-to-sequence model [130] for machine translation. Later works show that Transformer-based pre-trained models (PTMs) [100] can achieve state-ofthe-art performances on various tasks. As a consequence, Transformer has become the go-to architecture in NLP, especially for PTMs. In addition to language related applications, Transformer
has also been adopted in CV [13, 33, 94], audio processing [15, 31, 41] and even other disciplines, such as chemistry [114] and life sciences [109].

Due to the success, Transformer的许多变体 (称为 X-formers) 近些年被提出。这些X-formers从许多不同的角度提升了最初的Transformer。

  1. Model Efficiency:应用Transformer的一个主要的挑战是its inefficiency at processing long sequences mainly due to the computation and memory complexity of the self-attention module. 这些提升方法包括lightweight attention (e.g. sparse attention variants) 以及 Divide-and-conquer methods (e.g., recurrent and hierarchical mechanism).
  2. Model Generalization:由于transformer是一个flexible architecture and makes few assumptions on the structural bias of input data, it is hard to train on small-scale data. 提升方法包括introducing structural bias or regularization, pre-training on large-scale unlabeled data, etc.
  3. Model Adaptation. This line of work aims to adapt the Transformer to specific downstream tasks and applications.

在这篇综述中,我们旨在提供关于Transformer以及它的变体的一个详细的调查。尽管我们可以organize X-formers on the basis of the perspectives mentioned above, many existing X-formers may address one or several issues. For example, sparse attention variants not
only reduce the computational complexity but also introduce structural prior on input data to alleviate the overfitting problem on small datasets. Therefore, it is more methodical to categorize the various existing X-formers and propose a new taxonomy mainly according to their ways to improve the vanilla Transformer: architecture modification, pre-training, and applications. Considering the audience of this survey may be from different domains, we mainly focus on the general architecture variants and just briefly discuss the specific variants on pre-training and applications.

2 背景

2.1 Vanilla Transformer

The vanilla Transformer [137] is a sequence-to-sequence model and consists of an encoder and a decoder, each of which is a stack of LLL identical blocks. Each encoder block is mainly composed of a multi-head self-attention module and a position-wise feed-forward network (FFN). For building a deeper model, a residual connection [49] is employed around each module, followed by Layer Normalization [4] module. Compared to the encoder blocks, decoder blocks additionally insert cross-attention modules between the multi-head self-attention modules and the position-wise FFNs. Furthermore, the self-attention modules in the decoder are adapted to prevent each position from
attending to subsequent positions. The overall architecture of the vanilla Transformer is shown in Fig. 1.

在后续的子节中,我们介绍vanilla Transformer的各个核心模块。

2.1.1 Attention Modules

Transformer adopts attention mechanism with Query-Key-Value (QKV) model. Given the packed matrix representations of queries Q∈RN×DkQ\in\mathbb{R}^{N\times D_k}Q∈RN×Dk​,keys K∈RM×DkK\in\mathbb{R}^{M\times D_k}K∈RM×Dk​,and values V∈RM×DvV\in\mathbb{R}^{M\times D_v}V∈RM×Dv​,the scaled dot-product attention used by Transformer is given by

Attention(Q,K,V)=softmax(QKTDk)V=AV(1)\text{Attention(Q,K,V)}=\text{softmax}(\frac{QK^T}{\sqrt{D_k}})V=AV\quad\quad\quad\quad\quad\quad(1)Attention(Q,K,V)=softmax(Dk​​QKT​)V=AV(1)

这里NNN和MMM表示 the lengths of queries and keys (or values); DkD_kDk​以及DvD_vDv​denote the dimensions of keys (or queries) and values; A=softmax(QKTDk)A=\text{softmax}(\frac{QK^T}{\sqrt{D_k}})A=softmax(Dk​​QKT​)通常被称为attention matrix; softmax is applied in a row-wise manner. The dot-products of queries and keys are divided by Dk\sqrt{D_k}Dk​​to alleviate
gradient vanishing problem of the softmax function.

Instead of simply applying a single attention function, Transformer uses multi-head attention, where the DmD_mDm​-dimensional original queries, keys and values are projected into Dk,DkD_k,D_kDk​,Dk​以及DvD_vDv​
dimensions, respectively, with HHH different sets of learned projections. For each of the projected queries, keys and values, and output is computed with attention according to Eq. (1). The model
then concatenates all the outputs and projects them back to a DmD_mDm​-dimensional representation.

待补充

3 transformer的分类

A wide variety of models have been proposed so far based on the vanilla Transformer from three perspectives: types of architecture modification, pre-training methods, and applications. Fig. 2 gives
an illustrations of our categorization of Transformer variants.

A Survey of Transformers论文解读相关推荐

  1. FAN(Understanding The Robustness in Vision Transformers)论文解读,鲁棒性和高效性超越ConvNeXt、Swin

    FAN(Understanding The Robustness in Vision Transformers)论文解读,鲁棒性和高效性超越ConvNeXt.Swin < center > ...

  2. BEIT: BERT Pre-Training of Image Transformers论文解读

    BEIT: BERT Pre-Training of Image Transformers 论文:2106.08254.pdf (arxiv.org) 代码:unilm/beit at master ...

  3. 论文解读:DETR 《End-to-end object detection with transformers》,ECCV 2020

    论文解读:DETR <End-to-end object detection with transformers>,ECCV 2020 0. 论文基本信息 1. 论文解决的问题 2. 论文 ...

  4. [论文解读]关于机器学习测试,看这一篇论文就够了 Machine Learning Testing: Survey ,Landscapes and Horizons

    Machine Learning Testing: Survey ,Landscapes and Horizons 文章目录 Machine Learning Testing: Survey ,Lan ...

  5. 知识图谱最新权威综述论文解读:关系抽取

    上期我们介绍了2020年知识图谱最新权威综述论文<A Survey on Knowledge Graphs: Representation, Acquisition and Applicatio ...

  6. 知识图谱最新权威综述论文解读:实体发现

    上期我们介绍了2020年知识图谱最新权威综述论文<A Survey on Knowledge Graphs: Representation, Acquisition and Applicatio ...

  7. 知识图谱最新权威综述论文解读:知识图谱补全部分

    上期我们介绍了2020年知识图谱最新权威综述论文<A Survey on Knowledge Graphs: Representation, Acquisition and Applicatio ...

  8. 知识图谱最新权威综述论文解读:知识表示学习部分

    知识图谱最新权威综述论文解读:知识表示学习部分 知识图谱表示学习 1 表示空间 1.1 Point-wise空间 1.2 复数向量空间 ​1.3 高斯分布 1.4 流形和群 2 打分函数 2.1 基于 ...

  9. 论文解读丨LayoutLM: 面向文档理解的文本与版面预训练

    摘要:LayoutLM模型利用大规模无标注文档数据集进行文本与版面的联合预训练,在多个下游的文档理解任务上取得了领先的结果. 本文分享自华为云社区<论文解读系列二十五:LayoutLM: 面向文 ...

  10. 复旦邱锡鹏组最新综述:A Survey of Transformers!

    作者 | Tnil@知乎 编辑 | NewBeeNLP 转眼Transformer模型被提出了4年了.依靠弱归纳偏置.易于并行的结构,Transformer已经成为了NLP领域的宠儿,并且最近在CV等 ...

最新文章

  1. 017-平衡二叉树(三)
  2. 用宏定义实现函数值互换
  3. 【OpenCv】Cannot parallelize deblocking type 1, decoding such frames in sequential order
  4. 采用redis+ThreadLocal获取全局的登录用户信息(一)
  5. 【RK3399Pro学习笔记】九、ROS客户端Client的编程实现
  6. Python递归文件夹遍历所有文件夹及文件
  7. git 命令行(一)-版本回退
  8. 【Python实例第32讲】一个分类分数的置换检验
  9. android3种播放视频方式,Android 两种方式播放视频
  10. 微信小程序开发入门教程
  11. 前后端分离跨域上传图片代码
  12. 三阶魔方7步还原法详解 简单
  13. 毕业两年,只会Crud,侥幸通过面试定级 P6,没想到我也可以入职阿里!(面经分享)
  14. 银河麒麟搭建nodejs环境
  15. 学前端的你了解这些知识吗?——BFC、IFC、GFC、FFC
  16. 来谈一谈专注力的真相
  17. android 10.0 launcher3 app列表隐藏某个app
  18. 失业下的深圳中年:没有人活的容易,生活仍得继续...
  19. 就读体验丨香港科技大学工学院科技领导及创业(TLE)理学硕士学位课程(上)
  20. 化工企业安全生产管理监控预警系统软件

热门文章

  1. 带ant 的收发器_ANT无线收发器nRF24AP1及其应用
  2. ijkplayer中遇到的问题汇总
  3. 今天我点亮了CSDN博客专家殊荣
  4. linux 用用监听器,Linux/Unix shell 监控Oracle监听器(monitor listener)
  5. spark、hadoop 问题合集
  6. 回归标准差和残差平方和的关系_一文详解经典回归分析
  7. 帆软填报联动 控件联动的几种方式
  8. linux无线adb,linux 无法连接adb 设备
  9. html java对象_Java遇见HTML——JSP篇之JSP内置对象(下)
  10. python excel案例导入jira_用Python脚本批量添加JIRA用户,python,jira