编辑:NewBeeNLP

前几天逛github刷到一个『awesome-fast-attention』大列表,整理了一系列关于attention的高效改进文章,包括论文、引用量、源码实现、算法复杂度以及关键亮点。其中一部分论文,我们在之前的『Transformer Assemble』系列文章中也都有作过解读~

Efficient Attention

Paper (引用量) 源码实现 复杂度 AutoRegressive Main Idea
Generating Wikipedia by Summarizing Long Sequences[1] (208) memory-compressed-attention[2] compresses key and value + blocked attention
CBAM: Convolutional Block Attention Module[3] (677) attention-module[4]  combines the SE attention with a per pixel(local) weight
CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149) CCNet[6] each pixel attends to its row and column simultaneously
Efficient Attention: Attention with Linear Complexities[7] (2) efficient-attention[8] Softmax(Q)*(Softmax(K^T)*V)
Star-Transformer[9] (24) fastNLP[10] uses a relay(global) node and attends to/from that node
Generating Long Sequences with Sparse Transformers[11] (139) torch-blocksparse[12] sparse block based attention
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96) GCNet[14]  squeeze and excitation with an attention pooling (instead of a GAP)
SCRAM: Spatially Coherent Randomized Attention Maps[15] (1) - uses PatchMatch to find close keys
Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13) IN_PAPER combination of a short length and then long range(dilated) attention
Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2) Permutohedral_attention_module[18] uses permutohedral lattice approximation algorithm to approximate the attention output
Large Memory Layers with Product Keys[19] (28) XLM[20] search for nearest neighbor keys
Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38) EMANet[22] applys expectation maximization to cluster keys into k clusters
Compressive Transformers for Long-Range Sequence Modelling[23] (20) compressive-transformer-pytorch[24] compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL
BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8) BPT[26] attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner
Axial Attention in Multidimensional Transformers[27] (5) axial-attention[28] apply attention on each axis separately
Reformer: The Efficient Transformer[29] (69) trax[30] uses LSH to find close keys
Transformer on a Diet[31] (2) transformer-on-diet[32] dilated transformer like wavenet
Sparse Sinkhorn Attention[33] (4) sinkhorn-transformer[34] uses a cost matrix to limit attention between buckets
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1) - learns the q, k connections == dynamically creates a sparse attention matrix
Efficient Content-Based Sparse Attention with Routing Transformers[36] (11) routing-transformer[37] computes attention with same-cluster tokens (computed by online k-means)
Longformer: The Long-Document Transformer[38] (15) longformer[39] global + blocked attention
Neural Architecture Search for Lightweight Non-Local Networks[40] (2) AutoNL[41] computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions
ETC: Encoding Long and Structured Data in Transformers[42] (2) - combines global attention (star transformer with multiple global tokens) with local attention
Multi-scale Transformer Language Models[43] (1) IN_PAPER UNet like + retina attetion is something close to BP-Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5) - does not compute pairwise interactions
Jukebox: A Generative Model for Music[45] (9) jukebox[46] better attention patterns from Sparse Transformer
GMAT: Global Memory Augmentation for Transformers[47] (0) gmat[48] adds global tokens
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0) google-research[50] calculate an unbiased stochastic approximation of the attention matrix
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0) - does not compute pairwise interactions and uses fixed mask patters
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1) fast-transformers[53] uses phi(q)(phi(k)v) and also improves the sequential sampling step
Linformer: Self-Attention with Linear Complexity[54] (3) linformer-pytorch[55] project key and value from nd
Real-time Semantic Segmentation with Fast Attention[56] (0) - l2_norm(q)*(l2_norm(k)*v)
Fast Transformers with Clustered Attention[57] (0) fast-transformers[58] groups queries together with LSH
Big Bird: Transformers for Longer Sequences[59] (0) - ETC with random connections

文章

  • A Survey of Long-Term Context in Transformers[60]

  • Transformers Assemble(PART I)

  • Transformers Assemble(PART II)

  • Transformers Assemble(PART III)

  • Transformers Assemble(PART IV)

  • Transformers Assemble(PART V)

  • ICLR2020 | 深度自适应Transformer

  • Memory Transformer,一种简单明了的Transformer改造方案

  • 【ICLR2020】Transformer Complex-order:一种新的位置编码方式

本文参考资料

[1]

Generating Wikipedia by Summarizing Long Sequences: https://arxiv.org/abs/1801.10198v1

[2]

memory-compressed-attention: https://github.com/lucidrains/memory-compressed-attention

[3]

CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2

[4]

attention-module: https://github.com/Jongchan/attention-module

[5]

CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2

[6]

CCNet: https://github.com/speedinghzl/CCNet

[7]

Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8

[8]

efficient-attention: https://github.com/cmsflash/efficient-attention

[9]

Star-Transformer: https://arxiv.org/abs/1902.09113v2

[10]

fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py

[11]

Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1

[12]

torch-blocksparse: https://github.com/ptillet/torch-blocksparse

[13]

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1

[14]

GCNet: https://github.com/xvjiarui/GCNet

[15]

SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1

[16]

Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2

[17]

Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2

[18]

Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module

[19]

Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2

[20]

XLM: https://github.com/facebookresearch/XLM

[21]

Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2

[22]

EMANet: https://github.com/XiaLiPKU/EMANet

[23]

Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1

[24]

compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch

[25]

BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1

[26]

BPT: https://github.com/yzh119/BPT

[27]

Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1

[28]

axial-attention: https://github.com/lucidrains/axial-attention

[29]

Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2

[30]

trax: https://github.com/google/trax/tree/master/trax/models/reformer

[31]

Transformer on a Diet: https://arxiv.org/abs/2002.06170v1

[32]

transformer-on-diet: https://github.com/cgraywang/transformer-on-diet

[33]

Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1

[34]

sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer

[35]

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2

[36]

Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1

[37]

routing-transformer: https://github.com/lucidrains/routing-transformer

[38]

Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1

[39]

longformer: https://github.com/allenai/longformer

[40]

Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1

[41]

AutoNL: https://github.com/LiYingwei/AutoNL

[42]

ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2

[43]

Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1

[44]

Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1

[45]

Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1

[46]

jukebox: https://github.com/openai/jukebox

[47]

GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1

[48]

gmat: https://github.com/ag1988/gmat

[49]

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1

[50]

google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention

[51]

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1

[52]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2

[53]

fast-transformers: https://github.com/idiap/fast-transformers

[54]

Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3

[55]

linformer-pytorch: https://github.com/tatp22/linformer-pytorch

[56]

Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2

[57]

Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1

[58]

fast-transformers: https://github.com/idiap/fast-transformers

[59]

Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1

[60]

A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/

END -

说个正事哈

由于微信平台算法改版,公号内容将不再以时间排序展示,如果大家想第一时间看到我们的推送,强烈建议星标我们和给我们多点点【在看】。星标具体步骤为:

(1)点击页面最上方深度学习自然语言处理”,进入公众号主页。

(2)点击右上角的小点点,在弹出页面点击“设为星标”,就可以啦。

感谢支持,比心

投稿或交流学习,备注:昵称-学校(公司)-方向,进入DL&NLP交流群。

方向有很多:机器学习、深度学习,python,情感分析、意见挖掘、句法分析、机器翻译、人机对话、知识图谱、语音识别等。

记得备注呦

推荐两个专辑给大家:

专辑 | 李宏毅人类语言处理2020笔记

专辑 | NLP论文解读


整理不易,还望给个在看!

【注意力机制】一系列关于attention的高效改进大集合相关推荐

  1. 一文读懂——全局注意力机制(global attention)详解与代码实现

    废话不多说,直接先上全局注意力机制的模型结构图. 如何通过Global Attention获得每个单词的上下文向量,从而获得子句向量呢?如下几步: 代码如下所示: x = Embedding(inpu ...

  2. 计算机视觉中的注意力机制(Visual Attention)

    ,欢迎关注公众号:论文收割机(paper_reader) 原文链接:计算机视觉中的注意力机制(Visual Attention) 本文将会介绍计算机视觉中的注意力(visual attention)机 ...

  3. 深入理解图注意力机制(Graph Attention Network)

    参考来源:https://mp.weixin.qq.com/s/Ry8R6FmiAGSq5RBC7UqcAQ 1.介绍 图神经网络已经成为深度学习领域最炽手可热的方向之一.作为一种代表性的图卷积网络, ...

  4. 双重关系感知注意力机制 Dual Relation-Aware Attention[keras实现 dual attention优化版]

    文章目录 前言 一.Compat Position Attention Module紧凑型位置注意力模块 二.Compat Channel Attention Module紧凑型通道注意力模块 三.效 ...

  5. 注意力机制之SGE Attention

    论文 Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks 论文链接 pa ...

  6. 魔改Attention大集合

    ↑ 点击蓝字 关注视学算法 来源丨NewBeeNLP 编辑|极市平台 极市导读 如何对attention进行高效改进?本文盘点了相关论文,并梳理出它们的引用量.代码实现.算法复杂度和关键点,方便对比使 ...

  7. 【Attention九层塔】注意力机制的九重理解

    本文作者:电光幻影炼金术 研究生话题Top1,上海交大计算机第一名,高中物理竞赛一等奖,段子手,上海交大计算机国奖,港中文博士在读 https://zhuanlan.zhihu.com/p/36236 ...

  8. DL之Attention:Attention注意力机制的简介、应用领域之详细攻略

    DL之Attention:Attention注意力机制的简介.应用领域之详细攻略 目录 Attention的简介 1.Why Attention? 2.Attention机制的分类 3.Attenti ...

  9. 《Attention Is All You Need》注意力机制公式中Q,K,V的理解

    一.概述 <Attention Is All You Need>是一篇关于注意力机制里程碑的文章,从2017年发表至今2020年7月已经获得了上万的引用.该文的两大亮点一是提出了一个几乎仅 ...

  10. 注意力机制——Coordinate Attention

    目录 摘要 1 介绍 2 相关工作 2.1 Mobile Network 2.2 注意力机制 3 Coordinate Attention 3.1 Revisit SE Block 3.1.1 Squ ...

最新文章

  1. 90后「V神」封神之路:4岁学编程,19岁创办以太坊,4年十亿身家!
  2. velocity模板引擎 -- java.io.FileNotFoundException: velocity.log (Permission denied)
  3. JVM调优:对象进入老年代的两个条件
  4. php attr,PHP DOMAttr isId()用法及代码示例
  5. Servlet读取文件的最好的方式
  6. 前端学习(3273):js中this的使用二
  7. C# 格式化字符串 String.Format
  8. shell 多个引号冲突_Html多个引号重叠使用冲突解决办法
  9. 【百度地图API】如何批量转换为百度经纬度
  10. 别惹老黄!英伟达遭网络攻击后,反手就把黑客黑了
  11. 训练效果不好的解决办法
  12. iOS中创建动态库及调用方法
  13. UVA11968 In The Airport【最值】
  14. Tushare最好用的金融数据接口之一
  15. (三)树莓派系列教程:树莓派4B上编写Python程序(C语言),并运行
  16. LeTeX的下载与安装
  17. MATLAB基础学习(一)
  18. 嘚瑟一下,我的书上电视了!
  19. D. Empty Graph(贪心/二分)
  20. 软件测试培训班多少钱?

热门文章

  1. java设计模式-Observe
  2. 使用Flask-Migrate进行管理数据库升级
  3. Software Defined Networking(Week 2, part 2)
  4. mysql-debug: Thread stack overrun
  5. c++写一个类后编译发现class重定义
  6. iOS应用的真机调试
  7. VSTO 3.0 for Office 2007 Programming
  8. [Windows]win10时间同步错误的解决方法
  9. easyUI 数据表格datagrid的使用
  10. Java super与this