魔改Attention大集合
↑ 点击蓝字 关注视学算法
来源丨NewBeeNLP
编辑|极市平台
极市导读
如何对attention进行高效改进?本文盘点了相关论文,并梳理出它们的引用量、代码实现、算法复杂度和关键点,方便对比使用。
前几天逛github刷到一个awesome-fast-attention大列表,整理了一系列关于attention的高效改进文章,包括论文、引用量、源码实现、算法复杂度以及关键亮点。
Github地址:
https://github.com/Separius/awesome-fast-attention
Efficient Attention
Paper (引用量) | 源码实现 | 复杂度 | AutoRegressive | Main Idea |
---|---|---|---|---|
Generating Wikipedia by Summarizing Long Sequences[1] (208) | memory-compressed-attention[2] | compresses key and value + blocked attention | ||
CBAM: Convolutional Block Attention Module[3] (677) | attention-module[4] | combines the SE attention with a per pixel(local) weight | ||
CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149) | CCNet[6] | each pixel attends to its row and column simultaneously | ||
Efficient Attention: Attention with Linear Complexities[7] (2) | efficient-attention[8] | Softmax(Q)*(Softmax(K^T)*V) | ||
Star-Transformer[9] (24) | fastNLP[10] | uses a relay(global) node and attends to/from that node | ||
Generating Long Sequences with Sparse Transformers[11] (139) | torch-blocksparse[12] | sparse block based attention | ||
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96) | GCNet[14] | squeeze and excitation with an attention pooling (instead of a GAP) | ||
SCRAM: Spatially Coherent Randomized Attention Maps[15] (1) | - | uses PatchMatch to find close keys | ||
Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13) | IN_PAPER | combination of a short length and then long range(dilated) attention | ||
Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2) | Permutohedral_attention_module[18] | uses permutohedral lattice approximation algorithm to approximate the attention output | ||
Large Memory Layers with Product Keys[19] (28) | XLM[20] | search for nearest neighbor keys | ||
Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38) | EMANet[22] | applys expectation maximization to cluster keys into k clusters | ||
Compressive Transformers for Long-Range Sequence Modelling[23] (20) | compressive-transformer-pytorch[24] | compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL | ||
BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8) | BPT[26] | attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner | ||
Axial Attention in Multidimensional Transformers[27] (5) | axial-attention[28] | apply attention on each axis separately | ||
Reformer: The Efficient Transformer[29] (69) | trax[30] | uses LSH to find close keys | ||
Transformer on a Diet[31] (2) | transformer-on-diet[32] | dilated transformer like wavenet | ||
Sparse Sinkhorn Attention[33] (4) | sinkhorn-transformer[34] | uses a cost matrix to limit attention between buckets | ||
SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1) | - | learns the q, k connections == dynamically creates a sparse attention matrix | ||
Efficient Content-Based Sparse Attention with Routing Transformers[36] (11) | routing-transformer[37] | computes attention with same-cluster tokens (computed by online k-means) | ||
Longformer: The Long-Document Transformer[38] (15) | longformer[39] | global + blocked attention | ||
Neural Architecture Search for Lightweight Non-Local Networks[40] (2) | AutoNL[41] | computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions | ||
ETC: Encoding Long and Structured Data in Transformers[42] (2) | - | combines global attention (star transformer with multiple global tokens) with local attention | ||
Multi-scale Transformer Language Models[43] (1) | IN_PAPER | UNet like + retina attetion is something close to BP-Transformer | ||
Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5) | - | does not compute pairwise interactions | ||
Jukebox: A Generative Model for Music[45] (9) | jukebox[46] | better attention patterns from Sparse Transformer | ||
GMAT: Global Memory Augmentation for Transformers[47] (0) | gmat[48] | adds global tokens | ||
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0) | google-research[50] | calculate an unbiased stochastic approximation of the attention matrix | ||
Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0) | - | does not compute pairwise interactions and uses fixed mask patters | ||
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1) | fast-transformers[53] | uses phi(q)(phi(k)v) and also improves the sequential sampling step | ||
Linformer: Self-Attention with Linear Complexity[54] (3) | linformer-pytorch[55] | project key and value from nd | ||
Real-time Semantic Segmentation with Fast Attention[56] (0) | - | l2_norm(q)*(l2_norm(k)*v) | ||
Fast Transformers with Clustered Attention[57] (0) | fast-transformers[58] | groups queries together with LSH | ||
Big Bird: Transformers for Longer Sequences[59] (0) | - | ETC with random connections |
参考资料
[1] Generating Wikipedia by Summarizing Long Sequences: https://arxiv.org/abs/1801.10198v1
[2]memory-compressed-attention: https://github.com/lucidrains/memory-compressed-attention
[3] CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2
[4] attention-module: https://github.com/Jongchan/attention-module
[5] CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2
[6] CCNet: https://github.com/speedinghzl/CCNet
[7] Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8
[8] Efficient-attention: https://github.com/cmsflash/efficient-attention
[9] Star-Transformer: https://arxiv.org/abs/1902.09113v2
[10] fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py
[11] Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1
[12] torch-blocksparse: https://github.com/ptillet/torch-blocksparse
[13] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1
[14] GCNet: https://github.com/xvjiarui/GCNet
[15] SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1
[16] Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2
[17] Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2
[18] Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module
[19] Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2
[20] XLM: https://github.com/facebookresearch/XLM
[21] Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2
[22] EMANet: https://github.com/XiaLiPKU/EMANet
[23] Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1
[24] compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch
[25] BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1
[26] BPT: https://github.com/yzh119/BPT
[27] Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1
[28] axial-attention: https://github.com/lucidrains/axial-attention
[29] Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2
[30] trax: https://github.com/google/trax/tree/master/trax/models/reformer
[31] Transformer on a Diet: https://arxiv.org/abs/2002.06170v1
[32] transformer-on-diet: https://github.com/cgraywang/transformer-on-diet
[33] Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1
[34] sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer
[35] SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2
[36] Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1
[37] routing-transformer: https://github.com/lucidrains/routing-transformer
[38] Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1
[39] longformer: https://github.com/allenai/longformer
[40] Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1
[41] AutoNL: https://github.com/LiYingwei/AutoNL
[42] ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2
[43] Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1
[44] Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1
[45] Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1
[46] jukebox: https://github.com/openai/jukebox
[47] GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1
[48] gmat: https://github.com/ag1988/gmat
[49] Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1
[50] google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention
[51] Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1
[52] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2
[53] fast-transformers: https://github.com/idiap/fast-transformers
[54] Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3
[55] linformer-pytorch: https://github.com/tatp22/linformer-pytorch
[56] Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2
[57] Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1
[58] fast-transformers: https://github.com/idiap/fast-transformers
[59] Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1
[60] A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/
觉得有用麻烦给个在看啦~
魔改Attention大集合相关推荐
- 魔改合成大西瓜,从源码到部署,步骤详细
现在有一款火爆全网的小游戏<合成大西瓜>,据官方数据显示,目前参与合成西瓜的游玩人次已经多达六千多万,可见广大网友已经不仅仅满足于在娱乐圈吃瓜了,现在更喜欢自己创造大西瓜,哈哈哈. 前一段 ...
- 魔改合成大西瓜--11张图片定制(速度最快版)
上线个人定制版合成大西瓜 1.下载合成大西瓜的源码: (1)从github上下载合成大西瓜的源码: 地址:https://github.com/liyupi/daxigua 下载压缩包即可. 解压到本 ...
- 魔改和上线你的合成大西瓜,最全教程!
本文是从 0 到 1 的教程,让小白也能够魔改和上线发布属于你的合成大西瓜! 最近,一款名为『 合成大西瓜 』的游戏突然火了!看来真的是大家吃瓜吃太多了,这个小游戏深抓人心! 当然,游戏本身非常有趣, ...
- 西瓜大作战java 源码_魔改和上线你的合成大西瓜,最全教程!
本文是从 0 到 1 的教程,让小白也能够魔改和上线发布属于你的合成大西瓜! 最近,一款名为『 合成大西瓜 』的游戏突然火了!看来真的是大家吃瓜吃太多了,这个小游戏深抓人心! 当然,游戏本身非常有趣, ...
- 魔改部署自己专属的合成大西瓜(三:上线篇<踩坑篇>)
魔改部署自己专属的合成大西瓜(一:运行篇) 魔改部署自己专属的合成大西瓜(二:魔改篇) 一.Vercel Vercel 是免费网站托管平台,可以帮我们部署网站,并生成可访问的简短网址,还能够和自己购买 ...
- 魔改部署自己专属的合成大西瓜(一:运行篇)
下载原版大西瓜源码后,如何在本地运行(小白手把手式教程): 原版下载: 链接:https://pan.baidu.com/s/1xOTSwsAy365SFrRCjHwtLA 提取码:0zgb 运行: ...
- 【游戏】2048及各种变种大集合汇总【更新ing~新版Floppy2048 - 恒星聚变版 - 恶搞改数据】...
threes - 鼻祖 手机版:http://asherv.com/threes/ js版:http://threesjs.com/ 2048 - 原版 http://gabrielecirulli. ...
- 魔改大西瓜源码下载之后画面显示只有一半的解决方法
魔改大西瓜源码下载之后画面显示只有一半的解决方法 以Google Chrome浏览器为例 第一种解决方法:手机模式 浏览器打开合成大西瓜界面后,按f12(如果是笔记本电脑的话可能要按fn+f12)然后 ...
- 魔改《合成大西瓜》——附试玩链接
最近合成大西瓜这个游戏很火,最近几天都看到朋友圈有很多小伙伴在挑战合成大西瓜. 但是因为合成大西瓜的难度太大了,极少概率能合成大西瓜,无奈只能刷到高分数. 这篇文章就带大家合成一个大西瓜,顺便魔改一下 ...
最新文章
- Intel Realsense D435 pyrealsense set_option() rs.option 可配置参数翻译
- python3精要(40)-数组与矩阵
- iOS压缩动画 CGAffineTransform
- 2021HDU多校第九场1008HDU7073 Integers Have Friends 2. 随机化
- 小觅双目相机如何使用_MYNT EYE S小觅双目摄像头标准(彩色)版结构光双目深度惯导相机3D...
- 事业单位非编制值得去吗?
- 湖南省计算机二级tc,湖南计算机二级考试大纲,重点内容谢谢!
- 【关键字】volatile
- kafka传数据到Flink存储到mysql之Flink使用SQL语句聚合数据流(设置时间窗口,EventTime)...
- matlab 中文注释乱码问题解决
- 内存映射文件的优势劣势体会
- hikari yml文件配置
- 网络技术大讲堂:什么是IPv6+?
- 技嘉服务器主板按f1才能进系统,电脑开机要按F1或F2才能进入系统的解决方法
- pyecharts 画地图(热力图)(世界地图,省市地图,区县地图)
- Android系统移植与调试之-------build.prop文件详细赏析
- Clickhouse 时间日期函数实战总结
- fowin自动交易和量化交易和合约交易
- 如何保持积极的心态?
- 网络1711班 C语言第一次作业批改总结
热门文章
- [再读书]私有构造函数
- 【组队学习】10月份微信图文索引
- STL vector
- 介绍一个效率爆表的数据采集框架
- 打造数字原生引擎,易捷行云EasyStack发布新一代全栈信创云
- 倒计时1天 | 张钹院士领衔,AI开发者大会20大论坛全攻略!
- 抖音、快手和直播行业的火爆究竟给了谁机会?
- 来了来了!趋势预测算法大PK!
- MySQL 狠甩 Oracle 稳居 Top1,私有云最受重用,大数据人才匮乏! | 中国大数据应用年度报告...
- Excel弱爆了!这个工具30分钟完成了我一天的工作量,零基础、文科生也能学!...