Visual Transformers: Token-based Image Representation and Processing for Computer Vision

作者：今天不吹牛
链接：https://www.zhihu.com/question/400733777/answer/1466879756
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

首先感谢

@吃完就饿

回答里推荐的另外两篇论文：

Graph-Based Global Reasoning Networks (GloRe)

LatentGNN: Learning Efficient Non-local Relations for Visual Recognition

Visual Transformer与这两篇的共通之处很多，放在一起读让我受益匪浅。

这三者发表在arxiv时间顺序是：GloRe -> LatentGNN -> Visual Transformer 。三者都是关于GCN或者说 relation learning 在视觉上的应用。GloRe 和latentGNN 的运用思路是把 relation learning 作为一种feature augmentation的手段，和 nonlocal 类似; 而 Visual Transformer 的运用思路则是“替换卷积”，和 Local Relation , SAN, Stand-Alone Self-Attention 类似。

首先说 GloRe 和 LatentGNN 。二者都定位于为卷积特征进行 feature augmentation，从而 capture long-range dependency。从方法上看， GloRe 分为三步：

（1）From Coordinate Space to Interaction Space

（2）Reasoning with Graph Convolution

（3）From Interaction Space to Coordinate Space

LatentGNN 也是三步：

（1）Visible-to-latent propagation

（2）Latent-to-latent propagation

（3）Latent-to-visible propagation

GloRe 和 LatentGNN 的（1）（3）两步是一样的，都是用1×1 的卷积层来得到一个仿射矩阵，用这个仿射矩阵作用到输入的feature map（长宽两个空间维度融合为一个）上。二者的主要区别在于（2），虽然说 GloRe 和 LatentGNN 都说使用了GCN, 但二者使用GCN的思路不同。LatentGNN 用latent node的feature的点积作为邻接矩阵，或者干脆理解为在latent space做一次nonlocal。GloRe 的邻接矩阵则是learnable params ，然后做 GCN。GloRe 实现GCN的方式很有意思，是用两个分别跨越channel和跨越node的1×1 Conv实现的。

GloRe 和 LatentGNN没有在相同数据集上PK，但是从讲故事的水平上来说，我觉得LatentGNN略胜一筹～

Visual Transformer 定位与用 transformer layer 取代 convolution layer，也分为三步：

（1）Tokenization

（2）Transformer

（3）Projector

这里的（1）（3）两步和 GloRe 、LatentGNN的基本一样，用1×1 Conv 学一个仿射矩阵，然后怼上去。不一样的地方在于加入了position encoding，这个position encoding其实就是learnable params。我个人认为之所以前几年自注意力机制只能作为一种增强特征的方法，而这一两年却涌现那么多自注意力层取代卷积层，原因就是ACL上的这篇带relative position encoding的Transformer , 妥当的相对位置编码可以带来卷积层天然具备而自注意力层不具备的spatial translation equivariance。

总的来说，三篇论文在方法上有不少相通之处，尤其是GloRe和LatenGNN，但是如果读过原文的话其实不难发现它们在叙事上区别挺大的，各自都有不少独到的insight，比如GloRe的learnable adjacent matrix，LatentGNN的low-rank representation的观点，所以我认为不能苛刻地说谁就是水文。

Visual Transformers: Token-based Image Representation and Processing for Computer Vision相关推荐

Image Processing and Computer Vision_Review：Local Invariant Feature Detectors: A Survey——2007.11...
翻译局部不变特征探测器:一项调查摘要 -在本次调查中,我们概述了不变兴趣点探测器,它们如何随着时间的推移而发展,它们如何工作,以及它们各自的优点和缺点.我们首先定义理想局部特征检测器的属性.接下来 ...
[Transformer]Efficient Training of Visual Transformers with Small Datasets
使用小数据集高效训练Visual Transformer Abstract Section I Introduction Section II Related Work Section III Pre ...
《Gabor feature based sparse representation for face recognition with gabor occlusion dictionary》
Meng Yang, Lei Zhang, "Gabor feature based sparse representation for face recognition with gabo ...
Transformer综述大全（1）【A Survey of Visual Transformers】
Transformer 1 Introduction 2 原始Transformer 1注意力机制Attention Mechanism 2多头注意力机制Multi-Head Attention Me ...
显著性检测论文解析2——Visual Saliency Detection Based on Bayesian Model, Yulin Xie, ICIP2011
最近感觉玩的差不多了,现在准备好好学习了,所以就又开始随便写点,就当是自己学习的笔记.这次要说的的是卢湖川的Visual Saliency Detection Based on Bayesian Mo ...
Transformer综述大全（2）【A Survey of Visual Transformers】
Transformer 4 TRANSFORMER FOR DETECTION目标检测 1Transformer Neck 1)原始检测器 2)稀疏关注的Transformer 3)空间先验Trans ...
论文阅读：Visual Semantic Localization based on HD Map for AutonomousVehicles in Urban Scenarios
题目:Visual Semantic Localization based on HD Map for Autonomous Vehicles in Urban Scenarios 中文:基于高清地图 ...
【论文笔记】A Survey of Visual Transformers（完结）
声明: 本人是来自 γ 星球的外星人为了学习地球的深度学习知识的,好回去建设自己的家乡每周不定期更新自己的论文精读笔记,中心思想是两个字 --- 易懂没啥事的兄弟姐妹们,可以和我探讨外星知识哦~ ...
[computer vision] Bag of Visual Word (BOW)
Python微信订餐小程序课程视频 https://edu.csdn.net/course/detail/36074 Python实战量化交易理财系统 https://edu.csdn.net/cou ...

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Visual Transformers: Token-based Image Representation and Processing for Computer Vision相关推荐

最新文章

热门文章