Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
Microsoft Research Asia

本文提出了一个新的视觉Transformer,称为Swin Transformer,它可以作为计算机视觉的一个通用骨干(backbone)。将Transformer从语言改编为视觉的挑战来自于两个领域之间的差异,比如视觉实体的尺度变化很大,以及与文本中的文字相比,图像中的像素分辨率很高。为了解决这些差异,我们提出了一个层次化的Transformer,其表示方法是通过 S \textbf{S} Shifted win \textbf{win} windows来计算的。移位的窗口方案通过将自我注意(self-attention)的计算限制在不重叠的局部窗口,同时也允许跨窗口的连接,从而带来了更高的效率。这种分层结构具有在不同尺度上建模的灵活性,并且相对于图像大小具有线性计算复杂性。Swin Transformer的这些特质使其与广泛的视觉任务兼容,包括图像分类(ImageNet-1K上87.3%的最高准确率)和密集预测任务,如物体检测(COCO test-dev上58.7%的APbox和51.1%的APmask)和语义分割(ADE20K val上53.5% mIoU)。它的性能超过了以前的最先进水平,在COCO上为+2.7% APbox和+2.6% APmask,在ADE20K上为 +3.2% mIoU,证明了基于Transformer的模型作为视觉骨干的潜力。分层设计和移位窗口的方法也被证明对所有MLP架构有益。代码和模型在this https URL公开提供。

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{this https URL}.


Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2103.14030 [cs.CV]
(or arXiv:2103.14030v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2103.14030
Focus to learn more
Submission history
From: Han Hu [view email]
[v1] Thu, 25 Mar 2021 17:59:31 UTC (1,064 KB)
[v2] Tue, 17 Aug 2021 16:41:34 UTC (1,065 KB)

ICCV 2021 Best Paper
论文地址:https://doi.org/10.48550/arXiv.2103.14030
源码地址:https://github.com/microsoft/Swin-Transformer


0. 引言

0.1 Swin Transformer与Vision Transformer的对比

二者的不同之处:

  1. Swin-Transformer所构建的特征图是具有层次性的,很像我们之前将的卷积神经网络那样,随着特征提取层的不断加深,特征图的尺寸是越来越小的(4x、8x、16x下采样)。正因为Swin Transformer拥有像CNN这样的下采样特性,能够构建出具有层次性的特征图。在论文中作者提到,这样的好处就是:正是因为这样具有层次的特征图,Swin Transformer对于目标检测和分割任务相比ViT有更大的优势。

在ViT模型中,是直接对特征图下采样16倍,在后面的结构中也一致保持这样的下采样规律不变(只有16x下采样,不Swin Transformer那样有多种下采样尺度 -> 这样就导致ViT不能构建出具有层次性的特征图)

  1. 在Swin Transformer的特征图中,它是用一个个窗口的形式将特征图分割开的。窗口与窗口之间是没有重叠的。而在ViT中,特征图是是一个整体,并没有对其进行分割。其中的窗口(Window)就是我们一会儿要讲的Windows Multi-head Self-attention。引入该结构之后,Swin Transformer就可以在每个Window的内部进行Multi-head Self-Attention的计算。Window与Window之间是不进行信息的传递的。这样做的好处是:可以大大降低运算量,尤其是在浅层网络,下采样倍率比较低的时候,相比ViT直接针对整张特征图进行Multi-head Self-Attention而言,能够减少计算量。

0.2 Swin Transformer与其他网络准确率对比分析

0.2.1 ImageNet-1K数据集准确率对比

这些模型先在ImageNet-1K数据集上进行预训练后,再在ImageNet-1K上的表现。

可以看到: