transformer--ViT

代码

为了处理二维图像，我们将尺寸为 H×W×C的图像reshape为拉平的2维图块，尺寸为 (N×(P^2×C))。其中， (P,P)为图块的大小， N=HW/P^2 。 N 是图块的数量，会影响输入序列的长度。Transformer在所有图层上使用恒定的隐矢量D，因此我们将图块拉平，并使用可训练的线性投影映射到D的大小，将此投影的输出称为patch embedding。对应代码如下：直接暴力拉伸

# Transformer.n, h, w, c = x.shapex = jnp.reshape(x, [n, h * w, c])

类似BERT的[class] token，我们在可嵌入的补丁序列(z_0^0=x_class )之前准备了可学习的embedding向量，该序列在Transformer编码器的输出(z_L^0 )的状态用作图像表示y。在预训练和微调期间，都将分类head连接到 z_L^0。

# If we want to add a class token, add it here.if self.classifier == 'token':cls = self.param('cls', nn.initializers.zeros, (1, 1, c))cls = jnp.tile(cls, [n, 1, 1])x = jnp.concatenate([cls, x], axis=1)

分类head是通过在预训练时具有一个隐藏层的MLP以及在微调时通过一个线性层的MLP来实现的。

class MlpBlock(nn.Module):"""Transformer MLP / feed-forward block."""mlp_dim: intdtype: Dtype = jnp.float32out_dim: Optional[int] = Nonedropout_rate: float = 0.1kernel_init: Callable[[PRNGKey, Shape, Dtype],Array] = nn.initializers.xavier_uniform()bias_init: Callable[[PRNGKey, Shape, Dtype],Array] = nn.initializers.normal(stddev=1e-6)@nn.compactdef __call__(self, inputs, *, deterministic):"""Applies Transformer MlpBlock module."""actual_out_dim = inputs.shape[-1] if self.out_dim is None else self.out_dimx = nn.Dense(features=self.mlp_dim,dtype=self.dtype,kernel_init=self.kernel_init,bias_init=self.bias_init)(  # pytype: disable=wrong-arg-typesinputs)x = nn.gelu(x)x = nn.Dropout(rate=self.dropout_rate)(x, deterministic=deterministic)output = nn.Dense(features=actual_out_dim,dtype=self.dtype,kernel_init=self.kernel_init,bias_init=self.bias_init)(  # pytype: disable=wrong-arg-typesx)output = nn.Dropout(rate=self.dropout_rate)(output, deterministic=deterministic)return

位置embedding会添加到patch embedding中，以保留位置信息。我们使用标准的可学习1D位置embedding，因为我们没有观察到使用更高级的2D感知位置embedding可显着提高性能。embedding向量的结果序列用作编码器的输入。

class AddPositionEmbs(nn.Module):"""Adds (optionally learned) positional embeddings to the inputs.Attributes:posemb_init: positional embedding initializer."""posemb_init: Callable[[PRNGKey, Shape, Dtype], Array]@nn.compactdef __call__(self, inputs):"""Applies AddPositionEmbs module.By default this layer uses a fixed sinusoidal embedding table. If alearned position embedding is desired, pass an initializer toposemb_init.Args:inputs: Inputs to the layer.Returns:Output tensor with shape `(bs, timesteps, in_dim)`."""# inputs.shape is (batch_size, seq_len, emb_dim).assert inputs.ndim == 3, ('Number of dimensions should be 3,'' but it is: %d' % inputs.ndim)pos_emb_shape = (1, inputs.shape[1], inputs.shape[2])pe = self.param('pos_embedding', self.posemb_init, pos_emb_shape)return inputs + pe

transformer--ViT相关推荐

Keras构建用于分类任务的Transformer（Vision Transformer/VIT）
文章目录一.Vision Transformer (ViT)详细信息二.Vision Transformer结构三.Keras实现 3.1 相关包 3.2 数据读取 3.3 声明超参数 3.4 ...
品论文：VISION TRANSFORMER (VIT)
今天上午看了个论文,每当遇到全英文论文的时候,就会发现自己的英文水平属实是太一般,但是看完这篇论文确实是感触良多!!! 论文标题:<AN IMAGE IS WORTH 16X16 WORDS: ...
Vision Transformer(ViT)解读
Vision Transformer Transformer原本是用在NLP上的模型,直到Vision Transformer的出现,transformer开始了在视觉领域的应用. 论文:An Ima ...
Vision Transformer(ViT) 2: 应用及代码讲解
文章目录 1. 代码讲解 1.1 PatchEmbed类 1)`__init__ `函数 2) forward 过程 1.2 Attention类 1)`__init__ `函数 2)forward ...
Vision Transformer(VIT)代码分析——保姆级教程
目录前言一.代码分析 1.1.DropPath模块 1.2.Patch Embeding 1.3.Multi-Head Attention 1.4.MLP 1.5.Block 1.6.Vision ...
Vision Transformer(ViT) 1: 理论详解
Vison Transformer 介绍 Vison Transformer论文- An Image is Worth 16x16 Words: Transformers for Image Reco ...
ViT（vision transformer）原理快速入门
本专题需要具备的基础: 了解深度学习分类网络原理. 了解2017年的transformer. Transformer 技术里程碑: ViT简介时间:2020年CVPR 论文全称:<An Ima ...
各类Transformer都得稍逊一筹，LV-ViT：探索多个用于提升ViT性能的高效Trick
[导读]本文探索了用于提升ViT性能的各种训练技巧.通过一系列实验对比.改进与组合,本文所提方案取得了SOTA方案,超越了EfficientNet.T2TViT.DeiT.Swin Transform ...
vision transformer（viT）教学视频【通俗易懂】
11.1 Vision Transformer(vit)网络详解_哔哩哔哩_bilibili 文章地址:Vision Transformer详解_霹雳吧啦Wz-CSDN博客其中两个关键的图
ICCV2021-PiT-池化操作不是CNN的专属，ViT说：“我也可以”；南大提出池化视觉Transformer（PiT）...
关注公众号,发现CV技术之美本文分享一篇 ICCV2021 论文:『Rethinking Spatial Dimensions of Vision Transformers』. 详细信息如下: 论文 ...

transformer--ViT

transformer--ViT相关推荐

最新文章

热门文章