摘要

提出了两点：We propose a novel method for unsupervised image-to-image translation, which incorporates a new attention module and a new learnable normalization function in an end-to-end manner.

与以往工作的不同：Unlike previous attention-based method which cannot handle the geometric changes（几何变化） between domains,
our model can translate both images requiring holistic changes（整体变化） and images requiring large shape changes（大的形状变化）.==>此处应该是指的例如：橘子<==>苹果的这种转换

1、a new attention module的作用：The attention module guides our model to focus on more important regions distinguishing between source and target domains based on the attention map obtained by the auxiliary classifier.

2、a new learnable normalization function的作用：Moreover, our new AdaLIN (Adaptive Layer-Instance Normalization) function helps our attention-guided model to flexibly control the amount of change in shape and texture（更灵活的控制形状和纹理上的变化） by learned parameters depending on datasets.

我们可以看到IN与LN的异同——
①相同点是都是对instance做得正则，与Batch无关；
②不同点是IN进一步局限到单个channel之间，而LN则跨过所有channels。因此，IN假设不同feature的不同channels之间是无关的（uncorrelated），因此单独作用于每个channel可能会引入对原来的语义（semantic content）的干扰；而LN尽管是对所有channels作权衡，但考虑到normalization的本质还是“平滑”，容易抹消一些语义信息。作者认为可以把两者结合起来，互相抵消他们之间的不足，同时又结合了两者的优点。

adaptive的最朴素的思想是寻找一个比率ratio，来权衡某一层中IN与LN的关系，即：

其中这个就是一个权重，经过训练得到（前向、方向传播、梯度更新）。本文正是这么做的！

INTRODUCTION

they are successful for the style transfer tasks mapping local texture (e.g., photo2vangogh and photo2portrait) but are typically unsuccessful for image translation tasks with larger shape change (e.g., selfifie2anime and cat2dog) in wild images.
Therefore, the pre-processing steps such as image cropping and alignment are often required to avoid these problems by limiting the complexity of the data distributions.

例如（图像与梵高）这种局部纹理上的转换很成功，但是在大的形状改变（猫与狗、自拍与动漫）上不成功。

嵌入在生成器与鉴别器的注意力图促使了翻译任务：Our model guides the translation to focus on more important regions and ignore minor regions by distinguishing between source and target domains based on the attention map obtained by the auxiliary classifier. These attention maps are embedded into the generator and discriminator to focus on semantically important areas, thus facilitating the shape transformation.

1、生成器中的注意图将焦点引向特定区分两个域的区域：While the attention map in the generator induces the focus on areas that specifically distinguish between the two domains,
2、而鉴别器中的注意图通过关注目标域中真实图像和虚假图像之间的差异来帮助微调：the attention map in the discriminator helps fine-tuning by focusing on the difference between real image and fake image in target domain.

Adaptive LayerInstance Normalization (AdaLIN)：In addition to the attentional mechanism, we have found that the choice of the normalization function has a significant impact on the quality of the transformed results for various datasets with different amounts of change in shape and texture. Inspired by Batch-Instance Normalization(BIN), we propose Adaptive LayerInstance Normalization (AdaLIN), whose parameters are learned from datasets during training time by adaptively selecting a proper ratio between Instance normalization (IN) and Layer Normalization (LN).

AdaLIN的作用（更自由的控制文艺以及形上的变化，不需要改变网络架构就可以实现图像转换）：The AdaLIN function helps our attention-guided model to flexibly control the amount of change in shape and texture. As a result, our model, without modifying the model architecture or the hyper-parameters, can perform image translation tasks not only requiring holistic changes but also requiring large shape changes.

The main contribution of the proposed work can be summarized as follows:

1、We propose a novel method for unsupervised image-to-image translation with a new attention module and a new normalization function, AdaLIN.（提出了一种新的无监督图像翻译方法，该方法采用了新的注意模块和归一化函数AdaLIN。）

2、Our attention module helps the model to know where to transform intensively（仔细翻译） by distinguishing between source and target domains based on the attention map obtained by the auxiliary classifier.（我们的注意力模块通过基于辅助分类器所获得的注意力图来区分源域和目标域，从而帮助模型知道在哪里进行仔细翻译）

3、AdaLIN function helps our attention-guided model to flexibly control the amount of change in shape and texture without modifying the model architecture or the hyper-parameters.（AdaLIN功能帮助我们的注意力引导模型在不修改模型架构或超参数的情况下，灵活控制形状和纹理的变化量）

Unsupervised generative attentional networks with AdaLIN

Our goal is to train a function Gs→t that maps images from a source domain Xs to a target domain Xt using only unpaired samples drawn from each domain.
Our framework consists of two generators Gs→t 【生成器1】and Gt→s and two discriminators Ds and Dt.

We integrate the attention module into both generator and discriminator：
①、The attention module in the discriminator guides the generator to focus on regions that are critical to generate a realistic image.
②、The attention module in the generator gives attention to the region distinguished from the other domain.（生成器中的注意力模块关注与其他domain不同的region）

Here, we only explain Gs→t and Dt (See Fig 1) as the vice versa should be straight-forward.

2.1 MODEL

2.1.1 GENERATOR（生成器）：

Let $x \in\left\{X_{s}, X_{t}\right\}$ represent a sample from the source and the target domain.
Our translation model $G_{s \rightarrow t}$ consists of：
①、an encoder $E_{s}$ 【编码器】
②、a decoder $G_{t}$ 【解码器】
③、and an auxiliary classifier $\eta_{s}$ 【辅助分类器】,
where $\eta_{s}(x)$ 【 $x$ 来源于 $X_{s}$ 的可能性】represents the probability that x comes from $X_{s}$ .
Let $E_{s}^{k}(x)$ 【encoder的第k个激活图\特征图】be the k-th activation map of the encoder
and $E_{s}^{k_{i j}}(x)$ 【在 $(i,j)$ 出的值】be the value at $(i,j)$ .

Inspired by CAM, the auxiliary classifier is trained to learn the weight of the k-th feature map for the source domain $W_{s}^{k}$ 【源域第k个特征图的权重】, by using the global average pooling and global max pooling, i.e., $\eta_{s}(x)=\sigma\left(\Sigma_{k} w_{s}^{k} \Sigma_{i j} E_{s}^{k_{i j}}(x)\right)$ .
By exploiting $W_{s}^{k}$ , we can calculate a set of domain specific attention feature map $a_{s}(x)=w_{s} * E_{s}(x)=\left\{w_{s}^{k} *\right.\left.E_{s}^{k}(x) \mid 1 \leq k \leq n\right\}$ 【一组特定领域的注意力特征图】, where $n$ 【encoder特征图的数量】 is the number of encoded feature maps.
Then, our translation model $G_{s \rightarrow t}$ becomes equal to $G_{t}\left(a_{s}(x)\right)$ .

Inspired by recent works that use affine transformation parameters in normalization layers and combine normalization functions , we equip the residual blocks with AdaLIN whose parameters, γ and β are dynamically computed by a fully connected layer from the attention map（γ和β由注意力图中的全连接层动态计算）.

where µI , µL 【逐通道的均值和标准差】
and σI , σL【逐层的均值和标准差】
are channel-wise, layer-wise mean and standard deviation respectively,
γ and β 【由全连接层生成的参数】are parameters generated by the fully connected layer,
τ is the learning rate【学习率】
and ∆ρ【由优化器确定的参数更新向量(例如梯度)】 indicates the parameter update vector (e.g., the gradient) determined by the optimizer.

The values of ρ are constrained to the range of [0, 1] simply by imposing bounds at the parameter update step.（在参数更新时施加界限使 ρ 的值位于 [0-1] 之间）

生成器调整 ρ 值：
Generator adjusts the value so that the value of ρ is close to 1 in the task where the instance normalization is important（使ρ值在instance normalization重要的任务中接近1）
and the value of ρ is close to 0 in the task where the LN is important.（ρ值在LN重要的任务中接近0）
The value of ρ is initialized to 1 in the residual blocks of the decoder（在解码器的残差块中，ρ的值初始化为1）
and 0 in the up-sampling blocks of the decoder. （在解码器的上采样块中，ρ的值初始化为0）

the LN does not assume uncorrelation between channels, but sometimes it does not keep the content structure of the original domain well because it considers global statistics only for the feature maps.（LN不能很好地保持原始域的内容结构，因为它只考虑特征图的全局统计）
To overcome this, our proposed normalization technique AdaLIN combines the advantages of AdaIN and LN by selectively keeping or changing the content information, which helps to solve a wide range of image-to-image translation problems.（为了克服这一点，我们提出的AdaLIN通过选择性地保留或改变内容信息，结合了AdaIN和LN的优点，这有助于解决广泛的图像到图像翻译问题）

2.1.2 DISCRIMINATOR 鉴别器：

Let $x \in\left\{X_{t}, G_{s \rightarrow t}\left(X_{s}\right)\right\}$ represent a sample from the target domain and the translated source domain.
Similar to other translation models, the discriminator $D_{t}$ which is a multi-scale model consists of an encoder $E_{D_{t}}$ , a classifier $C_{D_{t}}$ , and an auxiliary classifier $\eta D_{t}$ .
Unlike the other translation models, both $\eta D_{t}(x)$ and $D_{t}(x)$ are trained to discriminate whether x comes from $X_{t}$ or $G_{s \rightarrow t}(X_{s})$ .
Given a sample $x$ , $D_{t}(x)$ exploits the attention feature maps $a_{D_{t}}(x)=w_{D_{t}} * E_{D_{t}}(x)$ using $w_{D_{t}}$ on the encoded feature maps $E_{D_{t}}(x)$ that is trained by $\eta D_{t}(x)$ .
Then, our discriminator $D_{t}(x)$ becomes equal to $C_{D_{t}}\left(a_{D_{t}}(x)\right)$

2.2 LOSS FUNCTION

The full objective of our model comprises four loss functions. Here, instead of using the vanilla GAN objective,
we used the Least Squares GAN objective for stable training.

1、Adversarial loss

An adversarial loss is employed to match the distribution of the translated images to the target image distribution:（是为了将待翻译图像的分布与目标图像分布相匹配）
解释：
这个loss的作用是为了让生成的图像更能称之为图像，也就是生成的图像更真实。但它不保证能生成到我们想要的图像。

2、Cycle loss

To alleviate the mode collapse problem, we apply a cycle consistency constraint to the generator. Given an image x ∈ $X_{s}$ , after the sequential translations of x from $X_{s}$ to $X_{t}$ and from $X_{t}$ to $X_{s}$ , the image should be successfully translated back to the original domain:（1、为了缓解模式崩塌问题，2、以及保证目标域与原始域来回可以成功转换）
解释：
其中，G的输入一般为x，是用来生成fake y图的生成器，F的输入一般为y，是用来生成fake x图的生成器。当我们把x送入到G中后，得到的是假的y图，再把这张假的y图送入到F中，得到更假的x图。理想情况下，此时的更假的x图应该与原始的x图相差无几。这样也便构成了一个循环，因此叫做循环一致性损失。

在训练中经常会碰到这样一种情况，通过x得到的fake y，会越来越倾向于成为一个能骗得过判别器的值，生成器会慢慢发现，不管送进来的x是什么样，只要我生成的图越像y，就越能骗过判别器，因此只要生成跟y一样的图就好了。但这样的y是我们不想要的，我们希望保留x中content的成分，只去改变里面style。因此，设计了这个循环一致性损失。

3、Identity loss

To ensure that the color distributions of input image and output image are similar, we apply an identity consistency constraint to the generator. Given an image x ∈ Xt, after the translation of x using Gs→t, the image should not change.
解释：
生成器G用来生成y风格图像，那么把y送入G，应该仍然生成y，只有这样才能证明G具有生成y风格的能力。因此G(y)和y应该尽可能接近。根据论文中的解释，如果不加该loss，那么生成器可能会自主地修改图像的色调，使得整体的颜色产生变化。如下图所示：

4、CAM loss

By exploiting the information from the auxiliary classifiers ηs and ηDt , given an image x ∈ {Xs, Xt}. Gs→t and Dt get to know where they need to improve or what makes the most difference between two domains in the current state:
解释：
通过辅助分类器 ηs and ηDt，使生成器1（Gs→t）与鉴别器1（Dt）知道在当前状态下哪里需要提升并且哪里需要做出最大的改变

Full objective

Finally, we jointly train the encoders, decoders, discriminators, and auxiliary classi-fiers to optimize the final objective:

where λ1 = 1, λ2 = 10, λ3 = 10, λ4 = 1000. Here, $L_{l s g a n}=L_{l s g a n}^{s \rightarrow t}+L_{l s g a n}^{t \rightarrow s}$ and the other losses are defined in the similar way ( $\left.L_{c y c l e}, L_{i d e n t i t y}, \text { and } L_{c a m}\right)$

3 EXPERIMENTS

3.3.1 CAM ANALYSIS

1、First, we conduct an ablation study to confirm the benefit from the attention modules used in both generator and discriminator. As shown in Fig 2 (b), the attention feature map helps the generator to focus on the source image regions that are more discriminative from the target domain, such as eyes and mouth.（attention feature map帮助帮助生成器聚焦于源图像UI目标域有非常大区别的地带）

2、Meanwhile, we can see the regions where the discriminator concentrates its attention to determine whether the target image is real or fake by visualizing local and global attention maps of the discriminator as shown in Fig 2 (c) and (d), respectively. The generator can fine-tune the area where the discriminator focuses on with those attention maps. Note that we incorporate both global and local attention maps from two discriminators having different size of receptive field. Those maps can help the generator to capture the global structure (e.g., face area and near of eyes) as well as the local regions. With this information some regions are translated with more care.（鉴别器的global and local attention maps帮助生成器捕获全局以及局部结构）

3、The results with the attention module shown in Fig 2 (e) verify the advantageous effect of exploiting attention feature map in an image translation task.
On the other hand, one can see that the eyes are misaligned, or the translation is not done at all in the results without using attention module as shown in Fig 2 (f) （CAM非常的重要）

3.3.2 ADALIN ANALYSIS

we have applied the AdaLIN only to the decoder of the generator.The role of the residual blocks in the decoder is to embed features,and the role of the up-sampling convolution blocks in the decoder is to generate target domain images from the embedded features.（AdaLIN 仅仅存在于生成器的decoder中：其中decoder中的 1、“residual blocks”是为了嵌入特征，2、“up-sampling convolution blocks”是为了从嵌入特征中生成目标域图像）

Figure 3: Comparison of the results using each normalization function: (a) Source images, (b) Our results, (c) Results only using IN in decoder with CAM, (d) Results only using LN in decoder with CAM, (e) Results only using AdaIN in decoder with CAM, (f) Results only using GN in decoder with CAM.

If the learned value of the gate parameter ρ is closer to 1, it means that the corresponding layers rely more on IN than LN. Likewise, if the learned value of ρ is closer to 0, it means that the corresponding layers rely more on LN than IN.（如果参数ρ接近1，则意味着相应的层更多地依赖于IN而不是LN，反之亦然）

As shown in Fig 3 (c), in the case of using only IN in the decoder, the features of the source domain (e.g., earrings and shades around cheekbones) are well preserved due to channel-wise normalized feature statistics used in the residual blocks. However, the amount of translation to target domain style is somewhat insufficient since the global style cannot be captured by IN of the up-sampling convolution blocks.（仅仅在decoder中使用IN，由于在残留块中使用的通道式归一化特征统计，源域的特征(例如，耳环和颧骨周围的阴影)被很好地保留，然而，转换到目标域样式的量有些不足，因为全局样式不能被上采样卷积块的输入捕获）

On the other hand, As shown in Fig 3 (d), if we use only LN in the decoder, target domain style can be transferred sufficiently by virtue of layer-wise normalized feature statistics used in the up-sampling convolution. But the features of the source domain image are less preserved by using LN in the residual blocks.（仅仅在decoder中使用LN，则目标域样式可以借助于上采样卷积中使用的逐层归一化特征统计量来充分传递。但是在残差块中使用LN，源域图像的特征保留较少）

This analysis of two extreme cases tells us that it is beneficial to rely more on IN than LN in the feature representation layers to preserve semantic characteristics of source domain, and the opposite is true for the up-sampling layers that actually generate images from the feature embedding. Therefore, the proposed AdaLIN which adjusts the ratio of IN and LN in the decoder according to source and target domain distributions is more preferable in unsupervised image-to-image translation tasks. Additionally, the Fig 3 (e), (f) are the results of using the AdaIN and Group Normalization (GN) respectively, and our methods are showing better results compared to these.（所以通过比例结合IN与LN是一个很好的选择）

B IMPLEMENTATION DETAILS

B.1 NETWORK ARCHITECTURE

The network architectures of U-GAT-IT are shown in Table 4, 5, and 6.

1、生成器：（要注意的生成器的encoder和decoder均是由两个卷积组成的而不是三个，表中的三层是因为第一次是调整通道数）
The encoder of the generator is composed of two convolution layers with the stride size of two for down-sampling and four residual blocks.
The decoder of the generator consists of four residual blocks and two up-sampling convolution layers with the stride size of one.
Note that we use the instance normalization for the encoder and AdaLIN for the decoder, respectively. In general, LN does not perform better than batch normalization in classification problems. Since the auxiliary classifier is connected from the encoder in the generator, to increase the accuracy of the auxiliary classifier we use the instance normalization(batch normalization with a mini-batch size of 1) instead of the AdaLIN.

2、鉴别器：（是由两个PatchGAN【分别叫做local discriminator和global discriminator】组成的）
Spectral normalization is used for the discriminator. We employ two different scales of PatchGAN for the discriminator network, which classifies whether local (70 x 70) and global (286 x 286) image patches are real or fake. For the activation function, we use ReLU in the generator and leaky-ReLU with a slope of 0.2 in the discriminator.

B.2 TRAINING

All models are trained using Adam with β1=0.5 and β2=0.999.
For data augmentation, we flipped the images horizontally with a probability of 0.5, resized them to 286 x 286, and random cropped them to 256 x 256.
The batch size is set to one for all experiments.
We train all models with a fixed learning rate of 0.0001 until 500,000 iterations and linearly decayed up to 1,000,000 iterations.
We also use a weight decay at rate of 0.0001. The weights are initialized from a zero-centered normal distribution with a standard deviation of 0.02.

U-GAT-IT 论文阅读相关推荐

GAT原论文阅读笔记
<GRAPH ATTENTION NETWORKS>阅读笔记本文记录阅读GAT原论文<GRAPH ATTENTION NETWORKS>的笔记,方便后续查阅. 论文地址:&l ...
【论文阅读】A Gentle Introduction to Graph Neural Networks [图神经网络入门]（7）
[论文阅读]A Gentle Introduction to Graph Neural Networks [图神经网络入门](7) Into the Weeds Other types of grap ...
论文阅读ICLR2020《ADAPTIVE STRUCTURAL FINGERPRINTS FOR GRAPH ATTENTION NETWORKS》
论文阅读ICLR2020<ADAPTIVE STRUCTURAL FINGERPRINTS FOR GRAPH ATTENTION NETWORKS> 摘要确定节点相似性时图的结构 Ad ...
[论文阅读] (22)图神经网络及认知推理总结和普及-清华唐杰老师
<娜璋带你读论文>系列主要是督促自己阅读优秀论文及听取学术讲座,并分享给大家,希望您喜欢.由于作者的英文水平和学术能力不高,需要不断提升,所以还请大家批评指正,非常欢迎大家给我留言评论,学 ...
谣言检测相关论文阅读笔记：Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement ...
论文阅读笔记（2）：Learning a Self-Expressive Network for Subspace Clustering，SENet，用于大规模子空间聚类的自表达网络
论文阅读笔记(2):Learning a Self-Expressive Network for Subspace Clustering. SENet--用于大规模子空间聚类的自表达网络前言摘要 ...
论文阅读和分析： “How Attentive are Graph Attention Networks?”
下面所有博客是个人对EEG脑电的探索,项目代码是早期版本不完整,需要完整项目代码和资料请私聊. 数据集 1.脑电项目探索和实现(EEG) (上):研究数据集选取和介绍SEED 相关论文阅读分析: 1. ...
DeepLearning | 图注意力网络Graph Attention Network（GAT）论文、模型、代码解析
本篇博客是对论文 Velikovi, Petar, Cucurull, Guillem, Casanova, Arantxa,et al. Graph Attention Networks, 2018 ...
论文阅读笔记：MGAT: Multi-view Graph Attention Networks
论文阅读笔记:MGAT: Multi-view Graph Attention Networks 文章目录论文阅读笔记:MGAT: Multi-view Graph Attention Networ ...
交通预测论文阅读与总结
交通论文阅读与总结最近在找点,做一个记录,跟的主要是郑宇老师团队的TaxiBJ和BikeNYC数据集上的论文,欢迎交流. 论文时间关联空间关联其他(特点与不足) 输入数据上的特点 STGCN ...

U-GAT-IT 论文阅读

摘要