GAN for Image-to-image translation 2019年文章综述

文章目录

前言
- 一篇文献的阅读姿势
- - 1. 这篇论文的创新点是什么（idea是什么）
  - 2. 这篇论文干了什么事情（idea怎么具体实现）
  - 3. 这篇论文如何分析自己的方法有效（如何设置实验）
GAN
Conditional Generative Adversarial Nets（cGAN，2014）
- Image-to-Image Translation with Conditional GAN（2017）
CycleGAN
IIT task概念解释：
- conditional
- unsupervised
- multi-modal and multi-domain
unsupervised Image to Image translation(UNIT)
multi-modal unsupervised Image to Image translation (MUNIT)
StyleGAN
StyleGAN2
StarGAN（2018.12）
RelGAN(2019.08)
Latent Filter Scaling for Multimodal Unsupervised Image-to-Image Translation
Multi-Channel Attention Selection GAN with Cascaded SemanticGuidance for Cross-View Image Translation
未完待续

前言

本篇博客记录了笔者最近阅读的2019年应用于IIT(image-to-image translation）的GAN的最新论文，为了思路清晰，顺带也介绍了之前的比较经典的关于IIT GAN的论文。
包括以下文章：

GAN
cGAN (conditional GAN)
CycleGAN
IIT中常见概念的解释
- styleGAN与StyleGAN2
UNIT
MUNIT
StarGAN
RelGAN
Latent Filter Scaling for Multimodal Unsupervised Image-to-Image Translation
Multi-Channel Attention Selection GAN with Cascaded SemanticGuidance for Cross-View Image Translation
A Unified Feature Disentangler for Multi-Domain I2I
Homomorphic Latent Space Interpolation for Unpaired Image-to-image
TraVeLGAN Image-to-image Translation by Transformation Vector Learning
Image-to-Image Translation via Group-wise Deep Whitening-and-Coloring Transformation

本篇博客对于论文的介绍基本遵循lyf学长的《一篇文献的阅读姿势》。具体介绍形式如下。当然有些地方，笔者为了省事就一笔带过了。

另外本篇文章并不是对论文做详细的介绍，只是介绍每篇论文的idea和大致如何实现这个idea，可以让读者对论文有个大概了解：这篇论文干了啥。但如果想仔细了解论文的细节，请阅读其他的详细解析文章或者原论文。

笔者CV刚入门两个月，水平实在有限，博客必定有许多疵漏和错误之处，请读者们不吝指正。

一篇文献的阅读姿势

1. 这篇论文的创新点是什么（idea是什么）

a) 别人的方法为何在这个情况/设置下行不通或者不好，有何缺点或不足（Limitations）
b) 确定其提出的是新方法/新理论，等等
c) 提出的方法如何从出发点（Motivation）上解决了上述别人方法的不足

2. 这篇论文干了什么事情（idea怎么具体实现）

a) 确定其方法的输入/输出
b) 确定方法的核心实现是如何与论文出发点一一对应的

3. 这篇论文如何分析自己的方法有效（如何设置实验）

a) 实验上可视化的图片/给出的数据，如何解决了之前方法没有解决的问题 ——验证1. a)
这个一般会给出强力的性能表现，视觉效果比其他方法好，数值比其他效果好
b) 论文的实现如何体现出论文的Motivation ——验证1. b) & 2. b)
这个一般需要分析实验或者消融实验，前者通过可视化工具得到方法的中间结果与预期一致；后者是通过将论文提出方法的核心部分去掉或者替换，以控制变量法的方式验证关键部分的有效性
c) 论文提出的思路或者实现方式有不足，或者其他严谨性的方面需要讨论
这个部分不一定所有论文都有，但是通过这个部分可以反观这个论文的原理、方法的实现

GAN

GAN是一种训练生成模型的方法，包括两个互相对抗的模型：一个生成模型G用于拟合样本数据分布和一个判别模型D用于估计输入样本是来自于真实的训练数据还是生成模型G。
生成器通过映射函数把噪声映射到数据空间，而判别器的输出是一个标量，表示数据来自真实训练数据而非G的生成数据的概率。

Conditional Generative Adversarial Nets（cGAN，2014）

为解决GAN太过自由的问题，在生成模型G和判别模型D中同时加入条件约束y来引导数据的生成过程。条件可以是任何补充的信息，如类标签（独热编码）等，这样使得GAN能够更好地被应用于跨模态问题，例如图像自动标注。
看损失函数就知cGAN与GAN的区别：其中x|y表示真实图片x和标签y同时作为D的输入；z|y表示噪声z和标签y同时作为G的输入。

Image-to-Image Translation with Conditional GAN（2017）

将cGAN用在I2I。与cGAN的Loss函数基本一致，输入G和D的标签就是要被translation的image。

贡献：
a. 在损失函数中增加了L1 Loss，使生成图像不仅要像真实图片，也要更接近于输入的条件图片。
b. 在生成器中，用U-net结构（G和D的镜像层skip connection ）代替encoder-decoder的G结构
c. 提出PatchGAN通常判断都是对生成样本整体进行判断，比如对一张图片来说，就是直接看整张照片是否真实。而且Image-to-Image Translation中很多评价是像素对像素的，所以在这里提出了分块判断的算法，在图像的每个patch块上去判断是否为真，最终平均给出结果。

缺点：Supervised learning，要求source image and target image is a pair.

CycleGAN

Propose another network mapping target domain image to source domain image and cycle-consistency loss which preserved some properties of original image and avoid mode collapse(map all images to the same image)

IIT task概念解释：

conditional

Unlike the unconditional case, where the latent vector can be simply mapped to a full size image, the conditional case requires using both the latent vector and the input image.

unsupervised

GANs that take an image from one domain and produce an image in another domain will be referred to as image-to-image translation GANs. If paired data are used, the GAN will be referred to as supervised. It will be referred to as unsupervised if the images from the two domains are not paired.

multi-modal and multi-domain

Multi-model: Finally, image-to-image translation GANs that produce a single image will be referred to as deterministic or unimodal, while multimodal ones make use of an input latent vector in addition to the input image to produce many outputs.（multi-model训练时并不要求数据集有attribute label）

Multi-domain: 这里的domain是指针对数据集中的attribute，根据attribute来划分的，比如就性别这个attri而言，男是一个domain，女是一个，相对于发色而言，金发是一个domain，黑发是一个domain。（但multi-domainGAN如StarGAN，RelGAN需要dataset有attribute label）

Fine-grained: 精细控制

unsupervised Image to Image translation(UNIT)

multi-modal unsupervised Image to Image translation (MUNIT)

StyleGAN

styleGAN并不属于IIT论文，而是提出了一个全新的generator framework.

idea
a. 目前研究者对GAN的generator合成图片的过程还没有真正的充分了解，generator的运作就像black box。另外对隐变量空间也没有充分的理解。
b. 本文提出一个generator，它能实现生成图片的随机变化和图片attribute的分离，还能实现对图像合成过程的精细控制。并实现了更好的插值性质和隐变量的解纠缠。
2.idea的具体实现
a.输入：可学习常量输入：指定style的图片
b. 将从正态分布随机采样的噪声，先通过MLP映射到隐空间W，然后在每一个卷积层后，都加一个AdaIN，AdaIN的参数由W经过仿射变换得到。关键在于，所有层都用同一个隐变量w（W中的一个点），但是都有自己的仿射变换，因此有不同的自适应参数。
c.通过添加噪声实现随机变化（如发丝等），但对人脸的固有属性并不会产生影响。

StyleGAN2

修复了StyleGAN中几个生成图片的质量问题：

将AdaIN改为modulation和demodulation（AdaIN要减去均值，modulation直接改变方差）。解决了有水滴状瑕疵的问题。
不再采用progressive growth的训练方法，G采用skip connection，D采用residual Block。解决了人脸的某些部分在图片中有保持固定位置的倾向。

StarGAN（2018.12）

1．创新点
a已存在的IIT方法只适用于two domain，在mult-domain上表现不足。
b本文提出StarGAN结构，能在一个network里面训练multi-II。甚至能在同一个网络里train多个dataset
2 idea具体实现
a.输入：image and domain information，输出IIT后的image。
b．核心实现：1.在D上附加一个domain分类器并提出domain分类损失 2.重建损失
3.通过Mask Vector来使GAN忽略多个dataset IIT时未知的label，聚集在已知的label上。
4.采用Wasserstein的GAN Adv Loss

3.实验
a.采取DIAT，CycleGAN，IcGAN作为baseline，通过可视化的图片，展示了StarGAN在Facial attribute transfer results on the CelebA dataset和Facial expression synthesis results on the RaFD dataset效果比baseline好。又通过数据（ResNet18s上的分类准确性）展示了对StarGAN的定量评估。
b．采用消融实验体现了joint datasets training的effect;通过正确和错误的mask vector对比展示了mask vector的重要作用。

RelGAN(2019.08)

1.创新点
A．以往的MIIT方法，属性向量是二值的，对产生结果的控制不够精细；采用绝对的属性向量，即使不想改变某些属性，也需要指明它们的值。本文提出相对属性向量来解决这个问题。
B 本文提出： 1.相对属性向量方法。 2.D_match 3. D_interp
2. idea具体实现
idea的具体实现：包括了Conditional对抗损失，重建损失，插值损失等。
3.实验
Experiment就是将StarGAN和AttGAN作为baseline，在IIT任务中，选择FID作为metrics体现了RelGAN在视觉质量上的优越性；还比对了三个GAN的分类准确性。之后做了消融实验展示了各个Loss的效果；做了分析实验展示了插值效果。最后做了UserStudy将RelGAN和baseline比对了图片质量。

Latent Filter Scaling for Multimodal Unsupervised Image-to-Image Translation

文章指出现有的IIT方法很多都是把隐编码(latent code)直接映射成图像，这就需要十分复杂的网络结构和引入很多超参数。这篇文章提出把隐编码当做卷积滤波器的修饰器（和MUNIT相似），并保留传统GAN的判别损失，需要控制的参数只有权衡生成图像质量和多样性的这一个参数。这篇文章的贡献和创新点可以概括为：
• 提出不用编码器和重建损失，只保留传统GAN的结构和判别损失。本质上抑制了模式崩溃的发生
• 较少的超参数和损失项，只有一个参数控制生成图像的质量和多样性。

Multi-Channel Attention Selection GAN with Cascaded SemanticGuidance for Cross-View Image Translation

创新点：
cross-view image translation：给出语义图（物体轮廓）和场景图，恢复出同一场景下语义图视角的图片。

1．创新点：
a之前的工作，生成的图片重叠部分太多；有一个工作是用语义图监督图像生成，但是由于语义图不够精确，因此生成图片效果也不够好。
b 贡献：A novel multi-channel attention selection GAN framework；
B novel multi-channel attention selection module
至于为什么multi-channel能refine the generated image，作者的解释是：We argue that this is not
enough for the complex translation problem we are dealing with, and thus we explore using a larger generation space to have a richer synthesis via constructing multiple intermediate generations.

未完待续