基于外观流提取的免解析器虚拟试穿

图一将我们的模型与最近的基于解析器的前沿试穿方法（左侧）和新出现的免解析器方法（右侧）进行了比较。在左侧，我们用绿框将不精确的分割区域高亮了出来，这些不精确的区域划分误导了现存的基于解析器的方法，如 CP-VTON、ClothFlow、CP-VTION+ 和 ACGPN ，导致其产生错误的结果。在右侧，第一个免解析器的方法 WUTON 在最近被提出，但它的图像质量受到基于解析器的模型生成的假图的限制，because simply trained a “student” network to mimic the parser-based method using knowledge distillation. 我们看到我们的方法在图像质量上远优于之前的前沿方法，并且无需依赖人像分割（without relying on human segmentation）

摘要

image virtual try-on aims to fit a garment image (target clothes) to a person image. Prior methods are heavily based on human parsing. However, slightly-wrong segmentation results would lead to unrealistic try-on images with large artifacts.

图像虚拟试穿的目标是将一张服装图像 fit 到一张人像图上。此前的方法都在很大程度上依赖人类解析。然而，只存在微小错误的分割结果也会导致最后生成不真实的图像。

A recent pioneering work employed knowledge distillation to reduce the dependency of human parsing, where the try-on images produced by a parser-based method are used as supervisions to train a “student” network without relying on segmentation, making the student mimic the try-on ability of the parser-based model.

最近的首创的使用了知识精炼（knowledge distillation）来降低对人类解析的依赖度，where the try-on images produced by a parser-based method are used as supervisions to train a “student” network without relying on segmentation, making the student mimic the try-on ability of the parser-based model.

However, the image quality of the student is bounded by the parser-based model. To address this problem, we propose a novel approach, “teacher-tutor-student” knowledge distillation, which is able to produce highly photo-realistic images without human parsing, possessing several appealing advantages compared to prior arts.

然而，student 生成的图像质量受到基于解析器的模型的限制。为了解决这个问题，我们提出了一个新奇的方法——“teacher-tutor-student”知识精炼。这个方法无需用人像解析（human parsing）就可生成逼真的图像，与之前的方法相比，我们的方法拥有许多吸引人的优点。

(1) Unlike existing work, our approach treats the fake images produced by the parser-based method as “tutor knowledge”, where the artifacts can be corrected by real “teacher knowledge”, which is extracted from the real person images in a self-supervised way. =

与现有的工作不同的是，我们的方法将基于解析器的方法产生的假图像处理为 “tutor knowledge” ，可以通过自我监督的方式从真人图像中提取真实的 “teacher knowledge” 来纠正伪图像。

(2) Other than using real images as supervisions, we formulate knowledge distillation in the try-on problem as distilling the appearance flows between the person image and the garment image, enabling us to find accurate dense correspondences between them to produce high-quality results.

(2) 在试穿问题中，我们不是用真实的图像作为监督，而是将知识的蒸馏制定为对人的形象和服装形象之间的外观流动的蒸馏，使我们能够找到两者之间准确的密集对应，从而产生高质量的结果。

(3) Extensive evaluations show large superiority of our method (see Fig. 1).

(3) 大量的评价表明，我们的方法有很大的优越性(see Fig. 1).

1. 介绍

Virtual try-on of fashion image is to fit an image of a clothing item (garment) onto an image of human body. This task has attracted a lot of attention in recent years because of its wide applications in e-commerce and fashion image editing. Most of the state-of-the-art methods such as VTON [9], CP-VTON [30], VTNFP [33], ClothFlow [8], ACGPN [32], and CP-VTON+ [18] were relied on human segmentation of different body parts such as upper body, lower body, arms, face, and hairs, in order to enable the learning procedure of virtual try-on. However, high-quality human parsing is typically required to train the try-on models, because slightly wrong segmentation would lead to highly-unrealistic try-on images, as shown in Fig.1

时装图像的虚拟试穿是将一张服装图像 fit 到一张人像图上。这个任务因其在电商和时装图像编辑的广泛应用，在这些年吸引了大量的注意。绝大多数的前沿方法，如 VTON [9], CP-VTON [30], VTNFP [33], ClothFlow [8], ACGPN [32], and CP-VTON+ [18]，都依赖于人像分割，例如将人像分为上半身、下半身、手臂、脸和头发，以此来方便训练。然而，为了训练模型，需要高质量的人像解析结果，因为只要分割结果有微小的错误，就会生成一个不真实的图像，如图一所示。

为了减少对精确mask的依赖性，最近的一项开创性工作 WUTON 提出了第一个不使用人像分割的免解析器网络用于虚拟试穿。不幸的是，WUTON 在其模型设计上有个不可避免的缺点。如图二底部所示，WUTON通过将一个基于解析器的模型（需要人像分割的试穿网络）作为teacher网络，采用了传统的知识蒸馏方案，并且对teacher生成的试穿图像（合成图）进行蒸馏一个免解析器的student网络，该网络不使用分割作为输入。这导致student直接模仿了teacher的试穿能力。然而，teacher生成的图像有大量人工合成的痕迹（图一），因此使用这些图像作为teacher knowledge来监督student会得到不理想的结果，因为student生成的图像质量受到基于解析器的模型的限制。

为了解决上述问题，该项工作提出了一个新的角度，可以在不需要使用人像解析的情况下生成逼真的试穿图像，这项技术称为 Parser Free Appearance Flow Network (PF-AFN)，其采用了新奇的 “teacher-tutor-student” 知识蒸馏方案。如图二顶部所示，PF-AFN将基于解析器的模型当作一个可能会生成不逼真的图像的tutor（即，tutor knowledge），而不是teacher。real teacher 会提升 tutor 的表现。关键是teacher knowledge的来源的设计。为此，PF-AFN将假的人像图（tutor knowledge）作为免解析器student模型的输入，该模型由原始的真实人像图（teacher knowledge）监督，让student模仿真实的人像图。这与自监督学习类似，在自我监督学习中，student网络通过将真实人物图像上的服装转移到基于解析器的模型生成的假人像图上进行训练。换句话说，student被要求将假人像图上的衣服转换为真人像图上的衣服，这样它就可以在真人像图的指导下进行自我监督。在这种情况下，由我们的免解析器模型生成的图像的性能显著优于以前方法生成的对应图像。

To further improve image quality of the student, other than using real images as supervisions, we formulate knowledge distillation of the try-on problem as distilling the appearance flows between the person image and the garment image, facilitating to find dense correspondences between them to generate high-quality images.

为了进一步提高student生成的图像的质量，除了使用真实图像作为监督，我们试穿问题的知识蒸馏指定为将人像图和衣物图之间的 appearance flow 蒸馏出来，以便找出二者之间的密切对应来生成高质量的图像。

Our work has three main contributions. First, we propose a “teacher-tutor-student” knowledge distillation scheme for the try-on problem, to produce highly photo-realistic results without using human segmentation as model input, completely removing human parsing.

我们的工作有三大贡献。第一，我们为试穿问题提出了“teacher-tutor-student”知识蒸馏方案，不需要使用人像分割作为模型输入，来生成逼真的合成图，完全去除了人像解析。

Second, we formulate knowledge distillation in the try-on problem as distilling appearance flows between the person image and the garment image, which is important to find accurate dense correspondences between pixels to generate high-quality images.

第二，在试穿问题中，我们将知识蒸馏制定为对人像图和衣服图像之间的appearance flows的蒸馏，which is important to find accurate dense correspondences between pixels to generate high-quality images.

Third, extensive experiments and evaluations on the popular datasets demonstrate that our proposed method has large superiority compared to the recent stateof-the-art approaches both qualitatively and quantitatively

#TODO

2. 相关工作

虚拟试穿。在虚拟试穿领域现存的基于深度学习的方法可以分为基于三维模型的方法和基于二维模型的方法。由于前者需要额外的三维测量方法和更大的计算能力，基于二维图像的方法的应用更为广泛。由于现有的二维图像试穿数据集仅包含未配对的数据(衣服和穿着对应衣服的人)，因此以往的方法主要是对人的图像中的衣服区域进行掩码，然后用相应的衣服图像重建人的图像，这需要精确的人像解析。当解析结果不准确时，这种基于解析器的方法会生成很假的合成图。WUTON最近提出了一种开创性的免解析器方法，但是这种方法生成的图像的质量被基于解析器的网络的假图像所限制。

Appearance Flow. Appearance flow refers to 2D coordinate vectors indicating which pixels in the source can be used to synthesize the target. It motivates visual tracking, image restorations and face hallucination.Appearance flow is first introduced by to synthesize images of the same object observed from arbitrary viewpoints. The flow estimation is limited on the nonrigid clothing regions with large deformation. uses 3D appearance flows to synthesize a person image with a target pose, via fitting a 3D model to compute the appearance flows as supervision, which are not available in 2D try-on.

Knowledge Distillation. Knowledge distillation leverages the intrinsic information of a teacher network to train a student network, which was first introduced in [12] for model compression. As introduced in [34], knowledge distillation has also been extended as cross-modality knowledge transfer, where one model trained with superior modalities (i.e. multi-modalities) as inputs intermediately supervises another model taking weak modalities (i.e. single-modality) as inputs, and the two models can use the same network

3.1. Network Training 网络训练

As shown in Fig. 3, our method contains a parser-based network PB-AFN and a parser-free network PF-AFN. We first train PB-AFN with data (Ic, I) following the existing methods [30, 8, 32], where Ic and I indicate the image of the clothes and the image of the person wearing this clothes.

如图3所示，我们的方法包含了一个基于解析器的网络 PB-AFN 和一个免解析器的网络 PF-AFN。我们首先照着已有的方法用数据 (Ic,I)(I_c,I)(Ic,I) 训练 PB-AFN，IcI_cIc 和 III 分别表示衣服图片和穿了这件衣服的人像图。

We concatenate a mask containing hair, face, and the lowerbody clothes region, the human body segmentation result, and the human pose estimation result as the person representations p∗ to infer the appearance flows uf between p∗ and the clothes image Ic.

我们将一个包含了头发、脸和下半身服装区域的mask，人体分割结果和人体姿态估计结果连接在一起作为人的表示 p∗p^*p∗ ，以此来推断p∗p^*p∗和服装图像IcI_cIc之间的 appearance flows ufu_fuf。

Then the appearance flows uf are used to generate the warped clothes uw with Ic.

之后，ufu_fuf 会被用于和 lcl_clc 生成变形的衣服 uwu_wuw

?Concatenating this warped clothes, the preserved regions on the person image and human pose estimation along channels as inputs, we could train a generative module to synthesize the person image with the ground-truth supervision I.

将这件变形的衣服、人像图上保留的区域和人体姿态估计作为输入，我们可以训练一个生成模型来合成人像图和 III 。

After training PB-AFN, we randomly select a different clothes image Iec and generate the try-on result uIe, that is
the image of person in I changing a clothes.

在训练了 PB-AFN 后，我们随机挑一件不同的衣服图像 Ic~I_{\tilde{c}}Ic~ ，并生成对应的试穿结果图 uI~u_{\tilde{I}}uI~。uI~u_{\tilde{I}}uI~ 是人像图 III 中的模特换装后的图像。

Intuitively, the generated fake image uIe is regarded as the input to train the student network PF-AFN with the clothes image Ic.

直觉上，我们会将生成的假图像 uI~u_{\tilde{I}}uI~ 和服装图像 IcI_cIc 作为输入来训练 student 网络 PF-AFN。

We treat the parser-based network as the “tutor” network and its generated fake image as “tutor knowledge” to enable the training of the student network.

我们将基于解析器的网络作为 tutor 网络，将其生成的假图会作为 tutor knowledge 来训练 student 网络。

In PF-AFN, a warping module is adopted to predict the appearance flows sf between the tutor uIe and the clothes image Ic and warp Ic to sw. A generative module further synthesizes the student sI with the warped clothes and the tutor.

We treat the real image I as the “teacher knowledge” to correct the student sI , making the student mimic the original real image. Furthermore, the tutor network PB-AFN distills the appearance flows uf to the student network PF-AFN though adjustable knowledge distillation, which will be explained in Sec. 3.4.

3.2. Appearance Flow Warping Module (AFWM).

Both PB-AFN and PF-AFN contain the warping module AFWM, to predict the dense correspondences between
the clothes image and the person image for warping the clothes.

PB-AFN 和 PF-AFN 都包含了warping module AFWM，用于预测衣物图像和人像图之间的dense correspondences 以此来对衣物变形。

As shown in Fig. 3, the output of the warping module is the appearance flows ( e.g. uf ), which are a set of
2D coordinate vectors. Each vector indicates which pixels in the clothes image should be used to fill the given
pixel in the person image.

如图三所示，warping module的输出是 appearance flows ，也就是 ufu_fuf，是一个由二维坐标向量组成的集合。每个向量表示衣服图像中的哪个像素应该用来填充人像图中的给定像素。

The warping module consists of dual pyramid feature extraction network (PFEN) and a progressive appearance flow estimation network (AFEN).

warping module 由双金字塔特征提取网络(PFEN)和渐进式外观流估计网络(AFEN)组成。

PFEN extracts two-branch pyramid deep feature representations from two inputs. Then at each pyramid level, AFEN learns to generate coarse appearance flows, which are refined in the next level. The second-order smooth constraint is also adopted when learning the appearance flows, to further preserve clothes characteristics, e.g. logo and stripe. The parser-based warping module (PB-AFWM) and the parser-free warping module (PF-AFWM) have the identical architecture except for the difference in the inputs.

Parser-Free Virtual Try-on via Distilling Appearance Flows - 基于外观流提取的免解析器虚拟试穿相关推荐

实现迷你解析器把字符串解析成NestInteger类 Mini Parser
为什么80%的码农都做不了架构师?>>> 问题: Given a nested list of integers represented as a string, implem ...
【Linux 内核】CFS 调度器 ① ( CFS 完全公平调度器概念 | CFS 调度器虚拟时钟 Virtual Runtime 概念 | 四种进程优先级 | 五种调度类 )
文章目录一.CFS 调度器概念 ( 完全公平调度器 ) 二.CFS 调度器虚拟时钟概念 ( Virtual Runtime ) 三.进程优先级 ( 调度优先级 | 静态优先级 | 正常优先级 | 实 ...
easyui js解析字符串_easyui的解析器Parser
平时使用easyui做框架开发时,都知道easyui的模块组件能通过属性方法或js方法来渲染,本质上是通过parser解析器来处理实现的,因为多数情况下都是自动触发完成整个页面的解析,所以没有感觉到它 ...
【编译原理】构建一个简单的解释器（Let’s Build A Simple Interpreter. Part 7.）（笔记）解释器 interpreter 解析器 parser 抽象语法树AST
[编译原理]让我们来构建一个简单的解释器(Let's Build A Simple Interpreter. Part 7.) 文章目录 python代码插--后序遍历 C语言代码(有错误) C语言 ...
极其简便的PHP HTML DOM解析器PHP Simple HTML DOM Parser/有中文手册
极其简便的PHP HTML DOM解析器PHP Simple HTML DOM Parser,有中文手册,对于需要分析HTML代码dom结构的php开发者来说,是一个极其有用的函数库,使用Jquery ...
FFmpeg的HEVC解码器源代码简单分析：解析器（Parser）部分
===================================================== HEVC源代码分析文章列表: [解码 -libavcodec HEVC 解码器] FFmpe ...
FFmpeg的HEVC解码器源代码简单分析解析器（Parser）部分
===================================================== HEVC源代码分析文章列表: [解码 -libavcodec HEVC 解码器] FFmpe ...
java sql分析器_java sql解析器比较druid sql parser vs jsqlparser vs fdb-sql-parser
先上结论. 功能上:druid sql parser(支持分区.WITH.DUAL等.使用mysql语法解析时,已知oracle的一些操作符会被转为mysql,如|| 转为OR.使用oracle解析器 ...
【编译器实现笔记】2. 解析器（parser）
原文地址:https://lisperator.net/pltut/ 解析器的作用解析器在分词器之上,直接操作 token 流,不用处理单个字符,把代码解析成一个个对象 lambda 解析器解析标 ...

Parser-Free Virtual Try-on via Distilling Appearance Flows - 基于外观流提取的免解析器虚拟试穿