人物肖像速写

Synthesizing and editing video portraits—i.e., videos framed to show a person’s head and upper body—is an important problem in computer graphics, with applications in video editing and movie postproduction, visual effects, visual dubbing, virtual reality, and telepresence, among others.

合成和编辑视频肖像(即，成帧显示人的头部和上半身的视频)是计算机图形学中的一个重要问题，在视频编辑和电影后期制作，视觉效果，配音，虚拟现实和远程呈现等应用中。

The problem of synthesizing a photo-realistic video portrait of a target actor that mimics the actions of a source actor—and especially where the source and target actors can be different subjects—is still an open problem.

合成模仿源演员的动作的目标演员的逼真的视频肖像的问题(尤其是在源演员和目标演员可以是不同主体的情况下)仍然是一个悬而未决的问题。

There hasn’t been an approach that enables one to take full control of the rigid head pose, face expressions, and eye motion of the target actor; even face identity can be modified to some extent. Until now.

还没有一种方法可以完全控制目标演员的硬头姿势，面部表情和眼睛动作。甚至人脸的身份都可以在一定程度上进行修改。到现在。

In this post, I’m going to review “Deep Video Portraits”, which presents a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

在这篇文章中，我将回顾“ Deep Video Portraits” ，它提出了一种新颖的方法，该方法仅使用输入视频就可以对肖像视频进行逼真的重新动画处理。

In this post, I’ll cover two things: First, a short definition of a DeepFake. Second, an overview of the paper “Deep Video Portraits” in the words of the authors.

在本文中，我将介绍两件事：首先，是DeepFake的简短定义。其次，用作者的话概述“深度视频肖像”一文。

1.定义DeepFakes (1. Defining DeepFakes)

The word DeepFake combines the terms “deep learning” and “fake”, and refers to manipulated videos or other digital representations that produce fabricated images and sounds that appear to be real but have in fact been generated by deep neural networks.

DeepFake这个词结合了“深度学习”和“伪造”两个术语，指的是经过操纵的视频或其他数字表示形式，它们产生的伪造图像和声音看上去是真实的，但实际上是由深度神经网络生成的。

2.深度视频肖像 (2. Deep Video Portraits)

2.1概述 (2.1 Overview)

The core method presented in the paper provides full control over the head of a target actor by transferring the rigid head pose, facial expressions, and eye motion of a source actor, while preserving the target’s identity and appearance.

本文中介绍的核心方法是通过传递源头演员的硬头姿势，面部表情和眼睛动作来完全控制目标演员的头部，同时保留目标的身份和外观。

On top of that, full video of the target is synthesized, including consistent upper body posture, hair, and background.

最重要的是，可以合成目标的完整视频，包括一致的上身姿势，头发和背景。

Figure 1. Facial reenacement results from “DVP”. Expressions from the source are transferred from source to target actor, while retaining the head pose (rotation and translation) as well as the eye gaze of the target actor

The overall architecture of the paper’s framework is illustrated below in Figure 2.

本文框架的总体架构如下图2所示。

First, the source and target actors are being tracked using a state-of-the-art face reconstruction approach from a single image, and a 3D morphable model (3DMM) is derived to best fit the source and target actors.

首先，使用最先进的人脸重构方法从单个图像中跟踪源角色和目标角色，然后导出3D可变形模型(3DMM)以最适合源角色和目标角色。

The resulting sequence of low-dimensional parameter vectors represents the actor’s identity, head pose, expression, eye gaze, and the scene lighting for every video frame.

所得的低维参数向量序列表示每个视频帧的演员身份，头部姿势，表情，视线和场景照明。

Then, the head pose, expressions and/or eye gaze parameters from the source are taken and mixed with the illumination and identity parameters of the target. This allows the network to generate a full-head reenactment while preserving the actor’s identity and look.

然后，从源头获取头部姿势，表情和/或视线参数，并将其与目标的照明和标识参数混合。这允许网络在保留演员的身份和外观的同时生成完整的重新制定。

Next, new synthetic renderings of the target actor are generated based on the mixed parameters. These renderings are the input to the paper’s novel “rendering-to-video translation network”, which is trained to convert the synthetic input into photo-realistic output.

接下来，基于混合参数生成目标演员的新合成渲染。这些渲染是论文新颖的“视频渲染翻译网络”的输入，该网络经过训练可以将合成的输入转换为逼真的输出。

Figure. 2. Deep video portraits enable a source actor to fully control a target video portrait. First, a low-dimensional parametric representation (let) of both videos is obtained using monocular face reconstruction. The head pose, expression and eye gaze can now be transferred in parameter space (middle). Finally, render conditioning input images that are converted to a photo-realistic video portrait of the target actor (right). Obama video courtesy of the White House (public domain)

2.2从单个图像重建脸部 (2.2 Face Reconstruction from a single image)

3D morphable models are used for face analysis because the intrinsic properties of 3D faces provide a representation that’s immune to intra-personal variations, such as pose and illumination. Given a single facial input image, a 3DMM can recover 3D face (shape and texture) and scene properties (pose and illumination) via a fitting process.

3D可变形模型用于人脸分析，因为3D人脸的内在属性可提供不受人体内部变化(例如姿势和光照)影响的表示。给定单个面部输入图像，3DMM可以通过拟合过程恢复3D面部(形状和纹理)和场景属性(姿势和照明)。

The authors employ a state-of-the-art dense face reconstruction approach that fits a parametric model of the face and illumination to each video frame. It obtains a meaningful parametric face representation for both the source and the target, given an input video sequence.

作者采用了最先进的密集面部重建方法，该方法将面部和照明的参数模型拟合到每个视频帧。在给定输入视频序列的情况下，它为源和目标都获得了有意义的参数面部表示。

Equation 1. source actor video sequence where N_s denotes the total number of source frames.

The meaningful parametric face representation consists of a set of parameters P. , which could be denoted as the corresponding parameter sequence that fully describes the source or target facial performance.

有意义的参数人脸表示由一组参数P组成 。，可以表示为完整描述来源或目标面部效果的相应参数序列。

Equation 2. A meaningful parametric face representation best describes each frame in the input video sequence.

The set of reconstructed parameters P encode the rigid head pose, facial identity coefficients, expressions coefficients, gaze direction for both eyes, and spherical harmonics illumination coefficients. Overall, the face reconstruction process estimates 261 parameters per video frame.

重建参数集P编码了刚性头部的姿势，面部识别系数，表情系数，两只眼睛的注视方向以及球谐照明度系数。总体而言，人脸重建过程估计每个视频帧261个参数。

Below are more details on the parametric face representation and the fitting process.

以下是有关参数化面部表示和拟合过程的更多详细信息。

2.2.1 Parametric Face Representation

2.2.1参数人脸表示

The paper represents the space of facial identity based on a parametric head model, and the space of facial expressions via an affine model. Mathematically, they model geometry variation through an affine model v∈ R^(3N) that stacks per-vertex deformations of the underlying template mesh with N vertices, as follows:

本文基于参数化头部模型来表示面部识别的空间，并通过仿射模型来表示面部表情的空间。在数学上，他们通过仿射模型v∈R ^(3N)对几何变化进行建模，该仿射模型堆叠具有N个顶点的基础模板网格的每个顶点变形，如下所示：

Equation 3. per-vertex deformations of the underlying template mesh with N vertices

Where a_{geo} ∈ R^(3N) stores the average facial geometry. The geometry bases b_k for the geometry has been computed by applying principal component analysis (PCA) to 200 high-quality face scans, and b_k for the expressions has been obtained in the same manner on blendshapes.

其中a_ {geo}∈R ^(3N)存储平均面部几何形状。通过将主成分分析(PCA)应用于200个高质量的面部扫描，可以计算出几何的几何基数b_k，并且在混合形状上以相同的方式获得了表达式的b_k。

2.2.2 Image Formation Model

2.2.2图像形成模式 l

To render synthetic head images, a full perspective camera is assumed that maps model-space 3D points v via camera space to 2D points on the image plane. The perspective mapping Π contains the multiplication with the camera intrinsics and the perspective division.

为了渲染合成头部图像，假定使用全透视相机，将通过相机空间将模型空间3D点v映射到图像平面上的2D点。透视图映射Π包含与相机内在函数的乘积和透视图除法。

In addition, based on a distant illumination assumption, spherical harmonics basis functions are used to approximate the incoming radiance B from the environment.

另外，基于远处的照明假设，使用球谐函数基函数来估计来自环境的入射辐射B。

Equation 4. A spherical harmonics basis functions are used to approximate the incoming radiance B from the environment

Where B is the number of spherical harmonics bands, ɣ_b the spherical harmonics coefficients, and r_i and n_i the reflectance and unit normal vector of the i-th vertex, respectively.

其中B是球谐频带的数量，ɣ_b是球谐系数，r_i和n_i分别是第i个顶点的反射率和单位法向矢量。

2.3综合条件输入 (2.3 Synthetic Conditioning Input)

Using the face reconstruction approach described above, a face is reconstructed in each frame of the source and target video. Next, the rigid head pose, expression, and eye gaze of the target actor is modified. All parameters are copied in a relative manner from the source to the target.

使用上述的面部重构方法，在源视频和目标视频的每个帧中重构面部。接下来，修改目标演员的硬头姿势，表情和视线。所有参数都以相对方式从源复制到目标。

Then the authors render synthetic conditioning images of the target actor’s face model under the modified parameters using hardware rasterization.

然后，作者使用硬件光栅化在修改后的参数下渲染目标演员的面部模型的合成条件图像。

For each frame, three different conditioning inputs are generated: a color rendering, a correspondence image, and an eye gaze image.

对于每一帧，将生成三个不同的条件输入：彩色渲染，对应图像和视线图像。

Figure 3. The synthetic input used for conditioning the rendering-to-video translation network: (1) colored face rendering under target illumination, (2) correspondence image, and (3) the eye gaze image

The color rendering shows the modified target actor model under the estimated target illumination, while keeping the target identity (geometry and skin reflectance) fixed. This image provides a good starting point for the following rendering-to-video translation, since in the face region only the delta to a real image has to be learned.

彩色渲染在估计的目标照明下显示修改后的目标演员模型，同时保持目标身份(几何形状和皮肤反射率)固定。该图像为随后的渲染到视频的转换提供了一个很好的起点，因为在面部区域中，只需学习真实图像的增量即可。

A correspondence image encoding the index of the parametric model’s vertex that projects into each pixel is also rendered to keep the 3D information.

还渲染了编码投影到每个像素中的参数模型顶点索引的对应图像，以保留3D信息。

Finally, a gaze map is provided to provide information about the eye gaze direction and blinking.

最后，提供凝视图以提供有关眼睛凝视方向和眨眼的信息。

All of the images are stacked to obtain the input to the rendering-to-video translation network.

将所有图像堆叠在一起，以获得输入到视频转译网络的输入。

2.4渲染到视频翻译 (2.4 Rendering-To-Video Translation)

The generated conditioning space-time stacked images are the input to the rendering-to-video translation network.

生成的条件时空叠加图像是渲染到视频翻译网络的输入。

The network learns to convert the synthetic input into full frames of a photo-realistic target video, in which the target actor now mimics the head motion, facial expression, and eye gaze of the synthetic input.

网络学习如何将合成输入转换为逼真的目标视频的全帧，目标演员现在可以模仿合成输入的头部运动，面部表情和视线。

The network learns to synthesize the entire actor in the foreground, i.e., the face for which conditioning input exists, but also all other parts of the actor, such as hair and body, so that they comply with the target head pose.

网络学习合成前景中的整个演员，即存在条件输入的面部，以及演员的所有其他部分(例如头发和身体)，以使其符合目标头部姿势。

It also synthesizes the appropriately modified and filled-in background, even including some consistent lighting effects between the foreground and background.

它还可以合成经过适当修改和填充的背景，甚至包括前景和背景之间的一些一致的照明效果。

The network shown in Figure 4 follows an encoder-decoder architecture and is trained in an adversarial manner.

图4所示的网络遵循编码器-解码器架构，并以对抗性方式进行训练。

Figure 4. Architecture of the rendering-to-video translation network follows an encoder-decoder architecture

The training objective function comprises a conditioned adversarial loss and L1 photometric loss.

训练目标函数包括条件对抗损失和L1光度损失。

Equation 5. Rendering-To-Video Translation objective function

During adversarial training, the discriminator D tries to get better at classifying given images as real or synthetic, while the transformation network T tries to improve in fooling the discriminator. The L1 loss penalizes the distance between the synthesized image T(x) and the ground truth image Y, which encourages the sharpness of the synthesized output:

在对抗训练中，鉴别器D试图在将给定图像分类为真实或合成图像方面做得更好，而变换网络T试图在欺骗鉴别器方面进行改进。 L1损失会惩罚合成图像T(x)与地面真实图像Y之间的距离，这会鼓励合成输出的清晰度：

Deep learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

深度学习正Swift向收集数据的地方(边缘设备)靠近。订阅Fritz AI新闻简报，以了解有关此过渡及其如何帮助您扩展业务的更多信息。

3.实验与结果 (3. Experiments & Results)

This approach enables us to take full control of the rigid head pose, facial expression, and eye motion of a target actor in a video portrait, thus opening up a wide range of video rewrite applications.

这种方法使我们能够完全控制视频肖像中目标演员的硬头姿势，面部表情和眼睛动作，从而开辟了广泛的视频重写应用程序。

3.1在完全掌控的情况下重演 (3.1 Reenactment under full head control)

This approach is the first that can photo-realistically transfer the full 3D head pose (spatial position and rotation), facial expression, as well as eye gaze and eye blinking of a captured source actor to a target actor video.

这种方法是第一种可以真实地将完整的3D头部姿势(空间位置和旋转)，面部表情以及捕获的源演员的目光和眨眼转移到目标演员视频的方法。

Figure 5 shows some examples of full-head reenactment between different source and target actors. Here, the authors use the full target video for training and the source video as the driving sequence.

图5显示了不同源角色和目标角色之间的完整头像重现的一些示例。在这里，作者使用完整的目标视频进行训练，并使用源视频作为驱动序列。

As can be seen, the output of their approach achieves a high level of realism and faithfully mimics the driving sequence, while still retaining the mannerisms of the original target actor.

可以看出，他们的方法的输出实现了很高的真实感，并忠实地模仿了驾驶顺序，同时仍然保留了原始目标演员的举止。

Figure 5. qualitative results of full-head reenactment

3.2面部重演和视频配音 (3.2 Facial Reenactment and Video Dubbing)

Besides full-head reenactment, the approach also enables facial reenactment. In this experiment, the authors replaced the expression coefficients of the target actor with those of the source actor before synthesizing the conditioning input to the rendering-to-video translation network.

除了全头重演外，该方法还可以进行面部重演。在这个实验中，作者在合成渲染到视频翻译网络的条件输入之前，将目标演员的表达系数替换为源演员的表达系数。

Here, the head pose and position and eye gaze remain unchanged. Figure 6 shows facial reenactment results.

在此，头部的姿势和位置以及视线保持不变。图6显示了面部重现结果。

Video dubbing could also be applied by modifying the facial motion of actors speaking originally in another language to an ensign translation, spoken by a professional dubbing actor in a dubbing studio.

视频配音也可以通过将最初用另一种语言说的演员的面部动作修改为少尉翻译来进行，而配音则由配音工作室中的专业配音演员说出。

More precisely, the captured facial expressions of the dubbing actor could be transferred to the target actor, while leaving the original target gaze and eye blinks intact.

更准确地说，可以将捕获的配音演员的面部表情转移到目标演员，同时保持原始目标注视和眨眼完好无损。

4。讨论 (4. Discussion)

In this post, I presented Deep Video Portraits, a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.

在这篇文章中，我介绍了Deep Video Portraits，这是一种新颖的方法，可以仅使用输入视频就可以对肖像视频进行逼真的重新动画处理。

In contrast to existing approaches that are restricted to manipulations of facial expressions only, the authors are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.

与仅限于面部表情操作的现有方法相比，作者是第一个将完整3D头部位置，头部旋转，面部表情，眼睛凝视和眨眼从原始演员转移到人像视频的人。目标演员。

The authors have shown, through experiments and a user study, that their method outperforms prior work, both in terms of model performance and expanded capabilities. This opens doors to many applications, like video reenactment for virtual reality and telepresence, interactive video editing, and visual dubbing.

作者通过实验和用户研究表明，在模型性能和扩展功能方面，他们的方法均优于先前的工作。这为许多应用打开了大门，例如用于虚拟现实和远程呈现的视频重现，交互式视频编辑以及可视化配音。

5。结论 (5. Conclusions)

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.

与往常一样，如果您有任何疑问或意见，请随时在下面留下您的反馈，或者随时可以通过LinkedIn与我联系。

Till then, see you in the next post!

人物肖像速写_深度视频肖像相关推荐

人物肖像速写_去哪儿？优步肖像之旅
人物肖像速写 In early 2018, the Uber brand team started a rebranding exercise, exploring a fresh take on w ...
人物肖像速写_肖像学的基础
人物肖像速写 More in the iconography series:• 7 Principles of Icon Design• 5 Ways to Create a Settings Ico ...
史上最全机器学习_深度学习毕设题目汇总——视频
下面是该类的一些题目: 题目基于RGB-D视频序列的大尺度场景三维语义表面重建技术研究基于上下文信息的视频目标跟踪问题研究基于上下文的视频理解关键技术研究基于动态纹理和卷积神经网络的视频烟雾探 ...
深度学习图像分类_深度学习时代您应该阅读的10篇文章了解图像分类
深度学习图像分类前言 (Foreword) Computer vision is a subject to convert images and videos into machine-under ...
乐器演奏_深度强化学习代理演奏的蛇
乐器演奏 Ever since I watched the Netflix documentary AlphaGo, I have been fascinated by Reinforcement L ...
深度学习中交叉熵_深度计算机视觉，用于检测高熵合金中的钽和铌碎片
深度学习中交叉熵计算机视觉 (Computer Vision) Deep Computer Vision is capable of doing object detection and image ...
算法_深度LSTM笔记[博]
原创博客链接:算法_深度LSTM笔记本文适合有一定基础同学的复习使用,不适合小白入门,入门参考本文参考文献第一篇结构_静态综合图结构_分步动图进一步,向量化参数和引入问题 1, cell 的状 ...
深度学习深度前馈网络_深度学习前馈网络中的讲义第4部分
深度学习深度前馈网络 FAU深度学习讲义 (FAU Lecture Notes in Deep Learning) These are the lecture notes for FAU's YouT ...
李兴球python创意编程视频云盘_A36_Python滚动的字幕_教学视频
A36_Python滚动的字幕_教学视频这是风火轮编程Python初级教程的第36课. 下面是用Python制作的滚动的字幕运行结果: python滚动的字幕动画原理视频教程这是一节案例课,讲述一 ...

人物肖像速写_深度视频肖像