论文阅读：Synthesizing Obama: Learning Lip Sync from Audio

文章目录

音频到landmarks
面部纹理合成
- 候选帧选择
- 加权中位数纹理的合成
- 牙齿proxy (Teeth Proxy)

音频到视频部分出现的术语：

stock video footage：the many hours of online weekly address video
source：the input audio track
target video： stock video clip into which we composite the synthesized mouth region

音频到landmarks

音频部分的处理：

给定16KHz单声道音频，使用ffmpeg的RMS-based normalization 正则化volume
在每个25ms长的滑动窗口上进行离散傅里叶变换，采样间隔为10ms
在傅里叶功率谱上应用40个三角形的Mel-scale lters，对输出进行对数处理。
应用离散余弦变换，将维度降低到13维矢量。

最后输出28维度的向量， 13维度加上对数平均能量和其一阶导数

脸部的处理：

对每帧obama的脸进行frontalize，正面化，用了14年的论文：Total moving face reconstruction
然后检测嘴部landmarks，这里是给出了18个点，也就是36个数，然后PCA到20维的系数
Finally, we temporally upsample the mouth shape from 30Hz to 100Hz by linearly interpolating PCA coeffcients, to match the audio sampling rate. Note that this upsampling is only used for training; we generate the final video at 30Hz.（最后，我们通过线性插值PCA系数将口型从30Hz升采样到100Hz，以匹配音频采样率。请注意，这种上采样只用于训练；我们以30Hz的频率生成最终视频。）

之前音频间隔是10ms也就是1s取100次，视频1s100hz也就是1s采样100次，和音频一一对应，这里是正常取了30，中间的通过插值得到

RNN的处理：

重要的点：音频和嘴部并不完全同步，举个例子，比如你张开嘴发出啊的声音，但是可能张嘴要比出声早了一点时间，所以说网络需要看到未来的一部分音频信息，比如我关注到几十个ms后的啊了我就知道应该张嘴了

所以在这里很重要的一点是要添加time delay，把网络输出在时间上向前移动，为输出增加时间延迟，也就是说输入是当前的声音，但是预测的landmarks有些许的延后，这样效果挺好
time-delayed RNN的delay为2， 网上一般叫target-delay，搜了好长时间才发现····
效果图如下：

这里有个疑问，这样的话输入和输出相比不是一一对应的，前面说是音频和视频采样率调整到一致了，这个地方是怎么处理的呢？
是不是输出还是同样的，但是我不取前两个而是取后面的作为输出？
cell state c是60维， time delay是20 steps(200 ms)，也就是预测的嘴部会落后音频200ms

这部分我看了源码之后懂了, 来自Obama Net的train.py：

timeDelay = 20
lookBack = 10
n_epoch = 20
n_videos = 12for key in tqdm(keys[0: n_videos]):audio = audioKp[key]video = videoKp[key]if (len(audio) > len(video)):audio = audio[0: len(video)]else:video = video[0: len(audio)]start = (timeDelay - lookBack) if (timeDelay - lookBack > 0) else 0for i in range(start, len(audio) - lookBack):a = np.array(audio[i: i + lookBack])v = np.array(video[i + lookBack - timeDelay]).reshape((1, -1))X.append(a)y.append(v)

这里audio是一个序列，但是landmarks是一个，也就是一个序列预测一个landmarks，同时landmarks是超前audio一点的，超前的数字是time delay

model = Sequential()
model.add(LSTM(25, input_shape = (lookBack, 39)))
model.add(Dropout(0.25))
model.add(Dense(8))

这是用keras写的，输出是8然后经过PCA一些处理之后得到正常的40

面部纹理合成

有了landmarks怎么到人脸呢，有些方法是用嘴部的替换原来的，然后经过pix2pix合成过去，但是很明显的一点是假如测试的时候并没有嘴部的牙齿等细节，网络怎么能够输出细节呢？看看这篇文章怎么做的。
因为这篇文章主要合成的是嘴部的关键点，所以主要是合成下部脸部区域像嘴巴，下巴，脸颊和鼻子嘴巴周围的区域。看看下面的mask：

a 是没有衬衫的区域，加上neck和shirt部分， mouth是合成的部分，最后得到的合成的frame

这个mouth区域是怎么合成的呢，一步步来吧。

作者提到， Instead, we propose an approach that combines weighted median and high frequencies from a teeth proxy。也就是把teeth proxy的加权中位数和高频的信息融合起来。

现在我们有嘴部关键点和target video, 算法概述如下：

per mouth PCA shape, select a fixed number of target video frames that best match the given shape; 对于每个嘴部PCA后的shape，选择固定数量的最符合给定shape的video frames，比如说算算序列嘴部的距离平均值，最小就选这个序列
apply weighted median on the candidates to synthesize a median texture; 在候选的身上应用加权中位数合成中位数纹理
select teeth proxy frames from the target video, and transfer high-frequency teeth details from the proxy into the teeth region of the media texture. 从目标视频中选择teeth proxy frames，然后把高频的牙齿细节transfer到介质的texture，下面逐一介绍。

候选帧选择

给定生成的嘴部shape，选取最匹配的帧。方法如下：

在目标视频上检测landmarks，同时估计3D pose并正面化，因为之前landmarks是正面化的，计算3D face model用到了论文Total moving face reconstruction.，其实就是他自己论文hhhh，同时使用下巴和背景的粗略近似来增强它，作者说这个增强显著改善了正面化的结果，没开源说个

论文阅读：Synthesizing Obama: Learning Lip Sync from Audio相关推荐
1. 《论文阅读》Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response Gener
  <论文阅读>Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response ...
2. 论文阅读 [CVPR-2022] BatchFormer: Learning to Explore Sample Relationships for Robust Representation Lea
  论文阅读 [CVPR-2022] BatchFormer: Learning to Explore Sample Relationships for Robust Representation Lea ...
3. 论文阅读：Deep Learning in Mobile and Wireless Networking:A Survey
  论文阅读:Deep Learning in Mobile and Wireless Networking:A Survey 从背景介绍到未来挑战,一文综述移动和无线网络深度学习研究近来移动通信和 5 ...
4. 论文阅读 [TPAMI-2022] On Learning Disentangled Representations for Gait Recognition
  论文阅读 [TPAMI-2022] On Learning Disentangled Representations for Gait Recognition 论文搜索(studyai.com) 搜索 ...
5. 【论文阅读】Cross-X Learning for Fine-Grained Visual Categorization
  [论文阅读]Cross-X Learning for Fine-Grained Visual Categorization 摘要具体实现 OSME模块跨类别跨语义正则化(C3SC^{3} SC3S ...
6. 论文阅读|DeepWalk: Online Learning of Social Representations
  论文阅读|DeepWalk: Online Learning of Social Representations 文章目录论文阅读|DeepWalk: Online Learning of Soci ...
7. 论文阅读——WaveNet: A Generative Model for Raw Audio
  论文阅读--WaveNet: A Generative Model for Raw Audio 1.文献名称:wavenet 2.期刊会议:由Google旗下的Deepmind团队推出 3.影响因子: ...
8. 【论文阅读笔记】Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer
  摘要: 本文主要研究训练和测试类别不相交时(即没有目标类别的训练示例)的对象分类问题.在此之前并没有对于毫无关联的训练集和测试集进行对象检测的工作,只是对训练集所包含的样本进行分类.实验表明,通过使用 ...
9. 【论文阅读】Federated Learning应用扩展合集
  2020-MM-Performance Optimization for Federated Person Re-identification via Benchmark Analysis 动机:联邦 ...
最新文章
热门文章

论文阅读：Synthesizing Obama: Learning Lip Sync from Audio

文章目录

音频到landmarks

面部纹理合成

候选帧选择

论文阅读：Synthesizing Obama: Learning Lip Sync from Audio相关推荐

最新文章

热门文章