ECCV学习（一）：ECCV2018整理总结

一、下载地址

全部论文地址：http://openaccess.thecvf.com/ECCV2018.py

全部论文pdf下载链接如下

链接：https://pan.baidu.com/s/1oEEgEdnYsPUiqdAJEHZpdQ
密码：hdpm

相关论文解读：https://blog.csdn.net/Extremevision/article/details/81875068

二、Oral 口头汇报

Convolutional Networks with Adaptive Computation Graphs
Progressive Neural Architecture Search
Diverse Image-to-Image Translation via Disentangled Representations
Lifting Layers: Analysis and Applications
Learning with Biased Complementary Labels
Light Structure from Pin Motion: Simple and Accurate Point Light Calibration for Physics-based Modeling
Programmable Light Curtains
Learning to Separate Object Sounds by Watching Unlabeled Video
Coded Two-Bucket Cameras for Computer Vision & authorr
Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image
End-to-End Joint Semantic Segmentation of Actors and Actions in Video
Learning-based Video Motion Magnification
Massively Parallel Video Networks
DeepWrinkles: Accurate and Realistic Clothing Modeling
Learning Discriminative Video Representations Using Adversarial Perturbations

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
Unsupervised Person Re-identification by Deep Learning Tracklet Association
Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition
Instance-level Human Parsing via Part Grouping Network
Adversarial Geometry-Aware Human Motion Prediction
Weakly-supervised 3D Hand Pose Estimation from Monocular RGB Images
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
DeepIM: Deep Iterative Matching for 6D Pose Estimation
Implicit 3D Orientation Learning for 6D Object Detection from RGB Images
Direct Sparse Odometry With Rolling Shutter
3D Motion Sensing from 4D Light Field Gradients
A Style-Aware Content Loss for Real-time HD Style Transfer
Scale-Awareness of Light Field Camera based Visual Odometry
Burst Image Deblurring Using Permutation Invariant Convolutional Neural Networks

MVSNet: Depth Inference for Unstructured Multi-view Stereo
PlaneMatch: Patch Coplanarity Prediction for Robust RGB-D Registration
Active Stereo Net: End-to-End Self-Supervised Learning for Active Stereo Systems, &website
GAL: Geometric Adversarial Loss for Single-View 3D-Object Reconstruction
Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry
Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation
Dual-Agent Deep Reinforcement Learning for Deformable Face Tracking
Deep Autoencoder for Combined Human Pose Estimation and Body Model Upscaling
Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network
GANimation: Anatomically-aware Facial Animation from a Single Image
Deterministic Consensus Maximization with Biconvex Programming
Robust fitting in computer vision: easy or hard?
Highly-Economized Multi-View Binary Compression for Scalable Image Clustering
Efficient Semantic Scene Completion Network with Spatial Group Convolution
Asynchronous, Photometric Feature Tracking using Events and Frames

Group Normalization
Deep Matching Autoencoder
Deep Expander Networks: Efficient Deep Networks from Graph Theory
Towards Realistic Predictors
Learning SO(3) Equivariant Representations with Spherical CNNs
CornerNet: Detecting Objects as Paired Keypoints
RelocNet: Continous Metric Learning Relocalisation using Neural Nets
The Contextual Loss for Image Transformation with Non-Aligned Data
Acquisition of Localization Confidence for Accurate Object Detection
Deep Model-Based 6D Pose Refinement in RGB
DeepTAM: Deep Tracking and Mapping
ContextVP: Fully Context-Aware Video Prediction
Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics
Museum Exhibit Identification Challenge for the Supervised Domain Adaptation.
Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

三、主要论文和反方向

Semantic Segmentation 语义分割

《Effective Use of Synthetic Data for Urban Scene Semantic Segmentation》

Abstract：训练深度网络以执行语义分割需要大量标记数据。为了减轻注释真实图像的手动工作，研究人员研究了合成数据的使用，这些数据可以自动标记。不幸的是，在合成数据上训练的网络在真实图像上表现得相对较差。虽然这可以通过域适应（domain adaptation）来解决，但是现有方法都需要在训练期间访问真实图像。在本文中，我们介绍了一种截然不同的处理合成图像的方法，这种方法不需要在训练时看到任何真实的图像。Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently。特别是，前者应该以基于检测的方式处理，以更好地解释这样的事实：虽然它们在合成图像中的纹理不是照片般逼真的，但它们的形状看起来很自然。我们的实验证明了我们的方法对Cityscapes和CamVid的有效性，仅对合成数据进行了训练。

arXiv：https://arxiv.org/abs/1807.06132

注：domain adaptation这个概念最近很火！

Stereo 立体声

《ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems》

Abstract：在本文中，我们介绍ActiveStereoNet，这是active立体声系统的第一个深度学习解决方案。由于缺乏ground truth，我们的方法是完全自监督的，但它产生精确的深度，子像素精度为像素的1/30;它没有遭受常见的过度平滑问题;它保留了边缘;它明确地处理遮挡。我们引入了一种新的重建损失（reconstruction loss），它对噪声和无纹理patches更加稳健，并且对于光照变化是不变的。使用具有自适应支持权重方案的基于窗口的成本聚合来优化所提出的损失。这种成本聚合是边缘保留并使损失函数平滑，这是使网络达到令人信服的结果的关键。最后，我们展示了预测无效区域（如遮挡）的任务如何在没有ground truth的情况下进行端到端的训练。该component对于减少模糊至关重要，特别是改善了深度不连续性的预测。对真实和合成数据进行广泛的定量和定性评估，证明了在许多具有挑战性的场景中的最新技术成果。

arXiv：https://arxiv.org/abs/1807.06009

CNN 卷积

《CBAM: Convolutional Block Attention Module》

Abstract：我们提出了卷积块注意力模块（CBAM，Convolutional Block Attention Module ），这是一种用于前馈卷积神经网络的简单而有效的注意力（attention）模块。给定中间特征图，我们的模块沿着两个单独的维度（通道和空间）顺序地（sequentially）推断注意力图，然后将注意力图乘以输入特征图以进行自适应特征细化。由于CBAM是一个轻量级的通用模块，它可以无缝地集成到任何CNN架构中，代价可以忽略不计，并且可以与基本CNN一起进行端到端的训练。我们通过对ImageNet-1K，MS~COCO检测和VOC~2007检测数据集的大量实验来验证我们的CBAM。我们的实验表明，各种模型在分类和检测性能方面均有一定的改进，证明了CBAM的广泛适用性。代码和模型将随后公开提供。

arXiv：链接：https://arxiv.org/abs/1807.06521

注：很棒的论文，相信可以帮助一波同学写论文（划水）

Multi-View Reconstruction 3D重建

《Specular-to-Diffuse Translation for Multi-View Reconstruction》

Abstract：大多数多视图3D重建算法，特别是当使用来自阴影的形状提示时，假设对象外观主要是漫射的（predominantly diffuse）。为了缓解这种限制，我们引入了S2Dnet，一种生成的对抗网络，用于将具有镜面反射的物体的多个视图转换为漫反射（ diffuse），从而可以更有效地应用多视图重建方法。我们的网络将无监督的图像到图像转换扩展到多视图“镜面到漫反射”的转换。为了在多个视图中保留对象外观，我们引入了一个多视图一致性损失（MVC，Multi-View Coherence loss），用于评估视图转换后局部patches的相似性和faithfulness。我们的MVC损失确保在图像到图像转换下保留多视图图像之间的局部对应的相似性。因此，与几种单视图 baseline 技术相比，我们的网络产生了明显更好的结果。此外，我们使用基于物理的渲染精心设计并生成大型综合训练数据集。在测试过程中，我们的网络仅将原始光泽图像作为输入，无需额外信息，如分割掩模或光照估计。结果表明，使用我们的网络过滤的图像可以显著地改善多视图重建。我们还展示了在现实世界训练和测试数据上的出色表现。

arXiv：https://arxiv.org/abs/1807.05439

Data Augmentation 数据增广

《Modeling Visual Context is Key to Augmenting Object Detection Datasets》

Abstract：众所周知，用于深度神经网络的数据增广（data augmentation）对于训练视觉识别系统是十分重要的。通过人为增加训练样本的数量，它有助于减少过度拟合并改善泛化。对于物体检测（object detection），用于数据增强的经典方法包括生成通过基本几何变换和原始训练图像的颜色变化获得的图像。在这项工作中，我们更进一步，利用 segmentation annotations 来增加训练数据上存在的对象实例的数量。为了使这种方法获得成功，我们证明，适当地建模对象周围的视觉上下文（ visual context ）对于将它们放置在正确的环境中至关重要。否则，我们会发现之前的策略确实会受到伤害。通过我们的上下文（context）模型，当VOC’12基准测试中很少有标记示例可用时，我们实现了显著的平均精度改进。

arXiv：https://arxiv.org/abs/1807.07428

Face Recognition 人脸识别

《From Face Recognition to Models of Identity: A Bayesian Approach to Learning about Unknown Identities from Unsupervised Data》

Abstract：当前的面部识别系统可以在各种成像条件下稳健地识别身份。在这些系统中，通过分类到从监督身份标记获得的已知身份来执行识别。这个当前范例存在两个问题：（1）current systems are unable to benefit from unlabelled data which may be available in large quantities; （2）当前系统将成功识别等同于给定输入图像的标记。另一方面，人类会对完全无监督的个体进行识别，即使没有能够命名该个体，也要认识到他们之前见过的人的身份。我们如何超越当前的分类范式，更加人性化地理解身份？我们提出了一个综合的贝叶斯模型，该模型连贯地推理观察到的图像，身份，名称的部分知识以及每个观察的情境背景。我们的模型不仅对已知身份获得了良好的识别性能，它还可以从无监督数据中发现新身份，并学习将身份与不同情境联系起来，这取决于哪些身份倾向于一起观察。此外，提出的半监督组件不仅能够处理熟人的名字，而且还能够处理统一框架中未标记的熟悉面孔和完全陌生人。

arXiv：https://arxiv.org/abs/1807.07872

Instance Segmentation 实例分割

《Semi-convolutional Operators for Instance Segmentation》

Abstract：目标检测（Object detection）和实例分割（instance segmentation）由基于区域的方法（例如Mask RCNN）主导。然而，人们越来越关注将这些问题减少到像素标记任务，因为后者可以更高效，可以在许多其他任务中使用的图像到图像（image-to-image）网络架构中无缝集成，并且对于不能由边界框近似的目标更加准确。在本文中，我们从理论和经验上表明，使用卷积算子不能轻易地实现构建可以分离对象实例的 dense pixel embeddings 。同时，我们表明简单的修改，我们称之为 semi-convolutional，其在这项任务中有更好的表现。我们证明了这些算子也可用于改进Mask RCNN等方法，展示了比单独使用Mask RCNN可实现的复杂生物形状和PASCAL VOC类别更好的分割。

arXiv：https://arxiv.org/abs/1807.10712

Super Resolution 超分辨率

《CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping》

Abstract：The Reference-based Super-resolution (RefSR) super-resolves a low-resolution (LR) image given an external high-resolution (HR) reference image，其中参考图像和LR图像共享相似的视点但具有显著的分辨率间隙 x8。现有的RefSR方法以级联的方式工作，例如 patch匹配，然后是具有两个独立定义的目标函数的合成 pipeline，导致inter-patch misalignment，grid effect and inefficient optimization。为了解决这些问题，我们提出了CrossNet，一种使用跨尺度变形的端到端和全卷积深度神经网络。我们的网络包含图像编码器（encoder），cross-scale warping layers和融合解码器（decoder）：编码器用于从LR和参考图像中提取多尺度特征;cross-scale warping layers在空间上将参考特征图与LR特征图对齐;解码器最终聚合来自两个域的特征映射以合成HR输出。使用跨尺度变形，我们的网络能够以端到端的方式在像素级执行空间对齐，从而改善现有方案的精度（大约2dB-4dB）和效率（超过100倍）。

arXiv：https://arxiv.org/abs/1807.10547

Crowd Counting 人群计数

《Iterative Crowd Counting》

Abstract：在这项工作中，我们解决了图像中人群计数的问题。我们提出了一种基于卷积神经网络（CNN）的密度估计方法来解决这个问题。一次性预测高分辨率密度图是一项具有挑战性的任务。因此，我们提出了一个用于生成高分辨率密度图的两分支CNN架构，其中第一个分支生成低分辨率密度图，第二个分支包含来自第一个分支的低分辨率预测和特征图以生成高分辨率密度图。我们还提出了我们方法的多阶段扩展，其中管道中的每个阶段都使用来自所有先前阶段的预测。与目前最佳的人群计数方法的实证比较表明，我们的方法在三个具有挑战性的人群计数基准上实现了最低的平均绝对误差：Shanghaitech，WorldExpo’10和UCF数据集。

arXiv：https://arxiv.org/abs/1807.09959

Depth Prediction 深度预测

《StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction》

Abstract：本文介绍了StereoNet，这是第一个用于实时立体匹配的端到端深度架构，在NVidia Titan X上以60 fps运行，可生成高质量，边缘保留，无量化（quantization-free）的视差图。本文的一个重要创新点是网络实现了亚像素匹配精度，而不是传统立体匹配方法的精度。This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high disparity precision.。通过采用学习的边缘感知上采样功能来实现空间精度。我们的模型使用Siamese网络从左右图像中提取特征。在非常低分辨率的cost volume中计算视差的第一估计，然后分层地通过使用紧凑的像素到像素细化网络的学习的上采样函数来重新引入高频细节。利用颜色输入作为指导，该功能（function）能够产生高质量的边缘感知输出。我们在多个基准测试中获得了最佳的结果。

arXiv：https://arxiv.org/abs/1807.08865

3D Face 3D人脸

《Generating 3D faces using Convolutional Mesh Autoencoders》

Abstract：人脸的3D表示（representations）对于计算机视觉问题是有用的，例如3D面部跟踪和从图像重建，以及诸如角色生成和动画的图形应用。传统模型使用线性子空间或高阶张量概括来学习面部的潜在表示（latent representation）。由于这种线性，它们无法捕获极端变形和非线性表达式。为了解决这个问题，我们引入了一个多功能模型（versatile model），该模型使用网格表面上的光谱卷积来学习面部的非线性表示。我们引入了网格采样操作，这种操作能够实现分层网格表示，捕获模型中多个尺度的形状和表达的非线性变化。在variational setting中，我们的模型从多元高斯分布中采样不同的逼真3D人脸。我们的训练数据包括在12个不同subjects中捕获的20,466个极端表情网格。尽管训练数据有限，但我们训练的模型优于最先进的面部模型，重建误差降低50％，而参数减少75％。我们还表明，用我们的自动编码器替换现有最先进的人脸模型的表达空间，可以实现更低的重建误差。

arXiv：https://arxiv.org/abs/1807.10267

github：https://github.com/anuragranj/coma

Image Captioning 图片摘要

《Recurrent Fusion Network for Image Captioning》

Abstract：最近，看图说话（Image captioning）已经取得了很大进展，并且所有最先进的模型都采用了编码器 - 解码器框架。在此框架下，输入图像由卷积神经网络（CNN）编码，然后通过递归神经网络（RNN）转换为自然语言。依赖于该框架的现有模型仅使用一种CNN，例如ResNet或Inception-X，其仅从一个特定视点描述图像内容。因此，不能全面地理解输入图像的语义含义，这限制了captioning的性能。在本文中，为了利用来自多个编码器的补充信息，我们提出了一种用于处理看图说话的新型循环融合网络（RFNet）。我们模型中的融合过程可以利用图像编码器的输出之间的相互作用，然后为解码器生成新的紧凑但信息丰富的表示。 MSCOCO数据集上的实验证明了我们提出的RFNet的有效性，它为看图说话（image caption）提供了一种新的先进技术。

arXiv：https://arxiv.org/abs/1807.09986

注：Image Caption挺有意思的！CNN和RNN完美结合~

Image to Image Translation 图像到图像转换

《Diverse Image-to-Image Translation via Disentangled Representations》

Abstract：图像到图像转换旨在学习两个视觉域之间的映射。许多应用存在两个主要挑战：1）缺少对齐的训练对（aligned training pairs）2）来自单个输入图像的多个可能输出。在这项工作中，我们提出了一种基于disentangled representation的方法，用于在没有成对训练图像的情况下产生多样化的输出。为了实现多样性（diversity），我们提出将图像嵌入到两个空间中：a domain-invariant content space capturing shared information across domains and a domain-specific attribute space。我们的模型采用从给定输入中提取的编码内容特征和从属性空间采样的属性向量，以在测试时产生不同的输出。为了处理不成对的训练数据，我们引入了新的基于disentangled representations的cross-cycle consistency loss。定性结果表明，我们的模型可以在无需配对训练数据的情况下，在各种任务上生成多样且逼真的图像。对于定量比较，我们使用感知距离度量（perceptual distance metric）来衡量用户研究和多样性的真实性。与MNIST-M和LineMod数据集上的最新技术相比，我们将所提出的模型应用于域适应并显示出最佳效果（SOTA）。

arXiv：https://arxiv.org/abs/1808.00948

homepage：http://vllab.ucmerced.edu/hylee/DRIT/

github：https://github.com/HsinYingLee/DRIT

Object Localization 目标检测

《Self-produced Guidance for Weakly-supervised Object Localization》

Abstract：弱监督方法通常基于分类网络产生的注意力图（attention maps）生成定位结果。然而，注意力图表现出对象的最具辨别力的部分，这些部分是小的和稀疏的。我们建议生成自生导引（generate Self-produced Guidance ，SPG）掩模，其将前景，感兴趣对象与背景分离，以向分类网络提供像素的空间相关信息。提出了一种分阶段（stagewise）方法，以结合高置性对象区域来学习SPG掩模。注意力图中的高置信区域用于逐步学习SPG掩模。然后将掩模用作辅助像素级监督，以便于分类网络的训练。对ILSVRC的广泛实验表明，SPG可有效地生成高质量的对象定位图。特别是，提出的SPG在ILSVRC验证集上实现了43.83％的Top-1定位错误率，这是一种新的SOTA错误率。

arXiv：https://arxiv.org/abs/1807.08902

VAE 自动编码器

《MT-VAE: Learning Motion Transformations to Generate Multimodal Human Dynamics》

Abstract：Long-term human motion can be represented as a series of motion modes—motion sequences that capture short-term temporal dynamics—with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis.

摘要：长期（long-term）人体运动可以表示为一系列运动模式 - 捕捉短期时间动态的运动序列 - 它们之间的过渡。我们利用这种结构，提出了一种新颖的运动变换变分自动编码器（MT-VAE），用于学习运动序列生成。我们的模型联合学习运动模式的特征嵌入（可以从中重建运动序列）和表示一个运动模式到下一个运动模式的转换的特征变换。我们的模型能够从相同的输入生成”未来”的多种多样且可信的运动序列。我们将此方法应用于面部和全身运动，并演示了基于类比的运动传递和视频合成等应用。

arXiv：https://arxiv.org/abs/1808.04545

Visual Reasoning 视觉推理

《Visual Reasoning with Multi-hop Feature Modulation》

Abstract：Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt — on-par with single-hop FiLM generation — while also significantly outperforming prior state-of-the-art and single-hop FiLM generation on the GuessWhat?! visual dialogue task.

摘要：最近计算机视觉和自然语言处理方面的突破激发了人们对挑战多模式任务（如视觉问答和视觉对话）的兴趣。对于这样的任务，一种成功的方法是通过特征线性调制（FiLM）层（即，每通道缩放和移位）来调节语言上基于图像的卷积网络计算。我们提出以多跳方式生成在卷积网络的层次结构上的FiLM层的参数，而不是像在先前的工作中那样一次生成。通过在参与语言输入和生成FiLM层参数之间交替，这种方法能够更好地扩展到具有较长输入序列的设置，例如对话（dialogue）。我们证明了多跳FiLM生成实现了短输入序列任务的最新技术参考 - 与单跳FiLM生成相媲美 - 同时也明显优于先前的先进技术GuessWhat上的单跳FiLM生成？！视觉对话任务。

arXiv：https://arxiv.org/abs/1808.04446

注：Amusi觉得将CV与NLP结合有非常大的研究意义和前景。

Semantic Segmentation 语义分割

《Concept Mask: Large-Scale Segmentation from Semantic Concepts》

Abstract：Existing works on semantic segmentation typically consider a small number of labels, ranging from tens to a few hundreds. With a large number of labels, training and evaluation of such task become extremely challenging due to correlation between labels and lack of datasets with complete annotations. We formulate semantic segmentation as a problem of image segmentation given a semantic concept, and propose a novel system which can potentially handle an unlimited number of concepts, including objects, parts, stuff, and attributes. We achieve this using a weakly and semi-supervised framework leveraging multiple datasets with different levels of supervision. We first train a deep neural network on a 6M stock image dataset with only image-level labels to learn visual-semantic embedding on 18K concepts. Then, we refine and extend the embedding network to predict an attention map, using a curated dataset with bounding box annotations on 750 concepts. Finally, we train an attention-driven class agnostic segmentation network using an 80-category fully annotated dataset. We perform extensive experiments to validate that the proposed system performs competitively to the state of the art on fully supervised concepts, and is capable of producing accurate segmentations for weakly learned and unseen concepts.

摘要：关于语义分割的现有工作通常考虑少量标签，范围从几十到几百。由于标签之间的相关性以及缺少具有完整注释的数据集，因此对于大量标签，对此类任务的训练和评估变得极具挑战性。我们将语义分割表示为给定语义概念的图像分割问题，并提出一种新颖的系统，它可以处理无限数量的概念，包括对象，部件，东西和属性。我们使用弱监督和半监督框架来实现这一目标，该框架利用具有不同监督级别的多个数据集。我们首先在6M图像数据集上训练深度神经网络，仅使用图像级标签来学习18K概念的视觉语义嵌入。然后，我们使用带有750个概念的边界框注释的curated 数据集来优化和扩展嵌入网络以预测注意力图。最后，我们使用80类完全注释的数据集训练注意力驱动的类不可知分割网络。我们进行了大量实验，以验证所提出的系统在完全监督的概念上与现有技术相比具有竞争力，并且能够为弱学习和看不见的概念产生准确的分割。

arXiv：https://arxiv.org/abs/1808.06032

Monocular Depth Estimation 单目深度估计

《Learning Monocular Depth by Distilling Cross-domain Stereo Networks》

Abstract：Monocular depth estimation aims at estimating a pixelwise depth map for a single image, which has wide applications in scene understanding and autonomous driving. Existing supervised and unsupervised methods face great challenges. Supervised methods require large amounts of depth measurement data, which are generally difficult to obtain, while unsupervised methods are usually limited in estimation accuracy. Synthetic data generated by graphics engines provide a possible solution for collecting large amounts of depth data. However, the large domain gaps between synthetic and realistic data make directly training with them challenging. In this paper, we propose to use the stereo matching network as a proxy to learn depth from synthetic data and use predicted stereo disparity maps for supervising the monocular depth estimation network. Cross-domain synthetic data could be fully utilized in this novel framework. Different strategies are proposed to ensure learned depth perception capability well transferred across different domains. Our extensive experiments show state-of-the-art results of monocular depth estimation on KITTI dataset.

摘要：单目深度估计旨在估计单个图像的像素深度图，其在场景理解和自动驾驶中具有广泛的应用。现有的监督和无监督方法面临巨大挑战。监督方法需要大量深度测量数据，这些数据通常难以获得，而无监督方法通常在估计精度方面受到限制。合成数据为收集大量深度数据提供了可能的解决方案。然而，合成数据和实际数据之间存在较大的域（domain）差距，这使得直接训练具有一定挑战性。在本文中，我们建议使用立体匹配网络作为proxy 来从合成数据中学习深度，并使用预测的立体视差图来监督单目深度估计网络。跨域合成数据可以在这个新颖的框架中得到充分利用。提出了不同的策略来确保学习深度感知能力在不同域之间良好地传递。我们的广泛实验显示了KITTI数据集上单目深度估计的最新结果。

arXiv：https://arxiv.org/abs/1808.06586