[论文速读]：全景相机（360度相机）室内图像的景深估计 Depth Estimation for Indoors Spherical Panoramas （三篇）

全景相机（360度相机）室内图像有个非常重要的数据集，即 3D60 dataset 包括 SunCG；Matterport3D；Stanford2D3D 三个数据集）。关于其下载，请参考我的博文：

3D60 Dataset 下载步骤（详细）。

1. ECCV2018 Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

Abstract

Introduction

Related works

Distortion-aware CNN for depth prediction

Distortion-aware Convolution

2. ECCV2018 OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

Abstract

Introduction

Contributions

Omnidirectional Depth Estimation

UResNet

RectNet

3. CVPR2020 Geometric Structure Based and Regularized Depth Estimation From 360◦ Indoor Imagery

Abstract

Method

1. ECCV 2018

Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images

之所以详细分析这篇文章，是因为第一次接触全景相机问题时，我很直觉的就想到用一种稍加修改的可变形卷积可以来做。这和本文的作者不谋而合，文章提出了新的可变行卷积，并用在域适应网络中。虽然感觉好像晚了一步，但这篇文章可是 18 年就发表了，也就是说，差不多是在 17-18 年间完成的，能想到这样的方法，在当时是非常先进的了。

[paper]

Abstract

There is a high demand of 3D data for 360◦ panoramic images and videos, pushed by the growing availability on the market of specialized hardware for both capturing (e.g., omni-directional cameras) as well as visualizing in 3D (e.g., head mounted displays) panoramic images and videos.

At the same time, 3D sensors able to capture 3D panoramic data are expensive and/or hardly available.

To fill this gap, we propose a learning approach for panoramic depth map estimation from a single image.

Thanks to a specifically developed distortion-aware deformable convolution filter, our method can be trained by means of conventional perspective images, then used to regress depth for panoramic images, thus bypassing the effort needed to create annotated panoramic training dataset.

We also demonstrate our approach for emerging tasks such as panoramic monocular SLAM, panoramic semantic segmentation and panoramic style transfer.

摘要分为五个部分：

1. 背景：说明了 3D 信息对于 360◦ panoramic images and videos 的重要性；

2. 问题：但是 3D 传感器确实昂贵（有可能还是无效的）；

3. 目的：本文通过学习的方式估计景深 depth 信息，避免使用额外的 3D 传感器；

4. 方法：本文提出了失真感知的可变形卷积滤波器（distortion-aware deformable convolution filter）方法；

5. 实验：本文的方法在全景单眼 SLAM、全景语义分割和全景风格转换都取得好的效果。

Introduction

背景及问题：

The availability of 360◦ panoramic visual data is quickly increasing thanks to the availability on the market of a new generation of cheap and compact omni-directional cameras: to name a few, Ricoh Theta, Gear360, Insta360 One. At the same time, there is also a growing demand of utilizing such visual content within 3D panoramic displays as provided by head mounted displays (HMDs) and new smartphone apps, dictated by emerging applications in the field of virtual reality (VR) and gaming. Nevertheless, the great majority of currently available panoramic content is just monoscopic, since available hardware has no means to associate depth or geometry information to the acquired RGB data. This naturally limits the sense of 3D when experiencing such content, even if the current hardware could already exploit 3D content, since almost all HMDs feature a stereoscopic display.

Therefore, the ability to acquire 3D data for panoramic images is strongly desired from both a hardware and an application standpoint. Nevertheless, acquiring depth from a panoramic video or image is not an easy task. Conversely to the case of conventional perspective imaging, where there are off-the-shelf, cheap and lightweight 3D sensors (e.g. Intel RealSense, Orbbec Astra), consumer 3D omni-directional cameras have not yet been developed. Current devices for obtaining 360◦ panoramic RGB-D images rely on a set of depth cameras (e.g. the Matterport camera4 ), a laser scanner (e.g. FARO5 ), or a mobile robotic setup (e.g. the NavVis trolley6 ). All these solutions are particularly expensive, require long set-up times and are not suited to mobile devices. Additionally, most of these solutions require static working conditions and cannot deal with dynamic environments, since the devices incrementally scan the surroundings either via mechanical rotation or being pushed around.

360全景视觉数据的可用性正在迅速增加，这要归功于市场上新一代廉价和紧凑的全向相机。举几个例子，Ricoh Theta, Gear360, Insta360 One。与此同时，由于虚拟现实 (VR) 和游戏领域的新兴应用，头戴式显示器 (HMDs) 和新的智能手机应用，使得 3D 全景显示的需求也在不断增长。

然而，目前绝大多数可用的全景内容只是单眼的，因为可用的硬件无法将深度或几何信息与获得的RGB数据关联起来。因为几乎所有的 HMDs 都以立体显示为特色，这自然会限制体验此类内容时的3D感知，即使目前的硬件已经可以利用 3D 内容。

背景中显示了作者对该领域是十分熟悉的，能了解到该领域实际应用中遇到的痛点（加粗内容），这些观点是我们坐在实验室中很难想到的。好的工作一定启发于实际应用，实践中发现实际的问题，并提出相应的新技术解决之。

本文的主要策略及 motivation

Recently a research trend has emerged aiming at depth prediction from a single RGB image. In particular, the use of convolutional neural networks (CNNs) [15, 4, 5] in an end-to-end fashion has proved the ability to regress dense depth maps at a relatively high resolution and with good generalization accuracy, even in the absence of monocular cues to drive the depth estimation task.

With our work, we aim to explore the possibility of predicting depth information from monoscopic 360◦ panoramic image using a learned approach, which would allow obtaining depth information based simply on low-cost omni-directional cameras.

One main challenge to accomplish this goal is represented by the need of extensive annotations for training depth prediction, which would still require the aforementioned high-cost, impractical solutions based on 3D panoramic sensors.

Instead, if we could exploit conventional perspective images for training a panoramic depth predictor, this would be greatly beneficial for reducing the cost of annotations and for training under a variety of conditions (outdoor/indoor, static/dynamic, etc.), by exploiting the wealth of publicly available perspective datasets.

这段包括四个部分，一步一步地思考，最终提出了本文的 motivation：

1. 已有的工作：现有的工作是从单幅 RGB 图像中估计景深，采用 CNNs 网络学习；

2. 本文的目的：在360度全景图像中学习预测深度信息；

3. 主要的挑战：数据集是个大问题，因为全景图像的深度信息标记是耗时耗力耗钱的，很难实现；

4. 本文的策略：如果可以利用传统的透视图像来训练一个全景深度预测器，这将大大有助于减少注释的成本，并且可以利用丰富的公开透视数据集，在各种条件下(室外/室内、静态/动态等)进行训练。

主体策略有了，下面要解决的是技术算法方面的问题了。

算法方面的问题

With this motivation, our goal is to develop a learning approach which trains on perspective RGB images and regresses 360◦ panoramic depth images.

The main problem is represented by the distortions caused by the equirectangular representation: indeed, when projecting the spherical pixels to a flat plane, the image gets remarkably distorted especially along the y axis. This distortion leads to significant error in depth prediction, as shown in Fig. 1 (bottom row, left).

A simple but partial solution to this problem is represented by rectification. Since 360◦ panoramic images cannot be rectified to a single perspective image due to the limitations of the field of view of the camera model, they are usually rectified using a collection of 6 perspective images, each associated to a different direction, i.e. a representation known as cube map projection [8]. However, such representation includes discontinuities at each image border, despite the panoramic image being continuous on those regions. As a consequence, the predicted depth also shows unwanted discontinuities, as shown in Fig. 1 (bottom row, middle), since the receptive field of the network is terminated on the cube map’s borders.

For this problem, Su et al. [29] proposed a method for domain adaptation of CNNs from perspective image to equirectangular panoramic image. Nevertheless, their approach relies on feature extraction specifically aimed at object detection, hence it does not easily extend to dense prediction tasks such as depth prediction and semantic segmentation.

本段分为四个部分：

1. 明确策略：通过学习传统图像的深度信息，来引导实现全景图像的深度估计；

2. 技术问题：因为全景图像是从球面上表示的，从球面坐标到平面坐标的等矩表示会引起畸变；

3. 第一种方法及其不足：采用 cube map projection 方法；该方法会产生图像边界上的不连续点；

4. 第二种方法及其不足：采用域适应 domain adaptation 方法；适合于目标检测，但对密度预测的任务 dense prediction tasks （如景深预测和语义分割）不适用。

本文算法的概述

We propose to modify the network’s convolutions by leveraging geometrical priors for the image distortion, by means of a novel distortion-aware convolution that adapts its receptive field by deforming the shape of the convolutional filter according to the distortion and projection model. Thus, these modified filters can compensate for the image distortions directly during the convolutional operation, so to rectify the receptive field.

This allows employing different distortion models for training and testing a network: in particular, the advantage is that panoramic depth prediction can be trained by means of standard perspective images.

We demonstrate the domain adaptation capability for the depth prediction task between rectified perspective images and equirectangular panoramic images on a public panoramic image benchmarks, by replacing the convolutional layers of a state-of-the-art architecture [15] with the proposed distortion-aware convolutions.

Moreover, we also test our approach for semantic segmentation and obtain 360◦ semantic 3D reconstruction from a single panoramic image (see Fig. 1, top right). Finally, we show examples of application of our approach for tasks such as panoramic monocular SLAM and panoramic style transfer.

本段包括四个部分：

1. 提出新的卷积：distortion-aware convolution，依据几何先验，根据失真和投影模型对卷积滤波器的形状进行变形。该滤波器可以直接补偿卷积操作过程中的图像畸变，从而矫正感受野。

2. 新卷积的意义：失真感知的可变形卷积可以在任何出现失真的场景下训练使用，其意义在于，即使是在标准感知图像中训练，其依然可以使用在全景图像估计应用中。最简单的方法就是可以将全景图像 crop 成多个 patch 作为训练数据。

这个思想真的很奇妙！

3. 具体实施方法：具体方法就是将网络模型 [15] 中的传统卷积用失真感知可变行卷积替换。

4. 实际验证结果：略。

2. ECCV 2018

OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas

[dataset] [paper]

本文提出的数据集是 Matterport3D，它和其它数据集可以一起被下载，即 3D60 （包括 SunCG；Matterport3D；Stanford2D3D）。

关于数据集的下载，请参考我的另一篇博客： https://mp.csdn.net/console/editor/html/109238767

Abstract

Recent work on depth estimation up to now has only focused on projective images ignoring $360^o$ content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on $360^o$ datasets, which however, are hard to acquire.

In this work, we circumvent the challenges associated with acquiring high quality $360^o$ datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to $360^o$ via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from $360^o$ images.

We show promising results in our synthesized data as well as in unseen realistic images.

摘要包括三个内容：

1. 问题：两点，一是过去的景深估计大都在投影图像上（就是普通平面图像），没有考虑对 $360^o$ 相机图像进行估计；二是，传统的在投影图像上训练出来的算法，在 $360^o$ 图像不是最优的。

2. 方法：首先，还是数据集的问题（上面那篇文章也遇到了，不过解决方法不同，上面那篇文章用失真感知卷积，使其在投影图像上训练的模型也适用于 $360^o$ 图像）；而本文则是构建了人工渲染的数据集 Matterport3D 。其次，在该数据集上训练了针对 $360^o$ 图像的景深估计模型。

3. 结论：在合成数据和 未经训练的 真实图像中显示了有希望的结果。

Introduction

背景问题和算法就略过不谈了，我们看看本文拟要解决的主要问题。

Depth and/or geometry extraction from omnidirectional content has been approached similar to traditional 2D content via omnidirectional stereo [11,12,13,14] and SfM[4] analytical solutions. There are inherent problems though to applying learning based methods to $360^o$ content as a result of its acquisition process that inhibits the creation of high quality datasets. Coupling them with $360^o$ LIDARs would produce low resolution depths and would also insert the depth sensor into the content itself, a drawback that also exists when aiming to acquire stereo datasets. One alternative would be to manually re-position the camera but that would be tedious and error prone as a consistent baseline would not be possible.

从全向内容中进行深度和/或几何提取，与通过全向立体和 SfM 解析解进行的传统2D内容类似。然而，将基于学习的方法应用到 $360^o$ 内容中，由于其获取过程抑制了高质量数据集的创建，因此存在固有的问题。将它们与360度激光雷达相结合将产生低分辨率的深度，而且还会将深度传感器插入内容本身，这在获取立体数据集时也存在缺陷。一种选择是手动重新定位相机，但这将是乏味的和容易出错的，因为一致的基线是不可能的。

反正就是说：1. 传统 2D 图像的景深估计模型不能用于 $360^o$ 图像中；2. 通过激光雷达获其它硬件方法，很难获得关于景深的准确的 ground truth。

然后就是本文工作简介：

In this work, we train a CNN to learn to estimate a scene’s depth given an omnidirectional (equirectangular) image as input1 . To circumvent the lack of available training data we resort to re-using existing 3D datasets and repurposing them for use within a $360^o$ context. This is accomplished by generating diverse $360^o$ views via rendering. We use this dataset for learning to infer depth from omnidirectional content.

本文训练一个CNN来学习估计一个场景的深度，其中全向 (等矩形) 图像输入作为输入。因为缺乏可用的训练数据，本文构建新的数据集。这是通过渲染生成不同的 $360^o$ 视图来完成的。使用这个数据集来学习从全方位的内容中推断深度。

Contributions

1. We present the first, to the authors’ knowledge, learning based dense depth estimation method that was trained with and operates directly on omnidirectional content in the form of equirectangular images.

2. We offer a dataset consisting of $360^o$ color images paired with ground truth $360^o$ depth maps in equirectangular format. The dataset is available online .

3. We propose and validate, a CNN auto-encoder architecture specifically designed for estimating depth directly on equirectangular images.

4. We show how monocular depth estimation methods trained on traditional 2D images fall short or produce low quality results when applied to equirectangular inputs, highlighting the need for learning directly on the $360^o$ domain.

1. 提出了一种基于学习的稠密深度估计方法，该方法是直接对全向内容以等矩形图像的形式进行训练和操作的。

2. 供了一个包含 $360^o$ 彩色图像与 GT 相配对的数据集 $360^o$ 等矩形深度地图。

3. 提出并验证，一个CNN自动编码器架构，专门设计用于直接估计深度在等矩形图像。

4. 展示了在传统二维图像上训练的单眼深度估计方法在应用于等矩形输入时是如何不足或产生低质量的结果，强调了直接在 $360^o$ 域上学习的需要。

Omnidirectional Depth Estimation

本文用两种模型训练，即 UResNet 和 RectNet。

UResNet

网络结构如下：

UResNet Architecture: The encoder consists of two input preprocessing blocks, and four down-scaling blocks (dark green). The former are single convolutional (conv) layers while the latter consist of a strided conv and two more regular convs with a skip/residual connection. The decoder contains one upscaling block (orange) and three up-prediction blocks (red), followed by the prediction layer (pink). Up-scaling is achieved with a strided deconv followed by a conv, and similarly, up-predictions additionally branch out to estimate a depth prediction at the corresponding scale with an exta conv that is concatenated with the features of the next block’s last layer.

UResNet架构：

编码器由两个输入预处理块和四个向下缩放模块 (深绿色) 组成。前者是单卷积 (conv) 层，而后者由一个strided conv （空间降维）和两个更多的正则卷积与跳接/残差连接。

解码器包含一个向上扩展块(橙色)和三个向上预测块(红色)，然后是预测层(粉色)。upscaling 是通过一个跨步的 deconv 和一个 conv 来实现的，类似地，up-prediction 是通过一个 exta conv (与下一个块的最后一层的特性连接) 向外扩展来估计相应规模的深度预测。

RectNet

网络结构如下：

RectNet Architecture: The encoder consists of two preprocessing blocks (yellow and blue) and a downscaling block (dark green), followed by two increasing dilation blocks (light green and black). The preprocessing blocks concatenate features produced by convolutions (convs) with different filter sizes, accounting for the equirectangular projection’s varying distortion factor. The down-scaling block comprises a strided and two regular convs.

RectNet架构:

编码器由两个预处理块(黄色和蓝色)和一个向下缩放块(深绿色)，然后是两个递增扩张块(浅绿色和黑色)组成。预处理块将不同滤波器尺寸的卷积 (convs) 产生的特征拼接在一起，考虑了等矩形投影的变化失真系数。尺度下降模块包括一个跨步卷积（降维）和两个正规卷积。

特别注意编码器中黄色和蓝色模块，采用了不同形状的卷积核。

3. CVPR 2020

Geometric Structure Based and Regularized Depth Estimation From 360◦ Indoor Imagery

[paper]

说说我的看法吧。这篇文章很有想法，用角落、边界和平面来表达几何的结构先验。这个先验又和深度信息相互辅助学习。这些思路都是很值得学习的。Err，但这个方法只适用于室内了。而且，从文章给出的例子来看，这些全景室内图像都有着比较清晰的几何结构，对于那些不怎么清晰的情况，该方法还能有效地估计吗？

Abstract

Motivated by the correlation between the depth and the geometric structure of a 360◦ indoor image, we propose a novel learning-based depth estimation framework that leverages the geometric structure of a scene to conduct depth estimation. Specifically, we represent the geometric structure of an indoor scene as a collection of corners, boundaries and planes. On the one hand, once a depth map is estimated, this geometric structure can be inferred from the estimated depth map; thus, the geometric structure functions as a regularizer for depth estimation. On the other hand, this estimation also benefits from the geometric structure of a scene estimated from an image where the structure functions as a prior.

However, furniture in indoor scenes makes it challenging to infer geometric structure from depth or image data. An attention map is inferred to facilitate both depth estimation from features of the geometric structure and also geometric inferences from the estimated depth map.

To validate the effectiveness of each component in our framework under controlled conditions, we render a synthetic dataset, Shanghaitech-Kujiale Indoor 360◦ dataset with 3550 360◦ indoor images.

Extensive experiments on popular datasets validate the effectiveness of our solution. We also demonstrate that our method can also be applied to counterfactual depth.

摘要包括如下几个方面，前三个是本文的三个方法：

1. 方法一：作者从几何结构与深度信息相关联这个先验出发，提出了几何结构估计与深度估计相结合的模型；而且二者可以互相协助优化。其中，几何结构包括角落、墙的边界和平面三部分。

2. 方法二：采用注意力图促进几何结构和深度的估计；

3. 方法三：构建了新的数据集；

4. 结论：略。

Method

网络结构如下：

Figure 2. Overview of our architecture. (a) Depth estimation predicts depth from a given panorama image. (c) Structure as a prior takes panorama as inputs, and estimates the structure of the room. (b) Attention Module aims to generate attention map to avoid the inconsistency between structure and depth map caused by furniture. (d) Structure as a regularizer is designed to regularize the estimated depth maps by predicting structure from them. We neglect the skip connections in U-Net here for simplicity. Different rectangles with different colors represent convolution blocks.

(a) 深度估计根据给定的全景图像预测深度。

(b) 注意力模块 的目的是生成注意图，避免家具造成的结构和深度图不一致。

(d) 结构作为正则化器的目的是通过预测估计深度图的结构来正则化估计深度图。为了简单起见，我们忽略了U-Net中的跳过连接。不同颜色的矩形代表卷积块。