Matterport3D：室内环境RGB-D数据的深度学习

Matterport3D：室内环境RGB-D数据的深度学习

Taylor Guo, 2017年9月24日

Matterport3D: Learning from RGB-D Data in Indoor Environments

Matterport3D Home Page

mpview is a C++ application for parsing and viewing houses in the Matterport3D dataset.

图1：Matterport3D数据集，包括高动态范围彩色图像，深度图，全景天空盒，材质网格，区域布局和分类，物体语义分割。

Matterport3D数据集

图2：拍摄全景的视角（绿色的球体），平均间距是2.25米。

3.1 数据获取过程

Matterport的数据获取，使用了三脚架固定的相机设备，有3个彩色相机和3个深度相机，分布在上中下。对于每个全景，它需要沿着垂直方向旋转到6个不同的方向（也就是60度拍一下），每个彩色相机都要拍高动态范围图像。当相机旋转时，3个深度相机持续拍摄数据，整合生成1280×1024的深度图像，与每幅彩色照片进行配准。每个全景图像是由18个彩色图片组成，中心点正好是拍摄人员的高度。

数据集中的每个环境，操作人员拍摄的一组全景在可行走的户型图的区域上都统一间隔2.5米。用户用iPad App标记窗户和镜子所在的位置，并把数据传给Matterport。Matterport对原始数据做以下处理：

1）拼接图片形成全景“天空盒”用于全景观看，

2）对每幅图片用全局捆集调整，估计6个自由度，

3）重建一个纹理网格多边面，包含这个环境的所有可见表面。

每个场景处理过后，会得到一组1280×1024（高动态范围色彩）分辨率的图像，每张图都带6个自由度的相机位姿估计，同一全景下的一个天空盒、每组18张图片，整个场景的纹理多边形网格面。总的来说，数据集包含90个建筑物的194,400张RGB-D图像，10,800个全景图像，24,727,520个纹理三角面；纹理网格重构算法从【21】和【25】获得。

3.2 语义标注

为了做3D实例语义标注，首先创建户型图对房子进行标注，从户型图中提取房间区域，再密集操作标注每个区域的物体。

语义标注过程的第一步是，通过指出3D空间和语义分类标签，将每个建筑的每个房间区域分解成区域模块。标注者使用一个简单的交互工具，选择一个目录，对每个区域在地面上画出一个2D多边形（如图3所示）。这个工具就会将多边形贴到合适的平面表面（墙和地板），也会贴到天花板上。

图3：标注者提供户型图。户型图定义了标注物体语义的区域。左图是带纹理网格。右图是户型图，不同区域用不同颜色区别。

第二步是对每个区域中物体标记3D表面。我们用泊松扫描表面重构【6】提取每个区域的网格。然后，使用Dai【7】Scan-net分包接口对每个区域画三角面，给这个区域中的所有物体加上名称。我们通过众包平台先收集一组初始标签，由10位专业标注者完成、确认和验证。确保了高质量的标签标准和更大的标注范围。

3D分割包含50,811个物体实例标注。众包工作人员使用自己的文本标签格式，一共有1659种文本标签。我们对他们进行后处理，构建一组40类物体，映射到WordNet。图5是物体语义目录的分布，图4是彩色网格例子。

图4：语义标注的例子。左边是：3D房间网格。中间是：物体标签。右边是：物体类别标签

3.3 数据集的特性

RGB-D全景

Matterport3D包含了配置的1280×1024彩色和深度图像，18个视图包含大约3.75sr（整个球体，除了南北极），沿着“天空盒”分布图像，外面看起来是沿着位于全景中心的立方体分布。这些RGB-D全景提供了新的机会来识别场景目录，估计区域分布，了解他们之间的关系，还有更多应用参考4.4节。

精准全局配准

尽管数据集里没有基准相机位姿，也无法客观地测量误差，但我们可以估计出对应表面点之间的平均配准误差在1厘米以内，甚至更低（如图6）。也有一些表面的不当匹配为10厘米或更大，但这些很少见。通常一对图像，他们包含的视野有好几米。

图6：点云可视化（从左到右：色彩，diffuse shading，法线）。这些图片表示根据相机位姿将像素从所有RGB-D图像上重投射回世界空间中。注意全局配准的精度和表面法线的相对低噪声，不需要深度融合技术。

更全面的视野采样

全景图像的拍摄间隔是一致的，间隔是2.25米±0.57米，全景图像中心，人的最佳视野是1.13米。

表面视图多种多样

Matterport3D提供了多个角度和距离的多个视角下的表面补丁（如图7）。每个表面补丁平均有11个相机视图（如图8）。所有像素的深度图的平均距离是2.125米，标准偏差是1.4356米；角度的平均值为42.584°，标准偏差是15.546°。这种多种多样的视图测量方式可以估计出不依赖视角的表面特性，比如材质反射【4,26】，可以通过深度学习识别出视图独立的数据表示，比如补丁描述子【45,46】和法线【9,23,3,41,48】（参考4.3节）。

图7：表面像素点的图像可视化（如红线所示）。（这张图上的网格去掉了，为了方便观察）

图8：直方图表示每个顶点有多少个图像。模型是7个，平均是11个。

4 数据深度学习

用Matterport3D的数据集特性提供更多的方法学习场景表示。

4.1 关键点匹配

预训练图像对应关系间的深度局部描述子，可以学习得到更有用的特征训练更强的描述子。我们采用了卷积神经网络（ResNet-50）【18】将输入图像映射成512维特性描述子。深度卷积神经网络训练了匹配好的图像补丁和没有匹配好的图像补丁。从世界空间和100°以内的世界法线中，在每0.02米内的SIFT关键点位置中提取特征匹配。

图9：Matterport3D中的图像匹配和图像补丁。补丁的匹配和不匹配用于训练深度局部关键点描述子。

4.2 视图重叠区域估计

识别之前扫描过的场景是许多重建问题的基本步骤，比如回环检测。

Matterport3D数据集视野重叠数量较大，主要是因为全景本身的特征和扫描过程中的全面视图采样造成的。大量的回环可以通过训练一个深度学习模型识别回环，未来可以将其整合到SLAM重构中。

我们把回环检测作为图像提取任务处理。给定待检索的图像，目的是要找到另一个表面可视区域重叠尽可能多的图像。

训练一个卷积神经网络（ResNet-50）将图像特征提取出来，特征之间用L2距离表示更高的重合度。训练这个模型的损失函数是距离比率损失函数【19】。重叠区域函数取值在0到1之间。在三联体神经网络头上添加回归损失函数可以将重叠区域回归到匹配的图像对上（重叠比率大于0.1）。

4.3 表面法线估计

估计表面法线是场景重构和场景识别的核心问题。给定一个彩色图像，要估计每个像素的表面法线方向。RGB-D相机提供深度图噪声比较大，训练数据就会很差。

Matterport3D数据集中的法线可以用来训练更好的模型来预测法线。我们采用了【48】的模型，在NYUv2数据集上获得了更好的结果。这个模型是一个全连接的卷积神经网络，由一个编码器，( 与VGG-16的架构完全一样，从开始到全连接层，) 和一个纯对称的解码器组成。

调参好的Matterport3D视觉质量非常好，因为它捕捉了更多的小物体的细节，比如毛巾和灭火器，还能生成更平滑的表面区域。这种表面法线估计的改进表明了高质量深度图的重要性。

Screened Poisson Surface Reconstruction

ScanNet Dataset

ScanNet Home Page

3DMatch: RGB-D Local Geometric Descriptors

参考文献

[1] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
[2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D semantic parsing of largescale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016.
[3] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2D-3D alignment via surface normal prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5965–5974, 2016.
[4] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Trans. on Graphics (SIGGRAPH), 33(4), 2014.
[5] S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun. A large dataset of object scans. arXiv preprint arXiv:1602.02481, 2016.
[6] M. Chuang and M. Kazhdan. Interactive and anisotropic geometry processing using the screened poisson equation. ACM Transactions on Graphics (TOG), 30(4):57, 2011.
[7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. http://arxiv.org/abs/1702.04405, 2017.
[8] A. Dai, M. Nießner, M. Zoll¨ofer, S. Izadi, and C. Theobalt. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. ACM Transactions on Graphics 2017 (TOG), 2017.
[9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
[10] M. Firman. RGBD datasets: Past, present and future. In CVPRWorkshop on Large Scale 3D Data: Acquisition, Modelling and Analysis, 2016.
[11] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3D primitives for single image understanding. In ICCV, 2013.
[12] D. F. Fouhey, A. Gupta, and M. Hebert. Unfolding an indoor origami world. In European Conference on Computer Vision, pages 687–702. Springer, 2014.
[13] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 564–571, 2013.
[14] S. Gupta, R. Girshick, P. Arbel´aez, and J. Malik. Learning rich features from RGB-D images for object detection and segmentation: Supplementary material, 2014.
[15] M. Halber and T. Funkhouser. Structured global registration of rgb-d scans in indoor environments. 2017.
[16] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patchbased matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3279–3286, 2015.
[17] A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla. SceneNet: Understanding real world indoor scenes with synthetic data. arXiv preprint arXiv:1511.07041, 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[19] E. Hoffer, I. Hubara, and N. Ailon. Deep unsupervised learning through spatial contrasting. arXiv preprint arXiv:1610.00243, 2016.
[20] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung. SceneNN: A scene meshes dataset with annotations. In International Conference on 3D Vision (3DV), volume 1, 2016.
[21] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, volume 7, 2006.
[22] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017.
[23] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119–1127, 2015.
[24] D. Lin, S. Fidler, and R. Urtasun. Holistic scene understanding for 3D object detection with rgbd cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1424, 2013.
[25] M. Nießner, M. Zollh¨ofer, S. Izadi, and M. Stamminger. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG), 2013.
[26] K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuytelaars. Deep reflectance maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4508–4516, 2016.
[27] X. Ren, L. Bo, and D. Fox. RGB-(D) scene labeling: Features and algorithms. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2759–2766. IEEE, 2012.
[28] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. PiGraphs: Learning Interaction Snapshots from Observations. ACM Transactions on Graphics (TOG), 35(4), 2016.
[29] T. Schmidt, R. Newcombe, and D. Fox. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2(2):420–427, 2017.
[30] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
[31] A. Shrivastava and A. Gupta. Building part-based object detectors via 3D geometry. In Proceedings of the IEEE International Conference on Computer Vision, pages 1745–1752, 2013.
[32] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, 2011.
[33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision, 2012.
[34] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pages 118–126, 2015.
[35] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
[36] S. Song and J. Xiao. Sliding shapes for 3D object detection in depth images. In European conference on computer vision, pages 634–651. Springer, 2014.
[37] S. Song and J. Xiao. Deep sliding shapes for amodal 3D object detection in RGB-D images. 2016.
[38] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. arXiv preprint arXiv:1611.08974, 2016.
[39] J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin. Learning to navigate the energy landscape. arXiv preprint arXiv:1603.05772, 2016.
[40] J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, J. Shotton, P. Kohli, M. Nießner, A. Criminisi, S. Izadi, and P. Torr. SemanticPaint: Interactive 3D labeling and learning at your fingertips. ACM Transactions on Graphics (TOG), 34(5):154, 2015.
[41] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
[42] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2695–2702. IEEE, 2012.
[43] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
[44] J. Xiao, A. Owens, and A. Torralba. SUN3D: A database of big spaces reconstructed using SFM and object labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 1625–1632, 2013.
[45] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT: Learned invariant feature transform. In European Conference on Computer Vision, pages 467–483. Springer, 2016.
[46] A. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, and T. Funkhouser. 3DMatch: Learning local geometric descriptors from RGB-D reconstructions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[47] J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun. Estimating the 3D layout of indoor scenes and its clutter from depth sensors. In Proceedings of the IEEE International Conference on Computer Vision, pages 1273–1280, 2013.
[48] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. arXiv preprint arXiv:1612.07429, 2016.
[49] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014.