红色字体表示可能存在问题的译句

蓝色字体表示值得关注的论文、知识点等

初次接触，所以翻译内容仅供参考，欢迎批评指正

Abstract

摘要

A point cloud is a commonly used geometric data type with many applications in computer vision, computer graphics and robotics. The availability of inexpensive 3D sensors has made point cloud data widely available and the current interest in self-driving vehicles has highlighted the importance of reliable and efficient point cloud processing.

点云是一种在计算机视觉、计算机图形学和机器人学里面广泛使用的几何数据类型。廉价的3D传感器的应用已经让点云数据在很大程度上变得可用，并且目前自动驾驶的火热突出了可靠和高效的点云处理的重要性。

Due to its irregular format, however, current convolutional deep learning methods cannot be directly used with point clouds. Most researchers transform such data to regular 3D voxel grids or collections of images, which renders data unnecessarily voluminous and causes quantization and other issues.

由于点云具有不规则的形式，因此，目前的基于卷积的深度学习方法不能直接应用于点云数据。大多数的研究者都采取将点云数据类型转换为3D栅格数据或图片数据的方式，使得数据不必要地变得庞大并且造成量化和其他问题。

In this thesis,we present novel types of neural networks (PointNet and PointNet++) that directly consume point clouds, in ways that respect the permutation invariance of points in the input. Our network provides a unified architecture for applications ranging from object classification and part segmentation to semantic scene parsing, while being efficient and robust against various input perturbations and data corruption.

在本文中，我们提出了能够直接处理点云数据的新型神经网络（pointnet和pointnet++）,它们遵守了对于输入点的排序不变性。我们的网络提供了一个可以应用于物体分类、零件分割和场景语义分析的框架，同时这个框架对于点的不同输入顺序和数据的扰动问题是高效和健壮的。

We provide a theoretical analysis of our approach, showing that our network can approximate any set function that is continuous,and explain its robustness. In PointNet++, we further exploit local contexts in point clouds, investigate the challenge of non-uniform sampling density in common 3D scans, and design new layers that learn to adapt to varying sampling densities.

我们对我们所采取的方式进行了理论上的分析，分析表明我们的神经网络可以近似于任何连续的集合函数，同时也解释了为什么它具有健壮性。在Pointnet++里面，我们进一步利用了点云中的局部特性，探讨了在常见的3D扫描中非均匀密度采样面临的挑战，并且设计了新的神经网络层，这些层能够学会适应不同的采样密度。

The proposed architectures have opened doors to new 3D-centric approaches to scene understanding. We show how we can adapt and apply PointNets to two important perception problems in robotics: 3D object detection and 3D scene flow estimation. In 3D object detection, we propose a new frustum-based detection framework that achieves 3D instance segmentation and 3D amodal box estimation in point clouds.

我们提出的框架已经为新的以3D为中心的场景理解方法打开了大门。我们展示了我们是如何将Pointnet调整和应用于两个重要的机器人感知问题：3D目标检测和3D场估计。在3D目标检测中，我们提出了一个新的圆台型的检测框架，这个框架实现了点云中的3D实例分割和3D盒子估计。

Our model,called Frustum PointNets, benefits from accurate geometry provided by 3D points and is able to canonicalize the learning problem by applying both non-parametric and data-driven geometric transformations on the inputs. Evaluated on large-scale indoor and outdoor datasets, our real-time detector significantly advances state of the art. In scene flow estimation, we propose a new deep network called FlowNet3D that learns to recover 3D motion flow from two frames of point clouds.

我们的这个模型叫做Frustum PointNets,得益于3D点提供的精确的几何图形，这个模型能够通过对输入进行非参数和数据驱动的几何变换来规范化学习问题。经过在大规模户内和户外数据集上的评估，表明我们的实时检测器显著提升了最新的技术水平。在场流估计方面，我们提出了一个叫做FlowNet3D的神经网络，它能够学会根据两帧点云来恢复3D运动流。

Compared with previous work that focuses on 2D representations and optimizes for optical flow, our model directly optimizes 3D scene flow and shows great advantages in evaluations on real LiDAR scans. As point clouds are prevalent, our architectures are not restricted to the above two applications or even 3D scene understanding. This thesis concludes with a discussion on other potential application domains and directions for future research.

相比于以前关注2D表示和优化光流所做的工作，我们的模型直接优化3D场流并且在真实的LiDAR扫描数据评估中显示了巨大的优势。随着点云的流行，我们的框架不再局限于以上两种应用甚至不再局限于3D场景理解。本文的结尾部分对该框架潜在的应用领域和未来的研究方向做了讨论。

Chapter 1 Introduction

Overview

Recently, we observe many emerging applications that require perception of 3D environment or interaction with 3D objects. For example, in autonomous driving, in order to make driving decisions, a robot car needs to be aware of pedestrians and cars around it, as well as understand their motions. In augmented reality (AR), AR glasses are equipped with depth cameras to perceive and understand 3D geometry, in order to display virtual objects at the right place, such as a virtual menu on the refrigerator door.

近期，我们注意到很多逐渐兴起的应用，这些应用需要感知3D环境或是和3D物体进行交互。比如说，在自动驾驶领域，为了做出驾驶决策，无人驾驶车需要感知它周边的行人和车辆，并且获取他们的运动状况。在增强现实领域，配备深度相机的AR眼镜被用来感知和理解三维几何场景，为了准确地显示虚拟物体，比如说冰箱门上的虚拟菜单。

Many of those 3D scene understanding problems cannot be directly programmed, so to solve them, data-driven methods are required. Inspired by the recent success of deep convolutional neural networks (CNNs) in 2D image understanding, we would also like to benefit from deep learning in 3D data understanding, leading to 3D deep learning.

许多类似的3D场景理解问题都不能通过编程来直接实现，所以为了解决这些问题，需要采取数据驱动的方法。受到近期卷积神经网络在2D图像理解方面取得成功的启发，我们也希望从3D数据理解的深度学习中受益，实现3D数据的深度学习。

However, different from images that have a dominant representation as 2D pixel arrays, 3D has many popular representations, as shown in Fig. 1.1. The diversity of representations exists for a reason: just as computational algorithms depend on data structures (lists, trees, graphs etc.), different applications of 3D data have their own preferences for 3D representations. Among them, a point cloud is a set of points in space sampled from object surfaces, usually collected by 3D sensors such as LiDARs or depth cameras.

然而，不同于图像数据的是，图像数据具有二维像素数组这样的主流表示，而3D数据具有许多中流行的表示形式，如图1.1。这种表示的多样性是有原因的：就像计算机算法依赖于数据结构（列表、树、图等），不同的3D数据应用有它们自己的表示偏好。在这些表示形式里面，点云表示形式是空间中点的集合，这些点通常是使用LiDAR或深度相机这样的3D传感器在物体表面采样形成。

A polygon mesh is a collection of triangles or quads, and is heavily used in computer graphics because it is friendly to 3D modeling and rendering algorithms. Volumetric representation quantifies the space into small voxels (3D cubes) and is common in simulation and reconstruction (such as in medical imaging), because it is easier to aggregate signals in regular grids than other formats. Last but not least, we also represent 3D as multiple projected views, often used for visualizations, as humans are more used to perceiving 2D than 3D.

而多边形网的表示形式是三角形或四边形的集合，由于这种形式对3D建模和渲染算法比较友好，所以被大量使用。体素表示形式则把空间量化成小的体素（3D立方体），这种形式在模拟和重建（比如医学成像）中比较常见，因为在规则的网格中聚合信号要比在其他形式下更容易。最后一种比较重要的是，我们也把3D数据表示为多个投影视图，通常用于可视化，因为人类更习惯于感知2D数据而不是3D数据。

Figure 1.1: 3D representations. 3D has multiple popular representations. Among them, point cloud is the closest to raw sensor data and simple in representation, therefore becomes the focus subject of this thesis.

图1.1：3D表示形式。3D数据有多种流行的表示。在这些表示里面，点云表示形式是最接近于原始传感器数据的，并且在表示上比较简单，因此成为了这篇论文的焦点主题。

Among those 3D representations, we particularly care about point clouds for 3D scene understanding,because of two reasons. First, a point cloud is probably the closest representation to raw sensor data. It encodes full information from sensors, without any quantization loss (as in volumetric representations) or projection loss (as in multi-view representations), therefore it is preferred for the end-to-end learning in 3D scene understanding.

在这些3D表示里面，我们尤其关注用于3D场景理解的点云，有以下两点原因。首先，点云是最接近于原始传感器数据的表示。它编码了来自传感器的全部信息，没有任何的量化损失（比如体素表示）或投影损失（比如多视角表示），因此对于3D场景理解中的端到端学习是比较合适的。

Second, a point cloud is very simple in representation – it is just a collection of points,which avoids the combinatorial irregularities and complexities of meshes (such as choices of polygons, polygon sizes and connectivities), and thus is easier to learn from. The point cloud is also free from the necessity to choose resolution as in volumetric representations, or projection viewpoint as in multi-view images.

第二点，点云在表示上非常简单——它仅仅是一个由3D空间中点组成的集合，这就避免了多边形网的组合不规则性和复杂性（比如说多边形的选取、多边形大小和连接方式），因此更容易用于学习。点云也不需要像体素表示或多视角图像中的投影视点那样选择分辨率。

Despite its simplicity and prevalence, there is barely any work of representation learning on point clouds.Most existing point cloud features are handcrafted towards specific tasks. Very recently, there have been a few works that process point cloud with deep neural networks. However, because point clouds are irregular 1,nearly all of these methods convert point cloud to other regular representations first, and then apply existing deep network architectures.

尽管它具有简单和流行的特点，但还几乎没有任何在点云上的表示学习工作。大多数现成的点云特征是针对特定的任务进行人工获取的。最近，出现了一些用深度神经网络处理点云的工作。然而，由于点云是不规则的，这些工作几乎都是首先将点云转换成其他规则的表示，然后使用现成的深度神经网络架构。

One example is to convert a point cloud into a binary occupancy grid (a voxel is one if there is a point in it, or zero if it is empty), and then apply 3D CNNs on the volumetric grid [127, 65, 83].This however, suffers from very high space and computation cost. In 3D CNNs, the storage and computation cost grows cubically with the grid resolution. Even worse, a lot of computation is wasted since many voxels are empty as scanners only capture points from surfaces of objects.

一个例子就是把点云转换成二值的填充网格（一个体素如果它里面有点则体素值为1否则为0），然后把3D卷积神经网络应用于提速网格上[127,65,83]。然而，这么做会消耗很多的时间和空间代价。在3D卷积神经网络中，随着网格分辨率的增长，空间和时间成本呈指数级增长。更糟糕的是，由于扫描仪仅仅获取了物体表面的点，导致很多体素是空的（为0），大量的计算资源被浪费掉了。

Because of the expensive costs, most works just use a very coarse grid, such as 30 by 30 by 30 in resolution [127], which in turn causes large quantization errors. Besides voxelizing point clouds to volumetric grids and use 3D CNN, one could also project point clouds onto a 2D plane or render a 2D image from it, and then use popular 2D CNNs [103]. However, due to projection, certain 3D information is lost and it is not always obvious which viewpoint to select for the projection.

由于成本的昂贵，大多数工作使用了非常粗的网格，比如在[127]中使用了30×30×30的网格。反而造成了很大的量化误差。除了把点云体素化成体素网格并使用3D CNN外，也可以把点云投射到2D的平面上或是从中渲染出2D图像，之后再使用流行的2D CNN。[103]然而，由于使用了投射，一些3D信息丢失了，并且选择哪一个投影视点并不是很显而易见。

Furthermore, one could extract hand-crafted features from point clouds first, and then use simple fully connected networks to process them [28]. In this way, however, the feature learning is gated by the hand-crafted features. Since all these conversions have shortcomings, an appealing research question is:Can we achieve effective feature learning directly on point clouds?

除此之外，也可以先从点云中抽取人工特征，然后使用简单的全连接网络来处理特征。[28]然而，这样做特征学习的能力就被人工特征限制住了。由于这些转换都各有缺点，那么问题来了：我们能不能直接在点云上实现有效的特征学习？

The answer is yes. In our work PointNet (Chapter 3), we achieve end-to-end learning for the irregular point data, without converting it to other representations first. We propose a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input. This new machinery also enables us to study both static and dynamic point clouds.

答案是可以。在我们的作品PointNet（第三章）里面，我们实现了对于不规则点云数据的端到端学习，不用先把它转换成其他表示。我们提出了一种统一的架构，它能够直接把点云作为输入并且输出整个点云的分类标签或是输出点云中某个点的分割标签。这种机理使得我们能够学习静态和动态的点云。

The basic architecture of our network is surprisingly simple, as in the initial stages each point is processed identically and independently. In the basic setting, each point is represented by just its three coordinates (x, y,z). Additional dimensions may be added by computing normals and other local or global features.

我们的网络基本架构出奇的简单，因为在初始阶段每个点都被以同样的方式单独处理。在基本设置下，每个点仅仅被三个坐标值所表示，可以通过计算法线和其他局部或全局特征来添加附加维度。

The key to our approach is the use of a single symmetric function, max pooling. Effectively the network learns a set of optimization functions/criteria that select interesting or informative points of the point cloud and encode the reason for their selection. The final fully connected layers of the network aggregate these learned optimal values into the global descriptor for the entire shape as mentioned above (shape classification) or are used to predict a label per point (shape segmentation).

我们的方法的关键是单值对称函数的使用，即max pooling函数。网络高效地学到了一组优化函数/优化标准，这组优化函数能够从点云中选择那些感兴趣的或者包含信息的点，并且将选到的这些点进行编码。最后网络的全连接层把这些学到的最佳值汇合到用于整体形状的描述器里（比如上文提到的外形分类），或是把全连接层用于预测点云中每个点的标签（外形分割）。

PointNet is both efficient and effective, as well as robust against various data corruptions. Compared with state-of-the-art volumetric models [83], our PointNet model saves more than 80% memory and more than 88% computation cost, and is able to process more than one million points per second on a GTX 1080 GPU – it is therefore a very promising architecture for portable and mobile devices. Although it is the first deep learning architecture on raw point clouds, PointNet is already able to outperform or achieve on par results compared with prior art based on volumetric or multi-view inputs and CNNs.

PointNet既高效又有效，并且对各种数据扰动问题健壮。与最先进的体素模型相比[83]，PointNet节省了超过80%的内存和超过88%的计算资源，在GTX 1080 GPU上每秒可以处理超过一百万个点——因此PointNet在手持或移动设备领域有很大的前景。虽然它是第一个基于原始点云的深度学习架构，但是与之前的基于体素或多视角CNN的方法相比，它已经能够超过或是达到平均水平。

Furthermore, PointNet is also very robust to missing points and various input perturbations: under a multi-class shape classification task, when as much as 50% of the points are missing, PointNet is barely affected with only 3.8% accuracy decrease. In comparison, a 3D CNN baseline drops more than 40% in accuracy, 10x worse than PointNet.

除此之外，PointNet对于一楼点和各种输入扰动也是很健壮的：在一个多分类任务中，当多达50%的点丢失时，PointNet几乎没有受到影响，只有3.8%的准确率下降。相比之下，基于3D CNN的方法在准确率上下降了40%，比PointNet差10倍。

Besides experimental evaluations, we also provide a theoretical analysis of our approach. We show that our network can approximate any set function that is continuous. More interestingly, it turns out that our network learns to summarize an input point cloud by a sparse set of key points, which roughly corresponds to the skeleton of objects according to visualization. The theoretical analysis provides an explanation for why our PointNet is highly robust to small perturbation of input points as well as to corruption through point insertion (outliers) or deletion (missing data).

除了进行试验评估，我们也对自己的方法进行了理论分析。我们表明我们的网络可以近似于任何连续的集合函数。有意思的是，事实证明我们的网络能够学会通过一组稀疏的关键点来汇总输入的点云，通过可视化大致对应于物体的骨架。理论分析解释了为什么PointNet对于输入点的小规模扰动和通过点插入（异常值）或删除（点丢失）造成的损坏具有高度健壮性。

While PointNet succeeds as a permutation invariant network, it lacks in the ability to capture local context,as points are either processed independently for each point or processed globally. To address this issue, we extend the PointNet architecture to build a hierarchical point cloud learning network, PointNet++ (Chapter 4), which applies mini-PointNet recursively at local regions of point clouds. Since all its operations are based on the PointNet structure, PointNet++ is also guaranteed to be order invariant. It is actually a superset architecture of PointNet: in a plain single-hierarchy setting, PointNet++ regresses to PointNet.

尽管PointNet作为顺序不变性网络是成功的，但它缺乏捕获局部上下文的能力，因为这些点要么是一个一个处理要么是全局处理。为了解决这个问题，我们扩展了PointNet，建立了分层的点云学习网络PointNet++（见第四章），PointNet++在点云的局部区域递归地使用了mini-PointNet。因为所有的操作都是基于PointNet结构的，所以PointNet++也能保证顺序不变性。它实际上是PointNet的超集架构：在一个简单配置的单层模式下，PointNet++退化为PointNet。

Similar to convolutional neural networks, PointNet++ learns hierarchical levels of abstractions and is fully translation invariant. However, the differences between point clouds and images yield new challenges that are unseen in CNNs. One interesting example is the common non-uniform sampling density in point clouds. In images, small kernels are often preferred in modern architectures (such as VGG [97]) but it is not necessary the case in point cloud learning.

类似于CNN，PointNet++学习层次抽象并且具有平移不变性。然而，点云和图像的不同产生了新的挑战，而这些挑战在CNN中是没有的。一个有趣的例子就是点云中的非均匀采样密度。在图像中，小内核在现代架构中通常是首选（比如说VGG[97]），但是这在点云学习中不是必须的。

In point clouds, local regions might contain highly varying number of points, thus a very small 3D kernel will suffer from unstable features. In PointNet++, we design special multi-scale and mult-resolution layers to resolve the issue, as the new layers enable the network to learn how to combine information from multiple scales adapting to the underlying density.

在点云中，局部区域内的点数量可能变化剧烈，因此一个非常小的3D核将会受到不稳定特征的影响。在PointNet++中，我们设计了特殊的多尺度和多分辨率图层来解决这个问题，因为加了新的层，网络能够学习如何结合来自多尺度的信息以此来适应潜在的密度。

While the first half of the thesis studies fundamental deep architectures for point clouds (PointNets: PointNet and PointNet++) and their simple applications to point set classification and segmentation, the second half of the thesis focuses on applications of PointNets in more sophisticated 3D scene understanding tasks. In particular, we examine two representative problems: 3D object detection (to estimate 3D bounding boxes of objects in the scene) and scene flow estimation (to recover 3D motion field of points in a dynamic scene).

论文的前半部分研究了用于点云的基础深度架构和它们（PointNets：包括PointNet 和PointNet++）在点集分类和分割方面的应用，后半部分则聚焦于PointNets在更复杂的3D场景理解方面的应用。特别地，我们研究了两个代表性问题：3D目标检测（估计场景中物体的3D边缘盒子）和场流估计（恢复动态场景中的3D运动场）。

A shared theme is that our new deep networks on point clouds have opened doors to novel 3D-centric approaches to 3D scene understanding. Without learning machinery on point clouds, most previous methods rely on image representations and suffer from projection distortions. CNNs on front-view images define pixel neighborhoods based on Manhattan distances in 2D images.

一个同样的主题是：我们在点云上的新神经网络已经为3D场景理解中的以3D为中心的方法打开了大门。没有基于点云的学习机理的情况下，大多数以前的方法都依赖于图像表示并且收到投射扭曲的影响。在2D图像中，基于前视图的CNN定义了基于曼哈顿距离的像素邻域。

However, two neighboring pixels can be actually very far in 3D due to their differences in depth. Failure to capture the real 3D neighborhood greatly increases difficulty in 3D understanding, as we will see in Chapter 5. In contrast, our PointNets enable learning directly in point clouds, with correct 3D neighborhoods to capture local context, thus making problems such as 3D instance segmentation much easier.

然而，在3D中，由于深度上的不同，两个相邻的像素可以离得很远。没能很好地获取真实的3D相邻点，导致3D场景理解变得更困难（见第五章）。于此相反的是，我们的PointNets能够直接在点云上进行学习，使用争取的3D领域来获取局部上下文，因此使得3D实例分割等问题变得更容易。

For 3D object detection, we propose a novel frustum-based detection framework, which is made possible by PointNets. Our 3D detector, named Frustum PointNets (Chapter 5), significantly advances the state of the art on this problem, as measured by standard benchmarks. In the well-known KITTI dataset for autonomous driving, we achieve more than 8% average precision (AP) improvement on car detection compared with previous state-of-the-art [18];

对于3D目标检测，我们提出了一个新的基于圆台的检测框架，可以通过PointNets实现。我们的3D检测器叫做Frustum PointNets （见第五章），通过标准基准测量，极大地提高了在这类问题上的表现。在众所周知的用于自动驾驶的KITTI数据集上，相比以前[18]我们在车辆检测上的平均准确率提高了8%以上。

In the indoor SUN RGB-D dataset, we get a 6.4% improvement in AP while running three magnitudes faster than the previous best model [88]. This is the first time that a 3D deep learning method on point clouds has been applied on the problem, but it already outperforms other 3D representations (e.g., volumetric, multi-view) in large-scale scenes, showing the great potential of point cloud based methods.

在户内的SUN RGB-D数据集上，我们获得了6.4%的AP改善，同时比以前的最佳模型快了3个数量级。这是首次有基于点云的3D深度学习方法应用于该问题，但在大规模场景方面它已经超过了其他的3D表示形式（比如volumetric, multi-view），显示了基于点云方法的巨大潜力。

For scene flow estimation, we are arguably the first to propose an end-to-end deep learning approach to learn scene flow directly in 3D point clouds. Our network, named FlowNet3D (Chapter 6) takes a pair of point clouds from consecutive frames, without assuming correspondence or rigidity in motion, and predicts motion flow for each point in the first frame.

在场流估计方面，按理说我们是第一个提出使用端到端深度学习方法来直接从3D点云中学习场流的。我们的网络叫做FlowNet3D（见第六章），该网络从连续帧中获取一对点云，不用假设运动中的对应或刚性，预测第一帧中每个点的运动流。

We design novel layers to learn to associate points in two frames and propagate flow estimations, by extending the basic PointNet structure. Since our network is able to directly optimize 3D flows, we achieve more than 40% in error reductions compared with various strong baselines. Interestingly, because scene flow is a low-level task, our model trained on large-scale synthetic data successfully generalizes to real LiDAR scans, proving its effectiveness in feature learning.

通过扩展基本PointNet架构，我们设计了新的层来学习将两个点中的点关联起来并且传播流估计。由于我们的网络能够直接优化3D流，相比各种各样的基准，我们的误差减少了超过40%。有趣的是，由于场流是低级的任务，我们的基于大规模合成数据训练的模型成功的泛化于真正的LiDAR扫描，证明了其在特征学习中的有效性。

Besides 3D object detection and scene flow estimation, we expect to see more applications of point clouds and their deep architectures in 3D scene understanding. And note that although we have focused on 3D understanding tasks, the problem of processing unordered sets by neural nets is a very general and fundamental one. We expect that our ideas can be transferred to other domains as well.

除了3D目标检测和场流估计，我们期待看到点云和基于点云的深度学习架构在3D场景理解中的应用。说明一下，虽然我们聚焦于3D理解任务，但是使用神经网络处理无序集合的问题是一个非常普遍和基本的问题。我们期望我们的想法能够被转移到其他领域。

Contributions and Thesis Outline

Figure 1.2: Thesis contributions: novel architectures for deep learning on point clouds, and their applications in 3D scene understanding.

图1.2：论文贡献：新的基于点云深度学习的架构和它们在3D场景理解方面的应用。

As illustrated in Fig. 1.2, we start by designing a new family of deep neural networks specialized for point clouds, and then equipped with this new learning machinery, we propose new approaches to two example problems in 3D scene understanding. In summary, the key contributions of this thesis are as follows:

正如图1.2展示的，刚开始设计了一系列专用于点云的深度神经网络，使用这种新的学习机理，我们针对3D场景理解中的两个经典问题提出了新的解决办法。综上所述，本文的主要贡献如下：

• Novel architectures for deep learning on point clouds – PointNet [81] and PointNet++ [82] that are light-weight, robust to data corruption and respect invariances of point cloud data, providing a unified framework for various 3D tasks such as object classification, object part segmentation and semantic segmentation. Extensive empirical and theoretical analysis are also provided on the stability and efficiency of the method.

•基于点云的深度学习新架构——PointNet[81]和PointNet++[82],它们是轻量级的、对数据的损坏比较健壮并且遵守了点云数据的顺序不变性，为物体分类、物体零件分割、语义分割等各种3D任务提供了统一的框架。还提供了关于该方法稳定性和效率的试验和理论分析。

• A new framework for RGB-D data based 3D object detection called Frustum PointNets [79], with direct learning on point clouds for 3D geometry inference. Evaluated on large-scale public benchmarks, our detectors improve state-of-the-art detection performance by significant margins.

•基于RGB-D数据的针对3D目标检测的新框架，叫做Frustum PointNets [79]，直接学习点云用于三维几何推理。基于大规模公共基准的评估，检测器极大地提高了最佳检测能力。

• A new deep network for learning scene flow in 3D point clouds, called FlowNet3D [80]. The network has a homogeneous structure with innovative uses of basic PointNet modules, and learns to extract point cloud features, correlate geometry patterns and smooth motion flows in an end-to-end manner.

一个在3D点云中学习场流的深度网络，叫做FlowNet3D [80]。该网络具有同构结构，是对基本PointNet模块的创新使用，同时该网络学会了抽取点云特征，以端到端的方式将几何形态与光滑运动流关联起来。

Contents for each chapter of the thesis are as follows:

各章内容如下：

In Chapter 2 we introduce the datasets we use in the thesis and survey related work in point cloud descriptors, deep learning on 3D data and 3D scene understanding tasks.

第二章中，介绍了本文使用的数据集并且调查了点云描述器中的相关工作，包括3D数据的深度学习和3D场景理解任务。

Chapter 3 describes a novel type of neural network that directly consumes point clouds, faithful to the permutation invariance of points in the input. The network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.

第3章描述了一个直接处理点云数据的新神经网络，具有对输入点的顺序不变性。这个网叫做PointNet，提供了一个可以用于物体分类、零件分割和场景语义解析的统一架构。

Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par with or even better than the state-of-the-art. Theoretically, we provide an analysis towards understanding what the network has learnt and why the network is robust with respect to input perturbation and corruption. The content of this chapter is based primarily on Qi et al. [81].

虽然网络简单，但却十分高效和有效。从实验来看，它的表现与现有技术相当甚至更好。同时在理论上进行了分析，以了解网络学到了什么以及为什么网络对数据扰动和破坏具有健壮性。第3章内容主要基于Qi等人。[81]

Chapter 4 extends Chapter 3 by introducing a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, the new network named PointNet++ is able to learn local features with increasing contextual scales. A further observation is that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities.

第4章通过引入一个分层神经网络扩展了第三章，该网络将PointNet递归地应用于输入点集的巢式分区。通过使用度量空间中的距离，名为PointNet++的网络能够随着上下文规模的增加来学习局部特征。进一步观察来看，点集的采样密度通常是变化的，导致基于均匀采样密度训练的网络的性能大大降低。

To address the issue, we propose novel learning layers on point sets that adaptively combine features from multiple scales. Lastly, we show that the same network architecture can be extended to spaces in higher dimensions than 3D, with an example in organic shape classification with points in geodesic spaces. The content of this chapter is based primarily on Qi et al. [82].

为了解决这个问题，提出了新的基于点集的学习层，这些层可以自适应地组合来自不同尺度的特征。最后，展示了同样的网络架构能够扩展到比3D空间更高维的空间，并在有机形状分类中使用测量空间中的点进行示例。第4章内容主要基于Qi等人。[82]

Chapter 5 presents a new framework for 3D object detection from RGB-D data, based on PointNet/PointNet++ architectures. Previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data. In contrast, we directly operate on raw point clouds by popping up depth scans.

第5章介绍了一种基于PointNet / PointNet ++架构的RGB-D数据三维物体检测新框架。以前的方法专注于图像或3D体素，隐藏（回避）了自然的3D模型和3D数据的不变性。相反，我们通过弹出深度扫描器件来直接处理原始点云。

However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. The content of this chapter is based primarily on Qi et al. [79].

然而，这种方法面临的巨大挑战就是如何高效地定位大规模场景点云中物体的位置。我们的方法利用成熟的2D物体检测器和先进的3D深度学习来进行目标定位，而不单单使用3D方案，达到了很高的效率并且即使对于小物体也有很高的召回率。第5章内容主要基于Qi等人。[79]

Chapter 6 proposes a novel deep neural network named FlowNet3D that learns scene flow in 3D point clouds. While previous image-based methods optimize for 2D terms such as optical flow and disparity, and compute 3D flow afterwards, unnecessarily suffering from accumulated errors, our method directly optimize for 3D flow accuracy. With a homogeneous structure of PointNet modules, FlowNet3D simultaneously learns hierarchical point cloud features as well as how to correlate features and smooth flows. The content of this chapter is based primarily on Qi et al. [80].

第6章提出了一种新的DNN，叫做FlowNet3D，它能够学习3D点云中的场流。虽然先前的基于图像的方法针对诸如光流和视差的2D术语进行优化，并且进行3D流计算，但会不必要地受到积累误差的影响，我们的方法直接针对3D流精度进行优化。使用了基于PointNet模块的同质化结构，FlowNet3D同时学到了点云的层级特征并且学到了如何将特征和平滑流关联起来。第6章内容主要基于Qi等人。[80]

Finally, in Chapter 7, we summarize the thesis, provide insights on directions for future research, and discuss a broad range of other applications of deep learning on point clouds.

对论文进行了总结，提供了有关未来研究方向的见解，并讨论了点云深度学习的广泛应用。

Chapter 2 Background

Recent advances in deep learning are supported by high-performance computing hardware (GPU) and platforms(Caffe [44], TensorFlow [1], Pytorch [76] etc.), but more importantly they are fueled by large-scale datasets(e.g. ImageNet [22]). Similarly, 3D deep learning rises because 3D data and 3D datasets become available.

由于高性能计算硬件（GPU）和平台（Caffe [44], TensorFlow [1], Pytorch [76]等）的支持，深度学习取得了新的进展，更重要的是深度学习受到大规模数据集的推动（ImageNet [22]等）。

同样，由于3D数据和3D数据集的使用，3D深度学习也不断进步。

The growth in 3D data is due to two forces. The first driving force is the progress in 3D sensing technology.As costs of sensors decrease quickly, we now have access to very affordable depth cameras such as Kinect,Tango, Structure Sensor, Intel RealSense or even depth cameras in phones (e.g. iPhone X).

3D数据的增长归功于两种力量。一是3D传感技术的进步。随着传感器成本的急剧下降，现在可以使用价格很实惠的深度相机，如Kinect，Tango，结构传感器，英特尔RealSense甚至是手机中的深度相机。（如iPhone X）

Extremely accurate 3D sensors such as LiDARs are also more widespread thanks to the needs from autonomous driving industry. Increases in 3D sensors directly result in much quicker growth of raw 3D data. The second force is the availability and popularity of free 3D modeling tools such as Blender and SketchUp as well as 3D model sharing platforms such as the Trimble 3D Warehouse, leading to a fast growing number of 3D models that are publicly accessible .

由于自动驾驶行业的需求，极其精确的3D传感器如LiDAR变得更加普及。3D传感器的增加直接导致原始3D数据的快速增长。二是免费免费3D建模工具（如Blender和SketchUp）以及3D模型共享平台（如Trimble 3D Warehouse）的可用性和普及性，导致可公开使用的3D模型数量快速增长。

However, while 3D data quickly grows from both sensing and modeling, they are not yet ready to be used by deep learning algorithms, because most models still require annotated data for training. Fortunately, we also observe a few recent efforts in organizing, cleaning and annotating 3D data (e.g. ShapeNet [14]). Our work in this thesis is only possible with these public 3D datasets.

然而，虽然3D数据从传感和建模两者中快速增长，但它们尚未准备好被深度学习算法使用，因为大多数模型仍需要被标注的数据进行训练。很幸运，我们看到了一些在组织、清洗和标注数据方面的努力。（如ShapeNet[14]）我们在本论文中的工作仅适用于这些公共3D数据集。

In Sec. 2.1, we introduce the datasets we used and discuss their characteristics. Then in Sec. 2.2, we survey previous work and also discuss parallel and more recent advances in the field.

在2.1中，介绍了使用的数据集并论述了它们的特点。在2.2中，调查了一些之前做的工作同时论述了一些近期该领域的进步。

2.1 Datasets

We used various datasets for a variety of 3D understanding problems to test the performance of our proposed deep architectures and algorithms. The datasets involved are very diverse, ranging from single objects to multi-objects in scenes, from synthetic shapes to real scans and from indoor rooms to outdoor environments.

使用了各种用于3D理解问题的数据集来测试我们提出的深度架构和算法的性能。数据涉及的内容很广，从场景中的单目标到多目标，从合成数据到真实扫描数据，从户内到户外。

Datasets with single objects:

单目标数据：

• MNIST [52]: A dataset for handwritten digits, with 60k training and 10k testing samples. There are ten digits from 0 to 9, and the image size is 28⇥28. Images are grayscale. While the dataset is simple, it is a popular benchmark for machine learning algorithms. We represent the images with 2D points and use them to test our models in digit classification, to compare with prior work using CNNs and fully-connected networks.

•MINIST[52]:一个手写数字的数据集，有6W张训练用例和1W张测试用例。每张图片大小为28×28，内容从数字0到数字9.虽然数据比较简单，但这是一个机器学习中流行的参考。我们用2D点来表示图像，然后用它们来测试我们的模型在数字分类方面的性能，并与之前使用CNN和全连接网络的方法进行比较。

• ModelNet40 [127]: A 3D dataset of CAD models from 40 categories (mostly man-made objects such as chairs, bathtubs, cars etc.). All models are aligned in up-right poses. In default, 10 of the 40 categories have aligned poses in azimuth angles. There is also a recent update in the dataset, providing fully aligned models for all categories. We use the official split with 9,843 shapes for training and 2,468 for testing to evaluate our models in 3D shape classification, with random shape rotations in the azimuth direction (along the up-axis).

• ModelNet40 [127]:包含40个类别的CAD模型数据集（大多是人造的物体如椅子、浴缸、汽车等）。每个模型都以直立的方式对齐。默认的，40类中的10类在方向角上对齐。最近该数据集有了更新，所有的类别都进行了对齐。我们使用了官方的分配方式来评估我们的模型在3D形状分类方面的性能，该分配方式将数据分为9843个训练样本和2468个测试样本，并且在方位角上进行了随机旋转。

• ShapeNetPart [132]: This is a dataset built on top of the ShapeNet [14] dataset (a large-scale, richly annotated 3D shape repository), containing 16,881 shapes from 16 of the ShapeNet classes, annotated with 50 parts in total. We use the official train/test split following [14]. We test our models for object part segmentation on this dataset, which is a much finer grained task than whole shape classification in ModelNet40.

• ShapeNetPart [132]:这是一个在ShapeNet[14]（这是一个大型的、标注丰富的3D形状库）数据集基础上建立的数据集，包含来自16个ShapeNet类的16881个形状，共标注了50个部分。接下来使用了官方的分配方式。[14]在这个数据集上测试了我们的模型的零件分割性能，这比ModelNet40中的整体形状分类要精细得多。

• SHREC15 [56]: A dataset for non-rigid shape retrieval, containing 1200 shapes (as triangle meshes) from 50 categories. Each category contains 24 shapes which are mostly organic ones with various poses such as horses, cats, etc. We use five fold cross validation to acquire classification accuracy on this dataset to test how our model generalizes to point sets in non-Euclidean spaces.

• SHREC15 [56]: 一个用于非刚性形状恢复的数据集，包含来自50个类的1200个形状。（以三角形网的形式）每一类包含24个形状，这些形状大多数是有各种姿势的有机物体比如说马、猫等。我们使用5折交叉验证来提高在此数据集上的精度，以此来检验我们的模型是如何在非欧几里得空间中进行模型泛化的。

Datasets with multiple objects or of 3D scenes:

3D场景中多目标数据集：

• S3DIS [2]: The Stanford 3D Indoor Spaces (S3DIS) dataset contains 3D scans from Matterport scanners in 6 building areas including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall etc. plus clutter). The dataset also provides annotations for depth and normal maps, as well as instance segmentations. We use this dataset for evaluation of semantic segmentation and 3D object detection in our PointNet model.

• S3DIS [2]:斯坦福3D室内空间（S3DIS）数据集，包含使用Matterport扫描仪采集的来自6个建筑区域271个房间的扫描数据。（椅子、桌子、地板、墙等和杂物）数据集还提供深度和法线贴图以及实例分割的标注。我们使用此数据集来评估PointNet模型中的语义分割和3D对象检测。

• ScanNet [20]: It is a richly annotated dataset of reconstructions of indoor scenes. It contains more than 2.5 million views of more than 1,500 scans of indoor rooms, annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentation. Compared with S3DIS, it has more diversity in room layouts and styles. We follow the experimental setting in [20] and use 1,201 scenes for training and 312 scenes for testing, to evaluate our model on the task of semantic segmentation.

• ScanNet [20]:室内场景重建的丰富标注的数据集。包含超过1,500个室内房间扫描的250多万个视图，并且有3D相机姿势，表面重建和实例级语义分割的标注。与S3DIS相比，它在房间布局和风格上更加多样化。我们按照[20]中的实验设置，使用1,201个场景进行训练，312个场景进行测试，来评估我们的语义分割任务模型。

• SUN RGB-D [98]: The dataset consists of 10,355 RGB-D images captured from various depth sensors for indoor scenes (bedrooms, dining rooms etc.). It provides annotations on scene classification, room layouts, semantic segmentation and 3D bounding boxes. We follow the same train/val splits as [98, 88]for experiments. We use this benchmark to evaluate our Frustum PointNets on 3D object detection in indoor scenes.

• SUN RGB-D [98]:该数据集包括从各种深度传感器捕获的10,355个RGB-D图像，用于室内场景（卧室，餐厅等）。它提供了关于场景分类，房间布局，语义分割和3D边界框的标注。我们按照与[98,88]相同的训练/评估分组进行实验。我们使用此基准来评估我们的Frustum PointNets在室内场景中的3D物体检测性能。

• FlyingThings3D [66]: The dataset consists of rendered scenes with multiple randomly moving objects sampled from ShapeNet. There are in total around 32k stereo images with ground truth disparity and optical flow maps. With synthetic generation, the dataset gets full annotations on object masks, optical flow and disparity changes between frames, so we can compute ground truth scene flow to test and validate designs in our FlowNet3D model.

• FlyingThings3D [66]:数据集由渲染场景组成，这些场景具有从ShapeNet采样的多个随机运动的物体。总共有大约3.2W立体图像具有地面实况差异和光流图。通过合成，数据集在目标模板、光流和帧间差距改变上获得了完整标注，因此，我们可以计算地面实况场景流，以测试和验证我们在FlowNet3D模型中的设计。

• KITTI [31, 70, 69]: It is a large-scale dataset for antonomous driving and 3D vision, with various benchmarks in object recognition, tracking, depth estimation, flow estimation and odometry etc. In our thesis we use the annotations involved in the 3D object detection and scene flow estimation benchmarks.

• KITTI [31, 70, 69]: 一个用于自动驾驶和3D视觉的大型数据集，在物体识别，跟踪，深度估计，流量估算和测距等方面具有各种基准。在我们的论文中，我们使用3D对象检测和场景流量估计基准中涉及的注释。

The object detection benchmark in KITTI provides synchronized RGB images and LiDAR point clouds with groundtruth amodal 2D and 3D box annotations for vehicles, pedestrians and cyclists. The training set contains 7481 frames and a undisclosed test set contains 7581 frames. We use this benchmark to evaluate our Frustum PointNets on 3D object detection in ourdoor scenes.

KITTI中的物体检测基准为车辆，行人和骑自行车者提供同步RGB图像和LiDAR点云以及真实场景中的 2D和3D盒子注释。训练集包含7481帧，未公开的测试集包含7581帧。我们使用此基准来评估我们的Frustum PointNets在室外场景中的3D物体检测性能。

The scene flow benchmark consists of 200 stereo images for training scenes and 200 for test scenes (4 color images per scene), with dynamic scenes of multiple moving objects. Each pixel in the image is annotated with ground truth optical flow and disparity change. By propagating annotations on pixels to LiDAR point clouds, we can recover ground truth scene flow labels in 3D point clouds. We use this dataset to evaluate our FlowNet3D model in scene flow estimation.

场景流基准包括200个用于训练场景的立体图像和200个用于测试场景的立体图像（每个场景4个彩色图像），具有多运动物体的动态场景。图像中的每个像素用地面实况光流和差距变化标注。通过将像素上的注释传播到LiDAR点云，我们可以在3D点云中恢复地面真实场景流标签。我们使用此数据集来评估场景流估计中的FlowNet3D模型。

Besides the datasets mentioned here, which are used in experiments in this thesis, there are many more other efforts in building 3D datasets. Examples include a large-scale synthetic 3D scene dataset SUNCG [101], another richly annotated RGB-D dataset of indoor scenes Matterport3D [13]. There are also on-going efforts to build large 3D/RGB-D video datasets and simulation platforms for 3D understanding and interactions. We foresee that datasets will keep playing an important role in future 3D deep learning research.

除了本文中实验用到的这些数据，在构造3D数据方面也做了一些努力。包括大规模合成3D场景数据SUNCG[101]和进行了丰富标注的叫做Matterport3D [13]的室内RGB-D数据集。目前还在努力构建用于3D理解和交互的大型3D / RGB-D视频数据集和仿真平台。我们预计数据集将继续在未来的3D深度学习研究中发挥重要作用。

2.2 Related Work

As 3D is prevalent in computer vision, graphics and robotics, there have been tremendous works on topics of point clouds, 3D data learning and 3D scene understanding. In this section, we briefly review previous works related to our research, and introduce some concurrent and more recent efforts since the work of this thesis. We first review prior attempts in designing and learning features for point clouds, then discuss 3D deep learning on other 3D representations, and finally look at related work in 3D scene understanding.

由于3D在计算机视觉，图形和机器人技术中很流行，所以基于点云已经做了大量工作，包括3D数据学习和3D场景理解。在本节中，我们将简要回顾以前与我们的研究相关的工作，并介绍自本论文工作以来现在或最近的努力。我们首先回顾先前设计和学习点云特征的尝试，然后讨论其他3D表示的3D深度学习，最后查看3D场景理解中的相关工作。

2.2.1 Point Cloud Descriptors

Most existing features for point cloud are handcrafted towards specific tasks. Point features often encode certain statistical properties of points and are designed to be invariant to certain transformations, which are typically classified as intrinsic (WKS [3], HKS [107, 8] etc.) or extrinsic (PFH [93], FPFH [92], D2 [75], inner-distance [60], Spin Image [45], LFD [15] etc.). They can also be categorized as local features and global features, or be classified based on the input format they take (e.g. 3D points, points with features, or depth maps). For a specific task, it is not trivial to find the optimal feature combination (feature selection), even with a deep neural network [28]).

点云的大多数现有特征都是针对特定任务人工制作的。点云特征通常编码了点的某些统计属性，并且被设计为对某些变换具有不变性，这些特征通常被分为内在 (WKS [3], HKS [107, 8] etc.) 的或外在(PFH [93], FPFH [92], D2 [75], inner-distance [60], Spin Image [45], LFD [15] etc.)的。它们还可以被分为局部特征和全局特征，或者基于它们采用的输入格式（例如，3D点，具有特征的点或深度图）来分类。对于特定任务，即使使用深度神经网络[28]，找到最佳特征组合（特征选择）并非易事。

In this work we present a new type of networks for point feature learning, which can be easily adapted to different tasks, such as classification, segmentation, shape retrieval and finding correspondences. Our network is also flexible to consume extra features than 3D coordinates and able to deal with partial point clouds.

在这项工作中，我们提出了一种新的点特征学习网络，可以很容易地适应不同的任务，如分类，分割，形状恢复和查找对应关系。我们的网络也可以灵活地使用额外的特征而不是3D坐标，并能够处理局部点云。

2.2.2 Deep Learning on Unordered Points

Oriol Vinyals et al [116] is one of the first researchers that studied deep learning on point sets. They designed a read-process-write network with attention mechanism to consume unordered input sets and showed that their network has the ability to sort numbers. However, since their work focuses on generic sets and natural language processing (NLP) applications, there lacks the role of geometry in the sets. Another two concurrent works to ours are from Ravanbakhshet al. [85] and Zaheer et al. [135], in which they designed deep networks that achieve invariance to set element ordering.

Oriol Vinyals等人[116]是最早研究点集上深度学习的研究者之一。他们设计了一个具有注意机制的读——处理——写网络，以处理无序输入集，并显示其网络具有对数字进行排序的能力。但是，由于他们的工作主要集中在通用集和自然语言处理（NLP）应用上，因此缺少集合上的几何运用。另外两项工作来自Ravanbakhshet al[85]和Zaheer等人[135]，他们设计了深度网络，实现了输出对集合元素排序的不变性。

However their works emphasize on a wide ranges of applications for point set learning, rather than focusing on 3D understanding. Compared with these prior work, we exploit

the geometry properties in 3D point sets (geometric transformations and distance spaces for sampling and local context), provide theoretical analysis to what has been learned, and give rich visualizations to help understand the model. Our work also targets towards real applications in 3D scene understanding.

然而，他们的作品强调点集学习的广泛应用，而不是专注于3D理解。与这些先前的工作相比，我们利用了三维点集中的几何属性（用于采样和局部上下文的几何变换和距离空间），为所学到的知识提供理论分析，并提供丰富的可视化以帮助理解模型。我们的工作还针对3D场景理解中的实际应用。

Points sampled from a metric space are usually noisy and with non-uniform sampling density. This affects effective point feature extraction and causes difficulty for learning. One of the key issue is to select proper scale for point feature design. Previously several approaches have been developed regarding this [77, 72, 6, 21, 33, 126] either in geometry processing community or photogrammetry and remote sensing community. In none of the above deep learning works, the problem of non-uniform sampling density has been explicitly considered.In PointNet++ we learn to extract point features and balance multiple feature scales in an end-to-end fashion.

从度量空间采样的点通常是有噪声的并且采样密度不均匀。这会影响点特征提取的有效性并导致学习困难。关键问题之一是为点特征设计选择合适的尺度。以前在几何处理社区或摄影测量和遥感社区中已经开发了几种关于规模[77,72,6,21,33,126]的方法。但在上述深度学习工作中，没有明确考虑非均匀采样密度的问题。在PointNet ++中，我们学习提取点特征并以端到端的方式权衡多个特征尺度。

Since our work, recently there have been more efforts in designing novel deep learning networks for point clouds. For example, dynamic graph CNNs [123] extends PointNet++ by allowing neighborhood search in latent feature spaces. VoxelNet [136] combines PointNet with 3D CNNs. It computes local voxel features using a PointNet-like network, and then applies 3D CNNs on voxels for object proposals and classification. O-CNN [122], OctNet [89] and Kd-network [49] have introduced indexing trees to 3D deep learning, to avoid computations at empty spaces.

自从我们的工作以来，最近在为点云设计新型深度学习网络方面有了更多努力。例如，动态图CNN [123]通过允许在潜在特征空间中进行邻域搜索来扩展PointNet ++。VoxelNet [136]将PointNet与3D CNN结合在一起。它使用类似PointNet的网络计算本地体素特征，然后在体素上应用3D CNN以进行对提案和分类。O-CNN [122]，OctNet [89]和Kd-network [49]已经将索引树引入到3D深度学习中，以避免在空白空间进行计算。

ShapeContextNet [130] extends traditional Shape Context [5] descriptor to a trainable setting with deep neural networks. SPLATNet [102] achieves efficient 3D convolution by spar bilateral convolutions on a lattice structure. Tangent convolution network [110] exploits the fact that point clouds from 3D sensors are living on a 2D manifold in 3D space, and achieved point cloud learning by 2D convolutions on local tangent planes. With diverse applications and representations of 3D data, we envision more deep architectures being invented in the near future.

ShapeContextNet [130]将传统的形状上下文[5]描述符扩展到具有深度神经网络的可训练设置。SPLATNet [102]通过网格结构上的翼梁双边卷积实现有效的3D卷积。切线卷积网络[110]利用了这样的事实：来自3D传感器的点云寄宿在3D空间中的2D堆叠基础上，并通过局部切平面上的2D卷积实现了点云学习。随着不同的应用和3D数据表示，我们设想在不久的将来更多深层架构会被发明。

2.2.3 Deep Learning on Other 3D Representations

3D data has multiple popular representations, leading to various approaches for learning. Volumetric CNNs: [127, 65, 83] are the pioneers applying 3D convolutional neural networks (CNNs) on voxelized shapes. There have also been extensive works that apply 3D CNNs for semantic segmentation of scenes [20, 111]on voxelized point clouds. However, volumetric representation is constrained by its resolution due to data sparsity and computation cost of 3D convolution. FPNN [55] and Vote3D [121] proposed special methods to deal with the sparsity problem; however, their operations are still on sparse volumes, it’s challenging for them to process very large point clouds.

3D数据具有多种流行的表示，导致各种学习方法的产生。Volumetric CNNs: [127, 65, 83]是在体素化形状上应用3D卷积神经网络（CNN）的先驱。还有大量的工作将3D CNN用于[20,111]对体素化点云的场景语义分割。然而，由于数据稀疏性和3D卷积的计算成本，体素表示受到其分辨率的限制。FPNN [55] 和 Vote3D [121]提出了处理稀疏性问题的特殊方法；但是，他们的操作仍处于稀疏数量，因此处理非常大的点云面临挑战。

Multiview CNNs: [103, 83] have tried to render 3D point cloud or shapes into 2D images and then apply 2D conv nets to classify them. With well engineered image CNNs, this line of methods have achieved dominating performance on shape classification and retrieval tasks [95]. However, it’s nontrivial to extend them to scene understanding or other 3D tasks such as point classification and shape completion. Spectral CNNs: Some latest works [10, 64] use spectral CNNs on meshes. However, these methods are currently constrained on manifold meshes such as organic objects and it’s not obvious how to extend them to non-isometric shapes such as furniture.

Multiview CNNs: [103, 83]尝试将3D点云或形状渲染为2D图像，然后应用2D转换网络对它们进行分类。通过精心设计的图像CNN，这一系列方法在形状分类和复原任务方面取得了主导表现[95]。然而，将它们扩展到场景理解或其他3D任务（例如点分类和形状完成）是非常重要的。Spectral CNNs:一些最新的作品[10,64]在多边形网络上使用光谱CNN。然而，这些方法目前被限制在诸如有机物体之类的堆叠网络上，并且如何将它们扩展到非等距形状（如家具）并不明显。

Feature-based DNNs: [28, 35] firstly convert the 3D data into a vector, by extracting traditional shape features and then use a fully connected net to classify the shape. We think they are constrained by the representation power of the features extracted.

Feature-based DNNs: [28, 35] 首先通过提取传统的形状特征将3D数据转换为矢量，然后使用完全连接的网络对形状进行分类。我们认为它们受到所提取特征的表示能力的约束。

In our work, we do deep learning directly on the raw 3D representation – point clouds, without the pre-processing stage of converting them to volumes, images, graphs, or feature vectors.

在我们的工作中，我们直接在原始3D表示 ——点云上进行深度学习，而没有将它们转换为体素，图像，图形或特征向量的预处理阶段。

2.2.4 3D Object Detection

3D scene understanding is a very broad topic, involving object recognition, layout estimation, semantic segmentation, motion estimation etc. 3D object detection is one of the most important tasks among them and is specially relevant in modern applications such as autonomous driving and augmented reality. Because of its importance, researchers have made many attempts to approach the 3D detection problem, by taking various ways to represent RGB-D data. Front view image based methods: [16, 73, 129] take monocular RGB images and shape priors or occlusion patterns to infer 3D bounding boxes. [54, 23] represent depth data as 2D maps and apply CNNs to localize objects in 2D image.

3D场景理解是一个非常广的主题，涉及目标识别，布局估计，语义分割，运动估计等。3D目标检测是其中最重要的任务之一，并且与诸如自动驾驶和增强现实的现代应用很相关。由于3D目标检测很重要，研究人员采用各种方式来表示RGB-D数据，尝试解决3D检测问题。基于前视图的方法：[16,73,129]采用单目RGB图像和形状先验或遮挡模式来推断3D边界框。 [54,23]将深度数据表示为2D图并应用CNN来定位2D图像中的对象。

In comparison we represent depth as a point cloud and use advanced 3D deep networks (PointNets) that can exploit 3D geometry more effectively. Bird’s eye view based methods: MV3D [18] projects LiDAR point cloud to bird’s eye view and trains a region proposal network (RPN [87]) for 3D bounding box proposal. However, the method lags behind in detecting small objects, such as pedestrians and cyclists and cannot easily adapt to scenes with multiple objects in vertical direction. 3D based methods: [121, 99] train 3D object classifiers by SVMs on hand-designed geometry features extracted from point cloud and then localize objects using sliding-window search.

相比之下，我们将深度数据表示为点云，并使用可以更有效地利用3D几何性质的高级3D深度网络（PointNets）。基于鸟瞰图的方法：MV3D [18]将LiDAR点云投射到鸟瞰图并训练区域建议网络（RPN [87]）用于3D边界框提议。然而，该方法在检测诸如行人和骑车者的小物体方面滞后，并且对于垂直方向上具有多个物体的场景不容易适应。基于3D的方法：[121,99]使用SVM从在点云中提取的手工设计的几何特征上训练3D目标分类器，然后使用滑动窗口搜索来定位目标。

[26] extends [121] by replacing SVM with 3D CNN on voxelized 3D grids. [88] designs new geometric features for 3D object detection in a point cloud. [100, 53] convert a point cloud of the entire scene into a volumetric grid and use 3D volumetric CNN for object proposal and classification. Computation cost for those method is usually quite high due to the expensive cost of 3D convolutions and large 3D search space.

[26]通过在体素化3D网格上用3D CNN替换SVM来扩展[121]。[88]为点云中三维物体的检测设计了新的几何特征。[100,53]将整个场景的点云转换为体素网格，并使用3D体素CNN进行对象建议（提取？）和分类。由于3D卷积和大型3D搜索空间的成本昂贵，这些方法的计算成本通常很高。

Recently, [51] proposed a 2D-driven 3D object detection method that is similar to our Frustum PointNets in spirit. However, they use hand-crafted features (based on histogram of point coordinates) with simple fully connected networks to regress 3D box location and pose, which is sub-optimal in both speed and performance. In contrast, we propose a more flexible and effective solution with deep 3D feature learning (PointNets). More previously, Kim et al. [47] also tried to leverage 2D recognition to reduce 3D search space, for 3D object localization in RGB-D data.

最近，[51]提出了一种基于2D的3D物体检测方法，它与我们的Frustum PointNets在思想上类似。然而，他们使用人工特征（基于点坐标的直方图）和简单的全连接网络来回归3D盒子的位置和姿态，这在速度和性能方面都是次优的。相比之下，我们提出了一种更灵活，更有效的使用深度3D特征学习的解决方案（PointNets）。此前，Kim等人[47]还尝试借助2D识别来减少3D搜索空间，用于RGB-D数据中的3D对象定位。

Their method proposes multiple 2D hypothesis masks for an object in an image, and extracts hand-crafted features from corresponding point clouds, to estimate the object’s 3D location with a structural latent SVM. In our method, instead of relying on 2D segmentation, we argue and show that 3D segmentation directly in point clouds leads to cleaner and more accurate 3D localization. While deriving 3D masks was hard at that time (with hand-crafted features and non-deep learning models), it is now possible with our newly proposed PointNets. Beyond 3D localization in [47], our method also achieves 3D instance segmentation and amodal 3D bounding box estimation.

他们的方法为图像中的对象提出了多个2D假设掩模，并从对应的点云中提取人工特征，以使用structural latent SVM估计对象的3D位置。在我们的方法中，我们不依赖于2D分割，而是直接在点云中进行3D分割可以实现更清晰，更准确的3D定位。虽然当时很难获得3D掩模（使用人工特征和非深度学习模型），但现在可以使用我们新提出的PointNets可以做到。除了[47]中的3D定位之外，我们的方法还实现了3D实例分割和amodal 3D边界框估计。

In terms of using point clouds for 3D object detection, most existing works convert point clouds to images or volumetric forms before feature learning. [127, 65, 83] voxelize point clouds into volumetric grids and generalize image CNNs to 3D CNNs. [55, 90, 122, 26] design more efficient 3D CNN or neural network architectures that exploit sparsity in point cloud. However, these CNN based methods still require quantitization of point clouds with certain voxel resolution. Our PointNets provide a new possibility to directly learn features and detect objects in point clouds.

在将点云用于3D目标检测方面，大多数现有工作在特征学习之前将点云转换为图像或体素形式。[127,65,83]将点云体素化为体积网格并将图像CNN推广到3D CNN。[55,90,122,26]设计了更有效的3D CNN或神经网络架构，可利用点云中的稀疏性。然而，这些基于CNN的方法仍然需要将点云进行某种体素分辨率下的量化。我们的PointNets为直接学习点云中的特征和检测对象提供了新的可能性。

2.2.5 Scene Flow Estimation

Vedula et al. [115] first introduced the concept of scene flow, as three-dimensional field of motion vectors in the world. They assumed knowledge of stereo correspondences and combined optical flow and first-order approximations of depth maps to estimate scene flow. Since this seminal work, many others have tried to jointly estimate structure and motion from stereoscopic images [40, 78, 125, 114, 12, 124, 117, 118, 4, 119, 68], mostly in a variational setting with regularizations for smoothness of motion and structure [40, 4, 114], or with assumption of the rigidity of the local structures [118, 68, 119].

Vedula等[115]首先介绍了场流的概念，作为真实世界中运动矢量的三维场。他们采用立体对应的知识，结合光流和深度图的一阶近似来估计场景流。自从这项开创性工作以来，许多人试图从立体图像中联合估计结构和运动[40,78,125,114,12,124,117,118,4,119,68]，大多是在变化的环境中，对运动和结构的平滑性进行调整[40,4,114]，或假设局部结构的刚度[118,68,119]。

With the recent advent of commodity depth sensors, it has become feasible to estimate scene flow from monocular RGB-D images [36], by generalizing variational 2D flow algorithms to 3D [38, 43] and exploiting more geometric cues provided by the depth channel [84, 39, 105]. Our work focuses on learning scene flow directly from point clouds, without any dependence on RGB images or assumptions on rigidity and camera motions. By doing so we can directly optimize for the 3D flow accuracy instead of optimizing the 2D optical flow.

随着最近商用深度传感器的出现，通过将变分2D流算法（variational 2D flow algorithms）推广到3D [38,43]并利用深度通道提供的更多几何线索，从单目RGB-D图像[36]估计场流变得可行。[84,39,105]。我们的工作专注于直接从点云学习场景流，而不依赖于RGB图像或对刚度和相机运动的假设。通过这样做，我们可以直接优化3D流量精度，而不用优化2D光流。

Very recently, Dewan et al. [24] proposed to estimate dense rigid motion fields in 3D LiDAR scans. They formulate the problem as a energy minimization problem of a factor graph, with hand-crafted 3D descriptors (SHOT [112]) for correspondence search. Later, Ushani et al. [113] presented a different pipeline: they firstly convert point clouds to an occupancy grid, use a learned background filter to remove static voxels, and then train a logistic classifier to tell whether two columns of occupancy grids correspond. Compared to these two works, our method is a cleaner and end-to-end solution with deep learned features and no dependency on hard correspondences or assumptions on rigidity.

最近，Dewan等人[24]提出估计3D LiDAR扫描中的密集刚体运动场。他们将问题表述为因子图的能量最小化问题，使用人工制作的3D描述符（SHOT [112]）进行对应搜索。后来，Ushani等人[113]提出了一个不同的途径：他们首先将点云转换为填充网格，使用学习到的背景过滤器去除静态体素，然后训练逻辑分类器以判断两列占用网格是否对应。与这两个作品相比，我们的方法是一个更清晰，端到端的解决方案，具有深度学习到的特征，不依赖于硬对应或刚性假设。

Our FlowNet3D is inspired by FlowNet [25] and FlowNet 2.0 [41], two seminal works that proposed to learn optical flow with convolutional neural networks. [66] extended FlowNet to simultaneously estimate disparity and optical flow. However, the irregular structure in point clouds (no regular grids as in images) presents new challenges and opportunities for design of novel architectures for scene flow estimation directly in point clouds, which is the focus of our work.

我们的FlowNet3D受到FlowNet [25]和FlowNet 2.0 [41]的启发，这两项开创性工作提出用卷积神经网络学习光流。[66]扩展了FlowNet，以同时估计差异和光流。然而，点云中的不规则结构（图像中没有规则网格）为直接在点云中设计用于场流估计的新颖架构提出了新的挑战和机会，这也是我们工作的重点。

Chapter 3 Deep Learning on Point Sets: PointNet

3.1 Introduction

In this chapter we explore deep learning architectures capable of reasoning about point clouds, a popular and important type of 3D geometric data. Typical convolutional architectures require highly regular input data formats, like those of image grids or 3D voxels, in order to perform weight sharing and other kernel optimizations. Since point clouds are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture. This data representation transformation, however, renders the resulting data unnecessarily voluminous — while also introducing quantization artifacts that can obscure natural invariances of the data.

在本章中，我们将探索能够推理点云的深度学习架构，点云是一种流行且重要的3D几何数据类型。典型的卷积体系结构需要高度规则的输入数据格式，如图像网格或3D体素的格式，以便执行权重共享和其他核优化。由于点云不是规则的格式，因此大多数研究人员通常在将这些数据送到深度网络架构之前将这些数据转换为常规3D体素网格或图像集合（例如视图）。然而，这种数据表示变换使得得到的数据不必要地大量增加——同时还引入了可能使数据的自然本质模糊的人为量化因素。

For this reason we focus to directly process point clouds without converting them to other formats – and name our resulting deep nets PointNets. Our PointNet is a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input (Fig. 3.1).

出于这个原因，我们专注于直接处理点云而不将其转换为其他格式 ——并将生成的网络命名为PointNets。我们的PointNet是一种统一的体系结构，它直接将点云作为输入并输出整个输入的类标签或输入的每个点的分割标签（图3.1）。

Although point clouds are simple in representation, we still face two challenges in our architecture design. First, the model needs to respect the fact that a point cloud is just a set of points and therefore invariant to permutations of its members. Second, invariances to rigid motions also need to be considered. To address the challenges, we construct PointNet as a symmetric function composed by neural networks, which guarantees its invariance to input point orders. Furthermore, our input format is easy to apply rigid or affine transformations to, as each point transforms independently. Thus we can add a data-dependent spatial transformer network that attempts to canonicalize the data before the PointNet processes them, so as to further improve the results.

虽然点云的表示很简单，但我们的架构设计仍面临两个挑战。首先，模型需要承认一个事实，点云只是一组点，因此对其成员的排列具有不变性。其次，还需要考虑刚体运动的不变性。为了应对这些挑战，我们将PointNet构建为由神经网络组成的对称函数，这保证了它对于输入点顺序的不变性。此外，我们的输入格式很容易应用刚性或仿射变换，因为每个点都是独立变换的。因此，我们可以添加一个依赖于数据的空间变换器网络，该网络尝试在PointNet处理数据之前对数据进行规范化，以便进一步改善结果。

We provide both a theoretical analysis and an experimental evaluation of our approach. We show that our network can approximate any set function that is continuous. More interestingly, it turns out that our network learns to summarize an input point cloud by a sparse set of key points, which roughly corresponds to the skeleton of objects according to visualization. The theoretical analysis provides an understanding why our PointNet is highly robust to small perturbation of input points as well as to corruption through point insertion (outliers) or deletion (missing data).

我们为我们的方法提供理论分析和实验评估。表明我们的网络可以近似任何连续的集合函数。更有趣的是，事实证明我们的网络学会通过一组稀疏的关键点来概括输入点云，根据可视化这些关键点大致对应于对象的骨架。理论分析让我们理解为什么PointNet对输入点的小扰动以及通过点插入（异常值）或删除（丢失数据）的损坏具有高度健壮性。

Figure 3.1: Applications of PointNet. We propose a novel deep net architecture that consumes raw point cloud (set of points) without voxelization or rendering. It is a unified architecture that learns both global and local point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks.

图3.1：PointNet的应用。我们提出了一种新的深网络架构，它处理原始点云（点集）而无需体素化或渲染。它是一个统一的架构，可以学习全局和本地点特征，为许多3D识别任务提供简单，高效和有效的方法。

On a number of benchmark datasets ranging from shape classification, part segmentation to scene segmentation, we experimentally compare our PointNet with state-of-the-art approaches based upon multi-view and volumetric representations. Under a unified architecture, not only is our PointNet much faster in speed, but it also exhibits strong performance on par or even better than state of the art.

在形状分类，零件分割到场景分割的许多基准数据集中，我们通过实验将PointNet与基于多视图和体素表示的最新方法进行比较。在统一的架构下，我们的PointNet不仅速度更快，而且还具有与现有技术相当甚至更好的性能。

The key contributions of this chapter are as follows:

这章主要贡献如下：

• We design a novel deep net architecture suitable for consuming unordered point sets in 3D;

我们设计了一种新的深网结构，适合于处理3D中的无序点集;

• We show how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;

我们展示了如何训练这样的网络来执行3D形状分类，形状部分分割和场景语义分析任务;

• We provide thorough empirical and theoretical analysis on the stability and efficiency of our method;

我们对方法的稳定性和效率进行了深入的实证和理论分析;

• We illustrate the 3D features computed by the selected neurons in the net and develop intuitive explanations for its performance.

我们演示了网络中所选神经元计算出的3D特征，并对其表现进行了直观的解释。

3.2 Problem Statement

We design a deep learning framework that directly consumes unordered point sets as inputs. A point cloud is represented as a set of 3D points {Pi| i = 1,...,n}, where each point Pi is a vector of its (x, y,z) coordinate plus extra feature channels such as color, normal etc. For simplicity and clarity, unless otherwise noted, we only use the (x, y,z) coordinate as our point’s channels.

我们设计了一个深度学习框架，直接使用无序点集作为输入。点云表示为一组3D点{Pi | i = 1，...，n}，其中每个点Pi是其（x，y，z）坐标的矢量加上额外的特征通道，如颜色，法线等。为简单起见，除非另有说明，否则我们仅使用（x，y，z）坐标作为点的通道。

For the object classification task, the input point cloud is either directly sampled from a shape or presegmented from a scene point cloud. Our proposed deep network outputs k scores for all the k candidate classes.

对于对象分类任务，输入点云或是来自形状直接采样或是来自场景点云预分割。我们提出的深度网络输出所有k个候选类别的k个分数。

For semantic segmentation, the input can be a single object for part region segmentation, or a sub-volume from a 3D scene for object region segmentation. Our model will output n×m scores for each of the n points and each of the m semantic sub-categories.

对于语义分割，输入可以是用于部分区域分割的单个对象，或者来自用于对象区域分割的3D场景的子体积。我们的模型将为n个点和m个语义子类别中的每一个输出n×m个分数。

Figure 3.2: PointNet architecture. The classification network takes n points as input, applies input and feature transformations, and then aggregates point features by max pooling. The output is classification scores for k classes. The segmentation network is an extension to the classification net. It concatenates global and local features and outputs per point scores. “mlp” stands for multi-layer perceptron, numbers in bracket are layer sizes. Batch norm is used for all layers with ReLU. Dropout layers are used for the last mlp in classification net.

图3.2:PointNet架构。分类网络将n个点作为输入，应用输入变换和特征变换，然后通过最大池化来汇合点特征。输出是k个类的分数。分割网是对分类网的扩展。它将局部和全局特征串联起来并且输出每个点的对应分数。“mlp”代表多层感知机，括号中的数字是层大小。Batch norm用于所有用ReLU的层。 Droupout用于分类网中的最后一个mlp。

3.3 Deep Learning on Point Sets

The architecture of our network (Sec 3.3.2) is inspired by the properties of point sets in R^n (Sec 3.3.1).

我们网络的体系结构（第3.3.2节）的灵感来自Rn（第3.3.1节）中的点集属性。

3.3.1 Properties of Point Sets in R^n

Our input is a subset of points from an Euclidean space. It has three main properties:

我们的输入是来自欧氏空间的点的子集。它有三个主要属性：

• Unordered. Unlike pixel arrays in images or voxel arrays in volumetric grids, a point cloud is a set of points without specific order. In other words, a network that consumes N 3D points needs to be invariant to N! permutations of the input set in data feeding order.

无序。与图像中的像素阵列或体素网格中的体素阵列不同，点云是一组没有特定顺序的点。换句话说，处理N个3D点的网络需要对N！种输入顺序具有输出不变性。

• Interaction among points. The points are from a space with a distance metric. This means that points are not isolated, and neighboring points form a meaningful subset. Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures.

点之间的相互联系，这些点来自具有距离度量的空间。这意味着点不是孤立的，相邻点形成一个有意义的子集。因此，模型需要能够捕获附近点的局部结构，以及局部结构之间的组合相互作用。

• Invariance under transformations. As a geometric object, the learned representation of the point set should be invariant to certain transformations. For example, rotating and translating points all together should not modify the global point cloud category nor the segmentation of the points.

变换不变性。作为几何对象，点集学到的特征应该对某些变换不变。例如，将所有点一起旋转和平移不应该修改全局点云的类别，也不应该修改点的分割。

3.3.2 PointNet Architecture

Our full network architecture is visualized in Fig. 3.2, where the classification network and the segmentation network share a great portion of common structure. Please read the caption of Fig. 3.2 for the pipeline.

我们的完整网络架构在图3.2中进行可视化，其中分类网络和分割网络共享很大一部分共同结构。请阅读图3.2标题行。

Our network has three key modules: the max pooling layer as a symmetric function to aggregate information from all the points, a local and global information combination structure, and two joint alignment networks that align both input points and point features.

我们的网络有三个主要模块：最大池层作为对称函数（用于聚合来自所有点的信息），本地和全局信息组合结构，和两个联合对齐网络（对齐输入点和点特征）。

We will discuss our reason behind these design choices in separate paragraphs below.

我们将在下面的单独段落中讨论这些设计选择背后的原因。

Symmetry Function for Unordered Input In order to make a model invariant to input permutation, three strategies exist: 1) sort input into a canonical order; 2) treat the input as a sequence to train a RNN, but augment the training data by all kinds of permutations; 3) use a simple symmetric function to aggregate the information from each point. Here, a symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order. For example, + and × operators are symmetric binary functions.

用于无序输入的对称函数 为了使模型输出对输入的排列具有不变性，存在三种策略：1）将输入排序为规范顺序;2）将输入视为训练RNN的序列，但通过各种排列来增加训练数据;3）使用简单的对称函数来汇合来自每个点的信息。这里，对称函数将n个向量作为输入，并输出对输入顺序不变的新向量。例如，+和×运算符是对称二进制函数。

While sorting sounds like a simple solution, in high dimensional space there in fact does not exist an ordering that is stable w.r.t. point perturbations in the general sense. This can be easily shown by contradiction. If such an ordering strategy exists, it defines a bijection map between a high-dimensional space and a 1D real line. It is not hard to see, to require an ordering to be stable w.r.t. point perturbations is equivalent to requiring that this map preserves spatial proximity as the dimension reduces, a task that cannot be achieved in the general case. Therefore, sorting does not fully resolve the ordering issue, and it is hard for a network to learn a consistent mapping from input to output as the ordering issue persists. As shown in experiments (Fig. 3.5), we find that a fully connected network applied on a sorted point set performs poorly, though slightly better than the one directly processing an unsorted input.

虽然排序听起来像一个简单的解决方案，但在高维空间中实际上并不存在稳定的排序由于常见的点扰动。这可以通过矛盾很容易地显示出来。如果存在这样的排序策略，则它定义高维空间和1D实线之间的双射映射。不难看出，要求排序关于点扰动是稳定的，等同于要求该映射在维度减小时保持空间接近度，在一般情况下无法实现的任务。因此，排序并不能完全解决顺序问题，并且由于顺序问题仍然存在，网络很难学到输出关于输入顺序具有不变性的映射。如实验所示（图3.5），我们发现应用于排序点集的全连接网络表现不佳，但略好于直接处理未排序就输入的网络。

The idea to use an RNN considers the point set as a sequential signal and hopes that by training the RNN with randomly permuted sequences, the RNN will become invariant to input order. However in “OrderMatters” [116] the authors have shown that order does matter and cannot be totally omitted. While RNN has relatively good robustness to input ordering for sequences with small length (dozens), it’s hard to scale to thousands of input elements, which is the common size for point sets. Empirically, we have also shown that model based on RNN does not perform as well as our proposed method (Fig. 3.5).

使用RNN的想法将点集视为顺序信号，并希望通过用随机置换序列训练RNN，RNN将变得对输入顺序不变。然而，在“OrderMatters”[116]中，作者已经表明顺序确实很重要，不能完全忽略。虽然RNN对长度较小（数十个）的序列的输入排序具有相对较好的健壮性，但很难扩展到数千个输入元素，这是点集的常见大小。根据经验，我们还表明基于RNN的模型的表现不如我们提出的方法（图3.5）。

Our idea is to approximate a general function defined on a point set by applying a symmetric function on transformed elements in the set:

我们的想法是通过对集合中变换得到的元素使用对称函数来近似在点集上定义的普通函数：

f({x1,..., xn}) ≈ g(h(x1),...,h(xn)), (3.1)

Empirically, our basic module is very simple: we approximate h by a multi-layer perceptron network and g by a composition of a single variable function and a max pooling function.

根据经验，我们的基本模块非常简单：我们通过多层感知器网络来近似h，通过单变量函数和最大池函数的组合来近似g。

This is found to work well by experiments. Through a collection of h, we can learn a number of f’s to capture different properties of the set.

实验中发现这很好用。通过h的集合，我们可以学习一些f函数来捕获集合的不同属性。

While our key module seems simple, it has interesting properties (see Sec 3.4.3) and can achieve strong performace (see Sec 3.4.1) in a few different applications. Due to the simplicity of our module, we are also able to provide theoretical analysis as in Sec 3.3.3.

虽然我们的关键模块看起来很简单，但它具有有趣的属性（参见3.4.3节）并且可以在一些不同的应用程序中实现强大的性能（参见3.4.1节）。由于我们模块的简单性，我们也能够提供第3.3.3节中的理论分析。

Local and Global Information Aggregation The output from the above section forms a vector [ f1,..., fK], which is a global signature of the input set. We can easily train a SVM or multi-layer perceptron classifier on the shape global features for classification. However, point segmentation requires a combination of local and global knowledge. We can achieve this by a simple yet highly effective manner.

局部和全局信息汇合 上一节的输出形成一个向量[f1，...，fK]，它是输入集的全局特征。我们可以在形状全局特征上轻松训练SVM或多层感知器分类器以进行分类。但是，点分割需要局部和全局知识的组合。我们可以通过简单而高效的方式实现这一目标。

Our solution can be seen in Fig. 3.2 (Segmentation Network). After computing the global point cloud feature vector, we feed it back to per point features by concatenating the global feature with each of the point features. Then we extract new per point features based on the combined point features - this time the per point feature is aware of both the local and global information.

我们的解决方案可以在图3.2（分割网络）中看到。在计算全局点云特征向量之后，我们通过将全局特征与每个点特征串联起来得到每个点的特征。然后我们获取到基于组合特征的每个点的特征——每点特征知道局部和全局信息。

With this modification our network is able to predict per point quantities that rely on both local geometry and global semantics. For example we can accurately predict per-point normals (Appendix A Fig. A.7), validating that the network is able to summarize information from the point’s local neighborhood. In experiment session, we also show that our model can achieve state-of-the-art performance on shape part segmentation and scene segmentation.

通过这种修改，我们的网络能够预测依赖于局部几何和全局语义的每个点的数量。例如，我们可以准确地预测每点法线（附录A图A.7），验证了网络能够概括点的局部邻域的信息。在实验环节中，我们还表明我们的模型可以在形状零件分割和场景分割方面实现最先进的性能。

Joint Alignment Network The semantic labeling of a point cloud has to be invariant if the point cloud undergoes certain geometric transformations, such as rigid transformation. We therefore expect that the learned representation by our point set is invariant to these transformations.

联合对准网络 如果点云经历某些几何变换（例如刚性变换），则点云的语义标记必须是不变的。因此，我们期望我们的点集学习到的表示对这些变换具有不变性。

A natural solution is to align all input set to a canonical space before feature extraction. Jaderberg et al. [42] introduces the idea of spatial transformer to align 2D images through sampling and interpolation, achieved by a specifically tailored layer implemented on GPU.

一种自然的解决方案是在特征提取之前将所有输入集对齐到规范空间。Jaderberg等[42]介绍了空间变换器的想法，通过采样和插值来对齐2D图像，通过在GPU上专门定制的层来实现。

Our input form of point clouds allows us to achieve this goal in a much simpler way compared with [42]. We do not need to invent any new layers and no alias is introduced as in the image case. We predict an affine transformation matrix by a mini-network (T-net in Fig. 3.2) and directly apply this transformation to the coordinates of input points. The mini-network itself resembles the big network and is composed by basic modules of point independent feature extraction, max pooling and fully connected layers. More details about the T-net are described in Appendix A Sec. A.1.

与[42]相比，我们的点云输入形式使我们能够以更简单的方式实现这一目标。我们不需要发明任何新层，也不会像图像情况那样引入任何别名。我们通过迷你网络（图3.2中的T-net）预测了仿射变换矩阵，并直接将该变换应用于输入点的坐标。迷你网络本身类似于大型网络，由点独立特征提取，最大池化和全连接层的基本模块组成。有关T-net的更多细节见附录A Sec.A.1。

This idea can be further extended to the alignment of feature space, as well. We can insert another alignment network on point features and predict a feature transformation matrix to align features from different input point clouds. However, transformation matrix in the feature space has much higher dimension than the spatial transform matrix, which greatly increases the difficulty of optimization. We therefore add a regularization term to our softmax training loss. We constrain the feature transformation matrix to be close to an orthogonal matrix:

这个想法可以进一步扩展到特征空间的对齐。我们可以基于点特征插入另一个对齐网络，并预测特征转换矩阵以对齐来自不同输入点云的特征。然而，特征空间中的变换矩阵具有比空间变换矩阵高得多的维度，这极大地增加了优化的难度。因此，我们在softmax训练损失中增加了一个正则化项。我们将特征变换矩阵约束为接近正交矩阵：

where A is the feature alignment matrix predicted by a mini-network. An orthogonal transformation will not lose information in the input, thus is desired. We find that by adding the regularization term, the optimization becomes more stable and our model achieves better performance.

其中A是由迷你网络预测的特征对齐矩阵。正交变换不会丢失输入中的信息，因此是我们所期望的。我们发现通过添加正则化项，优化变得更加稳定，我们的模型实现了更好的性能。

3.3.3 Theoretical Analysis

Universal Approximation We first show the universal approximation ability of our neural network to continuous set functions. By the continuity of set functions, intuitively, a small perturbation to the input point set should not greatly change the function values, such as classification or segmentation scores.

通用近似 我们首先展示了神经网络对连续集函数的通用近似能力。通过集合函数的连续性，直观地，对输入点集的小扰动不应该极大地改变函数值，例如分类或分割分数。

定理表明，理论上我们的网络可以近似f函数只要有足够的神经元，也就是3.1中K足够大。

The proof of this theorem can be found in Appendix A, Sec. A.5. The key idea is that in the worst case the network can learn to convert a point cloud into a volumetric representation, by partitioning the space into equal-sized voxels. In practice, however, the network learns a much smarter strategy to probe the space, as we shall see in point function visualizations.

这个定理的证明见附录A的A.5节。主要的想法是，在最坏的情况下，网络可以将空间划分成大小相同的体素来学习将点云转换成体素表示。然而事实上，网络学习更聪明的策略来探索空间，正如在函数可视化中会看到的。

Bottleneck Dimension and Stability Theoretically and experimentally we find that the expressiveness of our network is strongly affected by the dimension of the max pooling layer, i.e., K in (3.1). Here we provide an analysis, which also reveals properties related to the stability of our model.

从理论上和实验上我们发现，我们网络的表现力受到最大池化层维度的强烈影响，即（3.1）中的K。在这里，我们提供了一个分析，它还揭示了与模型稳定性相关的属性。

Figure 3.3: Qualitative results for part segmentation. We visualize the CAD part segmentation results across all 16 object categories. We show both results for partial simulated Kinect scans (left block) and complete ShapeNet CAD models (right block).

图3.3 :零件分割的定性结果。我们对CAD零件分割结果进行了可视化，包括16个对象类别。我们显示了部分模拟Kinect扫描（左侧块）和完整ShapeNet CAD模型（右侧块）的结果。

Combined with the continuity of h, this explains the robustness of our model w.r.t point perturbation, corruption and extra noise points. The robustness is gained in analogy to the sparsity principle in machine learning models. Intuitively, our network learns to summarize a shape by a sparse set of key points. In experiment section we see that the key points form the skeleton of an object.

结合h函数的连续性，就能够解释为什么我们的模型关于点扰动、错误和其它噪点具有健壮性。类似于机器学习模型中的稀疏性原理，可以获得健壮性。直观地说，我们的网络学会通过一个稀疏的关键点集合概括形状。从实验节中可以看到由关键点构成的物体骨架。

3.4 Experiments

Experiments are divided into four parts. First, we show PointNets can be applied to multiple 3D recognition tasks (Sec 3.4.1). Second, we provide detailed experiments to validate our network design (Sec 3.4.2). At last we visualize what the network learns (Sec 3.4.3) and analyze time and space complexity (Sec 3.4.4).

实验分成四个部分。首先，演示了Pointnets可以用于多种3D识别任务（3.4.1）。第二，进行了详细的实验来验证网络设计（3.4.2）。最后可视化了网络学到的东西（3.4.3）并且分析了时间和空间复杂度。

3.4.1 Applications

In this section we show how our network can be trained to perform 3D object classification, object part segmentation and semantic scene segmentation . Even though we are working on a brand new data representation (point sets), we are able to achieve comparable or even better performance on benchmarks for several tasks.

在本节中，我们将展示如何训练我们的网络以执行3D对象分类，对象部分分割和语义场景分割。尽管我们正在开发一种全新的数据表示（点集），但我们能够在多个任务的基准测试中实现差不多甚至更好的性能。

3D Object Classification Our network learns global point cloud feature that can be used for object classifi-cation. We evaluate our model on the ModelNet40 [127] shape classification benchmark. There are 12,311 CAD models from 40 man-made object categories, split into 9,843 for training and 2,468 for testing. While previous methods focus on volumetric and mult-view image representations, we are the first to directly work on raw point cloud.

我们的网络学习可用于对象分类的全局点云特征。我们在ModelNet40 [127]形状分类基准上评估我们的模型。来自40个人造物体类别的12,311个CAD模型分为9,843个用于训练，2,468个用于测试。虽然以前的方法专注于体素和多视图图像表示，但我们是第一个直接处理原始点云的方法。

We uniformly sample 1024 points on mesh faces according to face area and normalize them into a unit sphere. During training we augment the point cloud on-the-fly by randomly rotating the object along the up-axis and jitter the position of each points by a Gaussian noise with zero mean and 0.02 standard deviation.

我们根据面部区域在网格面上均匀地采样1024个点，并将它们标准化为单位球体。在训练期间，我们通过沿着上轴随机旋转物体并通过具有零均值和0.02标准偏差的高斯噪声来抖动每个点的位置来实时增强点云。

In Table 3.1, we compare our model with previous works as well as our baseline using MLP on traditional features extracted from point cloud (point density, D2, shape contour etc.). Our model achieved state-of-the-art performance among methods based on 3D input (volumetric and point cloud). With only fully connected layers and max pooling, our net gains a strong lead in inference speed and can be easily parallelized in CPU as well. There is still a small gap between our method and multi-view based method (MVCNN [103]), which we think is due to the loss of fine geometry details that can be captured by rendered images.

在表3.1中，我们将我们的模型与之前的工作以及使用MLP对从点云提取的传统特征（点密度，D2，形状轮廓等）的基线进行比较。我们的模型在基于3D输入（体积和点云）的方法中实现了最先进的性能。由于只有全连接层和最大池化层，我们的网络在推理速度方面获得了很强的领先优势，并且可以在CPU中轻松并行化。我们的方法和基于多视图的方法（MVCNN [103]）之间仍然存在一个小的差距，我们认为这是由于丢失了可以通过渲染图像捕获的精细几何细节。

3D Object Part Segmentation Part segmentation is a challenging fine-grained 3D recognition task. Given a 3D scan or a mesh model, the task is to assign part category label (e.g. chair leg, cup handle) to each point or face.

三维物体零件分割 零件分割是一项具有挑战性的细粒度3D识别任务。给定3D扫描或网格模型，任务是将零件类别标签（例如，椅子腿，杯柄）分配给每个点或面。

We evaluate on ShapeNet part data set from [132], which contains 16,881 shapes from 16 categories, annotated with 50 parts in total. Most object categories are labeled with two to five parts. Ground truth annotations are labeled on sampled points on the shapes.

我们评估来自[132]的ShapeNet零件数据集，其中包含16个类别的16,881个形状，总共标注了50种零件。大多数对象类别标有两到五种零件。地面实况标注标记在形状上的采样点上。

We formulate part segmentation as a per-point classification problem. Evaluation metric is mIoU on points. For each shape S of category C, to calculate the shape’s mIoU: For each part type in category C, compute IoU between groundtruth and prediction. If the union of groundtruth and prediction points is empty, then count part IoU as 1. Then we average IoUs for all part types in category C to get mIoU for that shape. To calculate mIoU for the category, we take average of mIoUs for all shapes in that category.

我们将零件分割定义为点的分类问题。评估指标是点的mIoU。对于C类中的单个形状S，计算S的mIoU：对于C类中的每种零件类型，计算真实值和预测值之间的IoU。如果真实值和预测值的并集是空的，将零件的IoU计为1。最后将C类中所有IoU取平均，来得到该形状的mIoU。要计算那一类的mIoU，则将该类中所有mIoU取平均。

未完待续.........

PointNet 翻译：相关推荐

PointNet++
[NIPS 2017]PointNet++: Deep Hierarchical Feature Learning onPoint Sets in a Metric Space 语雀(原文内容多一点, ...
深度学习(10):PointNet论文翻译与学习
PointNet:用于 3D 分类和分割的点集的深度学习 PointNet: Deep Learning on Point Sets for 3D Classification and Segment ...
《《《翻译》》》pointNet
原文:PointNet:Deep Learning on Point Sets for 3D Classification and Segmentation 原文连接:https://arxiv.or ...
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation 中文翻译
点云是一种重要的几何数据结构. 由于其不规则的格式,大多数研究人员将这些数据转换为常规3D体素网格或图像集合. 然而,这会使数据不必要地大量增加并导致问题. 在本文中,我们设计了一种直接消耗点云的新型 ...
PointNet 中文翻译
Deep Learning on Point Sets for 3D Classification and Segmentation https://github.com/charlesq34/poi ...
点云网络的论文理解（四）-点云网络的优化 PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
目录 0.Abstract 0.1逐句翻译 0.2总结 1.Introduction 1.1逐句翻译 1.2总结 2.Problem Statement 2.1翻译 2.2总结 3.Method 3. ...
点云网络的论文理解（一）-点云网络的提出 PointNet : Deep Learning on Point Sets for 3D Classification and Segmentation
1.摘要 1.1逐句翻译 Point cloud is an important type of geometric data structure. 点云是一种重要的数据结构. Due to its ...
[转]综述论文翻译：A Review on Deep Learning Techniques Applied to Semantic Segmentation
近期主要在学习语义分割相关方法,计划将arXiv上的这篇综述好好翻译下,目前已完成了一部分,但仅仅是尊重原文的直译,后续将继续完成剩余的部分,并对文中提及的多个方法给出自己的理解. _论文地址:htt ...
【论文阅读/翻译笔记】Deep Snake for Real-Time Instance Segmentation
原论文标题:Deep Snake for Real-Time Instance Segmentation 原论文链接:https://arxiv.org/abs/2001.01629 翻译:张欢荣用 ...

PointNet 翻译：

Chapter 1

Introduction

Chapter 2

Background