(CVPR 2019) PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud

Abstract

在本文中，我们提出PointRCNN用于从原始点云进行3D目标检测。整个框架由两个阶段组成：stage-1用于自下而上的 3D proposal生成，stage-2用于在规范坐标中细化proposal以获得最终检测结果。我们的第一阶段子网络不是像以前的方法那样从RGB图像生成proposals或将点云投影到鸟瞰图或体素，而是通过分割以自下而上的方式直接从点云生成少量高质量的3D proposals将整个场景的点云分为前景点和背景。stage-2子网络将每个proposal的池化点转换为规范坐标，以学习更好的局部空间特征，并结合stage-1中学习的每个点的全局语义特征进行准确的框细化和置信度预测。在KITTI数据集的3D检测基准上进行的大量实验表明，我们提出的架构通过仅使用点云作为输入，以显着的优势优于最先进的方法。该代码可在https://github.com/sshaoshuai/PointRCNN获得。

1. Introduction

深度学习在2D计算机视觉任务上取得了显着进展，包括目标检测[8,32,16]和实例分割[6,10,20]等。除了2D场景理解之外，3D目标检测对于许多真实的世界应用，例如自动驾驶和家用机器人。虽然最近开发的2D检测算法能够处理图像中的大量视点变化和背景杂波，但是具有点云的三维物体的检测仍然面临着来自3D目标的6自由度的不规则数据格式和大搜索空间的巨大挑战。

在自动驾驶中，最常用的3D传感器是LiDAR传感器，它可以生成3D点云来捕捉场景的3D结构。基于点云的3D目标检测的难点主要在于点云的不规则性。最先进的3D检测方法要么利用成熟的2D检测框架，将点云投影到鸟瞰图[14, 42, 17]（见图1(a)），再到正面视图[4, 38 ]，或常规的3D体素[34, 43]，它们不是最优的，并且在量化过程中会丢失信息。

图1.与最先进方法的比较。我们的方法不是从鸟瞰图和前视图的融合特征图[14]或RGB图像[25]生成proposals，而是直接从原始点云以自下而上的方式生成3D proposals。

Qi等人并没有将点云转换为体素或其他常规数据结构来进行特征学习。[26, 28]提出了PointNet，用于直接从点云数据中学习3D表示，用于点云分类和分割。如图1(b)所示，他们的后续工作[25]将PointNet应用于3D目标检测，基于从2D RGB检测结果中裁剪的截锥体点云来估计3D边界框。然而，该方法的性能在很大程度上依赖于2D检测性能，不能利用3D信息的优势来生成鲁棒的边界框proposals。

与2D图像中的目标检测不同，自动驾驶场景中的3D目标由带注释的3D边界框自然且良好地分离。换句话说，用于3D目标检测的训练数据直接为3D目标分割提供了语义掩码。这是3D检测和2D检测训练数据之间的关键区别。在2D目标检测中，边界框只能为语义分割提供弱监督[5]。

基于这一观察，我们提出了一种新颖的两阶段3D目标检测框架，名为PointRCNN，它直接对3D点云进行操作，并实现了鲁棒和准确的3D检测性能（见图1（c））。提出的框架由两个阶段组成，第一阶段旨在以自下而上的方案生成3D边界框proposal。通过利用3D边界框生成ground-truth分割掩码，第一阶段分割前景点并同时从分割点生成少量边界框proposals。这种策略避免了在整个3D空间中使用大量的3D锚框，就像以前的方法[43,14,4]所做的那样，并且节省了大量的计算量。

PointRCNN的第二阶段进行规范的3D框细化。在生成3D proposals后，采用点云区域池化操作来池化来自stage-1的学习点表示。与现有的直接估计全局框坐标的3D方法不同，池化的3D点被转换为规范坐标，并与池化点特征以及来自stage-1的分割掩码相结合，用于学习相对坐标细化。该策略充分利用了我们强大的第一阶段分割和proposal子网络提供的所有信息。为了学习更有效的坐标细化，我们还提出了完整的基于bin的3D框回归损失用于proposal生成和细化，并且消融实验表明它比其他3D框回归损失收敛更快并实现更高的召回率。

我们的贡献可以概括为三个方面。(1)我们提出了一种新颖的基于自下而上点云的3D边界框proposal生成算法，该算法通过将点云分割为前景目标和背景来生成少量高质量的3D proposals。从分割中学习到的点表示不仅擅长生成proposal，而且有助于后期的框细化。(2)提出的规范3D边界框细化利用了我们从stage-1生成的高召回率框proposals，并学习在规范坐标中预测框坐标细化，并具有鲁棒的基于bin的损失。(3)我们提出的3D检测框架PointRCNN在仅使用点云作为输入的情况下，在KITTI的3D检测测试板上，在截至2018年11月16日的所有已发表作品中以显着的优势优于state-of-theart方法，并且在所有已发表的作品中排名第一。

2. Related Work

从2D图像中检测3D目标。 现有的工作是从图像中估计3D边界框。[24, 15]利用3D和2D边界框之间的几何约束来恢复3D目标姿态。[1, 44, 23]利用3D目标和CAD模型之间的相似性。陈等人[2, 3]将目标的3D几何信息表示为能量函数，以对预定义的3D框进行评分。由于缺乏深度信息，这些作品只能生成粗略的3D检测结果，并且会受到外观变化的极大影响。

来自点云的3D目标检测。 最先进的3D目标检测方法提出了各种从稀疏3D点云中学习判别特征的方法。[4, 14, 42, 17, 41]将点云投影到鸟瞰图，并利用2D CNN学习点云特征以生成3D框。宋等人[34]和周等人将点分组为体素，并使用3D CNN学习体素的特征以生成3D框。然而，鸟瞰投影和体素化由于数据量化而遭受信息丢失，并且3D CNN的内存和计算效率均较低。[25, 39]利用成熟的2D检测器从图像中生成2D proposals，并减少每个裁剪图像区域中3D点的大小。然后使用PointNet[26, 28]学习点云特征以进行3D框估计。但是，在一些只能从3D空间很好观察的具有挑战性的情况下，基于2D图像的proposal生成可能会失败。3D框估计步骤无法恢复此类故障。相比之下，我们自下而上的3D proposal生成方法直接从点云生成鲁棒的3D proposals，既高效又无量化。

学习点云表示。 Qi等人没有将点云表示为体素[22、33、35]或多视图格式[27、36、37]。[26]提出了PointNet架构，直接从原始点云中学习点特征，大大提高了点云分类和分割的速度和准确性。后续工作[28, 12]通过考虑点云中的局部结构进一步提高了提取的特征质量。我们的工作将基于点的特征提取器扩展到基于3D点云的目标检测，从而产生了一种新颖的两阶段3D检测框架，该框架直接从原始点云生成3D框proposals和检测结果。

3. PointRCNN for Point Cloud 3D Detection

在本节中，我们将介绍我们提出的两阶段检测框架PointRCNN，用于从不规则点云中检测3D目标。整体结构如图2所示，由自下而上的3D proposal生成阶段和规范边界框细化阶段组成。

图2. PointRCNN架构，用于从点云进行3D目标检测。整个网络由两部分组成：（a）以自下而上的方式从原始点云生成3D proposals。(b)用于在规范坐标中细化3D proposals。

3.1. Bottom-up 3D proposal generation via point cloud segmentation

现有的2D目标检测方法可以分为一阶段和两阶段方法，其中一阶段方法[19, 21, 31, 30, 29]通常更快，但直接估计目标边界框而无需细化，而两阶段方法方法[10, 18, 32, 8]首先生成proposals，并在第二阶段进一步细化proposals和置信度。然而，由于巨大的3D搜索空间和不规则的点云格式，将两阶段方法从2D直接扩展到3D并非易事。AVOD[14]在3D空间中放置80-100k个锚框，并在多个视图中为每个锚汇集特征以生成proposals。FPointNet[25]从2D图像生成2D proposals，并根据从2D区域裁剪的3D点估计3D框，这可能会错过只能从3D空间清晰观察到的困难目标。

我们提出了一种准确且鲁棒的3D proposal生成算法作为我们基于全场景点云分割的第一阶段子网络。我们观察到3D场景中的目标自然分离而不会相互重叠。所有3D目标的分割掩码都可以通过它们的3D边界框注释直接获得，即3D框内的3D点被视为前景点。

因此，我们建议以自下而上的方式生成3D proposals。具体来说，我们学习逐点特征来分割原始点云并同时从分割的前景点生成3D proposals。基于这种自下而上的策略，我们的方法避免在3D空间中使用大量预定义的3D框，并显着限制了3D proposal生成的搜索空间。实验表明，我们提出的3D框proposal方法比基于3D锚的proposal生成方法实现了更高的召回率。

学习点云表示。 为了学习用于描述原始点云的判别式逐点特征，我们利用具有多尺度分组的PointNet++[28]作为我们的骨干网络。还有其他几种替代的点云网络结构，例如[26, 13]或具有稀疏卷积[9]的VoxelNet[43]，它们也可以用作我们的骨干网络。

前景点分割。 前景点提供了丰富的信息来预测其相关目标的位置和方向。通过学习分割前景点，点云网络被迫捕获上下文信息以进行准确的逐点预测，这也有利于3D框的生成。我们设计了自下而上的3D proposal生成方法，直接从前景点生成3D proposal框，即同时执行前景分割和3D proposal框生成。

给定由主干点云网络编码的逐点特征，我们附加一个分割头来估计前景掩码和一个框回归头来生成3D proposals。对于点分割，ground-truth分割掩码自然由3D ground-truth框提供。对于大型户外场景，前景点的数量通常远小于背景点的数量。因此，我们使用焦点损失[19]来处理类不平衡问题：

$(pt)=−αt(1−pt)γlog⁡(pt)(1)\mathcal{L}_{\text {focal }}\left(p_{t}\right)=-\alpha_{t}\left(1-p_{t}\right)^{\gamma} \log \left(p_{t}\right) \tag{1}$
$\text { where } p_{t}= \begin{cases}p & \text { for forground point } \\ 1-p & \text { otherwise }\end{cases}$

在训练点云分割期间，我们保持默认设置 $αt=0.25\alpha_{t}=0.25$ 和 $γ=2\gamma=2$ 作为原始论文。

基于bin的3D边界框生成。 正如我们上面提到的，还附加了一个框回归头，用于同时生成自下而上的3D proposals和前景点分割。在训练期间，我们只需要框回归头从前景点回归3D边界框位置。请注意，尽管框不是从背景点回归的，但由于点云网络的感受野，这些点也为生成框提供了支持信息。

一个3D边界框在LiDAR坐标系中表示为 $\theta)$ ，其中 $(x, y, z)$ 是目标中心位置， $(h, w, l)$ 是目标大小， $θ\theta$ 是鸟瞰的目标方向。为了约束生成的3D框proposals，我们提出了基于bin的回归损失来估计目标的3D边界框。

为了估计目标的中心位置，如图3所示，我们将每个前景点的周围区域沿X轴和Z轴拆分为一系列离散的bin。具体来说，我们为当前前景点的每个X轴和Z轴设置了一个搜索范围S，每个1D搜索范围被划分为长度一致的 $δ\delta$ 的bin来表示X-Z平面上的不同目标中心 $(x ， z)$ 。我们观察到，使用基于bin的分类和X轴和Z轴的交叉熵损失，而不是使用平滑L1损失的直接回归，可以实现更准确和鲁棒的中心定位。

X或Z轴的定位损失由两项组成，一项用于沿每个X和Z轴的bin分类，另一项用于分类bin内的残差回归。对于沿垂直Y轴的中心位置y，我们直接利用平滑L1损失进行回归，因为大多数目标的y值都在非常小的范围内。使用L1损失足以获得准确的y值。

因此，定位目标可以表述为
$bin⁡x(p)=[xp−x(p)+Sδ⌋,bin⁡z(p)=⌊zp−z(p)+Sδ⌋\operatorname{bin}_{x}^{(p)}=\left[\frac{x^{p}-x^{(p)}+\mathcal{S}}{\delta}\right\rfloor, \operatorname{bin}_{z}^{(p)}=\left\lfloor\frac{z^{p}-z^{(p)}+\mathcal{S}}{\delta}\right\rfloor$
$res⁡u∈{x,z}(p)=1C(up−u(p)+S−(bin⁡u(p)⋅δ+δ2)),(2)\operatorname{res}_{u \in\{x, z\}}^{(p)}=\frac{1}{\mathcal{C}}\left(u^{p}-u^{(p)}+\mathcal{S}-\left(\operatorname{bin}_{u}^{(p)} \cdot \delta+\frac{\delta}{2}\right)\right),\tag{2}$
$res⁡y(p)=yp−y(p)\operatorname{res}_{y}^{(p)}=y^{p}-y^{(p)}$

其中 $(x(p),y(p),z(p))\left(x^{(p)}, y^{(p)}, z^{(p)}\right)$ 是前景兴趣点的坐标， $(xp,yp,zp)\left(x^{p}, y^{p}, z^{p}\right)$ 是其对应目标的中心坐标， $bin⁡x(p)\operatorname{bin}_{x}^{(p)}$ 和 $bin⁡z(p)\operatorname{bin}_{z}^{(p)}$ 是沿X和Z轴的ground-truth bin分配， $res⁡x(p)\operatorname{res}_{x}^{(p)}$ 和 $res⁡z(p)\operatorname{res}_{z}^{(p)}$ 是用于在指定bin内进行进一步位置细化的ground-truth残差，C是用于归一化的bin长度。

方向 $θ\theta$ 和大小 $(h, w, l)$ 估计的目标与[25]中的目标相似。我们将方向 $\pi$ 划分为n个bin，并按照与x或z预测相同的方式计算bin分类目标 $bin⁡θ(p)\operatorname{bin}_{\theta}^{(p)}$ 和残差回归目标 $res⁡θ(p)\operatorname{res}_{\theta}^{(p)}$ 。目标大小 $(h, w, l)$ 是通过计算相对于整个训练集中每个类的平均目标大小的残差 $(res⁡h(p),res⁡w(p),res⁡l(p))\left(\operatorname{res}_{h}^{(p)}, \operatorname{res}_{w}^{(p)}, \operatorname{res}_{l}^{(p)}\right)$ 直接回归的。

图3.基于bin的定位示意图。每个前景点沿X轴和Z轴的周围区域被分成一系列bin以定位目标中心。

在推理阶段，对于基于bin的预测参数 $\theta$ ，我们首先选择预测置信度最高的bin中心，并添加预测残差，得到精化参数。对于其他直接回归的参数，包括 $y, h, w$ 和 $l$ ，我们将预测残差添加到它们的初始值。

然后可以将具有不同训练损失项的整体3D边界框回归损失 $Lreg\mathcal{L}_{\mathrm{reg}}$ 表示为

$Lbin(p)=∑u∈{x,z,θ}(Fcls(bin⁡^u(p),bin⁡u(p))+Freg(res⁡^u(p),res⁡u(p)))Lres (p)=∑v∈{y,h,w,l}Freg(res⁡^v(p),res v(p))Lreg=1Npos ∑p∈pos (Lbin(p)+Lres(p))(3)\begin{aligned} \mathcal{L}_{\mathrm{bin}}^{(p)}=& \sum_{u \in\{x, z, \theta\}}\left(\mathcal{F}_{\mathrm{cls}}\left(\widehat{\operatorname{bin}}_{u}^{(p)}, \operatorname{bin}_{u}^{(p)}\right)+\mathcal{F}_{\mathrm{reg}}\left(\widehat{\operatorname{res}}_{u}^{(p)}, \operatorname{res}_{u}^{(p)}\right)\right) \\ \mathcal{L}_{\text {res }}^{(p)}=& \sum_{v \in\{y, h, w, l\}} \mathcal{F}_{\mathrm{reg}}\left(\widehat{\operatorname{res}}_{v}^{(p)}, \text { res }_{v}^{(p)}\right) \\ \mathcal{L}_{\mathrm{reg}}=& \frac{1}{N_{\text {pos }}} \sum_{p \in \text { pos }}\left(\mathcal{L}_{\mathrm{bin}}^{(p)}+\mathcal{L}_{\mathrm{res}}^{(p)}\right) \end{aligned} \tag{3}$

u(p),binu(p))+Freg(res

u(p),resu(p)))v∈{y,h,w,l}∑Freg(res

v(p),resv(p))Npos1p∈pos∑(Lbin(p)+Lres(p))(3)

其中 $N_{pos}$ 是前景点的数量， $bin⁡u(p)\operatorname{bin}_{u}^{(p)}$ 和 $res⁡u(p)\operatorname{res}_{u}^{(p)}$ 是前景点 $p$ 的预测bin分配和残差， $bin⁡^u(p)\widehat{\operatorname{bin}}_{u}^{(p)}$

u(p)和

res⁡^u(p)\widehat{\operatorname{res}}_{u}(p)

是ground-truth目标如上计算，

\mathcal{F}_{\text {cls }}

表示交叉熵分类损失，

\mathcal{F}_{\text {reg }}

表示平滑L1损失。

为了去除多余的proposals，我们从鸟瞰的角度进行基于有向IoU的非极大值抑制(NMS)，以生成少量高质量的proposals。对于训练，我们使用0.85作为鸟瞰IoU阈值，在NMS之后，我们保留前300个proposals用于训练stage-2子网络。对于推理，我们使用IoU阈值为0.8的有向NMS，并且只保留前100个proposals用于细化stage-2子网络。

3.2. Point cloud region pooling

在获得3D边界框proposals后，我们的目标是根据先前生成的框proposals来细化框的位置和方向。为了了解每个proposal的更具体的局部特征，我们建议根据每个3D proposal的位置从stage-1中汇集3D点及其对应的点特征。

对于每个3D框proposal， $bi=(xi,yi,zi,hi,wili,θi)\mathbf{b}_{i}=\left(x_{i}, y_{i}, z_{i}, h_{i}, w_{i}\right. \left.l_{i}, \theta_{i}\right)$ ，我们稍微放大它来创建一个新的3D框 $bie=(xi,yi,zi,hi+η,wi+η,li+η,θi)\mathbf{b}_{i}^{e}=\left(x_{i}, y_{i}, z_{i}, h_{i}+\eta, w_{i}+\eta, l_{i}+\eta, \theta_{i}\right)$ 对来自其上下文的附加信息进行编码，其中 $η\eta$ 是用于扩大框大小的常数值。

对于每个点 $p=(x(p),y(p),z(p))p=\left(x^{(p)}, y^{(p)}, z^{(p)}\right)$ ，进行内部/外部测试以确定点 $p$ 是否在放大的边界框proposal $bie\mathbf{b}_{i}^{e}$ 内。如果是这样，则将保留该点及其特征以改进框 $bi\mathbf{b}_{i}$ 。与内部点 $p$ 相关的特征包括其3D点坐标 $(x(p),y(p),z(p))∈R3\left(x^{(p)}, y^{(p)}, z^{(p)}\right) \in \mathbb{R}^{3}$ ，其激光反射强度 $r(p)∈R\boldsymbol{r}^{(p)} \in \mathbb{R}$ ，其预测分割掩码 $m(p)∈{0,1}m^{(p)} \in \{0,1\}$ 来自stage-1，C维学习点特征表示 $f(p)∈RC\mathbf{f}^{(p)} \in \mathbb{R}^{C}$ 来自stage-1。

我们包括分割掩码 $m^{(p)}$ 以区分放大框 $bie\mathbf{b}_{i}^{e}$ 内的预测前景/背景点。学习到的点特征 $f(p)\mathbf{f}^{(p)}$ 通过学习分割和生成proposal来编码有价值的信息，因此也包括在内。我们在下一阶段消除没有内点的proposals。

3.3. Canonical 3D bounding box refinement

如图2(b)所示，每个proposal的池化点及其相关特征（参见第3.2节）被馈送到我们的stage-2子网络，以细化3D框位置以及前景目标置信度。

正则变换。 为了利用stage-1的高召回率框proposals并仅估计proposals框参数的残差，我们将属于每个proposals的池化点转换为相应3D proposal的规范坐标系。如图4所示，一个3D proposal的规范坐标系表示：（1）原点位于box proposal的中心；(2)局部 $X′X^{\prime}$ 和 $Z′Z^{\prime}$ 轴与地平面近似平行， $X′X^{\prime}$ 指向proposal的头部方向，另一个 $Z′Z^{\prime}$ 轴垂直于 $X′X^{\prime}$ ；(3) $Y′Y^{\prime}$ 轴与激光雷达坐标系保持一致。box proposal的所有池化点的坐标 $p$ 应通过适当的旋转和平移转换为规范坐标系为 $p~\tilde{p}$ 。使用建议的规范坐标系使框细化阶段能够为每个proposal学习更好的局部空间特征。

图4.典型变换的插图。将属于每个提议的池化点转换为相应的规范坐标系，以便更好地学习局部空间特征，其中 CCS 表示规范坐标系。

用于框proposal细化的特征学习。正如我们在3.2中提到的，细化子网络结合了变换后的局部空间点（特征） $p~\tilde{p}$ 以及它们来自stage-1的全局语义特征 $f(p)\mathbf{f}^{(p)}$ ，用于进一步的框和置信度细化。

虽然正则变换能够实现鲁棒的局部空间特征学习，但它不可避免地会丢失每个目标的深度信息。例如，由于激光雷达传感器的固定角度扫描分辨率，远处目标的点通常比附近目标少得多。为了补偿丢失的深度信息，我们将到传感器的距离，即 $(x(p))2+(y(p))2+(z(p))2\sqrt{\left(x^{(p)}\right)^{2}+\left(y^{(p)}\right)^{2}+\left(z^{(p)}\right)^{2}}$

包含在

p

点的特征。

对于每个proposal，其关联点的局部空间特征 $p~\tilde{p}$ 和额外特征 $[r(p),m(p),d(p)]\left[r^{(p)}, m^{(p)}, d^{(p)}\right]$ 首先被连接并馈送到几个完全连接的层，以将它们的局部特征编码到全局特征 $f(p)\mathbf{f}^{(p)}$ 的相同维度。然后，局部特征和全局特征被连接并被馈送到遵循[28]的结构的网络中，以获得用于随后的置信度分类和框细化的鉴别特征向量。

框proposal细化的损失。 我们采用类似的基于bin的回归损失来进行proposal细化。如果它们的3D IoU大于0.55，则将ground-truth框分配给3D框proposal用于学习框细化。3D proposal及其对应的3D ground-truth框都转换为规范坐标系，这意味着3D proposal $bi=(xi,yi,zi,hi,wi,li,θi)\mathbf{b}_{i}=\left(x_{i}, y_{i}, z_{i}, h_{i}, w_{i}, l_{i}, \theta_{i}\right)$ 和3D ground-truth框 $bigt=(xigt,yigt,zigt,higt,wigt,ligt,θigt)\mathbf{b}_{i}^{\mathrm{gt}}=\left(x_{i}^{\mathrm{gt}}, y_{i}^{\mathrm{gt}}, z_{i}^{\mathrm{gt}}, h_{i}^{\mathrm{gt}}, w_{i}^{\mathrm{gt}}, l_{i}^{\mathrm{gt}}, \theta_{i}^{\mathrm{gt}}\right)$ 将转换为

$b~i=(0,0,0,hi,wi,li,0)b~igt=(xigt−xi,yigt−yi,zigt−zi,higt,wigt,ligt,θigt−θi)(4)\begin{aligned} \tilde{\mathbf{b}}_{i} &=\left(0,0,0, h_{i}, w_{i}, l_{i}, 0\right) \\ \tilde{\mathbf{b}}_{i}^{\mathrm{gt}} &=\left(x_{i}^{\mathrm{gt}}-x_{i}, y_{i}^{\mathrm{gt}}-y_{i}, z_{i}^{\mathrm{gt}}-z_{i}, h_{i}^{\mathrm{gt}}, w_{i}^{\mathrm{gt}}, l_{i}^{\mathrm{gt}}, \theta_{i}^{\mathrm{gt}}-\theta_{i}\right) \end{aligned} \tag{4}$

第 $i$ 个框的中心位置的训练目标 $(bin⁡Δxi,bin⁡Δzi,res⁡Δxi,res⁡Δzi,res⁡Δyi)\left(\operatorname{bin}_{\Delta x}^{i}, \operatorname{bin}_{\Delta z}^{i}, \operatorname{res}_{\Delta x}^{i}, \operatorname{res}_{\Delta z}^{i}, \operatorname{res}_{\Delta y}^{i}\right)$ 的设置方式与等式(2)相同，只是我们使用较小的搜索范围S用于细化3D proposals的位置。我们仍然直接回归大小残差 $(bin⁡Δxi,bin⁡Δzi,res⁡Δxi,res⁡Δzi,res⁡Δyi)\left(\operatorname{bin}_{\Delta x}^{i}, \operatorname{bin}_{\Delta z}^{i}, \operatorname{res}_{\Delta x}^{i}, \operatorname{res}_{\Delta z}^{i}, \operatorname{res}_{\Delta y}^{i}\right)$ 相对于训练集中每个类的平均目标大小，因为池化稀疏点通常无法提供足够的proposal大小信息 $(hi,wi,li)\left(h_{i}, w_{i}, l_{i}\right)$ 。

为了细化方向，我们假设相对于ground-truth方向的角度差 $θigt−θi\theta_{i}^{\mathrm{gt}}-\theta_{i}$ 在 $[−π4,π4]\left[-\frac{\pi}{4}, \frac{\pi}{4}\right]$ 范围内，基于proposal和proposal之间的3D IoU他们的ground-truth框至少为0.55。因此，我们将 $π2\frac{\pi}{2}$ 划分为具有bin大小 $ω\omega$ 的离散bin，并将基于bin的方向目标预测为

$bin⁡Δθi=∣θigt−θi+π4ω⌋res⁡Δθi=2ω(θigt−θi+π4−(bin⁡Δθi⋅ω+ω2))(5)\begin{aligned} \operatorname{bin}_{\Delta \theta}^{i} &\left.=\mid \frac{\theta_{i}^{\mathrm{gt}}-\theta_{i}+\frac{\pi}{4}}{\omega}\right\rfloor \\ \operatorname{res}_{\Delta \theta}^{i} &=\frac{2}{\omega}\left(\theta_{i}^{\mathrm{gt}}-\theta_{i}+\frac{\pi}{4}-\left(\operatorname{bin}_{\Delta \theta}^{i} \cdot \omega+\frac{\omega}{2}\right)\right) \end{aligned} \tag{5}$

因此，stage-2子网络的整体损失可以表示为
$(i))(6)\begin{aligned} \mathcal{L}_{\text {refine }}=& \frac{1}{\|\mathcal{B}\|} \sum_{i \in \mathcal{B}} \mathcal{F}_{\text {cls }}\left(\text { prob }_{i}, \text { label }_{i}\right) \\ &+\frac{1}{\left\|\mathcal{B}_{\text {pos }}\right\|} \sum_{i \in \mathcal{B}_{\text {pos }}}\left(\tilde{\mathcal{L}}_{\text {bin }}^{(i)}+\tilde{\mathcal{L}}_{\text {res }}^{(i)}\right) \end{aligned} \tag{6}$

其中 $B\mathcal{B}$ 是来自stage-1的3D proposals集， $\mathcal{B}_{\text {pos }}$ 存储用于回归的正proposals， $prob⁡i\operatorname{prob}_{i}$ 是 $b~i\tilde{\mathbf{b}}_{i}$ 的估计置信度，label $_{i}$ 是相应的标签， $Fcls\mathcal{F}_{\mathrm{cls}}$ 是监督预测置信度的交叉熵损失， $(i)\tilde{\mathcal{L}}_{\text {bin }}^{(i)}$ 和 $(i)\tilde{\mathcal{L}}_{\text {res }}^{(i)}$ 类似于等式(3)中的 $(p)\mathcal{L}_{\text {bin }}^{(p)}$ 和 $(p)\mathcal{L}_{\text {res }}^{(p)}$ 。新目标由 $b~i\tilde{\mathbf{b}}_{i}$ 和 $b~igt\tilde{\mathbf{b}}_{i}^{\mathrm{gt}}$ 计算，如上所述。

我们最终应用鸟瞰IoU阈值为0.01的定向NMS来移除重叠的边界框并为检测到的目标生成3D边界框。

5. Conclusion

我们提出了PointRCNN，一种新颖的3D目标检测器，用于从原始点云中检测3D目标。所提出的第一阶段网络以自下而上的方式直接从点云生成3D proposals，与以前的proposal生成方法相比，它实现了显着更高的召回率。第二阶段网络通过结合语义特征和局部空间特征来细化规范坐标中的proposals。此外，新提出的基于bin的损失已经证明了其对 3D边界框回归的效率和有效性。实验表明，PointRCNN在具有挑战性的KITTI数据集3D检测基准上优于以前的最先进方法，具有显着的优势。

References

[1] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teulière, and Thierry Chateau. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), pages 2040–2049, 2017.

[2] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2156, 2016.

[3] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pages 424–432, 2015.

[4] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.

[5] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE Inter national Conference on Computer Vision, pages 1635–1643, 2015.

[6] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.

[7] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[8] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

[9] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. CVPR, 2018.

[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.

[11] Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for effective detection proposals? IEEE transactions on pattern analysis and machine intelligence, 38(4):814–830, 2016.

[12] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2018.

[13] Mingyang Jiang, Yiran Wu, and Cewu Lu. Pointsift: A siftlike network module for 3d point cloud semantic segmentation. CoRR, abs/1807.00652, 2018.

[14] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven Lake Waslander. Joint 3d proposal generation and object detection from view aggregation. CoRR, abs/1712.02294, 2017.

[15] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. Gs3d: An efficient 3d object detection framework for autonomous driving. 2019.

[16] Hongyang Li, Bo Dai, Shaoshuai Shi, Wanli Ouyang, and Xiaogang Wang. Feature Intertwiner for Object Detection. In ICLR, 2019.

[17] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 641–656, 2018.

[18] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.

[19] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.

[20] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In The IEEE International Conference on Computer Vision (ICCV), 2017.

[21] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[22] Daniel Maturana and Sebastian Scherer. V oxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.

[23] Roozbeh Mottaghi, Y u Xiang, and Silvio Savarese. A coarseto-fine model for 3d pose estimation and sub-category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 418–426, 2015.

[24] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Koˇsecká. 3d bounding box estimation using deep learning and geometry. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5632–5640. IEEE, 2017.

[25] Charles Ruizhongtai Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CoRR, abs/1711.08488, 2017.

[26] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.

[27] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. V olumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.

[28] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.

[29] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object de tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[30] Joseph Redmon and Ali Farhadi. Y olo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.

[31] Joseph Redmon and Ali Farhadi. Y olov3: An incremental improvement. CoRR, abs/1804.02767, 2018.

[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[33] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, 2017.

[34] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.

[35] Shuran Song, Fisher Y u, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 190–198. IEEE, 2017.

[36] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.

[37] Hao Su, Fan Wang, Eric Yi, and Leonidas J Guibas. 3dassisted feature synthesis for novel views of an object. In Proceedings of the IEEE International Conference on Computer Vision, pages 2677–2685, 2015.

[38] Bin Xu and Zhenzhong Chen. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2345–2353, 2018.

[39] Danfei Xu, Dragomir Anguelov, and Ashesh Jain. Pointfusion: Deep sensor fusion for 3d bounding box estimation. CoRR, abs/1711.10871, 2017.

[40] Yan Yan, Y uxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.

[41] Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3d object detection. In Conference on Robot Learning, pages 146–155, 2018.

[42] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Realtime 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.

[43] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. CoRR, abs/1711.06396, 2017.

[44] Menglong Zhu, Konstantinos G Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips, Matthieu Lecce, and Kostas Daniilidis. Single image 3d object detection and pose estimation for grasping. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 3936–3943. IEEE, 2014.