STD: Sparse-to-Dense 3D Object Detector for Point Cloud 阅读笔记

Yang Z, Sun Y, Liu S, et al. Std: Sparse-to-dense 3d object detector for point cloud[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 1951-1960.

1.研究背景

提出了一个两阶段的 3D 对象检测框架 STD。第一阶段是一个自下而上的 proposal generation module，它使用原始点云作为输入，通过为每个点生成一个新的 spherical anchor（球形锚）来生成 proposal。与以前的工作相比，它以更少的计算实现了更高的召回率。然后，通过将内部点特征从稀疏表示转换为紧凑表示，将 PointsPool 应用于 proposal feature generation，从而节省了更多的计算量。在第二阶段的 Box 预测中，实现了 IoU 分支，以提高对定位的准确性，从而进一步提高性能。在 KITTI 数据集上进行实验，推理速度达到 10+ FPS。

2.研究方法

2.1.Proposal Generation Module

Spherical Anchor（球形锚）:考虑到一个三维物体可以有任何方向，设计了 Spherical Anchor。对于每个 spherical anchor，都有一个 spherical receptive field（球形感受野），半径由类别决定（car 类半径为2米，pedestrian 和 cyclist 类半径为1米）。每个 anchor 预测的 proposal 是基于 spherical receptive field 中的点，并且直接预测 reference box 的方向。这种方法定义的 anchor 数量与传统矩形 anchor 方法相比减少约50%，取得了更高的Recall（召回率）。
为了进一步减少 anchor 的数量，对原始的点云数据使用了一次PointNet++（Backbone），得到了每个点的分割分数（Score），根据这个分数做 NMS，最后剩余大约500个 anchor。

Proposal Generation Network:在spherical anchor生成的剩余 anchor 中，对点的坐标进行标准化处理(以anchor的圆心为坐标系原点)。将每个点的坐标信息（XYZ）与分割特征（PointNet++ 生成的 feature)进行拼接，之后送入PointNet中，生成预测分类分数、回归偏移量和方向。
利用anchor 的中心点坐标和提前定义好的 anchor 尺寸，加上预测偏移量来生成预测的 propoals。针对方向 angle 的预测，对 angle 进行等分形成多个区间，直接预测 angle 分类在哪个区间，之后加上 angle 偏移量。angle 偏移量是在分类的区间内进行回归得到的。最后，应用基于预测分类分数的 NMS 和面向 BEV 的 IoU 来消除多余的 proposal ，保留300个 proposal 。

2.2.Proposal Feature Generation

PointsPool:对一个 proposal 中的点，如果使用PointNet++效果最好但速度最慢（41ms 79.1），而PointNet效果最差但速度最快（10ms 77.1）。所以提出了一种新的层，PointPool + 2FC。PointsPool 分为以下三步：
（1）在每个 propoals中选出 N N N 个点，将这些点的坐标减去 propoals 中心点的坐标，然后旋转到预测的 angle 上来得到点的标准化坐标。
（2）参考 VoxelNet 对 proposal 划分成体素，每个体素中点的的特征包括：标准化坐标和分割 feature。
（3）参考 VoxelNet 中的 VFE 层，对每个 propoal 提取特征，并将特征展平交给 box prediction network。

2.3.Box Prediction Network

Box Prediction Branch:使用2个 FC 层预测 Box 宽度（w）、高度（h）、长度（l）的偏移量以及中心点（x,y,z）的偏移量，对于 angle 的预测与 proposal generation network 中的方法相同。

IoU Branch:使用2个 FC 层预测 Box 的 IoU 分数。作者认为 Box 的分类分数与定位质量不高度相关，基于分类分数的 NMS 相比于基于 IoU 值的 NMS 效果要差。

2.4.损失函数

总损失 L t o t a l L_{total} Ltotal 由 Proposal Generation 损失 L p r o p L_{prop} Lprop 和 Box Prediction 损失 L b o x L_{box} Lbox 组成：
L t o t a l = L p r o p + L b o x L_{total} = L_{prop}+L_{box} Ltotal=Lprop+Lbox

L p r o p L_{prop} Lprop 由 3D 语义分割损失 L s e g L_{seg} Lseg 、Proposal Classification 损失 L c l s L_{cls} Lcls 、Location regression 损失 L l o c L_{loc} Lloc 和 Angle 损失 L a n g l e L_{angle} Langle 组成。 L s e g L_{seg} Lseg 使用 focal loss 函数， L c l s L_{cls} Lcls 使用 softmax cross-entropy 函数。
L p r o p = L s e g + 1 N c l s ∑ i L c l s ( s i , u i ) + λ 1 N p o s ∑ i [ u i ≥ 1 ] ( L l o c + L a n g l e ) L_{prop} = L_{seg} + \frac{1}{N_{cls}} \sum_i L_{cls}(s_i,u_i)+\lambda \frac{1}{N_{pos}} \sum_i [u_i \geq 1](L_{loc}+L_{angle}) Lprop=Lseg+Ncls1i∑Lcls(si,ui)+λNpos1i∑[ui≥1](Lloc+Langle)
L l o c L_{loc} Lloc 由 Center Residual Prediction 损失和 Size Residual Prediction 损失组成。 L d i s L_{dis} Ldis 使用 Smooth L1 函数。 A c t r A_{ctr} Actr 和 A s i z e A_{size} Asize 是 Proposal Generation 网络预测的 Center Residual 和 Size Residual。
L l o c = L d i s ( A c t r , G c t r ) + L d i s ( A s i z e , G s i z e ) { G c t r = G j − A j j ∈ ( x , y , z ) G s i z e = ( G j − A j ) / A j j ∈ ( l , w , h ) L_{loc} = L_{dis}(A_{ctr},G_{ctr})+L_{dis}(A_{size},G_{size}) \\ \begin{cases} G_{ctr} = G_j-A_j & j \in (x,y,z) \\ G_{size} = (G_j-A_j)/A_j & j \in (l,w,h) \end{cases} Lloc=Ldis(Actr,Gctr)+Ldis(Asize,Gsize){Gctr=Gj−AjGsize=(Gj−Aj)/Ajj∈(x,y,z)j∈(l,w,h)
L a n g l e L_{angle} Langle 由 Orientation Classification 损失和 Residual Prediction 损失组成。 t a − c l s t_{a-cls} ta−cls 和 t a − r e s t_{a-res} ta−res 是预测的 Angle Class 和 Angle Residual。（ v a − c l s v_{a−cls} va−cls 和 v a − r e s v_{a−res} va−res 论文没明确解释，猜测是 ground truth 值？）
L a n g l e = L c l s ( t a − c l s , v a − c l s ) + L d i s ( t a − r e s , v a − r e s ) L_{angle} = L_{cls}(t_{a-cls},v_{a-cls})+L_{dis}(t_{a-res},v_{a-res}) Langle=Lcls(ta−cls,va−cls)+Ldis(ta−res,va−res)

L b o x L_{box} Lbox 由 L c l s L_{cls} Lcls、 L l o c L_{loc} Lloc、 L a n g l e L_{angle} Langle 再加上 3D IoU 损失和 corner 损失 L c o r n e r L_{corner} Lcorner 组成。3D IoU 损失使用 Smooth L1 函数。 L c o r n e r L_{corner} Lcorner 是 8 个角点与 ground truth 之间的距离差。
L c o r n e r = ∑ k = 1 8 ∥ P k − G k ∥ L_{corner} = \sum_{k=1}^8 \lVert P_k - G_k \rVert Lcorner=k=1∑8∥Pk−Gk∥

3.结果分析

3.1.KITTI 数据集实验结果

3.2.消融研究

Anchors’ Receptive Field（感受野）的影响: Cuboid 形状的感受野需要 2 个角度，即 ( 0 , π / 2 ) (0, \pi/2) (0,π/2)，因为长度和宽度不成比例，导致数据量增加 2 倍，也需要更多的计算量。只有 1 个角度的 Cuboid 形状会导致 Recall（召回率）降低 1.5（75.7-74.2）。更高的 Recall （76.8）证明 Sphere 形状感受野获取了额外的上下文信息。

复杂 Anchors 形状的影响:更复杂的 Anchors 形状（Cylindrical 77.0、Ellipsoidal 77.4）能带来更高的 Recall，但最终的 mAP 差距很小。相比于 Cylindrical 和 Ellipsoidal Anchor，Spherical Anchor更简单，仅由半径长度确定，并且也足够有效。

Proposal Feature 的影响:proposal features 由 canonical coordinates 和 3D semantic features 组成。相比于使用原始点坐标（第 1 行），Canoized 和 Semantic 都能带来较大的性能提升。

IoU Branch 的影响:3D-IoU 优于传统 NMS 和Soft-NMS，但 3D-IoU 中只考虑 positive proposals，而 cls-score（分类分数）可以区分 positive 和 negative 的预测。因此，cls-score 和 3D-IoU 的组合会更有效。