Voxel RCNN：高性能3D目标检测网络（AAAI2021）

作者丨柒柒@知乎

来源丨https://zhuanlan.zhihu.com/p/390497086

编辑丨3D视觉工坊

论文标题：Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection
作者单位：CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China
代码：https://github.com/djiajunustc/Voxel-R-CNN
论文：https://arxiv.org/pdf/2012.15712.pdf

一句话读论文：

本文的核心是讨论如何有效提取3D结构信息

作者的观点：

1. point-based的方法为什么有效？作者认为由于其提供了精确的位置信息。但是也一定程度上损失了效率，因为对点特征的采样和融合是非常耗时的操作。

Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. Despite the superior detection accuracy, point-based methods are in general less efficient because it is more costly to search nearest neighbor with the point representation for point set abstraction.

2. voxel的优点是？voxel通过网格划分有效聚合了网格内部特征，因此在特征提取上具有显著优势。但是同时，网格内部点的聚合损失了每个点精确位置信息。

The voxel-based methods divide point clouds into regular grids, which are more applicable for convolutional neural networks (CNNs) and more efficient for feature extraction due to superior memory locality. Nevertheless, the downside is that voxelization often causes loss of precise position information.

3. 作者的核心观点是：高性能的3D检测器并不需要精确的定位信息。

In this paper, we take a slightly different viewpoint — we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy.

4. 作者认为voxel-based和point-based methods的核心区别是：voxel-based methods是在鸟瞰图（BEV）上进行检测，而point-based methods则是利用含有更多3D结构信息的点特征。所以制约voxel-based methods性能的关键是：基于BEV map的检测损失了3D结构信息。

By taking a close look at the underlying mechanisms, we find that the key disadvantage of existing voxel-based methods stems from the fact that they convert 3D feature volumes into BEV representations without ever restoring the 3D structure context.

因此，全文的关键模块是如何为voxel-based methods提取3D结构信息。那么，之前的网络是怎么做的呢？

5. SECOND vs. PV-RCNN。作者选取了两个性能差异较大的网络作为对比，得出两个结论：

a）3D结构信息对3D目标检测非常重要。PV-RCNN增加了keypoint模块补充了3D结构信息，由此取得了更好的性能。

the 3D structure is of significant importance for 3D object detectors, since the BEV representation alone is insufficient to precisely predict bounding boxes in a 3D space. Typically, PV-RCNN integrates voxel features into sampled keypoints with Voxel Set Abstraction. The keypoints works as an intermediate feature representation to effectively preserve 3D structure information.

b）point-voxel的交互是非常耗时的，PV-RCNN相比SECOND慢很多。

the point-voxel feature interaction is time-consuming and affects the detector' s efficiency. However, the point-voxel interaction takes almost half of the overall running time, which makes PV-RCNN much slower than SECOND.

那么，如何实现快速高效的特征表达呢？具体地，网络框架流程可以理解为：

输入point cloud → 体素特征提取 → RPN → Voxel RoI pooling → 检测结果，如下图。

Voxel R-CNN 网络框架

整体网络挺简单的，这里只记录一下要点。

Voxel RoI pooling是怎么做的？其实就是，利用3D voxels 表示proposal features，也就是如何从3D voxels特征中聚合结构信息。

a）Voxel Query，为proposal匹配邻近体素，其实就是按照Manhattan距离计算voxel point附近voxel。

We exploit Manhattan distance in voxel query and sample up to K voxels within a distance threshold.

b）得到了匹配的体素，下一步显然就是考虑如何从voxel group中提取结构信息。作者引入了voxel RoI pooling layers，具体步骤是：proposal → grids → voxel query → aggregate features with PointNet。

It starts by dividing a region proposal into sub-voxels. The center point is taken as the grid point of the corresponding sub-voxel. Specifically, given a grid point, we first exploit voxel query to group a set of neighboring voxels. Then, we aggregate the neighboring voxel features with a PointNet module.

描述一下就是，对每个proposal划分grids，在每个grid中提取voxel query feature，把query feature聚合得到grid feature representation，把grid feature级联为proposal feature，这个proposal feature作者认为包含了3D结构信息。

另外一个问题是，如何在有效提取3D结构信息的基础上保证效率？因此，作者提出了Accelerated Local Aggregation。这个模块其实就是先处理体素特征，再处理位置信息，因此时间复杂度大大减小。

其余部分就没啥好说的，就是基于通用的RPN提取，检测网络等。

实验结果:

KITTI test set

照惯例，有提升。

读完后的想法：

本质上讲，其实是一个two-stage的网络，利用BEV提取initial proposals ，再将其映射到3D 点云中提取结构信息，通过加速模块在提升精度的前提下保证检测效率。

本文仅做学术分享，如有侵权，请联系删文。

下载1

在「3D视觉工坊」公众号后台回复：3D视觉，即可下载 3D视觉相关资料干货，涉及相机标定、三维重建、立体视觉、SLAM、深度学习、点云后处理、多视图几何等方向。

下载2

在「3D视觉工坊」公众号后台回复：3D视觉github资源汇总，即可下载包括结构光、标定源码、缺陷检测源码、深度估计与深度补全源码、点云处理相关源码、立体匹配源码、单目、双目3D检测、基于点云的3D检测、6D姿态估计源码汇总等。

下载3

在「3D视觉工坊」公众号后台回复：相机标定，即可下载独家相机标定学习课件与视频网址；后台回复：立体匹配，即可下载独家立体匹配学习课件与视频网址。

重磅！3DCVer-学术论文写作投稿 交流群已成立

扫码添加小助手微信，可申请加入3D视觉工坊-学术论文写作与投稿微信交流群，旨在交流顶会、顶刊、SCI、EI等写作与投稿事宜。

同时也可申请加入我们的细分方向交流群，目前主要有3D视觉、CV&深度学习、SLAM、三维重建、点云后处理、自动驾驶、多传感器融合、CV入门、三维测量、VR/AR、3D人脸识别、医疗影像、缺陷检测、行人重识别、目标跟踪、视觉产品落地、视觉竞赛、车牌识别、硬件选型、学术交流、求职交流、ORB-SLAM系列源码交流、深度估计等微信群。

一定要备注：研究方向+学校/公司+昵称，例如：”3D视觉 + 上海交大 + 静静“。请按照格式备注，可快速被通过且邀请进群。原创投稿也请联系。

▲长按加微信群或投稿

▲长按关注公众号

3D视觉从入门到精通知识星球：针对3D视觉领域的视频课程（三维重建系列、三维点云系列、结构光系列、手眼标定、相机标定、orb-slam3等视频课程）、知识点汇总、入门进阶学习路线、最新paper分享、疑问解答五个方面进行深耕，更有各类大厂的算法工程人员进行技术指导。与此同时，星球将联合知名企业发布3D视觉相关算法开发岗位以及项目对接信息，打造成集技术与就业为一体的铁杆粉丝聚集区，近2000星球成员为创造更好的AI世界共同进步，知识星球入口：

学习3D视觉核心技术，扫描查看介绍，3天内无条件退款

圈里有高质量教程资料、可答疑解惑、助你高效解决问题

觉得有用，麻烦给个赞和在看~