AutoShape：实时形状感知的单目3D目标检测（ICCV2021）

作者丨柒柒@知乎

来源丨https://zhuanlan.zhihu.com/p/404683961

编辑丨3D视觉工坊

论文标题：AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection
作者单位：Robotics and Autonomous Driving Laboratory, Baidu Research 等
代码：GitHub - zongdai/AutoShape: ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection（https://github.com/zongdai/AutoShape）
论文：https://arxiv.org/pdf/2108.11127.pdf

一句话读论文：

利用形状信息提升单目3D检测性能。

作者的观点：

1. 单目3D检测最主要的挑战是如何获取准确的深度信息。

The main challenge for monocular-based approaches is to obtain accurate depth information. In general, depth estimation from a single image without any prior information is a challenging problem and recent many deep learningbased approaches achieve good results.

2. 基于伪点云的检测方法效率较低，一些更为直接的方法比如SMOKE或者RTM3D也可以取得较好的性能。这些方法将物体检测任务建模为”关键点(keypoints)+属性(size, offsets, orentation, depth, etc.)“的表示方法，因此更为高效。

To improve the efficiency, many direct regression-based approaches have been proposed (e.g., SMOKE,RTM3D) and achieved promising results. By representing the object as one center point, the object detection task is formulated as keypoints detection and its corresponding attributes (e.g., size, offsets, orientation, depth, etc.) regression.

但是这些方法缺点也很明显：只利用关键点信息而忽略了物体的整体形状。

However, the drawback is also obvious. One center point representation ignores the detailed shape of the object and results in location ambiguity if its projected center point is on another object's surface due to occlusion.

综上，作者的动机就很直接了。基本思路还是follow关键点检测器的方法，但是通过引入形状信息改善这类检测器的缺陷。那么如何构造形状信息呢？显然，通过通用的8个角点是没有办法充分表示物体形状的。因此，作者尝试引入n个keypoints以建模物体的形状信息。在实验中，n=16或n=48。

具体地，整体网络可以理解为两部分：其一，如何获得n个keypoints；其二，如何利用n个keypoints建模形状信息。

第一，为了与作者文章逻辑相同，先介绍如何利用keypoints建模形状信息，如下图。

整体框架图

输入单张图片，通过骨干网络后，共输出7类结果，包括：

In anchor-free based object detection frameworks, the object center is essential information, which serves two functions: one is whether there is an object and the other is that if there exists an object, where is the center.The output of this branch will be above, where C is the number of classes and 2 represents the offset in x and y direction respectively.

A separate branch is used to regress the object dimension. Similar to other approaches, we don't regress the absolute object’s size directly and regress a relative scale compared to the mean object size of each class.

Rather than directly detect these keypoints from the image, we regress n ordered 2D offset coordinates for each object center. The benefit is that the number and order of keypoints for each object can be well guranteed.

Similar to 2D keypoints, we regress the 3d keypoints in the local object coordinates. In addition, all 3D keypoints are normalized by object dimension in xyz direction respectively. By using this format, the 3D values are in a relatively small range, which will benefit the whole regression process.

The overall loss contains the following items: a center point classification loss and center point offset regression loss, a 2D keypoints regression loss, a 3D keypoints points regression loss, an orientation multibin loss, a dimension regression loss, a 3D IoU confidence loss and a 3D box IoU loss.

至此，整体框架的输入输出已经做完，回到第一个问题，如何利用网络预测的keypoints建模形状信息？这里作者引入了pose estimation的概念，也就是对于3D物体上的每一个关键点，可以将其与2D关键点建立一一对应的约束关系。先给定一些定义：

第二，3D shape auto-labeling。这个模块的主要作用是：如何自动生成每个物体上的关键点标注，如下图。

In this section, we will introduce how to automatically fit the 3D shape to the visual observations and then automatically generate ground-truth annotations of 2D keypoints and 3D locations in the local object coordinate for traning the network.

自动标注框架

其核心思想是：任意物体可以表示基本模板+形变。因此，作者一方面归纳出了一个基本模板，同时利用PCA系数和6D位姿参数对其进行形变，以生成尽可能符合当前物体的3D模型。那么，怎么知道这个3D模型生成得好不好呢？作者设计了优化函数：

叫differentiable rendering function，中文名叫可微分渲染。这个是什么意思呢？引用一下可微分渲染：Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer（https://blog.csdn.net/qq_43420530/article/details/117909788）的解释：

三维的物体渲染成二维图像的时候，其实本质上是进行了一系列矩阵变换，插值等操作，这和神经网络有一定的相似之处，渲染相当于前向传播，得到渲染图，而渲染图和输入图像相比较可以定义loss，从而进行反向传播，去优化三维物体的形状与纹理，从而实现基于单张图像的三维重建，并且不再受3D数据集依赖。

有了可微分渲染器可以做什么呢？

有了可微分渲染器，那么三维重建工作就可以建立以上的处理流程。对于一张输入的图像，经过神经网络后，去预测得到一系列的mesh，light，texuture这三个三维属性。如果没有可微渲染的话，可以通过三维数据集来优化神经网络得到尽量准确的预测结果，有了可微渲染，就可以抛弃数据集了，在这些三维属性之后进行可微分渲染，根据loss优化三维属性，得到三维重建结果。而如果输入的图像足够多（如GAN生成），我们甚至可以把重建好的三维属性当做真实值去优化那个预测三维属性的神经网络！

实验结果：

KITTI test set

提升还是挺明显的，不过主要集中在简单样本的提升，猜测可能是因为3D shape auto-label还是存在一定误差的。

本文仅做学术分享，如有侵权，请联系删文。

3D视觉精品课程推荐：

1.面向自动驾驶领域的多传感器数据融合技术

2.面向自动驾驶领域的3D点云目标检测全栈学习路线！(单模态+多模态/数据+代码)
3.彻底搞透视觉三维重建：原理剖析、代码讲解、及优化改进
4.国内首个面向工业级实战的点云处理课程
5.激光-视觉-IMU-GPS融合SLAM算法梳理和代码讲解
6.彻底搞懂视觉-惯性SLAM：基于VINS-Fusion正式开课啦
7.彻底搞懂基于LOAM框架的3D激光SLAM: 源码剖析到算法优化
8.彻底剖析室内、室外激光SLAM关键算法原理、代码和实战(cartographer+LOAM +LIO-SAM)

9.从零搭建一套结构光3D重建系统[理论+源码+实践]

10.单目深度估计方法：算法梳理与代码实现

11.自动驾驶中的深度学习模型部署实战

12.相机模型与标定(单目+双目+鱼眼）

13.重磅！四旋翼飞行器：算法与实战

重磅！3DCVer-学术论文写作投稿 交流群已成立

扫码添加小助手微信，可申请加入3D视觉工坊-学术论文写作与投稿微信交流群，旨在交流顶会、顶刊、SCI、EI等写作与投稿事宜。

同时也可申请加入我们的细分方向交流群，目前主要有3D视觉、CV&深度学习、SLAM、三维重建、点云后处理、自动驾驶、多传感器融合、CV入门、三维测量、VR/AR、3D人脸识别、医疗影像、缺陷检测、行人重识别、目标跟踪、视觉产品落地、视觉竞赛、车牌识别、硬件选型、学术交流、求职交流、ORB-SLAM系列源码交流、深度估计等微信群。

一定要备注：研究方向+学校/公司+昵称，例如：”3D视觉 + 上海交大 + 静静“。请按照格式备注，可快速被通过且邀请进群。原创投稿也请联系。

▲长按加微信群或投稿

▲长按关注公众号

3D视觉从入门到精通知识星球：针对3D视觉领域的视频课程（三维重建系列、三维点云系列、结构光系列、手眼标定、相机标定、激光/视觉SLAM、自动驾驶等）、知识点汇总、入门进阶学习路线、最新paper分享、疑问解答五个方面进行深耕，更有各类大厂的算法工程人员进行技术指导。与此同时，星球将联合知名企业发布3D视觉相关算法开发岗位以及项目对接信息，打造成集技术与就业为一体的铁杆粉丝聚集区，近4000星球成员为创造更好的AI世界共同进步，知识星球入口：

学习3D视觉核心技术，扫描查看介绍，3天内无条件退款

圈里有高质量教程资料、答疑解惑、助你高效解决问题

觉得有用，麻烦给个赞和在看~