论文笔记：3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection（ECCV 2020）

1.为什么要做这个研究（理论走向和目前缺陷） ?

因为之前基于融合的3D目标检测方法在融合时都会丢失到一些信息，导致算法的精度不高，本文想基于这一点进行改进。

2.他们怎么做这个研究（方法，尤其是与之前不同之处） ?

1.auto-calibrated projection and the gated attention

2.3D RoI Fusion-based Refinement

3.实际上也是做了两次融合。

3.发现了什么（总结结果，补充和理论的关系）?

1.为什么说当把camera feature投影到3D world coordinates时，因为这个变换是一个one to many 的mapping，就会会丢失一些空间信息？（我认为如果是从3D到2D，才会丢失一些信息）

2.6个camera视角，是哪6个camera视角？

3.怎样得到文中所说的camera voxel structure，使其能被PointNet处理？

4.Cross-view feature mapping中的360 degree Camera Voxel指的是什么？

5.RoI grid-based pooling of camera features没看懂

4.摘要

提出问题：camera与lidar融合的一个挑战就是each modality得到的feature maps处于不同的camera和3D world坐标系中，因此，要将不同结构的feature maps融合在一起并且不丢失信息不是一个简单的任务。

为了解决上述问题，提出了3D-CVF，using the cross-view spatial feature fusion strategy。

首先，使用了auto-calibrated projection，将2D特征转化为一个平滑的feature map(并且是一个BEV图)，这个feature map在BEV图中与lidar feature有很高的相关性。

接下来，使用了gated feature fusion network，其中应用了spatial attention maps来融合camera和lidar特征。

之后，将上一步融合的特征给到之后的proposal refinement stage。

low-level的lidar和camera features还分别使用了 region of interest (RoI)-based feature pooling进行池化，之后，再把它们与前一步得到的融合特征再融合一次。

5.引言

先是老生常谈，说一些之前的2D和3D目标检测做过的工作，3D目标检测做过的工作包括MV3D [2], PIXOR [29], ContFuse [13], PointRCNN [22], F-ConvNet [26], STD[30], VoxelNet [31], SECOND [28], MMF [12], PointPillar [9], and Part A2[23]等。

之后讲了只用camera的缺点，所以要做融合，那么做融合时就会涉及到摘要中提到的问题，那么本文如何解决呢？

首先，由于当把camera feature投影到3D world coordinates时，因为这个变换是一个one to many 的mapping，因此会丢失一些空间信息。并且，投影之后的坐标与liadr的 3D坐标可能仍有不一致性。也因为这两个原因，camera-LiDAR fusion-based methods 的效果很难比 LiDAR-only methods 好。

所以本文：输入为lidar和multi-view cameras，检测时分为两大stage。

第一个stage：使用了auto-calibrated feature projection可以maps the camera-view features to smooth and dense BEV feature maps using the interpolated projection capable of correcting the spatial offsets。

有无auto-calibrated feature projection的对比图如下：

并且要注意到经过auto-calibrated feature projection后的camera特征也不能有定位目标的效果。

We also note from Fig. 1 (b) that since the camera feature mapping is a one-to-many mapping, we cannot localize the objects on the transformed camera feature map. To resolve objects in the BEV domain, we employ the adaptive gated fusion network that determines where and what should be brought from two sources using attention mechanism. Fig. 1 © shows the appropriately-localized activation for the objects obtained by applying the adaptive gated fusion network.

6.相关工作。

7.网络结构

7.1总体架构

It consists of five modules including the 1) LiDAR pipeline, 2) camera pipeline,3) cross-view spatial feature mapping, 4) gated camera-LiDAR feature fusion network, and 5) proposal generation and refinement network.

LiDAR Pipeline:先用VoxelNet将点云体素化，再通过6个 3D sparse convolution 层（SECOND），得到的LiDAR feature map of 128 channels in the BEV domain。

RGB Pipeline:使用了pre-trained ResNet-18 [6] followed by feature pyramid network (FPN) [14] to generate the camera feature map of 256 channels represented in camera-view。

Cross-View Feature Mapping:The auto-calibrated projection converts the camera feature maps in camera-view to those in BEV. Then,the projected feature map is enhanced by the additional convolutional layers and delivered to the gated camera-LiDAR feature fusion block.

Gated Camera-LiDAR Feature Fusion:The spatial attention maps are applied to both feature maps to adjust the contributions from each modality depending on their importance.生成了camera-LiDAR feature map

3D RoI Fusion-based Refinement:先对上一步生成的camera-LiDAR feature map使用RoI pooling，可以proposal refinement。因为camera-LiDAR feature map没有充足的空间信息，对low-level的6个lidar和6个 camera features使用3D RoI-based pooling，对low-level的lidar和camera features提取特征时都采用PointNet，将得到的特征与camera-LiDAR feature map通过3D RoI-based fusion network得到最终的fused feature，用它来产生最终的检测结果。

7.2 Cross-View Feature Mapping

首先要产生camera voxel structure

之后是Auto-Calibrated Projection Method，将3D voxel map中的voxel的中心目标通过使用world-to-camera-view projection matrix 转化到camera-view plane中。转化后的每个中心与邻近的4个pixels结合成一个特征u。

看图和公式

7.3 Gated Camera-LiDAR Feature Fusion

可以衡量出camera and LiDAR features的重要性，即算出他们各自占的权重。

图和公式如下：

7.4 3D-RoI Fusion-based Refinement

Region Proposal Generation:将the joint camera-LiDAR feature先经过RPN网络，由于产生的proposal
boxes数量太多，要经过NMS处理。

3D RoI-based Feature Fusion:

讲了RoI grid-based pooling of camera features是怎么做的

论文笔记：3D-CVF（ECCV 2020）相关推荐

3d object是什么文件_[单目3D目标检测论文笔记] 3D Bounding Box Estimation
本文是3D Bounding Box Estimation Using Deep Learning and Geometry的论文笔记及个人理解.这篇文章是单目图像3d目标检测的一个经典工作之一.其目 ...
论文笔记--3D human pose estimation in video with temporal convolutions and semi-supervised training
3D human pose estimation in video with temporal convolutions and semi-supervised training(利用时间卷积和半监督 ...
【论文笔记】（VLDB 2020） A Benchmarking Study of Embedding-based Entity Alignment for Knowledge
A Benchmarking Study of Embedding-based Entity Alignment for Knowledge 论文原文开发代码摘要: 实体对齐旨在在不同的知识图谱 ...
论文笔记--3D Human Pose Estimation with Spatial and Temporal Transformers（用空间和时间变换器进行三维人体姿势估计）
用空间和时间变换器进行三维人体姿势估计摘要 Transformer架构已经成为自然语言处理中的首选模型,现在正被引入计算机视觉任务中,如图像分类.物体检测和语义分割.然而,在人类姿势估计领域,卷 ...
ECCV 2020 语义分割论文大盘点（38篇论文）
作者:CV Daily | 编辑:Amusi Date:2020-09-25 来源:计算机视觉Daily微信公众号(系投稿) 原文:ECCV 2020 语义分割论文大盘点(38篇论文) 前言距离EC ...
ECCV 2020 论文大盘点-3D人体姿态估计篇
本文盘点ECCV 2020 中所有与3D姿态估计(3D Human Pose Estimation)相关的论文,总计 14 篇,其中一篇Oral 论文,7 篇已经或者将开源代码. 下载包含这些论文的 ...
ECCV 2020 Oral 中谷歌论文盘点，点云与3D方向工作居多
ECCV2020 已经结束,官方放出了所有论文: ECCV 2020 论文合集下载,分类盘点进行中谷歌作为人工智能研究领域工业界的领头羊,其工作是非常值得参考的. 本文汇总其入选 ECCV 2020 ...
ECCV 2020论文大盘点-3D目标检测篇
随着自动驾驶的火热,3D目标检测在计算机视觉领域持续升温,学术和工业界都有众多研究学者,这其中基于点云数据的3D目标检测是主流,近年来基于单目RGB数据的也越来越多了.本部分总计 21 篇,1篇spo ...
ECCV 2020最佳论文讲了什么？作者为ImageNet一作、李飞飞高徒
点上方蓝字计算机视觉联盟获取更多干货在右上方 ··· 设为星标 ★,与你不见不散仅作学术分享,不代表本公众号立场,侵权联系删除转载于:量子位 AI博士笔记系列推荐周志华<机器学习> ...