IntentNet: Learning to Predict Intention from Raw Sensor Data

动机

In this paper we develop a one-stage detector and forecaster that exploits both 3D point clouds produced by a LiDAR sensor as well as dynamic maps of the environment.
we exploit 3D LiDAR point clouds and dynamic HD maps containing semantic elements such as lanes, intersections and traffic lights.

our IntentNet is a fully-convolutional neural network that outputs three types of variables in a single forward pass corresponding to: detection scores for vehicle and background classes, high level action probabilities corresponding to discrete intention, and bounding box regressions in the current and future time steps to represent the intended trajectory
IntentNet is inspired by FaF [4], which performs joint detection and future prediction.

优点

(i) a more suitable architecture based on an early fusion of a larger number of previous LiDAR sweeps,
(ii) a parametrization of the map that allows our model to understand traffific constraints for all vehicles at once and
(iii) an improved loss function that includes a temporal discount factor to
account for the inherent ambiguity of the future.

Input parametrization

3D point cloud: we represent point clouds in bird’s eye view (BEV) as a 3D tensor, treating height as our channel dimension.
Dynamic maps:
We form a BEV representation of our maps by rasterization。
We represent each semantic component in the map with a binary map (i.e., 1 or -1). Roads and intersections are represented as fifilled polygons covering the whole drivable surface. Lane boundaries are parametrized as poly-lines representing the left and right boundaries of lane segments.
In total, there are 17 binary masks used as map features, resulting in a 3D tensor that represents the map

Output parametrization

Trajectory regression:
For each detected vehicle, we parametrize its trajectory as a sequence of bounding boxes, including current and future locations.
High level actions:
We frame the discrete intention prediction problem as a multi-class classification with 8 classes: keep lane, turn left, turn right, left change lane, right change lane, stopping/stopped, parked and other, where other can be any other action such as reversed driving.

Network architecture

IntentNet exploits a late fusion of LiDAR and map information through an architecture consisting of a two-stream backbone network and three task-specifific branches on top

Backbone network: we exploit residual connections

Header network:
The detection branch outputs two scores for each anchor box。An anchor is a predefifined bounding box with orientation that serves as a prior for detection。we use multiple anchors for each feature map location

The intention network performs 4a multi-class classifification over the set of high level actions, assigning a calibrated probability to the 8 possible behaviors at each feature map location. The discrete intention scores are in turn fed into an embedding convolutional layer to provide extra features to condition the motion estimation

The motion estimation branch receives the concatenation of the shared features and the embedding from the high level action scores, and outputs the predicted trajectories for each anchor box at each feature map location.

Loss Function

Our model is fully differentiable and thus can be trained end-to-end through back-propagation.

where t = 0 is the current frame and t > 0 the future, θ the model parameters and λ a temporal discount factor to ensure distance times into the future do not dominate the loss as they are more difficult to predict

Detection:
We define a binary focal loss, computed over all feature map locations and predefined anchor boxes, assigned using the matching strategy proposed in SSD

where i, j are the location indices on the feature map and k is the index over the predefined set of anchor boxes; qi,j,k is the class true label and pi,j,k;θ the predicted probability.

Trajectory regression：
Obj是以BBOX的形式表达，所以Trajectory要回归的参数就是以BBOX为基础，见下：

We apply a weighted smooth L1 loss to the regression targets ，见下：

where t is the prediction time step, xtr;θ refers to the predicted value of the r-th regression target, ytr is the ground truth value of such regression target, and χr is the weight assigned to the r-th regression target.

Intention prediction:
We employ a cross entropy loss over the set of high level actions. To address the high imbalance in the intention distribution, we downsample the dominant classes keep lane, stopping/stopped and parked by 95%.

引用其他技术

focal loss

引用自： https://blog.csdn.net/qq_34199326/article/details/83824778

作者提出focal loss的出发点也是希望one-stage detector可以达到two-stage detector的准确率，同时不影响原有的速度
要找one-stage detector的准确率不如two-stage detector的原因，作者认为原因是：样本的类别不均衡导致的，不均衡导致如下后果：
(1) training is inefficient as most locations are easy negatives that contribute no useful learning signal;
(2) en mass, the easy negatives can overwhelm training and lead to degenerate models.
负样本数量太大，占总的loss的大部分，而且多是容易分类的，因此使得模型的优化方向并不是我们所希望的那样。OHEM算法虽然增加了错分类样本的权重，但是OHEM算法忽略了容易分类的样本。
focal loss，这个损失函数是在标准交叉熵损失基础上修改得到的。这个函数可以通过减少易分类样本的权重，使得模型在训练时更专注于难分类的样本
交叉熵的形式化定义，原来的分类loss是各个训练样本交叉熵的直接求和，也就是各个样本的权重是一样的

以二分类为例，用pt代替p，表示预测样本为1的概率

简化成下面的式子

常见的做法就是给正负样本加上权重，因此可以通过设定a的值来控制正负样本对总的loss的共享权重。a取比较小的值来降低负样本（多的那类样本）的权重，形式化定义如下：

显然前面的公式3虽然可以控制正负样本的权重，但是没法控制容易分类和难分类样本的权重，于是就有了focal loss：

这里的γ称作focusing parameter，γ>=0，这个调制系数目的是通过减少易分类样本的权重，从而使得模型在训练时更专注于难分类的样本
focal loss的两个性质算是核心，其实就是用一个合适的函数去度量难分类和易分类样本对总的损失的贡献
作者在实验中采用的是公式5的focal loss（结合了公式3和公式4，这样既能调整正负样本的权重，又能控制难易分类样本的权重）：

hard negative mining
引用自： https://www.zhihu.com/question/46292829
难例挖掘与非极大值抑制 NMS 一样，都是为了解决目标检测老大难问题：样本不平衡+低召回率，如下：

为了提高Recall，很直观的想法是“宁肯错杀一千，绝不放过一个”。因此在目标检测中，模型往往会提出远高于实际数量的区域提议（Region Proposal，SSD等one-stage的Anchor也可以看作一种区域提议）。但此时就会遇到一个问题，因为区域提议实在太多，导致在训练时绝大部分都是负样本，这导致了大量无意义负样本的梯度“淹没”了有意义的正样本。
解决方式，参考下面论文：
1.Object detection with discriminatively trained part based models
2.Example-based learning for viewbased human face detection.

Bootstrapping methods train a model with an initial subset of negative examples, and then collect negative examples that are incorrectly classified by this initial model to form a set of hard negatives. A new model is trained with the hard negative examples, and the process may be repeated a few times.
we use the following “bootstrap” strategy that incrementally selects only those “nonface” patterns with high utility value:
1. Start with a small set of “nonface” examples in the training database.
2. Train the MLP classifier with the current database of examples.
3. Run the face detector on a sequence of random images. Collect all the “nonface” patterns that the current system wrongly classifies as “faces” (see Fig. 5b).Add these “nonface” patterns to the training database as new negative examples.
4. Return to Step 2.

R-CNN的Hard Negative Mining相当于给模型定制一个错题集，在每轮训练中不断“记错题”，并把错题集加入到下一轮训练中，直到网络效果不能上升为止