动机

In this paper we develop a one-stage detector and forecaster that exploits both 3D point clouds produced by a LiDAR sensor as well as dynamic maps of the environment.
we exploit 3D LiDAR point clouds and dynamic HD maps containing semantic elements such as lanes, intersections and traffic lights.

our IntentNet is a fully-convolutional neural network that outputs three types of variables in a single forward pass corresponding to: detection scores for vehicle and background classes, high level action probabilities corresponding to discrete intention, and bounding box regressions in the current and future time steps to represent the intended trajectory
IntentNet is inspired by FaF [4], which performs joint detection and future prediction.

优点

(i) a more suitable architecture based on an early fusion of a larger number of previous LiDAR sweeps,
(ii) a parametrization of the map that allows our model to understand traffific constraints for all vehicles at once and
(iii) an improved loss function that includes a temporal discount factor to
account for the inherent ambiguity of the future.

Input parametrization

3D point cloud: we represent point clouds in bird’s eye view (BEV) as a 3D tensor, treating height as our channel dimension.
Dynamic maps:
We form a BEV representation of our maps by rasterization。
We represent each semantic component in the map with a binary map (i.e., 1 or -1). Roads and intersections are represented as fifilled polygons covering the whole drivable surface. Lane boundaries are parametrized as poly-lines representing the left and right boundaries of lane segments.
In total, there are 17 binary masks used as map features, resulting in a 3D tensor that represents the map

Output parametrization

Trajectory regression:
For each detected vehicle, we parametrize its trajectory as a sequence of bounding boxes, including current and future locations.
High level actions:
We frame the discrete intention prediction problem as a multi-class classification with 8 classes: keep lane, turn left, turn right, left change lane, right change lane, stopping/stopped, parked and other, where other can be any other action such as reversed driving.

Network architecture

IntentNet exploits a late fusion of LiDAR and map information through an architecture consisting of a two-stream backbone network and three task-specifific branches on top


Backbone network: we exploit residual connections

Header network:
The detection branch outputs two scores for each anchor box。An anchor is a predefifined bounding box with orientation that serves as a prior for detection。we use multiple anchors for each feature map location

The intention network performs 4a multi-class classifification over the set of high level actions, assigning a calibrated probability to the 8 possible behaviors at each feature map location. The discrete intention scores are in turn fed into an embedding convolutional layer to provide extra features to condition the motion estimation

The motion estimation branch receives the concatenation of the shared features and the embedding from the high level action scores, and outputs the predicted trajectories for each anchor box at each feature map location.

Loss Function

Our model is fully differentiable and thus can be trained end-to-end through back-propagation.

where t = 0 is the current frame and t > 0 the future, θ the model parameters and λ a temporal discount factor to ensure distance times into the future do not dominate the loss as they are more difficult to predict

Detection:
We define a binary focal loss, computed over all feature map locations and predefined anchor boxes, assigned using the matching strategy proposed in SSD

where i, j are the location indices on the feature map and k is the index over the predefined set of anchor boxes; qi,j,k is the class true label and pi,j,k;θ the predicted probability.

Trajectory regression:
Obj是以BBOX的形式表达,所以Trajectory要回归的参数就是以BBOX为基础,见下:

We apply a weighted smooth L1 loss to the regression targets ,见下:

where t is the prediction time step, xtr;θ refers to the predicted value of the r-th regression target, ytr is the ground truth value of such regression target, and χr is the weight assigned to the r-th regression target.

Intention prediction:
We employ a cross entropy loss over the set of high level actions. To address the high imbalance in the intention distribution, we downsample the dominant classes keep lane, stopping/stopped and parked by 95%.

引用其他技术

  • focal loss

引用自: https://blog.csdn.net/qq_34199326/article/details/83824778

作者提出focal loss的出发点也是希望one-stage detector可以达到two-stage detector的准确率,同时不影响原有的速度
要找one-stage detector的准确率不如two-stage detector的原因,作者认为原因是:样本的类别不均衡导致的,不均衡导致如下后果:
(1) training is inefficient as most locations are easy negatives that contribute no useful learning signal;
(2) en mass, the easy negatives can overwhelm training and lead to degenerate models.
负样本数量太大,占总的loss的大部分,而且多是容易分类的,因此使得模型的优化方向并不是我们所希望的那样。OHEM算法虽然增加了错分类样本的权重,但是OHEM算法忽略了容易分类的样本。
focal loss,这个损失函数是在标准交叉熵损失基础上修改得到的。这个函数可以通过减少易分类样本的权重,使得模型在训练时更专注于难分类的样本
交叉熵的形式化定义,原来的分类loss是各个训练样本交叉熵的直接求和,也就是各个样本的权重是一样的

以二分类为例,用pt代替p,表示预测样本为1的概率

简化成下面的式子

常见的做法就是给正负样本加上权重, 因此可以通过设定a的值来控制正负样本对总的loss的共享权重。a取比较小的值来降低负样本(多的那类样本)的权重,形式化定义如下:

显然前面的公式3虽然可以控制正负样本的权重,但是没法控制容易分类和难分类样本的权重,于是就有了focal loss:

这里的γ称作focusing parameter,γ>=0, 这个调制系数目的是通过减少易分类样本的权重,从而使得模型在训练时更专注于难分类的样本
focal loss的两个性质算是核心,其实就是用一个合适的函数去度量难分类和易分类样本对总的损失的贡献
作者在实验中采用的是公式5的focal loss(结合了公式3和公式4,这样既能调整正负样本的权重,又能控制难易分类样本的权重):

  • hard negative mining
    引用自 : https://www.zhihu.com/question/46292829
    难例挖掘与非极大值抑制 NMS 一样,都是为了解决目标检测老大难问题:样本不平衡+低召回率,如下:

    为了提高Recall,很直观的想法是“宁肯错杀一千,绝不放过一个”。因此在目标检测中,模型往往会提出远高于实际数量的区域提议(Region Proposal,SSD等one-stage的Anchor也可以看作一种区域提议)。但此时就会遇到一个问题,因为区域提议实在太多,导致在训练时绝大部分都是负样本,这导致了大量无意义负样本的梯度“淹没”了有意义的正样本。
    解决方式,参考下面论文:
    1.Object detection with discriminatively trained part based models
    2.Example-based learning for viewbased human face detection.

    Bootstrapping methods train a model with an initial subset of negative examples, and then collect negative examples that are incorrectly classified by this initial model to form a set of hard negatives. A new model is trained with the hard negative examples, and the process may be repeated a few times.
    we use the following “bootstrap” strategy that incrementally selects only those “nonface” patterns with high utility value:

    1. Start with a small set of “nonface” examples in the training database.
    2. Train the MLP classifier with the current database of examples.
    3. Run the face detector on a sequence of random images. Collect all the “nonface” patterns that the current system wrongly classifies as “faces” (see Fig. 5b).Add these “nonface” patterns to the training database as new negative examples.
    4. Return to Step 2.

R-CNN的Hard Negative Mining相当于给模型定制一个错题集,在每轮训练中不断“记错题”,并把错题集加入到下一轮训练中,直到网络效果不能上升为止

IntentNet: Learning to Predict Intention from Raw Sensor Data相关推荐

  1. 【论文阅读】Color Constancy by Learning to Predict Chromaticity from Luminance

    论文:Color Constancy by Learning to Predict Chromaticity from Luminance 作者:Ayan Chakrabarti 年份:2015 期刊 ...

  2. Combining satellite imagery and machine learning to predict poverty

    Combining satellite imagery and machine learning to predict poverty 是2016年发表在Science上的一篇文章. 在发展中国家,关 ...

  3. 论文笔记-Learning to Predict Streaming Video QoE: Distortions, rebuffering and memory

    info Bampis C G, Bovik A C. Learning to predict streaming video QoE: Distortions, rebuffering and me ...

  4. Re7:读论文 FLA/MLAC/FactLaw Learning to Predict Charges for Criminal Cases with Legal Basis

    诸神缄默不语-个人CSDN博文目录 论文名称:Learning to Predict Charges for Criminal Cases with Legal Basis 论文ArXiv网址:htt ...

  5. Fast spectral clustering learning with hierarchical bipartite graph for large-scale data

    Fast spectral clustering learning with hierarchical bipartite graph for large-scale data 基于层次二分图的大规模 ...

  6. 论文笔记(二十)VisuoTactile 6D Pose Estimation of an In-Hand Object using Vision and Tactile Sensor Data

    VisuoTactile 6D Pose Estimation of an In-Hand Object using Vision and Tactile Sensor Data 文章概括 摘要 1. ...

  7. 【Paper】Learning to Predict Charges for Criminal Cases with Legal Basis

    文章目录 Abstract 1 Introduction 2 相关工作 3 Data Preparation 4 Our Approach 4.1 Document Encoder 4.2 Using ...

  8. Learning to Predict Context-adaptiveConvolution for Semantic Segmentation阅读笔记

    作者里面有个大牛李洪生 李鸿升 - 知乎 (zhihu.com) 单位的话有港中文大学,商汤科技,深圳计算机视觉和模式识别研究院等等 摘要: 长距离的上下文信息对于实现高质量的语义分割是必不可少的.之 ...

  9. IBM Machine Learning学习笔记(一)——Exploratory Data Analysis for Machine Learning

    数据的探索性分析 1. 读入数据 (1)csv文件读取 (2)json文件读取 (3)SQL数据库读取 (4)Not-only SQL (NoSQL)读取 (5)从网络中获取 2. 数据清洗 (1)缺 ...

最新文章

  1. lisp协议instand_分享|Linux 上 10 个最好的 Markdown 编辑器
  2. ZJU-java进阶笔记 第六周(抽象与接口)
  3. css 倒三角_改善CSS的10种最佳做法
  4. Linux学习之CentOS(二十九)--Linux网卡高级命令、IP别名及多网卡绑定
  5. mysql的删除命令+linux命令大全,Linux环境下MySQL基础命令----查看、创建、删除库和表...
  6. ibatis mybatis sql语句配置 符号不兼容 大于号 小于号
  7. python、java、ruby、node等如何提取office文档中的内容?
  8. amcharts学习
  9. 【随机算法梗概】遗传算法通俗的讲解案例~~
  10. 手写文字怎么识别,手写文字识别的方法
  11. springboot毕设项目电子竞技赛事管理系统f1v55(java+VUE+Mybatis+Maven+Mysql)
  12. Let‘sEncrypt快速颁发及自动续签泛域名证书实践指南
  13. TF-A中的工具介绍
  14. 1-2 李宏毅2021春季机器学习教程-第一节(下)-深度学习基本概念简介
  15. win11系统苹果电脑如何安装 Windows11绕过tpm限制在苹果电脑进行安装的步骤方法
  16. model.predict_classes(test) 和model.predict(test) 区别
  17. 安装mysql 配置环境变量
  18. 创业记[01]三人行,初创的激情
  19. Teradata SQL 日期
  20. php 防止恶意灌水,discuz!使用技巧(2)如何防止用户在论坛恶意灌水?_discuz!论坛...

热门文章

  1. spring boot数据库问号问题
  2. MyBatis框架的作用
  3. 数字媒体艺术18级创意自画像赏析
  4. 3 个最好的免费在线PDF合并工具,可免费合并 PDF 文件
  5. 3D园区数据可视化建筑三维模型大屏展示
  6. java有Dm_Spring DM的开发示例
  7. MATLAB机器人工具箱【3】—— 动力学相关函数及用法
  8. 足球俱乐部介绍——多特蒙德
  9. html空间装扮教师空间,教师空间创作设计说明
  10. 吴永辉教授2021年讲课3-4题解