An Auto-tuning Framework for Autonomous Vehicles

动机：

As the scenario becomes more complicated, tuning to improve the motion planner performance becomes increasingly diffificult. To systematically solve this issue, we develop a data-driven auto-tuning framework based on the Apollo autonomous driving framework
【CC】简单来说，场景越来越复杂，系统的解决这个问题只能通过data-driven的方式

Third, the expert driving data and information about the surrounding environment are collected and automatically l
labeled
【CC】这套框架能自动打标签？整个思路都是向着自动化的

Typically, two major approaches are used to develop such a map: learning via demonstration (imitation learning) or
through optimizing the current reward/cost functional.
【CC】背景知识，典型的motion Planner的处理方式：要么imitation learning 要么走优化的路子

In an imitation learning system, the state-to-action mapping is directly learned from expert demonstration，a multimodal distribution loss function is necessary but will slow the training process
【CC】imitaiton learning的典型思路是通过数据构建一个分布的map： state-> action，一般比较慢

Optimizing through a reward functional，the reward/cost functionals are typically provided by an expert or learned from data via inverse reinforcement learning
【cc】优化的路子，cost function要么专家定义，要么通过IRL来学习，本文就是走的IRL学习的路子

Expert driving data from different scenarios are easy to collect but are extremely diffificult to reproduce in simulation since the ego car requires interaction with the surrounding environment
【cc】数据获取的痛点：数据好收集，但是不好在虚拟环境上复现，因为跟环境有交互

we build an auto-tuning system that includes both online trajectory optimization and offlfline parameter tuning
【cc】在线进行轨迹优化，离线进行参数调优；按照下图，是训练好的cost func/参数回塞到在线系统中；但是，这个没有看到数据流的走向，不知道会不会做数据孪生

Our motion planner module is not tied to a specifific approach.
【CC】按照APOLLO的框架，个人猜测，motion planner可能不止有优化版本，还可能有imitation learning的版本；需要做的是“定义好”/“学习好”cost func来评价优化/生成出来的结果，这也是本文的IRL的重点

The performance of these motion planners is be evaluated with the metrics that quantify both optimality and robustness.The optimality of the online part can be measured by the difference in the reward functional values of the optimal trajectory and generated trajectory, and the robustness can be measured by the variance in the generated trajectory behavior given specific scenarios.
【CC】optimality的度量是去衡量优化的轨迹/生成的轨迹间的difference, 这个很迷？ robustness的度量是给定一个场景，看输出轨迹的方差（我们也不知道这里的距离是如何定义的），这个比较好理解，同一个场景的轨迹方差不能波动太大，不然肯定鲁棒性有问题

Algorithm tuning loop for the motion planner in the Apollo autonomous driving platform
【CC】这个调优的作业流没啥好说的，哪些点是自动化的？这是关键

For the offlfline tuning module, we focus on providing a reward/cost functional that can adapt to different driving scenarios. Furthermore, the features that are used to build the reward functional may be collinear, which may also impact the stability of the tuned motion planner.
【CC】有没有可能，不同的场景评价函数不一样，动态的来决定；更进一步的，在调优的时候如果新的cost func出现了，如何评价cost func 的好坏呢？想GAN的方式做对抗？这里只说了会影响motion planner的稳定性，估计只是春秋笔法，有可能会出现model collapse的问题。所以这里注意到两点features的colinear，不引入高次项，在线性的目标空间里面求解（有啥数学定理能证明会有稳定的最优解/次优解？推李亚普洛夫稳定性？）；回头想想optimality的度量是轨迹间的偏差，那数学含义是啥呢？调整目标空间不同基的权重，不同的生成器给出的最优点都相对集中？否则，生成器给出的最优点是假的；实际上后面有解答，认为专家开出来的就是真值

主题思想：

An MDP is defifined by a set of states S , transition actions A and transition probabilities T = P(st+1|st,a)，a reward functional r ∈ R is defifined as a mapping S → R, a policy π ∈ Π is defifined as a map from a state to action π(at|st) distribution. Reinforcement learning aims to find the policy that optimizes the accumulated reward function

where s0 ∼ D is the initial state that follows predefifined distribution D. Es0∼D(Vπ (s0)) is defifined as E∑ γt r(st,at) with at ∼ π(·|st), st+1 ∼ P(·|st,at), where γt is a time discount factor.
【CC】RL的标准形式化定义：高斯过程有状态量S，动作集合A，转移概率（给定当前状态/动作到下一个状态的概率）T, 环境奖励函数V/r，策略（给定当前状态st从动作集合A中选取某个动作a的概率分布）集合Π。强化学习的目标是从集合Π中找到最优的π给定初始转台下的期望最大化，其中初始状态服从预定义的分布D（也是随便定的一个）

Defifine the expert policy as πE. The idea of IRL is to find the reward functional such that the expected value function is best for the expert demonstration

where πˆ r is the estimated optimal policy generated by reinforcement learning under reward r
【CC】关于different的是我想多了（感觉数学上的普适性更强一些），这里的定义非常简单，就认为πE最优解（真值，专业司机开出来的）；直观理解是找到一个评价函数r使得专业司机开出来的期望跟优化器/生成器算出来的最好的轨迹期望差距越大，证明这个评价函数越好；有点GAN的意思了，r就是那个分类器，而π就是那个生成器

Our idea for learning the reward functional includes two key parts: conditional comparison and rank-based learning
Conditional comparison

Thus, instead of comparing the expectation of value functions of the expert demonstration and optimal policy defifined in Eq. 2, we compare the value functions state by state to measure the performance of a policy πr under initial state s0 given reward function r(s,a).

【CC】基本是一个简化计算/建模的思路一：上面那个期望有积分项，导致我们没法儿计算；后面那个式子将期望换成了一个Loss Func：L，L的输入是基于状态S0的价值

Rank-based learning
we sample random policies and compare against the expert demonstration instead of generating the optimal policy first. Under each initial state s0, which we defifine as a scenario, a set of random policies πi,i = 1,2,…,N is sampled. Our assumption is that the human demonstrations rank near the top of the distribution of policies conditional on initial state s0 ∈ D on average. Additionally, since we generate random policies for comparison with the expert demonstration, the tuned reward functional can easily learn useful information from corner cases. Diffificult scenarios can also be generated to train and test the robustness of the reward functional

【cc】简化计算/建模思路二：不去计算生成器的最优解（对应式子2中那个max项），转而通过sample的方式来做，再简单求和作为期望； sample有一个工程上的好处，可以将corner case/ hard case 塞进去，进而训练cost func的鲁棒性；这里有个假设，专家展示出来的πE是S0状态下的峰值/均值，直观理解就是在S0状态下，大部分人都会这么做，也合理

Background shifting problem
Based on the idea of the maximum margin, the goal is to find the direction that clearly separates the demonstrated trajectory from randomly generated ones

R1 and R2 represent two features of the reward function. The output reward function is a linear combination of R1 and R2. Thus, the reward function can be seen as the direction that maximizes the margin between pseudo demonstration RH(θ) and randomly generated samples. The circle points in the top-left of the figures are 100 samples randomly generated from a Cauchy distribution. The top-right figure shifts both the random samples and the pseudo demonstration point RH by a fixed amount. The two red arrows represent the optimal direction for the top two frames. If two frames are combined, then the optimal direction shifts to the black one, as shown in the bottom figure.However, the direction trained with the combined frame is not optimal in either of the top frames.
【CC】本质就是一个多维/多目标优化问题，只是优化的方向是使得sample的轨迹和专家轨迹间隔越大越好，这里假设R是一个线性函数，则在平面上就是线，那么优化目标就是使得这条线画出来以后 RH的值跟原点差距越大越好（有点类似SVM，当然如果R是一个高维度的函数，应该能够划分的更漂亮）

网络主体：

the trajectory sampler uses the same strategy to generate candidate trajectories for both the offlfline and online modules

【CC】不太清楚online下怎么对轨迹进行采样？生成器会生成多条轨迹？然后直接简单的随机采样？

we define a trajectory under MDP as ξ = (a0,s0,…aN,sN) ∈ Ξ, where space Ξ is the sampled trajectory space. A trajectory under initial state s0 is evaluated by the value function

We use fj(at,st), j = 1,2,…,K to represent the feature given a current state and action. We choose reward function R as a function of all features with parameter θ ∈ Ω:

Typically, R˜ can be as simple as a linear combination of features or a neural network with features as the input. We use ξH to represent the human expert demonstration trajectory and ξS to represent the randomly generated sample trajectory in Ξ
【cc】文章并没有描述S的表达，我们不知道S到底是多少维的！那意味着后面R函数需要的输入信息是不是能够满足，特别是整个特征表达向量化了以后？这里的ξ如何表达实际上非常重要，即涉及到后面R需要的数据，还涉及到表达层是如何工作的（是否类似Inetennet是纯结构化数据，还是类似TNR是向量化数据，同时，它是如何对环境/OBJ间的交互建模的？）R是一个多输入的价值函数；每个f对应一个输入的特征，具体的维度是人为设计出来的，直观理解就是需要考虑的多目标，下表就是论文中的一个示例；正真要学的到底是fi呢（类似softmax的做法），还是R本身呢？论文没有讲，只是说都可以；

Siamese network in RC-IRL. The value networks of both the human and the sampled trajectories share the same network parameter settings. The loss function evaluates the difference between the sampled data and the generated trajectory via the value network outputs
【CC】一个标准的孪生网络架构：参数共享，目标函数是一个差值；最开始以为L是学习出来的，结果直接定义了leak relu作为loss func

The value network inside the siamese model is used to capture driving behavior based on encoded features. The network is a trainable linear combination of encoded rewards at different times t = t0,…,t17. The weight of the encoded reward is a learnable time decay factor. The encoded reward includes an input layer with 21 raw features and a hidden layer with 15 nodes to cover possible interactions. The parameters of the reward at different times share the same θ to maintain consistency
【CC】这里给出的R是简单的线性组合。如何对特征做映射也没讲清楚，粗暴给了15个节点的做环境交互的隐藏层，交互方式也是简单的FC。感觉还是挺粗暴的，是否可以参考socal LSTM的做法，去共享中间的特征层作为交互的表达；或者，类似attention机制，使用PQ矩阵去描述节点间的交互