MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction

背景假设：

A fundamental aspect of future state prediction is that it is inherently stochastic, as agents cannot know each other’s motivations
【CC】这里假设我们是无法正真知道其他agnent的行为，跟RL计算纳什均衡思路不一样，后者假设每个agent都是理智的，都想达到自己的最优解

We seek a model of the future that can provide both (1) a weighted, parsimonious set of discrete trajectories that covers the space of likely outcomes and (2) a closed-form evaluation of the likelihood of any trajectory
【CC】期望模型输出能够满足 1）对预测的轨迹质量能够量化，进而选出更少的可能的未来（不然可能性会指数爆炸） 2）能够计算轨迹质量的解析解的似然值

主要思路：

MultiPath leverages a fifixed set of future state-sequence anchors that correspond to modes of the trajectory distribution.
【CC】以一组固定的（预定义）状态序列（预先定义好的轨迹）再对其数据分布进行预测；整体思想跟Detection中先固定一组template anchor box再对其进行回归差不都

Our method is inflfluenced heavily by the concept of predefifined anchors, which have a rich history in machine learning applications to handle multi-modal problems
【CC】如何去“预定义好”的anchors（轨迹），影响比较大

MultiPath predicts a discrete distribution over the anchors and, for each anchor, regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at each time step.
【CC】对每个agent（可以看成每辆车）基于固定的anchor（预先定义好的轨迹）去回归关于此轨迹的偏置，先验的认为其服从GMM；感觉像是先撒点，再对各个点进行回归

MultiPath model addresses these issues with a key insight: it employs a fifixed set of trajectory anchors as the basis of our modeling
【CC】为了规避未来时空可行性的指数爆炸问题，引入一组固定的anchors（预先定义好的轨迹）

assume control uncertainty is normally distributed at each future time step，parameterized such that the mean corresponds to a context-specifific offset from the anchor state，with the associated covariance capturing the unimodal aleatoric uncertainty
【CC】后面对control uncertainty有解释，就是对anchors offset的数据分布的描述，先验的认为其满足正态分布。其期望是上跟当前场景/上下文相关的量，其方差是一个单峰值的随机量

Our trajectory anchors are modes found in our training data in state-sequence space via unsupervised learning.
【CC】anchorss（预先定义好的轨迹）是从数据集中通过非监督的方式学出来的（K-meas）

Our complete model predicts a Gaussian mixture model (GMM) at each time step, with the mixture weights (intent distribution) fixed over time
【CC】GMM模型的参数是不随时间变化的；逻辑上合理，这么思考在遇到同样的场景下agent做出的预期动作都是一样的

Given such a parametric distribution model, we can directly evaluate the likelihood of any future trajectory and also have a simple way to obtain a compact, diverse weighted set of trajectory samples: the MAP sample from each anchor-intent
【CC】如果模型给定，就可以直接计算出轨迹的似然, 就可以直接拿MAP（最大后验估计）作为其量化指标；这里就回答了最开始如何量化评价输出轨迹的质量问题

形式化描述：

Given observations x in the form of past trajectories of all agents in a scene and possibly additional contextual information. MultiPath seeks to provide (1) a parametric distribution over future trajectories s: p(s|x), and (2) a compact weighted set of explicit trajectories which summarizes this distribution well.
【CC】给定观测量X，包含所有agent历史轨迹和当前环境（后面有input如何结构化的描述）. 本模型的目标 1）能够给出未来轨迹分布的参数p(s|x)（相当于拿模型拟合数据分布） 2）能够量化的评估预测轨迹的质量（前面已近说过, 用MAP）

Let t denote a discrete time step, and let st denote the state of an agent at time t, the future trajectory s = [s1, . . . , sT ] is a sequence of states from t = 1 to a fixed time horizon T
【CC】st表征t时刻agent的状态

We factorize the notion of uncertainty into independent quantities. Intent uncertainty models uncertainty about the agents’ latent coarse-scale intent or desired goal . control uncertainty，which describes the uncertainty over the sequence of states the agent will follow to satisfy its intent. Both intent and control uncertainty depend on the past observations of static and dynamic world context x
【CC】将不确定性分为两类（即认为预测有两个随机变量，我们的目标就是去拟合这两个随机变量的分布）：Intent uncertainty/control uncertainty，前者是粗粒度的不确定（比如,左转/跟随/右转），后者是服从于前者的状态分布的偏差，其也是一个随机变量；认为两个随机变量都依赖于输入x. 个人理解后者应该是基于前者的一个分布，而不可能认为两者是独立的分布，本文应该是简单处理的

We model a discrete set of intents as a set of K anchor trajectories A = {ak}K k=1, where each anchor trajectory is a sequence of states: ak = [ak1, . . . , akT ]. We model uncertainty over this discrete set of intents with a softmax distribution:

where fk(x) : Rd(x) → R is the output of a deep neural network.
【CC】集合A表征所有预定的轨迹，ak表示第k条轨迹；基于当前输入x观察到第k条轨迹ak的“概率”，这里使用了softmax作为其“概率的度量”；进一步，这里的π(ak|x)就是上面Intent uncertainty的度量，即选中某条预定义轨迹的概率的度量；注意，这里fk（x）是NN学习的一个函数, 即，认为Intent uncertainty只依赖于输入x, 另我们并不知道fk的表达形式，通过NN网络学习获得；再观察一下，这里的f有K类，即认为这种关系有K类，即每种预定义轨迹针对输入x有不同的映射关系

We make the simplifying assumption that uncertainty is unimodal given intent, and model control uncertainty as a Gaussian distribution dependent on each waypoint state of an anchor trajectory

The Gaussian parameters µk tand Σkt are directly predicted by our model as a function of x for each time-step of each anchor trajectory akt . Note in the Gaussian distribution mean, akt + µkt, the µkt represents a scene-specific offset from the anchor state akt; it can be thought of as modeling a scene-specific residual or error term on top of the prior anchor distribution
【CC】假设control uncertainty直接服从高斯分布. 这么理解，一方面是数学处理上简单，另一方面已经确定了粗粒度的形式轨迹，现在是基于轨迹进一步的精细化，一般来说不会偏离预期轨迹（即其期望 akt + µkt）太远. 这里的µk/Σkt认为只与x相关而与ak无关，回顾前文，即认为control uncertainty是与Intent uncertainty无关的量，只与x相关；这里就有上面的疑问了，直觉上两者应该是有关联的！如果把µk记为µk（x, ak）认为µk需要有两个入参来确定，而实际µ本身有k类，是做了分类的，所以感觉也没啥必要；这里的φ(skt |ak, x)就是control uncertainty的概率，够直接了

The time-step distributions are assumed to be conditionally independent given an anchor, i.e., we write φ(st|·) instead of φ(st|·, s1:t−1).
This modeling assumption allows us to predict for all time steps jointly with a single inference pass, making our model simple to train and efficient to evaluate. If desired, it is straightforward to add a conditional next-time-step dependency to our model, using a
recurrent structure (RNN)
【CC】假设在时域上是独立分布的。说人话是，t时刻的预测值跟t-1时刻没啥关系，其数据分布是独立的，直觉上就不太对；这样做有好处，就是够快，一步计算完成. 如果感觉不对劲, 可以直接改为RNN的方式，通过[1, t-1]的数据去预测t时刻

To obtain a distribution over the entire state space, we marginalize over agent intent:

Note that this yields a Gaussian Mixture Model distribution, with mixture weights fixed over all time steps.
【CC】将Intent uncertainty/control uncertainty合起来使用用联合概率公式, 就是一个标准GMM的形式

主流程：

Figure 1: MultiPath estimates the distribution over future trajectories per agent in a scene, as follows: 1) Based on a top-down scene representation, the Scene CNN extracts mid-level features that encode the state of individual agents and their interactions. 2) For each agent in the scene, we crop an agent-centric view of the mid-level feature representation and predict the probabilities over the fixed set of K predefined anchor trajectories. 3) For each anchor, the model regresses offsets from the anchor states and uncertainty distributions for each future time step.

Input representation

We follow other recent approaches [2, 11] and represent a history of dynamic and static scene context as a 3-dimensional array of data rendered from a top-down ortho-graphic perspective. The first two dimensions represent spatial locations in the top-down image. The channels in the depth dimension hold static and time-varying (dynamic) content of a fixed number of previous time steps.
【CC】整个的输入表达参考了 IntentNet/ ChauffeurNet的方式；三维数组，头两维为空间信息，第三维存固定帧数的储静态/动态的信息；另，是不是可以通过vectorNet自己学习表达方式？

Obtaining anchor trajectories

As noted by [6, 5], directly learning a mixture suffers from issues of mode collapse
【CC】对Ak本身放到NN里面自己学习的话容易造成模型坍塌，所以采用预处理的方式
we used the k-means algorithm as a simple approximation to obtain A with the following squared distance between trajectories:

where Mu, Mv are affifine transformation matrices which put trajectories into a canonical rotation- and translation-invariant agent-centric coordinate frame.
【CC】这里直接采用K-meas的方式，距离的定义就是一个标准的二范数

Learning

We train our model via imitation learning by fifitting our parameters to maximize the log-likelihood of recorded driving trajectories.
Let our data be of the form {(xm,ˆsm)} M m =1. We learn to predict distribution parameters π(ak|x), µ(x)kt and Σ(x)kt as outputs of a deep neural network parameterized by weights θ with the following negative log-likelihood loss built upon Equation 2:

The notation II(·) is the indicator function, and kˆm is the index of the anchor most closely matching the groundtruth trajectory ˆsm, measured as ` 2-norm distance in state-sequence space. This hard-assignment of groundtruth anchors sidesteps the intractability of direct GMM likelihood fitting, avoids resorting to an expectation-maximization procedure
【CC】老套路，所谓的imitation learning只是将真值与预测值做最小loss；这里的loss function也很直接，就是p（s|x）的NLL；只不过这里多了一个维度m，即所有的样本；这里关于预测值与那条ak最近的指派函数也很暴力，就是一个二范式距离，本文说的是为了避免EM过程，后面的实现者可以自己去调整，那是不是可以去学习一个距离呢？比如把 II(·)通过softmax软化，里面的f让NN自己学会更好？

Inferring a diverse weighted set of test-time trajectories
we take the MAP trajectory estimates from each of our K anchor modes, and consider the distribution over anchors π(ak|x) the sample weights
【CC】前面已经讲过了

Neural network details

We opt to use ResNet based architectures [24] for this scene-level feature extractor. The second phase extracts patches of size 11×11 centered on agents locations in this feature map.The extracted features are also rotated to an agent-centric coordinate system via a differentiable bilinear warping. The second agent-centric network then operates on a per-agent basis. It contains 4 convolutional layers with kernel size 3 and 8 or 16 depth channels
【CC】第一层使用各种ResNet作为表达学习；抠出一个以agent为中心1111的特征图，以anget的朝向为坐标起点进行变换，对每个agent都做预测，塞到一个4层的conv；这里注意两个点：其一，1111这个值比较tricky，相当于抠了agent周围的一块区域，但从人的直觉来看，会更关注前方类似梯形的区域；其二，这里的backbone不能降采样（或者要返回去），不然坐标系对不齐

It produces K×T×5 parameters describing bivariate Gaussian’s per time step per anchor (parameterized by µx, µy, log σx, log σy and ρ; the last 3 parameters defifine the 2 × 2 covariance matrix Σxy in the agent-centric x, y-coordinate space), as well as K softmax logits to represent π(a|x)
【CC】吐出K×T×5的张量，K表示预定义的轨迹，T表示要预测多长时间； µx, µy为期望，Σxy即为方差；感觉这里的π(a|x)写错了，应该是φ(skt |ak, x）中的待学习量

补充知识：
Conditional variational autoencoders (CVAEs)