[论文翻译]Pedestrian Alignment Network for Large-scale Person Re-Identification

传送门：

https://arxiv.org/pdf/1707.00408.pdf

https://github.com/layumi/Pedestrian_Alignment

摘要

Person re-Identification 常被当做一个图像检索问题，其目的是 search a query person ina large image pool。在实际中，re-id 常采用自动检测器来获得cropped pedestrian images。然而，在此过程中有两类detector errors：excessive background 和 part missing。这导致pedestrian alignment（行人对齐）质量的退化同时由于位置尺度的偏差也影响pedestrian matching 的精度。为了解决不对中（misalignment）问题，提出可以在Identification过程中进行对齐。我们介绍PAN（Pedestrian Alignment Network），它可以学习 discriminative embedding 和 pedestrian alignment 而没有额外的标注。Key observation：当CNN区分不同的identities，学习到的features 在人身体上有很强的activations，而不是在背景上。本文利用这种注意机制的优点在a bounding box 中来调节位置和对齐行人。三个数据集上的实验展示PAN的

1、Introduction

现在Re-ID 主要工作在于在大数据集上使用CNN学习 discriminative embeddings，结果要胜于手动特征。misalignment 是一个极其重要的因素。此问题的提出是由于detector的使用，在large-scale 数据集常使用现成的检测器检测行人，这样节省了大部分的劳动而且更加贴近实际。但是，只要使用检测器就不可避免的产生错误，这导致两个常见的噪声因素：excessive background 和 part missing.前者，背景也许会占据检测到的图像中的一大部分比例；后者，检测到图像也许只包含部分的身体。如图1所示

Pedestrian alignment（行人对齐）和 re-id 是两个相联系的问题。当我们有行人bounding boxes 的 identity labels，我们能找到最优的仿射变换（the optimal affine transformation）用来更好的区分不同的identities。在仿射变换下，行人可以更好地被对齐。进一步，更好的对齐，更具有区别性的特征将会被学习，反过来，匹配的准确率将会提高。

根据上述方面，本文将行人对齐（pedestrian alignment）包含在 re-ID 结构中，产生PAN（Pedestrian Alignment Network）。给定一个行人检测到的图像，此网络可同时重定位 the person 并且把此人分类到预定义的identities中。因此，PAN利用了person alignment 和 re-id 互补的性质。

PAN的训练过程由三部分组成：（1）一个网络预测输入图像的identity；（2）an affine transportion 对输入图像进行重定位进行估计；（3）另一个网络对重定位的图像进行预测identity。对于（1）和（3），本文使用两个CNN分支，分别称为 the base branch 和 alignment branch，分别对原始图像和对齐后的图像预测 identity。内部，两者共享 low-level features；在测试时，串联 FC layer 的 features 来生成 the pedestrian descriptor。对于（2），使用 the base branch 的 high-level 卷积层的feature maps 来估计仿射参数。之后，在 the base branch 的 lower-level feature maps 上使用仿射变换。在这步，我们使用一个 differentiable localization network：STN（spatial transformer network）。使用STN，我们可以对包含很多背景的图像进行crop 或者对missing parts 图像的边界进行 pad zeros。结果，我们减少了由于misdetection造成的大小位置变化的影响，并使得匹配更加准确。

注意，本方法解决由于检测错误引起的misalignment（不对齐）问题，然而，通常使用 patch matching 策略用于在 well-aligned 图像找到 matched local structures。使用patch matching的方法假设 the matched local structures 位于the same horizontal stripe or square neighborhood（相同横条或方形邻域）。因此，这些算法对于一些小的空间变化具有鲁棒性。然而，当发生misdetection，由于搜索范围的限制，这种类型的方法也许不能找到 matched structures 并且mismatching 的风险也许会高。本文推测本方法与 part matching 会是一个好的互补。

贡献：

（1）PAN（Pedestrian Alignment Network）——同时 align pedestrians 和学习pedestrian descriptors。仅使用 identity label 没有其他额外的标注。

（2）手动cropped图像仍然不完美，在其上可以提高re-ID的性能。

（3）在Market-1501、CUHK03 和 DukeMTMC-reID上的实验取得了 state-of-the-art 的结果。

2、Related Work

本文同时解决两类任务：Re-id 和 person alignment

2.1 Hand-crafted Systems for Re-ID

Re-ID 需要在不同的摄像头中找到鲁棒且有区分性的features。

代表性方法（局部手工特征）：LBP、Gabor、LOMO 和多特征融合

度量方法：KISSME 等等

2.2 Deeply-learned Models for Re-ID

现在很多方法使用通过划分图像或添加新的patch-matching层来空间限制的CNN（More recent approaches based on CNN apply spatial constraints by splitting images or adding new patch-matching layers）。但是它由于输入图像是成对导致计算低效受限制。

此外，使用单独CNN不带明确的patch-matching有high discriminative ability。本文采用一个相似的CNN branches without explicit part-matching layers。注意，我们针对re-id寻找更鲁棒的pedestrian embedding，因此我们可以使用之前的方法进一步提升性能

2.3 Objective Alignment

Face alignment（the rectification（矫正） of face misdetection）广泛的研究。以目标位置任务驱动的 attention models：STN（spatial transformer network）。在Re-ID中：3D body models（未处理misdetection 问题）、PoseBox。但本方法与PoseBox相似却不同。

3、Pedestrian Alignment Network

3.1 PAN总述

我们的目的是设计一个结构同时 align the images 和 identify the person。主要的挑战：设计一个支持端到端训练并从两个两连的任务中受益的模型。PAN由两个卷积branches 和一个affine estimation branch 同时解决解决这些设计约束。

本文使用ResNet-50 model作为base model，每个Res_i（ i = 1,2,3,4,5）block 定义几个带有batch normalization 、ReLU 和 optionally max pooling的卷积层。在每个block后，对features进行down-sampled。

3.2 Base and Alignment Branches

两个主要的卷积branches，称为 the base branch 和 the alignment branch。这两个分支是用于预测训练图像的identity的分类网络。给定原始的检测图像，the base branch 不仅学习区分它的identity和其他的identities，而且编码图像的appearance并提供空间位置的信息，如下图所示。The alignment branch 共享一个相似的卷积网络，但是处理的是由the affine estimation branch产生的aligned features。

在base分支，使用在ImageNet上预训练的除去最后一个FC的ResNet-50。因为在Market-1501训练集中有K = 751个identities，我们增加一个FC layer 把1*1*2048大小的CNN embedding映射成751unnormalized probabilities。alignment分支包含3个ResBlocks 和一个平均pooling layer，我们也增加一个FC layer 预测多分类的概率。这两个分支不共享权值。使用W1和W2分别定义两个卷积分支的参数。

更正式地讲，给定一个输入图像x，p(k|x)定义给定图像x时属于类别k∈ {1...K}的概率.具体地说，

其中，zi为从CNN网络中输出的概率值。对于两个分支，交叉熵损失函数（cross-entropy losses）为

其中，xa定义为aligned input，其可以从原始的输入求得xa = T(x)。给定label y，真实的分布（ground-truth distribution）q(y|x) = 1 和 q(k|x) = 0 对于所有 k ≠ y。若丢弃Eq.1 和 Eq.2中的0，则losses等于：

因此，在每次迭代中，我们要最小化总共的熵，就等价于最大化正确预测的概率。

3.3 Affine Estimation Branch

为了解决excessive background 和 part missing问题，关键是要预测行人的位置并进行相应的空间变换。当excessive background存在，使用cropping策略；当part missing时，我们需要在对应的图像边界进行pad zeros的操作。两种策略都需要找到affine transformation（仿射变换）的参数。在本文，这个功能由affine estimation branch实现。

注：下段bo主有修改，Res2和Res4

The affine estimation 分支的输入为两个张量（tensors）的激活值：从base分支的14*14*1024和56*56*256，分别称为Res4 Feature Maps 和Res2 Feature Maps.Res2 Feature Maps包含原始图像的浅层特征并反映了local pattern（局部模式）的信息。另一方面，Res4 Feature Maps 更接近分类层，它编码用于协助Identification的关注的行人和语义线索。The affine estimation branch 包含一个双线性采样器和一个 Grid Network 的小网络。Grid Network 包含一个ResBlock 和一个平均pooling层。我们把Res4 Feature Maps通过Grid Network回归一组6维度的变换参数。学习到的变换参数θ用于产生 the image grid。映射过程如下：

其中

为在输出feature map上的目标坐标，

为输入feature map（Res4 Feature Maps）的原始坐标。θ11, θ12, θ21 和 θ22 处理大小和旋转变换，而θ13 and θ23处理补偿。在本文，我们定义坐标：（-1，-1）指的是位于图像左上的像素，而（1,1）指的是右下的像素。eg：若

输出图像中（-1，-1）的像素值等价于在输入features上的（-0.9，-0.7）。使用一个双线性采样器来弥补missing pixels，并分配0在超出原始图像的位置上。因此，我们从原始feature map V 到 aligned 输出 U中获得一个injective function。更正式地讲，函数如下:

其中，

为在通道c的（m，n）位置的输出feature map，

为在通道c的（xs, ys）位置的输入feature map。若(xt, yt) 接近于(m, n)，我们根据双线性采样在（xs, ys）位置添加像素。

在本文，我们执行pedestrian alignment 在浅层的features上而不是原始图像（减少运行时间和模型的参数）。这也解释了我们在features上使用re-localization grid。双线性采样器接受grid，the features 来产生aligned 的输出xa。可视化如图3所示。可以看到通过ID的监督，我们在一定程度上可以重定位行人和correct misdetections。

3.4 Pedestrian Descriptor

给定微调的PAN模型和一个输入图像xi，the pedestrian descriptor是base分支和alignment分支的FC features 的加权融合。即，我们从原始图像和aligned图像中获得pedestrian characteristic。4.3节的实验证明：这两个特征相互互补并提高re-id性能。

本文采用一个简单的late fusion strategy：

这里f 1 i and f 2 i 分别是两个类型图像的FC层descriptors。我们在最后的平均池化后的tensor（张量）reshape成一维的向量作为每个分支的pedestrian descriptor。pedestrian descriptor表示如下：

其中|·|操作表示一个L2-normalization步骤。在L2-normalization步骤后，串联the aligned descriptor 和原始图像的descriptor。若没有明确指出，则α = 0.5.

3.5 Re-ranking for re-ID

本文我们首先通过按gallery images与the query之间Euclidean distance的排序获得the rank list N(q, n) = [x1, x2, ...xn]。距离计算如下：Di,j = (fi − fj ) ^2，其中fi 和 fj分别是图像i和j的L2-normalization features。之后再执行re-ranking来获得更好的检索结果。

除了Euclidean distance，我们还考虑the Jaccard similarity。为了介绍此距离，我们先对每个图像定义一个robust retrieval set。The k-reciprocal nearest neighborsR(p,k)包括对于query为p的top-K检索的rank：

根据 [Zhong et al., 2017], 扩展集合 R to R*以包含更多的正例样本（positive samples）。对于R*使用Jaccard similarity 来 re-ranking。当我们使用正确的匹配图像来产生retrieval，我们应该取回一个和原始的query相似的rank list。Jaccard similarity 计算如下：

其中| · |定义 the cardinality of the set，若R*(q, k) 和 R*(xi , k)共享更多的元素，xi就更像一个true match。这帮助我们区分一些hard negative samples 与 the correct matches。在测试时，这个相似距离加入Euclidean distance 来 re-rank the result。实验展示其提升效果。

4 实验

数据集：Market-1501、CUHK03 和 DukeMTMC-reID。

其中Market-1501、CUHK03（detected）由DPM自动检测得到且面临misdetection问题。我们不知道对于手工标注的图像进行slight alignment之后会不会带来额外的好处。所以，我们在手工的bounding boxes上的CUHK03（labeled）和DukeMTMC-reID上进行评估。如图4所示，这三个数据集有不同的特点（scene variances 和 detection bias）。

4.1 数据集

1、Market-1501——misalignment problem（DPM）

2、CUHK03——分为“detected” set（DPM）和 “labeled” set

3、DukeMTMC-reID——相似的穿着和可能被车和树遮挡

4、Evaluation Metrics——rank-1,5,20 accuracy 和 mAP（mean average precision）

rank-i accuracy 定义在top-i中正确匹配的图像出现的概率，此指标针对query images。若在top-i中没有出现正确匹配的图像则rank-i = 0，否则为1。对于每个query，我们计算PR曲线（Precision-Recall curve）下的面积，即AP（average precision），然后计算均值，这个指标反映了精度和召回的性能。

4.2 实现细节

ConvNet.(1)基于re-ID数据集微调 the base branch；(2)固定base分支，微调整个网络。具体讲，微调base分支时，在30epochs后，学习率从10^−3 降低到 10^−4。在40th epoch时停止训练。训练整个网络同上。使用mini-batch SGD（stochastic gradient descent）with a Nesterov momentum fixed to 0.9 来更新权重。实现介于Matconvnet。输入图像统一resize为224*224，此外，使用简单的data augmentation（eg: cropping、horizontal flipping）

STN.对于affine estimation 分支，此网络也许会在早期的迭代中进入一个局部最小值。为了稳定的训练，我们使用一个小的学习率是有用的。因此在affine分支的最后一个卷基层使用1 × 10^−5的学习率。此外，设置所有的θ = 0除了θ11, θ22 = 0.8。所以，the alignment 分支从注意Res2 Feature Maps的中心开始训练。

4.3 Evaluation

Evaluation of the ResNet baseline.我们基于常规的基准[Zheng et al., 2016b], 特殊的细节如4.2中说明。基准结果如Table 1：

我们使用batch size为16，dropout rate 为0.75。

Base branch. vs. alignment branch为了调查alignment如何学习 discriminative pedestrian representations，我们分别用base分支和alignment分支评估Pedestrian descriptors。可推断以下两个结论：

（1）在后两个数据集上，the alignment分支产生更好的结果，在Market-1501上产生一个相似的结果。推测：Market-1501包含更多的intensive detection errors，因此alignment的效果受限。

（2）虽然后两个数据集是手工标注的，使用alignment分支仍然提升了性能。这表明手工的标注对于学习一个好的descriptor也许不是足够的好。alignment可以学习到更具有区分性的表达。

The complementary of the two branches. 串联base分支和alignment分支的descriptors在三个数据集都使得性能提升。两个分支是互补的，因此比一个分支包含更多有意义的信息。简单的融合不产生额外的计算。

Parameter sensitivity. 评估re-ID准确率对参数α的敏感性。图5为微调α从0到1的rank-1 accuracy 和 mAP。α对两者的影响很小。为简单起见，α = 0.5，对于特定数据集其有可能不是最好的选择，但是如果预先不知道数据集的分布，这却是一个简单的选择。

Comparison with the state-of-the-art methods.分为如table2\4\3所示。Market-1501——rank-1 accuracy = 85.78%, mAP = 76.56% after re-ranking。最好的结果且适用于以前的方法。结合GAN生成的图像来训练还可以提升。CUHK03...DukeMTMC-reID...

结果对比可视化：

虽然三个数据集有差异（scene variance 和 detection bias）但是均有提升。

Visualization of the alignment.——网络不能完美的解决alignment问题，但是或多或少减少大小和位置的方差，这对于学习representations是很关键的.