PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 详解

原文
Abstract:Detecting pedestrians,especially under heavy occlusions, is a challenging computer vision problem with numerous real-world applications. This paper introduces a novel approach, termed as PSC-Net, for occluded pedestrian detection. The proposed PSC-Net contains a dedicated module that is designed to explicitly capture both inter and intra-part co-occurrence
information of different pedestrian body parts through a Graph Convolutional Network (GCN). Both inter and intra-part co- occurrence information contribute towards improving the feature representation for handling varying level of occlusions, ranging from partial to severe occlusions. Our PSC-Net exploits the topological structure of pedestrian and does not require part- based annotations or additional visible bounding-box (VBB) information to learn part spatial co-occurrence. Comprehensive experiments are performed on two challenging datasets: CityPer- sons and Caltech datasets. The proposed PSC-Net achieves state- of-the-art detection performance on both. On the heavy occluded (HO) set of CityPerosns test set, our PSC-Net obtains an absolute gain of 4.0% in terms of log-average miss rate over the state-of- the-art [34] with same backbone, input scale and without using additional VBB supervision. Further, PSC-Net improves the state- of-the-art [54] from 37.9 to 34.8 in terms of log-average miss rate on Caltech (HO) test set.

行人检测是一个具有挑战性的计算机视觉问题，在现实生活中有着广泛的应用。本文介绍了一种新的遮挡行人检测方法，称为PSC网络。本文提出的的PSC网络包含一个专用模块，该模块旨在通过图卷积网络（GCN）明确捕获不同行人身体部位的部件间和部件内共现信息。部分间和部分内共现信息都有助于改进特征表示，以处理从部分到严重的不同程度的遮挡问题。我们的PSC网络利用行人的拓扑结构，不需要基于零件的注释或附加可见边界框（VBB）信息来学习零件空间共生。在两个具有挑战性的数据集上进行了综合实验：City Persons和Caltech数据集。提出的PSC网络在这两方面都达到了最先进的检测性能。在City Persons测试集的重阻塞（HO）集上，我们的PSC网络在使用相同主干、输入规模且不使用额外 VBB 监控的情况下，在最新技术[34]的对数平均未命中率方面获得了4.0%的绝对增益。此外，就加州理工学院（HO）测试集的对数平均未命中率而言，PSC Net将最新水平[54]从37.9提高到34.8。

I. INTRODUCTION

PEDESTRIAN detection is a challenging problem in computer vision with various real-application applications, e.g., robotics, autonomous driving and visual surveillance. Recent years have witnessed significant progress in the field of pedestrian detection, mainly due to the advances in deep convolutional neural networks (CNNs). Modern pedestrian detection methods can be generally divided into single-stage [22], [30], [3] and two-stage [20], [57], [54], [34], [52], [53], [4], [46], [24]. Single-stage pedestrian detectors typically work by directly regressing the default anchors into pedestrian detection boxes. Different to single-stage pedestrian detectors,
two-stage methods first produce a set of candidate pedestrian proposals which is followed by classification and regression of these pedestrian proposals. Most existing two-stage pedestrian detectors [53], [34], [20], [52], [46], [24] are base on the popular Faster R-CNN detection framework [37] that is adapted from generic object detection. Existing pedestrian
detectors typically assume entirely visible pedestrians when trained using full body pedestrian annotations.

在计算机视觉中，行人检测是一个具有挑战性的问题，在各种实际应用中，例如。机器人技术、自动驾驶和视觉监控。近年来，由于深度卷积神经网络（CNN）的发展，行人检测领域取得了重大进展。现代行人检测方法一般可分为单阶段[22]、[30]、[3]和两阶段[20]、[57]、[54]、[34]、[52]、[53]、[4]、[46]、[24]。单阶段行人检测器通常通过将默认anchor直接回归到行人检测框中来工作。与单阶段行人检测器不同的是，两阶段方法首先生成一组候选行人方案，然后对这些行人方案进行分类和回归。大多数现有的双阶段行人检测方法是[53]、[34]、[20]、[52]、[46]、[24]基于流行的快速R-CNN检测框架[37]，该框架采用自通用对象检测。当使用行人全身作为注释进行训练时，现有行人检测器通常假定行人完全可见。

While promising results have been achieved by existing pedestrian detectors on standard non-occluded pedestrians, their performance on heavily occluded pedestrians is far from satisfactory. This is evident from the fact that the best reported performance [34] on the reasonable ® set (where visibility ratio is larger than 65%) of CityPersons test set [51] is 9.3 (log- average miss rate) whereas it is 41.0 on the heavy occluded
(HO) set (where visibility ratio ranges from 20% to 65%) of the same dataset . Handling pedestrian occlusion is an open problem in computer vision and present a great challenge for detecting pedestrians in real-world applications due to its frequent occurrence. Therefore, a pedestrian detector is desired to be accurate with respect to varying level of occlusions, ranging
from reasonably occluded to severely occluded pedestrians.

虽然现有的行人检测器在标准的非遮挡行人上取得了令人满意的结果，但它们在严重遮挡行人上的性能远远不尽人意。这一点从以下实验结果中可以明显看出：在City Persons测试集[51]的 reasonable ® set （可见性比率大于65%）上的最佳报告性能[34]为9.3（对数平均未命中率），而在同一数据集的严重遮挡（HO）集（可见性比率从20%到65%）上的最佳报告性能[34]为41.0（对数平均未命中率）。行人遮挡的处理是计算机视觉中的一个开放性问题，由于其频繁发生，在实际应用中对行人检测提出了很大的挑战。因此，需要有更加精确的行人检测方法来处理对于不同程度的遮挡问题（从合理的遮挡到严重的遮挡的行人检测）。

II. RELATED WORK

Convolutional neural networks (CNNs) have significantly advanced the state-of-the-art in numerous computer vision applications, such as image classification [12], [41], [39], [43], object detection [37], [33], [45], [29], [21], object counting [48], [9], [8], [17], image retrieval [35], [36], [23], [49],action recognition [40], [38], [19], [28], and pedestrian detection [52], [34], [53], [46]. State-of-the-art pedestrian detection methods can generally be divided into single-stage and twostage methods. Next, we present a brief overview of two-stage pedestrian detection methods.

卷积神经网络（CNN）在许多计算机视觉应用领域，如图像分类[12]、[41]、[39]、[43]、目标检测[37]、[33]、[45]、[29]、[21]、目标计数[48]、[9]、[8]、[17]、图像检索[35]、[36]、[23]、[49]、行为识别[40]、[38]、[19]、[28]，行人检测[52]、[34]、[53]、[46]。最先进的行人检测方法一般可分为单阶段和两阶段方法。接下来，我们简要介绍两阶段行人检测方法。

Two-stage Deep Pedestrian Detection: In recent years, twostage pedestrian detection approaches [51], [34], [57], [54], [52], [20], [53], [46] have shown superior performance on standard pedestrian benchmarks. Generally, in two-stage pedestrian detectors, a set of candidate pedestrian proposals is first generated. Then, these candidate object proposals are classified and regressed. Zhang et al., [51] propose key adaptations in the popular Faster R-CNN [37] for pedestrian detection. The work of [46] propose an approach based on a bounding-box regression loss designed for crowded scenes. The work of [52] investigate several attention strategies, e.g., channel, part and visible bounding-box, for pedestrian detection. The work of [5] introduce a multi-scale pedestrian detection approach with layers having receptive fields similar to object scales. Zhang et al., [53] propose a loss formulation that enforces candidate proposals to be close to the corresponding objects and integrates structural information with visibility predictions. The work of [3] propose a multi-phase autoregressive pedestrian detection approach that utilizes a stackable decoder-encoder module with convolutional re-sampling layers. In [20], an adaptive NMS strategy is introduced that applies a dynamic suppression threshold to an instance.

两阶段深度行人检测：近年来，两阶段行人检测方法[51]、[34]、[57]、[54]、[52]、[20]、[53]、[46]在标准行人基准上表现出了优异的性能。通常，在两级行人检测器中，首先生成一组行人候选框。然后，对这些目标候选框进行分类和回归。Zhang等人[51]提出了流行的快速R-CNN[37]中行人检测的关键采用。[46]的工作提出了一种基于边界框回归损失的方法，设计用于拥挤场景。[52]研究了几种注意策略，例如：通道、部分和可见边界框用于行人检测。[5]的工作介绍了一种多尺度行人检测方法，其各层具有类似于物体尺度的感受野。Zhang等人[53]提出了一个损失公式，该公式强制候选方案靠近相应的对象，并将结构信息与可见性预测相结合。[3]的工作提出了一种多阶段自回归行人检测方法，该方法利用具有卷积重采样层的可堆叠解码器-编码器模块。在[20]中，引入了一种自适应NMS策略，将动态抑制阈值应用于实例。

Most recent approaches [52], [57], [53], [54], [34] investigate the problem of occluded pedestrian detection by utilizing additional visible bounding-box (VBB) annotations together with the standard full body information. Zhang et al., [52] use VBB along with a pre-trained body part prediction model to tackle occluded pedestrian detection. The work of [57] demonstrate that an additional task of visible-region bounding-box prediction can improve the full body pedestrian detection. The work of [53] propose a novel loss that improves the localization, and a part occlusion-aware region of interest pooling, to integrate structure information with visibility predictions. Zhou et al., [54] propose a discriminative feature transformation module that projects the features in to a feature space, where the distance between occluded and nonoccluded pedestrians is less. Such a transformation improves the robustness of the pedestrian detector during occlusion. In their approach, the visible bounding-box (VBB) is used to identify the occluded pedestrian. Recently, MGAN [34] propose a mask-guided attention network, using VBB annotation, to emphasize the visible regions while suppressing the occluded regions, leading to state-of-the-art results on standard benchmarks.

最近的方法[52]、[57]、[53]、[54]、[34]通过使用附加可见边界框（VBB）注释以及标准全身信息来研究遮挡行人检测问题。Zhang等人，[52]使用VBB和预先训练的身体部位预测模型来处理遮挡行人检测。[57]的工作表明，额外的可见区域边界框预测任务可以改进全身行人检测。[53]的工作提出了一种改进定位的新损失，以及一个部分遮挡感知的兴趣池区域，以将结构信息与可见性预测相结合。Zhou等人[54]提出了一种区分性特征转换模块，该模块将特征投影到特征空间，其中被遮挡和未被遮挡的行人之间的距离较小。这种变换提高了遮挡期间行人检测器的鲁棒性。在他们的方法中，可见边界框（VBB）用于识别被遮挡的行人。最近，MGAN[34]提出了一种使用VBB注释的面具引导注意网络，以强调可见区域，同时抑制遮挡区域，从而在标准基准上获得最先进的结果。

Our Approach: Contrary to above mentioned recent approaches that rely on additional visible bounding-box (VBB) annotations, our proposed PSC-Net only requires the standard full body supervision to handle occluded pedestrian detection. The focus of our design is the introduction of a part spatial co-occurrence (PSC) module that explicitly captures both inter and intra-part co-occurrence information of different body parts through a Graph Convolutional Network (GCN)[14]. To the best of our knowledge, the proposed approach is the first to capture both inter and intra-part co-occurrence information through a GCN to address the problem of occluded pedestrian detection.

我们的方法：与上述依赖附加可见边界框（VBB）注释的最新方法相反，我们提出的PSC网络只需要标准的全身监督来处理遮挡行人检测。我们设计的重点是引入零件空间共现（PSC）模块，该模块通过图形卷积网络（GCN）明确捕获不同身体部位的零件间和零件内共现信息[14]。据我们所知，提出的方法是第一个通过GCN捕获部分间和部分内共现信息以解决遮挡行人检测问题的方法。

III. METHOD
As discussed above, occlusion is one of the most challenging problems in pedestrian detection. In case of occlusion, pedestrian body parts are either partially or fully occluded. Based on the observation that a human body generally has a fixed empirical ratio with limited flexibility of deformation, we propose a two-stage approach that utilizes spatial cooccurrence of different pedestrian body parts as a useful cue for heavily occluded pedestrian detection.

在遮挡情况下，行人身体部位部分或完全遮挡。==基于观察到人体通常具有固定的经验比率和有限的变形灵活性，==我们提出了一种两阶段方法，该方法利用不同行人身体部位的空间共现性作为严重遮挡行人检测的有用线索。

Fig. 2 shows the overall architecture of our proposed PSC-Net. It consists of a standard pedestrian detection (PD) branch (Sec. III-A) and a part spatial co-occurrence (PSC) module (Sec. III-B). The standard pedestrian detection (PD) branch is based on Faster R-CNN [37] typically employed in existing pedestrian detection works [51], [34]. The part spatial co-occurrence (PSC) module encodes both inter and intra-part co-occurrence information of different body parts. The PSC module comprises of two components. In the first component, intra-part co-occurrence of a pedestrian body part is captured by utilizing the corresponding RoI feature. As a result, an enhanced part feature representation is obtained. This representation is used as an input to the second component for capturing the inter-part co-occurrence of spatially adjacent body parts, leading to a final enhanced feature representation that encodes both intra and inter-part information. This final enhanced feature representation of a candidate proposal is then deployed as an input to the later part of the pedestrian detection (PD) branch which performs final bounding-box regression and classification.

图2显示了我们提出的PSC网络的总体架构。它由标准行人检测（PD）分支（第III-A）和零件空间共现（PSC）模块（第III-B）。标准行人检测（PD）分支基于速度更快的R-CNN[37]，通常用于现有行人检测工程[51]、[34]。部分空间共现（PSC）模块对不同身体部位的零件间和零件内共现信息进行编码。PSC模块由两个部分组成。在第一模块中，通过利用相应的RoI特征来捕获行人身体部分的部分内共现。结果，获得了增强的部分特征表示。该表示用作第二模块的输入，用于捕获空间上相邻的身体部位的部位间共现，导致最终的增强特征表示，其编码部位内和部位间信息。然后将候选方案的最终增强特征表示作为输入部署到行人检测（PD）分支的后续部分，该分支执行最终边界框回归和分类。

Next, we describe the standard pedestrian detection (PD) branch, followed by a detailed presentation of our part cooccurrence (PSC) module (Sec. III-B).

接下来，我们将介绍标准行人检测（PD）分支，然后详细介绍部分空间共现（PSC）模块（第III-B）。

A. Standard Pedestrian Detector
Here, we describe the standard pedestrian detection (PD) branch that is based on the popular Faster R-CNN framework [37] and typically employed in several pedestrian detection methods [51], [34].
在这里，我们描述了标准行人检测（PD）分支，该分支基于流行的快速R-CNN框架[37]，通常用于几种行人检测方法[51]，[34]。
The PD branch consists of a backbone network, a region proposal network (RPN), region-of-interest (RoI) pooling layer and a classification network for final bounding-box regression and classification. In PD branch, the image is first feed into the backbone network and the RPN generates a set of candidate proposals for the input image. For each candidate proposal, a fixed-sized feature representation is obtained through RoI pooling. Finally, this fixed-sized feature representation is passed through a classification network that outputs the classification score and the regressed bounding box locations for the corresponding proposal. The loss function Lfof the standard pedestrian detection (PD) branch is given as follows:

PD分支由主干网络、区域建议网络（RPN）、感兴趣区域（RoI）池层和用于最终边界框回归和分类的分类网络组成。在PD分支中，首先将图像馈送到主干网络，RPN为输入图像生成一组候选方案。对于每个候选方案，通过RoI池获得固定大小的特征表示。最后，这个固定大小的特征表示通过一个分类网络，该网络输出分类分数和相应提案的回归边界框位置。标准行人检测（PD）分支的损失函数如下所示：

where L_rpn_cls and L_rpn reg are the classification loss and bounding box regression loss of RPN, respectively, and
L_rpn_cls and L_rpn reg are the classification and bounding box regression loss of the classification network. In PD branch, Cross-Entropy loss is used as classification loss, and Smooth- L1 loss as bounding-box regression loss.

其中，Lrpn和Lrpn分别表示RPN的分类损失和包围盒回归损失，Lrcnn和Lrcnn分别表示分类网络的分类损失和包围盒回归损失。在PD分支中，交叉熵损失用作分类损失，平滑L1损失用作边界盒回归损失。

Limitations: To handle heavy occlusions, several recent twostage pedestrian detection approaches [52], [53], [34] extend the PD branch by exploiting additional visible bounding-box (VBB) annotations along with the standard full body information. However, this reliance on additional VBB information implies that two sets of annotations are required for pedestrian detection training. Further, these approaches obtain a fixedsized proposal representation by performing a pooling operation (e.g., RoI Pool or RoiAlign) on the high-level features from the later layer of the backbone network (e.g., conv5 of VGG). Alternatively, several anchor-free methods [47], [42] have investigated the effective utilization of features from multiple (conv) layers of the backbone network for pedestrian detection.

In this work, we propose a two-stage pedestrian detection method, PSC-Net, to address heavy occlusions. Our main contribution is the introduction of a part spatial co-occurrence (PSC) module that only requires standard full body supervision and explicitly captures both inter and intra-part spatial co occurrence information of different body parts. Next, we
describe the details of our PSC module.

限制：为了处理严重遮挡，最近的几种两阶段行人检测方法[52]、[53]、[34]通过利用附加可见边界框（VBB）注释以及标准全身信息来扩展PD分支。然而，这种对附加VBB信息的依赖意味着行人检测培训需要两组注释。此外，这些方法通过执行池操作（例如。GRoI池（RoI池或ROIALLIGN）在主干网络（例如。GVGG的conv5）。或者，有几种无锚方法[47]、[42]研究了如何有效利用主干网络多个（conv）层的特征进行行人检测.

在这项工作中，我们提出了一种两阶段的行人检测方法，PSC网络，以解决严重的遮挡问题。我们的主要贡献是引入了零件空间共现（PSC）模块，该模块只需要标准的全身监督，并明确捕获不同身体部位的零件间和零件内空间共现信息。接下来，我们将描述PSC模块的详细信息。

B. Part Spatial Co-occurrence Module
In pedestrian detection, the task is to accurately localize the full body of a pedestrian. This task is relatively easier in case of standard non-occluded pedestrians. However, it becomes particularly challenging in case of partial or severe occlusions. Here, we introduce a part spatial co-occurrence (PSC) module that utilizes spatial co-occurrence of different body parts captured through a Graph Convolutional Network (GCN) [14]. In PSC module, the GCN is employed to capture intra-part and inter-part spatial co-occurrence by exploiting the topological structure of pedestrian. The intra-part cooccurrence is expected to improve the feature representation in scenarios where a particular body part is partially occluded whereas the inter-part co-occurrence targets severe occlusion of a particular body part.
Our PSC module neither requires pedestrian body part annotations nor rely on the use of an external pre-trained part model. Instead, it divides the full body bounding-box of a pedestrian into five parts (Fhead, Flef t, Fright, Fmid, Ff oot),based on empirical fixed ratio of human body (see Fig. 3), as in [53]. The RoI pooling operation is performed on each body part (five) as well as the full body (FD), resulting in six RoI pooled features for each proposal.
As described above, the RoI pooling operation is performed for five body parts as well as the full body, resulting in an increased feature dimension. Therefore, a direct utilization of all these RoI features will drastically increase the computational complexity of our PSC module. Note that the Faster R-CNN and its pedestrian detection adaptations [51], [34], [53] commonly use a single RoI pooling only on the conv5 features of VGG, resulting in 512 channels. To maintain a similar number of channels as in Faster R-CNN and its pedestrian detection adaptations [52], [53], [34], we introduce an additional 1 × 1 convolution in RoI pooling strategy that significantly reduces the number of channels (572 in total). Consequently, RoI pooled features of each body part and the full body has only 64 and 256 channels, respectively.
1) Intra-part Co-occurrence: Here, we enhance RoI pooled feature representation of each body part by considering their intra-part co-occurrence. For instance, consider a scenario where head part Fheadis partially occluded, thereby making top-part of the head invisible. Our intra-part co-occurrence component aims to capture spatial relation between different sub-regions (e.g. eyes and ears) within a RoI feature Fmof a body part (e.g., Fhead) through graph convolutional layer,
B零件空间共现模块
&ensp；在行人检测中，任务是准确定位行人的全身。对于标准的非阻塞行人，此任务相对容易。然而，在部分或严重闭塞的情况下，它变得特别具有挑战性。在这里，我们介绍一个零件空间共现（PSC）模块，该模块利用通过图卷积网络（GCN）捕获的不同身体部位的空间共现[14]。在PSC模块中，通过利用行人的拓扑结构，利用GCN捕捉部分内和部分间的空间共生。在特定身体部位被部分遮挡的情况下，零件内共现有望改善特征表示，而零件间共现则针对特定身体部位的严重遮挡
&ensp；我们的PSC模块既不需要行人身体部位注释，也不依赖于使用外部预先训练的部位模型。相反，它根据人体的经验固定比率（见图3），将行人的全身边界框分为五个部分（Fhead、Flef t、Fright、Fmid、Ff oot），如[53]所示。对每个主体部分（五个）以及整个主体（FD）执行RoI池操作，从而为每个提案生成六个RoI池功能
$ensp;如上所述，对五个身体部位以及整个身体执行RoI池操作，从而增加特征维度。因此，直接利用所有这些RoI特性将大大增加PSC模块的计算复杂性。请注意，更快的R-CNN及其行人检测自适应[51]、[34]、[53]通常仅在VGG的conv5功能上使用单个RoI池，从而产生512个通道。为了保持与更快的R-CNN及其行人检测自适应中相似的通道数量[52]、[53]、[34]，我们在RoI池策略中引入了额外的1×1卷积，显著减少了通道数量（总共572个）。因此，每个身体部位和全身的RoI合并特征分别只有64和256个通道
1）零件内共现：在这里，我们通过考虑每个身体部位的零件内共现来增强RoI池特征表示。例如，考虑头部部分被遮挡的情况，从而使头部的顶部不可见。我们的部分内共现组件旨在捕捉不同子区域之间的空间关系（例如。G眼睛和耳朵）在身体部位的RoI特征内（例如。GFhead）通过图卷积层，

where σ is the ReLU activation, Ws∈ RC×Cis a learnable parameter matrix, C is the number of channels (64 for each body part and 256 for full body) and As∈ RH×W×H×Wis intra-part spatial adjacency matrix of a graph Gs= (Vs,As). Here, each pixel within the RoI region is treated as a graph node. In total, there are H × W nodes Vsin the graph. The intra-part spatial adjacency matrix Asis computed as follow, and is also shown in Fig. 4. We first pass the RoI feature Fm∈ RH×W×Cthrough two parallel 1 × 1 convolution layers that are cascaded by ReLU activation. The resulting outputs are re-shaped prior to performing a matrix multiplication, followed by a softmax operation to compute the intra-part spatial adjacency matrix As. The output from the graph convolution layer (Eq. 2) is ˜Fm∈ RH×W×C. This output˜Fmis first added to its input Fm(original RoI feature), followed by passing it through a fully connected layer to obtain a d dimensional enhanced part feature. The enhanced part features Feof all six parts (five parts and full body) are further used to capture inter-part cooccurrence described next. 2) Inter-part Co-occurrence: Our inter-part co-occurrence component is designed to improve feature representation, especially in case of severe occlusion of a particular body part. The traditional convolution layer only captures information from a small spatial neighborhood, defined by the kernel size (e.g. 3×3), and is therefore often ineffective to encode interpart co-occurrence information of a pedestrian. To address this issue, we introduce an additional graph convolutional layer in our PSC module, that captures inter-part relationship of different body parts and the full body of a pedestrian. We treat each part (including full body) as separate nodes (Vp) of a graph Gp= (Vp,Ap), where Apdenote the spatial adjacency matrix capturing the neighborhood relationship of different nodes. The graph convolutional operation is used to improve the node features Feas follows,

其中σ是ReLU激活，Ws∈ RC×Cis是一个可学习的参数矩阵，C是通道数（每个身体部位64个，全身256个），As∈ 图的RH×W×H×Wis内部空间邻接矩阵Gs=（Vs，As）。这里，RoI区域内的每个像素被视为一个图形节点。图中总共有H×W个节点。部分内部空间邻接矩阵的计算如下，也如图4所示。我们首先通过RoI特征Fm∈ RH×W×ctr通过两个平行的1×1卷积层，通过ReLU激活级联。结果输出在执行矩阵乘法之前被重新成形，随后是softmax操作以计算零件内空间邻接矩阵As。图卷积层（等式2）的输出为〜Fm∈ RH×W×C。该输出〜Fm首先添加到其输入Fm（原始RoI特征），然后将其通过完全连接的层以获得d维增强零件特征。所有六个部件（五个部件和全身）的增强部件特征将进一步用于捕获下面描述的部件之间的协同发生。2）零件间共现：我们的零件间共现组件旨在改进特征表示，特别是在特定身体部位严重遮挡的情况下。传统的卷积层只从一个小的空间邻域中获取信息，该邻域由内核大小（e。G3×3），因此对行人的部分间共现信息进行编码通常是无效的。为了解决这个问题，我们在PSC模块中引入了一个额外的图形卷积层，它捕获了不同身体部位和行人全身的部分间关系。我们将每个部分（包括全身）视为图Gp=（Vp，Ap）的独立节点（Vp），其中，Ap表示捕捉不同节点的邻域关系的空间邻接矩阵。图卷积运算用于改进节点特征，如下所示：，

where Fe∈ Rn×dis enhanced features of n parts, Apis the spatial adjacency matrix, Wp∈ Rd×dis learnable parameter matrix and σ is ReLU activation. Here, n = 6 is the number of nodes, and d = 1024 is channels of each node. (I − Ap) is used to conduct Laplacian smoothing [16] to propagate the node features over the graph. The spatial adjacency matrix Apcaptures the relation between different graph nodes. During heavy occlusion, enhanced part features of a particular node may not contain relevant body part information. Therefore, it is desired to assign lesser weights to the edges linking such nodes in Ap. To this end, we introduce a self-attention scheme to each edge that assigns learnable weight aij. The input of the self attention is the concatenated features of node i and node j. The self attention of each edge aijis computed by a fully-connected operations followed by a sigmoid activation. The unnormalized spatial adjacency matrixˆAp(i, j) is defined as:

The spatial adjacency matrix Ap(i, j) is computed by conducting L2 normalize in each row ofˆAp(i, j). Afterwards, we employ full connected layer to merge all the features˜Fe∈ Rn×dinto a d-dimensional feature vector. The resulting enriched features are then utilized as an input to the classification network which predicts the final classification score and bounding-box regression.

Fe在哪里∈ n部分的Rn×dis增强特征，空间邻接矩阵，Wp∈ Rd×不可学习参数矩阵，σ为ReLU激活。这里，n=6是节点数，d=1024是每个节点的通道数(我− Ap）用于执行拉普拉斯平滑[16]，以在图形上传播节点特征。空间邻接矩阵表示不同图节点之间的关系。在重度遮挡期间，特定节点的增强零件特征可能不包含相关的身体部位信息。因此，希望将较小的权重分配给Ap中链接此类节点的边。为此，我们在每一条边上引入一个自我注意方案，分配可学习权重aij。自我注意的输入是节点i和节点j的串联特征。每个边缘的自我注意是由一个完全连接的操作，然后是一个S形激活来计算的。非规范化的空间邻接矩阵Ap（i，j）定义为：
空间邻接矩阵Ap（i，j）通过在每一行中进行L2规范化来计算。然后，我们使用全连接层来合并所有特征∈ Rn×din表示一个d维特征向量。然后将得到的丰富特征用作分类网络的输入，该网络预测最终分类分数和边界框回归。

IV. EXPERIMENTS Datasets: We perform experiments on two datasets: City Persons [51] and Caltech [10]. City Persons [51] consists of 2975 training, 500 validation, and 1525 test images. Caltech [10] contains 11 sets of videos. First 6 sets (0-5) are used for training, the last 5 sets (6-10) are used for testing. To get a large amount of training data, we sample the videos with 10Hz. The training sets consists of 42782 images in total. Evaluation Metrics: We report the performance using logaverage miss rate (MR) throughout our experiments. Here, MR is computed over the false positive per image (FPPI) range of [10−2,100] [10]. On CityPersons, we report the results across two different occlusion degrees: Reasonable (R) and Heavy Occlusion (HO). For both R and HO sets, the height of pedestrians is larger than 50 pixels. The visibility ratio in R set is larger than 65% whereas in HO it ranges from 20% to 65%. In addition to these sets, the results are reported on combined (R + HO) set on Caltech.

Implementation Details: For both datasets, we train our network on a NVIDIA GPU with the mini-batch consisting of two images per GPU. Adam [13] solver is selected as optimizer. In case of CityPersons, we fine-tune pre-trained ImageNet VGG model [41] on the trainset of the CityPersons. We follow the same experimental protocol as in [51] and employ two fully connected layers with 1024 instead of 4096 output dimensions. The initial learning rate is set to 10−4 for the first 8 epochs, and decay it by a factor of 10 for another 3 epochs. For Caltech, we start with a model that is pre-trained on CityPersons. An initial learning rate of 10−5 is used for first 3 training epochs and decay it to 10−6 for another 1 training epoch. Both the source code and models will be publicly available.
数据集：我们在两个数据集上进行实验：CityPersons[51]和Caltech[10]。CityPersons[51]由2975个培训、500个验证和1525个测试图像组成。加州理工学院[10]包含11套视频。前6组（0-5）用于培训，最后5组（6-10）用于测试。为了获得大量的训练数据，我们对10Hz的视频进行了采样。训练集共包含42782幅图像。评估指标：在整个实验过程中，我们使用对数平均未命中率（MR）报告性能。 MR在每幅图像的假阳性（FPPI）范围[10]内计算−2,100] [10]. 在CityPersons上，我们报告了两种不同遮挡程度的结果：合理遮挡（R）和重度遮挡（HO）。对于R集和HO集，行人高度都大于50像素。R组的可视率大于65%，而HO组的可视率在20%到65%之间。除了这些数据集，加州理工学院的联合（R HO）数据集也报告了结果。
实验细节：对于这两个数据集，我们在一个NVIDIA GPU上训练我们的网络，每个GPU包含两个图像。Adam[13]解算器被选为优化器。对于CityPersons，我们在CityPersons的列车集上微调预训练的ImageNet VGG模型[41]。我们遵循与[51]中相同的实验协议，采用两个完全连接的层，输出尺寸为1024而不是4096。初始学习速率设置为10−前8个时期为4，后3个时期为10。对于加州理工学院，我们从一个在CityPersons上预先培训过的模型开始。初始学习率为10%−5用于前3个训练阶段，并将其衰减为10−6再训练一次。源代码和模型都将公开提供。

A. CityPersons Dataset

Baseline Comparison: As discussed earlier, our design is the introduction of a part spatial co-occurrence (PSC) module that explicitly captures both intra-part (Sec. III-B1) and interpart (Sec. III-B2) co-occurrence information of different body parts. Tab. I shows the impact of integrating our intra-part and inter-part co-occurrence components in the baseline. For fair comparison, all results in Tab. I are reported by using the same set of ground-truth pedestrian examples during training. All ground-truth pedestrian examples which are at least 50 pixels tall with visibility ≥ 65 are utilized for training. Further, the input scale of 1.0× is employed during the experiments. The integration of each component into the baseline results in consistent improvement in performance. Further, our final PSC-Net that integrates both the intra-part and inter-part cooccurrence achieves absolute gains of 3.2% and 6.6% on R and HO sets, respectively, over the baseline. These results demonstrate that both components are required to obtain optimal performance.

State-of-the-art Comparison: Here, we perform a comparison of our PSC-Net with state-of-the-art pedestrian detection methods in literature. Tab. II shows the comparison on CityPersons validation set. Note that existing approaches utilize different set of ground-truth pedestrian examples for training. For fair comparison, we therefore select same set of ground-truth pedestrian examples (denoted as data (visibility) in Tab. II) and input scale, when performing a comparison with each state-of-the-art method. Our PSC-Net achieves superior performance on all these settings for both R and HO sets, compared to the state-of-the-art methods.
When using an input scale of 1× and data visibility (≥ 65%), the attention-based approaches, F.RCNN+ A TT-part [52] and F.RCNN+A TT-vbb [52], obtain log-average miss rates of 16.0, 56.7 and 16.4, 57.3 on the R and HO sets. The work of [46] based on Repulsion Loss obtains log-average miss rates of 13.2 and 56.9 on the R and HO sets, respectively. The Adaptive-NMS approach [20] that applies a dynamic suppression threshold and learns density scores obtains logaverage miss rates of 11.9 and 55.2 on the R and HO sets, respectively. MGAN [34] learns a spatial attention mask using VBB information to modulate full body features and achieves log-average miss rates of 11.5 and 51.7 on the R and HO sets, respectively. Our PSC-Net outperforms MGAN, without using VBB supervision, on both sets with log-average miss rates of 10.6 and 50.2. When using same data visibility but 1.3× input scale, Adaptive-NMS [20] and MGAN [34] achieve log-average miss rates of 54.0 and 49.6, respectively on the HO set. The same two approaches report 10.8 and 10.3 on the R set. PSC-Net achieves superior results with log-average miss rates of 9.8 and 48.3 on the R and HO sets, respectively. On this dataset, the best existing results of 10.5 and 39.4 are reported [34] on R and HO sets, respectively when using an input scale of 1.3× and data visibility (≥ 30%). Our PSCNet outperforms the state-of-the-art [34] with log-average miss rates of 9.9 and 37.2 on the two sets.
Tab. III shows the comparison on CityPersons test set. Among existing methods, the multi-stage Cascade MS-CNN [7] consisting of a sequence of detectors trained with increasing IoU thresholds obtains log average miss rate of 47.1 on the HO set. MGAN [34] obtains a log average miss rate of 41.0 on the same set. Our PSC-Net significantly reduces the error by 4.0% over MGAN on the HO set. Similarly, PSC-Net also improves the performance on the R set.
A.CityPersons数据集
基线比较：如前所述，我们的设计引入了一个零件空间共现（PSC）模块，该模块显式地捕获零件内部（第。III-B1）和interpart（第。III-B2）不同身体部位的共现信息。标签。我展示了在基线中集成部分内和部分间共现组件的影响。为了进行公平比较，所有结果都显示在选项卡中。我在培训期间使用了相同的地面真实行人示例。所有地面真实行人示例，高度至少为50像素，具有可见性≥ 65个用于培训。此外，在实验中采用了1.0×的输入标度。将每个组件集成到基线中会导致性能的持续改进。此外，我们最终的PSC网络整合了部分内和部分间的共现性，在R集和HO集上的绝对收益分别为3.2%和6。这些结果表明，这两个组件都需要获得最佳性能。
最先进的比较：这里，我们将我们的PSC网络与文献中最先进的行人检测方法进行比较。标签。II显示了CityPersons验证集的比较。请注意，现有方法使用不同的地面真实行人示例集进行培训。因此，我们在“行人的真实可见度”选项卡中选择“行人的真实可见度”（we set of fair visibility）作为“行人的真实可见度”选项卡。二）和输入比例，当与每种最先进的方法进行比较时。与最先进的方法相比，我们的PSC网络在R和HO集合的所有这些设置上都实现了卓越的性能
使用1×的输入比例和数据可见性时(≥ 65%），以注意力为基础的方法，F。RCNN A TT第[52]部分和F部分。RCNN A TT vbb[52]，在R集和HO集上获得16.0、56.7和16.4、57.3的对数平均未命中率。[46]基于排斥损失的工作得到了R集和HO集的对数平均脱靶率分别为13.2和56.9。采用动态抑制阈值并学习密度分数的自适应NMS方法[20]在R集和HO集上分别获得11.9和55.2的对数平均未命中率。MGAN[34]使用VBB信息学习空间注意面具，以调节全身特征，并在R和HO集合上分别达到11.5和51.7的对数平均未命中率。在不使用VBB监控的情况下，我们的PSC网络在两个集合上都优于MGAN，对数平均未命中率分别为10.6和50.2。当使用相同的数据可见性但为1.3×输入标度时，自适应NMS[20]和MGAN[34]在HO集合上分别达到54.0和49.6的对数平均未命中率。同样的两种方法在R集合上报告10.8和10.3。PSC Net在R集和HO集上的对数平均未命中率分别为9.8和48.3，取得了优异的结果。在这个数据集上，当使用1.3×的输入尺度和数据可见性时，在R和HO集上分别报告了10.5和39.4的最佳现有结果[34](≥ 30%). 我们的PSCNet在两组中的对数平均未命中率分别为9.9和37.2，优于最先进的[34]
标签。III显示了CityPersons测试集的比较。在现有的方法中，多级级联MS-CNN[7]由一系列经过IoU阈值递增训练的检测器组成，在HO集合上获得47.1的对数平均未命中率。MGAN[34]在同一集合上获得了41.0的对数平均未命中率。我们的PSC网络在HO集合上比MGAN显著减少了4。同样，PSC网络也提高了R集上的性能。

B. Caltech Dataset
Finally, we evaluate our PSC-Net on Caltech and compare it with state-of-the-art approaches in literature. Tab. II shows the comparison on Caltech test set under three sets: R, H and R + HO. Among existing methods, A TT-vbb [52], Bibox [57], FRCN +A +DT [54] and MGAN [34], address the problem of occlusions by utilizing VBB information. On the R, H and R + HO subsets, AR-Ped [3], FRCN +A +DT [54] and MGAN [34] report best existing performance, respectively. PSC-Net achieves superior detection performance on all three subsets with log average miss rates of 6.4, 34.8 and 12.7, respectively. Fig. 5 displays example detections showing a visual comparison of our PSC-Net with recently introduced AR-Ped [3] and MGAN [34] under occlusions. All detection results in Fig. 5 are obtained using same false positive per image (FPPI) criterion. Our PSC-Net provides improved detections under these occlusion scenarios compared to the other two approaches.

最后，我们在加州理工学院评估了我们的PSC网络，并将其与文献中最先进的方法进行了比较。标签。II显示了加州理工学院测试集在三个测试集：R、H和R HO下的比较。在现有的方法中，TT vbb[52]、Bibox[57]、FRCN A DT[54]和MGAN[34]通过利用vbb信息解决闭塞问题。
关于R、H和R HO子集，AR Ped[3]、FRCN A DT[54]和MGAN[34]分别报告了最佳的现有性能。PSC网络在所有三个子集上都实现了优异的检测性能，对数平均未命中率分别为6.4、34.8和12.7。图5显示了我们的PSC网络与最近引入的AR Ped[3]和MGAN[34]在遮挡情况下的视觉比较的示例检测。图5中的所有检测结果都是使用相同的每图像假阳性（FPPI）准则获得的。与其他两种方法相比，我们的PSC网络在这些遮挡场景下提供了更好的检测。

V. CONCLUSION
We proposed a two-stage approach, PSC-Net, for occluded pedestrian detection. Our PSC-Net consists of a standard pedestrian detection branch and a part spatial co-occurrence (PSC) module. The focus of our design is the PSC module that is designed to capture intra-part and inter-part spatial co-occurrence of different body parts through a Graph Convolutional Network (GCN). Our PSC module only requires standard full body supervision and exploits the topological structure of pedestrian. Experiments are performed on two popular datasets: CityPersons and Caltech. Our results clearly demonstrate that the proposed PSC-Net significantly outperforms the baseline in all cases. Further, the PSC-Net sets a new state-of-the-art on both datasets.

我们提出了一种两阶段的方法，PSC网络，用于遮挡行人检测。我们的PSC网络由标准行人检测分支和部分空间共生（PSC）模块组成。我们设计的重点是PSC模块，该模块旨在通过图形卷积网络（GCN）捕获不同身体部位的内部和内部空间共现。我们的PSC模块只需要标准的全身监控，并利用行人的拓扑结构。实验在两个流行的数据集上进行：CityPersons和Caltech。我们的结果清楚地表明，在所有情况下，建议的PSC网络显著优于基线。此外，PSC网络在这两个数据集上都设置了新的最新技术。

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译相关推荐

Co-occurrence Feature Learning from.....分层式共现网络（动作识别、检测）
Co-occurrence Feature learning from Skeketon Data for Action Recognition and Detection with Hierarch ...
MONODISTILL: LEARNING SPATIAL FEATURES FOR MONOCULAR 3D OBJECT DETECTION
Paper name MONODISTILL: LEARNING SPATIAL FEATURES FOR MONOCULAR 3D OBJECT DETECTION Paper Reading No ...
Deep learning based multi-scale channel compression feature surface defect detection system
基于深度学习的多尺度通道压缩特征表面缺陷检测系统 Deep learning based multi-scale channel compression feature surface defect ...
2019 ICCV之多光谱行人检测：Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection
Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection 当前的问题及概述: 真实的多光谱数据存在位置偏移问 ...
Learning Rich Features at High-Speed for Single-Shot Object Detection
Learning Rich Features at High-Speed for Single-Shot Object Detection abstract 单级目标检测方法因其具有实时性强.检测精度 ...
用于三维医学图像检测的半监督学习——FocalMix: Semi-Supervised Learning for 3D Medical Image Detection
本文记录下阅读 CVPR2020论文其中的<FocalMix: Semi-Supervised Learning for 3D Medical Image Detection>,更新于2 ...
Deep Learning for Generic Object Detection: A Survey -- 目标检测综述总结
最近,中国国防科技大学.芬兰奥卢大学.澳大利亚悉尼大学.中国香港中文大学和加拿大滑铁卢大学等人推出一篇最新目标检测综述,详细阐述了当前目标检测最新成就和关键技术.文章最后总结了未来8个比较有前景的 ...
【论文解读 ASONAM 2019】Semi-Supervised Learning and Graph Neural Networks for Fake News Detection
论文题目:Semi-Supervised Learning and Graph Neural Networks for Fake News Detection 论文链接:https://ieeexpl ...
【论文阅读】Neural Transformation Learning for Deep Anomaly Detection Beyond Images 异常检测，可学习变换，时间序列，表格数据
本博客系博主阅读论文之后根据自己理解所写,非逐字逐句翻译,预知详情,请参阅论文原文. 论文标题:Neural Transformation Learning for Deep Anomaly Dete ...
论文阅读：Spatial context-aware network for salient object detection
论文地址:https://doi.org/10.1016/j.patcog.2021.107867 发表于:PR 2021 Abstract 显著目标检测(SOD)是计算机视觉领域的一个基本问题.本文 ...

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译相关推荐

最新文章

热门文章

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络 翻译

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络 翻译相关推荐

最新文章

热门文章

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译

PSC-Net: Learning Part Spatial Co-occurrence for Occluded Pedestrian Detection 用于遮挡行人检测的部分空间共现网络翻译相关推荐