原理：参考

Joint Face Detection and Alignment using

Multi-task Cascaded Convolutional Networks

多任务级联卷积网络进行人脸检测与对齐

Abstract—Face detection and alignment in unconstrained en-vironment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between detection and alignment to boost up their performance. In particular, our framework leverages a cascaded architecture with three stages of carefully
designed deep convolutional networks to predict face and land-mark location in a coarse-to-fine manner. In addition, we propose a new online hard sample mining strategy that further improves the performance in practice. Our method achieves superior ac-curacy over the state-of-the-art techniques on the challenging
FDDB and WIDER FACE benchmarks for face detection, and AFLW benchmark for face alignment, while keeps real time per-formance.Index Terms—Face detection, face alignment, cascaded con-volutional neural network

摘要

由于不同的姿势，灯光和遮挡，在无约束环境下的人脸检测与对齐仍具有挑战性。最近的研究表明，在这两项任务上深度学习方法能够取得令人印象深刻的表现。在这篇论文中，我们提出了一个深度级联多任务框架，它利用检测和对齐之间的内在联系来提高他们的表现。特别的，我们的框架利用级联框架，通过三个精心设计的深度卷积网络，由粗糙到细致的方式预测人脸和坐标点。另外，我们提出了一个新的在线硬样本挖掘策略，可以进一步提高在实践中的性能。我们的方法在人脸检测的具有挑战性的FDDB和WIDER FACE基准以及人脸对齐的AFLW基准测试中实现了超过最先进技术的卓越精度，同时保持了试实时性能。

关键词：人脸检测，人脸对齐，级联卷积神经网络

I. INTRODUCTION
FACE applications, detection such and as alignment face recognition are essential and facial to expression
many face analysis. However, the large visual variations of faces, such as occlusions, large pose variations and extreme lightings, impose great challenges for these tasks in real world applications.

介绍

人脸检测和对齐对于许多应用来说是必不可少的，例如人脸识别和面部表情分析。然而，面部大的视觉变化，例如遮挡，大的姿势变化和极端的照明给他们在现实世界中的应用提出了巨大的挑战。

The cascade face detector proposed by Viola and Jones [2] utilizes Haar-Like features and AdaBoost to train cascaded classifiers, which achieves good performance with real-time efficiency. However, quite a few works [1, 3, 4] indicate that this kind of detector may degrade significantly in real-world applications with larger visual variations of human faces even with more advanced features and classifiers. Besides the cas-
cade structure, [5, 6, 7] introduce deformable part models (DPM) for face detection and achieve remarkable performance. However, they are computationally expensive and may usually require expensive annotation in the training stage. Recently, convolutional neural networks (CNNs) achieve remarkable progresses in a variety of computer vision tasks, such as image classification [9] and face recognition [10]. Inspired by the
significant successes of deep learning methods in computer vision tasks, several studies utilize deep CNNs for face detec-tion. Yang et al. [11] train deep convolution neural networks for facial attribute recognition to obtain high response in face regions which further yield candidate windows of faces. However, due to its complex CNN structure, this approach is time costly in practice. Li et al. [19] use cascaded CNNs for face detection, but it requires bounding box calibration from face detection with extra computational expense and ignores the inherent correlation between facial landmarks localization and bounding box regression.

由 Viola和Jones [2]提出的，利用Haar-like特征和AdaBoost训练的级联人脸检测器，在实时检测上达到了很好的效果。然而，相当多的工作表明[1, 3, 4]，由于人脸视觉变化很大，即使使用更先进的特征和分类器，这种分类器在现实应用中的效果可能会显著降低。除了级联结构外 [5, 6, 7]，引入了可变部分模型(DPM)用于人脸检测虽然实现了卓越的性能表现，然而它们却需要好肥昂贵的计算资源并且在训练阶段通常需要大量的标注。最近，卷积神经网络(CNNs)在各种计算机视觉任务中取得了卓越的进展，例如图像分类[9]和人脸识别[10]。受到深度学习方法在计算机视觉任务上取得的显著的成功的启发，一些研究利用深度卷积网络用于人脸识别。Yang et al.[11]在面部特征识别方面，训练了能生成更多候选框的深度卷积网络受到了高度的关注。然而由于其复杂的CNN结构，这种方法在实践中十分耗时。Li et al.[19]使用级联CNNs应用于人脸检测，但是在进行人脸检测是需要好肥额外的计算资源用于Bound box校准，并且忽略了人脸关键点定位和bounding box回归之间的固有的联系。

Face alignment also attracts extensive research interests. Researches in this area can be roughly divided into two cate-gories, regression-based methods [12, 13, 16] and template fitting approaches [14, 15, 7]. Recently, Zhang et al. [22] proposed to use facial attribute recognition as an auxiliary task to enhance face alignment performance using deep convolu-tional neural network.

人脸对齐也吸引了广泛的研究兴趣，这个领域大致分为两个类别，基于回归的方法 [12, 13, 16] 和基于基于模型拟合的方法[14, 15, 7]。最近，Zhang et al. [22]提出了面部特征识别作为一个辅助任务去增强使用深度卷积网络进行人脸对齐的表现。

However, most of previous face detection and face alignment methods ignore the inherent correlation between these two tasks. Though several existing works attempt to jointly solve them, there are still limitations in these works. For example, Chen et al. [18] jointly conduct alignment and detection with random forest using features of pixel value difference. But, these handcraft features limit its performance a lot. Zhang et al. [20] use multi-task CNN to improve the accuracy of multi-view face detection, but the detection recall is limited by the initial detection window produced by a weak face detector.

然而，大多数先进的人脸识别和人脸对齐方法忽略了两个任务之间内在的联系。尽管许多现存的工作尝试去同时解决它们，但是在他们的工作中仍然有他们的局限性。例如，Chen et al. [18]利用像素值差异特征和随机森林联合进行对齐和检测。但是这些手工特征很大程度的限制了他的表现。Zhang et al. [20]使用多任务卷积神经网络去提高多视图人脸检测的准确率，但是检测的召回率收到了因较弱的人脸检测器生成的初始检测窗口的限制。

On the other hand, mining hard samples in training is critical to strengthen the power of detector. However, traditional hard sample mining usually performs in an offline manner, which significantly increases the manual operations. It is desirable to design an online hard sample mining method for face detection,which is adaptive to the current training status automatically.

另一方面，在训练过程中挖掘困难样本对于加强检测器的效果来说是至关重要的。然而，传统的困难样本挖掘通常采用离线方式进行，这显著的增加了手工操作。设计一种能够自动适应当前训练状态的在线困难样本挖掘方法对于人脸检测来说是非常迫切的。

In this paper, we propose a new framework to integrate these two tasks using unified cascaded CNNs by multi-task learning. The proposed CNNs consist of three stages. In the first stage, it produces candidate windows quickly through a shallow CNN. Then, it refines the windows by rejecting a large number of non-faces windows through a more complex CNN. Finally, it uses a more powerful CNN to refine the result again and output five facial landmarks positions. Thanks to this multi-task learning framework, the performance of the algorithm can be notably improved. The codes have been released in the project

page1. The major contributions of this paper are summarized as follows: (1) We propose a new cascaded CNNs based frame-work for joint face detection and alignment, and carefully de-sign lightweight CNN architecture for real time performance.(2) We propose an effective method to conduct online hard
sample mining to improve the performance. (3) Extensive ex-periments are conducted on challenging benchmarks, to show significant performance improvement of the proposed approach compared to the state-of-the-art techniques in both face detec-tion and face alignment tasks.

在这篇文章中，我们提出一个新的模型，通过多任务学习的统一级联CNN网络来整合这两类工作。这个被提出的CNN网络包含三个部分。第一阶段，它通过一个浅层的CNN网络来快速生成候选窗口。之后他通过一个更复杂的CNN网络来拒绝大量非面部窗口以达到精炼窗口的目的。最后，他通过一个更强大的CNN网络再次精炼结果并输出5个人脸关键点位置。由于这个多任务学习框架，使得算法表现显著提高。框架的代码已经在项目page1发布。这篇论文的主要贡献概括如下：(1)我们提出了一个基于整合了人脸识别和对齐的级联CNN网络，并精心设计了他的轻量级网络结构来实现实时检测的表现。(2)我们提出了一个有效的方法来提高在线困难样本挖掘的表现。(3)在具有挑战性的数据集测试中，我们进行了广泛的实验，结果表明我们所提出的方法在人脸检测和对齐任务上同时超过了目前的最先进方法的性能。

II. APPROACH

In this section, we will describe our approach towards joint
face detection and alignment.

在这部分，我们会描述我们的联合人脸检测和对齐的方法。

A. Overall Framework
The overall pipeline of our approach is shown in Fig. 1. Given an image, we initially resize it to different scales to build an image pyramid, which is the input of the following three-stage cascaded framework:

Stage 1: We exploit a fully convolutional network, called Proposal Network (P-Net), to obtain the candidate facial win-dows and their bounding box regression vectors. Then candi-dates are calibrated based on the estimated bounding box re-gression vectors. After that, we employ non-maximum sup-
pression (NMS) to merge highly overlapped candidates.

Stage 2: All candidates are fed to another CNN, called Re-fine Network (R-Net), which further rejects a large number of false candidates, performs calibration with bounding box re-gression, and conducts NMS.

Stage 3: This stage is similar to the second stage, but in this stage we aim to identify face regions with more supervision. In particular, the network will output five facial landmarks’ posi-tions.

A.总体框架

我们方法的总体流程咱现在图1,给定一张照片，我们首先将其缩放到不同的比例来构建图像金字塔，这是三个级联网络部分的输入:

Stage 1:

我们利用一个叫做Proposal Network(P-Net)的全卷积网络来获得人脸候选框和他们的bounding box回归向量。然后通过估计bounding box回归向量来校准候选框。随后，我们采用NMS去合并高度重合的候选框。

Stage 2:

所有的候选框被喂给另一个叫做Refine Network(R-Net)CNN网络,以此来进一步拒绝大量的错误的候选框，之后通过bounding box回归进行校准，之后执行NMS。

Stage 3:

这个部分和前两部分相似，但是在这个部分我们主要的目标是识别出更多的人脸区域。特别的，这个网络会输出五个面部特征点的位置。

B.CNN结构

In [19], multiple CNNs have been designed for face detec-tion. However, we notice its performance might be limited bythe following facts: (1) Some filters in convolution layers lack diversity that may limit their discriminative ability. (2) Com-pared to other multi-class objection detection and classification
tasks, face detection is a challenging binary classification task, so it may need less numbers of filters per layer. To this end, we reduce the number of filters and change the 5×5 filter to 3×3 filter to reduce the computing while increase the depth to get better performance. With these improvements, compared to the
previous architecture in [19], we can get better performance with less runtime (the results in training phase are shown in Table I. For fair comparison, we use the same training and validation data in each group). Our CNN architectures are shown in Fig. 2. We apply PReLU [30] as nonlinearity activa-tion function after the convolution and fully connection layers(except output layers).

在[19]中，多个CNN被设计出来用于人脸检测。然而，我们注意到它的表现可能受制于以下事实：

(1)卷积层中的卷积核缺乏差异性可能会限制它们的特征辨别能力。

(2)与其他多任务目标检测和分类任务相比，人脸检测是一个具有挑战性的二分类任务。所以它每层可能需要更少的卷积核(滤波器)，为此，我们减少了卷积核的数量并把5X5卷积核变为3X3卷积核以减少计算，同时增加深度以获得更好的表现。

通过这些改进，对比在[19]中的之前的结构，我们在减少运行时间的同时得到了更好的性能表现(训练阶段的结果展示在表1,为了公平比较，我们在每组中使用了相同的训练和验证数据)。我们的CNN结构如图2所示。我们在每个卷积层和全链接层后使用了PReLU[30]作为非线性激活函数(输出层除外)。

C. Training
We leverage three tasks to train our CNN detectors: face/non-face classification, bounding box regression, and facial landmark localization.

C.训练

我们利用这三个任务去训练我们的CNN检测器：人脸/非人脸分类起，bounding box回归，和人脸关键点标注。

1)Face classification: The learning objective is formulated as a two-class classification problem. For each sample xix_{i}xi, we use the cross-entropy loss:
Lidet=−(yidetlog(pi))+(1−yidet)(1−log(pi))L_{i}^{det} = -(y_{i}^{det}log(p_{i})) + (1-y_{i}^{det})(1-log(p_{i})) Lidet=−(yidetlog(pi))+(1−yidet)(1−log(pi))
where pip_{i}pi is the probability produced by the network that in-dicates sample being a face. The notation yidety_{i}^{det}yidet∈ {0,1} denotes the ground-truth label.

人脸分类:学习目标被指定为一个二分类问题对于每个Xi，我们使用交叉熵损失：
·······························
其中 pip_{i}pi表示网络生成的表示x为人脸的概率。yidety_{i}^{det}yidet为真实标注。

2)Bounding box regression: For each candidate window, we predict the offset between it and the nearest ground truth (i.e.,the bounding boxes’ left, top, height, and width). The learning objective is formulated as a regression problem, and we employ the Euclidean loss for each sample xix_{i}xi:
Libox=∣∣y^ibox−yibox∣∣22L_{i}^{box} = || \hat y_{i}^{box} - y_{i}^{box} ||_{2}^{2}Libox=∣∣y^ibox−yibox∣∣22
bounding box 回归：对于每个候选框，我们预测它(候选框)和与它最接近的ground truth的offset(偏移坐标)（ground truth 指：有监督学习中，数据是有标注的，以(x, t)的形式出现，其中x是输入数据，t是标注.正确的t标注是ground truth，错误的标记则不是）(即bounding box的left,top,height,width)。学习目标被指定为回归问题，我们对每个样本xix_{i}xi使用欧几里得损失.

where y^ibox\hat y_{i}^{box}y^ibox is the regression target obtained from the network and yiboxy_{i}^{box}yibox is the ground-truth coordinate. There are four coor-dinates, including left top, height and width, and thus yibox∈R4y_{i}^{box} ∈\mathbb R^{4}yibox∈R4.

其中，y^ibox\hat y_{i}^{box}y^ibox是从网络中获得的回归目标，yiboxy_{i}^{box}yibox是ground-truth 坐标，其中有四个坐标，包括左上角，高度，宽度，因此yibox∈R4y_{i}^{box} ∈\mathbb R^{4}yibox∈R4.

3)Facial landmark localization: Similar to bounding box regression task, facial landmark detection is formulated as a regression problem and we minimize the Euclidean loss:
Lilandmark=∣∣y^ilandmark−yilandmark∣∣22L_{i}^{landmark} = || \hat y_{i}^{landmark} - y_{i}^{landmark} ||_{2}^{2}Lilandmark=∣∣y^ilandmark−yilandmark∣∣22
where y^ilandmark\hat y_{i}^{landmark}y^ilandmark is the facial landmark’s coordinates obtained from the network and yilandmarky_{i}^{landmark}yilandmark is the ground-truth coordinate for the i-th sample. There are five facial landmarks, including left eye, right eye, nose, left mouth corner, and right mouth corner, and thus yilandmark∈R10y_{i}^{landmark} ∈\mathbb R^{10}yilandmark∈R10.

人脸特征点识别：与bounding box回归任务类似，人脸特征点识别被指定为回归问题，我们最小化欧几里得损失：
·········
其中y^ilandmark\hat y_{i}^{landmark}y^ilandmark是从网络中获得的的人脸关键点坐标，yilandmarky_{i}^{landmark}yilandmark 是第i个样本对应的ground-truth坐标，其中有5个关键点，包括左眼，右眼，鼻子，左嘴角，右嘴角，因此yilandmark∈R10y_{i}^{landmark} ∈\mathbb R^{10}yilandmark∈R10

4)Multi-source training: Since we employ different tasks in each CNN, there are different types of training images in the learning process, such as face, non-face, and partially aligned face. In this case, some of the loss functions (i.e., Eq. (1)-(3)) are not used. For example, for the sample of background region, we only compute LidetL_{i}^{det}Lidet, and the other two losses are set as 0. This can be implemented directly with a sample type indicator. Then the overall learning target can be formulated as:
min∑i=1N∑j∈{detdet,box,landmark}αjβijLijmin\sum_{i=1}^{N}\sum_{j\in\{detdet,box,landmark\}}\alpha_{j}\beta_{i}^{j}L_{i}^{j}mini=1∑Nj∈{detdet,box,landmark}∑αjβijLij
多源训练：由于我们在每个CNN中使用不同的任务，因此在训练过程中存在不同类型的训练图像，例图面部，非面部和部分对齐的面部。在这种情况下，一些损失函数没有使用(即例如等式(1)-(3))。例如，对于每个背景区域，我们只计算了LidetL_{i}^{det}Lidet,另外两个loss我们设置为0。这可以直接使用样本类型指示符来实现。然后整体的学习目标可以表述为：
…
where N is the number of training samples and αi\alpha_{i}αidenotes on the task importance. We use (αdet\alpha_{det}αdet = 1, αbox\alpha_{box}αbox = 0.5, αlandmark\alpha_{landmark}αlandmark= 0.5) in P-Net and R-Net, while (αdet\alpha_{det}αdet = 1, αboxi\alpha_{boxi}αboxi= 0.5, αlandmark\alpha_{landmark}αlandmark = 1) in O-Net for more accurate facial landmarks localization.βij\beta_{i}^{j}βij ∈ {0,1} is the sample type indicator. In this case, it is natural to employ stochastic gradient descent to train these CNNs.

其中，N是训练样本，αi\alpha_{i}αi表示任务的重要程度，我们在P-Net和R-Net中使用(αdet\alpha_{det}αdet = 1, αbox\alpha_{box}αbox = 0.5, αlandmark\alpha_{landmark}αlandmark= 0.5) ，在O-Net中使用(αdet\alpha_{det}αdet = 1, αboxi\alpha_{boxi}αboxi= 0.5, αlandmark\alpha_{landmark}αlandmark = 1) 以此来获得更加准确的面部特征点。在这种情况下，采用随机梯度下降来训练这些CNN是很自然的。
5)Online Hard sample mining: Different from conducting traditional hard sample mining after original classifier had been trained, we conduct online hard sample mining in face/non-face classification task which is adaptive to the training process.
5）在线困难样本挖掘：不同于原始分类器在经过训练后进行传统困难样本挖掘，我们在人脸/非人脸分类过程中进行在线困难样本挖掘，使其自适应训练过程中。
In particular, in each mini-batch, we sort the losses computed in the forward propagation from all samples and select the top
70% of them as hard samples. Then we only compute the gra-dients from these hard samples in the backward propagation.
That means we ignore the easy samples that are less helpful to strengthen the detector during training. Experiments show that
this strategy yields better performance without manual sample selection. Its effectiveness is demonstrated in Section III.
特别的，我们把所有样本的前向传播过程中的损失进行了排序并选出了前70%的样本作为困难样本。随后我们在反向传播中我们只计算更新了这些困难样本的梯度，这意味着我们忽略了那些训练期间对增强检测器几乎没有作用的简单样本。实验表明，这个策略无需手动选择样本即可获得更好的性能。其有效性在第三部分得到了证实。

III. EXPERIMENTS

III.实验

In this section, we first evaluate the effectiveness of the proposed hard sample mining strategy. Then we compare our face detector and alignment against the state-of-the-art methods in Face Detection Data Set and Benchmark (FDDB) [25], WIDER FACE [24], and Annotated Facial Landmarks in the Wild (AFLW) benchmark [8]. FDDB dataset contains the annotations for 5,171 faces in a set of 2,845 images. WIDER FACE dataset consists of 393,703 labeled face bounding boxes in 32,203 images where 50% of them for testing (divided into three subsets according to the difficulty of images), 40% for training and the remaining for validation. AFLW contains the facial landmarks annotations for 24,386 faces and we use the same test subset as [22]. Finally, we evaluate the computational efficiency of our face detector.
在本节中，我们首先评估了我们提出的困难样本挖掘策略的有效性。随后我们将我们的人脸检测与对齐检测器与在FDDB，WIDER FACE，和 AFLW基准最先进的方法作对比。FDDB数据集包含标注有5171张人脸的2845张图像。WIDER人脸数据集在32203张图片中包括393703个bounding box，其中50%的用于测试(根据图像难度分为3个子集)，40%用于训练，剩余的用于验证。AFLW包含24386张人脸的关键点标注，我们使用与[22想用的测试子集]，最后我们评估人脸检测器的计算效率。

A. Training Data

A.训练数据

Since we jointly perform face detection and alignment, here we use four different kinds of data annotation in our training process: (i) Negatives: Regions whose the Intersection-over-Union (IoU) ratio are less than 0.3 to any ground-truth faces; (ii) Positives: IoU above 0.65 to a ground truth face; (iii) Part faces: IoU between 0.4 and 0.65 to a ground truth face; and (iv) Landmark faces: faces labeled 5 landmarks’ positions. There is an unclear gap between part faces and negatives, and there are variances among different face annotations. So, we choose IoU gap between 0.3 to 0.4. Negatives and positives are used for face classification tasks, positives and part faces are used for bounding box regression, and landmark faces are used for facial landmark localization. Total training data are composed of 3:1:1:2 (negatives/ positives/ part face/ landmark face) data. The training data collection for each network is described as follows:
由于我们联合进行人脸检测和对齐，因此在学习过程中存在不同类型的图片标注。
(1)负标签：IOU与ground-truth对比低于0.3的
(2)正标签：IOU与ground-truth对比高于0.65的
(3)部分人脸：IOU与ground-truth对比在0.4-0.65之间的
(4)人脸关键点：标有五个人脸关键点的人脸。
部分人脸和负标签之间存在不明确的间隔，不同的人脸标注间存在差异，因此我们选择IOU在0.3-0.4之间的的负样本和正样本用作人脸分类任务（ ***感觉这句没太读懂，后边再来填坑***）,正标签和部分人脸用作bounding box回归，人脸关键点用作人脸关键点。总训练数据由3:1:1:2(负标签/正标签/部分人脸/人脸关键点)。每个网络训练数据描述如下：
1）P-Net: We randomly crop several patches from WIDER FACE[24] to collect positives, negatives and part face. Then,we crop faces from CelebA [23] as landmark faces.

P-Net:我们从WIDER FACE中随机裁剪了几个窗口来收集正样本，负样本和脸部样本。随后我们从CelebA中随机裁剪人脸作为关键点人脸样本。

2）R-Net: We use the first stage of our framework to detect faces from WIDER FACE [24] to collect positives, negatives and part face while landmark faces are detected from CelebA [23].
我们使用第一部分我们的框架去检测来自WIDER FACE的数据去收集正样本，负样本和人脸样本，并且检测来自Celeba的人脸关键点样本。

3)O-Net: Similar to R-Net to collect data but we use the first two stages of our framework to detect faces and collect data
O-Net：与R-Net收集数据相似，但是我们使用我们第一第二部分框架去检测并收集人脸数据。

B. The effectiveness of online hard sample mining

在线挖掘困难样本的有效性

To evaluate the contribution of the proposed online hard sample mining strategy, we train two P-Nets (with and without online hard sample mining) and compare their performance on FDDB. Fig. 3 (a) shows the results from two different P-Nets on FDDB. It is clear that the online hard sample mining is beneficial to improve performance. It can bring about 1.5% overall performance improvement on FDDB.

为了评估我们提出的在线困难样本挖掘策略的贡献，我们训练了两个P-Nets（使用和不使用这个策略）并且比较了它们两个在FDDB上的表现。图3(a)展示了这两个网络在FDDB上的不同的表现结果。显然使用在线困难样本挖掘策略有助于提高性能。大概能带来1.5%在FDDB上的总体性能提升。

C. The effectiveness of joint detection and alignment

C. 联合检测和校准的有效性

To evaluate the contribution of joint detection and alignment, we evaluate the performances of two different O-Nets (joint facial landmarks regression learning and do not joint it) on FDDB (with the same P-Net and R-Net). We also compare the performance of bounding box regression in these two O-Nets. Fig. 3 (b) suggests that joint landmark localization task learning help to enhance both face classification and bounding box regression tasks.

为了评估同时人脸检测和对齐的贡献，我们评估了两个不同的O-Nets(同时进行人脸关键点回归学习和不同时)在FDDB上(使用相同的P-Net和R-Net)。我们同时比较了在两个O-Netsbounding-box回归的表现。图3展示了联合两个任务有助于提高人脸分类和bounding-box回归任务的性能。

Fig. 3. (a) Detection performance of P-Net with and without online hard sample mining. (b) “JA” denotes joint face alignment learning in O-Net while “No JA” denotes do not joint it. “No JA in BBR” denotes use “No JA” O-Net for bounding box regression.
图3.
（a）有和没有在线硬样本挖掘的P-Net的检测性能。
（b）“JA”表示O-Net中的联合面部对齐学习，而“No JA”表示不联合它。 “BBR中没有JA”表示使用“No JA”O-Net进行边界框回归

D. Evaluation on face detection

###　D. 面部检测评估
To evaluate the performance of our face detection method, we compare our method against the state-of-the art methods [1, 5, 6, 11, 18, 19, 26, 27, 28, 29] in FDDB, and the state-of-the-art methods [1, 24, 11] in WIDER FACE. Fig. 4 (a)-(d) shows that our method consistently outperforms all the compared approaches by a large margin in both the benchmarks.
为了评估我们的人脸检测方法，我们把我们的方法与在FDDB上表现最好的方法[1,5,6,11,18,19,26,27,28,29]比较和在WIDER FACE上表现最好的方法比较[1,24,11]。与4(a)-(d)展示了我们的方法在两个基准测试中始终大大优于所有比较方法。

Fig. 4. (a) Evaluation on FDDB. (b-d) Evaluation on three subsets of wider face. The number following the method indicates the average accuracy.
图4.（a）对FDDB的评估。（b-d）对三个较宽面子集的评估。方法后面的数字表示平均准确

E. Evaluation on face alignment

E.面部对齐评估

In this part, we compare the face alignment performance of our method against the following methods: RCPR [12], TSPM [7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21], and TCDCN [22]. The mean error is measured by the distances between the estimated landmarks and the ground truths, and normalized with respect to the inter-ocular distance. Fig. 5 shows that our method outperforms all the state-of-the-art methods with a margin. It also shows that our method shows less superiority in mouth corner localization. It may result from the small variances of expression, which has a significant in fluence in mouth corner position, in our training data.
在这一部分，我们将外婆们我们人脸对齐方法的表现与以下方法进行比较:RCPR [12], TSPM [7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21], and TCDCN [22]。平均误差通过估计的关键点和真实点之间的距离，并对眼间距离进行了归一化。图5显示我们的方法优于所有最先进的方法。它还表明我们的方法在嘴角定位表现出较少的优势。可能是我们的训练数据中的细小表示的变化，造成了显著的影响。

Fig. 5. Evaluation on AFLW for face alignment
图5.面部对齐的AFLW评估

F. Runtime efficiency

####　F. 运行时效率
Given the cascade structure, our method can achieve high speed in joint face detection and alignment. We compare our method with the state-of-the-art techniques on GPU and the results are shown in Table II. It is noted that our current implementation is based on un-optimized MATLAB codes.
基于级联结构，我们的方法实现了人脸检测和对齐的高速度。我们比较了我们的方法与在GPU上最先进的技术。结果如表2所示：值得注意的是，我们当前的实现是基于未优化的matlab代码。

SPEED COMPARISON OF OUR METHOD AND OTHER METHODS
我们的方法和其他方法的速度比较

结论:

在本文中，我们提出了一种基于多任务级联的CNN架构，用于结合人脸检测和人脸对齐。实验结果表面，我们
的方法在几个具有挑战性基准的测试中始终优于最先进方法(包括用于人脸检测的FDDB和WIDER FACE,以及
人脸对齐的AFLW)，同时实现640*840，最小人脸尺寸20 * 20的VGA图像实时检测。性能改进的主要贡献来自
精心设计的级联CNN架构，在线困难样本挖掘和联合了人脸检测与对齐。
————————————————

好多地方感觉懂文章的意思，但不知道怎么翻译会好一点，索性直译，有的地方不通顺，会意即可。

MTCNN论文翻译人脸检测相关推荐

基于 MTCNN/TensorFlow 实现人脸检测
人脸检测方法有许多,比如opencv自带的人脸Haar特征分类器和dlib人脸检测方法等.对于opencv的人脸检测方法,有点是简单,快速:存在的问题是人脸检测效果不好.正面/垂直/光线较好的人脸,该 ...
mtcnn人脸检测python_基于MTCNN/TensorFlow实现人脸检测
人脸检测方法有许多,比如opencv自带的人脸Haar特征分类器和dlib人脸检测方法等.对于opencv的人脸检测方法,有点是简单,快速:存在的问题是人脸检测效果不好.正面/垂直/光线较好的人脸,该 ...
Context Based Face Spoofing Detection Using Active Near-Infrared Images(论文翻译)活体检测相关
摘要 - 在本文中,借助可控的有源近红外(NIR)光,我们构建了近红外差分(NIRD)图像.基于反射模型,NIRD图像被认为包含具有和不具有活动NIR光的图像之间的光照差异.基于NIRD图像的两个主 ...
利用MTCNN和FaceNet实现人脸检测和人脸识别 | CSDN博文精选
作者 | pan_jinquan 来源 | CSDN博文精选 (*点击阅读原文,查看作者更多文章) 人脸检测和人脸识别技术算是目前人工智能方面应用最成熟的技术了.本博客将利用MTCNN和FaceNet ...
YOLOv5Face YOLO5Face人脸检测论文及代码简析
YOLO5face人脸检测模型论文和代码简析 YOLO5Face模型分析论文及源码下载论文创新点实验结果下载代码跑起来调整数据集训练完成之后检验结果一点点代码简析文件结构 data m ...
论文解析：人脸检测中级联卷积神经网络的联合训练
论文解析:人脸检测中级联卷积神经网络的联合训练商汤科技解析CVPR2016论文:人脸检测中级联卷积神经网络的联合训练 width="250" height="250&q ...
人脸检测MTCNN和人脸识别Facenet(附源码)
原文链接:人脸检测MTCNN和人脸识别Facenet(附源码) 在说到人脸检测我们首先会想到利用Harr特征提取和Adaboost分类器进行人脸检测(有兴趣的可以去一看这篇博客第九节.人脸检测之Haa ...
（转）如何应用MTCNN和FaceNet模型实现人脸检测及识别
https://zhuanlan.zhihu.com/p/37705980 人脸检测与人脸识别人脸检测是对人脸进行识别和处理的第一步,主要用于检测并定位图片中的人脸,返回高精度的人脸框坐标及人脸特征 ...
（转）第三十七节、人脸检测MTCNN和人脸识别Facenet(附源码)
http://www.cnblogs.com/zyly/p/9703614.html 在说到人脸检测我们首先会想到利用Harr特征提取和Adaboost分类器进行人脸检测(有兴趣的可以去一看这篇博客第 ...

MTCNN论文翻译人脸检测

Joint Face Detection and Alignment using

Multi-task Cascaded Convolutional Networks

多任务级联卷积网络进行人脸检测与对齐

摘要