Arcface v1 论文翻译与解读

神罗Noctis 2019-10-13 16:14:39 543 收藏 4
展开
论文地址：http://arxiv.org/pdf/1801.07698v1.pdf
最新版本v3的论文翻译：Arcface v3 论文翻译与解读

Arcface v1 论文的篇幅比较长，花费了本人3天的时间进行翻译解读，希望能够帮助读者更好地理解论文。

ArcFace: Additive Angular Margin Loss for Deep Face Recognition
目录
Abstract

1. Introduction

2. From Softmax to ArcFace

2.1. Softmax

2.2.Weights Normalisation

2.3. Multiplicative Angular Margin

2.4. Feature Normalisation

2.5. Additive Cosine Margin

2.6. Additive Angular Margin

2.7. Comparison under Binary Case

2.8. Target Logit Analysis

3. Experiments

3.1. Data

3.1.1 Training data

3.1.2 Validation data

3.1.3 Test data

3.2. Network Settings

3.2.1 Input setting

3.2.2 Output setting

3.2.3 Block Setting

3.2.4 Backbones

3.2.5 Network Setting Conclusions

3.3. Loss Setting

3.4. MegaFace Challenge1 on FaceScrub

3.5. Further Improvement by Triplet Loss

4. Conclusions

4. 结论

Abstract
Convolutional neural networks have significantly boosted the performance of face recognition in recent years due to its high capacity in learning discriminative features. To enhance the discriminative power of the Softmax loss, multiplicative angular margin [23] and additive cosine margin [44, 43] incorporate angular margin and cosine margin into the loss functions, respectively. In this paper, we propose a novel supervisor signal, additive angular margin (ArcFace), which has a better geometrical interpretation than supervision signals proposed so far. Specifically, the proposed ArcFace cos(θ + m) directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. Compared to multiplicative angular margin cos(mθ) and additive cosine margin cosθ - m, ArcFace can obtain more discriminative deep features. We also emphasise the importance of network settings and data refinement in the problem of deep face recognition. Extensive experiments on several relevant face recognition benchmarks, LFW, CFP and AgeDB, prove the effectiveness of the proposed ArcFace. Most importantly,we get state-of-art performance in the MegaFace Challenge in a totally reproducible way. We make data, models and training/test code public .

摘要
近年来，卷积神经网络显著提高了人脸识别的性能，因其强大学习具有判别性特征的能力。为了提高Softmax损失的判别能力，乘法角度间隔(multiplicative angular margin)[23]和加法余弦间隔(additive cosine margin)[44,43]分别将角度间隔和余弦间隔加入到损失函数中。在本文中，我们提出了一种新颖的监控信号，即(additive angular margin)加法角度间隔（ArcFace），它比目前提出的监督信号具有更好的几何解释。具体来说，所提出的ArcFace cos(θ + m)直接将基于L2归一化权重和特征的角度空间中的决策边界最大化。与乘法角度间隔 cos(mθ) 和加法余弦间隔 cosθ - m 相比，ArcFace可以获得更具判别能力的深层特征。我们也强调了网络设置和数据细化在深度人脸识别中的重要性。在相关的人脸识别基准(LFW、CFP和AgeDB)上进行了大量的实验，证明了ArcFace的有效性。最重要的是，我们在MegaFace挑战赛中以完全可重现的方式获得最先进的性能。我们公开数据、模型和培训/测试代码。

图1. ArcFace的几何解释。(a) 蓝点和绿点代表来自两个不同类的嵌入特征。ArcFace可以直接在类之间增加角度间隔(angular (arc) margin)。(b)我们发现角度和角度间隔(arc margin)之间有一种直观的对应关系。Arcface的angular margin对应超球面上的arc margin(测地距离)。

1. Introduction
Face representation through the deep convolutional network embedding is considered the state-of-the-art method for face verification, face clustering, and face recognition [42, 35, 31]. The deep convolutional network is responsible for mapping the face image, typically after a pose normalisation step, into an embedding feature vector such that features of the same person have a small distance while features of different individuals have a considerable distance. The various face recognition approaches by deep convolutional network embedding differ along three primary attributes.

1. 介绍
通过深度卷积网络嵌入的人脸表征，被认为是目前最先进的人脸验证、人脸聚类和人脸识别方法[42,35,31]。通常在归一化步骤之后，深度卷积网络负责将人脸图像映射到嵌入的特征向量中，使得同一个人的特征距离小，不同一个人的特征距离大。基于深度卷积网络嵌入的人脸识别方法，主要有三个属性。

The first attribute is the training data employed to train the model. The identity number of public available training data, such as VGG-Face [31], VGG2-Face [7], CAISAWebFace [48], UMDFaces [6], MS-Celeb-1M [11], and MegaFace [21], ranges from several thousand to half million. Although MS-Celeb-1M and MegaFace have a significant number of identities, they suffer from annotation noises [47] and long tail distributions [50]. By comparison, private training data of Google [35] even has several million identities. As we can check from the latest performance report of Face Recognition Vendor Test (FRVT) [4], Yitu, a
start-up company from China, ranks first based on their private 1.8 billion face images [5]. Due to orders of magnitude
difference on the training data scale, face recognition models from industry perform much better than models from academia. The difference of training data also makes some deep face recognition results [2] not fully reproducible.

第一个属性是用于训练模型的训练数据。公开人脸训练数据集，如VGG-Face [31]，VGG2-Face [7]，CAISAWebFace [48]，UMDFaces [6]，MS-Celeb-1M [11]和MegaFace [21]，包含了数量范围从几千到五十万的身份。尽管MS-Celeb-1M和MegaFace具有相当数量的身份，但它们受到注释噪声[47]和长尾分布[50]的影响。相比之下，Google [35]的私人训练数据集甚至拥有数百万个身份。从人脸识别供应商Test (FRVT)[4]的最新业绩报告中可以看出，中国初创企业依图公司拥有18亿张私有的人脸图像[5]，排名第一。由于训练数据规模上的数量级差异，来自工业界的人脸识别模型比来自学术界的模型表现要好得多。训练数据的差异也使得一些深度人脸识别结果[2]不能完全复现。

The second attribute is the network architecture and settings. High capacity deep convolutional networks, such as ResNet [14, 15, 46, 50, 23] and Inception-ResNet [40,3], can obtain better performance compared to VGG network [37, 31] and Google Inception V1 network [41, 35].Different applications of deep face recognition prefer different trade-off between speed and accuracy [16, 51]. For face verification on mobile devices, real-time running speed and compact model size are essential for slick customer experience.For billion level security system, high accuracy is as important as efficiency.

第二个属性是网络架构和设置。高容量的深卷积网络，如ResNet[14, 15, 46, 50, 23]和Inception-ResNet[40,3]，与VGG网络[37,31]和Google Inception V1网络[41,35]相比，可以获得更好的性能。不同的深度人脸识别应用在速度和精度之间的取舍是不同的[16,51]。对于移动设备上的人脸验证，实时运行速度和紧凑的模型大小对于流畅的客户体验是至关重要的。对于十亿级安全系统，精度和效率是同等重要的。

The third attribute is the design of the loss functions.

第三个属性是损失函数的设计。

(1) Euclidean margin based loss.

In [42] and [31], a Softmax classification layer is trained over a set of known identities. The feature vector is then taken from an intermediate layer of the network and used to generalise recognition beyond the set of identities used in training. Centre loss [46] Range loss [50] and Marginal loss [10] add extra penalty to compress intra-variance or enlarge inter-distance to improve the recognition rate, but all of them still combine Softmax to train recognition models.However, the classification-based methods [42, 31] suffer from massive GPU memory consumption on the classification layer when the identity number increases to million level, and prefer balanced and sufficient training data for each identity.

The contrastive loss [39] and the Triplet loss [35] utilise pair training strategy. The contrastive loss function consists of positive pairs and negative pairs. The gradients of the loss function pull together positive pairs and push apart negative pairs. Triplet loss minimises the distance between an anchor and a positive sample and maximises the distance between the anchor and a negative sample from a different identity. However, the training procedure of the contrastive loss [39] and the Triplet loss [35] is tricky due to the selection of effective training samples.

(1) 基于欧几里德距离的损失。

在[42]和[31]中，Softmax分类层是在一组已知身份上训练的。然后，从网络的中间层提取特征向量，用于训练中使用的一组身份之外的泛化识别。中心损失[46]范围损失[50]和margin损失[10]增加了额外的惩罚来减小类内方差或增大类间距离，从而提高识别率，但它们仍然结合Softmax来训练识别模型。然而，当身份数量增加到百万级别时，基于分类(classiﬁcation-based)方法[42,31]会在分类层上消耗大量的GPU内存，并且每个身份都需要均衡且充足的训练数据。

对比损失[39]和三重损失[35]采用配对训练策略。对比损失函数由正对和负对组成。损失函数的梯度将正对(positive pairs)拉拢在一起，将负对(negative pairs)分开。三重损失最小化锚点(anchor)与正样本之间的距离，最大化锚点(anchor)与不同身份的负样本之间的距离。然而，很难选择有效的训练样本，导致对比损失[39]和三重损失[35]的训练过程比较复杂。

(2) Angular and cosine margin based loss.

Liu et al. [24] proposed a large margin Softmax (L-Softmax) by adding multiplicative angular constraints to each identity to improve feature discrimination. SphereFace cos(mθ) [23] applies L-Softmax to deep face recognition with weights normalisation. Due to the non-monotonicity of the cosine function, a piece-wise function is applied in SphereFace to guarantee the monotonicity. During training of SphereFace, Softmax loss is combined to facilitate and ensure the convergence. To overcome the optimisation difficulty of SphereFace, additive cosine margin [44, 43] cos(θ) - m moves the angular margin into cosine space. The implementation and optimisation of additive cosine margin are much easier than SphereFace. Additive cosine margin is easily reproducible and achieves state-of-the-art performance on MegaFace (TencentAILab_FaceCNN_v1) [2].
Compared to Euclidean margin based loss, angular and cosine margin based loss explicitly adds discriminative constraints on a hypershpere manifold, which intrinsically matches the prior that human face lies on a manifold.

As is well known that the above mentioned three attributes, data, network and loss, have a high-to-low influence on the performance of face recognition models. In this paper, we contribute to improving deep face recognition from all of these three attributes.

(2)基于角度间隔和余弦间隔的损失

Liu等人在[24]中提出了 large margin Softmax (L-Softmax)，通过在每个身份中添加乘法角度约束来提高特征的判别能力。 SphereFace cos(mθ) 将L-Softmax应用于权重归一化的深度人脸识别。在SphereFace的训练中，结合了Softmax损失，以促进和确保收敛。为了克服SphereFace的优化困难，加法余弦间隔(additive cosine margin) [44, 43] cos(θ) - m 将角度间隔移动到余弦空间中。加法余弦间隔的实现和优化比SphereFace容易得多。加法余弦间隔很容易重现，并在MegaFace（TencentAILab_FaceCNN_v1）[2] 上达到了最先进的性能。与基于欧几里德距离的损失相比，基于角度间隔和余弦间隔的损失在超球面流形上显式地增加了判别约束，这本质上与人脸分布在流形上的先验匹配。

众所周知，上述三个属性，数据，网络和损失，对人脸识别模型的性能有高低的影响。在这篇论文中，我们从这三个属性来改进深度人脸识别。

Data. We refined the largest public available training data, MS-Celeb-1M [11], in both automatic and manual way. We have checked the quality of the refined MS1M dataset with the Resnet-27 [14, 50, 10] network and the marginal loss [10] on the NIST Face Recognition Prize Challenge . We also find that there are hundreds of overlap face images between the MegaFace one million distractors and the FaceScrub dataset, which significantly affects the evaluation results. We manually find these overlap face images from the MegaFace distractors. Both the refinement of training data and test data will be public available.

数据。我们以自动和手动两种方式对目前世界上规模最大的公开人脸训练数据集MS-Celeb-1M[11]进行清洗。我们使用Resnet-27[14, 50, 10]网络和NIST人脸识别奖挑战中的margin损失[10]，检测了修改后的MS1M数据集的质量。我们还发现在MegaFace中的一百万个干扰集和FaceScrub数据集之间存在数百个重复的人脸图像，这对评估结果有显著影响。(MegaFace挑战将从 Flickr Dataset中挑选的百万张人脸图像作为测试时的干扰集 (distractors)，而使用的搜索测试集 (probes)来自于FaceScrub 数据集 ) 我们在MegaFace的干扰集中手动找到这些重复的人脸图像。训练数据和测试数据的改进都将公之于众。

Network. Taking VGG2 [7] as the training data, we conduct extensive contrast experiments regarding the convolutional
network settings and report the verification accuracy on LFW, CFP and AgeDB. The proposed network settings have been confirmed robust under large pose and age variations. We also explore the trade-off between the speed and accuracy based on the most recent network structures.

网络。以VGG2[7]作为训练数据集，我们对卷积网络设置进行了大量对比实验，并报告了LFW、CFP和AgeDB的验证精度。在大的面部姿态变动和年龄变化下，提出的网络设置已被证实具有鲁棒性。基于最新的网络结构，我们还探讨了速度和精度之间的权衡。

Loss. We propose a new loss function, additive angular margin (ArcFace), to learn highly discriminative features for robust face recognition. As shown in Figure 1, the proposed loss function cos(θ + m) directly maximise decision boundary in angular (arc) space based on the L2 normalised weights and features. We show that ArcFace not only has a more clear geometrical interpretation but also outperforms the baseline methods, e.g. multiplicative angular margin [23] and additive cosine margin [44, 43]. We innovatively explain why ArcFace is better than Softmax, SphereFace [23] and CosineFace [44, 43] from the view of
semi-hard sample distributions.

损失。我们提出了一种新的损失函数——加法角度间隔(ArcFace)，学习具有高判别性的特征，以实现具有鲁棒性的人脸识别。如图1所示,所提出的损失函数cos(θ+ m)直接最大化基于L2归一化权重和特征的角度空间中的决策边界。我们分析表明，ArcFace不仅有更清晰的几何解释，而且比乘法角度间隔[23]和加法余弦间隔[44,43]这些 baseline方法更好。

Performance. The proposed ArcFace achieves state-ofthe-art results on the MegaFace Challenge [21], which is the largest public face benchmark with one million faces for recognition. We make these results totally reproducible with data, trained models and training/test code public available.

性能。所提出的ArcFace在MegaFace挑战赛[21]上取得了优异的成绩，[21]是目前世界上规模最大和公开的百万规模级别的人脸识别算法的测试基准。我们将这些结果与数据、经过训练的模型和训练/测试代码公开。

2. From Softmax to ArcFace
2.1. Softmax
The most widely used classification loss function, Softmax
loss, is presented as follows:

where denotes the deep feature of the -th samples, belonging to the -th class. The feature dimension d is set as 512 in this paper following [46, 50, 23, 43]. denotes the -th column of the weights in the last fully connected layer and is the bias term. The batch size and the class number is and , respectively. Traditional Softmax loss is widely used in deep face recognition [31, 7]. However, the Softmax loss function does not explicitly optimise the features to have higher similarity score for positive pairs and lower similarity score for negative pairs, which leads to a performance gap.

2. 从Softmax到ArcFace
2.1. Softmax
最广泛使用的分类损失函数Softmax损失如下：

其中表示第个样本的深层特征，属于类。本文根据[46,50,23,43]将特征维度设置为512。表示最后一个完全连接层中权重的第列，为偏置项。批大小和类的数量分别为和。传统的Softmax损失在深度人脸识别中得到了广泛的应用[31,7]。然而，Softmax损失函数没有明确地优化特征，使正对的相似性得分更高并且负对的相似性得分更低，这导致性能差距。

2.2.Weights Normalisation
For simplicity, we fix the bias as [23]. Then, we transform the target logit [32] as follows:

Following [23, 43, 45], we fix by L2 normalisation, which makes the predictions only depend on the angle between the feature vector and the weight.

In the experiments of SphereFace, L2 weight normalisation only improves little on performance.

2.2. 权重归一化
为了简单起见，我们像[23]那样，固定偏差。然后，我们将目标logit[32]变换如下:

在[23,43,45]之后，我们通过L2归一化将权重向量固定为，这使得预测只依赖于特征向量和权重之间的角度。

在SphereFace的实验中，L2权重归一化对性能的改善微乎其微。

2.3. Multiplicative Angular Margin
In SphereFace [23, 24], angular margin m is introduced by multiplication on the angle.

where . In order to remove this restriction, is substituted by a piece-wise monotonic function . The SphereFace is formulated as:

where , , , is the integer
that controls the size of angular margin. However, during the implementation of SphereFace, Softmax supervision is incorporated to guarantee the convergence of training, and the weight is controlled by a dynamic hyper-parameter λ. With the additional Softmax loss, in fact is:

where λ is a additional hyper-parameter to facilitate the training of SphereFace. λ is set to 1,000 at beginning and decreases to 5 to make the angular space of each class more compact [23]. This additional dynamic hyper-parameter λ makes the training of SphereFace relatively tricky.

2.3. 乘法角度间隔
在SphereFace[23,24]中，角度间隔通过乘法引入到角度中。

其中。为了消除这个限制，用一个分段单调函数代替。SphereFace表示为:

其中，，，是一个整数，它控制角度间隔的大小。然而，在SphereFace的实施过程中，引入一个Softmax监督项，用于确保训练时收敛，监督项的权重由动态的超参数 λ 控制。加上额外的Softmax损失，实际变为：

其中 λ 是一个额外的超参数，用于促进SphereFace的训练。 λ在开始时设置为1,000，并且逐渐减小到5，使得每个类的角度空间更紧凑[23]。这个额外的动态超参数 λ 使得SphereFace的训练难度更大。

2.4. Feature Normalisation
Feature normalisation is widely used for face verification,e.g. L2-normalised Euclidean distance and cosine distance [29]. Parde et al. [30] observe that the L2-norm of features learned using Softmax loss is informative of the quality of the face. Features for good quality frontal faces have a high L2-norm while blurry faces with extreme pose have low L2-norm. Ranjan et al. [33] add the L2-constraint to the feature descriptors and restrict features to lie on a hypersphere of a fixed radius. L2 normalisation on features can be easily implemented using existing deep learning frameworks and significantly boost the performance of face verification. Wang et al. [44] point out that gradient norm may be extremely large when the feature norm from low-quality face image is very small, which potentially increases the risk of gradient explosion. The advantages of feature normalisation are also revealed in [25, 26, 43, 45] and the feature normalisation is explained from analytic, geometric and experimental perspectives.

2.4. 特征归一化
特征归一化被广泛用于人脸验证，例如，L2归一化的欧几里德距离和余弦距离[29]。Parde等人[30]观察到，使用的softmax损失学习的归一化特征，对于得到关于人脸质量的信息是有帮助的。高质量正面脸的特征具有较高的L2范数，而姿态极端的模糊脸的特征具有较低的L2范数。Ranjan等人将L2约束添加到特征描述符中，并限制特征分布在一个半径固定的超球面上。使用现有的深度学习框架可以很容易地实现特征的L2归一化，并显著提高人脸验证的性能。Wang等人[44]指出，当来自低质量人脸图像的特征范数非常小时，梯度范数可能非常大，这可能增加了梯度爆炸的风险。在[25,26,43,45]揭示了特征归一化的优点，并从分析、几何和实验的角度对特征归一化进行了解释。

As we can see from above works, L2 normalisation on features and weights is an important step for hypersphere metric learning. The intuitive insight behind feature and weight normalisation is to remove the radial variation and push every feature to distribute on a hypersphere manifold.

从以上工作可以看出，特征和权值的L2归一化是超球度量学习的重要步骤。特征和权重归一化背后的直觉洞察力是去除放射性状的变量，并推动每个特征分布在超球面流形上。

Following [33, 43, 45, 44], we fix by L2 normalisation and re-scale to s, which is the hypersphere radius and the lower bound is give in [33]. In this paper, we use s = 64 for face recognition experiments [33, 43]. Based on feature and weight normalisation, we can get .

按照[33,43,45,44]，我们通过L2归一化将固定，并将重新缩放到，也就是超球面的半径，其下界在[33]中给出。本文采用s = 64进行人脸识别实验[33,43]。基于特征和权重归一化，我们可以得到。

If the feature normalisation is applied to SphereFace, we can get the feature normalised SphareFace, denoted as SphereFace-FNorm

如果将特征归一化应用于SphereFace，则可以得到特征归一化的SphareFace，表示为SphereFace-FNorm

2.5. Additive Cosine Margin
In [44, 43], the angular margin m is removed to the outside of cosθ, thus they propose the cosine margin loss function:

In this paper, we set the cosine margin m as 0:35 [44, 43].Compared to SphereFace, additive cosine margin (CosineFace) has three advantages: (1) extremely easy to implement without tricky hyper-parameters; (2) more clear and able to converge without the Softmax supervision; (3) obvious performance improvement.

2.5. 加法余弦间隔
在[44,43]中，角度间隔被移动到cosθ的外边，这样一来，它们提出了余弦间隔损失函数：

在本文中，我们将余弦间隔设为0:35[44,43]。与SphereFace相比，加法余弦间隔(CosineFace)具有以下三个优点:(1)无需复杂的超参数即可轻松实现； (2)更清晰，不需要Softmax监督即可收敛；(3)性能明显提高。

2.6. Additive Angular Margin
Although the cosine margin in [44, 43] has a one-to-one mapping from the cosine space to the angular space, there is still a difference between these two margins. In fact, the angular margin has a more clear geometric interpretation compared to cosine margin, and the margin in angular space corresponds to the arc distance on the hypersphere manifold.

We add an angular margin m within cosθ. Since cos(θ+m) is lower than cos(θ) when , the constraint is more stringent for classification. We define the proposed ArcFace as:

If we expand the proposed additive angular margin cos(θ+m), we get cos(θ+m) = cosθcosm - sinθsinm. Compared to the additive cosine margin cos(θ) - m proposed in [44, 43], the proposed ArcFace is similar but the margin is dynamic due to sin θ. In Figure 2, we illustrate the proposed ArcFace, and the angular margin corresponds to the arc margin. Compared to
SphereFace and CosineFace, our method has the best geometric interpretation.

In Figure 2, we illustrate the proposed ArcFace, and the angular margin corresponds to the arc margin. Compared to SphereFace and CosineFace, our method has the best geometric interpretation.

2.6. 加法角度间隔
虽然[44,43]中的余弦间隔从余弦空间到角度空间是一对一映射的，但这两个间隔(margin)之间仍然存在差异。事实上，与余弦间隔相比，角度间隔有更清晰的几何解释，角度空间中的间隔(margin)对应于超球面流形上的弧距(arc distance)。

我们在cosθ里面增加一个角度间隔。因为cos(θ+ m)小于cos(θ)，所以当时，对于分类，约束更为严格。我们将提议的ArcFace定义为:

如果我们将提出的加法角度间隔cos(θ+ m)展开，得到 cos(θ+m) = cosθcosm - sinθsinm。与[44,43]中提出的加法余弦间隔cos(θ) - m 相比，提出的ArcFace与之类似，由于有sinθ，所以margin是动态的。

在图2中，我们说明了所提出的ArcFace，角度间隔(arc margin)对应于弧度间隔(arc margin)。与SphereFace和CosineFace相比，我们的方法具有最佳的几何解释。

图2. ArcFace的几何解释。不同的颜色区域代表不同类的特征空间。ArcFace不仅可以压缩特征区域，而且可以对应超球面上的测地线距离。

2.7. Comparison under Binary Case
To better understand the process from Softmax to the proposed ArcFace, we give the decision boundaries under binary classification case in Table 1 and Figure 3. Based on the weights and features normalisation, the main difference among these methods is where we put the margin.

2.7. 二分类情景下的比较
为了更好地理解从Softmax到所提出的ArcFace的过程，我们在表1和图3中给出了二元分类情况下的决策边界。基于权重和特征归一化，这些方法之间的主要区别是margin的摆放位置。

表1. 二分类情景下类1的决策边界。注意，是和之间的角度，是超球面半径，是间隔(margin)。

图3. 二分类情景下不同损失函数的决策间隔(decision margins)。虚线表示决策边界，灰色区域是决策间隔。

2.8. Target Logit Analysis
To investigate why the face recognition performance can be improved by SphereFace, CosineFace and ArcFace, we
analysis the target logit curves and the θ distributions during training. Here, we use the LResNet34E-IR (refer to Sec. 3.2) network and the refined MS1M dataset (refer to Sec. 3.1).

2.8. 目标Logit分析
补充一点：target logit按照字面翻译是目标逻辑，但实际上跟论文想表达的意思不符。target logit代表的是全连接层输出矩阵中预测类别为真实类别的输出，应该翻译成目标分数比较好。

为了研究为什么SphereFace，CosineFace和ArcFace可以改善人脸识别性能，我们分析了目标logit曲线和训练期间的θ分布。在这里，我们使用LResNet34E-IR（参见3.2节）网络和修改后的MS1M数据集（参见3.1节）。

图4. 目标logit分析。 (a) Softmax，SphereFace，CosineFace和ArcFace的目标logit曲线。 (b) 对Softmax，CosineFace和ArcFace进行批训练，估算的目标logit 收敛曲线。 (c) 在训练期间，θ分布从大角度移动到小角度（开始，中间和结束）。最好通过放大查看。

In Figure 4(a), we plot the target logit curves for Softmax, SphereFace, CosineFace and ArcFace. For SphereFace, the best setting is m = 4 and λ = 5, which is similar to the curve with m = 1.5 and λ = 0. However, the implementation of SphereFace requires the m to be an integer. When we try the minimum multiplicative margin, m = 2 and λ = 0, the training can not converge. Therefore, decreasing the target logit curve slightly from Softmax is able to increase the training difficulty and improve the performance, but decreasing too much may cause the training divergence.

在图4(a)中，我们绘制了Softmax、SphereFace、CosineFace和ArcFace的目标logit曲线。对于SphereFace,最佳设置为m = 4 和 λ= 5, 它类似于m = 1.5 和 λ= 0 的曲线。但是，SphereFace的实现要求m为整数。我们尝试最小化乘法间隔,m = 2 和 λ= 0,但是训练无法收敛。因此，与Softmax相比，稍微降低目标logit曲线可以增加训练难度，提高训练效果，但降低太多会导致训练发散。

Both CosineFace and ArcFace follow this insight. As we can see from Figure 4(a), CosineFace moves the target logit curve along the negative direction of y-axis, while ArcFace moves the target logit curve along the negative direction of x-axis. Now, we can easily understand the performance improvement from Softmax to CosineFace and ArcFace.

CosineFace和ArcFace都遵循这一观点。从图4(a)可以看出，CosineFace将目标logit曲线沿y轴负方向移动，ArcFace将目标logit曲线沿x轴负方向移动。现在，我们可以很容易地理解从Softmax到CosineFace和ArcFace的性能改进。

For ArcFace with the margin m = 0.5, the target logit curve is not monotonic decreasing when . In fact, the target logit curve increases when . However, as shown in Figure 4(c), the θ has a Gaussian distribution with the centre at and the largest angle below when starting from the randomly initialised network. The increasing interval of ArcFace is almost never reached during training. Therefore, we do not need to deal with this explicitly.

对于margin m = 0.5的ArcFace，当时，目标logit曲线不是单调递减的。事实上，当时，目标logit曲线会增加。然而，如图4（c）所示，当从随机初始化网络开始时，θ具有高斯分布，其中心位于，最大角度小于。在训练期间，ArcFace逐渐增大的间隔几乎从未达到。因此，我们不需要明确地处理这个问题。

In Figure 4(c), we show the θ distributions of CosineFace and ArcFace in three phases of training, e.g. start, middle and end. The distribution centres gradually move from to . In Figure 4(a), we find the target logit curve of ArcFace is lower than that of CosineFace between to . Therefore, the proposed ArcFace puts more strict margin penalty compared to CosineFace in this interval. In Figure 4(b), we show the target logit converge curves estimated on training batches for Softmax, CosineFace and ArcFace. We can also find that the margin penalty of ArcFace is heavier than that of CosineFace at the beginning, as the red dotted line is lower than the blue dotted line. At the end of training, ArcFace converges better than CosineFace, as the histogram of θ is in the left (Figure 4(c)) and the target logit converge curve is higher (Figure 4(b)). From Figure 4(c), we can find that almost all of the θs are smaller than at the end of training. The samples beyond this field are the hardest samples as well as the noise samples of the training dataset. Even though CosineFace puts more strict margin penalty when (Figure 4(a)), this field is seldom reached even at the end of training (Figure 4(c)). Therefore, we can also understand why SphereFace can obtain very good performance even with a relatively small margin in this section.

在图4(c)中，我们展示了CosineFace和ArcFace在三个训练阶段的θ分布，例如：开始，中间和结束。θ值的分布中心逐渐从移动到之间。在图4(a)中，我们发现在到之间，ArcFace的目标logit曲线低于CosineFace的目标logit曲线。因此，与CosineFace相比，本文提出的ArcFace对此区间内的margin惩罚更为严格。在图4(b)中，我们展示了对Softmax、CosineFace和ArcFace进行批训练，估算出的目标logit收敛曲线。我们还可以发现在开始时,ArcFace的margin惩罚比CosineFace重，因为红色虚线低于蓝色虚线。在训练结束时，ArcFace收敛比CosineFace要好，因为θ的直方图在左边（图4(c)），目标logit收敛曲线更高（图4(b)）。从图4(c)中，我们可以发现在训练结束时几乎所有的θ都小于。超出这个区域的样本是最困难的样本和训练数据集的噪声样本。当（图4(a)），尽管CosineFace会对margin进行更严格的惩罚，在训练结束时也很少达到这个区域(图4(c))。因此，我们也可以理解为什么SphereFace在本节中即使是相对较小的margin也可以获得非常好的性能。

In conclusion, adding too much margin penalty when may cause training divergence, e.g. SphereFace (m = 2 and λ = 0). Adding margin when can potentially improve the performance, because this section corresponds to the most effective semihard negative samples [35]. Adding margin when can not obviously improve the performance, because this section corresponds to the easiest samples. When we go back to Figure 4(a) and rank the curves between , we can understand why the performance can improve from Softmax, SphereFace, CosineFace to ArcFace under their best parameter settings. Note that, and here are the roughly estimated thresholds for easy and hard training samples.

总之，当，添加太大的margin惩罚，可能会导致训练发散，例如SphereFace (m = 2 和 λ= 0) 。当，添加margin有可能会提高性能，因为这部分对应最有效的半困难negative样本[35]。当，添加margin无法明显改善性能，因为这部分对应于最简单的样本。当我们回到图4(a)，对之间的曲线进行排序时，我们可以理解为什么在它们(Softmax，SphereFace，CosineFace、ArcFace)的最佳参数设置下，性能会有所提高。请注意，此处的 and 是对于简单和困难训练样本，粗略估计的阈值。

3. Experiments
In this paper, we target to obtain state-of-the-art performance on MegaFace Challenge [21], the largest face identification and verification benchmark, in a totally reproducible way. We take Labelled Faces in the Wild (LFW) [19], Celebrities in Frontal Profile (CFP) [36], Age Database (AgeDB) [27] as the validation datasets, and conduct extensive experiments regarding network settings and loss function designs. The proposed ArcFace achieves state-of-the-art performance on all of these four datasets.

3. 实验
在本文中，我们的目标是在MegaFace Challenge [21]中以完全可复制的方式获得最先进的性能，其中MegaFace Challenge是目前世界上规模最大的人脸识别和人脸验证测试基准。我们采用 Labelled Faces in the Wild (LFW) [19], Celebrities in Frontal Profile (CFP) [36], Age Database (AgeDB) [27] 作为验证数据集，并对有关网络设置和损失函数设计进行大量的实验。所提出的ArcFace在这四个数据集上实现了最先进的性能。

3.1. Data
3.1.1 Training data
We use two datasets, VGG2 [7] and MS-Celeb-1M [11], as our training data.

VGG2. VGG2 dataset contains a training set with 8,631 identities (3,141,890 images) and a test set with 500 identities (169,396 images). VGG2 has large variations in pose, age, illumination, ethnicity and profession. Since VGG2 is
a high-quality dataset, we use it directly without data refinement.

MS-Celeb-1M. The original MS-Celeb-1M dataset contains about 100k identities with 10 million images. To decrease the noise of MS-Celeb-1M and get a high-quality training data, we rank all face images of each identity by their distances to the identity centre. For a particular identity, the face image whose feature vector is too far from the identity’s feature centre is automatically removed [10]. We further manually check the face images around the threshold of the first automatic step for each identity. Finally, we obtain a dataset which contains 3.8M images of 85k unique identities. To facilitate other researchers to reproduce all of the experiments in this paper, we make the refined MS1M dataset public available within a binary file, but please cite the original paper [11] and follow the original license [11] when using this dataset. Our contribution here is only training data refinement, not release.

3.1. 数据
3.1.1 训练数据集
我们使用两个数据集，VGG2 [7]和MS-Celeb-1M [11]作为我们的训练数据集。

VGG2。 VGG2数据集包含具有8,631个身份（3,141,890个图像）的训练集和具有500个身份（169,396个图像）的测试集。VGG2在姿势，年龄，光照，种族和职业方面有很大差异。由于VGG2是一个高质量的数据集，我们直接使用它，无需对数据进行清洗。

MS-Celeb-1M。最初的MS-Celeb-1M数据集包含大约10万个身份和1000万张图像。为了降低MS-Celeb-1M的噪声并获得高质量的训练数据，我们将每个身份的所有面部图像按照它们到身份中心的距离进行排序。对于一个特定的身份，如果其特征向量距离身份特征中心太远，则该人脸图像将被自动清洗[10]。在第一个自动步骤中，我们进一步为每个身份手动检查阈值附近的人脸图像。最后，我们得到一个包含3.8M张图像(85k个唯一身份)的数据集。为了方便其他研究人员复制本文中的所有实验，我们用一个二进制文件，将清洗过的MS1M数据集公开，但是在使用该数据集时，请引用原始论文[11]并遵循原始许可证[11]。我们在这里的贡献只是对训练数据进行修改，而不是发布。

3.1.2 Validation data
We employ Labelled Faces in the Wild (LFW) [19],Celebrities in Frontal Profile (CFP) [36] and Age Database (AgeDB) [27] as the validation datasets.

LFW. [19] LFW dataset contains 13,233 web-collected images from 5749 different identities, with large variations in pose, expression and illuminations. Following the standard protocol of unrestricted with labelled outside data, we give the verification accuracy on 6,000 face pairs.

CFP. [36]. CFP dataset consists of 500 subjects, each with 10 frontal and 4 profile images. The evaluation protocol includes frontal-frontal (FF) and frontal-profile (FP) face verification, each having 10 folders with 350 sameperson pairs and 350 different-person pairs. In this paper, we only use the most challenging subset, CFP-FP, to report the performance.

AgeDB. [27, 10] AgeDB dataset is an in-the-wild dataset with large variations in pose, expression, illuminations, and age. AgeDB contains 12,240 images of 440 distinct subjects, such as actors, actresses, writers, scientists, and politicians. Each image is annotated with respect to the identity, age and gender attribute. The minimum and maximum ages are 3 and 101, respectively. The average age range for each subject is 49 years. There are four groups of test data with different year gaps (5 years, 10 years, 20 years and 30 years,respectively) [10]. Each group has ten split of face images, and each split contains 300 positive examples and 300 negative examples. The face verification evaluation metric is the same as LFW. In this paper, we only use the most challenging subset, AgeDB-30, to report the performance.

3.1.2 验证数据集
我们采用 Labelled Faces in the Wild (LFW) [19],Celebrities in Frontal Profile (CFP) [36] and Age Database (AgeDB) [27] 作为验证数据集。

LFW。 [19] LFW数据集包含来自5749个不同身份的13,233个网络收集的图像，其姿态，表情和照明有很大变化。遵循不受限制的标准协议，并标注外部数据，我们给出了6,000对人脸的验证精度。

CFP。[36]。CFP数据集包含500名受试者，每个受试者有10张正面图和4张侧面图。评估方案包括正面对正面(FF)和正面对侧面(FP)的人脸验证，每个都有10个文件夹，包含350对相同的人和350对不同的人。在本文中，我们仅使用最具挑战性的子集CFP-FP来报告性能。

补充：收集和注释在无约束条件下捕获的面部图像，通常被称为“in-the-wild”

AgeDB。[27,10] AgeDB数据集是一种in-the-wild的数据集，在姿态、表情、光照和年龄方面有很大的变化。AgeDB。[27,10] AgeDB数据集是一种野外数据集，在姿态、表情、光照和年龄方面有很大的变化。AgeDB包含了12240张440个不同主题的图片，这些主题包括男女演员、作家、科学家和政治家。每个图像都有关于身份、年龄和性别属性的注释。最小年龄为3岁，最大年龄为101岁。每个研究对象的平均年龄范围为49岁。有四组测试数据具有不同的年份差距（分别为5年，10年，20年和30年）[10]。每组10张分割的人脸图片，每组包含300个正例图片和300个负例图片。人脸验证评价指标与LFW相同。在本文中，我们仅使用最具挑战性的子集AgeDB-30来报告性能。

3.1.3 Test data
MegaFace. MegaFace datasets [21] are released as the largest public available testing benchmark, which aims at evaluating the performance of face recognition algorithms at the million scale of distractors. MegaFace datasets include gallery set and probe set. The gallery set, a subset of Flickr photos from Yahoo, consists of more than one million images from 690k different individuals. The probe sets are two existing databases: FaceScrub [28] and FGNet [1]. FaceScrub is a publicly available dataset that containing 100k photos of 530 unique individuals, in which 55,742 images are males, and 52,076 images are females. FGNet is a face ageing dataset, with 1002 images from 82 identities. Each identity has multiple face images at different ages (ranging from 1 to 69).

It is quite understandable that data collection of MegaFace is very arduous and time-consuming thus data noise is inevitable. For FaceScrub dataset, all of the face images from one particular identity should have the same identity. For the one million distractors, there should not be any overlap with the FaceScrub identities. However, we find noisy face images not only exist in FaceScrub dataset but also exist in the one million distractors, which significantly affect the performance.

In Figure 5, we give the noisy face image examples from the Facesrub dataset. As shown in Figure 8(c), we rank all of the faces according to the cosine distance to the identity centre. In fact, face image 221 and 136 are not Aaron Eckhart. We manually clean the FaceScrub dataset and finally find 605 noisy face images. During testing, we change the noisy face to another right face, which can increase the identification accuracy by about 1%. In Figure 6(b), we give the noisy face image examples from the MegaFace distractors. All of the four face images from the MegaFace distractors are Alec Baldwin. We manually clean the MegaFace distractors and finally find 707 noisy face images. During testing, we add one additional feature dimension to distinguish these noisy faces, which can increase the identification accuracy by about 15%.

Even though the noisy face images are double checked by seven annotators who are very familiar with these celebrities, we still can not promise these images are 100% noisy. We put the noise lists of the FaceScrub dataset and the MegaFace distractors online. We believe the masses have sharp eyes and we will update these lists based on other researchers’ feedback.

3.1.3 测试数据集
MegaFace。MegaFace数据集[21]作为世界上规模最大的公开测试基准，旨在评估人脸识别算法在百万级干扰项干扰下的性能。MegaFace数据集包括图库集和探测集。图库集是来自雅虎Flickr照片的一个子集，由来自69万不同个体的100多万张照片组成。探测集是两个现有的数据库:FaceScrub[28]和FGNet[1]。FaceScrub是一个公开数据集，包含530个独立个体的100k张照片，其中55,742张是男性，52,076张是女性。FGNet是一个面部老化数据集，包含来自82个身份的1002张图像。每个身份在不同年龄(从1岁到69岁)都有多个人脸图像。

可以理解的是，MegaFace的数据采集是非常艰巨和耗时的，因此数据噪声是不可避免的。对于FaceScrub数据集，来自一个特定身份的所有人脸图像应该具有相同的身份。对于数量为一百万的干扰集，不应该与FaceScrub身份有任何重复。然而，我们发现噪声人脸图像不仅存在于FaceScrub数据集中，而且还存在于数量为一百万的干扰集中，这对性能有很大的影响。

在图5中，我们给出了来自Facesrub数据集的噪声人脸图像示例。如图8(c)所示，我们根据到身份中心的余弦距离对所有的人脸进行排序。事实上，noise face 221和 noise face 136 并不是Aaron Eckhart。我们手动清理FaceScrub数据集，最终找到605张有噪声的人脸图像。在测试过程中，我们将有噪声的人脸变换为另一个右脸，可以使识别精度提高约1%。在图6(b)中，我们给出了来自MegaFace干扰集的噪声人脸图像示例。这四张来自MegaFace干扰集的人脸图像都是亚历克·鲍德温。我们手动清楚了MegaFace的干扰集，最终找到了707张有噪声的人脸图像。在测试过程中，我们增加了一个额外的特征维度来区分这些有噪声的人脸，可以将识别精度提高约15%。

尽管这些有噪声的人脸图像被7位非常熟悉这些名人的注释者反复检查，我们仍然不能保证这些图像100%没有噪声。我们将FaceScrub数据集和MegaFace干扰集的噪声列表放到了网上。我们相信大众有敏锐的眼睛，我们将根据其他研究人员的反馈更新这些列表。

图5. 来自FaceScrub数据集的噪声人脸图像示例。在(a)中，图像id放在左上角，到身份中心的余弦距离放在左下角。

图6. (a)用于注释器从FaceScrub数据集学习身份。(b)显示从MegaFace干扰集中选取的重复人脸。

3.2. Network Settings
We first evaluate the face verification performance based on different network settings by using VGG2 as the training data and Softmax as the loss function. All experiments in this paper are implemented by MxNet [8]. We set the batch size as 512 and train models on four or eight NVIDIA Tesla P40 (24GB) GPUs. The learning rate is started from 0.1 and divided by 10 at the 100k, 140k, 160k iterations. Total iteration step is set as 200k. We set momentum at 0.9 and weight decay at 5e - 4 (Table 5).

3.2. 网络设置
我们首先使用VGG2作为训练数据和Softmax作为损失函数，根据不同的网络设置，评估人脸验证的性能。本文中的所有实验均由MxNet [8]实现。我们将批大小(batch size)设置为512，在4个或8个NVIDIA Tesla P40(24GB)GPU上训练模型。学习速率从0.1开始，并在100k、140k、160k个迭代(iterations)时除以10。总迭代步长设置为200k。我们设定动量为0.9，权重衰减为5e - 4(表5)。

3.2.1 Input setting
Following [46, 23], we use five facial landmarks (eye centres,nose tip and mouth corners) [49] for similarity transformation to normalise the face images. The faces are cropped and resized to 112 × 112, and each pixel (ranged between [0,255]) in RGB images is normalised by subtracting 127.5 then divided by 128.

As most of the convolutional networks are designed for the Image-Net [34] classification task, the input image size is usually set as 224 × 224 or larger. However, the size of our face crops is only 112 × 112. To preserve higher feature map resolution, we use conv3 × 3 and stride = 1 in the first convolutional layer instead of using conv7 × 7 and stride = 2. For these two settings, the output size of the convolutional networks is 7×7 (denoted as “L” in front of the network names) and 3 × 3, respectively.

3.2.1 输入设置
按照[46,23]，我们使用五个面部关键点landmarks（眼睛中心，鼻尖和嘴角）[49]进行相似性变换，来标准化人脸图像。将人脸裁剪并调整为112×112，在RGB图像中，通过对每个像素(范围在[0,255]之间)减去127.5再除以128，来进行归一化。

由于大多数卷积网络都是针对Image-Net [34]分类任务而设计的，因此输入图像大小通常设置为224 × 224或更大。但是，我们的剪裁后人脸图像的大小只有112×112。为了保持更高的feature map分辨率，我们在第一个convolutional layer中使用了conv3×3 and stride = 1来代替conv7×7 and stride = 2。对于这两种设置，卷积网络的输出大小分别为7×7(在网络名称前面用“L”表示)和3×3。

3.2.2 Output setting
In last several layers, some different options can be investigated to check how the embedding settings affect the model performance. All feature embedding dimension is set to 512 expect for Option-A, as the embedding size in Option-A is
determined by the channel size of last convolutional layer.

•Option-A: Use global pooling layer(GP).

•Option-B: Use one fully connected (FC) layer after GP.

•Option-C: Use FC-Batch Normalisation (BN) [20] after GP.

•Option-D: Use FC-BN-Parametric Rectified Linear Unit (PReLu) [13] after GP.

•Option-E: Use BN-Dropout [38]-FC-BN after the last convolutional layer.

During testing, the score is computed by the Cosine Distance of two feature vectors. Nearest neighbour and threshold comparison are used for face identification and verification tasks.

3.2.2 输出设置
在最后几层中，可以探讨一些不同的选项，来检测嵌入设置是如何影响模型的性能。对于Option-A，所有feature的嵌入维数设置为512，其中Option-A中的嵌入维数由最后一个convolutional layer的通道大小决定。

•选项-A：使用全局池化层（GP）。

•选项-B：在GP之后使用一个全连接（FC）层。

•选项-C：在GP之后使用FC-Batch 标准化（BN）[20]。

•选项-D：在GP之后使用FC-BN-Parametric 整流线性单元（PReLu）[13]。

•选项-E：在最后一个卷积层之后使用BN-Dropout [38] -FC-BN。

在测试过程中，通过两个特征向量的余弦距离来计算分数。采用邻近算法和阈值比较，用于人脸识别和人脸验证任务。

3.2.3 Block Setting
Besides the original ResNet [14] unit, we also investigate a more advanced residual unit setting [12] for the training of face recognition model. In Figure 7, we show the improved residual unit (denoted as “IR” in the end of model names), which has a BN-Conv-BN-PReLu-Conv-BN structure. Compared to the residual unit proposed by [12], we set stride = 2 for the second convolutional layer instead of the first one. In addition, PReLu [13] is used to substitute the original ReLu.

3.2.3 模块设置
在原有的ResNet[14]单元的基础上，我们还研究了一种更高级的用于人脸识别模型训练的残差单元设置[12]。在图7中，我们展示了改进后的残差单元(模型名称末尾用“IR”表示)，其结构为BN-Conv-BN-PReLu-Conv-BN。与[12]提出的残差单位相比，我们将第二个卷积层的步长设置为2，而不是第一个卷积层 (如下图第二个蓝色框中，步长设置为2)。另外，使用PReLu[13]代替原来的ReLu。

3.2.4 Backbones
Based on recent advances on the model structure designs,we also explore MobileNet [16], Inception-Resnet-V2 [40], Densely connected convolutional networks (DenseNet) [18], Squeeze and excitation networks (SE) [17] and Dual path Network (DPN) [9] for deep face recognition. In this paper, we compare the differences between these networks from the aspects of accuracy, speed and model size.

3.2.4 骨干
基于模型结构设计的最新进展，我们还探索了MobileNet [16], Inception-Resnet-V2 [40], DenseNet[18], Squeeze， SE[17] 和DPN [9]，用于深度人脸识别。本文从精度、速度和模型大小三个方面比较了这些网络之间的差异。

3.2.5 Network Setting Conclusions
Input selects L. In Table 2, we compare two networks with and without the setting of “L”. When using conv3 × 3 and stride = 1 as the first convolutional layer, the network output is 7×7. By contrast, if we use conv7×7 and stride = 2 as the first convolutional layer, the network output is only 3×3. It is obvious from Table 2 that choosing larger feature maps during training obtains higher verification accuracy.

表2. 验证精度(%)在不同的输入条件下(Softmax@VGG2)。

3.2.5 网络设置结论
输入选择L。在表2中，我们比较了两个有和没有设置“L”的网络。当使用conv3×3和stride = 1作为第一个卷积层时，网络输出为7×7。相比之下，如果我们使用conv7×7和stride = 2作为第一个卷积层，网络输出只有3×3。从表2可以看出，在训练过程中选择较大的feature map可以获得较高的验证精度。

Output selects E. In Table 3, we give the detailed comparison between different output settings. The option E (BN-Dropout-FC-BN) obtains the best performance. In this paper, the dropout parameter is set as 0.4. Dropout can effectively act as the regularisation term to avoid over-fitting and obtain better generalisation for deep face recognition.

表3. 验证精度(%)在不同的输出设置(Softmax@VGG2)。

输出选择E。在表3中，我们给出了不同输出设置的详细比较。选项E (BN-Dropout-FC-BN)的性能最好。本文将dropout参数设置为0.4。Dropout可以有效地作为正则化项，避免过拟合，获得更好的深度人脸识别泛化效果。

Block selects IR. In Table 4, we give the comparison between the original residual unit and the improved residual unit. As we can see from the results, the proposed BN-Conv(stride=1)-BN-PReLu-Conv(stride=2)-BN unit can obviously improve the verification performance.

表4. 验证精度(%)原残差单元与改进残差单元的比较(Softmax@VGG2)。

残差模块选择 IR。表4给出了原残差单元与改进残差单元的比较。从结果可以看出，提出的BN-Conv(stride=1)-BN-PReLu-Conv(stride=2)-BN 单元可以明显提高验证性能。

Backbones Comparisons. In Table 8, we give the verification accuracy, test speed and model size of different backbones. The running time is estimated on the P40 GPU. As the performance on LFW is almost saturated, we focus on the more challenging test sets, CFP-FP and AgeDB-30, to compare these network backbones. The Inception-Resnet-V2 network obtains the best performance with long running time (53.6ms) and largest model size (642MB). By contrast,MobileNet can finish face feature embedding within 4.2ms with a model of 112MB, and the performance only drops slightly. As we can see from Table 8, the performance gaps between these large networks, e.g. ResNet-100,Inception-Resnet-V2, DenseNet, DPN and SE-Resnet-100,
are relatively small. Based on the trade-off between accuracy,speed and model size, we choose LResNet100E-IR to conduct experiments on the Megaface challenge.

表8. 不同骨干之间的准确性(%)、速度(ms)和模型大小(MB)的比较(Softmax@VGG2)

骨干比较。在表8中，我们给出了不同骨架的验证精度、测试速度和模型尺寸。运行时间在P40 GPU上估算。由于LFW的性能已经接近饱和，我们将重点放在更具挑战性的测试集CFP-FP和AgeDB-30上，来比较这些网络骨架。Inception-Resnet-V2网络获得最佳的性能，其运行时间长为(53.6ms)，最大的模型大小为(642MB)。相比之下，MobileNet可以使用大小为112MB的模型，在4.2ms内完成人脸特征的嵌入，性能略有下降。从表8可以看出，这些大型网络，如ResNet-100,Inception-Resnet-V2, DenseNet, DPN 和 SE-Resnet-100，它们之间的性能差距相对较小。基于精度、速度和模型尺寸之间的权衡，我们选择LResNet100E-IR来进行Megaface challenge实验。

Weight decay. Based on the SE-LResNet50E-IR network,we also explore how the weight decay (WD) value affects the verification performance. As we can see from Table 5, when the weight decay value is set as 5e - 4, the verification accuracy reaches the highest point. Therefore, we fix the weight decay at 5e - 4 in all other experiments.

表5. 不同权重衰减(WD)值的验证性能(%)(SE-LResNet50E-IR,Softmax@VGG2)。

权重衰减。基于SE-LResNet50E-IR网络，我们还探讨了权重衰减(WD)值如何影响验证性能。从表5可以看出，当权重衰减值设置为5e - 4时，验证精度达到最高点。因此，在所有其他实验中，我们将权重衰减的值固定为5e - 4。

3.3. Loss Setting
Since the margin parameter m plays an important role in the proposed ArcFace, we first conduct experiments to search the best angular margin. By varying m from 0.2 to 0.8, we use the LMobileNetE network and the ArcFace loss to train models on the refined MS1M dataset. As illustrated in Table 6, the performance improves consistently from m = 0.2 on all datasets and gets saturated at m = 0.5. Then, the verification accuracy turns to decrease from m = 0.5. In this paper, we fix the additive angular margin m as 0.5.

表6. 不同的角度间隔 m (LMobileNetE,ArcFace@MS1M)对应的ArcFace验证性能(%)。

3.3. 损失设计
由于margin参数 m 在提出的ArcFace中起着重要的作用，我们首先进行实验来寻找最佳的角度间隔。通过将m从0.2变化到0.8，我们使用LMobileNetE网络和ArcFace损失在清洗完的MS1M数据集上训练模型。如表6所示，在所有数据集上，从m = 0.2开始，性能不断提高，在m = 0.5时达到饱和。验证精度从m = 0.5之后开始下降。本文将加法角度间隔 m 固定为0.5。

Based on the LResNet100E-IR network and the refined MS1M dataset, we compare the performance of different loss functions, e.g. Softmax, SphereFace [23], Cosine-Face [44, 43] and ArcFace. In Table 7, we give the detailed verification accuracy on the LFW, CFP-FP, and AgeDB-30 datasets. As LFW is almost saturated, the performance improvement is not obvious. We find that (1) Compared to Softmax, SphereFace, CosineFace and ArcFace improve the performance obviously, especially under large pose and age variations. (2) CosineFace and ArcFace obviously outperform SphereFace with much easier implementation. Both CosineFace and ArcFace can converge easily without additional supervision from Softmax. By contrast, additional supervision from Softmax is indispensable for SphereFace to avoid divergence during training. (3) ArcFace is slightly better than CosineFace. However, ArcFace is more intuitive and has a more clear geometric interpretation on the hypersphere manifold as shown in Figure 1.

表7. 不同损失函数下的验证性能(%)(LResNet100E-IR@MS1M)。

基于LResNet100E-IR网络和MS1M数据集清洗，我们比较了不同损失函数的性能，如Softmax、SphereFace[23]、Cosine-Face[44、43]和ArcFace。在表7中，我们给出了LFW、CFP-FP和AgeDB-30数据集的详细验证精度。由于LFW接近饱和，性能改善不明显。我们发现(1)与Softmax相比，SphereFace、CosineFace和ArcFace明显提高了性能，特别是在较大的姿态和年龄变化情况下。(2) CosineFace和ArcFace明显优于SphereFace，实现更简单。CosineFace和ArcFace可以很容易地收敛，而不需要额外的Softmax监督。相比之下，为了避免在训练中出现发散，额外的Softmax监督对于SphereFace来说是必不可少的。(3) ArcFace略优于CosineFace。但是ArcFace更加直观，对超球面流形的几何解释更加清晰，如图1所示。

3.4. MegaFace Challenge1 on FaceScrub
For the experiments on the MegaFace challenge, we use the LResNet100E-IR network and the refined MS1M dataset as the training data. In both Table 9 and 10, we give the identification and verification results on the original MegaFace dataset and the refined MegaFace dataset.

In Table 9, we use the whole refined MS1M dataset to train models. We compare the performance of the proposed ArcFace with related baseline methods, e.g. Softmax,Triplet, SphereFace, and CosineFace. The proposed Arc-Face obtains the best performance before and after the distractors refinement. After the overlapped face images are removed from the one million distractors, the identification performance significantly improves. We believe that the results on the manually refined MegaFace dataset are more reliable, and the performance of face identification under million distractors is better than we think [2].

To strictly follow the evaluation instructions on MegaFace, we need to remove all of the identities appearing in the FaceScrub dataset from our training data. We calculate the feature centre for each identity in the refined MS1M dataset and the FaceScrub dataset. We find that 578 identities from the refined MS1M dataset have a close distance (cosine similarity is higher than 0.45) with the identities from the FaceScrub dataset. We remove these 578 identities from the refined MS1M dataset and compare the proposed ArcFace to other baseline methods in Table 10. ArcFace still outperforms CosineFace with a slight performance drop compared to Table 9. But for Softmax, the identification rate drops obviously from 78.89% to 73.66% after the suspectable overlap identities are removed from the training data. On the refined MegaFace testset, the verification result of CosineFace is slightly higher than that of ArcFace.This is because we read the verification results which are closest to FAR=1e-6 from the outputs of the devkit. As we can see from Figure 8, the proposed ArcFace always outperforms CosineFace under both identification and verification metric.

3.4. 在FaceScrub上的MegaFace Challenge1
对于MegaFace挑战的实验，我们使用LResNet100E-IR网络和清洗完的MS1M数据集作为训练数据。在表9和表10中，我们给出了原始MegaFace数据集和清洗完的MegaFace数据集的识别和验证结果。

在表9中，我们使用整个清洗完的MS1M数据集来训练模型。我们将提出的ArcFace与相关baseline方法(Softmax、Triplet、SphereFace和CosineFace)的性能进行了比较。在修改干扰项之前和之后，提出的ArcFace都获得最佳的性能。从数量为一百万的干扰集中去除重复的人脸图像后，识别性能显著提高。我们认为在手动清洗完的MegaFace数据集上的结果更可靠，在百万级别的干扰集下，人脸识别的性能比我们认为的[2]更好。

为了严格遵守MegaFace上的评估说明，我们需要从我们的训练数据集中删除FaceScrub数据集中出现的所有身份。我们计算了清洗完的MS1M数据集和FaceScrub数据集中每个身份的特征中心。我们发现来自清洗完的MS1M数据集的578个身份与来自FaceScrub数据集的身份的距离相近(余弦相似度高于0.45)。我们从清洗完的MS1M数据集中删除了这578个身份，并将提出的ArcFace与其他baseline方法进行比较，在表10中。ArcFace仍然优于CosineFace，与表9相比性能略有下降。但是对于Softmax，在从训练数据中去除疑似重复的身份后，识别率明显下降，从78.89%下降到73.66%。在清洗完的MegaFace测试集上，CosineFace的验证结果略高于ArcFace。这是因为我们读取的验证结果与devkit的输出FAR=1e-6 最接近。从图8可以看出，在识别度和验证度度量方面，提出的ArcFace总是优于CosineFace。

以下补充2个概念

rank-1 ：https://blog.csdn.net/sinat_42239797/article/details/93651594

TAR和FAR：https://blog.csdn.net/liuweiyuxiang/article/details/81259492

表9. MegaFace Challenge1 (LResNet100E-IR@MS1M)中不同方法的识别和验证结果。“Rank 1”指的是rank-1人脸识别的精度，“VR”指的是,在FAR(错误接受的比例)为时，人脸验证的TAR(正确接受的比例)。(R)表示MegaFace数据集清洗的版本。

表10. MegaFace Challenge1 (Methods@ MS1M - FaceScrub) 中不同方法的识别和验证结果。“Rank 1”指的是rank-1人脸识别的精度，“VR”指的是,在FAR(错误接受的比例)为时，人脸验证的TAR(正确接受的比例)。(R)表示MegaFace数据集清洗的版本。

图8. (a) 和 (c) 报告了在附带1M干扰项的MegaFace数据集上不同方法的CMC曲线。(b) 和 (d) 报告了在附带1M干扰集的MegaFace数据集上不同方法的ROC曲线。(a) 和 (b)在原始的MegaFace数据集上评估，(c) 和 (d)则在清洗完的MegaFace数据集上评估。

3.5. Further Improvement by Triplet Loss
Due to the limitation of GPU memory, it is hard to train Softmax-based methods,e.g. SphereFace, CosineFace and ArcFace, with millions of identities. One practical solution is to employ metric learning methods, and the most widely used method is the Triplet loss [35, 22]. However, the converging speed of Triplet loss is relatively slow. To this end, we explore Triplet loss to fine-turn exist face recognition models which are trained with Softmax based methods.

For Triplet loss fine-tuning, we use the LResNet100EIR network and set learning rate at 0.005, momentum at 0 and weight decay at 5e - 4. As shown in Table 11, we give the verification accuracy by Triplet loss fine-tuning on the AgeDB-30 dataset. We find that (1) The Softmax model trained on a dataset with fewer identity numbers (e.g.VGG2 with 8,631 identities) can be obviously improved by Triplet loss fine-tuning on a dataset with more identity numbers (e.g. MS1M with 85k identities). This improvement confirms the effectiveness of the two-step training strategy, and this strategy can significantly accelerate the
whole model training compared to training Triplet loss from scratch. (2) The Softmax model can be further improved by Triplet loss fine-tuning on the same dataset, which proves that the local refinement can improve the global model. (3) The excellence of margin improved Softmax methods, e.g.SphereFace, CosineFace, and ArcFace, can be kept and further improved by Triplet loss fine-tuning, which also verifies that local metric learning method, e.g. Triplet loss, is complementary to global hypersphere metric learning based methods.

As the margin used in Triplet loss is the Euclidean distance,we will investigate Triplet loss with the angular margin recently.

表11. 通过三重损失微调(LResNet100E-IR)提高验证精度。

3.5. 进一步改进三重损失
由于GPU内存的限制，很难训练使用softmax-based的方法(SphereFace, CosineFace和ArcFace)，去训练百万级别的身份。一种实用的解决方法是使用度量学习方法，最广泛使用的方法是三重损失[35,22]。然而，三重态损失的收敛速度相对较慢。为此，我们探索三重损失的微调现，存在的softmax based方法训练的人脸识别模型。

对于三重损失的微调，我们使用LResNet100EIR网络，并设置学习率为0.005，动量为0，权重衰减为5e - 4。如表11所示，我们通过对AgeDB-30数据集进行三重损失微调来给出验证精度。我们发现 (1)用较少身份数量的数据集(例如具有8,631个身份的vgg2)训练的Softmax模型可以显著得到提升，通过使用在较多身份数量的数据集(例如具有85k身份的MS1M)上微调过的三重损失。这一改进证实了两步训练策略的有效性，与从头开始训练的三重损失相比，这种策略可以显著加速整个模型训练。(2)通过对同一数据集上的三重损失进行微调，可以进一步改进Softmax模型，证明局部改进可以提升全局模型。(3)margin的有点提升了的Softmax方法，如sphereface、CosineFace和ArcFace。这个优点可以通过三重损失的微调来保持和进一步改进，这也验证了局部度量学习方法，如三重损失，是对全局超球度量学习基本方法的补充。

由于三重损失使用的间隔是欧几里德距离，所以我们最近将用研究带有角度间隔的三重损失。

4. Conclusions
In this paper, we contribute to improving deep face recognition from data refinement, network settings and loss function designs. We have (1) refined the largest public available training dataset (MS1M) and test dataset (MegaFace); (2) explored different network settings and analysed the trade-off between accuracy and speed; (3) proposed a geometrically interpretable loss function called ArcFaceand explained why the proposed ArcFace is better than Softmax, SphereFace and CosineFace from the view of semi-hard sample distributions; (4) obtained state-of-theart performance on the MegaFace dataset in a totally reproducible way.

4. 结论
在本文中，我们从数据清洗、网络设置和损失函数设计三个方面来提升深度人脸识别的效果。我们有(1)清洗了规模最大的公开训练数据集(MS1M)和测试数据集(MegaFace)；(2)探索不同的网络设置，分析准确性与速度之间的权衡；(3)提出了一种称为ArcFace的几何可解释损失函数，从semi-hard样本分布的角度解释了为什么提出的ArcFace要优于Softmax、SphereFace和CosineFace；(4)以完全可复制的方式，在MegaFace数据集中获得最先进的性能
————————————————
版权声明：本文为CSDN博主「神罗Noctis」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/qq_39937396/article/details/102523945

Arcface v1 论文翻译与解读相关推荐

Arcface v3 论文翻译与解读
论文地址:http://arxiv.org/pdf/1801.07698.pdf Arcface v3 与 Arcface v1的内容有较大不同.建议先阅读Arcface v1 的论文,再看v3.可以 ...
CVPR2018: Generative Image Inpainting with Contextual Attention 论文翻译、解读
2019独角兽企业重金招聘Python工程师标准>>> CVPR2018: Generative Image Inpainting with Contextual Attention ...
如何自动生成推荐歌单：ACM论文翻译与解读 | Translation and Interpretation of ACM Survey
如何自动生成推荐歌单:ACM论文翻译与解读 | How to Automatically Generate Music Playlists: Translation and Interpretatio ...
FaceID-GAN：Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis论文翻译和解读
写在之前:这篇work的精妙程度是我平生仅见,或者是我还没看过太多论文.网络模型的设计加上合适的损失函数,一篇CVPR就出来了. 摘要人脸合成使用GANs已经获得了很卓越的效果.现存在的方法将GAN ...
yolo v1论文翻译-整理
论文原文:https://arxiv.org/pdf/1506.02640.pdf Tensorflow版本yolo v1:GitHub - gliese581gg/YOLO_tensorflow: ...
AlexNet论文翻译与解读
目录目录摘要(Abstract) 1 前言( Introduction) 2 数据集(The Dataset) 3 结构(The Architecture) 3.1 ReLU非线性函数(ReLU ...
NIN论文翻译及解读
文章目录 1.MLPCONV结构 2.Global Average Pooling(GAP) 3.NIN网络结构 NIN论文解读及个人理解 NIN网络的代码实现(pytorch) 1.NIN块实现 2 ...
ResNet论文翻译及解读
文章目录一.介绍三.深度残差网络 1.残差网络 2.通过shortcut连接传递自身映射 3.网络架构 3.1.无残差网络 3.2.残差网络 4.实施方案 4.1.训练阶段:与AlexNet和VG ...
yolov2 论文翻译与解读
论文:YOLO9000:Better, Faster, Stronger Abstract \quad们介绍YOLO9000,一个最先进的,实时目标检测系统,可以检测超过9000个目标类别.首先,我们 ...

Arcface v1 论文翻译与解读

Arcface v1 论文翻译与解读相关推荐

最新文章

热门文章