文章目录

5. Experiments
- 5.1. Computation Accuracy Trade-off
- 5.2. Effect of CNN Model
- 5.3. Sensitivity to Image Quality
- 5.4. Embedding Dimensionality
- 5.5. Amount of Training Data
- 5.6. Performance on LFW
- 5.7. Performance on Youtube Faces DB
- 5.8. Face Clustering
6. Summary
Acknowledgments
References

5. Experiments

If not mentioned otherwise we use between 100M-200M training face thumbnails consisting of about 8M different identities. A face detector is run on each image and a tight bounding box around each face is generated. These face thumbnails are resized to the input size of the respective network. Input sizes range from $96 \times 96$ pixels to $224 \times 224$ pixels in our experiments.

如果没有另外提及，我们使用 100M-200M 的训练人脸缩略图，其中包含大约 8M 不同的身份。在每个图像上运行一个人脸检测器，并在每个人脸周围生成一个紧密的边界框。这些人脸缩略图的大小被调整为相应网络的输入大小。在我们的实验中，输入尺寸范围从 $96 \times 96$ 像素到 $224 \times 224$ 像素。

5.1. Computation Accuracy Trade-off

Before diving into the details of more specific experiments lets discuss the trade-off of accuracy versus number of FLOPS that a particular model requires. Figure 4

在深入了解更具体实验的细节之前，让我们讨论一下特定模型所需的准确性与 FLOPS 数量之间的权衡。图 4

Figure 4. FLOPS vs. Accuracy trade-off. Shown is the trade-off between FLOPS and accuracy for a wide range of different model sizes and architectures. Highlighted are the four models that we focus on in our experiments.
图 4. FLOPS 与准确性的权衡。显示的是各种不同模型大小和架构的 FLOPS 和准确性之间的权衡。突出显示的是我们在实验中关注的四个模型。

Table 3. Network Architectures. This table compares the per- formance of our model architectures on the hold out test set (see section 4.1). Reported is the mean validation rate VAL at 10E-3 false accept rate. Also shown is the standard error of the mean across the five test splits.
表 3. 网络架构。该表比较了我们的模型架构在保持测试集上的性能（参见第 4.1 节）。报告的是 10E-3 错误接受率的平均验证率 VAL。还显示了五个测试拆分的平均值的标准误差。

shows the FLOPS on the x-axis and the accuracy at 0.001 false accept rate (FAR) on our user labelled test-data set from section 4.2. It is interesting to see the strong correlation between the computation a model requires and the accuracy it achieves. The figure highlights the five models (NN1, NN2, NN3, NNS1, NNS2) that we discuss in more detail in our experiments.

显示了 x 轴上的 FLOPS 和 0.001 错误接受率 (FAR) 在我们的用户标记的测试数据集（第 4.2 节）的准确度。有趣的是，模型所需的计算量与其所达到的准确度之间存在很强的相关性。该图突出显示了我们在实验中更详细讨论的五个模型（NN1、NN2、NN3、NNS1、NNS2）。

We also looked into the accuracy trade-off with regards to the number of model parameters. However, the picture is not as clear in that case. For example, the Inception based model NN2 achieves a comparable performance to NN1, but only has a 20th of the parameters. The number of FLOPS is comparable, though. Obviously at some point the performance is expected to decrease, if the number of parameters is reduced further. Other model architectures may allow further reductions without loss of accuracy, just like Inception [16] did in this case.

我们还研究了关于模型参数数量的准确性权衡。但是，在这种情况下，情况并不那么清晰。例如，基于 Inception 的模型 NN2 实现了与 NN1 相当的性能，但只有 20th 的参数。不过，FLOPS 的数量是可比的。显然，如果参数数量进一步减少，性能预计会下降。其他模型架构可能允许进一步减少而不损失准确性，就像 Inception [16] 在这种情况下所做的那样。

Figure 5. Network Architectures. This plot shows the com- plete ROC for the four different models on our personal pho- tos test set from section 4.2. The sharp drop at 10E-4 FAR can be explained by noise in the groundtruth labels. The mod- els in order of performance are: NN2: $224 \times 224$ input Inception based model; NN1: Zeiler&Fergus based network with $\times 1$ convolutions; NNS1: small Inception style model with only 220M FLOPS; NNS2: tiny Inception model with only 20M FLOPS.
图 5. 网络架构。该图显示了第 4.2 节中我们的个人照片测试集上四种不同模型的完整 ROC。 10E-4 FAR 的急剧下降可以用 groundtruth 标签中的噪声来解释。模型按性能排序为： NN2： $224 \times 224$ 输入基于 Inception 的模型； NN1：基于 Zeiler&Fergus 的 $\times 1$ 卷积网络； NNS1：只有 220M FLOPS 的小型 Inception 风格模型； NNS2：只有 20M FLOPS 的小型 Inception 模型。

5.2. Effect of CNN Model

We now discuss the performance of our four selected models in more detail. On the one hand we have our tradi- tional Zeiler&Fergus based architecture with $\times 1$ convolutions [22, 9] (see Table 1). On the other hand we have Inception [16] based models that dramatically reduce the model size. Overall, in the final performance the top models of both architectures perform comparably. However, some of our Inception based models, such as NN3, still achieve good performance while significantly reducing both the FLOPS and the model size.

我们现在更详细地讨论我们选择的四个模型的性能。一方面，我们有传统的基于 Zeiler&Fergus 的架构，具有 $\times 1$ 卷积 [22, 9]（见表 1）。另一方面，我们有基于 Inception [16] 的模型，可以显着减小模型大小。总体而言，在最终性能中，两种架构的顶级模型的性能相当。然而，我们的一些基于 Inception 的模型，例如 NN3，仍然取得了良好的性能，同时显着降低了 FLOPS 和模型大小。

The detailed evaluation on our personal photos test set is shown in Figure 5. While the largest model achieves a dramatic improvement in accuracy compared to the tiny NNS2, the latter can be run 30ms / image on a mobile phone and is still accurate enough to be used in face clustering. The sharp drop in the ROC for FAR < 10−4 indicates noisy labels in the test data groundtruth. At extremely low false accept rates a single mislabeled image can have a significant impact on the curve.

我们的个人照片测试集的详细评估如图 5 所示。虽然最大的模型与微型 NNS2 相比在准确度上取得了显着提升，但后者在手机上可以运行 30ms/张，仍然足够准确用于人脸聚类。 FAR < 10-4 的 ROC 急剧下降表明测试数据中的标签有噪声。在极低的错误接受率下，单个错误标记的图像可能会对曲线产生重大影响。

5.3. Sensitivity to Image Quality

Table 4 shows the robustness of our model across a wide range of image sizes. The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEG quality of 20. The performance drop is very small for face thumbnails down to a size of $120 \times 120$ pixels and even at $80 \times 80$ pixels it shows acceptable performance. This is notable, because the network was trained on $220 \times 220$ input images. Training with lower resolution faces could improve this range further.

表 4 显示了我们的模型在各种图像尺寸下的鲁棒性。该网络在 JPEG 压缩方面出人意料地健壮，并且在 JPEG 质量为 20 时表现非常出色。对于尺寸为 $120 \times 120$ 像素的面部缩略图，性能下降非常小，即使在 $80 \times 80$ 像素时也显示出可接受的性能。这是值得注意的，因为该网络是在 $220 \times 220$ 输入图像上训练的。用较低分辨率的人脸训练可以进一步提高这个范围。

Table 4. Image Quality. The table on the left shows the effect on the validation rate at 10E-3 precision with varying JPEG quality. The one on the right shows how the image size in pixels effects the validation rate at 10E-3 precision. This experiment was done with NN1 on the first split of our test hold-out dataset.
表 4. 图像质量。左侧的表格显示了 10E-3 精度下不同 JPEG 质量对验证率的影响。右边的一个显示了以像素为单位的图像大小如何影响 10E-3 精度的验证率。该实验是在我们的测试保留数据集的第一次拆分上使用 NN1 完成的。

Table 5. Embedding Dimensionality. This Table compares the effect of the embedding dimensionality of our model NN1 on our hold-out set from section 4.1. In addition to the VAL at 10E-3 we also show the standard error of the mean computed across five splits.
表 5. 嵌入维度。该表比较了我们的模型 NN1 的嵌入维度对第 4.1 节中的保留集的影响。除了 10E-3 处的 VAL，我们还显示了跨五个拆分计算的平均值的标准误差。

5.4. Embedding Dimensionality

We explored various embedding dimensionalities and se- lected 128 for all experiments other than the comparison re- ported in Table 5. One would expect the larger embeddings to perform at least as good as the smaller ones, however, it is possible that they require more training to achieve the same accuracy. That said, the differences in the performance re- ported in Table 5 are statistically insignificant.

除了表 5 中报告的比较之外，我们探索了各种嵌入维度并为所有实验选择了 128 个。人们期望较大的嵌入至少与较小的嵌入一样好，但是，它们可能需要更多训练以达到相同的精度。也就是说，表 5 中报告的性能差异在统计上不显着。

It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embed- dings are possible at a minor loss of accuracy and could be employed on mobile devices.

需要注意的是，在训练过程中使用了 128 维浮点向量，但它可以量化为 128 字节而不会损失精度。因此，每个人脸都由一个 128 维字节向量紧凑地表示，这是大规模聚类和识别的理想选择。较小的嵌入是可能的，但精度损失很小，并且可以在移动设备上使用。

5.5. Amount of Training Data

Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens of millions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2. Compared to only millions of images the relative reduction in error is 60%. Using another order of magnitude more images (hundreds of millions) still gives a small boost, but the improvement tapers off.

表 6 显示了大量训练数据的影响。由于时间限制，此评估在较小的模型上运行；在更大的模型上效果可能更大。很明显，使用数以千万计的样本可以明显提高第 4.2 节中我们的个人照片测试集的准确性。与仅数百万张图像相比，误差相对减少了 60%。使用另一个数量级的更多图像（数亿）仍然会带来小幅提升，但改进会逐渐减弱。

Table 6. Training Data Size. This table compares the performance after 700h of training for a smaller model with $96 \times 96$ pixel inputs. The model architecture is similar to NN2, but without the $\times 5$ con- volutions in the Inception modules.
表 6. 训练数据大小。此表比较了具有 $96 \times 96$ 像素输入的较小模型在训练 700 小时后的性能。模型架构类似于 NN2，但在 Inception 模块中没有 $\times 5$ 卷积。

Figure 6. LFW errors. This shows all pairs of images that were incorrectly classified on LFW.
图 6. LFW 错误。这显示了在 LFW 上被错误分类的所有图像对。

5.6. Performance on LFW

We evaluate our model on LFW using the standard pro- tocol for unrestricted, labeled outside data. Nine training splits are used to select the L2-distance threshold. Classification (same or different) is then performed on the tenth test split. The selected optimal threshold is 1.242 for all test splits except split eighth (1.256).

我们使用标准协议评估我们在 LFW 上的模型，用于不受限制的、标记的外部数据。九个训练分割用于选择 L2 距离阈值。然后在第十个测试拆分上执行分类（相同或不同）。对于除八分之一 (1.256) 之外的所有测试拆分，选定的最佳阈值为 1.242。

Our model is evaluated in two modes:

Fixed center crop of the LFW provided thumbnail.
Aproprietaryfacedetector(similartoPicasa[3])isrun on the provided LFW thumbnails. If it fails to align the face (this happens for two images), the LFW alignment is used.

我们的模型以两种模式进行评估：

修复了 LFW 提供的缩略图的中心裁剪。

在提供的 LFW 缩略图上运行专有的人脸检测器（类似于 Picasa[3]）。如果它未能对齐面部（这发生在两个图像上），则使用 LFW 对齐。

Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom. We achieve a classification accuracy of $\% \pm 0.15$ when using the fixed center crop described in (1) and the record breaking $\% \pm 0.09$ standard error of the mean when using the extra face alignment (2). This reduces the error reported for DeepFace in [17] by more than a factor of 7 and the previous state-of-the-art reported for DeepId2+ in [15] by $\%$ . This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.

图 6 给出了所有失败案例的概览。它在顶部显示错误接受，在底部显示错误拒绝。当使用 (1) 中描述的固定中心裁剪时，我们实现了 $\% \pm 0.15$ 的分类准确度，而在使用额外人脸对齐 (2) 时，我们达到了打破平均标准误差的 $\% \pm 0.09$ 的记录。这将 [17] 中 DeepFace 报告的错误减少了 7 倍以上，并且将 [15] 中 DeepId2+ 报告的先前最先进技术减少了 $\%$ 。这是模型 NN1 的性能，但即使是更小的 NN3 也能实现在统计上没有显着差异的性能。

5.7. Performance on Youtube Faces DB

We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of $\% \pm 0.39$ . Using the first one thousand frames results in $\%$ . Compared to [17] $\%$ who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2+ [15] achieved $\%$ and our method reduces this error by $\%$ , comparable to our improvement on LFW.

我们使用人脸检测器在每个视频中检测到的前一百帧的所有对的平均相似度。这为我们提供了 $\% \pm 0.39$ 的分类准确率。使用前一千帧的结果是 $\%$ 。与同样评估每个视频一百帧的 [17] $\%$ 相比，我们将错误率降低了近一半。 DeepId2+ [15] 达到了 $\%$ ，我们的方法将此误差降低了 $\%$ ，与我们对 LFW 的改进相当。

5.8. Face Clustering

Our compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity. The constraints in assignment imposed by clustering faces, compared to the pure verification task, lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using agglom- erative clustering. It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age.

我们的紧凑嵌入有助于将用户的个人照片聚集到具有相同身份的人群中。与纯验证任务相比，聚类人脸所施加的分配约束会产生真正惊人的结果。图 7 显示了用户个人照片集中的一个集群，它是使用聚合聚类生成的。它清楚地展示了对遮挡、照明、姿势甚至年龄的难以置信的不变性。

6. Summary

We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as concate- nation of multiple models and PCA, as well as SVM clas- sification. Our end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance.

我们提供了一种直接学习嵌入到欧几里得空间中进行人脸验证的方法。这使它与其他使用 CNN 瓶颈层的方法 [15, 17] 区别开来，或者需要额外的后处理，例如多个模型和 PCA 的串联，以及 SVM 分类。我们的端到端训练既简化了设置，又表明直接优化与手头任务相关的损失可以提高性能。

Figure 7. Face Clustering. Shown is an exemplar cluster for one user. All these images in the users personal photo collection were clustered together.
图 7. 人脸聚类。显示的是一个用户的示例集群。用户个人照片集中的所有这些图像都聚集在一起。

Another strength of our model is that it only requires minimal alignment (tight crop around the face area). [17], for example, performs a complex 3D alignment. We also experimented with a similarity transform alignment and no- tice that this can actually improve performance slightly. It is not clear if it is worth the extra complexity.

我们模型的另一个优点是它只需要最小的对齐（面部区域周围的紧密裁剪）。 [17]，例如，执行复杂的 3D 对齐。我们还尝试了相似变换对齐，并注意到这实际上可以稍微提高性能。目前尚不清楚是否值得额外的复杂性。

Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also look into ways of improving the currently extremely long training times, e.g. variations of our curriculum learn- ing with smaller batch sizes and offline as well as online positive and negative mining.

未来的工作将集中在更好地理解错误情况，进一步改进模型，以及减少模型大小和降低 CPU 需求。我们还将研究改善目前极长的训练时间的方法，例如我们的课程学习的变化与较小的批量和离线以及在线正负挖掘。

Acknowledgments

We would like to thank Johannes Steffens for his discus- sions and great insights on face recognition and Christian Szegedy for providing new network architectures like [16] and discussing network design choices. Also we are in- debted to the DistBelief [4] team for their support espe- cially to Rajat Monga for help in setting up efficient training schemes.

Also our work would not have been possible without the support of Chuck Rosenberg, Hartwig Adam, and Simon Han.

References

[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur- riculum learning. In Proc. of ICML, New York, NY, USA, 2009. 2
[2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012. 2
[3] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade
face detection and alignment. In Proc. ECCV, 2014. 8
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Wein-
berger, editors, NIPS, pages 1232–1240. 2012. 9
[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J.
Mach. Learn. Res., 12:2121–2159, July 2011. 4
[6] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville,
and Y. Bengio. Maxout networks. In In ICML, 2013. 4
[7] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Re- port 07-49, University of Massachusetts, Amherst, October
2007. 5
[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Compu- tation, 1(4):541–551, Dec. 1989. 2, 4
[9] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4, 6
[10] C. Lu and X. Tang. Surpassing human-level face veri- fication performance on LFW with gaussianface. CoRR, abs/1404.3840, 2014. 1
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 2, 4
[12] M.SchultzandT.Joachims.Learningadistancemetricfrom relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, NIPS, pages 41–48. MIT Press, 2004. 2
[13] T.Sim,S.Baker,andM.Bsat.TheCMUpose,illumination, and expression (PIE) database. In In Proc. FG, 2002. 2
[14] Y. Sun, X. Wang, and X. Tang. Deep learning face
representation by joint identification-verification. CoRR, abs/1406.4773, 2014. 1, 2, 3
[15] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. 1, 2, 5, 8
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.2,4,5,6,9
[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verifica- tion. In IEEE Conf. on CVPR, 2014. 1, 2, 5, 8
[18] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. CoRR, abs/1404.4661, 2014. 2
[19] K.Q.Weinberger,J.Blitzer,andL.K.Saul.Distancemetric learning for large margin nearest neighbor classification. In NIPS. MIT Press, 2006. 2, 3
[20] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Net- works, 16(10):1429–1451, 2003. 4
[21] L. Wolf, T. Hassner, and I. Maoz. Face recognition in un- constrained videos with matched background similarity. In IEEE Conf. on CVPR, 2011. 5
[22] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. 2, 4, 6 [23] Z. Zhu, P. Luo, X. Wang, and X. Tang. Recover canonical- view faces in the wild with deep neural networks. CoRR,
abs/1404.3543, 2014. 2

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (3/3)相关推荐

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (1/3)
文章目录 Authors and Publishment Authors Bibtex Categories 0. Abstract 1. Introduction 2. Related Work A ...
论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (2/3)
文章目录 3. Method 3.1. Triplet Loss 3.2. Triplet Selection 3.3. Deep Convolutional Networks 4. Datasets ...
【读点论文】FaceNet: A Unified Embedding for Face Recognition and Clustering 人脸向量映射到一个特定空间后成为一种集成系统
FaceNet: A Unified Embedding for Face Recognition and Clustering 大规模有效实施人脸验证和识别对当前方法提出了严峻挑战.在本文中,提出了 ...
FaceNet: A Unified Embedding for Face Recognition and Clustering
论文:FaceNet: A Unified Embedding for Face Recognition and Clustering 时间:2015.04.13 来源:CVPR 2015 来自谷歌的 ...
译文 FaceNet: A Unified Embedding for Face Recognition and Clustering
摘要 Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementi ...
论文笔记 | FaceNet: A Unified Embedding for Face Recognition and Clustering
Authors Florian Schroff Dmitry Kalenichenko James Philbin Florian Schroff Abstract 本文提出了FaceNet syst ...
论文阅读 FaceNet: A Unified Embedding for Face Recognition and Clustering
这篇文章提出了FaceNet系统,学习从人脸到欧式空间的映射,在欧氏空间中判断两幅图片的相似性.可以用来人脸验证.人脸识别以及聚类. 人脸验证:两个嵌入之间距离阈值化问题人脸识别:k-NN分类问题 ...
Triplet Loss: A Unified Embedding for Face Recognition and Clustering（论文阅读笔记）（2015CVPR）
论文链接:<FaceNet: A Unified Embedding for Face Recognition and Clustering> 摘要尽管人脸识别领域最近取得了重大进展[1 ...
论文研读：Automatic Temporal Segment Detection and Affect Recognition From Face and Body Display
摘要:情感的呈现包含一系列的时域分割:开启(onset).峰值(apex).结束(offset).在过去15年中(论文发表于2009年),计算机界对情感识别进行了大量的研究,但是大多数方法都有两点局限 ...
transformer论文研读
transformer论文研读 Conformer: Local Features Coupling Global Representations for Visual Recognition CS2 ...

论文研读 —— 5. FaceNet A Unified Embedding for Face Recognition and Clustering (3/3)