Google人脸识别系统Facenet paper解析

Facenet paper地址 : facenet； 论文解析下载地址（PDF版）：论文解析

FaceNet: A Unified Embedding for Face Recognition and Clustering

Abstract摘要

Despite significantrecent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as featurevectors.

尽管人脸识别领域已经取得了重大的进步，但是对于当下的方法如何有效的运用人脸验证和人脸识别仍然有巨大的挑战。在这个论文里，我们提出了一个叫facenet的系统，这个系统直接学习了一个从人脸图像到紧密型欧几里得空间的映射，在那里距离直接和人脸的相似度相关。一旦这个空间产生，诸如人脸识别、验证、聚集这类的任务可以在运用FaceNet embeddings为特征向量的标准技术下轻松实现。

Our method uses a deep convolutional networktrained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matchin/ non-matchingface patchesgenerated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

我们的方法运用深度卷积网络训练直接优化embedding,而不是像原来的运用中间瓶颈层为人脸图像的向量映射，然后以分类层作为输出层。对于训练，我们运用online triplet

mining 的方法生成triplets大致校准匹配或非匹配人脸补丁。我们方法的最大优点是拥有最大表征效率：我们仅用每张人脸128位取得了人脸识别的最先进的性能。

On the widely used Labeled Faces in the Wild(LFW) dataset, our system achieves a new record accuracy of 99.63%

On YouTubeFaces DB it achieves95.12%. Our system cuts the error rate in comparison to the best published result [15] by30% on both datasets.

在广泛使用的LFW数据库我们的系统取得了99.63%准确性的新纪录。在YouTub数据库取得了95.12%的成绩。我们的系统相比已经发布的最好结果在这两个数据集的错误率减少了30%

We also introduce the concept of harmonic embeddings ,and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

我们也介绍了harmonic embeddings和harmonic triplet loss的概念，它们描述了由不同网络产生的不同版本的face embeddings他们之间是兼容的而且可以直接比较。

1. Introduction 引言

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distancesin the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

在这个论文里我们呈现了包含人脸验证（是不是同一个人）、识别（他是谁？）、聚集（在人脸中找到相同的人进行归类）的一个完整的系统。我们的方法是基于每张图像用一个深度卷积网络学习一个Euclidean embedding。然后把这个网络进行训练这样在embedding space的squared L2距离直接对应人脸相似度：同一个人的人脸具有小的距离不同人的人脸具有较大的距离。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholdingthe distance between the two embeddings; recognition becomes a k-NN classification problem; and clusteringcan be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.

一旦这个embedding产生前面提到的任务就变得很简单了：人脸验证仅仅只涉及两个embedding距离的阀值；识别成了一个k-NN分类问题；聚集可以用现成的例如k-means or agglomerative clustering之类的技术来实现。

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representationsize per face is usually very large (1000s ofdimensions ).Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

其中一个期望改进的是瓶颈层对新人脸也可以很好的泛化并且用瓶颈层每一个脸的表示尺寸都非常大（1000s of dimensions）。一些现在的工作用PCA已经减小了维度，但是这是一个线性的变换可以在网络的一层轻松学习到。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN[19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

与上述那些方法比起来，FaceNe运用基于LMNN的triplet- based loss函数直接把输出训练成一个紧凑的128-Dembedding。我们的triplets包含两个匹配的人脸缩略图和一个非匹配的人脸缩略图。Loss函数的目的就是通过距离边界区分正负类。缩略图为精密剪裁的脸部区域除了执行缩放平移外没有2D or 3D校准。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

选择运用triplets被证明对于取得好的性能是非常重要的，并且受curriculum learning启发，我们提出了一个online negative exemplar mining策略，它确保了随着网络训练triplets的难度持续增长。为了提高聚类准确度，我们也探索了hard-positive mining技术它对于每个人的embeddings激发出球形聚类。

As an illustration of the incrediblevariability that our method can handle see Figure 1.Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

我们的方法可以产生惊人的变化性就像图例1显示那样。展现的是来自PIE的图像对其曾经被认为对于人脸识别系统是十分困难的事情

Figure 1. Illumination and Pose invariane. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and adifferent person in different pose and illumination combinations. A distance of0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum,two different identities. You can see that a threshold of 1.1 would classifyevery pair correctly.

（图1.光照与姿态不变性.姿态和光照是一个长期存在的问题在人脸识别中。这个图像显示了运用FaceNet生成的相同／不同人在不同的姿态和光照组合下的人脸对的输出距离，距离0意味着相同的人脸，4.0对应着相反的频谱，有不同的人脸特征。可以观察到门限值1.1可以准确区分每对人脸）

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area;section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.

其余论文内容概述如下：

section 2：回顾本领域的相关文献

section 3.1：定义了triplet loss

section 3.2：描述了triplet selection& training procedure

section 3.3：所用的模型结构

section 4and 5：提出了一些关于embeddings的定量的结论，并且定性地探索了一些聚类结论。

2. Related Work

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

和现在的运用的深度卷积网络的方法类似，我们的方法是一个纯粹的数据驱动方法，该方法从人脸的每一个像素开始直接学习它的表示。我们运用标记人脸的大型数据库去获得合适的姿态光照和其他可变的情况的不变性，而不是运用engineered features。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleavedlayersof convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1*1*d convolution layers inspired by the work of [9]. The second architecture is based on the Inceptionmodel of Szegedyet al. which was recently used as the winning approach for ImageNet2014[16]. These networks use mixed layers that run several different convolutional and polling layersin parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

在论文里我们探究了最近在计算机视觉社区成功使用的两类不同的深度卷积神经网络架构。第一个架构是基于Zeiler&Fergus模型其包含multiple interleaved layers of convolutions, non-linear activations, local response normalizations,and max pooling layers。受[9]的工作启发我们添加了几个1*1*d卷积层。第二个架构是基于the Inception model of Szegedy et al 这种架构被称为ImageNet 2014 中最优的方式。这些网络在并联和串联它们的相应的时候运用运行在几个不同的卷积和池化层组成的混合层。我们发现这两种模型都可以减少参数的使用次数达20次并且有减少浮点运算次数的潜在性能。

There is a vast corpus of face verification and recognition works. Reviewing it is out of this paper so we will only briefly discuss the most relevant recent work.

下面是人脸验证和识别工作相关的大量资料，但是它们和这篇论文不大相关所以我们仅仅讨论一下最相关的最近工作。

The works of [15, 17, 23] all employ acomplex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and anSVM for classification.

[15, 17, 23]中的工作都是运用一个复杂的多级系统，其中结合了带有用于减少纬度的PCA技术和分类的SVM的深度卷积网络的输出。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification,PCA on the network output in conjunction with an ensemble of SVMs is used.

Taigman et al. [17] propose a multi-stageapproachthat aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the 2 kernel)of those networks are combined using a non-linear SVM.

Taigman 等训练一个多级网络去执行一个拥有四千多特征的人脸识别任务。作者也试用一个叫做Siamese的网络在这个网络中他们可以直接优化两个人脸特征间的L1-distanc。他们在LFW上最好的性能是97.35%来源于运用不同校准和颜色通道的三个网络的总体。这些网络的预测距离（基于两个核的non-linear SVM）运用一个non-linear SVM结合。

Sunet al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to alinear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distancebetween faces of the same identity and enforces a margin between the distanceof faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.

Sun等提出了一个紧凑的并且相对简单的计算网络。作者结合了50个响应(regular andflipped)在LFW上取得了99.47%的最终性能。PCA 和Joint Bayesian模型都很有效的符合我们运用的embedding space的线性变换。他们的方法不需要详尽的2D/3D校准。他们的方法运用classification 和verification loss的组合来训练网络。他们的verification loss和我们的triplet loss很相似，因为他们最小化了相同特征人脸间的L2-distance，加大了不同特征人脸间距离的边缘。主要的不同是他们仅对一对图像进行比较，然而triplet loss促进相对距离约束。

A similar lossto the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.

Wang et al.提出similar loss是为了根据语义和视觉相似度给图像划分等级。

3. Method

FaceNet uses a deep convolutional network.We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks aredescribed in section 3.3.

FaceNet运用深度卷积网络。我们讨论两个不同的核心结构：

TheZeiler&Fergus架构网络

The recentInception架构网络

这些网络的详细描述在3.3部分

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the tripletloss that directly reflects what we want to achieve in face verification,recognition and clustering. Namely, we strive for an embedding f(x), from an image x into a feature space Rd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance betweena pair of face images from different identities is large.

考虑到模型的细节，我们暂且把它视为一个黑盒（见图2），我们的方法最重要的部分在于对整个系统的端到端的学习。系统末端我们运用the triplet loss直接反射我们想要得到的人脸验证、识别、和聚类。换句话来说，我们在争取实现一个embedding f(x)，从一个图像到一个特征空间Rd，这样在图像条件独立的情况下，相同特征的所有人脸间的平方距离是比较小的，然而不同特征的人脸图像对的平方距离是比较大的。

Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed L2normalization,which results in the face embedding. This is followed by the triplet loss during training.

（图2.模型结构.我们的网络包含一批输入层和一个深度卷积网络随后是一个L2规范化，其结果进入face embedding。最后是训练中的triplet loss）

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces.This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

虽然我们并没有直接比较其他losses，例如在[14] Eq中用于正负类的loss，但是我们相信the triplet loss对于人脸验证是最合适的。激励是某个损失其来自在the embedding space鼓励同一特征的所有人脸投影到单个点。但是The triplet loss试图加强从某一人脸到其他人脸人脸对的边缘距离。这就允许同一特征的人脸可以依靠其一个复本，同时加强上述人脸间距离并且从而分辨其他特征。

The following section describes thistriplet lossand how it can be learned efficiently in scale.

下面的部分描述triplet loss和在大规模情况下如何高效的学习。

3.1.Triplet Loss

The embedding is represented by。It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. ||f(x)||2 = 1. This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image xai (anchor) of a specific person is closer to all other images xpi (positive) of the same person than it is to any image xni (negative)of any other person. This is visualized in Figure 3.

Embedding用表示。它把一个图像x嵌入到一个d维的欧几里得空间。另外，我们依靠一个d维的超球面约束这个embedding。如：||f(x)||2 = 1.这个loss被涉及在[19]的最近邻居分类的上下文中。在这里我们想确保一个特定人的图像xai (anchor)更接近于这个同一个人的其他图像xpi (positive)较远于其他任何人的任何图像xni (negative)。见图3

Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and negativehor and a negativof a different identity.

（图3:The Triplet Loss最小化了anchor 和positive之间的距离，它们两个具有相同的特征；最大化了具有不同特征的anchor 和˙negative之间的距离）

Thus we want,

whereａis a margin that is enforced between positive and negative pairs。T is the set of all possible triplets in the training set and has cardinality N .

a是positive 和 negative对间的余量。T是在训练集中的所有可能的triplets的集合并且有基数N。

The loss that is being minimized is then L =

L是被最小化的loss

Generatingratingall possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence,as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.

生成所有可能的triplets将会导致容易符合条件的许多triplets。这些triplets将不会对训练作出贡献并且会导致更低的收敛性，因为它们将会仍然通过网络。这是至关重要的对于选择hard triplets，并且是有效的有助于提升模型。以下部分讲述我们用于triplet selection的不同方法。

3.2. Triplet Selection

In order to ensure fast convergence it is crucial to select triplets that violate the triple constraint in Eq. (1). This means that, given xai , we want to select an xpi （hard positive）such thatargmax and similarly xni (hardnegative) such that argmin

为了确保快速收敛选择triplets是非常重要的，避免了式（1）中的triplet约束。这就意味着给我们xai我们可以选择xpi（hard positive）这样argmax和xn i(hardnegative) 这样 argmin

It is infeasible to compute the argmin and argmax across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positivesandnegatives.There are two obvious choices that avoid this issue:

通过整个训练集计算argmin和argmax是不现实的。另外，这可能导致差的训练，就像错误的标签和差的人脸图像会决定hard positives 和 negatives。下面是两个明显的选择可以避免这个问题：

• Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

• Generate triplets online. This can be done by selecting the hard positive/negativeexemplars from within a mini-batch.

• 脱机每n步生成一次 triplets，运用最近的网络检查站并在数据的子集上计算出argmin和argmax。

• 在线生成triplets 。这个可以实现通过在一个mini-batch中选择the hard positive/negative样本。

Here, we focus on the online generation and use large mini-batches in the order ofa few thousand exemplars and only compute the argmin and argmax within a mini-batch.

这里，我们关注在线生成并用largemini-batches大约几千个样本仅仅在一个mini-batch上计算argmin和argmax。

To have a meaningful representation of the anchor- positive distances, it needs to been sured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini- batch. Additionally, randomly sampled negative faces are added to each mini-batch.

为了得到anchor-positive distances的有意义的表达，需要确保在每个mini-batch任何一个特征的样本的最小量。在我们的实验中我们对训练数据进行采样这样每个mini-batch中每个特征大约40个人脸被选择。另外，随机采样的负人脸被添加进每个mini-batch。

Instead of picking the hardest positive, we use all anchor- positive pairs in a mini-batchwhile still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the allanchor-positivewas more stable and converged sightly fasterat the beginning of training.

我们用所有的anchor-positive对在一个mini-batch并且仍然选择thehard negatives，而不是选择the hardest positive。我们并没有对hardanchor-positive pairs和所有anchor-positivepairs进行并列比较,在一个小批次中，但是在实际中我们发现所有的anchor-positive 方法更加稳定并且收敛的较快在开始的训练的时候。

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.

我们也和在线生成一起探索了脱机生成，其允许运用更小的batch，但是实验也是非决定性的。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e.f(x) = 0). In order to mitigate this, it helps to selectxni such that

实际上选择一个hardest negatives能够导致差的局部最小值在训练的早期，特别是它也可能导致坍塌的模型（如：f(x) = 0）。为了缓解这种情况，像如下一样选择xni是有帮助的：

We call these negative exemplarssemi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin a.

我们把这种负样本叫做semi-hard，因为较正样本来说距离anchor更远，但是仍然是hard因为平方距离接近于anchor-positive距离。这些负类在边界a内.

As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic GradientDescent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constrain twith regards to the batch size, however, is the way we select hard relevan ttripletsfrom with int he mini-batches. In most experiments we use a batch size of around 1,800 exemplars.

就像以前所说的，恰当的triplet selection是至关重要的对于快速收敛。一方面我们更倾向于用小的mini-batches因为其有助于在随机梯度下降法中提高收敛性。另一方面，实现细节使得数十到上百样本的batches更加高效。但是，对于batch大小的主要约束是我们从the mini-batches内选择hard relevan ttriplets的方式。在大多数的实验中我们采用的batch大小大约是1800个样本。

3.3. Deep Convolutional Networks

In all our experiments we train the CNN using Stochastic GradientDescent (SGD) with standard backprop [8, 11] and AdaGrad[5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16],and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss(and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin a is set to 0.2.

在我们的实验中运用结合标准反向传播算法和AdaGrad的随机梯度下降法训练卷积神经网络。在大多数的实验中我们以0.05的学习率开始，较低的完成了这个模型。模型被随机初始化类似于[16]，模型在一个CPU集群上训练1000到2000小时。在训练500h后loss的下降幅度和accuracy的增长幅度变得异常缓慢，但是额外的训练对于提升性能仍然很重要。边缘a被设置成0.2.

We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a data center can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear unitsasthe non-linear activation function.

我们使用两种不同的架构并探索它们的trade-offs在实验部分的更多细节下。他们的实际上的不同在于不同的参数和浮点运算次数。依据应用最好的模型可能不同，如运行在数据中心的一个模型可能需要许多参数并且需要较多的浮点运算次数，然而运行在一个移动电话的模型需要更少的参数，以便于它可以适应内存。我们所有的方法均运用改正的线性单元作为非线性激励函数。

The first category, shown in Table 1, adds 1*1*d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.

第一个类别（展示在表1中），像[9]建议的那样在theZeiler&Fergus的标准卷积层间添加了1*1*d的卷积层，致使模型具有22层深。它总共有1.4亿个参数和每张图像160亿次浮点运算。

Table 1. NN1.Zeiler&Fergus [22] based model with 1*1 convolutions inspired by [9]. The input and output sizes are described in rows *cols *#filters. The kernel is specified as rows*cols, stride and the maxout [6] pooling size as p = 2

表1. NN1.基于1*1卷积的Zeiler&Fergus，输入和输出大小被描述为rows*cols*#filters。核心被定义为rows *cols，跨步stride和the maxout [6] pooling大小中p = 2.

The second category we use is based on GoogLeNet style Inception models[16]. These models have 20* fewer parameters (around 6.6M-7.5M)and up to 5*fewer FLOPS (between 500M-1.6B). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One,NNS1, has 26M parameters and only requires 220M FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160x160. NN4 has an input size of only 96x96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5x5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5x5 convolutions can be removed throughout with only a minor drop in accuracy,racyFigure 4 compares all our models.

第二个类别是基于GoogLeNet style的Inception models模型。这种模型有20* fewer的参数（大约6.6M-7.5M）和多达5*fewerFLOPS（在500M-1.6B之间）。其中的一些模型大小急剧减小（在深度和滤波器数量方面），这样它们可以运行在移动手机上。其中一种叫NNS1拥有26M的参数并且每张图像仅需要220M的FLOPS。另外一种叫NNS2拥有4.3M的参数并且每张图像仅需要20M的FLOPS。表2详细的描述了我们最大的网络NN2。NN3和其余的架构是一样的但是160x160缩小的尺寸。NN4仅有96x96的输入大小，因此较大地减少了CPU的需求（285M FLOPS vs 1.6B for NN2）除了减小输入大小它也并没有在高层使用5x5卷积因为感受野已经太小了。通常我们发现5x5的卷积可以被移除仅仅会造成精度很小的下降。图4比较了我们所有的模型。

（图4.FLOPS对比Accuracy trade-off .由图中可得不同的模型大小和架构在FLOPS和accuracy之间trade-off有一个较大的范围。其中突出我们在实验中关注的4个模型）

表2.NN2.描述了NN2 Inception incarnation 的细节。这个模型和在[16]中描述的差不多。两个主要的不同是使用L2 pooling代替Max pooling（m），这里详细说明。如：用 the L2 norm 取代spatial Max。池化总是3*3（除了最后的平均池化）并且和每个 Inception module中和卷积模型平行。在每个池化被表示成p后如果有维度的减小。1*1，3*3，和5*5池化被连接在一起得到最后输出。

4. Datasets and Evaluation

We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faceswe evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold D(xi,xj) is used to determine the classification of same and different. All face pairs (i, j) of the same identity are denoted with Psame, whereas all pairs of different identities are denoted with Pdiff.

我们在四个数据集上评估我们的方法并且除了LFW和YouTube Faces外我们在人脸验证任务上评估我们的方法。如：对于一对人脸图像squared L2 distance门限D(xi,xj)被用作决定相同和不同人脸的分类。有相同特征的人脸对(i, j)表示成Psame，反之不同特征的人脸对表示成Pdiff.

We define the set of all true accepts as

These are the face pairs (i, j) that were correctly classified as same at threshold d.

这些是依据门限值d准确分类的人脸对(i, j)。

Similarly ，is the set of all pairs that was incorrectly classifiedas same (false accept).

同样地：这是一类被错误分类的人脸对

The validation rate VAL and the false accept rate FAR(d) for a given face distance d are then defined as

对于给定的人脸距离d验证率VAL(d)和可接受错误率VAL(d)被定义为：

4.1. Hold-out Test Set

We keep a hold out set of around one million images, that has the same distribution as our training set, but disjoint identities. For evaluation we split it into five disjoint sets of 200k images each. The FAR and VAL rate are then computed on 100k x100k image pairs. Standard error is reported across the five splits.

留出法测试集

我们保留与我们的训练集有同样分布但是有不同特征的大约100张照片的留出集。为了便于评估我们把留出集分成5个不相交的子集每个里面有200k图像。FAR和VAL率在100k x100k的图像对上计算。通过这五个分块来描述标准差。

4.2. Personal Photos

This is a test set with similar distribution to our training set, but has been manually verified to have very clean labels. It consists of three personal photocollections with a total of around 12k images. We compute the FAR and VAL rate across all 12k squared pairs of images.

个人照片

这是一个和我们的训练集有相同分布的测试集，但是它必须手工验证去保证非常整洁的标签。它由总共12k图像的三个个人照片集组成。我们在所有12k squared对图像上计算FAR 和 VAL率。

4.3. Academic Datasets

Labeled Faces in the Wild (LFW) is the de-facto academic test set for face verification [7]. We follow the standard protocol for unrestricted,labeled outside dataand report the mean classification accuracy as well as the standard error of the mean.

LFW是事实上的人脸验证学术测试集。我们遵守无限制标记外部数据的标准协议并且报告平均分类准确度和平均标准差。

Youtube Faces DB [21] is a new dataset that has gained popularity in the face recognition community [17, 15]. The setup issimilar to LFW, but instead of verifying pairs of images, pairs of videos are used.

Youtube Faces DB是新的数据集并且在人脸识别社区很受欢迎。其上面可以使用视频而不是验证图像对。

5. Experiments

If not mentioned otherwise we use between 100M-200M training face thumbnails consisting of about 8M different identities.A face detector is run on each image and a tight bounding box around each faceis generated. These face thumbnails are resized to the input size of the respective network. Input sizes range from 96x96 pixelsto 224x224 pixels in our experiments.

如果没有另外说明我们使用由8M不同特征组成的在100M-200M之间的训练人脸缩略图。在每个人脸上运行人脸探测器并在人脸周围生成一个紧密的bounding box。这些人脸缩略图被调整成相应网络的输入大小。在我们的实验中输入大小的范围是从96x96像素到224x224像素。

5.1. Computation Accuracy Trade-off

Before diving into the details of more specific experimentswe will discuss the trade-off of accuracy versus number of FLOPS that a particular model requires. Figure 4 shows theFLOPS on the x-axis and accuracy at 0,001 false accept rat our user labelled test-data set from section 4.2. It is interesting to see the strong correlation between the computation a model requires and the accuracy it achieves. The figure highlights the five models (NN1, NN2, NN3, NNS1, NNS2) that we discuss in more detail in our experiments.

计算准确度权衡

在探究更具体的实验细节前我们将会讨论准确度和浮点运算次数的权衡这是一个特定的模型所需要的。图4中用x轴表示浮点运算次数和在4.2中的用户标记数据集中0.001的错误可接受率下的准确度。很愿意看到一个模型所需要的计算和它取得的准确度具有强相关。图示强调了在我们实验中详细讨论的5个模型(NN1, NN2, NN3, NNS1, NNS2)。

We also looked into the accuracy trade-off with regards to the number of model parameters. However, the picture is not as clear in that case. For example, the Inception based model NN2 achieves a comparable performance to NN1, but only has a 20th of the parameters. The number of FLOPS is comparable,though. Obviously at some point the performanceis expected to decrease, if the number of parameters is reduced further. Other model architectures may allow further reductions without loss of accuracy, just like inception [16] did in this case.

我们也研究关于模型参数的准确度权衡。但是，这种情况在图片中并不是太清楚。例如，Inception based model NN2与NN1相比有更好的性能，但是仅有一个a 20th of the parameters。尽管浮点运算次数是比得上的。很明显在某些情况下性能是期望减少的如果参数量进一步减小。其他模型架构可能在不损失准确度的情况下进一步减小，Inception [16]就是这种情况。

5.2. Effect of CNN Model

We now discuss the performance of our four selected models in more detail. On the one hand we have our traditional Zeiler&Fergus based architecture with 1x1 convolutions [22, 9] (see Table 1). On the other hand we have Inception [16]based models that dramatically reduce the model size. Overall, in the final performance the top models of both architectures perform comparably.However, some of our Inception based models, such asNN3,still achieve good performance while significantly reducing both the FLOPS and the model size.

下面我们更详细的讨论四个备选模型的性能。一方面我们有基于1x1卷级架构的Zeiler&Fergus（见表1）.另一方面大大减小模型尺寸的Inception[16] based models。总之，用这两种架构的顶级模型最后的性能相当。不管怎么样，一些Inceptionbased models例如NN3仍然在减少浮点运算次数和模型大小方面取得了好的性能。

The detailed evaluation on our personal photos test set is shown in Figure 5. While the largest model achieves a dramatic mprovement in accuracy compared to the tiny NNS2, the latter can be run 30ms /image on a mobile phone and is still accurate enough to be used in face clustering. The sharp drop in the ROC for FAR < 10 4 indicates noisy labels in the test data groundtruth. At extremely low false accept rates a single mislabeled image can have a significant impact on the curve.

图5显示了在我们的个人照片测试集上更加详细的评估。虽然NN2相比紧凑的NNS2模型最大的模型在准确度方面取得了极大的提高，但是紧凑的NNS2可以在移动手机上每个图像运行仅30ms并且对于人脸聚类来说他也是足够准确的。在ROC曲线上当FAR < 10 4 时的急剧下降表明了在the test data groundtruth.上的noisy labels。在一个特别低的可接受错误率下一个错误标注的图像在曲线上可能有很大的影响。

图5.网络架构.本图显示了4.2中个人照片测试集上4个不同模型的完整的ROC曲线，在10E-4 处的急剧下降可以通过噪声解释在the groundtruth labels.中。按照模型的性能排序依次是：NN2: 224*224 input Inception based model; NN1:Zeiler&Fergus based network with 1x1 convolutions; NNS1: small Inception style model with only 220M FLOPS; NNS2: tiny Inception model with only 20M FLOPS.

5.3. Sensitivity to ImageQuality

Table 4 shows the robustness of our model across a wide range of image sizes. The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEGquality of 20. The performance drop is very small for face thumbnails down to asize of 120x120 pixels and even at 80x80 pixels it shows acceptable performance.This is notable, because the network was trained on 220x220input images. Training with lower resolution faces could improve this range further.

图像质量的敏感度

表4显示了我们的模型在较广泛的图像大小下的鲁棒性。这个网络在JPEG压缩的时候有惊人的鲁棒性并且在JPEG质量为20时执行的非常好。对于下降到120x120像素大小甚至80x80像素的人脸缩略图来说性能下降非常小这明显是较为合格的性能。这种情况是引人注意的，因为网络在一个220x220的输出图像上训练。在低分辨率人脸上训练可以扩大需要像素的范围。

表4.图像质量.左表显示在不同JPEG质量在10E-3精度下对验证率的影响。右边表格显示在10E-3精度下图像大小对验证率的影响。这个实验运用NN1在留出数据集上进行。

5.4. Embedding Dimensionality

We explored various embedding dimensionalities and selected 128 for all experiments other than the comparison reported in Table 5. One would expect thelarger embeddings to perform at least as good as the smaller ones,however, it is possible that they require more training to achieve the same accuracy. That said, the differences in the performance reported in Table 5 are statistically insignificant.It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss ior los of accuracy and could be employed on mobile devices.

嵌入维度

我们探究了各种各样的嵌入维度并在所有的实验中均选择了128维的，表5显示了其与其他的维度的对比。我们所期望的一点是大的embeddings至少表现的和小的一样好但是这是可以实现的通过更多的训练得到同样的准确度。意思就是，在表5中呈现的不同的性能是统计上没有意义的。应该注意到在训练中使用了一个128维的浮点向量但是它可以量化为128字节而且不损失准确度。因此每张人脸被表示成128维的字节向量，这对于大规模的人脸聚类和识别是很理想的。较小的embeddings在微小的精度损失下是可行的并且可以运用到移动设备上。

表5.嵌入维度.比较了在我们的留出测试集上运用NN1模型不同嵌入维度下的相关影响。除了在10E-3下的验证率我们还通过划分的五个块计算了平均的标准差。

5.5. Amount of Training Data

Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens ofmillions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2.Compared to only millions of images the relative reduction in error is 60%.Using another order of magnitude more images (hundreds of millions) still givesa small boost, but the improvement tapers off.

训练数据量

表6显示了较大的训练数据量下对VAL的影响。由于时间约束这次评估运行在一个较小的模型上；在较大的模型上可能还会有较大的影响。很明显运用数千万的样本会导致准确度显著的增长在维4.2部分所用的个人照片测试集上。仅仅和数百万的图像相比误差相对减少了60%。用另外一个数量级更多的图像（数亿张）仍然得到一个小的促进，但是提高在逐渐变低。

表6.训练数据大小.此表的性能比较是在96x96像素输入的一个小的模型训练700h后进行的。这个模型架构和NN2很相似但是没有5x5的卷积在the Inception modules

5.6. Performanceon LFW

We evaluate our model on LFW using the standard protocol for unrestricted,labeled outside data.Nine training splitsare used to select the L2-distance threshold. Classification (same or different) is then performed onthe tenth test split. The selected optimal threshold is1.242 for all test splits except split eighth (1.256).

Our model is evaluated in two modes:

1.Fixed center crop of the LFW provided thumbnail.

2.A proprietary face detector(similar to Picasa[3])is run on the provided LFW thumbnails. If it fails to align the face (this happens for twoimages), the LFW alignment is used.

在LFW数据集上的性能

我们评估我们的模型在LFW数据上采用无限制标记外部数据的标准协议。9个训练分块被用于选择L2-distance阀值。然后在第十个分块上执行分类操作（相同或者不同）。除了第八个分块选择1.256其余的测试块选择1.242为最佳阀值。

我们的模型在两种模式下进行评估：

1.LFW 提供的缩略图进行中心裁剪

2.在LFW缩略图上运行一个和Picasa[3]相似的专有的人脸探测器，如果不能够校准人脸就使用LFW校准（其中有俩张图像出现了这种情况）

Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom.We achieve a classification accuracy of 98.87%±0.15 when using the fixed center crop described in (1) and the record breaking 99.63%±0.09 standard error of the mean when using the extra face alignment (2). This reduces the error reported for Deep Face in [17] by more than a factor of 7 and the previous state-of-the-artreported for DeepId2+ in [15] by 30%. This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.

图6给出了所有失败案例的概述。除了显示了在底部的拒绝的错误还有在顶部的可接受的错误。我们使用模式1中提到的固定中心剪裁得到了98.87%±0.15的分类准确度然而运用模式2中额外的校准方法取得了平均标准差99.63%±0.09的突破。在[17]中的Deep Face报道中总共7个因素中不止一个误差减小并且比DeepId2+ in [15]报道的以前最先进的减小了30%。这是在NN1模型上的性能，但是甚至非常小的NN3模型也取得了这样的性能但是没有统计上较大的不同。

图6.LFW误差.左侧显示了LFW上所有错误划分图像对。13个中有8个错误拒绝其他5个在LFW上错误标记，这里是真实误差。

5.7. Performanceon Youtube Faces DB

We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of95.12%±0.39. Using the first one thousand frames results 95.18%. Compared to [17] 91.4% who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2+ [15] achieved 93.2% and our method reduces this error by 30%, comparable to our improvement on LFW.

You tubeFaces DB上的性能

每个视频中我们人脸探测器探测的前100帧中所有的人脸对使用平均相似度。这个给了我们95.12%±0.39的分类准确度。用前1000帧导致95.18%。和[17] 91.4%的相比较他每个视频也评估了100帧但是我们减少了至少一半的错误率。DeepId2+ [15]取得了93.2%可以和我们在LFW上的提高相提并论我们减少了30%的误差。

5.8. FaceClustering

Ourcompact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity.The constraints in assignment imposed by clustering faces, compared to the pure verification task, lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using

agglomerative clustering.It is a clear show case of the incredible invariance to occlusion, lighting, pose and even age.

人脸聚类

由于我们的compact embedding导致它被用作把用户的个人照片按照相同的特征进行分组。施加在人脸聚类任务上的约束，与纯粹的验证任务相比有更加真实惊人的结果。图7显示一个用户个人照片集上用融合聚类产生的聚类。它清晰的显示了遮挡、光照、姿态、甚至年龄的惊人不变性。

图7.人脸聚类.展示了一个用户的样本聚类。所有在用户个人照片集中的图像被聚类在一起。

6. Summary

We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such asan

concatenation of multiple models

and PCA, as well as SVM classification. Ourend-to-end training both simplifies the setup and shows that directlyoptimizing a loss relevant to the task at hand improves performance.

总结

我们提出了方法直接学习embedding到一个用于人脸验证的欧几里得空间。这个模型除了用于CNN瓶颈层的方法和需要添加的例如多模型的连接和PCA这样的额外的后处理外，也有SVM分类。我们端到端训练既简化了设备又直接优化了相关任务的损失提高了性能。

Another strength of our model is that it only requires minimal alignment (tight crop around the face area).[17], for example, performs a complex 3D alignment. We also experimented with asimilarity transform alignment and notice that this can actually improve performance slightly. It is not clear if it is worth the extra complexity.

我们的模型另外一个长处就是它仅仅需要最小的校准（脸部区域的紧密剪裁）。例如，执行一个复杂的3D校准。我们也在一个相似转换校准的情况下进行实验并且注意到它可以稍微的准确提升性能。不清楚是否它值得需要额外的复杂性。

Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also look into

ways of improving thecurrently extremely long training times, e.g. variations of ourcurriculum learning withsmaller batch sizes and offlineas well asonline positive and negative mining.

以后的工作将会关注如何更好的理解错误的情况，进一步提高模型并且减少模型大小和CPU的需求。我们也在寻求提高当前特别长训练时间的方法例如：除了在线的正负类挖掘外对于较小批和脱机的课程学习的变动。

7. Appendix:Harmonic Embedding

In this section we introduce the concept of harmonic embeddings. By this we denote a set of in the sense that aregenerated by different models v1 and v2 but are compatible in the sense that they can be compared to each other.

附录：Harmonic Embedding

在这一部分我介绍Harmonic Embedding的概念。这里我们表示一些embeddings它们由不同的V1和V2模型生成但是就它们可以互相比较来说它们是兼容的。

This compatibility greatly simplifies upgrade paths. E.g. in an scenario whereembedding v1 was computed across a large set of images and a new embeddingmodel v2 is being rolled out, this compatibility ensures a smooth transition without the need toworry about version incompatibilities. Figure 8 shows results on our 3Gdataset. It can be seen that the improved model NN2 significantly outperformsNN1, while the comparison of NN2 embeddings to NN1 embeddings performs at anintermediate level.

兼容性大大化简了上升路径。例如：在以下的情形中embedding v1在一个大的图像集上计算并且新的embedding model v2被推出，兼容性确保了平滑过度不需要担心不同版本的不兼容。图8显示了在我们3G数据集上的结果。可以看到提升的NN2明显胜过NN1，当NN2和NN1对比在一个中间层上进行。

图8. Harmonic Embedding Compatibility.这些ROC曲线显示了NN2 embeddings to NN1embeddings的Harmonic Embedding兼容性。NN2是改善过的模型执行的比NN1好。当比较由NN1生成的embedding和NN2生成的embedding可以看到两者的兼容性。事实上，混合模型的性能仍然比NN1自身好。

7.1. HarmonicTriplet Loss

In order to learn the harmonic embedding we mix embeddings of v1 together with theembeddings v2, that are being learned. This is done inside the triplet loss and results in additionally generated triplets that encourage the compatibility between the different embedding versions. Figure 9 visualizes the different combinations of triplets that contribute to the triplet loss.

为了学习 harmonic embedding我们混合了embeddings v1和embeddingsv2。这在triplet loss里面进行并且导致额外生成triplets促进了不同embedding版本间的兼容性。图9显示了有助于triplet loss的不同的组合。

图9. learning the harmonic embedding.为了学习harmonic embedding我们生成了混合了V1embedding和正在训练的V2embedding的triplets。选择来自整个V1和V2embedding的semi-hardnegatives。

Weinitialized the v2 embedding from an independently trained NN2 and retrained the last layer (embedding layer) from random initialization with the compatibility encouraging triplet loss. First only the last layer is retrained, then we continue training the whole v2 network with the harmonic loss.

我们初始化了来自独立的已经训练的NN2中的 v2 embedding并且重新训练了最后一层从具有兼容性激励triplet loss的随机初始化。开始仅仅最后一层被训练随后我们训练了带有harmonic loss的整个v2 network。

Figure 10 shows a possible interpretation ofhow this compatibility may work in practice.The vast majority of v2 embeddings may be embedded near the corresponding v1embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.

图10显示了在实际中兼容性怎样可能工作的一个合理的解释。大多数的v2 embeddings可能嵌入到接近的相应的v1 embeddings，但是，对于错误的安置v1 embeddings可能被轻微的打乱导致他们新的位置在embedding space 提高了验证准确度。

图10. Harmonic Embedding Space

此图描述了一个可能的解释关于harmonic embedding怎么提高验证准确度当对于较小准确embeddings维持兼容性的时候。在这种情况下这里有一张错误分类的人脸它的embedding被扰乱为“正确”的位置在v2中。

7.2. Summary

These are very interesting findings and it is some what surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.

总结

这些是非常有趣的发现并且它运行的这么好是很惊人的。以后的工作是这个想法还可以怎么扩展。我们可以推测v2 embedding提升超过v1有一个极限，虽然仍然可以兼容。此外训练可以运行在移动手机上的小的网络是非常有趣的可以和大的服务器模型兼容。

本人人脸识别初学者，英语渣渣，有些翻译基于了一些自己肤浅的理解，如有翻译不当烦请指出，不知道是否是编辑原因博客中可能缺少或者覆盖了相关配图，具体完整的可以参考我的解析PDF版 here，谢谢。

Google人脸识别系统Facenet paper解析相关推荐

人脸识别系统FaceNet原理
1. 概述近年来,随着深度学习在CV领域的广泛应用,人脸识别领域也得到了巨大的发展.在深度学习中,通过多层网络的连接,能够学习到图像的特征表示,那么两张人脸的图像,是不是可以通过深度学习判别其是否是 ...
如何在 Keras 中使用 FaceNet 开发人脸识别系统
https://www.infoq.cn/article/4wT4mNvKlVvEQZR-JXmp Keras 是一个用 Python 编写的高级神经网络 API,能够以 TensorFlow.CNT ...
matlab人脸识别样本库建立,facenet 人脸识别（二）——创建人脸库搭建人脸识别系统...
搭建人脸库选择的方式是从百度下载明星照片照片下载,downloadImageByBaidu.py # coding=utf-8 """ 爬取百度图片的高清原图 &qu ...
创建自己的人脸识别系统
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达这是一篇全面的.互动性强的人脸识别初学指南.接下来,我们将创建一个 ...
推荐 6 个 yyds 的人脸识别系统
本文章推荐 6 个 GitHub 上 Star 最多的人脸识别开源项目,逛逛 GitHub 会每天推荐一些优质有用的开源项目,欢迎关注订阅本期推荐的开源项目是: 1. 带有移动应用程序的人脸识别库 ...
你熟知的那个杀毒软件公司McAfee，用这种方法骗过护照人脸识别系统
选自mcafee.com 作者:Steve Povolny.Jesse Chick 机器之心编译编辑:杜伟当你自己与其他人的图像高度匹配时,人脸识别系统还能发挥其作用吗?网络安全公司McAfee生 ...
基于matlab的人脸五官边缘检测方法,基于MATLAB的人脸识别系统的设计
基于MATLAB的人脸识别系统的设计(论文12000字,外文翻译,参考程序) 摘要:本文基于MATLAB平台设计了一款简单的人脸识别系统,通过USB摄像头来采集图像,经过肤色方法进行人脸检测与定位,然 ...
python怎么另起一行阅读答案_使用Python+Dlib构建人脸识别系统（在Nvidia Jetson Nano 2GB开发板上）...
Nvidia Jetson Nano 2GB开发板是一款新的单板机售价59美元运行带有GPU加速的人工智能软件.在2020年你可以从一台售价59美元的单板计算机中获得令人惊叹的性能让我们用它来 ...
基于 PCA 的人脸识别系统及人脸姿态分析
文章目录 1 PCA 1.1 原理 1.2 算法流程 1.2.1 零均值化 1.2.2 计算协方差矩阵 1.2.3 特征值和特征向量 1.2.4 降维得到 K 维特征 1.2.5 PCA 的优缺点 2 ...

Google人脸识别系统Facenet paper解析

FaceNet: A Unified Embedding for Face Recognition and Clustering

Google人脸识别系统Facenet paper解析相关推荐

最新文章

热门文章