Taigman, Yaniv, et al. “Deepface: Closing the gap to human-level performance in face verification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. (Citations: 851).

1 Motivation

Aligning faces in the unconstrained scenario is difficult because the variations of
• Pose (due to the non-planarity of the face).
• Non-rigid expressions.

The whole detection and alginment pipeline can be seen in Fig. 12.

Figure 12: Detection and alginment pipeline. (a) The detected face, with 6 initial fiducial points. (b) The induced 2D-aligned crop. (c) 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, we added triangles on the contour to avoid discontinuities. (d) The reference 3D shape transformed to the 2D-aligned crop image-plane. (e) Triangle visibility wrt to the fitted 3D-2D camera; darker triangles are less visible. (f) The 67 fiducial points induced by the 3D model that are used to direct
the piece-wise affine warpping. (g) The final frontalized crop. (h) A new view generated by the 3D model (not used in this paper).

2 Detection
Fiducial Point Detector
• Choose 6 fiducial points: 2 eyes’ center + 1 nose tip + 3 mouth points, see Fig. 12(a).
• Fiducial points are extracted by a SVR trained to predict point configurations from an image descriptor (LBP Histograms).

3 Alignment
2D Alignment
• Use fiducial points to scale, rotate, and translate the image into 6 fixed locations (anchor locations), see Fig. 12(b) for alignment result.
• However, this alignment fails to compensate for out-of-plane rotation, which is particularly important in unconstrained conditions.

3D Alignment
• Use a generic 3d shape model.
• from the 2d-aligned crop (Fig. 12(b)), using a second SVR localizing additional 67 fiducial points, see Fig. 12(c).
• An affine 3d-to-2d camera P is then fitted using the generalized least squares.
• However, this alignment fails to model full perspective projections and non-rigid deformations. Therefor, we allow to warp the 2d image with
small distortions, see Fig. 12(g) for alignment result.

4 Representation + Classification
In a Nutshell (120M Parameters)

• input (3 × 152 × 152).
• conv1 (32@11 × 11), relu1, pool1 (3 × 3, s2).
• conv2 (16@9 × 9), relu2.
• lc3 (16@9 × 9), relu3.
• lc4 (16@7 × 7), relu4.
• lc5 (16@5 × 5), relu5.
• fc6 (4096), relu6, drop6.
• fc7 (4030).

CONV Layers Used to extract low-level features like simple edges and texture.

No POOL2 Several levels of pooling would cause the network to lose information about the precise position of detailed facial structure and microtextures.

Locally Connected Layers (LC) Like a conv layer, they apply filter bank, but every location in the feature map learns a different set of filters since different regions of an aligned image have different local statistics.

5 Analysis
Features Produced by this Network is Very Sparse. See Fig. 13.

6 Identify Task
Definition Verifying whether two input instances belong to the same class (identity).

Idea Use the l 2 normalized fc6 features. The key is to design similarity measure.

Unsupervised Similarity Inner product of the two features.

Weighted χ^2 Distance

w are learned using a linear SVM.

Siamese Network End-to-end learning. The face recognition network (without the top layer) is replicated twice (one for each input image) and the features are used to directly predict whether the two input images belong to the same person. There are two ways
• 1. taking the absolute difference between the features.
• 2. Followed a top fc that maps into a single logistic unit (same/not same).

7 Ensembles of Networks
By feeding different types of inputs
• 3D aligned RGB inputs.
• 2D aligned RGB inputs.
• The gray-level image plus image gradient magnitude and orientation.

Dataset Face dataset
• SFC (Social Face Classification): 4.4M faces, 4k peoples.

Identification dataset
• LFW (Labeled Faces in the Wild): 6k face pairs, 5.7k people.
• YTF (YouTube Faces): 5k video pairs, 1.6k people.

Train/Test Spilitting The most recent 5% of face images of each identity are left out for testing. This is done according to the images’ time-stamp in order to simulate continuous identification through aging.

Result LFW winner
• Accuracy: 97.25%.
• Human performace: 97.5%.

