[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudometric $$d(x, y) = (x y) M (x y)^T$$ where $M$ is a positivedefinite matrix. The only difference between a pseudometric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$f(x_i^a)  f(x_i^p)_2^2 + \alpha < f(x_i^a)  f(x_i^n)_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma)
Your comment:

[link]
## Keywords Tripletloss , face embedding , harmonic embedding  ## Summary ### Introduction **Goal of the paper** A unified system is given for face verification , recognition and clustering. Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space. * Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them. * Face recognition : Face recognition becomes a clustering task in the embedding space **Previous work** * Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector. * Some other techniques use PCA to reduce the dimensionality of the embedding for comparison. **Method** * This method makes use of inception style CNN to get an embedding of each face. * The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them. **Triplet Loss** Triplet loss makes use of two matching face thumbnails and a nonmatching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the nonmatching pair of images. **Triplet Selection** * Selection of triplets is done such that samples are hardpositive or hardnegative . * Hardest negative can lead to local minima early in the training and a collapse model in a few cases * Use of semihard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum. **Deep Convolutional Network** * Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad * The training is done on two networks :  Zeiler&Fergus architecture with model depth of 22 and 140 million parameters  GoogLeNet style inception model with 6.6 to 7.5 million parameters. **Experiment** * Study of the following cases are done :  Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold.  Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions.  No. of images in the training data set **Results classification accuracy** :  LFW(Labelled faces in the wild) dataset : 98.87% 0.15  Youtube Faces DB : 95.12% .39 On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age. **Conclusion** * The model can be extended further to improve the overall accuracy. * Training networks to run on smaller systems like mobile phones. * There is need for improving the training efficiency.  ## Notes * Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model * To make the embeddings compatible with different models , harmonictriplet loss and the generated triplets must be compatible with each other ## Open research questions * Better understanding of the error cases. * Making the model more compact for embedded and mobile use cases. * Methods to reduce the training times. 