FaceNet: A Unified Embedding for Face Recognition and Clustering on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff and Dmitry Kalenichenko and James Philbin
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 2

[link] Summary by ANIRUDH NJ 5 years ago

## Keywords
Triplet-loss , face embedding , harmonic embedding

---
## Summary

### Introduction

**Goal of the paper**
A unified system is given for face verification , recognition and clustering.
Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space.
* Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them.
* Face recognition : Face recognition becomes a clustering task in the embedding space

**Previous work**
* Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector.
* Some other techniques use PCA to reduce the dimensionality of the embedding for comparison.

**Method**
* This method makes use of inception style CNN to get an embedding of each face.
* The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them.

**Triplet Loss**
Triplet loss makes use of two matching face thumbnails and a non-matching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the non-matching pair of images.

**Triplet Selection**
* Selection of triplets is done such that samples are hard-positive or hard-negative .
* Hardest negative can lead to local minima early in the training and a collapse model in a few cases
* Use of semi-hard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum.

**Deep Convolutional Network**
* Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad
* The training is done on two networks :
- Zeiler&Fergus architecture with model depth of 22 and 140 million parameters
- GoogLeNet style inception model with 6.6 to 7.5 million parameters.

**Experiment**
* Study of the following cases are done :
- Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold.

- Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions.

- No. of images in the training data set

**Results classification accuracy** :
- LFW(Labelled faces in the wild) dataset : 98.87% 0.15
- Youtube Faces DB : 95.12% .39
On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age.

**Conclusion**

* The model can be extended further to improve the overall accuracy.
* Training networks to run on smaller systems like mobile phones.
* There is need for improving the training efficiency.

---

## Notes

* Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model

* To make the embeddings compatible with different models , harmonic-triplet loss and the generated triplets must be compatible with each other

## Open research questions

* Better understanding of the error cases.
* Making the model more compact for embedded and mobile use cases.
* Methods to reduce the training times.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private