Universal representations:The missing link between faces, text, planktons, and cat breedsUniversal representations:The missing link between faces, text, planktons, and cat breedsHakan Bilen and Andrea Vedaldi2017
Paper summarymartinthomaThis paper is about transfer learning for computer vision tasks.
* Before this paper, people focused on similar datasets (e.g. ImageNet-like images) or even the same dataset but a different task (classification -> segmentation). This paper, they look at extremely different dataset (ImageNet-like vs text) but only one task (classification). They show that all layers can be shared (including the last classification layer) between datasets such as MNIST and CIFAR-10
* Normalizing information is necessary for sharing models between datasets in order to compensate for dataset-specific differences. Domain-specific scaling parameters work well.
* Used datasets:
1. MNIST (10 classes: handwritten digits 0-9),
2. SVHN (10 classes: house number digits, 0-9),
3. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) (10 classes: airplane, automobile, bird, ...)
4. Daimler Mono Pedestrian Classification Benchmark (18 × 36 pixels)
5. Human Sketch dataset (20000 human sketches of every day objects such as “book”, “car”, “house”, “sun”)
6. German Traffic Sign Recognition (GTSR) Benchmark (43 traffic signs)
7. Plankton imagery data (classification benchmark that contains 30336 images of various organisms ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies)
8. Animals with Attributes (AwA): 30475 images of 50 animal species (for zero-shot learning)
9. Caltech-256: object classification benchmark (256 object categories and an additional background class)
10. Omniglot: 1623 different handwritten characters from 50 different alphabets (one shot learning)
* images are resized to 64 × 64 pixels, greyscale ones are converted into RGB by setting the three channels to the same value
* Each dataset is also whitened, by subtracting its mean and dividing it by its standard deviation per channel
* **Architecture**: ResNet + Global Average Pooling + FC with Softmax
* "As the majority of the datasets have a different number of classes, we use a dataset-specific fully connected layer in our experiments unless otherwise stated."
* **Data augmentation**: We follow the same data augmentation strategy in [](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15), the 64 × 64 size whitened image is padded with 8 pixels on all sides and a 64×64 patch randomly sampled from the padded image or its horizontal flip (except for MNIST / Omniglot / SVHN, as those contain text)
* **Training**: stochastic gradient descent with momentum
1. Baseline: Train networks for each dataset independantly
2. Full sharing: For MNIST / SVHN / CIFAR-10, group classes randomly together so that Node 2 might be digit "7" for MNIST, digit "3" for SVHN and "aeroplane" for CIFAR-10. They are trained together in one network.
3. Deep sharing: Share all layers except the last one. Use all 10 datasets for this.
4. Partial sharing: Have a dataset-specific first part to compensate for different image statistics, but share the middle of the network.
The results seem to be inconclusive to me.
## Follow-up / related work
First published: 2017/01/25 (3 years ago) Abstract: With the advent of large labelled datasets and high-capacity models, the
performance of machine vision systems has been improving rapidly. However, the
technology has still major limitations, starting from the fact that different
vision problems are still solved by different models, trained from scratch or
fine-tuned on the target data. The human visual system, in stark contrast,
learns a universal representation for vision in the early life of an
individual. This representation works well for an enormous variety of vision
problems, with little or no change, with the major advantage of requiring
little training data to solve any of them.