Learning Deep Structure-Preserving Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings
Paper summary The authors present a neural model that maps images and sentences into the same space, in order to perform cross-modal retrieval – find images based on a sentence or find sentences based on an image. https://i.imgur.com/DCFYzN8.png The image vectors come from a pre-trained VGG image detection network. The sentence vectors are constructed using Fisher vectors, but they also explore simpler options, such as mean word2vec vectors and tfidf. Both are then mapped through nonlinearities and normalised, and Euclidean distance is used to measure vector similarity. They also investigate the task of mapping noun phrases from the image caption to specific areas of the image.

Summary by Marek Rei 3 years ago
Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and