This paper feels a bit like watching a 90’s show, and everyone’s in denim and miniskirts, except it’s a 2017 ML paper, and everything uses attention. (I’ll say it again, ML years are like dog years, but more so). That said, that’s not a critique of the paper: finding clever ways to cobble together techniques for your application can be an important and valuable contribution. This paper addresses the problem of text to image generation: how to take a description of an image and generate an image that matches it, and it makes two main contributions: 1) a GAN structure that seems to merge insights from Attention and Progressive GANs in order to select areas of the sentence to inform details in specific image regions, and 2) a novel discriminator structure to evaluate whether a sentence matches an image. https://i.imgur.com/JLuuhJF.png Focusing on the first of these first: their generation system works by an iterative process, that gradually builds up image resolution, and also pulls specific information from the sentence to inform details in each region. The first layer of the network generates a first “hidden state” based on a compressed representation of the sentence as a whole (the final hidden state of a LSTM text encoder, I believe), as well as random noise (typical input to a GAN). Subsequent “hidden states” are calculated by calculating attention weightings between each region of the image, and each word in the sentence, and pulling together a per-region context vector based on that attention map. (As far as I understand it, “region” here refers to the fact that when you’re at lower spatial scales of what is essentially a progressive generation process, 64x64 rather than 256x256, for example, each “pixel” actually represents a larger region of the image). I’m using quotes around “hidden state” in the above paragraph because I think it’s actually pretty confusing terminology, since it suggests a recurrent structure, but this model isn’t actually recurrent: there’s a specific set of weights for resolution block 0, and 1, and 2. This whole approach, of calculating a specific attention-weighted context vector over input words based on where you are in the generation process is very conceptually similar to the original domain of attention, where the attention query would be driven by the hidden state of the LSTM generating the translated version of some input sentence, except, here, instead of translating between languages, you’re translating across mediums. The loss for this model is a combination of per-layer loss, and a final, special, full-resolution loss. At each level of resolution, there exists a separate discriminator, which seems to be able to take in both 1) only an image, and judge whether it thinks that image looks realistic on it’s own, and 2) an image and a global sentence vector, and judge whether the image matches the sentence. It’s not fully clear from the paper, but it seems like this is based on just feeding in the sentence vector as additional input? https://i.imgur.com/B6qPFax.png For each non-final layer’s discriminator, the loss is a combination of both of these unconditional and conditional losses. The final contribution of this paper is something they call the DAMSM loss: the Deep Attention Multimodal Similarity Model. This is a fairly complex model structure, whose ultimate goal is to assess how closely a final generated image matches a sentence. The whole structure of this loss is based on projecting region-level image features (from an intermediate, 17x17 layer of a pretrained Inception Net) and word features into the same space, and then calculating dot product similarities between them, which are then used to build “visual context vectors” for each word (for each word, created a weighted sum of visual vectors, based on how similar each is to the word). Then, we take each word’s context vector, and see how close it is to the original word vector. If we, again, imagine image and word vectors as being in a conceptually shared space, then this is basically saying “if I take a weighted average of all the things that are the most similar to me, how ultimately similar is that weighted average to me”. This allows there to be a “concept representation” match found when, for example, a particular word’s concept, like “beak”, is only present in one region, but present there very strongly: the context vector will be strongly weighted towards that region, and will end up being very close, in cosine similarity terms, to the word itself. By contrast, if none of the regions are a particularly good match for the word’s concept, this value will be low. DAMSM then aggregates up to an overall “relevance” score between a sentence and image, that’s simply a sum over a word’s “concept representation”, for each word in a sentence. It then calculates conditional probabilities in two directions: what’s the probability of the sentence, given the image (relevance score of (Sent, Imag), divided by that image’s summed relevance with all possible sentences in the batch), and, also, what’s the probability of the image, given the sentence (relevance score of the pair, divided by the sentence’s summed relevance with all possible images in the batch). In addition to this word-level concept modeling, DAMSM also has full sentence-level versions, where it simply calculates the relevance of each (sentence, image) pair by taking the cosine similarity between the global sentence and global image features (the final hidden state of an encoder RNN, and the final aggregated InceptionNet features, respectively). All these losses are aggregated together, to get one that uses both global information, and information as to whether specific words in a sentence are represented well in an image.