# Summary The authors present a way to generate captions describing the content of images using attention-based mechanisms. They present two ways of training the network, one via standard backpropagation techniques and another using stochastic processes. They also show how their model can selectively "focus" on the relevant parts of an image to generate appropriate captions, as shown in the classic example of the famous woman throwing a frisbee. Finally, they validate their model on Flicker8k, Flicker30k and MSCOCO. ![image](https://user-images.githubusercontent.com/18450628/61397054-10639300-a897-11e9-8b4a-f4cd804c3229.png) # Model At a very high level, the model takes as input an image I and returns a caption generated from a pre-defined vocabulary: ![image](https://user-images.githubusercontent.com/18450628/61398513-20c93d00-a89a-11e9-8e93-72ccf7a61be1.png) A high-level overview of the model is presented in Figure 1: ![image](https://user-images.githubusercontent.com/18450628/61398365-de076500-a899-11e9-8413-55ec755f0f83.png) ## Visual extractor A CNN is used to extract features from the image. The authors experimented with VGG-19 pretrained on ImageNet and not finetuned. They use the features from the last convolutional layer as their representations. Starting with images of 224x224, the last CNN feature map has shape 14x14x512, which they flatten along width and height to obtain a vector representation of 196x512. These 512 vectors are used as inputs to the language model. ## Sentence generation An LSTM network is used to generate a sequence of words from a fixed vocabulary of size L. As input, a weighted sum based on attention values of the vectors of the flattened image features is used. The previous word is also fed as input to the LSTM. The hidden layer from the previous timestep as well as the layers from the CNN a_i are fed through an MLP + softmax layer and used to generate attention values for each flattened image feature vector that sum to one. ![image](https://user-images.githubusercontent.com/18450628/61462206-41e46900-a940-11e9-991d-e3a9e4b98837.png) ![image](https://user-images.githubusercontent.com/18450628/61462544-c9ca7300-a940-11e9-8c31-dbf85bf8301f.png) The authors propose two ways to compute phi, i.e. the attention, which they refer to as "soft attention" and "hard attention". These will be covered in a later section. The output of the LSTM, z, is then fed to a deep network to generate the next word. This is detailed in the following figure. ![image](https://user-images.githubusercontent.com/18450628/61408594-6132b600-a8ae-11e9-894c-392396e299b0.png) ## Attention The paper proposes two methods of attention, a "soft" attention and a "hard" attention. ### Soft attention Soft attention is the most intuitive one and is relatively straight forward. In order to compute the vector representing the image as input to the LSTM, **z**, the expectation of the context vector is computed by using a weighted average scheme: ![image](https://user-images.githubusercontent.com/18450628/61408939-34cb6980-a8af-11e9-989e-24308be3ed3c.png) where alpha are the attention weights and a_i are the vectors of the feature representation. To ensure that all image features are used somewhat equally, a regularization term is added to the loss function: ![image](https://user-images.githubusercontent.com/18450628/61412057-d2c23280-a8b5-11e9-9d9c-7f35edc650ef.png) This ensures that the image feature vectors over time sum to 1 as closely as possible and that no part of the image is ignored. ### Hard attention The authors propose an alternative method to calculate attention. Each attention parameter is treated as an intermediate latent variables that can be represented in one-hot encoding, i.e. on or off. To do so, they use a multinoulli distribution parametrized by alpha, the softmax output of f_att. They show how they can approximate the gradient using monte-carlo methods: ![image](https://user-images.githubusercontent.com/18450628/61413158-d1463980-a8b8-11e9-8aef-b2a9d9bb6bad.png) Refer to the paper for more mathemagical details. Finally, they use soft attention with probability 0.5 when using hard attention. ## Visualizing features One of the contributions of this work is showing what the network is "attending" to. To do so, the authors use the final layer of VGG-19, which consists of 14x14x512 features, upsample the resulting filters to the original image size of 224x224 and use a Gaussian blur to recreate the receptive field. ## Results The authors evaluate their methods on 3 caption datasets, i.e. Flicker8k, Flicker30k and MSCOCO. For all our experiments, they used a fixed vocabulary size of 10,000. They report both BLEU and METEOR scores on the task. As can be seen in the following figure, both the soft and hard attention mechanisms beat all state of the art methods at the time of publishing. Hard attention outperformed soft attention most of the time. ![image](https://user-images.githubusercontent.com/18450628/61412281-77dd0b00-a8b6-11e9-882f-73e86638bc9d.png) # Comments Cool paper which lead the way in terms of combining text and images and using attention mechanisms. They show an interesting way to visualize what the network is attending to, although it was not clear to me why they should expect the final layer of the CNN to be showing that in the first place since they did not finetune on the datasets they were training on. I would expect that to mean that their method would work best on datasets most "similar" to ImageNet. Their hard attention mechanism seems a lot more complicated than the soft attention mechanism and it isn't always clear that it is much better other than it offers a stronger regularization and a type of dropout.