[link]
This paper solves two tasks: Image Captioning and VQA. The main idea is to use Faster RCNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors. For **VQA**, this is basically (Faster RCNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048dim feature vectors. This paper uses Faster RCNN to get k bounding boxes. Each bounding box is a 2048dim vector so we have kx2048, which is fed to SAAA. **SAAA**: https://i.imgur.com/2FnPXi0.png **This paper (VQA)**: https://i.imgur.com/xib77Iy.png For **Image Captioning**, it uses 2layer LSTM. The first layer gets the average of k 2048dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weightaveraged 2048dim vector and the output of the first layer. https://i.imgur.com/GeXaC30.png
Your comment:
