Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Paper summary This paper solves two tasks: Image Captioning and VQA. The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors. For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA. **SAAA**: https://i.imgur.com/2FnPXi0.png **This paper (VQA)**: https://i.imgur.com/xib77Iy.png For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer. https://i.imgur.com/GeXaC30.png
arxiv.org
arxiv-sanity.com
scholar.google.com
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV

more

Summary from elbaro
Loading...
Your comment:


ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!
About

Sponsored by: and