Summaries from Conference and Computer Vision and Pattern Recognition on ShortScience.org

arxiv.org
scholar.google.com

Geometric robustness of deep networks: analysis and improvement
Kanbak, Can and Moosavi-Dezfooli, Seyed-Mohsen and Frossard, Pascal
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 5 years ago

Kanbak et al. propose ManiFool, a method to determine a network’s invariance to transformations by iteratively finding adversarial transformations. In particular, given a class of transformations to consider, ManiFool iteratively alternates two steps. First, a gradient step is taken in order to move into an adversarial direction; then, the obtained perturbation/direction is projected back to the space of allowed transformations. While the details are slightly more involved, I found that this approach is similar to the general projected gradient ascent approach to finding adversarial examples. By finding worst-case transformations for a set of test samples, Kanbak et al. Are able to quantify the invariance of a network against specific transformations. Furthermore, they show that adversarial fine-tuning using the found adversarial transformations allows to boost invariance, while only incurring a small loss in general accuracy. Examples of the found adversarial transformations are shown in Figure 1.

https://i.imgur.com/h83RdE8.png
Figure 1: The proposed attack method allows to consider different classes of transformations as shown in these examples.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Embodied Question Answering
Abhishek Das and Samyak Datta and Georgia Gkioxari and Stefan Lee and Devi Parikh and Dhruv Batra
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV, cs.AI, cs.CL, cs.LG
more

[link] Summary by Oleksandr Bailo 5 years ago

This paper introduces a new AI task - Embodied Question Answering. The goal of this task for an agent is to be able to answer the question by observing the environment through a single egocentric RGB camera while being able to navigate inside the environment. The agent has 4 natural modules:
https://i.imgur.com/6Mjidsk.png
1. **Vision**. 224x224 RGB images are processed by CNN to produce a fixed-size representation. This CNN is pretrained on pixel-to-pixel tasks such as RGB reconstruction, semantic segmentation, and depth estimation.

2. **Language**. Questions are encoded with 2-layer LSTMs with 128-d hidden states. Separate question encoders are used for the navigation and answering modules to capture important words for each module.

3. **Navigation** is composed of a planner (forward, left, right, and stop actions) and a controller that executes planner selected action for a variable number of times. The planner is LSTM taking hidden state, image representation, question, and previous action. Contrary, a controller is an MLP with 1 hidden layer which takes planner's hidden state, action from the planner, and image representation to execute an action or pass the lead back to the planner.

4. **Answering** module computes an image-question similarity of the last 5 frames via a dot product between image features (passed through an fc-layer to align with question features) and question encoding. This similarity is converted to attention weights via a softmax, and the attention-weighted image features are combined with the question features and passed through an answer classifier. Visually this process is shown in the figure below. https://i.imgur.com/LeZlSZx.png

[Successful results](https://www.youtube.com/watch?v=gVj-TeIJfrk) as well as [failure cases](https://www.youtube.com/watch?v=4zH8cz2VlEg) are provided.

Generally, this is very promising work which literally just scratches the surface of what is possible. There are several constraints which can be mitigated to push this field to more general outcomes. For example, use more general environments with more realistic graphics and broader set of questions and answers.

arxiv.org
scholar.google.com

Data Distillation: Towards Omni-Supervised Learning
Radosavovic, Ilija and Dollár, Piotr and Girshick, Ross B. and Gkioxari, Georgia and He, Kaiming
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by isarandi 6 years ago

* It's a semi-supervised method (the goal is to make use of unlabeled data in addition to labeled data).
* They first train a neural net normally, in the supervised way, on a labeled dataset.
* Then **they retrain the net using *its own predictions* on the originally unlabeled data as if it was ground truth** (but only when the net is confident enough about the prediction).
  * More precisely they retrain on the union of the original dataset and the examples labeled by the net itself. (Each minibatch is on average 60% original and 40% self-labeled)
* When making these predictions (that will subsequently used for training), they use **multi-transform inference**.
  * They apply the net to differently transformed versions of the image (mirroring, scaling), transform the outputs back accordingly and combine the results.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Adversarial Discriminative Domain Adaptation
Tzeng, Eric and Hoffman, Judy and Saenko, Kate and Darrell, Trevor
Conference and Computer Vision and Pattern Recognition - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 6 years ago

_Objective:_ Define a framework for Adversarial Domain Adaptation and propose a new architecture as state-of-the-art.

 _Dataset:_ MNIST, USPS, SVHN and NYUD.   

## Inner workings:

Subsumes previous work in a generalized framework where designing a new method is now simplified to the space of making three design choices:

*   whether to use a generative or discriminative base model.
*   whether to tie or untie the weights.
*   which adversarial learning objective to use.

[![screen shot 2017-04-18 at 5 10 01 pm](https://cloud.githubusercontent.com/assets/17261080/25138167/15d5e644-245a-11e7-9fb8-636ce4111036.png)](https://cloud.githubusercontent.com/assets/17261080/25138167/15d5e644-245a-11e7-9fb8-636ce4111036.png)

## Architecture:

[![screen shot 2017-04-18 at 5 14 44 pm](https://cloud.githubusercontent.com/assets/17261080/25138526/07848bd0-245b-11e7-94c9-f6ae7ccea76f.png)](https://cloud.githubusercontent.com/assets/17261080/25138526/07848bd0-245b-11e7-94c9-f6ae7ccea76f.png)

## Results:

Interesting as the theoretical framework seem to converge with other papers and their architecture improves on previous papers performance even if it's not a huge improvement.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning from Noisy Large-Scale Datasets with Minimal Supervision
Veit, Andreas and Alldrin, Neil and Chechik, Gal and Krasin, Ivan and Gupta, Abhinav and Belongie, Serge J.
Conference and Computer Vision and Pattern Recognition - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 6 years ago

_Objective:_ Predict labels using a very large dataset with noisy labels and a much smaller (3 orders of magnitude) dataset with human-verified annotations.

_Dataset:_ [Open image](https://research.googleblog.com/2016/09/introducing-open-images-dataset.html)


## Architecture:

Contrary to other approaches they use the clean labels, the noisy labels but also image features. They basically train 3 networks:

1.  A feature extractor for the image.
2.  A label Cleaning Network that predicts to learn verified labels from noisy labels + image feature.
3.  An image classifier that predicts using just the image.

[![screen shot 2017-04-12 at 11 10 56 am](https://cloud.githubusercontent.com/assets/17261080/24950258/c4764106-1f70-11e7-82e4-c1111ffc089e.png)](https://cloud.githubusercontent.com/assets/17261080/24950258/c4764106-1f70-11e7-82e4-c1111ffc089e.png)

## Results:

Overall better performance but not breath-taking improvement: from `AP 83.832 / MAP 61.82` for a NN trained only on labels to `AP 87.67 / MAP 62.38` with their approach.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
Huang, Jonathan and Rathod, Vivek and Sun, Chen and Zhu, Menglong and Korattikara, Anoop and Fathi, Alireza and Fischer, Ian and Wojna, Zbigniew and Song, Yang and Guadarrama, Sergio and Murphy, Kevin
Conference and Computer Vision and Pattern Recognition - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 6 years ago

_Objective:_ Compare several meta-architectures and hyper-parameters in the same framework for easy comparison.

## Architectures:

Four meta architectures:

1.  R-CNN
2.  Faster R-CNN
3.  SSD
4.  YOLO Architecture (not evaluated in the paper)

[![screen shot 2017-05-05 at 3 12 57 pm](https://cloud.githubusercontent.com/assets/17261080/25746807/5a294360-31a5-11e7-808e-d48497a16cd5.png)](https://cloud.githubusercontent.com/assets/17261080/25746807/5a294360-31a5-11e7-808e-d48497a16cd5.png)

## Results:

Very interesting to know which framework to implement or not at first glance.