Summaries from International Conference on Computer Vision on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet
Amarjot Singh and Nick Kingsbury
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by hanoch kremer 4 years ago

ScatterNets incorporates geometric knowledge of images to produce discriminative and invariant (translation and rotation) features i.e. edge information. The same outcome as CNN's first layers hold. So why not replace that first layer/s with an equivalent, fixed, structure and let the optimizer find the best weights for the CNN with its leading-edge removed.
The main motivations of the idea of replacing the first convolutional, ReLU and pooling layers of the CNN with a two-layer parametric log-based Dual-Tree Complex Wavelets Transform (DTCWT), covered by a few papers, were:
Despite the success of CNNs, the design and optimizing configuration of these networks is not well understood which makes it difficult to develop these networks
This improves the training of the network as the later layers can learn more complex patterns from the start of learning because the edge representations are already present
Converge faster as it has fewer filter weights to learn
My takeaway: a slight reduction in the amount of data necessary for training!

On CIFAR10 and Caltech-101 with 14 self-made CNN with increasing depth, VGG, NIN and WideResnet:
When doing transfer learning(Imagenet): DTSCNN outperformed (“useful margin”) all the CNN architectures counterpart when finetuning with only 1000 examples(balanced over classes). While on larger datasets the gap decreases ending on par with. However, when freezing the first layers on VGG and NIN, as in DTSCNN, the NIN results are in par with, while VGG outperforms!

DTSCNN learns faster in the rate but reaches the same target with minor speedup (few mins)

Complexity analysis in terms of weights and operations is missing

Datasets: CIFAR-10 & Caltech-101, is a good start point (further step with a substantial dataset like COCO would be a plus). For other modalities/domains, please try and let me know

Great work but ablation study is missing such as comparing full training WResNet+DTCWT vs. WResNet

14 citation so far (Cambridge): probably low value per money at the moment
https://i.imgur.com/GrzSviU.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Focal Loss for Dense Object Detection
Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by RyanDsouza 5 years ago

In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss.

The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors.

The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases.

They add a modulating factor  as shown in the image below to the cross- entropy loss  https://i.imgur.com/N7R3M9J.png
Which ends up looking like this https://i.imgur.com/kxC8NCB.png
in experiments though they add an additional alpha term to it,because it gives them better results.

**Retina Net**

The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first sub-network helps in object classification of the backbone networks output and the second sub-network helps in bounding box regression.
The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning to Estimate 3D Hand Pose from Single RGB Images
Zimmermann, Christian and Brox, Thomas
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Oleksandr Bailo 6 years ago

This paper estimate 3D hand shape from **single** RGB images based on deep learning. The overall pipeline is the following:
https://i.imgur.com/H72P5ns.png
1. **Hand Segmentation** network is derived from this [paper](https://arxiv.org/pdf/1602.00134.pdf) but, in essence, any segmentation network would do the job. Hand image is cropped from the original image by utilizing segmentation mask and resized to a fixed size (256x256) with bilinear interpolation.
2. **Detecting hand keypoints**. 2D Keypoint detection is formulated as predicting score map for each hand joints (fixed size = 21). Encoder-decoder architecture is used. 
3. **3D hand pose estimation**. 
https://i.imgur.com/uBheX3o.png
    - In this paper, the hand pose is represented as $w_i = (x_i, y_i, z_i)$, where $i$ is index for a particular hand joint. This representation is further normalized $w_i^{norm} = \frac{1}{s} \cdot w_i$, where $s = ||w_{k+1} - w_{k} ||$, and relative position to a reference joint $r$ (palm) is obtained as $w_i^{rel} = w_i^{norm} - w_r^{norm}$.
    - The network predicts coordinates within a canonical frame and additionally estimate the transformation into the canonical frame (as opposite to predicting absolute 3D coordinates). Therefore, the network predicts $w^{c^*} = R(w^{rel}) \cdot w^{rel}$ and $R(w^{rel}) = R_y \cdot R_{xz}$.
Information whether left/right hand is the input is concatenated to flattened feature representation. The training loss is composed of a separate term for canonical coordinates and canonical transformation matrix L2 losses.

Contribution: 
- Apparently, the first method to perform 3D hand shape estimation from a single RGB image rather than using both RGB and depth sensors;
- Possible extension to sign language recognition problem by attaching classifier on predicted 3D poses.

While this approach quite accurately predicts hand 3D poses among frames, they often fluctuate among frames. Probably several techniques (i.e. optical flow, RNN, post-processing smoothing) can be used for ensuring temporal consistency and make predictions more stable across frames.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition
Camgöz, Necati Cihan and Hadfield, Simon and Koller, Oscar and Bowden, Richard
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Oleksandr Bailo 6 years ago

This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts:
 - CNN as a feature extractor
 - Bidirectional LSTMs for temporal modeling
 - Connectionist Temporal Classification as a loss layer 
![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png)

Results:
 - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets.
 - Utilizing full images rather than hand patches provides better performance for continuous SLR. 
 - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers.
 - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences.
 - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Be Your Own Prada: Fashion Synthesis with Structural Coherence
Zhu, Shizhan and Fidler, Sanja and Urtasun, Raquel and Lin, Dahua and Loy, Chen Change
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Arian 6 years ago

[FashionGAN][1] works as follows. Given an input image of a person and a sentence describing an outfit, the model tries to "redress" the person in the image.
The Generator in the model is stacked. 
* The first stage of the generator gets as input a low resolution version of the segmentation of the input image (which is obtained independently) and the design encoding, and generates a **human segmentation map** (not dressed). 
* Then in the second stage, the model renders the generated image using another generator conditioned on the design encoding. It adds region specific texture using the segmentation map and generates the final image.

![FashionGAN Model](https://i.imgur.com/DzwB8xm.png "FasionGAN model")

They added sentence descriptions to a subset of the [DeepFashion dataset][2] (79k examples).

[1]:http://mmlab.ie.cuhk.edu.hk/projects/FashionGAN/
[2]:http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning to Reason: End-to-End Module Networks for Visual Question Answering
Hu, Ronghang and Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Saenko, Kate
International Conference on Computer Vision - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 6 years ago

A modular neural architecture for visual question answering. A seq2seq component predicts the sequence of neural modules (eg find() and compare()) based on the textual question, which are then dynamically combined and trained end-to-end. Achieves good results on three separate benchmarks that focus on reasoning about the image.

https://i.imgur.com/iOkSh8y.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Diverse and Natural Image Descriptions via a Conditional GAN
Bo Dai and Sanja Fidler and Raquel Urtasun and Dahua Lin
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abhishek Das 6 years ago

This paper proposes a conditional GAN-based image captioning model.
Given an image, the generator generates a caption, and given an image
and caption, the discriminator/evaluator distinguishes between generated
and real captions. Key ideas:

- Since caption generation involves sequential sampling, which is
non-differentiable, the model is trained with policy gradients, with
the action being the choice of word at every time step, policy being
the distribution over words, and reward the score assigned by the
evaluator to generated caption.

- The evaluator's role assumes a completely generated caption as input
(along with image), which in practice leads to convergence issues. Thus
to accommodate feedback for partial sequences during training, Monte Carlo
rollouts are used, i.e. given a partial generated sequence, n completions
are sampled and run through the evaluator to compute reward.

- The evaluator's objective function consists of three terms
    - image-caption pairs from training data (positive)
    - image and generated captions (negative)
    - image and sampled captions for other images from training data (negative)

- Both the generator and evaluator are pretrained with supervision / MLE, then
fine-tuned with policy gradients. During inference, evaluator score is used as
the beam search objective.

## Strengths

This is neat paper with insightful ideas (Monte Carlo rollouts for assigning
rewards to partial sequences, evaluator score as beam search objective),
and is perhaps the first work on C-GAN-based image captioning.

## Weaknesses / Notes

arxiv.org
scholar.google.com

The Pose Knows: Video Forecasting by Generating Pose Futures
Walker, Jacob and Marino, Kenneth and Gupta, Abhinav and Hebert, Martial
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Kirill Pevzner 6 years ago

Problem
---------
Video prediction with human objects


Contribution
--------------
Instead of the common approach of predicting directly in pixel-space, use explicit knowledge of human motion space to predict the future of the video.

Approach
--------------
1. VAE to model the possible future movements of humans in the pose space
2. Conditional GAN - use pose information for to predict video in pixel space.



https://image.ibb.co/b1omVF/The_pose_knows.png

arxiv.org
scholar.google.com

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

## See also
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)