ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Bayesian dark knowledge
Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 8 years ago

This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model.

The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher.

The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD.

Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution.

arxiv.org
scholar.google.com

Learning Representations for Counterfactual Inference
Johansson, Fredrik D. and Shalit, Uri and Sontag, David
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 7 years ago

This paper presents a method to train a neural network to make predictions for *counterfactual* questions. In short, such questions are questions about what the result of an intervention would have been, had a different choice for the intervention been made (e.g. *Would this patient have lower blood sugar had she received a different medication?*).

One approach to tackle this problem is to collect data of the form $(x_i, t_i, y_i^F)$ where $x_i$ describes a situation (e.g. a patient), $t_i$ describes the intervention made (in this paper $t_i$ is binary, e.g. $t_i = 1$ if a new treatment is used while $t_i = 0$ would correspond to using the current treatment) and $y_i^F$ is the factual outcome of the intervention $t_i$ for $x_i$. From this training data, a predictor $h(x,t)$ taking the pair $(x_i, t_i)$ as input and outputting a prediction for $y_i^F$ could be trained. 

From this predictor, one could imagine answering counterfactual questions by feeding $(x_i, 1-t_i)$ (i.e. a description of the same situation $x_i$ but with the opposite intervention $1-t_i$) to our predictor and comparing the prediction $h(x_i, 1-t_i)$ with $y_i^F$. This would give us an estimate of the change in the outcome, had a different intervention been made, thus providing an answer to our counterfactual question.

The authors point out that this scenario is related to that of domain adaptation (more specifically to the special case of covariate shift) in which the input training distribution (here represented by inputs $(x_i,t_i)$) is different from the distribution of inputs that will be fed at test time to our predictor (corresponding to the inputs $(x_i, 1-t_i)$). If the choice of intervention $t_i$ is evenly spread and chosen independently from $x_i$, the distributions become the same. However, in observational studies, the choice of $t_i$ for some given $x_i$ is often not independent of $x_i$ and made according to some unknown policy. This is the situation of interest in this paper.

Thus, the authors propose an approach inspired by the domain adaptation literature. Specifically, they propose to have the predictor $h(x,t)$ learn a representation of $x$ that is indiscriminate of the intervention $t$ (see Figure 2 for the proposed neural network architecture). Indeed, this is a notion that is [well established][1] in the domain adaptation literature and has been exploited previously using regularization terms based on [adversarial learning][2] and [maximum mean discrepancy][3]. In this paper, the authors used instead a regularization (noted in the paper as $disc(\Phi_{t=0},\Phi_ {t=1})$) based on the so-called discrepancy distance of [Mansour et al.][4], adapting its use to the case of a neural network. 

As an example, imagine that in our dataset, a new treatment ($t=1$) was much more frequently used than not ($t=0$) for men. Thus, for men, relatively insufficient evidence for counterfactual inference is expected to be found in our training dataset. Intuitively, we would thus want our predictor to not rely as much on that "feature" of patients when inferring the impact of the treatment. 

In addition to this term, the authors also propose incorporating an additional regularizer where the prediction $h(x_i,1-t_i)$ on counterfactual inputs is pushed to be as close as possible to the target $y_{j}^F$ of the observation $x_j$ that is closest to $x_i$ **and** actually had the counterfactual intervention $t_j = 1-t_i$. 

The paper first shows a bound relating the counterfactual generalization error to the discrepancy distance. Moreover, experiments simulating counterfactual inference tasks are presented, in which performance is measured by comparing the predicted treatment effects (as estimated by the difference between the observed effect $y_i^F$ for the observed treatment and the predicted effect $h(x_i, 1-t_i)$ for the opposite treatment) with the real effect (known here because the data is simulated). The paper shows that the proposed approach using neural networks outperforms several baselines on this task.

**My two cents**

The connection with domain adaptation presented here is really clever and enlightening. This sounds like a very compelling approach to counterfactual inference, which can exploit a lot of previous work on domain adaptation.

The paper mentions that selecting the hyper-parameters (such as the regularization terms weights) in this scenario is not a trivial task. Indeed, measuring performance here requires knowing the true difference in intervention outcomes, which in practice usually cannot be known (e.g. two treatments usually cannot be given to the same patient once). In the paper, they somewhat "cheat" by using the ground truth difference in outcomes to measure out-of-sample performance, which the authors admit is unrealistic. Thus, an interesting avenue for future work would be to design practical hyper-parameter selection procedures for this scenario. I wonder whether the *reverse cross-validation* approach we used in our work on our adversarial approach to domain adaptation (see [Section 5.1.2][5]) could successfully be used here.

Finally, I command the authors for presenting such a nicely written description of counterfactual inference problem setup in general, I really enjoyed it!

[1]: https://papers.nips.cc/paper/2983-analysis-of-representations-for-domain-adaptation.pdf
[2]: http://arxiv.org/abs/1505.07818
[3]: http://ijcai.org/Proceedings/09/Papers/200.pdf
[4]: http://www.cs.nyu.edu/~mohri/pub/nadap.pdf
[5]: http://arxiv.org/pdf/1505.07818v4.pdf#page=16

dx.doi.org
sci-hub
scholar.google.com

SSD: Single Shot MultiBox Detector
Liu, Wei and Anguelov, Dragomir and Erhan, Dumitru and Szegedy, Christian and Reed, Scott E. and Fu, Cheng-Yang and Berg, Alexander C.
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

SSD aims to solve the major problem with most of the current state of the art object detectors namely Faster RCNN and like. All the object detection algortihms have same  methodology 

- Train 2 different nets - Region Proposal Net (RPN) and advanced classifier to detect class of an object and bounding box separately.
- During inference, run the test image at different scales to detect object at multiple scales to account for invariance

This makes the nets extremely slow. Faster RCNN could operate at **7 FPS with 73.2% mAP** while SSD could achieve **59 FPS with 74.3% mAP ** on VOC 2007 dataset.

#### Methodology

SSD uses a single net for predict object class and bounding box. However it doesn't do that directly. It uses a mechanism for choosing ROIs, training end-to-end for predicting class and boundary shift for that ROI.

##### ROI selection

Borrowing from FasterRCNNs SSD uses the  concept of anchor boxes for generating ROIs from the feature maps of last layer of shared conv layer. For each pixel in layer of feature maps, k default boxes with different aspect ratios are chosen around every pixel in the map. So if there are feature maps each of m x n resolutions - that's *mnk* ROIs for a single feature layer. Now SSD uses multiple feature layers (with differing resolutions) for generating such ROIs primarily to capture size invariance of objects. But because earlier layers in deep conv net tends to capture low level features, it uses features after certain levels and layers henceforth.

##### ROI labelling
Any ROI that matches to Ground Truth for a class after applying appropriate transforms and having Jaccard overlap greater than 0.5 is positive. Now, given all feature maps are at different resolutions and each boxes are at different aspect ratios, doing that's not simple. SDD uses simple scaling and aspect ratios to get to the appropriate  ground truth dimensions for calculating Jaccard overlap for default boxes for each pixel at the given resolution

##### ROI classification

SSD uses single convolution kernel of 3*3 receptive fields to predict for each ROI the 4 offsets (centre-x offset, centre-y offset, height offset , width offset) from the Ground Truth box for each RoI, along with class confidence scores for each class. So that is if there are c classes (including background), there are (c+4) filters for each convolution kernels that looks at a ROI. 

So summarily we have convolution kernels  that look at ROIs (which are default boxes around each pixel in feature map layer) to generate (c+4) scores for each RoI. Multiple feature map layers with different resolutions are used for generating such ROIs. Some ROIs are positive and some negative depending on jaccard overlap after ground box has scaled appropriately taking resolution differences in input image and feature map into consideration.

Here's how it looks :
![](https://i.imgur.com/HOhsPZh.png)

##### Training

For each ROI a combined loss is calculated as a combination of localisation error and classification error. The details are best explained in the figure. 
![](https://i.imgur.com/zEDuSgi.png)

##### Inference
For each ROI predictions a small threshold is used to first filter out irrelevant predictions, Non Maximum Suppression (nms) with jaccard overlap of 0.45 per class is applied then on the remaining candidate ROIs and the top 200 detections per image are kept.

For further understanding of the intuitions regarding the paper and the results obtained please consider giving the full paper a read. 

The open sourced code is available at this [Github repo](https://github.com/weiliu89/caffe/tree/ssd)

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

Learning to Segment Object Candidates
Pinheiro, Pedro H. O. and Collobert, Ronan and Dollár, Piotr
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series.

This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)):

1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm.
2. A CNN classifier is applied on each of the proposals.

The current paper improves the step 1, i.e., region/object proposals.

Most object proposals approaches fall into three categories:
* Objectness scoring
* Seed Segmentation
* Superpixel Merging

Current method is different from these three.
It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN.
The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object.

## Model and Training

Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model.

Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints:
* the patch contains an object roughly centered in the input patch
* the object is fully contained in the patch and in a given scale range

Note that the network must output a mask for a single object at the center even when multiple objects are present.

Figure 1 shows the architecture and sampling for training.
![figure1](https://i.imgur.com/zSyP0ij.png)

Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation.

## Inference

During full image inference, model is applied densely at multiple locations and scales.
This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).

![figure2](https://i.imgur.com/dQWfy8R.png)

This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation.