Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1519 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |

Robustness of classifiers: from adversarial to random noise

Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2016/08/31 (3 years ago)

**Abstract:** Several recent works have shown that state-of-the-art classifiers are
vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints.
On the other hand, it has been empirically observed that these same classifiers
are relatively robust to random noise. In this paper, we propose to study a
\textit{semi-random} noise regime that generalizes both the random and
worst-case noise regimes. We propose the first quantitative analysis of the
robustness of nonlinear classifiers in this general noise regime. We establish
precise theoretical bounds on the robustness of classifiers in this general
regime, which depend on the curvature of the classifier's decision boundary.
Our bounds confirm and quantify the empirical observations that classifiers
satisfying curvature constraints are robust to random noise. Moreover, we
quantify the robustness of classifiers in terms of the subspace dimension in
the semi-random noise regime, and show that our bounds remarkably interpolate
between the worst-case and random noise regimes. We perform experiments and
show that the derived bounds provide very accurate estimates when applied to
various state-of-the-art deep neural networks and datasets. This result
suggests bounds on the curvature of the classifiers' decision boundaries that
we support experimentally, and more generally offers important insights onto
the geometry of high dimensional classification problems.
more
less

Alhussein Fawzi and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
Fawzi et al. study robustness in the transition from random samples to semi-random and adversarial samples. Specifically they present bounds relating the norm of an adversarial perturbation to the norm of random perturbations – for the exact form I refer to the paper. Personally, I find the definition of semi-random noise most interesting, as it allows to get an intuition for distinguishing random noise from adversarial examples. As in related literature, adversarial examples are defined as $r_S(x_0) = \arg\min_{x_0 \in S} \|r\|_2$ s.t. $f(x_0 + r) \neq f(x_0)$ where $f$ is the classifier to attack and $S$ the set of allowed perturbations (e.g. requiring that the perturbed samples are still images). If $S$ is mostly unconstrained regarding the direction of $r$ in high dimensional space, Fawzi et al. consider $r$ to be an adversarial examples – intuitively, and adversary can choose $r$ arbitrarily to fool the classifier. If, however, the directions considered in $S$ are constrained to an $m$-dimensional subspace, Fawzi et al. consider $r$ to be semi-random noise. In the extreme case, if $m = 1$, $r$ is random noise. In this case, we can intuitively think of $S$ as a randomly chosen one dimensional subspace – i.e. a random direction in multi-dimensional space. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) |

Comparing Rewinding and Fine-tuning in Neural Network Pruning

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. |

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.AI, cs.LG

**First published:** 2016/09/26 (3 years ago)

**Abstract:** Neural Machine Translation (NMT) is an end-to-end learning approach for
automated translation, with the potential to overcome many of the weaknesses of
conventional phrase-based translation systems. Unfortunately, NMT systems are
known to be computationally expensive both in training and in translation
inference. Also, most NMT systems have difficulty with rare words. These issues
have hindered NMT's use in practical deployments and services, where both
accuracy and speed are essential. In this work, we present GNMT, Google's
Neural Machine Translation system, which attempts to address many of these
issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder
layers using attention and residual connections. To improve parallelism and
therefore decrease training time, our attention mechanism connects the bottom
layer of the decoder to the top layer of the encoder. To accelerate the final
translation speed, we employ low-precision arithmetic during inference
computations. To improve handling of rare words, we divide words into a limited
set of common sub-word units ("wordpieces") for both input and output. This
method provides a good balance between the flexibility of "character"-delimited
models and the efficiency of "word"-delimited models, naturally handles
translation of rare words, and ultimately improves the overall accuracy of the
system. Our beam search technique employs a length-normalization procedure and
uses a coverage penalty, which encourages generation of an output sentence that
is most likely to cover all the words in the source sentence. On the WMT'14
English-to-French and English-to-German benchmarks, GNMT achieves competitive
results to state-of-the-art. Using a human side-by-side evaluation on a set of
isolated simple sentences, it reduces translation errors by an average of 60%
compared to Google's phrase-based production system.
more
less

Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.AI, cs.LG

[link]
This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bi-directional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * pre-feed to all layers * Training has two steps: ML and RL * ML (cross-entropy) training: * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets) * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores |

Simple Black-Box Adversarial Perturbations for Deep Networks

Nina Narodytska and Shiva Prasad Kasiviswanathan

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

**First published:** 2016/12/19 (3 years ago)

**Abstract:** Deep neural networks are powerful and popular learning models that achieve
state-of-the-art pattern recognition performance on many computer vision,
speech, and language processing tasks. However, these networks have also been
shown susceptible to carefully crafted adversarial perturbations which force
misclassification of the inputs. Adversarial examples enable adversaries to
subvert the expected system behavior leading to undesired consequences and
could pose a security risk when these systems are deployed in the real world.
In this work, we focus on deep convolutional neural networks and demonstrate
that adversaries can easily craft adversarial examples even without any
internal knowledge of the target network. Our attacks treat the network as an
oracle (black-box) and only assume that the output of the network can be
observed on the probed inputs. Our first attack is based on a simple idea of
adding perturbation to a randomly selected single pixel or a small set of them.
We then improve the effectiveness of this attack by carefully constructing a
small set of pixels to perturb by using the idea of greedy local-search. Our
proposed attacks also naturally extend to a stronger notion of
misclassification. Our extensive experimental results illustrate that even
these elementary attacks can reveal a deep neural network's vulnerabilities.
The simplicity and effectiveness of our proposed schemes mean that they could
serve as a litmus test for designing robust networks.
more
less

Nina Narodytska and Shiva Prasad Kasiviswanathan

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

[link]
Narodytska and Kasiviswanathan propose a local search-based black.box adversarial attack against deep networks. In particular, they address the problem of k-misclassification defined as follows: Definition (k-msiclassification). A neural network k-misclassifies an image if the true label is not among the k likeliest labels. To this end, they propose a local search algorithm which, in each round, randomly perturbs individual pixels in a local search area around the last perturbation. If a perturbed image satisfies the k-misclassificaiton condition, it is returned as adversarial perturbation. While the approach is very simple, it is applicable to black-box models where gradients and or internal representations are not accessible but only the final score/probability is available. Still the approach seems to be quite inefficient, taking up to one or more seconds to generate an adversarial example. Unfortunately, the authors do not discuss qualitative results and do not give examples of multiple adversarial examples (except for the four in Figure 1). https://i.imgur.com/RAjYlaQ.png Figure 1: Examples of adversarial attacks. Top: original image, bottom: perturbed image. |

Evaluating the visualization of what a Deep Neural Network has learned

Samek, Wojciech and Binder, Alexander and Montavon, Grégoire and Bach, Sebastian and Müller, Klaus-Robert

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Samek, Wojciech and Binder, Alexander and Montavon, Grégoire and Bach, Sebastian and Müller, Klaus-Robert

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that ![](https://i.imgur.com/GQxrnCT.png) where R(l) [i] is given as ![](https://i.imgur.com/FD7AAfF.png) where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations ![](https://i.imgur.com/lo7f8AI.png) such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity. ![](https://i.imgur.com/Sq0b5yg.png) |

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much

He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

He, Bryan D. and Sa, Christopher De and Mitliagkas, Ioannis and Ré, Christopher

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

[link]
A study of how scan orders influence Mixing time in Gibbs sampling. This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other). First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability. MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured. |

Dynamic Routing Between Capsules.

Sara Sabour and Nicholas Frosst and Geoffrey E. Hinton

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Sara Sabour and Nicholas Frosst and Geoffrey E. Hinton

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

[link]
An attention-based architecture for combining information from different convolutional layers. The attention values are calculated using an iterative process, making use of a custom squashing function. The evaluations on MNIST show robustness to affine transformations. |

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

[link]
* They describe a model for human pose estimation, i.e. one that finds the joints ("skeleton") of a person in an image. * They argue that part of their model resembles a Markov Random Field (but in reality its implemented as just one big neural network). ### How * They have two components in their network: * Part-Detector: * Finds candidate locations for human joints in an image. * Pretty standard ConvNet. A few convolutional layers with pooling and ReLUs. * They use two branches: A fine and a coarse one. Both branches have practically the same architecture (convolutions, pooling etc.). The coarse one however receives the image downscaled by a factor of 2 (half width/height) and upscales it by a factor of 2 at the end of the branch. * At the end they merge the results of both branches with more convolutions. * The output of this model are 4 heatmaps (one per joint? unclear), each having lower resolution than the original image. * Spatial-Model: * Takes the results of the part detector and tries to remove all detections that were false positives. * They derive their architecture from a fully connected Markov Random Field which would be solved with one step of belief propagation. * They use large convolutions (128x128) to resemble the "fully connected" part. * They initialize the weights of the convolutions with joint positions gathered from the training set. * The convolutions are followed by log(), element-wise additions and exp() to resemble an energy function. * The end result are the input heatmaps, but cleaned up. ### Results * Beats all previous models (with and without spatial model). * Accuracy seems to be around 90% (with enough (16px) tolerance in pixel distance from ground truth). * Adding the spatial model adds a few percentage points of accuracy. * Using two branches instead of one (in the part detector) adds a bit of accuracy. Adding a third branch adds a tiny bit more. ![Results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__results.png?raw=true "Results") *Example results.* ![Part Detector](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__part_detector.png?raw=true "Part Detector") *Part Detector network.* ![Spatial Model](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__spatial_model.png?raw=true "Spatial Model") *Spatial Model (apparently only for two input heatmaps).* ------------------------- # Rough chapter-wise notes * (1) Introduction * Human Pose Estimation (HPE) from RGB images is difficult due to the high dimensionality of the input. * Approaches: * Deformable-part models: Traditionally based on hand-crafted features. * Deep-learning based disciminative models: Recently outperformed other models. However, it is hard to incorporate priors (e.g. possible joint- inter-connectivity) into the model. * They combine: * A part-detector (ConvNet, utilizes multi-resolution feature representation with overlapping receptive fields) * Part-based Spatial-Model (approximates loopy belief propagation) * They backpropagate through the spatial model and then the part-detector. * (3) Model * (3.1) Convolutional Network Part-Detector * This model locates possible positions of human key joints in the image ("part detector"). * Input: RGB image. * Output: 4 heatmaps, one per key joint (per pixel: likelihood). * They use a fully convolutional network. * They argue that applying convolutions to every pixel is similar to moving a sliding window over the image. * They use two receptive field sizes for their "sliding window": A large but coarse/blurry one, a small but fine one. * To implement that, they use two branches. Both branches are mostly identical (convolutions, poolings, ReLU). They simply feed a downscaled (half width/height) version of the input image into the coarser branch. At the end they upscale the coarser branch once and then merge both branches. * After the merge they apply 9x9 convolutions and then 1x1 convolutions to get it down to 4xHxW (H=60, W=90 where expected input was H=320, W=240). * (3.2) Higher-level Spatial-Model * This model takes the detected joint positions (heatmaps) and tries to remove those that are probably false positives. * It is a ConvNet, which tries to emulate (1) a Markov Random Field and (2) solving that MRF approximately via one step of belief propagation. * The raw MRF formula would be something like `<likelihood of joint A per px> = normalize( <product over joint v from joints V> <probability of joint A per px given a> * <probability of joint v at px?> + someBiasTerm)`. * They treat the probabilities as energies and remove from the formula the partition function (`normalize`) for various reasons (e.g. because they are only interested in the maximum value anyways). * They use exp() in combination with log() to replace the product with a sum. * They apply SoftPlus and ReLU so that the energies are always positive (and therefore play well with log). * Apparently `<probability of joint v at px?>` are the input heatmaps of the part detector. * Apparently `<probability of joint A per px given a>` is implemented as the weights of a convolution. * Apparently `someBiasTerm` is implemented as the bias of a convolution. * The convolutions that they use are large (128x128) to emulate a fully connected graph. * They initialize the convolution weights based on histograms gathered from the dataset (empirical distribution of joint displacements). * (3.3) Unified Models * They combine the part-based model and the spatial model to a single one. * They first train only the part-based model, then only the spatial model, then both. * (4) Results * Used datasets: FLIC (4k training images, 1k test, mostly front-facing and standing poses), FLIC-plus (17k, 1k ?), extended-LSP (10k, 1k). * FLIC contains images showing multiple persons with only one being annotated. So for FLIC they add a heatmap of the annotated body torso to the input (i.e. the part-detector does not have to search for the person any more). * The evaluation metric roughly measures, how often predicted joint positions are within a certain radius of the true joint positions. * Their model performs significantly better than competing models (on both FLIC and LSP). * Accuracy seems to be at around 80%-95% per joint (when choosing high enough evaluation tolerance, i.e. 10px+). * Adding the spatial model to the part detector increases the accuracy by around 10-15 percentage points. * Training the part detector and the spatial model jointly adds ~3 percentage points accuracy over training them separately. * Adding the second filter bank (coarser branch in the part detector) adds around 5 percentage points accuracy. Adding a third filter bank adds a tiny bit more accuracy. |

About