Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has over 600 summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Aditya Devarakonda and Maxim Naumov and Michael Garland

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.CV, cs.DC, stat.ML, 68T05, , I.2.6; I.5.0

**First published:** 2017/12/06 (1 month ago)

**Abstract:** Training deep neural networks with Stochastic Gradient Descent, or its
variants, requires careful choice of both learning rate and batch size. While
smaller batch sizes generally converge in fewer training epochs, larger batch
sizes offer more parallelism and hence better computational efficiency. We have
developed a new training approach that, rather than statically choosing a
single batch size for all epochs, adaptively increases the batch size during
the training process. Our method delivers the convergence rate of small batch
sizes while achieving performance similar to large batch sizes. We analyse our
approach using the standard AlexNet, ResNet, and VGG networks operating on the
popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate
that learning with adaptive batch sizes can improve performance by factors of
up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1%
relative to training with fixed batch sizes.
more
less

Aditya Devarakonda and Maxim Naumov and Michael Garland

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.CV, cs.DC, stat.ML, 68T05, , I.2.6; I.5.0

**TL;DR**: You can increase batch size in advanced phases of training without hurting accuracy and gaining some speedup. You should multiply the learning rate by the same value you multiplied batch size. **Long version**: Authors propose to increase batch size gradually, starting with a small batch size $r$, and then progressively increase the batch size while adapting the learning rate $\alpha$ so that the ratio $\alpha/r$ remains constant (without taking in account scheduled LR reduce). In this paper, they double the batch size by schedule. At the same time, learning rate is decayed and then multiplied by 2 to compensate batch size increase: if in baseline lr is multiplied by $0.375$, it is multiplied by $0.75$ now. The experiments on CIFAR-100 dataset show that the gradual increase of batch size allows to converge to the same values as constant small batch size. However, bigger batches allow faster training, providing $\times 1.5$ speedup on AlexNet, and around $\times 1.2$ speedup on ResNet and VGG, for both forward and backward passes on single GPU. On multiple GPUs the approach allows to further increase batchsize. On fortunate setups authors manage to get up to $\times 1.6$ speedup compared to constant batch size equal to initial value, while the error is almost unchanged. For bigger batch sizes lr warmup is used. For ImageNet, same behavior is shown for accuracy: gradual increase of batch size converges to same values as setup with initial batch size. Since authors haven't access to a system capable of processing large batches on ImageNet, no performance results are reported. |

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

**First published:** 2017/07/25 (5 months ago)

**Abstract:** Top-down visual attention mechanisms have been used extensively in image
captioning and visual question answering (VQA) to enable deeper image
understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention
mechanism that enables attention to be calculated at the level of objects and
other salient image regions. This is the natural basis for attention to be
considered. Within our approach, the bottom-up mechanism (based on Faster
R-CNN) proposes image regions, each with an associated feature vector, while
the top-down mechanism determines feature weightings. Applying this approach to
image captioning, our results on the MSCOCO test server establish a new
state-of-the-art for the task, improving the best published result in terms of
CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the
broad applicability of the method, applying the same approach to VQA we obtain
first place in the 2017 VQA Challenge.
more
less

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

This paper solves two tasks: Image Captioning and VQA. The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors. For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA. **SAAA**: https://i.imgur.com/2FnPXi0.png **This paper (VQA)**: https://i.imgur.com/xib77Iy.png For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer. https://i.imgur.com/GeXaC30.png |

Geometric Deep Learning: Going beyond Euclidean data

Bronstein, Michael M. and Bruna, Joan and LeCun, Yann and Szlam, Arthur and Vandergheynst, Pierre

IEEE Signal Process. Mag. - 2017 via Local Bibsonomy

Keywords: dblp

Bronstein, Michael M. and Bruna, Joan and LeCun, Yann and Szlam, Arthur and Vandergheynst, Pierre

IEEE Signal Process. Mag. - 2017 via Local Bibsonomy

Keywords: dblp

This paper surveys progress on adapting deep learning techniques to non-Euclidean data and suggests future directions. One of the strengths (and weaknesses) of deep learning--specifically exploited by convolutional neural networks--is that the data is assumed to exhibit translation invariance/equivariance and invariance to local deformations. Hence, long-range dependencies can be learned with multi-scale, hierarchical techniques where spatial resolution is reduced. However, this means that any information about the data that can't be learned when spatial resolution is reduced can get lost (I believe that residual networks aim to address this by the skip connections that are able to learn an identity operation; also, in computer vision, multi-scale versions of the data are often fed to CNNs). Key areas where this assumption about the data appears to be true is computer vision and speech recognition. #### Some quick background The *Laplacian*, a self-adjoint (symmetric) positive semi-definite operator, which is defined for smooth manifolds and graphs in this paper, can be thought of as the difference between the local average of a function around a point and the value of the function at the point itself. It's generally defined as $\triangle = -\text{div} \nabla$. When discretizing a continuous, smooth manifold with a *mesh*, note that the graph Laplacian might not converge to the continuous Laplacian operator with increasing sampling density. To be consistent, need to create a triangular mesh, i.e., represent the manifold as a polyhedral surface. ### Spectral methods Fourier analysis on non-Euclidean domains is possible by considering the eigendecomposition of the Laplacian operator. A possible transformation of the Convolution Theorem to functions on manifolds and graphs is discussed, but is noted as not being shift-invariant. The Spectral CNN can be defined by introducing a spectral convolutional layer acting on the vertices of the graph and using filters in the frequency domain and the eigenvectors of the Laplacian. However, the spectral filter coefficients will be dependent on the particular eigenvectors (basis) - domain dependency == bad for generalization! The non-Euclidean analogy of pooling is *graph coarsening*- only a fraction of the vertices of the graph are retained. Strided convolutions can be generalized to the spectral construction by only keeping the low-frequency components - must recompute the graph Laplacian after applying the nonlinearity in the spatial domain, however. Performing matrix multiplications on the eigendecomposition of the Laplacian is expensive! ### Spectrum-free Methods **A polynomial of the Laplacian acts as a polynomial on the eigenvalues**. ChebNet (Defferrard et al.) and Graph Convolutional Networks (Kipf et al.) boil down to applying simple filters acting on the r- or 1-hop neighborhood of the graph in the spatial domain. Some examples of generalizations of CNNs that define weighting functions for a locally Euclidean coordinate system around a point on a manifold are the * Geodesic CNN * Anisotropic CNN * Mixture Model network (MoNet) #### What problems are being solved with these methods? * Ranking and community detection on social networks * Recommender systems * 3D geometric data in Computer Vision/Graphics * Shape classification * Feature correspondence for 3D shapes * Behavior of N-particle systems (particle physics, LHC) * Molecule design * Medical imaging ### Open Problems * *Generalization* spectral analogues of convolution learned on one graph cannot be readily applied to other ones (domain dependency). Spatial methods generalize across different domains, but come with their own subtleties * *Time-varying domains* * *Directed graphs* non-symmetric Laplacian that do not have orthogonal eigendecompositions for interpretable spectral-domain constructions * *Synthesis problems* generative models * *Computation* extending deep learning frameworks for non-Euclidean data |

Bayesian dark knowledge

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

It's not clear to me how predicting the variance with a neural network is a robust estimator of uncertainty. We all know the adversarial examples where we can simply fool a neural network with an example that is a little off. By a same argument, we could make adversarial examples to _fool_ the uncertainty estimator. I would like to see more work on this |

Representation Learning by Rotating Your Faces

Tran, Luan and Yin, Xi and Liu, Xiaoming

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Tran, Luan and Yin, Xi and Liu, Xiaoming

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

This paper gets a face image and changes its pose or rotates it (to any desired pose) by passing the target pose as the input to the model. https://i.imgur.com/AGNOag5.png They use a GAN (named DR-GAN) for face rotation. The gan has an encoder and a decoder. The encoder takes the image and gets a high-level feature representation. The decoder gets high-level features, the target pose, and some noise to generate the output image with rotated face. The generated image is then passed to a discriminator where it says whether the image is real or fake. The disc also has two other outputs: 1- it estimates the pose of the generated image, 2) it estimated the identity of the person. no direct loss is applied to the generator, it is trained by the gradient that it gets through discriminator to minimize the three objects: 1- gan loss (to fool disc) 2-pose estimation 3- identity estimation. They use two tricks to improve the model: 1- using the same parameters for encoder of generator (gen-enc) and the discriminator (they observe this helps better identity recognition) 2- passing two images to gen-enc and interpolating between their high-level features (gen-enc output) and then applying two costs on it: 1) gan loss 2) pose loss. These losses are applied through disc, similar to above. The first trick improves gen-enc and second trick improves gen-dec, both help on identification. Their model can also leverage multiple image of the same identity if the dataset provides that to get better latent representation in gen-enc for a given identity. https://i.imgur.com/23Tckqc.png These are some samples on face frontalization: https://i.imgur.com/zmCODXe.png and these are some samples on interpolating different features in latent space: (sub-fig a) interpolating f(x) between the latent space of two images, (sub-fig b) interpolating pose (c), (sub-fig c) interpolating noise: https://i.imgur.com/KlkVyp9.png I find these positive aspects about the paper: 1) face rotation is applied on the images in the wild, 2) It is not required to have paired data. 3) multiple source images of the same identity can be used if provided, 4) identity and pose are used smartly in the discriminator to guide the generator, 5) model can specify the target pose (it is not only face-frontalization). Negative aspects: 1) face has many artifacts, similar to artifacts of some other gan models. 2) The identity is not well-preserved and the faces seem sometime distorted compared to the original person. They show the models performance on identity recognition and face rotation and demonstrate compelling results. |

Learning to Diagnose with LSTM Recurrent Neural Networks

Lipton, Zachary Chase and Kale, David C. and Elkan, Charles and Wetzel, Randall C.

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Lipton, Zachary Chase and Kale, David C. and Elkan, Charles and Wetzel, Randall C.

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

#### Goal + Predict 128 diagnoses for intensive pediatric care patients. #### Dataset: + Children's Hospital LA. + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit. Dataset properties | Value ---------|---------- Number of episodes | 10,401 Duration of episodes | From 12h to several months Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. + Resampling and missing values: + Irregularly sampled time-series that is resampled to an hourly rate. + Mean measurement within each hour window is taken. + Forward- and back-filling are used to fill gaps created by the resampling. + When variable time series is missing entirely: imputation with a clinically *normal* value defined by domain experts. + This paper is followed by [Modeling Missing Data in Clinical Time Series with RNNs](http://www.shortscience.org/paper?bibtexKey=journals/corr/LiptonKW16) from the same research group. + Labels: + Each episode is associated with 0 or more diagnoses. (in-house taxonomy, ICD-9 based). + Dataset contains 429 diagnoses. The paper focuses on the 128 most frequent diagnoses that appear 50 or more times in the dataset. #### Architecture: + LSTM with Target Replication: ![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_target.png?raw=true "Target Replication") + Loss function: + For the model with target replication, output y is generated at every sequence step. The loss function is then a convex combination of the final loss (log-loss in the case of this paper) and the average of the losses over all steps where T is the number of sequence steps and alpha is a hyperparameter. ![Loss function](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_loss.png?raw=true "Loss function") #### Experiments and Results: **Methodology**: + Split dataset: 80% training, 10% validation, 10% test + LTSM trained for 100 epochs via gradient stochastic gradient (with momentum). + Regularization L2: 1e-6, obtained via validation dataset. + LSTM: 2 hidden layers with 64 cells or 128 cells (and 50% dropout) + Multiple combinations: target replication / auxiliary target variables (trained using the other 301 diagnoses and other clinical information as a target. Inferences are made only for the 128 major diagnoses. + Baselines for comparison: + Logistic Regression - L2 regularized + MLP with 3 hidden layers - ReLU - dropout 50%. + Baselines tested in the raw time-series and in a feature engineering version made by domain experts. *Metrics*: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnoses among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. *Results*: ![All Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_allresults.png?raw=true "Performance metrics across all labels") *Results for selected diagnoses*: ![Results for Selected Diseases](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_selected.png?raw=true "Performance for selected diagnoses") #### Discussion: + Auxiliary outputs improve performance at the expense of increased training time. Very unbalanced dataset for some of the remaining 301 labels makes it spend an entire epoch only to learn that one of the target variables can take values â€‹â€‹other than 0. + Real-Time Predictions: In the future, the authors expect that the proposed solution could be used to make continuously updated real-time alerts and diagnoses. |

Rotation equivariant vector field networks

Diego Marcos and Michele Volpi and Nikos Komodakis and Devis Tuia

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

**First published:** 2016/12/29 (1 year ago)

**Abstract:** We propose a method to encode rotation equivariance or invariance into
convolutional neural networks (CNNs). Each convolutional filter is applied with
several orientations and returns a vector field that represents the magnitude
and angle of the highest scoring rotation at the given spatial location. To
propagate information about the main orientation of the different features to
each layer in the network, we propose an enriched orientation pooling, i.e. max
and argmax operators over the orientation space, allowing to keep the
dimensionality of the feature maps low and to propagate only useful
information. We name this approach RotEqNet. We apply RotEqNet to three
datasets: first, a rotation invariant classification problem, the MNIST-rot
benchmark, in which we improve over the state-of-the-art results. Then, a
neuron membrane segmentation benchmark, where we show that RotEqNet can be
applied successfully to obtain equivariance to rotation with a simple fully
convolutional architecture. Finally, we improve significantly the
state-of-the-art on the problem of estimating cars' absolute orientation in
aerial images, a problem where the output is required to be covariant with
respect to the object's orientation.
more
less

Diego Marcos and Michele Volpi and Nikos Komodakis and Devis Tuia

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

This work deals with rotation equivariant convolutional filters. The idea is that when you rotate an image you should not need to relearn new filters to deal with this rotation. First we can look at how convolutions typically handle rotation and how we would expect a rotation invariant solution to perform below: | | | | - | - | | https://i.imgur.com/cirTi4S.png | https://i.imgur.com/iGpUZDC.png | | | | | The method computes all possible rotations of the filter which results in a list of activations where each element represents a different rotation. From this list the maximum is taken which results in a two dimensional output for every pixel (rotation, magnitude). This happens at the pixel level so the result is a vector field over the image. https://i.imgur.com/BcnuI1d.png We can visualize their degree selection method with a figure from https://arxiv.org/abs/1603.04392 which determined the rotation of a building: https://i.imgur.com/hPI8J6y.png We can also think of this approach as attention \cite{1409.0473} where they attend over the possible rotations to obtain a score for each possible rotation value to pass on. The network can learn to adjust the rotation value to be whatever value the later layers will need. ------------------------ Results on [Rotated MNIST](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations) show an impressive improvement in training speed and generalization error: https://i.imgur.com/YO3poOO.png |

Generating Images with Perceptual Similarity Metrics based on Deep Networks

Alexey Dosovitskiy and Thomas Brox

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, cs.NE

**First published:** 2016/02/08 (1 year ago)

**Abstract:** Image-generating machine learning models are typically trained with loss
functions based on distance in the image space. This often leads to
over-smoothed results. We propose a class of loss functions, which we call deep
perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of
computing distances in the image space, we compute distances between image
features extracted by deep neural networks. This metric better reflects
perceptually similarity of images and thus leads to better results. We show
three applications: autoencoder training, a modification of a variational
autoencoder, and inversion of deep convolutional networks. In all cases, the
generated images look sharp and resemble natural images.
more
less

Alexey Dosovitskiy and Thomas Brox

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, cs.CV, cs.NE

* The authors define in this paper a special loss function (DeePSiM), mostly for autoencoders. * Usually one would use a MSE of euclidean distance as the loss function for an autoencoder. But that loss function basically always leads to blurry reconstructed images. * They add two new ingredients to the loss function, which results in significantly sharper looking images. ### How * Their loss function has three components: * Euclidean distance in image space (i.e. pixel distance between reconstructed image and original image, as usually used in autoencoders) * Euclidean distance in feature space. Another pretrained neural net (e.g. VGG, AlexNet, ...) is used to extract features from the original and the reconstructed image. Then the euclidean distance between both vectors is measured. * Adversarial loss, as usually used in GANs (generative adversarial networks). The autoencoder is here treated as the GAN-Generator. Then a second network, the GAN-Discriminator is introduced. They are trained in the typical GAN-fashion. The loss component for DeePSiM is the loss of the Discriminator. I.e. when reconstructing an image, the autoencoder would learn to reconstruct it in a way that lets the Discriminator believe that the image is real. * Using the loss in feature space alone would not be enough as that tends to lead to overpronounced high frequency components in the image (i.e. too strong edges, corners, other artefacts). * To decrease these high frequency components, a "natural image prior" is usually used. Other papers define some function by hand. This paper uses the adversarial loss for that (i.e. learns a good prior). * Instead of training a full autoencoder (encoder + decoder) it is also possible to only train a decoder and feed features - e.g. extracted via AlexNet - into the decoder. ### Results * Using the DeePSiM loss with a normal autoencoder results in sharp reconstructed images. * Using the DeePSiM loss with a VAE to generate ILSVRC-2012 images results in sharp images, which are locally sound, but globally don't make sense. Simple euclidean distance loss results in blurry images. * Using the DeePSiM loss when feeding only image space features (extracted via AlexNet) into the decoder leads to high quality reconstructions. Features from early layers will lead to more exact reconstructions. * One can again feed extracted features into the network, but then take the reconstructed image, extract features of that image and feed them back into the network. When using DeePSiM, even after several iterations of that process the images still remain semantically similar, while their exact appearance changes (e.g. a dog's fur color might change, counts of visible objects change). ![Generated images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generating_Images_with_Perceptual_Similarity_Metrics_based_on_Deep_Networks__generated_images.png?raw=true "Generated images") *Images generated with a VAE using DeePSiM loss.* ![Reconstructed images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generating_Images_with_Perceptual_Similarity_Metrics_based_on_Deep_Networks__reconstructed.png?raw=true "Reconstructed images") *Images reconstructed from features fed into the network. Different AlexNet layers (conv5 - fc8) were used to generate the features. Earlier layers allow more exact reconstruction.* ![Iterated reconstruction](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generating_Images_with_Perceptual_Similarity_Metrics_based_on_Deep_Networks__reconstructed_multi.png?raw=true "Iterated reconstruction") *First, images are reconstructed from features (AlexNet, layers conv5 - fc8 as columns). Then, features of the reconstructed images are fed back into the network. That is repeated up to 8 times (rows). Images stay semantically similar, but their appearance changes.* -------------------- ### Rough chapter-wise notes * (1) Introduction * Using a MSE of euclidean distances for image generation (e.g. autoencoders) often results in blurry images. * They suggest a better loss function that cares about the existence of features, but not as much about their exact translation, rotation or other local statistics. * Their loss function is based on distances in suitable feature spaces. * They use ConvNets to generate those feature spaces, as these networks are sensitive towards important changes (e.g. edges) and insensitive towards unimportant changes (e.g. translation). * However, naively using the ConvNet features does not yield good results, because the networks tend to project very different images onto the same feature vectors (i.e. they are contractive). That leads to artefacts in the generated images. * Instead, they combine the feature based loss with GANs (adversarial loss). The adversarial loss decreases the negative effects of the feature loss ("natural image prior"). * (3) Model * A typical choice for the loss function in image generation tasks (e.g. when using an autoencoders) would be squared euclidean/L2 loss or L1 loss. * They suggest a new class of losses called "DeePSiM". * We have a Generator `G`, a Discriminator `D`, a feature space creator `C` (takes an image, outputs a feature space for that image), one (or more) input images `x` and one (or more) target images `y`. Input and target image can be identical. * The total DeePSiM loss is a weighted sum of three components: * Feature loss: Squared euclidean distance between the feature spaces of (1) input after fed through G and (2) the target image, i.e. `||C(G(x))-C(y)||^2_2`. * Adversarial loss: A discriminator is introduced to estimate the "fakeness" of images generated by the generator. The losses for D and G are the standard GAN losses. * Pixel space loss: Classic squared euclidean distance (as commonly used in autoencoders). They found that this loss stabilized their adversarial training. * The feature loss alone would create high frequency artefacts in the generated image, which is why a second loss ("natural image prior") is needed. The adversarial loss fulfills that role. * Architectures * Generator (G): * They define different ones based on the task. * They all use up-convolutions, which they implement by stacking two layers: (1) a linear upsampling layer, then (2) a normal convolutional layer. * They use leaky ReLUs (alpha=0.3). * Comparators (C): * They use variations of AlexNet and Exemplar-CNN. * They extract the features from different layers, depending on the experiment. * Discriminator (D): * 5 convolutions (with some striding; 7x7 then 5x5, afterwards 3x3), into average pooling, then dropout, then 2x linear, then 2-way softmax. * Training details * They use Adam with learning rate 0.0002 and normal momentums (0.9 and 0.999). * They temporarily stop the discriminator training when it gets too good. * Batch size was 64. * 500k to 1000k batches per training. * (4) Experiments * Autoencoder * Simple autoencoder with an 8x8x8 code layer between encoder and decoder (so actually more values than in the input image?!). * Encoder has a few convolutions, decoder a few up-convolutions (linear upsampling + convolution). * They train on STL-10 (96x96) and take random 64x64 crops. * Using for C AlexNet tends to break small structural details, using Exempler-CNN breaks color details. * The autoencoder with their loss tends to produce less blurry images than the common L2 and L1 based losses. * Training an SVM on the 8x8x8 hidden layer performs significantly with their loss than L2/L1. That indicates potential for unsupervised learning. * Variational Autoencoder * They replace part of the standard VAE loss with their DeePSiM loss (keeping the KL divergence term). * Everything else is just like in a standard VAE. * Samples generated by a VAE with normal loss function look very blurry. Samples generated with their loss function look crisp and have locally sound statistics, but still (globally) don't really make any sense. * Inverting AlexNet * Assume the following variables: * I: An image * ConvNet: A convolutional network * F: The features extracted by a ConvNet, i.e. ConvNet(I) (feaures in all layers, not just the last one) * Then you can invert the representation of a network in two ways: * (1) An inversion that takes an F and returns roughly the I that resulted in F (it's *not* key here that ConvNet(reconstructed I) returns the same F again). * (2) An inversion that takes an F and projects it to *some* I so that ConvNet(I) returns roughly the same F again. * Similar to the autoencoder cases, they define a decoder, but not encoder. * They feed into the decoder a feature representation of an image. The features are extracted using AlexNet (they try the features from different layers). * The decoder has to reconstruct the original image (i.e. inversion scenario 1). They use their DeePSiM loss during the training. * The images can be reonstructed quite well from the last convolutional layer in AlexNet. Chosing the later fully connected layers results in more errors (specifially in the case of the very last layer). * They also try their luck with the inversion scenario (2), but didn't succeed (as their loss function does not care about diversity). * They iteratively encode and decode the same image multiple times (probably means: image -> features via AlexNet -> decode -> reconstructed image -> features via AlexNet -> decode -> ...). They observe, that the image does not get "destroyed", but rather changes semantically, e.g. three apples might turn to one after several steps. * They interpolate between images. The interpolations are smooth. |

Pixel Recurrent Neural Networks

Oord, AÃ¤ron Van Den and Kalchbrenner, Nal and Kavukcuoglu, Koray

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Oord, AÃ¤ron Van Den and Kalchbrenner, Nal and Kavukcuoglu, Koray

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

*Note*: This paper felt rather hard to read. The summary might not have hit exactly what the authors tried to explain. * The authors describe multiple architectures that can model the distributions of images. * These networks can be used to generate new images or to complete existing ones. * The networks are mostly based on RNNs. ### How * They define three architectures: * Row LSTM: * Predicts a pixel value based on all previous pixels in the image. * It applies 1D convolutions (with kernel size 3) to the current and previous rows of the image. * It uses the convolution results as features to predict a pixel value. * Diagonal BiLSTM: * Predicts a pixel value based on all previous pixels in the image. * Instead of applying convolutions in a row-wise fashion, they apply them to the diagonals towards the top left and top right of the pixel. * Diagonal convolutions can be applied by padding the n-th row with `n-1` pixels from the left (diagonal towards top left) or from the right (diagonal towards the top right), then apply a 3x1 column convolution. * PixelCNN: * Applies convolutions to the region around a pixel to predict its values. * Uses masks to zero out pixels that follow after the target pixel. * They use no pooling layers. * While for the LSTMs each pixel is conditioned on all previous pixels, the dependency range of the CNN is bounded. * They use up to 12 LSTM layers. * They use residual connections between their LSTM layers. * All architectures predict pixel values as a softmax over 255 distinct values (per channel). According to the authors that leads to better results than just using one continuous output (i.e. sigmoid) per channel. * They also try a multi-scale approach: First, one network generates a small image. Then a second networks generates the full scale image while being conditioned on the small image. ### Results * The softmax layers learn reasonable distributions. E.g. neighboring colors end up with similar probabilities. Values 0 and 255 tend to have higher probabilities than others, especially for the very first pixel. * In the 12-layer LSTM row model, residual and skip connections seem to have roughly the same effect on the network's results. Using both yields a tiny improvement over just using one of the techniques alone. * They achieve a slightly better result on MNIST than DRAW did. * Their negative log likelihood results for CIFAR-10 improve upon previous models. The diagonal BiLSTM model performs best, followed by the row LSTM model, followed by PixelCNN. * Their generated images for CIFAR-10 and Imagenet capture real local spatial dependencies. The multi-scale model produces better looking results. The images do not appear blurry. Overall they still look very unreal. ![Generated ImageNet images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Pixel_Recurrent_Neural_Networks__imagenet_multiscale.png?raw=true "Generated ImageNet images") *Generated ImageNet 64x64 images.* ![Image completion](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Pixel_Recurrent_Neural_Networks__occlusion.png?raw=true "Image completion") *Completing partially occluded images.* |

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford and Luke Metz and Soumith Chintala

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG, cs.CV

**First published:** 2015/11/19 (2 years ago)

**Abstract:** In recent years, supervised learning with convolutional networks (CNNs) has
seen huge adoption in computer vision applications. Comparatively, unsupervised
learning with CNNs has received less attention. In this work we hope to help
bridge the gap between the success of CNNs for supervised learning and
unsupervised learning. We introduce a class of CNNs called deep convolutional
generative adversarial networks (DCGANs), that have certain architectural
constraints, and demonstrate that they are a strong candidate for unsupervised
learning. Training on various image datasets, we show convincing evidence that
our deep convolutional adversarial pair learns a hierarchy of representations
from object parts to scenes in both the generator and discriminator.
Additionally, we use the learned features for novel tasks - demonstrating their
applicability as general image representations.
more
less

Alec Radford and Luke Metz and Soumith Chintala

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG, cs.CV

* DCGANs are just a different architecture of GANs. * In GANs a Generator network (G) generates images. A discriminator network (D) learns to differentiate between real images from the training set and images generated by G. * DCGANs basically convert the laplacian pyramid technique (many pairs of G and D to progressively upscale an image) to a single pair of G and D. ### How * Their D: Convolutional networks. No linear layers. No pooling, instead strided layers. LeakyReLUs. * Their G: Starts with 100d noise vector. Generates with linear layers 1024x4x4 values. Then uses fractionally strided convolutions (move by 0.5 per step) to upscale to 512x8x8. This is continued till Cx32x32 or Cx64x64. The last layer is a convolution to 3x32x32/3x64x64 (Tanh activation). * The fractionally strided convolutions do basically the same as the progressive upscaling in the laplacian pyramid. So it's basically one laplacian pyramid in a single network and all upscalers are trained jointly leading to higher quality images. * They use Adam as their optimizer. To decrease instability issues they decreased the learning rate to 0.0002 (from 0.001) and the momentum/beta1 to 0.5 (from 0.9). ![Architecture of G](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Unsupervised_Representation_Learning_with_Deep_Convolutional_Generative_Adversarial_Networks__G.png?raw=true "Architecture of G") *Architecture of G using fractionally strided convolutions to progressively upscale the image.* ### Results * High quality images. Still with distortions and errors, but at first glance they look realistic. * Smooth interpolations between generated images are possible (by interpolating between the noise vectors and feeding these interpolations into G). * The features extracted by D seem to have some potential for unsupervised learning. * There seems to be some potential for vector arithmetics (using the initial noise vectors) similar to the vector arithmetics with wordvectors. E.g. to generate mean with sunglasses via `vector(men) + vector(sunglasses)`. ![Example images (bedrooms)](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Unsupervised_Representation_Learning_with_Deep_Convolutional_Generative_Adversarial_Networks__bedrooms.png?raw=true "Example images (bedrooms)") *Generated images, bedrooms.* ![Example images (faces)](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Unsupervised_Representation_Learning_with_Deep_Convolutional_Generative_Adversarial_Networks__faces.png?raw=true "Example images (faces)") *Generated images, faces.* ### Rough chapter-wise notes * Introduction * For unsupervised learning, they propose to use to train a GAN and then reuse the weights of D. * GANs have traditionally been hard to train. * Approach and model architecture * They use for D an convnet without linear layers, withput pooling layers (only strides), LeakyReLUs and Batch Normalization. * They use for G ReLUs (hidden layers) and Tanh (output). * Details of adversarial training * They trained on LSUN, Imagenet-1k and a custom dataset of faces. * Minibatch size was 128. * LeakyReLU alpha 0.2. * They used Adam with a learning rate of 0.0002 and momentum of 0.5. * They note that a higher momentum lead to oscillations. * LSUN * 3M images of bedrooms. * They use an autoencoder based technique to filter out 0.25M near duplicate images. * Faces * They downloaded 3M images of 10k people. * They extracted 350k faces with OpenCV. * Empirical validation of DCGANs capabilities * Classifying CIFAR-10 GANs as a feature extractor * They train a pair of G and D on Imagenet-1k. * D's top layer has `512*4*4` features. * They train an SVM on these features to classify the images of CIFAR-10. * They achieve a score of 82.8%, better than unsupervised K-Means based methods, but worse than Exemplar CNNs. * Classifying SVHN digits using GANs as a feature extractor * They reuse the same pipeline (D trained on CIFAR-10, SVM) for the StreetView House Numbers dataset. * They use 1000 SVHN images (with the features from D) to train the SVM. * They achieve 22.48% test error. * Investigating and visualizing the internals of the networks * Walking in the latent space * The performs walks in the latent space (= interpolate between input noise vectors and generate several images for the interpolation). * They argue that this might be a good way to detect overfitting/memorizations as those might lead to very sudden (not smooth) transitions. * Visualizing the discriminator features * They use guided backpropagation to visualize what the feature maps in D have learned (i.e. to which images they react). * They can show that their LSUN-bedroom GAN seems to have learned in an unsupervised way what beds and windows look like. * Forgetting to draw certain objects * They manually annotated the locations of objects in some generated bedroom images. * Based on these annotations they estimated which feature maps were mostly responsible for generating the objects. * They deactivated these feature maps and regenerated the images. * That decreased the appearance of these objects. It's however not as easy as one feature map deactivation leading to one object disappearing. They deactivated quite a lot of feature maps (200) and they objects were often still quite visible or replaced by artefacts/errors. * Vector arithmetic on face samples * Wordvectors can be used to perform semantic arithmetic (e.g. `king - man + woman = queen`). * The unsupervised representations seem to be useable in a similar fashion. * E.g. they generated images via G. They then picked several images that showed men with glasses and averaged these image's noise vectors. They did with same with men without glasses and women without glasses. Then they performed on these vectors `men with glasses - mean without glasses + women without glasses` to get `woman with glasses |

A Neural Algorithm of Artistic Style

Leon A. Gatys and Alexander S. Ecker and Matthias Bethge

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV, cs.NE, q-bio.NC

**First published:** 2015/08/26 (2 years ago)

**Abstract:** In fine art, especially painting, humans have mastered the skill to create
unique visual experiences through composing a complex interplay between the
content and style of an image. Thus far the algorithmic basis of this process
is unknown and there exists no artificial system with similar capabilities.
However, in other key areas of visual perception such as object and face
recognition near-human performance was recently demonstrated by a class of
biologically inspired vision models called Deep Neural Networks. Here we
introduce an artificial system based on a Deep Neural Network that creates
artistic images of high perceptual quality. The system uses neural
representations to separate and recombine content and style of arbitrary
images, providing a neural algorithm for the creation of artistic images.
Moreover, in light of the striking similarities between performance-optimised
artificial neural networks and biological vision, our work offers a path
forward to an algorithmic understanding of how humans create and perceive
artistic imagery.
more
less

Leon A. Gatys and Alexander S. Ecker and Matthias Bethge

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV, cs.NE, q-bio.NC

* The paper describes a method to separate content and style from each other in an image. * The style can then be transfered to a new image. * Examples: * Let a photograph look like a painting of van Gogh. * Improve a dark beach photo by taking the style from a sunny beach photo. ### How * They use the pretrained 19-layer VGG net as their base network. * They assume that two images are provided: One with the *content*, one with the desired *style*. * They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*. * They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*. * How to calculate a *Gram Matrix*: * Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations. * Convert each filter's activations to a (1-dimensional) vector. * Pick all pairs of filters. Calculate the scalar product of both filter's vectors. * Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128). * Repeat that for every pair to get the Gram Matrix. * The Gram Matrix roughly represents the *texture* of the image. * Now you have the content representation (activations of a layer) and the style representation (Gram Matrices). * Create a new image of the size of the content image. Fill it with random white noise. * Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.) * Make changes to the new image using gradient descent to optimize a loss function. * The loss function has two components: * The mean squared error between the new image's content representation and the previously extracted content representation. * The mean squared error between the new image's style representation and the previously extracted style representation. * Add up both components to get the total loss. * Give both components a weight to alter for more/less style matching (at the expense of content matching). ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples") *One example input image with different styles added to it.* ------------------------- ### Rough chapter-wise notes * Page 1 * A painted image can be decomposed in its content and its artistic style. * Here they use a neural network to separate content and style from each other (and to apply that style to an existing image). * Page 2 * Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style). * They call the feature responses in higher layers *content representation*. * To capture style information, they use a method that was originally designed to capture texture information. * They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multi-scale representation of the style. * Page 3 * They use VGG as their base CNN. * Page 4 * Based on the extracted style features, they can generate a new image, which has equal activations in these style features. * The new image should match the style (texture, color, localized structures) of the artistic image. * The style features become more and more abtstract with higher layers. They call that multi-scale the *style representation*. * The key contribution of the paper is a method to separate style and content representation from each other. * These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork). * Page 6 * The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.) * Content and style can't be separated perfectly. * Their loss function has two terms, one for content matching and one for style matching. * The terms can be increased/decreased to match content or style more. * Page 8 * Previous techniques work only on limited or simple domains or used non-parametric approaches (see non-photorealistic rendering). * Previously neural networks have been used to classify the time period of paintings (based on their style). * They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images). * Page 9 * The style representation is gathered by measuring correlations between activations of neurons. * They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do. * They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for style-invariant classification. * Page 9, Methods * They use the 19 layer VGG net as their basis. * They use only its convolutional layers, not the linear ones. * They use average pooling instead of max pooling, as that produced slightly better results. * Page 10, Methods * The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features. * The build a style representation by calculating Gram Matrices for each layer. * Page 11, Methods * The Gram Matrix is generated in the following way: * Convert each filter of a convolutional layer to a 1-dimensional vector. * For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters. * Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix. * Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices). * That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting. * Page 12, Methods * To transfer the style of a painting to an existing image, proceed as follows: * Start with a white noise image. * Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting). * Each distance (content, style) can be weighted to have more or less influence on the loss function. |

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

**First published:** 2015/02/11 (2 years ago)

**Abstract:** Training Deep Neural Networks is complicated by the fact that the
distribution of each layer's inputs changes during training, as the parameters
of the previous layers change. This slows down the training by requiring lower
learning rates and careful parameter initialization, and makes it notoriously
hard to train models with saturating nonlinearities. We refer to this
phenomenon as internal covariate shift, and address the problem by normalizing
layer inputs. Our method draws its strength from making normalization a part of
the model architecture and performing the normalization for each training
mini-batch. Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer, in some
cases eliminating the need for Dropout. Applied to a state-of-the-art image
classification model, Batch Normalization achieves the same accuracy with 14
times fewer training steps, and beats the original model by a significant
margin. Using an ensemble of batch-normalized networks, we improve upon the
best published result on ImageNet classification: reaching 4.9% top-5
validation error (and 4.8% test error), exceeding the accuracy of human raters.
more
less

Sergey Ioffe and Christian Szegedy

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

### What is BN: * Batch Normalization (BN) is a normalization method/layer for neural networks. * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*. * BN essentially performs Whitening to the intermediate layers of the networks. ### How its calculated: * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch. * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases. * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance). ### Theoretical effects: * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active. * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions). * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on. ### Practical effects: * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.) * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.) * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.) ![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations") *BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.* ------------------------- ### Rough chapter-wise notes * (2) Towards Reducing Covariate Shift * Batch Normalization (*BN*) is a special normalization method for neural networks. * In neural networks, the inputs to each layer depend on the outputs of all previous layers. * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*. * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions). * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network. * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance). * That accomplishes: * No more covariate shift. * Fixes problems with vanishing gradients due to saturation. * Effects: * Networks learn faster. (As they don't have to adjust to covariate shift any more.) * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.) * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.) * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.) * BN reduces the need for dropout. (As it has a regularizing effect.) * How BN works: * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*. * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. * A proper method has to include the current example *and* all previous examples in the normalization step. * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way. * (3) Normalization via Mini-Batch Statistics * Each feature (component) is normalized individually. (Due to cost, differentiability.) * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))` * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function. * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component) * E and Var are estimated for each mini batch. * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left). * (3.1) Training and Inference with Batch-Normalized Networks * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training. * During test time, the BN formulas can be simplified to a single linear transformation. * (3.2) Batch-Normalized Convolutional Networks * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities. * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian. * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason). * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta). * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m. * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.) * (3.3) Batch Normalization enables higher learning rates * BN normalizes activations. * Result: Changes to early layers don't amplify towards the end. * BN makes it less likely to get stuck in the saturating parts of nonlinearities. * BN makes training more resilient to parameter scales. * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions. * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used. * (something about singular values and the Jacobian) * (3.4) Batch Normalization regularizes the model * Usually: Examples are seen on their own by the network. * With BN: Examples are seen in conjunction with other examples (mean, variance). * Result: Network can't easily memorize the examples any more. * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength. * (4) Experiments * (4.1) Activations over time ** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.) ** Batch Size was 60. ** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training. ** Generalization of the BN network seemed to be better. * (4.2) ImageNet classification ** They applied BN to the Inception network. ** Batch Size was 32. ** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation. ** They shuffle the data during training (i.e. each batch contains different examples). ** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate). ** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU). ** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet. * (5) Conclusion ** BN is similar to a normalization layer suggested by GÃ¼lcehre and Bengio. However, they applied it to the outputs of nonlinearities. ** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function). |

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

Kingma, Diederik P. and Ba, Jimmy

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`. * They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1 - beta1) * g` * `variance = beta2 * variance + (1 - beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform bias-corrections of the mean and the variance: * `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1-beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep). ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adam__algorithm.png?raw=true "Algorithm") |

Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series

Lipton, Zachary Chase and Kale, David C. and Wetzel, Randall C.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Lipton, Zachary Chase and Kale, David C. and Wetzel, Randall C.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

#### Motivation: + Take advantage of the fact that missing values can be very informative about the label. + Sampling a time series generates many missing values. ![Sampling](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_motivation.png?raw=true) #### Model (indicator flag): + Indicator of occurrence of missing value. ![Indicator](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_indicator.png?raw=true) + An RNN can learn about missing values and their importance only by using the indicator function. The nonlinearity from this type of model helps capturing these dependencies. + If one wants to use a linear model, feature engineering is needed to overcome its limitations. + indicator for whether a variable was measured at all + mean and standard deviation of the indicator + frequency with which a variable switches from measured to missing and vice-versa. #### Architecture: + RNN with target replication following the work "Learning to Diagnose with LSTM Recurrent Neural Networks" by the same authors. ![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_architecture.png?raw=true) #### Dataset: + Children's Hospital LA + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit Dataset properties | Value ---------|---------- Number of episodes | 10,401 Duration of episodes | From 12h to several months Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. #### Experiments and Results: **Goal** + Predict 128 diagnoses. + Multilabel: patients can have more than one diagnose. **Methodology** + Split: 80% training, 10% validation, 10% test. + Normalized data to be in the range [0,1]. + LSTM RNN: + 2 hidden layers with 128 cells. Dropout = 0.5, L2-regularization: 1e-6 + Training for 100 epochs. Parameters chosen correspond to the time that generated the smallest error in the validation dataset. + Baselines: + Logistic Regression (L2 regularization) + MLP with 3 hidden layers and 500 hidden neurons / layer (parameters chosen via validation set) + Tested with raw-features and hand-engineered features. + Strategies for missing values: + Zeroing + Impute via forward / backfilling + Impute with zeros and use indicator function + Impute via forward / backfilling and use indicator function + Use indicator function only #### Results + Metrics: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnostics among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. ![Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_results.png?raw=true) #### Discussion: + Predictive model based on data collected following a given routine. This routine can change if the model is put into practice. Will the model predictions in this new routine remain valid? + Missing values in a way give an indication of the type of treatment being followed. + Trade-off between complex models operating on raw features and very complex features operating on more interpretable models. |

Generative Adversarial Nets

Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron C. and Bengio, Yoshua

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron C. and Bengio, Yoshua

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

* GANs are based on adversarial training. * Adversarial training is a basic technique to train generative models (so here primarily models that create new images). * In an adversarial training one model (G, Generator) generates things (e.g. images). Another model (D, discriminator) sees real things (e.g. real images) as well as fake things (e.g. images from G) and has to learn how to differentiate the two. * Neural Networks are models that can be trained in an adversarial way (and are the only models discussed here). ### How * G is a simple neural net (e.g. just one fully connected hidden layer). It takes a vector as input (e.g. 100 dimensions) and produces an image as output. * D is a simple neural net (e.g. just one fully connected hidden layer). It takes an image as input and produces a quality rating as output (0-1, so sigmoid). * You need a training set of things to be generated, e.g. images of human faces. * Let the batch size be B. * G is trained the following way: * Create B vectors of 100 random values each, e.g. sampled uniformly from [-1, +1]. (Number of values per components depends on the chosen input size of G.) * Feed forward the vectors through G to create new images. * Feed forward the images through D to create ratings. * Use a cross entropy loss on these ratings. All of these (fake) images should be viewed as label=0 by D. If D gives them label=1, the error will be low (G did a good job). * Perform a backward pass of the errors through D (without training D). That generates gradients/errors per image and pixel. * Perform a backward pass of these errors through G to train G. * D is trained the following way: * Create B/2 images using G (again, B/2 random vectors, feed forward through G). * Chose B/2 images from the training set. Real images get label=1. * Merge the fake and real images to one batch. Fake images get label=0. * Feed forward the batch through D. * Measure the error using cross entropy. * Perform a backward pass with the error through D. * Train G for one batch, then D for one (or more) batches. Sometimes D can be too slow to catch up with D, then you need more iterations of D per batch of G. ### Results * Good looking images MNIST-numbers and human faces. (Grayscale, rather homogeneous datasets.) * Not so good looking images of CIFAR-10. (Color, rather heterogeneous datasets.) ![Generated Faces](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Adversarial_Networks__faces.jpg?raw=true "Generated Faces") *Faces generated by MLP GANs. (Rightmost column shows examples from the training set.)* ------------------------- ### Rough chapter-wise notes * Introduction * Discriminative models performed well so far, generative models not so much. * Their suggested new architecture involves a generator and a discriminator. * The generator learns to create content (e.g. images), the discriminator learns to differentiate between real content and generated content. * Analogy: Generator produces counterfeit art, discriminator's job is to judge whether a piece of art is a counterfeit. * This principle could be used with many techniques, but they use neural nets (MLPs) for both the generator as well as the discriminator. * Adversarial Nets * They have a Generator G (simple neural net) * G takes a random vector as input (e.g. vector of 100 random values between -1 and +1). * G creates an image as output. * They have a Discriminator D (simple neural net) * D takes an image as input (can be real or generated by G). * D creates a rating as output (quality, i.e. a value between 0 and 1, where 0 means "probably fake"). * Outputs from G are fed into D. The result can then be backpropagated through D and then G. G is trained to maximize log(D(image)), so to create a high value of D(image). * D is trained to produce only 1s for images from G. * Both are trained simultaneously, i.e. one batch for G, then one batch for D, then one batch for G... * D can also be trained multiple times in a row. That allows it to catch up with G. * Theoretical Results * Let * pd(x): Probability that image `x` appears in the training set. * pg(x): Probability that image `x` appears in the images generated by G. * If G is now fixed then the best possible D classifies according to: `D(x) = pd(x) / (pd(x) + pg(x))` * It is proofable that there is only one global optimum for GANs, which is reached when G perfectly replicates the training set probability distribution. (Assuming unlimited capacity of the models and unlimited training time.) * It is proofable that G and D will converge to the global optimum, so long as D gets enough steps per training iteration to model the distribution generated by G. (Again, assuming unlimited capacity/time.) * Note that these things are proofed for the general principle for GANs. Implementing GANs with neural nets can then introduce problems typical for neural nets (e.g. getting stuck in saddle points). * Experiments * They tested on MNIST, Toronto Face Database (TFD) and CIFAR-10. * They used MLPs for G and D. * G contained ReLUs and Sigmoids. * D contained Maxouts. * D had Dropout, G didn't. * They use a Parzen Window Estimate aka KDE (sigma obtained via cross validation) to estimate the quality of their images. * They note that KDE is not really a great technique for such high dimensional spaces, but its the only one known. * Results on MNIST and TDF are great. (Note: both grayscale) * CIFAR-10 seems to match more the texture but not really the structure. * Noise is noticeable in CIFAR-10 (a bit in TFD too). Comes from MLPs (no convolutions). * Their KDE score for MNIST and TFD is competitive or better than other approaches. * Advantages and Disadvantages * Advantages * No Markov Chains, only backprob * Inference-free training * Wide variety of functions can be incorporated into the model (?) * Generator never sees any real example. It only gets gradients. (Prevents overfitting?) * Can represent a wide variety of distributions, including sharp ones (Markov chains only work with blurry images). * Disadvantages * No explicit representation of the distribution modeled by G (?) * D and G must be well synchronized during training * If G is trained to much (i.e. D can't catch up), it can collapse many components of the random input vectors to the same output ("Helvetica") |

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Salimans, Tim and Kingma, Diederik P.

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

Salimans, Tim and Kingma, Diederik P.

Neural Information Processing Systems Conference - 2016 via Local Bibsonomy

Keywords: dblp

* Weight Normalization (WN) is a normalization technique, similar to Batch Normalization (BN). * It normalizes each layer's weights. ### Differences to BN * WN normalizes based on each weight vector's orientation and magnitude. BN normalizes based on each weight's mean and variance in a batch. * WN works on each example on its own. BN works on whole batches. * WN is more deterministic than BN (due to not working an batches). * WN is better suited for noisy environment (RNNs, LSTMs, reinforcement learning, generative models). (Due to being more deterministic.) * WN is computationally simpler than BN. ### How its done * WN is a module added on top of a linear or convolutional layer. * If that layer's weights are `w` then WN learns two parameters `g` (scalar) and `v` (vector, identical dimension to `w`) so that `w = gv / ||v||` is fullfilled (`||v||` = euclidean norm of v). * `g` is the magnitude of the weights, `v` are their orientation. * `v` is initialized to zero mean and a standard deviation of 0.05. * For networks without recursions (i.e. not RNN/LSTM/GRU): * Right after initialization, they feed a single batch through the network. * For each neuron/weight, they calculate the mean and standard deviation after the WN layer. * They then adjust the bias to `-mean/stdDev` and `g` to `1/stdDev`. * That makes the network start with each feature being roughly zero-mean and unit-variance. * The same method can also be applied to networks without WN. ### Results: * They define BN-MEAN as a variant of BN which only normalizes to zero-mean (not unit-variance). * CIFAR-10 image classification (no data augmentation, some dropout, some white noise): * WN, BN, BN-MEAN all learn similarly fast. Network without normalization learns slower, but catches up towards the end. * BN learns "more" per example, but is about 16% slower (time-wise) than WN. * WN reaches about same test error as no normalization (both ~8.4%), BN achieves better results (~8.0%). * WN + BN-MEAN achieves best results with 7.31%. * Optimizer: Adam * Convolutional VAE on MNIST and CIFAR-10: * WN learns more per example und plateaus at better values than network without normalization. (BN was not tested.) * Optimizer: Adamax * DRAW on MNIST (heavy on LSTMs): * WN learns significantly more example than network without normalization. * Also ends up with better results. (Normal network might catch up though if run longer.) * Deep Reinforcement Learning (Space Invaders): * WN seemed to overall acquire a bit more reward per epoch than network without normalization. Variance (in acquired reward) however also grew. * Results not as clear as in DRAW. * Optimizer: Adamax ### Extensions * They argue that initializing `g` to `exp(cs)` (`c` constant, `s` learned) might be better, but they didn't get better test results with that. * Due to some gradient effects, `||v||` currently grows monotonically with every weight update. (Not necessarily when using optimizers that use separate learning rates per parameters.) * That grow effect leads the network to be more robust to different learning rates. * Setting a small hard limit/constraint for `||v||` can lead to better test set performance (parameter updates are larger, introducing more noise). ![CIFAR-10 results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Weight_Normalization__cifar10.png?raw=true "CIFAR-10 results") *Performance of WN on CIFAR-10 compared to BN, BN-MEAN and no normalization.* ![DRAW, DQN results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Weight_Normalization__draw_dqn.png?raw=true "DRAW, DQN results") *Performance of WN for DRAW (left) and deep reinforcement learning (right).* |

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

Tompson, Jonathan J. and Jain, Arjun and LeCun, Yann and Bregler, Christoph

Neural Information Processing Systems Conference - 2014 via Local Bibsonomy

Keywords: dblp

* They describe a model for human pose estimation, i.e. one that finds the joints ("skeleton") of a person in an image. * They argue that part of their model resembles a Markov Random Field (but in reality its implemented as just one big neural network). ### How * They have two components in their network: * Part-Detector: * Finds candidate locations for human joints in an image. * Pretty standard ConvNet. A few convolutional layers with pooling and ReLUs. * They use two branches: A fine and a coarse one. Both branches have practically the same architecture (convolutions, pooling etc.). The coarse one however receives the image downscaled by a factor of 2 (half width/height) and upscales it by a factor of 2 at the end of the branch. * At the end they merge the results of both branches with more convolutions. * The output of this model are 4 heatmaps (one per joint? unclear), each having lower resolution than the original image. * Spatial-Model: * Takes the results of the part detector and tries to remove all detections that were false positives. * They derive their architecture from a fully connected Markov Random Field which would be solved with one step of belief propagation. * They use large convolutions (128x128) to resemble the "fully connected" part. * They initialize the weights of the convolutions with joint positions gathered from the training set. * The convolutions are followed by log(), element-wise additions and exp() to resemble an energy function. * The end result are the input heatmaps, but cleaned up. ### Results * Beats all previous models (with and without spatial model). * Accuracy seems to be around 90% (with enough (16px) tolerance in pixel distance from ground truth). * Adding the spatial model adds a few percentage points of accuracy. * Using two branches instead of one (in the part detector) adds a bit of accuracy. Adding a third branch adds a tiny bit more. ![Results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__results.png?raw=true "Results") *Example results.* ![Part Detector](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__part_detector.png?raw=true "Part Detector") *Part Detector network.* ![Spatial Model](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__spatial_model.png?raw=true "Spatial Model") *Spatial Model (apparently only for two input heatmaps).* ------------------------- # Rough chapter-wise notes * (1) Introduction * Human Pose Estimation (HPE) from RGB images is difficult due to the high dimensionality of the input. * Approaches: * Deformable-part models: Traditionally based on hand-crafted features. * Deep-learning based disciminative models: Recently outperformed other models. However, it is hard to incorporate priors (e.g. possible joint- inter-connectivity) into the model. * They combine: * A part-detector (ConvNet, utilizes multi-resolution feature representation with overlapping receptive fields) * Part-based Spatial-Model (approximates loopy belief propagation) * They backpropagate through the spatial model and then the part-detector. * (3) Model * (3.1) Convolutional Network Part-Detector * This model locates possible positions of human key joints in the image ("part detector"). * Input: RGB image. * Output: 4 heatmaps, one per key joint (per pixel: likelihood). * They use a fully convolutional network. * They argue that applying convolutions to every pixel is similar to moving a sliding window over the image. * They use two receptive field sizes for their "sliding window": A large but coarse/blurry one, a small but fine one. * To implement that, they use two branches. Both branches are mostly identical (convolutions, poolings, ReLU). They simply feed a downscaled (half width/height) version of the input image into the coarser branch. At the end they upscale the coarser branch once and then merge both branches. * After the merge they apply 9x9 convolutions and then 1x1 convolutions to get it down to 4xHxW (H=60, W=90 where expected input was H=320, W=240). * (3.2) Higher-level Spatial-Model * This model takes the detected joint positions (heatmaps) and tries to remove those that are probably false positives. * It is a ConvNet, which tries to emulate (1) a Markov Random Field and (2) solving that MRF approximately via one step of belief propagation. * The raw MRF formula would be something like `<likelihood of joint A per px> = normalize( <product over joint v from joints V> <probability of joint A per px given a> * <probability of joint v at px?> + someBiasTerm)`. * They treat the probabilities as energies and remove from the formula the partition function (`normalize`) for various reasons (e.g. because they are only interested in the maximum value anyways). * They use exp() in combination with log() to replace the product with a sum. * They apply SoftPlus and ReLU so that the energies are always positive (and therefore play well with log). * Apparently `<probability of joint v at px?>` are the input heatmaps of the part detector. * Apparently `<probability of joint A per px given a>` is implemented as the weights of a convolution. * Apparently `someBiasTerm` is implemented as the bias of a convolution. * The convolutions that they use are large (128x128) to emulate a fully connected graph. * They initialize the convolution weights based on histograms gathered from the dataset (empirical distribution of joint displacements). * (3.3) Unified Models * They combine the part-based model and the spatial model to a single one. * They first train only the part-based model, then only the spatial model, then both. * (4) Results * Used datasets: FLIC (4k training images, 1k test, mostly front-facing and standing poses), FLIC-plus (17k, 1k ?), extended-LSP (10k, 1k). * FLIC contains images showing multiple persons with only one being annotated. So for FLIC they add a heatmap of the annotated body torso to the input (i.e. the part-detector does not have to search for the person any more). * The evaluation metric roughly measures, how often predicted joint positions are within a certain radius of the true joint positions. * Their model performs significantly better than competing models (on both FLIC and LSP). * Accuracy seems to be at around 80%-95% per joint (when choosing high enough evaluation tolerance, i.e. 10px+). * Adding the spatial model to the part detector increases the accuracy by around 10-15 percentage points. * Training the part detector and the spatial model jointly adds ~3 percentage points accuracy over training them separately. * Adding the second filter bank (coarser branch in the part detector) adds around 5 percentage points accuracy. Adding a third filter bank adds a tiny bit more accuracy. |

Accurate Image Super-Resolution Using Very Deep Convolutional Networks

Kim, Jiwon and Lee, Jung Kwon and Lee, Kyoung Mu

Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy

Keywords: dblp

Kim, Jiwon and Lee, Jung Kwon and Lee, Kyoung Mu

Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy

Keywords: dblp

* They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution"). * Their model uses a deeper architecture than previous models and has a residual component. ### How * Their model is a fully convolutional neural network. * Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry). * Output of the model: The upscaled image (without the blurriness). * They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.) * They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".) * They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual). * They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10. * They use weight decay of 0.0001. * During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[-t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[-t/lr, t/lr]` (where `lr` is the learning rate). * They argue that their special gradient clipping allows the use of significantly higher learning rates. * They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?) ### Results * Higher accuracy upscaling than all previous methods. * Can handle well upscaling factors above 2x. * Residual network learns significantly faster than non-residual network. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Accurate_Image_Super-Resolution__architecture.png?raw=true "Architecture") *Architecture of the model.* ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Accurate_Image_Super-Resolution__examples.png?raw=true "Examples") *Super-resolution quality of their model (top, bottom is a competing model).* |

Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis

Li, Chuan and Wand, Michael

Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy

Keywords: dblp

Li, Chuan and Wand, Michael

Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy

Keywords: dblp

* They describe a method that applies the style of a source image to a target image. * Example: Let a normal photo look like a van Gogh painting. * Example: Let a normal car look more like a specific luxury car. * Their method builds upon the well known artistic style paper and uses a new MRF prior. * The prior leads to locally more plausible patterns (e.g. less artifacts). ### How * They reuse the content loss from the artistic style paper. * The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations. * They use layer `relu4_2` for the distance measurement. * They replace the original style loss with a MRF based style loss. * Step 1: Extract from the source image `k x k` sized overlapping patches. * Step 2: Perform step (1) analogously for the target image. * Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`). * Step 4: Perform step (3) analogously for the target image. (Result: `r_t`) * Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation). * Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5). * They add a regularizer loss. * The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners). * It is based on the raw pixel values of the last synthesized image. * For each pixel in the synthesized image, they calculate the squared x-gradient and the squared y-gradient and then add both. * They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> x-gradient^2 + y-gradient^2`). * Their whole optimization problem is then roughly `image = argmin_image MRF-style-loss + alpha1 * content-loss + alpha2 * regularizer-loss`. * In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization). * In practice, they sample patches from the style image under several different rotations and scalings. ### Results * In comparison to the original artistic style paper: * Less artifacts. * Their method tends to preserve style better, but content worse. * Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse. ![Non-photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Non-photorealistic example images") *Non-photorealistic example images. Their method vs. the one from the original artistic style paper.* ![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images") *Photorealistic example images. Their method vs. the one from the original artistic style paper.* |

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, Fisher and Koltun, Vladlen

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Yu, Fisher and Koltun, Vladlen

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

* They describe a variation of convolutions that have a differently structured receptive field. * They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling). ### How * One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter. * Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other. * In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.) * Normal convolutions are simply 1-dilated convolutions (skipping 0 cells). * One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing. * Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution. * They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.) ![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field") *Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.* ### Results * They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept). * They then used the network to segment images. * Their results were significantly better than previous methods. * They also added another network with more dilated convolutions in front of the VGG one, again improving the results. ![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance") *Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.* |

About