Welcome to ShortScience.org! 
[link]
* They describe a method that applies the style of a source image to a target image. * Example: Let a normal photo look like a van Gogh painting. * Example: Let a normal car look more like a specific luxury car. * Their method builds upon the well known artistic style paper and uses a new MRF prior. * The prior leads to locally more plausible patterns (e.g. less artifacts). ### How * They reuse the content loss from the artistic style paper. * The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations. * They use layer `relu4_2` for the distance measurement. * They replace the original style loss with a MRF based style loss. * Step 1: Extract from the source image `k x k` sized overlapping patches. * Step 2: Perform step (1) analogously for the target image. * Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`). * Step 4: Perform step (3) analogously for the target image. (Result: `r_t`) * Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation). * Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5). * They add a regularizer loss. * The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners). * It is based on the raw pixel values of the last synthesized image. * For each pixel in the synthesized image, they calculate the squared xgradient and the squared ygradient and then add both. * They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> xgradient^2 + ygradient^2`). * Their whole optimization problem is then roughly `image = argmin_image MRFstyleloss + alpha1 * contentloss + alpha2 * regularizerloss`. * In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization). * In practice, they sample patches from the style image under several different rotations and scalings. ### Results * In comparison to the original artistic style paper: * Less artifacts. * Their method tends to preserve style better, but content worse. * Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse. ![Nonphotorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Nonphotorealistic example images") *Nonphotorealistic example images. Their method vs. the one from the original artistic style paper.* ![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images") *Photorealistic example images. Their method vs. the one from the original artistic style paper.* 
[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudometric $$d(x, y) = (x y) M (x y)^T$$ where $M$ is a positivedefinite matrix. The only difference between a pseudometric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$f(x_i^a)  f(x_i^p)_2^2 + \alpha < f(x_i^a)  f(x_i^n)_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) 
[link]
The last two years have seen a number of improvements in the field of language model pretraining, and BERT  Bidirectional Encoder Representations from Transformers  is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Mechanically, this works by either 1) training word embeddings and then using those embeddings as input feature representations for supervised models, or 2) treating the problem as a transfer learning problem, and finetune to a supervised task  similar to how you’d finetune a model trained on ImageNet by carrying over parameters, and then training on your new task. Even though the text we’re learning on is strictly speaking unsupervised (lacking a supervised label), we need to design a task on which we calculate gradients in order to train our representations. For unsupervised language modeling, that task is typically structured as predicting a word in a sequence given prior words in that sequence. Intuitively, in order for a model to do a good job at predicting the word that comes next in a sentence, it needs to have learned patterns about language, both on grammatical and semantic levels. A notable change recently has been the shift from learning unconditional word vectors (where the word’s representation is the same globally) to contextualized ones, where the representation of the word is dependent on the sentence context it’s found in. All the baselines discussed here are of this second type. The two main baselines that the BERT model compares itself to are OpenAI’s GPT, and Peters et al’s ELMo. The GPT model uses a selfattentionbased Transformer architecture, going through each word in the sequence, and predicting the next word by calculating an attentionweighted representation of all prior words. (For those who aren’t familiar, attention works by multiplying a “query” vector with every word in a variablelength sequence, and then putting the outputs of those multiplications into a softmax operator, which inherently gets you a weighting scheme that adds to one). ELMo uses models that gather context in both directions, but in a fairly simple way: it learns one deep LSTM that goes from left to right, predicting word k using words 0k1, and a second LSTM that goes from right to left, predicting word k using words k+1 onward. These two predictions are combined (literally: just summed together) to get a representation for the word at position k. https://i.imgur.com/2329e3L.png BERT differs from prior work in this area in several small ways, but one primary one: instead of representing a word using only information from words before it, or a simple sum of prior information and subsequent information, it uses the full context from before and after the word in each of its multiple layers. It also uses an attentionbased Transformer structure, but instead of incorporating just prior context, it pulls in information from the full sentence. To allow for a model that actually uses both directions of context at a time in its unsupervised prediction task, the authors of BERT slightly changed the nature of that task: it replaces the word being predicted with the “mask” token, so that even with multiple layers of context aggregation on both sides, the model doesn’t have any way of knowing what the token is. By contrast, if it weren’t masked, after the first layer of context aggregation, the representations of other words in the sequence would incorporate information about the predicted word k, making it trivial, if another layer were applied on top of that first one, for the model to directly have access to the value it’s trying to predict. This problem can either be solved by using multiple layers, each of which can only see prior context (like GPT), by learning fully separate LR and RL models, and combining them at the final layer (like ELMo) or by masking tokens, and predicting the value of the masked tokens using the full remainder of the context. This task design crucially allows for a multilayered bidirectional architecture, and consequently a much richer representation of context in each word’s pretrained representation. BERT demonstrates dramatic improvements over prior work when fine tuned on a small amount of supervised data, suggesting that this change added substantial value. 
[link]
**Dropout for layers** sums it up pretty well. The authors built on the idea of [deep residual networks](http://arxiv.org/abs/1512.03385) to use identity functions to skip layers. The main advantages: * Training speedups by about 25% * Huge networks without overfitting ## Evaluation * [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html): 4.91% error ([SotA](https://martinthoma.com/sota/#imageclassification): 2.72 %) Training Time: ~15h * [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html): 24.58% ([SotA](https://martinthoma.com/sota/#imageclassification): 17.18 %) Training time: < 16h * [SVHN](http://ufldl.stanford.edu/housenumbers/): 1.75% ([SotA](https://martinthoma.com/sota/#imageclassification): 1.59 %)  trained for 50 epochs, begging with a LR of 0.1, divided by 10 after 30 epochs and 35. Training time: < 26h 
[link]
This method is based on improving the speed of RCNN \cite{conf/cvpr/GirshickDDM14} 1. Where RCNN would have two different objective functions, Fast RCNN combines localization and classification losses into a "multitask loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of subwindows of approximate size $h/H \times w/W$ and then maxpooling the values in each subwindow into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster RCNN" \cite{conf/nips/RenHGS15} 
[link]
The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map. # Architecture ![image](https://userimages.githubusercontent.com/8659132/6318795924cdb380c02e11e9912177e0923e91c6.png) The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing. Decoders:  Semantic segmentation: coupled with the encoder, it's UNetlike. The output is a segmentation map.  Instance center: for each pixel, outputs the confidence that it is the center of an object.  Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels. To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates. Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class. Finally, the segmentation and instance maps are upsampled using the SLIC algorithm. # Loss There is one loss for each decoder head.  Semantic segmentation: weighted crossentropy  Instance center: crossentropy term modulated by a $\gamma$ parameter to counter the overrepresentation of the background over the target classes. ![image](https://userimages.githubusercontent.com/8659132/6328648522659680c28611e99134f1b823a34217.png)  Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding. ![image](https://userimages.githubusercontent.com/8659132/63286399f1856180c28511e99136feb6c4a555e5.png) ![image](https://userimages.githubusercontent.com/8659132/63286411fcd88d00c28511e9939f0771579d8263.png) $\hat{e}$ are the embeddings, $\delta_a$ is an hyperparameter defining "close enough", and $\delta_b$ defines "far enough" The whole model is trained jointly using a weighted sum of the 3 losses. # Experiments and results The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation. ![image](https://userimages.githubusercontent.com/8659132/63287573a882dc80c28811e983e0b352e43bdf28.png) For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster. ![image](https://userimages.githubusercontent.com/8659132/63287643d700b780c28811e99d405bcaf695a744.png) On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the realtime constraint, it's much better. # Comments  Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion.  If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet.  I'm not sure what's the point of upsampling the semantic map given that we already have the instance map. 
[link]
Zhang et al. propose CROWN, a method for certifying adversarial robustness based on bounding activations functions using linear functions. Informally, the main result can be stated as follows: if the activation functions used in a deep neural network can be bounded above and below by linear functions (the activation function may also be segmented first), the network output can also be bounded by linear functions. These linear functions can be computed explicitly, as stated in the paper. Then, given an input example $x$ and a set of allowed perturbations, usually constrained to a $L_p$ norm, these bounds can be used to obtain a lower bound on the robustness of networks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixeltopixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution  and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floatingnumber RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures  ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of singlescale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper 
[link]
At NIPS 2017, Ali Rahimi was invited on stage to give a keynote after a paper he was on received the “Test of Time” award. While there, in front of several thousand researchers, he gave an impassioned argument for more rigor: more small problems to validate our assumptions, more visibility into why our optimization algorithms work the way they do. The nowfamous catchphrase of the talk was “alchemy”; he argued that the machine learning community has been effective at finding things that work, but less effective at understanding why the techniques we use work. A central example he used in his talk is that of Batch Normalization: a now nearlyuniversal step in optimizing deep nets, but one where our accepted explanation of “reducing internal covariate shift” is less rigorous than one might hope. With apologies for the long preamble, this is the context in which today’s paper is such a welcome push in the direction of what Rahimi was advocating for  small, focused experimentation that tries to build up knowledge from principles, and, specifically, asks the question: “Does Batch Norm really work via reducing covariate shift”. To answer the question of whether internal covariate shift is a likely mechanism of the  empirically very solid  improved performance of Batch Norm, the authors do a few simple experience. First, and most straightforwardly, they train a basic convolutional net with and without BatchNorm, pick a layer, and visualize the activation distribution of that layer over time, both in the Batch Norm and nonBatch Norm case. While they saw the expected performance boost, the Batch Norm case didn’t seem to be meaningfully more stable over time, relative to the normal case. Second, the authors tested what would happen if they added nonzeromean random noise *after* Batch Norm in the network. The upshot of this was that they were explicitly engineering internal covariate shift, and, if control thereof was the primary useful purpose of Batch Norm, you would expect that to neutralize BN’s good performance. In this experiment, while the authors did indeed see noisier, less stable activation distributions in the noise + BN case (in particular: look at layer 13 activations in the attached image), but noisy BN performed nearly as well as nonnoisy, and meaningfully better than the standard model without noise, but also without BN. As a final test, they approached the idea of “internal covariate shift” from a different definitional standpoint. Maybe a better way of thinking about it is in terms of stability of your gradients, in the face of updates made by lower layers of the network. That is to say: each parameter of the network pushes itself in the direction of lower loss all else held equal, but in practice, you change lowerlevel parameters simultaneously, which could cause the directional change the higherlayer parameter thought it needed to be off. So, the authors calculated the “gradient delta” between the gradient the model trains on, and what the gradient would be if you estimated it *after* all of the lower layers of the model had updated, such that the distribution of inputs to that layer has changed. Although the expectation would be that this gradient delta is smaller for batch norm, in fact, the authors found that, if anything, the opposite was true. So, in the face of none of these ideas panning out, the authors then introduce the best idea they’ve found for what motivates BN’s improved performance: a smoothing out of the loss function that SGD is optimizing. A smoother curve means, generally speaking, that the magnitudes of your gradients will be smaller, and also that the value of the gradient will change more slowly (i.e. low second derivative). As support for this idea, they show really different results for BN vs standard models in terms of, for example, how predictive a gradient at one point is of a gradient taken after you take a step in the direction of the first gradient. BN has meaningfully more predictive gradients, tied to lower variance in the values of the loss function in the direction of the gradient. The logic for why the mechanism of BN would cause this outcome is a bit tied up in math that’s hard to explain without LaTeX visuals, but basically comes from the idea that Batch Norm decreases the magnitude of the gradient of each layer output with respect to individual weight parameters, by averaging out those magnitudes over the batch. As Rahimi said in his initial talk, a lot of modern modeling is “applying brittle optimization techniques to loss surfaces we don’t understand.” And, by and large, that is in fact true: it’s devilishly difficult to get a good handle on what loss surfaces are doing when they’re doing it in severalmilliondimensional space. But, it being hard doesn’t mean we should just give up on searching for principles we can build our understanding on, and I think this paper is a really fantastic example of how that can be done well.
1 Comments

[link]
Subpixel CNN proposes a novel architecture for solving the illposed problem of superresolution (SR). Superresolution involves the recovery of a high resolution (HR) image or video from its low resolution (LR) counter part and is topic of great interest in digital image processing. #### Background & problems with current approaches LR images can be understood as lowpass filtered versions of HR images. A key assumption that underlies many SR techniques is that much of the highfrequency spatial data is redundant and thus can be accurately reconstructed from low frequency components. There are 2 kinds of SR techniques  one which assumes multiple LR images as different instances of HR images and uses CV techniques like registration and fusion to construct HR images. Apart from constraints, of requiring multiple images, they are inaccurate as well. The other one single image superresolution (SISR) techniques learn implicit redundancy that is present in natural data to recover missing HR information from a single LR instance. Among SISR techniques, the ones which deployed deep learning, most methods would tend to upsample LR images (using bicubic interpolation, with learnable filters) and learn the filters from the upsampled image. The problem with the methods are, normally creasing the resolution of the LR images before the image enhancement step increases the computational complexity. This is especially problematic for convolutional networks, where the processing speed directly depends on the input image resolution. Secondly, interpolation methods typically used to accomplish the task, such as bicubic interpolation do not bring additional information to solve the illposed reconstruction problem. Also, among other techniques like deconvolution, the pixels are filled with zero values between pixels, hence hese zero values have no gradient information that can be backpropagated through. #### Innovation In subpixel CNN authors propose an architecture (ESPCNN) that learns all the filters in 2 convolutional layers with the resolutions in LR space. Only in the last layer, a convolutional layer is implemented which transforms into HR space, using subpixel convolutional layer. The layer, in order to tide over problems from deconvolution, uses something known as Periodic Shuffling (PS) operator for upsampling. The idea behind periodic shuffling is derived from the fact that activations between pixel shouldn't be zero for upsamling. The kernel in subpixel CNN does a rearrangement following the equation ![PSoperator](https://i.imgur.com/OJtHL3w.png) The image explains the architecture of Subpixel CNN. The colouring in the last layer explains the methodology for PS ![ESPCNN_structure](https://i.imgur.com/py0vceQ.png) #### Results The ESPCNN was trained and tested on ImageNet along with other databases . As is evident from the image, ESPCNN performs significantly better than SuperResolution Convolutional Neural Network (SRCNN) & TRNN, which are currently the best performing approach published, as of date of publication of ESPCNN. ![ESPCNN_results](https://i.imgur.com/hOEjSeE.png) 