Welcome to ShortScience.org! 
[link]
This paper builds upon the previous work in gradientbased metalearning methods. The objective of metalearning is to find metaparameters ($\theta$) which can be "adapted" to yield "taskspecific" ($\phi$) parameters. Thus, $\theta$ and $\phi$ lie in the same hyperspace. A metalearning problem deals with several tasks, where each task is specified by its respective training and test datasets. At the inference time of gradientbased metalearning methods, before the start of each task, one needs to perform some gradientdescent (GD) steps initialized from the metaparameters to obtain these taskspecific parameters. The objective of metalearning is to find $\theta$, such that GD on each task's training data yields parameters that generalize well on its test data. Thus, the objective function of metalearning is the average loss on the training dataset of each task ($\mathcal{L}_{i}(\phi)$), where the parameters of that task ($\phi$) are obtained by performing GD initialized from the metaparameters ($\theta$). \begin{equation} F(\theta) = \frac{1}{M}\sum_{i=1}^{M} \mathcal{L}_i(\phi) \end{equation} In order to backpropagate the gradients for this taskspecific loss function back to the metaparameters, one needs to backpropagate through taskspecific loss function ($\mathcal{L}_{i}$) and the GD steps (or any other optimization algorithm that was used), which were performed to yield $\phi$. As GD is a series of steps, a whole sequence of changes done on $\theta$ need to be considered for backpropagation. Thus, the past approaches have focused on RNN + BPTT or Truncated BPTT. However, the author shows that with the use of the proximal term in the taskspecific optimization (also called inner optimization), one can obtain the gradients without having to consider the entire trajectory of the parameters. The authors call these implicit gradients. The idea is to constrain the $\phi$ to lie closer to $\theta$ with the help of proximal term which is similar to L2regularization penalty term. Due to this constraint, one obtains an implicit equation of $\phi$ in terms of $\theta$ as \begin{equation} \phi = \theta  \frac{1}{\lambda}\nabla\mathcal{L}_i(\phi) \end{equation} This is then differentiated to obtain the implicit gradients as \begin{equation} \frac{d\phi}{d\theta} = \big( \mathbf{I} + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big)^{1} \end{equation} and the contribution of gradients from $\mathcal{L}_i$ is thus, \begin{equation} \big( \mathbf{I} + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big)^{1} \nabla \mathcal{L}_i(\phi) \end{equation} The hessian in the above gradients are memory expensive computations, which become infeasible in deep neural networks. Thus, the authors approximate the above term by minimizing the quadratic formulation using conjugate gradient method which only requires Hessianvector products (cheaply available via reverse backpropagation). \begin{equation} \min_{\mathbf{w}} \mathbf{w}^\intercal \big( I + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big) \mathbf{w}  \mathbf{w}^\intercal \nabla \mathcal{L}_i(\phi) \end{equation} Thus, the paper introduces computationally cheap and constant memory gradient computation for metalearning. 
[link]
**Problem Setting:** Sequence to Sequence learning (seq2seq) is one of the most successful techniques in machine learning nowadays. The basic idea is to encode a sequence into a vector (or a sequence of vectors if using attention based encoder) and then use a recurrent decoder to decode the target sequence conditioned on the encoder output. While researchers have explored various architectural changes to this basic encoderdecoder model, the standard way of training such seq2seq models is to maximize the likelihood of each successive target word conditioned on the input sequence and the *gold* history of target words. This is also known as *teacherforcing* in RNN literature. Such an approach has three major issues: 1. **Exposure Bias:** Since we teacherforce the model with *gold* history during training, the model is never exposed to its errors during training. At test time, we will not have access to *gold* history and we feed the history generated by the model. If it is erroneous, the model does not have any clue about how to rectify it. 2. **LossEvaluation Mismatch:** While we evaluate the model using sequence level metrics (such as BLEU for Machine Translation), we are training the model with word level cross entropy loss. 3. **Label bias:** Since the word probabilities are normalized at each time step (by using softmax over the final layer of the decoder), this can result in label bias if we vary the number of possible candidates in each step. More about this later. **Solution:** This paper proposes an alternative training procedure for seq2seq models which attempt to solve all the 3 major issues listed above. The idea is to pose seq2seq learning as beamsearch optimization problem. Authors begin by removing the final softmax activation function from the decoder. Now instead of probability distributions, we will get score for next possible word. Then the training procedure is changed as follows: At every time step $t$, they maintain a set $S_t$ of $K$ candidate sequences of length $t$. Now the loss function is defined with the following characteristics: 1. If the *gold* subsequence of length $t$ is in set $S_t$ and the score for *gold* subsequence exceeds the score of the $K$th ranked candidate by a margin, the model incurs no loss. Now the candidates for next timestep are chosen in a way similar to regular beamsearch with beamsize $K$. 2. If the *gold* subsequence of length $t$ is in set $S_t$ and it is the $K$th ranked candidate, then the loss will push the *gold* sequence up by increasing its score. The candidates for next timestep are chosen in a way similar as first case. 3. If the *gold* subsequence of length $t$ is NOT in set $S_t$, then the score of the *gold* sequence is increased to be higher than $K$th ranked candidate by a margin. In this case, candidates for next step or chosen by only considering *gold* word at time $t$ and getting its top$K$ successors. 4. Further, since we want the full *gold* sequence to be at top of the beam at the end of the search, when $t=T$, the loss is modified to require the score of *gold* sequence to exceed the score of the *highest* ranked incorrect prediction by a margin. This nonprobabilistic training method has several advantages: * The model is trained in a similar way it would be tested, since we use beamsearch during training as well as testing. Hence this helps to eliminate exposure bias. * The score based loss can be easily scaled by a mistakespecific cost function. For example, in MT, one could use a cost function which is inversely proportional to BLEU score. So there is no lossevaluation mismatch. * Each time step can have different set of successor words based on any hard constraints in the problem. Note that the model is nonprobabilistic and hence this varying successor function will not introduce any label bias. Refer [this set of slides][1] for an excellent illustration of label bias. Cost of forwardprop grows linearly with respect to beam size $K$. However, GPU implementation should help to reduce this cost. Authors propose a clever way of doing BPTT which makes the backprop almost same cost as ordinary seq2seq training. **Additional Tricks** 1. Authors pretrain the seq2seq model with regular word level crossentropy loss and this is crucial since random initialization did not work. 2. Authors use "curriculum beam" strategy in training where they start with beam size of 2 and increase the beam size by 1 for every 2 epochs until it reaches the required beam size. You have to train your model with training beam size of at least test beam size + 1. (i.e $K_{tr} >= K_{te} + 1$). 3. When you use dropout, you need to be careful to use the same dropout value during backprop. Authors do this by sharing a single dropout across all sequences in a time step. **Experiments** Authors compare the proposed model against basic seq2seq in word ordering, dependency parsing and MT tasks. The proposed model achieves significant improvement over the strong baseline. **Related Work:** The whole idea of the paper is based on [learning as search optimization (LaSO) framework][2] of Daume III and Marcu (2005). Other notable related work are training seq2seq models using mix of crossentropy and REINFORCE called [MIXER][3] and [an actorcritic based seq2seq training][4]. Authors compare with MIXER and they do significantly better than MIXER. **My two cents:** This is one of the important research directions in my opinion. While other recent methods attempt to use reinforcement learning to avoid the issues in wordlevel crossentropy training, this paper proposes a really simple score based solution which works very well. While most of the language generation research is stuck with probabilistic framework (I am saying this w.r.t Deep NLP research), this paper highlights the benefits on nonprobabilistic generation models. I see this as one potential way of avoiding the nasty scalability issues that come with softmax based generative models. [1]: http://www.cs.stanford.edu/~nmramesh/crf [2]: https://www.isi.edu/~marcu/papers/daume05laso.pdf [3]: http://arxiv.org/pdf/1511.06732v7.pdf [4]: https://arxiv.org/pdf/1607.07086v2.pdf 
[link]
# Summary This paper presents stateoftheart methods for both caption generation of images and visual question answering (VQA). The authors build on previous methods by adding what they call a "bottomup" approach to previous "topdown" attention mechanisms. They show that using their approach they obtain SOTA on both Image captioning (MSCOCO) and the Visual Question and Answering (2017 VQA challenge). They propose a specific network configurations for each. Their biggest contribution is using FasterRCNN to retrieve the "important" parts of an image to focus on in both models. ## Top Down Up until this paper, the traditional approach was to use a "topdown" approach, in which the last feature map layer of a CNN is used to obtain a latent representation of the given input image. These features, along with the context of the caption being generated, were used to generate attention weights that were used to predict the next sequence in the context of caption generation. The network would learn to focus its attention on regions of the feature map that matters most. This is the approach used in previous SOTA methods like [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). ## Bottomup The authors argue that the feature map of a CNN is too generic and can be thought of operating on a uniform, gridlike feature map. In other words, there is no particular reason to think that the feature map of generated by a CNN would give optimal regions to attend to. Also, carefully choosing the dimensions of the feature map can be very arbitrary. In order to fix this, the authors propose combining object detection methods in a *bottomup* approach. To do so, the authors propose using FasterRCNN to identify regions of interest in an image. Given an input image, FasterRCNN will identify bounding boxes of the image that likely correspond to objects of a given category and simultaneously compute a feature vector of that bounding box. Figure 1 shows the difference between the Bottomup and TopDown approach. ![image](https://userimages.githubusercontent.com/18450628/618172632683cd00ae1c11e9971ad3b531dbbd98.png) ## Combining the two In this paper, the authors suggest using the bottomup approach to compute the salient regions of the image the network should focus on using FasterRCNN. FRCNN is carefully pretrained on both imagenet and the Visual Genome dataset. It is then frozen and only used to generate bounding boxes of regions with high confidence of being of interest. The topdown approach is then used on the features obtained from the bottomup approach. In order to "enhance" the FRCNN performance, they initialize their FRCNN with a ResNet101 pretrained on imagenet. They train their FRCNN on the Visual Genome dataset, adding attributes to the loss function that are available from the Visual Genome dataset, attributes such as color (black, white, gold etc.), state (open, close, dark, bright, etc.). A sample of FRCNN outputs are shown in figure 2. It is important to stress that only the feature representations and not the actual outputs (i.e. not the labels) are used in their model. ![image](https://userimages.githubusercontent.com/18450628/61817487aca01380ae1c11e990fa134033b95bb0.png) ## Caption Generation Figure 3 provides a highlevel overview of the model being used for caption generation for images. The image is first passed through FRCNN which produces a set of image features *V*. In their specific implementation, *V* consists of *k* vectors of size 1x2048. Their model consists of two LSTM blocks, one for attention and the other for language generation. ![image](https://userimages.githubusercontent.com/18450628/61818488effb8180ae1e11e98ae414355115429a.png) The first block of their model is a TopDown Attention LSTM layer. It takes as input the meanpooled features *V* , i.e. 1/k*sum(v_i), concatenated with the previous timestep's hidden representation of the language LSTM as well as the word embedding of the previously generated word. The word embedding is learned and not pretrained. The output of the first LSTM is used to compute the attention for each vector using an MLP and softmax: ![image](https://userimages.githubusercontent.com/18450628/6181998221298100ae2211e980a999640896413d.png) The attention weighted image feature is then used as an input to the language LSTM model, concatenated with the output from the topdown Attention LSTM and a softmax is used to predict the next word in the sequence. The loss function seeks to minimize the crossentropy of the generated sentence. ## VQA Model The VQA task differs to the image generation in that a textbased question accompanies an input image and the network must produce an answer. The VQA model proposed is different to that of the caption generation model previously described, however they both use the same bottomup approach to generate the feature vectors of the image based on the FRCNN architecture. A highlevel overview of the architecture for the VQA model is presented in Figure 4. ![image](https://userimages.githubusercontent.com/18450628/618219888da67f00ae2611e984563c9e5ec60787.png) Each word from the question is converted to a learned word embedding which is used as input to a GRU. The number of words for each question is limited to 14 for computational efficiency. The output from the GRU is concatenated with each of the *k* image features, and attention weights are computed for each *k*th feature using an MLP and softmax, similar to what is done in the attention for caption generation. The weighted sum of the feature vectors is then passed through an linear layer such that its shape is compatible with the gru output, and the Hadamard product (elementwise product) is computed over the GRU output and attentionweighted image feature representation. Finally, a tanh nonlinear activation is used. This results in a "gated tanh", which have been shown empirically to outperform both ReLU and tanh. Finally, a softmax probability distribution is generated at the output which selects a candidate answer among all possible candidate answers. ## Results and experiments ### Resnet Baseline To demonstrate that their contribution of bottomup mechanism actually improves on results, the authors use a ResNet trained on imagenet as a baseline for generating the image feature vectors (they resize the final CNN layers using bilinear interpolation when needed). They consistently obtain better results when using the bottomup approach over the ResNet approach in both caption generation and VQA. ## MSCOCO The authors demonstrate that they outperform all results on all metrics on the MSCOCO test server. ![image](https://userimages.githubusercontent.com/18450628/618241574f5f8e80ae2b11e98d90657db453e26e.png) They also show how using the bottomup approach over ResNet consistently scores them higher on detecting instances of objects, attributes, relations, etc: ![image](https://userimages.githubusercontent.com/18450628/618242387fa72d00ae2b11e981b3b5a7f80153f3.png) The authors, like their predecessors, insist on demonstrating their network's frisbee ability: ![image](https://userimages.githubusercontent.com/18450628/61824344bed57e00ae2b11e987cd597568587e1d.png) ## VQA Results They also demonstrate that the addition of bottomup attention improves results over a ResNet baseline. ![image](https://userimages.githubusercontent.com/18450628/6182450028ee2300ae2c11e990162120a91917e4.png) They also show that their model outperformed all other submissions on the VQA submission. They mention using an ensemble of 30 models for their submission. ![image](https://userimages.githubusercontent.com/18450628/6182463483877f00ae2c11e98d849589e0ea2be2.png) A sample image of what is attended in an image given a proper answer is shown in figure 6. ![image](https://userimages.githubusercontent.com/18450628/61824608736f9f80ae2c11e99d4e8cb6bd0a1a92.png) # Comments The authors introduce a new way to select portions of the image on which to focus attention. The idea is very original and came at a time when object detection was making significant progress (i.e. FRCNN). A few comments: * This method might not generalize well to other types of data. It requires pretraining on larger datasets (visual genome, imagenet, etc.) which consist of categories that overlap with both the MSCOCO and VQA datasets (i.e. cars, people, etc.). It would be interesting to see an endtoend model that does not rely on pretraining on other similar datasets. * No insight is given to the computational complexity nor to the inference time or training time. I imagine that FRCNN is resource intensive, and having to do a forward pass of FRCNN for every pass of the network must be a computational bottleneck. Not to mention that they ensembled 30 of them! 
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments

[link]
* Usually GANs transform a noise vector `z` into images. `z` might be sampled from a normal or uniform distribution. * The effect of this is, that the components in `z` are deeply entangled. * Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image. * The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color. * They suggest a change to GANs based on Mutual Information, which leads to interpretable components. * E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...). * These components are learned in a (mostly) unsupervised fashion. ### How * The latent code `c` * "Normal" GANs parameterize the generator as `G(z)`, i.e. G receives a noise vector and transforms it into an image. * This is changed to `G(z, c)`, i.e. G now receives a noise vector `z` and a latent code `c` and transforms both into an image. * `c` can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness. * Mutual Information * If using a latent code via `G(z, c)`, nothing forces the generator to actually use `c`. It can easily ignore it and just deteriorate to `G(z)`. * To prevent that, they force G to generate images `x` in a way that `c` must be recoverable. So, if you have an image `x` you must be able to reliable tell which latent code `c` it has, which means that G must use `c` in a meaningful way. * This relationship can be expressed with mutual information, i.e. the mutual information between `x` and `c` must be high. * The mutual information between two variables X and Y is defined as `I(X; Y) = entropy(X)  entropy(XY) = entropy(Y)  entropy(YX)`. * If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round). * If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round). * The new GAN loss becomes `old loss  lambda * I(G(z, c); c)`, i.e. the higher the mutual information, the lower the result of the loss function. * Variational Mutual Information Maximization * In order to minimize `I(G(z, c); c)`, one has to know the distribution `P(cx)` (from image to latent code), which however is unknown. * So instead they create `Q(cx)`, which is an approximation of `P(cx)`. * `I(G(z, c); c)` is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN"). * Basic equation: `LowerBoundOfMutualInformation(G, Q) = E[log Q(cx)] + H(c) <= I(G(z, c); c)` * `c` is the latent code. * `x` is the generated image. * `H(c)` is the entropy of the latent codes (constant throughout the optimization). * Optimization w.r.t. Q is done directly. * Optimization w.r.t. G is done via the reparameterization trick. * If `Q(cx)` approximates `P(cx)` *perfectly*, the lower bound becomes the mutual information ("the lower bound becomes tight"). * In practice, `Q(cx)` is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q. ### Results * MNIST * They use for `c` one categorical variable (10 values) and two continuous ones (uniform between 1 and +1). * InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width. * Applying Q(cx) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy. * Sampling new images with exaggerated continuous variables in the range `[2,+2]` yields sound images (i.e. the network generalizes well). * ![MNIST examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__mnist.png?raw=true "MNIST examples") * 3D face images * InfoGAN learns to represent the faces via pose, elevation, lighting. * They used five uniform variables for `c`. (So two of them apparently weren't associated with anything sensible? They are not mentioned.) * 3D chair images * InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments). * They used one categorical variable (four values) and one continuous variable (uniform `[1, +1]`). * SVHN * InfoGAN learns to represent lighting and to spot the center digit. * They used four categorical variables (10 values each) and two continuous variables (uniform `[1, +1]`). (Again, a few variables were apparently not associated with anything sensible?) * CelebA * InfoGAN learns to represent pose, presence of sunglasses (not perfectly), hair style and emotion (in the sense of "smiling or not smiling"). * They used 10 categorical variables (10 values each). (Again, a few variables were apparently not associated with anything sensible?) * ![CelebA examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__celeba.png?raw=true "CelebA examples") 
[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right^{1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easytocompute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly nonlinear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are nonlinear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the nontransformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channelwise masks. Since the output of each layerschunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layerschunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the realvalued nonvolume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / onepass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as realvalued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform subpixel noise was integrated out (see Equations 345 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a nonlinear transformation (${\rm logit}(\alpha + (1\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the loglikelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with realvalued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! 
[link]
This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a metanetwork that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation. In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. https://i.imgur.com/Q45w0QT.png One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize. To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest trainingset observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those trainingset points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn nonlinear relationships between your features, which the composability of multilayer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some pdimensional space, where p is a chosen hyperparameter. What are the concrete learning steps this architecture goes through? 1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (numdimensions) * (numclasses), rather than (numdimensions) * (numtrainingexamples). 4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 5. Then, you can calculate the (crossentropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs themeaningoflife), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of fewshot learning (matching networks, and the aforementioned metalearning). One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meetthebregmandivergences.html
1 Comments

[link]
If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. To understand this, imagine the smaller problem, of a onelayer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization  which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch  helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized  along with the likes of ReLU and Dropout  as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes  in the range of 216 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption  that the distributions for each feature channel are roughly the same  can be an incorrect one. A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the inthelimit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that  for example  represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semirandomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway. The upshot of the Group Norm paper, resultswise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN. 
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). 
[link]
Tanay and Griffin introduce the boundary tilting perspective as alternative to the “linear explanation” for adversarial examples. Specifically, they argue that it is not reasonable to assume that the linearity in deep neural networks causes the existence of adversarial examples. Originally, Goodfellow et al. [1] explained the impact of adversarial examples by considering a linear classifier: $w^T x' = w^Tx + w^T\eta$ where $\eta$ is the adversarial perturbations. In large dimensions, the second term might result in a significant shift of the neuron's activation. Tanay and Griffin, in contrast, argue that the dimensionality does not have an impact; althought he impact of $w^T\eta$ grows with the dimensionality, so does $w^Tx$, such that the ratio should be preserved. Additionally, they showed (by giving a counterexample) that linearity is not sufficient for the existence of adversarial examples. Instead, they offer a different perspective on the existence of adversarial examples that is, in the course of the paper, formalized. Their main idea is that the training samples live on a manifold in the actual input space. The claim is, that on the manifold there are no adversarial examples (meaning that the classes are well separated on the manifold and it is hard to find adversarial examples for most training samples). However, the decision boundary extends beyond the manifold and might lie close to the manifold such that adversarial examples leaving the manifold can be found easily. This idea is illustrated in Figure 1. https://i.imgur.com/SrviKgm.png Figure 1: Illustration of the underlying idea of the boundary tilting perspective, see the text for details. [1] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy: Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572 (2014) Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 