First published: 2018/10/04 (1 year ago) Abstract: A central goal of unsupervised learning is to acquire representations from
unlabeled data or experience that can be used for more effective learning of
downstream tasks from modest amounts of labeled data. Many prior unsupervised
learning works aim to do so by developing proxy objectives based on
reconstruction, disentanglement, prediction, and other metrics. Instead, we
develop an unsupervised learning method that explicitly optimizes for the
ability to learn a variety of tasks from small amounts of data. To do so, we
construct tasks from unlabeled data in an automatic way and run meta-learning
over the constructed tasks. Surprisingly, we find that, when integrated with
meta-learning, relatively simple mechanisms for task design, such as clustering
unsupervised representations, lead to good performance on a variety of
downstream tasks. Our experiments across four image datasets indicate that our
unsupervised meta-learning approach acquires a learning algorithm without any
labeled data that is applicable to a wide range of downstream classification
tasks, improving upon the representation learned by four prior unsupervised
This recent paper, a collaboration involving some of the authors of MAML, proposes an intriguing application of techniques developed in the field of meta learning to the problem of unsupervised learning - specifically, the problem of developing representations without labeled data, which can then be used to learn quickly from a small amount of labeled data. As a reminder, the idea behind meta learning is that you train models on multiple different tasks, using only a small amount of data from each task, and update the model based on the test set performance of the model.
The conceptual advance proposed by this paper is to adopt the broad strokes of the meta learning framework, but apply it to unsupervised data, i.e. data with no pre-defined supervised tasks. The goal of such a project is, as so often is the case with unsupervised learning, to learn representations, specifically, representations we believe might be useful over a whole distribution of supervised tasks. However, to apply traditional meta learning techniques, we need that aforementioned distribution of tasks, and we’ve defined our problem as being over unsupervised data. How exactly are we supposed to construct the former out of the latter? This may seem a little circular, or strange, or definitionally impossible: how can we generate supervised tasks without supervised labels?
The artificial tasks created by this paper are rooted in mechanically straightforward operations, but conceptually interesting ones all the same: it uses an off the shelf unsupervised learning algorithm to generate a fixed-width vector embedding of your input data (say, images), and then generates multiple different clusterings of the embedded data, and then uses those cluster IDs as labels in a faux-supervised task. It manages to get multiple different tasks, rather than just one - remember, the premise of meta learning is in models learned over multiple tasks - by randomly up and down-scaling dimensions of the embedding before clustering is applied. Different scalings of dimensions means different points close to one another, which means the partition of the dataset into different clusters.
With this distribution of “supervised” tasks in hand, the paper simply applies previously proposed meta learning techniques - like MAML, which learns a model which can be quickly fine tuned on a new task, or prototypical networks, which learn an embedding space in which observations from the same class, across many possible class definitions are close to one another.
An interesting note from the evaluation is that this method - which is somewhat amusingly dubbed “CACTUs” - performs best relative to alternative baselines in cases where the true underlying class distribution on which the model is meta-trained is the most different from the underlying class distribution on which the model is tested. Intuitively, this makes reasonable sense: meta learning is designed to trade off knowledge of any given specific task against the flexibility to be performant on a new class division, and so it gets the most value from trade off where a genuinely dissimilar class split is seen during testing.
One other quick thing I’d like to note is the set of implicit assumptions this model builds on, in the way it creates its unsupervised tasks. First, it leverages the smoothness assumptions of classes - that is, it assumes that the kinds of classes we might want our model to eventually perform on are close together, in some idealized conceptual space. While not a perfect assumption (there’s a reason we don’t use KNN over embeddings for all of our ML tasks) it does have a general reasonableness behind it, since rarely are the kinds of classes very conceptually heterogenous. Second, it assumes that a truly unsupervised learning method can learn a representation that, despite being itself sub-optimal as a basis for supervised tasks, is a well-enough designed feature space for the general heuristic of “nearby things are likely of the same class” to at least approximately hold. I find this set of assumptions interesting because they are so simplifying that it’s a bit of a surprise that they actually work: even if the “classes” we meta-train our model on are defined with simple Euclidean rules, optimizing to be able to perform that separation using little data does indeed seem to transfer to the general problem of “separating real world, messier-in-embedding-space classes using little data”.
In response to increasing calls for ways to explain and interpret the predictions of neural networks, one major genre of explanation has been the construction of salience maps for image-based tasks. These maps assign a relevance or saliency score to every pixel in the image, according to various criteria by which the value of a pixel can be said to have influenced the final prediction of the network. This paper is an interesting blend of ideas from the saliency mapping literature with ones from adversarial examples: it essentially shows that you can create adversarial examples whose goal isn't to change the output of a classifier, but instead to keep the output of the classifier fixed, but radically change the explanation (their term for the previously-described pixel saliency map that results from various explanation-finding methods) to resemble some desired target explanation. This is basically a targeted adversarial example, but targeting a different property of the network (the calculated explanation) while keeping an additional one fixed (keeping the output of the original network close to the original output, as well as keeping the input image itself in a norm ball around the original image. This is done in a pretty standard way: by just defining a loss function incentivizing closeness to the original output and also closeness of the explanation to the desired target, and performing gradient descent to modify pixels until this loss was low.
The authors do a decent job of showing such targeted perturbations are possible: by my assessment of their results their strongest successes at inducing an actual targeted explanation are with Layerwise Relevance Propogation and Pattern Attribution (two of the 6 tested explanation methods). With the other methods, I definitely buy that they're able to induce an explanation that's very unlike the true/original explanation, but it's not as clear they can reach an arbitrary target. This is a bit of squinting, but it seems like they have more success in influencing propogation methods (where the effect size of the output is propogated backwards through the network, accounting for ReLus) than they do with gradient ones (where you're simply looking at the gradient of the output class w.r.t each pixel.
In the theory section of the paper, the authors do a bit of differential geometry that I'll be up front and say I did not have the niche knowledge to follow, but which essentially argues that the manipulability of an explanation has to do with the curvature of the output manifold for a constant output. That is to say: how much can you induce a large change in the gradient of the output, while moving a small distance along the manifold of a constant output value. They then go on to argue that ReLU activations, because they are by definition discontinuous, induce sharp changes in gradient for points nearby one another, and this increase the ability for networks to be manipulated. They propose a softplus activation instead, where instead of a sharp discontinuity, the ReLU shape becomes more curved, and show relatively convincingly that at low values of Beta (more curved) you can mostly eliminate the ability of a perturbation to induce an adversarially targeted explanation.
For all that I didn't have a completely solid grasp of some of the theory sections here, I think this is a neat proof of concept paper in showing that neural networks can be small-perturbation fragile on a lot of different axes: we've known this for a while in the area of adversarial examples, but this is a neat generalization of that fact to a new area.
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work.
A goal of this paper is to design a model-based RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, model-based RL has worked by learning a dynamics model of the environment - predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that model-based systems generalize better to new environments, because they predict one-step changes in local dynamics in a way that can be more easily separated from long-term dynamics or reward patterns.
However, a downside of MBRL is that it can be hard to train, especially when your observation space is high-dimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies.
The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict next-step reward and future value from that latent space vector. Because it learns to encode latent-space state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail.
Once this model exists, it can be used for generating a policy through a tree-search planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highest-value one.
The authors find that their model is able to do better than both model-free and model-based methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a next-step observation prediction can work.
My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state.
This paper presents a combination of the inception architecture
with residual networks. This is done by adding a shortcut connection
to each inception module. This can alternatively be seen as a resnet where
the 2 conv layers are replaced by a (slightly modified) inception module.
The paper (claims to) provide results against the hypothesis that adding residual
connections improves training, rather increasing the model size is what makes the difference.
This paper presents state-of-the-art methods for both caption generation of images and visual question answering (VQA). The authors build on previous methods by adding what they call a "bottom-up" approach to previous "top-down" attention mechanisms. They show that using their approach they obtain SOTA on both Image captioning (MSCOCO) and the Visual Question and Answering (2017 VQA challenge). They propose a specific network configurations for each. Their biggest contribution is using Faster-R-CNN to retrieve the "important" parts of an image to focus on in both models.
## Top Down
Up until this paper, the traditional approach was to use a "top-down" approach, in which the last feature map layer of a CNN is used to obtain a latent representation of the given input image. These features, along with the context of the caption being generated, were used to generate attention weights that were used to predict the next sequence in the context of caption generation. The network would learn to focus its attention on regions of the feature map that matters most. This is the approach used in previous SOTA methods like [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044).
The authors argue that the feature map of a CNN is too generic and can be thought of operating on a uniform, grid-like feature map. In other words, there is no particular reason to think that the feature map of generated by a CNN would give optimal regions to attend to. Also, carefully choosing the dimensions of the feature map can be very arbitrary.
In order to fix this, the authors propose combining object detection methods in a *bottom-up* approach. To do so, the authors propose using Faster-R-CNN to identify regions of interest in an image. Given an input image, Faster-R-CNN will identify bounding boxes of the image that likely correspond to objects of a given category and simultaneously compute a feature vector of that bounding box. Figure 1 shows the difference between the Bottom-up and Top-Down approach.
## Combining the two
In this paper, the authors suggest using the bottom-up approach to compute the salient regions of the image the network should focus on using Faster-R-CNN. FRCNN is carefully pretrained on both imagenet and the Visual Genome dataset. It is then frozen and only used to generate bounding boxes of regions with high confidence of being of interest. The top-down approach is then used on the features obtained from the bottom-up approach. In order to "enhance" the FRCNN performance, they initialize their FRCNN with a ResNet-101 pre-trained on imagenet. They train their FRCNN on the Visual Genome dataset, adding attributes to the loss function that are available from the Visual Genome dataset, attributes such as color (black, white, gold etc.), state (open, close, dark, bright, etc.). A sample of FRCNN outputs are shown in figure 2. It is important to stress that only the feature representations and not the actual outputs (i.e. not the labels) are used in their model.
## Caption Generation
Figure 3 provides a high-level overview of the model being used for caption generation for images. The image is first passed through FRCNN which produces a set of image features *V*. In their specific implementation, *V* consists of *k* vectors of size 1x2048. Their model consists of two LSTM blocks, one for attention and the other for language generation.
The first block of their model is a Top-Down Attention LSTM layer. It takes as input the mean-pooled features *V* , i.e. 1/k*sum(v_i), concatenated with the previous timestep's hidden representation of the language LSTM as well as the word embedding of the previously generated word. The word embedding is learned and not pretrained.
The output of the first LSTM is used to compute the attention for each vector using an MLP and softmax:
The attention weighted image feature is then used as an input to the language LSTM model, concatenated with the output from the top-down Attention LSTM and a softmax is used to predict the next word in the sequence. The loss function seeks to minimize the cross-entropy of the generated sentence.
## VQA Model
The VQA task differs to the image generation in that a text-based question accompanies an input image and the network must produce an answer. The VQA model proposed is different to that of the caption generation model previously described, however they both use the same bottom-up approach to generate the feature vectors of the image based on the FRCNN architecture. A high-level overview of the architecture for the VQA model is presented in Figure 4.
Each word from the question is converted to a learned word embedding which is used as input to a GRU. The number of words for each question is limited to 14 for computational efficiency. The output from the GRU is concatenated with each of the *k* image features, and attention weights are computed for each *k*th feature using an MLP and softmax, similar to what is done in the attention for caption generation. The weighted sum of the feature vectors is then passed through an linear layer such that its shape is compatible with the gru output, and the Hadamard product (element-wise product) is computed over the GRU output and attention-weighted image feature representation. Finally, a tanh non-linear activation is used. This results in a "gated tanh", which have been shown empirically to outperform both ReLU and tanh. Finally, a softmax probability distribution is generated at the output which selects a candidate answer among all possible candidate answers.
## Results and experiments
### Resnet Baseline
To demonstrate that their contribution of bottom-up mechanism actually improves on results, the authors use a ResNet trained on imagenet as a baseline for generating the image feature vectors (they resize the final CNN layers using bilinear interpolation when needed). They consistently obtain better results when using the bottom-up approach over the ResNet approach in both caption generation and VQA.
The authors demonstrate that they outperform all results on all metrics on the MSCOCO test server.
They also show how using the bottom-up approach over ResNet consistently scores them higher on detecting instances of objects, attributes, relations, etc:
The authors, like their predecessors, insist on demonstrating their network's frisbee ability:
## VQA Results
They also demonstrate that the addition of bottom-up attention improves results over a ResNet baseline.
They also show that their model outperformed all other submissions on the VQA submission. They mention using an ensemble of 30 models for their submission.
A sample image of what is attended in an image given a proper answer is shown in figure 6.
The authors introduce a new way to select portions of the image on which to focus attention. The idea is very original and came at a time when object detection was making significant progress (i.e. FRCNN).
A few comments:
* This method might not generalize well to other types of data. It requires pre-training on larger datasets (visual genome, imagenet, etc.) which consist of categories that overlap with both the MSCOCO and VQA datasets (i.e. cars, people, etc.). It would be interesting to see an end-to-end model that does not rely on pre-training on other similar datasets.
* No insight is given to the computational complexity nor to the inference time or training time. I imagine that FRCNN is resource intensive, and having to do a forward pass of FRCNN for every pass of the network must be a computational bottleneck. Not to mention that they ensembled 30 of them!
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.
This paper is a top-down (i.e. requires person detection separately) pose estimation method with a focus on improving high-resolution representations (features) to make keypoint detection easier.
During the training stage, this method utilizes annotated bounding boxes of person class to extract ground truth images and keypoints. The data augmentations include random rotation, random scale, flipping, and [half body augmentations](http://presentations.cocodataset.org/ECCV18/COCO18-Keypoints-Megvii.pdf) (feeding upper or lower part of the body separately). Heatmap learning is performed in a typical for this task approach of applying L2 loss between predicted keypoint locations and ground truth locations (generated by applying 2D Gaussian with std = 1).
During the inference stage, pre-trained object detector is used to provide bounding boxes. The final heatmap is obtained by averaging heatmaps obtained from the original and flipped images. The pixel location of the keypoint is determined by $argmax$ heatmap value with a quarter offset in the direction to the second-highest heatmap value.
While the pipeline described in this paper is a common practice for pose estimation methods, this method can achieve better results by proposing a network design to extract better representations. This is done through having several parallel sub-networks of different resolutions (next one is half the size of the previous one) while repeatedly fusing branches between each other:
The fusion process varies depending on the scale of the sub-network and its location in relation to others:
* They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution").
* Their model uses a deeper architecture than previous models and has a residual component.
* Their model is a fully convolutional neural network.
* Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry).
* Output of the model: The upscaled image (without the blurriness).
* They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.)
* They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".)
* They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual).
* They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10.
* They use weight decay of 0.0001.
* During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[-t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[-t/lr, t/lr]` (where `lr` is the learning rate).
* They argue that their special gradient clipping allows the use of significantly higher learning rates.
* They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?)
* Higher accuracy upscaling than all previous methods.
* Can handle well upscaling factors above 2x.
* Residual network learns significantly faster than non-residual network.
*Architecture of the model.*
*Super-resolution quality of their model (top, bottom is a competing model).*
Cisse et al. propose parseval networks, deep neural networks regularized to learn orthonormal weight matrices. Similar to the work by Hein et al. , the mean idea is to constrain the Lipschitz constant of the network – which essentially means constraining the Lipschitz constants of each layer independently. For weight matrices, this can be achieved by constraining the matrix-norm. However, this (depending on the norm used) is often intractable during gradient descent training. Therefore, Cisse et al. propose to use a per-layer regularizer of the form:
$R(W) = \|W^TW – I\|$
where $I$ is the identity matrix. During training, this regularizer is supposed to ensure that the learned weigth matrices are orthonormal – an efficient alternative to regular matrix manifold optimization techniques (see the paper).
 Matthias Hein, Maksym Andriushchenko: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. CoRR abs/1705.08475 (2017)
Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data.
#### What is DRAW:
The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder.
#### What do we gain?
The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem.
#### What follows?
A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network.
* As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way.
* The attention model is fully differentiable.
* I think a better exposition of the attention mechanism would improve this paper.