![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
TLDR; The authors propose a Contextual LSTM (CLSTM) model that appends a context vector to input words when making predictions. The authors evaluate the model and Language Modeling, next sentence selection and next topic prediction tasks, beating standard LSTM baselines. #### Key Points - The topic vector comes from an internal classifier system and is supervised data. Topics can also be estimated using unsupervised techniques. - Topic can be calculated either based on the previous words of the current sentence (SentSegTopic), all words of the previous sentence (PrevSegTopic), and current paragraph (ParaSegTopic). Best CLSTM uses all of them. - English Wikipedia Dataset: 1400M words train, 177M validation, 178M words test. 129k vocab. - When current segment topic is present, the topic of the previous sentence doesn't matter. - Authors couldn't compare to other models that incorporate topics because they don't scale to large-scale datasets. - LSTMs are a long chain and authors don't reset the hidden state between sentence boundaries. So, a sentence has implicit access to the prev. sentence information, but explicitly modeling the topic still makes a difference. #### Notes/Thoughts - Increasing number of hidden units seems to have a *much* larger impact on performance than increasing model perplexity. The simple word-based LSTM model with more hidden units significantly outperforms the complex CLSTM model. This makes me question the practical usefulness of this model. - IMO the comparisons are somewhat unfair because by using an external classifier to obtain topic labels you are bringing in external data that the baseline models didn't have access to. - What about using other unsupervised sentence embeddings as context vectors, e.g. seq2seq autoencoders or PV? - If the LSTM was perfect in modeling long-range dependencies then we wouldn't need to feed extra topic vectors. What about residual connections? ![]() |
[link]
## **Keywords** Progressive GAN , High resolution generator --- ## **Summary** 1. **Introduction** 1. **Goal of the paper** 1. Generation of very high quality images using progressively increasing size of the generator and discriminator. 1. Improved training and stability of GANs. 1. New metric for evaluating GAN results. 1. A high quality version of CELEBA-HQ dataset. 1. **Previous Research** 1. Generative methods help to produce new samples from higher-dimensional data distributions such as images . 1. The common approaches for generative methods are : 1. Autoregressive models : Produce sharp images and are slow to evaluate. eg PixelCNN 1. Variational Autoencoders : Easy to train but produces blurry images. 1. Generative Adversarial Neural Network : Produces sharp images at small resolutions but are highly unstable. 1. **Method** 1. **Basic GAN architecture** 1. Gan consists of two major parts : 1. _Generator_ : Creates a sample image from latent code which look very close to the training images. 1. _Discriminator_: Discriminator is trained to assess how close the sample image looks to the training image. 1. To measure the overlap between the training and the generated distributions many methods are used like Jensen-Shannon divergence , least-squares divergence and Wasserstein Distance. 1. Larger resolution generations cause problems because it becomes difficult for both the training and the generated networks amplifying the gradient problem. Larger resolutions also require large memory and can cause problems. 1. A mechanism is also proposed to stop the generator from participating in escalation that causes mode collapse problem. 1. **Progressive growing of GANs** 1. The primary method for the GAN training is to start off from a low resolution image and add extra layers in each step of the training process. 1. Lower resolution images are more stable as they have very less class information and as the resolution of the image increases further smaller details and features are added to the image. 1. This leads to a smooth increase in the quality of image instead of the network learning lot of details in one single step. 1. **Mini-batch separation** 1. GANs tend to capture only a very small set of features from the image. 1. "Minibatch discrimination" is used to generate feature vector for each individual image along with one for the the mini batch of images also.  1. **Conclusion** 1. Higher resolution images are able to be generated which are robust and efficient. 1. Improved quality of the generated images is given. 1. Reduced training time for a comparable result and output quality and resolution. --- ## **Notes** * Gradient Problem : At higher resolutions it becomes easier to tell the differences between the training and the testing images [1]. This is referred to as the gradient problem. * Mode Collapse : The generator is incapable of creating a large variety of samples and get stuck. ## **Open research questions** 1. Improved methods for a true photorealism generation of images. 1. Improved semantic sensibility and improved understanding of the dataset. ## **References** 1. [https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/](https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/) 1. [https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b](https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) ![]() |
[link]
This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14} 1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15} ![]() |
[link]
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. ![]() |
[link]
disclaimer: I'm the first author of the paper ## TL;DR We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol, i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all. We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning. main modifications we propose: - bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions - it's fine to forget, as long as you can quickly remember (just like we humans do) - we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new out-of-distribution (OoD) tasks (because the world is full of surprises) - the tasks distribution is actually a hidden Markov chain. This implies: - new and old tasks can re-occur (just like in real life). Better remember them quickly if you want to get a good total performance! - tasks have different lengths - and the tasks boundaries are unknown (task agnostic setting) ### Bonus: We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} in case it was starting to get confusing :p ## Motivation We imagine an agent, embedded or not, first pre-trained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pre-trained in a factory and deployed in homes or in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pre-trained on static datasets and deployed in a user’s life to fit its personal needs. Further motivations can be found in time series forecasting, e.g., market prediction, game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones. ## Unifying Framework We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} with meta learning terminology. https://i.imgur.com/U16kHXk.png (easier to digest with accompanying text) ## OSAKA The main features of the evaluation framework are - task agnosticism - pre-training is allowed, but OoD tasks at test time - task revisiting - controllable non-stationarity - online evaluation (see paper for the motivations of the features) ## Continual-MAML: an initial baseline A simple extension of MAML that is better suited than previous methods in the proposed setting. https://i.imgur.com/C86WUc8.png Features are: - Fast Adapatation - Dynamic Representation - Task boundary detection - Computational efficiency ## Experiments We provide a suite of 3 benchmarks to test algorithms in the new setting. The first includes the Omniglot, MNIST and FashionMNIST dataset. The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively. The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning, in the new setting. https://i.imgur.com/IQ1WYTp.png The second and third experiments lead us to similar conclusions code: https://github.com/ElementAI/osaka ![]() |
[link]
Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space. https://i.imgur.com/XBhHJuY.png Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples. https://i.imgur.com/HT2StGI.png Figure 2: Generative model (GAN) together with the required inverter. In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This paper models object detection as a regression problem for bounding boxes and object class probabilities with a single pass through the CNN. The main contribution is the idea of dividing the image into a 7x7 grid, and having each cell predict a distribution over class labels as well as a bounding box for the object whose center falls into it. It's much faster than R-CNN and Fast R-CNN, as the additional step of extracting region proposals has been removed. ## Strengths - Works real-time. Base model runs at 45fps and a faster version goes up to 150fps, and they claim that it's more than twice as fast as other works on real-time detection. - End-to-end model; Localization and classification errors can be jointly optimized. - YOLO makes more localization errors and fewer background mistakes than Fast R-CNN, so using YOLO to eliminate false background detections from Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN is much slower). ## Weaknesses / Notes - Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN). - Performs worse at detecting small objects, as at most one object per grid cell can be detected. ![]() |
[link]
This paper explores the use of convolutional (PixelCNN) and recurrent units (PixelRNN) for modeling the distribution of images, in the framework of autoregression distribution estimation. In this framework, the input distribution $p(x)$ is factorized into a product of conditionals $\Pi p(x_i | x_i-1)$. Previous work has shown that very good models can be obtained by using a neural network parametrization of the conditionals (e.g. see our work on NADE \cite{journals/jmlr/LarochelleM11}). Moreover, unlike other approaches based on latent stochastic units that are directed or undirected, the autoregressive approach is able to compute log-probabilities tractably. So in this paper, by considering the specific case of x being an image, they exploit the topology of pixels and investigate appropriate architectures for this. Among the paper's contributions are: 1. They propose Diagonal BiLSTM units for the PixelRNN, which are efficient (thanks to the use of convolutions) while making it possible to, in effect, condition a pixel's distribution on all the pixels above it (see Figure 2 for an illustration). 2. They demonstrate that the use of residual connections (a form of skip connections, from hidden layer i-1 to layer $i+1$) are very effective at learning very deep distribution estimators (they go as deep as 12 layers). 3. They show that it is possible to successfully model the distribution over the pixel intensities (effectively an integer between 0 and 255) using a softmax of 256 units. 4. They propose a multi-scale extension of their model, that they apply to larger 64x64 images. The experiments show that the PixelRNN model based on Diagonal BiLSTM units achieves state-of-the-art performance on the binarized MNIST benchmark, in terms of log-likelihood. They also report excellent log-likelihood on the CIFAR-10 dataset, comparing to previous work based on real-valued density models. Finally, they show that their model is able to generate high quality image samples. ![]() |
[link]
How can we learn causal relationships that explain data? We can learn from non-stationary distributions. If we experiment with different factorizations of relationships between variables we can observe which ones provide better sample complexity when adapting to distributional shift and therefore are likely to be causal. If we consider the variables A and B we can factor them in two ways: $P(A,B) = P(A)P(B|A)$ representing a causal graph like $A\rightarrow B$ $P(A,B) = P(A|B)P(B)$ representing a causal graph like $A \leftarrow B$ The idea is if we train a model with one of these structures; when adapting to a new shifted distribution of data it will take longer to adapt if the model does not have the correct inductive bias. For example let's say that the true relationship is $A$=Raining causes $B$=Open Umbrella (and not vice-versa). Changing the marginal probability of Raining (say because the weather changed) does not change the mechanism that relates $A$ and $B$ (captured by $P(B|A)$), but will have an impact on the marginal $P(B)$. So after this distributional shift the function that modeled $P(B|A)$ will not need to change because the relationship is the same. Only the function that modeled $P(A)$ will need to change. Under the incorrect factorization $P(B)P(A|B)$, adaptation to the change will be slow because both $P(B)$ and $P(A|B)$ need to be modified to account for the change in $P(A)$ (due to Bayes rule). Here a difference in sample complexity can be observed when modeling the joint of the shifted distribution. $B\rightarrow A$ takes longer to adapt: https://i.imgur.com/B9FEmA7.png Here the idea is that sample complexity when adapting to a new distribution of data is a heuristic to inform us which causal graph inductive bias is correct. Experimentally this works and they also observe that when models have more capacity it seems that the difference between the models grows. This summary was written with the help of Yoshua Bengio. ![]() |
[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) ![]() |