Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1478 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Group Normalization

Yuxin Wu and Kaiming He

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG

**First published:** 2018/03/22 (2 years ago)

**Abstract:** Batch Normalization (BN) is a milestone technique in the development of deep
learning, enabling various networks to train. However, normalizing along the
batch dimension introduces problems --- BN's error increases rapidly when the
batch size becomes smaller, caused by inaccurate batch statistics estimation.
This limits BN's usage for training larger models and transferring features to
computer vision tasks including detection, segmentation, and video, which
require small batches constrained by memory consumption. In this paper, we
present Group Normalization (GN) as a simple alternative to BN. GN divides the
channels into groups and computes within each group the mean and variance for
normalization. GN's computation is independent of batch sizes, and its accuracy
is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN
has 10.6% lower error than its BN counterpart when using a batch size of 2;
when using typical batch sizes, GN is comparably good with BN and outperforms
other normalization variants. Moreover, GN can be naturally transferred from
pre-training to fine-tuning. GN can outperform its BN-based counterparts for
object detection and segmentation in COCO, and for video classification in
Kinetics, showing that GN can effectively replace the powerful BN in a variety
of tasks. GN can be easily implemented by a few lines of code in modern
libraries.
more
less

Yuxin Wu and Kaiming He

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG

[link]
If you were to survey researchers, and ask them to name the 5 most broadly influential ideas in Machine Learning from the last 5 years, I’d bet good money that Batch Normalization would be somewhere on everyone’s lists. Before Batch Norm, training meaningfully deep neural networks was an unstable process, and one that often took a long time to converge to success. When we added Batch Norm to models, it allowed us to increase our learning rates substantially (leading to quicker training) without the risk of activations either collapsing or blowing up in values. It had this effect because it addressed one of the key difficulties of deep networks: internal covariate shift. To understand this, imagine the smaller problem, of a one-layer model that’s trying to classify based on a set of input features. Now, imagine that, over the course of training, the input distribution of features moved around, so that, perhaps, a value that was at the 70th percentile of the data distribution initially is now at the 30th. We have an obvious intuition that this would make the model quite hard to train, because it would learn some mapping between feature values and class at the beginning of training, but that would become invalid by the end. This is, fundamentally, the problem faced by higher layers of deep networks, since, if the distribution of activations in a lower layer changed even by a small amount, that can cause a “butterfly effect” style outcome, where the activation distributions of higher layers change more dramatically. Batch Normalization - which takes each feature “channel” a network learns, and normalizes [normalize = subtract mean, divide by variance] it by the mean and variance of that feature over spatial locations and over all the observations in a given batch - helps solve this problem because it ensures that, throughout the course of training, the distribution of inputs that a given layer sees stays roughly constant, no matter what the lower layers get up to. On the whole, Batch Norm has been wildly successful at stabilizing training, and is now canonized - along with the likes of ReLU and Dropout - as one of the default sensible training procedures for any given network. However, it does have its difficulties and downsides. One salient one of these comes about when you train using very small batch sizes - in the range of 2-16 examples per batch. Under these circumstance, the mean and variance calculated off of that batch are noisy and high variance (for the general reason that statistics calculated off of small sample sizes are noisy and high variance), which takes away from the stability that Batch Norm is trying to provide. One proposed alternative to Batch Norm, that didn’t run into this problem of small sample sizes, is Layer Normalization. This operates under the assumption that the activations of all feature “channels” within a given layer hopefully have roughly similar distributions, and, so, you an normalize all of them by taking the aggregate mean over all channels, *for a given observation*, and use that as the mean and variance you normalize by. Because there are typically many channels in a given layer, this means that you have many “samples” that go into the mean and variance. However, this assumption - that the distributions for each feature channel are roughly the same - can be an incorrect one. A useful model I have for thinking about the distinction between these two approaches is the idea that both are calculating approximations of an underlying abstract notion: the in-the-limit mean and variance of a single feature channel, at a given point in time. Batch Normalization is an approximation of that insofar as it only has a small sample of points to work with, and so its estimate will tend to be high variance. Layer Normalization is an approximation insofar as it makes the assumption that feature distributions are aligned across channels: if this turns out not to be the case, individual channels will have normalizations that are biased, due to being pulled towards the mean and variance calculated over an aggregate of channels that are different than them. Group Norm tries to find a balance point between these two approaches, one that uses multiple channels, and normalizes within a given instance (to avoid the problems of small batch size), but, instead of calculating the mean and variance over all channels, calculates them over a group of channels that represents a subset. The inspiration for this idea comes from the fact that, in old school computer vision, it was typical to have parts of your feature vector that - for example - represented a histogram of some value (say: localized contrast) over the image. Since these multiple values all corresponded to a larger shared “group” feature. If a group of features all represent a similar idea, then their distributions will be more likely to be aligned, and therefore you have less of the bias issue. One confusing element of this paper for me was that the motivation part of the paper strongly implied that the reason group norm is sensible is that you are able to combine statistically dependent channels into a group together. However, as far as I an tell, there’s no actually clustering or similarity analysis of channels that is done to place certain channels into certain groups; it’s just done so semi-randomly based on the index location within the feature channel vector. So, under this implementation, it seems like the benefits of group norm are less because of any explicit seeking out of dependant channels, and more that just having fewer channels in each group means that each individual channel makes up more of the weight in its group, which does something to reduce the bias effect anyway. The upshot of the Group Norm paper, results-wise, is that Group Norm performs better than both Batch Norm and Layer Norm at very low batch sizes. This is useful if you’re training on very dense data (e.g. high res video), where it might be difficult to store more than a few observations in memory at a time. However, once you get to batch sizes of ~24, Batch Norm starts to do better, presumably since that’s a large enough sample size to reduce variance, and you get to the point where the variance of BN is preferable to the bias of GN. |

Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Massimo Caccia and Pau Rodriguez and Oleksiy Ostapenko and Fabrice Normandin and Min Lin and Lucas Caccia and Issam Laradji and Irina Rish and Alexande Lacoste and David Vazquez and Laurent Charlin

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.AI, cs.LG

**First published:** 2020/03/12 (2 weeks ago)

**Abstract:** Learning from non-stationary data remains a great challenge for machine
learning. Continual learning addresses this problem in scenarios where the
learning agent faces a stream of changing tasks. In these scenarios, the agent
is expected to retain its highest performance on previous tasks without
revisiting them while adapting well to the new tasks. Two new recent
continual-learning scenarios have been proposed. In meta-continual learning,
the model is pre-trained to minimize catastrophic forgetting when trained on a
sequence of tasks. In continual-meta learning, the goal is faster remembering,
i.e., focusing on how quickly the agent recovers performance rather than
measuring the agent's performance without any adaptation. Both scenarios have
the potential to propel the field forward. Yet in their original formulations,
they each have limitations. As a remedy, we propose a more general scenario
where an agent must quickly solve (new) out-of-distribution tasks, while also
requiring fast remembering. We show that current continual learning, meta
learning, meta-continual learning, and continual-meta learning techniques fail
in this new scenario. Accordingly, we propose a strong baseline:
Continual-MAML, an online extension of the popular MAML algorithm. In our
empirical experiments, we show that our method is better suited to the new
scenario than the methodologies mentioned above, as well as standard continual
learning and meta learning approaches.
more
less

Massimo Caccia and Pau Rodriguez and Oleksiy Ostapenko and Fabrice Normandin and Min Lin and Lucas Caccia and Issam Laradji and Irina Rish and Alexande Lacoste and David Vazquez and Laurent Charlin

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.AI, cs.LG

[link]
disclaimer: I'm the first author of the paper ## TL;DR We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol, i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all. We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning. main modifications we propose: - bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions - it's fine to forget, as long as you can quickly remember (just like we humans do) - we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new out-of-distribution (OoD) tasks (because the world is full of surprises) - the tasks distribution is actually a hidden Markov chain. This implies: - new and old tasks can re-occur (just like in real life). Better remember them quickly if you want to get a good total performance! - tasks have different lengths - and the tasks boundaries are unknown (task agnostic setting) ### Bonus: We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} in case it was starting to get confusing :p ## Motivation We imagine an agent, embedded or not, first pre-trained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pre-trained in a factory and deployed in homes or in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pre-trained on static datasets and deployed in a user’s life to fit its personal needs. Further motivations can be found in time series forecasting, e.g., market prediction, game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones. ## Unifying Framework We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} with meta learning terminology. https://i.imgur.com/U16kHXk.png (easier to digest with accompanying text) ## OSAKA The main features of the evaluation framework are - task agnosticism - pre-training is allowed, but OoD tasks at test time - task revisiting - controllable non-stationarity - online evaluation (see paper for the motivations of the features) ## Continual-MAML: an initial baseline A simple extension of MAML that is better suited than previous methods in the proposed setting. https://i.imgur.com/C86WUc8.png Features are: - Fast Adapatation - Dynamic Representation - Task boundary detection - Computational efficiency ## Experiments We provide a suite of 3 benchmarks to test algorithms in the new setting. The first includes the Omniglot, MNIST and FashionMNIST dataset. The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively. The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning, in the new setting. https://i.imgur.com/IQ1WYTp.png The second and third experiments lead us to similar conclusions code: https://github.com/ElementAI/osaka |

Relational Forward Models for Multi-Agent Learning

Andrea Tacchetti and H. Francis Song and Pedro A. M. Mediano and Vinicius Zambaldi and Neil C. Rabinowitz and Thore Graepel and Matthew Botvinick and Peter W. Battaglia

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.MA, stat.ML

**First published:** 2018/09/28 (1 year ago)

**Abstract:** The behavioral dynamics of multi-agent systems have a rich and orderly
structure, which can be leveraged to understand these systems, and to improve
how artificial agents learn to operate in them. Here we introduce Relational
Forward Models (RFM) for multi-agent learning, networks that can learn to make
accurate predictions of agents' future behavior in multi-agent environments.
Because these models operate on the discrete entities and relations present in
the environment, they produce interpretable intermediate representations which
offer insights into what drives agents' behavior, and what events mediate the
intensity and valence of social interactions. Furthermore, we show that
embedding RFM modules inside agents results in faster learning systems compared
to non-augmented baselines. As more and more of the autonomous systems we
develop and interact with become multi-agent in nature, developing richer
analysis tools for characterizing how and why agents make decisions is
increasingly necessary. Moreover, developing artificial agents that quickly and
safely learn to coordinate with one another, and with humans in shared
environments, is crucial.
more
less

Andrea Tacchetti and H. Francis Song and Pedro A. M. Mediano and Vinicius Zambaldi and Neil C. Rabinowitz and Thore Graepel and Matthew Botvinick and Peter W. Battaglia

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.MA, stat.ML

[link]
One of the dominant narratives of the deep learning renaissance has been the value of well-designed inductive bias - structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be re-used across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for capture-able objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newly-updated node and edge information. (If you were predicting attributes that were graph-level attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agent-level actions). https://i.imgur.com/luFlhfJ.png All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games. https://i.imgur.com/BCKSmIb.png While I find graph networks quite interesting, and multi-agent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents. One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that - to whatever extent the graph is designed in a way that captures information meaningful to the task - explicit graph methods would have an even greater comparative advantage in this setting. |

Deep Forest: Towards An Alternative to Deep Neural Networks

Zhou, Zhi-Hua and Feng, Ji

International Joint Conference on Artificial Intelligence - 2017 via Local Bibsonomy

Keywords: dblp

Zhou, Zhi-Hua and Feng, Ji

International Joint Conference on Artificial Intelligence - 2017 via Local Bibsonomy

Keywords: dblp

[link]
https://i.imgur.com/QxHktQC.png The fundamental question that the paper is going to answer is weather deep learning can be realized with other prediction model other thahttps://i.imgur.com/Wh6xAbP.pngn neural networks. The authors proposed deep forest, the realization of deep learning using random forest(gcForest). The idea is simple and was inspired by representation learning in deep neural networks which mostly relies on the layer-by-layer processing of raw features. Importance: Deep Neural Network (DNN) has several draw backs. It needs a lot of data to train. It has many hyper-parameters to tune. Moreover, not everyone has access to GPUs to build and train them. Training DNN is mostly like an art instead of a scientific/engineering task. Finally, theoretical analysis of DNN is extremely difficult. The aim of the paper is to propose a model to address these issues and at the same time to achieve performance competitive to deep neural networks. Model: The proposed model consists of two parts. First part is a deep forest ensemble with a cascade structure similar to layer-by-layer architecture in DNN. Each level is an ensemble of random forest and to include diversity a combination of completely-random random forests and typical random forests are employed (number of trees in each forest is a hyper-parameter). The estimated class distribution, which is obtained by k-fold cv from forests, forms a class vector, which is then concatenated with the original feature vector to be input to the next level of cascade. Second part is a multi-grained scanning for representational learning where spatial and sequential relationships are captured using a sliding window scan (by applying various window sizes) on raw features, similar to the convolution and recurrent layers in DNN. Then, those features are passed to a completely random tree-forest and a typical random forest in order to generate transformed features. When transformed feature vectors are too long to be accommodated, feature sampling can be performed. Benefits: gcForest has much fewer hyper-parameters than deep neural networks. The number of cascade levels can be adaptively determined such that the model complexity can be automatically set. If growing a new level does not improve the performance, the growth of the cascade terminates. Its performance is quite robust to hyper-parameter settings, such that in most cases and across different data from different domains, it is able to get excellent performance by using the default settings. gcForest achieves highly competitive performance to deep neural networks, whereas the training time cost of gcForest is smaller than that of DNN. Experimental results: the authors compared the performance of gcForest and DNN by fixing an architecture for gcForest and testing various architectures for DNN, however assumed some fixed hyper-parameters for DNN such as activation and loss function, and dropout rate. They used MNIST (digit images recognition), ORL(face recognition), GTZAN(music classification ), sEMG (Hand Movement Recognition), IMDB (movie reviews sentiment analysis), and some low-dimensional datasets. The gcForest got the best results in these experiments and sometimes with significant differences. My Opinions: The main goal of the paper is interesting; however one concern is the amount of efforts they put to find the best CNN network for the experiments as they also mentioned that finding a good configuration is an art instead of scientific work. For instance, they could use deep recurrent layers instead of MLP for the sentiment analysis dataset, which is typically a better option for this task. For the time complexity of the method, they only reported it for one experiment not all. More importantly, the result of CIFAR-10 in the supplementary materials shows a big gap between superior deep learning method result and gcForest result although the authors argued that gcForest can be tuned to get better result. gcForest was also compared to non-deep learning methods such as random forest and SVM which showed superior results. It was good to have the time complexity comparison for them as well. In my view, the paper is good as a starting point to answer to the original question, however, the proposed method and the experimental results are not convincing enough. Github link: https://github.com/kingfengji/gcForest |

Layer Normalization

Jimmy Lei Ba and Jamie Ryan Kiros and Geoffrey E. Hinton

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2016/07/21 (3 years ago)

**Abstract:** Training state-of-the-art, deep neural networks is computationally expensive.
One way to reduce the training time is to normalize the activities of the
neurons. A recently introduced technique called batch normalization uses the
distribution of the summed input to a neuron over a mini-batch of training
cases to compute a mean and variance which are then used to normalize the
summed input to that neuron on each training case. This significantly reduces
the training time in feed-forward neural networks. However, the effect of batch
normalization is dependent on the mini-batch size and it is not obvious how to
apply it to recurrent neural networks. In this paper, we transpose batch
normalization into layer normalization by computing the mean and variance used
for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Like batch normalization, we also give each neuron its
own adaptive bias and gain which are applied after the normalization but before
the non-linearity. Unlike batch normalization, layer normalization performs
exactly the same computation at training and test times. It is also
straightforward to apply to recurrent neural networks by computing the
normalization statistics separately at each time step. Layer normalization is
very effective at stabilizing the hidden state dynamics in recurrent networks.
Empirically, we show that layer normalization can substantially reduce the
training time compared with previously published techniques.
more
less

Jimmy Lei Ba and Jamie Ryan Kiros and Geoffrey E. Hinton

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
TLDR; The authors propose a new normalization scheme called "Layer Normalization" that works especially well for recurrent networks. Layer Normalization is similar to Batch Normalization, but only depends on a single training case. As such, it's well suited for variable length sequences or small batches. In Layer Normalization each hidden unit shares the same normalization term. The authors show through experiments that Layer Normalization converges faster, and sometimes to better solutions, than batch- or unnormalized RNNs. Batch normalization still performs better for CNNs. |

Using Fast Weights to Attend to the Recent Past

Jimmy Ba and Geoffrey Hinton and Volodymyr Mnih and Joel Z. Leibo and Catalin Ionescu

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

**First published:** 2016/10/20 (3 years ago)

**Abstract:** Until recently, research on artificial neural networks was largely restricted
to systems with only two types of variable: Neural activities that represent
the current or recent input and weights that learn to capture regularities
among inputs, outputs and payoffs. There is no good reason for this
restriction. Synapses have dynamics at many different time-scales and this
suggests that artificial neural networks might benefit from variables that
change slower than activities but much faster than the standard weights. These
"fast weights" can be used to store temporary memories of the recent past and
they provide a neurally plausible way of implementing the type of attention to
the past that has recently proved very helpful in sequence-to-sequence models.
By using fast weights we can avoid the need to store copies of neural activity
patterns.
more
less

Jimmy Ba and Geoffrey Hinton and Volodymyr Mnih and Joel Z. Leibo and Catalin Ionescu

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

[link]
This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbian-like rule. They correspond to the matrices $A(t)$ in the figure below: ![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png) These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time. More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S-1}(t+1), h_S(t+1)$, defined as follows: $$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$ where $f$ is an element-wise non-linearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. As for the fast weights $A(t)$, they too change between time steps, using the hebbian-like rule: $$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$ where $\lambda$ acts as a decay rate (to partially forget some of what's in the past) and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past. In fact, the authors show an explicit relation between these fast weights and memory-augmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains $$A(t) = \eta \sum_{\tau = 1}^{\tau = t-1}\lambda^{t-\tau-1} h(\tau) h(\tau)^T$$ *(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to $$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t-1} \lambda^{t-\tau-1} h(\tau) (h(\tau)^T h_s(t+1))$$ i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memory-augmented architectures is thus that the attention weights aren't computed using a softmax non-linearity. Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved. The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy. **My two cents** This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality). Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to mini-batch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already mini-batch compatible code out there for dealing with the use of a memory matrix of stored previous memory states. This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbian-like updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :-) To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it! |

Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality

Xingjun Ma and Bo Li and Yisen Wang and Sarah M. Erfani and Sudanthi Wijewickrema and Grant Schoenebeck and Dawn Song and Michael E. Houle and James Bailey

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, cs.CV

**First published:** 2018/01/08 (2 years ago)

**Abstract:** Deep Neural Networks (DNNs) have recently been shown to be vulnerable against
adversarial examples, which are carefully crafted instances that can mislead
DNNs to make errors during prediction. To better understand such attacks, a
characterization is needed of the properties of regions (the so-called
'adversarial subspaces') in which adversarial examples lie. We tackle this
challenge by characterizing the dimensional properties of adversarial regions,
via the use of Local Intrinsic Dimensionality (LID). LID assesses the
space-filling capability of the region surrounding a reference example, based
on the distance distribution of the example to its neighbors. We first provide
explanations about how adversarial perturbation can affect the LID
characteristic of adversarial regions, and then show empirically that LID
characteristics can facilitate the distinction of adversarial examples
generated using state-of-the-art attacks. As a proof-of-concept, we show that a
potential application of LID is to distinguish adversarial examples, and the
preliminary results show that it can outperform several state-of-the-art
detection measures by large margins for five attack strategies considered in
this paper across three benchmark datasets. Our analysis of the LID
characteristic for adversarial regions not only motivates new directions of
effective adversarial defense, but also opens up more challenges for developing
new attacks to better understand the vulnerabilities of DNNs.
more
less

Xingjun Ma and Bo Li and Yisen Wang and Sarah M. Erfani and Sudanthi Wijewickrema and Grant Schoenebeck and Dawn Song and Michael E. Houle and James Bailey

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, cs.CV

[link]
Ma et al. detect adversarial examples based on their estimated intrinsic dimensionality. I want to note that this work is also similar to [1] – in both publications, local intrinsic dimensionality is used to analyze adversarial examples. Specifically, the intrinsic dimensionality of a sample is estimated based on the radii $r_i(x)$ of the $k$-nearest neighbors around a sample $x$: $- \left(\frac{1}{k} \sum_{i = 1}^k \log \frac{r_i(x)}{r_k(x)}\right)^{-1}$. For details regarding the original, theoretical formulation of local intrinsic dimensionality I refer to the paper. In experiments, the authors show that adversarial examples exhibit a significant higher intrinsic dimensionality than training samples or randomly perturbed examples. This observation allows detection of adversarial examples. A proper interpretation of this finding is, however, missing. It would be interesting to investigate what this finding implies about the properties of adversarial examples. |

Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series

Lipton, Zachary Chase and Kale, David C. and Wetzel, Randall C.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Lipton, Zachary Chase and Kale, David C. and Wetzel, Randall C.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
#### Motivation: + Take advantage of the fact that missing values can be very informative about the label. + Sampling a time series generates many missing values. ![Sampling](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_motivation.png?raw=true) #### Model (indicator flag): + Indicator of occurrence of missing value. ![Indicator](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_indicator.png?raw=true) + An RNN can learn about missing values and their importance only by using the indicator function. The nonlinearity from this type of model helps capturing these dependencies. + If one wants to use a linear model, feature engineering is needed to overcome its limitations. + indicator for whether a variable was measured at all + mean and standard deviation of the indicator + frequency with which a variable switches from measured to missing and vice-versa. #### Architecture: + RNN with target replication following the work "Learning to Diagnose with LSTM Recurrent Neural Networks" by the same authors. ![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_architecture.png?raw=true) #### Dataset: + Children's Hospital LA + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit Dataset properties | Value ---------|---------- Number of episodes | 10,401 Duration of episodes | From 12h to several months Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. #### Experiments and Results: **Goal** + Predict 128 diagnoses. + Multilabel: patients can have more than one diagnose. **Methodology** + Split: 80% training, 10% validation, 10% test. + Normalized data to be in the range [0,1]. + LSTM RNN: + 2 hidden layers with 128 cells. Dropout = 0.5, L2-regularization: 1e-6 + Training for 100 epochs. Parameters chosen correspond to the time that generated the smallest error in the validation dataset. + Baselines: + Logistic Regression (L2 regularization) + MLP with 3 hidden layers and 500 hidden neurons / layer (parameters chosen via validation set) + Tested with raw-features and hand-engineered features. + Strategies for missing values: + Zeroing + Impute via forward / backfilling + Impute with zeros and use indicator function + Impute via forward / backfilling and use indicator function + Use indicator function only #### Results + Metrics: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnostics among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. ![Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016_results.png?raw=true) #### Discussion: + Predictive model based on data collected following a given routine. This routine can change if the model is put into practice. Will the model predictions in this new routine remain valid? + Missing values in a way give an indication of the type of treatment being followed. + Trade-off between complex models operating on raw features and very complex features operating on more interpretable models. |

Videos as Space-Time Region Graphs

Xiaolong Wang and Abhinav Gupta

Computer Vision – ECCV 2018 - 2018 via Local CrossRef

Keywords:

Xiaolong Wang and Abhinav Gupta

Computer Vision – ECCV 2018 - 2018 via Local CrossRef

Keywords:

[link]
This paper tackles the challenge of action recognition by representing a video as space-time graphs: **similarity graph** captures the relationship between correlated objects in the video while the **spatial-temporal graph** captures the interaction between objects. The algorithm is composed of several modules: https://i.imgur.com/DGacPVo.png 1. **Inflated 3D (I3D) network**. In essence, it is usual 2D CNN (e.g. ResNet-50) converted to 3D CNN by copying 2D weights along an additional dimension and subsequent renormalization. The network takes *batch x 3 x 32 x 224 x 224* tensor input and outputs *batch x 16 x 14 x 14*. 2. **Region Proposal Network (RPN)**. This is the same RPN used to predict initial bounding boxes in two-stage detectors like Faster R-CNN. Specifically, it predicts a predefined number of bounding boxes on every other frame of the input (initially input is 32 frames, thus 16 frames are used) to match the temporal dimension of I3D network's output. Then, I3D network output features and projected on them bounding boxes are passed to ROIAlign to obtain temporal features for each object proposal. Fortunately, PyTorch comes with a [pretrained Faster R-CNN on MSCOCO](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) which can be easily cut to have only RPN functionality. 3. **Similarity Graph**. This graph represents a feature similarity between different objects in a video. Having features $x_i$ extracted by RPN+ROIAlign for every bounding box predictions in a video, the similarity between any pair of objects is computed as $F(x_i, x_j) = (wx_i)^T * (w'x_j)$, where $w$ and $w'$ are learnable transformation weights. Softmax normalization is performed on each edge on the graph connected to a current node $i$. Graph convolutional network is represented as several graph convolutional layers with ReLU activation in between. Graph construction and convolutions can be conveniently implemented using [PyTorch Geometric](https://github.com/rusty1s/pytorch_geometric). 4. **Spatial-Temporal Graph**. This graph captures a spatial and temporal relationship between objects in neighboring frames. To construct a graph $G_{i,j}^{front}$, we need to iterate through every bounding box in frame $t$ and compute Intersection over Union (IoU) with every object in frame $t+1$. The IoU value serves as the weight of the edge connecting nodes (ROI aligned features from RPN) $i$ and $j$. The edge values are normalized so that the sum of edge values connected to proposal $i$ will be 1. In a similar manner, the backward graph $G_{i,j}^{back}$ is defined by analyzing frames $t$ and $t-1$. 5. **Classification Head**. The classification head takes two inputs. One is coming from average pooled features from I3D model resulting in *1 x 512* tensor. The other one is from pooled sum of features (i.e. *1 x 512* tensor) from the graph convolutional networks defined above. Both inputs are concatenated and fed to Fully-Connected (FC) layer to perform final multi-label (or multi-class) classification. **Dataset**. The authors have tested the proposed algorithm on [Something-Something](https://20bn.com/datasets/something-something) and [Charades](https://allenai.org/plato/charades/) datasets. For the first dataset, a softmax loss function is used, while the second one utilizes binary sigmoid loss to handle a multi-label property. The input data is sampled at 6fps, covering about 5 seconds of a video input. **My take**. I think this paper is a great engineering effort. While the paper is easy to understand at the high-level, implementing it is much harder partially due to unclear/misleading writing/description. I have challenged myself with [reproducing this paper](https://github.com/BAILOOL/Videos-as-Space-Time-Region-Graphs). It is work in progress, so be careful not to damage your PC and eyes :-) |

Near-optimal probabilistic RNA-seq quantification

Nicolas L Bray and Harold Pimentel and Páll Melsted and Lior Pachter

Nature Biotechnology - 2016 via Local CrossRef

Keywords:

Nicolas L Bray and Harold Pimentel and Páll Melsted and Lior Pachter

Nature Biotechnology - 2016 via Local CrossRef

Keywords:

[link]
This paper from 2016 introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data called kallisto. The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy. In fact, kallisto is able to quantify expression in a matter of minutes instead of hours. The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference. Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus, it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of the kallisto index. Kallisto builds a colored de Bruijn graph (T-DBG) from all the k-mers found in the transcriptome. Each node of the graph corresponds to a k-mer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color. Linear stretches having the same coloring in the graph correspond to transcripts. Once the T-DBG is built, kallisto stores a hash table mapping each k-mer to its transcript(s) of origin along with the position within the transcript(s). This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome). Then for a given sequenced sample, kallisto decomposes each read into its k-mers and uses those k-mers to find a path covering in the T-DBG. This path covering of the transcriptome graph, where a path corresponds to a transcript, generates k-compatibility classes for each k-mer, i.e. sets of potential transcripts of origin on the nodes. The potential transcripts of origin for a read can be obtained using the intersection of its k-mers k-compatibility classes. To make the pseudoalignment faster, kallisto removes redundant k-mers since neighboring k-mers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps. https://i.imgur.com/eNH2kuO.png **Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.[From Bray et al. Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology, 2016.] Then, kallisto optimizes the following RNA-Seq likelihood function using the expectation-maximization (EM) algorithm. $$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left( \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$ In this function, $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if fragment $f$ is compatible with $t$ and 0 otherwise. The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$. These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions. To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads). Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the abundance estimates for each RNA-Seq sample using a bootstrap technique. For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample. The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed. Practically, we can illustrate the different steps involved in kallisto using a small example. Starting from a tiny genome with 3 transcripts, assume that the RNA-Seq experiment produced 4 reads as depicted in the image below. https://i.imgur.com/5JDpQO8.png The first step is to build the T-DBG graph and the kallisto index. All transcript sequences are decomposed into k-mers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing. The idea is that each different transcript will lead to a different path in the graph. The strand is not taken into account, kallisto is strand-agnostic. https://i.imgur.com/4oW72z0.png Once the index is built, the four reads of the sequenced sample can be analysed. They are decomposed into k-mers (k=5 here too) and the pre-built index is used to determine the k-compatibility class of each k-mer. Then, the k-compatibility class of each read is computed. For example, for read 1, the intersection of all the k-compatibility classes of its k-mers suggests that it might come from transcript 1 or transcript 2. https://i.imgur.com/woektCH.png This is done for the four reads enabling the construction of the compatibility matrix **y**$_{f,t}$ which is part of the RNA-Seq likelihood function. In this equation, the $α_t$ are the parameters that we want to estimate. https://i.imgur.com/Hp5QJvH.png The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNA-Seq likelihood function and optimize the $α_t$. https://i.imgur.com/qzsEq8A.png The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads. |

About