An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Ian J. Goodfellow and Mehdi Mirza and Da Xiao and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2013 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

Summaries/Notes 1

[link] Summary by Andrea Walter Ruggerini 4 years ago

The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. 

An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. 

Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii)  *tasks are similar* and the third case for (iii) *dissimilar tasks*. 

Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. 

The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. 

From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter. 

## Why the paper is important
It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem.

## Key conclusions, takeaways and modelling remarks
* dropout helps in preventing CF
* dropout seems to increase the optimal model size with respect to the model without dropout 
* choice of activation function has a less consistent effect than dropout\no dropout choice
* dissimilar task experiment provides a  notable exception of then dissimilar task experiment
* the previous hypothesis that LWTA activation is particularly resistant to CF is rejected  (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent)
* choice of activation function should always be cross-validated
* If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private