The paper presents a model-agnostic strategy to perform few-shot learning taking advantage of prior knowledge acquired during in multitask learning. Such prior knowledge derives from priors acquired about generalized model parameters (e.g. weights or hyperparameters) during the Model Agnostic Meta-Learning (MAML) algorithm. The strategy can be applied to any algorithm trained with gradient descent (not only neural networks) being more general and perhaps effective than transfer learning. It can loosely be referred to as "learning to learn".
## Why this is interesting
* Suitable in combination with any technique that uses gradient descent (supervised learning, reinforcement learning)
* Interesting idea: instead of further optimize existing models for performances, search a representation that can be subsequently tuned
* when only a few and diverse data is available, multiple tasks can be defined to harness the ability of the meta-model to learn while preserving generalization (see Experiments)
The key idea is performing the meta-learner update on a different data batch with respect to the parameters update(s) done for a single task. This leads (formally) the same update procedure both for the learning and meta-learning phases of the algorithm (see Figure below) and provides a general framework for MAML.
image from [lilianweng post](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html)
As [clearly worked out here](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html), the method requires to compute the second derivatives for the outer-loop update. Surprisingly enough, omitting them and performing a first-order MAML does not sensibly affect the results according to the reported experiment. It is hypothesized that this is because ReLU networks are almost linear (and hence depends upon the actual network structure).
### Supervised learning
1. regression from input to output of a sine wave. Amplitude and phase are varied among tasks. It is shown that MAML leads to good results and can generalize better than fine-tuning in the experiment conditions ("due to the often contradictory outputs on pre-training tasks". See Figure 2 in the paper)
2. few-shot image classification on Omniglot and MiniImageNet datasets (N, unseen, multiclass trained with K different instances). SOTA performance on the first dataset, and better-than SOTA on the second one where first-order MAML is also tested.
### Reinforcement learning
1. 2D navigation: a single agent point must move to different positions (tasks). The model trained with MAML shows better performances, for the same number of gradient steps (Figure 4)
2. Locomotion: two simulated robots are provided with a set of tasks. MAML can learn a model that adapts much faster to new tasks (this is a case where pretraining is detrimental)
## Related work and resources
* official [gthub repo](https://github.com/cbfinn/maml)
* [videos](https://sites.google.com/view/maml) of the learned policies in MAML paper
* paper appendix: the part on the multi-task baseline is interesting
* [how to train your MAML](https://arxiv.org/abs/1810.09502): discusses various modifications to MAML to stabilize and improve performances
* [Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML](https://arxiv.org/abs/1909.09157)
First published: 2013/12/21 (6 years ago) Abstract: Catastrophic forgetting is a problem faced by many machine learning models
and algorithms. When trained on one task, then trained on a second task, many
machine learning models "forget" how to perform the first task. This is widely
believed to be a serious problem for neural networks. Here, we investigate the
extent to which the catastrophic forgetting problem occurs for modern neural
networks, comparing both established and recent gradient-based training
algorithms and activation functions. We also examine the effect of the
relationship between the first task and the second task on catastrophic
forgetting. We find that it is always best to train using the dropout
algorithm--the dropout algorithm is consistently best at adapting to the new
task, remembering the old task, and has the best tradeoff curve between these
two extremes. We find that different tasks and relationships between tasks
result in very different rankings of activation function performance. This
suggests the choice of activation function should always be cross-validated.
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task.
An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one.
Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*.
Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification.
The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters.
From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter.
## Why the paper is important
It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem.
## Key conclusions, takeaways and modelling remarks
* dropout helps in preventing CF
* dropout seems to increase the optimal model size with respect to the model without dropout
* choice of activation function has a less consistent effect than dropout\no dropout choice
* dissimilar task experiment provides a notable exception of then dissimilar task experiment
* the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent)
* choice of activation function should always be cross-validated
* If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended.