First published: 2018/05/24 (5 years ago) Abstract: Experiments used in current continual learning research do not faithfully
assess fundamental challenges of learning continually. Instead of assessing
performance on challenging and representative experiment designs, recent
research has focused on increased dataset difficulty, while still using flawed
experiment set-ups. We examine standard evaluations and show why these
evaluations make some continual learning approaches look better than they are.
We introduce desiderata for continual learning evaluations and explain why
their absence creates misleading comparisons. Based on our desiderata we then
propose new experiment designs which we demonstrate with various continual
learning approaches and datasets. Our analysis calls for a reprioritization of
research effort by the community.
Through a likelihood-focused derivation of a variational inference (VI) loss, Variational Generative Experience Replay (VGER) presents the closest appropriate likelihood- focused alternative to Variational Continual Learning (VCL), the state-of the art prior-focused approach to continual learning.
In non continual learning, the aim is to learn parameters $\omega$ using labelled training data $\mathcal{D}$ to infer $p(y|\omega, x)$. In the continual learning context, instead, the data is not independently and identically distributed (i.i.d.), but may be split into separate tasks $\mathcal{D}_t = (X_t, Y_t)$ whose examples $x_t^{n_t}$ and $y_t^{n_t}$ are assumed to be i.i.d.
In \cite{Farquhar18}, as the loss at time $t$ cannot be estimated for previously discarded datasets, to approximate the distribution of past datasets $p_t(x,y)$, VGER (Variational Generative Experience Replay) trains a GAN $q_t(x, y)$ to produce ($\hat{x}, \hat{y}$) pairs for each class in each dataset as it arrives (generator is kept while data is discarded after each dataset is used). The variational free energy $\mathcal{F}_T$ is used to train on dataset $\mathcal{D}_T$ augmented with samples generated by the GAN. In this way the prior is set as the posterior approximation from the previous task.