[link]
This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is $$ \log P(\theta  D) = \log P(D  \theta) + \log P(\theta)  \log P(D) $$ By switching the prior into the posterior of previous task(s), we have $$ \log P(\theta  D) = \log P(D  \theta) + \log P(\theta  D_{prev})  \log P(D) $$ The paper use the following form for posterior $$ P(\theta  D_{prev}) = N(\theta_{prev}, diag(F)) $$ where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x\theta) (\nabla_\theta \log P(x\theta))^T]$. Then the resulting objective function is $$ L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i  \theta^{prev*}_i)^2 $$ where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.
Your comment:
