Overcoming catastrophic forgetting in neural networks Overcoming catastrophic forgetting in neural networks
Paper summary This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is $$ \log P(\theta | D) = \log P(D | \theta) + \log P(\theta) - \log P(D) $$ By switching the prior into the posterior of previous task(s), we have $$ \log P(\theta | D) = \log P(D | \theta) + \log P(\theta | D_{prev}) - \log P(D) $$ The paper use the following form for posterior $$ P(\theta | D_{prev}) = N(\theta_{prev}, diag(F)) $$ where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x|\theta) (\nabla_\theta \log P(x|\theta))^T]$. Then the resulting objective function is $$ L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i - \theta^{prev*}_i)^2 $$ where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.

Summary by luyuchen 4 weeks ago
Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and