The unreasonable effectiveness of the forget gate on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

The unreasonable effectiveness of the forget gate
Jos van der Westhuizen and Joan Lasenby
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.NE, cs.LG, stat.ML
more

Summaries/Notes 1

[link] Summary by CodyWild 5 years ago

I have a lot of fondness for this paper as a result of its impulse towards clear explanations, simplicity, and pushing back against complexity for complexity’s sake. The goal of the paper is pretty straightforward. Long Short Term Memory networks (LSTM) work by having a memory vector, and pulling information into and out of that vector through a gating system. These gates take as input the context of the network at a given timestep (the prior hidden state, and the current input), apply weight matrices and a sigmoid activation, and produce “mask” vectors with values between 0 and 1. A typical LSTM learns three separate gates: a “forget” gate that controls how much of the old memory vector is remembered, an “input” gate that controls how much new contextual information is added to the memory, an “output” gate that controls how much of the output (a sum of the gated memory information, and the gated input information) is passed outward into a hidden state context that’s visible to the rest of the network. Note here that “hidden” is an unfortunate word here, since this is actually the state that is visible to the rest of the network, whereas the “memory” vector is only visible to the next-step memory updating calculations. Also note that “forget gate” is an awkward name insofar as the higher the value of the forget gate, the more that the model *remembers* of its past memory. This is confusing, but we appear to be stuck with this terminology

The Gated Recurrent Unit, or GRU, did away with the output gate. In this system, the difference between “hidden” and “memory” vectors is removed, and so the network no longer has separate information channels for communicating with subsequent layers, and simple memory passed to future timesteps. On a wide range of problems, the GRU has performed comparably to the LSTM. This makes the authors ask: if a two-gate model can do as well, can a single gate model? In particular: how well does a LSTM-style model perform, if it only has a forget gate.

The answer, to not bury the probably-obvious lede, is: quite well. Models that only have a forget gate perform comparably to or better than traditional LSTM models for the tasks at which they were tried. On a mechanical level, not having an input gate means that, instead of having individual scaling for “how much old memory do you remember” and “how much new context do you take in”, so that those values could be, for example, 0.2 and 0.15, these numbers are defined as a convex combination of a single value, which is the forget gate. That’s a fancy way of saying: we calculate some x between 0 and 1, and that’s the weight on the forget gate, and then (1-x) is the weight on the input gate. This model, for reasons that are entirely unjustified, and obviously the result of some In Joke, is called JANET, because with a single gate, it’s Just Another NETwork. Image is attached to prove I’m Not Making This Shit Up.

The authors go down a few pathways of explaining why this forget-only model performs well, of which the most compelling is that it gives the model an easier and more efficient way to learn a skip connection, where information is passed down more or less intact to a future point in the model. It’s more straightforward to learn because the “skip-ness” of the connection, or, how strongly the information wants to propogate into the future, is just controlled by one set of parameters, and not a complex interaction of input, forget, and output. An interesting side investigation they perform is how the initialization of the bias term in the forget gate (which is calculated by applying weights to the input and former hidden state, and then adding a constant bias term) effects a model’s ability to learn long term dependencies. In particular, they discuss the situation where the model gets some signal, and then a long string of 0 values. If the bias term of the model is quite low, then all of those 0 values being used to calculate the forget gate will mean that only the bias is left, and the more times the bias is multiplied by itself, the smaller and closer to 0 it gets. The paper suggests initializing the bias of the forget gate according to the longest dependencies you expect the model to have, with the idea that you should more strongly bias your model towards remembering old information, regardless of what new information comes in, if you expect long term dependencies to be strongly relevant.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private