# Neural GPUs Learn Algorithms Authors: Łukasz Kaiser, Ilya Sutskever. http://arxiv.org/abs/1511.08228 This originally appeared here: https://github.com/ProofByConstruction/betterexplanations/blob/master/summaries/1511.08228.md ## Short Version Using convolutions on embedded symbols in a recurrentlooking setting allows training of what is essentially a cellular automaton which can perform various algorithms and generalize to sequences of very long lengths. ## Problem Setting We want to teach neural nets to learn algorithms (e.g. copy, reverse, binary addition or multiplication, etc.). Other approaches to this include sequence to sequence modeling (usually via some form of [LSTM](http://colah.github.io/posts/201508UnderstandingLSTMs/)). Related work of seqseq approaches includes [Neural Turing Machines](http://arxiv.org/abs/1410.5401), [Stack RNNs](http://arxiv.org/abs/1503.01007), [Pointer Networks](http://arxiv.org/abs/1506.03134), and [Grid LSTM](http://arxiv.org/abs/1507.01526). Related but different, see [Neural ProgrammerInterpreters](http://arxiv.org/abs/1511.06279). These types of problems are particularly difficult for neural approaches because as sequence length grows, so too does the necessary computation. Furthermore generalization can be difficult, since it's possible for networks to effectively learn lengthdependent rules that break down on longer sequences. This mostly comes from the fact that it's difficult to encode a symbolic, rulebased algorithmic approach in a neural net (hence all of the different approaches above and the active research in this area). The Neural GPU is somewhat different in its approach compared to other sequencesequence models in that it don't exactly take in or generate a sequence, but reads in an entire sequence as a static object and generates a fixed size image length equal to that of the input (with special padding characters to indicate gaps). These inputs and outputs are sized dependent on the problem (longer sequences get bigger representations). In this way, the Neural GPU looks much more like a traditional feedforward convolutional net with variable input/output size (and variable computational depth, but we'll get to that later). ## Architecture The three stages are  an embedder which takes a sequence of symbols to an input "image"  a stack of convolutional [GRUs](http://arxiv.org/abs/1412.3555) (Gated Recurrent Units) which at each step progressively process their input  a decoder which effectively does the reverse of the embedder (takes output representation to symbols) ### Encoder The input is a sequence of length n over set of symbols (e.g. {0, 1, +, PAD (=P)}, input is (in reversebinary) 1010+0111, output, constrained to be the same size, should be 11001PPPP) of size I. We first map the symbols to vectors of length m (which will be the number of channels of our mental image) by looking up each symbol in an embedding matrix of size I*m. Call these embeddings {e_i}. We then create an initial "mental image" which is a rectangular volume with width w (set to be 4), height n (the length of the input sequence) and depth m (the embedding size). In the first column we insert the e_i's depthwise, and set all other cells in the volume to be 0. Note that the size (height) of the input volume is dependent on sequence length. This could be a problem in other architectures, but since everything is done by convolution (as we'll see shortly) this architecture is able to exploit an adaptivesize input rather than reading in sequences (as I imagine convolutional text models might) ### CGRU The processing done by the network is over n (the length of the input sequence) time steps using L layers (L=2) of [GRUs](http://arxiv.org/abs/1412.3555), however this is a bit misleading because we only feed input in once (encoded as state), and if we unroll the computation for the fixed length of n time steps, it looks like a purely feedforward net. We might ask why use the GRU architecture at all (instead of only weight sharing across layers) and my guess is that the update and reset gates help in training over long unrollings (ie without them we might experience vanishing gradients  though it might be worth trying relu activated convolutional layers with shared weights across layers). Additionally, we structure the operation occurring in the GRU by forcing it to be a convolution (instead of e.g. fully connected). Intuitively what this CGRU (Convolutional GRU) is doing is processing the "mental image" (which remains the same shape over all time steps) in the same way at each point in time, thus bearing resemblance to cellular automata. ### Decoder We consider the output much like the input and read out only the first column (of height n, the output sequence length, and depth m, the embedding size). We use a decoder matrix O, of shape m*I which takes the mdimensionally represented characters and maps to I logit probabilities. $$l_k = O s_{fin}[0, k, :] = O c_k = [l_k^1, l_k^2, \dots, l_k^I]$$ The output at each slot in the output sequence is then whichever character is most probable. For each character, its loss is the probability of the target character (where the probability is softmax over the logits). The loss is then the sum of the logs (~product) of these over all characters $$L =  \sum_{k \in [1, n]} \log p(c_k^{target}) =  \sum_{k \in 1, n} \log \frac{e^{l_k^{target}}}{\sum_j e^{l_k^j}}$$ ## Training There are a bunch of tricks employed in training these.  Dropout (between 6%13.5%), noise added to gradients (gaussian with mean 0, variance ~ 1/sqrt(step number)), gradient clipping  Grid search for hyperparameter tuning  Curriculum design  the distribution over lengths of presented examples during training shifts towards more difficult problems as mastery of easier ones increases (note that this curriculum is designed by hand)  "Parameter sharing relaxation": each GRU is allowed to have different weights at each time step for 6 steps, then cycles that same set of weights (step 7 uses weights at step 1). This allows for more variation in the weights so that the network can achieve a better fit. To get a single weight matrix there's a penalty that progressively increases which is proportional to the differences of weights from the mean. ## Other Notes  The number of steps the network is run for is equal to the length of the input sequence (for algorithms which require more computational steps, this architecture is therefore insufficient for perfect calculation). I'm not sure if this is a trick, or just a decision for architectural simplicity. Regardless, it removes the need to learn a stopping mechanism.
Your comment:
