TLDR; The authors show that seq2seq LSTM networks (2 layers, 400dims) can learn to evaluate short Python programs (loops, conditionals, addition, subtraction, multiplication). The program code is fed one character at a time, and the LSTM is tasked with generating an output number (12 character vocab). The authors also present a new curriculum learning strategy, where the network is fed with a sensible mixture of easy and increasingly difficult examples, allowing it to gradually build up the concepts required to evaluate these programs. #### Key Points  LSTM unrolled for 50 steps, 2 layer, 400 cells per layer, ~2.5M parameters. Gradient norm constrained to 5.  3 Curriculum Learning strategies: 1. Naive (increase example difficulty) 2. Mixed: Randomly sample easy and hard problems, 3. Combined: Sample from Naive and Mixed strategy. Mixed or Combined almost always performs better.  Output Vocabulary: 10 digits, minus, dot  For evaluation teacher forcing is used: Feed correct output when generating target sequence  Evaluation Tasks: Program Evaluation, Addition, Memorization  Tricks: Reverse Input sequence, Double input sequence. Seem to make big difference.  Nesting loops makes the tasks difficult since LSTMs can't deal with compositionality.  Feeding easy examples and before hard examples may require the LSTM to restructure its memory. #### Notes / Questions  I wonder if there's a relation between regularization/dropout and curriculum learning. The authors propose that mixing example difficulty forces a more general representation. Shouldn't dropout be doing a similar thing?
Your comment:

## Problem Statement * Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequencetosequence framework. * [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf) ## Approach * Formulate program evaluation task as a sequencetosequence learning problem using RNNs. * Train on short programs that can be evaluated in linear time and constant memory  RNN can perform only a single pass over the data and its memory is limited. * Two parameters to control the difficulty of the program: * `length` : Number of digits in the integer that appears in the program. * `nesting` : Number of times operations can be combined with each other. * LSTM reads the input program, one character at a time and produces output, one character at a time. ### Additional Learning Tasks * **Addition Task**  Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks. * **Memorization Task**  Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model: * **Input reversing**  Reversing the order of input, while keeping the output fixed introduces many shortterm dependencies that help LSTM in learning the process. * **Input doubling**  Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice. ## Curriculum learning Gradually increase the difficulty of the program fed to the system. * **No Curriculum (baseline)**  Fixed `length` and fixed `nesting` programs are fed to the system. * **Naive Curriculum**  Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively. * **Mix Strategy**  Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples. * **Combined Strategy**  Each training example is obtained either by Naive curriculum strategy or mix strategy. ## Network Architecture * 2 layers, unrolled for 50 steps. * 400 cells per layer. * Parameters initialized uniformly in [0.08, 0.08] * minibatch size 100 * norm of gradient normalized to be less than 5 * start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95% ## Observations Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i1 digits of the output are provided as input to the LSTM. The general trend is (combine, mix) > (naive, baseline). In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns. While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further. 