For solving sequence modeling problems, recurrent architectures have been historically the most commonly used solution, but, recently, temporal convolution networks, especially with dilations to help capture longer term dependencies, have gained prominence. RNNs have theoretically much larger capacity to learn long sequences, but also have a lot of difficulty propagating signal forward through long chains of recurrent operations. This paper, which suggests the approach of Trellis Networks, places itself squarely in the middle of the debate between these two paradigms. TrellisNets are designed to be a theoretical bridge between between temporal convolutions and RNNs - more specialized than the former, but more generalized than the latter. https://i.imgur.com/J2xHYPx.png The architecture of TrellisNets is very particular, and, unfortunately, somewhat hard to internalize without squinting at diagrams and equations for awhile. Fundamentally: - At each layer in a TrellisNet, the network creates a “candidate pre-activation” by combining information from the input and the layer below, for both the current and former time step. - This candidate pre-activation is then non-linearly combined with the prior layer, prior-timestep hidden state - This process continues for some desired number of layers. https://i.imgur.com/f96QgT8.png At first glance, this structure seems pretty arbitrary: a lot of quantities connected together, but without a clear mechanic for what’s happening. However, there are a few things interesting to note here, which will help connect these dots, to view TrellisNet as either a kind of RNN or a kind of CNN: - TrellisNet uses the same weight matrices to process prior and current timestep inputs/hidden states, no matter which timestep or layer it’s on. This is strongly reminiscent of a recurrent architecture, which uses the same calculation loop at each timestep - TrellisNets also re-insert the model’s input at each layer. This also gives it more of a RNN-like structure, where the prior layer’s values act as a kind of “hidden state”, which are then combined with an input value - At a given layer, each timestep only needs access to two elements of the prior layer (in addition to the input); it does not require access to all the prior-timestep values of it’s own layer. This is important because it means that you can calculate an entire layer’s values at once, given the values of the prior layer: this means these models can be more easily parallelized for training Seeing TrellisNets as a kind of Temporal CNN is fairly straightforward: each timestep’s value, at a given layer, is based on a “filter” of the lower-layer value at the current and prior timestep, and this filter is shared across the whole sequence. Framing them as a RNN is certainly trickier, and anyone wanting to understand it in full depth is probably best served by returning to the paper’s equations. At at high level, the authors show that TrellisNets can represent a specific kind of RNN: a truncated RNN, where each timestep only uses history from the prior M time steps, rather than the full sequence. This works by sort of imagining the RNN chains as existing along the diagonals of a TrellisNet architecture diagram: as you reach higher levels, you can also reach farther back in time. Specifically, a TrellisNet that wants to represent a depth K truncated RNN, which is allowed to unroll through M steps of history, can do so using M + K - 1 layers. Essentially, by using a fixed operation across layers and timesteps, the TrellisNet authors blur the line between layer and timestep: any chain of operations, across layers, is fundamentally a series of the same operation, performed many times, and is in that way RNN-like. The authors have not yet taken a stab at translation, but tested their model on a number of word and character-level language modeling tasks (predicting the next word or character, given prior ones), and were able to successfully beat SOTA on many of them. I’d be curious to see more work broadly in this domain, and also gain a better understanding of areas in which a fixed, recurrently-used layer operation, like the ones used in RNNs and this paper, is valuables, and areas (like a “normal” CNN) where having specific weights for different levels of the hierarchy is valable.