Last year, a machine translation paper came out, with an unfortunately un-memorable name (the Transformer network) and a dramatic proposal for sequence modeling that eschewed both Recurrent NNN and Convolutional NN structures, and, instead, used self-attention as its mechanism for “remembering” or aggregating information from across an input. Earlier this month, the same authors released an extension of that earlier paper, called Image Transformer, that applies the same attention-only approach to image generation, and also achieved state of the art performance there. The recent paper offers a framing of attention that I find valuable and compelling, and that I’ll try to explicate here. They describe attention as being a middle ground between the approaches of CNNs and RNNs, and one that, to use an over-abused cliche, gets the best of both worlds. CNNs are explicitly local: each convolutional filter only gathers information from the cells that fall in specific locations along some predefined grid. And, because convolutional filters have a unique parameter for every relative location in the grid they’re applied to, increasing the size of any given filter’s receptive field would engender an exponential increase in parameters: to go from a 3x3 grid to a 4x4 one, you go from 9 parameters to 16. Convolutional networks typically increase their receptive field through the mechanism of adding additional layers, but there is still this fundamental limitation that for a given number of layers, CNNs will be fairly constrained in their receptive field. On the other side of the receptive field balance, we have RNNs. RNNs have an effectively unlimited receptive field, because they just apply one operation again and again: take in a new input, and decide to incorporate that information into the hidden state. This gives us the theoretical ability to access things from the distant past, because they’re stored somewhere in the hidden state. However, each element is only seen once and needs to be stored in the hidden state in a way that sort of “averages over” all of the ways it’s useful for various points in the decoding/translation process. (My mental image basically views RNN hidden state as packing for a long trip in a small suitcase: you have to be very clever about what you decide to pack, averaging over all the possible situations you might need to be prepared for. You can’t go back and pull different things into your suitcase as a function of the situation you face; you had to have chosen to add them at the time you encountered them). All in all, RNNs are tricky both because they have difficulty storing information efficiently over long time frames, and also because they can be monstrously slow to train, since you have to run through the full sequence to built up hidden state, and can’t chop it into localized bits the way you can with CNNs. So, between CNN - with its locally-specific hidden state - and RNN - with its large receptive field but difficulty in information storage - the self-attention approach interposes itself. Attention works off of three main objects: a query, and a set of keys, each one is attached to a value. In general, all of these objects take the form of vectors. For a given query, you calculate its similarity with each key, and then normalize those into a distribution (a set of weights, all of which sum to 1) that is used as the weights in calculating a weighted average of the values. As a motivating example, think of a model that is “unrolling” or decoding a translated sentence. In order to translate a sentence properly, the model needs to “remember” not only the conceptual content of the sentence, but what it has already generated. So, at each given point in the unrolling, the model can “query” the past and get a weighted distribution over what’s relevant to it in its current context. In the original Transformer, and also in the new one, the models use “multi-headed attention”, which I think is best compared to convolution filters: in the same way that you learn different convolution filters, each with different parameters, to pick up on different features, you learn different “heads” of the attention apparatus for the same purpose. To go back to our CNN - Attention - RNN schematic from earlier: Attention makes it a lot easier to query a large receptive field, since you don’t need an additional set of learned parameters for each location you expand to; you just use the same query weights and key weights you use for every other key and query. And, it allows you to contextually extract information from the past, depending on the needs you have right now. That said, it’s still the case that it becomes infeasible to make the length of the past you calculate your attention distribution over excessively long, but that cost is in terms of computation, not additional parameters, and thus is a question of training time, rather than essential model complexity, the way additional parameters is. Jumping all the way back up the stack, to the actual most recent image paper, this question of how best to limit the receptive field is one of the more salient questions, since it still is the case that conducting attention over every prior pixel would be a very large number of calculations. The Image Transformer paper solves this in a slightly hacky way: by basically subdividing the image into chunks, and having each chunk operate over the same fixed memory region (rather than scrolling the memory region with each pixel shift) to take better advantage of the speed of batched big matrix multiplies. Overall, this paper showed an advantage for the Image Transformer approach relevative to PixelCNN autoregressive generation models, and cited the ability for a larger receptive field during generation - without explosion in number of parameters - as the most salient reason why.