This paper explores the use of convolutional (PixelCNN) and recurrent units (PixelRNN) for modeling the distribution of images, in the framework of autoregression distribution estimation. In this framework, the input distribution $p(x)$ is factorized into a product of conditionals $\Pi p(x_i  x_i1)$. Previous work has shown that very good models can be obtained by using a neural network parametrization of the conditionals (e.g. see our work on NADE \cite{journals/jmlr/LarochelleM11}). Moreover, unlike other approaches based on latent stochastic units that are directed or undirected, the autoregressive approach is able to compute logprobabilities tractably. So in this paper, by considering the specific case of x being an image, they exploit the topology of pixels and investigate appropriate architectures for this. Among the paper's contributions are: 1. They propose Diagonal BiLSTM units for the PixelRNN, which are efficient (thanks to the use of convolutions) while making it possible to, in effect, condition a pixel's distribution on all the pixels above it (see Figure 2 for an illustration). 2. They demonstrate that the use of residual connections (a form of skip connections, from hidden layer i1 to layer $i+1$) are very effective at learning very deep distribution estimators (they go as deep as 12 layers). 3. They show that it is possible to successfully model the distribution over the pixel intensities (effectively an integer between 0 and 255) using a softmax of 256 units. 4. They propose a multiscale extension of their model, that they apply to larger 64x64 images. The experiments show that the PixelRNN model based on Diagonal BiLSTM units achieves stateoftheart performance on the binarized MNIST benchmark, in terms of loglikelihood. They also report excellent loglikelihood on the CIFAR10 dataset, comparing to previous work based on realvalued density models. Finally, they show that their model is able to generate high quality image samples.
Your comment:

#### Introduction * Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc. * [Link to the paper](https://arxiv.org/abs/1601.06759) #### Model * Scan the image, one row at a time and one pixel at a time (within each row). * Given the scanned content, predict the distribution over the possible values for the next pixel. * Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem. * Parameters used in prediction are shared across all the pixel positions. * Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well. ##### Pixel as discrete value * The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values). * This discrete representation is simpler and easier to learn. #### Pixel RNN ##### Row LSTM * Undirectional layer that processed image row by row. * Uses onedimensional convolution (kernel of size kx1, k>=3). * Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759). * Weight sharing in convolution ensures translation invariance of computed feature along each row. * For LSTM, the inputtostate component is computed for the entire 2d input map and then is masked to include only the valid context. * For equations related to statetostate component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759) ##### Diagonal BiLSTM * Bidirectional layer that processes the image in the diagonal fashion. * Input map skewed by offsetting each row of the image by one position with respect to the previous row. * Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759) * For both directions, the inputtostate component is a 1 x 1 convolution while the statetostate recurrent component is computed with column wise convolution using kernel size 2x1. * Kernel size of 2x1 processes minimal information yielding a highly nonlinear computation. * Output map is skewed back by removing the offset positions. * To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map. ##### Residual Connections * Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### Masked Convolutions * Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used). * Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen. * Mask B is applied to all subsequent inputtostate convolution transactions and allows connections from a colour channel to itself. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### PixelCNN * Uses multiple convolution layers that preserve spatial resolution. * Makes receptive field large but not unbounded. * Mask used to avoid seeing the future context. * Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily). ##### MultiScale PixelRNN * Composed of one unconditional PixelRNN and multiple conditional PixelRNNs. * Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s) * Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image. * For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n. * For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to inputtostate map. #### Training and Evaluation * Pixel values are dequantized using realvalued noise and log likelihood of continuous and discrete models are compared. * Update rule  RMSProp * Batch size  16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET. * Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well. * PixelRNN outperforms other models for Binary MNIST and CIFAR10. * For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field. * The paper also provides new benchmarks for generative image modelling on IMAGENET dataset. 