#### Introduction * The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture. * [Link to the paper](https://arxiv.org/abs/1606.05328) #### Based on PixelRNN and PixelCNN * Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals. * PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks. * PixelRNN gives better results but PixelCNN is faster to train. #### Gated PixelCNN * PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions. * To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively. * Masked convolutions can lead to blind spots in the receptive fields. * These can be removed by combining 2 convolutional network stacks: * Horizontal stack - conditions on the current row. * Vertical stack - conditions on all rows above the current row. * Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack. * Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings). #### Conditional PixelCNN * Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328)) * This conditioning does not depend on the location of the pixel in the image. * To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328)) #### PixelCNN Auto-Encoders * Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end. #### Experiments * For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train. * In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved. * Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.