[link]
Summary by Shagun Sodhani 3 years ago
#### Introduction
* Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
* [Link to the paper](https://arxiv.org/abs/1601.06759)
#### Model
* Scan the image, one row at a time and one pixel at a time (within each row).
* Given the scanned content, predict the distribution over the possible values for the next pixel.
* Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
* Parameters used in prediction are shared across all the pixel positions.
* Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.
##### Pixel as discrete value
* The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
* This discrete representation is simpler and easier to learn.
#### Pixel RNN
##### Row LSTM
* Undirectional layer that processed image row by row.
* Uses one-dimensional convolution (kernel of size kx1, k>=3).
* Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759).
* Weight sharing in convolution ensures translation invariance of computed feature along each row.
* For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
* For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759)
##### Diagonal BiLSTM
* Bidirectional layer that processes the image in the diagonal fashion.
* Input map skewed by offsetting each row of the image by one position with respect to the previous row.
* Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759)
* For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
* Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
* Output map is skewed back by removing the offset positions.
* To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.
##### Residual Connections
* Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)
##### Masked Convolutions
* Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
* Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
* Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)
##### PixelCNN
* Uses multiple convolution layers that preserve spatial resolution.
* Makes receptive field large but not unbounded.
* Mask used to avoid seeing the future context.
* Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).
##### Multi-Scale PixelRNN
* Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
* Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
* Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
* For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
* For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.
#### Training and Evaluation
* Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
* Update rule - RMSProp
* Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
* Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
* PixelRNN outperforms other models for Binary MNIST and CIFAR10.
* For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
* The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.

more
less