Conditional Image Generation with PixelCNN Decoders on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Conditional Image Generation with PixelCNN Decoders
Aaron van den Oord and Nal Kalchbrenner and Oriol Vinyals and Lasse Espeholt and Alex Graves and Koray Kavukcuoglu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG
more

Summaries/Notes 2

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
* [Link to the paper](https://arxiv.org/abs/1606.05328)

#### Based on PixelRNN and PixelCNN

* Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
* PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
* PixelRNN gives better results but PixelCNN is faster to train.

#### Gated PixelCNN

* PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
* To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively.
* Masked convolutions can lead to blind spots in the receptive fields.
* These can be removed by combining 2 convolutional network stacks:
* Horizontal stack - conditions on the current row.
* Vertical stack - conditions on all rows above the current row.
* Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
* Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).

#### Conditional PixelCNN

* Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328))
* This conditioning does not depend on the location of the pixel in the image.
* To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328))

#### PixelCNN Auto-Encoders

* Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.

#### Experiments

* For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
* In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
* Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private