#### Introduction This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels. ##### __1. Spatial long shortterm memory (SLSTM)__ This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations: $\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i1,j} \odot \mathbf{f}^r_{ij} $ $\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$ $\begin{pmatrix} \mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c \end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma \end{pmatrix} T_{\mathbf{A,b}} \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j1} \\ \mathbf{h}_{i1,j} \end{pmatrix} $ where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j1}$ and $\mathbf{c}_{i1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption. ![ride_1](http://i.imgur.com/W8ugGvl.png) As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections. ##### __2. Factorized mixtures of conditional Gaussian scale mixtures__ A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions: 1. __Markov assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood) 2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN. Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij}  \textbf{x}_{<ij}) = p(x_{ij}  \textbf{h}_{ij})$. The conditional distribution distribution in MCGSM is represented as a mixture of experts: $p(x_{ij}  \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s  \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij}  \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$. where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained endtoend with SLSTM using SGD with momentum and then finetuned using LBFGS after each epoch by fixing the parameters of SLSTM.*) * For training: ``` for n in range(num_epochs): for b in range(0, inputs.shape[0]  batch_size + 1, batch_size): # compute gradients f, df = f_df(params, b) loss.append(f / log(2.) / self.num_channels) # update SLSTM parameters for l in train_layers: for key in params['slstm'][l]: diff['slstm'][l][key] = momentum * diff['slstm'][l][key]  df['slstm'][l][key] params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key] # update MCGSM parameters diff['mcgsm'] = momentum * diff['mcgsm']  df['mcgsm'] params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm'] ``` * Finetuning (part of the code) ``` for l in range(self.num_layers): self.slstm[l] = SLSTM( num_rows=hiddens.shape[1], num_cols=hiddens.shape[2], num_channels=hiddens.shape[3], num_hiddens=self.num_hiddens, batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]), nonlinearity=self.nonlinearity, extended=self.extended, slstm=self.slstm[l], verbosity=self.verbosity) hiddens = self.slstm[l].forward(hiddens) # finetune with early stopping based on validation performance return self.mcgsm.train( hiddens_train, outputs_train, hiddens_valid, outputs_valid, parameters={ 'verbosity': self.verbosity, 'train_means': train_means, 'max_iter': max_iter}) ```
Your comment:
