#### Problem addressed:
Fully visible Bayesian network learning
This work is an extension of the original NADE paper. As oppose to using a prefixed random fully visible connected Bayesian network (FVBN), they try to train a factorial number of all possible FVBN by optimizing a stochastic version of the objective, which is an unbiased estimator. The resultant model is very easy to do any type of inference, in addition, since it is trained on all orderings, the ensemble generation of NADE models are also very easy with no additional cost. The training is to mask out the variables that one wants to predict, and maximize the likelihood over training data for the prediction of those missing variables. The model is very similar to denoising autoencoder with Bernoulli type of noise on the input. One drawback of this masking is that the model has no distinction between a masked out variable and a variable that has value 0. To overcome this, they supply the mask as additional input to the network and showed that this is an important ingredient for the model to work.
Proposed order agnoistic NADE, which overcome several drawbacks of original NADE.
The inference at test time is a bit expensive.
UCI, binary MNIST
#### Additional remarks:
The first author provided the implementation on his website
TLDR; The authors present Paragraph Vector, which learns fixed-length, semantically meaningful vector representations for text of any length (sentences, paragraphs, documents, etc). The algorithm works by training a word vector model with an additional paragraph embedding vector as an input. This paragraph embedding is fixed for each paragraph, but varies across paragraphs. Similar to word2vec, PV comes in 2 flavors:
- A Distributed Memory Model (PV-DM) that predicts the next word based on the paragraph and preceding words
- A BoW model (PW-BoW) that predicts context words for a given paragraph
A notable property of PV is that during inference (when you see a new paragraph) it requires training of a new vector, which can be slow. The learned embeddings can used as the input to other models. In their experiments the authors train both variants and concatenate the results. The authors evaluate PV on Classification and Information Retrieval Tasks and achieve new state-of-the-art.
#### Data Sets / Results
Stanford Sentiment Treebank Polar error: 12.2%
Stanford Sentiment Treebank Fine-Grained error: 51.3%
IMDB Polar error: 7.42%
Query-based search result retrieval (internal) error: 3.82%
#### Key Points
- Authors use 400-dimensional PV and word embeddings. The window size is a hyperparameter chosen on the validation set, values from 5-12 seem to work well. In IMDB, window size resulted in error fluctuation of ~0.7%.
- PV-DM performs well on its own, but concatenating PV-DM and PV-BoW consistently leads to (small) improvements.
- When training the PV-DM model, use concatenation instead of averaging to combine words and paragraph vectors (this preserves ordering information)
- Hierarchical Softmax is used to deal with large vocabularies.
- For final classification, authors use LR or MLP, depending on the task (see below)
- IMDB Training (25k documents, 230 average length) takes 30min on 16 core machine, CPU I assume.
#### Notes / Question
- How did the authors choose the final classification model? Did they cross-validate this? The authors mention that NN performs better than LR for the IMDB data, but they don't show how large the gap is. Does PV maybe perform significantly worse with a simpler model?
- I wonder if we can train hierarchical representations of words, sentences, paragraphs, documents, keep the vectors of each one fixed at each layer, and predicting sequences using RNNs.
- I wonder how PV compares to an attention-based RNN autoencoder approach. When training PV you are in a way attending to specific parts of the paragraph to predict the missing parts.
This paper extends supervised embedding models by combining them multiplicatively,
i.e. $f'(x,y) = G(x,y) f(x,y). $
It considers two types of model, dot product in the *embedding* space and kernel density in the *embedding* space, where the kernel in the embedding space is restricted to
$k((x,y),(x','y)) = k(x-x')k(y-y'). $
It proposes an iterative algorithm which alternates $f$ and $G$ parameter updates.