Infinite Dimensional Word EmbeddingsInfinite Dimensional Word EmbeddingsNalisnick, Eric T. and Ravi, Sachin2015

Paper summaryhlarochelleThis paper introduces a version of the skipgram word embeddings learning algorithm that can also learn the size (nb. of dimensions) of these embeddings. The method, coined infinite skipgram (iSG), is inspired from my work with Marc-Alexandre Côté on the infinite RBM, in which we describe a mathematical trick for learning the size of a latent representation. This is done by introducing an additional latent variable $z$ representing the number of dimensions effectively involved in the energy function. Moreover, a term penalizing increasing values for $z$ is also incorporated, such that the infinite sum over $z$ is converging.
In this paper, the authors extend the probabilistic model behind skipgram with such a variable $z$, now corresponding to the number of dimensions involved in the dot product between word embeddings. They also propose a few approximations required to allow for an efficient training algorithm. Mainly they optimize an upper bound on the regular skipgram objective (see Section 3.2) and they approximate the computation of the conditional over $z$ for a given word $w$, which requires summing over all possible context words $c$, by summing only over the words observed in the immediate current context of $w$ (thus this sum will very across training example of the same word $w$).
Experiments show that the iSG better learns to exploit different dimensions to model different senses of words, better than the original skipgram model. Quantitatively, the iSG seems to provide better probabilities to context words.

This paper introduces a version of the skipgram word embeddings learning algorithm that can also learn the size (nb. of dimensions) of these embeddings. The method, coined infinite skipgram (iSG), is inspired from my work with Marc-Alexandre Côté on the infinite RBM, in which we describe a mathematical trick for learning the size of a latent representation. This is done by introducing an additional latent variable $z$ representing the number of dimensions effectively involved in the energy function. Moreover, a term penalizing increasing values for $z$ is also incorporated, such that the infinite sum over $z$ is converging.
In this paper, the authors extend the probabilistic model behind skipgram with such a variable $z$, now corresponding to the number of dimensions involved in the dot product between word embeddings. They also propose a few approximations required to allow for an efficient training algorithm. Mainly they optimize an upper bound on the regular skipgram objective (see Section 3.2) and they approximate the computation of the conditional over $z$ for a given word $w$, which requires summing over all possible context words $c$, by summing only over the words observed in the immediate current context of $w$ (thus this sum will very across training example of the same word $w$).
Experiments show that the iSG better learns to exploit different dimensions to model different senses of words, better than the original skipgram model. Quantitatively, the iSG seems to provide better probabilities to context words.