There are mathematicians, still today, who look at deep learning, and get real salty over the lack of convex optimization. That is to say: convex functions are ones where you have an actual guarantees that gradient descent will converge, and mathematicians of olden times (i.e. 2006) spent reams of paper arguing that this or that function had convex properties, and thus could be guaranteed to converge, under this or that set of arcane conditions. And then, Deep Learning came along, with its huge, nonlinear, very much nonconvex objective functions, that it was nonetheless trying to optimize via gradient descent. From the perspective of an optimization theorist, this had the whiff of heresy, but exceptionally effective heresy. And, so, the field of DL has half-exploded, half-stumbled along, showcasing a portfolio of very impressive achievements, but with theory very much a secondary priority relative to performance. Something else that gradient descent isn’t supposed to be able to do is learn models that include discrete (i.e. non-continuous) operators. Without continuous gradients, the functions don’t have an obvious way to “push” in a certain direction, to modulate the loss at the end of the network. Discrete nodes mean that the value just jumps from being in one state, to being in the other, with no intermediate values. This has historically posed a problem for algorithms fueled by gradient descent. The authors of this paper came up with a solution that is 60% cleverness, and 40% just guessing that “even if we ignore the theory, things will probably work well enough”. But, first, their overall goal: to create a Variational Auto Encoder where the latent states, the compressed internal representation that is typically an array of continuous values, is instead an array of categorical values. The goal of this was 1) to have a representation type that was a better match for the discrete nature of data types like speech (which has distinct phonemes we might like to discretely capture), and, 2) to have a more compressed latent space that would (of necessity) focus on more global information, and leave local pixel-level information to be learned by the expressive PixelCNN decoder. The way they do this is remarkably simple. First, they learn a typical VAE encoder, mapping from the input pixels to a continuous z space. (An interesting sidenote here is that this paper uses spatially organized z; instead of using one single z vector to represent the whole image, they may have 32x32 spatial locations, each of which has its own z vector, to represent at 128x128 image). Then, for each of the spatial regions, they take the continuous vector produced by the network, and compare it to a fixed set of “embedding” vectors, of the same shape. That spatial location is then lumped into the category of the embedding that it’s closest to, meaning that you end up with a compressed layer of 32x32 (in this case) spatial regions, each of which is represented by a categorical number between 0 and max-num-categories. Then, the network passes forward the embedding that this input vector was just “snapped” to being, Then, the decoder uses the full spatial location set of embeddings to do its decoding. https://i.imgur.com/P8LQRYJ.png The clever thing here comes when you ask how to train the encoder to produce a different embedding, when there was this discrete “jump” that happened. The authors choose to just avoid the problem, more or less. They do that by just taking the gradient signals that come back from the end of the network to the embedding, and just pass those directly to the vector that was used to nearest-neighbors-lookup the embedding. Basically, they pretend that they passed the vector through the rest of the network, rather than the embedding. The embeddings are then trained in a K Means Clustering kind of way; with the embeddings being iteratively updated to be closer to the points that were assigned to their embedding in each round of training. This is the “Vector Quantization” part of VQ-VAE Overall, this seems to perform quite well: with the low capacity of the latente space meaning that it is incentivized to handle more global structure, while leaving low level pixel details to the decoder. It is also much easier to fit after-the-fact distributions over; once we’ve trained a VQ-VAE, we can easily learn a global model that represents the location by location dependencies between the categories (i.e. a 1 in this corner means at 5 in this other corner is more probable). This gives us the ability to have an analytically specified distribution, in latent space, that actually represents the structure of how these “concept level categories” relate to each other. By contrast, with most continuous latent spaces, it’s intractable to learn an explicit density function after the fact, and thus if we want to be able to sample we need to specify and enforce a prior distribution over z ahead of time.