This recent paper, a collaboration involving some of the authors of MAML, proposes an intriguing application of techniques developed in the field of meta learning to the problem of unsupervised learning - specifically, the problem of developing representations without labeled data, which can then be used to learn quickly from a small amount of labeled data. As a reminder, the idea behind meta learning is that you train models on multiple different tasks, using only a small amount of data from each task, and update the model based on the test set performance of the model. The conceptual advance proposed by this paper is to adopt the broad strokes of the meta learning framework, but apply it to unsupervised data, i.e. data with no pre-defined supervised tasks. The goal of such a project is, as so often is the case with unsupervised learning, to learn representations, specifically, representations we believe might be useful over a whole distribution of supervised tasks. However, to apply traditional meta learning techniques, we need that aforementioned distribution of tasks, and we’ve defined our problem as being over unsupervised data. How exactly are we supposed to construct the former out of the latter? This may seem a little circular, or strange, or definitionally impossible: how can we generate supervised tasks without supervised labels? https://i.imgur.com/YaU1y1k.png The artificial tasks created by this paper are rooted in mechanically straightforward operations, but conceptually interesting ones all the same: it uses an off the shelf unsupervised learning algorithm to generate a fixed-width vector embedding of your input data (say, images), and then generates multiple different clusterings of the embedded data, and then uses those cluster IDs as labels in a faux-supervised task. It manages to get multiple different tasks, rather than just one - remember, the premise of meta learning is in models learned over multiple tasks - by randomly up and down-scaling dimensions of the embedding before clustering is applied. Different scalings of dimensions means different points close to one another, which means the partition of the dataset into different clusters. With this distribution of “supervised” tasks in hand, the paper simply applies previously proposed meta learning techniques - like MAML, which learns a model which can be quickly fine tuned on a new task, or prototypical networks, which learn an embedding space in which observations from the same class, across many possible class definitions are close to one another. https://i.imgur.com/BRcg6n7.png An interesting note from the evaluation is that this method - which is somewhat amusingly dubbed “CACTUs” - performs best relative to alternative baselines in cases where the true underlying class distribution on which the model is meta-trained is the most different from the underlying class distribution on which the model is tested. Intuitively, this makes reasonable sense: meta learning is designed to trade off knowledge of any given specific task against the flexibility to be performant on a new class division, and so it gets the most value from trade off where a genuinely dissimilar class split is seen during testing. One other quick thing I’d like to note is the set of implicit assumptions this model builds on, in the way it creates its unsupervised tasks. First, it leverages the smoothness assumptions of classes - that is, it assumes that the kinds of classes we might want our model to eventually perform on are close together, in some idealized conceptual space. While not a perfect assumption (there’s a reason we don’t use KNN over embeddings for all of our ML tasks) it does have a general reasonableness behind it, since rarely are the kinds of classes very conceptually heterogenous. Second, it assumes that a truly unsupervised learning method can learn a representation that, despite being itself sub-optimal as a basis for supervised tasks, is a well-enough designed feature space for the general heuristic of “nearby things are likely of the same class” to at least approximately hold. I find this set of assumptions interesting because they are so simplifying that it’s a bit of a surprise that they actually work: even if the “classes” we meta-train our model on are defined with simple Euclidean rules, optimizing to be able to perform that separation using little data does indeed seem to transfer to the general problem of “separating real world, messier-in-embedding-space classes using little data”.