Unsupervised representation learning is a funny thing: our aspiration in learning representations from data is typically that they’ll be useful for future tasks, but, since we (by definition) don’t have access to labels, our approach has historically been to define heuristics, such as representing the data distribution in a low-dimensional space, and hope that those heuristics translate to useful learned representations. And, to a fair extent, they have. However, this paper’s goal is to attach this problem more directly, by explicitly meta-learning an unsupervised update rule so that performs well in future tasks. They do this by: https://i.imgur.com/EEkpW9g.png 1) Defining a parametrized weight update function, to slot into the role that Stochastic Gradient Descent on a label-defined loss function would play in a supervised network. This function calculates a “hidden state”, is defined for each neuron in each layer, and takes in the pre and post-nonlinearity activations for that batch, the hidden state of the next layer, and a set of learned per-layer “backwards weights”. The weight update for that neuron is then calculated using the current hidden state, the last batch's hidden state, and the current value of the weight. In the traditional way of people in this field who want to define some generic function, they instantiate these functions as a MLP. 2) Using that update rule on the data from a new task, taking the representing resulting from applying the update rule, and using it in a linear regression with a small number of samples. The generalization performance from this k-shot regression, taken in expectation over multiple tasks, acts as our meta training objective. By back-propagating from this objective, to the weight values of the representation, and from there to the parameters of the optimization step, they incentivize their updater to learn representations that are useful across some distribution of tasks. A slightly weird thing about this paper is that they train on image datasets, but shuffle the pixels and use a fully connected network rather than a conv net. I presume this has to do with the complexities of defining a weight update rule for a convolution, but it does make it harder to meaningfully compare with other image-based unsupervised systems, which are typically done using convolution. An interesting thing they note is that, early in meta-training on images, their update rules generalize fairly well to text data. However, later in training the update rules seem to have specialized to images, and generalize more poorly to images.