This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple - a variational autoencoder - even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing end-to-end supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to high-performing networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower long-term regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weight-free operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weight-free architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the within-training-run dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully.