![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
In January of this year (2020), DeepMind released a model called AlphaFold, which uses convolutional networks atop sequence-based and evolutionary features to predict protein folding structure. In particular, their model was designed to predict a distribution for how far away each pair of amino acids will be from one another in the final folded structure. Given such a trained model, you can score a candidate structure according to how likely it is under the model, and - if your process for generating candidates is differentiable, as it is in this case - you can directly optimize the structure to increase its probability. https://i.imgur.com/9ZBhqRo.png The distance-prediction model takes as input two main categories of feature: 1. Per-residue features characterizing which amino acid that residue is based on different techniques that produce one-hot amino acid type, or a distribution over amino acids. 2. Residue pair features based on parameters of Multiple Sequence Alignment (MSA) models. I don't deeply understand the details of how the specific models here work, but at a high level: MSA features are based on the evolutionary intuition that residues that make contact within a protein will likely evolve in a correlation with one another, and that you can simulate these evolutionary timestep correlations by comparing highly similar proteins (which were likely close in evolutionary time). https://i.imgur.com/h16lPwU.png These features are stacked in a LxL grid, with the per-residue-pair features differing at each point in the grid, and the per-residue features staying constant for a full row or column (since they correspond to a given i for all j). One relevant note here is that proteins can be hundreds or thousands of residues long, and, so you can't actually construct a full LxL matrix, either on the input or output end. Instead, the notional full LxL grid is subdivided into a courser grid of 64-residue square regions, and a single one of these 64x64 regions (which could be either adjacent or far away in the protein) is passed into the model at a time. Given these 64x64x<features> input, the model performs several layers of dilated convolutions - which allow features at a given point in the grid to be informed by information farther away - still in a 2D arrangement. The model then outputs a 64x64 grid (one element for each [i, j] amino acid pair), where each element in the grid is a 64-deep discretized probability distribution over the distance between those two residues. When I say "discretized probability distribution," what I actually mean is "histogram". This discretization of an output distribution, where you predict how much probability mass will be in each possible distance bin, allows for flexible and finer-grained predicted distributions than, for example, you could get with a continuous Gaussian model centered around a single point. Amusingly, because the model predicts distance histograms for each residue pair, the authors term the output a "distogram". During training, the next-to-last layer of the model is also used to predict per-residue auxiliary features: the accessible surface area of the residue in the folded structure, and the secondary structure type (helix, strand, etc) that the residue will be a part of. However, these are just used to provide more signal during training, and aren't used for either protein structure optimization or calculation of test scores. To actually generated predicted fold structures, the authors construct a generative model of fold structure where each amino acid is assigned two torsion angles that govern its connection to its neighbor. By setting these torsion angles to different values, you can twist and reshape the protein as a whole. Given this generative model, things proceed as you might suspect: you generate a candidate, calculate the resulting inter-residue distances, calculate a likelihood of those distances under the model you've learned, and send back a gradient to change your torsion angles to make that likelihood higher. Empirically, the Deepmind authors evaluated on a competition dataset, and specifically compared themselves against other approaches that (like theirs) didn't make predictions for a new protein by comparing against similar templates (Template Modeling, or TM) but instead modeled from raw features (Free Modeling, or FM). AlphaFold was able to achieve high accuracy on 24 out of the 43 test domains (where a domain is a cluster of highly related proteins) compared to the next best method, which only got 14 out of the 43. Definitely still not perfect, since almost half of the test proteins were out of its reach, but fairly compelling evidence that there's value to DeepMind's approach. ![]() |
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (5-40% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massively-pretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta-0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream mask-finding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of image-prediction tasks). ![]() |
[link]
### Key points - Instead of just focusing on supervised learning, a self-critique and adapt network provides a unsupervised learning approach in improving the overall generalization. It does this via transductive learning by learning a label-free loss function from the validation set to improve the base model. - The SCA framework helps a learning algorithm be more robust by learning more relevant features and improve during the training phase. ### Ideas 1. Combine deep learning models with SCA that help improve genearlization when we data is fed into these large networks. 2. Build a SCA that focuses not on learning a label-free loss function but on learning quality of a concept. ### Review Overall, the paper present a novel idea that offers a unsupervised learning method to assist a supervised learning model to improve its performance. Implementation of this SCA framework is straightforward and demonstrates promising results. This approach is finally contributing to the actual theory of meta-learning and learning to learn research field. SCA framework is a new step towards self-adaptive learning systems. Unfortunately, the experimentation is insufficient and provided little insight into how this framework can help in cases where task domains vary in distribution or in concept. ![]() |
[link]
This a nice, compact paper testing a straightforward idea: can we use the contrastive loss structure so widespread in unsupervised learning as a framework for generating and training against adversarial examples? In the context of the adversarial examples literature, adversarial training - or, training against examples that were adversarially generated so as to minimize the loss of the model you're training - is the primary strategy used to train robust models (robust here in the sense of not being susceptible to said adversarial attacks). Typically, these attacks are generated with the use of class labels, since they are meant to attack supervised classifiers that assign a class label to an image. Therefore, the goal of the adversarial attack is to push down the probability of the correct class label (either in favor of a specific alternate class, or just in favor of any class that isn't the true one). However, labels are hard and expensive, so, one wonders: in the same way that you can learn representations from unlabeled data, can you also make those representations (otherwise referred to as "embeddings") robust in a similarly label-free way. This paper tests an approach that does so in a quite simple way, by just generating adversarial examples against your contrastive loss target. This works by: 1) Taking an image, and generating two augmentations (or transformations) of it. This is part of the standard contrastive pipeline 2) Applying an adversarial perturbation to one of those transformations, where the perturbation is optimized to maximize the contrastive loss (ability to differentiate an augmented version of the same image from augmented versions of other images) 3) Training on that adversarial sample to generate more robustness https://i.imgur.com/ttF6k1A.png And this simple approach appears to work quite well! They find that, in defending against supervised adversarial attacks, it performs comparably to supervised adversarial training, and that it has the added benefits of (1) slightly higher accuracy on clean examples (in general, robustness is known to decrease your clean-sample accuracy), and (2) better robustness against attack types other than the attack type used for the adversarial training. It also achieves better transfer performance (that is, adversarially training on one dataset, and then evaluating robustness on another) than a supervised method, when evaluated on both CIFAR10 → CIFAR100 and CIFAR100 → CIFAR10. This does make pretty good sense to me, since instance-level stability does seem like it's getting at a more fundamental set of invariances that to would transfer better to different distributions of classes. ![]() |
[link]
The premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that anchor that it can be considered a "positive" or target similarity-wise. Some of these transformations are data augmentations performed on a frame, or choosing temporally adjacent frames in a video sequence (which, since the real world evolves smoothly, are assumed to be similar). Anyhow, all of this is well and good, except for the fact that, especially in an image classification setting like CIFAR or ImageNet, sampling randomly from the other images in a given batch doesn't give you a set of things that are entirely "negatives" in terms of being dissimilar to the anchor image. It is true that most of the objects you get by sampling randomly are negatives (especially in a many-class setting), but some of them will be other samples from the same class. By treating all of those as negatives, we penalize the model for having representations of them that are chose to our anchor representation, even though, for many downstream tasks, we'd probably prefer elements of the same class to have more similar representations. However, the whole premise of the unsupervised setting is that we don't have class labels, so we don't know, for a given sample from the batch (of things that aren't specifically transformations of the anchor) whether it's an actual negative or secretly a positive (i.e. of the same class). And, that's true, but this paper argues that, even if you can't identify which specific elements in a batch are secret positives, you can try to account for them in aggregate, if you have some reasonably good estimate of the overall class probabilities, which will tell you how many positives you expect to find in a given batch in expectation. Given that, they reformulate the loss to be "debiased". They do this by taking the expectation over negatives in the denominator, which is actually a sample over the full p(x), not just the distribution over negatives, and trying to make it a better estimate of the actual distribution over negatives. https://i.imgur.com/URN4RBF.png This they accomplish by writing out the full p(x) as a weighted combination of the distributions over positive and negative (which here is "every class that doesn't match the anchor"), as shown above, and noticing that you can represent the negative part of the distribution by taking the full distribution, and subtracting out the positive distribution (which we have an estimator for by construction, with our transformations), weighted by the prior over how frequent the positives are in our overall distribution. https://i.imgur.com/5IgGIhu.png This leads to a change of estimating the similarity between the anchor and positives (which we already have in the numerator, but which we can also calculate with more augmentations/positive samples to get a better estimate) and doing a (weighted) subtraction of that from the similarity over negative examples. Intuitively, we keep in the part where we penalize similarity with negatives (by adding magnitude to the denominator), but reduce that penalty in accordance with how much we think that "similarity with negatives" is actually similarity with other positives in the batch, which we actually would like to keep around. https://i.imgur.com/kUGoemA.png https://i.imgur.com/5Gitdi7.png In terms of experimental results, my read is that this is most useful on problems - like CIFAR10 and STL10 - that don't have many classes (they each, per their names, have 10). The results there are meaningfully stronger than for the 200-class ImageNet. And, that makes pretty good intuitive sense, since you would expect the scale of the "secret positives in our random sample of images" bias problem to be a lot more acute in a setting where we've got a 1 in 10 chance of sampling a same-class image, compared to a 1-in-200 chance. ![]() |
[link]
Large-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning - it's costly, and narrowly defined in terms of a certain task - but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties - like effectiveness against a new targetable drug receptor - that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method - named GROVER, in true-to-ML-form tortured acronym style. The first is the actual architecture of their model itself, which combines both a message-passing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png - There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes - In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weight-transforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things. - Multi-headed attention is then performed over these node representations, in the normal way, where all keys and queries are dot-product-ed together, and put into a softmax to calculate a weighted average over the values - We now have node-level representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a node-based node and edge representation. This is then joined by an edge-based node and edge representation from the parallel stack. These are aggregated on a full-graph level to predict graph-level properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained - without explicit supervised labels. The authors use two tasks - one local, and one global. The local task constructs labels based on local contextual properties of a given atom - for example, the atom here has one double-bonded Nitrogen and one single-bonded Oxygen in its local environment - and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85-long one-hot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are fine-tuned on small amounts of labeled data. The results are pretty impressive - it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semi-supervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graph-aggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the value-add of the local (also referred to as "contextual") and global semi-supervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. ![]() |
[link]
I tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand: 1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vector corresponds to a particular factor. Given a small amount of supervision, the hope is to learn an encoder that takes in an image, and produces a Z code that encodes where the image is on that particular axis 2. With those disentangled representations, the authors hope they can learn goal-conditioned policies, where the distance between the current image's representation and the goal image's representation can serve as a reward. In particular, they're trying to show that their weakly supervised disentangled representation performs better as a metric space to do goal-conditioning distance calculations in, relative to other learned spaces 3. The approach uses a GAN-based design, where a generator generates the images that correspond with a given z1 and z2, and the discriminator tries to tell the difference between the two real images, paired with their supervision vector, and two generated images, with their fake supervision vector [Here is the relevant equation, along with some notation-explaining text] https://i.imgur.com/XNbxK6i.png The thing I'm confused by is the actual mechanism for why (3) gets you disentangled representations. To my understanding, the thing the generator should be trying to do is generate images whose relationship to one another is governed by the relationship between z1 and z2; if z is really capturing your factors of variation, the two images should differ in places and in ways governed by where those z values are different. Based on this, I'd expect the fake supervision vector here to be some kind of binarized element-wise difference between the two (randomly sampled) vectors, z1 and z2. But the authors claim that the fake supervision vector that the generator is trying to replicate is just the zero vector. That seems like it would just result in the generator trying to generate images that don't differ on any axes, with two different z vectors as input. ![]() |
[link]
This is a really cool paper that posits a relatively simple explanation for the strange phenomena known as double descent - both the fact of seeing it in the first place, and the difficulty in robustly causing it to appear. In the classical wisdom of statistics, increasing model complexity too far will lead to increase in variance, and thus an increase in test error (or "test risk" or "empirical risk"), leading to a U-shaped test error curve as a function of model complexity. Double descent is the name given to the observation that, in modern neural networks, we tend to not see this phenomenon, and, in fact, sometimes see test error first increasing but then descend again below its initial minimum. Test error going up, and then back down again: double descent. However, this phenomenon proved to be a bit elusive: often in order to see it, you had to add artificial noise to your labels. This paper provides a cohesive theory for both the existence of double descent, and the fact that it sometimes can only be elicited with increased label noise. They empirically estimate the bias and variance components of test error for a range of neural nets on a range of datasets, and show that when they directly estimate bias and variance this way, they see bias decreasing (or, at least, non-increasing) monotonically with model complexity, as expected. But, they also see variance, rather than strictly increasing with model complexity, exhibiting unimodal behavior, where it first increases, and then decreases, as a function of model complexity. Taking a step back, bias is here understood as the component of your test error that comes from the difference between your expected learned estimator and the true underlying function. Variance is the squared difference between the expected learned estimator (that is, the one you get if you average over different splits in the data), and the estimator learned on each split of the data. The actual estimator you get is a function of both your average estimator, and the particular estimator you draw in the distribution around that average, which is defined by the variance. The authors empirically measure bias and variance by conducting k different N-way splits of their datasets, and averaging these k*N estimates to get an average or expected estimator. Given that, they can (as shown below), take the variance to be the squared difference between the k*N individual estimators and the average. Since test error is composed of bias + variance, we can then simply calculate bias as whatever remains of test error when variance has been accounted for. https://i.imgur.com/VPzujaZ.png This provides an elegant explanation for the different relationships we see between complexity and test error. In regimes where the decrease in bias from additional complexity is much larger than the increase in variance - which they argue is the case in modern deep networks - we don't see double descent, because the "bump" due to the variance peak is overshadowed by the continuing decrease in variance. However, in regimes where the overall scale of variance (at all levels of complexity) is higher, we see the increasing variance overwhelming the decreasing bias, and test error increases (before, ultimately, going down again, after the variance peaks). This explains why double descent has previously appeared preferentially in cases of injected label noise: more label noise means higher irreducible variability in the model learned from different sets of data, which makes the scale of the variance peak more pronounced compared to the bias drop. In addition to their empirical work, the authors also analytically analyze a two-layer linear neural network, and show that you would theoretically expect a peaked variance shape in that setting. In a certain sense, this just pushes the problem down the road, since the paper doesn't explain why, in any kind of conceptual or statistical sense, we would expect variance to be unimodal in this way. (They do offer a conjecture, but it was not the main thrust of the paper, and I didn't fully follow it). However, it does offer conceptual clarity into a previously somewhat more murky empirical phenomenon, and hopefully will let us focus on understanding why variance behaves in this way. ![]() |
[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. ![]() |
[link]
This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple - a variational autoencoder - even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing end-to-end supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to high-performing networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower long-term regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weight-free operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weight-free architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the within-training-run dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully. ![]() |