Generalization is, if not the central, then at least one of the central mysteries of deep learning. We are somehow able to able to train high-capacity, overparametrized models, that empirically have the capacity to fit to random data - meaning that they have the capacity to memorize the labeled data we give them - and which yet still manage to train functions that generalize to test data. People have tried to come up with generalization bounds - that is, bounds on the expected test error of a model class - but that have all been vacuous, which here means that their upper bound is so far above the actual observed test set error that it's meaningless for the purpose of predicting which changes will enhance or detract from generalization. This paper builds on - and somewhat critiques - an earlier paper, Jiang et al, which takes the approach of assessing generalization bounds empirically. The central approach taken by both papers is to compare the empirical test error of two networks that are identical except for one axis which is varied, and test whether the ranking of the predicted generalization errors for the two networks, resulting from a particular analytical bound, aligns with the ranking of actual, empirical test error. Said succinctly: the goal is to measure how good a generalization bound is at predicting which networks will actually generalize, across the kinds of hyperparameter changes we'd be likely to experiment with in practice. An important note here is that this kind of rank-based measurement is insensitive to whether the actual magnitude of the generalization bound is; it only cares about relative bounds for different model configurations. For a given pair of environments (or pairs of hyperparameter settings), the experimental framework trains multiple seeds and averages the sign error across them. If the two models in the pair were close to one another in generalization error, they were downweighted in the overall average, or removed from the estimation if they were too close, to reduce noise. A difference in methodologies between the Jiang paper and this one is that this one puts a lot of emphasis on the need to rank generalization measures not just by their average performance over a suite of different hyperparameter perturbations, but also by a metric capturing how robust the measure is, for which they suggest the max error rather than average error. Their rationale is that simply looking at an average obscures cases where a measure performs poorly in a particular region of hyperparameter space, in a way that might tell us interesting things about its failure modes. For example, beyond just being able to say that generalization bounds based on Frobenius norms performed poorly on average at predicting the effects of changes to training set size, they were able to look at the particular settings where it performed the worst, which turn out to be on small network sizes. The plot below shows the results from all of the tested measures aggregated together. Each row represents a different axis that was being varied, and, for each measure, a number of different settings were sampled over (for the hyperparameters that were being held fixed across pairs, rather than being varied. Each distribution rectangle represents the average sign error across all of the pairs that were sampled for that measure, and that axis of variation. The measures are listed from left to right according to their average performance across all environments and all axes of variation. https://i.imgur.com/Tg3wdA3.png Some conclusions from this experiment were: - Generalization measures seem to not perform well on changes made to width, however, the authors note this was mostly because changes to width tended to not change the generalization performance in consistent ways, and so the difference in test error between the networks in the pair was more often within the range of noise - Most but not all generalization bounds correctly predict that more training data should result in better generalization - No bound does particularly well on predicting the generalization effects of changes in depth Overall, I found this paper delightfully well written, and a real pleasure to read. My one critique is that the authors explicitly point out that an important piece of data for comparing generalization bounds is the set of features they depend on. That is, if a generalization bound can only make predictions with access to the learned weights (in addition to the model class and data characteristics), it's a lot less practically useful, in terms of model design, than one that doesn't. I wish they had followed through on that and represented the dependencies of the different bounds on some way in their central figure, so that it was easier to compare them "fairly," or accounting for the information they had access to.