[link]
This is another paper that was a bit of a personalgrowth test for me to try to parse, since it's definitely heavier on analytical theory than I'm used to, but I think I've been able to get something from it, even though I'll be the first to say I didn't understand it entirely. The question of this paper is: why does it seem to be the case that training a neural network on a data distribution  but with your supervised labels randomly sampled  seems to afford some level of advantage when finetuning on those randomtrained with correct labels. What is it that these networks learn from random labels that gives them a headstart on future training? To try to answer this, the authors focus on analyzing the firstlayer weights of a network, and frame both the input data and the learned weights (after random training) to both be random variables, each with some mean and covariance matrix. The central argument made by the paper is: After training with random labels, the weights come to have a distributional form with a covariance matrix that is "aligned" with the covariance matrix of the data. "Aligned" here means alignment on the level of eigenvectors. Formally, it is defined as a situation where every eigenspace in the data covariance matrix is contained in, or is a subset of, an eigenspace of the weight matrix. Intuitively, it means that the principal components — the axes that define the principle dimensions of variation  of the weight space are being aligned, in a linear algebra sense, with those of the data, down to a difference of a scaling factor (that is, the fact that the eigenvalues may be different between the two). They do show some empirical evidence of this being the case, by calculating the actual covariance matrices of both the data and the learned weight matrices, and showing that you see high degrees of similarity between the vector spaces of the two (though sometimes by having to add eigenspaces of the data together to be equivalent to an eigenspace of the weights). https://i.imgur.com/TB5JM6z.png They also show some indication that this property drives the advantage in finetuning. They do this by just taking their analytical model of what they believe is happening during training  that weights are coming to be drawn from a distribution governed by a covariance matrix aligned with the data covariance matrix  and sample weights from a normal distribution that has that property. They show, in the plot below, that this accounts for most of the advantage that has been observed in subsequent training from training on random labels (other than the previouslydiscovered effect of "all training increases the scale of the weights, which helps in future training," which they account for by normalizing). https://i.imgur.com/cnT27HI.png Unfortunately, beyond this central intuition of covariance matrix alignment, I wasn't able to get much else from the paper. Some other things they mentioned, that I didn't follow were:  The actual proof for why you'd expect this property of alignment in the case of random training  An analysis of the way that "the first layer [of a network] effectively learns a function which maps each eigenvalue of the data covariance matrix to the corresponding eigenvalue of the weight covariance matrix." I understand that their notion of alignment predicts that there should be some relationship between these two eigenvalues, but I don't fully follow which part of the first layer of a neural network will *learn* that function, or produce it as an output  An analysis of how this framework explains both the cases where you get positive transfer from randomlabel training (i.e. finetuned networks training better subsequently) and the cases where you get negative transfer
Your comment:
