Training Neural Networks with Local Error Signals Training Neural Networks with Local Error Signals
Paper summary This paper was presented at ICML 2019. Do you remember greedy layer-wise training? Are you curious what a modern take on the idea can achieve? This is the paper for you then. And it has its own very good summary: > We use standard convolutional and fully connected network architectures, but instead of globally back-propagating errors, each weight layer is trained by a local learning signal,that is not back-propagated down the network. The learning signal is provided by two separate single-layer sub-networks, each with their own distinct loss function. One sub-network is trained with a standard cross-entropy loss, and the other with a similarity matching loss. If it's a bit unclear, this figure might help: ![local_error_signal]( The cross-entropy loss is the standard classification loss. The similarity loss is between the output of the layer and the one-hot encoded labels: $$ L_{\mathrm{sim}}=\|\| S(\text { NeuralNet }(H))-S(Y) \|\|_{F}^{2} $$ The similarity is a cosine similarity matrix $S$ where the elements are: $$ s_{i j}=s_{j i}=\frac{\tilde{\mathbf{x}}_{i}^{T} \tilde{\mathbf{x}}_{j}}{\|\|\widetilde{\mathbf{x}}_{i}\|\|_{2}\|\|\widetilde{\mathbf{x}}_{j}\|\|_{2}} $$ The method is used to train VGG-like models on MNIST, Fashion-MNIST, CIFAR-10 and 100, SVHN and STL-10. While it gets near-SOTA up to CIFAR-10, it's not there yet for more complex datasets. It gets 80% accuracy on CIFAR-100 where SOTA is 90% accuracy. Still, this is better than a standard ResNet for example. Why would we prefer a local loss to a global loss? A big advantage is that the weights can be updated during the forward pass, thus avoiding storing the activations in memory. There was another paper on a similar topic, which I didn't read: [Greedy Layerwise Learning Can Scale to ImageNet]( # Comments - While this is clearly not ready to just replace standard backprop, I find this line of work very interesting as it casts a doubt on one of the assumption of backprop: that we need a global signal to learn complex functions. - Though not mentioned in the paper, wouldn't a local loss naturally avoid vanishing and exploding gradients?

Summary by Hadrien Bertrand 1 month ago
Nice summary.

Your comment: allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and