Early Stopping without a Validation Set
Maren Mahsereci
and
Lukas Balles
and
Christoph Lassner
and
Philipp Hennig
arXiv e-Print archive - 2017 via arXiv
Keywords:
cs.LG, stat.ML
First published: 2017/03/28 (7 years ago) Abstract: Early stopping is a widely used technique to prevent poor generalization
performance when training an over-expressive model by means of gradient-based
optimization. To find a good point to halt the optimizer, a common practice is
to split the dataset into a training and a smaller validation set to obtain an
ongoing estimate of the generalization performance. We propose a novel early
stopping criterion based on fast-to-compute local statistics of the computed
gradients and entirely removes the need for a held-out validation set. Our
experiments show that this is a viable approach in the setting of least-squares
and logistic regression, as well as neural networks.
Summary from [reddit](https://www.reddit.com/r/MachineLearning/comments/623oq4/r_early_stopping_without_a_validation_set/dfjzwqq/):
We want to minimize the expected risk (loss) but that's a mean over the real distribution of the data, which we don't know. We approximate that by using a finite dataset and try to minimize the empirical risk instead.
The gradients for the empirical risk are an approximation to the gradients for the expected risk.
The idea is that the real gradients contain just information whereas the approximated gradients contain information + noise. The noise results from using a finite dataset to approximate the real distribution of the data.
By computing local statistics about the gradients, the authors are able to determine when the gradients have no information about the expected risk anymore and what's left is just noise. If we keep optimizing we're going to overfit.