Early Stopping without a Validation SetEarly Stopping without a Validation SetMaren Mahsereci and Lukas Balles and Christoph Lassner and Philipp Hennig2017

Paper summarymartinthomaSummary from [reddit](https://www.reddit.com/r/MachineLearning/comments/623oq4/r_early_stopping_without_a_validation_set/dfjzwqq/):
We want to minimize the expected risk (loss) but that's a mean over the real distribution of the data, which we don't know. We approximate that by using a finite dataset and try to minimize the empirical risk instead.
The gradients for the empirical risk are an approximation to the gradients for the expected risk.
The idea is that the real gradients contain just information whereas the approximated gradients contain information + noise. The noise results from using a finite dataset to approximate the real distribution of the data.
By computing local statistics about the gradients, the authors are able to determine when the gradients have no information about the expected risk anymore and what's left is just noise. If we keep optimizing we're going to overfit.

First published: 2017/03/28 (5 months ago) Abstract: Early stopping is a widely used technique to prevent poor generalization
performance when training an over-expressive model by means of gradient-based
optimization. To find a good point to halt the optimizer, a common practice is
to split the dataset into a training and a smaller validation set to obtain an
ongoing estimate of the generalization performance. In this paper we propose a
novel early stopping criterion which is based on fast-to-compute, local
statistics of the computed gradients and entirely removes the need for a
held-out validation set. Our experiments show that this is a viable approach in
the setting of least-squares and logistic regression as well as neural
networks.

Summary from [reddit](https://www.reddit.com/r/MachineLearning/comments/623oq4/r_early_stopping_without_a_validation_set/dfjzwqq/):
We want to minimize the expected risk (loss) but that's a mean over the real distribution of the data, which we don't know. We approximate that by using a finite dataset and try to minimize the empirical risk instead.
The gradients for the empirical risk are an approximation to the gradients for the expected risk.
The idea is that the real gradients contain just information whereas the approximated gradients contain information + noise. The noise results from using a finite dataset to approximate the real distribution of the data.
By computing local statistics about the gradients, the authors are able to determine when the gradients have no information about the expected risk anymore and what's left is just noise. If we keep optimizing we're going to overfit.