SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting
in Neural NetworksSoftTarget Regularization: An Effective Technique to Reduce Over-Fitting
in Neural NetworksArmen Aghajanyan2016

Paper summaryfelipeThis paper introduces a new regularization technique that aims at reducing over-fitting without reducing the capacity of a model. It draws on the claim that models start to over-fit data when co-label similarities start to disappear, e.g. when the model output does not show that dogs of similar breeds like German shepherd and Belgian shepherd are similar anymore.
The idea is that models in an early training phase *do* show these similarities. In order to keep this information in the model, target labels $Y^t_c$ for training step $t$ are changed by adding the exponential mean of output labels of previous training steps $\hat{Y}^t$:
$$
\hat{Y}^t = \beta \hat{Y}^{t-1} + (1-\beta)F(X, W),\\
Y_c^t = \gamma\hat{Y}^t + (1-\gamma)Y,
$$
where $F(X,W)$ is the current network's output, and $Y$ are the ground truth labels. This way, the network should remember which classes are similar to each other. The paper shows that training using the proposed regularization scheme preserves co-label similarities (compared to an over-fitted model) similarly to dropout. This confirms the intuition the proposed method is based on.
The method introduces several new hyper-parameters:
- $\beta$, defining the exponential decay parameter for averaging old predictions
- $\gamma$, defining the weight of soft targets to ground truth targets
- $n_b$, the number of 'burn-in' epochs, in which the network is trained with hard targets only
- $n_t$, the number of epochs between soft-target updates
Results on MNIST, CIFAR-10 and SVHN are encouraging, as networks with soft-target regularization achieve lower losses on almost all configurations. However, as of today, the paper does not show how this translates to classification accuracy. Also, it seems that the results are from one training run only, so it is difficult to assess if this improvement is systematic.

First published: 2016/09/21 (9 months ago) Abstract: Deep neural networks are learning models with a very high capacity and
therefore prone to over-fitting. Many regularization techniques such as
Dropout, DropConnect, and weight decay all attempt to solve the problem of
over-fitting by reducing the capacity of their respective models (Srivastava et
al., 2014), (Wan et al., 2013), (Krogh & Hertz, 1992). In this paper we
introduce a new form of regularization that guides the learning problem in a
way that reduces over-fitting without sacrificing the capacity of the model.
The mistakes that models make in early stages of training carry information
about the learning problem. By adjusting the labels of the current epoch of
training through a weighted average of the real labels, and an exponential
average of the past soft-targets we achieved a regularization scheme as
powerful as Dropout without necessarily reducing the capacity of the model, and
simplified the complexity of the learning problem. SoftTarget regularization
proved to be an effective tool in various neural network architectures.

This paper introduces a new regularization technique that aims at reducing over-fitting without reducing the capacity of a model. It draws on the claim that models start to over-fit data when co-label similarities start to disappear, e.g. when the model output does not show that dogs of similar breeds like German shepherd and Belgian shepherd are similar anymore.
The idea is that models in an early training phase *do* show these similarities. In order to keep this information in the model, target labels $Y^t_c$ for training step $t$ are changed by adding the exponential mean of output labels of previous training steps $\hat{Y}^t$:
$$
\hat{Y}^t = \beta \hat{Y}^{t-1} + (1-\beta)F(X, W),\\
Y_c^t = \gamma\hat{Y}^t + (1-\gamma)Y,
$$
where $F(X,W)$ is the current network's output, and $Y$ are the ground truth labels. This way, the network should remember which classes are similar to each other. The paper shows that training using the proposed regularization scheme preserves co-label similarities (compared to an over-fitted model) similarly to dropout. This confirms the intuition the proposed method is based on.
The method introduces several new hyper-parameters:
- $\beta$, defining the exponential decay parameter for averaging old predictions
- $\gamma$, defining the weight of soft targets to ground truth targets
- $n_b$, the number of 'burn-in' epochs, in which the network is trained with hard targets only
- $n_t$, the number of epochs between soft-target updates
Results on MNIST, CIFAR-10 and SVHN are encouraging, as networks with soft-target regularization achieve lower losses on almost all configurations. However, as of today, the paper does not show how this translates to classification accuracy. Also, it seems that the results are from one training run only, so it is difficult to assess if this improvement is systematic.

Can you expand on "target labels are augmented with a exponential mean of output labels"? Why is the moving average effective? What impact does it have on training?

Thanks for your comment, I detailed how the training labels are changed. However, the paper does not show why explicitly the moving average is effective. I also could not find the effect on the training process itself, just on the training result (lower test loss).

Thanks! That is really clear now. So I think SoftTarget is making it harder for a weight update to make drastic changes to the output. So only persistent changes to the output are adopted. The output changing slowly will cause the loss and then gradient to slowly shrink and grow. This feels like a momentum that takes into account the individual outputs of the loss.

Your comment:

You must log in before you can post this comment!

You must log in before you can submit this summary! Your draft will not be saved!