[link]
Summary by felipe 4 years ago
This paper introduces a new regularization technique that aims at reducing over-fitting without reducing the capacity of a model. It draws on the claim that models start to over-fit data when co-label similarities start to disappear, e.g. when the model output does not show that dogs of similar breeds like German shepherd and Belgian shepherd are similar anymore.
The idea is that models in an early training phase *do* show these similarities. In order to keep this information in the model, target labels $Y^t_c$ for training step $t$ are changed by adding the exponential mean of output labels of previous training steps $\hat{Y}^t$:
$$
\hat{Y}^t = \beta \hat{Y}^{t-1} + (1-\beta)F(X, W),\\
Y_c^t = \gamma\hat{Y}^t + (1-\gamma)Y,
$$
where $F(X,W)$ is the current network's output, and $Y$ are the ground truth labels. This way, the network should remember which classes are similar to each other. The paper shows that training using the proposed regularization scheme preserves co-label similarities (compared to an over-fitted model) similarly to dropout. This confirms the intuition the proposed method is based on.
The method introduces several new hyper-parameters:
- $\beta$, defining the exponential decay parameter for averaging old predictions
- $\gamma$, defining the weight of soft targets to ground truth targets
- $n_b$, the number of 'burn-in' epochs, in which the network is trained with hard targets only
- $n_t$, the number of epochs between soft-target updates
Results on MNIST, CIFAR-10 and SVHN are encouraging, as networks with soft-target regularization achieve lower losses on almost all configurations. However, as of today, the paper does not show how this translates to classification accuracy. Also, it seems that the results are from one training run only, so it is difficult to assess if this improvement is systematic.

more
less