This paper suggests a novel explanation for why dropout training is helpful: because it corresponds to an adaptive data augmentation method. Indeed, the authors point out that, when sampling a mask of the hidden units in a network (effectively setting the corresponding units to 0), the same effect would have been obtained by feeding as input an example tailored to yield activations of 0 for these units and otherwise the same activation for all other units. Since this "ghost" example will have to be different from the original example, and since each different mask would correspond to a different "ghost" example, then effectively mask sampling is similar to data augmentation. While in practice finding a ghost example that replicates exactly the same dropout hidden activations might not be possible, the authors show that finding an "approximate" ghost example that minimizes a distance between the target dropout activation and the deterministic activation of the ghost example works well. Indeed, they show that training a deep neural net on additional data generated by this procedure yields results that are at least as good as regular dropout on MNIST and CIFAR10 (actually, the deterministic neural net still uses regular dropout at the input layer, however they do show that the additional ghost examples are necessary to match the neural net trained with dropout at all layers). Then the authors use that interpretation to justify a variation of dropout where the dropout rate isn't fixed, but itself is randomly sampled in some range for each example. Indeed, if we think of dropout at a fixed rate as a specific class of ghost data being added, varying the dropout rate corresponds to enriching even more the ghost data pool. The experiments show that this can help, though not by much. Finally, the authors propose an explanation of a property of dropout: that it tends to generate hidden representations that are sparser. Again, the authors rely on their interpretation of dropout as data augmentation. The explanation goes as follows. Training on the ghost data distribution might imply that the classification problem has become significantly harder. Specifically, it is quite possible that the addition of new ghost examples generates new isolated class clusters in input space that the model most now learn to discriminate. And they hypothesize that the generation of such additional clusters would encourage sparsity. To test this hypothesis, the authors synthetically simulate this scenario, by sampling data on a circle, which is clustered in small arcs each assigned to one of 10 possible classes in cycling order. Decreasing the arc length thus increases the number of arcs, i.e. class clusters. They show that training deep networks on datasets with increasing number of class clusters does yield representations that are increasingly sparser. This thus suggests that dropout might indeed be equivalent to modifying the input distribution by adding such isolated classspecific clusters in input space. One assumption behind this analysis is that the sparsity patterns (i.e. the set of nonzero dimensions) play an important role in classification and incorporate most of the discriminative class information. This assumption is also confirmed in experiments, where converting the ReLU activation function by a binary activation (that is 1 if the preactivation is positive and 0 otherwise) after training still yields a network with good performance (though slightly worse). #### My two cents This is a really original and thought provoking paper. One interpretation I make of these results is that the inductive bias corresponding to using a deep neural network with ReLU activations is more valuable than one might have thought, and that the usefulness of deep neural networks goes beyond just being black boxes that can learn datadependent representations. Otherwise, it's not clear to me why the ghost data implicitly generated by the architecture would be useful at all. This also suggests an experiment where such ghost samples would be fed to another type of classifier, such as an SVM, to test whether the data augmentation is useful in itself and reflects meaningful structure in the data, as opposed to being somehow useful only for neural nets. I note that the results are mostly specific to architectures based on ReLU activations (not that this is a problem, but one should keep this in mind). I'd really like to see what the ghost samples look like. Do they correspond to interpretable images? The authors also mention that exploring how the samples change with training would be interesting to investigate, and I agree. Finally, I think there might be a typo in Figure 1. While the labels of a) and b) states that the arc length is smaller for a) than b), the plot clearly show otherwise.
Your comment:
