Welcome to ShortScience.org! |
[link]
The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RL-based expected reward maximization is also made clear by the paper, Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a task-specific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "reward-augmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTM-based predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training. This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KL-divergence minimization that depends on sampling from the payoff distribution. Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose reward-augmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML. |
[link]
When humans classify images, we tend to use high-level information about the shape and position of the object. However, when convolutional neural networks classify images,, they tend to use low-level, or textural, information more than high-level shape information. This paper tries to understand what factors lead to higher shape bias or texture bias. To investigate this, the authors look at three datasets with disagreeing shape and texture labels. The first is GST, or Geirhos Style Transfer. In this dataset, style transfer is used to render the content of one class in the style of another (for example, a cat shape in the texture of an elephant). In the Navon dataset, a large-scale letter is rendered by tiling smaller letters. And, in the ImageNet-C dataset, a given class is rendered with a particular kind of distortion; here the distortion is considered to be the "texture label". In the rest of the paper, "shape bias" refers to the extent to which a model trained on normal images will predict the shape label rather than the texture label associated with a GST image. The other datasets are used in experiments where a model explicitly tries to learn either shape or texture. https://i.imgur.com/aw1MThL.png To start off, the authors try to understand whether CNNs are inherently more capable of learning texture information rather than shape information. To do this, they train models on either the shape or the textural label on each of the three aforementioned datasets. On GST and Navon, shape labels can be learned faster and more efficiently than texture ones. On ImageNet-C (i.e. distorted ImageNet), it seems to be easier to learn texture than texture, but recall here that texture corresponds to the type of noise, and I imagine that the cardinality of noise types is far smaller than that of ImageNet images, so I'm not sure how informative this comparison is. Overall, this experiment suggests that CNNs are able to learn from shape alone without low-level texture as a clue, in cases where the two sources of information disagree The paper moves on to try to understand what factors about a normal ImageNet model give it higher or lower shape bias - that is, a higher or lower likelihood of classifying a GST image according to its shape rather than texture. Predictably, data augmentations have an effect here. When data is augmented with aggressive random cropping, this increases texture bias relative to shape bias, presumably because when large chunks of an object are cropped away, its overall shape becomes a less useful feature. Center cropping is better for shape bias, probably because objects are likely to be at the center of the image, so center cropping has less of a chance of distorting them. On the other hand, more "naturalistic" augmentations like adding Gaussian noise or distorting colors lead to a higher shape bias in the resulting networks, up to 60% with all the modifications. However, the authors also find that pushing the shape bias up has the result of dropping final test accuracy. https://i.imgur.com/Lb6RMJy.png Interestingly, while the techniques that increase shape bias seem to also harm performance, the authors also find that higher-performing models tend to have higher shape bias (though with texture bias still outweighing shape) suggesting that stronger models learn how to use shape more effectively, but also that handicapping models' ability to use texture in order to incentivize them to use shape tends to hurt performance overall. Overall, my take from this paper is that texture-level data is actually statistically informative and useful for classification - even in terms of generalization - even if is too high-resolution to be useful as a visual feature for humans. CNNs don't seem inherently incapable of learning from shape, but removing their ability to rely on texture seems to lead to a notable drop in accuracy, suggesting there was real signal there that we're losing out on.
1 Comments
|
[link]
This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop. The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch. The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence. Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model. Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate. ##### My two cents This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically. I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities. Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done. |
[link]
Teo et al. propose a convex, robust learning framework allowing to integrate invariances into SVM training. In particular, they consider a set of valid transformations and define the cost of a training sample (i.e., pair of data and label) as the loss under the worst case transformation – this definition is very similar to robust optimization or adversarial training. Then, a convex upper bound on this cost is derived. Given, that the worst case transformation can be found efficiently, two different algorithms for minimizing this loss are proposed and discussed. The framework is evaluated on MNIST. However, considering error rate, the approach does not outperform data augmentation significantly (here, data augmentation means adding “virtual, i.e., transformed, samples to the training set). However, the proposed method seems to find sparser solutions, requiring less support vectors. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
#### Introduction * Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering. * [Link to the paper](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/) #### Dataset * 3047 questions sampled from Bing query logs. * Each question associated with a Wikipedia page. * All sentences in the summary paragraph of the page become the candidate answers. * Only 1/3rd questions have a correct answer in the candidate answer set. * Solutions crowdsourced through MTurk like platform. * Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper. #### Other Datasets * [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) * Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question. * Lexical overlap makes QA task easier. * Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences). #### Experiments ##### Baseline Systems * **Word Count** - Counts the number of non-stopwords common to question and answer sentences. * **Weighted Word Count** - Re-weight word counts by the IDF values of the question words. * **[LCLR](https://www.microsoft.com/en-us/research/publication/question-answering-using-enhanced-lexical-semantic-models/)** - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models. * **Paragraph Vectors** - Considers cosine similarity between question vector and sentence vector. * **Convolutional Neural Network (CNN)** - Bigram CNN model with average pooling. * **PV-Cnt** and **CNN-Cnt** - Logistic regression classifier combining PV (and CNN) models and Word Count models. ##### Metrics * MAP and MRR for answer selection problem. * Precision, recall and F1 scores for answer triggering problem. #### Observations * CNN-cnt outperforms all other models on both the tasks. * Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness. * Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally. * For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset. * Note: The dataset is very small to train end-to-end networks. |