HellaSwag: Can a Machine Really Finish Your Sentence?HellaSwag: Can a Machine Really Finish Your Sentence?Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi2019
Paper summarydecodyng[Machine learning is a nuanced, complicated, intellectually serious topic...but sometimes it’s refreshing to let ourselves be a bit less serious, especially when it’s accompanied by clear, cogent writing on a topic. This particular is a particularly delightful example of good-natured silliness - from the dataset name HellaSwag to figures containing cartoons of BERT and ELMO representing language models - combined with interesting science.]
This paper tackles the problem of natural language comprehension, which asks: okay, our models can generate plausible looking text, but do they actually exhibit what we would consider true understanding of language? One natural structure of task for this is to take questions or “contexts”, and, given a set of possible endings or completion, pick the correct one. Positive examples are relatively easy to come by: adjacent video captions and question/answer pairs from WikiHow are two datasets used in this paper. However, it’s more difficult to come up with *negative* examples. Even though our incorrect endings won’t be a meaningful continuation of the sentence, we want them to be “close enough” that we can feel comfortable attributing a model’s ability to pick the correct answer as evidence of some meaningful kind of comprehension. As an obvious failure mode, if the alternative multiple choice options were all the same word repeated ten times, that would be recognizable as being not real language, and it would be easy for a model to select the answer with the distributional statistics of real language, but it wouldn’t prove much. Typically failure modes aren’t this egregious, but the overall intuition still holds, and will inform the rest of the paper: your ability to test comprehension on a given dataset is a function of how contextually-relevant and realistic your negative examples are.
Previous work (by many of the same authors as are on this paper), proposed a technique called Adversarial Filtering to try to solve this problem. In Adversarial Filtering, a generative language model is used to generate possible many endings conditioned on the input context, to be used as negative examples. Then, a discriminator is trained to predict the correct ending given the context. The generated samples that the discriminator had the highest confidence classifying as negative are deemed to be not challenging enough comparisons, and they’re thrown out and replaced with others from our pool of initially-generated samples. Eventually, once we’ve iterated through this process, we have a dataset with hopefully realistic negative examples. The negative examples are then given to humans to ensure none of them are conceptually meaningful actual endings to the sentence. The dataset released by the earlier paper, which used as it’s generator a relatively simple LSTM model, was called Swag.
However, the authors came to notice that the performance of new language models (most centrally BERT) on this dataset might not be quite what it appears: its success rate of 86% only goes down to 76% if you don’t give the classifier access to the input context, which means it can get 76% (off of a random baseline of 25%, with 4 options) simply by evaluating which endings are coherent as standalone bits of natural language, without actually having to understand or even see the context. Also, shuffling the words in the words in the possible endings had a similarly small effect: the authors are able to get BERT to perform at 60% accuracy on the SWAG dataset with no context, and with shuffled words in the possible answers, meaning it’s purely selecting based on the distribution of words in the answer, rather than on the meaningfully-ordered sequence of words.
The authors overall conclusion with this is: this adversarial filtering method is only as good as the generator, and, more specifically, the training dynamic between the generator that produces candidate endings, and the discriminator that filters them. If the generator is too weak, the negative examples can be easily detected as fake by a stronger model, but if the generator is too strong, then the discriminator can’t get good enough to usefully contribute by weeding samples out. They demonstrate this by creating a new version of Swag, which they call HellaSwag (for the expected acronym-optimization reasons), with a GPT generator rather than the simpler one used before: on this new dataset, all existing models get relatively poor results (30-40% performance). However, the authors’ overall point wasn’t “we’ve solved it, this new dataset is the end of the line,” but rather a call in the future to be wary, and generally aware that with benchmarks like these, especially with generated negative examples, it’s going to be a constantly moving target as generation systems get better.
HellaSwag: Can a Machine Really Finish Your Sentence?
arXiv e-Print archive - 2019 via Local arXiv
First published: 2019/05/19 (1 year ago) Abstract: Recent work by Zellers et al. (2018) introduced a new task of commonsense
natural language inference: given an event description such as "A woman sits at
a piano," a machine must select the most likely followup: "She sets her fingers
on the keys." With the introduction of BERT, near human-level performance was
reached. Does this mean that machines can perform human level commonsense
In this paper, we show that commonsense inference still proves difficult for
even state-of-the-art models, by presenting HellaSwag, a new challenge dataset.
Though its questions are trivial for humans (>95% accuracy), state-of-the-art
models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data
collection paradigm wherein a series of discriminators iteratively select an
adversarial set of machine-generated wrong answers. AF proves to be
surprisingly robust. The key insight is to scale up the length and complexity
of the dataset examples towards a critical 'Goldilocks' zone wherein generated
text is ridiculous to humans, yet often misclassified by state-of-the-art
Our construction of HellaSwag, and its resulting difficulty, sheds light on
the inner workings of deep pretrained models. More broadly, it suggests a new
path forward for NLP research, in which benchmarks co-evolve with the evolving
state-of-the-art in an adversarial way, so as to present ever-harder