I should say from the outset: I have a lot of fondness for this paper. It goes upstream of a lot of research-community incentives: It’s not methodologically flashy, it’s not about beating the State of the Art with a bigger, better model (though, those papers certainly also have their place). The goal of this paper was, instead, to dive into a test set used to evaluate performance of models, and try to understand to what extent it’s really providing a rigorous test of what we want out of model behavior. Test sets are the often-invisible foundation upon which ML research is based, but like real-world foundations, if there are weaknesses, the research edifice built on top can suffer. Specifically, this paper discusses the Winograd Schema, a clever test set used to test what the NLP community calls “common sense reasoning”. An example Winograd Schema sentence is: The delivery truck zoomed by the school bus because it was going so fast. A model is given this task, and asked to predict which token the underlined “it” refers to. These cases are specifically chosen because of their syntactic ambiguity - nothing structural about the order of the sentence requires “it” to refer to the delivery truck here. However, the underlying meaning of the sentence is only coherent under that parsing. This is what is meant by “common-sense” reasoning: the ability to understand meaning of a sentence in a way deeper than that allowed by simple syntactic parsing and word co-occurrence statistics. Taking the existing Winograd examples (and, when I said tiny, there are literally 273 of them) the authors of this paper surface some concerns about ways these examples might not be as difficult or representative of “common sense” abilities as we might like. - First off, there is the basic, previously mentioned fact that there are so few examples that it’s possible to perform well simply by random chance, especially over combinatorially large hyperparameter optimization spaces. This isn’t so much an indictment of the set itself as it is indicative of the work involved in creating it. - One of the two distinct problems the paper raises is that of “associativity”. This refers to situations where simple co-occurance counts between the description and the correct entity can lead the model to the correct term, without actually having to parse the sentence. An example here is: “I’m sure that my map will show this building; it is very famous.” Treasure maps aside, “famous buildings” are much more generally common than “famous maps”, and so being able to associate “it” with a building in this case doesn’t actually require the model to understand what’s going on in this specific sentence. The authors test this by creating a threshold for co-occurance, and, using that threshold, call about 40% of the examples “associative” - The second problem is that of predictable structure - the fact that the “hinge” adjective is so often the last word in the sentence, making it possible that the model is brittle, and just attending to that, rather than the sentence as a whole The authors perform a few tests - examining results on associative vs non-associative examples, and examining results if you switch the ordering (in cases like “Emma did not pass the ball to Janie although she saw that she was open,” where it’s syntactically possible), to ensure the model is not just anchoring on the identity of the correct entity, regardless of its place in the sentence. Overall, they found evidence that some of the state of the art language models perform well on the Winograd Schema as a whole, but do less well (and in some cases even less well than the baselines they otherwise outperform) on these more rigorous examples. Unfortunately, these tests don’t lead us automatically to a better solution - design of examples like this is still tricky and hard to scale - but does provide valuable caution and food for thought.