In the area of explaining model predictions over images, there are two main strains of technique: methods that look for pixels that have the highest gradient effect on the output class, and assign those as the “reason” for the class, and approaches that ask which pixel regions are most responsible for a given classification, in the sense that the classification would change the most if they were substituted with some uninformative reference value. The tricky thing about the second class of methods is that you need to decide what to use as your uninformative fill-in value. It’s easy enough to conceptually pose the problem of “what would our model predict if it couldn’t see this region of pixels,” but as a practical matter, these models take in full images, and you have to put *something* to give the classifier in a region, if you’re testing what the score would be if you removed the information contained in the pixels in that region. What should you fill in instead? The simplest answers are things like “zeros”, or “a constant value” or “white noise”. But all of these are very off-distribution for the model; it wouldn’t have typically seen images that resemble white noise, or all zeros, or all a single value. So if you measure the change in your model score from an off-distribution baseline to your existing pixels, you may not be getting the marginal value of the pixels, so much as the marginal disutility of having something so different from what the model has previously seen. There are other, somewhat more sensible approaches, like blurring out the areas around the pixel region of interest, but these experience less intense forms of the same issue. This paper proposes instead, using generative models to fill in the regions conditioned on the surrounding pixels, and use that as a reference. The notion here is that a conditioned generative model, like a GAN or VAE, can take into account the surrounding pixels, and “imagine” a fill-in that flows smoothly from the surrounding pixels, and looks generally like an image, but which doesn’t contain the information from the pixels in the region being tested, since it wasn’t conditioned on that. https://i.imgur.com/2fKnY0M.png Using this approach, the authors run two types of test: one where they optimize to find the smallest region they can remove from the image, and have it switch class (Smallest Deletion Region, or SDR), and also the smallest informative region that can be added to an otherwise uninformative image, and have the model predict the class connected to that region. They find that regions calculated using their generative model fill in, and specifically with GANs, find a smaller and more compact pixel region as their explanation for the prediction, which is consistent with both human intuitions and also with a higher qualitative sensibleness of the explanations found.