Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)Been Kim and Martin Wattenberg and Justin Gilmer and Carrie Cai and James Wexler and Fernanda Viegas and Rory Sayres2017
Paper summarydavidstutzKim et al. propose Concept Activation Vectors (CAV) that represent the direction of features corresponding to specific human-interpretable concepts. In particular, given a network for a classification task, a concept is defined as a set of images with that concept. A linear classifier is then trained to distinguish images with concept from random images without the concept based on a chosen feature layer. The normal of the obtained linear classification boundary corresponds to the learned Concept Activation Vector (CAV). By considering the directional derivative along this direction for a given input allows to quantify how well the input aligns with the chosen concept. This way, images can be ranked and the model’ sensitivity to particular concepts can be quantified. The idea is also illustrated in Figure 1.
Figure 1: Process of constructing Concept Activation Vectors (CAVs).
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
arXiv e-Print archive - 2017 via Local arXiv
First published: 2017/11/30 (2 years ago) Abstract: The interpretation of deep learning models is a challenge due to their size,
complexity, and often opaque internal state. In addition, many systems, such as
image classifiers, operate on low-level features rather than high-level
concepts. To address these challenges, we introduce Concept Activation Vectors
(CAVs), which provide an interpretation of a neural net's internal state in
terms of human-friendly concepts. The key idea is to view the high-dimensional
internal state of a neural net as an aid, not an obstacle. We show how to use
CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional
derivatives to quantify the degree to which a user-defined concept is important
to a classification result--for example, how sensitive a prediction of "zebra"
is to the presence of stripes. Using the domain of image classification as a
testing ground, we describe how CAVs may be used to explore hypotheses and
generate insights for a standard image classification network as well as a