On Calibration of Modern Neural NetworksOn Calibration of Modern Neural NetworksGuo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q.2017
Paper summaryelbaro## Task
A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods.
## Figure - Confidence Histogram
https://i.imgur.com/dMtdWsL.png
The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.
- ECE (Expected Calibration Error): average of |accuracy-confidence| of bins
- MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins
## Analysis - What
The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL.
## Solution - Calibration Methods
Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant.
Guo et al. study calibration of deep neural networks as post-processing step. Here, calibration means a correction of the predicted confidence scores as these are commonlz too overconfident in recent deep networks. They consider several state-of-the-art post-processing steps for calibration, but surprisingly, they show that a simple linear mapping, or even scaling, works surprisingly well. So if $z_i$ are the logits of the network, then (the network being fixed) a parameter $T$ is found such that
$\sigma(\frac{z_i}{T})$
is calibrated and minimized the NLL loss on a held-out validation set. Here, the temeratur $T$ either softens or roughens the probability distribution over classes. Interestingly, finding $T$ by optimizing the same training loss helps to reduce over-confidence.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
## Task
A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods.
## Figure - Confidence Histogram
https://i.imgur.com/dMtdWsL.png
The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.
- ECE (Expected Calibration Error): average of |accuracy-confidence| of bins
- MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins
## Analysis - What
The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL.
## Solution - Calibration Methods
Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant.