[link]
Guo et al. study calibration of deep neural networks as postprocessing step. Here, calibration means a correction of the predicted confidence scores as these are commonlz too overconfident in recent deep networks. They consider several stateoftheart postprocessing steps for calibration, but surprisingly, they show that a simple linear mapping, or even scaling, works surprisingly well. So if $z_i$ are the logits of the network, then (the network being fixed) a parameter $T$ is found such that $\sigma(\frac{z_i}{T})$ is calibrated and minimized the NLL loss on a heldout validation set. Here, the temeratur $T$ either softens or roughens the probability distribution over classes. Interestingly, finding $T$ by optimizing the same training loss helps to reduce overconfidence. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Your comment:

[link]
## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **wellcalibrated**. Old models like LeNet (1998) was wellcalibrated, but modern networks like ResNet (2016) are no longer wellcalibrated. This paper explains what caused this and compares various calibration methods. ## Figure  Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.  ECE (Expected Calibration Error): average of accuracyconfidence of bins  MCE (Maximum Calibration Error): max of accuracyconfidence of bins ## Analysis  What The paper experiments how models are miscalibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution  Calibration Methods Many calibration methods for binary classification and multiclass classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. 