Summary
In the realm of machine learning, obtaining accurate predictions is crucial for making informed decisions. However, equally important is the confidence we have in these predictions. Calibration is a fundamental concept that helps assess the reliability of a machine learning model's predictions. I plan to write more on the topic of Uncertainty.
In this blog, we will derive one of the evaluation metrics, the Spiegelhalter Z-statistic, which provides a simple yet powerful way to evaluate a model's calibration. Python implementation was open-sourced at splinator package.
Important References
- Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.
- Yingxiang Huang and others, A tutorial on calibration measurements and calibration models for clinical prediction models, Journal of the American Medical Informatics Association, Volume 27, Issue 4, April 2020, Pages 621–633, https://doi.org/10.1093/jamia/ocz228
- Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine, 5(5), 421–433. doi:10.1002/sim.4780050506
- Brier score decomposition from Stata.com https://www.stata.com/manuals/rbrier.pdf
Brier Score
Before we get started on Spiegelhalter Z-statistic, let’s talk about Brier Score. It’s a score frequently used in literature, tutorials, and open source libraries as a metric for calibration evaluation.
The Brier score is defined as the average squared difference between the predicted probabilities and the actual binary outcomes over data points:
Spiegelhalter Z-statistic
Brier score can be decomposed into two parts, leveraging the fact that :
Here, the first term measure calibration and the second terms is “sharpness”, which is irrelevant to the realization .
The null hypothesis is perfect calibration, that is .
In other words, the expected value is completely from sharpness. Let’s calculate the variance! A friendly reminder: for any random variable X , the variance of X is the expected value of the squared difference between X and its expected value:
With this in mind, we can start calculating variance:
Because and Variance of Bernoulli Distribution ), we can further simplify (4):
The z-statistic can be written as:
The corresponding p-value is given by the upper-tail probability of Z under the standard normal distribution. The null hypothesis of calibration, that is, is rejected at the significance level if , where is the -quantile of the standard normal distribution.
Implementation
You can find numpy implementation at Splinator library here https://github.com/Affirm/splinator/blob/main/src/splinator/metrics.py
With simple python code:
from splinator.metrics import spiegelhalters_z_statistic
labels = np.array([1, 0])
scores_a = np.array([0.2, 0.2])
scores_b = np.array([0.4, 0.5])
szs_a = spiegelhalters_z_statistic(labels, scores_a)
szs_b = spiegelhalters_z_statistic(labels, scores_b)
We can recreate table 1 from Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.