Summary

In the realm of machine learning, obtaining accurate predictions is crucial for making informed decisions. However, equally important is the confidence we have in these predictions. Calibration is a fundamental concept that helps assess the reliability of a machine learning model's predictions. I plan to write more on the topic of Uncertainty.

In this blog, we will derive one of the evaluation metrics, the Spiegelhalter Z-statistic, which provides a simple yet powerful way to evaluate a model's calibration. Python implementation was open-sourced at splinator package.

Important References

Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.
Yingxiang Huang and others, A tutorial on calibration measurements and calibration models for clinical prediction models, Journal of the American Medical Informatics Association, Volume 27, Issue 4, April 2020, Pages 621–633, https://doi.org/10.1093/jamia/ocz228
Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine, 5(5), 421–433. doi:10.1002/sim.4780050506
Brier score decomposition from Stata.com https://www.stata.com/manuals/rbrier.pdf

Brier Score

Before we get started on Spiegelhalter Z-statistic, let’s talk about Brier Score. It’s a score frequently used in literature, tutorials, and open source libraries as a metric for calibration evaluation.

The Brier score is defined as the average squared difference between the predicted probabilities $\hat{p_i}$ and the actual binary outcomes $y_i$ over $n$ data points:

\text{Brier Score (BS)} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})^2

Spiegelhalter Z-statistic

Brier score can be decomposed into two parts, leveraging the fact that ${y_i}^2 = y_i$ :

\text{Brier Score (BS)} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i}) + \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})

Here, the first term measure calibration and the second terms is “sharpness”, which is irrelevant to the realization $y$ .

The null hypothesis is perfect calibration, that is $E[y_i]=p_i$ .

\mathbf{E}[BS] = \mathbf{E}[\frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i}) + \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})] = \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})

In other words, the expected value is completely from sharpness. Let’s calculate the variance! A friendly reminder: for any random variable X , the variance of X is the expected value of the squared difference between X and its expected value:

Var[BS] = \mathbf{E}[(BS-E[BS])^2]

With this in mind, we can start calculating variance:

\begin{align} Var[BS] &= \mathbf{E}[ [\frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \mathbf{E}[ [\sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \sum_{i=1}^{n} \mathbf{E} [[(y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \sum_{i=1}^{n} (1 - 2 \hat{p}_{i})^2 \mathbf{E} [ (y_{i} - \hat{p}_{i})^2] \\ \end{align}

Because $E[y_i] = p_i$ and Variance of Bernoulli Distribution $Var(x) = p(1-p$ ), we can further simplify (4):

\begin{align} Var[BS] &= \frac{1}{n^2} \sum_{i=1}^{n} \hat{p_i}(1 - \hat{p}_{i})(1 - 2 \hat{p}_{i})^2 \\ \end{align}

The z-statistic can be written as:

\begin{align} Z &= \frac{BS-\mathbf{E}[[BS]}{\sqrt{Var[BS]}} \\ &= \frac{\sum_{i=1}^{n}(y_i - \hat{p}_i)(1 - 2\hat{p}_i)}{\sqrt{\sum_{i=1}^{n}\left((1 - 2\hat{p}_i)^2 \hat{p}_i (1-\hat{p}_i)\right)}} \end{align}

The corresponding p-value is given by the upper-tail probability of Z under the standard normal distribution. The null hypothesis of calibration, that is, $E[y_i]=p_i$ is rejected at the significance level $\alpha$ if $Z>q_{1-\alpha/2}$ , where $q_{\alpha}$ is the $\alpha$ -quantile of the standard normal distribution.

Implementation

You can find numpy implementation at Splinator library here https://github.com/Affirm/splinator/blob/main/src/splinator/metrics.py

With simple python code:

from splinator.metrics import spiegelhalters_z_statistic


labels = np.array([1, 0])
scores_a = np.array([0.2, 0.2])
scores_b = np.array([0.4, 0.5])

szs_a = spiegelhalters_z_statistic(labels, scores_a)
szs_b = spiegelhalters_z_statistic(labels, scores_b)

We can recreate table 1 from Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.

Calibration with Spiegelhalter Z-statistic (Uncertainty Series)

Summary

Important References

Brier Score

Spiegelhalter Z-statistic

Implementation