Calibration with Spiegelhalter Z-statistic (Uncertainty Series)

Summary

In the realm of machine learning, obtaining accurate predictions is crucial for making informed decisions. However, equally important is the confidence we have in these predictions. Calibration is a fundamental concept that helps assess the reliability of a machine learning model's predictions. I plan to write more on the topic of Uncertainty.

In this blog, we will derive one of the evaluation metrics, the Spiegelhalter Z-statistic, which provides a simple yet powerful way to evaluate a model's calibration. Python implementation was open-sourced at splinator package.

Important References

  • Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.
  • Yingxiang Huang and others, A tutorial on calibration measurements and calibration models for clinical prediction models, Journal of the American Medical Informatics Association, Volume 27, Issue 4, April 2020, Pages 621–633, https://doi.org/10.1093/jamia/ocz228
  • Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Statistics in Medicine, 5(5), 421–433. doi:10.1002/sim.4780050506
  • Brier score decomposition from Stata.com https://www.stata.com/manuals/rbrier.pdf

Brier Score

Before we get started on Spiegelhalter Z-statistic, let’s talk about Brier Score. It’s a score frequently used in literature, tutorials, and open source libraries as a metric for calibration evaluation.

The Brier score is defined as the average squared difference between the predicted probabilities pi^\hat{p_i} and the actual binary outcomes yiy_i over nn data points:

Brier Score (BS)=1ni=1n(yip^i)2\text{Brier Score (BS)} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})^2

Spiegelhalter Z-statistic

Brier score can be decomposed into two parts, leveraging the fact that yi2=yi{y_i}^2 = y_i:

Brier Score (BS)=1ni=1n(yip^i)(12p^i)+1ni=1np^i(1p^i)\text{Brier Score (BS)} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i}) + \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})

Here, the first term measure calibration and the second terms is “sharpness”, which is irrelevant to the realization yy.

The null hypothesis is perfect calibration, that is E[yi]=piE[y_i]=p_i.

E[BS]=E[1ni=1n(yip^i)(12p^i)+1ni=1np^i(1p^i)]=1ni=1np^i(1p^i)\mathbf{E}[BS] = \mathbf{E}[\frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i}) + \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})] = \frac{1}{n} \sum_{i=1}^{n} \hat{p}_{i}(1 - \hat{p}_{i})

In other words, the expected value is completely from sharpness. Let’s calculate the variance! A friendly reminder: for any random variable X , the variance of X is the expected value of the squared difference between X and its expected value:

Var[BS]=E[(BSE[BS])2]Var[BS] = \mathbf{E}[(BS-E[BS])^2]

With this in mind, we can start calculating variance:

Var[BS]=E[[1ni=1n(yip^i)(12p^i)]2]=1n2E[[i=1n(yip^i)(12p^i)]2]=1n2i=1nE[[(yip^i)(12p^i)]2]=1n2i=1n(12p^i)2E[(yip^i)2]\begin{align} Var[BS] &= \mathbf{E}[ [\frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \mathbf{E}[ [\sum_{i=1}^{n} (y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \sum_{i=1}^{n} \mathbf{E} [[(y_{i} - \hat{p}_{i})(1 - 2 \hat{p}_{i})]^2] \\ &= \frac{1}{n^2} \sum_{i=1}^{n} (1 - 2 \hat{p}_{i})^2 \mathbf{E} [ (y_{i} - \hat{p}_{i})^2] \\ \end{align}

Because E[yi]=piE[y_i] = p_i and Variance of Bernoulli Distribution Var(x)=p(1pVar(x) = p(1-p), we can further simplify (4):

Var[BS]=1n2i=1npi^(1p^i)(12p^i)2\begin{align} Var[BS] &= \frac{1}{n^2} \sum_{i=1}^{n} \hat{p_i}(1 - \hat{p}_{i})(1 - 2 \hat{p}_{i})^2 \\ \end{align}

The z-statistic can be written as:

Z=BSE[[BS]Var[BS]=i=1n(yip^i)(12p^i)i=1n((12p^i)2p^i(1p^i))\begin{align} Z &= \frac{BS-\mathbf{E}[[BS]}{\sqrt{Var[BS]}} \\ &= \frac{\sum_{i=1}^{n}(y_i - \hat{p}_i)(1 - 2\hat{p}_i)}{\sqrt{\sum_{i=1}^{n}\left((1 - 2\hat{p}_i)^2 \hat{p}_i (1-\hat{p}_i)\right)}} \end{align}

The corresponding p-value is given by the upper-tail probability of Z under the standard normal distribution. The null hypothesis of calibration, that is, E[yi]=piE[y_i]=p_i is rejected at the significance level α\alpha if Z>q1α/2Z>q_{1-\alpha/2}, where qαq_{\alpha} is the α\alpha-quantile of the standard normal distribution.

Implementation

You can find numpy implementation at Splinator library here https://github.com/Affirm/splinator/blob/main/src/splinator/metrics.py

With simple python code:

from splinator.metrics import spiegelhalters_z_statistic


labels = np.array([1, 0])
scores_a = np.array([0.2, 0.2])
scores_b = np.array([0.4, 0.5])

szs_a = spiegelhalters_z_statistic(labels, scores_a)
szs_b = spiegelhalters_z_statistic(labels, scores_b)

We can recreate table 1 from Rufibach, Kaspar. "Use of Brier score to assess binary predictions." Journal of clinical epidemiology 63.8 (2010): 938-939.

image