Scoring forecasts directly

Nikos Bosse

2023-11-29

A variety of metrics and scoring rules can also be accessed directly through the scoringutils package.

The following gives an overview of (most of) the implemented metrics.

Bias

The function bias determines bias from predictive Monte-Carlo samples, automatically recognising whether forecasts are continuous or integer valued.

For continuous forecasts, Bias is measured as \[B_t (P_t, x_t) = 1 - 2 \cdot (P_t (x_t))\]

where \(P_t\) is the empirical cumulative distribution function of the prediction for the true value \(x_t\). Computationally, \(P_t (x_t)\) is just calculated as the fraction of predictive samples for \(x_t\) that are smaller than \(x_t\).

For integer valued forecasts, Bias is measured as

\[B_t (P_t, x_t) = 1 - (P_t (x_t) + P_t (x_t + 1))\]

to adjust for the integer nature of the forecasts. In both cases, Bias can assume values between -1 and 1 and is 0 ideally.

## integer valued forecasts
true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
bias_sample(true_values, predictions)
#>  [1] -0.150  0.470 -0.475  0.410 -0.900  0.325  0.290 -0.260 -0.465 -0.085
#> [11]  0.300  0.705  0.735 -0.230  0.605  0.800  0.470 -0.435 -0.400  0.845
#> [21] -0.030  0.755 -0.305  0.500 -0.020 -0.185 -0.435  0.145  0.265 -0.090

## continuous forecasts
true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(30, mean = 1:30))
bias_sample(true_values, predictions)
#>  [1]  1.00  0.32  0.12  0.21  0.62 -0.14 -0.95 -0.45 -0.02 -0.27  0.07 -0.72
#> [13]  0.40 -0.13  0.47  0.37 -0.92  0.88  0.81  0.81 -0.20  0.99  0.95 -0.59
#> [25]  0.50 -0.64 -0.02  0.34 -0.25 -0.56

Sharpness

Sharpness is the ability of the model to generate predictions within a narrow range. It is a data-independent measure, and is purely a feature of the forecasts themselves.

Sharpness / dispersion of predictive samples corresponding to one single true value is measured as the normalised median of the absolute deviation from the median of the predictive samples. For details, see ?stats::mad

predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
mad_sample(predictions)
#>  [1] 1.4826 1.4826 1.4826 1.4826 2.2239 2.9652 2.2239 2.9652 2.9652 2.9652
#> [11] 2.9652 4.4478 4.4478 4.4478 4.4478 2.9652 4.4478 4.4478 4.4478 2.9652
#> [21] 4.4478 5.9304 4.4478 4.4478 4.4478 4.4478 5.9304 4.4478 4.4478 6.6717

Calibration

Calibration or reliability of forecasts is the ability of a model to correctly identify its own uncertainty in making predictions. In a model with perfect calibration, the observed data at each time point look as if they came from the predictive probability distribution at that time.

Equivalently, one can inspect the probability integral transform of the predictive distribution at time t,

\[u_t = F_t (x_t)\]

where \(x_t\) is the observed data point at time \(t \text{ in } t_1, …, t_n\), n being the number of forecasts, and \(F_t\) is the (continuous) predictive cumulative probability distribution at time t. If the true probability distribution of outcomes at time t is \(G_t\) then the forecasts \(F_t\) are said to be ideal if \(F_t = G_t\) at all times \(t\). In that case, the probabilities ut are distributed uniformly.

In the case of discrete outcomes such as incidence counts, the PIT is no longer uniform even when forecasts are ideal. In that case a randomised PIT can be used instead:

\[u_t = P_t(k_t) + v \cdot (P_t(k_t) - P_t(k_t - 1) )\]

where \(k_t\) is the observed count, \(P_t(x)\) is the predictive cumulative probability of observing incidence \(k\) at time \(t\), \(P_t (-1) = 0\) by definition and \(v\) is standard uniform and independent of \(k\). If \(P_t\) is the true cumulative probability distribution, then \(u_t\) is standard uniform.

The function checks whether integer or continuous forecasts were provided. It then applies the (randomised) probability integral and tests the values \(u_t\) for uniformity using the Anderson-Darling test.

As a rule of thumb, there is no evidence to suggest a forecasting model is miscalibrated if the p-value found was greater than a threshold of \(p >= 0.1\), some evidence that it was miscalibrated if \(0.01 < p < 0.1\), and good evidence that it was miscalibrated if \(p <= 0.01\). In this context it should be noted, though, that uniformity of the PIT is a necessary but not sufficient condition of calibration. It should also be noted that the test only works given sufficient samples, otherwise the Null hypothesis will often be rejected outright.

Continuous Ranked Probability Score (CRPS)

Wrapper around the crps_sample() function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
crps_sample(true_values, predictions)
#>  [1] 0.476875 0.372775 1.034975 1.823100 0.765150 2.802975 1.424225 1.390800
#>  [9] 1.114350 2.339300 2.048150 0.829075 0.989300 1.755150 3.959275 1.113175
#> [17] 1.747650 5.084725 1.747275 1.825600 2.956050 1.347325 2.541700 1.410350
#> [25] 3.252025 4.293650 1.200000 3.094325 1.637475 1.719850

Dawid-Sebastiani Score (DSS)

Wrapper around the dss_sample() function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function can be used for continuous as well as integer valued forecasts. Smaller values are better.

true_values <- rpois(30, lambda = 1:30)
predictions <- replicate(200, rpois(n = 30, lambda = 1:30))
dss_sample(true_values, predictions)
#>  [1] 5.390323 1.490260 2.694861 1.977603 1.809438 2.610020 3.216831 3.259631
#>  [9] 4.005781 2.629087 2.744518 2.492800 3.050290 3.580582 2.596429 3.443596
#> [17] 2.877230 3.764766 3.046372 3.467957 3.647532 4.607031 3.452120 3.077591
#> [25] 3.850796 3.490482 3.900866 4.237623 3.396307 3.651876

Log Score

Wrapper around the logs_sample() function from the scoringRules package. For more information look at the manuals from the scoringRules package. The function should not be used for integer valued forecasts. While Log Scores are in principle possible for integer valued forecasts they require a kernel density estimate which is not well defined for discrete values. Smaller values are better.

true_values <- rnorm(30, mean = 1:30)
predictions <- replicate(200, rnorm(n = 30, mean = 1:30))
logs_sample(true_values, predictions)
#>  [1] 0.9573384 1.1475535 1.7024900 1.0497181 2.1651228 0.9885190 0.9894431
#>  [8] 0.9973930 1.0407186 2.8570565 1.7821696 1.5504730 2.1092420 1.6287750
#> [15] 1.3066877 0.9842722 1.4316806 0.9921830 1.1226407 2.1558309 0.8443488
#> [22] 0.9987385 1.2022842 1.0951447 1.0339494 1.1971997 1.1002311 1.3276218
#> [29] 0.9297171 1.2927543

Brier Score

The Brier score is a proper score rule that assesses the accuracy of probabilistic binary predictions. The outcomes can be either 0 or 1, the predictions must be a probability that the true outcome will be 1.

The Brier Score is then computed as the mean squared error between the probabilistic prediction and the true outcome.

\[\text{Brier_Score} = \frac{1}{N} \sum_{t = 1}^{n} (\text{prediction}_t - \text{outcome}_t)^2\]

true_values <- sample(c(0, 1), size = 30, replace = TRUE)
predictions <- runif(n = 30, min = 0, max = 1)

brier_score(true_values, predictions)
#>  [1] 0.8096126635 0.0449297386 0.9245047003 0.0305868085 0.0292332056
#>  [6] 0.1415321053 0.7831543972 0.0885676800 0.0150624390 0.1758647063
#> [11] 0.0241157044 0.8061056126 0.6109336407 0.2175827065 0.8359301891
#> [16] 0.0980825666 0.0004857228 0.0032267834 0.6902970477 0.7991360993
#> [21] 0.2134666736 0.0054909096 0.0086763781 0.5453461627 0.0567205146
#> [26] 0.0316702829 0.4707347976 0.4319744911 0.2765174978 0.8500592011

Interval Score

The Interval Score is a Proper Scoring Rule to score quantile predictions, following Gneiting and Raftery (2007). Smaller values are better.

The score is computed as

\[ \text{score} = (\text{upper} - \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{lower} - \text{true_value}) \cdot 1(\text{true_values} < \text{lower}) + \\ \frac{2}{\alpha} \cdot (\text{true_value} - \text{upper}) \cdot 1(\text{true_value} > \text{upper})\]

where \(1()\) is the indicator function and \(\alpha\) is the decimal value that indicates how much is outside the prediction interval. To improve usability, the user is asked to provide an interval range in percentage terms, i.e. interval_range = 90 (percent) for a 90 percent prediction interval. Correspondingly, the user would have to provide the 5% and 95% quantiles (the corresponding alpha would then be 0.1). No specific distribution is assumed, but the range has to be symmetric (i.e you can’t use the 0.1 quantile as the lower bound and the 0.7 quantile as the upper). Setting weigh = TRUE will weigh the score by \(\frac{\alpha}{2}\) such that the Interval Score converges to the CRPS for increasing number of quantiles.

true_values <- rnorm(30, mean = 1:30)
interval_range <- 90
alpha <- (100 - interval_range) / 100
lower <- qnorm(alpha / 2, rnorm(30, mean = 1:30))
upper <- qnorm((1 - alpha / 2), rnorm(30, mean = 1:30))

interval_score(
  true_values = true_values,
  lower = lower,
  upper = upper,
  interval_range = interval_range
)
#>  [1] 0.26510092 0.22679750 2.87025782 0.11820362 0.15672788 0.11394204
#>  [7] 0.22712963 0.23710624 0.17807795 0.11350379 1.23965665 0.13375215
#> [13] 0.39490675 0.28270055 0.09282983 0.31741923 0.19783311 0.14548430
#> [19] 0.26778352 0.20879105 0.15698975 0.21255938 0.17338694 0.21030246
#> [25] 1.28231869 0.15597653 0.21108737 1.28822490 0.10298824 0.16554946