# Performance metrics

#### 2022-12-16

When we train a model, we usually want to know how good the model is. Model performance is assessed using different metrics that quantify how well a model discriminates cases, stratifies groups or predicts values. familiar implements metrics that are typically used to assess the performance of categorical, regression and survival models.

# Categorical outcomes

Performance metrics for models with categorical outcomes, i.e. binomial and multinomial are listed below.

Overview of performance metrics for categorical outcomes. Some contingency table-based metrics require averaging for multinomial outcomes as their original definition only covers binomial problems with two classes.
method tag averaging
accuracy accuracy
area under the receiver-operating curve auc, auc_roc
balanced accuracy bac, balanced_accuracy
balanced error rate ber, balanced_error_rate
Brier score brier
Cohen’s kappa kappa, cohen_kappa
f1 score f1_score ×
false discovery rate fdr, false_discovery_rate ×
informedness informedness ×
markedness markedness ×
Matthews’ correlation coefficient mcc, matthews_correlation_coefficient
negative predictive value npv ×
positive predictive value precision, ppv ×
recall recall, sensitivity, tpr, true_positive_rate ×
specificity specificity, tnr, true_negative_rate ×
Youden’s J youden_j, youden_index ×

## Area under the receiver-operating curve

The area under the receiver-operating curve is quite literally that. It is the area under the curve created by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity). TPR and FPR are derived from a contingency table, which is created by comparing predicted class probabilities against a threshold. The receiver-operating curve is created by iterating over1 threshold values. The AUC of a model that predicts perfectly is $$1.0$$, while $$0.5$$ indicates predictions that are no better than random.

The implementation in familiar does not use the ROC curve to compute the AUC. Instead, an algebraic equation by Hand and Till (2001) is used. For multinomial outcomes the AUC is computed for each pairwise comparison of outcome classes, and averaged (Hand and Till 2001).

## Brier score

The Brier score (Brier 1950) is a measure of deviation of predicted probabilities from the ideal situation where the probability for class a is $$1.0$$ if the observed class is a and $$0.0$$ if it is not a. Hence, it can be viewed as a measure of calibration as well. A value of $$0.0$$ is optimal.

The implementation in familiar iterates over all outcome classes in a one-versus-all approach, as originally devised by Brier (1950). For binomial outcomes, the score is divided by 2, so that it falls in the $$[0.0, 1.0]$$ range.

## Contingency table-based metrics

A contingency table or confusion matrix displays the observed and predicted classes. When dealing with two classes, e.g. a and b, one of the classes is usually termed ‘positive’ and the other ‘negative’. For example, let b be the ‘positive’ class. Then we can define the following four categories:

• True positive ($$TP$$): b is predicted and observed.
• True negative ($$TN$$): a is predicted and observed.
• False negative ($$FN$$): b was observed, but a was predicted.
• False positive ($$FP$$): a was observed, but b was predicted.

A contingency table contains the occurrence of each of the four cases. If a model is good, most samples will be either true positive or true negative. Models that are not as good may have larger numbers of false positives and/or false negatives.

Metrics based on the contingency table use two or more of the four categories to characterise model performance. The extension from two classes (binomial) to more (multinomial) is often not trivial. For many metrics, familiar uses a one-versus-all approach. Here, all classes are iteratively used as the ‘positive’ class, with the rest grouped as ‘negative’. Three options can be used to obtain performance values for multinomial problems, with an implementation similar to that of scikit.learn:

• micro: The number of true positives, true negatives, false positives and false negatives are computed for each class iteration, and then summed over all classes. The score is calculated afterwards.

• macro: A score is computed for each class iteration, and then averaged.

• weighted: A score is computed for each class iteration, and the averaged with a weight corresponding to the number of samples with the observed ‘positive’ class, i.e. the prevalence.

By default, familiar uses macro, but the averaging procedure may be selected through appending _micro, _macro or _weighted to the name of the metric. For example, recall_micro will compute the recall metric using micro averaging.

Averaging only applies to multinomial outcomes. No averaging is performed for binomial problems with two classes. In this case familiar will always consider the second class level to correspond to the ‘positive’ class.

### Accuracy

Accuracy quantifies the number of correctly predicted classes: $$s_{acc}=(TP + TN) / (TP + TN + FP + FN)$$. The extension to more than 2 classes is trivial. No averaging is performed for the accuracy metric.

### Balanced accuracy

Accuracy is known to be sensitive to imbalances in the class distribution. A balanced accuracy was therefore defined (Brodersen et al. 2010), which is the averaged within-class true positive rate (also known as recall or sensitivity): $$s_{bac}=0.5 (TP / (TP + FN) + TN / (TN + FP))$$.

The extension to more than 2 classes involves summation of in-class true positive rate $$TP / (TP + FN)$$ for each positive class and subsequent division by the number of classes. No averaging is performed for balanced accuracy.

### Balanced error rate

The balanced error rate is closely related to balanced accuracy, i.e. instead of the in-class true positive rate, the in-class false negative rate is used: $$s_{ber}=0.5 (FN / (TP + FN) + FP / (TN + FP))$$.

The extension to more than 2 classes involves summation of in-class false negative rate $$FN / (TP + FN)$$ for each positive class and subsequent division by the number of classes. No averaging is performed for balanced error rate.

### F1 score

The F1 score is the harmonic mean of precision and sensitivity: $$s_{f1} = 2 \; TP / (2 \; TP + FP + FN)$$.

The metric is not invariant to class permutation. Averaging is therefore performed for multinomial outcomes.

### False discovery rate

The false discovery rate quantifies the proportion of false positives among all predicted positives, i.e. the Type I error: $$s_{fdr} = FP / (TP + FP)$$.

The metric is not invariant to class permutation. Averaging is therefore performed for multinomial outcomes.

### Informedness

Informedness is a generalisation of Youden’s J statistic (Powers 2011). Informedness can be extended to multiple classes, and no averaging is therefore required.

For binomial problems, informedness and the Youden J statistic are the same.

### Cohen’s kappa

Cohen’s kappa coefficient is a measure of correspondence between the observed and predicted classes (Cohen 1960). Cohen’s kappa coefficient is invariant to class permutations and no averaging is performed for Cohen’s kappa.

### Markedness

Markedness is related to the precision or positive predictive value (Powers 2011).

### Matthews correlation efficient

Matthews’ correlation coefficient measures the correlation between observed and predicted classes [Matthews1975-kh].

An extension to multiple classes, i.e. multinomial outcomes, was devised by Gorodkin (2004).

### Negative predictive value

The negative predictive value (NPV) is the fraction of predicted negative classes that were also observed to be negative: $$s_{npv} = TN / (TN + FN)$$.

The NPV is not invariant to class permutations. Averaging is performed for multinomial outcomes.

### Positive predictive value

The positive predictive value (PPV) is the fraction of predicted positive classes that were also observed to be positive: $$s_{ppv} = TP / (TP + FP)$$.

The PPV is also referred to as precision. The PPV is not invariant to class permutations. Averaging is performed for multinomial outcomes. micro-averaging effectively computes the accuracy.

### Recall

Recall, also known as sensitivity or true positive rate, is the fraction of observed positive classes that were also predicted to be positive: $$s_{recall} = TP / (TP + FN)$$.

Recall is not invariant to class permutations and averaging is performed for multinomial outcomes. Both micro and weighted averaging effectively compute the accuracy.

### Specificity

Specificity, also known as the true negative rate, is the fraction of observed negative classes that were also predicted to be negative: $$s_{spec} = TN / (TN + FP)$$.

Specificity is not invariant to class permutations and averaging is performed for multinomial outcomes.

### Youden’s J statistic

Youden’s J statistic (Youden 1950) is the sum of recall and specificity minus 1: $$s_{youden} = TP / (TP + FN) + TN / (TN + FP) - 1$$.

Youden’s J statistic is not invariant to class permutations and averaging is performed for multinomial outcomes.

For binomial problems, informedness and the Youden J statistic are the same.

# Regression outcomes

Performance metrics for models with regression outcomes, i.e. count and continuous, are listed below.

method tag
explained variance explained_variance
mean absolute error mae, mean_absolute_error
relative absolute error rae, relative_absolutive_error
mean log absolute error mlae, mean_log_absolute_error
mean squared error mse, mean_squared_error
relative squared error rse, relative_squared_error
mean squared log error msle, mean_squared_log_error
median absolute error medae, median_absolute_error
R2 score r2_score, r_squared
root mean square error rmse, root_mean_square_error
root relative squared error rrse, root_relative_squared_error
root mean square log error rmsle, root_mean_square_log_error

Each of the above metrics can be made more robust against rare outliers by appending _winsor or _trim as a suffix to the metric name. This respectively performs winsorising (clipping) and trimming (truncating) based on the absolute prediction error, prior to computing the metric. Winsorising clips the predicted values for 5% of the instances with the most extreme absolute errors prior to computing the performance metric, whereas trimming removes these instances. For example, winsored and trimmed versions of the mean squared error metric are specified as mse_winsor and mse_trim respectively.

Let $$y$$ be the set of observed values, and $$\hat{y}$$ the corresponding predicted values. The error is then $$\epsilon = y-\hat{y}$$.

## Explained variance

The explained variance is defined as $$1 - \text{Var}\left(\epsilon\right) / \text{Var}\left(y\right)$$. This metric is not sensitive to differences in offset between observed and predicted values.

## Mean absolute error

The mean absolute error is defined as $$1/N \sum_i^N \left|\epsilon_i\right|$$, with $$N$$ the number of samples.

## Relative absolute error

The relative absolute error is defined as $$\sum_i^N \left|\epsilon_i\right|/ \sum_i^N \left|y_i - \bar{y}\right|$$.

## Mean log absolute error

The mean log absolute error is defined as $$1/N \sum_i^N \log(\left|\epsilon_i\right| + 1)$$.

## Mean squared error

The mean squared error is defined as $$1/N \sum_i^N \left(\epsilon_i \right)^2$$.

## Relative squared error

The relative squared error is defined as $$\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i - \bar{y}\right)^2$$.

## Mean squared log error

Mean squared log error is defined as $$1/N \sum_i^N \left(\log \left(y_i + 1\right) - \log\left(\hat{y}_i + 1\right)\right)^2$$. Note that this score only applies to observed and predicted values in the positive domain. It is not defined for negative values.

## Median absolute error

The median absolute error is the median of absolute error $$\left|\epsilon\right|$$.

## R2 score

The R2 score is defined as: $R^2 = 1 - \frac{\sum_i^N(\epsilon_i)^2}{\sum_i^N(y_i - \bar{y})^2}$ Here $$\bar{y}$$ denotes the mean value of $$y$$.

## Root mean square error

The root mean square error is defined as $$\sqrt{1/N \sum_i^N \left(\epsilon_i \right)^2}$$.

## Root relative squared error

The root relative squared error is defined as $$\sqrt{\sum_i^N \left(\epsilon_i\right)^2/ \sum_i^N \left(y_i - \bar{y}\right)^2}$$.

## Root mean square log error

The root mean square log error is defined as $$\sqrt{1/N \sum_i^N \left(\log \left(y_i + 1\right) - \log\left(\hat{y}_i + 1\right)\right)^2}$$. Note that this score only applies to observed and predicted values in the positive domain. It is not defined for negative values.

# Survival outcomes

Performance metrics for models with survival outcomes, i.e. survival, are listed below.

method tag
concordance index concordance_index, c_index, concordance_index_harrell, c_index_harrell

## Concordance index

The concordance index assesses ordering between observed and predicted values. Let $$T$$ be observed times, $$c$$ the censoring status ($$0$$: no observed event; $$1$$: event observed) and $$\hat{T}$$ predicted times. Concordance between all pairs of values is determined as follows (Pencina and D’Agostino 2004):

• Concordant: a pair is concordant if $$T_i < T_j$$ and $$\hat{T}_i < \hat{T}_j$$ (provided $$c_i=1$$), or if $$T_i > T_j$$ and $$\hat{T}_i > \hat{T}_j$$ (provided $$c_j=1$$).
• Discordant: a pair is discordant if $$T_i < T_j$$ and $$\hat{T}_i > \hat{T}_j$$ (provided $$c_i=1$$), or if $$T_i > T_j$$ and $$\hat{T}_i < \hat{T}_j$$ (provided $$c_j=1$$).
• Tied: a pair is tied if $$\hat{T}_i = \hat{T}_j$$, provided that $$c_i=c_j=1$$.
• Not comparable: otherwise. This occurs, for example, if sample $$i$$ was censored before an event was observed in sample $$j$$, or both samples were censored.

The concordance index is then computed as: $ci = \frac{n_{concord} + 0.5 n_{tied}}{n_{concord} + n_{discord} + n_{tied}}$

# References

Brier, Glenn W. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Mon. Weather Rev. 78 (1): 1–3.
Brodersen, K H, C S Ong, K E Stephan, and J M Buhmann. 2010. “The Balanced Accuracy and Its Posterior Distribution.” In 2010 20th International Conference on Pattern Recognition, 3121–24.
Cohen, Jacob. 1960. “A Coefficient of Agreement for Nominal Scales.” Educ. Psychol. Meas. 20 (1): 37–46.
Gorodkin, J. 2004. “Comparing Two k-Category Assignments by a k-Category Correlation Coefficient.” Comput. Biol. Chem. 28 (5-6): 367–74.
Hand, David J, and Robert J Till. 2001. “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems.” Mach. Learn. 45 (2): 171–86.
Pencina, Michael J, and Ralph B D’Agostino. 2004. “Overall C as a Measure of Discrimination in Survival Analysis: Model Specific Population Value and Confidence Interval Estimation.” Stat. Med. 23 (13): 2109–23.
Powers, David Martin. 2011. “Evaluation: From Precision, Recall and f-Measure to ROC, Informedness, Markedness and Correlation.” International Journal of Machine Learning Technology 2 (1): 37–63.
Youden, W J. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35.