This vignette shows how to use the performance package to check the fit of a model, how to detect misspecification and how to improve your model. The basic workflow of the performance package can be summarized as follows:
In the following, we will demonstrate this workflow using a model with a count response variable. We will fit a Poisson regression model to the Salamanders dataset from the glmmTMB package. The dataset contains counts of salamanders in different sites, along with information on the number of mines and the species of salamanders. We will check the model fit and assess the model fit indices.
Problems that may arise with count response variables are zero inflation and overdispersion. Zero inflation occurs when there are more zeros in the data than expected under the Poisson distribution. Overdispersion occurs when the variance of the data is greater than the mean, which violates the assumption of equidispersion in the Poisson distribution.
We will check for these problems and suggest ways to improve the model fit, i.e. if necessary, we will fit another model that could potentially improve the fit. Finally, we will compare the model fit indices and perform statistical tests to determine which model is the best fit.
We start with a generalized mixed effects model, using a Poisson distribution.
library(performance)
model1 <- glmmTMB::glmmTMB(
count ~ mined + spp + (1 | site),
family = poisson,
data = glmmTMB::Salamanders
)
First, let us look at the summary of the model.
library(parameters)
model_parameters(model1)
#> # Fixed Effects
#>
#> Parameter | Log-Mean | SE | 95% CI | z | p
#> ---------------------------------------------------------------
#> (Intercept) | -1.62 | 0.24 | [-2.10, -1.15] | -6.76 | < .001
#> mined [no] | 2.26 | 0.28 | [ 1.72, 2.81] | 8.08 | < .001
#> spp [PR] | -1.39 | 0.22 | [-1.81, -0.96] | -6.44 | < .001
#> spp [DM] | 0.23 | 0.13 | [-0.02, 0.48] | 1.79 | 0.074
#> spp [EC-A] | -0.77 | 0.17 | [-1.11, -0.43] | -4.50 | < .001
#> spp [EC-L] | 0.62 | 0.12 | [ 0.39, 0.86] | 5.21 | < .001
#> spp [DES-L] | 0.68 | 0.12 | [ 0.45, 0.91] | 5.75 | < .001
#> spp [DF] | 0.08 | 0.13 | [-0.18, 0.34] | 0.60 | 0.549
#>
#> # Random Effects
#>
#> Parameter | Coefficient | 95% CI
#> -------------------------------------------------
#> SD (Intercept: site) | 0.58 | [0.38, 0.87]
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald z-distribution approximation.
#>
#> The model has a log- or logit-link. Consider using `exponentiate =
#> TRUE` to interpret coefficients as ratios.
We see a lot of statistically significant estimates here. No matter, which philosophy you follow in terms of interpreting statistical test results, our conclusions we draw from our regression models will be inaccurate if our modeling assumptions are a poor fit for the situation. Hence, checking model fit is essential.
In performance, we can conduct a comprehensive visual
inspection of our model fit using check_model()
. We won’t
go into details of all the plots here, but you can find more information
on all created diagnostic plots in the dedicated
vignette.
For now, we want to focus on the posterior predictive checks, dispersion and zero-inflation as well as the Q-Q plot (uniformity of residuals).
Note that unlike plot()
, which is a base R function to
create diagnostic plots, check_model()
relies on
simulated residuals for the Q-Q plot, which is more accurate
for non-Gaussian models. See this
vignette and the documentation of simulate_residuals()
for further details.
The above plot suggests that we may have issues with overdispersion
and/or zero-inflation. We can check for these problems using
check_overdispersion()
and
check_zeroinflation()
, which will perform statistical tests
(based on simulated residuals). These tests can additionally be used
beyond the visual inspection.
check_overdispersion(model1)
#> # Overdispersion test
#>
#> dispersion ratio = 2.324
#> Pearson's Chi-Squared = 1475.875
#> p-value = < 0.001
#> Overdispersion detected.
check_zeroinflation(model1)
#> # Check for zero-inflation
#>
#> Observed zeros: 387
#> Predicted zeros: 311
#> Ratio: 0.80
#> Model is underfitting zeros (probable zero-inflation).
As we can see, our model seems to suffer both from overdispersion and zero-inflation.
We can try to improve the model fit by fitting a model with zero-inflation component:
model2 <- glmmTMB::glmmTMB(
count ~ mined + spp + (1 | site),
ziformula = ~ mined + spp,
family = poisson,
data = glmmTMB::Salamanders
)
check_model(model2)
#> `check_outliers()` does not yet support models of class `glmmTMB`.
Looking at the above plots, the zero-inflation seems to be addressed
properly (see especially posterior predictive checks and
uniformity of residuals, the Q-Q plot). However, the
overdispersion still could be present. We can check for these problems
using check_overdispersion()
and
check_zeroinflation()
again.
check_overdispersion(model2)
#> # Overdispersion test
#>
#> dispersion ratio = 1.679
#> p-value = 0.008
#> Overdispersion detected.
check_zeroinflation(model2)
#> # Check for zero-inflation
#>
#> Observed zeros: 387
#> Predicted zeros: 387
#> Ratio: 1.00
#> Model seems ok, ratio of observed and predicted zeros is within the
#> tolerance range (p > .999).
Indeed, the overdispersion is still present.
We can try to address this issue by fitting a negative binomial model instead of using a Poisson distribution.