Discnorm: Detecting and adjusting for underlying non-normality in ordinal datasets

Njål Foldnes and Steffen Grønneberg

2022-05-25

library(discnorm )
library(lavaan)

The discnorm package uses bootstrapping to help determine whether the commonly assumed normality assumption is tenable for an ordinal dataset. Researchers wanting to proceed with ordinal SEM based on polychoric correlations should first to check that the normalit copula assumption is not violated. Also, if the normality assumption is tenable, researchers may specify other marginal distributions using catLSadjust().

Example of bootTest()

The procedure is named bootTest() and operates on an ordinal dataset and returns a p-value associated with the null-hypothesis of underlying normality. Let us first use the test for a dataset that is produced by underlying normality.

#let us discretize an underlying normal vector
# with moderate correlation 
rho <- 0.3
Sigma <- diag(5)
Sigma[Sigma !=1] <- rho
set.seed(1234)
norm.sample  <- MASS::mvrnorm(n=200, mu=rep(0,5), Sigma=Sigma)
# let us discretize into 4 categories
disc.sample <- apply(norm.sample, 2, cut,   breaks=c(-Inf, -1, 1, 2, Inf), labels=FALSE)

#check for underlying normality
pvalue <- bootTest(disc.sample, B=500)
#> Progress 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
print(pvalue)
#> [1] 0.488
# we have no evidence against the null hypothesis of underlying normality

And let us discretize a non-normal dataset

nonnorm.sample <- data.frame(norm.sample[, 1:4], norm.sample[,1]*norm.sample[,2])
disc.sample2 <- apply(nonnorm.sample, 2, cut, breaks=c(-Inf, -1, 1, 2, Inf), labels=FALSE)
pvalue <- bootTest(disc.sample2, B=500)
#> Progress 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
print(pvalue)
#> [1] 0
# rejected!

The procedure is fully described in Foldnes and Grønneberg (2019)

Example of adjusted polychoric correlation: catLSadj()

First we generate a large dataset with non-normal marginals by transforming the marginals of a normal dataset

shape= 2
scale = 1/sqrt(shape)
m1 <- list(F=function(x) pchisq(x, df=1), qF=function(x) qchisq(x, df=1), sd=sqrt(2))
G3 <- function(x) pgamma(x+shape*scale, shape=shape, scale=scale)
G3flip <- function(x) 1- G3(-x)
qG3 <- function(x) qgamma(x, shape=shape, scale=scale)-shape*scale
qG3flip <- function(x) -qG3(1-x)
marginslist <- list(m1, list(F=G3, qF=qG3), list(F=G3flip, qF=qG3flip))
                    
Sigma <- diag(3)
Sigma[Sigma==0] <- 0.5
Sigma
#>      [,1] [,2] [,3]
#> [1,]  1.0  0.5  0.5
#> [2,]  0.5  1.0  0.5
#> [3,]  0.5  0.5  1.0

set.seed(1)
norm.data <- MASS::mvrnorm(10^5, rep(0,3), Sigma)
colnames(norm.data) <- c("x1", "x2", "x3")
#With normal marginals, the correlation matrix is (approximately)
#Sigma.

#Transform the marginals to follow the elements in marginslist:
nonnorm.data <- data.frame(x1=marginslist[[1]]$qF(pnorm(norm.data[, 1])), 
                     x2=marginslist[[2]]$qF(pnorm(norm.data[, 2])),
                     x3=marginslist[[3]]$qF(pnorm(norm.data[, 3])))

cor(nonnorm.data)
#>           x1        x2        x3
#> x1 1.0000000 0.4424008 0.3552593
#> x2 0.4424008 1.0000000 0.4240402
#> x3 0.3552593 0.4240402 1.0000000

Next we fit both the normal and the non-normal datasets to a factor model (which fits perfectly to both sets), and look at factor loading parameters


head(standardizedsolution(cfa("F=~ x1+x2+x3", norm.data)),3)
#>   lhs op rhs est.std    se       z pvalue ci.lower ci.upper
#> 1   F =~  x1   0.713 0.002 287.400      0    0.708    0.717
#> 2   F =~  x2   0.708 0.002 285.130      0    0.703    0.713
#> 3   F =~  x3   0.707 0.002 284.592      0    0.702    0.712

head(standardizedsolution(cfa("F=~ x1+x2+x3", nonnorm.data)),3)
#>   lhs op rhs est.std    se       z pvalue ci.lower ci.upper
#> 1   F =~  x1   0.609 0.003 192.345      0    0.603    0.615
#> 2   F =~  x2   0.727 0.003 220.536      0    0.720    0.733
#> 3   F =~  x3   0.584 0.003 185.327      0    0.577    0.590

Then we discretize the non-normal dataset and confirm that the strongly polychoric correlations are strongly biased

disc.data <- data.frame(x1=cut(nonnorm.data[, 1], breaks= c(-Inf, 0.1, 1, Inf), labels=FALSE), 
                   x2= cut(nonnorm.data[, 2], breaks= c(-Inf, -.7, 0,1, Inf), labels=FALSE),
                   x3=cut(nonnorm.data[, 3], breaks= c(-Inf, -1, 0,1, Inf), labels=FALSE))

lavaan::lavCor(disc.data, ordered=colnames(disc.data))
#>    x1    x2    x3   
#> x1 1.000            
#> x2 0.504 1.000      
#> x3 0.506 0.503 1.000

Next, compute the adjusted correlations and the associated standard error. Confirm that the correlations are close to those in the original non-normal dataset:

adjusted <- catLSadj(disc.data, marginslist, verbose=T )
#> Standard deviation approximated to be: 1 
#> Standard deviation approximated to be: 1 
#>    x1    x2    x3   
#> x1 1.000            
#> x2 0.504 1.000      
#> x3 0.506 0.503 1.000
#>    x1    x2    x3   
#> x1 1.000            
#> x2 0.441 1.000      
#> x3 0.357 0.427 1.000


adjusted[[1]]
#>    x1    x2    x3   
#> x1 1.000            
#> x2 0.441 1.000      
#> x3 0.357 0.427 1.000

Running conventional ordinal factor analysis leads to biased factor loadings:

head(standardizedsolution(fcat <- cfa("F=~ x1+x2+x3", disc.data, ordered=colnames(disc.data))),3)
#>   lhs op rhs est.std    se       z pvalue ci.lower ci.upper
#> 1   F =~  x1   0.712 0.003 223.534      0    0.706    0.719
#> 2   F =~  x2   0.708 0.003 227.355      0    0.702    0.714
#> 3   F =~  x3   0.710 0.003 225.393      0    0.704    0.716

These parameter estimates are close to the parameters of the continuous model for normal data, and not to the model parameters obtained from the discretized non-normal dataset To get consistent estimates of these parameters we need to use the adjusted polychoric correlation.


sample.th   <- lavInspect(fcat, "sampstat")$th
attr(sample.th, "th.idx") <- lavInspect(fcat, "th.idx")
#the asymptotic covariance matrix of the adjusted polychorics: 
gamma.adj <- adjusted[[2]]
WLS.V.new <- diag(1/diag(gamma.adj))

fcat.adj  <- cfa("F=~ x1+x2+x3", sample.cov=adjusted[[1]],
                  sample.nobs=nrow(disc.data),  sample.th=sample.th,
                  NACOV = gamma.adj, WLS.V=WLS.V.new)
head(standardizedsolution(fcat.adj), 3)
#>   lhs op rhs est.std    se       z pvalue ci.lower ci.upper
#> 1   F =~  x1   0.608 0.003 225.770      0    0.602    0.613
#> 2   F =~  x2   0.726 0.003 225.753      0    0.720    0.732
#> 3   F =~  x3   0.588 0.002 237.061      0    0.583    0.592

Closely matches the model parameters obtained with the non-normal dataset

The procedure is fully described in Grønneberg and Foldnes (2022)

References

Foldnes, N, and S Grønneberg. 2019. “Pernicious Polychorics: The Impact and Detection of Underlying Non-Normality.” Structural Equation Modeling: A Multidisciplinary Journal.
Grønneberg, S, and N Foldnes. 2022. “Factor Analyzing Ordinal Items Requires Substantive Knowledge of Response Marginals.” Psychological Methods.