Multivariate Sex Differences in Personality with Regularized Regression in multid -package

Ville-Juhani Ilmarinen

2024-02-15

This vignette outlines the procedures used to estimate multivariate sex differences in personality traits using regularized logistic regression and the gender diagnosticity approach with the multid package.

This approach can be used to measure multivariate differences between any two groups, but in this example, we examine differences between females and males in their “answers to the Big Five Personality Test, constructed with items from the International Personality Item Pool” (which can be obtained from https://openpsychometrics.org/).

The regularized method enables the examination of group-membership probabilities and their distributions across individuals. In the context of statistical predictions of sex, these distributions are an updated version of the gender-typicality distributions used in the gender diagnosticity methodology (Lippa & Connelly, 1990).

For more information and empirical examples of how multivariate sex difference estimation and gender diagnosticity methodology can be integrated, please see Ilmarinen, Vainikainen, and Lönnqvist (2022) and Lönnqvist and Ilmarinen (2021).

Preparations

Packages

You can install the released version of multid from CRAN with:

install.packages("multid")

You can install the development version from GitHub with:

install.packages("devtools")
devtools::install_github("vjilmari/multid")

Load multid and other packages needed for the vignette

library(rio)
library(multid)
library(dplyr)
library(overlapping)

Data

This vignette uses openly available data from the “Big Five Personality Test,” which is constructed with items from the International Personality Item Pool. The dataset can be downloaded directly from https://openpsychometrics.org (by approach presented here https://stackoverflow.com/a/66351588/8995170), and there is no need to download it to your drive in order to reproduce the analyses described in this vignette.

# create a temporary directory
td <- tempdir()

# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")

# download file from internet into temporary location
download.file("http://openpsychometrics.org/_rawdata/BIG5.zip", tf)

# list zip archive
file_names <- unzip(tf, list=TRUE)

# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)

# use when zip file has only one file
dat <- import(file.path(td, "BIG5/data.csv"))

# delete the files and directories
unlink(td)

Variable transformations

# save names of personality items to a vector
per.items<-names(dat)[which(names(dat)=="E1"):
                        which(names(dat)=="O10")]

# code item response 0 (zero) to not available (NA) on all these items
dat[,per.items]<-
  sapply(dat[,per.items],function(x){ifelse(x==0,NA,x)})

# check that the numerical range makes sense
range(dat[,per.items],na.rm=T)
#> [1] 1 5
# calculate sum scores for Big Five (reverse coded items subtracted from 6)
# see codebook from https://openpsychometrics.org for item content and keys

dat$N<-rowMeans(
  cbind(dat[,c("N1","N3","N5","N6","N7","N8","N9","N10")],
        6-dat[,c("N2","N4")]),na.rm=T)

dat$E<-rowMeans(
  cbind(dat[,c("E1","E3","E5","E7","E9")],
        6-dat[,c("E2","E4","E6","E8","E10")]),na.rm=T)

dat$O<-rowMeans(
  cbind(dat[,c("O1","O3","O5","O7","O8","O9","O10")],
        6-dat[,c("O2","O4","O6")]),na.rm=T)

dat$A<-rowMeans(
  cbind(dat[,c("A2","A4","A6","A8","A9","A10")],
        6-dat[,c("A1","A3","A5","A7")]),na.rm=T)

dat$C<-rowMeans(
  cbind(dat[,c("C1","C3","C5","C7","C9","C10")],
        6-dat[,c("C2","C4","C6","C8")]),na.rm=T)

For the purposes of this example, we will only analyze responses from the United States to examine sex differences. Additionally, we will only include participants who identified as male (gender=1) or female (gender=2), and who did not have any missing responses.

US.dat<-
  na.omit(dat[dat$country=="US" & (dat$gender==1 | dat$gender==2),
              c("gender",per.items,"N","E","O","A","C")])

# code categorical gender variables
US.dat$gender.cat<-
  ifelse(US.dat$gender==1,"Male","Female")

Sex differences

This section demonstrates several different ways to estimate sex differences in personality. We begin with typical univariate differences in the broad Big Five domains, followed by multivariate differences in both domains and using personality items as a multivariate set. We will pay special attention to regularized methods because this approach enables us to examine both sex differences and the underlying distributions that give rise to these mean differences.

Univariate differences in trait domain sum scores using the d_pooled_sd -function

The d_pooled_sd function is used to calculate Cohen’s d with a pooled standard deviation across the male and female groups for each Big Five trait separately. The function takes several arguments, including the analyzed Big Five trait (var), sex as the binary grouping variable (group.var), the values that represent these groups (group.values), and a logical value indicating whether statistical inference is desired (infer). All arguments except for statistical inference are also included in the multivariate function D_regularized, which is presented further below.

d.N<-d_pooled_sd(data = US.dat,var = "N",
            group.var = "gender.cat",group.values = c("Male","Female"),
            infer=T)

d.E<-d_pooled_sd(data = US.dat,var = "E",
            group.var = "gender.cat",group.values = c("Male","Female"),
            infer=T)

d.O<-d_pooled_sd(data = US.dat,var = "O",
            group.var = "gender.cat",group.values = c("Male","Female"),
            infer=T)

d.A<-d_pooled_sd(data = US.dat,var = "A",
            group.var = "gender.cat",group.values = c("Male","Female"),
            infer=T)

d.C<-d_pooled_sd(data = US.dat,var = "C",
            group.var = "gender.cat",group.values = c("Male","Female"),
            infer=T)

# compile to a table

uni.d<-rbind(d.N,d.E,d.O,d.A,d.C)
rownames(uni.d)<-c("N","E","O","A","C")

round(uni.d,2)
#>   n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd  diff     D
#> N   2945     5739   2.79     3.16    0.88      0.86      0.87 -0.37 -0.42
#> E   2945     5739   3.00     3.06    0.96      0.96      0.96 -0.07 -0.07
#> O   2945     5739   4.07     3.88    0.58      0.63      0.62  0.18  0.30
#> A   2945     5739   3.68     4.03    0.75      0.68      0.71 -0.35 -0.50
#> C   2945     5739   3.38     3.44    0.72      0.75      0.74 -0.07 -0.09
#>   t_stat      df p
#> N -18.45 5845.38 0
#> E  -3.09 5919.79 0
#> O  13.44 6458.37 0
#> A -21.33 5455.81 0
#> C  -3.95 6119.69 0

The column labeled “D” indicates Cohen’s d. Females score higher in Neuroticism (N; d = -0.42) and Agreeableness (A; d = -0.50), while males score higher in Openness (O; d = 0.30). Differences in Extraversion (E) and Conscientiousness (C) are statistically significant (p < .001), but their magnitudes are negligible (d < |.10|).

The average absolute difference across the Big Five traits is therefore d = 0.27. However, this average absolute difference does not provide information about the overall distance between males and females. For examining questions regarding the overall distance in a certain domain, such as personality, one should use multivariate methods

Other columns in the output are:

Multivariate difference using trait sum scores as variables

Although the univariate differences presented above estimate sex differences in each broad Big Five trait separately, these single Cohen’s d estimates do not provide a clear picture of the overall distance between males and females. To understand overall distances, multivariate methods that use multiple variables and account for their covariability (intercorrelations) should be used. These methods provide D values, which can be interpreted in the same way as Cohen’s d, that is, as standardized distances between males and females.

Mahalanobis’ D

Mahalanobis’ D is the standardized distance between the male and female group centroids in multivariate space. To calculate Mahalanobis’ D, you need the correlation matrix of the multivariate set and a vector of standardized mean differences. The ordering of the variables should be the same in both inputs.

# let's use the previously obtained univariate d-values
d_vector<-uni.d[,"D"]
# calculate intercorrelations
cor_mat<-cor(US.dat[,c("N","E","O","A","C")])
# test that the ordering of these variables matches
names(d_vector)==colnames(cor_mat)
#> [1] TRUE TRUE TRUE TRUE TRUE
# print the correlation matrix
round(cor_mat,2)
#>       N     E     O     A     C
#> N  1.00 -0.25 -0.10 -0.10 -0.26
#> E -0.25  1.00  0.15  0.31  0.10
#> O -0.10  0.15  1.00  0.10  0.05
#> A -0.10  0.31  0.10  1.00  0.17
#> C -0.26  0.10  0.05  0.17  1.00
# Calculate D
D_maha<-sqrt(t(d_vector) %*% solve(cor_mat) %*% d_vector)
D_maha
#>           [,1]
#> [1,] 0.7681502

Mahalanobis’ D is, of course, larger than any of the univariate d’s for single variables, here the standardized distance between the group centroids of males and females in the multivariate space is D = 0.77

Regularized D with D_regularized -function

In the D_regularized function, a predictive approach is used to calculate the regularized variant of D. This approach involves predicting the binary sex variable using a multivariate set of variables with regularized logistic regression. The predicted values for each individual represent the logits of probabilities of being male or female, and are hereafter referred to as femininity-masculinity scores (FM).

FM scores can be used in the same way as any other variable to calculate standardized distances between the mean FM scores of males and females with pooled standard deviations. In addition to estimating mean differences, there are other informative features of the multivariate/FM distributions.

The D_regularized function uses separate partitions of the original data for regulation (i.e., training) and prediction (i.e., testing). During regulation, the beta-coefficients for variables in the multivariate set are penalized through a specified function, such as a elastic net penalty with 10-fold cross-validation (default). The resulting coefficient weights are then used in the prediction part of the data as measures of FM for each individual. The arguments out (a logical value indicating whether predictions are made out-of-bag) and size (the size of the training partition of the data, equal for males and females) are used to specify that a partition is required. The multivariate set is supplied with the mv.vars argument.

It is also possible to request pcc (probability of correctly classified), auc (area under the receiver operating characteristics), and frequencies indicating the proportions of the sample that fall within certain cutoffs of predicted probabilities from 0 to 1 for each group (pred.prob) by using these arguments within D_regularized call.

# set seed number for reproducibility
set.seed(37127)

# decide the size of the regularization (training data fold)
# here, half of the smaller of the male and female groups is used
size=round(min(table(US.dat$gender.cat))/2)

# run D_regularized
D_out<-
  D_regularized(data=US.dat,
                mv.vars=c("N","E","O","A","C"),
                group.var="gender.cat",
                group.values = c("Male","Female"),
                out = T,
                size=size,
                pcc = T,
                auc = T,
                pred.prob = T)

# print the output
round(D_out$D,2)
#>      n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff    D
#> [1,]   1473     4267   0.34     -0.3    0.86      0.79      0.81 0.64 0.79
#>      pcc.Male pcc.Female pcc.total  auc
#> [1,]     0.63       0.67      0.66 0.71

The output shows that the sample sizes are smaller compared to the univariate estimates because they only reflect the samples in the prediction/testing dataset. The standardized multivariate difference between males and females is estimated at D = 0.79, which is very similar in magnitude to the multivariate sex difference indexed by Mahalanobis’ D.

There are additional features in the output that may be of interest. Similar to univariate d estimates, descriptive statistics for both groups on the FM-score metric are provided, as well as an estimate of the pooled standard deviation used to calculate D. The probability of correctly classified individuals (pcc) and the area under the receiver operating characteristics (auc) are also included, which are commonly used indices of predictive performance. In this example, they indicate the performance of the broad Big Five domains in predicting sex. The predicted probabilities of being male across certain cutoffs (defaults were used in this example) for both groups can be obtained from the P.table part of the output.

round(100*D_out$P.table,1)
#>         
#>          [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#>   Female     7.2      39.0      37.0      14.1     2.7
#>   Male       1.2      18.1      37.1      32.6    11.0

The table shows that majority of males are nevertheless not very clearly classified as males based on the Big Five personality traits (11% in .80-1 range), and same is true for classifying females as females (7% in 0-0.20 range).

To examine the regularized beta weights obtained during the training phase, the coefficients function can be called on the cv.mod object of the output. The s argument specifies the value of the penalty parameter lambda used to calculate predictions. The default value of “lambda.min” is used to obtain the values where the mean cross-validated error is minimized, while “lambda.1se” uses the largest value of lambda such that the error is within one standard error of the minimum. It is also possible to plot the cross-validation curve to visualize how “lambda.min” and “lambda.1se” were obtained. For more information on regularized regression conducted with the glmnet package, please refer to https://glmnet.stanford.edu/articles/glmnet.html.

coefficients(D_out$cv.mod,
             s = "lambda.min")
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>                      s1
#> (Intercept)  3.37297381
#> N           -0.54865910
#> E           -0.02070834
#> O            0.64983681
#> A           -0.88854735
#> C           -0.23751055
coefficients(D_out$cv.mod,
             s = "lambda.1se")
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>                     s1
#> (Intercept)  2.3670214
#> N           -0.3900418
#> E            .        
#> O            0.4626928
#> A           -0.6930196
#> C           -0.1053781
plot(D_out$cv.mod)

Overlap

Differences and similarities are two sides of the same coin. In the case of sex differences, visual inspection of the distributions of FM-scores can reveal both. Specifically, by assessing the overlap between the distributions of FM-scores of males and females, we can observe differences between the two groups. However, we can also identify similarities by examining the degree of overlap between the distributions.

# Obtain the D-estimate
D<-unname(D_out$D[,"D"])

# Overlap relative to a single distribution (OVL)
OVL<-2*pnorm((-D/2))

# Proportion of overlap relative to the joint distribution
OVL2<-OVL/(2-OVL)

# non-parametric overlap with overlapping package

np.overlap<-
  overlap(x = list(D_out$pred.dat[
  D_out$pred.dat$group=="Male","pred"],
  D_out$pred.dat[
  D_out$pred.dat$group=="Female","pred"]),
  plot=T)

# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
np.OVL2<-unname(np.overlap$OV)

# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
np.OVL<-(2*np.OVL2)/(1+np.OVL2)

# comparison of different overlap-estimates
round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
#>       OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69   0.82 0.53     0.7

In this example, the non-parametric overlap (np) estimates to these logistic distributions are almost identical to the parametric overlap estimates.

Bootstrap confidence intervals for Regularized D

To obtain confidence intervals for D-estimates using regularized techniques, a bootstrap approach can be employed. This involves estimating D with multiple bootstrap samples drawn with replacement from the original sample and using the distribution of these D values to construct a percentile confidence interval around the estimate. In this example, only 10 redraws with replacement are used, but in practice, at least 1000 or 5000 redraws should be used for more accurate results.

# number of redraws
reps=10

# placeholder for D-estimates
D_out_boot<-rep(NA,reps)

# repeat the procedure to obtain a distribution

t1<-Sys.time()

for (i in 1:reps){
  # draw replaced samples with the same size as the original data
  boot.dat<-sample_n(US.dat,size=nrow(US.dat),replace=T)
  
  # run D_regularized for each sample
  D_out_boot[i]<-
    D_regularized(data=boot.dat,
                mv.vars=c("N","E","O","A","C"),
              group.var="gender.cat",
              group.values = c("Male","Female"),
              out = T,
              size=size)$D[,"D"]
  
}
t2<-Sys.time()
t2-t1
#> Time difference of 6.333763 secs
# print 95% CI percentile interval

round(quantile(D_out_boot,c(.025,.975)),2)
#>  2.5% 97.5% 
#>  0.77  0.83

After obtaining the bootstrap confidence interval, it can be reported that the multivariate sex difference D = 0.79, 95% CI [0.77, 0.83].

Multivariate difference using item scores as variables

Recently, there have been numerous studies in the field of personality psychology emphasizing the significance of narrow personality traits in predicting criteria, beyond the commonly used broad trait domains such as the Big Five (e.g., Seeboth & Mõttus, 2018). These narrow personality traits are usually either facets or nuances that are commonly modeled as being part of a certain Big Five trait, although their unique predictive utility suggests that they have trait-like properties (stability, cross-informant agreement etc.) that go beyond the broad Big Five domains (e.g., Mõttus et al., 2019). Consequently, narrower traits can be regarded as distinct traits. Thus, it is interesting to investigate whether using item scores as the variable set in multivariate estimation would provide better predictions of sex and different estimates of sex differences (see Ilmarinen et al., 2022).

However, due to the increased number of variables (from five to fifty in this dataset), regularization methods and different data partitions for training and testing are highly recommended to prevent overfitting of the data. For illustrative purposes, Mahalanobis’ D is also calculated below.

Mahalanobis’ D

# place-holder vector for univariate Cohen d's for items
d_vector_items<-rep(NA,length(per.items))

# loop through the items and obtain d for each item

for (i in 1:length(per.items)){
  d_vector_items[i]<-
    d_pooled_sd(data=US.dat,
              var=per.items[i],
            group.var="gender.cat",
            group.values=c("Male","Female"))[,"D"]
  
}

cor_mat_items<-cor(US.dat[,per.items])

# Calculate D
D_maha_items<-
  sqrt(t(d_vector_items) %*% solve(cor_mat_items) %*% d_vector_items)
D_maha_items
#>           [,1]
#> [1,] 0.9212846

Item-based Mahalanobis’ D is somewhat larger than D (Mahalanobis’ or regularized) based on Big Five domains. The standardized distance between the group centroids of males and females in a space comprised of personality items is D = 0.92

Regularized D with D_regularized -function

It is straightforward to use D_regularized -function even with large variable sets (especially when the variable names are save in a vector first; here “per.items”). Separate partitions of data are again used for regularization and prediction. Domain-based results are printed for comparison.

D_items_out<-
  D_regularized(data=US.dat,
                mv.vars=per.items,
              group.var="gender.cat",
              group.values = c("Male","Female"),
              out = T,
              size=size,pcc = T,auc = T,pred.prob = T)

# Item-based D
round(D_items_out$D,2)
#>      n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff    D
#> [1,]   1473     4267   0.46     -0.4     0.9      0.86      0.87 0.86 0.99
#>      pcc.Male pcc.Female pcc.total  auc
#> [1,]      0.7       0.69      0.69 0.76
# Big Five domain-based D
round(D_out$D,2)
#>      n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff    D
#> [1,]   1473     4267   0.34     -0.3    0.86      0.79      0.81 0.64 0.79
#>      pcc.Male pcc.Female pcc.total  auc
#> [1,]     0.63       0.67      0.66 0.71

With items as the multivariate set, D = 0.99 is slightly larger than with Big Five trait domains, D = 0.79.

Differences in predictive performance can also be illustrated from comparing the predicted probabilities.

round(100*D_items_out$P.table,1)
#>         
#>          [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#>   Female    11.6      38.5      33.4      14.3     2.2
#>   Male       1.6      15.1      32.9      35.4    15.0
round(100*D_out$P.table,1)
#>         
#>          [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#>   Female     7.2      39.0      37.0      14.1     2.7
#>   Male       1.2      18.1      37.1      32.6    11.0

The analysis reveals that the distribution of the FM-scores shows more individuals in the extremes. However, despite this, the overall predictive performance based on the pcc and auc values is not substantially improved. It is noteworthy that over 30% of the data falls in the 40%-60% range, indicating that their personality profiles are rather indifferent in regards to sex.

Overlap

Overlaps give similar information but in another format:

# Obtain the D-estimate
D_items<-unname(D_items_out$D[,"D"])

# Overlap relative to a single distribution (OVL)
OVL_items<-2*pnorm((-D_items/2))

# Proportion of overlap relative to the joint distribution
OVL2_items<-OVL_items/(2-OVL_items)

# non-parametric overlap

np.overlap_items<-
  overlap(x = list(D_items_out$pred.dat[
  D_items_out$pred.dat$group=="Male","pred"],
  D_items_out$pred.dat[
  D_items_out$pred.dat$group=="Female","pred"]),
  plot=T)

# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
np.OVL2_items<-unname(np.overlap_items$OV)

# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
np.OVL_items<-(2*np.OVL2_items)/(1+np.OVL2_items)

# compare the different overlap-estimates

round(cbind(OVL_items,np.OVL_items,
            OVL2_items,np.OVL2_items),2)
#>      OVL_items np.OVL_items OVL2_items np.OVL2_items
#> [1,]      0.62         0.77       0.45          0.63
# print Big Five domain based overlaps for comparison

round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
#>       OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69   0.82 0.53     0.7

The overlap between FM-distributions of males and females is somewhat smaller when individual items are used as the multivariate set rather than broad Big Five domains.

References

Session Information

sessionInfo()
#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                     LC_CTYPE=Finnish_Finland.utf8   
#> [3] LC_MONETARY=Finnish_Finland.utf8 LC_NUMERIC=C                    
#> [5] LC_TIME=Finnish_Finland.utf8    
#> 
#> time zone: Europe/Helsinki
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] overlapping_2.1 testthat_3.2.1  ggplot2_3.4.4   dplyr_1.1.4    
#> [5] multid_1.0.0    rio_0.5.29     
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7        utf8_1.2.4        generics_0.1.3    shape_1.4.6      
#>  [5] lattice_0.21-9    stringi_1.8.3     pROC_1.18.5       hms_1.1.3        
#>  [9] digest_0.6.34     magrittr_2.0.3    evaluate_0.23     grid_4.3.2       
#> [13] iterators_1.0.14  fastmap_1.1.1     plyr_1.8.9        foreach_1.5.2    
#> [17] Matrix_1.6-5      glmnet_4.1-8      cellranger_1.1.0  jsonlite_1.8.8   
#> [21] zip_2.3.0         brio_1.1.4        survival_3.5-7    fansi_1.0.6      
#> [25] scales_1.3.0      codetools_0.2-19  jquerylib_0.1.4   cli_3.6.2        
#> [29] rlang_1.1.3       splines_4.3.2     munsell_0.5.0     withr_3.0.0      
#> [33] cachem_1.0.8      yaml_2.3.7        tools_4.3.2       colorspace_2.1-0 
#> [37] forcats_1.0.0     curl_5.0.2        vctrs_0.6.5       R6_2.5.1         
#> [41] lifecycle_1.0.4   foreign_0.8-85    pkgconfig_2.0.3   pillar_1.9.0     
#> [45] bslib_0.5.1       openxlsx_4.2.5.2  gtable_0.3.4      data.table_1.14.8
#> [49] glue_1.7.0        Rcpp_1.0.12       haven_2.5.2       xfun_0.39        
#> [53] tibble_3.2.1      tidyselect_1.2.0  rstudioapi_0.15.0 knitr_1.44       
#> [57] farver_2.1.1      htmltools_0.5.5   labeling_0.4.3    rmarkdown_2.25   
#> [61] compiler_4.3.2    readxl_1.4.2