lboxcox: an R package for the logistic Box-Cox model

Li Xing, Shiyu Xu, Jing Wang, Kohlton Booth, Xuekui Zhang, Igor Burstyn, Paul Gustafson

2023-12-12

Introduction

The lboxcox package is developed based on our recent research publication (Xing et. al [2]). We proposed a logistic Box-Cox model by adding a shape parameter to a routine logistic regression model to handle a more flexible relationship between log odds and its systematic component. Below we provide instructions for users on how to fit the logistic Box-Cox model using our lboxcox package.

Data Example

library(lboxcox)

In the library, there is a dataset called depress. It was from the National Health and Nutrition Examination Survey (NHANES) (Xing et. al [2]). The outcome is a binary indicator for depression in the data, and the main predictor is the blood mercury level, which requires an extra shape parameter to accommodate the nonlinear association. And the other variables are age, gender, and weight. The weight variable contains sampling weights, which would be used to adjust for sampling bias.

data(depress)
head(depress)
#>   depression mercury age gender    weight
#> 1          0    7.13  37      1 18979.728
#> 2          0    1.40  60      0 19877.753
#> 3          0    2.46  60      0 16671.291
#> 4          0    6.24  55      0  3749.384
#> 5          0   11.20  64      1 66567.646
#> 6          0    1.48  51      1  5494.848

The logistic Box-Cox model is fitted using the function, trained_model().

MaxLik model

trained_model = lbc_maxlik(
  depression ~ mercury + # response variable, and primary predictor
  age + factor(gender), # covariates
  weight_column_name="weight", # the column name which contains the information about survey weight 
  data=depress, # data comes from package
  svy_lambda_vector = seq(0, 2, length = 25),
  init_lambda_vector = seq(0, 2, length = 100),
  num_cores=2, # since vignettes can only work with a max of two processes
  seed = 1,
  iterlim = 100,
  timelim = 10
)
summary(trained_model)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 1: gradient close to zero (gradtol)
#> Log-Likelihood: -41.14436 
#> 5  free parameters
#> Estimates:
#>                   Estimate Std. error t value Pr(> t)
#> Beta_0           -2.206758  12.233684  -0.180   0.857
#> Beta_1           -0.365960   4.470466  -0.082   0.935
#> Lambda            0.205635  16.899802   0.012   0.990
#> age              -0.003513   0.231210  -0.015   0.988
#> factor(gender)_1 -0.492110   8.073192  -0.061   0.951
#> --------------------------------------------

The output shows the coefficients and the standard error values for all the predictors. Categorical variables are automatically one-hot encoded during the model fitting process.

The above is the intended use case for when we do not have initial values for the coefficients. We use svyglm from the survey package [1] to estimate the initial values. The training process for svyglm utilizes a lambda parameter that we search over a reasonable range to optimize the model.

We incorporate parallelized computing option in the model by setting the num_cores param to the number of cores we want to use for training. Please use the default num_cores value (= 1) if you do not wish to parallelize.

However, if you have an initial guess for the coefficients, you can include it as a vector and bypass the initial value estimation described above entirely! We do this with the below.

init = c(0.0984065403, -0.0227734374, 0.0000000000, -0.0002426025, -0.0316484585)
trained_model = lbc_maxlik(
  depression ~ mercury + # response variable, and primary predictor
  age + factor(gender), # covariates
  weight_column_name="weight", # the column name which contains the information about survey weight
  data=depress, # data comes from package
  svy_lambda_vector = seq(0, 2, length = 25),
  init_lambda_vector = seq(0, 2, length = 100),
  num_cores=2, # since vignettes can only work with a max of two processes
  seed = 2023,
  iterlim = 100,
  timelim = 10
)
summary(trained_model)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 1: gradient close to zero (gradtol)
#> Log-Likelihood: -41.14436 
#> 5  free parameters
#> Estimates:
#>                   Estimate Std. error t value Pr(> t)
#> Beta_0           -2.206758  12.233684  -0.180   0.857
#> Beta_1           -0.365960   4.470466  -0.082   0.935
#> Lambda            0.205635  16.899802   0.012   0.990
#> age              -0.003513   0.231210  -0.015   0.988
#> factor(gender)_1 -0.492110   8.073192  -0.061   0.951
#> --------------------------------------------

Citations

[1] Lumley, T. (2011). Complex surveys: a guide to analysis using R (Vol. 565). John Wiley & Sons.

[2] Xing, L., Zhang, X., Burstyn, I., & Gustafson, P. (2021). On logistic Box–Cox regression for flexibly estimating the shape and strength of exposure‐disease relationships. Canadian Journal of Statistics, 49(3), 808-825.