mixgb: Multiple Imputation Through XGBoost

Yongshi Deng

2023-02-16

Introduction

Mixgb offers a scalable solution for imputing large datasets using XGBoost, subsampling and predictive mean matching. Our method utilizes the capabilities of XGBoost, a highly efficient implementation of gradient boosted trees, to capture interactions and non-linear relations automatically. Moreover, we have integrated subsampling and predictive mean matching to minimize bias and reflect appropriate imputation variability. Our package supports various types of variables and offers flexible settings for subsampling and predictive mean matching. We also include diagnostic tools for evaluating the quality of the imputed values.

Impute missing values with mixgb

We first load the mixgb package and the nhanes3_newborn dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values.

library(mixgb)
str(nhanes3_newborn)
#> tibble [2,107 × 16] (S3: tbl_df/tbl/data.frame)
#>  $ HSHSIZER: int [1:2107] 4 3 5 4 4 3 5 3 3 3 ...
#>  $ HSAGEIR : int [1:2107] 2 5 10 10 8 3 10 7 2 7 ...
#>  $ HSSEX   : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 2 2 2 1 ...
#>  $ DMARACER: Factor w/ 3 levels "1","2","3": 1 1 2 1 1 1 2 1 2 2 ...
#>  $ DMAETHNR: Factor w/ 3 levels "1","2","3": 3 1 3 3 3 3 3 3 3 3 ...
#>  $ DMARETHN: Factor w/ 4 levels "1","2","3","4": 1 3 2 1 1 1 2 1 2 2 ...
#>  $ BMPHEAD : num [1:2107] 39.3 45.4 43.9 45.8 44.9 42.2 45.8 NA 40.2 44.5 ...
#>   ..- attr(*, "label")= chr "Head circumference (cm)"
#>  $ BMPRECUM: num [1:2107] 59.5 69.2 69.8 73.8 69 61.7 74.8 NA 64.5 70.2 ...
#>   ..- attr(*, "label")= chr "Recumbent length (cm)"
#>  $ BMPSB1  : num [1:2107] 8.2 13 6 8 8.2 9.4 5.2 NA 7 5.9 ...
#>   ..- attr(*, "label")= chr "First subscapular skinfold (mm)"
#>  $ BMPSB2  : num [1:2107] 8 13 5.6 10 7.8 8.4 5.2 NA 7 5.4 ...
#>   ..- attr(*, "label")= chr "Second subscapular skinfold (mm)"
#>  $ BMPTR1  : num [1:2107] 9 15.6 7 16.4 9.8 9.6 5.8 NA 11 6.8 ...
#>   ..- attr(*, "label")= chr "First triceps skinfold (mm)"
#>  $ BMPTR2  : num [1:2107] 9.4 14 8.2 12 8.8 8.2 6.6 NA 10.9 7.6 ...
#>   ..- attr(*, "label")= chr "Second triceps skinfold (mm)"
#>  $ BMPWT   : num [1:2107] 6.35 9.45 7.15 10.7 9.35 7.15 8.35 NA 7.35 8.65 ...
#>   ..- attr(*, "label")= chr "Weight (kg)"
#>  $ DMPPIR  : num [1:2107] 3.186 1.269 0.416 2.063 1.464 ...
#>   ..- attr(*, "label")= chr "Poverty income ratio"
#>  $ HFF1    : Factor w/ 2 levels "1","2": 2 2 1 1 1 2 2 1 2 1 ...
#>  $ HYD1    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 3 1 1 1 1 1 1 2 1 ...
colSums(is.na(nhanes3_newborn))
#> HSHSIZER  HSAGEIR    HSSEX DMARACER DMAETHNR DMARETHN  BMPHEAD BMPRECUM 
#>        0        0        0        0        0        0      124      114 
#>   BMPSB1   BMPSB2   BMPTR1   BMPTR2    BMPWT   DMPPIR     HFF1     HYD1 
#>      161      169      124      167      117      192        7        0

To impute this dataset, we can use the default settings. The default number of imputed datasets is m = 5. Note that we do not need to convert our data into dgCMatrix or one-hot coding format. Our package will automatically convert it for you. Variables should be of the following types: numeric, integer, factor or ordinal factor.

# use mixgb with default settings
imputed.data <- mixgb(data = nhanes3_newborn, m = 5)

Customize imputation settings

We can also customize imputation settings:

# Use mixgb with chosen settings
params <- list(
  max_depth = 3,
  gamma = 0,
  eta = 0.3,
  min_child_weight = 1,
  subsample = 0.7,
  colsample_bytree = 1,
  colsample_bylevel = 1,
  colsample_bynode = 1,
  nthread = 2,
  tree_method = "auto",
  gpu_id = 0,
  predictor = "auto"
)

imputed.data <- mixgb(
  data = nhanes3_newborn, m = 5, maxit = 1,
  ordinalAsInteger = FALSE, bootstrap = FALSE,
  pmm.type = "auto", pmm.k = 5, pmm.link = "prob",
  initial.num = "normal", initial.int = "mode", initial.fac = "mode",
  save.models = FALSE, save.vars = NULL,
  xgb.params = params, nrounds = 100, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0
)

Tune hyperparameters

Imputation performance can be affected by the hyperparameter settings. Although tuning a large set of hyperparameters may appear intimidating, it is often possible to narrowing down the search space because many hyperparameters are correlated. In our package, the function mixgb_cv() can be used to tune the number of boosting rounds - nrounds. There is no default nrounds value in XGBoost, so users are required to specify this value themselves. The default nrounds in mixgb() is 100. However, we recommend using mixgb_cv() to find the optimal nrounds first.

params <- list(max_depth = 3, subsample = 0.7, nthread =2)
cv.results <- mixgb_cv(data = nhanes3_newborn, nrounds = 100, xgb.params = params, verbose = FALSE)
cv.results$evaluation.log
#>     iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#>  1:    1       5.3744890    0.014675460      5.3750613    0.07086852
#>  2:    2       3.8332703    0.010838631      3.8372278    0.07087161
#>  3:    3       2.7718411    0.006272944      2.7770853    0.07691548
#>  4:    4       2.0595957    0.007274550      2.0664484    0.07566479
#>  5:    5       1.5868738    0.008584994      1.6054670    0.07819587
#>  6:    6       1.2907648    0.014248482      1.3210073    0.07759826
#>  7:    7       1.1071555    0.015394644      1.1530031    0.07783167
#>  8:    8       1.0000161    0.017887745      1.0566439    0.07895215
#>  9:    9       0.9414638    0.018404997      1.0082380    0.07867945
#> 10:   10       0.9074870    0.018933432      0.9829059    0.08001215
#> 11:   11       0.8876951    0.018986953      0.9682910    0.07752943
#> 12:   12       0.8764532    0.018322576      0.9609722    0.07684140
#> 13:   13       0.8670131    0.018055405      0.9576967    0.07822358
#> 14:   14       0.8604551    0.017868182      0.9551112    0.07878126
#> 15:   15       0.8545978    0.017994667      0.9556937    0.07906311
#> 16:   16       0.8497766    0.017346718      0.9574317    0.07809297
#> 17:   17       0.8456010    0.017452824      0.9579252    0.07793869
#> 18:   18       0.8412693    0.017763551      0.9566392    0.07777018
#> 19:   19       0.8369451    0.017050940      0.9582819    0.07699266
#> 20:   20       0.8329889    0.017898987      0.9579895    0.07783339
#> 21:   21       0.8292042    0.018045147      0.9609148    0.07802547
#> 22:   22       0.8261493    0.018352210      0.9629216    0.07725943
#> 23:   23       0.8218315    0.018426677      0.9660504    0.07674019
#> 24:   24       0.8174190    0.018241518      0.9668980    0.07467530
#>     iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
cv.results$response
#> [1] "BMPSB2"
cv.results$best.nrounds
#> [1] 14

By default, mixgb_cv() will randomly choose an incomplete variable as the response and build an XGBoost model with other variables as explanatory variables using the complete cases of the dataset. Therefore, each run of mixgb_cv() will likely return different results. Users can also specify the response and covariates in the argument response and select_features respectively.

cv.results <- mixgb_cv(data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1,
                       response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM","BMPSB1", "BMPSB2","BMPTR1", "BMPTR2", "BMPWT"),xgb.params = params, verbose = FALSE)

cv.results$best.nrounds
#> [1] 18

Let’s just try setting nrounds = cv.results$best.nrounds in mixgb() to obtain 5 imputed datasets.

imputed.data <- mixgb(data = nhanes3_newborn, m = 5, nrounds = cv.results$best.nrounds)

Inspect multiply imputed values

The mixgb package provides the following visual diagnostics functions:

  1. Single variable: plot_hist(), plot_box(), plot_bar() ;

  2. Two variables: plot_2num(), plot_2fac(), plot_1num1fac() ;

  3. Three variables: plot_2num1fac(), plot_1num2fac().

Each function will return m+1 panels to compare the observed data with m sets of actual imputed values.

For more details, please check the vignette on GitHub Visual diagnostics for multiply imputed values.