slideimp

{slideimp} is a lightweight R package for fast K-NN and PCA imputation of missing values in high-dimensional numeric matrices.

Core functions

knn_imp(): Full-matrix K-NN imputation with multi-core parallelization, {mlpack} KD/Ball-Tree nearest neighbor implementation (for data with very low missing rates and extremely high dimensions), and optional subset imputation (ideal for epigenetic clock calculations).
pca_imp(): Optimized version of missMDA::imputePCA() for high-dimensional numeric matrices.
slide_imp(): Sliding window K-NN or PCA imputation for extremely high-dimensional numeric matrices with ordered features (i.e., by genomic position).
group_imp(): Parallelizable group-wise (e.g., by chromosomes or column clusters) K-NN or PCA imputation with optional auxiliary features and group-wise parameters.
- group_features(): group_imp()’s helper function to create groups based on a mapping data.frame (i.e., Illumina manifests). See {slideimp.extra} on GitHub for tools to process common Illumina manifests.
tune_imp(): Parallelizable hyperparameter tuning with repeated cross-validation; works with built-in or custom imputation functions.

Installation

The stable version of {slideimp} can be installed from CRAN using:

install.packages("slideimp")

You can install the development version of {slideimp} with:

pak::pkg_install("hhp94/slideimp")

Workflow

Let’s simulate some DNA methylation (DNAm) microarray data from 2 chromosomes. All {slideimp} functions expect the input to be a numeric matrix where variables are stored in the columns.

library(slideimp)
# Simulate data from 2 chromosomes
set.seed(1234)
sim_obj <- sim_mat(m = 20, n = 50, perc_NA = 0.3, perc_col_NA = 1, nchr = 2)
# Here we see that variables are stored in rows
sim_obj$input[1:5, 1:5]
#>              s1        s2        s3        s4        s5
#> feat1 0.2391314 0.0000000 0.5897476 0.4201222        NA
#> feat2        NA 0.2810446 0.3677927        NA 0.6387734
#> feat3 0.7203854 0.1600776 0.5027545        NA 0.5556735
#> feat4 0.0000000 0.1816453 0.3608640 0.3356484 0.6394179
#> feat5 0.5827582 0.3774313 0.2801131 0.5047049 0.5761809

# So we t() to put the variables in columns
obj <- t(sim_obj$input)

We can optionally estimate the prediction accuracy of different methods and tune hyperparameters prior to imputation with tune_imp().

For custom functions (.f argument), the parameters data.frame must include the columns corresponding to the arguments passed to the custom function. The custom function must accept obj as the first argument and return a matrix with the same dimensions as obj.

We tune the results using 2 repeats (rep = 2) for illustration (increase in actual analyses).

knn_params <- tibble::tibble(k = c(5, 20))
# Parallelization is controlled by `cores` only for knn or slideimp knn
tune_knn <- tune_imp(obj, parameters = knn_params, cores = 2, rep = 2)
#> Tuning knn_imp
#> Step 1/2: Injecting NA
#> Running in parallel...
#> Step 2/2: Tuning
compute_metrics(tune_knn)
#> # A tibble: 12 × 7
#>        k cores param_set   rep .metric .estimator .estimate
#>    <dbl> <dbl>     <int> <int> <chr>   <chr>          <dbl>
#>  1     5     2         1     1 mae     standard     0.178  
#>  2     5     2         1     1 rmse    standard     0.225  
#>  3     5     2         1     1 rsq     standard     0.00454
#>  4    20     2         2     1 mae     standard     0.149  
#>  5    20     2         2     1 rmse    standard     0.190  
#>  6    20     2         2     1 rsq     standard     0.0172 
#>  7     5     2         1     2 mae     standard     0.202  
#>  8     5     2         1     2 rmse    standard     0.259  
#>  9     5     2         1     2 rsq     standard     0.00960
#> 10    20     2         2     2 mae     standard     0.172  
#> 11    20     2         2     2 rmse    standard     0.219  
#> 12    20     2         2     2 rsq     standard     0.0850

For PCA and custom functions, setup parallelization with mirai::daemons().

mirai::daemons(2) # 2 Cores
# Note, for PCA and custom functions, cores is controlled by the `mirai::daemons()`
# and the `cores` argument is ignored.

# PCA imputation. Specified by the `ncp` column in the `pca_params` tibble.
pca_params <- tibble::tibble(ncp = c(1, 5))
tune_pca <- tune_imp(obj, parameters = pca_params, rep = 2)

# The parameters have `mean` and `sd` columns.
custom_params <- tibble::tibble(mean = 1, sd = 0)
# This function impute data with rnorm values of different `mean` and `sd`.
custom_function <- function(obj, mean, sd) {
  missing <- is.na(obj)
  obj[missing] <- rnorm(sum(missing), mean = mean, sd = sd)
  return(obj)
}
tune_custom <- tune_imp(obj, parameters = custom_params, .f = custom_function, rep = 2)

mirai::daemons(0) # Close daemons

Then, preferably perform imputation by group with group_imp() if the variables can be meaningfully grouped (e.g., by chromosomes).

group_imp() allows imputation to be performed separately within defined groups (e.g., by chromosome), which significantly reduces runtime and can increase accuracy for both K-NN and PCA imputation.
group_imp() requires a group tibble, preferably created with group_features(), with three list-columns:
- features: required – a list-column where each element is a character vector of variable names to be imputed together.
- aux: optional – auxiliary variables to include in each group.
- parameters: optional – group-specific imputation parameters.
In this example, we have data from 2 chromosomes so the group tibble should have two rows (one per chromosome), with the corresponding variables listed in the features column for each row.

PCA-based imputation with group_imp() can be parallelized using the {mirai} package, similar to how parallelization is done with tune_imp().

# Use the `group_features()` helper function
group_df <- group_features(obj, sim_obj$group_feature)
group_df

# We choose K-NN imputation, k = 5, from the `tune_imp` results.
knn_group_results <- group_imp(obj, group = group_df, k = 5, cores = 2)

# Similar to `tune_imp`, parallelization is controlled by `mirai::daemons()`
mirai::daemons(2)
knn_group_results <- group_imp(obj, group = group_df, ncp = 3)
mirai::daemons(0)

Alternatively, full matrix imputation can be performed using knn_imp() or pca_imp().

full_knn_results <- knn_imp(obj = obj, k = 5)
full_pca_results <- pca_imp(obj = obj, ncp = 5)

Sliding Window Imputation

Sliding window imputation can be performed using slide_imp(). Note: DNAm WGBS/EM-seq data should be grouped by chromosomes and converted into either beta or M values before sliding window imputation. See vignette for more details.

chr1_beta <- t(sim_mat(m = 10, n = 2000, perc_NA = 0.3, perc_col_NA = 1, nchr = 1)$input)
dim(chr1_beta)
#> [1]   10 2000
chr1_beta[1:5, 1:5]
#>        feat1     feat2     feat3     feat4     feat5
#> s1        NA 0.7297743        NA        NA 0.3968039
#> s2 0.7346970        NA 0.5669140 0.3236858 0.3932419
#> s3        NA        NA        NA 0.3108793        NA
#> s4 0.5401526 0.5779956 0.4271064        NA 0.3309645
#> s5 0.6457875        NA 0.7308792 0.4803642 0.5929590

# From the tune results, choose window size of 50, overlap of size 5 between windows,
# K-NN imputation using k = 10. Specify `ncp` for sliding window PCA imputation.
slide_imp(obj = chr1_beta, n_feat = 50, n_overlap = 5, k = 10, cores = 2, .progress = FALSE)
#> ImputedMatrix (KNN)
#> Dimensions: 10 x 2000
#> 
#>        feat1     feat2     feat3     feat4     feat5
#> s1 0.5067435 0.7297743 0.5884198 0.5063839 0.3968039
#> s2 0.7346970 0.4551576 0.5669140 0.3236858 0.3932419
#> s3 0.5625864 0.4790436 0.5316400 0.3108793 0.5234974
#> s4 0.5401526 0.5779956 0.4271064 0.5551127 0.3309645
#> s5 0.6457875 0.4006866 0.7308792 0.4803642 0.5929590
#> 
#> # Showing [1:5, 1:5] of full matrix