The goal of the stabiliser package is to provide a flexible method of applying stability selection (Meinshausen and Buhlmann, 2010) with various model types, and the framework for triangulating the results for multiple models (Lima et al., 2021).

Installation

You can install this package from github using the devtools package as follows:

install.packages("stabiliser")

Or to get the most recent development version:

devtools::install_github("roberthyde/stabiliser")

Usage

The stabiliser_example dataset is a simulated example with the following properties:

library(stabiliser)
#> Registered S3 methods overwritten by 'expss':
#>   method                 from 
#>   [.labelled             Hmisc
#>   as.data.frame.labelled base 
#>   print.labelled         Hmisc
data("stabiliser_example")

stabilise()

To attempt to identify which variables are truly “causal” in this dataset using a selection stability approach, use the stabilise() function as follows:

set.seed(8141)
stable_enet <- stabilise(data = stabiliser_example,
                         outcome = "y")
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         99
#> 
#> Operations:
#> 
#> Centering and scaling for all_numeric_predictors()
#> Dummy variables from all_nominal_predictors()
#> K-nearest neighbor imputation for all_predictors()

Access the stability (percentage of 100 bootstrap resamples where a given variable was selected by a given model) results for elastic net as follows:

stable_enet$enet$stability
#> # A tibble: 99 × 7
#>    variable mean_coefficient ci_lower ci_upper bootstrap_p stability stable
#>    <chr>               <dbl>    <dbl>    <dbl>       <dbl>     <dbl> <chr> 
#>  1 causal1              2.79   0.223      5.92           0      84   *     
#>  2 causal2              2.90   0.393      6.59           0      75   <NA>  
#>  3 causal3              2.27   0.308      5.27           0      75   <NA>  
#>  4 junk14               2.53   0.219      6.75           0      67   <NA>  
#>  5 junk88               2.50   0.156      5.19           0      62.5 <NA>  
#>  6 junk86               1.51   0.0164     4.32           0      60   <NA>  
#>  7 junk13               2.27   0.214      5.34           0      59.5 <NA>  
#>  8 junk45               2.02   0.110      5.48           0      59.5 <NA>  
#>  9 junk90               1.74   0.0673     5.60           0      47.5 <NA>  
#> 10 junk26               1.67   0.0564     5.22           0      43   <NA>  
#> # … with 89 more rows

This ranks the variables by stability, and displays the mean coefficients, 95% confidence interval and bootstrap p-value. It also displays whether the variable is deemed “stable”, in this case 2 out of the 4 truly causal variables are identified as “stable”, with no false positives.

By default, this implements an elastic net algorithm over 100 bootstrap resamples of the dataset. Stability of each variable is then calculated as the proportion of bootstrap repeats where that variable is selected in the model.

stabilise() also permutes the outcome several times (5 by default) and performs the same process on each permuted dataset (20 bootstrap resamples for each by default).

This allows a permutation threshold to be calculated. Variables with a non-permuted stability % above this threshold are deemed “stable” as they were selected in a higher proportion of bootstrap resamples than in the permuted datasets, where we know there is no association between variables and the outcome.

The permutation threshold is available as follows:

stable_enet$enet$perm_thresh
#> [1] 81

triangulate()

Our confidence that a given variable is truly associated with a given outcome might be increased if it is identified in multiple model types.

Specify multiple model types (elastic net, mbic and mcp) for comparison using the stabilise() function as follows:

set.seed(8141)
stable_combi <- stabilise(data = stabiliser_example,
                         outcome = "y",
                         models = c("enet",
                                    "mbic",
                                    "mcp"))
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         99
#> 
#> Operations:
#> 
#> Centering and scaling for all_numeric_predictors()
#> Dummy variables from all_nominal_predictors()
#> K-nearest neighbor imputation for all_predictors()

The stability of variables is available for all model types as before.

The stability results from these three models stored within stable_combi can be combined using triangulate as follows:

triangulated <- triangulate(stable_combi)
triangulated
#> $combi
#> $combi$stability
#> # A tibble: 99 × 7
#>    variable mean_coefficient ci_lower ci_upper bootstrap_p stability stable
#>    <chr>               <dbl>    <dbl>    <dbl>       <dbl>     <dbl> <chr> 
#>  1 causal1              4.24   0.379    8.75       0            56.3 *     
#>  2 causal2              4.12   0.344    9.64       0            45.2 <NA>  
#>  3 causal3              3.36   0.276    8.59       0.00778      42.8 <NA>  
#>  4 junk88               3.86   0.201    8.14       0            41.5 <NA>  
#>  5 junk14               3.64   0.221    9.13       0.00463      36   <NA>  
#>  6 junk13               3.37   0.244    8.59       0            32.2 <NA>  
#>  7 junk86               1.98   0.0190   7.53       0.00578      28.8 <NA>  
#>  8 junk45               2.40   0.114    6.89       0            24.8 <NA>  
#>  9 junk90               2.36   0.0727   8.39       0            21.2 <NA>  
#> 10 junk65              -1.90  -6.77    -0.0917     0            19.5 <NA>  
#> # … with 89 more rows
#> 
#> $combi$intercept
#> # A tibble: 1 × 7
#>   variable    mean_coefficient ci_lower ci_upper bootstrap_p stability stable
#>   <chr>                  <dbl>    <dbl>    <dbl>       <dbl>     <dbl> <chr> 
#> 1 (Intercept)             2.14    -1.56     6.38       0.132       100 *     
#> 
#> $combi$perm_thresh
#> [1] 51

This shows variables consistently being identified as being stable across multiple model types, and consequently increasing our confidence that they are truly associated with the outcome.

Triangulating the results for stability selection across multiple model types generally provides better performance at identifying truly causal variables than single model approaches (Lima et al., 2021). As shown in this example, the triangulated approach identifies 3 out of the 4 truly causal variables as being “stable”, and being highly likely to be associated with the outcome y, in contrast to the 2 variables identified when using elastic net alone.

stab_plot()

Both stabilise() and triangulate() outputs can be plotted using stab_plot() as follows, with causal and junk variables highlighted for this example:

stab_plot(stabiliser_object = triangulated)

plot of chunk stab_plot

References

Lima, E., Hyde, R., Green, M., 2021. Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques. Sci. Rep. 11, 412. https://doi.org/10.1038/s41598-020-79317-8

Meinshausen, N., Buhlmann, P., 2010. Stability selection. J. R. Stat. Soc. Ser. B (Statistical Methodol. 72, 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x