sharp: Stability-enHanced Approaches using Resampling Procedures

CRAN status CRAN RStudio mirror downloads GitHub last commit

Description

In stability selection and consensus clustering, resampling techniques are used to enhance the reliability of the results. In this package, hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical. Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering.

Installation

The released version of the package can be installed from CRAN with:

install.packages("sharp")

The development version can be installed from GitHub:

remotes::install_github("barbarabodinier/sharp")

Example datasets

To illustrate the use of the main functions implemented in sharp, three artificial datasets are created:

library(sharp)

# Dataset for regression
set.seed(1)
data_reg <- SimulateRegression(n = 200, pk = 10)
x_reg <- data_reg$xdata
y_reg <- data_reg$ydata

# Dataset for structural equation modelling
set.seed(1)
data_sem <- SimulateStructural(n = 200, pk = c(5, 2, 3))
x_sem <- data_sem$data

# Dataset for graphical modelling
set.seed(1)
data_ggm <- SimulateGraphical(n = 200, pk = 20)
x_ggm <- data_ggm$data

# Dataset for clustering
set.seed(1)
data_clust <- SimulateClustering(n = c(10, 10, 10))
x_clust <- data_clust$data

Check out the R package fake for more details on these data simulation models.

Main functions

Variable selection

In a regression context, stability selection is done using LASSO regression as implemented in the R package glmnet.

stab_reg <- VariableSelection(xdata = x_reg, ydata = y_reg)
SelectedVariables(stab_reg)

Structural equation modelling

In a structural equation modelling context, stability selection is done using series of LASSO regressions as implemented in the R package glmnet.

dag <- LayeredDAG(layers = c(5, 2, 3))
stab_sem <- StructuralEquations(xdata = x_sem, adjacency = dag)
LinearSystemMatrix(vect = Stable(stab_sem), adjacency = dag)

Graphical modelling

In a graphical modelling context, stability selection is done using the graphical LASSO as implemented in the R package glassoFast.

stab_ggm <- GraphicalModel(xdata = x_ggm)
Adjacency(stab_ggm)

Clustering

Consensus clustering is done using hierarchical clustering as implemented in the R package stats.

stab_clust <- Clustering(xdata = x_clust)
Clusters(stab_clust)

Extraction and visualisation of the results

It is strongly recommended to check the calibration of the hyper-parameters using the function CalibrationPlot() on the output from any of the main functions listed above. The functions print(), summary() and plot() can also be used on the outputs from the main functions.

Parametrisation

Stability selection and consensus clustering can theoretically be done by aggregating the results from any selection (or clustering) algorithm on subsamples of the data. The choice of the underlying algorithm to use is specified in argument implementation in the main functions. Consensus clustering using partitioning around medoids, K means or Gaussian mixture models are also supported in sharp:

stab_clust <- Clustering(xdata = x_clust, implementation = PAMClustering)
stab_clust <- Clustering(xdata = x_clust, implementation = KMeansClustering)
stab_clust <- Clustering(xdata = x_clust, implementation = GMMClustering)

Other algorithms can be used by defining a wrapper function to be called in implementation. Check out the documentation of GraphicalModel() for an example using a shrunk estimate of the partial correlation instead of the graphical LASSO.

References