A beginner’s guide to creating a bulkAnalyseR app from a GEO dataset

In this short tutorial we showcase a simple pipeline to create a bulkAnalyseR app using a publicly available dataset from the Gene Expression Omnibus (GEO). No pre-requisites are required, as the installation of bulkAnalyseR and download of the data are included.

The example app described in this vignette can be found here.

Installation

First, install the latest version of bulkAnalyseR, starting with the CRAN and Bioconductor dependencies:

packages.cran <- c(
  "ggplot2", "shiny", "shinythemes", "gprofiler2", "stats", "ggrepel",
  "utils", "RColorBrewer", "circlize", "shinyWidgets", "shinyjqui",
  "dplyr", "magrittr", "ggforce", "rlang", "glue", "matrixStats",
  "noisyr", "tibble", "ggnewscale", "ggrastr", "visNetwork", "shinyLP",
  "grid", "DT", "scales", "shinyjs", "tidyr", "UpSetR", "ggVennDiagram"
)
new.packages.cran <- packages.cran[!(packages.cran %in% installed.packages()[, "Package"])]
if(length(new.packages.cran))
  install.packages(new.packages.cran)

packages.bioc <- c(
  "edgeR", "DESeq2", "preprocessCore", "GENIE3", "ComplexHeatmap"
)
new.packages.bioc <- packages.bioc[!(packages.bioc %in% installed.packages()[,"Package"])]
if(length(new.packages.bioc)){
  if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  BiocManager::install(new.packages.bioc)
}

install.packages("bulkAnalyseR")

Download data and create app

Get the expression matrix

We start by downloading and reading in the expression matrix. Rows represent genes/features and columns represent samples (note you need an internet connection to run the code below). The matrix is from a 2022 study on the Stem Cell transcriptional response to Microglia-Conditioned Media. We only use a few samples in the study for illustrative purposes.

download_path <- paste0(tempdir(), "expression_matrix.csv.gz")
download.file(
  "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE178620&format=file&file=GSE178620%5Fraw%5Fabundances%2Ecsv%2Egz", 
  download_path
)
exp <- as.matrix(read.csv(download_path, row.names = 1))[, c(1,2,19,20)]
head(exp)
##>                 control_G322_G322_1 control_G322_G322_2 microglia_067MG_G322_1
##> ENSG00000223972                   0                   0                      0
##> ENSG00000227232                  51                  45                     25
##> ENSG00000278267                   6                   0                      0
##> ENSG00000243485                   0                   0                      0
##> ENSG00000284332                   0                   0                      0
##> ENSG00000237613                   0                   0                      0
##>                 microglia_067MG_G322_2
##> ENSG00000223972                      0
##> ENSG00000227232                     40
##> ENSG00000278267                      0
##> ENSG00000243485                      0
##> ENSG00000284332                      0
##> ENSG00000237613                      0

Defining metadata

We use a very simple metadata table with just the main condition in the experiment. Detailed metadata is available for all GEO datasets and can be downloaded and used instead.

meta <- data.frame(
  name = colnames(exp),
  condition = sapply(colnames(exp), USE.NAMES = FALSE, function(nm){
    strsplit(nm, "_")[[1]][1]
  })
)
meta
##>                     name condition
##> 1    control_G322_G322_1   control
##> 2    control_G322_G322_2   control
##> 3 microglia_067MG_G322_1 microglia
##> 4 microglia_067MG_G322_2 microglia

Pre-processing

We can now denoise and normalise the data using bulkAnalyseR

exp.proc <- bulkAnalyseR::preprocessExpressionMatrix(exp, output.plot = TRUE)
##> >>> noisyR counts approach pipeline <<<
##> The input matrix has 60671 rows and 4 cols
##>     number of genes: 60671
##>     number of samples: 4
##> Calculating the number of elements per window
##>     the number of elements per window is 6067
##>     the step size is 303
##>     the selected similarity metric is correlation_pearson
##>   Working with sample 1
##>   Working with sample 2
##>   Working with sample 3
##>   Working with sample 4
##> Calculating noise thresholds for 4 samples...
##>     similarity.threshold = 0.25
##>     method.chosen = Boxplot-IQR
##> Denoising expression matrix...
##>     removing noisy genes
##>     adjusting matrix
##> >>> Done! <<<
##> Performing quantile normalisation...
##> Done!

Creating the shiny app

Finally, we can create a shiny app. This example app can be found here.

bulkAnalyseR::generateShinyApp(
  shiny.dir = "shiny_GEO",
  app.title = "Shiny app for visualisation of GEO data",
  modality = "RNA",
  expression.matrix = exp.proc,
  metadata = meta,
  organism = "hsapiens",
  org.db = "org.Hs.eg.db"
)
sessionInfo()
##> R version 4.2.2 (2022-10-31 ucrt)
##> Platform: x86_64-w64-mingw32/x64 (64-bit)
##> Running under: Windows 10 x64 (build 22621)
##> 
##> Matrix products: default
##> 
##> locale:
##> [1] LC_COLLATE=C                           
##> [2] LC_CTYPE=English_United Kingdom.utf8   
##> [3] LC_MONETARY=English_United Kingdom.utf8
##> [4] LC_NUMERIC=C                           
##> [5] LC_TIME=English_United Kingdom.utf8    
##> 
##> attached base packages:
##> [1] stats     graphics  grDevices utils     datasets  methods   base     
##> 
##> loaded via a namespace (and not attached):
##>  [1] tidyselect_1.2.0      xfun_0.35             bslib_0.4.1          
##>  [4] lattice_0.20-45       splines_4.2.2         colorspace_2.0-3     
##>  [7] vctrs_0.5.1           generics_0.1.3        htmltools_0.5.4      
##> [10] yaml_2.3.6            mgcv_1.8-41           utf8_1.2.2           
##> [13] noisyr_1.0.0          rlang_1.0.6           jquerylib_0.1.4      
##> [16] pillar_1.8.1          later_1.3.0           glue_1.6.2           
##> [19] withr_2.5.0           DBI_1.1.3             foreach_1.5.2        
##> [22] lifecycle_1.0.3       stringr_1.5.0         munsell_0.5.0        
##> [25] gtable_0.3.1          codetools_0.2-18      evaluate_0.19        
##> [28] labeling_0.4.2        knitr_1.41            fastmap_1.1.0        
##> [31] httpuv_1.6.7          fansi_1.0.3           highr_0.9            
##> [34] preprocessCore_1.60.0 Rcpp_1.0.9            xtable_1.8-4         
##> [37] scales_1.2.1          promises_1.2.0.1      cachem_1.0.6         
##> [40] jsonlite_1.8.4        bulkAnalyseR_1.1.0    farver_2.1.1         
##> [43] mime_0.12             ggplot2_3.4.0         digest_0.6.31        
##> [46] stringi_1.7.8         dplyr_1.0.10          shiny_1.7.3          
##> [49] grid_4.2.2            cli_3.4.1             tools_4.2.2          
##> [52] magrittr_2.0.3        philentropy_0.7.0     sass_0.4.4           
##> [55] tibble_3.1.8          pkgconfig_2.0.3       Matrix_1.5-1         
##> [58] ellipsis_0.3.2        assertthat_0.2.1      rmarkdown_2.18       
##> [61] rstudioapi_0.14       iterators_1.0.14      R6_2.5.1             
##> [64] nlme_3.1-160          compiler_4.2.2