Creating Linear Regression Matrices from Segment Data

James Dalgleish

July 25, 2018

From our previous work, we created a small input matrix, with segmented 1Mb regions as our row labels and with sample names from TARGET data as our column labels. We can read that in using the following code:

library(CNVScope)

nbl_input_matrix[1:5,1:5]

##                      PASEGA   PASRFS   PARHYL   PARVME PATGJU
## chr1_1_1000000         0.77 1.226667 0.990000 1.304444   1.09
## chr1_1000001_2000000   0.77 1.000000 1.053333 1.180000   1.09
## chr1_2000001_3000000   0.77 1.000000 0.850000 1.000000   1.09
## chr1_3000001_4000000   0.77 1.000000 0.850000 1.000000   1.09
## chr1_4000001_5000000   0.77 1.000000 0.850000 1.000000   1.09

calcVecLMs() comes standard in the CNVScope package. It allows calculation of the matrix with parallel processing using mclapply, but larger matrices will require a bit more power, and thus we use slurm_apply, from the rslurm pacakge to distribute the work over multiple cores. Our particular establishment has a limit approximating 1000 jobs, so it’s best not to use more than that unless your cluster will support it. Conversely, you should use less if you can’t submit that many individual jobs in a job array in your cluster. In this particular example, I’ve removed rows where there is no segmentation data, across the board using colSums().

library(parallel)
nbl_slurm_object_test_zero_removed<-calcVecLMs(bin_data =as.data.frame(t(nbl_input_matrix[which(rowSds(as.matrix(nbl_input_matrix))!=0.0),])),use_slurm = T,n_nodes = 975,memory_per_node = "32g",walltime = "04:00:00",cpus_on_each_node = 2,job_finished = F,slurmjob = NULL)

Saving the slurm object is essential as it will be required when you retrieve your results.

saveRDS(nbl_slurm_object_test_zero_removed,"nbl_slurm_object_test_zero_removed.rds")

Retrieving the data is as simple as using rslurm::get_slurm_out() on the saved slurm object and coercing it into a matrix with the original number of columns. The slurm object must have been read with readRDS() previously or done in the same session. For the purposes of making this tutorial, we have chosen to work on a small version of the whole matrix to make 5MB CRAN documentation limits. Previous versions of the tutorial included the whole matrix, but we leave that to the user to construct at this point. For reproducibility, one can find the original full data matrix at https://github.com/jamesdalg/CNVScope_public_data.

library(matrixStats)
nbl_result_matrix<-matrix(as.numeric(unlist( get_slurm_out(nbl_slurm_object_test_zero_removed))),ncol=ncol(as.data.frame(t(nbl_input_matrix[which(rowSds(as.matrix(nbl_input_matrix))!=0.0),])) ) )

saveRDS(nbl_result_matrix,"nbl_result_matrix_full.rds")
saveRDS(nbl_result_matrix[1:25,1:25],"nbl_result_matrix_small.rds")

You’ll notice that there are no signs in this matrix (they’re just negative log p-values, which are always positive). We’ll have to assign signs by the correlation matrix next, then we will chunk the large matrix into smaller, flattened matrices that the shiny app can handle. For lower capacity machines/clusters, an alternative may be using the cor function.

In order to perfrom sign correction,fix the “Inf values” to a viewable value, and restore column and row names, postProcessLinRegMatrix() can be applied, yielding a final full matrix of the entire genome (Chr1->ChrX on both Axes). 300 has been used, although something a bit smaller will reduce saturation issues depending on the disparity between the lowest values in the matrix and 300. We’ll plot the result below, using complexheatmap and a custom designed function that takes large asymmetric distributions of values and pushes them into the [0,1] colorspace with white at 0.5, corresponding to zero, values between 0 and 0.5 corresponding to negative values, and values from 0.5 to 1 corresponding to positive values (signedRescale).

nbl_result_matrix_small<-readRDS("nbl_result_matrix_small.rds")
nbl_result_matrix_small[1:5,1:5]

##          [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]      Inf   4.766959   3.743363   3.800939   3.800939
## [2,] 4.766959        Inf 199.883186 176.388907 176.388907
## [3,] 3.743363 199.883186        Inf 299.519384 299.519384
## [4,] 3.800939 176.388907 299.519384        Inf        Inf
## [5,] 3.800939 176.388907 299.519384        Inf        Inf

nbl_result_matrix_sign_corrected<-postProcessLinRegMatrix(input_matrix = nbl_input_matrix[1:25,1:25],LM_mat = nbl_result_matrix_small,cor_type = "pearson",inf_replacement_val = 300)
nbl_result_matrix_sign_corrected[1:5,1:5]

##                      chr1_1_1000000 chr1_1000001_2000000 chr1_2000001_3000000
## chr1_1_1000000           300.000000             4.766959             3.743363
## chr1_1000001_2000000       4.766959           300.000000           199.883186
## chr1_2000001_3000000       3.743363           199.883186           300.000000
## chr1_3000001_4000000       3.800939           176.388907           299.519384
## chr1_4000001_5000000       3.800939           176.388907           299.519384
##                      chr1_3000001_4000000 chr1_4000001_5000000
## chr1_1_1000000                   3.800939             3.800939
## chr1_1000001_2000000           176.388907           176.388907
## chr1_2000001_3000000           299.519384           299.519384
## chr1_3000001_4000000           300.000000           300.000000
## chr1_4000001_5000000           300.000000           300.000000

nbl_result_matrix_sign_corrected[1:5,1:5]

##                      chr1_1_1000000 chr1_1000001_2000000 chr1_2000001_3000000
## chr1_1_1000000           300.000000             4.766959             3.743363
## chr1_1000001_2000000       4.766959           300.000000           199.883186
## chr1_2000001_3000000       3.743363           199.883186           300.000000
## chr1_3000001_4000000       3.800939           176.388907           299.519384
## chr1_4000001_5000000       3.800939           176.388907           299.519384
##                      chr1_3000001_4000000 chr1_4000001_5000000
## chr1_1_1000000                   3.800939             3.800939
## chr1_1000001_2000000           176.388907           176.388907
## chr1_2000001_3000000           299.519384           299.519384
## chr1_3000001_4000000           300.000000           300.000000
## chr1_4000001_5000000           300.000000           300.000000

   if (requireNamespace("ComplexHeatmap", quietly = TRUE) & requireNamespace("circlize", quietly = TRUE)) {
      ComplexHeatmap::Heatmap(signedRescale(as.matrix(nbl_result_matrix_sign_corrected)),
                              col = circlize::colorRamp2(c(0,0.5,1),c("blue","white","red")),
                              cluster_rows = F,cluster_columns = F,
                              show_heatmap_legend = F,
                              show_column_names = F,
                              show_row_names = F)
   } else {
print("ComplexHeatmap not installed.\n
      Please install ComplexHeatmap in order to create this plot.")
   }

Finally, the whole genome matrix is too big to plot interactively without crashing most browsers using the plotly package. We’ll need to break things apart a bit. A final function will write chromosomal pair heatmaps to disk with genes from ensembl (hg19 coordinates) in encoded for each square in the matrix. Please only use this function on the WHOLE matrix, not on the small subset we have provided in documentation.

if(!dir.exists("nbl_matrix_set")){dir.create("nbl_matrix_set")}
#setwd("nbl_matrix_set")
doMC::registerDoMC()
#use ONLY the whole matrix with chromosomes 1-X, not the small subset provided for documentation purposes.
createChromosomalMatrixSet(whole_genome_mat=nbl_result_matrix_sign_corrected,output_dir="nbl_matrix_set",prefix="nbl_")

list.files("nbl_matrix_set")

There should be 529 of these particular files upon running the code. If there are not, don’t hesitate to run the code again. It can happen on a cluster. It is built to detect when chromosomal matrix is already written to disk.