The aim of this vignette is to introduce the basic steps involved in fitting GLM-PCA model to single-cell RNA-seq data using fastglmpca. (See Townes et al 2019 or Collins et al 2001 for a detailed description of the GLM-PCA model.)

To begin, load the packages that are needed.

library(Matrix)
library(fastglmpca)
library(ggplot2)
library(cowplot)

Set the seed so that the results can be reproduced.

set.seed(1)

The example data set

We will illustrate fastglmpca using a single-cell RNA-seq data set from Zheng et al (2017). These data are reference transcriptome profiles from 10 bead-enriched subpopulations of peripheral blood mononuclear cells (PBMCs). The original data set is much larger; for this introduction, we have taken a subset of roughly 3,700 cells.

The data we will analyze are unique molecular identifier (UMI) counts. These data are stored as an \(n \times m\) sparse matrix, where \(n\) is the number of genes and \(m\) is the number of cells:

data(pbmc_facs)
dim(pbmc_facs$counts)
# [1] 16791  3774

The UMI counts are “sparse”—that is, most of the counts are zero. Indeed, over 95% of the UMI counts are zero:

mean(pbmc_facs$counts > 0)
# [1] 0.04265257

For the purposes of this vignette only, we randomly subset the data further to reduce the running time:

counts <- pbmc_facs$counts
n      <- nrow(counts)
rows   <- sample(n,3000)
counts <- counts[rows,]

Now we have a 3,000 x 3,774 counts matrix:

dim(counts)
# [1] 3000 3774

Initializating and fitting the GLM-PCA model

Since no preprocessing of UMI counts is needed (e.g., a log-transformation), the first step is to initialize the model fit using init_glmpca_pois(). This function has many input arguments and options, but here we will keep all settings at the defaults, and we set K, the rank of the matrix factorization, to 2:

fit0 <- init_glmpca_pois(counts,K = 2)

By default, init_glmpca_pois() adds gene- (or row-) specific intercept, and a fixed cell- (or column-) specific size-factor. This is intended to mimic the defaults in glmpca. init_glmpca_pois() has many other options which we do not demonstrate here.

Once we have initialized the model, we are ready to run the optimization algorithm to fit the model (i.e., estimate the model parameters). This is accomplished by a call to fit_glmpca_pois():

fit <- fit_glmpca_pois(counts,fit0 = fit0)

If you prefer not to wait for the model optimization (it may take several minutes to run), you are welcome to load the previously fitted model (which is the output from the fit_glmpca_pois call above):

fit <- pbmc_facs$fit

The return value of fit_glmpca_pois() resembles the output of svd() and similar functions, with a few other outputs giving additional information about the model:

names(fit)
#  [1] "U"            "V"            "fixed_b_cols" "fixed_w_cols" "loglik"      
#  [6] "progress"     "X"            "B"            "Z"            "W"           
# [11] "d"

In particular, the outputs that are capital letters the low-rank reconstruction of the counts matrix:

fitted_counts <- with(fit,
  exp(tcrossprod(U %*% diag(d),V) +
      tcrossprod(X,B) +
      tcrossprod(W,Z)))

Let’s compare (a random subset of) the reconstructed (“fitted”) counts versus the observed counts:

i <- sample(prod(dim(counts)),2e4)
pdat <- data.frame(obs    = as.matrix(counts)[i],
                   fitted = fitted_counts[i])
ggplot(pdat,aes(x = obs,y = fitted)) +
  geom_point() +
  geom_abline(intercept = 0,slope = 1,color = "magenta",linetype = "dotted") +
  labs(x = "observed count",y = "fitted count") +
  theme_cowplot(font_size = 12)

The U and V outputs in particular are interesting because they give low-dimensional (in this case, 2-d) embeddings of the genes and cells, respectively. Let’s compare this 2-d embedding of the cells the provided cell-type labels:

celltype_colors <- c("forestgreen","dodgerblue","darkmagenta",
                     "gray","hotpink","red")
celltype <- as.character(pbmc_facs$samples$celltype)
celltype[celltype == "CD4+/CD25 T Reg" |
         celltype == "CD4+ T Helper2" |
         celltype == "CD8+/CD45RA+ Naive Cytotoxic" |
         celltype == "CD4+/CD45RA+/CD25- Naive T" |
         celltype == "CD4+/CD45RO+ Memory"] <- "T cell"
celltype <- factor(celltype)
pdat <- data.frame(celltype = celltype,
                   pc1 = fit$V[,1],
                   pc2 = fit$V[,2])
ggplot(pdat,aes(x = pc1,y = pc2,color = celltype)) +
  geom_point() +
  scale_color_manual(values = celltype_colors) +
  theme_cowplot(font_size = 10)