MAnorm2 for Normalizing and Comparing ChIP-seq Samples

Shiqi Tu

2022-10-28

Abstract

Eukaryotic gene transcription is regulated by a large cohort of chromatin associated proteins, and inferring their differential binding sites between cellular contexts requires a rigorous comparison of the corresponding ChIP-seq samples. The package MAnorm2 is primarily developed for quantitatively comparing groups of ChIP-seq samples (e.g., groups of biological replicates corresponding to different cellular contexts). Technically, MAnorm2 uses a hierarchical strategy for normalization of ChIP-seq samples, and it assesses within-group variability of ChIP-seq signals under an empirical Bayes framework, in which MAnorm2 considers the abundance of differential ChIP-seq signals between groups of samples and the possibility of different within-group variability between groups. Another capability of MAnorm2 is to identify hypervariable ChIP-seq signals across samples, which, for example, is essential to dissecting the epigenetic heterogeneity across cancer patients as well as revealing potential sub-structures associated with them. This vignette explains the working principle of MAnorm2 1.2.2 and demonstrates the use of it.

Introduction
MAnorm2 Capability
MAnorm2 Model Formulation
Citation
Acknowledgments
Session Info
References

Introduction

The System of MAnorm2 Machinery

The capabilities of MAnorm2 result primarily from two basic utilities implemented in it. One is for the normalization of ChIP-seq samples, and the other is for modeling the mean-variance trend associated with normalized ChIP-seq signal intensities. Several downstream analyses could be performed based on these two utilities, including

differential ChIP-seq analysis between individual samples or groups of samples;
identification of genomic regions with hypervariable ChIP-seq signals across individual samples or groups of samples;
clustering of individual ChIP-seq samples or groups of samples.

We’d like to emphasize here that each of the above analyses takes advantage of the observed mean-variance trend to improve the assessment of within-group variability (i.e., the variation of ChIP-seq signals across samples of the same group). In practice, the strategy could compensate for the lack of sufficient replicates for accurately assessing within-group variability.

Input Data

For employing the machinery implemented in MAnorm2, you need to prepare a table that profiles the ChIP-seq signal in each of a list of genomic intervals for each of a set of ChIP-seq samples. The H3K27Ac dataset bundled with MAnorm2 provides such an instance:

library(MAnorm2)

#> MAnorm2 1.2.2 2022-10-28

head(H3K27Ac)

#>   chrom  start    end GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> 1  chr1  29023  29346                          8                         16
#> 2  chr1 712983 715873                        440                        498
#> 3  chr1 740056 741095                         81                         54
#> 4  chr1 751252 753001                          2                        139
#> 5  chr1 760716 764177                        234                        329
#> 6  chr1 800556 801985                          4                         26
#>   GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> 1                          9                         23                         22
#> 2                        477                        439                        508
#> 3                         39                         44                         40
#> 4                         84                         11                         11
#> 5                        350                        326                        376
#> 6                         59                         16                         19
#>   GM12890_H3K27Ac_1.occupancy GM12891_H3K27Ac_1.occupancy GM12891_H3K27Ac_2.occupancy
#> 1                           0                           0                           0
#> 2                           1                           1                           1
#> 3                           1                           0                           0
#> 4                           0                           1                           0
#> 5                           1                           1                           1
#> 6                           0                           0                           1
#>   GM12892_H3K27Ac_1.occupancy GM12892_H3K27Ac_2.occupancy
#> 1                           1                           1
#> 2                           1                           1
#> 3                           1                           0
#> 4                           0                           0
#> 5                           1                           1
#> 6                           0                           0

To be specific, each row of the above table represents a genomic interval; each of the read_cnt variables corresponds to a ChIP-seq sample and records the numbers of reads from the sample that fall within the genomic intervals (i.e., the raw read counts for the sample); the occupancy variables correspond to the read_cnt variables one by one and specify the occupancy status of each interval in each sample (an occupancy status of 1 indicates that the interval is enriched with reads in the sample). In practice, the occupancy status of a genomic interval in a certain ChIP-seq sample could be determined by its overlap with the peaks (Zhang et al. 2008) of the sample. Note also that MAnorm2 refers to an interval as occupied by a sample if the interval is enriched with reads in the sample.

MAnorm2_utils is specifically designed to coordinate with MAnorm2, and we strongly recommend using it to create input tables of MAnorm2.

Note also that, although the above table records raw read counts, MAnorm2 does not impose a restriction that the input measurements of ChIP-seq signals must be integers (see also the section of Continuous Distribution below).

Application Scope

Although MAnorm2 has been designed to process ChIP-seq data, it could be applied in principle to the analysis of any type of data with a similar structure, including DNase-seq, ATAC-seq and RNA-seq data. The only problem associated with such extensions is how to naturally define “peaks” for specific data types.

Most of the peak callers originally devised for ChIP-seq data (e.g., MACS 1.4) also work for DNase-seq and ATAC-seq data. For RNA-seq data, each row of the input table should stand for a gene, and we recommend setting a cutoff (e.g., 20) of raw read count to define “peak” genes.

Continuous Distribution

In spite of the discrete nature of read counts, MAnorm2 uses continuous distribution to model ChIP-seq data by first transforming raw read counts into raw signal intensities. By default, MAnorm2 completes the transformation by simply adding an offset count to each raw count and taking a base-2 logarithm. Practical ChIP-seq data sets, however, may be associated with various confounding factors, including batch effects, local sequence compositions and background signals measured by input samples. On this account, the MAnorm2 machinery has been designed to be independent of the specific transformation employed. And any methods for correcting for confounding factors could be applied before invoking MAnorm2, as long as the resulting signal intensities could be approximately modeled as following the normal distribution (in particular, consider carefully whether it is necessary to apply a logarithmic transformation in the final step). In the extreme case, you can even accomplish the normalization of ChIP-seq data by yourself and use MAnorm2, for example, only for the following differential analysis.

The primary reason for which MAnorm2 models ChIP-seq signals as continuous random variables is that the mathematical theory of count distributions is far less tractable than that of the normal distribution. For example, current statistical methods based on the negative binomial distribution are frequently relied on approximations of various kinds. Specifically, variance (or dispersion) estimates for individual genomic intervals are typically treated as known parameters, and their uncertainty can hardly be incorporated into the statistical tests for identifying differential signals.

Besides, after an extensive correction for confounding factors, the resulting data range is almost certainly not limited to non-negative integers, and the data may have lost their discrete nature and be more akin to a continuous distribution. Moreover, transforming read counts towards the normal distribution unlocks the application of a large repository of mature statistical methods that are initially developed for analyzing continuous measurements (e.g., intensity data from microarray experiments). Refer to the voom article (Law et al. 2014) for a detailed discussion of this topic.

MAnorm2 Capability

This section explains in detail the working principle of MAnorm2 and demonstrates the use of it for various analyses. Note that all demonstrations are based on the H3K27Ac dataset bundled with MAnorm2 (see also the section of Input Data):

library(MAnorm2)
head(H3K27Ac)

#>   chrom  start    end GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> 1  chr1  29023  29346                          8                         16
#> 2  chr1 712983 715873                        440                        498
#> 3  chr1 740056 741095                         81                         54
#> 4  chr1 751252 753001                          2                        139
#> 5  chr1 760716 764177                        234                        329
#> 6  chr1 800556 801985                          4                         26
#>   GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> 1                          9                         23                         22
#> 2                        477                        439                        508
#> 3                         39                         44                         40
#> 4                         84                         11                         11
#> 5                        350                        326                        376
#> 6                         59                         16                         19
#>   GM12890_H3K27Ac_1.occupancy GM12891_H3K27Ac_1.occupancy GM12891_H3K27Ac_2.occupancy
#> 1                           0                           0                           0
#> 2                           1                           1                           1
#> 3                           1                           0                           0
#> 4                           0                           1                           0
#> 5                           1                           1                           1
#> 6                           0                           0                           1
#>   GM12892_H3K27Ac_1.occupancy GM12892_H3K27Ac_2.occupancy
#> 1                           1                           1
#> 2                           1                           1
#> 3                           1                           0
#> 4                           0                           0
#> 5                           1                           1
#> 6                           0                           0

This dataset profiles H3K27Ac ChIP-seq signals on a genome wide scale for three human lymphoblastoid cell lines (LCLs), each derived from a separate Caucasian individual (associated ChIP-seq data obtained from (Kasowski et al. 2013)). For meta information regarding these cell lines, type

attr(H3K27Ac, "metaInfo")

#>   individual population gender                 cellType
#> 1    GM12890  Caucasian female lymphoblastoid cell line
#> 2    GM12891  Caucasian   male lymphoblastoid cell line
#> 3    GM12892  Caucasian female lymphoblastoid cell line

For details about the generation of this dataset, type ?H3K27Ac.

Comparing Groups of ChIP-seq Samples

Quick Start

Here we show the standard workflow for a differential ChIP-seq analysis between two groups of samples. We use the comparison between the H3K27Ac ChIP-seq samples for GM12891 LCL and those for GM12892 LCL as example:

# Perform within-group normalization.
norm <- normalize(H3K27Ac, count = 5:6, occupancy = 10:11)
norm <- normalize(norm, count = 7:8, occupancy = 12:13)

# Construct a bioCond for each group of samples.
conds <- list(GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))

# Perform between-group normalization.
# Restrict common peak regions to autosomes only when the two groups
# being compared are associated with different genders.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Fit a mean-variance curve.
# If the following function call raises an error,
# set init.coef = c(0.1, 10) or try method = "local".
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)

#> During the parametric fit procedure:
#> After iteration 1: coef = (0.0133814, 5.43493); 971 (0.86%) outlier(s) detected
#> After iteration 2: coef = (0.013118, 5.45959); 971 (0.86%) outlier(s) detected
#> After iteration 3: coef = (0.0131117, 5.45988); 971 (0.86%) outlier(s) detected
#> Converged.

# Perform differential tests.
res <- diffTest(conds[[1]], conds[[2]])
head(res)

#>   GM12891.mean GM12892.mean       Mval   Mval.se     Mval.t         pval         padj
#> 1     2.952943     4.378686  1.4257425 0.6959293  2.0486888 4.049255e-02 7.281479e-02
#> 2     8.594610     8.842037  0.2474271 0.1687811  1.4659641 1.426581e-01 2.131498e-01
#> 3     4.970712     5.282421  0.3117088 0.4302032  0.7245617 4.687210e-01 5.638290e-01
#> 4     6.279446     3.357143 -2.9223031 0.4751332 -6.1504920 7.724294e-10 5.889993e-09
#> 5     8.038011     8.401049  0.3630385 0.1853013  1.9591797 5.009175e-02 8.753281e-02
#> 6     4.733793     4.015697 -0.7180965 0.5494122 -1.3070268 1.912036e-01 2.723929e-01

Note: rows of the above table of differential analysis results correspond to the genomic intervals in H3K27Ac one by one with the same order. See also a detailed explanation below.

Step-by-step Explanation and Visualization

Normalization

MAnorm2 normalizes two individual ChIP-seq samples by removing the overall M-A trend (M and A values refer to log2 fold change and average log2 read count, respectively) associated with their common peak regions, which are the genomic intervals occupied by both of them. For normalization of a set of any number of ChIP-seq samples, MAnorm2 selects one of the samples as baseline and normalizes each other sample against it. Taking the comparison of H3K27Ac ChIP-seq signals between GM12891 and GM12892 LCLs as example, you may choose to normalize all the related samples once for all, by supplying raw read counts and occupancy states associated with the samples:

# One-step normalization.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, count = 5:8, occupancy = 10:13,
                  common.peak.regions = autosome)

Note: here we exclude the genomic intervals in sex chromosomes from common peak regions, since these ChIP-seq samples are associated with different genders.

By default, MAnorm2 uses the median-ratio strategy (Anders and Huber 2010) to estimate the size factor of each ChIP-seq sample and selects the sample whose log2 size factor is closest to 0 as baseline. In practice, you can avoid using a sample with bad quality as baseline by explicitly specifying the baseline argument of normalize(). Besides, when the number of samples to be normalized is large (e.g., >5), you can reduce the variability of normalization results by setting baseline to "pseudo-reference", in which case MAnorm2 constructs a pseudo ChIP-seq profile as baseline by “averaging” all the samples (type ?normalize for details).

Check some information regarding the performed normalization by

names(attributes(norm))

#> [1] "names"       "row.names"   "metaInfo"    "class"       "size.factor" "baseline"   
#> [7] "norm.coef"   "MA.cor"

attributes(norm)[5:8]

#> $size.factor
#> GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt 
#>                  1.0914779                  1.0854059                  0.8836196 
#> GM12892_H3K27Ac_2.read_cnt 
#>                  0.9574468 
#> 
#> $baseline
#> [1] "GM12892_H3K27Ac_2.read_cnt"
#> 
#> $norm.coef
#>                                slope  intercept common.peak.regions
#> GM12891_H3K27Ac_1.read_cnt 1.0211083 -0.4409282               41759
#> GM12891_H3K27Ac_2.read_cnt 1.0264489 -0.4604073               42752
#> GM12892_H3K27Ac_1.read_cnt 0.9562856  0.4758901               46259
#> GM12892_H3K27Ac_2.read_cnt 1.0000000  0.0000000               51554
#> 
#> $MA.cor
#>                            GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                0.000000000                -0.03017988
#> GM12891_H3K27Ac_2.read_cnt                0.007119437                 0.00000000
#> GM12892_H3K27Ac_1.read_cnt                0.016129812                 0.01551055
#> GM12892_H3K27Ac_2.read_cnt                0.000000000                 0.00000000
#>                            GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                  0.1035251                 0.03699763
#> GM12891_H3K27Ac_2.read_cnt                  0.1179751                 0.04865985
#> GM12892_H3K27Ac_1.read_cnt                  0.0000000                -0.20926760
#> GM12892_H3K27Ac_2.read_cnt                  0.0000000                 0.00000000

# This statement requires the gplots package (>= 3.0.1).
plot(attr(norm, "MA.cor"), symbreaks = TRUE, margins = c(8, 8),
     cexRow = 1, cexCol = 1)

The norm.coef attribute records the linear transformation applied to the log2 read counts of each ChIP-seq sample as well as the number of common peak regions between each sample and the baseline. The MA.cor attribute is a matrix recording the Pearson correlation coefficient (PCC) between M and A values across the common peak regions of each pair of samples. The upper and lower triangles of this matrix are calculated from raw and normalized log2 read counts, respectively. In general, it indicates a good normalization performance that the M-A PCCs are all close to 0 after normalization.

We can also draw MA plots to visualize the normalization effects. Here we use the two biological replicates of GM12892 LCL as example:

# Before normalization.
raw <- log(H3K27Ac[7:8] + 0.5, base = 2)
MAplot(raw[[1]], raw[[2]], norm[[12]], norm[[13]], ylim = c(-2, 2),
       main = "Before normalization")
abline(h = 0, lwd = 2, lty = 5)

# After normalization.
MAplot(norm[[7]], norm[[8]], norm[[12]], norm[[13]], ylim = c(-2, 2),
       main = "After normalization")
abline(h = 0, lwd = 2, lty = 5)

In comparison to this one-step normalization, we prefer to adopt a hierarchical normalization process that takes advantage of the similarity structure among samples. Specifically, we first separately normalize the samples of each LCL:

# Within-group normalization.
norm <- normalize(H3K27Ac, count = 5:6, occupancy = 10:11)
norm <- normalize(norm, count = 7:8, occupancy = 12:13)

The key data type designed by MAnorm2 is bioCond, which is for grouping ChIP-seq samples belonging to the same biological condition. We next construct a bioCond object for each LCL to group its biological replicates, by supplying their normalized signal intensities and occupancy states:

# Construct a bioCond for each LCL.
conds <- list(GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))

Note: bioCond objects do not design a data field to specifically record the associated genomic intervals. The list and order of genomic intervals in a bioCond are the same as with the signal intensities for constructing it and will never be changed. In principle, all the MAnorm2 functions accepting multiple bioCond objects expect them to be associated with the same list and order of genomic intervals (e.g., normBioCond()), and it is your job to make sure of that. Note also that all the MAnorm2 functions applying a statistical test to each individual genomic interval generate a table that is associated with the same list and order of intervals as with the supplied bioCond(s) (e.g., diffTest()).

We can summarize a bioCond by

summary(conds$GM12891)

#> 
#> Biological condition GM12891:
#> 73828 genomic intervals with 2 profiles
#> 
#> Occupancy states:
#> 57577 (78.0%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -0.9807  6.1390  7.3980  7.2557  8.6744 12.1563 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.0000072 0.0719361 0.1573988 0.2179675 0.2927150 3.0115431 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                          1                          0
#> GM12891_H3K27Ac_2.read_cnt                          0                          1
#> 
#> The condition is not yet associated with a mean-variance curve.

Note: as indicated in the summary, MAnorm2 defines the occupancy states of genomic intervals in each bioCond, which are determined by the number of samples in the bioCond occupying each interval (see the occupy.num argument of bioCond()). The occupancy states of genomic intervals in bioCond objects matter for the following between-group normalization and mean-variance curve (MVC) fitting. When the samples to be grouped into a bioCond are biological replicates for the same experiment, we recommend using the default setting, which is occupy.num = 1.

Finally, we normalize the resulting bioCond objects to make them comparable between each other:

# Between-group normalization.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

We can also draw MA plots on bioCond objects to visualize the normalization results:

MAplot(conds[[1]], conds[[2]], ylim = c(-12, 12), main = "GM12891 vs. GM12892")
abline(h = 0, lwd = 2, lty = 5)

Modeling Mean-Variance Trend

After normalization, MAnorm2 next models the relationship between mean signal intensities of genomic intervals and their variances by fitting a smooth mean-variance curve (MVC):

# Fit an MVC.
# The "parametric" method sometimes requires the users to explicitly specify
# initial coefficients. Try setting init.coef = c(0.1, 10) in these cases.
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)

#> During the parametric fit procedure:
#> After iteration 1: coef = (0.0133814, 5.43493); 971 (0.86%) outlier(s) detected
#> After iteration 2: coef = (0.013118, 5.45959); 971 (0.86%) outlier(s) detected
#> After iteration 3: coef = (0.0131117, 5.45988); 971 (0.86%) outlier(s) detected
#> Converged.

This function call associates an MVC with each bioCond in conds. After the call, all bioCond objects in conds are associated with the same MVC, as indicated by the MVC ID:

summary(conds$GM12891)

#> 
#> Biological condition GM12891:
#> 73828 genomic intervals with 2 profiles
#> 
#> Occupancy states:
#> 57577 (78.0%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -2.005   5.619   6.968   6.815   8.334  12.063 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.000008 0.077034 0.168554 0.233415 0.313460 3.224977 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                          1                          0
#> GM12891_H3K27Ac_2.read_cnt                          0                          1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = conds, method = "parametric", occupy.only = TRUE)
#> 
#> Summary of the mean-variance curve:
#> Fit method = "parametric fit"
#> ID = "2022-10-28 16:14:58 Q4JXWP1BK1"
#> Number of prior dfs = Inf
#> Variance ratio factor = 1.19

summary(conds$GM12892)

#> 
#> Biological condition GM12892:
#> 73828 genomic intervals with 2 profiles
#> 
#> Occupancy states:
#> 55567 (75.3%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -1.269   5.554   6.901   6.860   8.350  12.057 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.000003 0.070686 0.155027 0.216504 0.290956 3.270204 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> GM12892_H3K27Ac_1.read_cnt                          1                          0
#> GM12892_H3K27Ac_2.read_cnt                          0                          1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = conds, method = "parametric", occupy.only = TRUE)
#> 
#> Summary of the mean-variance curve:
#> Fit method = "parametric fit"
#> ID = "2022-10-28 16:14:58 Q4JXWP1BK1"
#> Number of prior dfs = Inf
#> Variance ratio factor = 0.997

Note: for each bioCond that has been associated with an MVC, MAnorm2 uses a variance ratio factor to quantify its global within-group variability (relative to the MVC). When samples in each bioCond are biological replicates, the bioCond objects associated with the same MVC should have similar variance ratio factors. Otherwise, there might be batch effects and/or samples with bad quality.

To improve the unbiasedness of MVC fitting, MAnorm2 calculates observed means and variances of genomic intervals separately for each bioCond, and it pools the resulting mean-variance pairs from different bioCond objects (after adjusting for different global within-group variability) into a regression process. Currently, we provide two candidate methods for the regression process, which are "parametric fit" and "local regression". We also design the argument occupy.only to control whether to use all genomic intervals or only the occupied ones from each bioCond for the regression process. In cases where the samples in each bioCond are biological replicates, the underlying variance structure could be expected to be very regular, and we recommend using the "parametric fit" method, with setting occupy.only to TRUE to further enhance the data regularity. See the section of Combining Replicates and Using Local Regression below for an application scenario of local regression.

The number of prior degrees of freedom is used to assess the overall goodness of fit of the associated MVC. You can also visualize the mean-variance trend along with the MVC by

# Plot only occupied genomic intervals,
# as only these intervals have been used to fit the MVC.
plotMeanVarCurve(conds, subset = "occupied", ylim = c(-7, 0.5))

In practice, number of prior degrees of freedom amounts to the number of additional samples gained by borrowing information between genomic intervals, and it should be large (relative to the number of real samples) when samples in each bioCond are biological replicates.

Differential Tests

Finally, we call genomic intervals with differential signal intensities between the two bioCond objects by

res <- diffTest(conds[[1]], conds[[2]])

This function call performs a statistical test separately for each genomic interval, with the null hypothesis that the interval is non-differential between the supplied two bioCond objects. It returns a data frame that records the test results for each interval by each row:

head(res)

#>   GM12891.mean GM12892.mean       Mval   Mval.se     Mval.t         pval         padj
#> 1     2.952943     4.378686  1.4257425 0.6959293  2.0486888 4.049255e-02 7.281479e-02
#> 2     8.594610     8.842037  0.2474271 0.1687811  1.4659641 1.426581e-01 2.131498e-01
#> 3     4.970712     5.282421  0.3117088 0.4302032  0.7245617 4.687210e-01 5.638290e-01
#> 4     6.279446     3.357143 -2.9223031 0.4751332 -6.1504920 7.724294e-10 5.889993e-09
#> 5     8.038011     8.401049  0.3630385 0.1853013  1.9591797 5.009175e-02 8.753281e-02
#> 6     4.733793     4.015697 -0.7180965 0.5494122 -1.3070268 1.912036e-01 2.723929e-01

In this data frame, Mval could be interpreted as log2 fold change; pval assesses the statistical significance of each test; padj refers to adjusted p-value with the "BH" method.

You can visualize the overall test results by

MAplot(res, padj = 0.001)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

We can see from this figure that differential ChIP-seq signals could be abundant even between very similar cellular contexts.

When No Replicates Are Available

MAnorm2 compares two individual ChIP-seq samples by treating them as replicates and fitting an MVC based on their common peak regions. This strategy is basically the same as used by DESeq (Anders and Huber 2010). Here we give the standard workflow for comparing two individual ChIP-seq samples. We use the comparison of the first replicates of GM12891 and GM12892 LCLs as example:

# Perform normalization and create bioConds to represent the two LCLs.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, c(5, 7), c(10, 12), common.peak.regions = autosome)
conds <- list(GM12891 = bioCond(norm[5], norm[10], name = "GM12891"),
              GM12892 = bioCond(norm[7], norm[12], name = "GM12892"))

# Create a "blind" bioCond that treats the two samples as replicates and fit an
# MVC accordingly. Only common peak regions of the two samples are considered
# to be occupied by the "blind" bioCond, and only these regions are used to fit
# the MVC. This setting is for capturing underlying non-differential intervals
# as accurately as possible and avoiding over-estimation of prior variances
# (i.e., variances read from MVC).
conds$blind <- bioCond(norm[c(5, 7)], norm[c(10, 12)], occupy.num = 2,
                       name = "blind")
conds <- fitMeanVarCurve(conds, method = "parametric",
                         occupy.only = TRUE, init.coef = c(0.1, 10))

#> During the parametric fit procedure:
#> After iteration 1: coef = (0.209957, 17.6178); 586 (1.40%) outlier(s) detected
#> After iteration 2: coef = (0.179383, 20.7846); 598 (1.43%) outlier(s) detected
#> After iteration 3: coef = (0.172811, 21.6638); 601 (1.43%) outlier(s) detected
#> After iteration 4: coef = (0.171272, 21.8803); 601 (1.43%) outlier(s) detected
#> After iteration 5: coef = (0.171015, 21.92); 601 (1.43%) outlier(s) detected
#> Converged.

# Note the dramatic decrease of number of prior degrees of freedom.
summary(conds$blind)

#> 
#> Biological condition blind:
#> 73828 genomic intervals with 2 profiles
#> 
#> Occupancy states:
#> 41919 (56.8%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -1.463   5.558   6.841   6.839   8.271  12.060 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.000001 0.195670 0.467627 0.719394 0.960443 8.185305 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12891_H3K27Ac_1.read_cnt GM12892_H3K27Ac_1.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                          1                          0
#> GM12892_H3K27Ac_1.read_cnt                          0                          1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = conds, method = "parametric", occupy.only = TRUE, 
#>     init.coef = c(0.1, 10))
#> 
#> Summary of the mean-variance curve:
#> Fit method = "parametric fit"
#> ID = "2022-10-28 16:15:27 LZIRE6PJ4A"
#> Number of prior dfs = 4.24
#> Variance ratio factor = 0.597

# Visualize mean-variance trend along with the fitted MVC.
plotMeanVarCurve(conds[3], subset = "occupied", ylim = c(-7, 1))

# Perform differential tests.
res2 <- diffTest(conds[[1]], conds[[2]])
head(res2)

#>   GM12891.mean GM12892.mean        Mval   Mval.se     Mval.t       pval      padj
#> 1     3.413499     4.554589  1.14109009 1.3627709  0.8373308 0.44702852 0.8272570
#> 2     8.617834     8.779719  0.16188555 0.5167515  0.3132754 0.76888679 0.9482468
#> 3     5.238002     5.475733  0.23773185 0.9178155  0.2590192 0.80771764 0.9582934
#> 4     6.673159     3.523562 -3.14959737 0.9836902 -3.2018183 0.03025285 0.4549266
#> 5     7.985621     8.350939  0.36531860 0.5431482  0.6725947 0.53610033 0.8689649
#> 6     4.136960     4.044394 -0.09256607 1.3188512 -0.0701869 0.94723279 0.9882270

# Visualize the overall test results.
# We use a cutoff of raw p-value here to select significant intervals.
MAplot(res2, pval = 0.01)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

We can see from the last figure a dramatic decrease of statistical power for identifying differential genomic intervals, owing to the lack of true replicates. If you rank intervals in order of statistical significance, however, you will find that this differential analysis without replicates lead to very similar rankings to those from the previous analysis with replicates:

cor(res$pval, res2$pval, method = "spearman")

#> [1] 0.9094989

plot(-log10(res$pval), -log10(res2$pval), col = "#0000FF14", pch = 20,
     xlab = "With Reps", ylab = "Without Reps")

Simultaneous Comparison of Any Number of Groups

MAnorm2 can also be used to simultaneously compare more than two groups of ChIP-seq samples. Here we give the standard workflow for the cases where at least one of the groups to be compared contains two or more samples. We use the comparison of H3K27Ac ChIP-seq signals among GM12890, GM12891 and GM12892 LCLs as example:

# Perform within-group normalization.
norm <- normalize(H3K27Ac, count = 4, occupancy = 9)
norm <- normalize(norm, count = 5:6, occupancy = 10:11)
norm <- normalize(norm, count = 7:8, occupancy = 12:13)

# Construct a bioCond for each group of samples.
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
              GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))

# Perform between-group normalization.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Fit an MVC.
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)

#> During the parametric fit procedure:
#> After iteration 1: coef = (0.0093174, 6.04374); 967 (0.85%) outlier(s) detected
#> After iteration 2: coef = (0.00909686, 6.07412); 968 (0.86%) outlier(s) detected
#> After iteration 3: coef = (0.00908926, 6.07399); 968 (0.86%) outlier(s) detected
#> Converged.

# Perform differential tests.
res <- aovBioCond(conds)
head(res)

#>   conds.mean between.ms   within.ms  prior.var posterior.var      mod.f         pval         padj
#> 1   3.873144 1.25318002 0.155228311 0.42360573    0.42360573  2.9583642 5.190375e-02 6.438846e-02
#> 2   8.837172 0.02174483 0.003556575 0.02236996    0.02236996  0.9720553 3.783047e-01 4.075928e-01
#> 3   5.845167 0.22961287 0.073940389 0.11474760    0.11474760  2.0010255 1.351966e-01 1.570769e-01
#> 4   3.976759 8.99272048 0.124250239 0.39487900    0.39487900 22.7733572 1.287231e-10 4.506744e-10
#> 5   8.231352 0.15434149 0.004612264 0.02930040    0.02930040  5.2675562 5.156196e-03 7.371500e-03
#> 6   3.993976 2.96216974 0.272820052 0.39030242    0.39030242  7.5894219 5.057734e-04 8.262211e-04

We can see that this workflow is basically the same as that for two-group comparisons, except the final procedure for performing differential tests (aovBioCond() is called instead of diffTest()).

You can visualize the overall test results by

plot(res, padj = 1e-6)

In cases where each group contains a single ChIP-seq sample, try the strategy of constructing “blind” bioCond objects (see the section of When No Replicates Are Available) or calling hypervariable ChIP-seq signals across the samples (see the section of Identifying Hypervariable ChIP-seq Signals below).

Combining Replicates and Using Local Regression

In practice, chances are that you want to combine biological replicates to get a reference ChIP-seq profile for each biological condition. For example, with ChIP-seq samples for tissues or cells obtained from different individuals, you can classify the individuals according to the age, gender, health status or disease subtype of each of them, and then perform a differential analysis between groups of individuals to identify differential ChIP-seq signals associated with the group characteristics. Suppose that each individual is associated with multiple biological replicates. A reasonable analysis strategy is to separately create a reference profile for each individual by combining the associated biological replicates.

Here we use the comparison of H3K27Ac ChIP-seq signals between male and female LCLs as example to demonstrate how to use MAnorm2 to perform such analyses:

# Use the regular routine for normalization and construction of bioConds.
norm <- normalize(H3K27Ac, count = 4, occupancy = 9)
norm <- normalize(norm, count = 5:6, occupancy = 10:11)
norm <- normalize(norm, count = 7:8, occupancy = 12:13)
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
              GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Group LCLs into different genders.
genders <- list(male = cmbBioCond(conds[2], name = "male"),
                female = cmbBioCond(conds[c(1, 3)], name = "female"))

# Fit an MVC by using local regression.
genders <- fitMeanVarCurve(genders, method = "local", occupy.only = TRUE)

#> During the local regression procedure:
#> After iteration 1: 677 (1.04%) outlier(s) detected
#> After iteration 2: 676 (1.03%) outlier(s) detected
#> After iteration 3: 678 (1.04%) outlier(s) detected
#> After iteration 4: 678 (1.04%) outlier(s) detected
#> After iteration 5: 678 (1.04%) outlier(s) detected
#> Converged.

summary(genders$female)

#> 
#> Biological condition female:
#> 73828 genomic intervals with 2 profiles
#> 
#> Occupancy states:
#> 65370 (88.5%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -0.648   5.806   7.026   7.051   8.390  11.725 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> 0.000006 0.238225 0.567852 0.833277 1.175785 8.283881 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>         GM12890 GM12892
#> GM12890       1       0
#> GM12892       0       1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = genders, method = "local", occupy.only = TRUE)
#> 
#> Summary of the mean-variance curve:
#> Fit method = "local regression"
#> ID = "2022-10-28 16:16:16 PMGG5M6NNK"
#> Number of prior dfs = 13.7
#> Variance ratio factor = 0.849

plotMeanVarCurve(genders, subset = "occupied")

# Perform differential tests.
res <- diffTest(genders[[1]], genders[[2]])
head(res)

#>   male.mean female.mean       Mval   Mval.se      Mval.t      pval      padj
#> 1  3.655242    3.982095  0.3268530 3.2854629  0.09948461 0.9220941 0.9999813
#> 2  8.763611    8.873952  0.1103400 0.6733359  0.16387073 0.8720590 0.9999813
#> 3  5.482275    6.026613  0.5443382 1.6150539  0.33704026 0.7408384 0.9999813
#> 4  6.667296    2.631490 -4.0358058 2.4714076 -1.63299887 0.1236452 0.9999813
#> 5  8.259626    8.217214 -0.0424119 0.8176631 -0.05186965 0.9593291 0.9999813
#> 6  5.267752    3.357088 -1.9106638 2.7693612 -0.68992945 0.5009591 0.9999813

MAplot(res, pval = 0.01)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

Note: cmbBioCond() is designed for combining a list of bioCond objects into a single bioCond, such that each of the supplied bioCond objects serves as an individual ChIP-seq sample in the combined bioCond. Technically, the function integrates the ChIP-seq samples contained in each bioCond into a reference ChIP-seq profile and groups the resulting profiles into a new bioCond object. Note that the argument list of bioCond objects to cmbBioCond() must be normalized to be comparable with each other before calling the function.

Here we use local regression to adaptively capture the mean-variance trend, as the dependence of ChIP-seq signal variability across individual LCLs on mean signal intensity is not as regular as in the previous case for modeling the variability across biological replicates. The above settings for employing local regression should be flexible enough to handle most practical cases. The following are some considerations regarding advanced usage of local regression.

In practice, good chances are that the underlying extrapolation algorithm of local regression results in overestimated prior variances for non-occupied genomic intervals. Naturally, performing the regression on all genomic intervals (rather than only occupied intervals) can avoid the problem:

genders2 <- fitMeanVarCurve(genders, method = "local", occupy.only = FALSE)

#> During the local regression procedure:
#> After iteration 1: 797 (1.08%) outlier(s) detected
#> After iteration 2: 797 (1.08%) outlier(s) detected
#> After iteration 3: 798 (1.08%) outlier(s) detected
#> After iteration 4: 798 (1.08%) outlier(s) detected
#> Converged.

plotMeanVarCurve(genders, subset = "non-occupied",
                 main = "Use occupied intervals")
plotMeanVarCurve(genders2, subset = "non-occupied",
                 main = "Use all intervals")

However, using all genomic intervals to fit MVC may considerably reduce the estimated number of prior degrees of freedom as well as the statistical power for identifying differential intervals, owing to the fact that ChIP-seq signal measurements in non-occupied intervals are generally of less data regularity compared with those in occupied intervals:

genders[[1]]$fit.info$df.prior

#> [1] 13.73788

genders2[[1]]$fit.info$df.prior

#> [1] 5.921202

To split the difference, you can perform local regression on all genomic intervals and re-estimate the number of prior degrees of freedom using only occupied intervals:

genders3 <- estimatePriorDf(genders2, occupy.only = TRUE)
plotMeanVarCurve(genders3, subset = "non-occupied",
                 main = "Re-estimate prior df")

genders3[[1]]$fit.info$df.prior

#> [1] 8.238743

Note: in fact, when calling fitMeanVarCurve() to fit an MVC, estimatePriorDf() is automatically invoked for the associated parameter estimation. There is also a robust version of estimatePriorDf(), named estimatePriorDfRobust(). It renders the estimation procedure robust to potential outliers by applying the Winsorization technique (Phipson et al. 2016). Type ?estimatePriorDfRobust for details.

Identifying Hypervariable ChIP-seq Signals

Since MAnorm2 1.1.0, one of the new changes is the implementation of HyperChIP, which is a method developed for identifying genomic intervals with hypervariable ChIP-seq signals across a set of samples. Compared with the old workflow shown in the vignette for MAnorm2 1.0.0, HyperChIP has made specific efforts to increase the statistical power for identifying hypervariable intervals (refer to ?estParamHyperChIP for details). Here we use all the H3K27Ac ChIP-seq samples as an example to demonstrate the standard workflow of HyperChIP:

# Normalize all ChIP-seq samples once for all.
# Considering the number of samples in a hypervariable ChIP-seq analysis is
# typically large, HyperChIP uses a pseudo-reference profile as baseline in the
# MA normalization process to reduce the variability of normalization results.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, count = 4:8, occupancy = 9:13,
                  baseline = "pseudo-reference",
                  common.peak.regions = autosome)

# Construct a bioCond to group all the samples.
cond <- bioCond(norm[4:8], norm[9:13], occupy.num = 1,
                name = "all")

# Fit an MVC by using local regression.
# Set "nn = 1" to increase the smoothness of the resulting MVC.
cond <- fitMeanVarCurve(list(cond), method = "local",
                        occupy.only = TRUE, args.lp = list(nn = 1))[[1]]

#> During the local regression procedure:
#> After iteration 1: 97 (0.13%) outlier(s) detected
#> After iteration 2: 112 (0.15%) outlier(s) detected
#> After iteration 3: 116 (0.16%) outlier(s) detected
#> After iteration 4: 116 (0.16%) outlier(s) detected
#> Converged.

summary(cond)

#> 
#> Biological condition all:
#> 73828 genomic intervals with 5 profiles
#> 
#> Occupancy states:
#> 73821 (100.0%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.6994  5.9455  7.0825  7.1337  8.3771 11.7396 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.01114 0.33134 0.62790 0.80305 1.09181 5.44976 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          1                          0
#> GM12891_H3K27Ac_1.read_cnt                          0                          1
#> GM12891_H3K27Ac_2.read_cnt                          0                          0
#> GM12892_H3K27Ac_1.read_cnt                          0                          0
#> GM12892_H3K27Ac_2.read_cnt                          0                          0
#>                            GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          0                          0
#> GM12891_H3K27Ac_1.read_cnt                          0                          0
#> GM12891_H3K27Ac_2.read_cnt                          1                          0
#> GM12892_H3K27Ac_1.read_cnt                          0                          1
#> GM12892_H3K27Ac_2.read_cnt                          0                          0
#>                            GM12892_H3K27Ac_2.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          0
#> GM12891_H3K27Ac_1.read_cnt                          0
#> GM12891_H3K27Ac_2.read_cnt                          0
#> GM12892_H3K27Ac_1.read_cnt                          0
#> GM12892_H3K27Ac_2.read_cnt                          1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = list(cond), method = "local", occupy.only = TRUE, 
#>     args.lp = list(nn = 1))
#> 
#> Summary of the mean-variance curve:
#> Fit method = "local regression"
#> ID = "2022-10-28 16:16:47 CSFGM08NNO"
#> Number of prior dfs = 2.87
#> Variance ratio factor = 0.446

# Apply the parameter estimation framework of HyperChIP.
# Note the dramatic increase in the estimated number of prior degrees of
# freedom.
cond <- estParamHyperChIP(cond)
summary(cond)

#> 
#> Biological condition all:
#> 73828 genomic intervals with 5 profiles
#> 
#> Occupancy states:
#> 73821 (100.0%) of the genomic intervals occupied by the condition
#> 
#> Summary of mean signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.6994  5.9455  7.0825  7.1337  8.3771 11.7396 
#> 
#> Summary of standard deviations of signal intensities at the genomic intervals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.01114 0.33134 0.62790 0.80305 1.09181 5.44976 
#> 
#> The genomic intervals are associated with the same structure matrix:
#>                            GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          1                          0
#> GM12891_H3K27Ac_1.read_cnt                          0                          1
#> GM12891_H3K27Ac_2.read_cnt                          0                          0
#> GM12892_H3K27Ac_1.read_cnt                          0                          0
#> GM12892_H3K27Ac_2.read_cnt                          0                          0
#>                            GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          0                          0
#> GM12891_H3K27Ac_1.read_cnt                          0                          0
#> GM12891_H3K27Ac_2.read_cnt                          1                          0
#> GM12892_H3K27Ac_1.read_cnt                          0                          1
#> GM12892_H3K27Ac_2.read_cnt                          0                          0
#>                            GM12892_H3K27Ac_2.read_cnt
#> GM12890_H3K27Ac_1.read_cnt                          0
#> GM12891_H3K27Ac_1.read_cnt                          0
#> GM12891_H3K27Ac_2.read_cnt                          0
#> GM12892_H3K27Ac_1.read_cnt                          0
#> GM12892_H3K27Ac_2.read_cnt                          1
#> 
#> Function call for associating a mean-variance curve with the condition:
#>    fitMeanVarCurve(conds = list(cond), method = "local", occupy.only = TRUE, 
#>     args.lp = list(nn = 1))
#> 
#> Subsequent function call for adjusting related parameters:
#>    estParamHyperChIP(cond = cond)
#> 
#> Summary of the mean-variance curve:
#> Fit method = "local regression"
#> ID = "2022-10-28 16:16:47 CSFGM08NNO"
#> Number of prior dfs = Inf
#> Variance ratio factor = 1.02

# Perform statistical tests.
res <- varTestBioCond(cond)
head(res)

#>   observed.mean observed.var prior.var fold.change        pval      padj
#> 1      4.123172   0.62767325 3.3719506  0.18614545 0.108561188 0.3391233
#> 2      8.835486   0.01362584 0.2844984  0.04789425 0.008610023 0.1013005
#> 3      5.794082   0.16260797 1.4398955  0.11293040 0.043945034 0.2164067
#> 4      4.580327   4.82749982 2.6687559  1.80889526 0.247900708 0.5171521
#> 5      8.302970   0.07726829 0.3872625  0.19952434 0.122594081 0.3598123
#> 6      4.440024   1.65003664 2.8669925  0.57552876 0.639231994 0.8224108

The fold.change variable gives the ratio of the observed variance of each genomic interval to its prior variance. Note that the pval variable gives two-sided p-values. Therefore, for the genomic intervals with small p-values, those associated with a fold.change larger than 1 suggest hypervariable ChIP-seq signals across samples, and the others suggest lowly variable or so-called invariant ChIP-seq signals:

# Visualize the overall test results.
hist(res$pval, breaks = 100, col = "red")
plot(res, padj = 0.01)

You can get one-sided p-values for identifying hypervariable ChIP-seq signals by

df <- attr(res, "df")
df

#> observed    prior 
#>        4      Inf

one.sided.pval <- pf(res$fold.change, df[1], df[2], lower.tail = FALSE)

Compared with the genomic intervals occupied by all the ChIP-seq samples, those intervals occupied only by part of the samples should be associated with higher signal variability and, thus, tend to have more significant p-values:

n <- rowSums(norm[9:13])
x <- list(All = -log10(one.sided.pval[n == 5]),
          Partially = -log10(one.sided.pval[n > 0 & n < 5]))
wilcox.test(x$All, x$Partially, alternative = "less")

#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  x$All and x$Partially
#> W = 445213772, p-value < 2.2e-16
#> alternative hypothesis: true location shift is less than 0

boxplot(x, ylab = "-Log10(p-value)")
boxplot(x, ylab = "-Log10(p-value)", outline = FALSE)

In practice, you can also call hypervariable signals across groups of ChIP-seq samples, by first using cmbBioCond() to integrate the samples of each group into a reference profile (see also the section of Combining Replicates and Using Local Regression).

Clustering ChIP-seq Samples

Here we give the standard workflow for performing hierarchical clustering on a set of ChIP-seq samples. We use all the H3K27Ac ChIP-seq samples as an example:

# Normalize all ChIP-seq samples once for all.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, count = 4:8, occupancy = 9:13,
                  baseline = "pseudo-reference",
                  common.peak.regions = autosome)

# Construct a bioCond to group all the samples.
cond <- bioCond(norm[4:8], norm[9:13], occupy.num = 1,
                name = "all")

# Fit an MVC by using local regression.
cond <- fitMeanVarCurve(list(cond), method = "local",
                        occupy.only = TRUE, args.lp = list(nn = 1))[[1]]

#> During the local regression procedure:
#> After iteration 1: 97 (0.13%) outlier(s) detected
#> After iteration 2: 112 (0.15%) outlier(s) detected
#> After iteration 3: 116 (0.16%) outlier(s) detected
#> After iteration 4: 116 (0.16%) outlier(s) detected
#> Converged.

# Measure the distance between each pair of samples.
d <- distBioCond(cond, method = "prior")
d

#>                            GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                  1.3505179                           
#> GM12891_H3K27Ac_2.read_cnt                  1.3832427                  0.2790962
#> GM12892_H3K27Ac_1.read_cnt                  1.1447463                  0.8673543
#> GM12892_H3K27Ac_2.read_cnt                  1.1442555                  0.8925351
#>                            GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                                                      
#> GM12891_H3K27Ac_2.read_cnt                                                      
#> GM12892_H3K27Ac_1.read_cnt                  0.8403184                           
#> GM12892_H3K27Ac_2.read_cnt                  0.8474437                  0.2623903

# Perform hierarchical clustering.
plot(hclust(d, method = "average"), hang = -1)

Note: distBioCond() quantifies the distance between each pair of samples contained in the supplied bioCond. Each distance derived by distBioCond() could be interpreted as absolute difference in signal intensity between two samples averaged across genomic intervals. In the above example, the average fold change of H3K27Ac ChIP-seq signal between the 1st and 2nd replicates of GM12891 is about \(2^{0.2790962} \approx 1.2\), while the average fold change between the 1st replicates of GM12891 and GM12892 is about \(2^{0.8673543} \approx 1.8\). Technically, distBioCond() calculates a p-norm distance for each pair of samples while using the reciprocal of variance to weight each genomic interval. Suppose \(x_{i}\) and \(y_{i}\) represent the signal intensities of interval \(i\) in two samples. \(w_{i}\) is the reciprocal of the variance of interval \(i\). The function derives the distance between the two samples by

\[ d = \sqrt[p]{\frac{ \sum_{i}w_{i}|y_{i} - x_{i}|^{p} }{ \sum_{i}w_{i} }} \]

By default, distBioCond() uses \(p=2\) and the "prior" method to calculate the variance of each interval (type ?distBioCond for details).

In practice, you may want to use only the genomic intervals that are associated with hypervariable signal intensities across samples to perform clustering, as such intervals should be most helpful for distinguishing between samples:

# Select hypervariable genomic intervals.
cond <- estParamHyperChIP(cond)
res <- varTestBioCond(cond)
f <- res$fold.change > 1 & res$padj < 0.01

# The hierarchical structure among samples remains unmodified,
# but note the change of magnitude of the distances between cell lines.
d2 <- distBioCond(cond, subset = f, method = "prior")
d2

#>                            GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                  4.0412182                           
#> GM12891_H3K27Ac_2.read_cnt                  4.0706951                  0.3131988
#> GM12892_H3K27Ac_1.read_cnt                  3.3901795                  2.0668405
#> GM12892_H3K27Ac_2.read_cnt                  3.3263697                  2.0723909
#>                            GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                                                      
#> GM12891_H3K27Ac_2.read_cnt                                                      
#> GM12892_H3K27Ac_1.read_cnt                  2.0365856                           
#> GM12892_H3K27Ac_2.read_cnt                  2.0339763                  0.2795365

plot(hclust(d2, method = "average"), hang = -1)

You can also perform hierarchical clustering on groups of ChIP-seq samples, by first using cmbBioCond() to integrate the samples of each group into a reference profile (see also the section of Combining Replicates and Using Local Regression).

MAnorm2 Model Formulation

Here we provide a formal description of the statistical model designed in MAnorm2. Suppose \(X_{j}\) is an \(n \times m_{j}\) matrix recording normalized ChIP-seq signal intensities (by default, normalized signal intensities derived by MAnorm2 are normalized log2 read counts) at \(n\) genomic intervals for \(m_{j}\) samples belonging to group \(j\). Let \(X_{i,j}\) be a column vector representing the transpose of row \(i\) of \(X_{j}\). We assume

\[ X_{i,j}|\sigma^{2}_{i,j} \sim MVN(1\mu_{i,j}, (\gamma_{j}\sigma^{2}_{i,j})S_{i,j}) \\ \frac{1}{\sigma^{2}_{i,j}} \sim \frac{1}{f(\mu_{i,j})} \cdot \frac{\chi^{2}_{d_{0}}}{d_{0}} \]

Here \(MVN\) refers to the multivariate normal distribution. \(\mu_{i,j}\) and \(\sigma^{2}_{i,j}\) are two unknown scalars that parametrize the mean signal intensity of interval \(i\) in group \(j\) and the associated signal variability, respectively. \(1\) is a column vector of ones. \(\gamma_{j}\), termed variance ratio factor, is a scalar that quantifies the global within-group variability of group \(j\). \(S_{i,j}\), termed structure matrix, is an \(m_{j} \times m_{j}\) matrix designed for modeling precision weights of signal measurements from different samples as well as correlations among the measurements (by default, MAnorm2 simply sets each structure matrix to an identity matrix). \(f(\cdot)\) refers to an unscaled MVC common to different groups of samples and \(f(\mu_{i,j})\) is called the prior variance of interval \(i\) in group \(j\). \(d_{0}\), termed number of prior degrees of freedom, is a hyperparameter that assesses how well in general the variance of an individual interval could be predicted by its mean signal intensity. \(\chi^{2}_{d_{0}}\) refers to the chi-squared distribution with \(d_{0}\) degrees of freedom. For the convenience of devising statistical tests for identifying differential genomic intervals between groups of samples, we further assume that unscaled variances of non-differential intervals remain invariant across groups. Formally, we assume that \(\sigma^{2}_{i,j_{1}}\) equals \(\sigma^{2}_{i,j_{2}}\) with a probability of one (i.e., they refer to the same random variable) whenever \(\mu_{i,j_{1}} = \mu_{i,j_{2}}\). This assumption is consistent with the fact that \(\sigma^{2}_{i,j_{1}}\) and \(\sigma^{2}_{i,j_{2}}\) follow the same prior distribution on condition that \(\mu_{i,j_{1}} = \mu_{i,j_{2}}\).

Overall, the above model is similar to limma-trend (Sartor et al. 2006; Law et al. 2014), except that MAnorm2 allows for different global within-group variability between groups of samples.

Citation

To cite the MAnorm2 package in publications, please use

Tu, S., et al., MAnorm2 for quantitatively comparing groups of ChIP-seq samples. Genome Res, 2021. 31(1): p. 131-145.

If you have performed MA normalization with a pseudo-reference profile as baseline, or have employed a Winsorization-based robust parameter estimation framework, or have performed a hypervariable analysis, please cite additionally

Chen, H., et al., HyperChIP for identifying hypervariable signals across ChIP/ATAC-seq samples. bioRxiv, 2021: p. 2021.07.27.453915.

Acknowledgments

In devising the underlying statistical methods of MAnorm2, we have learned a lot from limma, limma-trend, voom, DESeq and DESeq2 (Smyth 2004; Sartor et al. 2006; Law et al. 2014; Anders and Huber 2010; Love, Huber, and Anders 2014). Special thanks to the authors of these fantastic tools.

We would also like to sincerely thank the following individuals, for their help and feedback on the MAnorm2 package:

Zhen Shao, Yijing Zhang, Mushan Li, Haojie Chen, Fengxiang Tan.

Session Info

sessionInfo()

#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                               LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MAnorm2_1.2.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] knitr_1.33         magrittr_2.0.1     munsell_0.5.0      statmod_1.4.36     lattice_0.20-44   
#>  [6] colorspace_2.0-2   R6_2.5.0           rlang_0.4.11       fastmap_1.1.0      highr_0.9         
#> [11] stringr_1.4.0      caTools_1.18.2     tools_4.1.0        grid_4.1.0         xfun_0.25         
#> [16] KernSmooth_2.23-20 jquerylib_0.1.4    htmltools_0.5.2    gtools_3.9.2       yaml_2.2.1        
#> [21] digest_0.6.27      lifecycle_1.0.0    farver_2.1.0       sass_0.4.0         bitops_1.0-7      
#> [26] evaluate_0.14      rmarkdown_2.10     stringi_1.7.3      compiler_4.1.0     bslib_0.3.1       
#> [31] gplots_3.1.1       scales_1.1.1       locfit_1.5-9.4     jsonlite_1.7.2

References

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11 (10): R106.

Kasowski, Maya, Sofia Kyriazopoulou-Panagiotopoulou, Fabian Grubert, Judith B Zaugg, Anshul Kundaje, Yuling Liu, Alan P Boyle, et al. 2013. “Extensive Variation in Chromatin States Across Humans.” Science 342 (6159): 750–52.

Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15 (2): R29.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550.

Phipson, Belinda, Stanley Lee, Ian J Majewski, Warren S Alexander, and Gordon K Smyth. 2016. “Robust Hyperparameter Estimation Protects Against Hypervariable Genes and Improves Power to Detect Differential Expression.” The Annals of Applied Statistics 10 (2): 946.

Sartor, Maureen A, Craig R Tomlinson, Scott C Wesselkamper, Siva Sivaganesan, George D Leikauf, and Mario Medvedovic. 2006. “Intensity-Based Hierarchical Bayes Method Improves Testing for Differentially Expressed Genes in Microarray Experiments.” BMC Bioinformatics 7 (1): 538.

Smyth, G. K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Stat Appl Genet Mol Biol 3: Article3.

Zhang, Yong, Tao Liu, Clifford A Meyer, Jérôme Eeckhoute, David S Johnson, Bradley E Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biology 9 (9): R137.