Introduction

The System of MAnorm2 Machinery

The capabilities of MAnorm2 result primarily from two basic utilities implemented in it. One is for the normalization of ChIP-seq samples, and the other is for modeling the mean-variance trend associated with normalized ChIP-seq signal intensities. Several downstream analyses could be performed based on these two utilities, including

  • differential ChIP-seq analysis between individual samples or groups of samples;
  • identification of genomic regions with hypervariable ChIP-seq signals across individual samples or groups of samples;
  • clustering of individual ChIP-seq samples or groups of samples.

We’d like to emphasize here that each of the above analyses takes advantage of the observed mean-variance trend to improve the assessment of within-group variability (i.e., the variation of ChIP-seq signals across samples of the same group). In practice, the strategy could compensate for the lack of sufficient replicates for accurately assessing within-group variability.

Input Data

For employing the machinery implemented in MAnorm2, you need to prepare a table that profiles the ChIP-seq signal in each of a list of genomic intervals for each of a set of ChIP-seq samples. The H3K27Ac dataset bundled with MAnorm2 provides such an instance:

library(MAnorm2)
#> MAnorm2 1.2.2 2022-10-28
head(H3K27Ac)
#>   chrom  start    end GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> 1  chr1  29023  29346                          8                         16
#> 2  chr1 712983 715873                        440                        498
#> 3  chr1 740056 741095                         81                         54
#> 4  chr1 751252 753001                          2                        139
#> 5  chr1 760716 764177                        234                        329
#> 6  chr1 800556 801985                          4                         26
#>   GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> 1                          9                         23                         22
#> 2                        477                        439                        508
#> 3                         39                         44                         40
#> 4                         84                         11                         11
#> 5                        350                        326                        376
#> 6                         59                         16                         19
#>   GM12890_H3K27Ac_1.occupancy GM12891_H3K27Ac_1.occupancy GM12891_H3K27Ac_2.occupancy
#> 1                           0                           0                           0
#> 2                           1                           1                           1
#> 3                           1                           0                           0
#> 4                           0                           1                           0
#> 5                           1                           1                           1
#> 6                           0                           0                           1
#>   GM12892_H3K27Ac_1.occupancy GM12892_H3K27Ac_2.occupancy
#> 1                           1                           1
#> 2                           1                           1
#> 3                           1                           0
#> 4                           0                           0
#> 5                           1                           1
#> 6                           0                           0

To be specific, each row of the above table represents a genomic interval; each of the read_cnt variables corresponds to a ChIP-seq sample and records the numbers of reads from the sample that fall within the genomic intervals (i.e., the raw read counts for the sample); the occupancy variables correspond to the read_cnt variables one by one and specify the occupancy status of each interval in each sample (an occupancy status of 1 indicates that the interval is enriched with reads in the sample). In practice, the occupancy status of a genomic interval in a certain ChIP-seq sample could be determined by its overlap with the peaks (Zhang et al. 2008) of the sample. Note also that MAnorm2 refers to an interval as occupied by a sample if the interval is enriched with reads in the sample.

MAnorm2_utils is specifically designed to coordinate with MAnorm2, and we strongly recommend using it to create input tables of MAnorm2.

Note also that, although the above table records raw read counts, MAnorm2 does not impose a restriction that the input measurements of ChIP-seq signals must be integers (see also the section of Continuous Distribution below).

Application Scope

Although MAnorm2 has been designed to process ChIP-seq data, it could be applied in principle to the analysis of any type of data with a similar structure, including DNase-seq, ATAC-seq and RNA-seq data. The only problem associated with such extensions is how to naturally define “peaks” for specific data types.

Most of the peak callers originally devised for ChIP-seq data (e.g., MACS 1.4) also work for DNase-seq and ATAC-seq data. For RNA-seq data, each row of the input table should stand for a gene, and we recommend setting a cutoff (e.g., 20) of raw read count to define “peak” genes.

Continuous Distribution

In spite of the discrete nature of read counts, MAnorm2 uses continuous distribution to model ChIP-seq data by first transforming raw read counts into raw signal intensities. By default, MAnorm2 completes the transformation by simply adding an offset count to each raw count and taking a base-2 logarithm. Practical ChIP-seq data sets, however, may be associated with various confounding factors, including batch effects, local sequence compositions and background signals measured by input samples. On this account, the MAnorm2 machinery has been designed to be independent of the specific transformation employed. And any methods for correcting for confounding factors could be applied before invoking MAnorm2, as long as the resulting signal intensities could be approximately modeled as following the normal distribution (in particular, consider carefully whether it is necessary to apply a logarithmic transformation in the final step). In the extreme case, you can even accomplish the normalization of ChIP-seq data by yourself and use MAnorm2, for example, only for the following differential analysis.

The primary reason for which MAnorm2 models ChIP-seq signals as continuous random variables is that the mathematical theory of count distributions is far less tractable than that of the normal distribution. For example, current statistical methods based on the negative binomial distribution are frequently relied on approximations of various kinds. Specifically, variance (or dispersion) estimates for individual genomic intervals are typically treated as known parameters, and their uncertainty can hardly be incorporated into the statistical tests for identifying differential signals.

Besides, after an extensive correction for confounding factors, the resulting data range is almost certainly not limited to non-negative integers, and the data may have lost their discrete nature and be more akin to a continuous distribution. Moreover, transforming read counts towards the normal distribution unlocks the application of a large repository of mature statistical methods that are initially developed for analyzing continuous measurements (e.g., intensity data from microarray experiments). Refer to the voom article (Law et al. 2014) for a detailed discussion of this topic.

MAnorm2 Capability

This section explains in detail the working principle of MAnorm2 and demonstrates the use of it for various analyses. Note that all demonstrations are based on the H3K27Ac dataset bundled with MAnorm2 (see also the section of Input Data):

library(MAnorm2)
head(H3K27Ac)
#>   chrom  start    end GM12890_H3K27Ac_1.read_cnt GM12891_H3K27Ac_1.read_cnt
#> 1  chr1  29023  29346                          8                         16
#> 2  chr1 712983 715873                        440                        498
#> 3  chr1 740056 741095                         81                         54
#> 4  chr1 751252 753001                          2                        139
#> 5  chr1 760716 764177                        234                        329
#> 6  chr1 800556 801985                          4                         26
#>   GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> 1                          9                         23                         22
#> 2                        477                        439                        508
#> 3                         39                         44                         40
#> 4                         84                         11                         11
#> 5                        350                        326                        376
#> 6                         59                         16                         19
#>   GM12890_H3K27Ac_1.occupancy GM12891_H3K27Ac_1.occupancy GM12891_H3K27Ac_2.occupancy
#> 1                           0                           0                           0
#> 2                           1                           1                           1
#> 3                           1                           0                           0
#> 4                           0                           1                           0
#> 5                           1                           1                           1
#> 6                           0                           0                           1
#>   GM12892_H3K27Ac_1.occupancy GM12892_H3K27Ac_2.occupancy
#> 1                           1                           1
#> 2                           1                           1
#> 3                           1                           0
#> 4                           0                           0
#> 5                           1                           1
#> 6                           0                           0

This dataset profiles H3K27Ac ChIP-seq signals on a genome wide scale for three human lymphoblastoid cell lines (LCLs), each derived from a separate Caucasian individual (associated ChIP-seq data obtained from (Kasowski et al. 2013)). For meta information regarding these cell lines, type

attr(H3K27Ac, "metaInfo")
#>   individual population gender                 cellType
#> 1    GM12890  Caucasian female lymphoblastoid cell line
#> 2    GM12891  Caucasian   male lymphoblastoid cell line
#> 3    GM12892  Caucasian female lymphoblastoid cell line

For details about the generation of this dataset, type ?H3K27Ac.

Comparing Groups of ChIP-seq Samples

Quick Start

Here we show the standard workflow for a differential ChIP-seq analysis between two groups of samples. We use the comparison between the H3K27Ac ChIP-seq samples for GM12891 LCL and those for GM12892 LCL as example:

# Perform within-group normalization.
norm <- normalize(H3K27Ac, count = 5:6, occupancy = 10:11)
norm <- normalize(norm, count = 7:8, occupancy = 12:13)

# Construct a bioCond for each group of samples.
conds <- list(GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))

# Perform between-group normalization.
# Restrict common peak regions to autosomes only when the two groups
# being compared are associated with different genders.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Fit a mean-variance curve.
# If the following function call raises an error,
# set init.coef = c(0.1, 10) or try method = "local".
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)
#> During the parametric fit procedure:
#> After iteration 1: coef = (0.0133814, 5.43493); 971 (0.86%) outlier(s) detected
#> After iteration 2: coef = (0.013118, 5.45959); 971 (0.86%) outlier(s) detected
#> After iteration 3: coef = (0.0131117, 5.45988); 971 (0.86%) outlier(s) detected
#> Converged.
# Perform differential tests.
res <- diffTest(conds[[1]], conds[[2]])
head(res)
#>   GM12891.mean GM12892.mean       Mval   Mval.se     Mval.t         pval         padj
#> 1     2.952943     4.378686  1.4257425 0.6959293  2.0486888 4.049255e-02 7.281479e-02
#> 2     8.594610     8.842037  0.2474271 0.1687811  1.4659641 1.426581e-01 2.131498e-01
#> 3     4.970712     5.282421  0.3117088 0.4302032  0.7245617 4.687210e-01 5.638290e-01
#> 4     6.279446     3.357143 -2.9223031 0.4751332 -6.1504920 7.724294e-10 5.889993e-09
#> 5     8.038011     8.401049  0.3630385 0.1853013  1.9591797 5.009175e-02 8.753281e-02
#> 6     4.733793     4.015697 -0.7180965 0.5494122 -1.3070268 1.912036e-01 2.723929e-01

Note: rows of the above table of differential analysis results correspond to the genomic intervals in H3K27Ac one by one with the same order. See also a detailed explanation below.

Step-by-step Explanation and Visualization

Normalization

MAnorm2 normalizes two individual ChIP-seq samples by removing the overall M-A trend (M and A values refer to log2 fold change and average log2 read count, respectively) associated with their common peak regions, which are the genomic intervals occupied by both of them. For normalization of a set of any number of ChIP-seq samples, MAnorm2 selects one of the samples as baseline and normalizes each other sample against it. Taking the comparison of H3K27Ac ChIP-seq signals between GM12891 and GM12892 LCLs as example, you may choose to normalize all the related samples once for all, by supplying raw read counts and occupancy states associated with the samples:

# One-step normalization.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, count = 5:8, occupancy = 10:13,
                  common.peak.regions = autosome)

Note: here we exclude the genomic intervals in sex chromosomes from common peak regions, since these ChIP-seq samples are associated with different genders.

By default, MAnorm2 uses the median-ratio strategy (Anders and Huber 2010) to estimate the size factor of each ChIP-seq sample and selects the sample whose log2 size factor is closest to 0 as baseline. In practice, you can avoid using a sample with bad quality as baseline by explicitly specifying the baseline argument of normalize(). Besides, when the number of samples to be normalized is large (e.g., >5), you can reduce the variability of normalization results by setting baseline to "pseudo-reference", in which case MAnorm2 constructs a pseudo ChIP-seq profile as baseline by “averaging” all the samples (type ?normalize for details).

Check some information regarding the performed normalization by

names(attributes(norm))
#> [1] "names"       "row.names"   "metaInfo"    "class"       "size.factor" "baseline"   
#> [7] "norm.coef"   "MA.cor"
attributes(norm)[5:8]
#> $size.factor
#> GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt GM12892_H3K27Ac_1.read_cnt 
#>                  1.0914779                  1.0854059                  0.8836196 
#> GM12892_H3K27Ac_2.read_cnt 
#>                  0.9574468 
#> 
#> $baseline
#> [1] "GM12892_H3K27Ac_2.read_cnt"
#> 
#> $norm.coef
#>                                slope  intercept common.peak.regions
#> GM12891_H3K27Ac_1.read_cnt 1.0211083 -0.4409282               41759
#> GM12891_H3K27Ac_2.read_cnt 1.0264489 -0.4604073               42752
#> GM12892_H3K27Ac_1.read_cnt 0.9562856  0.4758901               46259
#> GM12892_H3K27Ac_2.read_cnt 1.0000000  0.0000000               51554
#> 
#> $MA.cor
#>                            GM12891_H3K27Ac_1.read_cnt GM12891_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                0.000000000                -0.03017988
#> GM12891_H3K27Ac_2.read_cnt                0.007119437                 0.00000000
#> GM12892_H3K27Ac_1.read_cnt                0.016129812                 0.01551055
#> GM12892_H3K27Ac_2.read_cnt                0.000000000                 0.00000000
#>                            GM12892_H3K27Ac_1.read_cnt GM12892_H3K27Ac_2.read_cnt
#> GM12891_H3K27Ac_1.read_cnt                  0.1035251                 0.03699763
#> GM12891_H3K27Ac_2.read_cnt                  0.1179751                 0.04865985
#> GM12892_H3K27Ac_1.read_cnt                  0.0000000                -0.20926760
#> GM12892_H3K27Ac_2.read_cnt                  0.0000000                 0.00000000
# This statement requires the gplots package (>= 3.0.1).
plot(attr(norm, "MA.cor"), symbreaks = TRUE, margins = c(8, 8),
     cexRow = 1, cexCol = 1)

The norm.coef attribute records the linear transformation applied to the log2 read counts of each ChIP-seq sample as well as the number of common peak regions between each sample and the baseline. The MA.cor attribute is a matrix recording the Pearson correlation coefficient (PCC) between M and A values across the common peak regions of each pair of samples. The upper and lower triangles of this matrix are calculated from raw and normalized log2 read counts, respectively. In general, it indicates a good normalization performance that the M-A PCCs are all close to 0 after normalization.

We can also draw MA plots to visualize the normalization effects. Here we use the two biological replicates of GM12892 LCL as example:

# Before normalization.
raw <- log(H3K27Ac[7:8] + 0.5, base = 2)
MAplot(raw[[1]], raw[[2]], norm[[12]], norm[[13]], ylim = c(-2, 2),
       main = "Before normalization")
abline(h = 0, lwd = 2, lty = 5)

# After normalization.
MAplot(norm[[7]], norm[[8]], norm[[12]], norm[[13]], ylim = c(-2, 2),
       main = "After normalization")
abline(h = 0, lwd = 2, lty = 5)