Pipeline Overview

The VariTAS pipeline is an R package for processing amplicon-based targeted sequencing. It supports alignment, somatic variant calling (with and without matched normal), and variant annotation, and the pipeline can start from any stage.

Both Illumina sequencing (typically MiniSeq) and Ion Torrent systems are supported by the pipeline, but they require different configurations. For Illumina runs, the FASTQ files are used to start the pipeline at the alignment stage. For Ion Torrent sequencing, the aligned BAM files from the machine are used as input.

The pipeline is designed to be fully automated. Once the pipeline is launched, cluster jobs will be submitted for all tasks. In the case that some jobs depend on others, these job dependencies will be included in the script and handled by the cluster.

Each stage of the pipeline is associated with a file specification data frame. This data frame contains paths to the files to be processed at that stage, and information on any job dependencies. In turn, each pipeline stage will return a data frame that can be used for the next stage in the pipeline.

File paths, run parameters, HPC settings, and other options are controlled by a config file. See the Updating Settings section below for more details.

To start using the pipeline quickly, see the Examples section.

Third-Party Software

There are several essential programs that the VariTAS pipeline requires. The table below provides essential information about each of them. The version number indicates the latest version tested with the pipeline.

Program Version Download Link
BWA 0.7.12 http://bio-bwa.sourceforge.net/
bedtools 2.25.0 https://bedtools.readthedocs.io/en/latest/
Samtools 1.5 http://www.htslib.org/
Picard 2.1.0 https://broadinstitute.github.io/picard/
Vardict (Java) 1.4.6 https://github.com/AstraZeneca-NGS/VarDictJava
FastQC 0.11.4 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Directory Structure

Note that only the top output directory needs to be manually created

  .                           # Supplied output directory
   |-2018-11-12-plots         # Contains generated plots used in report
   |---sample-coverage        # Coverage plots generated per-sample
   |-2018-11-12-variant-data  # Final output files, including the PDF report
   |-78                       # Directory for each sample, containing intermediary files
   |---mutect                 # Files produced by MuTect for each sample
   |---vardict                # Files produced by VarDict for each sample
   |-code                     # Bash scripts used to submit jobs to HPC scheduler
   |-log                      # stdout and stderr for each job

Stages

There are four stages to the VariTAS pipeline: alignment, variant calling, annotation, and merging.

Stage Description
Alignment Align FASTQ files to reference genome
Variant Calling Run variant callers on aligned BAM files
Annotation Annotate variants with ANNOVAR
Merging Merge files from all variant callers and produce reports/ plots

Alignment

Alignment consists of two main steps: alignment with bwa, and coverage quality control.

For Illumina sequencing runs, both steps will typically be necessary. For Proton runs, the machine does the alignment against UCSC hg19. While the machine also outputs FASTQ files, realigning these yourself is not recommended as read quality information is lost[1].

The main function for running alignment is run.alignment(). It takes a FASTQ specification data frame as input, submits one alignment job per sample, and returns a BAM specification data frame.

library(varitas);

output.directory <- '';

fastq.specification <- data.frame(
    sample.id = c('A', 'B', 'C', 'D'), 
    patient.id = c('X', 'X', 'Y', 'Y'), 
    tissue = c('tumour', 'normal', 'tumour', 'normal'), 
    reads = c('A_1.fq', 'B_1.fq', 'C_1.fq', 'D_1.fq'),
    mates = c('A_2.fq', 'B_2.fq', 'C_2.fq', 'D_2.fq')
    );

print(fastq.specification);
##   sample.id patient.id tissue  reads  mates
## 1         A          X tumour A_1.fq A_2.fq
## 2         B          X normal B_1.fq B_2.fq
## 3         C          Y tumour C_1.fq C_2.fq
## 4         D          Y normal D_1.fq D_2.fq

The FASTQ specification must have columns sample.id and reads. Optionally, it can contain a column mates (for paired end reads), and columns patient.ID and tissue. If provided, the patient ID and tissue information will be used to do matched normal somatic variant calling in later stages of the pipeline.

After creating the FASTQ specification data frame, we are ready to run the alignment step of the pipeline.

matched.bam.specification <- run.alignment(
    fastq.specification = fastq.specification, 
    output.directory = output.directory, 
    paired.end = TRUE,
    quiet = TRUE # only for testing, does not submit jobs to cluster
    );

The alignment step returns a BAM specification data frame that can be used for the variant calling. When patient ID and tissue information is provided in the input data frame, the output data frame will contain tumour and normal BAM files for each tumour sample. When no patient ID/ tissue information is provided, all samples are assumed to be tumours, and variant calling without matched normal is performed in the subsequent step.

print(matched.bam.specification);
##   sample.id                   tumour.bam                   normal.bam
## 1         A /A/A.sorted.bam.ontarget.bam /B/B.sorted.bam.ontarget.bam
## 2         C /C/C.sorted.bam.ontarget.bam /D/D.sorted.bam.ontarget.bam
##    job.dependency
## 1 align_A align_B
## 2 align_C align_D

Variant Calling

Variant calling is performed through the run.variant.calling() function. The form of the input BAM specification depends on whether matched normals are available. If no matched normals are available, the only two required columns are sample.id and tumour.bam.

unmatched.bam.specification <- data.frame(
    sample.id = c('Z', 'Y'), 
    tumour.bam = c('Z.bam', 'Y.bam')
    );

print(unmatched.bam.specification);
##   sample.id tumour.bam
## 1         Z      Z.bam
## 2         Y      Y.bam

In addition to the bam specification data frame, run.variant.calling() takes the variant callers as an argument. To run VarDict and MuTect 2 on the previous matched normal example, you can use the following code.

vcf.specification <- run.variant.calling(
    matched.bam.specification, 
    output.directory = output.directory, 
    variant.caller = c('vardict', 'mutect'),
    quiet = TRUE # only for testing, does not submit jobs to cluster
    );
print(vcf.specification);
##           sample.id                              vcf job.dependency  caller
## vardict-A         A /A/vardict/A.passed.ontarget.vcf      vardict_A vardict
## vardict-C         C /C/vardict/C.passed.ontarget.vcf      vardict_C vardict
## mutect-A          A  /A/mutect/A.passed.ontarget.vcf       mutect_A  mutect
## mutect-C          C  /C/mutect/C.passed.ontarget.vcf       mutect_C  mutect

The VCF specification includes information on the variant caller used to produce the VCF file. This is needed for downstream filtering steps, and used to create unique job names for annotation jobs.

VarDict

VarDict [@vardict] is a variant caller optimized for deep sequencing. As performance scales linearly with depth, downsampling reads is not necessary, and VarDict has greater sensitivity for detecting variants present at low allele frequencies compared to other callers.

MuTect

MuTect [@mutect] is most commonly used for calling variants from whole genome and whole exome sequencing data. It is not optimized for amplicon data, and downsamples to depth 1,000 when it encounters deep sequencing data. When detecting variants in circulating DNA, this downsampling can result in mutations being lost, and running MuTect is not recommended. However, when sequencing solid tumours the variant allele frequencies are higher and there is less concern about losing mutations.

Annotation

Variant file annotation is done with ANNOVAR, and annotated variants are saved to a tab-separated file. The config file specifies the fields to be included in the final tab-separated file. More fields can be added as long as they are included in the ANNOVAR databases.

variant.specification <- run.annotation(
    vcf.specification, 
    output.directory = output.directory, 
    quiet = TRUE # testing only
    );
print(variant.specification);

Merging

The main function for submitting the post-processing job to the cluster is run.post.processing(). Similar to the alignment, variant calling, and variant annotation stages, this function will submit a cluster job with job dependencies as specified by the variant specification.

However, unlike the other stages, the post processing stage does not rely on any command line tools. If there are no job dependencies, the post-processing stage can be run directly through the post.processing() function.

run.post.processing(
    variant.specification = variant.specification, 
    output.directory = output.directory, 
    quiet = TRUE
    );

There are three main parts to the post-processing stage:

  1. Variant merging

  2. Summary plots and PDF report

  3. Quality control Excel sheet

The output is split between two date-stamped subdirectories of the project directory. The variant-data directory contains files that are meant to be sent to collaborators: filtered variants in Excel and text formats, coverage statistics in Excel format, and a PDF report. Additionally, the PNG format plots are saved to the plots directory.

The final page of the PDF report contains details on the pipeline run, including the path to the directory on scratch where the rest of the files can be found.

## VariTAS version 0.7.0
##  Date: 2018-04-26
##  User: username

## Raw/intermediate files can be found in
##  /data/analysis_dir

Running the Full Pipeline

In most cases, all steps in the pipeline can be executed with a single function call. run.varitas.pipeline() is the main function for launching the full pipeline.

By default, this will run all stages from alignment to post-processing. To start the pipeline at a later stage, adjust the start.stage argument of the function. Whatever start stage you provide must match the files provided in the file.details data frame. For example, if starting the pipeline at the variant annotation stage, the file.details data frame should contain paths to VCF files containing the variant calls, and be formatted in a way that passes the verify.variant.specification() check.

Running the run.varitas.pipeline() function will submit jobs for all stages at once, with appropriate job dependencies. To see which jobs that would be submitted, run run.varitas.pipeline() with the argument quiet = TRUE. This will print out all of the Perl calls instead of submitting them as system calls. Each Perl call corresponds to one job submitted to the cluster.

When starting the pipeline at a later stage, earlier jobs are dropped and job dependencies are adjusted accordingly.

vcf.specification$job.dependency <- NULL;

run.varitas.pipeline(
    file.details = vcf.specification,
    output.directory = output.directory,
    start.stage = 'annotation',
    quiet = TRUE
    );

The merging stage of the pipeline supports email notifications. As merging is the last stage of the pipeline, the email notification can be used to let you know when the pipeline run finishes.

run.varitas.pipeline(
    file.details = vcf.specification,
    output.directory = output.directory,
    start.stage = 'annotation',
    email = 'sid@pid.ac.uk',
    quiet = TRUE
    );

Updating Settings {#settings}

The VariTAS pipeline comes with a set of default options specified in the config.yaml file. These are loaded into R by default, and will enable you to run the pipeline. The settings include both cluster-specific settings that are unlikely to change once they have been set for your HPC system and run-specific settings that are more likely to change. Examples of run-specific settings are the target panel, sequencing platform, and variant filters.

In most cases you will want to make changes to the default settings. There are two ways of doing this.

  1. Create your own config file, and overwrite all config options with the overwrite.varitas.options() function.

  2. Update individual options with the set.varitas.options() function.

Variant Filters

Variant filters are specified as part of the settings. All these settings should start with the prefix filters (e.g. be nested under filters in the YAML file), and be further grouped by variant caller. For example, to set a MuTect-specific filter FILTER_NAME, use the command set.varitas.options(filters.mutect.FILTER_NAME = TRUE).

To specify a filter for all variant callers, list them under default in the config YAML file. These filters are set first and overwritten by any caller-specific filters. For example, the YAML code below would set the remove_exac filter for all variant callers and a min_tumour_depth filter of 10 for all callers except VarDict. The VarDict minimum tumour depth filter is set to 20.

filters:
  default:
    min_tumour_depth: 10
    remove_exac: true
  vardict:
    min_tumour_depth: 20

The set.varitas.options() function currently does not support default filters. These must be specified through a config YAML file that's passed to the overwrite.varitas.options() function.

The table below describes all filters currently supported. Variants that do not meet all of these criteria will be filtered out. Note that filters with “normal” in the name are only applied if the samples are paired tumour/normal.

Name Value Description
min_tumour_variant_reads numeric Minimum number of reads supporting a variant
max_normal_variant_reads numeric Maximum number of reads in supporting a variant in normal
min_tumour_depth numeric Minimum depth in tumour
min_normal_depth numeric Minimum depth in normal
min_tumour_allele_frequency numeric Minimum tumour allele frequency
max_normal_allele_frequency numeric Maximum normal allele frequency
indel_min_tumour_allele_frequency numeric Minimum tumour allele frequency for indels
min_quality numeric Minimum base quality
ct_min_tumour_allele_frequency numeric Minimum tumour allele frequency for C>T mutations. Intended as an FFPE filter
remove_1000_genomes logical Flag for removing all variants found in 1,000 genomes[2]
remove_exac logical Flag for removing variants found at AF>0.01 in the Exome Aggregation Consortium
remove_germline_status logical Flag for removing all variants with a status field set to “Germline”. Intended to be used with VarDict

To make it easier to specify filters, the pipeline comes with different sets of default options. These are split into defaults for ctDNA and solid tumours, and can be set by mode: ctdna and mode: tumour, respectively. Any filters specified separately will take precedence over the mode default settings.

For example, the following YAML code will use the ctDNA default settings, but update the min_tumour_variant_reads filter to 20 for all callers.

mode: ctDNA
filters:
  default:
    min_tumour_variant_reads: 20

Solid Tumour Mode

The default settings for the solid tumour mode can be found in the tumour_defaults.yaml file in the package directory.

filters:
  default:
    min_normal_depth: 5
    min_tumour_variant_reads: 5
    min_tumour_allele_frequency: 0.05
    max_normal_allele_frequency: 0.02
    ct_min_tumour_allele_frequency: 0.1
    indel_min_tumour_allele_frequency: 0.1
    remove_1000_genomes: true
    remove_exac: true
  vardict:
    remove_germline_status: true

ctDNA Mode

Defaults for variant calling on ctDNA can be found in the ctdna_defaults.yaml file. Due to low purity, variant allele frequencies in circulating DNA will typically be much lower than those in solid tumour samples. To allow for this, the minimum allele frequency filters are decreased.

filters:
  default:
    min_tumour_variant_reads: 5
    min_tumour_allele_frequency: 0.01
    ct_min_tumour_allele_frequency: 0.05
    indel_min_tumour_allele_frequency: 0.05
    min_normal_depth: 5
    max_normal_allele_frequency: 0
    remove_1000_genomes: true
    remove_exac: true
  pgm:
    indel_min_tumour_allele_frequency: 0.02
  vardict:
    remove_germline_status: true
  isis:
    indel_min_tumour_allele_frequency: 0.02

Examples and Use Cases

Generic Wrapper Script {#examples}

Any call to the VariTAS pipeline requires data to be passed in the form of a dataframe, so the easiest way to interact with it is to create a simple wrapper R script. The goals of the wrapper are to collect the relevant input files in a dataframe, change any necessary VariTAS options, and call the relevant pipeline function.

We can start by arranging the FASTQ files:

library(varitas)
output.directory <- '.'

fastq.directory <- 'inst/extdata/fastq'
fastq.files <- list.files(
  pattern = 'R1.*\\.fastq', 
  path = fastq.directory, 
  full.names = TRUE
  )
fastq.mate.files <- list.files(
  pattern = 'R2.*\\.fastq', 
  path = fastq.directory, 
  full.names = TRUE
  )

fastq.specification <- data.frame(
  # Extract the sample ID from the filename
  sample.id = gsub('.*Sample0(\\d\\d).*', '\\1', basename(fastq.files)),
  reads = fastq.files,
  mates = fastq.mate.files,
  stringsAsFactors = FALSE
  )

print(fastq.specification)
##   sample.id                                      reads
## 1        01 inst/extdata/fastq/2018_Sample001_R1.fastq
## 2        02 inst/extdata/fastq/2018_Sample002_R1.fastq
## 3        03 inst/extdata/fastq/2018_Sample003_R1.fastq
## 4        04 inst/extdata/fastq/2018_Sample004_R1.fastq
##                                        mates
## 1 inst/extdata/fastq/2018_Sample001_R2.fastq
## 2 inst/extdata/fastq/2018_Sample002_R2.fastq
## 3 inst/extdata/fastq/2018_Sample003_R2.fastq
## 4 inst/extdata/fastq/2018_Sample004_R2.fastq

Often, you will need to change settings in the VariTAS config file. As shown in the Introduction, this can be done in one of two ways. The first is to use set.varitas.options() within your wrapper script like so:

set.varitas.options(filters.vardict.min_tumour_depth = 10)

This is suitable for smaller changes, but it is usually more convenient to have a copy of the VariTAS config file for each project or run of the pipeline. This way, all of the settings that are unlikely to change can be easily set and other users will be able to clearly see the config options you used.

config <- 'inst/extdata/varitas_config.yaml'
overwrite.varitas.options(config)

Once the above steps are completed, you are ready to call the main function of the pipeline.

run.varitas.pipeline(
    file.details = fastq.specification,
    output.directory = output.directory,
    variant.callers = c('mutect', 'vardict'),
    quiet = FALSE,
    run.name = 'EXAMPLE',
    email = 'sid@pid.ac.uk'
    )

And those are all the necessary steps to run the pipeline. It will notify you by email when it is finished if you provide an address. On the first attempt, it is advisable to set the quiet parameter to TRUE, which prevents any of the tasks from running. This way, any potential problems can be fixed before a large number of jobs are created.

A full wrapper script template is provided below for completeness and ease of copying-and-pasting.

###############################################################################
## VariTAS Wrapper Script
##
###############################################################################
## Author:
## Adam Mills
###############################################################################
## Libraries:
library(varitas)
###############################################################################
## Main

output.directory <- '.'

fastq.directory <- 'inst/extdata/fastq'
fastq.files <- list.files(
  pattern = 'R1.*\\.fastq', 
  path = fastq.directory, 
  full.names = TRUE
  )
fastq.mate.files <- list.files(
  pattern = 'R2.*\\.fastq', 
  path = fastq.directory, 
  full.names = TRUE
  )

fastq.specification <- data.frame(
  sample.id = gsub('.*Sample0(\\d\\d).*', '\\1', basename(fastq.files)),
  reads = fastq.files,
  mates = fastq.mate.files,
  stringsAsFactors = FALSE
  )

config <- 'inst/extdata/varitas_config.yaml'
overwrite.varitas.options(config)

run.varitas.pipeline(
  file.details = fastq.specification,
  output.directory = output.directory,
  variant.callers = c('mutect', 'vardict'),
  quiet = FALSE,
  run.name = 'EXAMPLE',
  email = 'sid@pid.ac.uk'
  )

Variant Calling with Matched Normal

Data from normal tissue can be used for matched somatic variant calling in the pipeline. When creating your FASTQ specification dataframe, include the columns patient.id and tissue and the pipeline will submit matched normal data to the variant callers.

fastq.specification <- data.frame(
  sample.id = gsub('.*Sample0(\\d\\d).*', '\\1', basename(fastq.files)),
  patient.id = c('X', 'X', 'Y', 'Y'),
  tissue = c('tumour', 'normal', 'tumour', 'normal'),
  reads = fastq.files,
  mates = fastq.mate.files,
  stringsAsFactors = FALSE
  )

print(fastq.specification)
##   sample.id patient.id tissue                                      reads
## 1        01          X tumour inst/extdata/fastq/2018_Sample001_R1.fastq
## 2        02          X normal inst/extdata/fastq/2018_Sample002_R1.fastq
## 3        03          Y tumour inst/extdata/fastq/2018_Sample003_R1.fastq
## 4        04          Y normal inst/extdata/fastq/2018_Sample004_R1.fastq
##                                        mates
## 1 inst/extdata/fastq/2018_Sample001_R2.fastq
## 2 inst/extdata/fastq/2018_Sample002_R2.fastq
## 3 inst/extdata/fastq/2018_Sample003_R2.fastq
## 4 inst/extdata/fastq/2018_Sample004_R2.fastq

Ion PGM Data

Data produced by an Ion PGM system can also be processed by this pipeline using a different function. If you'd like to incorporate the variants called by the machine in the pipeline, simply pass both the BAM files and the VCF files into run.varitas.pipeline.hybrid(). Data from Ion Proton systems can be used in the same way by setting the proton parameter to TRUE.

bam.directory <- 'inst/extdata/bam'
bam.files <- list.files(
  pattern = 'Sample.*\\.bam', 
  path = bam.directory, 
  full.names = TRUE
  )
vcf.directory <- 'inst/extdata/vcf'
vcf.files <- list.files(
  pattern = 'Sample.*\\.vcf', 
  path = vcf.directory, 
  full.names = TRUE
  )

bam.specification <- data.frame(
  sample.id = gsub('^Sample_(\\d+).*', '\\1', basename(bam.files)),
  tumour.bam = bam.files,
  stringsAsFactors = FALSE
  )
vcf.specification <- data.frame(
  sample.id = gsub('^Sample_(\\d+).*', '\\1', basename(vcf.files)),
  vcf = vcf.files,
  caller = rep('pgm', length(vcf.files)),
  stringsAsFactors = FALSE
  )
print(bam.specification)
##   sample.id                    tumour.bam
## 1         1 inst/extdata/bam/Sample_1.bam
## 2         2 inst/extdata/bam/Sample_2.bam
print(vcf.specification)
##   sample.id                           vcf caller
## 1         1 inst/extdata/vcf/Sample_1.vcf    pgm
## 2         2 inst/extdata/vcf/Sample_2.vcf    pgm
run.varitas.pipeline.hybrid(
    bam.specification = bam.specification,
    vcf.specification = vcf.specification,
    output.directory = 'inst/extdata/output/',
    proton = TRUE,
    run.name = 'EXAMPLE',
    quiet = FALSE,
    email = 'sid@pid.ac.uk'
    );

In this version of the pipeline, the alignment stage is skipped and the Ion PGM variant data will be incorporated into the final reports.

MiniSeq Data

To enable users to quickly build file specifications for MiniSeq runs, the VariTAS pipeline has a function prepare.miniseq.specifications(). When passed a MiniSeq sample sheet and the path to a MiniSeq directory, the function will parse through the directory and look for FASTQ/ BAM/ VCF files for each of the samples. By default the Sample_ID column of the MiniSeq sample sheet, up to the first dash, is taken as the sample ID.

prepare.miniseq.specifications() returns a list with elements corresponding to the different file types that have been found. For example, if VCF files were present in the VCF directory, a VCF specification will be named as vcf in the result. Note that you will have to add a column caller to the VCF specification before it can be used in the pipeline.

miniseq.sheet <- 'inst/extdata/miniseq/Example_template.csv'
miniseq.directory <- 'inst/extdata/miniseq'

miniseq.info <- prepare.miniseq.specifications(miniseq.sheet, miniseq.directory)
## Found directories fastq vcf
fastq.specification <- miniseq.info[[ 1 ]]
vcf.specification <- miniseq.info[[ 2 ]]
vcf.specification['caller'] <- rep('miniseq', nrow(vcf.specification))

print(fastq.specification)
##     sample.id                                                   reads
## 001       001 inst/extdata/miniseq/fastq/001_S1_L001_R1_example.fastq
## 002       002 inst/extdata/miniseq/fastq/002_S1_L001_R1_example.fastq
##                                                       mates
## 001 inst/extdata/miniseq/fastq/001_S1_L001_R2_example.fastq
## 002 inst/extdata/miniseq/fastq/002_S1_L001_R2_example.fastq
print(vcf.specification)
##     sample.id                                          vcf  caller
## 001       001 inst/extdata/miniseq/isis/001_S1_example.vcf miniseq
## 002       002 inst/extdata/miniseq/isis/002_S1_example.vcf miniseq

Incorporating MiniSeq Variant Calls

The dataframes generated by the prepare.miniseq.specifications function can be fed into the standard pipeline, or they can be used in the hybrid pipeline. In the latter case, you are able to pass the VCF files much like the Ion PGM scenario in Example 2. By doing so, the pipeline will include the MiniSeq variant calls in the final output.

run.varitas.pipeline.hybrid(
    fastq.specification = fastq.specification,
    vcf.specification = vcf.specification,
    output.directory = 'inst/extdata/output/',
    run.name = 'EXAMPLE',
    quiet = FALSE,
    email = 'sid@pid.ac.uk'
    )

References

[1]: Ion machines use SFF files, which are then converted back to FASTQ. This results in the loss of information on read quality. Another problem with aligning the Ion Torrent FASTQs is the distinct homopolymer error profiles of ion semiconductor sequencing. This is accounted for with the machine aligner, but not by BWA. [2]: Data from phase 3 of the 1,000 genomes project is obtained through ANNOVAR.