Converting vcfR objects to other forms

Brian J. Knaus

2023-12-07

Once we have finished examining our data in vcfR, we’ll want to format it so that other softwares can utilize it. A straightforward path is to create a *.vcf.gz format file. One downside to this path is that it creates an intermediate file. When working on large datasets this intermediate file may be rather large. If your path remains in R, it may be preferable to convert your vcfR objects to objects defined by other packages. Here we explore examples of these paths.

Data import

We’ll use two datasets to illustrate data conversion. The function vcfR2genind calls adegenet::df2genind, a function which predates high throughput sequencing. This path currently doesn’t scale well to large datasets. So we’ll begin with the vcfR example dataset. This dataset consists of 19 samples with 2,533 variants. Later we’ll use the example dataset from the package pinfsc50 which includes the same samples, but with 22,0331 variants.

library(vcfR)
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.15.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
data(vcfR_example)

Creating *.vcf.gz format files.

The function write.vcf() can be used to create *.vcf.gz files (gzipped VCF files) from objects of class vcfR or chromR. These VCF files can be used for any downstream analysis which uses VCF files as input.

write.vcf(vcf, "test.vcf.gz")
unlink("test.vcf.gz") # Clean up after our example is done.

Creating genind objects

The packages adegenet and poppr use objects of class genind. We can create genind objects with the function vcfR2genind().

my_genind <- vcfR2genind(vcf)
## Loading required namespace: adegenet
class(my_genind)
## [1] "genind"
## attr(,"package")
## [1] "adegenet"
my_genind
## /// GENIND OBJECT /////////
## 
##  // 18 individuals; 2,533 loci; 5,083 alleles; size: 1.9 Mb
## 
##  // Basic content
##    @tab:  18 x 5083 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 1-5)
##    @loc.fac: locus factor for the 5083 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: adegenet::df2genind(X = t(x), sep = sep)
## 
##  // Optional content
##    - empty -

The warning is because our example dataset has uninteresting locus names (they’re all NULL). Adegenet replaces these names with slightly more interesting, unique names.

The function vcfR2genind calls extract.gt to create a matrix of genotypes. This matrix is converted into a genind object with the adegenet function df2genind.

Currently, this function does not scale well to large quantities of data. This appears to be due a call to the function adegenet::df2genind (this function was produced prior to high throughput sequencing).

Creating genclone objects

The package poppr uses objects of class genclone as well as genind. Once a genind object has been created, it is fairly straight forward to create a genclone object.

my_genclone <- poppr::as.genclone(my_genind)
## Registered S3 method overwritten by 'pegas':
##   method      from
##   print.amova ade4
class(my_genclone)
## [1] "genclone"
## attr(,"package")
## [1] "poppr"
my_genclone
## 
## This is a genclone object
## -------------------------
## Genotype information:
## 
##      16 original multilocus genotypes 
##      18 diploid individuals
##    2533 codominant loci
## 
## Population information:
## 
##       0 strata. 
##       0 populations defined.

Creating genlight objects

The genlight object is used by adegenet and poppr. It was designed specifically to handle high-throughput genotype data. At present it appears to only support two alleles at a locus, but varying levels of ploidy. Variant callers such as FreeBayes and the GATK’s haplotype caller currently support more than two alleles per locus. To address this incompatibility, vcfR2genelight omits loci that include more than two alleles. The benefit of the genlight object is that the genlight object is much more efficient to use than the genind object as it was designed with high throughput sequencing in mind. When verbose is set to TRUE the function vcfR2genlight will throw a warning and report how many loci it has omitted. When verbose is set to FALSE the loci will be omitted silently.

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR(vcf_file, verbose = FALSE)
x <- vcfR2genlight(vcf)
## Warning in vcfR2genlight(vcf): Found 312 loci with more than two alleles.
## Objects of class genlight only support loci with two alleles.
## 312 loci will be omitted from the genlight object.
x
##  /// GENLIGHT OBJECT /////////
## 
##  // 18 genotypes,  21,719 binary SNPs, size: 2.2 Mb
##  31239 (7.99 %) missing data
## 
##  // Basic content
##    @gen: list of 18 SNPbin
## 
##  // Optional content
##    @ind.names:  18 individual labels
##    @loc.names:  21719 locus labels
##    @chromosome: factor storing chromosomes of the SNPs
##    @position: integer storing positions of the SNPs
##    @other: a list containing: elements without names

Creating snpclone objects

The genlight object is extended by the snpclone object for analysis of clonal and partially clonal populations in poppr. The genlight object can be converted to a snpclone object with functions in the poppr package.

library(poppr)
## Loading required package: adegenet
## Loading required package: ade4
## 
##    /// adegenet 2.1.10 is loaded ////////////
## 
##    > overview: '?adegenet'
##    > tutorials/doc/questions: 'adegenetWeb()' 
##    > bug reports/feature requests: adegenetIssues()
## This is poppr version 2.9.4. To get started, type package?poppr
## OMP parallel support: available
x <- as.snpclone(x)
x
##  ||| SNPCLONE OBJECT |||||||||
## 
##  || 18 genotypes,  21,719 binary SNPs, size: 2.2 Mb
##  31239 (7.99 %) missing data
## 
##  || Basic content
##    @gen: list of 18 SNPbin
##    @mlg: 16 original multilocus genotypes
##    @ploidy: ploidy of each individual  (range: 2-2)
## 
##  || Optional content
##    @ind.names:  18 individual labels
##    @loc.names:  21719 locus labels
##    @chromosome: factor storing chromosomes of the SNPs
##    @position: integer storing positions of the SNPs
##    @other: a list containing: elements without names 
## 
## NULL

Note that we now have a mlg slot to hold multilocus genotype indicators.

Creating DNAbin objects

The package ape handles sequence data using objects of class DNAbin. The VCF file only contains information on variant positions. Omitting invariant data provides for a more efficient representation of the data than including the invariant sites. Converting VCF data to sequence data presents a challenge in that these invariant sites may need to be included. This means that these objects can easily occupy large amounts of memory, and may exceed the physical memory when long sequences with many samples are included. In order to accomodate these issues, we’ve taken an approach which attempts to create DNAbin objects from portions of a chomosome, such as a gene. This means we’ll need a little more information than we’ve needed for other conversions. First, we’ll need to locate and read in our VCF file, a reference sequence and a gff file that has the coordinates for a gene.

# Find the files.
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = "pinfsc50")
gff_file <- system.file("extdata", "pinf_sc50.gff", package = "pinfsc50")
# Read in data.
vcf <- read.vcfR(vcf_file, verbose = FALSE)
dna <- ape::read.dna(dna_file, format="fasta")
gff <- read.table(gff_file, sep="\t", quote = "")

We can use information from the annotation file (gff) to extract a gene. Here we have specifically chosen one which has variants. We can use IUPAC ambiguity codes to convert heterozygous sites into a one character encoding. This results in a single sequence per individual. Alternatively, we can create two haplotypes for each diploid sample, resulting in two sequences per individual.

record <- 130
# my_dnabin1 <- vcfR2DNAbin(vcf, consensus = TRUE, extract.haps = FALSE,
# gt.split='|', ref.seq=dna[,gff[record,4]:gff[record,5]],
# start.pos=gff[record,4], verbose=FALSE)
my_dnabin1 <- vcfR2DNAbin(vcf, consensus = TRUE, extract.haps = FALSE, ref.seq = dna[,
    gff[record, 4]:gff[record, 5]], start.pos = gff[record, 4], verbose = FALSE)
my_dnabin1
## 18 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 1043 
## 
## Labels:
## BL2009P4_us23
## DDR7602
## IN2009T1_us22
## LBUS5
## NL07434
## P10127
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.238 0.295 0.272 0.194 
## (Total: 18.77 kb)

We can visualize the variable sites using tools from the package ‘ape.’

par(mar=c(5,8,4,2))
ape::image.DNAbin(my_dnabin1[,ape::seg.sites(my_dnabin1)])

par(mar=c(5,4,4,2))

Here, the ambiguous sites are visualized as ‘other.’ While the DNAbin object can include the ambiguity codes, not all downstream software handle these codes well. So the user should excercise prudence when using this option.

If we instead create two haplotypes for each diploid sample, it results in a DNAbin object which includes only unambiguous nucleotides(A, C, G and T). This typically requires the data to be phased (I use beagle4). In VCF files this is indicated by delimiting the alleles of the genotype with a pipe (‘|’) for phased data, while unphased data are delimited with a forward slash (‘/’).

# my_dnabin1 <- vcfR2DNAbin(vcf, consensus=FALSE, extract.haps=TRUE,
# gt.split='|', ref.seq=dna[,gff[record,4]:gff[record,5]],
# start.pos=gff[record,4], verbose=FALSE)
my_dnabin1 <- vcfR2DNAbin(vcf, consensus = FALSE, extract.haps = TRUE, ref.seq = dna[,
    gff[record, 4]:gff[record, 5]], start.pos = gff[record, 4], verbose = FALSE)
par(mar=c(5,8,4,2))
ape::image.DNAbin(my_dnabin1[,ape::seg.sites(my_dnabin1)])

par(mar=c(5,4,4,2))

Once we have a DNAbin object, it can be analysed in a number of R packages, such as ape and pegas. We can also output a fasta file for other softwares to use.

write.dna( my_dnabin1, file = 'my_gene.fasta', format = 'fasta' )
unlink('my_gene.fasta') # Clean up after we're done with the example.

Also see:

Creating loci objects

The package pegas uses objects of class loci. We can use the function vcfR2loci to convert our vcfR object to one of class loci.

system.time( my_loci <- vcfR2loci(vcf) )
class(my_loci)

This takes a noticable amount of time to execute but is effective. We can now proceed to downstream analyses.

Conclusion

The use of vcfR is an intermediary point in an analysis. Once VCF data are obtained, vcfR provides an interactive way to scrutinize and filter the data. A number of paths have been provided that take the results of VCF format data from exploration and filtering to downstream analyses by other software that uses VCF files as input or several R packages.