Examples

moimport(file,...)

###Basic data import

The function is called with moimport(<file>,...), where the mandatory argument <file> is a string specifying the path of the file to be imported. Consider the following dataset:

provided by the package in the Excel file ‘testDatabasic.xlsx’. The first column contains the sample IDs followed by marker columns. Marker labels are in the first row. All marker columns are of type ‘STR’ (default for argument molecular). The code below imports and transforms this dataset to standard format:

infile <- system.file("extdata", "testDatabasic.xlsx", package = "MLMOI")
Outfile <- moimport(file = infile)
#> Warning: There are no warnings!

The first six rows of the imported data are printed using the command:

head(Outfile)
#>   Sample IDs  M1   M2   M3  M4
#> 1   s-id1005 258  200  200 175
#> 2   s-id1009 258  200  197 175
#> 3   s-id1009 260 <NA> <NA> 177
#> 4   s-id1016 258  200  197 175
#> 5   s-id1018 258  202  197 177
#> 6   s-id1024 258  200  197 175

###Exporting the dataset in standard format

If export is not an absolute path, the dataset will be stored in the current working directory:

Outfile <- moimport(infile, export = "testDatabasicOut.xlsx")

###Importing datasets with metadata {#meta}

The number of metadata columns are specified via argument nummtd. If Users want to retain metadata, keepmtd = TRUE need to be set. The dataset ‘testDatametadata.xlsx’ contains three metadata variables, as shown below:

The appropriate code for importing this dataset is:

infile <- system.file("extdata", "testDatametadata.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 3, keepmtd = TRUE)
#> Warning: There are no warnings!

The first six rows of the imported data are printed using the command:

head(Outfile)
#>   Sample IDs     Location year  clonality locus1 locus2 locus3 locus4
#> 1      sam73 Lower Maroni 2006 polyclonal    188    256    102    316
#> 2      sam73 Lower Maroni 2006 polyclonal    166   <NA>   <NA>    336
#> 3      sam11 Lower Maroni 2008 monoclonal    188    250    102    336
#> 4      sam26 Lower Maroni 2010 polyclonal    160    256    102    320
#> 5      sam26 Lower Maroni 2010 polyclonal   <NA>   <NA>    110   <NA>
#> 6     sam427 Lower Maroni 2010 polyclonal    160    250    102    300

If users forget to set nummtd, the import function regards the metadata columns as marker columns. This usually causes various warnings.

###Importing different types of molecular markers

The Excel file ‘testDatamt.xlsx’ provided by the package contains different types of molecular data. This dataset is shown below

Marker columns start from third column. They are from left to right of data type ‘STR’, ‘amino-acid’, ‘SNP’, ‘SNP’, ‘amino-acid’ and ‘codon’. The argument molecular needs to be set as a vector:

infile <- system.file("extdata", "testDatamt.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 1, molecular = c('str','amino','snp','snp','amino','codon'))
#> Warning: There are no warnings!

The first six rows of the imported data are printed using the command:

head(Outfile)
#>   Sample IDs m1STR m1Amino m1SNP m2SNP m2Amino m1Codon
#> 1       sam1   132     HIS     C     A     ARG     AGC
#> 2       sam1   136    <NA>  <NA>  <NA>    <NA>    <NA>
#> 3       sam2   136     HIS     C     T     CYS     AGG
#> 4       sam3   140     HIS     G     T     GLY     GGG
#> 5       sam3  <NA>    <NA>  <NA>  <NA>     ARG    <NA>
#> 6       sam4   140     ASP     G     A     ARG     GGG

Since the coding class of markers in this dataset are the default ones, the argument coding is not specified.

###Different molecular types with different coding classes

If a dataset contains a molecular marker whose coding is not assumed as default in moimport, the argument coding needs to be set. Consider the following dataset:

called ‘testDatamtc.xlsx’ provided by the package. The molecular markers in columns 3 (‘SNP’), 5 (‘STR’) and 6 (‘amino-acid’) are not in default coding. Therefore, the argument coding needs to be set:

infile <- system.file("extdata", "testDatamtc.xlsx", package = "MLMOI")
Outfile <- moimport(infile, nummtd = 1,
         molecular = c('snp', 'str', 'str', 'amino', 'snp', 'codon'),
         coding = c('iupac', 'integer', 'nearest', 'full', '4let', 'triplet'))
#> Warning: The cell on row 13 and marker 'snp-iupac' contains an unidentified
#> entry: 'Z'.

Here a warning is issued because the entry ‘z’ is not in coding ‘iupac’. The last six rows of the imported data are printed using the command:

tail(Outfile)
#>    Sample IDs snp-iupac ms-integer ms-real aa-amino snp-nucleo aa-codon
#> 9        xid6      <NA>        148     192      ALA       <NA>      ATT
#> 10       xid7         T        148     188      ALA          G      ATT
#> 11       xid8         A        164     182      GLY          C      ATT
#> 12       xid8         C       <NA>     192      ALA          G      ATG
#> 13       xid9         T        160     188      GLY          C      ATG
#> 14      xid10         A        160     180      ALA          C      ATG

It can be seen that the entry ‘z’ is deleted.

###Data from several worksheets

The file ‘testDatacomplex.xlsx’ provided by the package contains datasets in multiple worksheets as shown below:

The numbering determines the order of worksheets in the Excel file. Clearly, this is the case of multiple worksheet dataset. Hence the option multsheets = TRUE needs to be set. Notice that worksheet (2) is entered in transposed format. Thus, the option transposed needs to be set as a logical vector, specifying which worksheets are in transposed format. The following code imports the dataset:

infile <- system.file("extdata", "testDatacomplex.xlsx", package = "MLMOI")
Outfile <- moimport(infile, multsheets = TRUE, keepmtd = TRUE, nummtd = c(0, 2, 2, 0), transposed = c(FALSE, TRUE, FALSE, FALSE),
         molecular = list(c('str', 'str', 'snp', 'snp'),
                          c('amino', 'codon', 'snp', 'amino', 'codon'),
                          c('codon', 'str'),
                          c('str', 'amino', 'snp', 'str')),
         coding = list(rep(c('integer', '4let'), each = 2), 
                       c('1let', 'triplet', 'iupac', '3let', 'triplet'),
                       c('triplet', 'integer'),
                       c('integer', '1let', 'iupac', 'integer')))
#> Warning: In worksheet 1: The cell in row 3 and marker 'x1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 5 and marker 'x1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 5 and marker 'y1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 7 and marker 'y1tr' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 4 and marker 'x1np' contains the
#> unexpected character '/'.
#> Warning: In worksheet 1: The cell in row 4 and marker 'y1np' contains the
#> unexpected character '/'.
#> Warning: In worksheet 2: The cell in column 5 and marker 'x2mino' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 7 and marker 'x2odon' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 5 and marker 'x2odon' contains the
#> unexpected character '?'.
#> Warning: In worksheet 2: The cell in column 6 and marker 'y2mino' contains the
#> unexpected character '?'.

Several warnings are issued because data in the first and second worksheets have several entry errors. The first seven rows of the imported data are printed using the command:

head(Outfile, n = 7)
#>   Sample IDs  site testDate year x1tr y1tr x1np y1np x2mino x2odon x2np y2mino
#> 1       sid4 Aioun    41794 1997  124   95    A    A    TYR   <NA>    C    CYS
#> 2       sid4 Aioun    41794 1997  134   98 <NA> <NA>   <NA>   <NA>    G    PHE
#> 3       sid4 Aioun    41794 1997 <NA> <NA> <NA> <NA>   <NA>   <NA> <NA>   <NA>
#> 4       sid8 Aioun     <NA> 1997 <NA>   95    A    A    TYR    CCG    A   <NA>
#> 5       sid8 Aioun     <NA> 1997 <NA> <NA> <NA> <NA>   <NA>    CGA    G   <NA>
#> 6       sid3  <NA>    41795 1997  120  102 <NA> <NA>   <NA>   <NA> <NA>   <NA>
#> 7       sid3  <NA>    41795 1997 <NA> <NA> <NA> <NA>   <NA>   <NA> <NA>   <NA>
#>   y2odon x3codon x3tr x4tr x4amino x4np y4tr
#> 1    AAA     CTA   95  280     THR    G  230
#> 2   <NA>     TTA  112  260    <NA>    T <NA>
#> 3   <NA>     CTT <NA> <NA>    <NA> <NA> <NA>
#> 4    AAG    <NA> <NA>  320     GLU    A  230
#> 5   <NA>    <NA> <NA> <NA>     THR <NA> <NA>
#> 6   <NA>     CCT  119  320     GLU    A  190
#> 7   <NA>    <NA> <NA>  280    <NA> <NA>  205

First warning of worksheet (2), addresses an entry with consecutive amino acids. Notice that, in spite of the warning the consecutive amino acids got converted to their equivalent three-letter designations after all.

moimerge(file1, file2,...)

###Merging datasets {#merge}

Consider the datasets ‘testDatamerge1.xlsx’ and ‘testDatamerge2.xlsx’ from the package:

These datasets are already in standard format. Running the following code imports and merges the datasets:

infile1 <- system.file("extdata", "testDatamerge1.xlsx", package = "MLMOI")
infile2 <- system.file("extdata", "testDatamerge2.xlsx", package = "MLMOI")
outfile <- moimerge(infile1, infile2, nummtd1 = 1, nummtd2 = 2, keepmtd = TRUE)
#> Warning:  Metadata of the following sample is not unique:
#>  'village' of sample 'samid10'.

A warning is issued because the metadata information of sample ‘samid10’ differs in two datasets, implying a data entry error. The first six rows of the imported data are printed using the command:

head(outfile, n = 6)
#>   Sample IDs village patient marker1 marker2 marker3 position1 position2
#> 1 "samid10"  "1"     "idy"   "223"   "155"   "Lys"   "A"       "CCC"    
#> 2 "samid12"  "1"     "idx"   "223"   "202"   "Gln"   "T"       "CCA"    
#> 3 "samid12"  "1"     "idx"   NA      NA      NA      NA        "CCC"    
#> 4 "samid23"  "2"     NA      "223"   "109"   "Gln"   NA        NA       
#> 5 "samid23"  "2"     NA      NA      NA      "Lys"   NA        NA       
#> 6 "samid37"  "1"     "idz"   NA      NA      NA      "T"       "CCC"

Users can set export = <Outfile> to export the merged dataset.

moimle(file,...)

###Estimating parameters

Consider again the dataset ‘testDatametadata.xlsx’ in section Importing datasets with metadata. The following code derives the maximum-likelihood estimates (MLE) of the MOI parameter and lineage frequencies for each marker separately

infile <- system.file("extdata", "testDatametadata.xlsx", package = "MLMOI")
mle <- moimle(file = infile, nummtd = 3)
mle$locus1
#> $`Sample Size`
#>       
#> N =  8
#> 
#> $`Allele Prevalence Counts`
#>       160 166 188
#> Nk =    4   2   3
#> 
#> $`Observed Prevalences`
#>         160  166   188
#> Nk/N =  0.5 0.25 0.375
#> 
#> $`Log likelihood at MLE`
#>                            
#> log likelihood =  -11.46237
#> 
#> $`MOI Parameter MLEstimate and Average MOI`
#>           lambda      psi
#> MLE =  0.3848811 1.204755
#> 
#> $`Lineage Frequencies MLEstimates`
#>                160       166       188
#> MLE_p =  0.4521839 0.2162672 0.3315488

###Constraining parameters

Consider again the dataset ‘testDatasetmerge2’ from section Merging datasets.This dataset contains pathological data on marker ‘position2’, resulting in meaningless estimate \(\lambda = 0\). By setting boundaries via argument bounds, the MLE for lineage frequencies is derived from profiling at the bounds using \(\lambda_{min}\) or \(\lambda_{max}\)

infile <- system.file("extdata", "testDatamerge2.xlsx", package = "MLMOI")
moimle(file = infile, nummtd = 2, bounds = c(1.5, 2))
#> Warning: The MLE of MOI parameter at marker 'position1'(1.3863) is lower than
#> minimum lambda.
#> $position1
#> $position1$`Sample Size`
#>       
#> N =  6
#> 
#> $position1$`Allele Prevalence Counts`
#>       A T
#> Nk =  4 4
#> 
#> $position1$`Observed Prevalences`
#>                 A         T
#> Nk/N =  0.6666667 0.6666667
#> 
#> $position1$`Log likelihood at MLE`
#>                            
#> log likelihood =  -6.599933
#> 
#> $position1$`MOI Parameter MLEstimate and Average MOI`
#>          lambda      psi
#> Fixed =     1.5 1.930825
#> 
#> $position1$`Lineage Frequencies MLEstimates`
#>            A   T
#> MLE_p =  0.5 0.5
#> 
#> 
#> $position2
#> $position2$`Sample Size`
#>       
#> N =  6
#> 
#> $position2$`Allele Prevalence Counts`
#>       CCA CCC CGC
#> Nk =    1   6   1
#> 
#> $position2$`Observed Prevalences`
#>               CCA CCC       CGC
#> Nk/N =  0.1666667   1 0.1666667
#> 
#> $position2$`Log likelihood at MLE`
#>                            
#> log likelihood =  -5.779715
#> 
#> $position2$`MOI Parameter MLEstimate and Average MOI`
#>          lambda      psi
#> Fixed =       2 2.313035
#> 
#> $position2$`Lineage Frequencies MLEstimates`
#>                 CCA       CCC        CGC
#> MLE_p =  0.07333523 0.8533295 0.07333523