Introduction to outbreaker2

Thibaut Jombart

2022-05-23

This tutorial provides a worked example of outbreak reconstruction using outbreaker2. For installation guidelines, a general overview of the package’s functionalities as well as other resources, see the ‘overview’ vignette:

vignette("Overview", package = "outbreaker2")

We will be analysing a small simulated outbreak distributed with the package, fake_outbreak. This dataset contains simulated dates of onsets, partial contact tracing data and pathogen genome sequences for 30 cases:

library(ape)
#> Warning: package 'ape' was built under R version 4.1.3
#> 
#> Attaching package: 'ape'
#> The following objects are masked from 'package:igraph':
#> 
#>     degree, edges, mst, ring
library(outbreaker2)

col <- "#6666cc"
fake_outbreak
#> $onset
#>  [1]  0  2  4  4  6  6  6  7  7  8  8  8  8  9  9  9  9 10 10 10 10 10 10 10 10
#> [26] 10 10 10 11 11
#> 
#> $sample
#>  [1]  3  5  6  6  7  9  8  9  9  9 11 10 10 10 10 11 11 12 11 13 12 13 11 12 11
#> [26] 11 13 12 14 14
#> 
#> $dna
#> 30 DNA sequences in binary format stored in a matrix.
#> 
#> All sequences of same length: 10000 
#> 
#> Labels:
#> ...
#> 
#> Base composition:
#>     a     c     g     t 
#> 0.251 0.242 0.251 0.256 
#> (Total: 300 kb)
#> 
#> $w
#> [1] 0.04255319 0.21276596 0.42553191 0.31914894
#> 
#> $ances
#>  [1] NA  1  2 NA  3  4  4  5  6  6  7  8  9  5  5  7  7  8  9 10 11 11 13 13 13
#> [26] 17 17 NA 10 13
#> 
#> $ctd
#>     i  j
#> 1   4  7
#> 2  10 29
#> 3  13 30
#> 4  11 22
#> 5  13 25
#> 6  16  7
#> 7  13 23
#> 8  10  6
#> 9  18  8
#> 10 14  5
#> 11 17 26
#> 12 11 21
#> 13  3  5
#> 14 17 27
#> 15  2  3
#> 16 17  7
#> 17 19  9
#> 18 12  8
#> 19 22  6
#> 20 30  9
#> 21 17 20
#> 22 21 29
#> 23 11 14
#> 24 18  9
#> 25 25 29
#> 26  1 17
#> 27 23  7
#> 28 25  3
#> 29  3  9
#> 30  1  4
#> 31 13  4
#> 32 21 27
#> 33 16  2
#> 34 29  3
#> 35 21  5

Here, we will use the dates of case isolation $sample, DNA sequences $dna, contact tracing data $ctd and the empirical distribution of the generation time $w, which can be visualised as:


plot(fake_outbreak$w, type = "h", xlim = c(0, 5), 
     lwd = 30, col = col, lend = 2, 
     xlab = "Days after infection", 
     ylab = "p(new case)", 
     main = "Generation time distribution")


Running the analysis with defaults

By default, outbreaker2 uses the settings defined by create_config(); see the documentation of this function for details. Note that the main function of outbreaker2 is called outbreaker (without number). The function’s arguments are:

args(outbreaker)
#> function (data = outbreaker_data(), config = create_config(), 
#>     priors = custom_priors(), likelihoods = custom_likelihoods(), 
#>     moves = custom_moves()) 
#> NULL

The only mandatory input really is the data. For most cases, customising the method will be done through config and the function create_config(), which creates default and alters settings such as prior parameters, length and rate of sampling from the MCMC, and definition of which parameters should be estimated (‘moved’). The last arguments of outbreaker are used to specify custom prior, likelihood, and movement functions, and are detailed in the ‘Customisation’ vignette.

Let us run the analysis with default settings:


dna <- fake_outbreak$dna
dates <- fake_outbreak$sample
ctd <- fake_outbreak$ctd
w <- fake_outbreak$w
data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w)

## we set the seed to ensure results won't change
set.seed(1)

res <- outbreaker(data = data)

This analysis will take around 40 seconds on a modern computer. Note that outbreaker2 is slower than outbreaker for the same number of iterations, but the two implementations are actually different. In particular, outbreaker2 performs many more moves than the original package for each iteration of the MCMC, resulting in more efficient mixing. In short: outbreaker2 is slower, but it requires far less iterations.

Results are stored in a data.frame with the special class outbreaker_chains:


class(res)
#> [1] "outbreaker_chains" "data.frame"
dim(res)
#> [1] 201  98
res
#> 
#> 
#>  ///// outbreaker results ///
#> 
#> class:  outbreaker_chains data.frame
#> dimensions 201 rows,  98 columns
#> ancestries not shown: alpha_1 - alpha_30
#> infection dates not shown: t_inf_1 - t_inf_30
#> intermediate generations not shown: kappa_1 - kappa_30
#> 
#> /// head //
#>   step       post       like    prior           mu        pi       eps
#> 1    1 -1198.0669 -1199.4211 1.354240 0.0001000000 0.9000000 0.5000000
#> 2   50  -576.3898  -578.4358 2.046006 0.0001322360 0.9719080 0.6238018
#> 3  100  -579.1822  -580.9235 1.741283 0.0001218749 0.9395509 0.5810956
#>       lambda
#> 1 0.05000000
#> 2 0.06582716
#> 3 0.10193745
#> 
#> ...
#> /// tail //
#>      step      post      like    prior           mu        pi       eps
#> 199  9900 -563.6538 -565.2708 1.616983 0.0001236515 0.9266639 0.7634021
#> 200  9950 -567.0488 -569.0806 2.031785 0.0001179255 0.9703719 0.7057554
#> 201 10000 -564.4037 -565.6457 1.242090 0.0001456115 0.8888590 0.6301657
#>         lambda
#> 199 0.07471296
#> 200 0.07418330
#> 201 0.08802858

Each row of res contains a sample from the MCMC. For each, informations about the step (iteration of the MCMC), log-values of posterior, likelihood and priors, and all parameters and augmented data are returned. Ancestries (i.e. indices of the most recent ancestral case for a given case), are indicated by alpha_[index of the case], dates of infections by t_inf_[index of the case], and number of generations between cases and their infector / ancestor by kappa_[index of the case]:


names(res)
#>  [1] "step"     "post"     "like"     "prior"    "mu"       "pi"      
#>  [7] "eps"      "lambda"   "alpha_1"  "alpha_2"  "alpha_3"  "alpha_4" 
#> [13] "alpha_5"  "alpha_6"  "alpha_7"  "alpha_8"  "alpha_9"  "alpha_10"
#> [19] "alpha_11" "alpha_12" "alpha_13" "alpha_14" "alpha_15" "alpha_16"
#> [25] "alpha_17" "alpha_18" "alpha_19" "alpha_20" "alpha_21" "alpha_22"
#> [31] "alpha_23" "alpha_24" "alpha_25" "alpha_26" "alpha_27" "alpha_28"
#> [37] "alpha_29" "alpha_30" "t_inf_1"  "t_inf_2"  "t_inf_3"  "t_inf_4" 
#> [43] "t_inf_5"  "t_inf_6"  "t_inf_7"  "t_inf_8"  "t_inf_9"  "t_inf_10"
#> [49] "t_inf_11" "t_inf_12" "t_inf_13" "t_inf_14" "t_inf_15" "t_inf_16"
#> [55] "t_inf_17" "t_inf_18" "t_inf_19" "t_inf_20" "t_inf_21" "t_inf_22"
#> [61] "t_inf_23" "t_inf_24" "t_inf_25" "t_inf_26" "t_inf_27" "t_inf_28"
#> [67] "t_inf_29" "t_inf_30" "kappa_1"  "kappa_2"  "kappa_3"  "kappa_4" 
#> [73] "kappa_5"  "kappa_6"  "kappa_7"  "kappa_8"  "kappa_9"  "kappa_10"
#> [79] "kappa_11" "kappa_12" "kappa_13" "kappa_14" "kappa_15" "kappa_16"
#> [85] "kappa_17" "kappa_18" "kappa_19" "kappa_20" "kappa_21" "kappa_22"
#> [91] "kappa_23" "kappa_24" "kappa_25" "kappa_26" "kappa_27" "kappa_28"
#> [97] "kappa_29" "kappa_30"


Analysing the results

Graphics

Results can be visualised using plot, which has several options and can be used to derive various kinds of graphics (see ?plot.outbreaker_chains). The basic plot shows the trace of the log-posterior values, which is useful to assess mixing:


plot(res)

The second argument of plot can be used to visualise traces of any other column in res:


plot(res, "prior")

plot(res, "mu")

plot(res, "t_inf_15")

burnin can be used to discard the first iterations prior to mixing:


## compare this to plot(res)
plot(res, burnin = 2000)

type indicates the type of graphic to plot; roughly:

Here are a few examples:


plot(res, "mu", "hist", burnin = 2000)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


plot(res, "mu", "density", burnin = 2000)


plot(res, type = "alpha", burnin = 2000)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.


plot(res, type = "t_inf", burnin = 2000)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.


plot(res, type = "kappa", burnin = 2000)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.


plot(res, type = "network", burnin = 2000, min_support = 0.01)

Using summary

The summary of results derives various distributional statistics for posterior, likelihood and prior densities, as well as for the quantitative parameters. It also builds a consensus tree, by finding for each case the most frequent infector / ancestor in the posterior samples. The corresponding frequencies are reported as ‘support’. The most frequent value of kappa is also reported as ‘generations’:


summary(res)
#> $step
#>    first     last interval  n_steps 
#>        1    10000       50      201 
#> 
#> $post
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -1198.1  -564.5  -561.8  -565.2  -559.2  -553.3 
#> 
#> $like
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -1199.4  -566.4  -563.7  -567.1  -561.2  -555.4 
#> 
#> $prior
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.6565  1.7928  2.0318  1.9224  2.1461  2.2984 
#> 
#> $mu
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 9.733e-05 1.301e-04 1.448e-04 1.475e-04 1.646e-04 2.108e-04 
#> 
#> $pi
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.8329  0.9449  0.9704  0.9593  0.9828  0.9995 
#> 
#> $tree
#>    from to time   support generations
#> 1    NA  1   -1        NA          NA
#> 2     1  2    1 1.0000000           1
#> 3     2  3    3 1.0000000           1
#> 4    NA  4    2        NA          NA
#> 5     3  5    4 1.0000000           1
#> 6    10  6    8 0.9950249           1
#> 7     4  7    5 1.0000000           1
#> 8     5  8    6 0.9601990           1
#> 9     4  9    5 0.9502488           1
#> 10    9 10    6 0.9950249           1
#> 11    7 11    7 0.8109453           1
#> 12    8 12    7 0.8358209           1
#> 13    9 13    7 1.0000000           1
#> 14    5 14    7 0.9900498           1
#> 15    5 15    7 0.7263682           1
#> 16    7 16    8 0.9950249           1
#> 17    7 17    7 0.9701493           1
#> 18    8 18    9 0.9651741           1
#> 19    9 19    8 1.0000000           1
#> 20   10 20   10 1.0000000           1
#> 21   11 21   10 1.0000000           1
#> 22   11 22   10 1.0000000           1
#> 23   13 23    9 1.0000000           1
#> 24   13 24    9 1.0000000           1
#> 25   13 25    9 1.0000000           1
#> 26   17 26    9 0.9751244           1
#> 27   17 27   10 1.0000000           1
#> 28   NA 28    9        NA          NA
#> 29   10 29   10 1.0000000           1
#> 30   13 30   10 1.0000000           1


Customising settings and priors

As said before, most customisation can be achieved via create_config. In the following, we make the following changes to the defaults:


config2 <- create_config(n_iter = 3e4,
                         sample_every = 20,
                 init_tree ="star",
             move_kappa = FALSE,
             prior_mu = 10)

set.seed(1)

res2 <- outbreaker(data, config2)
plot(res2)

plot(res2, burnin = 2000)

We can see that the burnin is around 2,500 iterations (i.e. after the initial step corresponding to a local optimum). We get the consensus tree from the new results, and compare the inferred tree to the actual ancestries stored in the dataset (fake_outbreak$ances):


summary(res2, burnin = 3000)
#> $step
#>    first     last interval  n_steps 
#>     3020    30000       20     1350 
#> 
#> $post
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -543.7  -525.3  -523.1  -523.3  -520.8  -513.7 
#> 
#> $like
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  -548.0  -529.6  -527.5  -527.6  -525.2  -518.1 
#> 
#> $prior
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.456   4.277   4.434   4.364   4.528   4.604 
#> 
#> $mu
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 7.752e-05 1.219e-04 1.358e-04 1.382e-04 1.541e-04 2.270e-04 
#> 
#> $pi
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.7877  0.9643  0.9814  0.9740  0.9917  1.0000 
#> 
#> $tree
#>    from to time   support generations
#> 1    NA  1   -1        NA          NA
#> 2     1  2    1 0.9992593           1
#> 3     2  3    3 0.9992593           1
#> 4    NA  4    2        NA          NA
#> 5     3  5    4 0.9970370           1
#> 6     4  6    5 0.9614815           1
#> 7     4  7    5 1.0000000           1
#> 8     5  8    6 0.9400000           1
#> 9     6  9    6 1.0000000           1
#> 10    6 10    7 1.0000000           1
#> 11    7 11    7 0.8311111           1
#> 12    8 12    8 0.8311111           1
#> 13    9 13    7 1.0000000           1
#> 14    5 14    7 0.9844444           1
#> 15    5 15    7 0.7281481           1
#> 16    7 16    8 0.9992593           1
#> 17    7 17    7 0.9881481           1
#> 18    8 18    9 0.9829630           1
#> 19    9 19    9 1.0000000           1
#> 20   10 20   10 0.9755556           1
#> 21   11 21   10 1.0000000           1
#> 22   11 22   10 1.0000000           1
#> 23   13 23    9 1.0000000           1
#> 24   13 24   10 1.0000000           1
#> 25   13 25    9 1.0000000           1
#> 26   17 26    9 1.0000000           1
#> 27   17 27   10 1.0000000           1
#> 28   NA 28    9        NA          NA
#> 29   10 29   11 1.0000000           1
#> 30   13 30   11 1.0000000           1
tree2 <- summary(res2, burnin = 3000)$tree

comparison <- data.frame(case = 1:30,
                         inferred = paste(tree2$from),
             true = paste(fake_outbreak$ances),
             stringsAsFactors = FALSE)
             
comparison$correct <- comparison$inferred == comparison$true
comparison
#>    case inferred true correct
#> 1     1       NA   NA    TRUE
#> 2     2        1    1    TRUE
#> 3     3        2    2    TRUE
#> 4     4       NA   NA    TRUE
#> 5     5        3    3    TRUE
#> 6     6        4    4    TRUE
#> 7     7        4    4    TRUE
#> 8     8        5    5    TRUE
#> 9     9        6    6    TRUE
#> 10   10        6    6    TRUE
#> 11   11        7    7    TRUE
#> 12   12        8    8    TRUE
#> 13   13        9    9    TRUE
#> 14   14        5    5    TRUE
#> 15   15        5    5    TRUE
#> 16   16        7    7    TRUE
#> 17   17        7    7    TRUE
#> 18   18        8    8    TRUE
#> 19   19        9    9    TRUE
#> 20   20       10   10    TRUE
#> 21   21       11   11    TRUE
#> 22   22       11   11    TRUE
#> 23   23       13   13    TRUE
#> 24   24       13   13    TRUE
#> 25   25       13   13    TRUE
#> 26   26       17   17    TRUE
#> 27   27       17   17    TRUE
#> 28   28       NA   NA    TRUE
#> 29   29       10   10    TRUE
#> 30   30       13   13    TRUE
mean(comparison$correct)
#> [1] 1

Let’s visualise the posterior trees:


plot(res2, type = "network", burnin = 3000, min_support = 0.01)