Estimating sample size for transmission linkage from the false discovery rate

In this vignette we describe how the linkage scenario functionality can be used in reverse, to calculate the necessary sample size to achieve a prespecified false discovery rate. For example, this calculation would be useful if a researcher is conducting a study for which specimens have yet to be collected, but they want to ensure at least 75% confidence in identified links. To calculate the proportion of links that represent true transmission pairs, the user needs to provide the sensitivity and specificity of the linkage criteria used, the final size of the outbreak, and the desired minimum true discovery rate (\(1-\text{False Discovery Rate}\)):

Param Variable Name Description
\(\eta\) sensitivity the sensitivity of the linkage criteria for identifying transmission links
\(\chi\) specificity the specificity of the linkage criteria for identifying transmission links
\(N\) N the final size of the outbreak (total number of infections)
\(R\) R the average reproductive number (also denoted \(R_\text{pop}\), see below)
\(\phi\) tdr the desired minimum true discovery rate

Imagine we are interested in collecting enough samples from an outbreak of 100 cases to ensure that the identified links are correct at least 75% of the time. The phylogenetic criteria we are using to identify links has a sensitivity of 99% and a specificity of 99.5%, and we assume multiple transmissions and multiple links are possible (i.e., we use the default assumption argument, mtml).

library(phylosamp)
translink_samplesize(sensitivity=0.99, specificity=0.995, 
                     N=100, R=1, tdr=0.75)
## [1] 10

Although 10 samples will ensure a false discovery rate of no more than 25%, 10 samples may not provide enough data for analysis. In this case, we can use the optional min_pairs parameter to require that the expected number of links (calculated using the translink_expected_links_obs() function) is at least some minimum value:

translink_samplesize(sensitivity=0.99, specificity=0.995, 
                     N=100, R=1, tdr=0.75, min_pairs=30)
## [1] 50

In another example, it may be crucial that links are identified with high certainty. So we increase the minimum true discovery rate to 95%:

translink_samplesize(sensitivity=0.99, specificity=0.995, 
                     N=100, R=1, tdr=0.95)
## Error in translink_samplesize(sensitivity = 0.99, specificity = 0.995, : Input values do no produce a viable solution

This result suggests it is not possible to obtain such high certainty with this linkage criteria. We can confirm this with the translink_tdr function, which shows that even with 100% sampling of this outbreak, we will correctly identify transmission links no more than 80% of the time.

translink_tdr(sensitivity=0.99, specificity=0.995, rho=1, M=100, R=1)
## Calculating true discovery rate assuming multiple-transmission and multiple-linkage
## [1] 0.8032454

Further exploration of the translink_samplesize() function reveals that there are limited combinations of sensitivity and specificity that produce reasonable sample sizes, and that specificity in particular affects the minimum false discovery rate that can be obtained. Therefore, correctly estimating sensitivity and specificity are of key importance when using these functions to understand transmission.