Using Optimall

Introduction

When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design, especially when sampling is expensive.

optimall offers a collection of functions that are designed to streamline the process of optimum sample allocation, for a single wave or an adaptive, multi-wave approach. Its main functions allow users to:

Define, split, and merge strata based on values or percentiles of other variables.
Calculate the optimum number of samples to allocate to each stratum in a given study in order to minimize the variance of the target sample mean.
Select specific IDs to sample based on a stratified sampling design.
Optimally allocate a fixed number of samples to a sampling wave based on results from a prior wave.

When used together, these functions can automate most of the sampling workflow. This vignette will outline these package features one by one (moving from simple to more complex), introduce the theoretical framework behind the functions, and finally demonstrate how they can be used collectively to perform multi-wave sampling in R. The final section then details how to use optimall_shiny() to efficiently make decisions about splitting strata. For a more detailed description of optimall, see the article published in the Journal of Statistical Software doi:10.18637/jss.v114.i10.

Defining and Refining Strata

In stratified sampling, strata are typically defined on values or quantiles of inexpensive variables available for the entire population. Given a dataset with one row per sampling unit and at least one column containing a variable that can be used to construct strata, optimall allows users to easily define, split, and merge strata.

To demonstrate the simple process of refining strata in optimall, we use the iris dataset from the R package datasets. Suppose that we have defined three strata based on "Species", but we want to split two out of the three, setosa and virginica, in half based on the within-stratum median of "Sepal.Width". In optimall, we can do this by calling the split_strata() function:

library(optimall)
library(datasets)

iris <- datasets::iris
table(iris$Species)
#> 
#>     setosa versicolor  virginica 
#>         50         50         50
iris2 <- split_strata(data = iris,
                     strata = "Species",
                     split = c("setosa", "virginica"), 
                     split_var = "Sepal.Width",
                     split_at = c(0.5), type = "local quantile")
table(iris2$new_strata)
#> 
#>  setosa.Sepal.Width_(3.4,4.4]  setosa.Sepal.Width_[2.3,3.4] 
#>                            22                            28 
#>                    versicolor virginica.Sepal.Width_(3,3.8] 
#>                            50                            17 
#> virginica.Sepal.Width_[2.2,3] 
#>                            33

Similarly, we can merge strata using the function merge_strata():

iris3 <- merge_strata(data = iris,
                     strata = "Species",
                     merge = c("setosa", "versicolor"), 
                     name = "set_or_vers")
table(iris3$new_strata)
#> 
#> set_or_vers   virginica 
#>         100          50

The split_strata and merge_strata() functions are quite versatile. See function documentation for more information.

Optimum Allocation

Assuming that the per-unit sampling cost is the same in each stratum and that \(S_h\), the standard deviation of the variable of interest within each stratum, can be estimated, Neyman (1934) presented the following solution to optimally allocate \(n\) samples among \(H\) strata in order to minimize the variance of the sample mean:

\[n_h = n \frac{N_hS_h}{\sum_{i=1}^H N_iS_i}.\] This formula is known as Neyman allocation and is one of the functions available in optimall. Neyman allocation offers the advantage of outputting sampling fractions that can later be multiplied by \(n\) or taken on their own if \(n\) is not known.

While Neyman allocation has a strong theoretical backing, Wright (2014) points out some limitations that make it sub-optimal in practice.

Neyman does not require that the solution to \(n_h\) is an integer, and thus it rarely is. When taking a fraction of a sample is not practical, researchers are forced to stray from the theory by rounding the sample sizes in ways that are not always optimal.
Closely related to the first issue, the rounded results for \(n_h\) are not guaranteed to sum to \(n\). This is clearly sub-optimal.

Wright offers alternative algorithms that solve the optimal problem for a discrete allocation that sums exactly to \(n\), which can also be implemented in optimall. Essentially, his approaches use linear constraints to optimize the allocation of samples over a space of integer values. He makes use of within-stratum variance and population stratum size to generate priority values, which in turn dictate how many samples should be taken from each stratum. To learn more about the specifics of these algorithms, see Wright (2014).

The optimall package allows users to select between Neyman allocation, Wright Algorithm I, and Wright Algorithm II in the method argument of the optimum_allocation() function. The optimum_allocation() function defaults to using Wright Algorithm II. This algorithm requires that at least 2 samples are taken from each stratum. In optimall, stratum sampling sizes for both Wright algorithms are also constrained from above at \(N_h\), the population stratum size, using the methods for constraints that Wright details in Algorithm III.

Below is an example of how optimum_allocation() could be called to optimally allocate 40 samples among "Species" in the iris dataset, minimizing the variance of the "Sepal.Width" sample mean.

sampling_design <- optimum_allocation(data = iris, strata = "Species", 
                                      y = "Sepal.Width",
                                      nsample = 40, method = "WrightII")
sampling_design
#>       strata npop   sd  n_sd stratum_fraction stratum_size
#> 1     setosa   50 0.38 18.95             0.38           15
#> 2 versicolor   50 0.31 15.69             0.30           12
#> 3  virginica   50 0.32 16.12             0.32           13

If we have a dataframe that holds the \(N_h\) and \(sd_h\) for each stratum instead of data for each individual unit, we can still use optimum_allocation():

iris_summary <- data.frame(strata = unique(iris$Species),
                           size = c(50, 50, 50),
                           sd = c(0.3791, 0.3138, 0.3225))

optimum_allocation(data = iris_summary, strata = "strata",
                   sd_h = "sd", N_h = "size", 
                   nsample = 40, method = "WrightII")
#>       strata npop   sd  n_sd stratum_fraction stratum_size
#> 1     setosa   50 0.38 18.95             0.38           15
#> 2 versicolor   50 0.31 15.69             0.30           12
#> 3  virginica   50 0.32 16.12             0.32           13

Selecting IDs to Sample

With the number of units to sample per stratum specified in a dataframe, optimall can select the IDs of the units to be sampled using simple random sampling within strata with the function sample_strata():

iris$id <- 1:150
set.seed(743)

iris <- sample_strata(data = iris, strata = "Species", id = "id",
                               design_data = sampling_design, design_strata = "strata",
                               n_allocated = "stratum_size")

The output of sample_strata() is the same input dataframe with a new column called "sample_indicator" that holds a binary indicator for whether each unit should be sampled in the specified wave:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id sample_indicator
#> 1          5.1         3.5          1.4         0.2  setosa  1                0
#> 2          4.9         3.0          1.4         0.2  setosa  2                0
#> 3          4.7         3.2          1.3         0.2  setosa  3                0
#> 4          4.6         3.1          1.5         0.2  setosa  4                1
#> 5          5.0         3.6          1.4         0.2  setosa  5                0
#> 6          5.4         3.9          1.7         0.4  setosa  6                1

From this dataframe, we can easily extract a vector of the ids to sample:

ids_to_sample <- iris$id[iris$sample_indicator == 1]
head(ids_to_sample)
#> [1]  4  6  8 11 14 15
length(ids_to_sample)
#> [1] 40

Note that design_data is the output of optimum_allocation() in this example, but in practice it can be any dataframe that has one row per stratum and one column specifying each stratum’s desired sample size. As such, any method of allocating samples to strata can be implemented by sample_strata() as long as the design dataframe is specified and simple random sampling within strata is used.

Adaptive, Multi-Wave Sampling

When measuring variables of interest is expensive or difficult, it can be advantageous to employ a multi-phase design, where cheaper variables are collected from the entire population and expensive variables are collected from subsamples selected through adaptive, multi-wave sampling. This approach, which is documented in McIsaac and Cook (2015), involves multiple waves of sampling where information from prior waves is used to inform the optimum sampling design of subsequent ones. With the understanding that both Neyman and Wright’s optimum allocation methods depend heavily on standard deviation estimates for the variable of interest, the benefit of multi-wave sampling is clear to see. Sample allocations based on prior waves will use updated estimates of nuisance parameters that incorporate data accumulated in the prior sampling waves. The optimal sampling proportion is updated at the end of each wave, and this update guides the sampling of the next wave. As the phase II data accumulate, the necessary within-strata SD estimates, and thus the estimated optimal sampling proportions, are expected to be closer to their true value.

In the design described by McIsaac and Cook, a large phase-I sample is first taken to measure the inexpensive covariates and/or outcome. The results of this phase will define strata which are then sampled non-optimally (through proportional or balanced sampling) for measurement of the expensive variable in phase-IIa. The phase-IIa results are used to estimate the standard deviation required to optimally allocate the next wave of samples. This process is iterated until the desired sample size for the variable of interest is achieved. Below is an outline of the workflow, which is facilitated by optimall:

optimall allows users to input phase-I data and iteratively allocate samples for subsequent waves with the function allocate_wave(). This function runs integer-valued optimum_allocation() on a dataset according to Wright’s Algorithm II, but it takes into account previous sampling waves to determine how the current wave of samples should be allocated.

For a simple example, suppose that we again want to allocate 40 samples to minimize the variance of "Sepal.Width" in the iris dataset, but 30 out of the 40 have already been sampled. We assume that the strata are still defined only by "Species", and that 16 of the first 30 samples were taken from the virginica species, 7 from setosa, and 7 from versicolor. Assuming we only have those 30 samples to base our within-stratum standard deviation estimates on, we can allocate the next 10 samples using allocate_wave():

# Set up Wave 1
wave1_design <- data.frame(strata = c("setosa",
                                       "virginica",
                                      "versicolor"),
                           stratum_size = c(7, 16, 7))

# Collect Sepal.Width from only the 30 samples
phase1_data <- subset(datasets::iris, select =  -Sepal.Width)

phase1_data$id <- 1:nrow(phase1_data) #Add id column

set.seed(234)
phase1_data <- sample_strata(data = phase1_data, 
                               strata = "Species", id = "id",
                               design_data = wave1_design,
                               design_strata = "strata",
                               n_allocated = "stratum_size")

wave1_ids <- iris$id[phase1_data$sample_indicator == 1]

wave1_sampled_data <- iris[iris$id %in% wave1_ids, c("id","Sepal.Width")]

wave1_data <- merge(phase1_data, wave1_sampled_data, by = "id", 
                    no.dups =  TRUE, all.x = TRUE)

# We have our 30 samples
table(is.na(wave1_data$Sepal.Width), wave1_data$Species)
#>        
#>         setosa versicolor virginica
#>   FALSE      7          7        16
#>   TRUE      43         43        34

# Now, allocate the next 10:
wave2_design <- allocate_wave(data = wave1_data,
                              strata = "Species", 
                              y = "Sepal.Width",
                              already_sampled = "sample_indicator",
                              nsample = 10, 
                              detailed = TRUE)

wave2_design
#>       strata npop nsample_optimal nsample_actual nsample_prior n_to_sample   sd
#> 1     setosa   50              19             15             7           8 0.47
#> 2 versicolor   50              12              9             7           2 0.29
#> 3  virginica   50               9             16            16           0 0.24

Notice that in this case, "nsample_optimal" does not match "nsample_actual" for every stratum in the outputted design dataframe. This occurred because we oversampled from the virginica stratum in wave 1, meaning that the optimum stratum sample size among 40 total samples was smaller than the amount of samples already taken in that group. When oversampling occurs, allocate_wave() recalculates the optimum allocation among the non-oversampled strata and uses that result to allocate the specified number of samples as optimally as possible. When no oversampling occurs, "nsample_optimal" will match "nsample_actual" for every stratum.

Now we can easily get the 10 new ids to sample using the wave2_design as the design_data in sample_strata(). Notice that we do not have to specify design_strata or n_allocated because the default arguments are the column names of the allocate_wave() output:

 # Run sample_strata
wave2_data <- sample_strata(data = wave1_data, strata = "Species",
                            id = "id", already_sampled = "sample_indicator",
                            design_data = wave2_design)

 # Extract the 10 ids to sample
wave2_ids <- iris$id[wave2_data$sample_indicator == 1]

wave2_ids
#>  [1]  8 15 22 24 26 32 40 47 57 68

Examples: Using `optimall` to Determine Study Design and Sample Data

Overview

We use the simulated dataset MatWgt_Sim to demonstrate how the functions in optimall can be used to carry out study design and sampling tasks in R. This section includes three examples, which build upon eachother during the design and execution of a multi-wave survey:

Example 1: Uses the function split_strata() to define and refine strata.
Example 2: Uses the function sample_strata() to randomly select units to be sampled based on a specified sampling design.
Example 3: Demonstrates how to conduct an adaptive, multi-wave sampling design that uses an optimal Neyman allocation scheme at each step to allocate the next wave across strata.

The dataset contains simulated data based on a real study of the association between maternal weight gain during pregnancy and the risk of childhood obesity after controlling for a number of clinical and demographic covariates. In this hypothetical example, the study data are obtained from electronic health records, which are known to be error-prone and do not have all the variables of interested available without manual chart review. In a perfect world, chart review would be used to validate and obtain the necessary variables on every observation in our sample, but chart review is an expensive and difficult task. Researchers determined that they could only reasonably afford to validate 750 out of the 10,335 child-mother pairs. We refer to the 10,335 as the phase 1 sample, for which we have error-prone observations on all, and we refer to the 750 as the phase II sample, which will be designed using tools in the optimall package. For simplicity, suppose that we want to use these 750 samples to estimate the true population mean of maternal weight change during pregnancy with minimal variance. This goal will be accomplished in our last example through an adaptive, multi-wave sampling design.

Data Set Up

The simulated data used in this example is included with optimall. The dataset MatWgt_Sim contains 10335 rows, one for each mother-child pair, and 6 columns containing ID numbers and covariates which we see below:

data(MatWgt_Sim, package = "optimall")
head(MatWgt_Sim)
#>     id mat_weight_est  race diabetes obesity mat_weight_true
#> 1 7667      12.954711 Asian        0       0       10.768935
#> 2 8554       9.607124 Asian        0       1        8.892049
#> 3 6324      10.998284 Asian        0       0       12.075861
#> 4 8320      10.702210 Asian        0       0       10.887584
#> 5 8543       6.485988 Asian        1       0        7.331542
#> 6  127      13.338445 Asian        0       1       12.936826

Three of the covariates, child race, diabetes, and estimated maternal weight, are inexpensive to collect for all subjects, but determining the true maternal weight requires a tedious validation process. This full version of the simulated dataset contains the true weight change for all 10335 mothers, which may or may not have a relationship with the other covariates.

For the purpose of our examples, we suppose that we do not have access to this full dataset. Instead, we must sample from it to optimally estimate the mean of the true maternal weight changes.

During phase-I, we assume that all 10335 mother-child pairs have been sampled for the inexpensive covariates. Accordingly, we define our phase 1 dataset to be 10335 x 5 with every column of the full MatWgt_Sim data excluding the expensive true weight change. In the examples that follow, we will assume the validated maternal weight variable mat_weight_true will only be available through phase 2 sampling.


# Get phase 1 data, remove mat_weight_true which we assume is unknown
phase1 <- subset(MatWgt_Sim, select =  -mat_weight_true)
dim(phase1)
#> [1] 10335     5

Example 1: Defining Sample Strata

We suspect that the true mean maternal weight change may vary with some of the inexpensive covariates in the phase 1 dataset, so we decide to split our population into 9 non-overlapping strata. Each stratum will be defined by a unique combination of race and global percentile of maternal weight gain (≤25th, 25th - 75th, >75th). We can accomplish this quickly in optimall using the split_strata() function.

phase1$strata <- phase1$race #initialize a strata column first

# Merge the smallest race categories

phase1 <- merge_strata(data = phase1, 
                          strata = "race",
                          merge = c("Other","Asian"),
                          name = "Other")

phase1 <- split_strata(data = phase1, strata = "new_strata", split = NULL, 
                       split_var = "mat_weight_est", 
                       type = "global quantile", 
                       split_at = c(0.25,0.75),
                       trunc = "MWC_est")
# Trunc argument specifies how to refer to mat_weight_est in new strata names

We now have the same phase1 data with a new column specifying the strata we have defined.

head(phase1)
#>                   new_strata old_strata   id mat_weight_est diabetes obesity
#> 1 Other.MWC_est_(9.75,15.06]      Other 7667      12.954711        0       0
#> 2 Other.MWC_est_[-5.39,9.75]      Other 8554       9.607124        0       1
#> 3 Other.MWC_est_(9.75,15.06]      Other 6324      10.998284        0       0
#> 4 Other.MWC_est_(9.75,15.06]      Other 8320      10.702210        0       0
#> 5 Other.MWC_est_[-5.39,9.75]      Other 8543       6.485988        1       0
#> 6 Other.MWC_est_(9.75,15.06]      Other  127      13.338445        0       1
#>   strata
#> 1  Asian
#> 2  Asian
#> 3  Asian
#> 4  Asian
#> 5  Asian
#> 6  Asian
table(phase1$new_strata) # 9 strata
#> 
#> Black.MWC_est_(15.06,38.46]  Black.MWC_est_(9.75,15.06] 
#>                         628                        1154 
#> Black.MWC_est_[-30.21,9.75] Other.MWC_est_(15.06,30.94] 
#>                         745                         325 
#>  Other.MWC_est_(9.75,15.06]  Other.MWC_est_[-5.39,9.75] 
#>                         929                         456 
#> White.MWC_est_(15.06,51.69]  White.MWC_est_(9.75,15.06] 
#>                        1631                        3084 
#> White.MWC_est_[-25.68,9.75] 
#>                        1383

Example 2: Create an Initial Phase 2 Subsample: Phase-IIa

With the strata now defined by the inexpensive variables sampled in phase-I, we are ready to begin auditing patient records for validation of the true maternal weight changes. Without any validated data to define the optimum phase-IIa sample allocation, we decide to use proportional stratified sampling for the first 250 out of our 750 audits. Conveniently, optimall can select this random sample for us with sample_strata().

The sample_strata() function requires as input two dataframes, one containing the stratum membership of each individual unit (data) and a second specifying the sampling design (design_data). The design_data dataframe must contain at least two columns:

design_strata: A column holding strata names
n_allocated: The total \(n\) allocated to each stratum.

In our example, we want the proportion of samples randomly taken from each stratum to reflect its population proportion and a total of 250 samples.

phase2a_design <- data.frame(
  strata_name = names(table(phase1$new_strata)), 
  strata_prop = as.vector(table(phase1$new_strata))/10335,
  strata_n = round(250.3*as.vector(table(phase1$new_strata))/10335)
  ) # 250.3 to make sure 250 samples after rounding
sum(phase2a_design$strata_n) 
#> [1] 250
phase2a_design
#>                   strata_name strata_prop strata_n
#> 1 Black.MWC_est_(15.06,38.46]  0.06076439       15
#> 2  Black.MWC_est_(9.75,15.06]  0.11165941       28
#> 3 Black.MWC_est_[-30.21,9.75]  0.07208515       18
#> 4 Other.MWC_est_(15.06,30.94]  0.03144654        8
#> 5  Other.MWC_est_(9.75,15.06]  0.08988873       22
#> 6  Other.MWC_est_[-5.39,9.75]  0.04412192       11
#> 7 White.MWC_est_(15.06,51.69]  0.15781326       40
#> 8  White.MWC_est_(9.75,15.06]  0.29840348       75
#> 9 White.MWC_est_[-25.68,9.75]  0.13381713       33

We can now call sample_strata() to randomly draw the specified number samples from each stratum. It will output the same phase1 dataframe with an extra column indicating which units should be sampled. We can then extract the ids to sample based on this indicator.

phase1 <- sample_strata(data = phase1, strata = "new_strata", 
                        id = "id", design_data = phase2a_design, 
                        design_strata = "strata_name", 
                        n_allocated = "strata_n")
ids_to_sample2a <- phase1[phase1$sample_indicator == 1,"id"]
length(ids_to_sample2a) # 250 ids to sample
#> [1] 250

We submit these 250 ids for validation. Hypothetically, this involves the hard work of trained nurses, but for our example it involves only a few lines of code.

phase2a_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2a, 
                              c("id","mat_weight_true")]

phase2a <- merge(phase1, phase2a_samples, by = "id",
                 no.dups =  TRUE, all.x = TRUE)
names(phase2a)
#> [1] "id"               "new_strata"       "old_strata"       "mat_weight_est"  
#> [5] "diabetes"         "obesity"          "strata"           "sample_indicator"
#> [9] "mat_weight_true"
table(is.na(phase2a$mat_weight_true)) 
#> 
#> FALSE  TRUE 
#>   250 10085

We notice that all of the units that were not validated have NA values in the new columns. We will use the non-missing validated data to inform optimum allocation in our future sampling waves.

Example 3: Optimally Allocate 2nd and 3rd Waves of Phase 2 Sample

In phase-IIa, we allocated samples to strata using proportional sampling rather than optimum allocation because we had no prior samples of the true maternal weight change from which to estimate within-stratum variances. Now that we have completed the first wave of validation, we can estimate these variances and will thus use optimum allocation for each of the subsequent sampling waves.

In phase-IIb (wave 2), we will optimally allocate 250 more samples, raising our total number of validated samples to 500. The allocate_wave() function makes this step simple by calculating the optimum allocation for 500 samples, determining how many units have already been sampled in previous waves (only Phase-IIa in this case), and allocating the 250 samples of the current wave to make up the difference. The output is a design dataframe summarizing the results for each stratum.

phase2b <- phase2a

# Add indicator for units that were already sampled
phase2b$already_sampled <- ifelse(phase2b$id %in% ids_to_sample2a, 1, 0)

# Make design 
phase2b_design <- allocate_wave(data = phase2b, strata = "new_strata",
                             y = "mat_weight_true",
                             already_sampled = "already_sampled", 
                             nsample = 250,
                             detailed = TRUE)
phase2b_design
#>                        strata npop nsample_optimal nsample_actual nsample_prior
#> 1 Black.MWC_est_(15.06,38.46]  628              46             46            15
#> 2  Black.MWC_est_(9.75,15.06] 1154              38             38            28
#> 3 Black.MWC_est_[-30.21,9.75]  745              52             52            18
#> 4 Other.MWC_est_(15.06,30.94]  325              12             12             8
#> 5  Other.MWC_est_(9.75,15.06]  929              34             34            22
#> 6  Other.MWC_est_[-5.39,9.75]  456              30             30            11
#> 7 White.MWC_est_(15.06,51.69] 1631              83             83            40
#> 8  White.MWC_est_(9.75,15.06] 3084             117            117            75
#> 9 White.MWC_est_[-25.68,9.75] 1383              88             88            33
#>   n_to_sample   sd
#> 1          31 3.51
#> 2          10 1.58
#> 3          34 3.32
#> 4           4 1.75
#> 5          12 1.73
#> 6          19 3.18
#> 7          43 2.44
#> 8          42 1.80
#> 9          55 3.04

Looking at the output of allocate_wave(), we notice that the optimum sample sizes for the next wave vary greatly between strata, but all are greater than one. Because we set detailed = TRUE, we can also see what the optimum allocation for 500 samples would have been ignoring prior wave sizes in the column "nsample_optimal". This column matches "nsample_actual" because we did not oversample any strata in phase-IIa.

In other cases when "n_to_sample" = 0 for some strata, we may not be so lucky. If the optimum sample size in a stratum is smaller than the amount it was allocated in previous waves ("nsample_prior"), we say that that stratum has been oversampled. When oversampling occurs, allocate_wave() “closes” the oversampled strata and re-allocates the remaining samples optimally among the open strata. Under these circumstances, the total sampling allocation is no longer optimal, but optimall will output the most optimal allocation from the non-oversampled strata for the next wave.

Although we don’t see evidence of oversampling, these results suggest that our strata are relatively imbalanced. We decide to split the largest stratum into three similarly-sized strata along mat_weight_est using the local quantile option of split_strata().

phase2b <- split_strata(data = phase2b, 
                        split = "White.MWC_est_(9.75,15.06]",
                        strata = "new_strata", 
                        split_var = "mat_weight_est",
                        type = "local quantile", split_at = c(1/3,2/3))

With our strata now more balanced in size, we can re-run allocate_wave() to make a new design dataframe for sample_strata() to use to determine which ids to sample in phase-IIb.

phase2b_design <- allocate_wave(data = phase2b, 
                             strata = "new_strata",
                             y = "mat_weight_true",
                             already_sampled = "already_sampled",
                             nsample = 250)
phase2b_design
#>                                                     strata npop nsample_actual
#> 1                              Black.MWC_est_(15.06,38.46]  628             49
#> 2                               Black.MWC_est_(9.75,15.06] 1154             40
#> 3                              Black.MWC_est_[-30.21,9.75]  745             55
#> 4                              Other.MWC_est_(15.06,30.94]  325             13
#> 5                               Other.MWC_est_(9.75,15.06]  929             35
#> 6                               Other.MWC_est_[-5.39,9.75]  456             32
#> 7                              White.MWC_est_(15.06,51.69] 1631             88
#> 8  White.MWC_est_(9.75,15.06].mat_weight_est_(11.44,12.99] 1028             29
#> 9  White.MWC_est_(9.75,15.06].mat_weight_est_(12.99,15.05] 1028             39
#> 10  White.MWC_est_(9.75,15.06].mat_weight_est_[9.75,11.44] 1028             28
#> 11                             White.MWC_est_[-25.68,9.75] 1383             92
#>    nsample_prior n_to_sample
#> 1             15          34
#> 2             28          12
#> 3             18          37
#> 4              8           5
#> 5             22          13
#> 6             11          21
#> 7             40          48
#> 8             29           0
#> 9             21          18
#> 10            25           3
#> 11            33          59

# Extract IDs to sample
ids_to_sample2b <- sample_strata(data = phase2b, 
                                 strata = "new_strata", 
                                 id = "id", 
                                 already_sampled = "already_sampled", 
                                 design_data = phase2b_design,
                                 design_strata = "strata", 
                                 n_allocated = "n_to_sample")
ids_to_sample2b <- 
  ids_to_sample2b[ids_to_sample2b$sample_indicator == 1,"id"]
length(ids_to_sample2b)
#> [1] 250

We can now get the validated maternal weight changes for these 250 subjects.

# Take samples from true data
phase2b_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2b,
                              c("id","mat_weight_true")]

# Merge extracted samples with our data.
# Note we use dplyr's coalesce here to merge sampled mat_weight_true with 
# the mat_weight_true solumn already present in phase2b, but manner of 
# merging of samples may vary from project to project.
# For simpler merges, see 'Multiwave Object' vignette.

phase2b <- merge(phase2b, phase2b_samples, by = "id", 
                 no.dups = TRUE, all.x = TRUE)

library(dplyr)

phase2b$mat_weight_true <- 
  dplyr::coalesce(phase2b$mat_weight_true.x,
                  phase2b$mat_weight_true.y)
phase2b <- subset(phase2b, select = -c(mat_weight_true.x,
                                       mat_weight_true.y))

table(is.na(phase2b$mat_weight_true)) # All are NA besides already sampled.
#> 
#> FALSE  TRUE 
#>   500  9835

With these results in, we can move onto our final wave.

Phase-IIc

After auditing the 250 additional samples in Phase-IIb, we only have 250 left to validate until we reach our goal of 750. We will combine what we have learned in the previous waves of phase-II to optimally assign these final 250 samples in phase-IIc.

Following the same steps as in phase-IIb, we use allocate_wave() to find the optimum sample sizes in each stratum based on estimates of the variance using our previous samples. We then determine which ids to sample using sample_strata().

phase2c <- phase2b

# Add indicator for units that were already sampled
phase2c$already_sampled <- ifelse(phase2b$id %in% c(ids_to_sample2a,
                                                    ids_to_sample2b), 
                                  1, 0)

# Make Design
phase2c_design <- allocate_wave(data = phase2c, strata = "new_strata",
                             y = "mat_weight_true",
                             already_sampled = "already_sampled", 
                             nsample = 250,
                             detailed = TRUE)
phase2c_design
#>                                                     strata npop nsample_optimal
#> 1                              Black.MWC_est_(15.06,38.46]  628              60
#> 2                               Black.MWC_est_(9.75,15.06] 1154              56
#> 3                              Black.MWC_est_[-30.21,9.75]  745              99
#> 4                              Other.MWC_est_(15.06,30.94]  325              18
#> 5                               Other.MWC_est_(9.75,15.06]  929              50
#> 6                               Other.MWC_est_[-5.39,9.75]  456              35
#> 7                              White.MWC_est_(15.06,51.69] 1631             176
#> 8  White.MWC_est_(9.75,15.06].mat_weight_est_(11.44,12.99] 1028              38
#> 9  White.MWC_est_(9.75,15.06].mat_weight_est_(12.99,15.05] 1028              48
#> 10  White.MWC_est_(9.75,15.06].mat_weight_est_[9.75,11.44] 1028              37
#> 11                             White.MWC_est_[-25.68,9.75] 1383             133
#>    nsample_actual nsample_prior n_to_sample   sd
#> 1              60            49          11 3.07
#> 2              56            40          16 1.56
#> 3              99            55          44 4.29
#> 4              18            13           5 1.83
#> 5              50            35          15 1.74
#> 6              35            32           3 2.47
#> 7             176            88          88 3.50
#> 8              38            29           9 1.20
#> 9              48            39           9 1.50
#> 10             37            28           9 1.16
#> 11            133            92          41 3.12

# Find IDs to sample
ids_to_sample2c <- sample_strata(data = phase2c, 
                                 strata = "new_strata", 
                                 id = "id", 
                                 already_sampled = "already_sampled", 
                                 design_data = phase2c_design,
                                 design_strata = "strata", 
                                 n_allocated = "n_to_sample")
ids_to_sample2c <- 
  ids_to_sample2c[ids_to_sample2c$sample_indicator == 1,"id"]
length(ids_to_sample2c)
#> [1] 250

# Sample
phase2c_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2c,
                              c("id","mat_weight_true")]

# Add samples to phase2c dataset
phase2c <- merge(phase2c, phase2c_samples, by = "id", 
                 no.dups = TRUE, all.x = TRUE)
phase2c$mat_weight_true <- 
  dplyr::coalesce(phase2c$mat_weight_true.x,
                  phase2c$mat_weight_true.y)
phase2c <- subset(phase2c, select = -c(mat_weight_true.x,
                                       mat_weight_true.y))

We have now validated 750 samples!

As this example shows, there are quite a few moving parts to keep track of in the multi-wave sampling workflow. When performing a multi-wave sampling design, you may also want to consider using optimall's multiwave object to organize and visualize the process. This feature is described in detail in the package vignette titled “The Multiwave Object”.

Splitting Strata Efficiently with `optimall_shiny()`

In many cases, deciding which strata to split and where is a difficult task. split_strata() makes this job easier, but it is designed for situations where the strata and split values have already been decided by the user. Running it iteratively to experiment with different splits is possible yet tedious.

To help users make these difficult decisions, optimall includes an R Shiny application that reacts in real time to user selections of splitting parameters. It can be called using optimall_shiny(). See the screenshot below:

Each time the user updates an input, the parameters of the split_strata() function are updated accordingly, and the resulting dataframe showing the optimum allocation of samples among the new strata is displayed. Once the user is satsfied with a set of inputs, they can confirm the split and the underlying data will be updated. The code used to perform the split will also be displayed.

Note that the data in the Shiny App has to be loaded in from a file, and it is thus separate from data being used in the optimall workflow. This means that the user has to return to their R session to make the changes determined in the Shiny app. The app, however, makes this easy by printing the code to perform all of the confirmed changes, so making them in R is as easy as copying and pasting. You may have to update the "data" argument.

For more information on optimall_shiny see the package vignette titled “Splitting Strata with Optimall Shiny”.

References

McIsaac MA, Cook RJ. Adaptive sampling in two‐phase designs: a biomarker study for progression in arthritis. Statistics in Medicine. 2015 Sep 20;34(21):2899-912.

Wright, T. A simple method of exact optimal sample allocation under stratification with any mixed constraint patterns.2014; Statistics, 07.

Using Optimall

Jasper Yang and Pamela Shaw

Last updated: January 28, 2026

Introduction

Defining and Refining Strata

Optimum Allocation

Selecting IDs to Sample

Adaptive, Multi-Wave Sampling

Examples: Using `optimall` to Determine Study Design and Sample Data

Overview

Data Set Up

Example 1: Defining Sample Strata

Example 2: Create an Initial Phase 2 Subsample: Phase-IIa

Example 3: Optimally Allocate 2nd and 3rd Waves of Phase 2 Sample

Phase-IIc

Splitting Strata Efficiently with `optimall_shiny()`

References

Using Optimall

Jasper Yang and Pamela Shaw

Last updated: January 28, 2026

Introduction

Defining and Refining Strata

Optimum Allocation

Selecting IDs to Sample

Adaptive, Multi-Wave Sampling

Examples: Using optimall to Determine Study Design and Sample Data

Overview

Data Set Up

Example 1: Defining Sample Strata

Example 2: Create an Initial Phase 2 Subsample: Phase-IIa

Example 3: Optimally Allocate 2nd and 3rd Waves of Phase 2 Sample

Phase-IIc

Splitting Strata Efficiently with optimall_shiny()

References

Examples: Using `optimall` to Determine Study Design and Sample Data

Splitting Strata Efficiently with `optimall_shiny()`