Determining sample sizes with ssize

library(whSample)

The ssize function is part of the whSample package of utilities for sampling. It is used by the sampler function to estimate the minimum sample size necessary to achieve statistical requirements using a Normal Approximation to the Hypergeometric Distribution. It also can be used as a standalone to determine sample sizes under various conditions.

This approximation to the hypergeometric distribution spans the probabilities of yes/no-type responses without replacement. Its parameters are:

N, the population size.
ci, the required confidence interval. The default is 95%.
me, the required level of precision, or margin of error. The default is +/- 7%.
p, the anticipated rate of occurrence. The default is 50%.

ssize requires only the N argument. Used as a standalone, it can explore sample sizes under other conditions.

The full range of command-line options is:

ssize(N, ci=0.95, me=0.07, p=0.5)

After loading the whSample package, ssize can be run with the defaults given only a population size:

ssize(10000)
#> [1] 193

This says a random sample of 193 of the 10,000 items in the population is estimated to be the minimum necessary to satisfy a confidence level of 95% with a precision level of +/- 7%. The anticipated rate of occurence by default is 50-50, which will produce the highest sample size.

The parameter most likely to change is the anticipated rate of occurrence. Many sampling situations provide opportunities to update the rate of occurrence during the sampling process. Let’s say that after a reasonable number of sample tests (somewhere in excess of 30 or 50), we find the rate of occurrence of the item we’re testing shows “positives” in 60 percent of the trials. If the conduct of a single trial is expensive in resources, we may want to re-estimate the necessary sample size:

ssize(10000, p=0.6)
#> [1] 185

A new target of 185 would save eight trials. Since the goal of sampling is to get the most reliable results using the least amount of resources, this may be nice. Three words of caution come to mind:

More samples mean more reliable results.
Continuing to test toward the new 185 target may show the true rate of occurrence may be higher or lower, so continuing to check periodically may be a good idea.
Sample size estimators do only that: they estimate. Only post-sampling analysis can confirm if the confidence and precision requirements were actually achieved.