To allocate a sample among different stages of sampling, the
contributions of the different stages to the variance of an estimator
must be considered. These components of variance generally depend on the
analysis variable and also on the form of the estimator. This vignette
covers some basic variance results for linear estimators in two-stage
and three-stage sampling and how the components can be estimated with
functions in `PracTools`

. Technical background is in Valliant, Dever, and Kreuter (2018b), ch.9.
First, the package must be loaded with

`library(PracTools)`

Alternatively, `require(PracTools)`

can be used.

Consider a two-stage sample design in which the first-stage units are
selected using \(\pi ps\) sampling,
i.e., with varying probabilities and without replacement. We will also
refer to this as *ppswor* sampling. Elements are selected at the
second stage via simple random sampling without replacement
(*srswor*). Quite a bit of notation is needed, even in this
fairly simple case:

\(\small U\) = universe of PSUs

\(\small M\) = number of PSUs in universe

\(\small U_{i}\) = universe of elements in PSU \(\small i\)

\(\small N_{i}\) = number of elements in the population for PSU \(\small i\)

\(\small N=\sum _{i\in U}N_{i}\) is the total number of elements in the population

\(\small \pi_{i}\) = selection probability of PSU \(\small i\)

\(\small \pi_{ij}\) = joint selection probability of PSUs \(\small i\) and \(\small j\)

\(\small m\) = number of sample PSUs

\(\small n_{i}\) = number of sample elements in PSU \(\small i\)

\(\small s\) = set of sample PSUs

\(\small s_{i}\) = set of sample elements in PSU \(\small i\)

\(\small y_{k}\) = analysis variable for element \(\small k\) in PSU \(\small i\) (subscript \(\small i\) is implied)

\(\small \bar{y}_{U}\) = mean per element in the population

\(\small \bar{y}_{Ui}\) = mean per element in the population in PSU \(\small i\)

The \(\small \pi\)-estimator of the population total, \(\small t_{U} =\sum _{i\in U}\sum _{k\in U_{i} }y_{k}\), of an analysis variable \(\small y\) is

\[\small \hat{t}_{\pi } =\sum _{i\in s}\frac{\hat{t}_{i} }{\pi _{i} }\]

where \(\small \hat{t}_{i} =\left({N_{i}/ n_{i} } \right)\sum _{k\in s_{i} }y_{k}\), which is the estimate of the total for PSU \(\small i\) with a simple random sample. The design variance of the estimated total can be written as the sum of two components: \[\begin{equation} \label{eq:pivar} \small {V}\left(\hat{t}_{\pi } \right)=\sum _{i\in U}\sum _{j\in U}\left(\pi _{ij} -\pi _{i} \pi _{j} \right)\frac{t_{i} }{\pi _{i} } \frac{t_{j} }{\pi _{j} } +\sum _{i\in U}\frac{N_{i} ^{2} }{\pi _{i} n_{i} } \left(1-\frac{n_{i} }{N_{i} } \right)S_{U2i}^{2} \end{equation}\] where

\[\small S_{U2i}^{2} = \sum_{k\in U_{i} }\left(y_{k} -\bar{y}_{Ui} \right)^{2} / \left(N_{i} -1\right) \]

is the unit variance of \(\small y\) among the elements in PSU \(\small i\).

Formula \(\small \eqref{eq:pivar}\)
is difficult or impossible to use for sample size computations because
the number of PSUs in the sample is not exposed. Another is to analyze
*srswor* sampling of PSUs and SSUs as in Example 1 below.
Determining sample sizes this way does not mean that you are necessarily
locked into selecting PSUs and elements within PSUs via *srswor*
or *srswr*. Basing sample sizes on a design that is less
complicated than the one that will actually be used is a common
approach, although it can be deceptive for some analysis variables.

Suppose the first stage is an *srswor* of \(\small m\) out of \(\small M\) PSUs and the second stage is a
sample of \(\small n_{i}\) elements
selected by *srswor* from the population of \(\small N_{i}\). The \(\small \pi\)-estimator is

\[ \small \hat{t}_{\pi} = \frac{M}{m} \sum_{i \in s} \frac{N_{i} }{n_{i} } \sum_{k \in s_{i} } y_{k} \]

Its variance is equal to \[\begin{equation} \label{srs.relvar.gen} \small V\left(\hat{t}_{\pi } \right)=\frac{M^{2} }{m} \frac{M-m}{M} S_{U1}^{2} +\frac{M}{m} \sum _{i\in U}\frac{N_{i}^{2} }{n_{i} } \frac{N_{i} -n_{i} }{N_{i} } S_{U2i}^{2} \end{equation}\] where \(\small S_{U1}^{2} =\frac{\sum _{i\in U}\left(t_{i} -\bar{t}_{U} \right)^{2}}{M-1}\) with \(t_{i}\) being the population total of \(\small y\) in PSU \(\small i\) and \(\small \bar{t}_{U} ={\sum _{i\in U}t_{i} / M}\) is the mean total per PSU.

If \(\bar{n}\) elements are selected in each PSU and the sampling fractions of PSUs and elements within PSUs are all small, then the relvariance can be written as \[\begin{equation} \label{eq:relvarsrs} \small \frac{V(\hat{t}_{\pi})}{t_{U}^2} = \frac{B^2}{m} + \frac{W^2}{m\bar{n}} \end{equation}\]

where \(\small B^{2} ={S_{U1}^{2} /
\bar{t}_{U}^{2} = M^{2}S_{U1}^2/t_{U}^2 }\) is the unit
relvariance among PSU totals and \(\small
W^{2} = M \sum _{i\in U}N_{i}^{2} S_{U2i}^{2}/t_{U}^2\). The term
\(\small B^{2}\) is called the “between
(PSU) component” while \(\small W^{2}\)
is the “within component”. Expression \(\small
\eqref{eq:relvarsrs}\) is the form used in the R function,
`BW2stageSRS`

. Textbooks often list a specialized form of
\(\small \eqref{eq:relvarsrs}\) that
requires that all PSUs have the same size, \(\small N_i \equiv \bar{N}\), and that \(\small \bar{n}\) elements are selected in
each. In that case, the second-stage sampling fraction is \({\bar{n} / \bar{N}}\). This implies that
the sample is self-weighting: \(\small \pi
_{i} \pi _{k\left|i\right. } = m\bar{n}/M\bar{N}\). The
relvariance based on \(\small
\eqref{srs.relvar.gen}\) then simplifies to the less general form
\[\begin{equation} \notag
\small \frac{V(\hat{t}_{\pi})}{t_{U}^2} = \frac{1}{m} \frac{M-m}{M}
B^{2} +\frac{1}{m\bar{n}} \frac{\bar{N}-\bar{n}}{\bar{N}} W^{2} \notag
\end{equation}\]

where \(\small W^{2} =\frac{1}{M\bar{y}_{U}^{2} } \sum _{i\in U}S_{U2i}^{2}\).

Assuming that \(\small \bar{n}\) elements are selected in each sample PSU, and \(\small m/M\) and \(\small \bar{n}/N_{i}\) are both small, the more general form of the relvariance in \(\small \eqref{eq:relvarsrs}\) can also be written in terms of a measure of homogeneity \(\small \delta\) as follows:

\[\begin{equation} \label{eq:deltavar.2st} \small \frac{V\left(\hat{t}_{\pi } \right)}{t_{U}^{2} } \doteq \frac{\tilde{V}}{m\bar{n}} k \left[1+\delta \left(\bar{n}-1\right)\right] \end{equation}\] where \(\small \tilde{V}= S_{U}^2/\bar{y}_{U}^{2}\), \(\small k=(B^2 + W^2)/\tilde{V}\), and

\[\begin{equation} \label{eq:delta} \small \delta = \frac{B^{2} }{B^{2} +W^{2} }. \end{equation}\] With some effort, it can be shown that when \(\small N_{i}=\bar{N}\) and both \(\small M\) and \(\small \bar{N}\) are large,

\[\begin{equation} \small \frac{S_{U}^{2} }{\bar{y}_{U}^{2} } = {\frac{1}{\bar{y}_{U}^{2} } \frac{\sum _{i\in U}\sum _{k\in U_{i}}\left(y_{k} -\bar{y}_{U} \right)^{2} }{\left(N-1\right)} } \notag \\ \small {\doteq B^{2} +W^{2} } \notag \end{equation}\] i.e., the population relvariance can be written as the sum of between and within relvariances. If \(\small k=1\), \(\small \eqref{eq:deltavar.2st}\) equals the expression found in many textbooks. However, when the population count of elements per cluster varies, \(\small k\) may be far from 1, as will be illustrated in an example below. In those cases, \(\small \eqref{eq:deltavar.2st}\) with an estimate of the actual \(\small k\) should be used for determining sample sizes and computing advance estimates of coefficients of variation.

Expressions \(\small
\eqref{eq:relvarsrs}\) and \(\small
\eqref{eq:deltavar.2st}\) are useful for sample size calculation
since the number of sample PSUs and sample units per PSU are explicit in
the formula. Equation \(\small
\eqref{eq:deltavar.2st}\) also connects the variance of the
estimated total to the variance that would be obtained from a simple
random sample since \(\small
\tilde{V}/m\bar{n}\) is the relvariance of the estimated total in
an *srswor* of size \(\small
m\bar{n}\) when the sampling fraction is small. The product \(\small k[1 + \delta(\bar{n}-1)]\) is a type
of design effect. When \(\small k=1\),
the term \(\small 1+\delta
\left(\bar{n}-1\right)\) is the approximate design effect found
in many textbooks.

The next example uses the `MDarea.pop`

from
`PracTools`

. This dataset is based on the U.S. Census counts
from the year 2000 for Anne Arundel County in the US state of Maryland.
The geographic divisions used in this dataset are called tracts and
block groups. Tracts are constructed by the US Census Bureau to have a
desired population size of 4,000 people. Block groups (BGs) are smaller
with a target size of 1,500 people. Counts of persons in the dataset are
the same for most tracts and block groups as in the 2000 Census.

**Example. Between and within variance components in**The R function*srs/srs*design`BW2stageSRS`

will calculate the unit relvariance of a population, \(\small B^{2} + W^{2}\) for comparison, the ratio \(\small k=(B^2+W^2)/(S_{U}^{2}/\bar{y}_{U}^2)\), and the full version of \(\small \delta\) in \(\small \eqref{eq:delta}\). The function assumes that the entire sampling frame is an input. The full R code for this example is in the file`Example 9.2.R`

, available at Valliant, Dever, and Kreuter (2018a). We first compute the results using the`PSU`

and`SSU`

variables as clusters. These fields are created so that all`PSU`

s have the same size; likewise, all`SSU`

s have the same size. For the variable`y1`

in the Maryland population, the code is

```
require(PracTools)
data(MDarea.pop)
BW2stageSRS(MDarea.pop$y1, psuID=MDarea.pop$PSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.007859299 1.455263645 1.462741186 1.463122944 1.000260988 0.005371592
BW2stageSRS(MDarea.pop$y1, psuID=MDarea.pop$SSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.03653178 1.42770886 1.46274119 1.46424065 1.00102510 0.02494930
```

The values of \(\small \delta\) are
0.005 for `PSU`

and 0.025 for `SSU`

. Next, to
illustrate the dramatic effect that varying sizes of clusters can have,
we compute the same statistics as above using tracts and block groups
(BGs) within tracts as clusters. These vary substantially in the number
of persons in each cluster. A new variable called `trtBG`

is
computed since the values of the variable, `BLKGROUP`

, are
nested within each tract:

```
<- 10*MDarea.pop$TRACT + MDarea.pop$BLKGROUP
trtBG BW2stageSRS(MDarea.pop$y1, psuID=MDarea.pop$TRACT)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.2604683 1.8390286 1.4627412 2.0994969 1.4353167 0.1240623
BW2stageSRS(MDarea.pop$y1, psuID=trtBG)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.3488622 1.9498600 1.4627412 2.2987221 1.5715167 0.1517635
```

The value of \(\small \delta\) is
0.124 `TRACT`

s are clusters and 0.152 when `trtBG`

defines clusters. The measures of homogeneity increase substantially
when tracts or BGs are clusters compared to the `PSU`

and
`SSU`

results. This is entirely due to the increase in \(\small B^{2}\) when units with highly
variable sizes are used and an *srs* is selected. For example,
\(\small B^{2} =0.0079\) for
`y1`

when `PSU`

is a cluster but is 0.2605 when
`TRACT`

is a cluster.

Variances of estimators in two-stage designs more complicated than
simple random sampling at each stage can be written as a sum of
components. However, these have limited usefulness in determining sample
sizes for the same reason that \(\small
\eqref{eq:pivar}\) is not. A more convenient formulation is the
case where PSUs are selected with varying probabilities but with
replacement, and the sample within each PSU is selected by
*srswor*. With-replacement designs may not often be used in
practice but have simple variance formulae. The *pwr*-estimator
of a total (**SSW.92?**) is
\[
\small \hat{t}_{pwr} =\frac{1}{m} \sum _{i\in s}\frac{\hat{t}_{i}
}{p_{i} }
\] where \(\small \hat{t}_{i}
=\frac{N_{i} }{n_{i} } \sum _{k\in s_{i} }y_{k}\) is the
estimated total for PSU \(\small i\)
from a simple random sample and \(\small
p_{i}\) is the one-draw selection probability of PSU \(\small i\). The variance of \(\small \hat{t}_{pwr}\) is

\[\begin{equation} \label{eq:v2st} \small V\left(\hat{t}_{pwr} \right)=\frac{1}{m} \sum _{i\in U}p_{i} \left(\frac{t_{i} }{p_{i} } -t_{U} \right)^{2} +\sum _{i\in U}\frac{N_{i} ^{2} }{mp_{i} n_{i} } \left(1-\frac{n_{i} }{N_{i} } \right)S_{U2i}^{2}. \end{equation}\]

Making the assumption that \(\small \bar{n}\) elements are selected in each PSU, the variance reduces to \[ \small V\left(\hat{t}_{pwr} \right)=\frac{S_{U1\left(pwr\right)}^{2} }{m} +\frac{1}{m\bar{n}} \sum _{i\in U}\left(1-\frac{\bar{n}}{N_{i} } \right)\frac{N_{i}^{2} S_{U2i}^{2} }{p_{i} } \] where, in this case, \(S_{U1\left(pwr\right)}^{2} =\sum _{i\in U}p_{i} \left(\frac{t_{i} }{p_{i} } -t_{U} \right)^{2}\). Dividing this by \(t_{U}^{2}\) and assuming that the within-PSU sampling fraction, \(\bar{n} / N_{i}\), is negligible, we obtain the relvariance of \(\hat{t}_{pwr}\) as, approximately, \[\begin{equation} \label{eq:vpwr.2st} \small \frac{V\left(\hat{t}_{pwr} \right)}{t_{U}^{2} } \doteq \frac{B^{2} }{m} +\frac{W^{2} }{m\bar{n}} =\frac{\tilde{V}}{m\bar{n}} k \left[1+\delta \left(\bar{n}-1\right)\right] \end{equation}\]

with \(\small \tilde{V}=S_{U}^2/\bar{y}_{U}^2\), \(\small k=(B^2 + W^2)/\tilde{V}\), \[\begin{equation} \label{eq:B2.2st.pwr} \small B^{2} =\frac{S_{U1\left(pwr\right)}^{2} }{t_{U}^{2} } , \end{equation}\]

\[\begin{equation} \label{eq:W2.2st.pwr} \small W^{2} =\frac{1}{t_{U}^{2} } \sum _{i\in U}N_{i}^{2} \frac{S_{U2i}^{2} }{p_{i} }, \end{equation}\]

\[\begin{equation} \label{eq:delta.2st.pwr} \small \delta =B^{2} \left/ \left(B^{2} +W^{2} \right) \right. \end{equation}\]

Expression \(\small
\eqref{eq:vpwr.2st}\) has the same form as \(\small \eqref{eq:deltavar.2st}\) but with
different definitions of \(\small B^2\)
and \(\small W^2\). Expression \(\small \eqref{eq:vpwr.2st}\) also has the
interpretation of an *srs* variance of an unclustered variance,
\(\small \tilde{V}/m\bar{n}\), times a
design effect, \(\small k[1 +
\delta(\bar{n}-1)]\), in the same way that \(\small \eqref{eq:deltavar.2st}\) did.

**Example. Between and within variance components in**This example repeats the calculations in the example above for the variables in the Maryland area population. Assume that clusters will be selected proportional to the count of persons in each cluster. The function*ppswr/srs*design`BW2stagePPS`

computes the population values of \(\small B^{2}\), \(\small W^{2}\), and \(\small \delta\) shown in \(\small \eqref{eq:B2.2st.pwr}\), \(\small \eqref{eq:W2.2st.pwr}\), and \(\small \eqref{eq:delta.2st.pwr}\) which are appropriate for*ppswr*sampling of clusters. The code for`y1`

using`PSU`

or`SSU`

as clusters is shown below. The variables,`pp.PSU`

and`pp.SSU`

, hold the one-draw probabilities \(\small p_{i}\) that appear in \(\small \eqref{eq:v2st}\):

```
<- table(MDarea.pop$PSU) / nrow(MDarea.pop)
pp.PSU <- table(MDarea.pop$SSU) / nrow(MDarea.pop)
pp.SSU BW2stagePPS(MDarea.pop$y1, pp=pp.PSU, psuID=MDarea.pop$PSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.007762335 1.455263403 1.462741186 1.463025738 1.000194534 0.005305672
BW2stagePPS(MDarea.pop$y1, pp=pp.SSU, psuID=MDarea.pop$SSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.03644120 1.42770995 1.46274119 1.46415115 1.00096392 0.02488896
```

The code for PSUs that are tracts and block groups is

```
<- table(MDarea.pop$TRACT) / nrow(MDarea.pop)
pp.trt <- table(trtBG) / nrow(MDarea.pop)
pp.BG BW2stagePPS(MDarea.pop$y1, pp=pp.trt, psuID=MDarea.pop$TRACT)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.009171403 1.453908596 1.462741186 1.463079999 1.000231629 0.006268559
BW2stagePPS(MDarea.pop$y1, pp=pp.BG, psuID=trtBG)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.01602891 1.44780622 1.46274119 1.46383513 1.00074787 0.01094994
```

The between term when clusters are defined by `PSU`

is
about the same as when clusters are selected by *srs* because
`PSU`

’s all have the same size. With PSUs being either tracts
or block groups in the *ppswr/srswor* design, the between term is
much smaller than the within, compared to the results in the
*srs/srs* example. For example, with `y1`

and
*srs* sampling of tracts, \(\small
B^2=0.2604\) but for *pps* sampling of tracts \(\small B^2=0.0091\).

When clusters are selected by *srs*, \(\small S_{U1}^{2}\) is the variance of the
cluster totals around the average cluster total. In contrast, with
*pps* sampling of clusters, \(\small
S_{U1\left(pwr\right)}^{2}\) is the variance of the estimated
population totals, \(\small t_{i} \left/ p_{i}
\right.\) around the population total, \(\small t_{U}\). When clusters are selected
with probability proportional to \(\small
N_{i}\), then \(\small t_{i} \left/
p_{i} \right. = N_{i} \bar{y}_{Ui}\). If these one-cluster
estimates of the population total are fairly accurate, as they are here,
the \(\small B^{2}\) term can be quite
small. This leads to much smaller values of \(\small \delta\) in *pps* sampling of
clusters. This implies that the negative effect of clustering on the
variance is lessened for a design that selects clusters with \(\small pp(N_i)\). This kind of comparison
explains most practitioners’ preference for *pps* sampling of
clusters, especially when the clusters vary in population size.

In the case of with-replacement sampling of PSUs with varying probabilities and srswor at the second and third stages, the relvariance can be written (with a few assumptions) in a form useful for sample size calculations. Treating the case where SSUs are selected via srs (either with or without replacement) is not too unrealistic since SSUs (like block groups) are often created to have about the same population sizes.

The variance formulae for a three-stage design with *ppswor*
selection of first-stage units is complex enough that it is not useful
for sample size planning. See Valliant, Dever,
and Kreuter (2018b), sec. 9.2.4 for details. To obtain a simpler
formula, suppose that \(\small
\bar{n}\) SSUs are sampled in each sample PSU, the sampling
fractions of SSUs in each PSU, \(\small
\bar{n} \left/ N_{i} \right.\), are small, and \(\small \bar{\bar{q}}\) elements are
selected in each sample SSU. The relvariance of the
*pwr*-estimator is then

\[\begin{equation} \label{ZEqnNum518915} \small \frac{V\left(\hat{t}_{pwr} \right)}{t_{U}^{2} } = \frac{B^{2} }{m} +\frac{W_{2}^{2} }{m\bar{n}} + \frac{W_{3}^{2} }{m\bar{n}{\kern 1pt} \bar{\bar{q}}}, \end{equation}\]

where \(\small B^{2} =S_{U1\left(pwr\right)}^{2} \left/ t_{U}^{2} \right.\) is given by \(\small \eqref{eq:B2.2st.pwr}\),

\[\begin{equation} \label{ZEqnNum344483} \small W_{2}^{2} = \frac{1}{t_{U}^{2} } \sum_{i\in U} N_{i}^{2} S_{U2i}^{2} \left/ p_{i} \right. ; \end{equation}\]

\[\begin{equation} \label{ZEqnNum290646} \small W_{3}^{2} = \frac{1}{t_{U}^{2} } \sum _{i\in U} \frac{N_{i} }{p_{i} } \sum_{j\in U_{i}} Q_{ij}^{2} S_{U3ij}^{2} . \end{equation}\]

The relvariance can also be written in terms of two measures of homogeneity:

\[\begin{equation} \label{ZEqnNum836300} \small \frac{V\left(\hat{t}_{pwr} \right)}{t_{U}^{2} } = \frac{\tilde{V}}{m\bar{n}\bar{\bar{q}}} \left\{k_{1} \delta _{1} \bar{n}\bar{\bar{q}}+k_{2} \left[1+\delta _{2} \left(\bar{\bar{q}}-1\right)\right]\right\} \end{equation}\] where

\(\small k_{1} = (B^2+W^2)/\tilde{V}\) with \(\small \tilde{V} = \frac{1}{Q-1} \sum_{i\in U} \sum_{j \in U_{i} } \sum_{k\in U_{ij} } \left(y_{k} - \bar{y}_{U} \right)^2 \left/ \bar{y}_U^2 \right.\) is the unit relvariance of \(\small y\) in the population.

\(\small k_{2} = (W_{2}^2 + W_{3}^2)/\tilde{V}\)

\(\small\delta _{1} = B^2/(B^2 + W^2)\)

\(\small W^{2} = \frac{1}{t_{U}^{2} } \sum_{i\in U} Q_{i}^{2} S_{U3i}^{2} \left/ p_{i} \right.\) with \(\small S_{U3i}^{2} =\frac{1}{Q_{i} -1} \sum _{j\in U_{i} }\sum _{k\in U_{ij} }\left(y_{k} -\bar{y}_{Ui} \right)^{2}\) and \(\small \bar{y}_{Ui} = \sum _{j\in U_{i} } \sum_{k\in U_{ij} }y_{k} \left/ Q_{i} \right.\), i.e., \(\small S_{U3i}^{2}\) is the element-level variance among all elements in PSU \(\small i\)

\(\small \delta _{2} = W_{2}^2/(W_{2}^2 + W_{3}^2)\)

Note that the term \(\small W^{2}\) in \(\small \delta_{1}\) does not enter the variance in \(\small \eqref{ZEqnNum518915}\) but is defined by analogy to the term in two-stage sampling. If elements were selected directly from the sample PSUs (instead of first sampling SSUs), then \(\small W^{2}\) above would be the appropriate within-PSU component.

The term \(\small \delta _{1}\) is a
measure of the homogeneity among the PSU totals. If the estimate of the
population total from each PSU total, \(\small
t_{i} \left/ p_{i} \right.\), was exactly equal to the population
total, \(\small t_{U}\), then \(\small B^{2} =0\) and \(\small \delta _{1} = 0\). That is, if the
variation within PSUs is much larger than the variation among PSU
totals, then \(\small \delta _{1}\)
will be small; this is the typical situation in household surveys *if
PSUs all have about the same number of elements*. As we saw in the
earlier example, the condition of equal-sized PSUs can be critically
important to insure that \(\small
B^{2}\) is small.

If the SSUs all have about the same totals, \(\small t_{ij}\), then \(\small W_{2}^{2}\) will be small and \(\small \delta _{2} \doteq 0\). Although attempts may be made to create SSUs that have about the same number of elements \(\small Q_{ij}\), the totals \(\small t_{ij}\) of other variables tend to vary, leading to values of \(\small \delta _{2}\) that are larger than those of \(\small \delta _{1}\).

The R function, `BW3stagePPS`

, will calculate \(\small B^{2}\), \(\small W^{2}\), \(\small W_{2}^{2}\), \(\small W_{3}^{2}\), \(\small \delta _{1}\), and \(\small \delta _{2}\) defined above for
*ppswr/srs/srs* and *srswr/srs/srs* sampling. The function
is appropriate if an entire frame is available and takes the following
parameters:

Parameter | Description |
---|---|

`X` |
data vector; length is the number of elements in the population. |

`pp` |
vector of one-draw probabilities for the PSUs; length is number of PSUs in population. |

`psuID` |
vector of PSU identification numbers. This vector must be as long as X. Each element in a given PSU should have the same value in psuID. PSUs must be in the same order as in X. |

`ssuID` |
vector of SSU identification numbers. This vector must
be as long as X. Each element in a given SSU should have the same value
in ssuID. PSUs and SSUs must be in the same order as in X. ssuID should
have the form `psuID||(ssuID within PSU)` . |

**Example. Variance components in three stage**. In the Maryland population suppose that suppose that tracts and BGs within tracts are the first- and second-stage units, and that persons are elements in a three-stage design. All three stages are selected by*srswr/srs/srs*and*ppswr/srs/srs*designs*srs*. The call to`BW3stagePPS`

for the variable`y1`

in an*srswr/srs/srs*design is:

```
<- length(unique(MDarea.pop$TRACT))
M <- 10*MDarea.pop$TRACT + MDarea.pop$BLKGROUP
trtBG <- rep(1/M,M)
pp.trt BW3stagePPS(X=MDarea.pop$y1, pp=pp.trt,
psuID=MDarea.pop$TRACT, ssuID=trtBG)
#> B W W2 W3 unit relvar k1
#> 0.2577266 1.8390286 0.2707206 2.1083645 1.4627412 1.4334423
#> k2 delta1 delta2
#> 1.6264566 0.1229169 0.1137919
```

We repeat the calculation but assuming *ppswr* sampling of
PSUs. The calculation for `y1`

using tracts and block groups
as the first- and second-stage sampling units is done via this call:

```
<- 10*MDarea.pop$TRACT + MDarea.pop$BLKGROUP
trtBG <- table(MDarea.pop$TRACT) / nrow(MDarea.pop)
pp.trt BW3stagePPS(X=MDarea.pop$y1, pp=pp.trt,
psuID=MDarea.pop$TRACT, ssuID=trtBG)
#> B W W2 W3 unit relvar k1
#> 0.009171403 1.453908596 0.271630074 1.687254100 1.462741186 1.000231629
#> k2 delta1 delta2
#> 1.339187133 0.006268559 0.138665715
```

Notice that \(\small \delta_1 =
0.123\) with *srs* sampling of tracts but is 0.006 when
tracts are sampled proportional to their population sizes.

An important practical, sample design problem that we do not cover in
this vignette is how to estimate variance components and measures of
homogeneity from a complex, multistage sample. This topic is covered in
detail in section 9.4 of Valliant, Dever, and
Kreuter (2018b). The `PracTools`

package includes a
variety of other functions relevant to two- and three-stage sampling
that are also not discussed in this vignette:

Function | Description |
---|---|

BW2stagePPSe | Estimate components of relvariance for a sample design
where primary sampling units (PSUs) are selected with pps and
elements are selected via srs. The input is a sample selected
in this way. |

BW3stagePPSe | Estimate components of relvariance for a sample design where primary sampling units (PSUs) are selected with probability proportional to size with replacement (ppswr) and secondary sampling units (SSUs) and elements within SSUs are selected via simple random sampling (srs). The input is a sample selected in this way. |

clusOpt2 | Compute the sample sizes that minimize the variance of
the pwr-estimator of a total in a two-stage sample. |

clusOpt2fixedPSU | Compute the optimum number of sample elements per primary sampling unit (PSU) for a fixed set of PSUs. |

clusOpt3 | Compute the sample sizes that minimize the variance of
the pwr-estimator of a total in a three-stage sample. |

clusOpt3fixedPSU | Compute the sample sizes that minimize the variance of
the pwr-estimator of a total in a three-stage sample when the
PSU sample is fixed. |

CVcalc2 | Compute the coefficient of variation of an estimated
total in a two-stage design. Primary sampling units (PSUs) can be
selected either with probability proportional to size (pps) or
with equal probability. Elements are selected via simple random sampling
(srs). |

CVcalc3 | Compute the coefficient of variation of an estimated
total in a three-stage design. Primary sampling units (PSUs) can be
selected either with probability proportional to size (pps) or
with equal probability. Secondary units and elements within SSUs are
selected via simple random sampling (srs). |

deff | Compute the Kish, Henry, Spencer, or Chen-Rust design effects. |

Valliant, R., J. A. Dever, and F. Kreuter. 2018a. *Files for 2nd
Edition of Practical Tools for Designing and Weighting Survey
Samples*. https://umd.app.box.com/v/PracTools2ndEdition.

———. 2018b. *Practical Tools for Designing and Weighting Survey
Samples*. 2nd ed. New York: Springer-Verlag.