Basic unit-level models

The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
##       Mean   SD t-value  MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.49    -243 0.122 -1098 -1090 -1083  1345     1
## 
## sigma_ :
##        Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.06    18.4 0.0695  51.3 56.1  61.4  1940     1
## 
## beta :
##                 Mean     SD t-value    MCSE    q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  801.807 25.152   31.88 0.57776  760.611  801.613 844.5010  1895 1.003
## ell           -1.966  0.301   -6.53 0.00685   -2.448   -1.967  -1.4810  1935 0.999
## meals         -1.956  0.282   -6.94 0.00631   -2.410   -1.956  -1.4882  1994 1.001
## stypeH      -106.809 13.445   -7.94 0.27453 -129.597 -106.542 -84.8748  2399 1.000
## stypeM       -60.085 11.519   -5.22 0.23834  -79.277  -60.028 -41.2435  2336 1.000
## hsg           -0.609  0.407   -1.50 0.00799   -1.266   -0.610   0.0486  2591 1.000
## col.grad       0.587  0.495    1.19 0.01149   -0.214    0.583   1.3837  1858 1.001
## grad.sch       2.131  0.483    4.41 0.00976    1.327    2.136   2.9210  2451 1.001
## 
## v_sigma :
##         Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23.1 6.71    3.45 0.222    13 22.6  34.9   916     1
## 
## v :
##                 Mean   SD  t-value  MCSE q0.05    q0.5 q0.95 n_eff R_hat
## Alameda      -25.266 15.3 -1.65309 0.365 -50.9 -24.962 -1.15  1751 1.001
## Amador         0.318 24.7  0.01287 0.471 -39.7   0.545 41.01  2751 0.999
## Butte         -0.837 24.1 -0.03478 0.440 -40.0  -0.476 38.79  3000 1.002
## Calaveras      7.660 22.5  0.34085 0.451 -27.9   6.642 46.40  2479 1.001
## Colusa         0.318 24.6  0.01295 0.449 -39.1   0.167 39.85  3000 1.000
## Contra Costa  -9.761 15.1 -0.64615 0.340 -34.9  -9.922 14.55  1979 1.002
## Del Norte      0.387 24.1  0.01605 0.440 -39.4   0.263 40.58  3000 1.000
## El Dorado     -0.140 24.0 -0.00583 0.438 -39.5   0.129 37.90  3000 1.000
## Fresno         0.106 15.6  0.00677 0.312 -25.3   0.131 25.35  2499 1.000
## Glenn         -0.327 23.9 -0.01370 0.437 -39.1  -0.420 39.65  2997 0.999
## ... 47 elements suppressed ...

We want to estimate the area population means \[ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, \] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, \] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                 Mean    SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda          679 14.92    45.5 0.339   654  679   703  1933 1.001
## Amador           739 31.74    23.3 0.637   688  739   790  2482 1.000
## Butte            678 26.22    25.9 0.498   635  678   720  2777 1.001
## Calaveras        743 27.60    26.9 0.530   699  742   790  2715 1.000
## Colusa           552 31.94    17.3 0.640   499  552   604  2489 1.000
## Contra Costa     727 14.55    50.0 0.288   702  727   751  2554 1.000
## Del Norte        679 32.45    20.9 0.612   625  679   733  2807 1.000
## El Dorado        754 26.58    28.4 0.510   710  754   797  2716 1.000
## Fresno           594 15.23    39.0 0.278   570  594   619  3000 0.999
## Glenn            625 31.15    20.1 0.591   575  624   677  2773 1.000
## Humboldt         702 26.65    26.3 0.523   659  701   746  2593 1.001
## Imperial         558 24.40    22.9 0.479   519  557   599  2596 0.999
## Inyo             707 32.52    21.7 0.629   654  707   761  2673 1.000
## Kern             596 14.89    40.0 0.319   572  597   620  2183 1.004
## Kings            595 22.86    26.0 0.468   556  595   631  2388 1.000
## Lake             662 25.67    25.8 0.481   622  661   704  2848 1.000
## Lassen           706 26.45    26.7 0.497   661  706   749  2827 1.001
## Los Angeles      636  8.26    76.9 0.153   622  636   649  2914 1.002
## Madera           615 20.41    30.1 0.432   582  615   649  2238 1.000
## Marin            816 20.91    39.0 0.387   783  816   852  2920 0.999
## Mariposa         723 35.54    20.4 0.737   666  722   783  2327 1.001
## Mendocino        663 27.27    24.3 0.529   619  662   707  2654 1.000
## Merced           578 23.80    24.3 0.485   539  577   618  2405 1.001
## Modoc            672 30.57    22.0 0.641   623  671   723  2272 0.999
## Mono             705 41.43    17.0 0.770   637  705   773  2892 1.000
## Monterey         628 18.26    34.4 0.339   598  628   658  2911 1.001
## Napa             701 21.79    32.2 0.413   665  701   737  2781 1.004
## Nevada           799 30.27    26.4 0.568   750  798   848  2842 1.000
## Orange           692 15.20    45.5 0.299   668  692   718  2579 0.999
## Placer           772 23.63    32.7 0.456   734  772   811  2681 1.001
## Plumas           691 31.54    21.9 0.604   640  690   742  2723 1.000
## Riverside        636 14.38    44.2 0.284   612  636   659  2571 1.000
## Sacramento       668 15.27    43.7 0.285   643  668   693  2873 1.003
## San Benito       710 30.24    23.5 0.619   658  710   759  2387 0.999
## San Bernardino   647 12.98    49.8 0.237   626  647   669  3000 1.000
## San Diego        704 14.36    49.0 0.319   681  704   728  2031 1.003
## San Francisco    638 19.67    32.4 0.388   606  638   670  2573 1.000
## San Joaquin      630 17.50    36.0 0.373   600  631   657  2203 1.000
## San Luis Obispo  753 24.17    31.2 0.463   714  754   794  2722 1.000
## San Mateo        734 21.57    34.0 0.432   700  733   770  2494 1.000
## Santa Barbara    678 19.46    34.9 0.368   646  678   710  2794 0.999
## Santa Clara      733 15.65    46.8 0.286   707  733   758  3000 0.999
## Santa Cruz       679 20.18    33.7 0.386   644  680   711  2726 1.001
## Shasta           697 22.13    31.5 0.420   661  696   735  2772 1.000
## Sierra           706 41.08    17.2 0.750   637  706   774  3000 1.000
## Siskiyou         697 25.25    27.6 0.491   655  697   739  2641 1.000
## Solano           711 20.78    34.2 0.406   677  711   744  2617 0.999
## Sonoma           725 23.01    31.5 0.464   687  725   762  2457 1.001
## Stanislaus       661 19.64    33.6 0.384   628  661   692  2610 1.001
## Sutter           652 25.29    25.8 0.465   610  652   693  2957 1.001
## Tehama           660 29.84    22.1 0.601   613  660   710  2467 1.000
## Trinity          650 37.93    17.1 0.707   589  650   714  2880 1.000
## Tulare           583 21.62    27.0 0.451   548  583   617  2302 1.001
## Tuolumne         720 30.67    23.5 0.570   669  720   769  2900 1.000
## Ventura          710 17.36    40.9 0.325   682  709   739  2850 1.000
## Yolo             687 24.57    28.0 0.512   646  688   726  2305 1.000
## Yuba             627 28.48    22.0 0.548   582  627   675  2700 1.001
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \]

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
##       Mean   SD t-value   MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -83.4 2.59   -32.3 0.0641 -87.9 -83.2 -79.5  1626     1
## 
## beta :
##                 Mean     SD t-value     MCSE   q0.05      q0.5   q0.95 n_eff R_hat
## (Intercept)  4.53494 1.5372  2.9501 0.055004  2.2171  4.411463  7.2657   781  1.00
## ell         -0.03588 0.0151 -2.3708 0.000470 -0.0612 -0.035722 -0.0114  1038  1.00
## meals        0.00094 0.0141  0.0665 0.000440 -0.0219  0.000536  0.0249  1034  1.00
## stypeH      -2.81340 0.6322 -4.4504 0.020063 -3.8896 -2.790293 -1.8015   993  1.00
## stypeM      -1.64635 0.5756 -2.8601 0.016546 -2.5829 -1.649127 -0.6934  1210  1.00
## hsg         -0.02527 0.0224 -1.1301 0.000710 -0.0622 -0.024838  0.0116   993  1.00
## col.grad    -0.03672 0.0271 -1.3557 0.000898 -0.0827 -0.036046  0.0078   909  1.01
## grad.sch     0.01284 0.0325  0.3951 0.001154 -0.0392  0.012184  0.0664   793  1.01
## 
## v_sigma :
##          Mean    SD t-value   MCSE  q0.05  q0.5 q0.95 n_eff R_hat
## v_sigma 0.355 0.275    1.29 0.0114 0.0275 0.299 0.888   581     1
## 
## v :
##                  Mean    SD  t-value    MCSE  q0.05      q0.5 q0.95 n_eff R_hat
## Alameda      -0.17109 0.425 -0.40291 0.01263 -0.990 -6.21e-02 0.348  1130 1.001
## Amador       -0.00485 0.472 -0.01026 0.00968 -0.714 -1.23e-04 0.696  2381 1.000
## Butte         0.00379 0.453  0.00838 0.00826 -0.672 -6.63e-04 0.730  3000 1.001
## Calaveras     0.03977 0.453  0.08776 0.00885 -0.630  8.80e-03 0.776  2620 1.000
## Colusa       -0.00220 0.437 -0.00505 0.00797 -0.689 -3.98e-04 0.702  3000 1.000
## Contra Costa  0.02002 0.388  0.05163 0.00806 -0.584  2.01e-03 0.665  2316 1.000
## Del Norte     0.00102 0.464  0.00220 0.00889 -0.736 -7.45e-05 0.726  2720 1.000
## El Dorado    -0.00879 0.441 -0.01994 0.00805 -0.719 -8.61e-04 0.700  3000 0.999
## Fresno       -0.04837 0.372 -0.12991 0.00841 -0.706 -1.66e-02 0.547  1959 1.001
## Glenn        -0.00599 0.468 -0.01279 0.00855 -0.717 -2.57e-03 0.693  3000 1.001
## ... 47 elements suppressed ...

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.786 0.0602   13.06 0.001633 0.670 0.796 0.867  1359 1.000
## Amador          0.842 0.1175    7.16 0.002399 0.600 0.900 1.000  2400 1.000
## Butte           0.835 0.0747   11.18 0.001598 0.708 0.833 0.938  2185 1.001
## Calaveras       0.872 0.0956    9.13 0.001943 0.700 0.900 1.000  2420 1.000
## Colusa          0.646 0.1591    4.06 0.003063 0.333 0.667 0.889  2696 1.000
## Contra Costa    0.817 0.0562   14.54 0.001240 0.720 0.821 0.899  2056 0.999
## Del Norte       0.882 0.1096    8.05 0.002071 0.625 0.875 1.000  2802 0.999
## El Dorado       0.849 0.0735   11.54 0.001619 0.725 0.850 0.950  2062 1.000
## Fresno          0.782 0.0591   13.22 0.001439 0.677 0.785 0.871  1689 1.000
## Glenn           0.788 0.1425    5.53 0.002777 0.556 0.778 1.000  2633 1.000
## Humboldt        0.852 0.0730   11.67 0.001577 0.725 0.850 0.950  2145 1.000
## Imperial        0.698 0.1068    6.53 0.002342 0.525 0.700 0.875  2080 1.001
## Inyo            0.782 0.1528    5.12 0.003006 0.571 0.857 1.000  2583 1.001
## Kern            0.806 0.0515   15.65 0.001124 0.716 0.811 0.878  2096 1.002
## Kings           0.849 0.0806   10.53 0.001705 0.720 0.840 0.960  2236 1.000
## Lake            0.827 0.0927    8.92 0.001836 0.682 0.818 0.955  2549 0.999
## Lassen          0.837 0.1083    7.73 0.002298 0.636 0.818 1.000  2220 1.001
## Los Angeles     0.799 0.0426   18.75 0.001028 0.728 0.799 0.870  1717 1.000
## Madera          0.818 0.0764   10.71 0.001541 0.677 0.839 0.935  2457 0.999
## Marin           0.835 0.0721   11.58 0.001748 0.700 0.840 0.940  1701 1.001
## Mariposa        0.857 0.1537    5.58 0.002994 0.600 0.800 1.000  2634 1.000
## Mendocino       0.814 0.0908    8.96 0.001820 0.640 0.800 0.960  2492 1.000
## Merced          0.789 0.0824    9.58 0.001861 0.651 0.794 0.905  1959 1.001
## Modoc           0.832 0.1585    5.25 0.002893 0.600 0.800 1.000  3000 0.999
## Mono            0.745 0.2463    3.03 0.004703 0.333 0.667 1.000  2743 1.000
## Monterey        0.802 0.0670   11.97 0.001503 0.687 0.807 0.904  1988 1.000
## Napa            0.846 0.0791   10.70 0.001622 0.704 0.852 0.963  2376 1.000
## Nevada          0.866 0.0940    9.22 0.001989 0.714 0.857 1.000  2234 1.001
## Orange          0.774 0.0607   12.76 0.001361 0.677 0.775 0.871  1988 1.000
## Placer          0.814 0.0795   10.24 0.002015 0.682 0.818 0.924  1557 1.000
## Plumas          0.737 0.1509    4.88 0.002886 0.444 0.778 1.000  2735 1.000
## Riverside       0.824 0.0521   15.81 0.001268 0.729 0.831 0.898  1692 1.000
## Sacramento      0.846 0.0475   17.80 0.001268 0.767 0.847 0.924  1404 1.000
## San Benito      0.820 0.1258    6.52 0.002610 0.636 0.818 1.000  2321 1.000
## San Bernardino  0.861 0.0417   20.66 0.001120 0.793 0.862 0.928  1385 1.000
## San Diego       0.847 0.0418   20.27 0.000961 0.775 0.850 0.909  1893 1.000
## San Francisco   0.762 0.0739   10.30 0.001492 0.630 0.770 0.870  2457 1.001
## San Joaquin     0.831 0.0552   15.05 0.001207 0.734 0.831 0.912  2094 1.000
## San Luis Obispo 0.853 0.0708   12.06 0.001629 0.725 0.850 0.950  1887 1.001
## San Mateo       0.797 0.0786   10.14 0.002090 0.650 0.811 0.895  1413 1.000
## Santa Barbara   0.831 0.0629   13.22 0.001317 0.716 0.840 0.926  2279 0.999
## Santa Clara     0.785 0.0686   11.45 0.002233 0.656 0.796 0.871   944 1.003
## Santa Cruz      0.787 0.0758   10.39 0.001629 0.653 0.796 0.898  2162 1.000
## Shasta          0.817 0.0760   10.75 0.001719 0.675 0.825 0.925  1957 1.000
## Sierra          0.723 0.2297    3.15 0.004459 0.333 0.667 1.000  2653 1.001
## Siskiyou        0.839 0.0977    8.59 0.002065 0.667 0.867 1.000  2241 1.001
## Solano          0.865 0.0621   13.94 0.001438 0.750 0.867 0.951  1862 1.001
## Sonoma          0.846 0.0598   14.16 0.001238 0.741 0.852 0.935  2329 1.000
## Stanislaus      0.816 0.0703   11.60 0.001737 0.692 0.824 0.912  1638 0.999
## Sutter          0.830 0.0910    9.12 0.001731 0.650 0.850 0.950  2760 1.001
## Tehama          0.844 0.0990    8.52 0.002056 0.647 0.882 1.000  2320 0.999
## Trinity         0.749 0.2058    3.64 0.003959 0.500 0.750 1.000  2702 1.000
## Tulare          0.832 0.0627   13.27 0.001426 0.718 0.836 0.927  1933 1.000
## Tuolumne        0.881 0.0929    9.48 0.001882 0.750 0.917 1.000  2438 1.000
## Ventura         0.838 0.0489   17.13 0.000975 0.758 0.839 0.913  2523 1.000
## Yolo            0.851 0.0746   11.41 0.001513 0.714 0.857 0.971  2430 1.001
## Yuba            0.796 0.1020    7.81 0.001978 0.632 0.789 0.947  2661 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.