The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.
We use the api
dataset included in packages survey.
library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))
The apipop
data frame contains the complete population
whereas apisrs
is a simple random sample from it. The
variable cname
is the county name, and we will be
interested in estimation at the county level. Not all counties in the
population are sampled. In order to be able to make predictions for
out-of-sample areas we make sure that the levels of the sample’s
cname
variable match those of its population
counterpart.
The basic unit-level model with county random area effects is fit as follows
library(mcmcsae)
mod <- api00 ~
reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.49 -243 0.122 -1098 -1090 -1083 1345 1
##
## sigma_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.06 18.4 0.0695 51.3 56.1 61.4 1940 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 801.807 25.152 31.88 0.57776 760.611 801.613 844.5010 1895 1.003
## ell -1.966 0.301 -6.53 0.00685 -2.448 -1.967 -1.4810 1935 0.999
## meals -1.956 0.282 -6.94 0.00631 -2.410 -1.956 -1.4882 1994 1.001
## stypeH -106.809 13.445 -7.94 0.27453 -129.597 -106.542 -84.8748 2399 1.000
## stypeM -60.085 11.519 -5.22 0.23834 -79.277 -60.028 -41.2435 2336 1.000
## hsg -0.609 0.407 -1.50 0.00799 -1.266 -0.610 0.0486 2591 1.000
## col.grad 0.587 0.495 1.19 0.01149 -0.214 0.583 1.3837 1858 1.001
## grad.sch 2.131 0.483 4.41 0.00976 1.327 2.136 2.9210 2451 1.001
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23.1 6.71 3.45 0.222 13 22.6 34.9 916 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -25.266 15.3 -1.65309 0.365 -50.9 -24.962 -1.15 1751 1.001
## Amador 0.318 24.7 0.01287 0.471 -39.7 0.545 41.01 2751 0.999
## Butte -0.837 24.1 -0.03478 0.440 -40.0 -0.476 38.79 3000 1.002
## Calaveras 7.660 22.5 0.34085 0.451 -27.9 6.642 46.40 2479 1.001
## Colusa 0.318 24.6 0.01295 0.449 -39.1 0.167 39.85 3000 1.000
## Contra Costa -9.761 15.1 -0.64615 0.340 -34.9 -9.922 14.55 1979 1.002
## Del Norte 0.387 24.1 0.01605 0.440 -39.4 0.263 40.58 3000 1.000
## El Dorado -0.140 24.0 -0.00583 0.438 -39.5 0.129 37.90 3000 1.000
## Fresno 0.106 15.6 0.00677 0.312 -25.3 0.131 25.35 2499 1.000
## Glenn -0.327 23.9 -0.01370 0.437 -39.1 -0.420 39.65 2997 0.999
## ... 47 elements suppressed ...
We want to estimate the area population means \[
\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,
\] where \(U_i\) is the set of
units in area \(i\) of size \(N_i\). The MCMC output in variable
sim
can be used to obtain draws from the posterior
distribution for \(\theta_i\). The
\(r\)th draw can be expressed as \[
\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i}
- n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i}
\epsilon_{j;r} \right)\,,
\] where \(\bar{y}_i\) is the
sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals
for area \(i\).
N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds) # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 679 14.92 45.5 0.339 654 679 703 1933 1.001
## Amador 739 31.74 23.3 0.637 688 739 790 2482 1.000
## Butte 678 26.22 25.9 0.498 635 678 720 2777 1.001
## Calaveras 743 27.60 26.9 0.530 699 742 790 2715 1.000
## Colusa 552 31.94 17.3 0.640 499 552 604 2489 1.000
## Contra Costa 727 14.55 50.0 0.288 702 727 751 2554 1.000
## Del Norte 679 32.45 20.9 0.612 625 679 733 2807 1.000
## El Dorado 754 26.58 28.4 0.510 710 754 797 2716 1.000
## Fresno 594 15.23 39.0 0.278 570 594 619 3000 0.999
## Glenn 625 31.15 20.1 0.591 575 624 677 2773 1.000
## Humboldt 702 26.65 26.3 0.523 659 701 746 2593 1.001
## Imperial 558 24.40 22.9 0.479 519 557 599 2596 0.999
## Inyo 707 32.52 21.7 0.629 654 707 761 2673 1.000
## Kern 596 14.89 40.0 0.319 572 597 620 2183 1.004
## Kings 595 22.86 26.0 0.468 556 595 631 2388 1.000
## Lake 662 25.67 25.8 0.481 622 661 704 2848 1.000
## Lassen 706 26.45 26.7 0.497 661 706 749 2827 1.001
## Los Angeles 636 8.26 76.9 0.153 622 636 649 2914 1.002
## Madera 615 20.41 30.1 0.432 582 615 649 2238 1.000
## Marin 816 20.91 39.0 0.387 783 816 852 2920 0.999
## Mariposa 723 35.54 20.4 0.737 666 722 783 2327 1.001
## Mendocino 663 27.27 24.3 0.529 619 662 707 2654 1.000
## Merced 578 23.80 24.3 0.485 539 577 618 2405 1.001
## Modoc 672 30.57 22.0 0.641 623 671 723 2272 0.999
## Mono 705 41.43 17.0 0.770 637 705 773 2892 1.000
## Monterey 628 18.26 34.4 0.339 598 628 658 2911 1.001
## Napa 701 21.79 32.2 0.413 665 701 737 2781 1.004
## Nevada 799 30.27 26.4 0.568 750 798 848 2842 1.000
## Orange 692 15.20 45.5 0.299 668 692 718 2579 0.999
## Placer 772 23.63 32.7 0.456 734 772 811 2681 1.001
## Plumas 691 31.54 21.9 0.604 640 690 742 2723 1.000
## Riverside 636 14.38 44.2 0.284 612 636 659 2571 1.000
## Sacramento 668 15.27 43.7 0.285 643 668 693 2873 1.003
## San Benito 710 30.24 23.5 0.619 658 710 759 2387 0.999
## San Bernardino 647 12.98 49.8 0.237 626 647 669 3000 1.000
## San Diego 704 14.36 49.0 0.319 681 704 728 2031 1.003
## San Francisco 638 19.67 32.4 0.388 606 638 670 2573 1.000
## San Joaquin 630 17.50 36.0 0.373 600 631 657 2203 1.000
## San Luis Obispo 753 24.17 31.2 0.463 714 754 794 2722 1.000
## San Mateo 734 21.57 34.0 0.432 700 733 770 2494 1.000
## Santa Barbara 678 19.46 34.9 0.368 646 678 710 2794 0.999
## Santa Clara 733 15.65 46.8 0.286 707 733 758 3000 0.999
## Santa Cruz 679 20.18 33.7 0.386 644 680 711 2726 1.001
## Shasta 697 22.13 31.5 0.420 661 696 735 2772 1.000
## Sierra 706 41.08 17.2 0.750 637 706 774 3000 1.000
## Siskiyou 697 25.25 27.6 0.491 655 697 739 2641 1.000
## Solano 711 20.78 34.2 0.406 677 711 744 2617 0.999
## Sonoma 725 23.01 31.5 0.464 687 725 762 2457 1.001
## Stanislaus 661 19.64 33.6 0.384 628 661 692 2610 1.001
## Sutter 652 25.29 25.8 0.465 610 652 693 2957 1.001
## Tehama 660 29.84 22.1 0.601 613 660 710 2467 1.000
## Trinity 650 37.93 17.1 0.707 589 650 714 2880 1.000
## Tulare 583 21.62 27.0 0.451 548 583 617 2302 1.001
## Tuolumne 720 30.67 23.5 0.570 669 720 769 2900 1.000
## Ventura 710 17.36 40.9 0.325 682 709 739 2850 1.000
## Yolo 687 24.57 28.0 0.512 646 688 726 2305 1.000
## Yuba 627 28.48 22.0 0.548 582 627 675 2700 1.001
theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)
A model with binomial likelihood can also be fit. We now model the
target variable sch.wide
, a binary variable indicating
whether a school-wide growth target has been met. We use the same mean
model structure as above for the linear model, but now using a logistic
link function, \[
y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\
\mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\
v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)
\]
apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -83.4 2.59 -32.3 0.0641 -87.9 -83.2 -79.5 1626 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 4.53494 1.5372 2.9501 0.055004 2.2171 4.411463 7.2657 781 1.00
## ell -0.03588 0.0151 -2.3708 0.000470 -0.0612 -0.035722 -0.0114 1038 1.00
## meals 0.00094 0.0141 0.0665 0.000440 -0.0219 0.000536 0.0249 1034 1.00
## stypeH -2.81340 0.6322 -4.4504 0.020063 -3.8896 -2.790293 -1.8015 993 1.00
## stypeM -1.64635 0.5756 -2.8601 0.016546 -2.5829 -1.649127 -0.6934 1210 1.00
## hsg -0.02527 0.0224 -1.1301 0.000710 -0.0622 -0.024838 0.0116 993 1.00
## col.grad -0.03672 0.0271 -1.3557 0.000898 -0.0827 -0.036046 0.0078 909 1.01
## grad.sch 0.01284 0.0325 0.3951 0.001154 -0.0392 0.012184 0.0664 793 1.01
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 0.355 0.275 1.29 0.0114 0.0275 0.299 0.888 581 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -0.17109 0.425 -0.40291 0.01263 -0.990 -6.21e-02 0.348 1130 1.001
## Amador -0.00485 0.472 -0.01026 0.00968 -0.714 -1.23e-04 0.696 2381 1.000
## Butte 0.00379 0.453 0.00838 0.00826 -0.672 -6.63e-04 0.730 3000 1.001
## Calaveras 0.03977 0.453 0.08776 0.00885 -0.630 8.80e-03 0.776 2620 1.000
## Colusa -0.00220 0.437 -0.00505 0.00797 -0.689 -3.98e-04 0.702 3000 1.000
## Contra Costa 0.02002 0.388 0.05163 0.00806 -0.584 2.01e-03 0.665 2316 1.000
## Del Norte 0.00102 0.464 0.00220 0.00889 -0.736 -7.45e-05 0.726 2720 1.000
## El Dorado -0.00879 0.441 -0.01994 0.00805 -0.719 -8.61e-04 0.700 3000 0.999
## Fresno -0.04837 0.372 -0.12991 0.00841 -0.706 -1.66e-02 0.547 1959 1.001
## Glenn -0.00599 0.468 -0.01279 0.00855 -0.717 -2.57e-03 0.693 3000 1.001
## ... 47 elements suppressed ...
To predict the population fractions of schools that meet the growth target by county,
samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 0.786 0.0602 13.06 0.001633 0.670 0.796 0.867 1359 1.000
## Amador 0.842 0.1175 7.16 0.002399 0.600 0.900 1.000 2400 1.000
## Butte 0.835 0.0747 11.18 0.001598 0.708 0.833 0.938 2185 1.001
## Calaveras 0.872 0.0956 9.13 0.001943 0.700 0.900 1.000 2420 1.000
## Colusa 0.646 0.1591 4.06 0.003063 0.333 0.667 0.889 2696 1.000
## Contra Costa 0.817 0.0562 14.54 0.001240 0.720 0.821 0.899 2056 0.999
## Del Norte 0.882 0.1096 8.05 0.002071 0.625 0.875 1.000 2802 0.999
## El Dorado 0.849 0.0735 11.54 0.001619 0.725 0.850 0.950 2062 1.000
## Fresno 0.782 0.0591 13.22 0.001439 0.677 0.785 0.871 1689 1.000
## Glenn 0.788 0.1425 5.53 0.002777 0.556 0.778 1.000 2633 1.000
## Humboldt 0.852 0.0730 11.67 0.001577 0.725 0.850 0.950 2145 1.000
## Imperial 0.698 0.1068 6.53 0.002342 0.525 0.700 0.875 2080 1.001
## Inyo 0.782 0.1528 5.12 0.003006 0.571 0.857 1.000 2583 1.001
## Kern 0.806 0.0515 15.65 0.001124 0.716 0.811 0.878 2096 1.002
## Kings 0.849 0.0806 10.53 0.001705 0.720 0.840 0.960 2236 1.000
## Lake 0.827 0.0927 8.92 0.001836 0.682 0.818 0.955 2549 0.999
## Lassen 0.837 0.1083 7.73 0.002298 0.636 0.818 1.000 2220 1.001
## Los Angeles 0.799 0.0426 18.75 0.001028 0.728 0.799 0.870 1717 1.000
## Madera 0.818 0.0764 10.71 0.001541 0.677 0.839 0.935 2457 0.999
## Marin 0.835 0.0721 11.58 0.001748 0.700 0.840 0.940 1701 1.001
## Mariposa 0.857 0.1537 5.58 0.002994 0.600 0.800 1.000 2634 1.000
## Mendocino 0.814 0.0908 8.96 0.001820 0.640 0.800 0.960 2492 1.000
## Merced 0.789 0.0824 9.58 0.001861 0.651 0.794 0.905 1959 1.001
## Modoc 0.832 0.1585 5.25 0.002893 0.600 0.800 1.000 3000 0.999
## Mono 0.745 0.2463 3.03 0.004703 0.333 0.667 1.000 2743 1.000
## Monterey 0.802 0.0670 11.97 0.001503 0.687 0.807 0.904 1988 1.000
## Napa 0.846 0.0791 10.70 0.001622 0.704 0.852 0.963 2376 1.000
## Nevada 0.866 0.0940 9.22 0.001989 0.714 0.857 1.000 2234 1.001
## Orange 0.774 0.0607 12.76 0.001361 0.677 0.775 0.871 1988 1.000
## Placer 0.814 0.0795 10.24 0.002015 0.682 0.818 0.924 1557 1.000
## Plumas 0.737 0.1509 4.88 0.002886 0.444 0.778 1.000 2735 1.000
## Riverside 0.824 0.0521 15.81 0.001268 0.729 0.831 0.898 1692 1.000
## Sacramento 0.846 0.0475 17.80 0.001268 0.767 0.847 0.924 1404 1.000
## San Benito 0.820 0.1258 6.52 0.002610 0.636 0.818 1.000 2321 1.000
## San Bernardino 0.861 0.0417 20.66 0.001120 0.793 0.862 0.928 1385 1.000
## San Diego 0.847 0.0418 20.27 0.000961 0.775 0.850 0.909 1893 1.000
## San Francisco 0.762 0.0739 10.30 0.001492 0.630 0.770 0.870 2457 1.001
## San Joaquin 0.831 0.0552 15.05 0.001207 0.734 0.831 0.912 2094 1.000
## San Luis Obispo 0.853 0.0708 12.06 0.001629 0.725 0.850 0.950 1887 1.001
## San Mateo 0.797 0.0786 10.14 0.002090 0.650 0.811 0.895 1413 1.000
## Santa Barbara 0.831 0.0629 13.22 0.001317 0.716 0.840 0.926 2279 0.999
## Santa Clara 0.785 0.0686 11.45 0.002233 0.656 0.796 0.871 944 1.003
## Santa Cruz 0.787 0.0758 10.39 0.001629 0.653 0.796 0.898 2162 1.000
## Shasta 0.817 0.0760 10.75 0.001719 0.675 0.825 0.925 1957 1.000
## Sierra 0.723 0.2297 3.15 0.004459 0.333 0.667 1.000 2653 1.001
## Siskiyou 0.839 0.0977 8.59 0.002065 0.667 0.867 1.000 2241 1.001
## Solano 0.865 0.0621 13.94 0.001438 0.750 0.867 0.951 1862 1.001
## Sonoma 0.846 0.0598 14.16 0.001238 0.741 0.852 0.935 2329 1.000
## Stanislaus 0.816 0.0703 11.60 0.001737 0.692 0.824 0.912 1638 0.999
## Sutter 0.830 0.0910 9.12 0.001731 0.650 0.850 0.950 2760 1.001
## Tehama 0.844 0.0990 8.52 0.002056 0.647 0.882 1.000 2320 0.999
## Trinity 0.749 0.2058 3.64 0.003959 0.500 0.750 1.000 2702 1.000
## Tulare 0.832 0.0627 13.27 0.001426 0.718 0.836 0.927 1933 1.000
## Tuolumne 0.881 0.0929 9.48 0.001882 0.750 0.917 1.000 2438 1.000
## Ventura 0.838 0.0489 17.13 0.000975 0.758 0.839 0.913 2523 1.000
## Yolo 0.851 0.0746 11.41 0.001513 0.714 0.857 0.971 2430 1.001
## Yuba 0.796 0.1020 7.81 0.001978 0.632 0.789 0.947 2661 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean)) # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)