Zero-inflated Poisson model for complex survey data

Thomas Lumley

2023-03-30

The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if \(Z\) is Bernoulli with probability \(1-p_0\) and \(P\) is Poisson with mean \(\lambda\), then \(Y=Z+(1-Z)P\) is zero-inflated Poisson. The ZIP is a latent-class model; we can have \(Y=0\) either because \(Z=0\) ('structural' zeroes) or because \(P=0\). That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on \(Z\) and a Poisson regression structure on \(P\).

There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM. The pscl package (Zeileis et al) fits zero-inflated models, and so does VGAM, so we can compare the model fitted with svyVGAM to both of those and to other work-arounds.

I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200 and SXQ100: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into 'never' and 'not yet'.

Here's how I created the dataset, from two NHANES files. It's data(nhanes_sxq) in the package

library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]

Start off by loading the packages and setting up a survey design

library(svyVGAM)
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:VGAM':
## 
##     calibrate
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)

First, we'll fit the model just ignoring the survey design, using both pscl::zeroinfl and VGAM::vglm. These models use the same variables in a logistic regression for \(Z\) and a Poisson regression for \(P\). In VGAM you would make the models different by constraining coefficients to be zero in one of the models; in pscl you would specify different models before and after the |.

unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)
## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0204 -0.9433 -0.7871  0.1503 29.2567 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.6666228  0.0506660  32.894  < 2e-16 ***
## RIDAGEYR          -0.0055102  0.0008969  -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394019  0.0779480  -0.505    0.613    
## factor(RIDRETH1)3  0.6508821  0.0345734  18.826  < 2e-16 ***
## factor(RIDRETH1)4  0.6675311  0.0365963  18.240  < 2e-16 ***
## factor(RIDRETH1)5  0.5642954  0.0594928   9.485  < 2e-16 ***
## DMDEDUC            0.0094256  0.0135180   0.697    0.486    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.188125   0.187079   1.006  0.31461   
## RIDAGEYR          -0.002938   0.003629  -0.810  0.41823   
## factor(RIDRETH1)2 -0.079636   0.242311  -0.329  0.74242   
## factor(RIDRETH1)3  0.118369   0.116120   1.019  0.30803   
## factor(RIDRETH1)4  0.143300   0.127764   1.122  0.26203   
## factor(RIDRETH1)5  0.259515   0.223032   1.164  0.24460   
## DMDEDUC           -0.148881   0.053337  -2.791  0.00525 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -9518 on 14 Df
vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")
## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.188125349         1.666622759        -0.002937819        -0.005510160 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.079635992        -0.039401949         0.118369301         0.650882145 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##         0.143300364         0.667531080         0.259515415         0.564295398 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.148881313         0.009425589 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556

Re-scaling the weights

A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.

nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)

wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(wt)
## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5864 -0.8418 -0.5430  0.1324 31.9106 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.8340681  0.0614994  29.823  < 2e-16 ***
## RIDAGEYR          -0.0073881  0.0009059  -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073312  0.0853535  -1.257   0.2086    
## factor(RIDRETH1)3  0.6551367  0.0481679  13.601  < 2e-16 ***
## factor(RIDRETH1)4  0.6358148  0.0529173  12.015  < 2e-16 ***
## factor(RIDRETH1)5  0.4774124  0.0666607   7.162 7.96e-13 ***
## DMDEDUC           -0.0237389  0.0143070  -1.659   0.0971 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.660504   0.214458   3.080 0.002071 ** 
## RIDAGEYR          -0.007833   0.003673  -2.133 0.032959 *  
## factor(RIDRETH1)2 -0.116789   0.252451  -0.463 0.643636    
## factor(RIDRETH1)3  0.101971   0.151531   0.673 0.500987    
## factor(RIDRETH1)4 -0.160804   0.181429  -0.886 0.375444    
## factor(RIDRETH1)5  0.106779   0.230339   0.464 0.642954    
## DMDEDUC           -0.202397   0.057624  -3.512 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -9766 on 14 Df
wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, weights = scaledwt, 
##     crit = "coef")
## 
## Coefficients: 
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1        0.6605042  0.2144354   3.080 0.002069 ** 
## (Intercept):2        1.8340681  0.0614568  29.843  < 2e-16 ***
## RIDAGEYR:1          -0.0078333  0.0036726  -2.133 0.032934 *  
## RIDAGEYR:2          -0.0073881  0.0008995  -8.214  < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894  0.2527278  -0.462 0.643999    
## factor(RIDRETH1)2:2 -0.1073312  0.0847641  -1.266 0.205430    
## factor(RIDRETH1)3:1  0.1019712  0.1515002   0.673 0.500899    
## factor(RIDRETH1)3:2  0.6551367  0.0481359  13.610  < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040  0.1814098  -0.886 0.375395    
## factor(RIDRETH1)4:2  0.6358147  0.0529738  12.002  < 2e-16 ***
## factor(RIDRETH1)5:1  0.1067789  0.2303235   0.464 0.642931    
## factor(RIDRETH1)5:2  0.4774124  0.0663590   7.194 6.27e-13 ***
## DMDEDUC:1           -0.2023967  0.0576221  -3.512 0.000444 ***
## DMDEDUC:2           -0.0237389  0.0146964  -1.615 0.106249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
## 
## Log-likelihood: -9765.52 on 5036 degrees of freedom
## 
## Number of Fisher scoring iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote( 
    coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
    ))
rep1
##                              theta     SE
## count_(Intercept)        1.8335175 0.1350
## count_RIDAGEYR          -0.0073709 0.0028
## count_factor(RIDRETH1)2 -0.1071380 0.2471
## count_factor(RIDRETH1)3  0.6552029 0.1858
## count_factor(RIDRETH1)4  0.6361156 0.1438
## count_factor(RIDRETH1)5  0.4769791 0.2501
## count_DMDEDUC           -0.0238310 0.0797
## zero_(Intercept)         0.6606269 0.2582
## zero_RIDAGEYR           -0.0078221 0.0039
## zero_factor(RIDRETH1)2  -0.1156275 0.2854
## zero_factor(RIDRETH1)3   0.1015741 0.0984
## zero_factor(RIDRETH1)4  -0.1620363 0.0859
## zero_factor(RIDRETH1)5   0.1065392 0.2120
## zero_DMDEDUC            -0.2025776 0.0586
repv = withReplicates(repdes, quote( 
    coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
    ))
repv
##                          theta     SE
## (Intercept):1        0.6605042 0.2582
## (Intercept):2        1.8340681 0.1350
## RIDAGEYR:1          -0.0078333 0.0039
## RIDAGEYR:2          -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1  0.1019712 0.0983
## factor(RIDRETH1)3:2  0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2  0.6358147 0.1438
## factor(RIDRETH1)5:1  0.1067789 0.2120
## factor(RIDRETH1)5:2  0.4774124 0.2501
## DMDEDUC:1           -0.2023967 0.0586
## DMDEDUC:2           -0.0237389 0.0797

svymle

Another way to fit the model using just the survey package is with svymle. This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments. That is, we specify the log-likelihood in terms of the Bernoulli parameter \(p_0\) and the Poisson mean \(\lambda\) -- actually \(\mathrm{logit} p_0\) and \(\eta=\log\lambda\), and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model. In fact, svymle is very similar in underlying approach to vglm; the difference is that vglm comes with a large collection of predefined models.

In defining the loglikelihood I'm going to take advantage of the Poisson pmf being available in R. Let's call it \(\digamma(y,\lambda)\). The loglikelihood is \[\ell(y; \mu,p_0)=\log\left(p_0\{y==0\}+(1-p)\digamma(y,\mu)\right)\] only we want it in terms of \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)

loglike = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    log(p*(y==0)+(1-p)*dpois(y,mu))
}

We also need the derivatives with respect to \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)

dlogitp = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dexpit = p/(1+p)^2
    num = dexpit*(y==0)-dexpit*dpois(y,mu)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   
    
deta = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dmutoy = 0*y
    dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
    num = (1-p)*(-dpois(y,mu)+dmutoy)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   

score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))

And now we call svymle giving the linear predictors for both parameters. One of the formulas needs to include the response variable \(Y\).

nlmfit = svymle(loglike=loglike, grad=score, design=des, 
        formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
        logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
      start=coef(unwt), na.action="na.omit",method="BFGS")

summary(nlmfit)
## Survey-sampled mle: 
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~ 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR + 
##     factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit", 
##     method = "BFGS")
##                                  Coef          SE p.value
## eta.(Intercept)           1.826825789 0.154214277 < 0.001
## eta.RIDAGEYR             -0.007800690 0.003014997 0.00967
## eta.factor(RIDRETH1)2    -0.119694280 0.235192596 0.61081
## eta.factor(RIDRETH1)3     0.639831600 0.165176912 < 0.001
## eta.factor(RIDRETH1)4     0.615167292 0.117750580 < 0.001
## eta.factor(RIDRETH1)5     0.465555942 0.213462405 0.02919
## eta.DMDEDUC              -0.008130865 0.072679440 0.91092
## logitp.(Intercept)        0.578310169 0.246782567 0.01911
## logitp.RIDAGEYR          -0.006077533 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033440316 0.280701007 0.90517
## logitp.factor(RIDRETH1)3  0.124435365 0.095140203 0.19090
## logitp.factor(RIDRETH1)4 -0.151762524 0.086322705 0.07873
## logitp.factor(RIDRETH1)5  0.119530077 0.209380275 0.56808
## logitp.DMDEDUC           -0.209112828 0.053553191 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)

svyVGAM

Finally, we use svy_vglm, with variances by linearisation

library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)
## 
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights, 
##     crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.660504243         1.834068104        -0.007833317        -0.007388071 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.116789371        -0.107331190         0.101971159         0.655136725 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##        -0.160804047         0.635814748         0.106778915         0.477412443 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.202396659        -0.023738881 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966

and by replicate weights

svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")