Zero-inflated Poisson model for complex survey data

The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if \(Z\) is Bernoulli with probability \(1-p_0\) and \(P\) is Poisson with mean \(\lambda\), then \(Y=Z+(1-Z)P\) is zero-inflated Poisson. The ZIP is a latent-class model; we can have \(Y=0\) either because \(Z=0\) ('structural' zeroes) or because \(P=0\). That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on \(Z\) and a Poisson regression structure on \(P\).

There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM. The pscl package (Zeileis et al) fits zero-inflated models, and so does VGAM, so we can compare the model fitted with svyVGAM to both of those and to other work-arounds.

I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200 and SXQ100: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into 'never' and 'not yet'.

Here's how I created the dataset, from two NHANES files. It's data(nhanes_sxq) in the package

First, we'll fit the model just ignoring the survey design, using both pscl::zeroinfl and VGAM::vglm. These models use the same variables in a logistic regression for \(Z\) and a Poisson regression for \(P\). In VGAM you would make the models different by constraining coefficients to be zero in one of the models; in pscl you would specify different models before and after the |.

Re-scaling the weights

A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.

nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)

wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

summary(wt)

## 
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC | 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5864 -0.8418 -0.5430  0.1324 31.9106 
## 
## Count model coefficients (poisson with log link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.8340666  0.0614994  29.823  < 2e-16 ***
## RIDAGEYR          -0.0073881  0.0009059  -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073243  0.0853534  -1.257   0.2086    
## factor(RIDRETH1)3  0.6551385  0.0481679  13.601  < 2e-16 ***
## factor(RIDRETH1)4  0.6358167  0.0529173  12.015  < 2e-16 ***
## factor(RIDRETH1)5  0.4774152  0.0666607   7.162 7.96e-13 ***
## DMDEDUC           -0.0237391  0.0143070  -1.659   0.0971 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.660511   0.214458   3.080 0.002071 ** 
## RIDAGEYR          -0.007833   0.003673  -2.133 0.032957 *  
## factor(RIDRETH1)2 -0.116798   0.252451  -0.463 0.643610    
## factor(RIDRETH1)3  0.101968   0.151531   0.673 0.500999    
## factor(RIDRETH1)4 -0.160809   0.181429  -0.886 0.375430    
## factor(RIDRETH1)5  0.106776   0.230339   0.464 0.642964    
## DMDEDUC           -0.202397   0.057624  -3.512 0.000444 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 20 
## Log-likelihood: -9766 on 14 Df

wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)

## 
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
##     family = zipoisson(), data = nhanes_sxq, weights = scaledwt, 
##     crit = "coef")
## 
## Coefficients: 
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1        0.6605042  0.2144354   3.080 0.002069 ** 
## (Intercept):2        1.8340681  0.0614568  29.843  < 2e-16 ***
## RIDAGEYR:1          -0.0078333  0.0036726  -2.133 0.032934 *  
## RIDAGEYR:2          -0.0073881  0.0008995  -8.214  < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894  0.2527278  -0.462 0.643999    
## factor(RIDRETH1)2:2 -0.1073312  0.0847641  -1.266 0.205430    
## factor(RIDRETH1)3:1  0.1019712  0.1515002   0.673 0.500899    
## factor(RIDRETH1)3:2  0.6551367  0.0481359  13.610  < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040  0.1814098  -0.886 0.375395    
## factor(RIDRETH1)4:2  0.6358147  0.0529738  12.002  < 2e-16 ***
## factor(RIDRETH1)5:1  0.1067789  0.2303235   0.464 0.642931    
## factor(RIDRETH1)5:2  0.4774124  0.0663590   7.194 6.27e-13 ***
## DMDEDUC:1           -0.2023967  0.0576221  -3.512 0.000444 ***
## DMDEDUC:2           -0.0237389  0.0146964  -1.615 0.106249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
## 
## Log-likelihood: -9765.52 on 5036 degrees of freedom
## 
## Number of Fisher scoring iterations: 8

## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote( 
    coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
    ))
rep1

##                              theta     SE
## count_(Intercept)        1.8340684 0.1350
## count_RIDAGEYR          -0.0073881 0.0028
## count_factor(RIDRETH1)2 -0.1073314 0.2471
## count_factor(RIDRETH1)3  0.6551358 0.1857
## count_factor(RIDRETH1)4  0.6358143 0.1438
## count_factor(RIDRETH1)5  0.4774122 0.2501
## count_DMDEDUC           -0.0237386 0.0797
## zero_(Intercept)         0.6605043 0.2582
## zero_RIDAGEYR           -0.0078333 0.0039
## zero_factor(RIDRETH1)2  -0.1167895 0.2854
## zero_factor(RIDRETH1)3   0.1019712 0.0983
## zero_factor(RIDRETH1)4  -0.1608040 0.0859
## zero_factor(RIDRETH1)5   0.1067790 0.2120
## zero_DMDEDUC            -0.2023967 0.0586

repv = withReplicates(repdes, quote( 
    coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
    ))
repv

##                          theta     SE
## (Intercept):1        0.6605042 0.2582
## (Intercept):2        1.8340681 0.1350
## RIDAGEYR:1          -0.0078333 0.0039
## RIDAGEYR:2          -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1  0.1019712 0.0983
## factor(RIDRETH1)3:2  0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2  0.6358147 0.1438
## factor(RIDRETH1)5:1  0.1067789 0.2120
## factor(RIDRETH1)5:2  0.4774124 0.2501
## DMDEDUC:1           -0.2023967 0.0586
## DMDEDUC:2           -0.0237389 0.0797

svymle

Another way to fit the model using just the survey package is with svymle. This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments. That is, we specify the log-likelihood in terms of the Bernoulli parameter \(p_0\) and the Poisson mean \(\lambda\) -- actually \(\mathrm{logit} p_0\) and \(\eta=\log\lambda\), and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model. In fact, svymle is very similar in underlying approach to vglm; the difference is that vglm comes with a large collection of predefined models.

In defining the loglikelihood I'm going to take advantage of the Poisson pmf being available in R. Let's call it \(\digamma(y,\lambda)\). The loglikelihood is \[\ell(y; \mu,p_0)=\log\left(p_0\{y==0\}+(1-p)\digamma(y,\mu)\right)\] only we want it in terms of \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)

loglike = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    log(p*(y==0)+(1-p)*dpois(y,mu))
}

We also need the derivatives with respect to \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)

dlogitp = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dexpit = p/(1+p)^2
    num = dexpit*(y==0)-dexpit*dpois(y,mu)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   
    
deta = function(y,eta,logitp){
    mu = exp(eta)
    p = exp(logitp)/(1+exp(logitp))
    dmutoy = 0*y
    dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
    num = (1-p)*(-dpois(y,mu)+dmutoy)
    denom = p*(y==0)+(1-p)*dpois(y,mu)
    num/denom
    }   

score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))

And now we call svymle giving the linear predictors for both parameters. One of the formulas needs to include the response variable \(Y\).

nlmfit = svymle(loglike=loglike, grad=score, design=des, 
        formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC, 
        logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
      start=coef(unwt), na.action="na.omit",method="BFGS")

summary(nlmfit)

## Survey-sampled mle: 
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~ 
##     RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR + 
##     factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit", 
##     method = "BFGS")
##                                  Coef          SE p.value
## eta.(Intercept)           1.826824375 0.154214263 < 0.001
## eta.RIDAGEYR             -0.007800674 0.003014996 0.00967
## eta.factor(RIDRETH1)2    -0.119695528 0.235192608 0.61080
## eta.factor(RIDRETH1)3     0.639831707 0.165176844 < 0.001
## eta.factor(RIDRETH1)4     0.615167757 0.117750557 < 0.001
## eta.factor(RIDRETH1)5     0.465556579 0.213462435 0.02919
## eta.DMDEDUC              -0.008130626 0.072679434 0.91093
## logitp.(Intercept)        0.578310456 0.246782589 0.01911
## logitp.RIDAGEYR          -0.006077530 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033442030 0.280701222 0.90517
## logitp.factor(RIDRETH1)3  0.124434633 0.095140201 0.19090
## logitp.factor(RIDRETH1)4 -0.151763506 0.086322700 0.07873
## logitp.factor(RIDRETH1)5  0.119530402 0.209380277 0.56808
## logitp.DMDEDUC           -0.209112732 0.053553190 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)

svyVGAM

Finally, we use svy_vglm, with variances by linearisation

library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR, 
##     nest = TRUE, data = nhanes_sxq)
## 
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights, 
##     crit = "coef")
## 
## 
## Coefficients:
##       (Intercept):1       (Intercept):2          RIDAGEYR:1          RIDAGEYR:2 
##         0.660504243         1.834068104        -0.007833317        -0.007388071 
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2 
##        -0.116789371        -0.107331190         0.101971159         0.655136725 
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2 
##        -0.160804047         0.635814748         0.106778915         0.477412443 
##           DMDEDUC:1           DMDEDUC:2 
##        -0.202396659        -0.023738881 
## 
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966

and by replicate weights

svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")

Zero-inflated Poisson model for complex survey data

Thomas Lumley

2026-01-16

Re-scaling the weights

svymle

svyVGAM