Recursive Two-Stage Models to Address Endogeneity

**Table 1. Recursive Two-Stage Models Supported by the Endogeneity Package**
Model	First Stage	Second Stage	Endogenous Variable	Outcome Variable
biprobit	probit	probit	binary	binary
biprobit_latent	probit	probit	binary (unobserved)	binary
biprobit_partial	probit	probit	binary (partially observed)	binary
probit_linear	probit	linear	binary	continuous
probit_linear_latent	probit	linear	binary (unobserved)	continuous
probit_linear_partial	probit	linear	binary (partially observed)	continuous
probit_linearRE	probit	linearRE	binary	continuous
pln_linear	pln	linear	count	continuous
pln_probit	pln	probit	count	binary

2. Models

Let M and Y denote the endogenous variable and the outcome variable, respectively. The models listed in Table 1 are specified as follows.

2.1. biprobit

This model can be used when the endogenous variable and the outcome variable are both binary. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. As is customary in a Probit model, the variance of the error term is assumed to be one in both stages to ensure that the parameter estimates are unique.

2.2. biprobit_latent and biprobit_partial

These two models can be used when the endogenous variable and the outcome variable are both binary, but the endogenous variable is unobserved or partially observed. Such endogenous variables of interest to researchers could be an unobserved or partially observed mediator.

The first and second stages of the biprobit_latent model are given by:

First stage (Latent Probit): \[m_i^*=1(\alpha'w_i+u_i>0)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i^*+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the unobserved endogenous variable \(m_i^*\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. To ensure that the estimates of the above model are unique, \(\gamma\) is restricted to be positive. Even with this constraint, the identification of this model can still be weak.

The only difference between biprobit_latent and biprobit_partial is that the latter allows the endogenous variable M to be partially observed. Compared to the case when M is fully unobserved, measuring M for 10%~20% of units can significantly improve the identification of the model.

2.3. probit_linear

This model can be used when the endogenous variable is binary and the outcome variable is continuous. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Linear):

\[y_i=\beta'x_i+\gamma m_i+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\sigma^2\) represents the variance of the error term in the outcome equation.

2.4. probit_linear_latent and probit_linear_partial

These two models can be used when the outcome variable is continuous and the endogenous variable is an unobserved or partially observed binary variable. Such endogenous variables of interest to researchers could be an unobserved or partially observed mediator.

The first and second stages of the probit_linear_latent model are given by:

First stage (Latent Probit): \[m_i^*=1(\alpha'w_i+u_i>0)\]

Second stage (Linear):

\[y_i=\beta'x_i+\gamma m_i^*+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

The only difference between probit_linear_latent and probit_linear_partial is that the latter allows the endogenous variable M to be partially observed. Compared to the case when M is fully unobserved, measuring M for 10%~20% of units can significantly improve the identification of the model.

2.5. probit_linearRE

This model is an extension of the probit_linear model to panel data. The outcome variable is a time-variant continuous variable, and the endogenous variable is a time-invariant binary variable. The first and second stages of the model are given by:

First stage (Probit): \[m_i=1(\alpha'w_i+u_i>0)\]

Second stage (Panel linear model with individual-level random effects):

\[y_{it}=\beta'x_{it}+\gamma m_i+\lambda v_i+\sigma \varepsilon_{it}\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(v_i\) represents the individual-level random effect and is assumed to follow a standard bivariate normal distribution with \(u_i\). \(\sigma^2\) represents the variance of the error term in the outcome equation.

2.6. pln_linear

This model can be used when the endogenous variable is a count measure and the outcome variable is continuous. The first and second stages of the model are given by:

First stage (Poisson lognormal): \[E[m_i|w_i,u_i]=exp(\alpha'w_i+\lambda u_i)\]

Second stage (linear):

\[y_i=\beta'x_i+\gamma m_i+\sigma v_i\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\lambda^2\) and \(\sigma^2\) represent the variance of the error terms in the first and second stages, respectively.

2.7. pln_probit

This model can be used when the endogenous variable is a count measure and the outcome variable is binary. The first and second stages of the model are given by:

First stage (Poisson lognormal): \[E[m_i|w_i,u_i]=exp(\alpha'w_i+\lambda u_i)\]

Second stage (Probit):

\[y_i=1(\beta'x_i+\gamma m_i+v_i>0)\]

Endogeneity structure:

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}\right). \]

where \(w_i\) represents the set of covariates influencing the endogenous variable \(m_i\), and \(x_i\) denotes the set of covariates influencing the outcome variable \(y_i\). \(u_i\) and \(v_i\) are assumed to follow a standard bivariate normal distribution. \(\lambda^2\) represents the variance of the error term in the first stage. The variance of the error term in the second stage Probit model is normalized to 1.

3. Examples

After loading the endogeneity package, type “example(model_name)” to see sample code for each model. For example, the code below runs the probit_linear model on a simulated dataset with the following data generating process (DGP):

\[m_i=1(1+x_i+z_i+u_i>0)\]

\[y_i=1+x_i+z_i+m_i+v_i>0\]

\[\begin{pmatrix} u_i \\ v_i \end{pmatrix}\sim N\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix},\begin{pmatrix} 1 & -0.5 \\ -0.5 & 1 \end{pmatrix}\right). \]

library(endogeneity)
example(probit_linear, prompt.prefix=NULL)
#> 
#> > library(MASS)
#> 
#> > N = 2000
#> 
#> > rho = -0.5
#> 
#> > set.seed(1)
#> 
#> > x = rbinom(N, 1, 0.5)
#> 
#> > z = rnorm(N)
#> 
#> > e = mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,rho,rho,1), nrow=2))
#> 
#> > e1 = e[,1]
#> 
#> > e2 = e[,2]
#> 
#> > m = as.numeric(1 + x + z + e1 > 0)
#> 
#> > y = 1 + x + z + m + e2
#> 
#> > est = probit_linear(m~x+z, y~x+z+m)
#> ==== Converged after 129 iterations, LL=-3424.12, gtHg=0.000067 ****
#> LR test of rho=0, chi2(1)=20.632, p-value=0.0000
#> Time difference of 0.115402 secs
#> 
#> > print(est$estimates, digits=3)
#>                    estimate     se     z        p    lci    uci
#> probit.(Intercept)    0.971 0.1232  7.88 3.44e-15  0.729  1.212
#> probit.x              0.996 0.0527 18.92 0.00e+00  0.893  1.099
#> probit.z              0.971 0.0338 28.68 0.00e+00  0.904  1.037
#> linear.(Intercept)    1.045 0.1567  6.67 2.51e-11  0.738  1.352
#> linear.x              1.019 0.0549 18.55 0.00e+00  0.911  1.127
#> linear.z              0.948 0.0853 11.11 0.00e+00  0.781  1.115
#> linear.m              0.984 0.0497 19.77 0.00e+00  0.886  1.081
#> sigma                 1.034 0.0206 50.08 0.00e+00  0.994  1.075
#> rho                  -0.487 0.0773 -6.31 2.80e-10 -0.624 -0.322

It can be seen that the parameter estimates are very close to the true values.

Recursive Two-Stage Models to Address Endogeneity

Jing Peng

2023-08-20

1. Introduction

2. Models

2.1. biprobit

2.2. biprobit_latent and biprobit_partial

2.3. probit_linear

2.4. probit_linear_latent and probit_linear_partial

2.5. probit_linearRE

2.6. pln_linear

2.7. pln_probit

3. Examples

4. Notes

Citations