Modeling heterogeneity

In the vignette on the model definition, we pointed out that the probit model can capture choice behavior heterogeneity by imposing a mixing distribution on the coefficient vector. The implementation in {RprobitB} is explained in this vignette1.

Estimating a joint normal mixing distribution

The {mlogit} package (Croissant 2020) contains the data set Electricity, in which residential electricity customers were asked to decide between four hypothetical electricity suppliers. The suppliers differed in 6 characteristics:

  1. their fixed price pf per kWh,
  2. their contract length cf,
  3. a boolean loc, indicating whether the supplier is a local company,
  4. a boolean wk, indicating whether the supplier is a well known company,
  5. a boolean tod, indicating whether the supplier offers a time-of-day electricity price (which is higher during the day and lower during the night), and
  6. a boolean seas, indicating whether the supplier’s price is seasonal dependent.

This constitutes a choice situation where choice behavior heterogeneity is expected: some customers might prefer a time-of-day electricity price (because they may be not at home during the day), while others can have the opposite preference. Ideally these differences in preferences should be modeled using characteristics of the deciders. In many cases (as in this data set) we do not have adequate information. Instead these differences in taste can be captured by means of a mixing distribution for the tod coefficient. This corresponds to the assumption of a random coefficient from the underlying mixing distribution to be drawn for each decider. We can use the estimated mixing distribution to determine for example the share of deciders that have a positive versus negative preference towards time-of-day electricity prices.

Additionally, we expect correlations between the random coefficients to certain covariates, for example a positive correlation between the influence of loc and wk: deciders that prefer local suppliers might also prefer well known companies due to recommendations and past experiences, although they might be more expensive than unknown suppliers. The fitted multivariate normal distribution will reveal these correlations.

The following lines prepare the Electricity data set for estimation. We use the convenience function as_cov_names() that relabels the data columns for alternative specific covariates into the required format “<covariate>_<alternative>”, compare to the vignette on choice data. Via the re = c("cl","loc","wk","tod","seas") argument, we specify that we want to model random effects for all but the price coefficient, which we will fix to -1 to interpret the other estimates as monetary values.

data("Electricity", package = "mlogit")
Electricity <- as_cov_names(Electricity, c("pf", "cl", "loc", "wk", "tod", "seas"), 1:4)
data <- prepare_data(
  form = choice ~ pf + cl + loc + wk + tod + seas | 0,
  choice_data = Electricity,
  re = c("cl", "loc", "wk", "tod", "seas")
)
model_elec <- fit_model(data, R = 1000, scale = "pf := -1")

Calling the coef() method on the estimated model also returns the estimated (marginal) variances of the mixing distribution besides the average mean effects:

coef(model_elec)
#>         Estimate   (sd) Variance   (sd)
#> 1   pf     -1.00 (0.00)       NA   (NA)
#> 2   cl     -0.27 (0.04)     0.33 (0.04)
#> 3  loc      2.89 (0.27)     7.82 (1.60)
#> 4   wk      2.14 (0.22)     4.15 (0.97)
#> 5  tod     -9.91 (0.25)    13.01 (2.29)
#> 6 seas     -9.96 (0.21)     6.73 (1.45)

By the sign of the estimates we can for example deduce, that the existence of the time-of-day electricity price tod in the contract has a negative effect. However, the deciders are very heterogeneous here, because the estimated variance of this coefficient is large. The same holds for the contract length cl. In particular, the estimated share of the population that prefers to have a longer contract length equals:

cl_mu <- coef(model_elec)["cl", "mean"]
cl_sd <- sqrt(coef(model_elec)["cl", "var"])
pnorm(cl_mu / cl_sd)
#> [1] 0.3185047

The correlation between the covariates can be accessed as follows:2

cov_mix(model_elec, cor = TRUE)
#>               cl        loc          wk         tod         seas
#> cl    1.00000000 0.09161238 0.058215654 -0.05419413 -0.100685302
#> loc   0.09161238 1.00000000 0.792512872  0.13606295  0.027063294
#> wk    0.05821565 0.79251287 1.000000000  0.13506495  0.003499879
#> tod  -0.05419413 0.13606295 0.135064950  1.00000000  0.544640146
#> seas -0.10068530 0.02706329 0.003499879  0.54464015  1.000000000

Here, we see the confirmation of our initial assumption about a high correlation between loc and wk. The pairwise mixing distributions can be visualized via calling the plot() method with the additional argument type = mixture:

plot(model_elec, type = "mixture")

Estimating latent classes

More generally, {RprobitB} allows to specify a Gaussian mixture as the mixing distribution. In particular,

\[ \beta \sim \sum_{c=1}^C \text{MVN} (b_c,\Omega_c).\] This specification allows for a) a better approximation of the true underlying mixing distribution and b) a preference based classification of the deciders.

To estimate a latent mixture, specify a named list latent_classes with the following arguments and submit it to the estimation routine fit_model:

Weight-based update of the latent classes

The following weight-based updating scheme is analogue to Bauer, Büscher, and Batram (2019) and executed within the burn-in period:

These rules contain choices on the values for \(\varepsilon_{\text{min}}\), \(\varepsilon_{\text{max}}\) and \(\varepsilon_{\text{distmin}}\). The adequate value for \(\varepsilon_{\text{distmin}}\) depends on the scale of the parameters. Per default, {RprobitB} sets

These values can be adapted through the latent_class list.

Dirichlet process-based update of the latent classes

As an alternative to the weight-based updating scheme to determine the correct number \(C\) of latent classes, {RprobitB} implements the Dirichlet process.4 The Dirichlet Process is a Bayesian nonparametric method, where nonparametric means that the number of model parameters can theoretically grow to infinity. The method allows to add more mixture components to the mixing distribution if needed for a better approximation, see Neal (2000) for a documentation of the general case. The literature offers many representations of the method, including the Chinese Restaurant Process (Aldous 1985), the stick-braking metaphor (Sethuraman 1994), and the Polya Urn model (Blackwell and MacQueen 1973).

In our case, we face the situation to find a distribution \(g\) that explains the decider-specific coefficients \((\beta_n)_{n = 1,\dots,N}\), where \(g\) is supposed to be a mixture of an unknown number \(C\) of Gaussian densities, i.e. \(g = \sum_{c = 1,\dots,C} s_c \text{MVN}(b_c, \Omega_c)\).

Let \(z_n \in \{1,\dots,C\}\) denote the class membership of \(\beta_n\). A priori, the mixture weights \((s_c)_c\) are given a Dirichlet prior with concentration parameter \(\delta/C\), i.e. \((s_c)_c \mid \delta \sim \text{D}_C(\delta/C,\dots,\delta/C)\). Rasmussen (1999) shows that

\[ \Pr((z_n)_n\mid \delta) = \frac{\Gamma(\delta)}{\Gamma(N+\delta)} \prod_{c=1}^C \frac{\Gamma(m_c + \delta/C)}{\Gamma(\delta/C)}, \] where \(\Gamma(\cdot)\) denotes the gamma function and \(m_c = \#\{n:z_n = c\}\) the number of elements that are currently allocated to class \(c\). Crucially, the last equation is independent of the class weights \((s_c)_c\), yet it still depends on the finite number \(C\) of latent classes. However, Li, Schofield, and Gönen (2019) shows that

\[ \Pr(z_n = c \mid z_{-n}, \delta) = \frac{m_{c,-n} + \delta/C}{N-1+\delta},\] where the notation \(-n\) means excluding the \(n\)th element. We can let \(C\) approach infinity to derive:

\[ \Pr(z_n = c \mid z_{-n}, \delta) \to \frac{m_{c,-n}}{N-1+\delta}. \]

Note that the allocation probabilities do not sum to 1, instead

\[ \sum_{c = 1}^C \frac{m_{c,-n}}{N-1+\delta} = \frac{N-1}{N-1+\delta}. \]

The difference to 1 equals

\[ \Pr(z_n \neq z_m ~ \forall ~ m \neq n \mid z_{-n}, \delta) = \frac{\delta}{N-1+\delta} \]

and constitutes the probability that a new cluster for observation \(n\) is created. Neal (2000) points out that this probability is proportional to the prior parameter \(\delta\): A greater value for \(\delta\) encourages the creation of new clusters, a smaller value for \(\delta\) increases the probability of an allocation to an already existing class.

In summary, the Dirichlet process updates the allocation of each \(\beta\) coefficient vector one at a time, dependent on the other allocations. The number of clusters can theoretically rise to infinity, however, as we delete unoccupied clusters, \(C\) is bounded by \(N\). As a final step after the allocation update, we update the class means \(b_c\) and covariance matrices \(\Omega_c\) by means of their posterior predictive distribution. The mean and covariance matrix for a new generated cluster is drawn from the prior predictive distribution. The corresponding formulas are given in Li, Schofield, and Gönen (2019).

The Dirichlet process directly integrates into our existing Gibbs sampler. Given \(\beta\) values, it updated the class means \(b_c\) and class covariance matrices \(\Omega_c\). The Dirichlet process updating scheme is implemented in the function update_classes_dp(). In the following, we give a small example in the bivariate case P_r = 2. We sample true class means b_true and class covariance matrices Omega_true for C_true = 3 true latent classes. The true (unbalanced) class sizes are given by the vector N, and z_true denotes the true allocations.

set.seed(1)
P_r <- 2
C_true <- 3
N <- c(100, 70, 30)
(b_true <- matrix(replicate(C_true, rnorm(P_r)), nrow = P_r, ncol = C_true))
#>            [,1]       [,2]       [,3]
#> [1,] -0.6264538 -0.8356286  0.3295078
#> [2,]  0.1836433  1.5952808 -0.8204684
(Omega_true <- matrix(replicate(C_true, rwishart(P_r + 1, 0.1 * diag(P_r))$W, simplify = TRUE),
  nrow = P_r * P_r, ncol = C_true
))
#>           [,1]        [,2]       [,3]
#> [1,] 0.3093652  0.14358543  0.2734617
#> [2,] 0.1012729 -0.07444148 -0.1474941
#> [3,] 0.1012729 -0.07444148 -0.1474941
#> [4,] 0.2648235  0.05751780  0.2184029
beta <- c()
for (c in 1:C_true) {
  for (n in 1:N[c]) {
    beta <- cbind(beta, rmvnorm(mu = b_true[, c, drop = F], Sigma = matrix(Omega_true[, c, drop = F], ncol = P_r)))
  }
}
z_true <- rep(1:3, times = N)

We specify the following prior parameters (for their definition see the vignette on model fitting):

delta <- 0.1
xi <- numeric(P_r)
D <- diag(P_r)
nu <- P_r + 2
Theta <- diag(P_r)

Initially, we start with C = 1 latent classes. The class mean b is set to zero, the covariance matrix Omega to the identity matrix:

z <- rep(1, ncol(beta))
C <- 1
b <- matrix(0, nrow = P_r, ncol = C)
Omega <- matrix(rep(diag(P_r), C), nrow = P_r * P_r, ncol = C)

The following call to update_classes_dp() updates the latent classes in 100 iterations. Note that we specify the arguments Cmax and s_desc. The former denotes the maximum number of latent classes. This specification is not a requirement for the Dirichlet process per se, but rather for its implementation. Knowing the maximum possible class number, we can allocate the required memory space, which leads to a speed improvement. We later can verify that we won’t exceed the number of Cmax = 10 latent classes at any point of the Dirichlet process. Setting s_desc = TRUE ensures that the classes are ordered by their weights in a descending order to ensure identifiability.

for (r in 1:100) {
  dp <- RprobitB:::update_classes_dp(
    Cmax = 10, beta, z, b, Omega, delta, xi, D, nu, Theta, s_desc = TRUE
  )
  z <- dp$z
  b <- dp$b
  Omega <- dp$Omega
}

The Dirichlet process was able to infer the true number C_true = 3 of latent classes:

par(mfrow = c(1, 2))
plot(t(beta), xlab = bquote(beta[1]), ylab = bquote(beta[2]), pch = 19)
RprobitB:::plot_class_allocation(beta, z, b, Omega, r = 100, perc = 0.95)

References

Aldous, David J. 1985. “Exchangeability and Related Topics.” In École d’Été de Probabilités de Saint-Flour XIII — 1983, edited by P. L. Hennequin, 1–198. Berlin, Heidelberg: Springer Berlin Heidelberg.
Bauer, D., S. Büscher, and M. Batram. 2019. “Non-Parameteric Estiation of Mixed Discrete Choice Models.” Second International Choice Modelling Conference in Kobe.
Blackwell, D., and J. MacQueen. 1973. “Ferguson Distributions via Polya Urn Schemes.” The Annals of Statistics 1: 353–55.
Burda, M., M. Harding, and J. Hausman. 2008. “A Bayesian Mixed Logit-Probit Model for Multinomial Choice.” Journal of Econometrics 147 (2). https://doi.org/10.1016/j.jeconom.2008.09.029.
Croissant, Y. 2020. “Estimation of Random Utility Models in R: The Mlogit Package.” Journal of Statistical Software 95 (11). https://doi.org/10.18637/jss.v095.i11.
Li, Y., E. Schofield, and M. Gönen. 2019. “A Tutorial on Dirichlet Process Mixture Modeling.” Journal of Mathematical Psychology 91. https://doi.org/10.1016/j.jmp.2019.04.004.
Neal, R. M. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics 9 (2). https://doi.org/10.2307/1390653.
Rasmussen, Carl. 1999. “The Infinite Gaussian Mixture Model.” Advances in Neural Information Processing Systems 12.
Sethuraman, J. 1994. “A Constructive Definition of Dirichlet Priors.” Statistica Sinica 4 (2): 639–50. http://www.jstor.org/stable/24305538.

  1. This vignette is built using R 4.1.2 with the {RprobitB} 1.1.4 package.↩︎

  2. Setting cor = FALSE instead returns the estimated covariance matrix.↩︎

  3. If either weight_update = TRUE or dp_update = TRUE (i.e. if classes are updated), C equals the initial number of latent classes.↩︎

  4. A similar approach in the context of discrete choice can be found in Burda, Harding, and Hausman (2008), where the Dirichlet process is applied to estimate a mixed logit-probit model.↩︎