This vignette was one of a series of `demo()`

files in the
package. It is still there as `demo("housing")`

, but is now
presented here with additional commentary and analysis, designed to
highlight some aspects of analysis of categorical data and graphical
display.

I’ll use the following packages in this vignette.

```
library(vcdExtra)
library(MASS)
library(effects)
```

The content here is the dataset `MASS::housing`

, giving a
4-way, \(3 \times 3 \times 4 \times 2\)
frequency table of 1681 individuals from the *Copenhagen Housing
Conditions Survey*, classified by their:

- Satisfaction (
`Sat`

) with their housing circumstances (low, medium or high), `Type`

of rental dwelling (Tower, Apartment, Atrium or Terrace)- perceived influence (
`Infl`

) on management of the property (low, medium, high), and - degree of contact (
`Cont`

) with other residents (low or high)

Load the data:

```
data(housing, package="MASS")
str(housing)
## 'data.frame': 72 obs. of 5 variables:
## $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 2 3 1 2 3 1 ...
## $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 3 3 3 1 ...
## $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 21 21 28 34 22 36 10 11 36 61 ...
```

Satisfaction (`Sat`

) of these householders with their
present housing circumstances is the **outcome variable**
here.

For purposes of analysis, note that `Sat`

is an ordered
factor with levels `"Low" < "Medium" < "High"`

.

Note also that Influence, with the same levels is just a “Factor”, not an ordered one.

I consider here just models using
`glm(..., family=poisson)`

or the equivalent in
`MASS::loglm()`

.

The ordering of factor levels is important in graphical displays. We
don’t want to see them ordered alphabetically, “High”, “Low”, “Medium”.
The `housing`

data.frame was constructed so that the levels
of `Sat`

and `Infl`

appear in the dataset in their
appropriate order.

```
levels(housing$Sat)
## [1] "Low" "Medium" "High"
levels(housing$Infl)
## [1] "Low" "Medium" "High"
```

Other models, e.g., the **proportional odds** model, fit
using `MASS:polr()`

can take the ordinal nature of
satisfaction into account. In `glm()`

one could re-assign
`Infl`

as an ordered factor and examine linear vs. non-linear
associations for this factor. But I don’t do this here.

The most ignorant model asserts that all the table factors are
mutually independent. In symbolic notation, this is
`[S] [I] [T] [C]`

where all terms in separate
`[ ]`

are supposed to be independent. This is
`Freq ~ Sat + Infl + Type + Cont`

as a formula for
`glm()`

.

```
<- glm(Freq ~ Sat + Infl + Type + Cont, family = poisson,
house.null data = housing)
```

When `Sat`

is the outcome variable, a minimal
**baseline model** should allow for all associations among
the predictors, symbolized as `[S] [I T C]`

. That is,
Influence, Type and Contact may be associated in arbitrary ways, just as
multiple predictors can be correlated in regression models.

In this framework, what remains to be explained is whether/how
`Sat`

depends on the combinations of the other variables. The
baseline model therefore includes the full three-way term for the
predictors.

```
<- glm(Freq ~ Sat + Infl*Type*Cont, family = poisson,
house.glm0 data = housing)
```

Both of these models fit terribly, but we can always use
`anova(mod1, mod2,...)`

to compare the *relative* fits
of **nested** models.

```
anova(house.null, house.glm0, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Freq ~ Sat + Infl + Type + Cont
## Model 2: Freq ~ Sat + Infl * Type * Cont
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 63 295.35
## 2 46 217.46 17 77.896 9.045e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The baseline model is shown in the mosaic plot below. Note that this
is applied not to the `housing`

data, but rather to the
`house.glm0`

object (of class `glm`

) resulting to
a call to `vcdExtra::mosaic.glm()`

.

With four variables in the mosaic, labeling of the variable names and
factor levels is a bit tricky, because labels must appear on all four
sides of the plot. The `labeling_args`

argument can be used
to set more informative variable names and abbreviate factor levels
where necessary.

```
# labeling_args for mosaic()
<- list(set_varnames = c(
largs Infl="Influence on management",
Cont="Contact among residents",
Type="Type of dwelling",
Sat="Satisfaction"),
abbreviate=c(Type=3))
mosaic(house.glm0,
labeling_args=largs,
main='Baseline model: [ITC][Sat]')
## Warning: no formula provided, assuming ~Sat + Infl + Type + Cont
```

In this plot we can see largish positive residuals in the blocks corresponding to (low satisfaction, low influence) and (high satisfaction, high influence) and clusters of largish negative residuals in the opposite corners.

By default, variables are used in the mosaic display in their order
in the data table or frequency data.frame. The warning reminds us that the order of
conditioning used is `~Sat + Infl + Type + Cont`

.

For `mosaic.glm()`

, the conditioning order of variables in
the mosaic can be set using the `formula`

argument. Here, I
rearrange the variables to put `Sat`

as the last variable in
the splitting / conditioning sequence. I also use
`vcdExtra::modFit()`

to add the LR \(G^2\) fit statistic to the plot title.

```
mosaic(house.glm0,
formula = ~ Type + Infl + Cont + Sat,
labeling_args=largs,
main=paste('Baseline model: [ITC][Sat],', modFit(house.glm0))
)
```

Clearly, satisfaction depends on one or more of the predictors,
`Infl`

, `Type`

and `Cont`

and possibly
their interactions.

As a first step it is useful to consider sequentially adding the
association terms `Infl:Sat`

, `Type:Sat`

,
`Cont:Sat`

one at a time. This analysis is carried out using
`MASS::addterm()`

.

```
::addterm(house.glm0,
MASS~ . + Sat:(Infl + Type + Cont),
test = "Chisq")
## Single term additions
##
## Model:
## Freq ~ Sat + Infl * Type * Cont
## Df Deviance AIC LRT Pr(Chi)
## <none> 217.46 610.43
## Sat:Infl 4 111.08 512.05 106.371 < 2.2e-16 ***
## Sat:Type 6 156.79 561.76 60.669 3.292e-11 ***
## Sat:Cont 2 212.33 609.30 5.126 0.07708 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Based on this, it is useful to consider a “main-effects” model for satisfaction, adding all three two-way terms involving satisfaction.

The `update()`

method provides an easy way to add (or
subtract) terms from a fitted model object. In the model formula,
`.`

stands for whatever was on the left side
(`Freq`

) or on the right side
(`Sat + Infl*Type*Cont`

) of the model
(`house.glm0`

) that is being updated.

```
<- update(house.glm0,
house.glm1 ~ . + Sat*(Infl + Type + Cont)) .
```

For comparison, we note that the same model can be fit using the
iterative proportional scaling algorithm of
`MASS::loglm()`

.

```
<- MASS::loglm(Freq ~ Infl * Type * Cont +
(house.loglm1 *(Infl + Type + Cont), data = housing))
Sat## Call:
## MASS::loglm(formula = Freq ~ Infl * Type * Cont + Sat * (Infl +
## Type + Cont), data = housing)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 38.66222 34 0.2671359
## Pearson 38.90831 34 0.2582333
```

As before, `anova()`

tests the added contribution of each
more complex model over the one before. The residual deviance \(G^2\) has been reduced from \(G^2 (46) = 217.46\) for the baseline model
`house.glm0`

to \(G^2 (34) =
38.66\) for the revised model `house.glm1`

. The
difference, \(G^2(M1 | M0) = G^2 (12) =
178.79\) tests the collective additional fit provided by the
two-way association of satisfaction with the predictors.

```
anova(house.glm0, house.glm1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Freq ~ Sat + Infl * Type * Cont
## Model 2: Freq ~ Sat + Infl + Type + Cont + Infl:Type + Infl:Cont + Type:Cont +
## Sat:Infl + Sat:Type + Sat:Cont + Infl:Type:Cont
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 46 217.456
## 2 34 38.662 12 178.79 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`glm1`

The model `house.glm1`

fits reasonably well,
G^2(34)=38.66, so most residuals are small. In the mosaic below, I use
`gp=shading_Friendly`

to shade the tiles so that positive and
negative residuals are distinguished by color, and they are filled when
the absolute value of the residual is outside \(\pm 2, 4\).

```
mosaic(house.glm1,
labeling_args=largs,
main=paste('Model [IS][TS][CS],', modFit(house.glm1) ),
gp=shading_Friendly)
```

One cell is highlighted here: The combination of medium influence, low contact and tower type, is more likely to give low satisfaction than the model predicts. Is this just an outlier, or is there something that can be interpreted and perhaps improve the model fit? It is hard tell, but the virtues of mosaic displays are that they help to:

- diagnose overall patterns of associations,
- spot unusual cells in relation to lack of fit of a given model.

When we add terms using `MASS::addterm()`

, they are added
sequentially. It might be the case that once some term is added, a
previously added term is no longer important. Running
`MASS::dropterm()`

on the `housel.glm1`

model
checks for this.

```
::dropterm(house.glm1, test = "Chisq")
MASS## Single term deletions
##
## Model:
## Freq ~ Sat + Infl + Type + Cont + Infl:Type + Infl:Cont + Type:Cont +
## Sat:Infl + Sat:Type + Sat:Cont + Infl:Type:Cont
## Df Deviance AIC LRT Pr(Chi)
## <none> 38.662 455.63
## Sat:Infl 4 147.780 556.75 109.117 < 2.2e-16 ***
## Sat:Type 6 100.889 505.86 62.227 1.586e-11 ***
## Sat:Cont 2 54.722 467.69 16.060 0.0003256 ***
## Infl:Type:Cont 6 43.952 448.92 5.290 0.5072454
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Note that the three-way term `Infl:Type:Cont`

is not
significant. However, with `Sat`

as the response, the
associations of all predictors must be included in the model.

The model so far says that each of influence, type and control have
separate, additive effects on the level of satisfaction, what I called a
“main-effects” model. It might be the case that some of the predictors
have *interaction* effects, e.g., that the effect of influence on
satisfaction might vary with the type of dwelling or the level of
control.

An easy way to test for these is to update the main-effects model,
adding all possible two-way interactions for `Sat`

, one at a
time, with `addterm()`

.

```
::addterm(house.glm1,
MASS~. + Sat:(Infl + Type + Cont)^2,
test = "Chisq")
## Single term additions
##
## Model:
## Freq ~ Sat + Infl + Type + Cont + Infl:Type + Infl:Cont + Type:Cont +
## Sat:Infl + Sat:Type + Sat:Cont + Infl:Type:Cont
## Df Deviance AIC LRT Pr(Chi)
## <none> 38.662 455.63
## Sat:Infl:Type 12 16.107 457.08 22.5550 0.03175 *
## Sat:Infl:Cont 4 37.472 462.44 1.1901 0.87973
## Sat:Type:Cont 6 28.256 457.23 10.4064 0.10855
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The result shows that adding the term `Infl:Type:Sat`

reduces the deviance \(G^2\) from 38.66
to 16.11. The difference, \(G^2(M1 + ITS | M1)
= G^2 (12) = 22.55\) reflects a substantial improvement. The
remaining two-way interaction terms reduce the deviance by smaller and
non-significant amounts, relative to `house.glm1`

.

Model fitting should be guided by substance, not just statistical
machinery. Nonetheless, it seems arguably sensible to add one two-way
term to the model, giving `house.glm2`

.

```
<- update(house.glm1,
house.glm2 ~ . + Sat:Infl:Type) .
```

Adding more association terms to a model will always improve it. The question is, whether that is “worth it”?

“Worth it” concerns the trade-off between model fit and parsimony. Sometimes we might prefer a model with fewer parameters to one that has a slightly better fit, but requires more model terms and parameters.

The AIC and BIC statistics are designed to adjust our assessment of
model fit by penalizing it for using more parameters. Equivalently, they
deduct from the likelihood ratio \(G^2\) a term proportional to the residual
\(\text{df}\) of the model. In any case
– **smaller is better** for both AIC and BIC.

\[AIC = G^2 - 2 \: \text{df}\] \[BIC = G^2 - \log(n) \: \text{df}\]

These measures are provided by `AIC()`

,
`BIC()`

, and can be used to compare models using
`vcdExtra::LRstats()`

.

```
LRstats(house.glm0, house.glm1, house.glm2)
## Likelihood summary table:
## AIC BIC LR Chisq Df Pr(>Chisq)
## house.glm0 610.43 669.62 217.456 46 <2e-16 ***
## house.glm1 455.63 542.15 38.662 34 0.2671
## house.glm2 457.08 570.91 16.107 22 0.8105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

By these metrics, model `house.glm1`

is best on both AIC
and BIC. The increased goodness-of-fit (smaller \(G^2\)) of model `house.glm2`

is
not worth the extra cost of parameters in the `house.glm2`

model.