Using missRanger

Usage

library(missRanger)

set.seed(3)

iris_NA <- generateNA(iris, p = 0.1)
head(iris_NA)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2    <NA>
#> 5           NA         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4    <NA>
 
imp <- missRanger(iris_NA, num.trees = 100)
#> 
#> Variables to impute:     Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute:    Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> 
#> iter 1 
#>   |                                                                              |                                                                      |   0%  |                                                                              |==============                                                        |  20%  |                                                                              |============================                                          |  40%  |                                                                              |==========================================                            |  60%  |                                                                              |========================================================              |  80%  |                                                                              |======================================================================| 100%
#> iter 2 
#>   |                                                                              |                                                                      |   0%  |                                                                              |==============                                                        |  20%  |                                                                              |============================                                          |  40%  |                                                                              |==========================================                            |  60%  |                                                                              |========================================================              |  80%  |                                                                              |======================================================================| 100%
#> iter 3 
#>   |                                                                              |                                                                      |   0%  |                                                                              |==============                                                        |  20%  |                                                                              |============================                                          |  40%  |                                                                              |==========================================                            |  60%  |                                                                              |========================================================              |  80%  |                                                                              |======================================================================| 100%
#> iter 4 
#>   |                                                                              |                                                                      |   0%  |                                                                              |==============                                                        |  20%  |                                                                              |============================                                          |  40%  |                                                                              |==========================================                            |  60%  |                                                                              |========================================================              |  80%  |                                                                              |======================================================================| 100%
head(imp)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1     5.100000         3.5          1.4   0.2000000  setosa
#> 2     4.900000         3.0          1.4   0.1608667  setosa
#> 3     4.700000         3.2          1.3   0.2000000  setosa
#> 4     4.600000         3.1          1.5   0.2000000  setosa
#> 5     5.061255         3.6          1.4   0.2000000  setosa
#> 6     5.400000         3.9          1.7   0.4000000  setosa

Predictive mean matching

It worked, but the new values appear overly exact. To avoid this, we can add predictive mean matching (PMM) to the OOB predictions:

imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 100, verbose = 0)
head(imp)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.4         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Controlling the random forests

missRanger() offers many options. How would we use one feature per split (mtry = 1) with 200 trees?

imp <- missRanger(iris_NA, pmm.k = 5, num.trees = 200, mtry = 1, verbose = 0)

Extended output

Setting data_only = FALSE (or keep_forests = TRUE) returns a “missRanger” object. With keep_forests = TRUE, this allows for out-of-sample applications:

imp <- missRanger(
  iris_NA, pmm.k = 5, num.trees = 100, keep_forests = TRUE, verbose = 0
)
imp
#> missRanger object. Extract imputed data via $data
#> - best iteration: 3 
#> - best average OOB imputation error: 0.1468982

summary(imp)
#> missRanger object. Extract imputed data via $data
#> - best iteration: 3 
#> - best average OOB imputation error: 0.1468982 
#> 
#> Sequence of OOB prediction errors:
#> 
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> [1,]    1.0000000   1.1108502   0.39671941  0.18322253 0.06666667
#> [2,]    0.2224743   0.5371919   0.06000731  0.05568752 0.03703704
#> [3,]    0.1732113   0.4517314   0.02408501  0.05583381 0.02962963
#> [4,]    0.1796650   0.4715697   0.02106975  0.05502143 0.03703704
#> 
#> Mean performance per iteration:
#> [1] 0.5514918 0.1824796 0.1468982 0.1528726
#> 
#> First rows of imputed data:
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa

# Out-of-sample application
# saveRDS(imp, file = "imputation_model.rds")
# imp <- readRDS("imputation_model.rds")
predict(imp, head(iris_NA))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.1         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Formulas

By default, missRanger() uses all columns to impute all columns with missings.

This can be modified by passing a formula: The left hand side specifies the variables to be imputed, while the right hand side lists the variables used for imputation.

# Impute all variables with all (default)
m <- missRanger(iris_NA, formula = . ~ ., pmm.k = 5, num.trees = 100, verbose = 0)

# Don't use Species for imputation
m <- missRanger(iris_NA, . ~ . - Species, pmm.k = 5, num.trees = 100, verbose = 0)

# Impute Sepal.Length by Species (or not?)
m <- missRanger(iris_NA, Sepal.Length ~ Species, pmm.k = 5, num.trees = 100)
#> 
#> Variables to impute:     Sepal.Length
#> Variables used to impute:    
#> 
#> iter 1 
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
head(m)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2    <NA>
#> 5          6.2         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4    <NA>

# Only univariate imputation was done! Why? Because Species contains missing values
# itself and needs to appear on the LHS as well:
m <- missRanger(iris_NA, Sepal.Length + Species ~ Species, pmm.k = 5, num.trees = 100)
#> 
#> Variables to impute:     Sepal.Length, Species
#> Variables used to impute:    Species
#> 
#> iter 1 
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
head(m)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1          5.1         3.5          1.4         0.2     setosa
#> 2          4.9         3.0          1.4          NA     setosa
#> 3          4.7         3.2          1.3         0.2     setosa
#> 4          4.6         3.1          1.5         0.2 versicolor
#> 5          6.5         3.6          1.4         0.2     setosa
#> 6          5.4         3.9          1.7         0.4 versicolor

# Impute all variables univariately
m <- missRanger(iris_NA, . ~ 1, verbose = 0)

Speed-up things

missRanger() fits a random forest per variable and iteration. Thus, imputation can take long. Some tweaks:

Use less trees, e.g., num.trees = 100.
Use a smaller tree depth, e.g., max.depth = 6.
Use large leaves, e.g., min.node.size = 100.
Use smaller bootstrap samples, e.g., sample.fraction = 0.2.
Use less iterations, e.g., max.iter = 3.

The first three items also help to greatly reduce the size of the models, which might become relevant in out-of-sample applications with keep_forests = TRUE.

Trick: Use `case.weights` to reduce impact of rows with many missings

Using the case.weights argument, you can pass case weights to the imputation models. For instance, this allows to reduce the contribution of rows with many missings:

m <- missRanger(
  iris_NA,
  num.trees = 100,
  pmm.k = 5,
  case.weights = rowSums(!is.na(iris_NA))
)

Using missRanger

2024-12-07

Overview

Installation

Usage

Predictive mean matching

Controlling the random forests

Extended output

Formulas

Speed-up things

Trick: Use `case.weights` to reduce impact of rows with many missings

References

Using missRanger

2024-12-07

Overview

Installation

Usage

Predictive mean matching

Controlling the random forests

Extended output

Formulas

Speed-up things

Trick: Use case.weights to reduce impact of rows with many missings

References

Trick: Use `case.weights` to reduce impact of rows with many missings