Linear regression provides the historical and conceptual basis for
much of the practical art of predictive modeling. Key advantages of
linear regression models are that they are both easy to fit to data and
easy to interpret and explain to end users. Unfortunately, these models
often predict poorly relative to newer approaches from the machine
learning literature like support vector machines, gradient boosting
machines, or random forests. A significant price paid for the sometimes
dramatically better performance of these alternative models is a greater
difficulty of interpretation and explanation. To address one aspect of
this problem, this vignette considers the problem of assessing variable
importance for a prediction model of arbitrary type, adopting the
well-known random permutation-based approach, and extending it to
consensus-based measures computed from results for a large collection of
models. In particular, this vignette uses the **datarobot**
*R* package, which allows *R* users to interact with the
DataRobot modeling engine, a massively parallel architecture for
simultaneously fitting many models to a single dataset. These models
cover a wide range of types, including constrained and unconstrained
linear regression models, machine learning models like the ones listed
above, and ensemble models that combine the predictions of the most
successful of these individual models in several different ways. To
demonstrate the practical utility of the ideas presented here, this
approach is applied to a simulation-based dataset generated using the
**mlbench.friedman1** function from the *R* package
**mlbench**, where it is possible to judge the
reasonableness of the results relative to a “correct answer.” For a more
detailed introduction to the **DataRobot** package and its
capabilities, refer to the companion vignette, “Introduction to the
**DataRobot** *R* Package.”

To help understand the results obtained from complex machine learning
models like random forests or gradient boosting machines, a number of
model-specific variable importance measures have been developed. As
Grömping (2009) notes, however, there is
no general consensus on the “best” way to compute - or even define - the
concept of variable importance. (For an illuminating list of possible
approaches, consult the **help** file for the
**varImp** function in the **caret** package
(Kuhn 2015): this list does not cover all
methods that have been proposed, or even all methods available in
*R*, but it clearly illustrates the range of methods available.)
For the application of primary interest here - i.e., understanding the
aggregate behavior of a collection of different type models like those
generated by the DataRobot modeling engine - a “good” variable
importance measure should exhibit these characteristics:

- it should be applicable to any model type;
- it should be sensible to combine or compare results across different model types.

One strategy that meets both of these criteria is that suggested by Friedman (2001), applying a random permutation to each covariate and thus effectively removing it from the model, and then examining the impact of this change on the predictive performance of the resulting model. The basic idea is that if we remove an important covariate from consideration, the best achievable model performance should suffer significantly, while if we remove an unimportant covariate, we should see little or no change in performance.

Ideally, we would like to implement the strategy just described by applying many random permutations and combining the results, but this ideal approach is computationally expensive. In his application to gradient boosting machines, Friedman (2001) could “borrow strength” by combining the random covariate permutations with the random subsampling inherent in the model fitting procedure. In characterizing arbitrary models, however, we cannot exploit their internal structure, but we can borrow strength by combining results across the different models we are fitting. Thus, the approach taken here consists of the following steps:

- Run a DataRobot modeling project based on the original data and retrieve the results;
- For each covariate whose influence we are interested in assessing:
- Generate a modified dataset, replacing the covariate with its random permutation, leaving all other covariates and the target variable unmodified;
- Set up and run a new modeling project based on this modified dataset;
- Retrieve the project information and compute the performance differences between the models in this project and the same models in the original project.

- Characterize each covariate, in one of the following three ways:
- Select an individual model of particular interest (e.g., the model that performs best in the original project) and compute its performance degradation;
- Compute the average performance degradation over all project models;
- Compute a performance-weighted average of the degradation, like that defined in Section 5.

The following sections apply the procedure just described to a simulation-based example where we can make reasonable judgements about the relative importance of the covariates.

The basis for the example considered here is the simulation function
**mlbench.friedman1** from the *R* package
**mlbench** (Leisch and Dimitriadou
2010). This function simulates 10 independent, uniformly
distributed random variables on the unit interval, \(x_1\) through \(x_{10}\), and generates the response
variable \(y\) defined by the first
five of these covariates:

\[ y = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5. \]

The user of the **mlbench.friedman1** simulation
function specifies the number \(n\) of
samples to be generated, and the standard deviation \(\sigma\) of a zero-mean, independent,
identically distributed, Gaussian noise sequence added to the simulated
\(y\) values. In the example considered
here, \(n = 5000\), and the noise
standard deviation had the default value, \(\sigma = 1\). To guarantee repeatability,
the function **set.seed(33)** was first called to
initialize the random number generators; the result was saved in the CSV
file **Friedman1.csv**.

Based on the structure of this simulator, we may classify the covariates into an influential subset (\(x_1\) through \(x_5\)) and an irrelevant subset (\(x_6\) through \(x_{10}\)). Also, from the model coefficients and the functions involved, we expect \(x_4\) to be the most important variable, probably followed by \(x_1\) and \(x_2\), both comparably important, with \(x_5\) probably less important. The influence of \(x_3\) is somewhat more difficult to assess due to its quadratic dependence, but it seems likely that this nonlinearity will suppress this variable’s influence since the total range of this term is from \(0\) to \(5\), the same as the \(x_5\) term.

To explore the utility of the random permutation-based strategy described in Section 3, eleven DataRobot projects were created. The first was based on the unmodified simulation dataset, specifying \(y\) as the target variable and \(x_1\) through \(x_{10}\) as prediction covariates. This project minimizes root mean square prediction error (RMSE), the default fitting metric chosen by DataRobot:

```
<- read.csv(system.file("extdata", "Friedman1.csv.gz", package = "datarobot"))
friedman <- StartProject(friedman, "OriginalProject", target = "Y", wait = TRUE)
originalProject <- ListModels(originalProject) originalModels
```

As discussed in the companion vignette, “Introduction to the
DataRobot R Package,” the above sequence of model-building steps begins
with the **StartProject** function, which uploads the
modeling dataset **Friedman1.csv**, creates the project,
gives it the name “OriginalProject”, specifies that the response
variable to be predicted by all project models is **Y**.
Because the model-building process does require some time to complete,
the parameter **wait = TRUE** is used to delay the call to
**ListModels**, requesting all project model information,
until the model-building process has completed.

All other projects were set up similarly, except that a single random
permutation was applied to covariate \(x_i\) for the \(i^{th}\) project. The random permutations
were generated using the *R* function
**PermuteColumn**, which has three required parameters:
**originalFile**, the name of the CSV file containing the
unmodified data (here, **Friedman1.csv**);
**colName**, the name of the column in this file containing
the variable whose influence we wish to assess; and
**permutedFile**, the name of the CSV file where the
modified data will be stored. The *R* code is:

```
<- function(originalFile, colName, permutedFile, iseed = 317) {
PermuteColumn set.seed(iseed)
<- system.file("extdata", originalFile, package = "datarobot")
originalFile <- read.csv(originalFile)
dframe <- colnames(dframe)
varNames <- which(varNames == colName)
colIndex <- dframe[ ,colIndex]
x <- sample(x)
y <- dframe
outFrame <- y
outFrame[ ,colIndex] write.csv(outFrame, permutedFile, row.names=FALSE)
}
```

To compute the permutation-based variable importances, all project
results are saved as elements of a composite list,
**modelList**, whose first element is the original project
model list and all subsequent elements are the permutation-based model
lists. The *R* code to generate this list follows (note that
because this process is building over 500 models, it takes some time to
complete):

```
<- list(n = 11)
modelList 1]] <- originalModels
modelList[[<- tempfile(fileext = "permFile.csv")
permFile for (i in 1:10) {
<- paste("X",i,sep="")
varName PermuteColumn("Friedman1.csv.gz", varName, permFile)
<- paste("PermProject", varName, sep = "")
projName <- StartProject(permFile, projectName = projName, target = "Y", wait = TRUE)
permProject +1]] <- ListModels(permProject)
modelList[[i }
```

To obtain the desired permutation shifts, it is useful to convert the summary list generated by this code into a dataframe, with the following columns:

**blueprintId**, an alphanumeric character string that uniquely identifies the model structure and any preprocessing applied;**modelType**, a character string describing the model structure;**expandedModel**, a character string describing the model structure and any preprocessing applied;**samplePct**, the fraction of the training dataset used in fitting the model;- a column giving the value of the fitting metric computed from the validation set data for the original model;
- one column for each covariate in the original model, giving the validation fitting metric value for the model constructed after the random permutation was applied to that covariate.

This conversion is accomplished by the
**PermutationMerge** function listed below, which requires
the summary list, the **samplePct** value for the models to
be matched, and a vector giving the names for the fitting metric columns
in the resulting dataframe. Matching here is done on the basis of
**blueprintId** and **samplePct**, which
guarantees that fit comparisons are made between original and randomized
fits to the same model for the same sample size, giving an
“apples-to-apples” comparison of the effects of the random permutations
applied to the covariates. In particular, note that DataRobot fits the
same model to several different sample sizes, so it is necessary to
specify both **samplePct** and **blueprintId**
to uniquely identify a model.

```
<- function(compositeList, matchPct = NULL, metricNames, matchMetric = NULL) {
PermutationMerge <- as.data.frame(compositeList[[1]], simple = FALSE)
df if (is.null(matchPct)) {
<- seq(1, nrow(df), 1)
index else {
} <- which(round(df$samplePct) == matchPct)
index
}if (is.null(matchMetric)) {
<- compositeList[[1]][[1]]$projectMetric
projectMetric <- paste(projectMetric, "validation", sep = ".")
matchMetric
}<- c("modelType", "expandedModel", "samplePct", "blueprintId", matchMetric)
getCols <- df[index, getCols]
outFrame <- getCols
keepCols 5] <- metricNames[1]
keepCols[colnames(outFrame) <- keepCols
<- length(compositeList)
n for (i in 2:n) {
<- as.data.frame(compositeList[[i]], simple = FALSE)
df <- which(df$samplePct == matchPct)
index <- df[index, c("blueprintId", matchMetric)]
upFrame colnames(upFrame) <- c("blueprintId", metricNames[i])
<- merge(outFrame, upFrame, by = "blueprintId")
outFrame
}
outFrame }
```

Because it gives the largest number of models to compare, the results
for the 16% training sample are considered here, and the
**metricNames** parameter required by the
**PermutationMerge** function is defined as indicated in
the following *R* code:

```
<- c("originalRMSE", paste("X", seq(1, 10, 1), "RMSE", sep = ""))
metricNames <- PermutationMerge(modelList, 16, metricNames) mergeFrame
```

The resulting dataframe gives the original validation set RMSE value
and the permutation RMSE values for all covariates for the 25 models fit
to the 16% data sample (here, this sample consists of 800 records, large
enough to give reasonable comparison results). Figure 1 shows a
*beanplot* summary of these results, comparing the RMSE values
for the 25 common models built for each of the eleven datasets just
described. This figure was generated using the *R* package
**beanplot** (Kampstra 2008),
and the “beans” or red lines in these plots represent the RMSE values
for the individual models from each project. Beanplots are analogous to
the more familiar boxplots, combining the boxplot range information with
a nonparametric density estimate to give a more complete view of how
each data subset is distributed. This additional information is
particularly useful in cases like the one considered here, where these
distributions differ significantly in form between the different
subsets. These plots *suggest* that random permutations applied
to the relevant variables X1 through X5 increase the RMSE values for
most models, while permutations applied to the irrelevant variables X6
through X10 seem to have little impact on these values, but to be
certain we need to make direct comparisons.

Figure 2 presents a more direct beanplot summary of the differences
in RMSE values between the random permutation-based models and the
original model. Here, each beanplot corresponds to one of the covariates
and the individual “beans” in these plots now represent the increase in
validation set RMSE value that results when the random permutation is
applied to the corresponding variable. The dataframe on which this
summary is based was computed using the function
**ComputeDeltas**:

```
<- function(mergeFrame, refCol, permNames, shiftNames) {
ComputeDeltas <- colnames(mergeFrame)
allNames <- which(allNames == refCol)
refIndex <- mergeFrame[, refIndex]
xRef <- which(allNames %in% permNames)
permCols <- mergeFrame[, permCols]
xPerm <- xPerm - xRef
deltas colnames(deltas) <- shiftNames
$New <- xRef
deltas<- which(colnames(deltas) == "New")
newIndex colnames(deltas)[newIndex] <- refCol
deltas }
```

To use this function, we need to specify: **refCol**,
the name of the reference column with the original model RMSE values;
**permNames**, the names of the columns containing the
permutation-based RMSE values; and **shiftNames**, the
names of the permutation-induced shift columns in the dataframe returned
by the function. Here, the following code generates the desired
permutation shift dataframe:

```
<- colnames(mergeFrame)
allNames <- allNames[5]
refCol <- allNames[6:15]
permNames <- paste("X", seq(1, 10, 1), sep = "")
shiftNames <- ComputeDeltas(mergeFrame, refCol, permNames, shiftNames) deltaFrame
```

The beanplot shown in Figure 2 includes solid blue lines at the average RMSE shift for each covariate and a dashed reference line at zero RMSE shift; also, the results for the best model for this example (RuleFit Regressor) are shown as solid green circles. It is clear from these results that random permutations applied to the relevant variables X1 through X5 cause significant RMSE increases for most models, while random permutations applied to the irrelevant variables X6 through X10 have essentially no effect. An important exception is the trivial mean response model: this model consistently exhibits the worst RMSE performance in Figure 1, but because its response is independent of all covariates, it shows zero RMSE shift in response to all covariate permutations in Figure 2.

While graphical summaries like those shown in Figures 1 and 2 are useful, we often want numerical importance measures. The RMSE shifts presented here provide a basis for computing these measures, which can be defined in several different ways. The simplest is to average the RMSE shifts over all models, giving a consensus-based relative importance value for each variable, corresponding to the blue lines shown in Figure 2. A potential disadvantage of this approach is that it gives equal weights to all models, including those like the mean response model for which the RMSE shifts are always zero. At the other extreme, we could take the RMSE shift in one individual model - the best predictor is an obvious choice here, corresponding to the green points in Figure 2 - but this has the potential disadvantage of being model-specific and it does not make full use of the performance shift data from all of the available models. An intermediate course is to form a performance-weighted average of the individual model results, giving greater weight to better-performing models. A specific implementation of this idea is the following performance-weighted average:

\[ \nu(x) = \frac{\sum_{i=1}^{M} \; (\mbox{RMSE}_i^x - \mbox{RMSE}_i^0)/\mbox{RMSE}_i^0}{\sum_{i=1}^{M} \; 1/\mbox{RMSE}_i^0}, \]

where \(\mbox{RMSE}_i^x\) represents
the RMSE value for the \(i^{th}\) model
in the collection fit from the dataset with a random permutation applied
to the covariate \(x\), and \(\mbox{RMSE}_i^0\) is the RMSE value for
this model when fit to the original dataset. For the example considered
here, this approach gives the RMSE shift for the best models
approximately five times greater weight than the poorest model. A simple
*R* function to compute this performance-weighted average from
the shift summary dataframes described above is
**varImpSummary**:

```
<- function(deltaFrame, refCol, oneIndex) {
varImpSummary <- colnames(deltaFrame)
vars <- which(vars == refCol)
refIndex <- deltaFrame[, refIndex]
refValue <- 1/refValue # Performance-weights = reciprocal fitting measure
wts <- deltaFrame[, -refIndex]
deltasOnly <- as.numeric(deltasOnly[oneIndex, ])
thisModel <- apply(deltasOnly, MARGIN=2, mean)
avg <- function(x, w) { sum(w * x) / sum(w) }
WtAvgFunction <- apply(deltasOnly, MARGIN = 2, WtAvgFunction, wts)
wtAvg <- data.frame(average = avg,
varImpFrame weightedAverage = wtAvg,
oneModel = thisModel)
varImpFrame }
```

Here, **deltaFrame** is the dataframe returned by the
function **ComputeDeltas** listed above,
**refValue** is the vector of original RMSE values (\(\mbox{RMSE}_i^0\) values in the above
equation), and **oneIndex** is the row number in
**deltaFrame** describing the performance of a single model
of particular interest. In this example, **oneIndex**
specifies the best non-blender model in the original DataRobot project
with no permutations applied to any variable.

The table shown below gives a summary of the three numerical variable importance measures just described for the simulation-based example considered here. Specifically, this table lists the average RMSE shift, the performance-weighted average RMSE shift defined above, and the shift associated with the best model. Based on these results, the relevant variables X1 through X5 are consistently ranked as much more important than the irrelevant variables, X6 through X10. Also, all three of the measures considered here give the same relative ranking to the relevant variables: X4 is most important, followed by X2 and X1, with similar shift values, then X5 and finally X3. Recall that this corresponds to the approximate ordering anticipated on the basis of the structure of the simulation model generating the data.

```
## Avg WtdAvg Best
## 1 1.024 1.137 1.745
## 2 1.105 1.230 1.926
## 3 0.374 0.448 0.706
## 4 1.508 1.636 2.167
## 5 0.543 0.599 0.903
## 6 -0.001 0.000 -0.014
## 7 0.000 -0.001 -0.020
## 8 0.010 0.011 -0.020
## 9 0.002 0.001 -0.013
## 10 0.002 0.004 -0.025
```

This note has described the use of random permutations to assess variable importance for predictive models of arbitrary type. The DataRobot modeling engine fits many different model types to the same dataset, building models that predict the same target variable from the same covariates, and the random permutation approach described here can be applied to all of these models, yielding variable importance assessments for any covariate and each model in the DataRobot modeling project. Since there is nothing model-specific about this random permutation strategy, the results can be compared and combined across all of these models, leading to consensus-based variable importance measures like those discussed in Section 5. These ideas were illustrated here for the simulation-based dataset described in Section 2, making it possible to compare the results with expectations based on the simulation model structure. For this example, all three of the numerical measures proposed in Section 5 gave results that were in agreement with these expectations. The companion vignette, “Using Many Models to Compare Datasets” presents the corresponding variable importance results for two real datasets.

Friedman, J. H. 2001. “Greedy Function Approximation: A Gradient
Boosting Machine.” *Annals of Statistics* 29: 1189–1232.

Grömping, U. 2009. “Variable Importance Assessment in Regression:
Linear Regression Versus Random Forest.” *The American
Statistician* 63: 308–19.

Kampstra, Peter. 2008. “Beanplot: A Boxplot Alternative for Visual
Comparison of Distributions.” *Journal of Statistical
Software, Code Snippets* 28 (1): 1–9.

Kuhn, M. 2015. *Caret: Classification and Regression Training*.

Leisch, Friedrich, and Evgenia Dimitriadou. 2010. *Mlbench: Machine
Learning Benchmark Problems*.