The **Rborist** package implements the Random Forest (TM) algorithm, with particular emphasis on high performance. The package is an **R**-language spinoff of the **Arborist** project, a multi-language effort targeting a variety of decision-tree methods. Look and feel owes a large debt to Liaw’s original **randomForest** package.

The interpretation of the phrase “high performance” will vary among users. We claim that the **Rborist** is a high-performance package primarily because it either does, or has the potential to, take advantage of the acceleration offerred by commodity parallel hardware. We also expect performance to scale in accordance with algorithmic complexity, decaying gracefully as resource limitations are approached.

Particular attention has been paid to minimizing data movement and, especially, toward maximizing *data locality*. We believe that this has been a key contributing factor to performance and scalability and will continue to play a major role in efforts to extend support more broadly.

The current implementation is limited to in-memory execution on multicore and multi-socket hardware. We envision the following improvements in the near to medium term:

Separate training of tree blocks over multiple compute nodes.

Training of significant portions of individual trees on vector coprocessors, such as GPUs.

Pipelined training over out-of-memory workloads.

The simplest way to invoke the package is to pass it a design matrix and response vector, of conforming dimensions. For appropriately typed design *x* and response *y*, then, it suffices to call:

`<- Rborist(x, y) rs `

The design can be either a *data.frame*, a numeric matrix or an integer matrix. In the case of a frame, the columns (predictors) must, individually, have either numeric or factor value. Integer matrices are coerced internally to their numeric counterparts. The response type may be either numeric, yielding a regression forest, or categorical, yielding a classification forest.

The return value (here *rs*) is of class *Rborist*. The full layout of an *Rborist* object is described by the **help()** files. A very useful example is the *validation* object, which summarizes testing on the out-of-bag samples.

In regression training, the *validation* object’s members include mean absolute and square errors, as well as the r-squared statistic. Continuing from the previous codelet, these are obtainable as follows:

```
$validation$mae
rs$validation$mse
rs$validation$rsq rs
```

These statistics are derived from the original training reponse (*y*, above) and the derived out-of-bag reponse. The out-of-bag response itself can also be obtained fromt the *validation* object:

`$validation$yPred rs`

*validation* also contains member *qPred*, for use with quantile regression. This member will be described in the next section.

In classification training, the *validation* object also presents the out-of-bag response in member *yPred*. Its other members, however, are specialized for classification:

The misprediction rate is reported for each classification category by field

*mispredition*.A confusion matrix is reported by field

*confusion*.The out-of-bag error rate is given by

*oobError*.The

*census*field is a matrix giving, for each row, the number of times each response category is predicted for the row.The

*prob*field reports a normalized version of*census*and can be interpreted as the probability of predicting a given category at a given row.

In addition to *validation*, an *Rborist* object contains several other members. Most of these are used to coordinate subsequent operations, such as prediction or feature contribution, and do not directly convey summary information. An exception is the *training* member, which summarizes training statistics not directly related to validation. *training* currently has a single member, *info*:

`$training$info rs`

For each predictor the *info* vector reports the sum, over all tree nodes in the forest for which the predictor defines a splitting condition, of the information value precipitating the respective split. Although these values depend on the information metric employed, they do provide a relative measure of each predictor’s importance in the model.

Validation can be suppressed as follows:

`<- Rborist(x, y, noValidate = TRUE) rs `

This option is primarily for the use of package maintainers, but may prove useful, for example, in instances when model fit is not a main concern. The *validation* field will then have value NULL.

When predicting over a regression forest, Rborist provides quantiles simply for the asking. Leaf information, by default, is rich enough to make their computation quite easy. Quantiles can be requested in either of two ways. The simplest is to set option *quantiles* to *TRUE*:

`<- Rborist(x, y, quantiles = TRUE) rs `

This default quantile vector consists of quartiles, which are given by the *qPred* member mentioned above:

`$validation$qPred rs`

Quantile prediction also estimates the quantile position at which the predicted mean, *yPred*, lies. This estimate is reported by the *qEst* value, as follow:

’’‘{r, eval = FALSE} rs\(validation\)qEst’’’

Explicity specfiying the quantile vector, *quantVec*, yields quantiles of any desired type. Deciles, for example, can be requested as follows:

`<- Rborist(x, y, quantVec = seq(0.1, 1.0, by=0.1)) rs `

The algorithm employed to compute the quantiles is approximate, as bining is employed to improve performance.

If, on the other hand, when training a regression forest, quantiles are not desired and it is intended that quantiles will not subsequently be requested *after* training, space can be saved by representing leaves in a sparser form:

`<- Rborist(x, y, thinLeaves=TRUE) rs `

The simplest way to affect forest size is to specify the number of trees. The following codelet requests 100 trees:

`<- Rborist(x, y, nTree=100) rs `

This also affects training time, which is expected to scale linearly with the number of trees.

Training of individual trees can be constrained according to several parameters. The *minNode* option places a lower bound on node sizes, i.e., the number of distinct samples subsumed by each node. A value of 1, for example, allows splitting to proceed until purity. A larger value results in a smaller tree, as well as faster training. To ensure that splitting does not proceed below a node size of 20, for example:

`<- Rborist(x, y, minNode=20) rs `

Another way to control tree size is to specify its maximal depth. The option *nLevel* sets an upper bound on the number of levels over which trees can be trained. The following codelet causes tree construction to halt after the root is created, resulting in a forest of singleton trees.

`<- Rborist(x, y, nLevel = 1) rs `

As with several other size-based constraints, constraining level depth also results in faster execution.

Option *maxLeaf* limits the number of leaves (terminals) present in a tree. Unlike other size constraints described so far, *maxLeaf* is enforced *after* training, so as not to bias tree shape. While this pruning operation is quite fast, then, limiting the number of leaves will not speed the training process.

`<- Rborist(x, y, maxLeaf = 13} rs `

Option *minInfo* sets a splitting threshold based on relative information gain. A splitting candidate is rejected if the ratio of information content between a node and its potential successors lies below the threshold.

`<- Rborist(x, y, minRatio = 0.1) rs `

This option should be applied with care and should probably be avoided at low predictor count.

Performance as well as storage economy can in some cases both be enhanced by abstracting away repeated instances of the same predictor value. This is the idea behind sparse representations and, in particular, one-hot encodings, in which repeated values of zero are represented implicitly. The Arborist employs a simlilar strategy, but on a *per-predictor* basis, representing high-frequency observations of a given predictor implicitly. A plurality threshold above which to compress repeated values is specified by the *autoCompress* option:

`<- Rborist(x, y, autoCompress = 0.4) rs `

As *autoCompress* specifies a plurality threshold, only a single set of repeated values undergoes compression for a given predictor. Resolution of ties, in particular, is implementation-dependent. A threshold frequency of 1.0, the maximum, ensures that no compression takes place, while a threshold frequency of 0.0, the minimum, ensures some value is compressed for each predictor, regardless of frequency.

As a complement to the *thinLeaves* option for training, the *Streamline* command can be applied following training to reduce a forest’s memory footprint. *Streamline* clears fields employed by validation, quantile regression and feature contribution, so should not be employed if any of these operations are expected to be performed subsequently.

```
<- Rborist(x, y)
rs
...<- Streamline(rs) rs
```

Several options affect the behavior of the various sampling mechanims used by the package.

Option *nSamp* dictates the number of bagged samples defining the root of each tree. A smaller sample count may result in smaller trees, as fewer opportunites arise to split.

Option *withRepl* specifies whether bag sampling is to be done with replacement.

Vector *rowWeight* weights the probability of bagging a given row. The following invocation gives each row identical weight, which is also the default behavior:

`<- Rborist(x, y, rowWeight = rep(1/nrow(y), nrow(y)) rs `

Vector *predWeight* weights the probability of selecting a given predictor as splitting candidate. For example, this invocation selects predictor 0 half as often, per predictor, as the remaining predictors:

`<- Rborist(x, y, predWeight = c(0.5, rep(1.0, ncol(x)-1))) rs `

Option *predProb* is the uniform probability of selecting a predictor for splitting. That is, the value applies uniformly to all predictors. Hence the number of predictors tried at a given split will have a binomial distribution.

Option *predFixed* is the actual number of predictors to test for a given split, and so calls for sampling predictors without replacement. *predFixed* and *predProb* cannot both be specified within a single training operation.

For regression training, vector *regMono* specifies a (signed) rejection probability to enforce monotonicity with respect to a given predictor. Negative values specify rejection of nondecreasing splitting candidates with a given probability, while positive values stipulate a rejection probability for nonincreasing predictors. A value of zero indicates that no monotonicity constraint is enforced. Values assigned to factor predictors are ignored.

The following example rejects nonincreasing predictors as splitting candidates with probability one:

`<- Rborist(x, y, regMono = rep(1.0, ncol(x))) rs `

Classification training can be fine-tuned by weighting individiual categories. The option *classWeight* permits weights to be specified, by category, for the objective function used to split. This may be useful, for example, when the training response is unbalanced.

The following example employs a non-normalized weighting vector to give the first category twice as much weight as the others. Note that the category ecoding reflected by ‘levels()’ is not portable:

```
<- levels(y)
lv <- Rborist(x, y, classWeight = c(2.0, rep(1.0, length(lv) - 1)) rs
```

By default, when a numerical predictor is chosen to split a node, the Arborist assigns its split value as that corresponding to the mean rank, with respect to the *full* set of observations on the predictor, between the two split boundary points. That is, the splitting criterion attempts to reflect the *ECDF* of the entire sample. This contrasts with other implementations, which typically select either the midpoint value or one of the two boundary values. In particular, depending upon how the observations are distributed, the midrank can correspond to a value arbitrarily close to either of the two boundaries.

The vector *splitQuant* allows a (sub)quantile value to be interpolated, so that the split value can be manipulated more finely with respect to the two endpoints. For example, the following codelet expresses the default behavior, which is to select the middle rank (i.e., 0.5 quantile) for all numerical predictors (if any):

`<- Rborist(x, y, splitQuant = rep(0.5, ncol(x))) rs `

Similarly, this example chooses the left endpoint for all relevant predictors:

`<- Rborist(x, y, splitQuant = rep(0.0, ncol(x))) rs `

The Arborist represents predictors internally using a format streamlined for subsequent training. A “preformatted” representation of the training data can be generated directly and trained separately as follows:

```
<- Preformat(x)
pf <- Rborist(pf, y) rs
```

Separate preformatting can result in a slight performance improvement in workflows with iterated training, such as under *Caret*. This is simply because the sorting and packing performed at the start of training can be cached instead of repeatedly computed.

A better motivation for preformatting arises when the training data can be represented sparsely. Suppose, for example, that *B* is a large data set consisting chiefly of predictors with highly repetitive observation values. As the Arborist is able to identify and compress repetitive observations, storage might be conserved by first preformatting *B*, then deleting it and finally training on the preformatted representation:

```
<- Preformat(B)
pf rm(B)
<- Rborist(pf, y) rs
```

Unlike many implementations, which employ sorting to order observations within a node, the Arborist employs a presort followed by linear updates. The updates are a form of stable partition, which we refer to as *restaging*. Restaging reduces the algorithmic complexity from the oft-cited \(\mathcal{O}(n \log{}^2 n),\) where *n* represents the number of training samples, to \(\mathcal{O}(n \log{} n).\)

Restaging is not without its problems, however. As currently implemented, a double buffer caches outcome data corresponding to every cell in the design. Hence the restaging containers can grow quite large. While some users may find this to be acceptable price for scalability, others may find the storage requirements too high for their application. Scalability, moreover, may not be an important concern at low sample or row count.

For high-dimensional problems, the **Ranger** package may provide a suitable alternative. Similarly, for some problems, the package **Xgboost** offers excellent storage characteristics. In the meantime, we envision several improvements, to appear in upcoming versions, to the Arborist’s parsimony with storage.