Random forests for linguistic analysis


In this tutorial, we’ll go over the basics of how to use random forests, which are an extension of the classification and regression tree methods discussed in this tutorial. In the next section I explain a bit about how random forests work, but if you want to just get to it, you can skip ahead the next section here.

Packages you will need for this tutorial:

install.packages(
  "here", # for project file management
  "tidyverse", # data wrangling
  "patchwork", # for arranging plots
  "ranger", "party", # for random forests
  "caret", "permimp", "vip", "flashlight", # for model evaluation
  "mlr", "tuneRanger" # for tuning
)
# to reproduce the markdown you will also need "rmdformats" and "fontawesome"

What are Random Forests?

Random forests are a powerful extension of single CART models, where we generate predictions based on many trees as opposed to a single tree (hence a “forest”). Random forests are therefore one of a family of techniques that use so-called ensemble methods to generate predictions. These methods rely on aggregating and summarising over lots of models and subsets of the data rather than relying on one single model to derive predictions. Basically each tree in the forest makes a prediction and then we average over all the trees in the forest. For classification problems, each tree predicts, or “votes for”, a specific outcome, and the forest chooses the outcome having the most “votes” over all the trees in the forest. In the case of regression, the forest takes the average of the predicted value over all the different trees.

To understand why we might want to use random forests, it’s helpful to recall the kinds of problems the method helps solve. One well-known problem with using single tree models is that the tree structure is highly sensitive to particulars of your dataset, because each split (after the initial one) depends on previous ones. So slight changes to the dataset may have a big impact of the tree model. Trees are also prone to overfitting, which makes drawing inferences about the population harder and less trustworthy. Random forests avoid these problems summarising over lots of different trees, and many different subsets of the data rather than relying on one single model to derive predictions. They do this by using a couple clever tricks.

Bagging

Random forests rely on a process known as Bootstrap Aggregating, or ‘bagging’ (wikipedia). Bagging is a technique used to reduce the variance of our predictions from individual trees by combining the result of many hundreds or thousands of trees trained on different random sub-sets of the data (the “in-bag” sample), and generating predictions based on data not included in the sample (the “out-of-bag” sample). In this way, bagging is designed to improve the reliability and accuracy of a model as well as reduce variance and potential overfitting. Strictly speaking, bootstrapping involves sampling with replacement, where the “in-bag” sample may contain copies of some observations. Sampling with replacement is not necessary for random forests, however, and as with many things the optimal approach will often depend on the data (Probst, Wright & Boulesteix 2019).

Random predictor sampling

A second aspect of random forests that makes them different from standard bagging methods is that when growing the individual trees, the method adds a second element of randomness in the selection of predictors that are included in each tree. When we have k input predictors, the random forest will randomly select a subset of m predictors at each node, in each tree. The best split on these m predictors is used to split the node.

To sum up, instead of using a single tree, Random Forests grow many many trees (a ‘forest’), each of which is grown on random sub-samples of the data using randomly selected subsets of predictors. This results in a healthy mix of very diverse trees, which compensates for sparse values in the data while at the same time giving fuller consideration to possible effects of all predictors. This is one of the major advantages of this method.

note: Because we’re using methods that rely on (pseudo-)randomness, our results may vary across runs. This is why it’s sometimes helpful to set the random seed prior to running our models so that we can replicate our results again. Almost all computer programs use algorithms to simulate randomness, and R is no different. We can take advantage of this fact to make sure we get the same results every time we run a (pseudo-)random process. We do this by setting the initial state of the random number generator to a predefined value with set.seed(). Which value you choose doesn’t matter, but starting at that seed will always produce the same results.

set.seed(43214)

Random forests in R

Libraries

There are two packages I use for working with random forests: the {ranger} package and the {party} package, which is a progenitor of {partykit}. You may also come across the {randomForest} package, which is an older package that uses the same methods as {ranger}, however it runs a bit slower and has been largely superceded by {ranger} now. As usual, I’m also working with tidyverse functions for data wrangling and plotting.

Let’s load our libraries.

library(tidyverse)
library(patchwork) # for plotting
# for random forests
library(ranger)
library(party)

Datasets

We’ll use two of the datasets we looked at before in the CART tutorial. I’ll focus only on the classification cases here.

gens <- here("data", "brown_genitives.txt") %>% # change as needed.
  read_delim(
    delim = "\t",
    trim_ws = TRUE,
    col_types = cols()
  )

obj_rels <- here("data", "brown_object_relativizers.txt") %>%
  read_delim(
    delim = "\t",
    trim_ws = TRUE,
    col_types = cols()
  )

Setting hyperparameters

Before starting, we need to specify two important hyperparameters for our models. In simple terms, a parameter is a variable whose value is learned (estimated) from the data by the model, while a hyperparameter is a variable used to control how the model learns. (We already saw examples of hyperparameters in the CART tutorial section on tuning.)

In random forests, the two most commonly discussed hyperparameters are:

  • The number of input predictors randomly selected at each node of a given tree in the forest (\(m\)).
  • The total number of trees to grow in the forest.

As a rule of thumb, Breiman (2001) recommends setting \(m\) equal to \(\sqrt{k}\) for a model with \(k\) predictors, and this is good for a first pass. The number of trees is a bit harder to determine, but the general consensus seems to be that more trees (> 1000) is better (Oshiro, Perez & Baranauskas 2012; Probst, Wright & Boulesteix 2019). Kuhn & Silge (2022) suggest that searching for the optimal number of trees is unnecessary provided the initial number is large enough (in the thousands, see Chapter 12.2), and this post provides some empirical evidence to support Kuhn & Silge’s argument. Nonetheless, this is still an area of very active research.

Anyway, the number of trees and number of predictors are the two hyperparameters that all random forest methods ask us to stipulate beforehand. There are other hyperparameters in addition to these, and we’ll see methods for determining empirically what the optimal settings should be (a process known as “tuning”) in Tuning random forests section below.

Now let’s get modeling!

{ranger} package

First we’ll fit a forest with the {ranger} package (Wright & Ziegler 2017). This is a very popular package that uses by default the Gini impurity measure for splitting trees. This method is arguably the standard approach for random forests.

Classification with 2 outcomes

Let’s start with a model of genitive choice involving nine predictors. We’ll set the number of predictors (mtry) to 3 and the number of trees (num.tree) to 1000. We’ll convert the columns like last time (this will be necessary for using {party}).

gen_fmla2 <- Type ~ Possessor.Animacy2 + Final.Sibilant +
  SemanticRelation + Possessor.Expression.Type +
  Genre + Corpus + Possessum.Length + Possessor.Length +
  PossessorThematicity

# convert first 7 variables to factors
gens <- gens %>%
  mutate(across(all.vars(gen_fmla2)[1:7], as.factor))

Now fit the forest.

system.time(
  gen_ranger1 <- ranger(gen_fmla2,
    data = gens,
    num.trees = 1000,
    mtry = 3
  )
)
   user  system elapsed 
   3.47    0.06    1.08 

A great thing about {ranger} is that it is very fast! Let’s check the output.

gen_ranger1
Ranger result

Call:
 ranger(gen_fmla2, data = gens, num.trees = 1000, mtry = 3) 

Type:                             Classification 
Number of trees:                  1000 
Sample size:                      5098 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 1 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error:             10.08 % 

The printout tells us a number of things. It gives us the hyperparameter settings (Target node size refers to the minimum number of observations allowed in a terminal node), along with the number of observations (Sample size), the variable importance measure (more on this below), and the splitting method.

It also gives an estimate of accuracy based on the out-of-bag (OOB) error rates. “Out-of-bag” refers to those observations that are left out of the randomly sampled “bag” of data used to train each tree. For each tree, we test its predictions on these OOB observations and derive an error rate. This gives us a measure of how accurate the tree is at predicting the outcome for data that it has never seen before. Using OOB accuracy is a way to safeguard against building an overfitting model, i.e. one that is too finely tuned to the specific dataset.

The average error over all the trees is the estimated prediction error for the forest as a whole. In other words, the predictive accuracy is 1 - the OOB error rate. So here we can say that the model correctly predicts genitives with an accuracy of 89.9%.

Classification with more than 2 outcomes

Random forests can also be used straightforwardly when you have more than 2 discrete classes as possible outcomes (this is not as simple with logistic regression models). Let’s return to the relative clause data and define a more complex formula and model.

rel_fmla2 <- relativizer ~ time + variety + genre + ant_given + ant_definite +
  ant_head_to_RC + noun_verb_ratio + ant_length + RC_length +
  passive_active_ratio + mean_sentence_length + personal_pron +
  split_infinitive + shall_will

obj_rels <- obj_rels %>%
  mutate(across(all.vars(rel_fmla2)[1:7], as.factor))

# fit model
rel_ranger <- ranger(
  rel_fmla2,
  data = obj_rels,
  mtry = 3,
  num.trees = 1000,
  classification = TRUE
)

rel_ranger
Ranger result

Call:
 ranger(rel_fmla2, data = obj_rels, mtry = 3, num.trees = 1000,      classification = TRUE) 

Type:                             Classification 
Number of trees:                  1000 
Sample size:                      3959 
Number of independent variables:  14 
Mtry:                             3 
Target node size:                 1 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error:             31.35 % 

Here the prediction error is much higher (the accuracy is lower), but this is partly because we have more classes to choose from. We’ll look at some better measures for evaluating models in the Evaluation section below.

Probability forests

NOTE: I’ve found that probability forests are generally more accurate than simple classification, so I recommend always using this method by default. There are no downsides to doing so, to my knowledge.

Normally {ranger} fits classification forests whose predictions are discrete class labels.

head(gen_ranger1$predictions)
[1] of of s  of s  s 
Levels: of s

But if we want to generate predicted probabilities of the classes we can do that by setting the probability = TRUE argument.

gen_ranger_prob <- ranger(
  gen_fmla2,
  data = gens,
  num.trees = 1000,
  mtry = 3,
  probability = TRUE
)
gen_ranger_prob
Ranger result

Call:
 ranger(gen_fmla2, data = gens, num.trees = 1000, mtry = 3, probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      5098 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.07464502 

So here the main difference is the OOB prediction error which is not the accuracy, but the Brier score. Generally, the closer to 0 the better the model is.

We can create a probability forest for our relative clause model too.

# fit model
rel_ranger_prob <- ranger(
  rel_fmla2,
  data = obj_rels,
  mtry = 3,
  num.trees = 1000,
  classification = TRUE,
  probability = TRUE,
  importance = "permutation" # more on this below
)
rel_ranger_prob
Ranger result

Call:
 ranger(rel_fmla2, data = obj_rels, mtry = 3, num.trees = 1000,      classification = TRUE, probability = TRUE, importance = "permutation") 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      3959 
Number of independent variables:  14 
Mtry:                             3 
Target node size:                 10 
Variable importance mode:         permutation 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.2515699 

Again, we’ll look at some more measures for model fit below, but first we’ll look at another common package.

{party} package

Next we’ll try random forests with the {party} package (Hothorn, Hornik & Zeileis 2006; Strobl et al. 2007; Strobl et al. 2008). For this package we use the cforest() function. Unlike other methods, this method uses conditional inference trees, which are argued to be less biased than starndard CARTs. In my very limited experience, I find that this method is slightly more accurate than {ranger}, but I don’t know of any published comparisons of the two methods in detail.

First we should make sure to detach {partykit} and load {party}. This is important to avoid conflicts between the two packages which share many of the same functions. {party} relies on conditional inference trees similar to those in {partykit}.

# only necessary if you've already loaded `partykit`
detach("package:partykit", unload = TRUE)
library(party)

For {party}, we need to set the hyperparameters inside the cforest_control() function. Once we’ve set our hyperparameters, we can fit our tree with cforest(). We will start with a simpler model, because this method is much slower than {ranger}.

ctrl <- cforest_control(mtry = 3, ntree = 1000) # same settings as above

# notice the simpler formula
system.time(
  gen_party1 <- party::cforest(
    Type ~ Possessor.Animacy2 + Final.Sibilant +
      Genre + Corpus + Possessum.Length + Possessor.Length,
    data = gens,
    controls = ctrl
  )
)
   user  system elapsed 
  41.28    0.89   42.60 

So this runs a little slower than ranger(), but not too much so, though keep in mind that our model is a bit simpler than the other one. It’s really when we start doing other things with these {party} forests that the difference becomes very clear.

The output of the cforest() function is annoyingly not very informative.

print(gen_party1)

     Random Forest using Conditional Inference Trees

Number of trees:  1000 

Response:  Type 
Inputs:  Possessor.Animacy2, Final.Sibilant, Genre, Corpus, Possessum.Length, Possessor.Length 
Number of observations:  5098 

In the next section(s) we’ll see how to get more information from the models.

note: Random forest objects, especially {party} forests are very large objects (at least > 100Mb) so they can be slow to load and can bog down your system if you’re fitting lots of models in one work session.

Evaluation

We can obtain the model predictions for each observation directly with the predict() function, and measure the accuracy ourselves based on this. This applies to all packages.

For {ranger} forests this is simple, as the predictions are automatically included in the model object.

# predict(gen_ranger1, gens) # also works
gen_ranger1$predictions %>%
  head(10)
 [1] of of s  of s  s  s  of s  s 
Levels: of s

For probability forests we have a dataframe containing the predicted probabilities for each possible outcome.

gen_ranger_prob$predictions %>%
  head(10)
               of          s
 [1,] 0.909477495 0.09052251
 [2,] 0.913402030 0.08659797
 [3,] 0.446553632 0.55344637
 [4,] 0.912078896 0.08792110
 [5,] 0.020294622 0.97970538
 [6,] 0.452312165 0.54768784
 [7,] 0.012198095 0.98780190
 [8,] 0.534232888 0.46576711
 [9,] 0.187230959 0.81276904
[10,] 0.002043494 0.99795651

We can get predictions from {party} forests with the predict() function, though this takes some time. DO NOT RUN THIS IN THE LIVE SESSION!

Here I have specified that I want the predicted probabilities, which are automatically available from the {party} forest (i.e. we didn’t specify a probability forest).

system.time(
  gen_party1_preds <- predict(gen_party1, newdata = gens, type = "prob")
)
   user  system elapsed 
 533.11    0.78  557.42 

So that takes about ~9-10 minutes on my machine. I recommend running these in a separate job in RStudio.

Check the predictions.

head(gen_party1_preds)
$`1`
       Type.of    Type.s
[1,] 0.8266243 0.1733757

$`2`
       Type.of    Type.s
[1,] 0.9359857 0.0640143

$`3`
       Type.of    Type.s
[1,] 0.5129722 0.4870278

$`4`
       Type.of    Type.s
[1,] 0.8266243 0.1733757

$`5`
        Type.of    Type.s
[1,] 0.04999138 0.9500086

$`6`
       Type.of    Type.s
[1,] 0.5129722 0.4870278

Note that the output is a list, so we have to convert this to a dataframe.

gen_party1_preds <- do.call("rbind", gen_party1_preds) %>%
  as.data.frame()
gen_party1_preds

  For evaluating models, I like the {caret} package, which has a number of functions for computing performance measures for all kinds of models. But first we’ll add columns to our prediction dataframes containing the predicted outcomes as well as the observed data.

# ranger forest
gen_ranger_prob_preds <- gen_ranger_prob$predictions %>%
  as.data.frame() %>%
  mutate(
    obs = gens$Type,
    pred = as.factor(if_else(of > .5, "of", "s"))
  )
gen_ranger_prob_preds
# party forest
gen_party1_preds <- gen_party1_preds %>%
  rename(of = "Type.of", s = "Type.s") %>% # change names
  mutate(
    obs = gens$Type,
    pred = as.factor(if_else(of > .5, "of", "s"))
  )
gen_party1_preds

Now I’ll use the twoClassSummary() function to get a quick assessment of the performance of the models. Here I’ll focus on the area under the ROC curve (ROC), which is a reasonable measure of how well the model discriminates between two outcomes. The scores range from .5 (chance) to 1 (perfect discrimination). Values above .8 are considered a good fit. This is also known sometimes as the index of concordance C.

caret::twoClassSummary(gen_ranger_prob_preds,
  lev = levels(gen_ranger_prob_preds$obs)
) %>%
  round(3)
  ROC  Sens  Spec 
0.959 0.927 0.856 
caret::twoClassSummary(gen_party1_preds,
  lev = levels(gen_party1_preds$obs)
) %>%
  round(3)
  ROC  Sens  Spec 
0.956 0.916 0.850 

The {party} forest does a bit worse than the {ranger} forest, but that’s expected given that the latter includes more predictors.

For classification models it can also be helpful to view the confusion matrix, which is a cross-tabulation of the predicted outcome vs. the observed data. The diagonal of the matrix shows the observations predicted correctly. Again, the {caret} package has lots of useful evaluation metrics (see the documentation here). We can use the prediction dataframes we created above.

# ranger
caret::confusionMatrix(
  data = gen_ranger_prob_preds$pred,
  reference = gen_ranger_prob_preds$obs,
  mode = "everything"
)
Confusion Matrix and Statistics

          Reference
Prediction   of    s
        of 2878  287
        s   225 1708
                                         
               Accuracy : 0.8996         
                 95% CI : (0.891, 0.9077)
    No Information Rate : 0.6087         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.788          
                                         
 Mcnemar's Test P-Value : 0.007021       
                                         
            Sensitivity : 0.9275         
            Specificity : 0.8561         
         Pos Pred Value : 0.9093         
         Neg Pred Value : 0.8836         
              Precision : 0.9093         
                 Recall : 0.9275         
                     F1 : 0.9183         
             Prevalence : 0.6087         
         Detection Rate : 0.5645         
   Detection Prevalence : 0.6208         
      Balanced Accuracy : 0.8918         
                                         
       'Positive' Class : of             
                                         

This is a very good model.

# party
caret::confusionMatrix(
  data = gen_party1_preds$pred,
  reference = gen_party1_preds$obs,
  mode = "everything"
)
Confusion Matrix and Statistics

          Reference
Prediction   of    s
        of 2842  299
        s   261 1696
                                          
               Accuracy : 0.8902          
                 95% CI : (0.8812, 0.8986)
    No Information Rate : 0.6087          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7686          
                                          
 Mcnemar's Test P-Value : 0.1179          
                                          
            Sensitivity : 0.9159          
            Specificity : 0.8501          
         Pos Pred Value : 0.9048          
         Neg Pred Value : 0.8666          
              Precision : 0.9048          
                 Recall : 0.9159          
                     F1 : 0.9103          
             Prevalence : 0.6087          
         Detection Rate : 0.5575          
   Detection Prevalence : 0.6161          
      Balanced Accuracy : 0.8830          
                                          
       'Positive' Class : of              
                                          

Now what about our relative clause model? Evaluating these kinds of models is a bit trickier than for binary outcomes. We can get the predictions in the usual way, but then we have to do some clever coding to get the outcome that has the highest predicted probability

rel_ranger_prob_preds <- rel_ranger_prob$predictions %>%
  as.data.frame()

# get column with the highest predicted probability
rel_ranger_prob_preds$pred <- names(rel_ranger_prob_preds)[apply(rel_ranger_prob_preds, 1, which.max)] %>%
  as.factor()

# add the observed
rel_ranger_prob_preds <- rel_ranger_prob_preds %>%
  mutate(obs = obj_rels$relativizer)
rel_ranger_prob_preds

We can again use the {caret} package to get some further measures. The multiClassSummary() function in {caret} gives us some info, such as the overall accuracy and Kappa statistics, and averages of “one versus all” statistics such as sensitivity, specificity, the AUC, etc.

caret::multiClassSummary(rel_ranger_prob_preds, lev = levels(rel_ranger_prob_preds$obs))
               logLoss                    AUC                  prAUC 
             0.7160254              0.7967564              0.5960713 
              Accuracy                  Kappa                Mean_F1 
             0.6956302              0.3415933              0.5358227 
      Mean_Sensitivity       Mean_Specificity    Mean_Pos_Pred_Value 
             0.5178335              0.7706591              0.6087291 
   Mean_Neg_Pred_Value         Mean_Precision            Mean_Recall 
             0.8210040              0.6087291              0.5178335 
   Mean_Detection_Rate Mean_Balanced_Accuracy 
             0.2318767              0.6442463 

The confusion matrix also provides more information.

caret::confusionMatrix(
  data = rel_ranger_prob_preds$pred, # the model predictions
  reference = rel_ranger_prob_preds$obs, # the observed data
  mode = "everything"
)
Confusion Matrix and Statistics

          Reference
Prediction THAT WHICH ZERO
     THAT   156    57   92
     WHICH   98   310  125
     ZERO   470   363 2288

Overall Statistics
                                         
               Accuracy : 0.6956         
                 95% CI : (0.681, 0.7099)
    No Information Rate : 0.6327         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.3416         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      

Statistics by Class:

                     Class: THAT Class: WHICH Class: ZERO
Sensitivity              0.21547       0.4247      0.9134
Specificity              0.95394       0.9309      0.4271
Pos Pred Value           0.51148       0.5816      0.7331
Neg Pred Value           0.84455       0.8774      0.7411
Precision                0.51148       0.5816      0.7331
Recall                   0.21547       0.4247      0.9134
F1                       0.30321       0.4909      0.8134
Prevalence               0.18287       0.1844      0.6327
Detection Rate           0.03940       0.0783      0.5779
Detection Prevalence     0.07704       0.1346      0.7883
Balanced Accuracy        0.58471       0.6778      0.6702

This model is… not great. It still predicts relativizers at a rate significantly higher than chance, but it doesn’t do so equally for all possible outcomes. We can see that it is fairly good when it comes to predicting the ZERO case, but not so good at that or which. In fact, for that and which, it gets more wrong than it does right. The statistic to look at here is Cohen’s Kappa, which tells us how much better the model is compared to one that simply guesses at random according to the frequency of each outcome (this statistic is also used for inter-rater reliability). This measure is better for imbalanced data like we have here. A value of 0.34 indicates a fair, but still rather crappy model, so there is a lot of room for improvement.

Predictor importance

Unlike methods such as regression models, random forests don’t give you direct effect size measures comparable to regression coefficients. Interpreting random forest models is not easy because a single predictor can show up in many different (non-nested) trees with different co-varying predictors (and interactions).

However, random forests make up for this shortcoming in being a great tool for evaluating the relative predictive importance of individual predictors. Due to the random (sub-)sampling and bagging processes, they are very good at determining the independent contributions of individual predictors to the model, even when those predictors are highly correlated with one another. And since random forests rely on many, many trees, they can model non-linearities and complex ‘interactions’ quite naturally (though this should perhaps be treated with some caution, see Gries 2019).

To assess the contribution of each predictor, most random forest methods compute a predictor importance score (also called “variable importance” or “feature importance”). I’ll go over how to do this below.

{ranger}

The {ranger} package allows us to use several importance measures. It’s generally recommended to use permutation-based methods, as permutation methods have been shown to be less biased than other measures (Strobl, Malley & Tutz 2009; Altmann et al. 2010). Permutation variable importance works fairly straightforwardly:

To evaluate predictor importance, a tree is grown in the first step, and the prediction accuracy in the OOB observations is calculated. In the second step, any association between the variable of interest xi and the outcome is broken by permuting the values of all individuals for xi, and the prediction accuracy is computed again. The difference between the two accuracy values is the permutation importance for xi from a single tree. The average of all tree importance values in a random forest then gives the random forest permutation importance of this variable. The procedure is repeated for all [predictors] of interest. (Wright, Ziegler & König 2016:3)

In other words, we…

  1. Grow a tree predicting a response from a set of predictors.
  2. Take one predictor and randomly shuffle the values of that predictor. This breaks the association between the response and that predictor
  3. Calculate the difference in model accuracy before and after shuffling the predictor
  4. Repeat for all trees and all predictors in the forest

The reasoning is as follows. If the predictor does not have a meaningful relationship with the response, shuffling its values should produce very little change in model accuracy. On the other hand, if the predictor was strongly associated with the response, permutation should create a large drop in accuracy. We do this for every predictor in the model, and rank them according to their importance.

Permutation based importance measures can be obtained for {ranger} models easily by specifying the importance argument in the initial function.

gen_ranger_perm <- ranger(
  gen_fmla2,
  data = gens,
  num.trees = 1000,
  mtry = 3,
  importance = "permutation"
)

The values can be obtained with the importance() wrapper function, or you can get them directly with gen_ranger_perm$variable.importance

importance(gen_ranger_perm) %>%
  sort()
                   Corpus          SemanticRelation            Final.Sibilant 
              0.005628278               0.007009083               0.007444319 
     PossessorThematicity                     Genre Possessor.Expression.Type 
              0.018289273               0.025269293               0.038955258 
         Possessor.Length        Possessor.Animacy2          Possessum.Length 
              0.046063122               0.062183864               0.136032892 

The {vip} package works nicely with {ranger} for quick plotting.

vip::vip(gen_ranger_perm)

A nice thing about {ranger} is that we can calculate significance tests for which predictors are important, following methods of Janitza, Celik & Boulesteix (2016) and Altmann et al. (2010). The first method is faster, but requires there to be a negative value (see below). By chance, the random reordering of values sometimes results in a configuration that affords a slightly better model. This may arise when the values of an irrelevant predictor are reshuffled. In other words, the random reshuffling created a new version of the predictor that is actually more strongly associated with the outcome than the original predictor. This has the effect of creating a negative importance score.

By mirroring the minimal score to the right of zero, we obtain an interval that characterizes irrelevant predictors. Variables can be considered informative and important if their variable importance value lies above (to the right) this region (Strobl, Malley & Tutz 2009; Janitza, Celik & Boulesteix 2016). In some cases, this may include all predictors. This is the “janitza” method.

# this will get an error
system.time(
  gen_ranger_varimp.pvals <- importance_pvalues(
    gen_ranger_perm,
    formula = gen_fmla2,
    data = gens,
    method = "janitza"
  )
)

The second method, “altmann”, can take a while depending on how large the model is and how many permutations you use. DO NOT RUN THIS IN THE LIVE SESSION!

system.time(
  gen_ranger_varimp.pvals <- importance_pvalues(
    gen_ranger_perm,
    formula = gen_fmla2,
    data = gens,
    method = "altmann",
    num.permutations = 100
  )
)
   user  system elapsed 
 990.58    7.84  281.07 
gen_ranger_varimp.pvals %>%
  as.data.frame() %>%
  rownames_to_column("pred") %>%
  arrange(desc(importance))

Here it seems that all predictors but Corpus (Brown ~ Frown) contribute to the model significantly.

{party}

The {party} package uses permutation importance measures as well, and it offers two measures, the decrease in prediction accuracy (varimp()) and the decrease in the area under the ROC curve (varimpAUC()). The latter has been shown to be a more reliable measure (Janitza, Strobl & Boulesteix 2013), so we’ll use that.

The only drawback is that these are both also very time consuming (like everything in {party}…). So we’ll use the {permimp} package which was designed to be much faster, though still not as fast as {ranger}.

library(permimp)
# track the time it takes to run
system.time(
  gen_party1_varimp <- permimp(
    gen_party1,
    AUC = TRUE,
    progressBar = FALSE
  )
)
   user  system elapsed 
  74.84    0.34   76.87 

Once we have our variable importance object, we can plot the results.

gen_party1_varimp$values %>%
  as.data.frame() %>%
  rename(varimp = ".") %>%
  mutate(pred = rownames(.) %>% factor()) %>%
  ggplot(aes(fct_reorder(pred, varimp), varimp)) +
  geom_segment(aes(xend = pred, yend = 0)) +
  geom_point(size = 3) +
  labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
  geom_hline(yintercept = 0, color = "red") +
  coord_flip()

Conditional perdictor importance

There are often many predictors that are correlated with one another to varying degrees, and when this is the case, the individual importance scores may not be reliable. The {permimp} package offers a method for dealing with this by using conditional variable importance measures (Strobl et al. 2008). From help(permimp):

If conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest.

This can be calculated as follows, however, be warned this can take much longer (though not so bad here).

# track the time it takes to run
system.time(
  gen_party1_varimp_cond <- permimp(
    gen_party1,
    AUC = TRUE,
    progressBar = FALSE,
    conditional = TRUE
  )
)
   user  system elapsed 
 292.91    1.25  305.22 

We can compare the unconditional and conditional importances, and see in this case there is relatively little difference.

p1 <- gen_party1_varimp$values %>%
  as.data.frame() %>%
  rename(varimp = ".") %>%
  mutate(pred = rownames(.) %>% factor()) %>%
  ggplot(aes(fct_reorder(pred, varimp), varimp)) +
  geom_point() +
  labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
  ggtitle("Unconditional importance") +
  geom_hline(yintercept = 0, color = "red") +
  coord_flip()
p2 <- gen_party1_varimp_cond$values %>%
  as.data.frame() %>%
  rename(varimp = ".") %>%
  mutate(pred = rownames(.) %>% factor()) %>%
  ggplot(aes(fct_reorder(pred, varimp), varimp)) +
  geom_point() +
  labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
  ggtitle("Conditional importance") +
  geom_hline(yintercept = 0, color = "red") +
  coord_flip()

p1 + p2

As a final note, it’s important to remember that these variable importance scores should only be interpreted relative to each other as a ranking of the model’s predictors. The absolute values of the scores should not be interpreted or compared across studies (see Strobl et al. 2008:336).

Tuning random forests

As I mentioned before, the hyperparameters in a random forest model can have an impact on how it performs, and so it is advised to use the optimal settings for your data. The process of finding the right hyperparameters for the data is known as tuning, and it works by simply fitting models with many different values for each hyperparameter and seeing which ones maximize the model performance. Probst, Wright & Boulesteix (2019) recommend tuning mtry, the number of observations drawn for each tree, the minimum terminal node size, and whether the in-bag samples are sampled with replacement. We’ll focus on the first three here.

There are a few good methods for tuning hyperparameters in R, and we’ll use the {tuneRanger} package here. I also recommend the {tidymodels} package, which works with many different kinds fo models, including {party} forests.

Load our libraries.

library(mlr)
library(tuneRanger)

First we make a classification task with the {mlr} package. You’ll get a warning, but don’t worry about that for now.

obj_rel_task <- makeClassifTask(
  data = as.data.frame(obj_rels[, all.vars(rel_fmla2)]),
  target = "relativizer"
)

Now we tune our hyperparameters. The tuneRanger() function will consider lots of difference values for the forest by searching through reasonable values for each hyperparameter (you can find out more about these methods here). This can take time. DO NOT RUN THIS IN THE LIVE SESSION

obj_rel_tune <- tuneRanger(
  obj_rel_task,
  num.trees = 1000,
  num.threads = 4,
  tune.parameters = c("mtry", "min.node.size", "sample.fraction"),
  iters = 100,
  show.info = FALSE
)

The recommended values are here:

obj_rel_tune$recommended.par

We can then refit the model with these values.

# fit model
rel_ranger_tuned <- ranger(
  rel_fmla2,
  data = obj_rels,
  mtry = obj_rel_tune$recommended.par$mtry,
  min.node.size = obj_rel_tune$recommended.par$min.node.size,
  sample.fraction = obj_rel_tune$recommended.par$sample.fraction,
  num.trees = 1000,
  classification = TRUE,
  probability = TRUE,
  importance = "permutation"
)
rel_ranger_tuned
Ranger result

Call:
 ranger(rel_fmla2, data = obj_rels, mtry = obj_rel_tune$recommended.par$mtry,      min.node.size = obj_rel_tune$recommended.par$min.node.size,      sample.fraction = obj_rel_tune$recommended.par$sample.fraction,      num.trees = 1000, classification = TRUE, probability = TRUE,      importance = "permutation") 

Type:                             Probability estimation 
Number of trees:                  1000 
Sample size:                      3959 
Number of independent variables:  14 
Mtry:                             3 
Target node size:                 3 
Variable importance mode:         permutation 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.2528225 

Now we can evaluate the model

rel_ranger_tuned_preds <- rel_ranger_tuned$predictions %>%
  as.data.frame()

# get column with the highest predicted probability
rel_ranger_tuned_preds$pred <- names(rel_ranger_tuned_preds)[apply(rel_ranger_tuned_preds, 1, which.max)] %>%
  as.factor()

# add the observed
rel_ranger_tuned_preds <- rel_ranger_tuned_preds %>%
  mutate(obs = obj_rels$relativizer)

caret::multiClassSummary(rel_ranger_tuned_preds, lev = levels(rel_ranger_tuned_preds$obs))
               logLoss                    AUC                  prAUC 
             0.7152652              0.7988903              0.6010658 
              Accuracy                  Kappa                Mean_F1 
             0.6943673              0.3262892              0.5207066 
      Mean_Sensitivity       Mean_Specificity    Mean_Pos_Pred_Value 
             0.5047823              0.7647187              0.6097549 
   Mean_Neg_Pred_Value         Mean_Precision            Mean_Recall 
             0.8243307              0.6097549              0.5047823 
   Mean_Detection_Rate Mean_Balanced_Accuracy 
             0.2314558              0.6347505 

Compare that to the untuned one and we can see that the improvement is marginal. This level of improvement is actually typical according to the literature that is available (e.g. Probst, Wright & Boulesteix 2019).

caret::multiClassSummary(rel_ranger_prob_preds, lev = levels(rel_ranger_prob_preds$obs))
               logLoss                    AUC                  prAUC 
             0.7160254              0.7967564              0.5960713 
              Accuracy                  Kappa                Mean_F1 
             0.6956302              0.3415933              0.5358227 
      Mean_Sensitivity       Mean_Specificity    Mean_Pos_Pred_Value 
             0.5178335              0.7706591              0.6087291 
   Mean_Neg_Pred_Value         Mean_Precision            Mean_Recall 
             0.8210040              0.6087291              0.5178335 
   Mean_Detection_Rate Mean_Balanced_Accuracy 
             0.2318767              0.6442463 

For completeness we can compare the predictor importance scores too. Some minor variation but nothing too concerning.

p1 <- vip::vip(rel_ranger_prob) + ggtitle("No tuning")
p2 <- vip::vip(rel_ranger_tuned) + ggtitle("With tuning")
p1 + p2

Interaction effects

One problem with the predictor importance plots above is that they don’t only tell us about the individual effects, and say nothing about the combined influence of 2 or more predictors, i.e. interaction effects. For investigating interaction effects Friedman & Popescu (2008) suggest a method which measures the relative influence of interactions vs. main effects on the variation in a model’s prediction. In simplest terms, the approach compares the observed influence of two (or more) predictors in the model to their influence under the assumption that there are no interactions. In the latter case, the assumption is that when there are no interactions, the joint influence of predictors A and B is simply equal to the sum of the their individual influence (see Molnar 2021:sec. 5.4). Large differences between the observed influence and the ‘no-interaction’ partial dependence suggest that the two predictors’ effects are not independent. The resulting H statistic is used to represent the amount of variance in the difference between observed and no-interaction explained by the interaction. An H value of 0 for a predictor means there is no interaction (i.e. only a main effect), and an H value of 1 means that a predictor only has an effect via interactions with other predictors (i.e. there is no main effect).

To do this in R takes a bit of work, but it’s worth it in the end. We’ll use the {flashlight} package, which is one of several recent packages built for doing all kinds of neat things with predictive models (others include {DALEX}, {iml}, and {tidymodels}). The online documentation is here

library(flashlight)

There are a couple things we need to do first. We’ll start by refitting our model with a binary outcome rather than a factor (don’t ask me why but most tools require this…). So we’ll just convert the Type column in our genitives data to 0s and 1s, then refit the model

gens2 <- gens %>%
  mutate(Type = as.numeric(Type) - 1)

gen_ranger_prob2 <- ranger(
  gen_fmla2,
  data = gens2,
  num.trees = 1000,
  mtry = 3,
  probability = TRUE,
  classification = TRUE,
  importance = "permutation"
)

Now we create a flashlight container object with the ranger forest and the dataset.

gen_fl <- flashlight(
  model = gen_ranger_prob2,
  label = "gen_ranger",
  y = "Type",
  data = gens2[, all.vars(gen_fmla2)],
  metrics = list(AUC = MetricsWeighted::AUC), # this is used to evaluate the model
  predict_function = function(mod, X) predict(mod, X)$predictions[, 2]
)
light_performance(gen_fl)

I am an object of class light_performance 

data.frames:

 data 
# A tibble: 1 x 3
  label      metric value
  <chr>      <fct>  <dbl>
1 gen_ranger AUC    0.985
gen_interactions <- gen_fl %>%
  light_interaction(n_max = 200)

We can plot these and see which predictors

It’s also possible to look at the pairwise effects. For example we can consider the interactions with Genre in the model.

gen_ints_pairs <- gen_fl %>%
  light_interaction(
    pairwise = TRUE, #
    n_max = 100
  )
gen_ints_pairs$data %>%
  filter(str_detect(variable, "Genre")) %>%
  ggplot(aes(reorder(variable, value), value)) +
  geom_col() +
  coord_flip() +
  labs(y = "Friedman's H", x = "") +
  ggtitle("Interactions with Genre")

So it seems that there are some interesting interactions with Genre. In particular the influence of whether the possessor is a proper or common noun seems to vary across genres

Partial dependence profiles

One more useful technique is to look at the partial dependence of the probability of the outcome on the values of a given predictor or predictors. The partial dependence is estimated by examining the model predictions if all our observations had the same value for the predictor(s) in question, then averaging over the predictions for all observations. These are very easy to get with the light_profile() function.

So we saw that there is an interaction of Genre and possessor NP type, and we can inspect the effects of these predictors in the model like so.

gen_fl %>%
  light_profile(v = "Possessor.Expression.Type", by = "Genre") %>%
  plot() +
  labs(y = "predicted probability of the s-genitive")

From this it seems that the biggest difference is i the Press genre, where proper noun possessors appear to be particularly more likely to occur in the s-genitive compared to the other genres. This makes some sense if we think about the nature of news texts.

For more on these and other techniques, see the package vignette online. There are lots of other things that you can do with {flashlight} and similar packages.

What about mixed-effects random forests?

Adapting random forest methods to handle multi-level datasets, like when observations are clustered by speakers in an experiment or text files in a corpus, is currently a very active area of research. Unfortunately however, the methods available are not easy to implement in R. There is a relatively easy-to-use implementation of mixed-effects random forests (MERFs) in Python available https://pypi.org/project/merf/, but this only works for continuous response variables (Hajjem, Bellavance & Larocque 2014). More about this can be found here.

Other methods for categorical responses are in development, e.g. Speiser et al. (2018), but nothing has been implemented in R yet to my knowledge.


Summary

Advantages of random forests

  • Small n, big p: Random forests can handle small datasets with relatively large sets of predictors. This is a problem for other methods. They are also better to use for classification with highly unbalanced datasets, i.e. where one class occurs less than 10% of the time.
  • Very accurate: Random forests are typically much more accurate than single trees and even regression models
  • Random forests are also generally less sensitive to problems of multicollinearity due to their bagging and random variable selection procedures, however proper care should still be taken (e.g. using conditional variable importance measures)
  • Random forests deal well with nonlinear effects

Disadvantages of random forests

  • Cannot handle non-independence well (yet): They don’t have a way of incorporating clustered or hierarchical data structures, i.e.”random” effects for speaker, text, etc. “Mixed-effets” random forests are still in development
  • Computationally demanding: RF methods can have potentially long computing time, especially for variable importance and partial dependence measures.
  • Hard to interpret: They do not provide traditional measures of effect size, direction, or significance, and exploring “interaction effects” is not always straightforward (Gries 2019; Wright, Ziegler & König 2016). I disucss how to deal with these issues in another tutorial.

Citation & Session Info

Grafmiller, Jason. 2022. Random forests for linguistic analysis. University of Birmingham. url: https://jasongrafmiller.netlify.app/tutorials/tutorial_random_forests.html (version 2022.06.03).

The following is my current setup on my machine.

sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] flashlight_0.8.0  tuneRanger_0.5    lhs_1.1.5         lubridate_1.8.0  
 [5] mlrMBO_1.1.5      smoof_1.6.0.2     checkmate_2.1.0   mlr_2.19.0       
 [9] ParamHelpers_1.14 party_1.3-10      strucchange_1.5-2 sandwich_3.0-1   
[13] zoo_1.8-10        modeltools_0.2-23 mvtnorm_1.1-3     ranger_0.13.1    
[17] patchwork_1.1.1   fontawesome_0.2.2 forcats_0.5.1     stringr_1.4.0    
[21] dplyr_1.0.9       purrr_0.3.4       readr_2.1.2       tidyr_1.2.0      
[25] tibble_3.1.7      ggplot2_3.3.6     tidyverse_1.3.1   knitr_1.39       
[29] here_1.0.1       

loaded via a namespace (and not attached):
  [1] readxl_1.4.0          backports_1.4.1       fastmatch_1.1-3      
  [4] plyr_1.8.7            lazyeval_0.2.2        splines_4.2.0        
  [7] listenv_0.8.0         TH.data_1.1-1         digest_0.6.29        
 [10] foreach_1.5.2         htmltools_0.5.2       fansi_1.0.3          
 [13] magrittr_2.0.3        MLmetrics_1.1.1       BBmisc_1.12          
 [16] ROCR_1.0-11           tzdb_0.3.0            recipes_0.2.0        
 [19] globals_0.15.0        modelr_0.1.8          gower_1.0.0          
 [22] matrixStats_0.62.0    R.utils_2.11.0        vroom_1.5.7          
 [25] hardhat_0.2.0         rmdformats_1.0.4      colorspace_2.0-3     
 [28] vip_0.3.2             rvest_1.0.2           haven_2.5.0          
 [31] xfun_0.31             DiceKriging_1.6.0     tcltk_4.2.0          
 [34] crayon_1.5.1          jsonlite_1.8.0        libcoin_1.0-9        
 [37] survival_3.3-1        iterators_1.0.14      glue_1.6.2           
 [40] gtable_0.3.0          ipred_0.9-12          R.cache_0.15.0       
 [43] future.apply_1.9.0    scales_1.2.0          DBI_1.1.2            
 [46] Rcpp_1.0.8.3          viridisLite_0.4.0     bit_4.0.4            
 [49] proxy_0.4-26          lava_1.6.10           prodlim_2019.11.13   
 [52] htmlwidgets_1.5.4     httr_1.4.3            RColorBrewer_1.1-3   
 [55] ellipsis_0.3.2        pkgconfig_2.0.3       R.methodsS3_1.8.1    
 [58] farver_2.1.0          nnet_7.3-17           sass_0.4.1           
 [61] dbplyr_2.1.1          RJSONIO_1.3-1.6       utf8_1.2.2           
 [64] caret_6.0-92          tidyselect_1.1.2      labeling_0.4.2       
 [67] rlang_1.0.2           reshape2_1.4.4        munsell_0.5.0        
 [70] cellranger_1.1.0      tools_4.2.0           cli_3.3.0            
 [73] generics_0.1.2        broom_0.8.0           evaluate_0.15        
 [76] fastmap_1.1.0         yaml_2.3.5            rematch2_2.1.2       
 [79] ModelMetrics_1.2.2.2  bit64_4.0.5           fs_1.5.2             
 [82] coin_1.4-2            future_1.26.1         nlme_3.1-157         
 [85] R.oo_1.24.0           xml2_1.3.3            compiler_4.2.0       
 [88] rstudioapi_0.13       plotly_4.10.0         e1071_1.7-9          
 [91] reprex_2.0.1          bslib_0.3.1           stringi_1.7.6        
 [94] highr_0.9             plot3D_1.4            lattice_0.20-45      
 [97] Matrix_1.4-1          styler_1.7.0          vctrs_0.4.1          
[100] pillar_1.7.0          lifecycle_1.0.1       jquerylib_0.1.4      
[103] cowplot_1.1.1         data.table_1.14.2     R6_2.5.1             
[106] bookdown_0.26         rpart.plot_3.1.1      gridExtra_2.3        
[109] parallelly_1.31.1     codetools_0.2-18      MASS_7.3-56          
[112] assertthat_0.2.1      rprojroot_2.0.3       withr_2.5.0          
[115] multcomp_1.4-19       hms_1.1.1             rpart_4.1.16         
[118] timeDate_3043.102     class_7.3-20          mco_1.15.6           
[121] misc3d_0.9-1          rmarkdown_2.14        parallelMap_1.5.1    
[124] MetricsWeighted_0.5.4 pROC_1.18.0          

References

Altmann, André, Laura Toloşi, Oliver Sander & Thomas Lengauer. 2010. Permutation importance: A corrected feature importance measure. Bioinformatics 26(10). 1340–1347. doi:cm7h6d.
Breiman, Leo. 2001. Random Forests. Machine Learning 41. 5–32.
Friedman, Jerome H. & Bogdan E. Popescu. 2008. Predictive learning via rule ensembles. The Annals of Applied Statistics 2(3). Institute of Mathematical Statistics. 916–954. doi:10.1214/07-AOAS148.
Gries, Stefan Th. 2019. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2018-0078.
Hajjem, Ahlem, François Bellavance & Denis Larocque. 2014. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation 84(6). 1313–1328. doi:10.1080/00949655.2012.741599.
Hothorn, Torsten, Kurt Hornik & Achim Zeileis. 2006. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3). 651–674. doi:10.1198/106186006X133933.
Janitza, Silke, Ender Celik & Anne-Laure Boulesteix. 2016. A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification. doi:gdj8zn.
Janitza, Silke, Carolin Strobl & Anne-Laure Boulesteix. 2013. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 14(1). 119. doi:f4s4cx.
Kuhn, Max & Julia Silge. 2022. Tidy Modeling with R. O’Reilly.
Molnar, Christoph. 2021. Interpretable Machine Learning. Leanpub.
Oshiro, Thais Mayumi, Pedro Santoro Perez & José Augusto Baranauskas. 2012. How many trees in a Random Forest? In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, et al. (eds.), Machine Learning and Data Mining in Pattern Recognition, vol. 7376, 154–168. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-31537-4_13.
Probst, Philipp, Marvin N. Wright & Anne-Laure Boulesteix. 2019. Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery 9(3). e1301. doi:10.1002/widm.1301.
Speiser, Jaime Lynn, Bethany J. Wolf, Dongjun Chung, Constantine J. Karvellas, David G. Koch & Valerie L. Durkalski. 2018. BiMM tree: A decision tree method for modeling clustered and longitudinal binary outcomes. Communications in Statistics - Simulation and Computation 0(0). 1–20. doi:10.1080/03610918.2018.1490429.
Strobl, Carolin, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin & Achim Zeileis. 2008. Conditional variable importance for random forests. BMC Bioinformatics 9(1). 307. doi:10.1186/1471-2105-9-307.
Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis & Torsten Hothorn. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(1). 25. doi:10.1186/1471-2105-8-25.
Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. doi:10.1037/a0016973.
Wright, Marvin N. & Andreas Ziegler. 2017. {Ranger} : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77(1). doi:10.18637/jss.v077.i01.
Wright, Marvin N., Andreas Ziegler & Inke R. König. 2016. Do little interactions get lost in dark random forests? BMC Bioinformatics 17(1). doi:b5t7.