Random forests for linguistic analysis
In this tutorial, we’ll go over the basics of how to use random forests, which are an extension of the classification and regression tree methods discussed in this tutorial. In the next section I explain a bit about how random forests work, but if you want to just get to it, you can skip ahead the next section here.
Packages you will need for this tutorial:
install.packages(
"here", # for project file management
"tidyverse", # data wrangling
"patchwork", # for arranging plots
"ranger", "party", # for random forests
"caret", "permimp", "vip", "flashlight", # for model evaluation
"mlr", "tuneRanger" # for tuning
)
# to reproduce the markdown you will also need "rmdformats" and "fontawesome"What are Random Forests?
Random forests are a powerful extension of single CART models, where we generate predictions based on many trees as opposed to a single tree (hence a “forest”). Random forests are therefore one of a family of techniques that use so-called ensemble methods to generate predictions. These methods rely on aggregating and summarising over lots of models and subsets of the data rather than relying on one single model to derive predictions. Basically each tree in the forest makes a prediction and then we average over all the trees in the forest. For classification problems, each tree predicts, or “votes for”, a specific outcome, and the forest chooses the outcome having the most “votes” over all the trees in the forest. In the case of regression, the forest takes the average of the predicted value over all the different trees.
To understand why we might want to use random forests, it’s helpful to recall the kinds of problems the method helps solve. One well-known problem with using single tree models is that the tree structure is highly sensitive to particulars of your dataset, because each split (after the initial one) depends on previous ones. So slight changes to the dataset may have a big impact of the tree model. Trees are also prone to overfitting, which makes drawing inferences about the population harder and less trustworthy. Random forests avoid these problems summarising over lots of different trees, and many different subsets of the data rather than relying on one single model to derive predictions. They do this by using a couple clever tricks.
Bagging
Random forests rely on a process known as Bootstrap Aggregating, or ‘bagging’ (wikipedia). Bagging is a technique used to reduce the variance of our predictions from individual trees by combining the result of many hundreds or thousands of trees trained on different random sub-sets of the data (the “in-bag” sample), and generating predictions based on data not included in the sample (the “out-of-bag” sample). In this way, bagging is designed to improve the reliability and accuracy of a model as well as reduce variance and potential overfitting. Strictly speaking, bootstrapping involves sampling with replacement, where the “in-bag” sample may contain copies of some observations. Sampling with replacement is not necessary for random forests, however, and as with many things the optimal approach will often depend on the data (Probst, Wright & Boulesteix 2019).
Random predictor sampling
A second aspect of random forests that makes them different from standard bagging methods is that when growing the individual trees, the method adds a second element of randomness in the selection of predictors that are included in each tree. When we have k input predictors, the random forest will randomly select a subset of m predictors at each node, in each tree. The best split on these m predictors is used to split the node.
To sum up, instead of using a single tree, Random Forests grow many many trees (a ‘forest’), each of which is grown on random sub-samples of the data using randomly selected subsets of predictors. This results in a healthy mix of very diverse trees, which compensates for sparse values in the data while at the same time giving fuller consideration to possible effects of all predictors. This is one of the major advantages of this method.
note: Because we’re using methods that rely on (pseudo-)randomness, our results may vary across runs. This is why it’s sometimes helpful to set the random seed prior to running our models so that we can replicate our results again. Almost all computer programs use algorithms to simulate randomness, and R is no different. We can take advantage of this fact to make sure we get the same results every time we run a (pseudo-)random process. We do this by setting the initial state of the random number generator to a predefined value with
set.seed(). Which value you choose doesn’t matter, but starting at that seed will always produce the same results.
set.seed(43214)Random forests in R
Libraries
There are two packages I use for working with random forests: the
{ranger} package and the {party} package,
which is a progenitor of {partykit}. You may also come
across the {randomForest} package, which is an older
package that uses the same methods as {ranger}, however it
runs a bit slower and has been largely superceded by
{ranger} now. As usual, I’m also working with tidyverse functions for data
wrangling and plotting.
Let’s load our libraries.
library(tidyverse)
library(patchwork) # for plotting
# for random forests
library(ranger)
library(party)Datasets
We’ll use two of the datasets we looked at before in the CART tutorial. I’ll focus only on the classification cases here.
gens <- here("data", "brown_genitives.txt") %>% # change as needed.
read_delim(
delim = "\t",
trim_ws = TRUE,
col_types = cols()
)
obj_rels <- here("data", "brown_object_relativizers.txt") %>%
read_delim(
delim = "\t",
trim_ws = TRUE,
col_types = cols()
)Setting hyperparameters
Before starting, we need to specify two important hyperparameters for our models. In simple terms, a parameter is a variable whose value is learned (estimated) from the data by the model, while a hyperparameter is a variable used to control how the model learns. (We already saw examples of hyperparameters in the CART tutorial section on tuning.)
In random forests, the two most commonly discussed hyperparameters are:
- The number of input predictors randomly selected at each node of a given tree in the forest (\(m\)).
- The total number of trees to grow in the forest.
As a rule of thumb, Breiman (2001) recommends setting \(m\) equal to \(\sqrt{k}\) for a model with \(k\) predictors, and this is good for a first pass. The number of trees is a bit harder to determine, but the general consensus seems to be that more trees (> 1000) is better (Oshiro, Perez & Baranauskas 2012; Probst, Wright & Boulesteix 2019). Kuhn & Silge (2022) suggest that searching for the optimal number of trees is unnecessary provided the initial number is large enough (in the thousands, see Chapter 12.2), and this post provides some empirical evidence to support Kuhn & Silge’s argument. Nonetheless, this is still an area of very active research.
Anyway, the number of trees and number of predictors are the two hyperparameters that all random forest methods ask us to stipulate beforehand. There are other hyperparameters in addition to these, and we’ll see methods for determining empirically what the optimal settings should be (a process known as “tuning”) in Tuning random forests section below.
Now let’s get modeling!
{ranger} package
First we’ll fit a forest with the {ranger} package (Wright & Ziegler 2017). This is a
very popular package that uses by default the Gini impurity measure for
splitting trees. This method is arguably the standard approach for
random forests.
Classification with 2 outcomes
Let’s start with a model of genitive choice involving nine
predictors. We’ll set the number of predictors (mtry) to 3
and the number of trees (num.tree) to 1000. We’ll convert
the columns like last time (this will be necessary for using
{party}).
gen_fmla2 <- Type ~ Possessor.Animacy2 + Final.Sibilant +
SemanticRelation + Possessor.Expression.Type +
Genre + Corpus + Possessum.Length + Possessor.Length +
PossessorThematicity
# convert first 7 variables to factors
gens <- gens %>%
mutate(across(all.vars(gen_fmla2)[1:7], as.factor))Now fit the forest.
system.time(
gen_ranger1 <- ranger(gen_fmla2,
data = gens,
num.trees = 1000,
mtry = 3
)
) user system elapsed
3.47 0.06 1.08
A great thing about {ranger} is that it is very fast!
Let’s check the output.
gen_ranger1Ranger result
Call:
ranger(gen_fmla2, data = gens, num.trees = 1000, mtry = 3)
Type: Classification
Number of trees: 1000
Sample size: 5098
Number of independent variables: 9
Mtry: 3
Target node size: 1
Variable importance mode: none
Splitrule: gini
OOB prediction error: 10.08 %
The printout tells us a number of things. It gives us the
hyperparameter settings (Target node size refers to the
minimum number of observations allowed in a terminal node), along with
the number of observations (Sample size), the variable
importance measure (more on this below), and the splitting method.
It also gives an estimate of accuracy based on the out-of-bag (OOB) error rates. “Out-of-bag” refers to those observations that are left out of the randomly sampled “bag” of data used to train each tree. For each tree, we test its predictions on these OOB observations and derive an error rate. This gives us a measure of how accurate the tree is at predicting the outcome for data that it has never seen before. Using OOB accuracy is a way to safeguard against building an overfitting model, i.e. one that is too finely tuned to the specific dataset.
The average error over all the trees is the estimated prediction error for the forest as a whole. In other words, the predictive accuracy is 1 - the OOB error rate. So here we can say that the model correctly predicts genitives with an accuracy of 89.9%.
Classification with more than 2 outcomes
Random forests can also be used straightforwardly when you have more than 2 discrete classes as possible outcomes (this is not as simple with logistic regression models). Let’s return to the relative clause data and define a more complex formula and model.
rel_fmla2 <- relativizer ~ time + variety + genre + ant_given + ant_definite +
ant_head_to_RC + noun_verb_ratio + ant_length + RC_length +
passive_active_ratio + mean_sentence_length + personal_pron +
split_infinitive + shall_will
obj_rels <- obj_rels %>%
mutate(across(all.vars(rel_fmla2)[1:7], as.factor))
# fit model
rel_ranger <- ranger(
rel_fmla2,
data = obj_rels,
mtry = 3,
num.trees = 1000,
classification = TRUE
)
rel_rangerRanger result
Call:
ranger(rel_fmla2, data = obj_rels, mtry = 3, num.trees = 1000, classification = TRUE)
Type: Classification
Number of trees: 1000
Sample size: 3959
Number of independent variables: 14
Mtry: 3
Target node size: 1
Variable importance mode: none
Splitrule: gini
OOB prediction error: 31.35 %
Here the prediction error is much higher (the accuracy is lower), but this is partly because we have more classes to choose from. We’ll look at some better measures for evaluating models in the Evaluation section below.
Probability forests
NOTE: I’ve found that probability forests are generally more accurate than simple classification, so I recommend always using this method by default. There are no downsides to doing so, to my knowledge.
Normally {ranger} fits classification forests whose
predictions are discrete class labels.
head(gen_ranger1$predictions)[1] of of s of s s
Levels: of s
But if we want to generate predicted probabilities of the
classes we can do that by setting the probability = TRUE
argument.
gen_ranger_prob <- ranger(
gen_fmla2,
data = gens,
num.trees = 1000,
mtry = 3,
probability = TRUE
)
gen_ranger_probRanger result
Call:
ranger(gen_fmla2, data = gens, num.trees = 1000, mtry = 3, probability = TRUE)
Type: Probability estimation
Number of trees: 1000
Sample size: 5098
Number of independent variables: 9
Mtry: 3
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.07464502
So here the main difference is the OOB prediction error
which is not the accuracy, but the Brier score. Generally, the closer to
0 the better the model is.
We can create a probability forest for our relative clause model too.
# fit model
rel_ranger_prob <- ranger(
rel_fmla2,
data = obj_rels,
mtry = 3,
num.trees = 1000,
classification = TRUE,
probability = TRUE,
importance = "permutation" # more on this below
)
rel_ranger_probRanger result
Call:
ranger(rel_fmla2, data = obj_rels, mtry = 3, num.trees = 1000, classification = TRUE, probability = TRUE, importance = "permutation")
Type: Probability estimation
Number of trees: 1000
Sample size: 3959
Number of independent variables: 14
Mtry: 3
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.2515699
Again, we’ll look at some more measures for model fit below, but first we’ll look at another common package.
{party} package
Next we’ll try random forests with the {party} package
(Hothorn, Hornik & Zeileis 2006; Strobl et al.
2007; Strobl et al. 2008). For this package we
use the cforest() function. Unlike other methods, this
method uses conditional inference trees, which are argued to be less
biased than starndard CARTs. In my very limited experience, I find that
this method is slightly more accurate than {ranger}, but I
don’t know of any published comparisons of the two methods in
detail.
First we should make sure to detach {partykit} and load
{party}. This is important to avoid conflicts between the
two packages which share many of the same functions.
{party} relies on conditional inference trees similar to
those in {partykit}.
# only necessary if you've already loaded `partykit`
detach("package:partykit", unload = TRUE)
library(party)For {party}, we need to set the hyperparameters inside
the cforest_control() function. Once we’ve set our
hyperparameters, we can fit our tree with cforest(). We
will start with a simpler model, because this method is much
slower than {ranger}.
ctrl <- cforest_control(mtry = 3, ntree = 1000) # same settings as above
# notice the simpler formula
system.time(
gen_party1 <- party::cforest(
Type ~ Possessor.Animacy2 + Final.Sibilant +
Genre + Corpus + Possessum.Length + Possessor.Length,
data = gens,
controls = ctrl
)
) user system elapsed
41.28 0.89 42.60
So this runs a little slower than ranger(), but not too
much so, though keep in mind that our model is a bit simpler than the
other one. It’s really when we start doing other things with these
{party} forests that the difference becomes very clear.
The output of the cforest() function is annoyingly not
very informative.
print(gen_party1)
Random Forest using Conditional Inference Trees
Number of trees: 1000
Response: Type
Inputs: Possessor.Animacy2, Final.Sibilant, Genre, Corpus, Possessum.Length, Possessor.Length
Number of observations: 5098
In the next section(s) we’ll see how to get more information from the models.
note: Random forest objects, especially
{party}forests are very large objects (at least > 100Mb) so they can be slow to load and can bog down your system if you’re fitting lots of models in one work session.
Evaluation
We can obtain the model predictions for each observation directly
with the predict() function, and measure the accuracy
ourselves based on this. This applies to all packages.
For {ranger} forests this is simple, as the predictions
are automatically included in the model object.
# predict(gen_ranger1, gens) # also works
gen_ranger1$predictions %>%
head(10) [1] of of s of s s s of s s
Levels: of s
For probability forests we have a dataframe containing the predicted probabilities for each possible outcome.
gen_ranger_prob$predictions %>%
head(10) of s
[1,] 0.909477495 0.09052251
[2,] 0.913402030 0.08659797
[3,] 0.446553632 0.55344637
[4,] 0.912078896 0.08792110
[5,] 0.020294622 0.97970538
[6,] 0.452312165 0.54768784
[7,] 0.012198095 0.98780190
[8,] 0.534232888 0.46576711
[9,] 0.187230959 0.81276904
[10,] 0.002043494 0.99795651
We can get predictions from {party} forests with the
predict() function, though this takes some time.
DO NOT RUN THIS IN THE LIVE SESSION!
Here I have specified that I want the predicted probabilities, which
are automatically available from the {party} forest
(i.e. we didn’t specify a probability forest).
system.time(
gen_party1_preds <- predict(gen_party1, newdata = gens, type = "prob")
) user system elapsed
533.11 0.78 557.42
So that takes about ~9-10 minutes on my machine. I recommend running these in a separate job in RStudio.
Check the predictions.
head(gen_party1_preds)$`1`
Type.of Type.s
[1,] 0.8266243 0.1733757
$`2`
Type.of Type.s
[1,] 0.9359857 0.0640143
$`3`
Type.of Type.s
[1,] 0.5129722 0.4870278
$`4`
Type.of Type.s
[1,] 0.8266243 0.1733757
$`5`
Type.of Type.s
[1,] 0.04999138 0.9500086
$`6`
Type.of Type.s
[1,] 0.5129722 0.4870278
Note that the output is a list, so we have to convert this to a dataframe.
gen_party1_preds <- do.call("rbind", gen_party1_preds) %>%
as.data.frame()
gen_party1_preds For evaluating models, I like the {caret} package,
which has a number of functions for computing performance measures for
all kinds of models. But first we’ll add columns to our prediction
dataframes containing the predicted outcomes as well as the observed
data.
# ranger forest
gen_ranger_prob_preds <- gen_ranger_prob$predictions %>%
as.data.frame() %>%
mutate(
obs = gens$Type,
pred = as.factor(if_else(of > .5, "of", "s"))
)
gen_ranger_prob_preds# party forest
gen_party1_preds <- gen_party1_preds %>%
rename(of = "Type.of", s = "Type.s") %>% # change names
mutate(
obs = gens$Type,
pred = as.factor(if_else(of > .5, "of", "s"))
)
gen_party1_predsNow I’ll use the twoClassSummary() function to get a
quick assessment of the performance of the models. Here I’ll focus on
the area under the ROC curve (ROC), which is a reasonable
measure of how well the model discriminates between two outcomes. The
scores range from .5 (chance) to 1 (perfect discrimination). Values
above .8 are considered a good fit. This is also known sometimes as the
index of concordance C.
caret::twoClassSummary(gen_ranger_prob_preds,
lev = levels(gen_ranger_prob_preds$obs)
) %>%
round(3) ROC Sens Spec
0.959 0.927 0.856
caret::twoClassSummary(gen_party1_preds,
lev = levels(gen_party1_preds$obs)
) %>%
round(3) ROC Sens Spec
0.956 0.916 0.850
The {party} forest does a bit worse than the
{ranger} forest, but that’s expected given that the latter
includes more predictors.
For classification models it can also be helpful to view the
confusion matrix, which is a cross-tabulation of the
predicted outcome vs. the observed data. The diagonal of the matrix
shows the observations predicted correctly. Again, the
{caret} package has lots of useful evaluation metrics (see
the documentation here).
We can use the prediction dataframes we created above.
# ranger
caret::confusionMatrix(
data = gen_ranger_prob_preds$pred,
reference = gen_ranger_prob_preds$obs,
mode = "everything"
)Confusion Matrix and Statistics
Reference
Prediction of s
of 2878 287
s 225 1708
Accuracy : 0.8996
95% CI : (0.891, 0.9077)
No Information Rate : 0.6087
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.788
Mcnemar's Test P-Value : 0.007021
Sensitivity : 0.9275
Specificity : 0.8561
Pos Pred Value : 0.9093
Neg Pred Value : 0.8836
Precision : 0.9093
Recall : 0.9275
F1 : 0.9183
Prevalence : 0.6087
Detection Rate : 0.5645
Detection Prevalence : 0.6208
Balanced Accuracy : 0.8918
'Positive' Class : of
This is a very good model.
# party
caret::confusionMatrix(
data = gen_party1_preds$pred,
reference = gen_party1_preds$obs,
mode = "everything"
)Confusion Matrix and Statistics
Reference
Prediction of s
of 2842 299
s 261 1696
Accuracy : 0.8902
95% CI : (0.8812, 0.8986)
No Information Rate : 0.6087
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7686
Mcnemar's Test P-Value : 0.1179
Sensitivity : 0.9159
Specificity : 0.8501
Pos Pred Value : 0.9048
Neg Pred Value : 0.8666
Precision : 0.9048
Recall : 0.9159
F1 : 0.9103
Prevalence : 0.6087
Detection Rate : 0.5575
Detection Prevalence : 0.6161
Balanced Accuracy : 0.8830
'Positive' Class : of
Now what about our relative clause model? Evaluating these kinds of models is a bit trickier than for binary outcomes. We can get the predictions in the usual way, but then we have to do some clever coding to get the outcome that has the highest predicted probability
rel_ranger_prob_preds <- rel_ranger_prob$predictions %>%
as.data.frame()
# get column with the highest predicted probability
rel_ranger_prob_preds$pred <- names(rel_ranger_prob_preds)[apply(rel_ranger_prob_preds, 1, which.max)] %>%
as.factor()
# add the observed
rel_ranger_prob_preds <- rel_ranger_prob_preds %>%
mutate(obs = obj_rels$relativizer)
rel_ranger_prob_predsWe can again use the {caret} package to get some further
measures. The multiClassSummary() function in
{caret} gives us some info, such as the overall accuracy
and Kappa statistics, and averages of “one versus all” statistics such
as sensitivity, specificity, the AUC, etc.
caret::multiClassSummary(rel_ranger_prob_preds, lev = levels(rel_ranger_prob_preds$obs)) logLoss AUC prAUC
0.7160254 0.7967564 0.5960713
Accuracy Kappa Mean_F1
0.6956302 0.3415933 0.5358227
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
0.5178335 0.7706591 0.6087291
Mean_Neg_Pred_Value Mean_Precision Mean_Recall
0.8210040 0.6087291 0.5178335
Mean_Detection_Rate Mean_Balanced_Accuracy
0.2318767 0.6442463
The confusion matrix also provides more information.
caret::confusionMatrix(
data = rel_ranger_prob_preds$pred, # the model predictions
reference = rel_ranger_prob_preds$obs, # the observed data
mode = "everything"
)Confusion Matrix and Statistics
Reference
Prediction THAT WHICH ZERO
THAT 156 57 92
WHICH 98 310 125
ZERO 470 363 2288
Overall Statistics
Accuracy : 0.6956
95% CI : (0.681, 0.7099)
No Information Rate : 0.6327
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3416
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: THAT Class: WHICH Class: ZERO
Sensitivity 0.21547 0.4247 0.9134
Specificity 0.95394 0.9309 0.4271
Pos Pred Value 0.51148 0.5816 0.7331
Neg Pred Value 0.84455 0.8774 0.7411
Precision 0.51148 0.5816 0.7331
Recall 0.21547 0.4247 0.9134
F1 0.30321 0.4909 0.8134
Prevalence 0.18287 0.1844 0.6327
Detection Rate 0.03940 0.0783 0.5779
Detection Prevalence 0.07704 0.1346 0.7883
Balanced Accuracy 0.58471 0.6778 0.6702
This model is… not great. It still predicts relativizers at a rate
significantly higher than chance, but it doesn’t do so equally for all
possible outcomes. We can see that it is fairly good when it comes to
predicting the ZERO case, but not so good at that or
which. In fact, for that and which, it gets
more wrong than it does right. The statistic to look at here is Cohen’s
Kappa, which tells us how much better the model is compared
to one that simply guesses at random according to the frequency of each
outcome (this statistic is also used for inter-rater reliability). This
measure is better for imbalanced data like we have here. A value of 0.34
indicates a fair, but still rather crappy model, so there is a lot of
room for improvement.
Predictor importance
Unlike methods such as regression models, random forests don’t give you direct effect size measures comparable to regression coefficients. Interpreting random forest models is not easy because a single predictor can show up in many different (non-nested) trees with different co-varying predictors (and interactions).
However, random forests make up for this shortcoming in being a great tool for evaluating the relative predictive importance of individual predictors. Due to the random (sub-)sampling and bagging processes, they are very good at determining the independent contributions of individual predictors to the model, even when those predictors are highly correlated with one another. And since random forests rely on many, many trees, they can model non-linearities and complex ‘interactions’ quite naturally (though this should perhaps be treated with some caution, see Gries 2019).
To assess the contribution of each predictor, most random forest methods compute a predictor importance score (also called “variable importance” or “feature importance”). I’ll go over how to do this below.
{ranger}
The {ranger} package allows us to use several importance
measures. It’s generally recommended to use permutation-based methods,
as permutation methods have been shown to be less biased than other
measures (Strobl, Malley & Tutz 2009; Altmann et al.
2010). Permutation variable importance works fairly
straightforwardly:
To evaluate predictor importance, a tree is grown in the first step, and the prediction accuracy in the OOB observations is calculated. In the second step, any association between the variable of interest xi and the outcome is broken by permuting the values of all individuals for xi, and the prediction accuracy is computed again. The difference between the two accuracy values is the permutation importance for xi from a single tree. The average of all tree importance values in a random forest then gives the random forest permutation importance of this variable. The procedure is repeated for all [predictors] of interest. (Wright, Ziegler & König 2016:3)
In other words, we…
- Grow a tree predicting a response from a set of predictors.
- Take one predictor and randomly shuffle the values of that predictor. This breaks the association between the response and that predictor
- Calculate the difference in model accuracy before and after shuffling the predictor
- Repeat for all trees and all predictors in the forest
The reasoning is as follows. If the predictor does not have a meaningful relationship with the response, shuffling its values should produce very little change in model accuracy. On the other hand, if the predictor was strongly associated with the response, permutation should create a large drop in accuracy. We do this for every predictor in the model, and rank them according to their importance.
Permutation based importance measures can be obtained for
{ranger} models easily by specifying the
importance argument in the initial function.
gen_ranger_perm <- ranger(
gen_fmla2,
data = gens,
num.trees = 1000,
mtry = 3,
importance = "permutation"
)The values can be obtained with the importance() wrapper
function, or you can get them directly with
gen_ranger_perm$variable.importance
importance(gen_ranger_perm) %>%
sort() Corpus SemanticRelation Final.Sibilant
0.005628278 0.007009083 0.007444319
PossessorThematicity Genre Possessor.Expression.Type
0.018289273 0.025269293 0.038955258
Possessor.Length Possessor.Animacy2 Possessum.Length
0.046063122 0.062183864 0.136032892
The {vip} package works nicely with
{ranger} for quick plotting.
vip::vip(gen_ranger_perm)A nice thing about {ranger} is that we can calculate
significance tests for which predictors are important, following methods
of Janitza, Celik & Boulesteix (2016) and Altmann
et al. (2010). The first method is faster, but
requires there to be a negative value (see below). By chance, the random
reordering of values sometimes results in a configuration that affords a
slightly better model. This may arise when the values of an
irrelevant predictor are reshuffled. In other words, the random
reshuffling created a new version of the predictor that is actually
more strongly associated with the outcome than the original
predictor. This has the effect of creating a negative importance
score.
By mirroring the minimal score to the right of zero, we obtain an interval that characterizes irrelevant predictors. Variables can be considered informative and important if their variable importance value lies above (to the right) this region (Strobl, Malley & Tutz 2009; Janitza, Celik & Boulesteix 2016). In some cases, this may include all predictors. This is the “janitza” method.
# this will get an error
system.time(
gen_ranger_varimp.pvals <- importance_pvalues(
gen_ranger_perm,
formula = gen_fmla2,
data = gens,
method = "janitza"
)
)The second method, “altmann”, can take a while depending on how large the model is and how many permutations you use. DO NOT RUN THIS IN THE LIVE SESSION!
system.time(
gen_ranger_varimp.pvals <- importance_pvalues(
gen_ranger_perm,
formula = gen_fmla2,
data = gens,
method = "altmann",
num.permutations = 100
)
) user system elapsed
990.58 7.84 281.07
gen_ranger_varimp.pvals %>%
as.data.frame() %>%
rownames_to_column("pred") %>%
arrange(desc(importance))Here it seems that all predictors but Corpus (Brown ~
Frown) contribute to the model significantly.
{party}
The {party} package uses permutation importance measures
as well, and it offers two measures, the decrease in prediction accuracy
(varimp()) and the decrease in the area under the ROC curve
(varimpAUC()). The latter has been shown to be a more
reliable measure (Janitza, Strobl
& Boulesteix 2013), so we’ll use that.
The only drawback is that these are both also very time consuming
(like everything in {party}…). So we’ll use the
{permimp} package which was designed to be much faster,
though still not as fast as {ranger}.
library(permimp)
# track the time it takes to run
system.time(
gen_party1_varimp <- permimp(
gen_party1,
AUC = TRUE,
progressBar = FALSE
)
) user system elapsed
74.84 0.34 76.87
Once we have our variable importance object, we can plot the results.
gen_party1_varimp$values %>%
as.data.frame() %>%
rename(varimp = ".") %>%
mutate(pred = rownames(.) %>% factor()) %>%
ggplot(aes(fct_reorder(pred, varimp), varimp)) +
geom_segment(aes(xend = pred, yend = 0)) +
geom_point(size = 3) +
labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
geom_hline(yintercept = 0, color = "red") +
coord_flip()Conditional perdictor importance
There are often many predictors that are correlated with one another
to varying degrees, and when this is the case, the individual importance
scores may not be reliable. The {permimp} package offers a
method for dealing with this by using conditional variable
importance measures (Strobl et al.
2008). From help(permimp):
If
conditional = TRUE, the importance of each variable is computed by permuting within a grid defined by the covariates that are associated (with 1 - p-value greater than threshold) to the variable of interest.
This can be calculated as follows, however, be warned this can take much longer (though not so bad here).
# track the time it takes to run
system.time(
gen_party1_varimp_cond <- permimp(
gen_party1,
AUC = TRUE,
progressBar = FALSE,
conditional = TRUE
)
) user system elapsed
292.91 1.25 305.22
We can compare the unconditional and conditional importances, and see in this case there is relatively little difference.
p1 <- gen_party1_varimp$values %>%
as.data.frame() %>%
rename(varimp = ".") %>%
mutate(pred = rownames(.) %>% factor()) %>%
ggplot(aes(fct_reorder(pred, varimp), varimp)) +
geom_point() +
labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
ggtitle("Unconditional importance") +
geom_hline(yintercept = 0, color = "red") +
coord_flip()
p2 <- gen_party1_varimp_cond$values %>%
as.data.frame() %>%
rename(varimp = ".") %>%
mutate(pred = rownames(.) %>% factor()) %>%
ggplot(aes(fct_reorder(pred, varimp), varimp)) +
geom_point() +
labs(x = "", y = "permutation predictor importance\n(mean decrease in AUC)") +
ggtitle("Conditional importance") +
geom_hline(yintercept = 0, color = "red") +
coord_flip()
p1 + p2As a final note, it’s important to remember that these variable importance scores should only be interpreted relative to each other as a ranking of the model’s predictors. The absolute values of the scores should not be interpreted or compared across studies (see Strobl et al. 2008:336).
Tuning random forests
As I mentioned before, the hyperparameters in a random forest model
can have an impact on how it performs, and so it is advised to use the
optimal settings for your data. The process of finding the right
hyperparameters for the data is known as tuning, and it
works by simply fitting models with many different values for each
hyperparameter and seeing which ones maximize the model performance.
Probst, Wright & Boulesteix (2019) recommend tuning
mtry, the number of observations drawn for each tree, the
minimum terminal node size, and whether the in-bag samples are sampled
with replacement. We’ll focus on the first three here.
There are a few good methods for tuning hyperparameters in R, and
we’ll use the {tuneRanger} package here. I also recommend
the {tidymodels} package, which works with many different
kinds fo models, including {party} forests.
Load our libraries.
library(mlr)
library(tuneRanger)First we make a classification task with the {mlr}
package. You’ll get a warning, but don’t worry about that for now.
obj_rel_task <- makeClassifTask(
data = as.data.frame(obj_rels[, all.vars(rel_fmla2)]),
target = "relativizer"
)Now we tune our hyperparameters. The tuneRanger()
function will consider lots of difference values for the forest by
searching through reasonable values for each hyperparameter (you can
find out more about these methods here).
This can take time. DO NOT RUN THIS IN THE LIVE
SESSION
obj_rel_tune <- tuneRanger(
obj_rel_task,
num.trees = 1000,
num.threads = 4,
tune.parameters = c("mtry", "min.node.size", "sample.fraction"),
iters = 100,
show.info = FALSE
)The recommended values are here:
obj_rel_tune$recommended.parWe can then refit the model with these values.
# fit model
rel_ranger_tuned <- ranger(
rel_fmla2,
data = obj_rels,
mtry = obj_rel_tune$recommended.par$mtry,
min.node.size = obj_rel_tune$recommended.par$min.node.size,
sample.fraction = obj_rel_tune$recommended.par$sample.fraction,
num.trees = 1000,
classification = TRUE,
probability = TRUE,
importance = "permutation"
)
rel_ranger_tunedRanger result
Call:
ranger(rel_fmla2, data = obj_rels, mtry = obj_rel_tune$recommended.par$mtry, min.node.size = obj_rel_tune$recommended.par$min.node.size, sample.fraction = obj_rel_tune$recommended.par$sample.fraction, num.trees = 1000, classification = TRUE, probability = TRUE, importance = "permutation")
Type: Probability estimation
Number of trees: 1000
Sample size: 3959
Number of independent variables: 14
Mtry: 3
Target node size: 3
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.2528225
Now we can evaluate the model
rel_ranger_tuned_preds <- rel_ranger_tuned$predictions %>%
as.data.frame()
# get column with the highest predicted probability
rel_ranger_tuned_preds$pred <- names(rel_ranger_tuned_preds)[apply(rel_ranger_tuned_preds, 1, which.max)] %>%
as.factor()
# add the observed
rel_ranger_tuned_preds <- rel_ranger_tuned_preds %>%
mutate(obs = obj_rels$relativizer)
caret::multiClassSummary(rel_ranger_tuned_preds, lev = levels(rel_ranger_tuned_preds$obs)) logLoss AUC prAUC
0.7152652 0.7988903 0.6010658
Accuracy Kappa Mean_F1
0.6943673 0.3262892 0.5207066
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
0.5047823 0.7647187 0.6097549
Mean_Neg_Pred_Value Mean_Precision Mean_Recall
0.8243307 0.6097549 0.5047823
Mean_Detection_Rate Mean_Balanced_Accuracy
0.2314558 0.6347505
Compare that to the untuned one and we can see that the improvement is marginal. This level of improvement is actually typical according to the literature that is available (e.g. Probst, Wright & Boulesteix 2019).
caret::multiClassSummary(rel_ranger_prob_preds, lev = levels(rel_ranger_prob_preds$obs)) logLoss AUC prAUC
0.7160254 0.7967564 0.5960713
Accuracy Kappa Mean_F1
0.6956302 0.3415933 0.5358227
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
0.5178335 0.7706591 0.6087291
Mean_Neg_Pred_Value Mean_Precision Mean_Recall
0.8210040 0.6087291 0.5178335
Mean_Detection_Rate Mean_Balanced_Accuracy
0.2318767 0.6442463
For completeness we can compare the predictor importance scores too. Some minor variation but nothing too concerning.
p1 <- vip::vip(rel_ranger_prob) + ggtitle("No tuning")
p2 <- vip::vip(rel_ranger_tuned) + ggtitle("With tuning")
p1 + p2Interaction effects
One problem with the predictor importance plots above is that they don’t only tell us about the individual effects, and say nothing about the combined influence of 2 or more predictors, i.e. interaction effects. For investigating interaction effects Friedman & Popescu (2008) suggest a method which measures the relative influence of interactions vs. main effects on the variation in a model’s prediction. In simplest terms, the approach compares the observed influence of two (or more) predictors in the model to their influence under the assumption that there are no interactions. In the latter case, the assumption is that when there are no interactions, the joint influence of predictors A and B is simply equal to the sum of the their individual influence (see Molnar 2021:sec. 5.4). Large differences between the observed influence and the ‘no-interaction’ partial dependence suggest that the two predictors’ effects are not independent. The resulting H statistic is used to represent the amount of variance in the difference between observed and no-interaction explained by the interaction. An H value of 0 for a predictor means there is no interaction (i.e. only a main effect), and an H value of 1 means that a predictor only has an effect via interactions with other predictors (i.e. there is no main effect).
To do this in R takes a bit of work, but it’s worth it in the end.
We’ll use the {flashlight} package, which is one of several
recent packages built for doing all kinds of neat things with predictive
models (others include {DALEX}, {iml}, and
{tidymodels}). The online documentation is here
library(flashlight)There are a couple things we need to do first. We’ll start by
refitting our model with a binary outcome rather than a factor (don’t
ask me why but most tools require this…). So we’ll just convert the
Type column in our genitives data to 0s and 1s, then refit
the model
gens2 <- gens %>%
mutate(Type = as.numeric(Type) - 1)
gen_ranger_prob2 <- ranger(
gen_fmla2,
data = gens2,
num.trees = 1000,
mtry = 3,
probability = TRUE,
classification = TRUE,
importance = "permutation"
)Now we create a flashlight container object with the
ranger forest and the dataset.
gen_fl <- flashlight(
model = gen_ranger_prob2,
label = "gen_ranger",
y = "Type",
data = gens2[, all.vars(gen_fmla2)],
metrics = list(AUC = MetricsWeighted::AUC), # this is used to evaluate the model
predict_function = function(mod, X) predict(mod, X)$predictions[, 2]
)light_performance(gen_fl)
I am an object of class light_performance
data.frames:
data
# A tibble: 1 x 3
label metric value
<chr> <fct> <dbl>
1 gen_ranger AUC 0.985
gen_interactions <- gen_fl %>%
light_interaction(n_max = 200)We can plot these and see which predictors
It’s also possible to look at the pairwise effects. For example we
can consider the interactions with Genre in the model.
gen_ints_pairs <- gen_fl %>%
light_interaction(
pairwise = TRUE, #
n_max = 100
)gen_ints_pairs$data %>%
filter(str_detect(variable, "Genre")) %>%
ggplot(aes(reorder(variable, value), value)) +
geom_col() +
coord_flip() +
labs(y = "Friedman's H", x = "") +
ggtitle("Interactions with Genre")So it seems that there are some interesting interactions with
Genre. In particular the influence of whether the possessor
is a proper or common noun seems to vary across genres
Partial dependence profiles
One more useful technique is to look at the partial
dependence of the probability of the outcome on the values of a
given predictor or predictors. The partial dependence is estimated by
examining the model predictions if all our observations had the same
value for the predictor(s) in question, then averaging over the
predictions for all observations. These are very easy to get with the
light_profile() function.
So we saw that there is an interaction of Genre and possessor NP type, and we can inspect the effects of these predictors in the model like so.
gen_fl %>%
light_profile(v = "Possessor.Expression.Type", by = "Genre") %>%
plot() +
labs(y = "predicted probability of the s-genitive")From this it seems that the biggest difference is i the Press genre, where proper noun possessors appear to be particularly more likely to occur in the s-genitive compared to the other genres. This makes some sense if we think about the nature of news texts.
For more on these and other techniques, see the package vignette
online. There are lots of other things that you can do
with {flashlight} and similar packages.
What about mixed-effects random forests?
Adapting random forest methods to handle multi-level datasets, like when observations are clustered by speakers in an experiment or text files in a corpus, is currently a very active area of research. Unfortunately however, the methods available are not easy to implement in R. There is a relatively easy-to-use implementation of mixed-effects random forests (MERFs) in Python available https://pypi.org/project/merf/, but this only works for continuous response variables (Hajjem, Bellavance & Larocque 2014). More about this can be found here.
Other methods for categorical responses are in development, e.g. Speiser et al. (2018), but nothing has been implemented in R yet to my knowledge.
Summary
Advantages of random forests
- Small n, big p: Random forests can handle small datasets with relatively large sets of predictors. This is a problem for other methods. They are also better to use for classification with highly unbalanced datasets, i.e. where one class occurs less than 10% of the time.
- Very accurate: Random forests are typically much more accurate than single trees and even regression models
- Random forests are also generally less sensitive to problems of multicollinearity due to their bagging and random variable selection procedures, however proper care should still be taken (e.g. using conditional variable importance measures)
- Random forests deal well with nonlinear effects
Disadvantages of random forests
- Cannot handle non-independence well (yet): They don’t have a way of incorporating clustered or hierarchical data structures, i.e.”random” effects for speaker, text, etc. “Mixed-effets” random forests are still in development
- Computationally demanding: RF methods can have potentially long computing time, especially for variable importance and partial dependence measures.
- Hard to interpret: They do not provide traditional measures of effect size, direction, or significance, and exploring “interaction effects” is not always straightforward (Gries 2019; Wright, Ziegler & König 2016). I disucss how to deal with these issues in another tutorial.
Citation & Session Info
Grafmiller, Jason. 2022. Random forests for linguistic analysis. University of Birmingham. url: https://jasongrafmiller.netlify.app/tutorials/tutorial_random_forests.html (version 2022.06.03).
The following is my current setup on my machine.
sessionInfo()R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 grid stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] flashlight_0.8.0 tuneRanger_0.5 lhs_1.1.5 lubridate_1.8.0
[5] mlrMBO_1.1.5 smoof_1.6.0.2 checkmate_2.1.0 mlr_2.19.0
[9] ParamHelpers_1.14 party_1.3-10 strucchange_1.5-2 sandwich_3.0-1
[13] zoo_1.8-10 modeltools_0.2-23 mvtnorm_1.1-3 ranger_0.13.1
[17] patchwork_1.1.1 fontawesome_0.2.2 forcats_0.5.1 stringr_1.4.0
[21] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[25] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1 knitr_1.39
[29] here_1.0.1
loaded via a namespace (and not attached):
[1] readxl_1.4.0 backports_1.4.1 fastmatch_1.1-3
[4] plyr_1.8.7 lazyeval_0.2.2 splines_4.2.0
[7] listenv_0.8.0 TH.data_1.1-1 digest_0.6.29
[10] foreach_1.5.2 htmltools_0.5.2 fansi_1.0.3
[13] magrittr_2.0.3 MLmetrics_1.1.1 BBmisc_1.12
[16] ROCR_1.0-11 tzdb_0.3.0 recipes_0.2.0
[19] globals_0.15.0 modelr_0.1.8 gower_1.0.0
[22] matrixStats_0.62.0 R.utils_2.11.0 vroom_1.5.7
[25] hardhat_0.2.0 rmdformats_1.0.4 colorspace_2.0-3
[28] vip_0.3.2 rvest_1.0.2 haven_2.5.0
[31] xfun_0.31 DiceKriging_1.6.0 tcltk_4.2.0
[34] crayon_1.5.1 jsonlite_1.8.0 libcoin_1.0-9
[37] survival_3.3-1 iterators_1.0.14 glue_1.6.2
[40] gtable_0.3.0 ipred_0.9-12 R.cache_0.15.0
[43] future.apply_1.9.0 scales_1.2.0 DBI_1.1.2
[46] Rcpp_1.0.8.3 viridisLite_0.4.0 bit_4.0.4
[49] proxy_0.4-26 lava_1.6.10 prodlim_2019.11.13
[52] htmlwidgets_1.5.4 httr_1.4.3 RColorBrewer_1.1-3
[55] ellipsis_0.3.2 pkgconfig_2.0.3 R.methodsS3_1.8.1
[58] farver_2.1.0 nnet_7.3-17 sass_0.4.1
[61] dbplyr_2.1.1 RJSONIO_1.3-1.6 utf8_1.2.2
[64] caret_6.0-92 tidyselect_1.1.2 labeling_0.4.2
[67] rlang_1.0.2 reshape2_1.4.4 munsell_0.5.0
[70] cellranger_1.1.0 tools_4.2.0 cli_3.3.0
[73] generics_0.1.2 broom_0.8.0 evaluate_0.15
[76] fastmap_1.1.0 yaml_2.3.5 rematch2_2.1.2
[79] ModelMetrics_1.2.2.2 bit64_4.0.5 fs_1.5.2
[82] coin_1.4-2 future_1.26.1 nlme_3.1-157
[85] R.oo_1.24.0 xml2_1.3.3 compiler_4.2.0
[88] rstudioapi_0.13 plotly_4.10.0 e1071_1.7-9
[91] reprex_2.0.1 bslib_0.3.1 stringi_1.7.6
[94] highr_0.9 plot3D_1.4 lattice_0.20-45
[97] Matrix_1.4-1 styler_1.7.0 vctrs_0.4.1
[100] pillar_1.7.0 lifecycle_1.0.1 jquerylib_0.1.4
[103] cowplot_1.1.1 data.table_1.14.2 R6_2.5.1
[106] bookdown_0.26 rpart.plot_3.1.1 gridExtra_2.3
[109] parallelly_1.31.1 codetools_0.2-18 MASS_7.3-56
[112] assertthat_0.2.1 rprojroot_2.0.3 withr_2.5.0
[115] multcomp_1.4-19 hms_1.1.1 rpart_4.1.16
[118] timeDate_3043.102 class_7.3-20 mco_1.15.6
[121] misc3d_0.9-1 rmarkdown_2.14 parallelMap_1.5.1
[124] MetricsWeighted_0.5.4 pROC_1.18.0