Classification and regression trees for linguistic analysis

In this tutorial we’ll go over the basics of how to use classification and regression trees (CARTs). The basic methods have been around for a while (see Breiman 1998), but they are (relatively) new to linguistics. One of the first papers to utilize this technique (and the related technique of random forests) in linguistics was Tagliamonte & Baayen (2012), but the method has been gaining ground in recent years as a useful alternative to other methods (e.g. Bernaisch, Gries & Mukherjee 2014; Szmrecsanyi et al. 2016).

In the first section I’ll explain a bit about how CART models work, and then we’ll move on to discuss how to use them in R. Feel free to skip ahead to the CARTs in R section below if you’d just like to get started!

Packages you will need for this tutorial:

install.packages(
  "here", # for project file management
  "tidyverse", # data wrangling
  "patchwork", # for arranging plots
  "partykit" # for trees
)
# to reproduce the markdown you will also need "rmdformats" and "fontawesome"

Let’s get started!

What are classification and regression trees?

Classification and regression trees are similar in spirit to regression models in that they’re used to predict a response \(y\) from a set of predictors \(x_1, x_2,..., x_n\). There are generally two types of trees:

Classification trees are used when you want to predict a categorical response/outcome. That is, the dependent variable can be assigned to 2 or more discrete classes, values, or labels.
Regression trees are used when you want to predict a continuous response/outcome. That is, the dependent variable can take any of an infinite range of numerical values.

One of the major differences between tree models and standard regression models is that the latter are ‘global’ models, in the sense that the prediction formula is assumed to hold over the entire data space. This means that in the simple case of a regression formula without any interactions,

\[y = \beta_1x_1 + \beta_2x_2 + \beta_3x_3\]

the effect of predictor \(x_1\) on the response \(y\) is assumed to be the same no matter what the values of predictors \(x_2\) and \(x_3\) may be. This is not so with tree models.

The idea behind CART models is to divide the data space into smaller and smaller non-overlapping partitions, to which simpler models can be applied. Tree models use a top-down approach, in which we begin at the top of the tree where all observations are included in a single region and successively split the data space into new branches (subregions) down the tree. Most tree algorithms are ‘greedy’ in that they consider only the best split for the current region of the data. That is, they don’t care about what splits have come before, or what splits may come after the current node. Splitting continues until some threshold is reached, and how that threshold is defined can have a major impact on the results.

There are several packages in R for computing CARTs, but in this tutorial we’ll focus on the {partykit} package (Hothorn & Zeileis 2015).

Terminology

A bit of terminology:

Nodes: Points at which a splitting decision is made. Each node represents a (sub)section of the dataset. The Root node is the node at the top of the tree which represents the entire sample prior to any splitting.
Branches: Subsections of the entire tree (a,so called sub-trees)
Leaves / Terminal nodes: Nodes at the bottom of the tree where no further split is made.
Parent / Child nodes: Super- and subordinate nodes are referred to as parent and child nodes respectively.

CART terminology

In all CARTs, the leaves of the tree (terminal nodes) give us predictions about our response for the subregion of the dataset that the leaves represent.

For classification trees, the leaves of the tree (terminal nodes) give us predictions about which class is mostly likely for that subregion of the data. This is measured simply as the class with the most observations, i.e. the class that makes up the largest proportion of the data.
For regression trees, the prediction is just the mean value of the response for the observations in a given subregion. So if a data point falls into that region, our prediction for its response will be the average response in that region.

For this tutorial, we will focus on methods that use binary splits, though there are methods for producing trees with more than two branches, e.g. with J48() in the {RWeka} package.

An illustration

Suppose we have a dataset of hypothetical speakers from three different regions, A, B, or C. We measured the proportions of two linguistic features used by these participants, feature F1 and feature F2, and we want to see how well we can predict a participant’s region based on F1 and F2. The values of F1 and F2 are proportions ranging from 0 to 1.

We plot the regions based on F1 and F2 and get this:

We can model this with a decision tree like so.

There are five terminal nodes in the resulting tree, which represent distinct partitions (subsets) of the data. These can be represented in the scatterplot accordingly:

For each partition of the data, the model makes a prediction based on the proportion of A, B, or C participants found in that partition. Simply put, the most frequent region found in the data partition is the winner. To get a prediction for a participant’s region then, we simply find the F1 and F2 values of that participant, locate the partition of the data they fall into, and get the most frequent response in that partition.

Splitting criteria

One of the most important factors affecting the accuracy of tree models is the method that the tree-growing algorithm uses to determine where and when to make a split. The idea is to find the “best” split at any given point, so the question becomes how to determine the “best” split. Decision trees use different algorithms to find the best split, and I’ll briefly mention the most common here.

For classification trees, the most common criteria are Gini impurity and information gain. Details can be found here. In practice, I’ve found both methods to produce nearly identical results.
For regression trees, the standard method relies on finding the split that results in the greatest reduction in variance (mean squared error, MSE) when comparing the variance of the parent node to the average variance among the child nodes.
Lastly, there are conditional inference trees. Conditional inference trees derives splits using permutation-based significance tests. This method is designed explicitly to avoid known biases of other methods which tend to favor variables that have many possible splits or many missing values (Hothorn, Hornik & Zeileis 2006). This method works equally well with discrete and continuous outcomes.

The {partykit} package we’ll be using relies on the conditional inference method.

CARTs in R

About R

Since we’re working in R, some familiarity with R and RStudio is a necessity. This tutorial assumes you are familiar with a few of the core aspects of the “tidy” coding style, particularly the use of pipes, i.e. %>% or %>%. If you are new to R, I recommend the following to get started:

swiRl. This is a tutorial library that works directly in R. There really isn’t a better way to learn R!
R for Data Science by Hadley Wickham and Garrett Grolemund (Wickham & Grolemund 2016). This covers all the basics for working with R using the “tidy” approach to programming.

Libraries

Before starting, make sure you have the following packages installed, and load them into your workspace. The code here makes use of the standard {tidyverse} packages and functions. You can find more information via the Tidyverse website, or through the much more extensive R for Data Science book, which is also available online. I also use the {here} package for managing file paths in my projects. For the trees, we’ll use {partykit}.

library(here)
library(tidyverse)
library(partykit)

Datasets

For for classifying we’ll use two datasets from studies of syntactic alternations in English. You can download the datasets directly from my GitHub repository here, here, and here. I recommend storing them in a separate data folder.

gens <- here("data", "brown_genitives.txt") %>% # change as needed.
  read_delim(
    delim = "\t",
    trim_ws = TRUE,
    col_types = cols()
  )

obj_rels <- here("data", "brown_object_relativizers.txt") %>%
  read_delim(
    delim = "\t",
    trim_ws = TRUE,
    col_types = cols()
  )

lexdec <- here("data", "english_lexical_decision.csv") %>%
  read_csv(
    trim_ws = TRUE,
    col_types = cols()
  )

Below is some info on the datasets.

Genitive alternation

This dataset contains data from 5 sections of the Brown corpus (wikipedia) used in Grafmiller (2014), as well as a complementary dataset from the Frown corpus.

the president’s assertion [s-genitive]
the contour of her face [of-genitive]

The English genitive alternation is known to be correlated with a number of features, including…

the animacy of the possessor
the length of the possessor and possessum
the presence of a sibilant at the end of the possessor (Bush’s prestige)
semantic relation between possessor and possessum (kinship, ownership, body-part, etc.)
the ‘thematicity’ (text frequency) of the possessor
the genre of the text (Newspaper, academic, fiction, etc.)

The goal of this study was to investigate the factors that co-determine (or at least correlate with) the choice of genitive variant.

# inspect the data
head(gens)

English Relativizers

This is the dataset of English relativizers originally compiled by Hinrichs, Szmrecsanyi & Bohmann (2015) and used in Grafmiller, Szmrecsanyi & Hinrichs (2016).

a doctrine that nobody either does or need hold
mistakes which others manage to avoid
the boundary conditions Ø we impose

The choice among relative pronouns in English is known to be correlated with a number of features, including…

the length of the relative clause RC_length
the length of the antecedent NP (both this and the above are related to the complexity of the RC context) `
the part of speech of the antecedent
the number of the antecedent
the formality of the text
prior use of a particular pronoun (‘structural persistence/priming’)
the predictability of an upcoming RC, given the preceding material

Like the genitive case, the goal of these studies was to investigate the factors that co-determine/correlate with the choice of relativizer, particularly that vs. which vs. ZERO

# inspect the data
head(obj_rels)

English lexical decision times

For illustrating regression trees we’ll use a dataset that comes from the {languageR} package (Baayen 2008). This gives mean visual lexical decision latencies and word naming latencies for 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various other predictor variables. You can find more information about this dataset by consulting the documentation with ?languageR::english.

head(lexdec)

Generally speaking, CARTs are not optimal for working with continuous outcomes, so I include this mainly for illustration. Most of this tutorial will focus on cases where we’re modeling categorical outcomes.

Classification trees

Classification trees are similar to logistic regression in that they are used to predict the probability of a set of two or more possible responses or outcomes. This is perhaps the most intuitive use of tree models.

For this tutorial, we’ll use the {partykit} package, which is an extension of the earlier {party} package (do NOT load both at the same time). The main function we’ll be using is ctree().

Watch for function conflicts! It’s important to note that the {party} and {partykit} packages use functions with the same names, most importantly ctree(), cforest(). If you load both packages into your workspace, you will see warnings about certain functions being masked. This means that the function from the first package that was loaded will no longer be called by default. For this reason, it’s usually a good idea to load only one of these package at a given time. Alternatively, you can call functions from specific packages explicitly with the syntax package::function()

Binary outcomes

Let’s start by fitting a simple classification tree for a binary outcome and plot it. We’ll consider the effect of possessor animacy (as a binary variable) and final sibilant on the choice of English genitive variant.

First we’ll define the formula for quick calling. Here we are trying to predict the Type of genitive construction based on whether the possessor is animate or not (Possessor.Animacy2), and whether the possessor ends in a sibilant sound (Final.Sibilant).

gen_fmla1 <- Type ~ Possessor.Animacy2 + Final.Sibilant

One thing to look out for is that {party} and {partykit} don’t work with character vectors, so we’ll need to convert our columns here to factors.

gens %>%
  select(all.vars(gen_fmla1)) %>%
  glimpse()

Rows: 5,098
Columns: 3
$ Type               <chr> "of", "of", "of", "of", "s", "s", "s", "s", "s", "s~
$ Possessor.Animacy2 <chr> "animate", "animate", "animate", "animate", "animat~
$ Final.Sibilant     <chr> "N", "Y", "N", "N", "N", "N", "N", "N", "N", "N", "~

These should all be factors, so we can convert them like so.

gens <- gens %>%
  mutate(across(all.vars(gen_fmla1), as.factor))

Fit the tree and plot it.

gen_ctree1 <- ctree(gen_fmla1, data = gens)
plot(gen_ctree1)

Easy! We see from Node 1 that initially, the best split is between animate and inanimate possessors. Then we look at the corresponding subsets and see that the presence of a final sibilant also has an effect, in both the animate and inanimate subsets. Teh terminal nodes show the proportion of genitive constructions in the resulting sub-regions of the data. So for instance, when the possessor is animate and does not have a final sibilant (Node 3), we the proportion of s-genitives is ~75-80% or so.

We can generate predictions from the tree with the predict() function.

ctree_predict <- predict(gen_ctree1)
head(ctree_predict, n = 20)

 3  4  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 
 s of  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s 
Levels: of s

Measure the accuracy:

# proportion of observations correctly predicted by tree:
sum(gens$Type == ctree_predict) / nrow(gens)

[1] 0.7826599

Multi-class (more than 2) outcomes

Let’s try to predict English relativizers with a simplified model, predicting one of three possibel options (the book that/which/∅ you wrote). Here we’ll use a simple model with the variety (AmE ~ BrE) and antecedent part-of-speech (noun ~ other) as predictors.

rc_fmla1 <- relativizer ~ variety + ant_POS

# convert columns to factors
obj_rels <- obj_rels %>%
  mutate(across(all.vars(rc_fmla1), as.factor))

Fit the tree.

rc_ctree1 <- ctree(
  rc_fmla1,
  data = obj_rels
)
# adjust font size for readability
plot(rc_ctree1, gp = gpar(cex = .9))

Here the terminal nodes show the propotions of each outcome for the respective part of the data. So there is a slight tendency for BrE to use more overt relativizers more than AmE in general, and within BrE, we find greater use of which vs. ZERO when the antecedent is a noun.

–>

Regression trees

Now let’s see what a regression tree looks like. Here we’ll try to predict the lexical decision time based on the word’s category (noun or verb) and the age of the participant (‘old’ or ‘young’).

# convert to factors
lexdec <- lexdec %>%
  mutate(
    AgeSubject = as.factor(AgeSubject),
    WordCategory = as.factor(WordCategory)
  )

lex_dec_fmla1 <- RTlexdec ~ AgeSubject + WordCategory

lex_dec_tree1 <- ctree(lex_dec_fmla1, lexdec)
plot(lex_dec_tree1)

The terminal nodes here are boxplots showing the distribution of the response times in the respective sub-regions. From this we can see that the tree does show a significant effect of WordCategory among both old and young speakers, but this difference is small compared to that between the different age groups. So it seems age has a bigger impact than category for this data.

The predictions are simply the average value of the respective data partition.

# look at a random sample of 20 observations
# the names indicate the terminal node
predict(lex_dec_tree1) %>%
  sample(20)

       6        6        7        3        7        6        4        4 
6.444560 6.444560 6.429946 6.664955 6.429946 6.444560 6.653983 6.653983 
       4        7        3        6        7        6        3        4 
6.653983 6.429946 6.664955 6.444560 6.429946 6.444560 6.664955 6.653983 
       4        3        6        3 
6.653983 6.664955 6.444560 6.664955

Pruning and tuning

Now let’s consider a more complex situation. Let’s define a more complex formula for genitive choice using more predictors. These are mostly categorical predictors, so the model should be relatively simple (we hope).

gen_fmla2 <- Type ~ Possessor.Animacy2 + Final.Sibilant +
  SemanticRelation + Possessor.Expression.Type +
  Genre + Corpus + Possessor.Length + Possessum.Length + PossessorThematicity

Check our data. We have some character columns that need to be changed to factors (numeric ones are fine)

gens[, all.vars(gen_fmla2)] %>%
  glimpse()

Rows: 5,098
Columns: 10
$ Type                      <fct> of, of, of, of, s, s, s, s, s, s, s, of, s, ~
$ Possessor.Animacy2        <fct> animate, animate, animate, animate, animate,~
$ Final.Sibilant            <fct> N, Y, N, N, N, N, N, N, N, N, N, N, N, N, N,~
$ SemanticRelation          <chr> "BOD", "BOD", "BOD", "BOD", "BOD", "BOD", "B~
$ Possessor.Expression.Type <chr> "CommonN", "CommonN", "ProperN", "CommonN", ~
$ Genre                     <chr> "General Fiction", "General Fiction", "Gener~
$ Corpus                    <chr> "Brown", "Brown", "Brown", "Brown", "Brown",~
$ Possessor.Length          <dbl> 3, 4, 2, 3, 2, 2, 2, 2, 1, 1, 2, 5, 2, 2, 1,~
$ Possessum.Length          <dbl> 2, 2, 2, 2, 1, 2, 1, 4, 6, 1, 1, 2, 1, 2, 1,~
$ PossessorThematicity      <dbl> -3.301030, -3.301030, -3.301030, -3.301030, ~

all.vars(gen_fmla2)

 [1] "Type"                      "Possessor.Animacy2"       
 [3] "Final.Sibilant"            "SemanticRelation"         
 [5] "Possessor.Expression.Type" "Genre"                    
 [7] "Corpus"                    "Possessor.Length"         
 [9] "Possessum.Length"          "PossessorThematicity"

The first 7 must be factors.

gens <- gens %>%
  mutate(across(all.vars(gen_fmla2)[1:7], as.factor))

Now we fit a tree using this formula.

gen_ctree2 <- ctree(gen_fmla2, data = gens)
# create a party plot
gen_ctree2 %>%
  plot(gp = gpar(cex = .7))

OMG! We can tweak the graphical parameters some (see below), but this will only get us so far. Really, this is a sign that our model may not be the best model for the purposes of inference, i.e. understanding what is going on in the larger population our data is sampled from. The problem is that the tree-building algorithm will look for any and all possible splits in the data, regardless of whether those divisions are likely to be replicated if we were to take different samples of genitive constructions. This is the problem of overfitting: our tree model is too finely tuned to our specific dataset, thus it does not likely represent a good model of genitive variation in general.

This is a much more complex tree! One challenge with CARTs is that while small trees are fairly easy to interpret, they can get unwieldy very quickly. It can be very hard to summarise a tree like this in a way that helps us understand what may be going on. Tree models are also prone to overfitting the data. Consider the tree below, based on a much more complex model of genitive choice. This tree is far too complicated to interpret in any useful way.

Adjusting for overfitting can be done post-hoc by “pruning” the tree, or it can be done prior to fitting the model by tuning the control settings of the function. Pruning is important for CARTs fit with traditional methods, but it is generally not something that is done with conditional inference trees (indeed, the technique was designed partly to make pruning unnecessary). What we should do instead, is decide how to control the growth of the tree beforehand, in an unbiased way. There are a number of ways to do this.

Adjusting p-value

One way is to adjust the level of statistical significance the tree requires a test to meet before making a split. Splits are only created when a global null hypothesis cannot be rejected, as determined by a chosen p-value (by default, this is the standard α = 0.05). You can adjust this under the ctree_control() options.

Let’s set it to require a p-value of 0.001 or below:

gen_ctree2b <- ctree(gen_fmla2,
  data = gens,
  control = ctree_control(mincriterion = .999)
) # now p < .001
# Note the value for 'mincriterion' is 1 minus the desired level: 1 - .001 = .999
plot(gen_ctree2b)

The tree is simpler now, but still pretty unmanageable.

Limiting branching depth

Another way to simplify the tree is to limit the maximum depth of the branching with the maxdepth argument. The default is maxdepth = Inf, but you can tell it to stop splitting after a certain number of levels is reached.

gen_ctree2c <- ctree(gen_fmla2,
  data = gens,
  control = ctree_control(mincriterion = .999, maxdepth = 3)
)
plot(gen_ctree2c)

This is better, but we may be losing some predictive power. The tree is likely to make less accurate predictions, but it is at least interpretable.

Split and node size

Other options can be adjusted as needed/desired. These include the number of data points necessary to consider splitting (minsplit = if there are fewer data points than this value, the model will not try to find any split) and the minimum number of data points that the resulting subsets must contain (minbucket = if a given split results in one or more partitions with fewer data points than this value, the model will ignore this split)

gen_ctree2d <- ctree(gen_fmla2,
  data = gens,
  control = ctree_control(mincriterion = .999, minsplit = 1000L)
)
plot(gen_ctree2d, gp = gpar(cex = .7))

gen_ctree2e <- ctree(gen_fmla2,
  data = gens,
  control = ctree_control(mincriterion = .999, minsplit = 1000L, minbucket = 200L)
)
plot(gen_ctree2e, gp = gpar(cex = .7))

See the documentation for ctree() and ctree_control() for more.

A word of caution about CARTs

CART models seem quite useful, however it is easy to be led astray by them. For example it’s often assumed that trees are good at representing interaction effects, but there are cases in which this assumption cannot be maintained. This is sometimes referred to as the “exclusive or” (XOR) problem,

which describes a situation where two variables show no main effect but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered. (Strobl, Malley & Tutz 2009:341)

I won’t illustrate this here, but a recent simulation study by Gries (2019) illustrates this problem quite nicely (though I don’t know how often it’s likely to occur IRL). The point it that when we don’t know the actual relationship between our predictors and the outcome, we should be extra careful about making claims regarding (the absence of) interactions in the data.

A second word of caution about trees, which to my knowledge has not been raised in the literature, involves the use of splits to identify inflection points (non-linearities) in continuous predictors. For example, Tagliamonte, D’Arcy & Louro (2016) use ctrees to identify what they refer to as “shock points” in the developmental timeline of the English quotative system, specifically the rapid global increase in the use of quotative like (as in he was like, “What happened?”). They illustrate this trend with the tree below:

Tagliamonte et al. (2016:833)

They argue that trees like this reveal the points in time where the use of quotative like began to significantly accelerate (or decelerate), and these points in time naturally open themselves up to interpretation and speculation. Presumably there is some reason why this word shows substantial changes at these particular times.

This seems reasonable, however we need to be aware that split points in trees like these may not actually be very meaningful. Recall that tree models are designed to try to split the data wherever they can, so we might wonder, what will happen when we have a truly linear effect in our data? How would a CART, constrained to make binary partitions in the data, represent a truly linear effect of a predictor x on some outcome y?

Suppose for instance that we had data such as illustrated below? Predictor x clearly has a strong (linear) effect on y, but there is no reason to suspect that any particular values of x split the data better than any others. In other words, there is no obvious curvature in the data here, so there are no “inflection points” suggested by the data. But what is likely to happen when if we fit a tree model to such a dataset? How many splits would we get? How are they distributed? Where might the tree split, and why?

Simulated linear effect

We can test this with a small simulation study. Doing so reveals how branching in the tree models can suggest non-linearities in a misleading way. We’ll simulate a dataset predicting choice of quotative like vs. the standard say based on the date of birth (DOB) of a speaker.

set.seed(43214)
# simulate a dataset
DOB <- ceiling(rnorm(200, 1960, 15))
y <- scale(DOB) * 1.5 + rnorm(200, 0, 2) # add some noise
# y is a set of outcome probabilities on the log odds scale
# We'll use log odds because they can range from -Inf to Inf
df <- data.frame(DOB = DOB, y = y) %>%
  mutate(
    prob = gtools::inv.logit(y), # convert log odds to probabilities
    resp = factor(if_else(prob > .5, "like", "say")),
    bin = as.numeric(resp) - 1
  )
summary(df)

      DOB             y                 prob            resp    
 Min.   :1920   Min.   :-6.29318   Min.   :0.001845   like:101  
 1st Qu.:1950   1st Qu.:-1.95962   1st Qu.:0.123523   say : 99  
 Median :1960   Median : 0.06341   Median :0.515842             
 Mean   :1961   Mean   :-0.07830   Mean   :0.488721             
 3rd Qu.:1971   3rd Qu.: 1.64026   3rd Qu.:0.837569             
 Max.   :2006   Max.   : 5.69999   Max.   :0.996665             
      bin       
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.495  
 3rd Qu.:1.000  
 Max.   :1.000

From a conditional density plot, it’s clear that the effect is about as linear as we could want. There is very little curvature in the line dividing the two halves of the plot.

par(mar = c(5, 4, 4, 2)) # increase top margin
cdplot(resp ~ DOB, df, main = "CD plot of simulated linear effect of DOB\non use of quotative 'like' over 'say'")

But, if we fit tree models to the data, they nonetheless suggest some “shock points”, which we might be tempted to interpret as meaningful in some way.

ctree(resp ~ DOB, df) %>%
  plot(main = "Ctree simulated linear effect of DOB\non use of quotative 'like' over 'say'")

The lesson here is that conditional inference trees (and other CARTs) are capable of capturing linear effects, to a certain degree, but they are constrained in ways that other methods are not. We should therefore be cautious when trying to read too much into the individual split points of continuous predictors derived solely from a tree. It’s always a good idea to verify such patterns using other techniques, such as condition density plots for categorical outcomes or simple scatterplots for continuous outcomes. If you don’t see much of a pattern in these plots, you probably should not make much of the specific split points in your tree models.¹

Summary

Advantages of tree models:

Non-parametric. Tree models don’t make any assumptions about the distribution of the data, which means they require very little data preparation (unlike parametric methods such as regression).
Computationally quick and simple. They can work on very large datasets in reasonable amounts of time.
Easy to understand. In simple cases, trees are relatively easy to understand and interpret even for people without much background in statistics.

Disadvantages of conditional inference trees:

Overfitting. Tree models tend to overfit the data, and require pruning or tuning.
Sensitive to particularities of your data. This problem is similar in spirit to overfitting. Slight changes can result in different trees, which can give different predictions. This makes the results of the tree less likely to generalize to new data.
Prone to misinterpretation. Trees with many interacting predictors do not always accurately represent the true patterns in the data (Gries 2019). Trees can suggest spurious non-linearities and inflate effects of individual predictors.
Overly complex trees. With large datasets and many predictors, we often get very large trees that are difficult to interpret.
Low accuracy. Trees are generally less accurate than other methods, e.g. regression.
Not ideal for continuous data. Tree models tend to lose too much information when trying to model continuous outcomes. In these cases, regression models are usually preferred.
Cannot handle non-independence well (yet): Most current methods don’t have a way of incorporating clustered or hierarchical data structures. There is a package {glmertree} available for fitting mixed effects trees (Fokkema et al. 2018), but this method is still in development and I am not very familiar with it.

For further reading see chapter 14 in Levshina (2015) and chapter 9.2 in Hastie, Tibshirani & Friedman (2009).

Graphical parameters

Large trees can be unwieldy to plot, and {partykit} offers ways to adjust the graphical settings of your trees to help with presentation.

help("party-plot")

Basic global parameters are specified with a gpar() object.

?gpar

You can see the current settings like so.

str(get.gpar())

List of 14
 $ fill      : chr "white"
 $ col       : chr "black"
 $ lty       : chr "solid"
 $ lwd       : num 1
 $ cex       : num 1
 $ fontsize  : num 12
 $ lineheight: num 1.2
 $ font      : int 1
 $ fontfamily: chr ""
 $ alpha     : num 1
 $ lineend   : chr "round"
 $ linejoin  : chr "round"
 $ linemitre : num 10
 $ lex       : num 1
 - attr(*, "class")= chr "gpar"

Fonts

Font size, face (bold, italic), and family (serif, Times, etc.) can be changed with the following parameters.

cex: Multiplier applied to fontsize
fontsize: The size of text (in points)
fontface: The specification of fontface can be an integer or a string. If an integer, then it follows the R base graphics standard: 1 = plain, 2 = bold, 3 = italic, 4 = bold italic. If a string, then valid values are: “plain”, “bold”, “italic”, “oblique”, and “bold.italic”.
fontfamily: Changes to the fontfamily may be ignored by some devices. The fontfamily may be used to specify one of the Hershey Font families (e.g., HersheySerif) and this specification will be honoured on all devices.

# (cex is a multiplier)
plot(gen_ctree1, gp = gpar(cex = .8, fontfamily = "Times New Roman"))

plot(rc_ctree1, gp = gpar(cex = .8, fontface = "italic"))

Line type & width (lex is a similar multiplier)

Line type (solid, dashed, etc.) and width can be adjusted with the lty and lwd or lex arguments respectively (lex is a multiplier similar to cex).

Line type can be specified using either text (“blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, “twodash”) or number (0, 1, 2, 3, 4, 5, 6). Note that lty = “solid” is identical to lty = 1.

plot(gen_ctree1, gp = gpar(cex = .8, lty = 2)) # dashed lines

plot(gen_ctree1, gp = gpar(cex = .8, lwd = 2))

Colors

plot(gen_ctree1, gp = gpar(cex = .8, col = "blue"))

Changing panels

The panels and edges themselves can be formatted as well (as we saw above)

?panelfunctions

plot(gen_ctree1,
  inner_panel = node_barplot, # create tree with inner panels as parplots
  ip_args = list(id = T), # remove IDs from inner panels
  tp_args = list(fill = c("palegreen4", "palegreen1"))
)

plot(rc_ctree1,
  gp = gpar(cex = .8),
  ip_args = list(id = F, pval = F), # remove IDs and pvals from inner panels
  tp_args = list(id = F)
)

# color bars
plot(rc_ctree1,
  tp_args = list(fill = heat.colors(3))
)

Citation & Session Info

Grafmiller, Jason. 2022. Classification and regression trees for linguistic analysis. University of Birmingham. url: https://jasongrafmiller.netlify.app/tutorials/tutorial_carts_ctrees.html (version 2022.06.03).

The following is my current setup on my machine.

sessionInfo()

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] patchwork_1.1.1   partykit_1.2-15   mvtnorm_1.1-3     libcoin_1.0-9    
 [5] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.9       purrr_0.3.4      
 [9] readr_2.1.2       tidyr_1.2.0       tibble_3.1.7      ggplot2_3.3.6    
[13] tidyverse_1.3.1   fontawesome_0.2.2 here_1.0.1        knitr_1.39       

loaded via a namespace (and not attached):
 [1] nlme_3.1-157      fs_1.5.2          lubridate_1.8.0   bit64_4.0.5      
 [5] httr_1.4.3        rprojroot_2.0.3   R.cache_0.15.0    tools_4.2.0      
 [9] backports_1.4.1   bslib_0.3.1       utf8_1.2.2        R6_2.5.1         
[13] rpart_4.1.16      mgcv_1.8-40       DBI_1.1.2         colorspace_2.0-3 
[17] withr_2.5.0       tidyselect_1.1.2  bit_4.0.4         compiler_4.2.0   
[21] cli_3.3.0         rvest_1.0.2       xml2_1.3.3        labeling_0.4.2   
[25] bookdown_0.26     sass_0.4.1        scales_1.2.0      digest_0.6.29    
[29] rmarkdown_2.14    R.utils_2.11.0    pkgconfig_2.0.3   htmltools_0.5.2  
[33] styler_1.7.0      dbplyr_2.1.1      fastmap_1.1.0     highr_0.9        
[37] rlang_1.0.2       readxl_1.4.0      rstudioapi_0.13   jquerylib_0.1.4  
[41] generics_0.1.2    farver_2.1.0      jsonlite_1.8.0    gtools_3.9.2.1   
[45] vroom_1.5.7       R.oo_1.24.0       magrittr_2.0.3    Formula_1.2-4    
[49] Matrix_1.4-1      munsell_0.5.0     fansi_1.0.3       lifecycle_1.0.1  
[53] R.methodsS3_1.8.1 stringi_1.7.6     yaml_2.3.5        inum_1.0-4       
[57] parallel_4.2.0    crayon_1.5.1      lattice_0.20-45   haven_2.5.0      
[61] splines_4.2.0     hms_1.1.1         pillar_1.7.0      reprex_2.0.1     
[65] glue_1.6.2        evaluate_0.15     modelr_0.1.8      vctrs_0.4.1      
[69] rmdformats_1.0.4  tzdb_0.3.0        cellranger_1.1.0  gtable_0.3.0     
[73] rematch2_2.1.2    assertthat_0.2.1  xfun_0.31         broom_0.8.0      
[77] survival_3.3-1    ellipsis_0.3.2

References

Bernaisch, Tobias, Stefan Th. Gries & Joybrato Mukherjee. 2014. The dative alternation in South Asian English(es): Modelling predictors and predicting prototypes. English World-Wide 35. 7–31.

Breiman, Leo (ed.). 1998. Classification and regression trees. Repr. Boca Raton: Chapman & Hall [u.a.].

Fokkema, M., N. Smits, A. Zeileis, T. Hothorn & H. Kelderman. 2018. Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behavior Research Methods 50(5). 2016–2034. doi:10.3758/s13428-017-0971-x.

Grafmiller, Jason. 2014. Variation in English genitives across modality and genres. English Language and Linguistics 18(03). 471–496. doi:10.1017/S1360674314000136.

Grafmiller, Jason, Benedikt Szmrecsanyi & Lars Hinrichs. 2016. Restricting the restrictive relativizer. Corpus Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2016-0015.

Gries, Stefan Th. 2019. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2018-0078.

Hastie, Trevor, Robert Tibshirani & J. H. Friedman. 2009. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed. (Springer Series in Statistics). New York, NY: Springer.

Hinrichs, Lars, Benedikt Szmrecsanyi & Axel Bohmann. 2015. Which-hunting and the Standard English relative clause. Language 91(4). 806–836.

Hothorn, Torsten, Kurt Hornik & Achim Zeileis. 2006. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3). 651–674. doi:10.1198/106186006X133933.

Hothorn, Torsten & Achim Zeileis. 2015. {Partykit:} A modular toolkit for recursive partytioning in {}R{}. Journal of Machine Learning Research 16. 3905–3909.

Levshina, Natalia. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam ; Philadelphia: John Benjamins Publishing Company.

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods 14(4). 323–348. doi:10.1037/a0016973.

Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. doi:10.1075/eww.37.2.01szm.

Tagliamonte, Sali A., Alexandra D’Arcy & Celeste Rodríguez Louro. 2016. Outliers, impact, and rationalization in linguistic change. Language 92(4). 824–849. doi:gdg6vt.

Tagliamonte, Sali & Harald Baayen. 2012. Models, forests and trees of York English: Was/Were variation as a case study for statistical practice. Language Variation and Change 24(2). 135–178. doi:10.1017/S0954394512000129.

Wickham, Hadley & Garrett Grolemund. 2016. R for data science: Import, tidy, transform, visualize, and model data. First edition. Sebastopol, CA: O’Reilly.

My aim here is not to criticize the work of Tagliamonte et al. (2016)—indeed the effects they find appear to be quite robust and genuinely non-linear, which we’d expect of the usual s-curve patterns observed in language change. In fact, if you go back to their data and plot them with a CD plot, it is clear that the “shock” points they observe in their trees likely represent genuine points of substantial change in the community.↩︎