Classification and regression trees for linguistic analysis
In this tutorial we’ll go over the basics of how to use classification and regression trees (CARTs). The basic methods have been around for a while (see Breiman 1998), but they are (relatively) new to linguistics. One of the first papers to utilize this technique (and the related technique of random forests) in linguistics was Tagliamonte & Baayen (2012), but the method has been gaining ground in recent years as a useful alternative to other methods (e.g. Bernaisch, Gries & Mukherjee 2014; Szmrecsanyi et al. 2016).
In the first section I’ll explain a bit about how CART models work, and then we’ll move on to discuss how to use them in R. Feel free to skip ahead to the CARTs in R section below if you’d just like to get started!
Packages you will need for this tutorial:
install.packages(
"here", # for project file management
"tidyverse", # data wrangling
"patchwork", # for arranging plots
"partykit" # for trees
)
# to reproduce the markdown you will also need "rmdformats" and "fontawesome"Let’s get started!
What are classification and regression trees?
Classification and regression trees are similar in spirit to regression models in that they’re used to predict a response \(y\) from a set of predictors \(x_1, x_2,..., x_n\). There are generally two types of trees:
- Classification trees are used when you want to predict a categorical response/outcome. That is, the dependent variable can be assigned to 2 or more discrete classes, values, or labels.
- Regression trees are used when you want to predict a continuous response/outcome. That is, the dependent variable can take any of an infinite range of numerical values.
One of the major differences between tree models and standard regression models is that the latter are ‘global’ models, in the sense that the prediction formula is assumed to hold over the entire data space. This means that in the simple case of a regression formula without any interactions,
\[y = \beta_1x_1 + \beta_2x_2 + \beta_3x_3\]
the effect of predictor \(x_1\) on the response \(y\) is assumed to be the same no matter what the values of predictors \(x_2\) and \(x_3\) may be. This is not so with tree models.
The idea behind CART models is to divide the data space into smaller and smaller non-overlapping partitions, to which simpler models can be applied. Tree models use a top-down approach, in which we begin at the top of the tree where all observations are included in a single region and successively split the data space into new branches (subregions) down the tree. Most tree algorithms are ‘greedy’ in that they consider only the best split for the current region of the data. That is, they don’t care about what splits have come before, or what splits may come after the current node. Splitting continues until some threshold is reached, and how that threshold is defined can have a major impact on the results.
There are several packages in R for computing CARTs, but in this
tutorial we’ll focus on the {partykit} package (Hothorn & Zeileis 2015).
Terminology
A bit of terminology:
- Nodes: Points at which a splitting decision is made. Each node represents a (sub)section of the dataset. The Root node is the node at the top of the tree which represents the entire sample prior to any splitting.
- Branches: Subsections of the entire tree (a,so called sub-trees)
- Leaves / Terminal nodes: Nodes at the bottom of the tree where no further split is made.
- Parent / Child nodes: Super- and subordinate nodes are referred to as parent and child nodes respectively.
CART terminology
In all CARTs, the leaves of the tree (terminal nodes) give us predictions about our response for the subregion of the dataset that the leaves represent.
- For classification trees, the leaves of the tree (terminal nodes) give us predictions about which class is mostly likely for that subregion of the data. This is measured simply as the class with the most observations, i.e. the class that makes up the largest proportion of the data.
- For regression trees, the prediction is just the mean value of the response for the observations in a given subregion. So if a data point falls into that region, our prediction for its response will be the average response in that region.
For this tutorial, we will focus on methods that use binary
splits, though there are methods for producing trees with more
than two branches, e.g. with J48() in the
{RWeka} package.
An illustration
Suppose we have a dataset of hypothetical speakers from three different regions, A, B, or C. We measured the proportions of two linguistic features used by these participants, feature F1 and feature F2, and we want to see how well we can predict a participant’s region based on F1 and F2. The values of F1 and F2 are proportions ranging from 0 to 1.
We plot the regions based on F1 and F2 and get this:
We can model this with a decision tree like so.
There are five terminal nodes in the resulting tree, which represent distinct partitions (subsets) of the data. These can be represented in the scatterplot accordingly:
For each partition of the data, the model makes a prediction based on the proportion of A, B, or C participants found in that partition. Simply put, the most frequent region found in the data partition is the winner. To get a prediction for a participant’s region then, we simply find the F1 and F2 values of that participant, locate the partition of the data they fall into, and get the most frequent response in that partition.
Splitting criteria
One of the most important factors affecting the accuracy of tree models is the method that the tree-growing algorithm uses to determine where and when to make a split. The idea is to find the “best” split at any given point, so the question becomes how to determine the “best” split. Decision trees use different algorithms to find the best split, and I’ll briefly mention the most common here.
- For classification trees, the most common criteria are Gini impurity and information gain. Details can be found here. In practice, I’ve found both methods to produce nearly identical results.
- For regression trees, the standard method relies on finding the split that results in the greatest reduction in variance (mean squared error, MSE) when comparing the variance of the parent node to the average variance among the child nodes.
- Lastly, there are conditional inference trees. Conditional inference trees derives splits using permutation-based significance tests. This method is designed explicitly to avoid known biases of other methods which tend to favor variables that have many possible splits or many missing values (Hothorn, Hornik & Zeileis 2006). This method works equally well with discrete and continuous outcomes.
The {partykit} package we’ll be using relies on the
conditional inference method.
CARTs in R
About R
Since we’re working in R, some familiarity with R and RStudio is a
necessity. This tutorial assumes you are familiar with a few of the core
aspects of the “tidy” coding style, particularly the use of pipes,
i.e. %>% or %>%. If you are new to R, I
recommend the following to get started:
- swiRl. This is a tutorial library that works directly in R. There really isn’t a better way to learn R!
- R for Data Science by Hadley Wickham and Garrett Grolemund (Wickham & Grolemund 2016). This covers all the basics for working with R using the “tidy” approach to programming.
Libraries
Before starting, make sure you have the following packages installed,
and load them into your workspace. The code here makes use of the
standard {tidyverse} packages and functions. You can find
more information via the Tidyverse
website, or through the much more extensive R for Data
Science book, which is also available online. I also use the
{here} package for managing file paths in my projects. For
the trees, we’ll use {partykit}.
library(here)
library(tidyverse)
library(partykit)Datasets
For for classifying we’ll use two datasets from studies of syntactic
alternations in English. You can download the datasets directly from my
GitHub repository here,
here,
and here. I recommend storing them in a separate
data folder.
gens <- here("data", "brown_genitives.txt") %>% # change as needed.
read_delim(
delim = "\t",
trim_ws = TRUE,
col_types = cols()
)
obj_rels <- here("data", "brown_object_relativizers.txt") %>%
read_delim(
delim = "\t",
trim_ws = TRUE,
col_types = cols()
)
lexdec <- here("data", "english_lexical_decision.csv") %>%
read_csv(
trim_ws = TRUE,
col_types = cols()
)Below is some info on the datasets.
Genitive alternation
This dataset contains data from 5 sections of the Brown corpus (wikipedia) used in Grafmiller (2014), as well as a complementary dataset from the Frown corpus.
- the president’s assertion [s-genitive]
- the contour of her face [of-genitive]
The English genitive alternation is known to be correlated with a number of features, including…
- the animacy of the possessor
- the length of the possessor and possessum
- the presence of a sibilant at the end of the possessor (Bush’s prestige)
- semantic relation between possessor and possessum (kinship, ownership, body-part, etc.)
- the ‘thematicity’ (text frequency) of the possessor
- the genre of the text (Newspaper, academic, fiction, etc.)
The goal of this study was to investigate the factors that co-determine (or at least correlate with) the choice of genitive variant.
# inspect the data
head(gens)English Relativizers
This is the dataset of English relativizers originally compiled by Hinrichs, Szmrecsanyi & Bohmann (2015) and used in Grafmiller, Szmrecsanyi & Hinrichs (2016).
- a doctrine that nobody either does or need hold
- mistakes which others manage to avoid
- the boundary conditions Ø we impose
The choice among relative pronouns in English is known to be correlated with a number of features, including…
- the length of the relative clause
RC_length - the length of the antecedent NP (both this and the above are related to the complexity of the RC context) `
- the part of speech of the antecedent
- the number of the antecedent
- the formality of the text
- prior use of a particular pronoun (‘structural persistence/priming’)
- the predictability of an upcoming RC, given the preceding material
Like the genitive case, the goal of these studies was to investigate the factors that co-determine/correlate with the choice of relativizer, particularly that vs. which vs. ZERO
# inspect the data
head(obj_rels)English lexical decision times
For illustrating regression trees we’ll use a dataset that comes from
the {languageR} package (Baayen 2008). This gives mean
visual lexical decision latencies and word naming latencies for 2284
monomorphemic English nouns and verbs, averaged for old and young
subjects, with various other predictor variables. You can find more
information about this dataset by consulting the documentation with
?languageR::english.
head(lexdec)Generally speaking, CARTs are not optimal for working with continuous outcomes, so I include this mainly for illustration. Most of this tutorial will focus on cases where we’re modeling categorical outcomes.
Classification trees
Classification trees are similar to logistic regression in that they are used to predict the probability of a set of two or more possible responses or outcomes. This is perhaps the most intuitive use of tree models.
For this tutorial, we’ll use the {partykit} package,
which is an extension of the earlier {party} package
(do NOT load both at the same time). The main function we’ll be
using is ctree().
Watch for function conflicts! It’s important to note that the
{party}and{partykit}packages use functions with the same names, most importantlyctree(),cforest(). If you load both packages into your workspace, you will see warnings about certain functions being masked. This means that the function from the first package that was loaded will no longer be called by default. For this reason, it’s usually a good idea to load only one of these package at a given time. Alternatively, you can call functions from specific packages explicitly with the syntaxpackage::function()
Binary outcomes
Let’s start by fitting a simple classification tree for a binary outcome and plot it. We’ll consider the effect of possessor animacy (as a binary variable) and final sibilant on the choice of English genitive variant.
First we’ll define the formula for quick calling. Here we are trying
to predict the Type of genitive construction based on
whether the possessor is animate or not
(Possessor.Animacy2), and whether the possessor ends in a
sibilant sound (Final.Sibilant).
gen_fmla1 <- Type ~ Possessor.Animacy2 + Final.SibilantOne thing to look out for is that {party} and
{partykit} don’t work with character vectors, so we’ll need
to convert our columns here to factors.
gens %>%
select(all.vars(gen_fmla1)) %>%
glimpse()Rows: 5,098
Columns: 3
$ Type <chr> "of", "of", "of", "of", "s", "s", "s", "s", "s", "s~
$ Possessor.Animacy2 <chr> "animate", "animate", "animate", "animate", "animat~
$ Final.Sibilant <chr> "N", "Y", "N", "N", "N", "N", "N", "N", "N", "N", "~
These should all be factors, so we can convert them like so.
gens <- gens %>%
mutate(across(all.vars(gen_fmla1), as.factor))Fit the tree and plot it.
gen_ctree1 <- ctree(gen_fmla1, data = gens)
plot(gen_ctree1)Easy! We see from Node 1 that initially, the best split is between animate and inanimate possessors. Then we look at the corresponding subsets and see that the presence of a final sibilant also has an effect, in both the animate and inanimate subsets. Teh terminal nodes show the proportion of genitive constructions in the resulting sub-regions of the data. So for instance, when the possessor is animate and does not have a final sibilant (Node 3), we the proportion of s-genitives is ~75-80% or so.
We can generate predictions from the tree with the
predict() function.
ctree_predict <- predict(gen_ctree1)
head(ctree_predict, n = 20) 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
s of s s s s s s s s s s s s s s s s s s
Levels: of s
Measure the accuracy:
# proportion of observations correctly predicted by tree:
sum(gens$Type == ctree_predict) / nrow(gens)[1] 0.7826599
Multi-class (more than 2) outcomes
Let’s try to predict English relativizers with a simplified model, predicting one of three possibel options (the book that/which/∅ you wrote). Here we’ll use a simple model with the variety (AmE ~ BrE) and antecedent part-of-speech (noun ~ other) as predictors.
rc_fmla1 <- relativizer ~ variety + ant_POS
# convert columns to factors
obj_rels <- obj_rels %>%
mutate(across(all.vars(rc_fmla1), as.factor))Fit the tree.
rc_ctree1 <- ctree(
rc_fmla1,
data = obj_rels
)
# adjust font size for readability
plot(rc_ctree1, gp = gpar(cex = .9))Here the terminal nodes show the propotions of each outcome for the respective part of the data. So there is a slight tendency for BrE to use more overt relativizers more than AmE in general, and within BrE, we find greater use of which vs. ZERO when the antecedent is a noun.
–>
Regression trees
Now let’s see what a regression tree looks like. Here we’ll try to predict the lexical decision time based on the word’s category (noun or verb) and the age of the participant (‘old’ or ‘young’).
# convert to factors
lexdec <- lexdec %>%
mutate(
AgeSubject = as.factor(AgeSubject),
WordCategory = as.factor(WordCategory)
)
lex_dec_fmla1 <- RTlexdec ~ AgeSubject + WordCategory
lex_dec_tree1 <- ctree(lex_dec_fmla1, lexdec)
plot(lex_dec_tree1)The terminal nodes here are boxplots showing the distribution of the
response times in the respective sub-regions. From this we can see that
the tree does show a significant effect of WordCategory
among both old and young speakers, but this difference is small compared
to that between the different age groups. So it seems age has a bigger
impact than category for this data.
The predictions are simply the average value of the respective data partition.
# look at a random sample of 20 observations
# the names indicate the terminal node
predict(lex_dec_tree1) %>%
sample(20) 6 6 7 3 7 6 4 4
6.444560 6.444560 6.429946 6.664955 6.429946 6.444560 6.653983 6.653983
4 7 3 6 7 6 3 4
6.653983 6.429946 6.664955 6.444560 6.429946 6.444560 6.664955 6.653983
4 3 6 3
6.653983 6.664955 6.444560 6.664955
Pruning and tuning
Now let’s consider a more complex situation. Let’s define a more complex formula for genitive choice using more predictors. These are mostly categorical predictors, so the model should be relatively simple (we hope).
gen_fmla2 <- Type ~ Possessor.Animacy2 + Final.Sibilant +
SemanticRelation + Possessor.Expression.Type +
Genre + Corpus + Possessor.Length + Possessum.Length + PossessorThematicityCheck our data. We have some character columns that need to be changed to factors (numeric ones are fine)
gens[, all.vars(gen_fmla2)] %>%
glimpse()Rows: 5,098
Columns: 10
$ Type <fct> of, of, of, of, s, s, s, s, s, s, s, of, s, ~
$ Possessor.Animacy2 <fct> animate, animate, animate, animate, animate,~
$ Final.Sibilant <fct> N, Y, N, N, N, N, N, N, N, N, N, N, N, N, N,~
$ SemanticRelation <chr> "BOD", "BOD", "BOD", "BOD", "BOD", "BOD", "B~
$ Possessor.Expression.Type <chr> "CommonN", "CommonN", "ProperN", "CommonN", ~
$ Genre <chr> "General Fiction", "General Fiction", "Gener~
$ Corpus <chr> "Brown", "Brown", "Brown", "Brown", "Brown",~
$ Possessor.Length <dbl> 3, 4, 2, 3, 2, 2, 2, 2, 1, 1, 2, 5, 2, 2, 1,~
$ Possessum.Length <dbl> 2, 2, 2, 2, 1, 2, 1, 4, 6, 1, 1, 2, 1, 2, 1,~
$ PossessorThematicity <dbl> -3.301030, -3.301030, -3.301030, -3.301030, ~
all.vars(gen_fmla2) [1] "Type" "Possessor.Animacy2"
[3] "Final.Sibilant" "SemanticRelation"
[5] "Possessor.Expression.Type" "Genre"
[7] "Corpus" "Possessor.Length"
[9] "Possessum.Length" "PossessorThematicity"
The first 7 must be factors.
gens <- gens %>%
mutate(across(all.vars(gen_fmla2)[1:7], as.factor))Now we fit a tree using this formula.
gen_ctree2 <- ctree(gen_fmla2, data = gens)
# create a party plot
gen_ctree2 %>%
plot(gp = gpar(cex = .7))OMG! We can tweak the graphical parameters some (see below), but this will only get us so far. Really, this is a sign that our model may not be the best model for the purposes of inference, i.e. understanding what is going on in the larger population our data is sampled from. The problem is that the tree-building algorithm will look for any and all possible splits in the data, regardless of whether those divisions are likely to be replicated if we were to take different samples of genitive constructions. This is the problem of overfitting: our tree model is too finely tuned to our specific dataset, thus it does not likely represent a good model of genitive variation in general.
This is a much more complex tree! One challenge with CARTs is that while small trees are fairly easy to interpret, they can get unwieldy very quickly. It can be very hard to summarise a tree like this in a way that helps us understand what may be going on. Tree models are also prone to overfitting the data. Consider the tree below, based on a much more complex model of genitive choice. This tree is far too complicated to interpret in any useful way.
Adjusting for overfitting can be done post-hoc by “pruning” the tree, or it can be done prior to fitting the model by tuning the control settings of the function. Pruning is important for CARTs fit with traditional methods, but it is generally not something that is done with conditional inference trees (indeed, the technique was designed partly to make pruning unnecessary). What we should do instead, is decide how to control the growth of the tree beforehand, in an unbiased way. There are a number of ways to do this.
Adjusting p-value
One way is to adjust the level of statistical
significance the tree requires a test to meet before making a
split. Splits are only created when a global null hypothesis cannot be
rejected, as determined by a chosen p-value (by default, this
is the standard α = 0.05). You can adjust this under the
ctree_control() options.
Let’s set it to require a p-value of 0.001 or below:
gen_ctree2b <- ctree(gen_fmla2,
data = gens,
control = ctree_control(mincriterion = .999)
) # now p < .001
# Note the value for 'mincriterion' is 1 minus the desired level: 1 - .001 = .999
plot(gen_ctree2b)The tree is simpler now, but still pretty unmanageable.
Limiting branching depth
Another way to simplify the tree is to limit the maximum depth of the
branching with the maxdepth argument. The default is
maxdepth = Inf, but you can tell it to stop splitting after
a certain number of levels is reached.
gen_ctree2c <- ctree(gen_fmla2,
data = gens,
control = ctree_control(mincriterion = .999, maxdepth = 3)
)
plot(gen_ctree2c)This is better, but we may be losing some predictive power. The tree is likely to make less accurate predictions, but it is at least interpretable.
Split and node size
Other options can be adjusted as needed/desired. These include the
number of data points necessary to consider splitting
(minsplit = if there are fewer data points than this value,
the model will not try to find any split) and the minimum number of data
points that the resulting subsets must contain (minbucket =
if a given split results in one or more partitions with fewer data
points than this value, the model will ignore this split)
gen_ctree2d <- ctree(gen_fmla2,
data = gens,
control = ctree_control(mincriterion = .999, minsplit = 1000L)
)
plot(gen_ctree2d, gp = gpar(cex = .7))gen_ctree2e <- ctree(gen_fmla2,
data = gens,
control = ctree_control(mincriterion = .999, minsplit = 1000L, minbucket = 200L)
)
plot(gen_ctree2e, gp = gpar(cex = .7))See the documentation for ctree() and
ctree_control() for more.
A word of caution about CARTs
CART models seem quite useful, however it is easy to be led astray by them. For example it’s often assumed that trees are good at representing interaction effects, but there are cases in which this assumption cannot be maintained. This is sometimes referred to as the “exclusive or” (XOR) problem,
which describes a situation where two variables show no main effect but a perfect interaction. In this case, because of the lack of a marginally detectable main effect, none of the variables may be selected in the first split of a classification tree, and the interaction may never be discovered. (Strobl, Malley & Tutz 2009:341)
I won’t illustrate this here, but a recent simulation study by Gries (2019) illustrates this problem quite nicely (though I don’t know how often it’s likely to occur IRL). The point it that when we don’t know the actual relationship between our predictors and the outcome, we should be extra careful about making claims regarding (the absence of) interactions in the data.
A second word of caution about trees, which to my knowledge has not been raised in the literature, involves the use of splits to identify inflection points (non-linearities) in continuous predictors. For example, Tagliamonte, D’Arcy & Louro (2016) use ctrees to identify what they refer to as “shock points” in the developmental timeline of the English quotative system, specifically the rapid global increase in the use of quotative like (as in he was like, “What happened?”). They illustrate this trend with the tree below:
Tagliamonte et al. (2016:833)
They argue that trees like this reveal the points in time where the use of quotative like began to significantly accelerate (or decelerate), and these points in time naturally open themselves up to interpretation and speculation. Presumably there is some reason why this word shows substantial changes at these particular times.
This seems reasonable, however we need to be aware that split points in trees like these may not actually be very meaningful. Recall that tree models are designed to try to split the data wherever they can, so we might wonder, what will happen when we have a truly linear effect in our data? How would a CART, constrained to make binary partitions in the data, represent a truly linear effect of a predictor x on some outcome y?
Suppose for instance that we had data such as illustrated below? Predictor x clearly has a strong (linear) effect on y, but there is no reason to suspect that any particular values of x split the data better than any others. In other words, there is no obvious curvature in the data here, so there are no “inflection points” suggested by the data. But what is likely to happen when if we fit a tree model to such a dataset? How many splits would we get? How are they distributed? Where might the tree split, and why?
Simulated linear effect
We can test this with a small simulation study. Doing so reveals how branching in the tree models can suggest non-linearities in a misleading way. We’ll simulate a dataset predicting choice of quotative like vs. the standard say based on the date of birth (DOB) of a speaker.
set.seed(43214)
# simulate a dataset
DOB <- ceiling(rnorm(200, 1960, 15))
y <- scale(DOB) * 1.5 + rnorm(200, 0, 2) # add some noise
# y is a set of outcome probabilities on the log odds scale
# We'll use log odds because they can range from -Inf to Inf
df <- data.frame(DOB = DOB, y = y) %>%
mutate(
prob = gtools::inv.logit(y), # convert log odds to probabilities
resp = factor(if_else(prob > .5, "like", "say")),
bin = as.numeric(resp) - 1
)
summary(df) DOB y prob resp
Min. :1920 Min. :-6.29318 Min. :0.001845 like:101
1st Qu.:1950 1st Qu.:-1.95962 1st Qu.:0.123523 say : 99
Median :1960 Median : 0.06341 Median :0.515842
Mean :1961 Mean :-0.07830 Mean :0.488721
3rd Qu.:1971 3rd Qu.: 1.64026 3rd Qu.:0.837569
Max. :2006 Max. : 5.69999 Max. :0.996665
bin
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.495
3rd Qu.:1.000
Max. :1.000
From a conditional density plot, it’s clear that the effect is about as linear as we could want. There is very little curvature in the line dividing the two halves of the plot.
par(mar = c(5, 4, 4, 2)) # increase top margin
cdplot(resp ~ DOB, df, main = "CD plot of simulated linear effect of DOB\non use of quotative 'like' over 'say'")But, if we fit tree models to the data, they nonetheless suggest some “shock points”, which we might be tempted to interpret as meaningful in some way.
ctree(resp ~ DOB, df) %>%
plot(main = "Ctree simulated linear effect of DOB\non use of quotative 'like' over 'say'")The lesson here is that conditional inference trees (and other CARTs) are capable of capturing linear effects, to a certain degree, but they are constrained in ways that other methods are not. We should therefore be cautious when trying to read too much into the individual split points of continuous predictors derived solely from a tree. It’s always a good idea to verify such patterns using other techniques, such as condition density plots for categorical outcomes or simple scatterplots for continuous outcomes. If you don’t see much of a pattern in these plots, you probably should not make much of the specific split points in your tree models.1
Summary
Advantages of tree models:
- Non-parametric. Tree models don’t make any assumptions about the distribution of the data, which means they require very little data preparation (unlike parametric methods such as regression).
- Computationally quick and simple. They can work on very large datasets in reasonable amounts of time.
- Easy to understand. In simple cases, trees are relatively easy to understand and interpret even for people without much background in statistics.
Disadvantages of conditional inference trees:
- Overfitting. Tree models tend to overfit the data, and require pruning or tuning.
- Sensitive to particularities of your data. This problem is similar in spirit to overfitting. Slight changes can result in different trees, which can give different predictions. This makes the results of the tree less likely to generalize to new data.
- Prone to misinterpretation. Trees with many interacting predictors do not always accurately represent the true patterns in the data (Gries 2019). Trees can suggest spurious non-linearities and inflate effects of individual predictors.
- Overly complex trees. With large datasets and many predictors, we often get very large trees that are difficult to interpret.
- Low accuracy. Trees are generally less accurate than other methods, e.g. regression.
- Not ideal for continuous data. Tree models tend to lose too much information when trying to model continuous outcomes. In these cases, regression models are usually preferred.
- Cannot handle non-independence well (yet): Most
current methods don’t have a way of incorporating clustered or
hierarchical data structures. There is a package
{glmertree}available for fitting mixed effects trees (Fokkema et al. 2018), but this method is still in development and I am not very familiar with it.
For further reading see chapter 14 in Levshina (2015) and chapter 9.2 in Hastie, Tibshirani & Friedman (2009).
Graphical parameters
Large trees can be unwieldy to plot, and {partykit}
offers ways to adjust the graphical settings of your trees to help with
presentation.
help("party-plot")Basic global parameters are specified with a gpar() object.
?gparYou can see the current settings like so.
str(get.gpar())List of 14
$ fill : chr "white"
$ col : chr "black"
$ lty : chr "solid"
$ lwd : num 1
$ cex : num 1
$ fontsize : num 12
$ lineheight: num 1.2
$ font : int 1
$ fontfamily: chr ""
$ alpha : num 1
$ lineend : chr "round"
$ linejoin : chr "round"
$ linemitre : num 10
$ lex : num 1
- attr(*, "class")= chr "gpar"
Fonts
Font size, face (bold, italic), and family (serif, Times, etc.) can be changed with the following parameters.
cex: Multiplier applied to fontsizefontsize: The size of text (in points)fontface: The specification of fontface can be an integer or a string. If an integer, then it follows the R base graphics standard: 1 = plain, 2 = bold, 3 = italic, 4 = bold italic. If a string, then valid values are: “plain”, “bold”, “italic”, “oblique”, and “bold.italic”.fontfamily: Changes to the fontfamily may be ignored by some devices. The fontfamily may be used to specify one of the Hershey Font families (e.g., HersheySerif) and this specification will be honoured on all devices.
# (cex is a multiplier)
plot(gen_ctree1, gp = gpar(cex = .8, fontfamily = "Times New Roman"))plot(rc_ctree1, gp = gpar(cex = .8, fontface = "italic"))Line type & width (lex is a similar multiplier)
Line type (solid, dashed, etc.) and width can be adjusted with the
lty and lwd or lex arguments
respectively (lex is a multiplier similar to
cex).
Line type can be specified using either text (“blank”, “solid”,
“dashed”, “dotted”, “dotdash”, “longdash”, “twodash”) or number (0, 1,
2, 3, 4, 5, 6). Note that lty = “solid” is identical to lty
= 1.
plot(gen_ctree1, gp = gpar(cex = .8, lty = 2)) # dashed linesplot(gen_ctree1, gp = gpar(cex = .8, lwd = 2))Colors
plot(gen_ctree1, gp = gpar(cex = .8, col = "blue"))Changing panels
The panels and edges themselves can be formatted as well (as we saw above)
?panelfunctionsplot(gen_ctree1,
inner_panel = node_barplot, # create tree with inner panels as parplots
ip_args = list(id = T), # remove IDs from inner panels
tp_args = list(fill = c("palegreen4", "palegreen1"))
)plot(rc_ctree1,
gp = gpar(cex = .8),
ip_args = list(id = F, pval = F), # remove IDs and pvals from inner panels
tp_args = list(id = F)
)# color bars
plot(rc_ctree1,
tp_args = list(fill = heat.colors(3))
)Citation & Session Info
Grafmiller, Jason. 2022. Classification and regression trees for linguistic analysis. University of Birmingham. url: https://jasongrafmiller.netlify.app/tutorials/tutorial_carts_ctrees.html (version 2022.06.03).
The following is my current setup on my machine.
sessionInfo()R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] patchwork_1.1.1 partykit_1.2-15 mvtnorm_1.1-3 libcoin_1.0-9
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
[13] tidyverse_1.3.1 fontawesome_0.2.2 here_1.0.1 knitr_1.39
loaded via a namespace (and not attached):
[1] nlme_3.1-157 fs_1.5.2 lubridate_1.8.0 bit64_4.0.5
[5] httr_1.4.3 rprojroot_2.0.3 R.cache_0.15.0 tools_4.2.0
[9] backports_1.4.1 bslib_0.3.1 utf8_1.2.2 R6_2.5.1
[13] rpart_4.1.16 mgcv_1.8-40 DBI_1.1.2 colorspace_2.0-3
[17] withr_2.5.0 tidyselect_1.1.2 bit_4.0.4 compiler_4.2.0
[21] cli_3.3.0 rvest_1.0.2 xml2_1.3.3 labeling_0.4.2
[25] bookdown_0.26 sass_0.4.1 scales_1.2.0 digest_0.6.29
[29] rmarkdown_2.14 R.utils_2.11.0 pkgconfig_2.0.3 htmltools_0.5.2
[33] styler_1.7.0 dbplyr_2.1.1 fastmap_1.1.0 highr_0.9
[37] rlang_1.0.2 readxl_1.4.0 rstudioapi_0.13 jquerylib_0.1.4
[41] generics_0.1.2 farver_2.1.0 jsonlite_1.8.0 gtools_3.9.2.1
[45] vroom_1.5.7 R.oo_1.24.0 magrittr_2.0.3 Formula_1.2-4
[49] Matrix_1.4-1 munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.1
[53] R.methodsS3_1.8.1 stringi_1.7.6 yaml_2.3.5 inum_1.0-4
[57] parallel_4.2.0 crayon_1.5.1 lattice_0.20-45 haven_2.5.0
[61] splines_4.2.0 hms_1.1.1 pillar_1.7.0 reprex_2.0.1
[65] glue_1.6.2 evaluate_0.15 modelr_0.1.8 vctrs_0.4.1
[69] rmdformats_1.0.4 tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.0
[73] rematch2_2.1.2 assertthat_0.2.1 xfun_0.31 broom_0.8.0
[77] survival_3.3-1 ellipsis_0.3.2
References
My aim here is not to criticize the work of Tagliamonte et al. (2016)—indeed the effects they find appear to be quite robust and genuinely non-linear, which we’d expect of the usual s-curve patterns observed in language change. In fact, if you go back to their data and plot them with a CD plot, it is clear that the “shock” points they observe in their trees likely represent genuine points of substantial change in the community.↩︎