25.2. Decision Trees#

A decision tree is an algorithm that arrives at a prediction by asking a series of questions about the features. It works like a flowchart, making decisions step by step based on feature values.

Let us look at an example. Suppose you are trying to decide whether to watch a movie tonight, a classification task where the possible outcomes are “Movie Tonight!” or “No Movie Tonight”. Your decision depends on three factors: whether you finished your homework, the ticket price, and the movie’s rating. Based on data from your past experiences, we can generate the following decision tree:

Figure: Example of a Classification Decision Tree

How do we interpret this tree?

The tree reveals that the most important factor in your decision is whether you finished your homework. If you have not, you decide against the movie. If you have finished, you consider the ticket price next. If the ticket costs \(10 or less, you go to the movie. If it costs more than \)10, you check the movie’s rating: if the rating is below 6, you skip the movie; otherwise, you go.

Notice how interpretable this algorithm is compared to other methods we have seen. We can clearly understand which factors matter, in what order, and exactly how they influence the final decision.

Structure of a Decision Tree#

Decision trees consist of a root node, internal nodes, branches, and leaf nodes. The root node is the topmost node in a decision tree. It represents the first question or the first split of the feature space. Internal nodes represent subsequent splits made as we move down the tree. Branches are the connections between nodes that lead us from one decision to the next. The leaf nodes (or leaves) are the terminal nodes that provide the final prediction. For regression tasks, the leaves contain predicted numerical values, and for classification tasks, the leaves contain class labels.

Notice that decision trees are drawn upside down compared to trees in nature: the root is at the top and the leaves are at the bottom.

For our example above:

  • Root node: “Homework Completed”

  • Internal nodes: “Ticket price ≤ $10”, “Movie Rating ≤ 6”

  • Leaf nodes: The gray nodes with no further splits: “Movie Tonight!” or “No Movie Tonight”

  • Branches: The lines connecting these nodes based on Yes/No answers

Now, let us look at an example of a decision tree performing a regression task: predicting the average number of hours of sleep for college students based on three factors: daily caffeine intake, stress level (Low, Medium, High), and daily screen time.

Figure: Example of a Regression Decision Tree

At the root of the tree, the first split is on caffeine intake < 150 mg, suggesting that caffeine intake is a particularly important factor in predicting sleep duration. If caffeine intake is less than 150 mg, the tree next considers the student’s stress level. For students with low stress, the model predicts an average of 8 hours of sleep, whereas for those with medium or high stress, it predicts 6 hours. If caffeine intake is equal or more than 150 mg, the tree again checks stress level. For students with low or medium stress, the predicted average sleep duration is 6.5 hours. However, if stress level is high, the tree further splits on screen time. Students with less than 3 hours of screen time are predicted to sleep 5.5 hours on average, while those with 3 or more hours of screen time are predicted to sleep 4 hours.

Notice that stress level is a categorical variable with multiple levels (Low, Medium, High). Modern decision trees handle such variables using binary splits, in which categories are grouped into two sets. As this example illustrates, the same categorical variable can be split in different ways at different points in the tree. For instance, it may be split as {Low} versus {Medium, High} at one node and as {Low, Medium} versus {High} at another.

Note

For nominal categorical variables with multiple levels, decision trees make binary splits by grouping categories into two subsets that best separate the response variable. The same categorical variable may be split differently at different points in the tree.

While some older decision tree algorithms allowed multi-way splits (splitting into more than two branches at once), modern tree-based methods such as CART, Random Forests, and Gradient Boosting rely exclusively on binary splits. Therefore, multi-way splits are uncommon in practice now. In this chapter, we therefore focus on decision trees with binary splits, which underlie all modern tree-based methods.

How are Decision Trees “Grown”?#

Trees are “grown” by asking a yes/no question about one of the features at each node. The answers lead us to the next nodes down the tree. Essentially, at each step we are dividing the predictor space into smaller regions. We continue this process until some stopping criterion is reached. We make prediction for a new data point based on the training observations that fall into the region containing our new data point.

The major steps in the construction of a decision tree are:

Step 1: Start with the dataset. We need a labeled dataset with features and a target variable.

Step 2: Choose the best feature and split point. At each node, we evaluate all possible splits and choose the one that best separates the data into homogeneous groups.

To understand what “splitting” means, consider the case of continuous numeric features. Dividing the feature space means literally drawing boundaries to form regions. Theoretically, these regions could take any shape, but we choose to divide the space into rectangular regions for simplicity and interpretability. Suppose we have features \(X_1, X_2, \ldots, X_p\). We divide the predictor space into \(m\) distinct regions: \(R_1, R_2, \ldots, R_m\). The figure below shows a split of a two-dimensional feature space with continuous features \(X_1\) and \(X_2\) into the rectangles \(R_1\), \(R_2\), \(R_3\), and \(R_4\). If the feature space is three-dimensional with features \(X_1\), \(X_2\), and \(X_3\), then each \(R_j\) is a three-dimensional rectangle (a cuboid), and so on.

Figure: Splitting feature space with two continuous features into simpler rectangular regions

But how exactly do we decide which split is “best”? Unfortunately, it is computationally infeasible to consider all possible partitions of the feature space. For this reason, we take a top-down, greedy approach known as recursive binary splitting.

The process is called top-down because it begins at the root node (where all observations belong to a single region) and successively splits the data as we move down the tree. It is called greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. Finally, it is called binary splitting because at each node, the data is split into exactly two branches based on a single feature and threshold.

To evaluate which split is best, we use different criteria depending on whether we are doing classification or regression:

  • For Classification Trees: We use Gini impurity to measure how mixed the classes are in a node: $\(\text{Gini} = 1 - \sum_{i=1}^{K} p_i^2\)\( where \)K\( is the number of classes and \)p_i\( is the proportion of training observations in the node that belong to class \)i$. A Gini value of 0 indicates perfect purity (all observations belong to one class), while higher values indicate more mixing of classes. We choose the split that results in the largest reduction in Gini impurity.

  • For Regression Trees: We use Residual Sum of Squares (RSS) to measure the total squared deviation of the responses from the node’s predicted value (the mean): $\(\text{RSS} = \sum_{i \in R} (y_i - \bar{y}_R)^2\)\( where \)R\( represents a region (node), \)y_i\( are the target values of observations in that region, and \)\bar{y}_R$ is the mean of those target values. Lower RSS means observations have similar target values. We choose the split that results in the largest reduction in RSS.

Step 3: Split the data. Create two child nodes based on the chosen split.

Step 4: Repeat recursively. Apply Steps 2 and 3 to each child node until a stopping criterion is met.

Step 5: Stop when criteria are met. Common stopping criteria include reaching a maximum tree depth, having too few observations in a node, or when all observations in a node have the same target value (a pure leaf).

Making Predictions#

During tree construction, each leaf node is assigned a prediction value based on the training observations in that region. For regression tasks, the prediction is the average of the target values of all training samples in that region. For classification tasks, the prediction is the mode (most frequently occurring class) among the training data points in that region.

When making predictions for new data, we simply determine which leaf the new data point falls into and return that leaf’s assigned value.

Tree Pruning#

When decision trees grow to their maximum depth, they tend to overfit by capturing noise in the training data rather than true patterns. This results in poor performance on new, unseen data.

Just as we prune natural trees to maintain their health, tree pruning limits unnecessary growth, either by preventing weak branches from forming or by removing branches that contribute little to predictive accuracy. This creates simpler trees that often generalize better to new data.

There are two main pruning approaches:

  • Pre-pruning (Early Stopping): Stop the tree from growing too deep by setting stricter stopping criteria (like those in Step 5). This prevents the tree from capturing noise during construction.

  • Post-pruning: First, grow the tree fully until no more splits are possible. Then, look back and remove branches that add complexity without meaningfully improving predictions on new data.