Ensemble Learning

25.4. Ensemble Learning#

An ensemble method is an approach that combines multiple simple “building block” models to obtain a single, potentially more powerful model. The key idea is that by aggregating predictions from multiple models, we can often achieve better performance than any single model alone. These simple building block models are sometimes known as weak learners. In this section, we will learn about ensemble methods such as bagging, random forests, and boosting, where the building blocks are decision trees. We will focus on understanding the underlying intuition behind each approach in this section and learn practical implementation in Python in the next section.

Bagging#

Decision trees are algorithms that suffer from high variance, which means that if we change the training data slightly, the resulting model could be very different. On the other hand, a statistical model with low variance will yield similar results when applied to different training sets. Bagging is a general technique for reducing the variance of a statistical learning method.

The key idea behind bagging is simple: train the building block model on several training datasets to obtain several prediction models, then combine their results. For regression problems, we average the predictions; for classification problems, we take a majority vote (the most commonly occurring prediction). But we usually have only one training dataset, so how can we generate several training datasets from it? This is where bootstrap sampling comes in, a technique we learned in Chapter 13. Bootstrap sampling means sampling from the training dataset with replacement. For example, if our training set has 100 data points, we create a bootstrapped dataset by randomly sampling 100 times from the original data, where the same data point can be selected multiple times. The word “bagging” is short for bootstrap aggregating.

Applying this to decision trees, we bootstrap hundreds or thousands of training datasets from our original data, train a decision tree on each, then combine predictions through averaging or majority voting. See the figure below.

Figure: Bagging with Decision Trees

For example, suppose we want to predict house prices. With bagging, we might create 500 bootstrap samples and train 500 different decision trees. When predicting the price of a new house, each tree might give different predictions, say $320K, $335K, $315K, and so on. Our final prediction would be the average of all 500 predictions. For classification, consider spam detection with 1000 bagged trees. When a new email arrives, if 650 trees classify it as spam and 350 as not spam, we take the majority vote: spam. This combined approach typically achieves significantly better accuracy than any single decision tree.

Random Forests#

Random forests is an ensemble method that improves upon bagging. While bagging reduces variance by averaging many trees, it has a limitation: if the trees are too similar to each other, averaging them does not provide much benefit. Consider a scenario where your dataset contains one very strong predictor along with several moderately strong predictors. In standard bagging, most or all of the trees will likely use that strong predictor for the top split, making the trees highly correlated or similar. Random forests solve this problem by introducing a simple but effective tweak that makes the trees more diverse.

As in bagging, we build decision trees on bootstrapped training samples. However, when building each tree, each time a split is considered, we randomly select only $m$ predictors as split candidates from the full set of $p$ predictors. The split must use one of those $m$ predictors. This constraint gives moderately strong predictors a chance to be selected even for top splits, creating more diverse trees that, when combined, produce better predictions.

The main difference between bagging and random forests is this predictor subset size $m$. A fresh random sample of $m$ predictors is taken at each split, and typically we choose $m \approx \sqrt{p}$, that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors. Note that bagging is a special case of random forests where $m = p$.

For example, suppose we are predicting house prices with 16 predictors such as square footage, bedrooms, neighborhood quality, and school district. Intuitively, square footage is likely the strongest predictor since larger homes typically cost more. With random forests using $m = \sqrt{16} = 4$, at each split in each tree, we randomly select only 4 predictors to consider. Perhaps for the first tree, at the first split we randomly select {square footage, bedrooms, lot size, age}, and square footage wins as the best split. At the next split in this tree, we again randomly select 4 predictors (it could be a completely different set, or some of the same predictors might appear again). The process continues until the tree is complete. For the second tree, built on a different bootstrap sample, we repeat the same process. At the first split of this second tree, the 4 randomly selected features might not include square footage. Perhaps we get {neighborhood quality, school district, bathrooms, garage size}, so one of these must be chosen instead. This randomness ensures that even the strongest predictor won’t dominate every tree, allowing other important features to shape different trees and creating a more diverse ensemble.

Boosting#

Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. The term “boosting” comes from the idea of boosting weak learners into a strong learner i.e. taking simple models that perform only slightly better than random guessing and combining them to create a powerful predictive model. While both bagging and boosting combine multiple trees, they differ fundamentally in how those trees are built. In bagging, trees are generated independently on bootstrapped training datasets. In boosting, the key idea is sequential learning: trees are trained one after another, with each new tree focusing on correcting the mistakes made by the previous ones.

The general boosting process works as follows. We start by training a first tree on the original data. This tree makes some predictions, some correct and some incorrect. The next tree is then trained to focus specifically on what the previous tree got wrong. This continues iteratively, with each new tree targeting the weaknesses of the combined model so far. See the figure below. Finally, we combine all trees to make predictions, typically giving more weight to better-performing trees. Note that in boosting, unlike bagging, the construction of each tree depends strongly on the trees already grown.

Figure: Boosting with Decision Trees

Different boosting algorithms implement this general idea in different ways. We will discuss two main approaches: AdaBoost and Gradient Boosting.

AdaBoost#

AdaBoost, short for Adaptive Boosting, was one of the first practical boosting algorithms. The key idea is to adaptively adjust the importance of training examples based on how well previous trees classified them. While AdaBoost is a general ensemble approach that can be applied to many statistical learning methods, it is most commonly used with decision trees.

In AdaBoost, the trees are typically very shallow, usually having just one node and two leaves. A tree with only one node and two leaves is called a stump. Stumps are not great at making accurate classifications because they only use one feature to make a decision, making them weak learners.

The algorithm works as follows. Initially, all training examples are given equal weight, and we train the first stump on this weighted dataset. After making predictions, we identify which examples were misclassified and increase their weights while decreasing the weights of correctly classified examples. This forces the next stump to focus more on the difficult examples. We repeat this process for a specified number of iterations, with each new stump targeting the examples that the combined ensemble so far still gets wrong. Finally, when making predictions, AdaBoost combines all stumps through weighted voting, where stumps that performed better during training receive higher weights in the final vote.

For example, suppose we are classifying emails as spam or not spam. The first stump might correctly identify obvious spam but miss more subtle cases. AdaBoost increases the weights on these subtle spam emails, so the second stump focuses specifically on learning patterns in these misclassified cases. Perhaps it learns that certain sophisticated phishing patterns indicate spam. The third stump then focuses on whatever examples the first two stumps combined still struggle with. When a new email arrives, all stumps vote, with better-performing stumps receiving more weight in the final decision, resulting in a robust classification model.

While we have focused on classification here, AdaBoost can also be adapted for regression problems, though other boosting methods like Gradient Boosting (discussed next) are more commonly used for regression.

Gradient Boosting#

Gradient Boosting is another powerful boosting algorithm that works for both regression and classification problems, though we will focus on regression here as it makes the intuition clearer. Unlike AdaBoost which adjusts weights on training examples, Gradient Boosting takes a different approach: it repeatedly fits new trees to the residuals (errors) of the current model. The trees in Gradient Boosting are typically shallow but are usually larger than the stumps used in AdaBoost.

The key idea is simple. We start by making an initial prediction, often just the average of all target values. This initial model will have errors because some predictions will be too high, some too low. These errors are called residuals (actual values - predicted values). Instead of trying to predict the original target values, the next tree tries to predict these residuals. We then add this tree’s predictions to our current model, creating an improved model. We calculate new residuals from this improved model and train the next decision tree to predict these new residuals. This process continues for a specified number of iterations, with each tree focusing on correcting the errors of the model so far. After building many trees, the final prediction for a new example is the initial prediction plus the sum of all the tree predictions.

For example, suppose we want to predict house prices. Our initial model might predict $300K for all houses (the average). For a house actually worth $350K, the residual is $50K. For a house worth $280K, the residual is -$20K. We train a decision tree to predict these residuals using the house features. May be this tree learns that houses with large square footage have positive residuals. We add this tree's predictions to our model. Now our model might predict $330K for the first house and $290K for the second. The residuals are now $20K and -$10K, which are smaller than before. We train another tree to predict these new, smaller residuals, continuing to refine our predictions. Each tree makes the model slightly better by correcting the remaining errors. The main idea is that by taking many small steps, with each step correcting the errors of the previous steps, we gradually build up to much better predictions.

Gradient Boosting has become one of the most popular machine learning algorithms due to its strong predictive performance. Modern implementations like XGBoost, LightGBM, and CatBoost have made it even more powerful and efficient. These are widely used in data science competitions and real-world applications.