In supervised machine learning, a model is typically trained on labeled data with the goal of generalizing its predictive ability to unseen examples. However, individual models often suffer from limitations. For example, a decision tree may be prone to high variance, making it sensitive to slight changes in the training data, while a linear model may exhibit high bias if the relationship between features and outcome is non-linear. This tradeoff — known as the bias-variance tradeoff — is central to understanding the limitations of single models.
Ensemble learning addresses this challenge by combining multiple models to form a single, more robust predictive model. The fundamental idea is that by aggregating the outputs of several models, each of which may capture different aspects of the data, the ensemble will produce more accurate and stable predictions than any single model could achieve alone.
From a theoretical perspective, ensemble methods improve generalization by reducing variance, bias, or both, depending on how the individual models are constructed and aggregated. Suppose we have a function \(f(x)\) that maps input \(x\) to a target variable \(y\), and we aim to approximate this function using a learning algorithm \(\hat{f}(x)\). The expected prediction error of \(\hat{f}(x)\) can be decomposed as:
Ensemble methods seek to reduce the variance term by averaging out the fluctuations of individual learners, or reduce the bias by combining weak learners that each capture a portion of the signal.
To illustrate this, consider a binary classification problem where a set of weak classifiers each perform slightly better than random guessing. While each classifier on its own may be unreliable, aggregating their predictions through majority voting can produce a stronger classifier with reasonably high accuracy, provided the errors are not too strongly correlated.
An essential concept that underlies the success of ensemble methods is diversity among the base learners. Naturally, if all learners make the same errors, combining them adds little to no value. However, if their errors are uncorrelated, the ensemble can cancel out individual mistakes and converge toward the correct prediction. This principle motivates the use of different sampling strategies, model types, or training objectives when building ensemble models – the key is diversity.
The ensemble approach is not only empirically successful but is also theoretically grounded. A well-known result in ensemble theory, the Condorcet Jury Theorem, suggests that under certain conditions, a majority vote of many weak learners can be more accurate than any individual learner, provided each has an accuracy slightly above 50% and makes independent errors.
In practice, ensemble models such as random forests, gradient boosting machines, and stacked generalizations have become integral to competitive machine learning pipelines. They form the backbone of many winning solutions in data science competitions and are widely adopted in real-world systems where predictive accuracy and robustness are paramount.
The short video below summarizes the key concepts behind ensemble learners.
With this foundational understanding, we now turn to a more formal classification of ensemble models based on the homogeneity or heterogeneity of their base learners, which will serve as a conceptual framework for the rest of the chapter.
Motivating Example
Before diving into the background and justification for ensemble learners, let’s build a simple homogeneous ensemble ourselves. A homogeneous ensemble is one where all models within the ensemble are from the same algorithm but trained on different subsets of the labeled data. We will use regression models as our base models for the built-in mtcars dataset. The structure of the dataset is as follows:
Let’s start by creating a multiple linear regression model that predicts the target feature pmg from all other variables. The dataset has 32 observations and 10 features. We will use a random subset of 70% of the available data.
# create training subset of 70% of randomly selected rowstrain.sample <-sample(x =1:nrow(mtcars), size = (round(nrow(mtcars) *0.7, 0)), replace = F)train.df <- mtcars[train.sample,]# train the regression model on the training subsetfull.model <-lm(formula = mpg ~ .,data = train.df)summary(full.model)
The regression model has is overall statistically significant (p < 0.05), although all of the coefficients are not statistically significant – an issue that we will ignore for expediency. The adjusted R^2 is quite high with R^2. The code below calculates the RMSE (root mean square error) for the model based on the testing data that was held back. Recall that the first column is mpg, the target feature.
Of course, if we had chosen a different subset of observations for the training data, the results would have been a bit different. In a large dataset that is imbalanced and may contain imputed missing values as well as outliers, the results could be more stark. To combat this issue, we train multiple regression models on different subsets of the data. For a smaller dataset it may also make sense to create bootstrap samples with replacement.
# Load required packagelibrary(caret)
## Loading required package: ggplot2
## Loading required package: lattice
# Number of partitions and sample sizen.partitions <-10sample.size <-round(0.7*nrow(mtcars))# Create a list to hold the partitionspartitions <-list()out.of.bag <-list()# Generate partitions with replacementfor (i in1:n.partitions) { sampled.indices <-sample(1:nrow(mtcars), size = sample.size, replace =TRUE) partitions[[i]] <- mtcars[sampled.indices, ] out.of.bag[[i]] <- mtcars[-sampled.indices, ]}# Inspect the first partitionhead(partitions[[1]],3)
Once we have the training samples, we can train the models:
# Create a list to hold the modelsmodels <-list()# Fit a linear regression model on each training partitionfor (i in1:length(partitions)) { train.data <- partitions[[i]] model <-lm(mpg ~ ., data = train.data) models[[i]] <- model}
We are now ready to create the ensemble. For this, we create a function lm.ensemble() that takes a list of models as input plus a dataframe with new data. The ensemble then makes a prediction with each model in the list of models and returns the average of the predictions for each value in the new data.
lm.ensemble <-function(models, new.data) {# Check that models is a list and new.data is a data frameif (!is.list(models)) stop("models must be a list")if (!is.data.frame(new.data)) stop("new.data must be a data frame")# Predict with each model and store results in a matrix predictions <-sapply(models, function(model) {predict(model, newdata = new.data) })# If only one row in new.data, predictions will be a vector; convert to matrixif (is.vector(predictions)) { predictions <-matrix(predictions, nrow =1) }# Return the row-wise average of predictionsreturn (rowMeans(predictions))}
Let’s use the ensemble to make some predictions for some new data.
# Create one row of synthetic input datanew.data <-data.frame(cyl =6,disp =220,hp =110,drat =3.6,wt =2.8,qsec =17.0,vs =1,am =1,gear =4,carb =2)# Predict using the weighted linear model ensemblepredicted.mpg <-lm.ensemble(models, new.data)# Display the predictioncat("Ensemble Prediction: ", predicted.mpg)
## Ensemble Prediction: 23.25162
We can also calculate the MSE or RMSE for the ensemble using the out-of-bag data that we held back for each sample. We will leave that as an exercise.
Rather than calculating the simple average, we can boost the model by adding weights based on the base learner’s performance – using RMSE as the evaluation metric.
lm.ensemble <-function(models, new.data) {if (!is.list(models)) stop("models must be a list")if (!is.data.frame(new.data)) stop("new.data must be a data frame")# Compute RMSE for each model on its training data rmses <-sapply(models, function(model) { y.true <- model$model$mpg y.pred <-predict(model, newdata = model$model)sqrt(mean((y.true - y.pred)^2)) })# Convert RMSEs to weights: inverse of RMSE (higher weight for lower error) inv.rmses <-1/ rmses weights <- inv.rmses /sum(inv.rmses)# Predict with each model on new data predictions <-sapply(models, function(model) {predict(model, newdata = new.data) })# Ensure predictions are in matrix formif (is.vector(predictions)) { predictions <-matrix(predictions, nrow =1) }# Apply weights to compute weighted average prediction weighted.preds <- predictions %*% weightsas.vector(weighted.preds)}
# Create one row of synthetic input datanew.data <-data.frame(cyl =6,disp =220,hp =110,drat =3.6,wt =2.8,qsec =17.0,vs =1,am =1,gear =4,carb =2)# Predict using the weighted linear model ensemblepredicted.mpg <-lm.ensemble(models, new.data)# Display the predictioncat("Ensemble Prediction: ", predicted.mpg)
## Ensemble Prediction: 23.93473
The prediction of the boosted model is different and likely more accurate. We could confirm that by calculating an evaluation metric for the boosted ensemble and compare that to the unboosted ensemble that we built first.
Now that we see how to construct an ensemble, let’s explore ensemble methods and the underlying theory in more detail.
Background and Justification
Condorcet Jury Theorem
The Condorcet Jury Theorem is a classical result from political science and probability theory that offers a powerful theoretical justification for ensemble learning, particularly in the context of voting-based classification models.
Originally formulated by the 18th-century French philosopher and mathematician Marquis de Condorcet, the theorem was intended to describe the behavior of juries in democratic decision-making. However, its implications extend naturally to machine learning, where multiple classifiers (analogous to jurors) cast votes to determine a predicted outcome.
The basic version of the Condorcet Jury Theorem can be stated as follows:
Suppose there is a binary decision problem (e.g., deciding whether an individual is guilty or not). Each of \(n\) independent voters (or classifiers) makes a decision, and each has a probability \(p > 0.5\) of being correct. Then, as the number of voters \(n\) increases, the probability that the majority vote is correct approaches 1.
In other words, if each voter is better than random guessing, and their decisions are independent, then the probability that the collective decision is correct increases with the number of voters. The implication is profound: many weak but slightly competent decision-makers can be aggregated into a highly competent decision system through majority voting.
Let us formalize the intuition a bit. Suppose we have \(n\) independent base classifiers, each of which has a probability \(p > 0.5\) of correctly classifying an instance. Let \(X_i\) be the Bernoulli random variable indicating whether the \(i\)-th classifier makes the correct prediction. Then the sum \(S_n = \sum_{i=1}^n X_i\) represents the total number of correct predictions. The majority vote will be correct if \(S_n > n/2\).
By the law of large numbers and the central limit theorem, as \(n \to \infty\), the proportion \(S_n/n\) converges to \(p\), and the distribution of \(S_n\) becomes more sharply peaked around \(np\). Because \(p > 0.5\), the mass of the distribution increasingly lies above \(n/2\), making the majority vote more likely to be correct.
In the context of ensemble learning, each base classifier corresponds to a juror. If each classifier has an individual accuracy slightly above 50%, and the errors are independent, then aggregating their predictions by majority voting yields a strong classifier with accuracy approaching 100% as more classifiers are added. This theorem provides a strong motivation for ensemble methods such as bagging, where weak learners (like shallow decision trees) are combined in large numbers to form a powerful model like the random forest.
However, the key assumptions of the theorem—independence of decisions and uniform competence (i.e.,\(p > 0.5\))—must be carefully examined in practice. In machine learning, base classifiers are often correlated due to being trained on similar data or using similar algorithms. If classifiers make similar errors, the ensemble gains little from their combination. Thus, promoting diversity among base models is essential to approaching the theoretical guarantees suggested by the Condorcet result.
In short, the Condorcet Jury Theorem mathematically supports the intuition that many weak learners, when combined properly and when sufficiently diverse, can yield a strong learner. It formalizes the idea that ensemble learning can, under the right conditions, drastically improve the reliability of predictions over any single model.
Delphi Method
While the Condorcet Jury Theorem and the Delphi Method used often for estimation arise from different intellectual traditions — probability theory and expert elicitation, respectively — they share a foundational premise: aggregating judgments from multiple individuals can lead to more accurate or reliable decisions than relying on a single individual. However, their mechanisms and assumptions differ in important ways.
The Condorcet Jury Theorem applies to binary decision-making. It asserts that if a group of individuals (jurors or classifiers) each independently makes a correct decision with a probability greater than 0.5, then the majority vote becomes increasingly likely to be correct as the number of individuals increases. The key assumptions are individual competence above chance and independence of decisions.
The Delphi Method, in contrast, is a structured forecasting and estimation technique used to aggregate expert opinion. It involves multiple rounds of anonymous feedback among a panel of experts. After each round, a facilitator summarizes the estimates and reasoning, and experts are encouraged to revise their estimates based on the group response. The process iterates until convergence or diminishing changes. The method does not rely on majority voting but rather on iterative refinement and convergence of expert judgment.
Points of Connection
Collective Intelligence: Both approaches operate on the principle that group decision-making can outperform individual decision-making, particularly when individual inputs are aggregated in some structured manner. The Condorcet theorem provides theoretical backing for this idea under probabilistic assumptions, while the Delphi method operationalizes it in real-world expert settings.
Improvement through Aggregation: In both methods, the final outcome is intended to be superior to that of any single contributor. The Condorcet theorem shows improvement via mathematical aggregation (voting); the Delphi method achieves improvement through social and cognitive refinement across rounds.
Role of Diversity: Diversity of opinion is crucial in both approaches. In the Condorcet framework, if all voters are identical or make the same errors, there is no benefit to aggregation. Similarly, in the Delphi method, a diversity of expertise, perspectives, and initial judgments enriches the iterative convergence process.
Reducing Noise: Both techniques serve as noise-reduction mechanisms. The Condorcet Jury Theorem reduces stochastic error through statistical averaging, while the Delphi method reduces subjective error by allowing individuals to reflect on and adjust their estimates in light of the group’s feedback.
Important Differences
Type of Problem: The Condorcet Jury Theorem is strictly concerned with binary decisions—correct versus incorrect—whereas the Delphi Method is suited for complex estimation problems that may involve continuous values, future events, or uncertainty that cannot be reduced to binary correctness.
Aggregation Method: Condorcet uses majority rule and depends on probabilistic models; Delphi uses iterative consensus informed by qualitative reasoning, with no explicit voting or majority rule.
Independence Assumption: The Condorcet model assumes independence among decision-makers. Delphi explicitly violates independence after the first round by encouraging convergence through structured feedback. In fact, Delphi’s effectiveness often depends on interdependence and iterative correction.
Goal of Process: The Condorcet theorem aims for objective accuracy based on probabilistic correctness, while the Delphi method seeks subjective consensus or informed judgment under uncertainty, especially when data are scarce or incomplete.
To summarize, while the Condorcet Jury Theorem and the Delphi Method originate from different frameworks—one from formal decision theory and the other from behavioral forecasting—they both embody a central belief in the wisdom of crowds, albeit through different lenses. Condorcet offers a mathematical idealization of ensemble voting, useful for machine learning and classification theory. Delphi offers a real-world approach to combining expert opinion, often used in forecasting, policy planning, and risk assessment.
In the context of ensemble learning, we might venture to say that Condorcet justifies voting-based classifier aggregation, while Delphi inspires meta-learning frameworks where model outputs are refined iteratively or integrated based on expert knowledge, perhaps by informing model stacking or ensemble weight optimization.
Simulation of the Condorcet Jury Theorem in R
To illustrate the approach, let’s walk through a short R simulation which seeks to demonstrates that aggregating the decisions of multiple weak learners through majority voting can lead to dramatically improved accuracy, provided each learner is better than chance and their errors are independent.
We will simulate an ensemble of \(n\) binary classifiers, each with individual accuracy \(p > 0.5\), and observe how the accuracy of the majority vote changes as we increase \(n\). We will run many trials to estimate the expected ensemble accuracy at each ensemble size.
Here is the complete R code for the simulation:
# Configure simulation parametersset.seed(77643) # for reproducabilityn_trials <-10000# Number of simulated cases per ensemble sizeensemble_sizes <-seq(1, 101, by =2) # Ensemble sizes (odd numbers only)individual_accuracy <-0.6# Each classifier has 60% chance of being correct# Function to simulate one trial of majority votingsimulate_ensemble_accuracy <-function(n_classifiers, p_correct, n_trials) {# Generate n_trials x n_classifiers matrix of predictions: 1 = correct, 0 = incorrect predictions <-matrix(rbinom(n_classifiers * n_trials, 1, p_correct), nrow = n_trials, ncol = n_classifiers)# Calculate row-wise majority vote: correct if majority of classifiers are correct majority_correct <-rowSums(predictions) > (n_classifiers /2)# Return average accuracy of majority votemean(majority_correct)}# Run simulation across all ensemble sizesensemble_accuracies <-sapply(ensemble_sizes, function(n) {simulate_ensemble_accuracy(n, individual_accuracy, n_trials)})# Plot the resultplot(ensemble_sizes, ensemble_accuracies, type ="o", lwd =2, pch =16,xlab ="Ensemble Size (Number of Classifiers)",ylab ="Accuracy of Majority Vote",main ="Condorcet Jury Theorem in Action")abline(h = individual_accuracy, col ="red", lty =2)legend("bottomright", legend =c("Individual Classifier Accuracy"), col ="red", lty =2, bty ="n")
After we run the simulation we plot the ensemble accuracy as a function of the number of classifiers. Each classifier has a fixed individual accuracy of 60%, and their errors are simulated as independent Bernoulli trials. As the ensemble size increases, you will observe a clear rise in the accuracy of the majority vote, demonstrating the principle at the heart of the Condorcet Jury Theorem.
The dashed red line represents the accuracy of any single classifier. The curve that rises above it shows how the majority vote of many such classifiers performs better than any of them individually — and improves with more voters.
You can experiment with the simulation by changing the individual_accuracy value to see how sensitive the result is to the competence of the base classifiers. If you set it to 0.5 (random guessing), ensemble accuracy stays flat. If you push it just slightly above 0.5, the ensemble benefit begins to emerge, and it improves more rapidly if the accuracy of the base learners approaches 1.
Bagging and Random Forests
In this section, we turn to the next major ensemble learning technique: bagging, with a focus on its most prominent implementation — random forests. This section explains the theoretical foundation of bagging, its ability to reduce variance, and how random forests extend the method to further improve predictive performance. We will also include a detailed R implementation using the randomForest package.
Before proceeding, you may wish to watch the short video below that explains the key concepts behind the random forest algorithm.
Bagging, short for bootstrap aggregating, is one of the earliest and most influential ensemble methods in supervised learning. Introduced by Breiman (1996), bagging was designed to reduce the variance of unstable learning algorithms — models that tend to change significantly with small changes in the training data. Decision trees, especially unpruned ones, are a classic example of high-variance learners and are thus particularly well-suited to bagging.
The short video below provides a visual introduction that you may wish to watch before proceeding.
The basic idea of bagging is quite simple. Given a training set of size \(n\), bagging constructs multiple training datasets by sampling with replacement from the original dataset. Each of these bootstrap samples is used to train a separate model, and their predictions are aggregated — by majority voting for classification or averaging for regression. This aggregation stabilizes the prediction by smoothing out fluctuations caused by the peculiarities of any single training sample.
Mathematically, let \(\mathfrak{D}\) be the original training set, and let \(\mathfrak{D}_1, \mathfrak{D}_2, \ldots, \mathfrak{D}_B\) be \(B\) bootstrap samples drawn from \(\mathfrak{D}\). We train a base learner \(f_b(x)\) on each \(\mathfrak{D}_b\), and construct the final predictor as:
While bagging reduces variance, it does not reduce bias, because all base models are trained on data drawn from the same distribution and are of the same type. To enhance diversity among base learners and introduce some bias-variance tradeoff, random forests modify the bagging algorithm by adding a second layer of randomness: when constructing each decision tree, only a random subset of features is considered for splitting at each node. This encourages more diverse trees, as they no longer always favor the strongest predictors globally.
If \(m\) denotes the number of features considered at each split, then choosing a small \(m\) relative to the total number of features increases the diversity of the trees, often improving generalization. In practice, \(m\) is typically set to \(\sqrt{p}\) for classification problems and \(p/3\) for regression, where \(p\) is the number of input features.
Let’s look at the construction and evaluation of a random forest model in R using the built-in iris dataset for classification. The code uses the randomForest() function of the randomForest package.
# Load necessary packagelibrary(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Load the iris datasetdata(iris)# Set seed for reproducibilityset.seed(42)# Train a random forest classifierrf_model <-randomForest(Species ~ ., data = iris, ntree =100, # Number of treesmtry =2, # Number of features to consider at each splitimportance =TRUE) # Enable variable importance measure# Print the model summaryprint(rf_model)
##
## Call:
## randomForest(formula = Species ~ ., data = iris, ntree = 100, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.67%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 4 46 0.08
# Evaluate prediction accuracy using out-of-bag (OOB) estimatecat("OOB estimate of error rate:", rf_model$err.rate[100, "OOB"], "\n")
## OOB estimate of error rate: 0.04666667
Let’s plot the error rate as a function of the number of trees.
# Plot error rate as a function of number of treesplot(rf_model)
In the above code example, we first trained a random forest model on the iris dataset with 100 trees and a split feature count of 2. The randomForest() function automatically performs internal out-of-bag (OOB) validation: about one-third of the training data is left out of each bootstrap sample and used to estimate predictive accuracy. The resulting OOB error is a reliable indicator of model performance without needing a separate validation set.
The importance() function returns various measures of how influential each feature is in the prediction process, and varImpPlot() visualizes these measures. This interpretability aspect is one of the advantages of random forests over more opaque models like boosting or neural networks.
To summarize, bagging is a variance-reduction technique that stabilizes high-variance models by averaging over many resampled versions of the training data. Random forests extend bagging by injecting additional randomness in the feature selection process, yielding powerful and robust ensemble models. In the next section, we will explore boosting, which takes a fundamentally different approach: it reduces both bias and variance by sequentially training base models to focus on the errors of their predecessors.
Boosting and Gradient-Based Ensemble Methods
Boosting is an ensemble strategy in which base learners are trained sequentially, with each new model focused on the mistakes made by the models before it. Unlike bagging, which treats all base models independently, boosting adapts to the weaknesses of previous learners by modifying the distribution of the training data or the loss function. The central idea is to gradually build a strong learner by combining many weak learners, each of which might only be slightly better than random guessing.
This section will first present the theoretical intuition behind boosting, followed by practical implementation examples in R using the gbm and xgboost packages.
The theoretical foundation of boosting originates from the work of Freund and Schapire (1997), who introduced AdaBoost (short for Adaptive Boosting). In AdaBoost, weights are assigned to each training instance, and these weights are updated after each round of learning so that misclassified instances receive higher weights in the next round. The final prediction is a weighted vote of the base learners, where each model’s contribution is proportional to its accuracy.
Formally, let us suppose we are performing binary classification with labels \(y_i \in \{-1, +1\}\). At each iteration \(t\), AdaBoost fits a weak learner \(f_t(x)\) to the weighted training data, and assigns it a weight \(\alpha_t\) based on its error \(\varepsilon_t\). The final ensemble prediction is given by:
Boosting has been extended from this original formulation into a broader class of gradient-based methods, collectively known as Gradient Boosting Machines (GBM). In GBM, the ensemble is constructed by performing gradient descent in function space, minimizing a differentiable loss function such as squared error or logistic loss. Each learner fits to the negative gradient of the loss function with respect to the current model’s predictions. This approach allows boosting to be applied to both regression and classification problems in a unified framework.
A more recent and efficient implementation of gradient boosting is the XGBoost algorithm (Extreme Gradient Boosting), which improves performance through regularization, parallelization, and optimized tree learning procedures. XGBoost has become a cornerstone of high-performance machine learning systems due to its speed and accuracy.
Let’s take a look at the use of gbm and xgboost in R for both regression and classification tasks.
Boosting in R with gbm: Classification Example
This example uses the gbm package to fit a gradient boosted model for binary classification. The model is trained using a logistic loss function, and 5-fold cross-validation is used to select the optimal number of boosting iterations.
# Load necessary packageslibrary(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(caret)# Load and prepare the datadata(iris)set.seed(123)# Convert Species to binary classification (e.g., setosa vs. not setosa)iris_binary <- irisiris_binary$Species <-ifelse(iris$Species =="setosa", 1, 0)# Fit a Gradient Boosting Model using gbmgbm_model <-gbm(Species ~ ., data = iris_binary,distribution ="bernoulli",n.trees =100,interaction.depth =2,shrinkage =0.1,n.minobsinnode =10,cv.folds =5,verbose =FALSE)
Let’s find the optimal number of trees using cross-validation.
# Determine the optimal number of treesbest_iter <-gbm.perf(gbm_model, method ="cv")
Now, we can use the tuned model to make predictions for the training data set.
# Predict on the training setpred_probs <-predict(gbm_model, iris_binary, n.trees = best_iter, type ="response")pred_class <-ifelse(pred_probs >0.5, 1, 0)
Finally, let’s calculate the overall accuracy of the boosting ensemble.
Boosting in R with xgboost: Multiclass Classification Example
This example uses xgboost to train a multiclass classifier on the full iris dataset. The label encoding ensures compatibility with XGBoost’s internal loss functions, and the accuracy is computed by comparing predictions with true labels.
# Load packagelibrary(xgboost)# Convert iris dataset to matrix form for xgboostiris_matrix <-as.matrix(iris[, -5])iris_labels <-as.numeric(as.factor(iris$Species)) -1# Convert to 0-based labelsdtrain <-xgb.DMatrix(data = iris_matrix, label = iris_labels)# Set parameters for multiclass classificationparams.list <-list(booster ="gbtree",objective ="multi:softmax",num_class =3,eta =0.1,max_depth =3,eval_metric ="merror")# Train xgboost modelset.seed(123)xgb_model <-xgb.train(params = params.list,data = dtrain,nrounds =100,verbose =0)
# Predict on training datapreds <-predict(xgb_model, iris_matrix)
As we saw in this section, boosting differs fundamentally from bagging in its sequential, adaptive training process. It constructs an additive model by combining weak learners trained to minimize a specified loss function, allowing it to correct bias and variance. While boosting is often more powerful than bagging for structured tabular data, it requires careful tuning of hyperparameters—such as learning rate, tree depth, and number of iterations—to avoid overfitting.
In the next section, we will examine stacking, a meta-learning ensemble method that combines predictions from diverse base models using a secondary model, and demonstrate its implementation in R using the caretEnsemble package.
Stacking and Meta-Learning
Stacking, or stacked generalization, is an ensemble method that differs fundamentally from both bagging and boosting. Rather than relying on repeated sampling of the data or iterative reweighting of instances, stacking aims to combine diverse base models by training a meta-learner to optimally integrate their predictions. The method was first introduced by Wolpert (1992) as a general framework for reducing the generalization error of supervised learning algorithms by leveraging a higher-level model trained on predictions made by lower-level models.
In a typical stacking architecture, we begin by training multiple base learners on the training data. These base learners can be homogeneous (all decision trees, for example) or heterogeneous (e.g., logistic regression, random forest, k-nearest neighbors, and support vector machines). Once trained, each base model generates predictions on the input data. These predictions are then used as features for a new model—called the meta-learner or level-1 model — which learns to predict the final output from the base predictions. naturally, the base learners can be improved via bagging and boosting.
To avoid overfitting, stacking generally requires that the meta-learner be trained on out-of-fold predictions rather than on the same data used to train the base learners. This is typically accomplished using cross-validation. The base learners are each trained on a subset of the training data and make predictions on the held-out portion. These out-of-fold predictions are then used to train the meta-learner, which ensures that the second-level model does not simply memorize the training responses.
Mathematically, stacking can be expressed as a two-level function approximation:
where \(f_1, \ldots, f_K\) are base learners and \(h\) is the meta-learner. The meta-model \(h\) is trained to minimize prediction error based on the outputs of the base models, and it can be any supervised learning algorithm suited to the prediction task.
The strength of stacking lies in its ability to combine models that capture different aspects of the data. A linear model may be effective for identifying additive relationships, while a decision tree may capture interactions or non-linearities. By learning how to weight and combine their predictions, stacking often improves generalization performance.
Stacking in R with caretEnsemble
Let’s look at how to implement a stacked ensemble in R using the caret and caretEnsemble packages.
# Load required packageslibrary(caret)library(caretEnsemble)# Prepare the datasetset.seed(123)data(iris)# Create a binary classification task: setosa vs. not setosairis_binary <- irisiris_binary$Species <-factor(ifelse(iris$Species =="setosa", "setosa", "other"))# Define trainControl for stacking with cross-validationctrl <-trainControl(method ="cv",number =5,savePredictions ="final",classProbs =TRUE,index =createFolds(iris_binary$Species, k =5))# Train multiple base learnersbase_models <-caretList( Species ~ ., data = iris_binary,trControl = ctrl,methodList =c("rpart", "glm", "knn"))# Create the stacked ensemble using glm as meta-learnerstacked_model <-caretStack( base_models,method ="glm",trControl =trainControl(method ="cv", number =5, classProbs =TRUE))# Print performanceprint(stacked_model)
In this example, we create three base learners: a decision tree (rpart), a logistic regression model (glm), and a k-nearest neighbors model (knn). These models are trained using 5-fold cross-validation. Their out-of-fold predictions are used to train a logistic regression meta-learner. The caretStack() function manages the entire process, ensuring proper training and evaluation.
Stacking is especially valuable when base learners are diverse and complementary, and when the training set is large enough to support the additional layer of learning. It is widely used in machine learning competitions and real-world systems where maximizing predictive accuracy is critical.
Summary
This lesson presented the theory and practice of ensemble methods for supervised machine learning. Beginning with the motivation for ensemble learning, we emphasized the inherent limitations of single models, particularly the bias-variance tradeoff, and showed how ensembles can overcome these limitations by combining multiple predictive models.
We introduced the Condorcet Jury Theorem to provide a theoretical foundation for majority voting among weak learners, illustrating how ensemble accuracy can exceed that of individual models when learners are both competent and diverse. This result helps justify the core intuition behind ensemble techniques.
We then classified ensemble models into homogeneous and heterogeneous ensembles. Homogeneous ensembles use repeated instances of the same learning algorithm (e.g., decision trees), and achieve diversity through data resampling or randomization. Heterogeneous ensembles combine structurally different models and leverage their complementary strengths. This distinction provides a useful framework for understanding how different ensemble techniques operate and why they succeed.
To summarize, the three major ensemble paradigms we presented were:
Bagging (Bootstrap Aggregating) was introduced as a variance reduction technique. We demonstrated how it stabilizes high-variance learners using repeated bootstrapped datasets. Its most prominent implementation, the Random Forest algorithm, was presented along with an R example using the randomForest package.
Boosting was discussed as a sequential approach that reduces both bias and variance by training models to correct the mistakes of their predecessors. We implemented both Gradient Boosting Machines using the gbm package and XGBoost using the xgboost package, illustrating their power in structured classification tasks.
Stacking was described as a meta-learning strategy where the predictions of base learners are used as input to a higher-level model. This method was shown to be especially useful when combining diverse models.
From a practical standpoint, ensemble methods are essential tools for any data scientist or machine learning practitioner seeking robust predictive performance. The choice of ensemble method should depend on the specific characteristics of the task at hand:
Use bagging (e.g., random forests) when the base learner is unstable and the goal is to reduce variance.
Use boosting (e.g., gbm, xgboost) when the data exhibit complex patterns and bias reduction is important.
Use stacking when multiple model types are available and likely to capture different aspects of the data.
Each ensemble method requires careful hyperparameter tuning. For example, the number of trees, learning rate, maximum tree depth, and feature subset size all influence the model’s generalization performance. Cross-validation should be used consistently to avoid overfitting and to assess true out-of-sample performance.
Finally, model interpretability remains an important consideration. While ensembles often sacrifice transparency for accuracy, tools like variable importance plots (for random forests and boosting) or SHAP values (for tree-based models) can offer post-hoc insight into how the ensemble arrives at its predictions.
At the end of the day, ensemble methods are not merely heuristics; they are backed by sound statistical reasoning and have become central to modern machine learning. When used appropriately, they can significantly enhance the reliability, accuracy, and stability of predictive models.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In Multiple classifier systems (pp. 1–15). Springer. https://doi.org/10.1007/3-540-45014-9_1
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/10.1006/jcss.1997.1504
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. https://doi.org/10.1613/jair.614
Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms. CRC Press.
Acknowledgements
Initial portions of this lessons were written with the assistance of ChatGPT 4o.
Review Questions
Bias-Variance Tradeoff
Explain how ensemble methods, particularly bagging and boosting, affect the bias and variance components of model error. Why is random forest often effective at reducing variance but not bias?
Theoretical Justification
Describe the Condorcet Jury Theorem and its relevance to ensemble classification. What assumptions must hold for the theorem’s conclusion to apply? How do these assumptions translate to ensemble learning in practice?
Homogeneous vs. Heterogeneous Ensembles
Contrast homogeneous and heterogeneous ensembles. Provide examples of when each type might be preferred. What are the trade-offs in terms of diversity, interpretability, and model complexity?
Boosting Strategy
In gradient boosting, what is the role of the learning rate and the number of trees? How do these parameters affect model performance and the risk of overfitting?
Meta-Learning and Overfitting
In stacking, why is it important to use out-of-fold predictions when training the meta-learner? What are the risks if this step is omitted?
Comparison of Methods
Under what conditions would you expect boosting to outperform bagging? Conversely, when might bagging be the more appropriate choice?
Model Evaluation
Discuss the role of cross-validation and out-of-bag error estimates in evaluating ensemble models. When is each method appropriate, and what are their limitations?
Practice Exercises (with R)
Random Forest Tuning
Load the Sonar dataset from the mlbench package and train a random forest classifier using different values of mtry and ntree. Use cross-validation to select the best combination. Plot the out-of-bag error across different configurations.
AdaBoost Simulation
Implement a simple AdaBoost routine in R using decision stumps (depth-1 trees) as weak learners. Track how the training error and test error evolve with the number of boosting rounds.
Feature Importance
Train a gbm model on the Boston dataset from the MASS package. Extract and interpret the relative importance of the features. How do the results compare with those from a random forest?
Stacking Exercise
Using the caretEnsemble package, stack at least three classifiers (e.g., logistic regression, k-nearest neighbors, and decision trees) on the PimaIndiansDiabetes dataset. Evaluate the performance of the ensemble versus the individual models using accuracy, sensitivity, and specificity.
XGBoost Experimentation
Train an xgboost model on a binary classification task of your choice. Vary the eta, max_depth, and nrounds parameters, and visualize their effect on test error. Use early stopping with a validation set to prevent overfitting.
Error Correlation Analysis
Train several base learners on the same dataset and compute the pairwise correlation of their prediction errors. Use this analysis to reason about the expected performance of a bagged vs. stacked ensemble composed of these models.