Intuiting Ensemble Methods

This post is part of a series.

One big disadvantage of decision trees is that they have a high variance (i.e. they are unstable, not robust to noise in the data). A slight change in the data can cause a different split to occur, giving rise to different child nodes and splits all the way down the tree, potentially leading to different predictions. However, we can make the predictions more stable by averaging them across an ensemble of many different decision trees, called a random forest. Random forests grow a variety of decision trees by forcing each split attribute to be selected from a random subset of candidates. They also train each tree using a random subset of the available training data, which is known as bootstrap aggregating or “bagging” for short.

In general, bagging constructs an ensemble of models that reduces model variance, making it suitable for complex models (low bias, high variance). For simple models (high bias, low variance), another ensemble model called gradient boosting can be used to reduce model bias. Gradient boosting performs gradient descent on a cost function by building the ensemble as a sequence of “error-correcting” models, where each model is trained on a subset of the training data that emphasizes instances that were misclassified by the preceding model. The output of the ensemble is a weighted average, where the weight given to a model’s prediction depends on the model’s accuracy.

Another example of an ensemble that we encountered earlier was the Bayes optimal ensemble, which averages over all sets of models with a given set of parameters. Since the average is difficult to compute, we settled for choosing the single model which contributed most to the ensemble (MAP/MLE). However, are ways to approximate the average through sampling, known as Bayesian Model Combination (BMC).

The type of ensemble model that wins most data science competitions, perhaps surprisingly, is not the Bayes optimal ensemble. Rather, it is the stacked model, which consists of an ensemble of entirely different species of models together with some combiner algorithm (usually chosen as a logistic regression) that is trained to make a final prediction using the predictions of the models within the ensemble as additional inputs. Although the Bayes optimal ensemble performs at least as well as (and often better than) stacking when the correct data-generating model is on the list of models under consideration, the correct data-generating model for data difficult enough to warrant a competition is often too complex to be approximated by models in the Bayes optimal ensemble. In such cases, it is advantageous to have a diverse basis of models from which to approximate the data-generating model.

This post is part of a series.