21. Cross Validation and Regularization#
Satadisha Saha Bhowmick
In the previous chapter we motivated the ideas of Feature Engineering and Feature Selection using issues of model complexity and overfitting. Here we will discuss both of these issues in greater detail and examine how we can further address them.
As stated before, the purpose of machine learning models is to examine and learn patterns from a sample of observed data and make predictions on previously unseen datapoints. This idea is known as generalization. Our priority at the time of model building is to maximize predictive performance on data that the model has not been trained upon, this is what it means for the model to generalize well. There is a direct relationship between model complexity and model generalizability. Modern classification and regression models are able to adapt to generalizing complex relationship between data and outcomes well, but often that produces `complex’ models that require a high number of features. When working with models that include a lot of features, the training data requires a proportionately higher number of samples. This ensures that the model is able to observe a sufficient number of patterns arising within the different features and the corresponding outcome variable such that it can estimate reliable parameter estimates able to generalize across a broad range of possible datapoints. However, in the abscence of adequate data, models can overemphasize a handful of specific patterns found within the limited training set. This compromises the model’s generalizability, i.e. predictive performance, as it attempts to project the patterns it learnt over a wider range of test data and in turn can follow the noise in the training set too closely to its detriment. In other words, the model learns to predict outcomes on the training data all too well but fails to reproduce the same performance on previously unseen data points. This phenomenon is called overfitting and it is often a major reason behind model underperformance at test time.
Now that we know overfitting happens, there must be strategies put in place to detect whether a model is overfitted prior to releasing it for production. We will take a look a closer look at overfitting and then delve into those strategies in greater detail in this chapter.