Feature Engineering and Feature Selection

19. Feature Engineering and Feature Selection#

Satadisha Saha Bhowmick

Statistical models have become ubiquitous in our efforts to understand and predict a variety of phenomena in modern society. Models are created by taking existing data into account and learning mathematical representations that can generalize over and explain the observed outcomes. These models solve a wider purpose of drawing inferences or predicting estimations upon future unobserved data as accurately as possible.

Once data is collected, the measurable variables in the observed data that are fed to a model are called features (also called predictors or independent variables in certain contexts). The quantity being modelled is the response or dependent variable. Features represent observed examples for the model and are key to its success. Different set of features can be used to perform the predictive task at hand, through a process called model fitting, with varying degree of effectiveness, depending on their association with the outcome variable. If the features have no relationship with the outcome, they are redundant and the resulting data representation is irrelevant for the purpose of modelling.

There are different ways to represent the observed examples that are fed to a model and impact its performance. This leads to the notion of Feature Engineering. It is a process that utilizes domain knowledge to transform raw data collected for a predictive problem into features, that can better represent underlying patterns in the data and lead to more effective models. Often times, this also involves modifying or combiming existing features into new ones that can lead to simpler models and more interpretable features.

Although for any given dataset several feature combinations can be created and chosen for model training, it is important to carefully select which ones to use. In machine learning, an overabundance of features also requires an adequate number of datapoints to be collected for the resulting model to be properly fitted. Otherwise training models, that use a large number of features, on insufficient data (relative to the number of predictors), can raise the issue of ‘overfitting’ and result in suboptimal performance. We will talk about overfitting in greater detail at a later chapter but it is a common problem in machine learning and is a major reason why feature selection is necessary. Feature Selection is the process of choosing the most predictive features from a large feature space in order to make the training process more efficient and develop models that are more effective at performing the task at hand. Approaches to feature selection can be supervised, where different feature combinations are evaluated in conjunction with a trained model. The model’s performance in predicting the target variable is used to assess the effectiveness of the selection process. On the other hand, unsupervised approaches aim to identify similarities between features without the use of any model trained on labeled data.

In this chapter, we will have dedicated sections on approaches to feature engineering and feature selection. Section 15.5 refers to the data lifecycle as a pipeline that consists of numerous steps including, but not limited to, data collection, processing and analysis. The techniques described here fall under data pre-processing and are essential in extracting features from data to build intelligent learning systems.