20.2. Feature Selection#

An overabundance of features can impact a model’s generalizability and interpretability. Here’s how:

  1. Generalizability: A model that works with a large number of features requires proportionately more training data. Otherwise, it can overfitted and yield sub-optimal performance on test data. While we will discuss overfitting more in Chapter 21, it can be intuitively understood as the model learning to predict the outcomes of a specific training dataset all too well to the point of compromising generalizability, i.e. the model’s predictive ability across other, previously unseen, datapoints.

  2. Interpretability: A model with high number of features results in a complex decision boundary – a surface that separates the feature space of a classification problem into regions representing datapoints from different classes. This makes it difficult to decipher the relationship between individual predictors and their outcomes thereby affecting the model’s interpretability.

To mitigate these issues, we explore strategies that can help gauge the relative importance of features used to represent datapoints for a machine learning problem and eliminate the redundant ones that can have undesirable effects on the resulting model.

Removing low variance features#

A feature that has little to no variation in the distribution of the values it assumes is called a zero or a near-zero variance predictor. Features that assume one or only a handful of unique values are uninformative with regards to the dependent variable. Some models, like tree-based models (to be covered in Chapter 25), can be naturally resistant to the presence of such features in the data. For example, if such features are not used to split the data into subgroups, at any point during the construction of models like decision trees, then the underlying decision boundary remains functionally independent of them. However, other models like regression models can find dealing with them problematic since they estimate parameters that weigh in every feature in the data when making a prediction. Hence such low variance features need to be eliminated during feature selection. Feature Selection can be approached through a few different techniques with the choice depending on what the problem or the model being used demands.

Feature Selection Methods#

Feature Selection methods are intended to remove uninformative features from the model’s consideration. Techniques that do not take into account the outcome of the prediction problem when eliminating features are unsupervised in nature. For example, filtering out predictors that have high correlation to other predictors, or sparse and unbalanced distribution like those with near-zero variance is a good example of unsupervised feature selection. For supervised methods, predictors are selected while considering their impact in improving the prediction accuracy or model complexity.

In general, apart from models with built-in feature selection mechanisms (like decision trees that will be covered in upcoming chapters), we can largely categorize feature selection methods into three main categories.

  1. Filter Methods

  2. Wrapper Methods

  3. Embedded Methods

We will now look into each category closely.

Filter Methods#

Filter methods evaluate the importance of features outside of their relevance to a specific predictive model and only move forward with the ones that pass certain statistical criteria. For example, each predictor available in the data could be independently checked for its relationship with the dependent variable. Only predictors that exhibit a strong relationship will be selected and passed to the model.
The upside for filter methods is being computationally inexpensive that helps quick evaluation of features and elimination of redundant or correlated ones. However the downside lies in the selection criterion not being directly related to the effectiveness of the chosen model. In other words, filter methods work independently of the machine learning model chosen for a prediction task. Also since filter methods evaluate predictors (or subset of predictors) separately, important interactions between variables are often not quantified during these processes.
Let us look into some of these methods now.

Univariate Feature Selection#

These feature selection techniques evaluate the presence of a distinct relationship between the dependent variable and each predictor individually based on statistical tests appropriate for their respective types. For example, in case of classification problems, predictors can be filtered out by conducting tests to verify if their means vary between different classes. Only predictors that show statistically significant differences between classes based on these tests are selected to be used for modeling. Some examples of univariate feature selection methods include the Correlation test, Analysis of Variance (ANOVA), Chi-squared test, or Mutual Information.

  • Pearson’s Correlation Coefficient : A popular feature selection method for regression problems with numerical predictors is testing the Pearson’s correlation coefficient between each such feature and the target variable. The features that reflect high correlation are considered to be most relevant and selected. The correlation coefficient assumes values between +1 (perfect positive correlation) and -1 (perfect negative correlation). Section 17.2 covers this metric in more detail.

  • Chi-squared test : For feature selection over categorical variables, we use Pearson’s chi-squared statistical test of independence. It tests the independence of a categorical feature from the target variable to establish its usefulness for classification tasks. We start with a contingency table that enumerates all possible values of either variable whose mutual independence is being tested and also records the frequency of the different combination of values between these two variables. The following formula is used to calculate the chi squared statistic between two categorical variables:

    \(x^{2} = \sum_{i=1}^{k}\sum_{j=1}^{m}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\) where,

    • \(i\) is the categorical feature variable

    • \(j\) is the target variable

    • \(O_{ij}\) denotes the observed frequency in the contingency table

    • \(E_{ij}\) denotes the expected frequency in the contingency table, under the assumption of independence.

    We can calculate the chi-squared statistic of each feature with respect to the target variable and rank their relevance in increasing order of the chi-squared value, prior to selecting a subset of these features for model fitting.

  • Information Gain : Another technique for feature selection is the entropy based measure of information gain. Derived from the field of information theory, information gain between two variables calculates the reduction in uncertainty for one variable if the value of the other is known. Uncertainty is mathematically represented through the measure of entropy. Entropy of a random variable quantifies the average level of uncertainty inherent to the possible outcomes of the variable. Entropy for a target variable \(Y\) assuming \(n\) different values is given by the following formula:

    \(H(Y) = \sum_{y=1}^{n}-p_{y}log\ p_{y}\) , where \(p_{y}\) is the probability of the target variable \(Y\) assuming value \(y\)

    In addition, we also need the conditional entropy of the target variable \(Y\), given the value of a feature \(X\), \(H(Y|X)\). Assuming that the variable \(X\) takes \(m\) different values, the conditional entropy can be calculated using the following formula:

    \(H(Y|X) = \sum_{x=1}^{m}\sum_{y=1}^{n}-p_{x,y}log \frac{p_{x,y}}{p_{x}}\)

    Finally the information gain over the target variable \(Y\) given the feature variable \(X\) can be calculated as:

    \(IG(Y,X) = H(Y) - H(Y|X)\)

    Mutual information is a measure of dependence and a higher value of \(IG(Y,X)\) would imply that knowing the feature \(X\) helps in the prediction of the target variable \(Y\). Hence upon calculation of the mutual information measure of the target variable with respect to every feature in the dataset, we can select the ones that reflect higher mutual information, indicating strong dependence of the target variable on these features, making the latter most appropriate for the predictive task at hand.

  • ANOVA : Another statistical test that can be used for selection of categorical features with respect to numerical response variables is ANOVA which stands for Analysis of Variance. ANOVA is used to check the presence of equal variance of a response variable between groups of the population assuming different values of a categorical feature. This in consequence implies that the feature in question has no impact on the response variable and therefore need not be considered in training. We carry out the following steps to perform One Way ANOVA between a categorical feature and a response variable.

    • Define the null and the alternate hypothesis with the null hypothesis being the assumption of equal variance.

    • Calculate the sum of squares between and within groups to capture inter-group and intra-group variations of the response variable. Sum of squares is a metric used to determine the dispersion (deviation from mean \(\bar{x}\)) of data points. Given \(k\) groups of \(n\) data points, each with group mean \(\bar{x}_{i}\), we have:

      Sum of squares between groups \(s_{1} = \sum_{i=1}^{k} (\bar{x}_{i}-\bar{x})^2\)

      Sum of squares within groups \(s_{2} = \sum_{i=1}^{n} (x_{i}-\bar{x})^2\)

    • Calculate the degrees of freedom between and within groups.

      1. Degrees of freedom between groups \(df\_between\): The variation between \(k\) groups explained by the different group means. Knowing the overall mean, the number of group means that can vary while inferring the remaining one:

        \(df\_between = k-1\)

      2. Degrees of Freedom within groups \(df\_within\): The variation of datapoints within each group where \(N\) is the total number of datapoints and \(k\) is the number of groups. After calculating the \(k\) group means, the number of remaining data points that are free to vary within their respective groups:

        \(df\_within = N-k\)

    • Calculate the F-value to compare the variance between the groups and the variance within them. It is the ratio of the sum of squares between groups over the sum of squares within groups, each adjusted by their respective degrees of freedom.

      \(F = \frac{s_{1}/df\_between}{s_{2}/df\_within}\)

    • Finally, the F-value, upon comparison with the F-critical value for the corresponding degrees of freedom, indicates whether to accept or reject the null hypothesis. Should the null hypothesis be rejected, it means that variance exists between the groups of a categorical feature distribution which in turn confirms that the corresponding feature impacts the response variable.

Wrapper Methods#

Wrapper methods are a family of feature selection techniques that enforces a search algorithm to explore all the possible combinations of features and evaluate each subset of feature based on the performance of the model trained on them. The feature set which yields the best model is deemed to be the most relevant to the problem. There are tradeoffs to using wrapper methods for feature selection. The advantages include the fact that these techniques are not model agnostic and can therefore directly customize the feature selection process to the model of choice for a predictive problem. Wrapper methods can better capture interactions between features and find the optimal choice of feature subset. On the downside, exhaustive searches of the optimal feature subset through all possible feature combinations can prove to be computationally infeasible. Wrapper methods involve multiple rounds of training to choose the optimal feature set that, aside from computational expense, can also result in higher dependence on the model and data and, in consequence, overfitting. Here we look into two popular wrapper methods:

  • Forward Selection : In classic forward selection, the predictors are evaluated and added into consideration one at a time, using a statistical hypothesis test, subject to passing some pre-defined significance threshold. For example, in regression models, we can conduct stepwise forward selection of features by using p-value to test the statistical significance of the variation in outcome explained by the model resulting from the inclusion of certain features.
    Note that there are a couple of issues with this process:

    1. The forward selection search procedure takes a greedy approach where new features, that produce the highest statistical significance based on the current state of the models, are progressively added. However these are locally optimal choices which might not guarantee arriving at the globally optimal solutions, since past choices are not reevaluated down the line.

    2. During forward selection the objective function that is being optimized is statistical significance rather than performance related metrics which are used to evaluate the model. However, this can be addressed by using a metric like \(R^2\) for regression models (instead of p-values) that measures the percentage of variance in the response variable explained by a linear model. In this case, features that result in the highest increase of this metric keep getting added iteratively. That said, if the objective is to undertand the relationship between the features and the outcome variable in the interest of preparing a simpler model, the p-value approach would be more appropriate.

    3. The model trained after feature selection can be overfitted since the computation of parameter estimates and subsequent performance metrics does not factor in the selection process into account.

    A hierarchical regression process with forward selection would proceed as laid down in Algorithm Algorithm 20.1

Algorithm 20.1 (Forward Selection with Linear Regression)

Inputs Training Data

Output Optimal subset of features \(S_{i}\) after feature selection

  1. Create an initial model with only the intercept term. Set a significance level defined by a p-value threshold.

  2. do

    1. for each predictor not in the current model do

      1. Train a candidate model by adding the predictor to the current model

      2. Use hypothesis testing and estimate the statistical significance of the new predictor by calculating the p-value

    2. end

    3. if the smallest of the p-values corresponding to all the predictors added in the current iteration is less than the pre-defined threshold then

      1. Update the model by adding the feature that yielded the lowest p-value.

    4. else

      1. Stop the search.

  3. while statistically significant predictors (producing acceptable p-values) remain outside of the current model

  • Backward Elimination : The opposite approach to forward selection starts with a full model that includes all the features available in the dataset. Much like forward selection, backward elimination is a greedy approach that iteratively removes predictors, one at a time, depending on their statistical significance in explaining the variation in the outcome variable. Additionally, it has all the same issues as Forward Selection.
    A hierarchicahl regression process with backward elimination would proceed as laid out in Algorithm Algorithm 20.2.

Algorithm 20.2 (Backward Elimination with Linear Regression)

Inputs Training Data

Output Optimal subset of features \(S_{i}\) after feature elimination

  1. Set a significance level defined by a p-value threshold.

  2. do

    1. Train a model with all the available predictors.

    2. for each predictor in the current model do

      1. Use hypothesis testing and estimate the statistical significance of the predictor by calculating the p-value

    3. end

    4. if the highest of the p-values corresponding to all the predictors added in the current iteration is above the pre-defined threshold then

      1. Update the model by removing the feature that yielded the highest p-value.

    5. else

      1. Stop the search.

  3. while more predictors, that can be removed using the p-value criterion, remain

  • Recursive Feature Elimination : Classical forward selection involves refitting several models at each step of the search for the ideal feature subset and the overall procedure is costly. To avoid this, an alternative is a backward selection algorithm called recursive feature elimination. Here we begin by building a model on the entire set of predictors and compute feature importance for every predictor. With the least important predictors removed, the model is refit with the remaining features and importance scores are recomputed. In practice, the number of features to be retained is pre-determined and this number itself becomes a tuning parameter for Recursive Feature Elimination. The feature subset size is tuned by optimizing the performance metrics.
    The whole process can be captured in the Algorithm Algorithm 20.3.

Algorithm 20.3 (Backward Selection with Recursive Feature Elimination)

Inputs Training Data and model of choice

Output Optimal subset of features \(S_{i}\) after feature selection

  1. Train the model of choice on training data using all available features \(P\)

  2. Calculate the trained model’s performance

  3. Rank features as per some pre-defined feature importance metric

  4. for each subset size \(i\), where \(i = 1...|S|\) do

    1. Keep the \(i\) most important features \(\rightarrow\) \(S_{i}\)

    2. Train the model on the training set using these \(S_{i}\) features

    3. Calculate and record the model performance

  5. end

  6. Examine the performance profile over all iterations of \(S_{i}\)

  7. Determine the appropriate number of predictors : the feature set \(S_{i}\) producing the best performance

  8. Fit and return the final model based on the optimal feature set \(S_{i}\)

Embedded Methods#

A third category of feature selection methods often has the selection mechanism embedded within the model building phase itself. In a sense, embedded methods combine the concepts of picking features based off importance from the filter methods and integrating the selection process with the model success criterion, as seen in the wrapper methods. This leads to the upside of having a computationally inexpensive as well as model sensitive feature selection process. The downside is that embedded methods are only compatible with some specific learning models. Some popular embedded methods include:

  • Lasso Regularization : This particular feature selection technique works with Linear Regression models. These models predict outcomes of a problem by learning a linear combination of features available in the data. The parameters of this model are feature coefficients learnt during training by minimizing an error criterion, typically in the form of the distance between the predicted outcomes and the actual outcomes. While training focuses on minimizing this error term, the purpose of regularization is to add another term that considers the magnitude of these feature coefficients. Lasso regularization seeks to shrink or minimize these feature coefficients alongside the error term. The aim of shrinking coefficients is to minimize the effect of some features or ideally drop them altogether (for those where the coefficient is zero), and in turn, produce a ‘simpler’ model. Such a regularization technique is known to be effective when combatting overfitting which is one of the main objectives of feature selection. We will cover this in greater detail in the regularization chapter.

  • Feature Importance with Decision Trees : Tree based models like decision trees, random forests etc. inherently perform feature selection. These models make sequential partitions of the data based on specific feature values until homogenous data subsets- a subset where all datapoints belong to the same class- remain. At each step of constructing a decision tree, the feature that is chosen to split the data is the one that minimizes the “impurity” or heterogeneity of the resulting partitions. Impurity in these models is computed using metrics like Gini Impurity or Information Gain. It is in fact a representation of a feature’s importance within the problem setup. Features that result in greater reduction of impurity are considered more relevant or important. Such features are therefore repeatedly selected over others during the construction of the many partitions (nodes) within the tree. Tree based models will be covered in more detail in Chapter 25.