Various Data Pre-processing (Feature Selection/Elimination) tasks using python

4 min readJun 26, 2021

What is Feature Selection?

Feature Selection is a Pre-processing steps that chooses a subset of original feature according to certain evaluation criteria to achieve certain goals like removing redundant data, reducing dimensionality, increasing learning accuracy and many more.

Why Feature Selection?

“Garbage in, Garbage out” concept follows each step of preprocessing. We do feature selection in order to feed less garbage(irrelevant data) to out model and yield maximum possible aaccuracy.

Top reasons to use feature selection are:

It enables the machine learning algorithm to train faster.
It reduces the complexity of a model and makes it easier to interpret.
It improves the accuracy of a model if the right subset is chosen.
It reduces overfitting.

Different methods of Feature Selection

Pearson’s product moment coefficient :

It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is given as:

Chi Square Method:

It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

Forward Selection:

Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Backward Elimination:

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

Recursive Feature elimination:

It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

Key Differences:

The main differences between the filter and wrapper methods for feature selection are:

Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

Filtering Techniques

Variance Threshold:

This method removes features with variation below a certain cutoff.

The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.

Variance Threshold doesn’t consider the relationship of features with the target variable.

Recursive Feature Elimination:

As the name suggests, this method eliminates worst performing features on a particular model one after the other until the best subset of features are known.

For data with n features,

On first round ‘n-1’ models are created with combination of all features except one. The least performing feature is removed

On second round ‘n-2’ models are created by removing another feature.

Wrapper Methods promises you a best set of features with a extensive greedy search.

But the main drawbacks of wrapper methods is the sheer amount of models that needs to be trained. It is computationally very expensive and is infeasible with large number of features.

PCA:

Principal component analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of uncorrelated variables called principal components.

The number of principal components is less than or equal to the number of original variables.

This transformation is defined in such a way that the first principal component has as high a variance as possible (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (uncorrelated with) the preceding components

Implementation

About Data

We have taken a famous loan prediction dataset having 12 attributes with 7 categorical attributes and 4 numeric continuous values and 1 primary key attribute. our dataset has 96 entries.

https://www.kaggle.com/burak3ergun/loan-data-set

Our target attribute is binary.

Feature Reduction using Chi Square:

You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores): ‘ApplicantIncome’, ‘CoapplicantIncome’,’LoanAmount’, ‘Credit_History’ . This scores will help you further in determining the best features for training your model.

Training with only selected influencing feature will improve the accuracy of model.

Code for Reference

Originally published at https://arjunrupavatia.blogspot.com.