Meet Mistry 17IT053 - Data Science: Practical

Aim: Perform the following data pre-processing (Feature selection/Elimination) task using python

What is feature selection?

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction.

Why it is important?

It improves model performance.
It leads to faster machine learning models.
It prevents overfitting. We would able to perfectly match our training data if we have more columns in the data than the number of rows.
Removing Garbage

Methods of Feature Selection

Variance threshold: This method removes features with the variation below a certain cutoff. The idea is when a feature doesn't vary much within itself, it generally has very little predictive power.

Univariate feature selection: Using univariate statistical test such as chi-square, Univariate feature selection works by choosing the best characteristics. It independently tests each feature to assess the intensity of the feature's relationship with the response variable.

Recursive feature elimination: Recursive feature elimination starts by fitting a model for each predictor on the entire set of features and computing an important score. The weakest features are then removed, the model is re-fitted, and once the specified number of features are used, significant scores are computed again.

PCA: Principal component analysis, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Correlation: Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other.

Importing Packages