Aim: Perform the following data pre-processing (Feature selection/Elimination) task using python
What is feature selection?
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction.
Why it is important?
- It improves model performance.
- It leads to faster machine learning models.
- It prevents overfitting. We would able to perfectly match our training data if we have more columns in the data than the number of rows.
- Removing Garbage
Variance threshold: This method removes features with the variation below a certain cutoff. The idea is when a feature doesn't vary much within itself, it generally has very little predictive power.
Univariate feature selection: Using univariate statistical test such as chi-square, Univariate feature selection works by choosing the best characteristics. It independently tests each feature to assess the intensity of the feature's relationship with the response variable.
Recursive feature elimination: Recursive feature elimination starts by fitting a model for each predictor on the entire set of features and computing an important score. The weakest features are then removed, the model is re-fitted, and once the specified number of features are used, significant scores are computed again.
PCA: Principal component analysis, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Correlation: Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other.
Importing Packages
Dataset (Heart Failure)
Univariate Selection
Python Code Link
No comments:
Post a Comment