Practical - 1

 Aim: Data Preprocessing using sckitLearn python library


What is data pre-processing?

Data preprocessing is an important step in the data mining process. The phrase garbage in, garbage out is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, etc. 

Why it is needed?

Data preprocessing is required because real-world data are generally 
  • Incomplete: Missing attributes values, missing certain attributes of importance, or having only aggregate data
  • Noisy: containing error or outliers
  • Inconsistent: Containing discrepancies in codes or names
Data pre-processing Techniques

Standardization: Data standardization is the method by which one or more attributes are reselected such that they have a mean value of 0 and a standard deviation of 1.

Normalization: The aim of normalization is to adjust the numeric column values to a standard scale in the dataset, without distorting the variations in the value ranges.

One-hot Encoding: One hot encoding is a process that transforms categorical data into a type that could be given to ML algorithms to do a better prediction job. 

Discretization: Discretization refers to the method of converting or partitioning discretized or normal attributes/features/variables/intervals from continuous attributes, features or variables. 

Imputation: For missing data, the imputation technique develops fair guesses, when the amount of missing data is tiny, It's most beneficial. 

Importing dataset



Data Exploration



Applying the KNN Model



Output accuracy is 0.75

Applying feature standardization



Accuracy score: 0.614583333

Applying One-hot Encoding


Accuracy score: 0.75

Question and Answer

1) How to decide the variance threshold in data reduction?
ans. The estimation of the variance threshold depends on a specific distribution's probability density function.

2) Does the output result same even after applying the model on encoded data vs original data?
ans. No, because most machine learning algorithm requires numerical input and output variables. So, after applying one-hot encoding accuracy of ML algorithm increases. 

Python code Link

No comments:

Post a Comment

Welcome to my blog