Meet Mistry 17IT053 - Data Science: Practical

Aim: Data Preprocessing using sckitLearn python library

What is data pre-processing?

Data preprocessing is an important step in the data mining process. The phrase garbage in, garbage out is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, etc.

Why it is needed?

Data preprocessing is required because real-world data are generally

Incomplete: Missing attributes values, missing certain attributes of importance, or having only aggregate data
Noisy: containing error or outliers
Inconsistent: Containing discrepancies in codes or names

Data pre-processing Techniques

Standardization: Data standardization is the method by which one or more attributes are reselected such that they have a mean value of 0 and a standard deviation of 1.

Normalization: The aim of normalization is to adjust the numeric column values to a standard scale in the dataset, without distorting the variations in the value ranges.

One-hot Encoding: One hot encoding is a process that transforms categorical data into a type that could be given to ML algorithms to do a better prediction job.

Discretization: Discretization refers to the method of converting or partitioning discretized or normal attributes/features/variables/intervals from continuous attributes, features or variables.

Imputation: For missing data, the imputation technique develops fair guesses, when the amount of missing data is tiny, It's most beneficial.

Importing dataset

Data Exploration

Applying the KNN Model

Output accuracy is 0.75

Applying feature standardization

Accuracy score: 0.614583333

Applying One-hot Encoding

Accuracy score: 0.75

Question and Answer

1) How to decide the variance threshold in data reduction?

ans. The estimation of the variance threshold depends on a specific distribution's probability density function.

2) Does the output result same even after applying the model on encoded data vs original data?

ans. No, because most machine learning algorithm requires numerical input and output variables. So, after applying one-hot encoding accuracy of ML algorithm increases.

Python code Link

Meet Mistry 17IT053 - Data Science

Pages

Practical - 1

Aim: Data Preprocessing using sckitLearn python library

No comments:

Post a Comment

Welcome to my blog

Report Abuse