PRACTICAL - 1

  

Data Preprocessing in Python using Scikit Learn

What is Data Preprocessing?

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a small clean data set. This technique is performed before the execution of the Iterative Analysis. The set of steps is known as Data Pre-processing. It includes –
  • Data Cleaning
  • Data Integration
  • Data Transformation
  • Data Reduction

Need of Data Preprocessing

The format of the data must be in a proper way to obtain better outcomes from the implemented model in Machine Learning and Deep Learning projects, this is where the Data Preparation is used.

Some specified Machine Learning and Deep Learning model need information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.

Various data pre-processing techniques:

Standardization:
Data standardization is the method by which one or more attributes are rescaled such that they have a mean value of 0 and a standard deviation of 1.

Normalization:
The aim of normalization is to adjust the numeric column values to a standard scale in the dataset, without distorting the variations in the value ranges.

One-hot Encoding:
One hot encoding is a process that transforms categorical data into a type that could be given to ML algorithms to do a better prediction job. It only accepts numerical information as an input. So, by using Label Encoder, the categorical data that needs to be encoded is transformed into a numerical form.

Discretization:
Discretization refers to the method of converting or partitioning discretized or nominal attributes / features / variables / intervals from continuous attributes, features or variables.

Imputation:
For missing data, the imputation technique develops fair guesses. When the amount of missing data is tiny, it's most beneficial. If the portion of missing information is too large, there is no natural variance in the results that could result in an efficient model.

What is Scikit Learn?

Scikit-learn is a Python library that provides a broad range of algorithms for supervised and unsupervised learning.

Scikit Learn is built on top of many Python libraries of common data and math. Such a design makes the integration between them all super simple. You can transfer numpy arrays and pandas data frames straight to Scikit's ML algorithms. It uses the following libraries
  • NumPy: For any work with matrices, especially math operations
  • SciPy: Scientific and technical computing
  • Matplotlib: Data visualization
  • IPython: Interactive console for Python
  • Sympy: Symbolic mathematics
  • Pandas: Data handling, manipulation, and analysis  

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import scale
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
Fig 1 Import require packages


df=pd.read_csv('data.csv',na_values=['?'])
df
Fig 2 Importing dataset

After data cleaning go for training  & testing dataset

x=df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'thal']]
y = df['num']
X_train,X_test,Y_train,Y_test = train_test_split(df,y,test_size=0.2)
Fig 3 Creating training & testing dataset

min_max=MinMaxScaler()
X_train_minmax=min_max.fit_transform(X_train[['trestbps','chol','thalach']])
X_test_minmax=min_max.fit_transform(X_test[['trestbps','chol','thalach']])
Fig 4 Applyling feature scaling

enc=OneHotEncoder(sparse=False)
X_train_1=X_train
columns=['sex', 'cp','fbs', 'restecg', 'exang', 'oldpeak', 'slope', 'thal']
X_test_1=X_test
data=X_train[[col]].append(X_test[[col]])
for col in columns:
# Fitting One Hot Encoding on train data
enc.fit(data)
temp = enc.transform(X_train[[col]])
# Changing the encoded features into a data frame with new column names
temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col].value_counts().index])
# In side by side concatenation index values should be same
temp=temp.set_index(X_train.index.values)
# Setting the index values similar to the X_train data frame
# fitting One Hot Encoding on test data
# adding the new One Hot Encoded varibales to the train data frame
Fig 5 Applying One Hot Encoding


X_train_scale=scale(X_train_1)
X_test_scale=scale(X_test_1)
accuracy_score(Y_test,log.predict(X_test_scale))
log.fit(X_train_scale,Y_train)
Fig 6 (display accuracy score )

Accuracy Score = 0.97 (26 % increase)

Conclusion:- With use of existing libraries i python we can do much better data pre-processing ( A crucial part)

Refrence Link:-

Comments

Post a Comment