PRACTICAL - 1
Data Preprocessing in Python using Scikit Learn
What is Data Preprocessing?
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set. This technique is performed before the execution of the Iterative Analysis. The set of steps is known as Data Pre-processing. It includes –
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
Need of Data Preprocessing
The format of the data must be in a proper way to obtain better outcomes from the implemented model in Machine Learning and Deep Learning projects, this is where the Data Preparation is used.
Some specified Machine Learning and Deep Learning model need information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.
Various data pre-processing techniques:
Standardization:
Data standardization is the method by which one or more attributes are rescaled such that they have a mean value of 0 and a standard deviation of 1.
Normalization:
The aim of normalization is to adjust the numeric column values to a standard scale in the dataset, without distorting the variations in the value ranges.
One-hot Encoding:
One hot encoding is a process that transforms categorical data into a type that could be given to ML algorithms to do a better prediction job. It only accepts numerical information as an input. So, by using Label Encoder, the categorical data that needs to be encoded is transformed into a numerical form.
Discretization:
Discretization refers to the method of converting or partitioning discretized or nominal attributes / features / variables / intervals from continuous attributes, features or variables.
Imputation:
For missing data, the imputation technique develops fair guesses. When the amount of missing data is tiny, it's most beneficial. If the portion of missing information is too large, there is no natural variance in the results that could result in an efficient model.
What is Scikit Learn?
Scikit-learn is a Python library that provides a broad range of algorithms for supervised and unsupervised learning.
Scikit Learn is built on top of many Python libraries of common data and math. Such a design makes the integration between them all super simple. You can transfer numpy arrays and pandas data frames straight to Scikit's ML algorithms. It uses the following libraries
- NumPy: For any work with matrices, especially math operations
- SciPy: Scientific and technical computing
- Matplotlib: Data visualization
- IPython: Interactive console for Python
- Sympy: Symbolic mathematics
- Pandas: Data handling, manipulation, and analysis
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import accuracy_scorefrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import scalefrom sklearn.preprocessing import LabelEncoderfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import OneHotEncoder
Fig 1 Import require packages
df=pd.read_csv('data.csv',na_values=['?'])df
Fig 2 Importing dataset
After data cleaning go for training & testing dataset
x=df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'thal']]y = df['num']X_train,X_test,Y_train,Y_test = train_test_split(df,y,test_size=0.2)
Fig 3 Creating training & testing dataset
min_max=MinMaxScaler()X_train_minmax=min_max.fit_transform(X_train[['trestbps','chol','thalach']])X_test_minmax=min_max.fit_transform(X_test[['trestbps','chol','thalach']])
Fig 4 Applyling feature scaling
enc=OneHotEncoder(sparse=False)X_train_1=X_traincolumns=['sex', 'cp','fbs', 'restecg', 'exang', 'oldpeak', 'slope', 'thal']X_test_1=X_testdata=X_train[[col]].append(X_test[[col]])for col in columns:# Fitting One Hot Encoding on train dataenc.fit(data)temp = enc.transform(X_train[[col]])# Changing the encoded features into a data frame with new column namestemp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col].value_counts().index])# In side by side concatenation index values should be sametemp=temp.set_index(X_train.index.values)# Setting the index values similar to the X_train data frame# fitting One Hot Encoding on test data# adding the new One Hot Encoded varibales to the train data frame
Fig 5 Applying One Hot Encoding
X_train_scale=scale(X_train_1)X_test_scale=scale(X_test_1)accuracy_score(Y_test,log.predict(X_test_scale))log.fit(X_train_scale,Y_train)
Fig 6 (display accuracy score )
Accuracy Score = 0.97 (26 % increase)
Refrence Link:-
- https://towardsdatascience.com/an-introduction-to-scikit-learn-the-gold-standard-of-python-machine-learning-e2b9238a98ab
- https://www.kaggle.com/sanskrutipanda/heart-disease-prediction
Colab Link:-
- https://colab.research.google.com/drive/1t560Rd9ILuEUneewyxzFDcpjpUg0jmMG?usp=sharing
This comment has been removed by the author.
ReplyDelete