Data Pre-Processing in Python

Preparation is the basic need to complete a task with near perfection, whether be it a football match or an online exam. Someone has to take a lot of preparation before they can achieve perfect result. In the same way, preparing the data before modeling and doing exploratory data analysis, is the basic task that needs to be done by every data science professionals to get accurate results. Here in this blog post I am going to discuss in details about preparing a data-set before finding interesting insights.

What is the need of pre-processing?

When we deal with real world data, there are lots of inconsistency such as:

Incomplete data: Missing values due to improper collection of data
Noisy data: Outliers or errors introduced while collecting data

The data-set may be imbalanced, i.e. for example, for a classification task, the data-set has more data for a positive class than negative class. If we train a machine learning model using this imbalanced data, the prediction might always be biased towards abundant class since it did not get trained properly on the rare class. So we have to make the data-set balanced before modelling.

Data pre-processing is required to remove all the inconsistency in the dataset and make it into an efficient format before it can be used for analysis. I am going to go step by step about pre-processing the dataset using Melbourne Housing Dataset.

Steps involved in pre-processing

Let us start the pre-processing steps by loading the python packages and the dataset.

#import the required libraries
import pandas as pd
import numpy as np

#loading the dataset and viewing the first 5 columns
original_df = pd.read_csv('S:/media/melb_data.csv')
original_df.shape

(13580, 21)

We can see there are 13580 rows and 21 columns in our data-set. Let us visualize five random rows in the data-set.

original_df.sample(5)

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	…	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
4081	Moonee Ponds	6/7 York St	1	u	360000.0	SP	Brad	10/12/2016	6.9	3039.0	…	1.0	1.0	0.0	54.0	1965.0	Moonee Valley	-37.76520	144.91160	Western Metropolitan	6232.0
8937	Rowville	11 Jindalee Cl	5	h	870000.0	S	Harcourts	1/07/2017	23.5	3178.0	…	2.0	2.0	733.0	217.0	1986.0	Knox	-37.91894	145.24305	South-Eastern Metropolitan	11667.0
3473	Keilor East	564 Buckley St	3	h	630000.0	S	Nelson	16/04/2016	12.8	3033.0	…	2.0	1.0	221.0	NaN	NaN	Moonee Valley	-37.75040	144.85940	Western Metropolitan	5629.0
9115	Craigieburn	195 Newbury Bvd	4	h	667000.0	S	Ray	3/06/2017	20.6	3064.0	…	2.0	2.0	480.0	214.0	2012.0	Hume	-37.57201	144.90897	Northern Metropolitan	15510.0
5405	Richmond	22a Stanley St	3	h	1600000.0	S	Biggin	24/09/2016	2.6	3121.0	…	2.0	2.0	80.0	144.0	1850.0	Yarra	-37.82330	144.99470	Northern Metropolitan	14949.0

5 rows × 21 columns

It is evident from the data-set that there are missing values in “BuildingArea” and “YearBuilt” column as “Nan” values are present. Let us check number of missing values in every column of the entire data-set.

#the below code counts the number of missing values in each column
original_df.isna().sum()

Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64

We can see that “Car”,”BuildingArea”,”YearBuilt” and “CouncilArea” definitely has missing values. Now there are 3 options to handling of missing values.

Drop the columns with missing value: This is very simple approach and is not recommended as lots of “important” information can be lost by simply dropping the columns. Missing columns can be dropped by

dataset_without_missing_column = original_df.dropna(axis=1)
dataset_without_missing_column.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Landsize         0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

We can see the columns having missing values are dropped and now the data-set has no missing values. Now lets look at the next option.

Drop the rows with missing value: This is another very simple approach and is not recommended as again, lots of “important” information can be lost. Rows with missing values can be dropped by

dataset_without_missing_row = original_df.dropna()
dataset_without_missing_row.isna().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64

We can see that there are no missing values in the dataset. This approach is suitable when there are lots of data in the dataset. Let us look at the 3rd option

Imputation: As the name suggests, we can impute values in the data-set to replace the missing values. This kind of approach tends to work out well. Imputation is the process of replacing the missing values with mean or median values , in case of numerical columns and mode values, in case of categorical columns. Let me use Scikit-learn’s SimpleImputer class to impute mean values in the numerical columns.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
imputed_numerical = imputer.fit_transform(original_df.select_dtypes(exclude=['object']))
imputed_numerical = pd.DataFrame(imputed_numerical, columns=original_df.select_dtypes(exclude=['object']).columns)
imputed_numerical.isna().sum()

Rooms            0
Price            0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
Lattitude        0
Longtitude       0
Propertycount    0
dtype: int64

We can see that the numerical columns are imputed and no missing values are found. Lets see how the categorical columns can be imputed

imputer = SimpleImputer(strategy="most_frequent")
imputed_categorical = imputer.fit_transform(original_df.select_dtypes(include=['object']))
imputed_categorical = pd.DataFrame(imputed_categorical, columns = original_df.select_dtypes(include=['object']).columns)
imputed_categorical.isna().sum()

Suburb         0
Address        0
Type           0
Method         0
SellerG        0
Date           0
CouncilArea    0
Regionname     0
dtype: int64

Here we can see, the categorical columns are imputed with most frequent,i.e., mode of the column.

In the above discussion, we have only used mean and mode imputation. There are lots of complex imputation techniques but it has been observed that simple is always better than complex and the complex imputation techniques doesn’t achieve high accuracy as desired. So we will stop our discussion on missing value treatment and move on to Feature Scaling.

Feature Scaling

Feature scaling is a very important task that needs to be done during pre-processing of the data-set. Feature scaling is the method by which independent variables(predictors) are scaled to a standardized value. There are mainly 2 ways we can do that, Min-Max Normalization and Standardization.

Min-Max Normalization

It is the simplest method and it re-scales the data in range between 0 and 1. Here is the formula for min-max normalization

x’ = (x – min(x))/(max(x)-min(x))

Let us apply min-max normalization in python and visualize the data-set.

#create a simple pandas dataframe
s1 = pd.Series([1, 2, 3, 4, 5, 6, 7], index=(range(7)))
s2 = pd.Series([10, 9, 8, 7, 6, 5, 4], index=(range(7)))
s3 = pd.Series([100,200,300,500,500,789,458], index=(range(7)))
df = pd.DataFrame({'s1': s1,'s2':s2,'s3':s3})
df

	s1	s2	s3
0	1	10	100
1	2	9	200
2	3	8	300
3	4	7	500
4	5	6	500
5	6	5	789
6	7	4	458

We created a simple data-set which has different range of values. Let us apply min-max normalization.

from mlxtend.preprocessing import minmax_scaling

min_max_normalized = minmax_scaling(df, columns=df.columns)
min_max_normalized

	s1	s2	s3
0	0.000000	1.000000	0.000000
1	0.166667	0.833333	0.145138
2	0.333333	0.666667	0.290276
3	0.500000	0.500000	0.580552
4	0.666667	0.333333	0.580552
5	0.833333	0.166667	1.000000
6	1.000000	0.000000	0.519594

 min_max_normalized.describe()

	s1	s2	s3
count	7.000000	7.000000	7.000000
mean	0.500000	0.500000	0.445159
std	0.360041	0.360041	0.332220
min	0.000000	0.000000	0.000000
25%	0.250000	0.250000	0.217707
50%	0.500000	0.500000	0.519594
75%	0.750000	0.750000	0.580552
max	1.000000	1.000000	1.000000

We can see that the data-set has been scaled in the range of 0 and 1.

Standardization

Standardization is a method to re-scale the data-set to have zero mean and 1 standard deviation. It is also called z-score normalization. Let us apply standardization on the simple data-set that we created above.

from mlxtend.preprocessing import standardize

standardized = standardize(df,columns=df.columns)
standardized

	s1	s2	s3
0	-1.5	1.5	-1.447314
1	-1.0	1.0	-0.975437
2	-0.5	0.5	-0.503560
3	0.0	0.0	0.440194
4	0.5	-0.5	0.440194
5	1.0	-1.0	1.803919
6	1.5	-1.5	0.242006

standardized.describe()

	s1	s2	s3
count	7.000000	7.000000	7.000000e+00
mean	0.000000	0.000000	-2.775558e-17
std	1.080123	1.080123	1.080123e+00
min	-1.500000	-1.500000	-1.447314e+00
25%	-0.750000	-0.750000	-7.394988e-01
50%	0.000000	0.000000	2.420055e-01
75%	0.750000	0.750000	4.401939e-01
max	1.500000	1.500000	1.803919e+00

We can see that all the columns has mean 0 and standard deviation approximately 1.

Standardization and normalization are important concepts and should be used before modelling. Let me now discuss about another important concept, Handling of Imbalanced data-set.

Imbalanced data-set

As the name suggests, an imbalanced dataset is something that has lots of example for a particular class and very less example for other class. So, when we are modelling a classification model, using this dataset, we can get very high accuracy for the particular class but very less accuracy for the other. This means the model cannot generalize well and it is very biased towards the particular class. So we have to handle this problem, i.e., we have to make the dataset a balanced one and then use it for modelling.

There are two ways to make a dataset balanced, under-sampling and over-sampling of dataset.

Under-sampling: Under-sampling is a technique to reduce the number of examples in the abundant class by randomly selecting the rows belonging to that class. All the samples in the rare class are kept intact. It is used when the data-set is large enough to have lots of data.
Over-sampling: In this technique, the rare samples are generated using repetition, boot-strapping or SMOTE(Synthetic Minority Over-Sampling Technique). In this way number of rare samples is increased in the data-set and can be balanced.

There is absolutely no advantage of one sampling technique with other. It all depends on the use-case and the dataset we are working in.

Let us see in python how we can use these techniques to balance-scale dataset

imbalanced_df = pd.read_csv('S:/xxxx/balance-scale.data',names=['balance', 'v1', 'v2', 'v3', 'v4'])
imbalanced_df.shape

(625, 5)

The data-set has 625 rows and 5 columns. Let us look at first 5 rows of the data-set

imbalanced_df['balance'].value_counts()

R    288
L    288
B     49
Name: balance, dtype: int64

imbalanced_df['balance'].value_counts().plot(kind='bar')

We can see that the dataset is imbalanced with “B” being the rare class. Lets make a logistic regression model using this dataset

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Separate predictor features (X) and target (y)
y = imbalanced_df.balance
X = imbalanced_df.drop('balance', axis=1)
 
# Train model
logistic_regression_model = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_imbalanced = logistic_regression_model.predict(X)

# Lets look at the accuracy?
print( accuracy_score(pred_imbalanced, y) )

0.8848

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(y, pred_imbalanced)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

From the confusion matrix, it can be visualized that “B” class denoted by “0” is not getting predicted properly. Only 5 samples got proper prediction out of 49. So it is evident that the model is not performing well. Let us now balance the data-set using over-sampling the data-set.

from sklearn.utils import resample
#Separate majority and minority classes
df_major = imbalanced_df[imbalanced_df.balance.isin(['L','R'])]
df_minor = imbalanced_df[imbalanced_df.balance=='B']

# Upsample minority class
df_minor_oversampled = resample(df_minor, replace=True, n_samples=288,random_state=123) 
# Combine the majority class with upsampled minority class

df_oversampled = pd.concat ([df_major, df_minor_oversampled])
# Display new class counts
df_oversampled.balance.value_counts()

L    288
B    288
R    288
Name: balance, dtype: int64

# Separate predictor features (X) and target (y)
y = df_oversampled.balance
X = df_oversampled.drop('balance', axis=1)
 
# Train model
logistic_regression_model = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_balanced = logistic_regression_model.predict(X)

# Lets look at the accuracy?
print( accuracy_score(pred_balanced, y) )

conf_matrix = confusion_matrix(y, pred_balanced)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

We can see that the model accuracy has increased as well as from the confusion matrix, it is evident that the model can generalize well. Since we do not have much data, it will be impractical to apply under-sampling, so we will skip that step here. You can definitely try that and let me know how did it perform.

Final Thoughts – Data Pre-Processing using Python

Here in this blog I tried to discuss in detail about data pre-processing and handling of imbalanced data-set. These are the basic task that needs to be performed by a data science professional, on their day to day work. In my next blog, I will be writing about Exploratory Data Analysis and find out interesting insights from there. Please stay tuned for further updates.

Data Science

Data Pre-processing in Python

What is the need of pre-processing?

Steps involved in pre-processing

Feature Scaling

Min-Max Normalization

Standardization

Imbalanced data-set

Final Thoughts – Data Pre-Processing using Python

Why AI would be nothing without Big Data?

Giving Feedback in a Remote Working setup

Leave A Reply Cancel reply

Become An Instructor?

Data Science

What is the need of pre-processing?

Steps involved in pre-processing

Feature Scaling

Min-Max Normalization

Standardization

Imbalanced data-set

Final Thoughts – Data Pre-Processing using Python

You may also like

Leave A Reply Cancel reply

Login with your site account

Register a new account