Exploratory Data Analysis in Python

Hello World!!! Here I am with another interesting blog and you can surely guess what it is actually about. Yes, it is about Exploratory Data Analysis (EDA) in Python.

EDA and data mining are closely related, but not the same. In EDA we do a lot of exploration on the data-set and try to get interesting insights from it. Using EDA we can come up with new variables using the existing variables that can be useful to predict more accurate values. In Data mining, we try to find out patterns in the data-set. Using this pattern we can get to know the relationship between variables and use them to improve decision related to business.

EDA is the most important step before modelling a machine learning model. A detailed EDA can bring up important insights, which can improve business. In EDA we try to break down a dataset and represent them visually, so that even a non-technical person can understand what is going on in the business and how it can be improved. Some of the basic steps that needs to be done are:

Analyse all the variables individually and try to find out if the variable has normal distribution.
Try to figure out any missing values or outliers and fix them (As discussed in my previous blog)
Represent the variables graphically and try to find out important patterns.
Check if the variables have multi-collinearity (More details discussed below).

In this blog I am going to discuss EDA in python using Melborne data-set. Python has a lot of scientific package related to machine learning. Some of the packages that I will be using are Pandas, Numpy, Matplotlib, Scikit-learn and Seaborn. I will try to keep it as simple as possible, but very detailed.

Let us start by importing the required packages and the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

original_dataset = pd.read_csv('S:/media/melb_data.csv')
original_dataset.sample(5)

	Suburb	Address	Rooms	Type	Price	Method	SellerG	Date	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	CouncilArea	Lattitude	Longtitude	Regionname	Propertycount
1135	Brighton	112 Asling St	4	h	2000000.0	S	Marshall	25/02/2017	11.2	3186.0	4.0	2.0	2.0	471.0	NaN	NaN	Bayside	-37.89560	145.00370	Southern Metropolitan	10579.0
10375	Knoxfield	8 Markhill Pl	4	h	1218000.0	S	Ray	27/05/2017	23.3	3180.0	4.0	2.0	2.0	900.0	281.0	1994.0	Knox	-37.87695	145.25470	Eastern Metropolitan	2949.0
12696	Coburg	7 Kewarren Ct	4	h	975000.0	SP	Nelson	16/09/2017	6.7	3058.0	4.0	2.0	2.0	301.0	209.0	2005.0	NaN	-37.73697	144.97568	Northern Metropolitan	11204.0
1021	Braybrook	19 Kannan Bvd	4	h	882000.0	S	Village	10/12/2016	10.8	3019.0	4.0	2.0	2.0	393.0	216.0	2008.0	Maribyrnong	-37.77630	144.85430	Western Metropolitan	3589.0
797	Bentleigh East	1/41 Elizabeth St	3	u	1025000.0	S	Woodards	4/03/2017	13.9	3165.0	3.0	2.0	2.0	384.0	120.0	1999.0	Glen Eira	-37.92290	145.05570	Southern Metropolitan	10969.0

I have imported the data-set and displayed 5 random samples from it.Now let me check the dimension of the data-set.

original_dataset.shape

(13580, 21)

The data-set has 13580 columns and 21 rows. Pandas has a builtin function “describe()” that summarizes the numerical columns in a data-set and tells us about mean, standard deviation, count, etc. This function only considers the numerical columns. So let me use this function and generate a summary about the numerical columns in the data-set.

original_dataset.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

From the above table, we can get to know a lot about the numerical columns. The function returns Count, mean, standard deviation, minimum and maximum value in the data-set columns. If we dive deeper, we can understand there are lots of missing values in “Car”, “BuildingArea” and “YearBuilt” columns. I have discussed about handling of missing values in my previous post.

In this data-set, there is “YearBuilt” column and using this I will add another column,”PropertyAge” and use it for further analysis. We will also add another column,”PropertyType” such that any property above 100 years will be considered “Historical” and below will be “Modern”. So lets do that.

original_dataset['PropertyAge']= 2020- original_dataset['YearBuilt']
original_dataset['PropertyType'] = np.where(original_dataset['PropertyAge']>=100,'Historical','Modern')

Let us use the describe() function again, and look at the data-set description

original_dataset.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount	PropertyAge
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000	8205.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378	55.315783
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772	37.273762
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000	2.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000	21.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000	50.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000	80.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000	824.000000

We can find out the “YearBuilt” column has been considered as numerical, but it is actually categorical. So I will change the data type of this column.

original_dataset['YearBuilt'] = original_dataset['YearBuilt'].astype('category')

Now let us look at the data types of the columns using info() function from pandas.

original_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 23 columns):
Suburb           13580 non-null object
Address          13580 non-null object
Rooms            13580 non-null int64
Type             13580 non-null object
Price            13580 non-null float64
Method           13580 non-null object
SellerG          13580 non-null object
Date             13580 non-null object
Distance         13580 non-null float64
Postcode         13580 non-null float64
Bedroom2         13580 non-null float64
Bathroom         13580 non-null float64
Car              13518 non-null float64
Landsize         13580 non-null float64
BuildingArea     7130 non-null float64
YearBuilt        8205 non-null category
CouncilArea      12211 non-null object
Lattitude        13580 non-null float64
Longtitude       13580 non-null float64
Regionname       13580 non-null object
Propertycount    13580 non-null float64
PropertyAge      8205 non-null float64
PropertyType     13580 non-null object
dtypes: category(1), float64(12), int64(1), object(9)
memory usage: 2.3+ MB

We can see “YearBuilt” got converted to categorical. We can understand that the oldest property was built in 1196 and newest in 2018. Let us find out the region where the oldest and newest property is located, along with the price.

original_dataset.loc[(original_dataset['YearBuilt']==1196)|(original_dataset['YearBuilt']==2018),['YearBuilt','Price','Regionname']]

	YearBuilt	Price	Regionname
1234	2018.0	1310000.0	Southern Metropolitan
9968	1196.0	1200000.0	Eastern Metropolitan

From above we can get to know about the oldest and newest property. Now I will plot histogram and box-plot for all the numerical columns and try to get some insights.

#generate boxplot
original_dataset.select_dtypes(exclude=['object']).boxplot(figsize=(20,10))

Some of the inference that can be drawn from the above plots are:

The “Price” column seems to have outliers, as evident from the boxplot. But they are not actually outliers, since it is normal that a rich person has a very expensive house compared to other citizens. From the histogram we can find out that the column is not normally distributed. So, we will apply some transformations and try to make it normal.
“Rooms”,”Postcode”,”Bedroom2″,”Bathroom”,”Car” and “Propertycount” have discrete values.

Now I will look into the “Price” column with more details and look at the data distribution using the Seaborn package.

import seaborn as sns
sns.distplot(original_dataset['Price'], color='g', bins=100, hist_kws={'alpha': 0.4})

From the figure, we can find out that the distribution is skewed. Lets check the amount of skewness.

original_dataset['Price'].skew()

2.239624312529873

We can see that the skewness value is positive, that means the data is right-skewed. We will apply a log transform using Numpy and plot the histogram to check whether the skewness can be decreased.

sns.distplot(np.log(original_dataset['Price']), color='g', bins=100, hist_kws={'alpha': 0.4})

Whoa!!! We can clearly see that the distribution is more symmetrical. So lets check the skewness value.

np.log(original_dataset['Price']).skew()

0.18065988655954393

We can see the log transform did a magic and the skewness is close to zero, i.e., the log of “Price” is normally distributed.

Let us now check the correlation between the variables. Correlation refers to the fact, whether a column is strongly related to other column, for example if a person is tall, his weight will also be more, than a short person. That means height and weight are correlated. If a column is strongly correlated to other column, we simply drop one of the columns during modelling, since both the columns provides same information, and it may lead to over-fitting. Let us generate a correlation matrix and get some insights.

correlation_matrix = original_dataset.select_dtypes(exclude=['object']).corr()
plt.figure(figsize = (10,10))
sns.heatmap(correlation_matrix, annot=True)
plt.show()

We can get the following interesting insights from the heatmap:

A strong positive correlation can be observed between “Rooms” and “Bedroom2”. That means an increase in the number of rooms, increases the number of bedroom. So during modelling, we can drop either one of the column.
There is a positive correlation between “PropertyAge” and “Price”,i.e., if the property is new, the price will be less as compared to older properties.
There is a positive correlation between “Rooms” and “Price” and we can understand that as the number of rooms increases, the price also increases.
We can also find a positive correlation between “Car” and “Rooms”. This mean a person having lots of room in his/her property has lots of car, that simply mean the person is rich.

So far we were dealing with the numerical columns and got some meaningful insights from them. Now let us deal with the categorical columns.

categorical_dataset = original_dataset.select_dtypes(include=['object'])
categorical_dataset.sample(5)

	Suburb	Address	Type	Method	SellerG	Date	CouncilArea	Regionname	PropertyType
845	Bentleigh East	37 Orange St	h	S	Buxton	12/11/2016	Glen Eira	Southern Metropolitan	Modern
3903	Malvern East	1c Wilmot St	t	S	Jellis	30/07/2016	Stonnington	Southern Metropolitan	Modern
3087	Hampton	1/514 Bluff Rd	t	VB	Buxton	27/11/2016	Bayside	Southern Metropolitan	Modern
10121	Balwyn North	17 Cityview Rd	h	S	hockingstuart	27/05/2017	Boroondara	Southern Metropolitan	Modern
8576	Ashburton	11 Poulter St	h	PI	Buxton	29/04/2017	Boroondara	Southern Metropolitan	Modern

Let us now visualize which region has most number of properties

categorical_dataset['Regionname'].value_counts().plot(kind='bar')

Most of the properties are in Southern Metropolitan, while Western Victoria has the lowest number. Let us now check the average price of properties based on region.

original_dataset.groupby(['Regionname'])['Price'].mean().sort_values(ascending=False)

Regionname
Southern Metropolitan         1.372963e+06
Eastern Metropolitan          1.104080e+06
South-Eastern Metropolitan    9.229438e+05
Northern Metropolitan         8.981711e+05
Western Metropolitan          8.664205e+05
Eastern Victoria              6.999808e+05
Northern Victoria             5.948293e+05
Western Victoria              3.975234e+05
Name: Price, dtype: float64

We can clearly see that Southern Metropolitan has the highest average property price, but Eastern Metropolitan is second highest in the list. That is quite an interesting finding. That means most of the property in Eastern Metropolitan has high prices, and we can consider this region to have very rich people staying there. Interesting business decision can be taken based on this finding.

We did a lot of EDA and got interesting pattern. Let us now do some feature engineering on categorical columns. In my previous blog I discussed about feature scaling of numerical columns, but did not discuss about handling of categorical features. So let us do that here.

Handling of categorical features

Categorical features cannot be directly applied for modeling. The text features need to be converted to numerical representation. There are 2 main ways to do it, One-hot encoding and Label encoding. As a newbie, one gets confused about using one over the other encoding method. So I will explain when to use one.

One-hot encoding: In this transformation method, a categorical column is converted to multiple column having values of 0 and 1. Let us look at an example using pandas. We will convert “Regionname” column into dummy variables

pd.get_dummies(original_dataset['Regionname'], prefix="Region").sample(5)

	Region_Eastern Metropolitan	Region_Northern Metropolitan	Region_Southern Metropolitan	Region_Western Metropolitan
6202	1	0	0	0
11589	0	0	1	0
3955	0	0	0	1
6231	1	0	0	0
1347	0	1	0	0

We can find out that the categorical column has now got converted to numerical column. Using the same method, we can convert all the categorical columns into dummy variables, concat them together and do our modelling.

Label-Encoding: One other method of encoding is label encoding. This method is generally used when the column is ordinal,i.e., for example, if there is a column named “Rating” and people rate it like “Satisfied”, “Dis-satisfied”, etc. then we can use this. Lets see an example.

We will consider “Type” in the data-set as ordinal and so we will apply label encoding using Scikit-learn package. Before applying label encoding, let us check the number of unique values in “Type” column.

original_dataset['Type'].value_counts()

h    9449
u    3017
t    1114
Name: Type, dtype: int64

Now let us apply label encoding

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
encoded = label_encoder.fit_transform(original_dataset['Type'])
encoded = pd.DataFrame(encoded, columns=['Type'])
encoded.sample(5)

	Type
1795	0
3457	0
3518	2
8355	2
231	0

We can find out that “Type” column has been converted to label encoded data. Lets check the number of unique values and find out whether the encoding is proper.

encoded['Type'].value_counts()

0    9449
2    3017
1    1114
Name: Type, dtype: int64

Hurray!!! You can see that “h” got converted to 0, “u” to 2 and “t” to 1 with same number of values. So we can conclude, the encoding is proper.

Let us now visualize the number of “Historical” and “Modern” properties and get some insights.

original_dataset['PropertyType'].value_counts().plot(kind='bar')

There are very few Historical properties as compared to Modern properties. Now let us find out average price between two kind of properties.

original_dataset.groupby(['PropertyType'])['Price'].mean()

PropertyType
Historical    1.545792e+06
Modern        1.023792e+06
Name: Price, dtype: float64

We can understand that the Historical properties are costlier than the modern properties, as we have inferred earlier.

Final thoughts – Exploratory Data Analysis in Python

In this blog I tried to explain in detail about EDA in Python and got some really meaningful insights. We can keep on exploring the data-set and make our analysis more detailed. I will ask my readers to do further analysis on the data-set and let me know about some cool findings. In my next blog, I am going to discuss about Logistic regression in Python and build a classification model from scratch. Please stay tuned for further updates.

Data Science

Exploratory data analysis in Python

Handling of categorical features

Final thoughts – Exploratory Data Analysis in Python

What responsibilities can Scrum Masters let go of?

Tracing the Global Infrastructure of AWS vs. Popular Cloud Service Providers

Leave A Reply Cancel reply

Become An Instructor?

Data Science

Handling of categorical features

Final thoughts – Exploratory Data Analysis in Python

You may also like

Leave A Reply Cancel reply

Login with your site account

Register a new account