What is Classification in Machine Learning? An use case in Automobile Domain using Python.

Introduction to Classification in Python

Hello world!!! Here I am with a new blog, and this time it will be different from the previous one’s. This blog is about different classification algorithms, which I will implement on a real life data-set. The algorithms I am going to discuss are Decision Tree, K-Nearest Neighbor (KNN), Naive Bayes, Random Forest, Support Vector Machine(SVM) and Xg-Boost. These are very important algorithms for a data science enthusiast to learn. I will discuss the working of each algorithm, intuitively, and implement them using Scikit learn package in Python.

What is Classification?

Classification in machine learning is one of the most important task. Algorithms related to classification tries to learn from different examples to assign a class label to an observation. For example, whether an e-mail is spam or not.

There are plenty of algorithms in machine learning that can perform this task. In classification task, a data-set with lots of training examples are required. These training examples must be processed properly before modeling. I have covered a lot regarding EDA and pre-processing in my previous blogs. I will start by explaining about K-NN.

K-nearest neighbour

K-NN is one of the simplest algorithm, and widely used for classification task. It works on the principle that similar things are classified as belonging to a particular class. The training samples are vectors in a multi-dimensional space, each having a class label.In the training phase,the algorithm consists of storing the feature vectors and class labels of the training samples.During classification phase, k is a user-defined constant, and an unlabeled vector (a test point) is classified by assigning the label which is most frequent among the k training samples nearest to that point.

Let me start by importing a car data-set. This data-set where we have to classify, whether a used car is acceptable for other customer during its next purchase. It has 7 columns.

buying price: very high, high, medium, low
maintenance price: very high, high, medium, low
number of doors: 2, 3, 4, 5 or more.
number of persons: 2, 4, more.
size of luggage boot: small, medium, big.
safety: low, medium, high.
car acceptability : unacceptable, acceptable, good, very good

import pandas as pd
import numpy as np

colnames = ['buying_price','maintenance_price','doors','person','luggage_boot','safety','car']

data = pd.read_csv('car.csv',names= colnames )
data.head()

	buying_price	maintenance_price	doors	person	luggage_boot	safety	car
290	vhigh	med	4	more	small	high	acc
19	vhigh	vhigh	2	more	big	med	unacc
1214	med	low	2	more	big	high	vgood
585	high	high	3	more	small	low	unacc
522	high	vhigh	5more	4	small	low	unacc

data.describe()

	buying_price	maintenance_price	doors	person	luggage_boot	safety	car
count	1728	1728	1728	1728	1728	1728	1728
unique	4	4	4	3	3	3	4
top	med	med	3	more	med	med	unacc
freq	432	432	432	576	576	576	1210

We can see that there are 1728 training examples in the data-set. All the columns are categorical and no missing values are present. We have to convert all the textual columns into numerical representation. We will divide the data-set into feature and labels. Then we will split it into training and test. Let me first encode the ordinal columns, ‘buying_price’, ‘maintenance_price’, ‘luggage_boot’ and ‘safety’ into numerical representation.

data['buying_price'] = data['buying_price'].replace({'low':0,'med': 1,'high': 2,'vhigh': 3})
data['maintenance_price'] = data['maintenance_price'].replace({'low':0,'med': 1,'high': 2,'vhigh': 3})
data['luggage_boot'] = data['luggage_boot'].replace({'small': 0, 'med': 1, 'big': 2})
data['safety'] = data['safety'].replace({'low':0,'med':1,'high':2})
data.head()

	buying_price	maintenance_price	doors	person	luggage_boot	safety	car
0	3	3	2	2	0	0	unacc
1	3	3	2	2	0	1	unacc
2	3	3	2	2	0	2	unacc
3	3	3	2	2	1	0	unacc
4	3	3	2	2	1	1	unacc

Let us check the unique values in doors and person column

data['doors'].value_counts()

3        432
5more    432
2        432
4        432
Name: doors, dtype: int64

data['person'].value_counts()

more    576
2       576
4       576
Name: person, dtype: int64

Let us replace ‘5more’ in ‘doors’ column with 5 and ‘more’ in ‘person’ column with 5.

data['doors'] = data['doors'].replace({'5more': 5})
data['person'] = data['person'].replace({'more': 5})
data.head()

	buying_price	maintenance_price	doors	person	luggage_boot	safety	car
0	3	3	2	2	0	0	unacc
1	3	3	2	2	0	1	unacc
2	3	3	2	2	0	2	unacc
3	3	3	2	2	1	0	unacc
4	3	3	2	2	1	1	unacc

Finally I will convert the labels, i.e., ‘car’ column into numerical labels.

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
encoded = label_encoder.fit_transform(data['car'])
label = pd.DataFrame(encoded, columns=['car'])
label['car'].value_counts()

2    1210
0     384
1      69
3      65
Name: car, dtype: int64

feature = data.iloc[:, :-1]
label = label['car']

from sklearn.model_selection import train_test_split
feature_train, feature_test, label_train, label_test = train_test_split(feature, label, test_size=0.3, random_state=1)

We have completed all the pre-processing steps. Now I will implement K-NN on the dataset.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(feature_train, label_train)


y_pred = knn.predict(feature_test)

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.9556840077071291

We can see that our model has been developed extremely well and accuracy is 95.57%. Let us have a look into the confusion matrix.

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

Next we will look into Decision Tree, and implement it.

Decision Tree

A decision tree is a flowchart, in which each internal node represents a “test” on a feature (e.g. whether a the sky is overcast or not), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The path from root to leaf represent classification rule. Below is a simple decision tree flowchart.

The flow chart can be explained as whether a particular day can be used for Picnic. If the weather is cloudy, there won’t be any picnic. On the other hand if the temperature is more than 20 degree, on a sunny day, there won’t be any picnic either. This is a simple intuitive structure of a decision tree.

Some of the advantages of decision trees are:

Simple to understand and interpret.
Able to handle multi-label problems.
Lots of data is not required

Let me now apply decision tree on our car data-set. we will not convert any column to numerical.

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(feature_train, label_train)
y_pred = clf.predict(feature_test)

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.9614643545279383

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

We can find out that using decision tree, the accuracy has increased further. Now I will move on to random forest and implement it.

Random Forest

Random forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a lot of decision trees at training time and outputting the class that is the mode of the classes ( for classification task) or mean prediction ( for regression) of the individual.

For example, given a data-set in binary classification task, a random forest will construct 5 different decision trees(DT). Each decision tree predicts that a particular observation belogs to class A or B. Suppose, 3 DT predicts, the observation belongs to class A and 2 DT predicts, that it belongs to class B. The final prediction will be that the observation belongs to class A, because of majority votes.

Random Forest can be used to find the importance of variables. Let me now implement it in our car data-set.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf = clf.fit(feature_train, label_train)
y_pred = clf.predict(feature_test)

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.9614643545279383

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

import matplotlib.pyplot as plt

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]


print("Feature ranking:")

for feat, importance in zip(feature.columns, clf.feature_importances_):
    print('feature: {f}, importance: {i}'.format(f=feat, i=importance))


plt.figure()
plt.title("Feature importances")
plt.bar(range(feature_train.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(feature_train.shape[1]), indices)
plt.xlim([-1, feature_train.shape[1]])
plt.show()

Feature ranking:
feature: buying_price, importance: 0.15805893468088125
feature: maintenance_price, importance: 0.14284411023417876
feature: doors, importance: 0.06261586615016933
feature: person, importance: 0.23607380372770673
feature: luggage_boot, importance: 0.09643509879528141
feature: safety, importance: 0.3039721864117825

From the above code, we got to know that safety is the most important feature for classification. Now we will move into implementing Naive Bayes.

Naive Bayes

Naive Bayes is a technique for constructing classifiers. There is no single algorithm for training classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers makes an assumption that value of a particular feature is independent of value of others, given the class . For example, a fruit may be considered to be an orange if it is orange in colour, round, and about 10 cm in diameter. A naive Bayes classifier considers each of the features to contribute independently to the probability, that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Naive Bayes classifiers requires very small amount of data and are majorly used in text classification. But in this blog, I will implement it in our car data-set and look how it performs.

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb = gnb.fit(feature_train, label_train)
y_pred = gnb.predict(feature_test)

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.7822736030828517

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

We can see that the Naive Bayes model didn’t perform well in this task, but we got an understanding, how to implement it using scikit-learn. Nex we will move on to implementing Support Vector Machine on this data-set and check how it performs.

Support Vector Machine(SVM)

Support Vector Machine is a supervised machine learning algorithm that is used for both regression and classification. Generally it is used for classification task. In SVM, we start by plotting each data point in n-dimensional space. After plotting the data points, a hyper-plane needs to be found that classifies the data-points according to their labels. The hyper-plane must be constructed in such a way that distance between nearest data-point is maximum.

In the above diagram, there are two classes, and we have drawn 3 hyper-planes. From the 3 hyper-planes, it is clear that hyper-plane A is close to red dots, while C is closer to blue stars. So we have to choose B as our required hyper-plane because it differentiates the two class pretty well and nearest distance from both classes is maximum. This is the intuition, how SVM works. Now let me apply it on our data-set.

from sklearn import svm

clf = svm.SVC()
clf = clf.fit(feature_train, label_train)
y_pred = clf.predict(feature_test)

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.9229287090558767

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

We can see that the SVM model has performed pretty well but not good enough, like the tree based models. Now let us move on to our last and final topic, XG-Boost.

XG-Boost

XG-Boosting is another ensemble based algorithm that learns from “weak learners”. It is one of the newest algorithm and has proved to be highly efficient and fast. In boosting, individual models are not built on completely random subsets of data but sequentially by putting more weight on instances with wrong predictions and high errors. Just like Gradient Descent, it learns from previous mistakes. For example, for a particular observation, the predicted result and the actual result is compared. After comparison, the partial derivative of the loss function is calculated and the weights are adjusted accordingly. I will write about Neural Networks and Gradient Descent in my next blog, and it will be much detailed.

I will implement Gradient Boosting on our car data-set. Before that I will convert data types of “doors” and “person” column to int64.

feature_train['doors'] = pd.to_numeric(feature_train['doors'])
feature_train['person'] = pd.to_numeric(feature_train['person'])
feature_test['doors'] = pd.to_numeric(feature_test['doors'])
feature_test['person'] = pd.to_numeric(feature_test['person'])

from xgboost import XGBClassifier

clf = XGBClassifier()


clf = clf.fit(feature_train, label_train)
y_pred = clf.predict(feature_test)

print("Accuracy:",metrics.accuracy_score(label_test, y_pred))

Accuracy: 0.9884393063583815

from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(label_test, y_pred)
sns.heatmap(conf_matrix, annot=True,cmap='Blues')

It is quite evident, how powerful xg-boost is and how accurate results can be predicted using this algorithm.

Final Thoughts

We have covered a lots of algorithm in this blog and implemented all of them in python.Readers can take different other data-set and try implementing all the algorithms to have a better understanding of the concepts. Hyper-parameter tuning can also be tested. We saw xg-boost algorithm produced the most accurate results in our car data-set. Supervised learning for classification is a vast field of research and lots of research is going on in the field of Neural Networks and my next blog is about detailed working of it. If you have any doubts, feel free to comment I will get back explaining the solution.

Data Science

What is Classification in Machine Learning? An use case in Automobile Domain using Python.

Introduction to Classification in Python

What is Classification?

K-nearest neighbour

Decision Tree

Random Forest

Naive Bayes

Support Vector Machine(SVM)

XG-Boost

Final Thoughts

Journey of a Business Analyst from Ideation to Execution

Business Analysis - Job Search Story, Learnings, Target Approach of Multiple roles within Business Analysis

Leave A Reply Cancel reply

Become An Instructor?

Data Science

Introduction to Classification in Python

What is Classification?

K-nearest neighbour

Decision Tree

Random Forest

Naive Bayes

Support Vector Machine(SVM)

XG-Boost

Final Thoughts

You may also like

Leave A Reply Cancel reply

Login with your site account

Register a new account