Simple classification problem with sklearn iris flower dataset
0
0

Simple classification problem with sklearn iris flower dataset

Discover how neural networks can solve classification problems and revolutionize data analysis

Victor Bona
4 min
0
0

“Information is the oil of the 21st century, and analytics is the combustion engine.” - Peter Sondergaard

Email image

Classification problems

Let’s say your boss asks you to code an way to classify the customers of a store in some classes based on customer attributes(shopping rate, income, age, family size and etc…) or a school hire you to code something to help teachers to understand what kind of students they have based on student school attributes. All of this problems can be solved using Machine learning and neural networks.

Basically, we can create a machine learning model to classify something based on data.

As an example, we will try to identify the species of a plant based on it’s flower attributes(sepal length, sepal width, petal length and petal width), using this data we will create a model to discover what species the plant is.

The data

Usually used as the “Hello world” data set to start machine learning studies.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Wikipedia

Summary Statistics
Summary Statistics

It’s important to remember, this data set will be loaded from sklearn.datasets preloaded samples and because of that, we don’t need to clean or perform any kind of modeling on the data.

The target column that represents the plant species is in number format, so, we will use this dict as reference:

ref = {
0: 'Iris-Setosa',
1: 'Iris-Versicolour',
2: 'Iris-Virginica'
}

The model

There is several kinds of classification models we could use for this problem and probably, almost all of them would perform very well, but, after some tests, I realized the MLP classifier suits very very well to this case, so, we will be using it as our example.

What actually is the MLP classifier?

Accordingly to wikipedia “A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation); see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as “vanilla” neural networks, especially when they have a single hidden layer.”

If we read something like this and have never studied about it, probably will be a little confusing, but, is not that hard if we explain in more simple words: A multilayer perceptron is supervisioned neural network that uses a technique called backpropagation for training and a nonlinear activation function. Usually used to solve classification problems.

Ok, let’s code it

Code it is very easy when using sklearn, but, remember, the data we are using are already well fitted and ready to go, in real world problems the data will need to be processed and hyperparameters calibrated. Data processing are one of the most important part when creating a ML model.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.neural_network import MLPClassifier
iris_dataset = load_iris()
data = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
data['target'] = iris_dataset.target
validation = data.sample(5)
data.drop(validation.index, axis=0, inplace=True)
features = data.drop('target', axis=1)
x_train, x_test, y_train, y_test = train_test_split(features, data['target'], random_state=42, test_size=0.25)
model = MLPClassifier(max_iter=1000)
model.fit(x_train, y_train)

As you see, the code is very easy to write and actually easy to understand. Now we can guess the plant species, imputing the attributes we saw, let’s get a sample and try it:

Email image

And predict it using our model, we expect the result 1.

Email image

As we can see we got the correct result. Checking our reference we see the sample we got, is an Iris-Versicolour species. Other thing we can try is to check our model score using the test data we create previously, check it out:

Email image

Our model has an impressive accuracy score, but, the problem here is we used just data the model already knows and this don’t explain a lot to us, we need to use unknown data to really validate our model accuracy, that’s exactly why we did this previously:

validation = data.sample(5)
data.drop(validation.index, axis=0, inplace=True)

This part of the code took 5 random samples from the data set, in order to use as a validation data, the reason we do things like this, is to see how te model performs with unknown data and really have an idea of his accuracy, so, let’s try it out:

Email image

There we are, we got results with our validation data.

Conclusion

We saw a very summarized implementation of a neural network using sklearn with data pre-formatted and ready to go, usually, the problems are never that easy to solve and demands hours of analysis and hard work.

The purpose of this article is just give a simple example about implementing your first neural network and present to you some concepts and links to search and narrow your studies.

References & Useful links