Demystifying machine learning for beginners

If you're a confused beginner like I was when just starting out with machine learning in python, then stick around, because today, I'll be trying my best at demystifying and simplifying machine learning for you!

To start off, I presume that you would like to learn machine learning for the following reasons:

  1. Working with datasets
  2. Visualizing data
  3. Predicting data
  4. Classifying data

In this tutorial we're going to be making a python script, that will:

  • Load a dataset
  • Visualize the dataset
  • Classify a new piece of data given the dataset

Let's get started!

First, let's import the required libraries:

import pandas
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

If you don't have some of these installed, you can install them by using pip install or pip3 install

Next, we're going to load-in the dataset which we're going to be using for this project:

import pandas
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing

df = pandas.read_csv('IRIS.csv')

For this project, we're going to be using the classic iris dataset which you can download here

Now comes the tricky bit...

Add these lines of code to your python script:

model = KNeighborsClassifier(n_neighbors=3)

features = list(zip(df["sepal_length"], df["sepal_width"]))

model.fit(features,df["species"])

Let me explain...

  • First, we define our model and give it 3 possible classes into which a new piece of data can be classified.
  • We then define the "features" variable which is going to take the "sepal_length" and "sepal_width" columns as the characteristics that we're going to compare in order to classify new pieces of data.
  • Finally, we fit our model with the names of the 3 Iris species, as well as their corresponding "sepal_length" and "sepal_width" values.

Before, we start predicting new pieces of data, let's graph our dataset using a scatter graph. In our graph, the X axis will be representing the "sepal_length" and the Y axis will be representing the "sepal_width". We're also going to color code the different species of Iris flowers by adding hue='species'. and then finally we'll define the data that we're going to be graphing as our Iris dataset by adding data=df to the end:

sns.scatterplot(x='sepal_length', y='sepal_width',
                hue='species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=1)

plt.show()

Here's how the scatter graph should look:
scatter

To start classifying new pieces of data, first comment out the last code snippet like so:

import pandas
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing


df = pandas.read_csv('IRIS.csv')
model = KNeighborsClassifier(n_neighbors=3)

features = list(zip(df["sepal_length"], df["sepal_width"]))

model.fit(features,df["species"])

"""sns.scatterplot(x='sepal_length', y='sepal_width',
                hue='species', data=df, )

# Placing Legend outside the Figure
plt.legend(bbox_to_anchor=(1, 1), loc=1)

plt.show()
"""

Then add these 2 lines of code to the end of your script:

predicted = model.predict([[4.6,5.8]]) 
print(predicted)

This will simply predict which species of Iris flower is one that has a sepal_length of 4.6 and a sepal_width of 5.8.

Now if you run your code, your output should look like this:

['Iris-setosa']

This means that our new mystery Iris flower has been classified as an "Iris-setosa".

Congradulations!

You've made your first machine learning project!

You can now experiment with this code as well as try some new datasets(you can find lots of great ones on https://www.kaggle.com/).

If you're a beginner who likes discovering new things about python, try my weekly python newsletter

Byeeeee👋

17