Nearest neighbor algorithm with Python and Numpy

Nearest neighbor algorithm with Python and Numpy

In this tutorial we will learn how to implement the nearest neighbor algorithm using only Numpy. Additionally, we will also cover how to measure distance using the Euclidean distance algorithm.

This is a very simple algorithm which will is an essential one to learn as your begin your machine learning journey.

Before we get started on the code, let’s take a look at the euclidean distance algorithm.

\large D(A,B) = \sqrt{(a_{1}-b_{1})^{2}+(a_{2}-b_{2})^{2}+...+(a_{n}-b_{n})^2}

In simple terms, if we want to calculate the distance (D) between two vectors (A, B) we perform the following steps.

  • Firstly, we subtract each of the corresponding pairs in the vector eg (a_{1}-b_{1}).
  • Next we square that value and then eg. (a_{1}-b_{1})^2 and we repeat that process each dimension of our vector
  • Lastly we get the square root of our product

Let’s work through an example using a two 4 dimensional arrays.

A = [5.5, 2.4, 3.8, 1.1]
B = [6.7, 2.5, 5.8, 1.8]

D(A,B) = \sqrt{(5.5-6.7)^{2}+(2.4-2.5)^{2}+(3.8-5.8)^2+(1.1-1.8)^2}

D(A,B) = \sqrt{5.94}

D(A,B) = 2.43

Therefore, the euclidean distance between these two vectors is 2.43, that is pretty straight forward.

Finally, in order the determine which is our nearest neighbor we simply do the same calculation across all the examples and select the one with the lowest distance. As a result we then make a prediction that the closest neighbor is the same class as our test data.

Implementing with Numpy

Now for the fun part, let’s implement this using Python and Numpy. To make things a little more interesting we will use the Iris dataset from the scikit learn library.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Nearest neighbor algorithm using euclidean distance
class NearestNeighbor(object):
    def __init__(self, X_train, y_train, labels):
        self.X_train = X_train
        self.y_train = y_train
        self.labels = labels
    
    def query(self, x):
        distances = np.zeros((self.X_train.shape[0]), dtype=np.float64)

        # calculate euclidean distance
        product = np.power(np.subtract(x, self.X_train), 2)

        for i, row in enumerate(product):
            distances[i]=np.sqrt(np.sum(row)) 
        
        # Index to closest neighbour
        min_index = np.argmin(distances)
        
        # Return predicted class
        return self.labels[self.y_train[min_index]]


# Lets use the Iris dataset, split the data into train and test splits
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

nn = NearestNeighbor(X_train, y_train, data.target_names)

# results will hold the predicted classes
results = []
for i, example in enumerate(X_test):
    results.append(nn.query(example))

# Labels for test data
val = [data.target_names[i] for i in y_test]

# Check how many predicted classes match with the actual labels
correct = 0
for i, res in enumerate(results):
    if res == val[i]:
        correct += 1
        
acc = (correct/len(results))*100

print(f"accuracy: {acc}%")

As you can see the implementation is very straight forward, the portion of code that does all the work is between lines 15 – 22.

        # calculate euclidean distance
        product = np.power(np.subtract(x, self.X_train), 2)

        for i, row in enumerate(product):
            distances[i]=np.sqrt(np.sum(row)) 
        
        # Index to closest neighbour
        min_index = np.argmin(distances)

On line 2 of this snippet, we calculate the product of all the vectors ie. (a_{1}-b_{1})^2

Next we take the square root of the product on lines 4 – 5

Finally, we use the index of the smallest distance in order and return the appropriate class.

Although the nearest neighbors algorithm has very limited practical uses due to it not scaling very well with large amounts of data, it’s still a useful mechanism to understand and perhaps to try on smaller datasets.

Bonus – KNearest Neighbors

Since we have covered how the nearest neighbors algorithm works, I thought I would also cover the KNearest Neighbors algorithm (KNN). The only difference here is that the KNN algorithm returns more than one neighbor (k neighbors).

Using the existing code this is very easy to implement, see the adjustment to the NearestNeighbor class.

# KNearest neighbor algorithm using euclidean distance
class KNearestNeighbor(object):
    def __init__(self, X_train, y_train, labels):
        self.X_train = X_train
        self.y_train = y_train
        self.labels = labels
    
    def query(self, x, k):
        distances = np.zeros((self.X_train.shape[0]), dtype=np.float64)

        # calculate euclidean distance
        product = np.power(np.subtract(x, self.X_train), 2)

        for i, row in enumerate(product):
            distances[i]=np.sqrt(np.sum(row)) 

        # Sort the values
        distances.sort()
        
        # Return the K values, here we return both the value and class
        return distances[0:k], self.labels[self.y_train[0:k]]

1 thought on “Nearest neighbor algorithm with Python and Numpy”

Comments are closed.