
In this tutorial we will learn how to implement the nearest neighbor algorithm using only Numpy. Additionally, we will also cover how to measure distance using the Euclidean distance algorithm.
This is a very simple algorithm which will is an essential one to learn as your begin your machine learning journey.
Before we get started on the code, let’s take a look at the euclidean distance algorithm.
\large D(A,B) = \sqrt{(a_{1}-b_{1})^{2}+(a_{2}-b_{2})^{2}+...+(a_{n}-b_{n})^2}
In simple terms, if we want to calculate the distance (D) between two vectors (A, B) we perform the following steps.
- Firstly, we subtract each of the corresponding pairs in the vector eg (a_{1}-b_{1}).
- Next we square that value and then eg. (a_{1}-b_{1})^2 and we repeat that process each dimension of our vector
- Lastly we get the square root of our product
Let’s work through an example using a two 4 dimensional arrays.
A = [5.5, 2.4, 3.8, 1.1] B = [6.7, 2.5, 5.8, 1.8]
D(A,B) = \sqrt{(5.5-6.7)^{2}+(2.4-2.5)^{2}+(3.8-5.8)^2+(1.1-1.8)^2}
D(A,B) = \sqrt{5.94}
D(A,B) = 2.43
Therefore, the euclidean distance between these two vectors is 2.43, that is pretty straight forward.
Finally, in order the determine which is our nearest neighbor we simply do the same calculation across all the examples and select the one with the lowest distance. As a result we then make a prediction that the closest neighbor is the same class as our test data.
Implementing with Numpy
Now for the fun part, let’s implement this using Python and Numpy. To make things a little more interesting we will use the Iris dataset from the scikit learn library.
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Nearest neighbor algorithm using euclidean distance class NearestNeighbor(object): def __init__(self, X_train, y_train, labels): self.X_train = X_train self.y_train = y_train self.labels = labels def query(self, x): distances = np.zeros((self.X_train.shape[0]), dtype=np.float64) # calculate euclidean distance product = np.power(np.subtract(x, self.X_train), 2) for i, row in enumerate(product): distances[i]=np.sqrt(np.sum(row)) # Index to closest neighbour min_index = np.argmin(distances) # Return predicted class return self.labels[self.y_train[min_index]] # Lets use the Iris dataset, split the data into train and test splits data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2) nn = NearestNeighbor(X_train, y_train, data.target_names) # results will hold the predicted classes results = [] for i, example in enumerate(X_test): results.append(nn.query(example)) # Labels for test data val = [data.target_names[i] for i in y_test] # Check how many predicted classes match with the actual labels correct = 0 for i, res in enumerate(results): if res == val[i]: correct += 1 acc = (correct/len(results))*100 print(f"accuracy: {acc}%")
As you can see the implementation is very straight forward, the portion of code that does all the work is between lines 15 – 22.
# calculate euclidean distance product = np.power(np.subtract(x, self.X_train), 2) for i, row in enumerate(product): distances[i]=np.sqrt(np.sum(row)) # Index to closest neighbour min_index = np.argmin(distances)
On line 2 of this snippet, we calculate the product of all the vectors ie. (a_{1}-b_{1})^2
Next we take the square root of the product on lines 4 – 5
Finally, we use the index of the smallest distance in order and return the appropriate class.
Although the nearest neighbors algorithm has very limited practical uses due to it not scaling very well with large amounts of data, it’s still a useful mechanism to understand and perhaps to try on smaller datasets.
Bonus – KNearest Neighbors
Since we have covered how the nearest neighbors algorithm works, I thought I would also cover the KNearest Neighbors algorithm (KNN). The only difference here is that the KNN algorithm returns more than one neighbor (k neighbors).
Using the existing code this is very easy to implement, see the adjustment to the NearestNeighbor class.
# KNearest neighbor algorithm using euclidean distance class KNearestNeighbor(object): def __init__(self, X_train, y_train, labels): self.X_train = X_train self.y_train = y_train self.labels = labels def query(self, x, k): distances = np.zeros((self.X_train.shape[0]), dtype=np.float64) # calculate euclidean distance product = np.power(np.subtract(x, self.X_train), 2) for i, row in enumerate(product): distances[i]=np.sqrt(np.sum(row)) # Sort the values distances.sort() # Return the K values, here we return both the value and class return distances[0:k], self.labels[self.y_train[0:k]]
1 thought on “Nearest neighbor algorithm with Python and Numpy”
Comments are closed.