Introduction
K-Nearest Neighbour (KNN) is a basic classification algorithm of Machine Learning. It comes under supervised learning. It is often used in the solution of classification problems in the industry. It is widely used in pattern recognization, data mining, etc. It stores all the available cases from the training dataset and classifies the new cases based on distance function.
I will explain the KNN algorithm with the help of the "Euclidean Distance" formula.
Euclidean Distance
The Euclidean distance formula is used to measure the distance in the plane. It is a very famous way to get the distance between two points.
Let's say the points (x1, y1) and (x2, y2) are points in 2-dimensional space and distance by using the Pythagorean formula like below.
Then, the Euclidean distance between (x1, y1) and (x2, y2) is,
d = √(x2 - x1)^2 + (y2 - y1) ^2
So, in short form,
Data Classification based on Euclidean Distance Formula
We have two different kinds of data mentioned below,
- Training Data
This set of data contains all the information included with classifications like below.
This training data includes classification with given x, y values.
- Test Data
This set of data contains only the values of x and y. Its classification type would be predicted based on the training data.
This set of training data doesn't contain a classification type. So, it will be predicted.
Implementation
Import the below libraries.
- import csv
- import sys
- from collections import Counter
- from math import sqrt
Import training data set.
- x = []
- y = []
- z = []
- with open('training_data.csv','rt') as f:
- reader = csv.reader(f)
- for row in reader:
- x.append(float(row[0]))
- y.append(float(row[1]))
- z.append(row[2])
-
- coordinates = list(zip(x,y))
- input_data = {coordinates[i]:z[i] for i in range(len(coordinates))}
Import test data set.
- test_x = []
- test_y = []
- with open('test_data.csv', 'rt') as f:
- reader = csv.reader(f)
- for row in reader:
- test_x.append(float(row[0]))
- test_y.append(float(row[1]))
-
- test_coordinates = list(zip(test_x, test_y))
- print (test_coordinates)
Generate the Euclidean distance.
- def euclidean_distance(x, y):
- if len(x) != len(y):
- return "Error: try equal length vectors"
- else:
- return sqrt(sum([(x[i]-y[i])**2 for i in range(len(y))]))
KNN clissifier.
- def knn_classifier(neighbors, input_data):
- knn = [input_data[i] for i in neighbors]
- knn = Counter(knn)
- classifier, _ = knn.most_common(1)[0]
- return classifier
Generate Neighbours.
- def neighbors(k, trained_points, new_point):
- neighbor_distances = {}
-
- for point in trained_points:
- if point not in neighbor_distances:
- neighbor_distances[point] = euclidean_distance(point, new_point)
-
- least_common = sorted(neighbor_distances.items(), key = lambda x: x[1])
-
- k_nearest_neighbors = list(zip(*least_common[:k]))
-
- return list(k_nearest_neighbors[0])
Print Results.
- results = {}
- for item in test_coordinates:
- results[item] = knn_classifier(neighbors(3,input_data.keys(), item), input_data)
-
- print (results)
Output
Here, x and y data have been classified into different groups. I have attached the zipped Python code. Python 3 or above will be required to execute this code.
Conclusion
K-Nearest Neighbor algorithm is an important algorithm for supervised learning in Machine Learning.