The k-nearest neighbours is an algorithm that can be used for both classification and regression.
This is a supervised algorithm which means that we first provide the algorithm with training records and the correctly classified labels. The algorithm then trains itself on the input and output and learns the characteristics of the data.
After training, we test how well the algorithm learned, by providing it with only test records and then check the accuracy of the test labels provided by the algorithm (check how many records were given the right labels and vice versa).
The KNN algorithm is convenient in a way that it takes little or no training time. Training involves only loading the data into the variables and the actual work begins when the test variables are introduced.
Before we go into the details of the algorithm, let us take a leap and work on a simple example. In the later sections, we will discuss the concept in detail and implement the algorithm from scratch.
A Simple Example
To illustrate KNN, we will use the famous Iris Dataset. Each record in this dataset corresponds to 3 classes of iris plant flower’s details. There are 4 features that describe each flower (sepal length in cm, sepal width in cm, petal length in cm and petal width in cm).
The following pieces of code can be used to import the KNN class from the scikit learn, create a simple model and test the results.
- Set up data
# Import libraries from sklearn import datasets from sklearn.model_selection import train_test_split
# Load data iris = datasets.load_iris()
# Prepare data X,y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
2. Create the model
# Import KNN from sklearn.neighbors import KNeighborsClassifier
# Create a model - model_0 model_0 = KNeighborsClassifier(n_neighbors=3)
3. Train the model
# Train the model model_0.fit(X_train,y_train)
4. Test the model
# Test the model print('Predicted class labels:',model_0.predict(X_test)) print('Accuracy of predictions:',model_0.score(X_test,y_test))
Predicted class labels: [1 1 2 0 1 0 0 0 1 2 1 0 2 1 0 1 2 0 2 1 1 1 1 1 2 0 2 1 2 0] Accuracy of predictions: 1.0
The accuracy of our model is shown to be 100%. That is, the model classified all our test records flawlessly into 3 classes.
As we mentioned in the introduction, the training only involves loading the features (X_train) and labels (y_train) into their respective variables.
The algorithm does its job during testing. When a test point is introduced (a row in X_test), it calculates the distance between this point and all the records in the training data (X_train).
In our example, we have 150 training records and 30 test records. The algorithm calculates the distance (Euclidean or Manhatten) between each test record and 150 training records.
While instantiating the KNN class, we had provided with the parameter n_neighbors = 3. Therefore, out of these 150 distances, the algorithm selects the shortest 3 distances. That is the 3 distances that are closest to the test record.
The algorithm then identifies the classes of these records, finds the most commonly occurring class and classifies the test record as belonging to this class.
The following picture illustrates a training record TR. The algorithm simply selects the training points that are closest to TR. In this case, TP belongs to the circle class since out of all its 3 closest neighbours, the most common is are circles.
Since we are looking at the neighbours of the data point, the accuracy of our result heavily depends on the distribution of features and the number of neighbours we choose to zero in on the class.
Before creating the model it is important to ensure that the features are normalized. That is, all the features have the same range of data and there are no outliers (preferably between 0 and 1).
The number of neighbours we choose also has an impact on the accuracy of the results.
Let us incorporate these two aspects to our example and observe the results.
# Import from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt
# Load data iris = datasets.load_iris()
# Scale data scaler = StandardScaler() scaler.fit(iris.data) iris_scaled = scaler.transform(iris.data)
# Prepare data X,y = iris_scaled, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
#For each neighbour value, calculate the error rate error_rate =  for i in range(1,40): model = KNeighborsClassifier(n_neighbors=i) model.fit(X_train,y_train) pred_i = model.predict(X_test) error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6)) plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10) plt.title('Error Rate vs. K Value') plt.xlabel('K') plt.ylabel('Error Rate')
If the number of neighbors is verhy low, for example n_neighbors = 1, the decision boundary (the boundary that seperates different classes) becomes tighter and the complexity of the model increases. That is, the model will overfit data.
When the number of neighbours are more the decision boundary that seperates classes loosens up and the model becomes simpler and generalizes better for new data.
Now that we have an understanding of how the algorithm works, let’s try and implement it from scratch.
For our basic class, we can have three standard methods – fit(), predict() and score().
In the case of KNN, the fit() method simply loads the data onto the variables.
The predict() method takes each row of test features (X_test), calculates the distance between this row and every row of the training features (X_train) using the support functions _calculate(), _top_indices(), _find_classes() and _predictions().
The predict() method returns the predictions of the test features (X_test).
The score() method provides us with the accuracy of the model, given the test classes (y_test).
class KNearestNeighbors: def __init__(self,n): self.n = n def fit(self,X,y): self.X_train = X self.y_train = y def predict(self,x): distances = [self._calculate(x) for x in X_test] top_indices = [self._top_indices(d) for d in distances] classes = [self._find_classes(indices) for indices in top_indices] predictions = [self._predictions(c) for c in classes] return predictions def _calculate(self,x): distances = [np.sqrt(np.sum((x - x_train)**2)) for x_train in X_train] return distances def _top_indices(self,d): top_indices = np.argsort(d)[:self.n] return top_indices def _find_classes(self,indices): classes = [self.y_train[i] for i in indices] return classes def _predictions(self,classes): p = Counter(classes).most_common(1) predictions = p return predictions def score(self,X_test,y_test): predictions = self.predict(self.X_test) accuracy = np.sum(predictions == y_test)/len(y_test) print("Accuracy: ",accuracy)
Given above is a basic implementation of the the K Nearest Neighbour class. However, it is strongly suggested that the learner get creative with it and add more functionalities or implement it in a different way for the concept to really sink in.
If you found this article useful, join our She Drives Data community on SHEROES (iOS and Android) to connect with thousands of women data enthusiasts across the world, hear from data science experts and get updated on the latest tech news every day.