ML Lesson 23 – KNN Algorithm | Dataplexa

K-Nearest Neighbors (KNN)

In the previous lesson, we studied Support Vector Machines and saw how a model can draw a strong decision boundary by focusing on the most critical data points. SVM was all about margins and mathematical optimization.

In this lesson, we move to a completely different way of thinking. There is no training phase, no complex equations, and no model building in advance.

Welcome to K-Nearest Neighbors, commonly known as KNN.


The Core Idea Behind KNN

KNN works exactly the way humans think in everyday life. When you meet a new person, you often compare them with people you already know.

If most of the similar people you know behave in a certain way, you assume this new person might behave the same way.

KNN applies this idea to data. To predict the class of a new data point, it looks at the K most similar points in the dataset and lets them vote.

The class with the majority vote becomes the final prediction.


Why KNN Is Called a Lazy Algorithm

Unlike most machine learning algorithms, KNN does not learn anything during training.

It simply stores the entire dataset in memory. When a new prediction is required, it performs all calculations at that moment.

Because of this behavior, KNN is called a lazy learner.


Using Our Dataset

We continue using the same dataset introduced earlier so that learning feels continuous and realistic.

Dataplexa ML Housing & Customer Dataset

Our task remains to predict whether a loan will be approved.


Preparing the Data

KNN is entirely based on distance. If one feature has larger values than others, it will dominate the distance calculation.

That is why feature scaling is not optional for KNN. It is mandatory.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Training the KNN Model

Although we say KNN does not train in the traditional sense, we still create a model object and define the value of K.

The value of K controls how many neighbors participate in the voting.

model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)

A small K makes the model sensitive to noise, while a large K makes it smoother but sometimes less accurate.


Making Predictions

Now the model compares each test point with all training points to find its nearest neighbors.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating the Model

Let us evaluate how well KNN performs on unseen data.

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

KNN often performs well for small to medium datasets, but performance drops when the dataset becomes very large.


Real-World Intuition

Imagine a bank officer reviewing a new loan application. Instead of using rules or formulas, the officer looks at five similar past customers.

If most of them repaid their loans successfully, the officer approves the new application.

This is exactly how KNN behaves.


Mini Practice

Suppose K is set to 1. The model only looks at the single closest customer.

Now imagine K is set to 20. The decision becomes more stable, but individual patterns may get diluted.

This trade-off is central to KNN.


Exercises

Exercise 1:
Why must features be scaled for KNN?

Because KNN uses distance calculations, and unscaled features distort distances.

Exercise 2:
What happens if K is too small?

The model becomes sensitive to noise and may overfit.

Quick Quiz

Q1. Does KNN learn parameters during training?

No. KNN stores the data and computes distances at prediction time.

In the next lesson, we will study Naive Bayes and understand how probability-based learning works.