ML Lesson 25 – Clustering(K-Means) | Dataplexa

K-Means Clustering

Until now, all the algorithms we studied were supervised learning models. They learned from labeled data, where the outcome was already known.

In this lesson, we enter a new phase of machine learning called unsupervised learning. Here, the data has no labels, and the model must discover patterns on its own.

The first and most important unsupervised algorithm is K-Means Clustering.


The Core Idea Behind K-Means

K-Means tries to group similar data points together without knowing anything about the final outcome.

Imagine a bank wants to understand its customers better. It does not want loan approval predictions. Instead, it wants to identify different types of customers based on income, age, credit score, and spending behavior.

K-Means solves this problem by dividing customers into K groups, called clusters, where customers within the same group are more similar to each other than to customers in other groups.


How K-Means Works Conceptually

K-Means starts by placing K random points in the dataset. These points are called centroids.

Each data point is assigned to the nearest centroid. After that, each centroid is moved to the average position of all the points assigned to it.

This process repeats until the centroids stop moving. At that point, the clusters are considered stable.


Using Our Dataset

Even though our dataset contains a loan approval label, we will temporarily ignore it.

Dataplexa ML Housing & Customer Dataset

Our goal is to discover natural customer groups based on their financial characteristics.


Preparing the Data

K-Means is entirely distance-based. If features are not scaled, the clusters will be incorrect.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Choosing the Number of Clusters

Choosing the value of K is not automatic. It requires understanding the business problem.

For demonstration, we will start with K equal to 3, which may represent low-risk, medium-risk, and high-risk customers.


Training the K-Means Model

model = KMeans(n_clusters=3, random_state=42)
model.fit(X_scaled)

At this point, the model has discovered three clusters based purely on feature similarity.


Analyzing Cluster Assignments

Each customer is now assigned to a cluster.

df["cluster"] = model.labels_
df.head()

These clusters can now be studied to understand customer behavior.


Real-World Interpretation

Banks use clustering to segment customers. One cluster may represent young professionals with moderate income. Another may represent high-income customers with strong credit scores.

These insights help businesses design better products, target marketing campaigns, and manage risk.


Mini Practice

Try changing the number of clusters from 3 to 4. Observe how customers get regrouped.

This shows how sensitive K-Means can be to the choice of K.


Exercises

Exercise 1:
Why must features be scaled before applying K-Means?

Because K-Means uses distance calculations, and unscaled features distort cluster formation.

Exercise 2:
What happens if K is chosen incorrectly?

Clusters may not represent meaningful real-world groups.

Quick Quiz

Q1. Does K-Means use labeled data?

No. K-Means is an unsupervised learning algorithm.

In the next lesson, we will study Hierarchical Clustering and understand how clusters can be formed step by step.