AI Lesson 33 – K-Means Clustering | Dataplexa

K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm used to group similar data points into clusters. Unlike supervised learning, K-Means works without labeled output data.

The goal of K-Means is simple: group data points so that points in the same cluster are more similar to each other than to points in other clusters.

Why Do We Need Clustering?

In many real-world problems, we do not know the correct output in advance. We only have raw data and want to discover hidden patterns or structures.

Examples include:

Customer segmentation in marketing
Grouping similar news articles
Image compression
Finding usage patterns in applications

Real-World Example

Suppose an e-commerce company wants to group customers based on age and income. The company does not know the groups beforehand but wants to identify patterns like low-income, medium-income, and high-income customers.

K-Means automatically finds these groups based on similarity.

What Does “K” Mean?

The letter K represents the number of clusters we want to form. This value is chosen by the user.

For example:

K = 2 → 2 clusters
K = 3 → 3 clusters
K = 5 → 5 clusters

How K-Means Works

K-Means follows an iterative process:

Choose K initial centroids randomly
Assign each data point to the nearest centroid
Recalculate centroids based on cluster averages
Repeat until centroids stop changing

This process minimizes the distance between data points and their assigned cluster centers.

K-Means Clustering Example

Below is a simple example using customer data with age and income.


from sklearn.cluster import KMeans

# Sample data: [Age, Income]
X = [
    [25, 30000],
    [30, 40000],
    [35, 60000],
    [40, 65000],
    [45, 80000],
    [50, 90000]
]

# Create K-Means model
model = KMeans(n_clusters=2, random_state=42)

# Train model
model.fit(X)

# Predict cluster labels
labels = model.predict(X)
print(labels)

[0 0 1 1 1 1]

The output shows which cluster each customer belongs to. Customers with similar age and income are grouped together automatically.

Understanding the Output

Each number represents a cluster ID. Data points with the same label belong to the same cluster.

Cluster numbers themselves have no meaning — only the grouping matters.

Choosing the Right Value of K

Choosing the correct number of clusters is important. A common method is the Elbow Method.

In this method, we run K-Means for multiple K values and choose the point where improvement slows down.

Advantages of K-Means

Simple and easy to understand
Fast and scalable
Works well for large datasets

Limitations of K-Means

Must choose K in advance
Sensitive to outliers
Works best with spherical clusters

Practice Questions

Practice 1: K-Means belongs to which type of learning?

Practice 2: What does K-Means try to form?

Practice 3: What represents the center of each cluster?

Quick Quiz

Quiz 1: What does the value K represent?

Features
Number of clusters
Iterations

Quiz 2: K-Means works using which process?

Random
Iterative
Single step

Quiz 3: Which method helps choose the value of K?

Grid Search
Elbow Method
Cross Validation

Coming up next: Hierarchical Clustering — building clusters step by step without choosing K in advance.

← Previous Course Index Next →

AI Course