AI Course
K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm used to group similar data points into clusters. Unlike supervised learning, K-Means works without labeled output data.
The goal of K-Means is simple: group data points so that points in the same cluster are more similar to each other than to points in other clusters.
Why Do We Need Clustering?
In many real-world problems, we do not know the correct output in advance. We only have raw data and want to discover hidden patterns or structures.
Examples include:
- Customer segmentation in marketing
- Grouping similar news articles
- Image compression
- Finding usage patterns in applications
Real-World Example
Suppose an e-commerce company wants to group customers based on age and income. The company does not know the groups beforehand but wants to identify patterns like low-income, medium-income, and high-income customers.
K-Means automatically finds these groups based on similarity.
What Does “K” Mean?
The letter K represents the number of clusters we want to form. This value is chosen by the user.
For example:
- K = 2 → 2 clusters
- K = 3 → 3 clusters
- K = 5 → 5 clusters
How K-Means Works
K-Means follows an iterative process:
- Choose K initial centroids randomly
- Assign each data point to the nearest centroid
- Recalculate centroids based on cluster averages
- Repeat until centroids stop changing
This process minimizes the distance between data points and their assigned cluster centers.
K-Means Clustering Example
Below is a simple example using customer data with age and income.
from sklearn.cluster import KMeans
# Sample data: [Age, Income]
X = [
[25, 30000],
[30, 40000],
[35, 60000],
[40, 65000],
[45, 80000],
[50, 90000]
]
# Create K-Means model
model = KMeans(n_clusters=2, random_state=42)
# Train model
model.fit(X)
# Predict cluster labels
labels = model.predict(X)
print(labels)
The output shows which cluster each customer belongs to. Customers with similar age and income are grouped together automatically.
Understanding the Output
Each number represents a cluster ID. Data points with the same label belong to the same cluster.
Cluster numbers themselves have no meaning — only the grouping matters.
Choosing the Right Value of K
Choosing the correct number of clusters is important. A common method is the Elbow Method.
In this method, we run K-Means for multiple K values and choose the point where improvement slows down.
Advantages of K-Means
- Simple and easy to understand
- Fast and scalable
- Works well for large datasets
Limitations of K-Means
- Must choose K in advance
- Sensitive to outliers
- Works best with spherical clusters
Practice Questions
Practice 1: K-Means belongs to which type of learning?
Practice 2: What does K-Means try to form?
Practice 3: What represents the center of each cluster?
Quick Quiz
Quiz 1: What does the value K represent?
Quiz 2: K-Means works using which process?
Quiz 3: Which method helps choose the value of K?
Coming up next: Hierarchical Clustering — building clusters step by step without choosing K in advance.