AI Lesson 33 – K-Means Clustering | Dataplexa

K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm used to group similar data points into clusters. Unlike supervised learning, K-Means works without labeled output data.

The goal of K-Means is simple: group data points so that points in the same cluster are more similar to each other than to points in other clusters.

Why Do We Need Clustering?

In many real-world problems, we do not know the correct output in advance. We only have raw data and want to discover hidden patterns or structures.

Examples include:

  • Customer segmentation in marketing
  • Grouping similar news articles
  • Image compression
  • Finding usage patterns in applications

Real-World Example

Suppose an e-commerce company wants to group customers based on age and income. The company does not know the groups beforehand but wants to identify patterns like low-income, medium-income, and high-income customers.

K-Means automatically finds these groups based on similarity.

What Does “K” Mean?

The letter K represents the number of clusters we want to form. This value is chosen by the user.

For example:

  • K = 2 → 2 clusters
  • K = 3 → 3 clusters
  • K = 5 → 5 clusters

How K-Means Works

K-Means follows an iterative process:

  • Choose K initial centroids randomly
  • Assign each data point to the nearest centroid
  • Recalculate centroids based on cluster averages
  • Repeat until centroids stop changing

This process minimizes the distance between data points and their assigned cluster centers.

K-Means Clustering Example

Below is a simple example using customer data with age and income.


from sklearn.cluster import KMeans

# Sample data: [Age, Income]
X = [
    [25, 30000],
    [30, 40000],
    [35, 60000],
    [40, 65000],
    [45, 80000],
    [50, 90000]
]

# Create K-Means model
model = KMeans(n_clusters=2, random_state=42)

# Train model
model.fit(X)

# Predict cluster labels
labels = model.predict(X)
print(labels)
  
[0 0 1 1 1 1]

The output shows which cluster each customer belongs to. Customers with similar age and income are grouped together automatically.

Understanding the Output

Each number represents a cluster ID. Data points with the same label belong to the same cluster.

Cluster numbers themselves have no meaning — only the grouping matters.

Choosing the Right Value of K

Choosing the correct number of clusters is important. A common method is the Elbow Method.

In this method, we run K-Means for multiple K values and choose the point where improvement slows down.

Advantages of K-Means

  • Simple and easy to understand
  • Fast and scalable
  • Works well for large datasets

Limitations of K-Means

  • Must choose K in advance
  • Sensitive to outliers
  • Works best with spherical clusters

Practice Questions

Practice 1: K-Means belongs to which type of learning?



Practice 2: What does K-Means try to form?



Practice 3: What represents the center of each cluster?



Quick Quiz

Quiz 1: What does the value K represent?





Quiz 2: K-Means works using which process?





Quiz 3: Which method helps choose the value of K?





Coming up next: Hierarchical Clustering — building clusters step by step without choosing K in advance.