Data Science Lesson 49 – Clustering | Dataplexa

Machine Learning · Lesson 49

Clustering

Group similar customers automatically without knowing their categories upfront — discover hidden customer segments using K-Means and hierarchical clustering algorithms.

What Makes Clustering Special

Clustering finds groups without being told what to look for. Think about organizing your music — you naturally group similar songs together. Maybe by mood, genre, or energy level. But what if you didn't know these categories existed? Clustering algorithms discover these natural groupings automatically.

The magic happens in unsupervised learning. No target variable. No right answers. Just data points waiting to reveal their hidden structure. Honestly, this is where machine learning gets genuinely exciting — finding patterns humans never noticed.

Find similar data points

Group them into clusters

Analyze cluster characteristics

Make business decisions

K-Means Algorithm

K-Means works like organizing people at a party into conversation groups. You decide how many groups you want (that's the K), then the algorithm finds the best way to group people so everyone in each group is similar to each other. The algorithm iterates until it finds stable cluster centers.

Advantages

Simple to understand and implement. Works well with spherical clusters. Scales to large datasets efficiently.

Best For

Customer segmentation, market research, image compression, data preprocessing.

Limitations

You must choose K upfront. Struggles with non-spherical shapes. Sensitive to outliers.

Watch Out

Different random starts can give different results. Always run multiple times.

The scenario: Myntra's data team needs to segment customers for personalized campaigns. They want to group customers by spending behavior and demographics to create targeted marketing strategies.

# Import essential libraries for clustering analysis
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load the ecommerce dataset for customer analysis
df = pd.read_csv('dataplexa_ecommerce.csv')
# Display first few rows to understand data structure
print(df.head())

   order_id       date  customer_age gender       city product_category  quantity  unit_price    revenue  rating  returned
0      1001 2023-01-05              28      F     Mumbai      Electronics         2     15000.0   30000.0     4.2     False
1      1002 2023-01-05              35      M      Delhi         Clothing         1      2500.0    2500.0     4.8     False  
2      1003 2023-01-06              42      F  Bangalore             Food         3       450.0    1350.0     3.9     False
3      1004 2023-01-06              29      M    Chennai            Books         1       800.0     800.0     4.5     False
4      1005 2023-01-07              33      F       Pune             Home         2      3200.0    6400.0     4.1     False

What just happened?

We loaded customer transaction data with customer_age, revenue, and rating columns. These features will help identify distinct customer segments. Try this: Check df.info() to see data types and missing values.

# Create customer summary for clustering features
customer_features = df.groupby('order_id').agg({
    'customer_age': 'first',      # Get age for each customer
    'revenue': 'sum',             # Total spending per customer  
    'rating': 'mean',             # Average satisfaction rating
    'quantity': 'sum'             # Total items purchased
}).reset_index()
# Display the aggregated customer data
print(customer_features.head())

   order_id  customer_age    revenue  rating  quantity
0      1001            28   30000.0     4.2         2
1      1002            35    2500.0     4.8         1
2      1003            42    1350.0     3.9         3  
3      1004            29     800.0     4.5         1
4      1005            33    6400.0     4.1         2

What just happened?

We created customer profiles with four key metrics: customer_age (demographics), revenue (spending power), and rating (satisfaction). Each row now represents one customer's complete behavior pattern. Try this: Check revenue distribution with customer_features['revenue'].describe().

# Select numerical features for clustering
features = ['customer_age', 'revenue', 'rating', 'quantity']
X = customer_features[features]
# Standardize features to same scale (critical for K-means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Check the scaled data shape and sample
print(f"Original shape: {X.shape}")
print(f"Scaled data sample:\n{X_scaled[:3]}")

Original shape: (5, 4)
Scaled data sample:
[[-1.22474487  1.68325372  0.31622777 -0.31622777]
 [ 0.          -0.4472136   1.89736660 -1.26491106]
 [ 1.63299316  -0.56524758 -1.26491106  1.58113883]]

What just happened?

StandardScaler converted all features to mean=0 and std=1. Notice how revenue values like 30000 became 1.68, while rating 4.2 became 0.32. This prevents revenue from dominating the clustering just because it has larger numbers. Try this: Compare X.mean() vs X_scaled.mean(axis=0).

Common Scaling Mistake

Skipping standardization when features have different units (age vs revenue in thousands). K-means uses Euclidean distance — revenue differences will completely overshadow age differences. Always scale first with StandardScaler() or MinMaxScaler().

# Apply K-Means clustering with 3 customer segments
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
# Fit the model and get cluster labels
cluster_labels = kmeans.fit_predict(X_scaled)
# Add cluster labels back to original data
customer_features['cluster'] = cluster_labels
# Display cluster assignments
print(f"Cluster distribution:\n{pd.Series(cluster_labels).value_counts().sort_index()}")

Cluster distribution:
0    2
1    2  
2    1
dtype: int64

What just happened?

K-means created 3 customer segments: clusters 0 and 1 each have 2 customers, cluster 2 has 1 customer. The random_state=42 ensures reproducible results, while n_init=10 runs the algorithm 10 times to find the best clustering. Try this: Access cluster centers with kmeans.cluster_centers_.

Each cluster represents distinct customer behavior patterns for targeted marketing

The scatter plot reveals Cluster 0 includes both high and medium spenders across different ages. This mixed pattern suggests these customers share other behavioral similarities beyond just age and revenue. Maybe they prefer similar product categories or have consistent purchase timing. Cluster 1 groups lower-spending customers who might be price-sensitive or occasional shoppers. This segment needs different marketing strategies — perhaps discount offers or loyalty programs to increase engagement. The age range varies, so demographics aren't the primary clustering factor here.

Analyzing Cluster Characteristics

# Calculate average characteristics for each cluster
cluster_summary = customer_features.groupby('cluster').agg({
    'customer_age': 'mean',       # Average age per cluster
    'revenue': 'mean',            # Average spending per cluster
    'rating': 'mean',             # Average satisfaction per cluster
    'quantity': 'mean'            # Average items per cluster
}).round(2)
# Display the cluster profiles
print("Cluster Characteristics:")
print(cluster_summary)

Cluster Characteristics:
         customer_age   revenue  rating  quantity
cluster                                         
0               31.5  16250.00    4.50      1.5
1               35.5   1075.00    4.20      2.0
2               33.0   6400.00    4.10      2.0

What just happened?

Clear customer personas emerged: Cluster 0 = premium customers (₹16,250 average, highest ratings), Cluster 1 = budget shoppers (₹1,075 average, more quantity), Cluster 2 = mid-tier customers (₹6,400 average). The groupby('cluster') reveals distinct behavioral patterns for each segment. Try this: Add .std() to see variation within clusters.

Radar chart reveals distinct customer personas across all behavioral dimensions

The radar visualization makes cluster differences crystal clear. Premium customers (Cluster 0) show high revenue but lower quantity — they buy expensive items less frequently. Perfect for luxury product recommendations and VIP treatment. Budget shoppers (Cluster 1) have the opposite pattern — lower spending but higher quantity purchases. They're bargain hunters who need volume discounts and promotional pricing to maximize their lifetime value.

📊 Data Insight

Cluster 0 customers spend 15x more per transaction (₹16,250 vs ₹1,075) but buy fewer items. This suggests premium customers value quality over quantity — ideal for cross-selling high-margin accessories and personalized service offerings.

Hierarchical Clustering Alternative

Sometimes you don't want to choose the number of clusters upfront. Hierarchical clustering builds a tree of relationships, letting you decide where to "cut" for the optimal number of groups. Think of it like a family tree — you can trace relationships at different levels of detail.

The scenario: Swiggy wants to understand restaurant clustering patterns for delivery optimization. They don't know how many delivery zones to create, so hierarchical clustering can reveal natural groupings.

# Import hierarchical clustering from scipy
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
# Create hierarchical clustering using Ward linkage
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
# Fit and predict cluster labels
hier_labels = hierarchical.fit_predict(X_scaled)
# Add hierarchical labels to compare with K-means
customer_features['hier_cluster'] = hier_labels
print(f"Hierarchical clusters: {pd.Series(hier_labels).value_counts().sort_index()}")

Hierarchical clusters: 
0    2
1    1
2    2
dtype: int64

What just happened?

Hierarchical clustering created different groupings than K-means — clusters 0 and 2 have 2 customers each, cluster 1 has 1. The Ward linkage minimizes within-cluster variance, often producing more balanced clusters than K-means. Try this: Compare results with customer_features[['cluster', 'hier_cluster']].

# Compare clustering methods side by side
comparison = customer_features[['customer_age', 'revenue', 'cluster', 'hier_cluster']]
# Create comparison showing both clustering results
print("Clustering Method Comparison:")
print(comparison)
# Calculate silhouette score to measure cluster quality
from sklearn.metrics import silhouette_score
kmeans_score = silhouette_score(X_scaled, cluster_labels)
hier_score = silhouette_score(X_scaled, hier_labels)
print(f"\nK-means silhouette score: {kmeans_score:.3f}")
print(f"Hierarchical silhouette score: {hier_score:.3f}")

Clustering Method Comparison:
   customer_age    revenue  cluster  hier_cluster
0            28   30000.0        0             0
1            35    2500.0        0             2
2            42    1350.0        1             2  
3            29     800.0        1             0
4            33    6400.0        2             1

K-means silhouette score: 0.482
Hierarchical silhouette score: 0.389

What just happened?

Different algorithms grouped customers differently! K-means achieved higher silhouette_score (0.482 vs 0.389), indicating tighter, more separated clusters. Customer 1 moved from K-means cluster 0 to hierarchical cluster 2. The silhouette score measures how well-separated clusters are (higher = better). Try this: Test with different linkage methods like 'complete' or 'average'.

K-means outperforms hierarchical clustering on this dataset with clearer cluster separation

The performance comparison shows K-means produces more distinct customer segments with a 24% higher silhouette score. This makes sense for spherical customer behavior patterns where spending and demographics cluster naturally around centers. But hierarchical clustering offers one major advantage — you can explore different numbers of clusters after fitting the model. K-means forces you to decide upfront, while hierarchical lets you cut the tree at different levels to find the optimal segmentation.

Pro Tip: Choosing Cluster Count

Use the elbow method: plot K-means inertia (within-cluster sum of squares) for different K values. The "elbow" point where improvement slows down indicates optimal cluster count. For hierarchical clustering, examine the dendrogram to see natural breakpoints in the tree structure.

Business Applications

Customer segmentation is just the start. Clustering powers recommendation systems, fraud detection, and market research. Netflix groups similar movies, banks identify suspicious transaction patterns, and retail companies optimize store layouts based on customer movement clusters.

Recommended: K-Means

Large datasets (1000+ points)
Spherical cluster shapes expected
Know approximate cluster count
Need fast, scalable solution

Alternative: Hierarchical

Small to medium datasets (<1000 points)
Unknown optimal cluster count
Need to explore cluster relationships
Irregular cluster shapes possible

Real companies report 15-25% increases in marketing ROI after implementing customer clustering. Why? Because targeted campaigns perform dramatically better than mass marketing. Your high-value customers want premium experiences, not discount offers.

And here's something most tutorials skip — clustering works 90% of the time smoothly. The 10% that trips everyone up? When you have categorical variables mixed with numerical ones, or when your data has too many dimensions. Always start with exploratory data analysis to understand your data structure before clustering.

Quiz

Up Next

ML Metrics

Master precision, recall, F1-score and ROC curves to evaluate how well your clustering and classification models actually perform in production.

← Previous Course Index Next →