AI Course
Hierarchical Clustering
Hierarchical Clustering is an unsupervised learning technique that groups data points by building a hierarchy of clusters. Unlike K-Means, it does not require us to specify the number of clusters in advance.
Instead of forming clusters in one step, Hierarchical Clustering creates a tree-like structure that shows how data points are merged or split step by step.
Why Hierarchical Clustering?
In many real-world problems, we do not know how many clusters exist in the data. Choosing the wrong value of K in K-Means can lead to poor grouping.
Hierarchical Clustering solves this by allowing us to explore the structure of data first and then decide how many clusters make sense.
Real-World Example
Think about organizing files on your computer. You first group files into folders, then folders into categories, and sometimes categories into broader groups.
This layered grouping is similar to how Hierarchical Clustering organizes data.
Types of Hierarchical Clustering
There are two main approaches:
- Agglomerative: Starts with each data point as its own cluster and merges them
- Divisive: Starts with all data points in one cluster and splits them
Agglomerative clustering is the most commonly used approach.
How Agglomerative Clustering Works
The algorithm follows these steps:
- Start with each data point as a separate cluster
- Find the two closest clusters
- Merge them into one cluster
- Repeat until all points are in a single cluster
The result is a hierarchy that can be visualized using a dendrogram.
Hierarchical Clustering Example
Below is a simple example using agglomerative clustering.
from sklearn.cluster import AgglomerativeClustering
# Sample data
X = [
[25, 30000],
[30, 40000],
[35, 60000],
[40, 65000],
[45, 80000],
[50, 90000]
]
# Create model
model = AgglomerativeClustering(n_clusters=2)
# Fit and predict
labels = model.fit_predict(X)
print(labels)
The output shows how data points are grouped based on similarity without requiring random initialization.
Understanding Dendrograms
A dendrogram is a tree diagram that shows how clusters are merged at different distances.
By cutting the dendrogram at a chosen height, we can decide the number of clusters that best represent the data.
Advantages of Hierarchical Clustering
- No need to specify number of clusters initially
- Produces interpretable hierarchical structure
- Works well for small to medium datasets
Limitations of Hierarchical Clustering
- Computationally expensive for large datasets
- Once clusters are merged, they cannot be undone
- Sensitive to noise and outliers
Practice Questions
Practice 1: Hierarchical Clustering belongs to which learning type?
Practice 2: What diagram is used to visualize hierarchical clustering?
Practice 3: Which hierarchical method starts with individual data points?
Quick Quiz
Quiz 1: Does Hierarchical Clustering require choosing K beforehand?
Quiz 2: What structure does Hierarchical Clustering produce?
Quiz 3: Agglomerative clustering works by?
Coming up next: Dimensionality Reduction — understanding how to reduce features while preserving information.