Cluster Analysis
In many analytical problems, the goal is not prediction or testing, but grouping similar observations.
Cluster Analysis is an unsupervised learning technique used to group cases so that observations within a cluster are similar to each other and different from those in other clusters.
Why Cluster Analysis Is Used
Cluster analysis helps answer questions such as:
- Which customers have similar buying behavior?
- Can employees be grouped by performance patterns?
- Are there natural segments in the data?
Unlike regression or classification, cluster analysis has no target variable.
Key Idea Behind Clustering
Clustering is based on distance or similarity.
Observations that are close to each other (in terms of variable values) are placed in the same cluster.
The goal is to:
- Maximize similarity within clusters
- Maximize difference between clusters
Types of Cluster Analysis in SPSS
SPSS mainly supports:
- Hierarchical Clustering
- K-Means Clustering
Each method serves a different purpose.
Hierarchical Clustering
Hierarchical clustering:
- Does not require pre-specifying number of clusters
- Builds clusters step by step
- Produces a dendrogram
It is useful for exploratory analysis and small datasets.
K-Means Clustering
K-Means clustering:
- Requires specifying number of clusters (k)
- Works well with large datasets
- Minimizes within-cluster variance
It is commonly used in business applications.
Example Scenario
A retail company collects data on customers:
- Annual income
- Spending score
Cluster analysis can segment customers into groups such as:
- High income – high spenders
- High income – low spenders
- Low income – low spenders
Preparing Data for Clustering
Before clustering:
- Standardize variables (important)
- Remove extreme outliers
- Use numeric variables only
Standardization ensures all variables contribute equally.
Running K-Means Clustering (Menu)
To run K-Means in SPSS:
- Go to Analyze → Classify → K-Means Cluster
- Select variables
- Specify number of clusters
- Click OK
SPSS assigns a cluster number to each observation.
SPSS Syntax Example
QUICK CLUSTER Income Spending_Score
/CRITERIA=CLUSTER(3)
/METHOD=KMEANS.
Interpreting Cluster Output
When interpreting clusters:
- Examine cluster centers (means)
- Understand characteristics of each cluster
- Assign meaningful labels
Clusters must be interpreted in business or research context.
Common Mistakes
Typical errors include:
- Not standardizing variables
- Choosing wrong number of clusters
- Over-interpreting random clusters
Clustering is exploratory, not definitive.
Quiz 1
Is cluster analysis supervised or unsupervised?
Unsupervised.
Quiz 2
Does clustering require a dependent variable?
No.
Quiz 3
Which method produces a dendrogram?
Hierarchical clustering.
Quiz 4
Why should variables be standardized?
To ensure equal contribution of variables.
Quiz 5
Is clustering mainly exploratory?
Yes.
Mini Practice
Create a dataset with customer income and spending data.
Apply K-Means clustering with three clusters and describe each cluster.
Standardize variables first, then interpret cluster centers.
What’s Next
In the next lesson, you will learn about Discriminant Analysis, which is used to classify observations into predefined groups.