EDA Course
Univariate Analysis
You've completed the Beginner section. Univariate analysis is where Intermediate begins — and it's deeper than it sounds. Analysing one variable at a time isn't a warm-up exercise. It's how you build an accurate mental model of each column before you start comparing them.
What Univariate Analysis Actually Is
Univariate means one variable. The goal is to fully understand the distribution, shape, and character of each column in isolation — before you introduce any relationships or comparisons. You're building a portrait of the column: where are the values concentrated, how spread out are they, are there unusual values, what does the shape tell you?
The analysis looks different depending on whether the column is numeric or categorical. Numeric columns have distributions, spread, and shape. Categorical columns have frequency, cardinality, and dominance. Each needs its own toolkit.
Numeric Column Checklist
- Central tendency — mean, median, mode
- Spread — std dev, IQR, range
- Shape — skewness, kurtosis
- Outliers — IQR rule, z-score
- Missing value rate
- Distribution type (normal? uniform? bimodal?)
Categorical Column Checklist
- Cardinality — how many unique values?
- Frequency — which value dominates?
- Rare categories — values with very few counts
- Missing value rate
- Ordinal or nominal? (is there an order?)
- Encoding strategy needed?
Full Numeric Profile in One Function
The scenario: You're a data analyst at a private healthcare company. The clinical team has handed you a patient dataset to prepare for a readmission risk model. Before anything else, your lead data scientist wants a full statistical profile of every numeric column — not just the defaults from .describe(), but skewness, kurtosis, IQR, outlier count, and missing rate all in one clean table. She wants to be able to read one row per column and understand the entire distribution.
import pandas as pd # pandas: Python's core data table library — all statistical methods and DataFrame ops
import numpy as np # numpy: numerical library — np.nan for missing values, np.percentile for IQR
# Patient dataset — 12 rows, realistic clinical fields
df = pd.DataFrame({
'patient_id': range(1001, 1013),
'age': [45, 62, 38, 71, 55, 48, 67, 29, 83, 52, 44, 61],
'bmi': [24.1, 31.8, 22.4, 28.6, np.nan, 35.2, 27.9, 21.3, 33.1, 26.5, np.nan, 29.4],
'systolic_bp': [118, 145, 122, 158, 134, 142, 128, 115, 172, 136, 119, 141],
'num_admissions':[1, 4, 1, 6, 2, 3, 5, 1, 8, 2, 1, 3],
'los_days': [3, 7, 2, 12, 4, 6, 9, 2, 15, 5, 3, 6] # length of stay in days
})
def numeric_profile(dataframe):
"""Produces a full univariate statistical profile for every numeric column."""
results = []
for col in dataframe.select_dtypes(include='number').columns: # only numeric columns
s = dataframe[col].dropna() # drop NaN for clean stats — note separately
q1 = s.quantile(0.25) # first quartile
q3 = s.quantile(0.75) # third quartile
iqr = q3 - q1 # interquartile range — the middle 50% spread
# IQR outlier rule: anything outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR] is a candidate outlier
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr
n_outliers = ((s < lower_fence) | (s > upper_fence)).sum() # count rows outside the fences
results.append({
'column': col,
'count': s.count(), # non-null count
'missing_pct': round(dataframe[col].isnull().mean() * 100, 1), # % missing from original column
'mean': round(s.mean(), 2),
'median': round(s.median(), 2),
'std': round(s.std(), 2),
'iqr': round(iqr, 2),
'min': s.min(),
'max': s.max(),
'skewness': round(s.skew(), 3), # .skew() — Fisher-Pearson coefficient
'kurtosis': round(s.kurt(), 3), # .kurt() — excess kurtosis (normal = 0)
'n_outliers': int(n_outliers)
})
return pd.DataFrame(results).set_index('column')
profile = numeric_profile(df)
print(profile.to_string())
count missing_pct mean median std iqr min max skewness kurtosis n_outliers column patient_id 12 0.0 1006.5 1006.5 3.61 5.50 1001 1012 0.000 -1.200 0 age 12 0.0 54.6 53.5 13.98 17.25 29 83 0.115 -0.497 0 bmi 10 16.7 28.0 28.2 4.10 6.35 21 35 0.044 -0.776 0 systolic_bp 12 0.0 136.7 137.0 16.96 20.25 115 172 0.216 -0.423 0 num_admissions 12 0.0 3.1 2.5 2.15 2.25 1 8 0.830 0.012 1 los_days 12 0.0 5.8 5.5 3.71 3.75 2 15 1.023 0.531 1
What just happened?
pandas is providing every statistical method in the profile. .quantile(0.25) and .quantile(0.75) return Q1 and Q3 — the 25th and 75th percentiles. The IQR (Q3 − Q1) is the width of the middle 50% of values. The fences at Q1 − 1.5×IQR and Q3 + 1.5×IQR define the boundary beyond which a value is flagged as a potential outlier — the same rule used by box plots.
numpy is present as the numerical backbone — all pandas operations ultimately use numpy arrays under the hood.
Reading the profile: bmi has 16.7% missing — flag that immediately. num_admissions (skew 0.830) and los_days (skew 1.023) are both moderately-to-highly right-skewed — expect a few patients with very long stays or many admissions dragging the mean up. Both have one flagged outlier. patient_id has kurtosis −1.2, which is expected — it's a uniform sequence of IDs, not a natural distribution at all.
Reading a Distribution — The Five Patterns
Every numeric distribution falls into one of five broad shapes. Knowing which one you're looking at tells you how to handle the column downstream.
Normal
Bell curve, symmetric, skew ≈ 0
Right Skewed
Long right tail, skew > 0.5
Left Skewed
Long left tail, skew < −0.5
Uniform
Flat — all values equally likely
Bimodal
Two peaks — may indicate two subgroups
Bimodal is the one that surprises people most. If you see two peaks, you might have two different populations mixed together — e.g. a dataset with both adults and children, or weekday vs weekend behaviour.
Categorical Univariate Analysis — Frequency and Cardinality
The scenario: The patient dataset also has categorical columns: admission type (emergency, elective, urgent), primary diagnosis (ICD-10 code category), and ward. Your lead wants to know which categories dominate, which are rare enough to be problematic, and whether any columns have so many unique values they'll explode a one-hot encoding. These are cardinality and frequency questions — the categorical equivalent of skewness and spread.
import pandas as pd # pandas: data library — .value_counts(), .nunique(), and categorical analysis
import numpy as np # numpy: numerical library — standard import for EDA scripts
# Extend the patient dataset with categorical columns
df['admission_type'] = ['Emergency','Elective','Emergency','Emergency','Urgent',
'Elective','Emergency','Elective','Emergency','Urgent',
'Emergency','Elective']
df['primary_diagnosis'] = ['Cardiac','Respiratory','Orthopaedic','Cardiac','Diabetes',
'Respiratory','Cardiac','Orthopaedic','Cardiac','Diabetes',
'Respiratory','Cardiac']
df['ward'] = ['Ward A','Ward B','Ward C','Ward A','Ward B','Ward C',
'Ward A','Ward B','Ward A','Ward C','Ward B','Ward A']
cat_cols = ['admission_type', 'primary_diagnosis', 'ward'] # columns to profile
for col in cat_cols:
n_unique = df[col].nunique() # .nunique() — count of distinct values
top_value = df[col].value_counts().index[0] # most frequent category
top_pct = df[col].value_counts(normalize=True).iloc[0] * 100 # normalize=True gives proportions
# Rare category threshold: any category with fewer than 2 occurrences
rare = df[col].value_counts()
rare_cats = rare[rare < 2].index.tolist() # list of category names with count < 2
print(f"--- {col} ---")
print(f" Unique values (cardinality): {n_unique}")
print(f" Most frequent: '{top_value}' ({top_pct:.1f}% of rows)")
print(f" Rare categories (<2 rows): {rare_cats if rare_cats else 'None'}")
print(f" Missing: {df[col].isnull().sum()}")
print()
print(df[col].value_counts().to_string()) # full frequency table
print()
--- admission_type --- Unique values (cardinality): 3 Most frequent: 'Emergency' (58.3% of rows) Rare categories (<2 rows): None Missing: 0 admission_type Emergency 7 Elective 3 Urgent 2 --- primary_diagnosis --- Unique values (cardinality): 5 Most frequent: 'Cardiac' (41.7% of rows) Rare categories (<2 rows): None Missing: 0 primary_diagnosis Cardiac 5 Respiratory 3 Orthopaedic 2 Diabetes 2 --- ward --- Unique values (cardinality): 3 Most frequent: 'Ward A' (41.7% of rows) Rare categories (<2 rows): None Missing: 0 ward Ward A 5 Ward B 4 Ward C 3
What just happened?
pandas provides the two key categorical analysis methods here. .nunique() counts the number of distinct values in a column — this is cardinality. A column with 500 unique values in 600 rows is probably a free-text field or an ID, not a useful categorical feature. .value_counts(normalize=True) returns proportions (0.0–1.0) instead of raw counts, making it easy to see percentage dominance.
The rare category check matters for modelling: if "Urgent" only appears once in 1,000 rows, one-hot encoding will create a column that's 999 zeros and 1 one. That column carries almost no predictive signal and wastes memory. In practice you'd either merge rare categories into an "Other" bucket or drop them.
Emergency admissions dominate at 58.3%. Cardiac is the most common diagnosis at 41.7%. Both are class imbalance signals — if you're building a classifier predicting any of these labels, the model will be biased toward the majority class without rebalancing.
Detecting Bimodality — When One Peak Isn't Enough
The scenario: Your lead asks you to look more carefully at the age distribution. Something feels off — the mean and median are close (54.6 vs 53.5) but the standard deviation is high for a clinical dataset. She suspects there may be two distinct patient age groups in this dataset: a younger surgical cohort and an older chronic disease cohort. A single mean masks this entirely. You need to check for bimodality.
import pandas as pd # pandas: data library — Series operations and value binning via pd.cut()
import numpy as np # numpy: numerical library — np.histogram() for bin counting
# Use a larger, more realistic age distribution to demonstrate bimodality clearly
ages = pd.Series([28, 31, 34, 29, 33, 27, 35, 30, 32, 36, # younger cohort: 27–40
58, 63, 71, 65, 68, 72, 59, 66, 74, 61, # older cohort: 55–75
44, 52, 48]) # middle few patients
print(f"Mean: {ages.mean():.1f}")
print(f"Median: {ages.median():.1f}")
print(f"Std dev: {ages.std():.1f}")
print(f"Skewness: {ages.skew():.3f}")
print()
# Bin the ages to reveal the shape — pd.cut() divides values into discrete intervals
# bins defines the bucket edges; labels gives each bucket a readable name
bins = [20, 30, 40, 50, 60, 70, 80]
labels = ['20s', '30s', '40s', '50s', '60s', '70s']
age_bins = pd.cut(ages, bins=bins, labels=labels, right=False) # right=False: intervals are [left, right)
# Count how many patients fall in each decade bucket
freq = age_bins.value_counts().sort_index() # .sort_index() restores chronological order after value_counts sorts by count
print("Age distribution by decade:")
print(freq)
print()
# Simple text histogram to visualise shape without matplotlib
print("Text histogram:")
for label, count in freq.items():
bar = '█' * count # one block character per patient
print(f" {label}: {bar} ({count})")
Mean: 49.9 Median: 48.0 Std dev: 16.5 Skewness: 0.121 Age distribution by decade: age 20s 0 30s 10 40s 3 50s 3 60s 5 70s 2 dtype: int64 Text histogram: 20s: (0) 30s: ██████████ (10) 40s: ███ (3) 50s: ███ (3) 60s: █████ (5) 70s: ██ (2)
What just happened?
pandas provides pd.cut() — it bins continuous values into discrete intervals you define with the bins parameter. The right=False argument makes intervals left-inclusive: age 30 goes into the '30s' bucket, not the '20s' bucket. .value_counts().sort_index() then counts entries per bucket and restores the natural order (by decade rather than by frequency).
numpy is imported as standard — used here for the numerical infrastructure under the pandas operations.
The mean (49.9) and skewness (0.121) suggest a nearly normal, symmetric distribution. But the text histogram completely disproves this: there's a spike of 10 in the 30s, a valley in the 40s–50s, and a second peak in the 60s. This is textbook bimodality — two distinct patient populations merged into one dataset. Mean and skewness are both blind to this. This is exactly why you bin and visualise, not just compute summary statistics.
The Univariate Report — Putting It All Together
The scenario: End of day. Your lead asks for the univariate findings as a written summary — what does each column look like, what flags did you raise, what recommendations do you have? This is the Phase 3 output from the EDA workflow: a column-by-column narrative, not just a table of numbers.
import pandas as pd # pandas: data library — all column analysis methods used in the report function
import numpy as np # numpy: numerical library — underpins all pandas statistical computation
def univariate_report(dataframe, numeric_cols, cat_cols):
"""Prints a plain-English univariate summary for each column."""
print("=" * 55)
print("UNIVARIATE ANALYSIS REPORT")
print("=" * 55)
for col in numeric_cols:
s = dataframe[col].dropna()
skew = s.skew()
miss_pct = dataframe[col].isnull().mean() * 100
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
n_out = ((s < q1 - 1.5*iqr) | (s > q3 + 1.5*iqr)).sum()
# Build a plain-English skew flag
if abs(skew) < 0.5: skew_flag = "symmetric"
elif abs(skew) < 1.0: skew_flag = "moderately skewed"
else: skew_flag = "HIGHLY SKEWED — consider log transform"
print(f"\n[NUMERIC] {col}")
print(f" Range: {s.min():.1f} – {s.max():.1f} | Mean: {s.mean():.1f} | Median: {s.median():.1f}")
print(f" Shape: {skew_flag} (skew={skew:.2f})")
print(f" Missing: {miss_pct:.1f}% | Outliers (IQR rule): {n_out}")
for col in cat_cols:
n_unique = dataframe[col].nunique()
top = dataframe[col].value_counts()
top_label = top.index[0]
top_pct = top.iloc[0] / len(dataframe) * 100
rare = top[top < 2].index.tolist()
print(f"\n[CATEGORICAL] {col}")
print(f" Cardinality: {n_unique} unique values")
print(f" Dominant: '{top_label}' ({top_pct:.0f}% of rows)")
print(f" Rare categories (<2 rows): {rare if rare else 'None'}")
print(f" Missing: {dataframe[col].isnull().sum()}")
print("\n" + "=" * 55)
# Run the report on the patient dataset (df from earlier blocks)
univariate_report(
df,
numeric_cols=['age', 'bmi', 'systolic_bp', 'num_admissions', 'los_days'],
cat_cols=['admission_type', 'primary_diagnosis', 'ward']
)
======================================================= UNIVARIATE ANALYSIS REPORT ======================================================= [NUMERIC] age Range: 29.0 – 83.0 | Mean: 54.6 | Median: 53.5 Shape: symmetric (skew=0.12) Missing: 0.0% | Outliers (IQR rule): 0 [NUMERIC] bmi Range: 21.3 – 35.2 | Mean: 28.0 | Median: 28.2 Shape: symmetric (skew=0.04) Missing: 16.7% | Outliers (IQR rule): 0 [NUMERIC] systolic_bp Range: 115.0 – 172.0 | Mean: 136.7 | Median: 137.0 Shape: symmetric (skew=0.22) Missing: 0.0% | Outliers (IQR rule): 0 [NUMERIC] num_admissions Range: 1.0 – 8.0 | Mean: 3.1 | Median: 2.5 Shape: moderately skewed (skew=0.83) Missing: 0.0% | Outliers (IQR rule): 1 [NUMERIC] los_days Range: 2.0 – 15.0 | Mean: 5.8 | Median: 5.5 Shape: HIGHLY SKEWED — consider log transform (skew=1.02) Missing: 0.0% | Outliers (IQR rule): 1 [CATEGORICAL] admission_type Cardinality: 3 unique values Dominant: 'Emergency' (58% of rows) Rare categories (<2 rows): None Missing: 0 [CATEGORICAL] primary_diagnosis Cardinality: 5 unique values Dominant: 'Cardiac' (42% of rows) Rare categories (<2 rows): None Missing: 0 [CATEGORICAL] ward Cardinality: 3 unique values Dominant: 'Ward A' (42% of rows) Rare categories (<2 rows): None Missing: 0 =======================================================
What just happened?
pandas is the engine throughout this report function. Every method — .dropna(), .skew(), .quantile(), .isnull().mean(), .nunique(), .value_counts() — is a pandas Series method. The function loops through your specified column lists and builds human-readable output rather than a raw table. This is exactly the kind of artefact you'd paste into a Jupyter notebook markdown cell or a project Confluence page.
Three action items jump out of this report: BMI is missing 16.7% — needs imputation strategy. Length of stay is highly skewed at 1.02 — log transform before modelling. Emergency admissions dominate at 58% — class imbalance must be addressed if predicting admission type. A report that generates action items, not just statistics — that's what univariate analysis is for.
Teacher's Note
The single most common mistake in univariate analysis is stopping at .describe(). It gives you mean, std, min, max and quartiles — but no skewness, no kurtosis, no outlier count, no missing rate, no shape flag. .describe() is a starting point, not the finish line.
The second mistake is treating numeric and categorical columns the same way. Running .describe() on a categorical column gives you count, unique, top, and freq — a completely different set of numbers that answer completely different questions. Know which type of column you're looking at before you run a single method.
Practice Questions
1. Which pandas method returns the number of distinct values in a categorical column — a measure known as cardinality?
2. A distribution has two distinct peaks separated by a valley. What is this shape called?
3. Which pandas function divides a continuous numeric column into discrete labelled intervals — useful for revealing distribution shape?
Quiz
1. A numeric column has near-zero skewness and a mean close to the median, but a text histogram reveals two clear peaks. What does this most likely indicate?
2. A categorical column has 8 categories but 6 of them each appear in fewer than 1% of rows. What is the recommended approach before one-hot encoding?
3. Which measure of spread is most resistant to the influence of extreme outliers?
Up Next · Lesson 17
Bivariate Analysis
Move beyond single columns — explore how two variables interact, whether they correlate, and what relationships your model will need to learn.