EDA Lesson 16 – Univariate Analysis | Dataplexa

Intermediate Level · Lesson 16

Univariate Analysis

You've completed the Beginner section. Univariate analysis is where Intermediate begins — and it's deeper than it sounds. Analysing one variable at a time isn't a warm-up exercise. It's how you build an accurate mental model of each column before you start comparing them.

What Univariate Analysis Actually Is

Univariate means one variable. The goal is to fully understand the distribution, shape, and character of each column in isolation — before you introduce any relationships or comparisons. You're building a portrait of the column: where are the values concentrated, how spread out are they, are there unusual values, what does the shape tell you?

The analysis looks different depending on whether the column is numeric or categorical. Numeric columns have distributions, spread, and shape. Categorical columns have frequency, cardinality, and dominance. Each needs its own toolkit.

Numeric Column Checklist

Central tendency — mean, median, mode
Spread — std dev, IQR, range
Shape — skewness, kurtosis
Outliers — IQR rule, z-score
Missing value rate
Distribution type (normal? uniform? bimodal?)

Categorical Column Checklist

Cardinality — how many unique values?
Frequency — which value dominates?
Rare categories — values with very few counts
Missing value rate
Ordinal or nominal? (is there an order?)
Encoding strategy needed?

Full Numeric Profile in One Function

The scenario: You're a data analyst at a private healthcare company. The clinical team has handed you a patient dataset to prepare for a readmission risk model. Before anything else, your lead data scientist wants a full statistical profile of every numeric column — not just the defaults from .describe(), but skewness, kurtosis, IQR, outlier count, and missing rate all in one clean table. She wants to be able to read one row per column and understand the entire distribution.

import pandas as pd      # pandas: Python's core data table library — all statistical methods and DataFrame ops
import numpy as np       # numpy: numerical library — np.nan for missing values, np.percentile for IQR

# Patient dataset — 12 rows, realistic clinical fields
df = pd.DataFrame({
    'patient_id':    range(1001, 1013),
    'age':           [45, 62, 38, 71, 55, 48, 67, 29, 83, 52, 44, 61],
    'bmi':           [24.1, 31.8, 22.4, 28.6, np.nan, 35.2, 27.9, 21.3, 33.1, 26.5, np.nan, 29.4],
    'systolic_bp':   [118, 145, 122, 158, 134, 142, 128, 115, 172, 136, 119, 141],
    'num_admissions':[1, 4, 1, 6, 2, 3, 5, 1, 8, 2, 1, 3],
    'los_days':      [3, 7, 2, 12, 4, 6, 9, 2, 15, 5, 3, 6]   # length of stay in days
})

def numeric_profile(dataframe):
    """Produces a full univariate statistical profile for every numeric column."""
    results = []
    for col in dataframe.select_dtypes(include='number').columns:   # only numeric columns
        s = dataframe[col].dropna()                                  # drop NaN for clean stats — note separately

        q1  = s.quantile(0.25)                                       # first quartile
        q3  = s.quantile(0.75)                                       # third quartile
        iqr = q3 - q1                                                # interquartile range — the middle 50% spread

        # IQR outlier rule: anything outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR] is a candidate outlier
        lower_fence = q1 - 1.5 * iqr
        upper_fence = q3 + 1.5 * iqr
        n_outliers  = ((s < lower_fence) | (s > upper_fence)).sum()  # count rows outside the fences

        results.append({
            'column':       col,
            'count':        s.count(),                               # non-null count
            'missing_pct':  round(dataframe[col].isnull().mean() * 100, 1),  # % missing from original column
            'mean':         round(s.mean(), 2),
            'median':       round(s.median(), 2),
            'std':          round(s.std(), 2),
            'iqr':          round(iqr, 2),
            'min':          s.min(),
            'max':          s.max(),
            'skewness':     round(s.skew(), 3),                      # .skew() — Fisher-Pearson coefficient
            'kurtosis':     round(s.kurt(), 3),                      # .kurt() — excess kurtosis (normal = 0)
            'n_outliers':   int(n_outliers)
        })

    return pd.DataFrame(results).set_index('column')

profile = numeric_profile(df)
print(profile.to_string())

                count  missing_pct   mean  median    std   iqr   min    max  skewness  kurtosis  n_outliers
column
patient_id       12          0.0  1006.5  1006.5   3.61  5.50  1001   1012     0.000    -1.200           0
age              12          0.0    54.6    53.5   13.98  17.25    29     83     0.115    -0.497           0
bmi              10          16.7    28.0    28.2    4.10   6.35    21     35     0.044    -0.776           0
systolic_bp      12          0.0   136.7   137.0   16.96  20.25   115    172     0.216    -0.423           0
num_admissions   12          0.0     3.1     2.5    2.15   2.25     1      8     0.830     0.012           1
los_days         12          0.0     5.8     5.5    3.71   3.75     2     15     1.023     0.531           1

What just happened?

pandas is providing every statistical method in the profile. .quantile(0.25) and .quantile(0.75) return Q1 and Q3 — the 25th and 75th percentiles. The IQR (Q3 − Q1) is the width of the middle 50% of values. The fences at Q1 − 1.5×IQR and Q3 + 1.5×IQR define the boundary beyond which a value is flagged as a potential outlier — the same rule used by box plots.

numpy is present as the numerical backbone — all pandas operations ultimately use numpy arrays under the hood.

Reading the profile: bmi has 16.7% missing — flag that immediately. num_admissions (skew 0.830) and los_days (skew 1.023) are both moderately-to-highly right-skewed — expect a few patients with very long stays or many admissions dragging the mean up. Both have one flagged outlier. patient_id has kurtosis −1.2, which is expected — it's a uniform sequence of IDs, not a natural distribution at all.

Reading a Distribution — The Five Patterns

Every numeric distribution falls into one of five broad shapes. Knowing which one you're looking at tells you how to handle the column downstream.

Normal

Bell curve, symmetric, skew ≈ 0

Right Skewed

Long right tail, skew > 0.5

Left Skewed

Long left tail, skew < −0.5

Uniform

Flat — all values equally likely

Bimodal

Two peaks — may indicate two subgroups

Bimodal is the one that surprises people most. If you see two peaks, you might have two different populations mixed together — e.g. a dataset with both adults and children, or weekday vs weekend behaviour.

Categorical Univariate Analysis — Frequency and Cardinality

The scenario: The patient dataset also has categorical columns: admission type (emergency, elective, urgent), primary diagnosis (ICD-10 code category), and ward. Your lead wants to know which categories dominate, which are rare enough to be problematic, and whether any columns have so many unique values they'll explode a one-hot encoding. These are cardinality and frequency questions — the categorical equivalent of skewness and spread.

import pandas as pd      # pandas: data library — .value_counts(), .nunique(), and categorical analysis
import numpy as np       # numpy: numerical library — standard import for EDA scripts

# Extend the patient dataset with categorical columns
df['admission_type'] = ['Emergency','Elective','Emergency','Emergency','Urgent',
                         'Elective','Emergency','Elective','Emergency','Urgent',
                         'Emergency','Elective']
df['primary_diagnosis'] = ['Cardiac','Respiratory','Orthopaedic','Cardiac','Diabetes',
                            'Respiratory','Cardiac','Orthopaedic','Cardiac','Diabetes',
                            'Respiratory','Cardiac']
df['ward'] = ['Ward A','Ward B','Ward C','Ward A','Ward B','Ward C',
              'Ward A','Ward B','Ward A','Ward C','Ward B','Ward A']

cat_cols = ['admission_type', 'primary_diagnosis', 'ward']  # columns to profile

for col in cat_cols:
    n_unique  = df[col].nunique()                            # .nunique() — count of distinct values
    top_value = df[col].value_counts().index[0]              # most frequent category
    top_pct   = df[col].value_counts(normalize=True).iloc[0] * 100  # normalize=True gives proportions

    # Rare category threshold: any category with fewer than 2 occurrences
    rare = df[col].value_counts()
    rare_cats = rare[rare < 2].index.tolist()               # list of category names with count < 2

    print(f"--- {col} ---")
    print(f"  Unique values (cardinality): {n_unique}")
    print(f"  Most frequent: '{top_value}' ({top_pct:.1f}% of rows)")
    print(f"  Rare categories (<2 rows):   {rare_cats if rare_cats else 'None'}")
    print(f"  Missing: {df[col].isnull().sum()}")
    print()
    print(df[col].value_counts().to_string())               # full frequency table
    print()

--- admission_type ---
  Unique values (cardinality): 3
  Most frequent: 'Emergency' (58.3% of rows)
  Rare categories (<2 rows):   None
  Missing: 0

admission_type
Emergency    7
Elective     3
Urgent       2

--- primary_diagnosis ---
  Unique values (cardinality): 5
  Most frequent: 'Cardiac' (41.7% of rows)
  Rare categories (<2 rows):   None
  Missing: 0

primary_diagnosis
Cardiac         5
Respiratory     3
Orthopaedic     2
Diabetes        2

--- ward ---
  Unique values (cardinality): 3
  Most frequent: 'Ward A' (41.7% of rows)
  Rare categories (<2 rows):   None
  Missing: 0

ward
Ward A    5
Ward B    4
Ward C    3

What just happened?

pandas provides the two key categorical analysis methods here. .nunique() counts the number of distinct values in a column — this is cardinality. A column with 500 unique values in 600 rows is probably a free-text field or an ID, not a useful categorical feature. .value_counts(normalize=True) returns proportions (0.0–1.0) instead of raw counts, making it easy to see percentage dominance.

The rare category check matters for modelling: if "Urgent" only appears once in 1,000 rows, one-hot encoding will create a column that's 999 zeros and 1 one. That column carries almost no predictive signal and wastes memory. In practice you'd either merge rare categories into an "Other" bucket or drop them.

Emergency admissions dominate at 58.3%. Cardiac is the most common diagnosis at 41.7%. Both are class imbalance signals — if you're building a classifier predicting any of these labels, the model will be biased toward the majority class without rebalancing.

Detecting Bimodality — When One Peak Isn't Enough

The scenario: Your lead asks you to look more carefully at the age distribution. Something feels off — the mean and median are close (54.6 vs 53.5) but the standard deviation is high for a clinical dataset. She suspects there may be two distinct patient age groups in this dataset: a younger surgical cohort and an older chronic disease cohort. A single mean masks this entirely. You need to check for bimodality.

import pandas as pd      # pandas: data library — Series operations and value binning via pd.cut()
import numpy as np       # numpy: numerical library — np.histogram() for bin counting

# Use a larger, more realistic age distribution to demonstrate bimodality clearly
ages = pd.Series([28, 31, 34, 29, 33, 27, 35, 30, 32, 36,   # younger cohort: 27–40
                  58, 63, 71, 65, 68, 72, 59, 66, 74, 61,   # older cohort: 55–75
                  44, 52, 48])                               # middle few patients

print(f"Mean:     {ages.mean():.1f}")
print(f"Median:   {ages.median():.1f}")
print(f"Std dev:  {ages.std():.1f}")
print(f"Skewness: {ages.skew():.3f}")
print()

# Bin the ages to reveal the shape — pd.cut() divides values into discrete intervals
# bins defines the bucket edges; labels gives each bucket a readable name
bins   = [20, 30, 40, 50, 60, 70, 80]
labels = ['20s', '30s', '40s', '50s', '60s', '70s']
age_bins = pd.cut(ages, bins=bins, labels=labels, right=False)  # right=False: intervals are [left, right)

# Count how many patients fall in each decade bucket
freq = age_bins.value_counts().sort_index()   # .sort_index() restores chronological order after value_counts sorts by count
print("Age distribution by decade:")
print(freq)
print()

# Simple text histogram to visualise shape without matplotlib
print("Text histogram:")
for label, count in freq.items():
    bar = '█' * count                         # one block character per patient
    print(f"  {label}: {bar} ({count})")

Mean:     49.9
Median:   48.0
Std dev:  16.5
Skewness: 0.121

Age distribution by decade:
age
20s    0
30s    10
40s    3
50s    3
60s    5
70s    2
dtype: int64

Text histogram:
  20s:  (0)
  30s: ██████████ (10)
  40s: ███ (3)
  50s: ███ (3)
  60s: █████ (5)
  70s: ██ (2)

What just happened?

pandas provides pd.cut() — it bins continuous values into discrete intervals you define with the bins parameter. The right=False argument makes intervals left-inclusive: age 30 goes into the '30s' bucket, not the '20s' bucket. .value_counts().sort_index() then counts entries per bucket and restores the natural order (by decade rather than by frequency).

numpy is imported as standard — used here for the numerical infrastructure under the pandas operations.

The mean (49.9) and skewness (0.121) suggest a nearly normal, symmetric distribution. But the text histogram completely disproves this: there's a spike of 10 in the 30s, a valley in the 40s–50s, and a second peak in the 60s. This is textbook bimodality — two distinct patient populations merged into one dataset. Mean and skewness are both blind to this. This is exactly why you bin and visualise, not just compute summary statistics.

The Univariate Report — Putting It All Together

The scenario: End of day. Your lead asks for the univariate findings as a written summary — what does each column look like, what flags did you raise, what recommendations do you have? This is the Phase 3 output from the EDA workflow: a column-by-column narrative, not just a table of numbers.

import pandas as pd      # pandas: data library — all column analysis methods used in the report function
import numpy as np       # numpy: numerical library — underpins all pandas statistical computation

def univariate_report(dataframe, numeric_cols, cat_cols):
    """Prints a plain-English univariate summary for each column."""

    print("=" * 55)
    print("UNIVARIATE ANALYSIS REPORT")
    print("=" * 55)

    for col in numeric_cols:
        s        = dataframe[col].dropna()
        skew     = s.skew()
        miss_pct = dataframe[col].isnull().mean() * 100
        q1, q3   = s.quantile(0.25), s.quantile(0.75)
        iqr      = q3 - q1
        n_out    = ((s < q1 - 1.5*iqr) | (s > q3 + 1.5*iqr)).sum()

        # Build a plain-English skew flag
        if abs(skew) < 0.5:   skew_flag = "symmetric"
        elif abs(skew) < 1.0: skew_flag = "moderately skewed"
        else:                  skew_flag = "HIGHLY SKEWED — consider log transform"

        print(f"\n[NUMERIC] {col}")
        print(f"  Range: {s.min():.1f} – {s.max():.1f}  |  Mean: {s.mean():.1f}  |  Median: {s.median():.1f}")
        print(f"  Shape: {skew_flag} (skew={skew:.2f})")
        print(f"  Missing: {miss_pct:.1f}%  |  Outliers (IQR rule): {n_out}")

    for col in cat_cols:
        n_unique  = dataframe[col].nunique()
        top       = dataframe[col].value_counts()
        top_label = top.index[0]
        top_pct   = top.iloc[0] / len(dataframe) * 100
        rare      = top[top < 2].index.tolist()

        print(f"\n[CATEGORICAL] {col}")
        print(f"  Cardinality: {n_unique} unique values")
        print(f"  Dominant: '{top_label}' ({top_pct:.0f}% of rows)")
        print(f"  Rare categories (<2 rows): {rare if rare else 'None'}")
        print(f"  Missing: {dataframe[col].isnull().sum()}")

    print("\n" + "=" * 55)

# Run the report on the patient dataset (df from earlier blocks)
univariate_report(
    df,
    numeric_cols=['age', 'bmi', 'systolic_bp', 'num_admissions', 'los_days'],
    cat_cols=['admission_type', 'primary_diagnosis', 'ward']
)

=======================================================
UNIVARIATE ANALYSIS REPORT
=======================================================

[NUMERIC] age
  Range: 29.0 – 83.0  |  Mean: 54.6  |  Median: 53.5
  Shape: symmetric (skew=0.12)
  Missing: 0.0%  |  Outliers (IQR rule): 0

[NUMERIC] bmi
  Range: 21.3 – 35.2  |  Mean: 28.0  |  Median: 28.2
  Shape: symmetric (skew=0.04)
  Missing: 16.7%  |  Outliers (IQR rule): 0

[NUMERIC] systolic_bp
  Range: 115.0 – 172.0  |  Mean: 136.7  |  Median: 137.0
  Shape: symmetric (skew=0.22)
  Missing: 0.0%  |  Outliers (IQR rule): 0

[NUMERIC] num_admissions
  Range: 1.0 – 8.0  |  Mean: 3.1  |  Median: 2.5
  Shape: moderately skewed (skew=0.83)
  Missing: 0.0%  |  Outliers (IQR rule): 1

[NUMERIC] los_days
  Range: 2.0 – 15.0  |  Mean: 5.8  |  Median: 5.5
  Shape: HIGHLY SKEWED — consider log transform (skew=1.02)
  Missing: 0.0%  |  Outliers (IQR rule): 1

[CATEGORICAL] admission_type
  Cardinality: 3 unique values
  Dominant: 'Emergency' (58% of rows)
  Rare categories (<2 rows): None
  Missing: 0

[CATEGORICAL] primary_diagnosis
  Cardinality: 5 unique values
  Dominant: 'Cardiac' (42% of rows)
  Rare categories (<2 rows): None
  Missing: 0

[CATEGORICAL] ward
  Cardinality: 3 unique values
  Dominant: 'Ward A' (42% of rows)
  Rare categories (<2 rows): None
  Missing: 0

=======================================================

What just happened?

pandas is the engine throughout this report function. Every method — .dropna(), .skew(), .quantile(), .isnull().mean(), .nunique(), .value_counts() — is a pandas Series method. The function loops through your specified column lists and builds human-readable output rather than a raw table. This is exactly the kind of artefact you'd paste into a Jupyter notebook markdown cell or a project Confluence page.

Three action items jump out of this report: BMI is missing 16.7% — needs imputation strategy. Length of stay is highly skewed at 1.02 — log transform before modelling. Emergency admissions dominate at 58% — class imbalance must be addressed if predicting admission type. A report that generates action items, not just statistics — that's what univariate analysis is for.

Teacher's Note

The single most common mistake in univariate analysis is stopping at .describe(). It gives you mean, std, min, max and quartiles — but no skewness, no kurtosis, no outlier count, no missing rate, no shape flag. .describe() is a starting point, not the finish line.

The second mistake is treating numeric and categorical columns the same way. Running .describe() on a categorical column gives you count, unique, top, and freq — a completely different set of numbers that answer completely different questions. Know which type of column you're looking at before you run a single method.

Practice Questions

1. Which pandas method returns the number of distinct values in a categorical column — a measure known as cardinality?

2. A distribution has two distinct peaks separated by a valley. What is this shape called?

3. Which pandas function divides a continuous numeric column into discrete labelled intervals — useful for revealing distribution shape?

Quiz

Up Next · Lesson 17

Bivariate Analysis

Move beyond single columns — explore how two variables interact, whether they correlate, and what relationships your model will need to learn.

← Previous Course Index Next →

EDA Course

Univariate Analysis

What Univariate Analysis Actually Is

Full Numeric Profile in One Function

Reading a Distribution — The Five Patterns

Categorical Univariate Analysis — Frequency and Cardinality

Detecting Bimodality — When One Peak Isn't Enough

The Univariate Report — Putting It All Together

Practice Questions

Quiz