EDA Lesson 14 – EDA Workflow | Dataplexa

Beginner Level · Lesson 14

EDA Workflow

You've spent thirteen lessons building individual tools. This lesson is where they click into a system. A repeatable, step-by-step EDA workflow means you never stare at a fresh dataset wondering where to start — you just run the playbook.

Why Workflow Matters More Than Individual Skills

Most analysts learn EDA techniques in isolation. They know how to check for nulls. They know how to plot a histogram. But ask them to hand over their EDA process as a document another analyst could follow — and there's nothing. No sequence, no standard, no way to reproduce it.

A documented workflow does three things: it makes your EDA reproducible (someone else can run it on a new dataset), defensible (you can explain every decision you made), and fast (you stop reinventing the wheel on every project).

The 6-Phase EDA Workflow

Ingest & Inspect

Load the data, check its shape, column names, dtypes, and the first and last few rows. The goal is a mental model of what you're working with before you touch anything.

Data Quality Audit

Find and quantify missing values, duplicates, and inconsistent data types. This phase produces a written report of problems — you don't fix them yet, you document them.

Univariate Analysis

Analyse each column in isolation. Distribution shape, central tendency, spread, skewness, outliers. One column at a time — no relationships yet.

Bivariate & Multivariate Analysis

Explore relationships between columns. Correlations, group comparisons, scatter patterns. This is where hypotheses start forming.

Data Cleaning & Transformation

Fix the problems documented in Phase 2. Handle missing values, remove or cap outliers, transform skewed columns, encode categoricals, scale features.

Document & Summarise

Write up what you found, what decisions you made, and what the data looks like now vs when you started. Every finding that matters should survive in a shareable artefact.

These six phases are not always perfectly linear. You'll often loop back — a surprising correlation in Phase 4 might send you back to Phase 2 to check for data quality issues you missed. That's normal. The workflow is a guide, not a cage.

Phase 1 — Ingest & Inspect

The scenario: You've just joined a food delivery startup as their first dedicated data analyst. On day one, your manager drops a CSV of the last three months of order data on your desk and says "just tell me what's in there before end of day." No context, no data dictionary, no previous analysis. This is Phase 1 — your first ten minutes with a new dataset.

import pandas as pd      # pandas: Python's core data table library — the foundation of every EDA workflow
import numpy as np       # numpy: numerical library — standard import alongside pandas for any analysis script

# Simulate three months of food delivery order data — 12 rows, realistic field names
df = pd.DataFrame({
    'order_id':        [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012],
    'customer_id':     [201, 202, 201, 203, 204, 202, 205, 203, 201, 206, 204, 205],
    'restaurant_name': ['Burger Barn','Sushi Spot','Burger Barn','Pizza Palace','Curry Corner',
                        'Sushi Spot','Burger Barn','Pizza Palace','Curry Corner','Burger Barn',
                        'Pizza Palace','Curry Corner'],
    'order_value':     [18.50, 34.20, 22.10, 15.80, 28.40, 41.00, 19.90, np.nan, 31.20, 17.60, 24.80, 29.50],
    'delivery_mins':   [28, 42, 31, 19, 55, 38, 27, 33, np.nan, 24, 41, 36],
    'rating':          [4, 5, 4, 3, 5, 4, np.nan, 3, 5, 4, 4, 5],
    'order_date':      pd.to_datetime(['2024-01-05','2024-01-07','2024-01-12','2024-01-15',
                                       '2024-02-01','2024-02-09','2024-02-14','2024-02-20',
                                       '2024-03-03','2024-03-11','2024-03-18','2024-03-25'])
})

# --- STEP 1: Dimensions ---
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")  # always the very first thing you check
print()

# --- STEP 2: Column names and data types ---
print("Column dtypes:")
print(df.dtypes)   # dtypes tells you if dates loaded as strings, numbers loaded as objects, etc.
print()

# --- STEP 3: First 5 rows — sanity check the data looks right ---
print("First 5 rows:")
print(df.head())   # .head() default is 5 rows — enough to spot obvious formatting issues

Shape: 12 rows × 7 columns

Column dtypes:
order_id               int64
customer_id            int64
restaurant_name       object
order_value          float64
delivery_mins        float64
rating               float64
order_date    datetime64[ns]
dtype: object

First 5 rows:
   order_id  customer_id restaurant_name  order_value  delivery_mins  rating order_date
0      1001          201     Burger Barn         18.5           28.0     4.0 2024-01-05
1      1002          202       Sushi Spot         34.2           42.0     5.0 2024-01-07
2      1003          201     Burger Barn         22.1           31.0     4.0 2024-01-12
3      1004          203    Pizza Palace         15.8           19.0     3.0 2024-01-15
4      1005          204    Curry Corner         28.4           55.0     5.0 2024-02-01

What just happened?

pandas is Python's core data table library. Three of its most essential inspection methods appear here: .shape returns a tuple of (rows, columns) — the very first number you need; .dtypes returns the data type of every column as a Series — critical for catching dates stored as strings or numbers stored as text; and .head() returns the first N rows (default 5) so you can eyeball the actual values before running any statistics.

Already we can see delivery_mins and rating loaded as float64 instead of int — that's because they contain NaN values, and pandas can't store integers with NaN, so it promotes the column to float. That's a data quality signal. We'll quantify it in Phase 2.

Phase 2 — Data Quality Audit

The scenario: You've got a mental model of the dataset. Now you need a written audit — exactly how many values are missing in each column, are there any duplicate orders, and does every column contain what it claims to? Your manager needs this as a bullet-point summary before you touch a single value.

import pandas as pd    # pandas: data table library — .isnull(), .duplicated(), and .describe() all live here
import numpy as np     # numpy: numerical library — used here for np.nan in the dataset

# (df created in Phase 1 — same dataset, continuing the workflow)

# --- MISSING VALUES ---
# .isnull().sum() counts NaN in each column; divide by len(df) for the percentage rate
missing_count = df.isnull().sum()                          # absolute count per column
missing_pct   = (df.isnull().sum() / len(df) * 100).round(1)  # percentage, 1 decimal place
missing_report = pd.DataFrame({
    'missing_count': missing_count,
    'missing_pct':   missing_pct
})
missing_report = missing_report[missing_report['missing_count'] > 0]  # filter to only columns with issues
print("=== MISSING VALUE REPORT ===")
print(missing_report)
print()

# --- DUPLICATES ---
# .duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row
n_dupes = df.duplicated().sum()                           # count of duplicate rows
print(f"Duplicate rows: {n_dupes}")
print()

# --- STATISTICAL SUMMARY ---
# .describe() gives count, mean, std, min, quartiles, max for all numeric columns in one shot
print("=== NUMERIC SUMMARY ===")
print(df.describe().round(2))

=== MISSING VALUE REPORT ===
               missing_count  missing_pct
order_value                1          8.3
delivery_mins              1          8.3
rating                     1          8.3

Duplicate rows: 0

=== NUMERIC SUMMARY ===
        order_id  customer_id  order_value  delivery_mins  rating
count      12.00        12.00        11.00          11.00   11.00
mean     1006.50       202.83        25.73          33.91    4.09
std         3.61         1.70         7.27           9.82    0.70
min      1001.00       201.00        15.80          19.00    3.00
25%      1003.75       202.00        19.15          27.50    4.00
50%      1006.50       203.00        26.65          34.50    4.00
75%      1009.25       204.00        30.85          40.50    5.00
max      1012.00       206.00        41.00          55.00    5.00

What just happened?

pandas is doing all the heavy lifting in this phase. .isnull().sum() chains two methods: .isnull() converts every cell to True/False (True = missing), then .sum() adds up the Trues per column — since True counts as 1. Dividing by len(df) converts to a percentage. We then construct a new DataFrame from the results to make it readable at a glance.

.duplicated().sum() scans every row for exact matches to a previous row — zero here, which is good. On real datasets with millions of rows, duplicate detection is often one of the biggest cleaning wins.

.describe() shows count dropping to 11 for the three columns with missing values — that's the quickest way to spot missing data in a summary. The mean delivery time is 34 minutes and average order value is £25.73 — the kind of headline numbers your manager will ask for.

Phase 3 — Univariate Analysis

The scenario: Audit complete. Now you analyse each column on its own terms. Your manager wants to know: what does the distribution of order values look like, are delivery times skewed, and which restaurant is getting the most orders? One column at a time — no relationships yet.

import pandas as pd    # pandas: data library — .skew(), .value_counts(), .median(), and column selection
import numpy as np     # numpy: numerical library — underpins all pandas column calculations

# --- NUMERIC COLUMNS: shape statistics ---
numeric_cols = ['order_value', 'delivery_mins', 'rating']   # columns to analyse individually

for col in numeric_cols:                                     # loop — same analysis repeated per column
    print(f"--- {col} ---")
    print(f"  Mean:     {df[col].mean():.2f}")              # central tendency — mean
    print(f"  Median:   {df[col].median():.2f}")            # central tendency — median (robust to outliers)
    print(f"  Std dev:  {df[col].std():.2f}")               # spread — standard deviation
    print(f"  Skewness: {df[col].skew():.3f}")              # shape — from Lesson 12
    print(f"  Min: {df[col].min():.1f}  Max: {df[col].max():.1f}")  # range
    print()

# --- CATEGORICAL COLUMN: frequency counts ---
print("--- restaurant_name (frequency) ---")
print(df['restaurant_name'].value_counts())    # .value_counts() counts unique entries, sorted by frequency

--- order_value ---
  Mean:     25.73
  Median:   26.65
  Std dev:   7.27
  Skewness:  0.192
  Min: 15.8  Max: 41.0

--- delivery_mins ---
  Mean:     33.91
  Median:   34.50
  Std dev:   9.82
  Skewness:  0.213
  Min: 19.0  Max: 55.0

--- rating ---
  Mean:      4.09
  Median:    4.00
  Std dev:   0.70
  Skewness: -0.187
  Min: 3.0  Max: 5.0

--- restaurant_name (frequency) ---
restaurant_name
Burger Barn     4
Pizza Palace    3
Curry Corner    3
Sushi Spot      2
dtype: int64

What just happened?

pandas is providing every method here: .mean(), .median(), .std(), .skew(), .min(), .max() — all called directly on a Series (a single column). The loop pattern keeps the code clean: same analysis runs on every numeric column without repeating code.

.value_counts() is pandas' go-to method for categorical columns — it returns a Series sorted by frequency, showing you at a glance which categories dominate. Burger Barn leads with 4 orders out of 12.

All three numeric columns have skewness close to zero — no transformation needed. Mean ≈ median for order_value and delivery_mins, confirming near-symmetric distributions. Rating has a mild negative skew (-0.187): more high ratings than low, which is typical for consumer apps.

Univariate Snapshot — Visual Summary

Here's what the restaurant order distribution and the order value spread look like side by side. Quick visual confirms the numbers.

Orders by Restaurant

Burger Barn

Pizza Palace

Curry Corner

Sushi Spot

Order Value Distribution

£15–20

£20–30

£30–35

£35–45

Mean £25.73 · Skew 0.19

Phase 4 — Bivariate Analysis

The scenario: Your manager comes back with a follow-up question: "Does the restaurant affect how long delivery takes, and do higher-value orders get better ratings?" These are relationship questions — two variables at once. Phase 4 starts here.

import pandas as pd    # pandas: data library — .groupby(), .corr(), and aggregation methods all live here
import numpy as np     # numpy: numerical library — present as standard; underpins pandas operations

# --- GROUPED ANALYSIS: average delivery time per restaurant ---
# .groupby() splits the DataFrame by restaurant, then .agg() computes stats per group
delivery_by_restaurant = (
    df.groupby('restaurant_name')['delivery_mins']
    .agg(mean_mins='mean', count='count')   # named aggregations: mean and count in one step
    .round(1)
    .sort_values('mean_mins', ascending=False)  # sort so slowest is at the top
)
print("=== DELIVERY TIME BY RESTAURANT ===")
print(delivery_by_restaurant)
print()

# --- CORRELATION: order value vs rating ---
# .corr() computes Pearson correlation between two numeric Series
# Returns a value from -1 (perfect negative) to +1 (perfect positive); 0 = no linear relationship
corr_value = df['order_value'].corr(df['rating'])
print(f"Correlation — order value vs rating: {corr_value:.3f}")
print()

# --- GROUPED ANALYSIS: average order value per restaurant ---
value_by_restaurant = (
    df.groupby('restaurant_name')['order_value']
    .mean()
    .round(2)
    .sort_values(ascending=False)
)
print("=== AVG ORDER VALUE BY RESTAURANT ===")
print(value_by_restaurant)

=== DELIVERY TIME BY RESTAURANT ===
                 mean_mins  count
restaurant_name
Curry Corner          40.7      3
Sushi Spot            40.0      2
Pizza Palace          30.7      3
Burger Barn           27.5      4

Correlation — order value vs rating: 0.187

=== AVG ORDER VALUE BY RESTAURANT ===
restaurant_name
Sushi Spot      37.60
Curry Corner    29.70
Pizza Palace    20.27
Burger Barn     19.50
Name: order_value, dtype: float64

What just happened?

pandas is the engine throughout Phase 4. .groupby() is one of pandas' most powerful methods — it splits the DataFrame into groups by a categorical column, applies a function to each group, and combines the results. The named aggregation syntax (mean_mins='mean') gives you clean column names in one step instead of renaming afterwards.

.corr() computes the Pearson correlation coefficient between two Series. It automatically ignores rows where either value is NaN — no extra handling needed.

The findings are genuinely interesting: Curry Corner and Sushi Spot are 10+ minutes slower than Burger Barn. But Sushi Spot charges the most (avg £37.60). The correlation between order value and rating is only 0.187 — essentially no relationship. Customers don't rate more expensive orders higher. These are the insights your manager actually wants.

Phases 5 & 6 — Clean, Transform, Document

Phase 5 applies everything from Lessons 6–13: handle the missing values you found in Phase 2, transform any skewed columns, encode categoricals. Phase 6 writes it all down. Here's a compact version of both applied to the delivery dataset.

import pandas as pd    # pandas: core data library — fillna(), get_dummies(), and all column ops
import numpy as np     # numpy: numerical library — np.nan for dataset construction

df_clean = df.copy()   # always work on a copy — never mutate the original DataFrame in-place

# --- PHASE 5: CLEAN & TRANSFORM ---

# Missing values: fill numeric columns with median (robust to outliers, safe for small samples)
# We use median rather than mean because with only 12 rows, one extreme value could skew the fill
for col in ['order_value', 'delivery_mins', 'rating']:
    median_val = df_clean[col].median()                      # compute median ignoring NaN
    df_clean[col] = df_clean[col].fillna(median_val)         # .fillna() replaces NaN with the given value
    print(f"Filled {col} NaN with median: {median_val:.2f}")

print()

# Encode restaurant_name with one-hot encoding (nominal — no inherent order between restaurants)
df_clean = pd.get_dummies(df_clean, columns=['restaurant_name'], prefix='rest', drop_first=True)  # drop_first avoids multicollinearity

# --- PHASE 6: DOCUMENT ---
print("=== EDA SUMMARY REPORT ===")
print(f"Rows:             {len(df_clean)}")
print(f"Columns (final):  {df_clean.shape[1]}")
print(f"Missing values:   0 (filled with column medians)")
print(f"Duplicates:       0")
print(f"Skewed columns:   None (all |skew| < 0.5)")
print(f"Encoding:         restaurant_name one-hot encoded (3 new columns, original dropped)")
print(f"Key finding:      Curry Corner slowest delivery (40.7 min avg); Sushi Spot highest value (£37.60 avg)")
print()
print("Final column list:")
print(list(df_clean.columns))

Filled order_value NaN with median: 27.90
Filled delivery_mins NaN with median: 34.50
Filled rating NaN with median: 4.00

=== EDA SUMMARY REPORT ===
Rows:             12
Columns (final):  10
Missing values:   0 (filled with column medians)
Duplicates:       0
Skewed columns:   None (all |skew| < 0.5)
Encoding:         restaurant_name one-hot encoded (3 new columns, original dropped)
Key finding:      Curry Corner slowest delivery (40.7 min avg); Sushi Spot highest value (£37.60 avg)

Final column list:
['order_id', 'customer_id', 'order_value', 'delivery_mins', 'rating', 'order_date',
 'rest_Curry Corner', 'rest_Pizza Palace', 'rest_Sushi Spot']

What just happened?

pandas provides both cleaning tools used here. .fillna() replaces NaN values with a scalar — we pass in the column's median, computed with .median(). pd.get_dummies() with columns= specified encodes the named column in-place and drops the original — cleaner than the separate pd.concat() approach when you want to transform and replace simultaneously.

The summary report at the end is Phase 6 in miniature. In a real project this would be a Markdown cell or a written section of your Jupyter notebook. The key discipline: every decision gets one sentence of justification. "Filled with median because..." not just "filled."

Teacher's Note

The six-phase workflow isn't gospel — experienced analysts adapt it based on dataset size, business context, and time pressure. But when you're new to a dataset or new to the job, having a default sequence to fall back on is genuinely valuable. It stops you spinning your wheels and gives your manager something predictable to expect from you.

One thing that never changes across projects: document as you go, not at the end. Every decision you delay writing down is a decision you'll misremember or forget to include. The five-minute discipline of writing one line of justification per choice saves hours of confusion in code review.

Practice Questions

1. In the 6-phase EDA workflow, which phase number involves finding and quantifying missing values and duplicates — without fixing them yet?

2. Which pandas method returns the frequency of each unique value in a categorical column, sorted by count?

3. Which pandas method splits a DataFrame into groups by a categorical column so you can compute statistics per group?

Quiz

Up Next · Lesson 15

Data Quality Issues

A deep dive into the full spectrum of real-world data problems — inconsistent formats, wrong types, impossible values, and how to systematically hunt them down.

← Previous Course Index Next →

EDA Course

EDA Workflow

Why Workflow Matters More Than Individual Skills

The 6-Phase EDA Workflow

Phase 1 — Ingest & Inspect

Phase 2 — Data Quality Audit

Phase 3 — Univariate Analysis

Univariate Snapshot — Visual Summary

Phase 4 — Bivariate Analysis

Phases 5 & 6 — Clean, Transform, Document

Practice Questions

Quiz