EDA Course
EDA Workflow
You've spent thirteen lessons building individual tools. This lesson is where they click into a system. A repeatable, step-by-step EDA workflow means you never stare at a fresh dataset wondering where to start — you just run the playbook.
Why Workflow Matters More Than Individual Skills
Most analysts learn EDA techniques in isolation. They know how to check for nulls. They know how to plot a histogram. But ask them to hand over their EDA process as a document another analyst could follow — and there's nothing. No sequence, no standard, no way to reproduce it.
A documented workflow does three things: it makes your EDA reproducible (someone else can run it on a new dataset), defensible (you can explain every decision you made), and fast (you stop reinventing the wheel on every project).
The 6-Phase EDA Workflow
Ingest & Inspect
Load the data, check its shape, column names, dtypes, and the first and last few rows. The goal is a mental model of what you're working with before you touch anything.
Data Quality Audit
Find and quantify missing values, duplicates, and inconsistent data types. This phase produces a written report of problems — you don't fix them yet, you document them.
Univariate Analysis
Analyse each column in isolation. Distribution shape, central tendency, spread, skewness, outliers. One column at a time — no relationships yet.
Bivariate & Multivariate Analysis
Explore relationships between columns. Correlations, group comparisons, scatter patterns. This is where hypotheses start forming.
Data Cleaning & Transformation
Fix the problems documented in Phase 2. Handle missing values, remove or cap outliers, transform skewed columns, encode categoricals, scale features.
Document & Summarise
Write up what you found, what decisions you made, and what the data looks like now vs when you started. Every finding that matters should survive in a shareable artefact.
These six phases are not always perfectly linear. You'll often loop back — a surprising correlation in Phase 4 might send you back to Phase 2 to check for data quality issues you missed. That's normal. The workflow is a guide, not a cage.
Phase 1 — Ingest & Inspect
The scenario: You've just joined a food delivery startup as their first dedicated data analyst. On day one, your manager drops a CSV of the last three months of order data on your desk and says "just tell me what's in there before end of day." No context, no data dictionary, no previous analysis. This is Phase 1 — your first ten minutes with a new dataset.
import pandas as pd # pandas: Python's core data table library — the foundation of every EDA workflow
import numpy as np # numpy: numerical library — standard import alongside pandas for any analysis script
# Simulate three months of food delivery order data — 12 rows, realistic field names
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012],
'customer_id': [201, 202, 201, 203, 204, 202, 205, 203, 201, 206, 204, 205],
'restaurant_name': ['Burger Barn','Sushi Spot','Burger Barn','Pizza Palace','Curry Corner',
'Sushi Spot','Burger Barn','Pizza Palace','Curry Corner','Burger Barn',
'Pizza Palace','Curry Corner'],
'order_value': [18.50, 34.20, 22.10, 15.80, 28.40, 41.00, 19.90, np.nan, 31.20, 17.60, 24.80, 29.50],
'delivery_mins': [28, 42, 31, 19, 55, 38, 27, 33, np.nan, 24, 41, 36],
'rating': [4, 5, 4, 3, 5, 4, np.nan, 3, 5, 4, 4, 5],
'order_date': pd.to_datetime(['2024-01-05','2024-01-07','2024-01-12','2024-01-15',
'2024-02-01','2024-02-09','2024-02-14','2024-02-20',
'2024-03-03','2024-03-11','2024-03-18','2024-03-25'])
})
# --- STEP 1: Dimensions ---
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns") # always the very first thing you check
print()
# --- STEP 2: Column names and data types ---
print("Column dtypes:")
print(df.dtypes) # dtypes tells you if dates loaded as strings, numbers loaded as objects, etc.
print()
# --- STEP 3: First 5 rows — sanity check the data looks right ---
print("First 5 rows:")
print(df.head()) # .head() default is 5 rows — enough to spot obvious formatting issues
Shape: 12 rows × 7 columns Column dtypes: order_id int64 customer_id int64 restaurant_name object order_value float64 delivery_mins float64 rating float64 order_date datetime64[ns] dtype: object First 5 rows: order_id customer_id restaurant_name order_value delivery_mins rating order_date 0 1001 201 Burger Barn 18.5 28.0 4.0 2024-01-05 1 1002 202 Sushi Spot 34.2 42.0 5.0 2024-01-07 2 1003 201 Burger Barn 22.1 31.0 4.0 2024-01-12 3 1004 203 Pizza Palace 15.8 19.0 3.0 2024-01-15 4 1005 204 Curry Corner 28.4 55.0 5.0 2024-02-01
What just happened?
pandas is Python's core data table library. Three of its most essential inspection methods appear here: .shape returns a tuple of (rows, columns) — the very first number you need; .dtypes returns the data type of every column as a Series — critical for catching dates stored as strings or numbers stored as text; and .head() returns the first N rows (default 5) so you can eyeball the actual values before running any statistics.
Already we can see delivery_mins and rating loaded as float64 instead of int — that's because they contain NaN values, and pandas can't store integers with NaN, so it promotes the column to float. That's a data quality signal. We'll quantify it in Phase 2.
Phase 2 — Data Quality Audit
The scenario: You've got a mental model of the dataset. Now you need a written audit — exactly how many values are missing in each column, are there any duplicate orders, and does every column contain what it claims to? Your manager needs this as a bullet-point summary before you touch a single value.
import pandas as pd # pandas: data table library — .isnull(), .duplicated(), and .describe() all live here
import numpy as np # numpy: numerical library — used here for np.nan in the dataset
# (df created in Phase 1 — same dataset, continuing the workflow)
# --- MISSING VALUES ---
# .isnull().sum() counts NaN in each column; divide by len(df) for the percentage rate
missing_count = df.isnull().sum() # absolute count per column
missing_pct = (df.isnull().sum() / len(df) * 100).round(1) # percentage, 1 decimal place
missing_report = pd.DataFrame({
'missing_count': missing_count,
'missing_pct': missing_pct
})
missing_report = missing_report[missing_report['missing_count'] > 0] # filter to only columns with issues
print("=== MISSING VALUE REPORT ===")
print(missing_report)
print()
# --- DUPLICATES ---
# .duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row
n_dupes = df.duplicated().sum() # count of duplicate rows
print(f"Duplicate rows: {n_dupes}")
print()
# --- STATISTICAL SUMMARY ---
# .describe() gives count, mean, std, min, quartiles, max for all numeric columns in one shot
print("=== NUMERIC SUMMARY ===")
print(df.describe().round(2))
=== MISSING VALUE REPORT ===
missing_count missing_pct
order_value 1 8.3
delivery_mins 1 8.3
rating 1 8.3
Duplicate rows: 0
=== NUMERIC SUMMARY ===
order_id customer_id order_value delivery_mins rating
count 12.00 12.00 11.00 11.00 11.00
mean 1006.50 202.83 25.73 33.91 4.09
std 3.61 1.70 7.27 9.82 0.70
min 1001.00 201.00 15.80 19.00 3.00
25% 1003.75 202.00 19.15 27.50 4.00
50% 1006.50 203.00 26.65 34.50 4.00
75% 1009.25 204.00 30.85 40.50 5.00
max 1012.00 206.00 41.00 55.00 5.00
What just happened?
pandas is doing all the heavy lifting in this phase. .isnull().sum() chains two methods: .isnull() converts every cell to True/False (True = missing), then .sum() adds up the Trues per column — since True counts as 1. Dividing by len(df) converts to a percentage. We then construct a new DataFrame from the results to make it readable at a glance.
.duplicated().sum() scans every row for exact matches to a previous row — zero here, which is good. On real datasets with millions of rows, duplicate detection is often one of the biggest cleaning wins.
.describe() shows count dropping to 11 for the three columns with missing values — that's the quickest way to spot missing data in a summary. The mean delivery time is 34 minutes and average order value is £25.73 — the kind of headline numbers your manager will ask for.
Phase 3 — Univariate Analysis
The scenario: Audit complete. Now you analyse each column on its own terms. Your manager wants to know: what does the distribution of order values look like, are delivery times skewed, and which restaurant is getting the most orders? One column at a time — no relationships yet.
import pandas as pd # pandas: data library — .skew(), .value_counts(), .median(), and column selection
import numpy as np # numpy: numerical library — underpins all pandas column calculations
# --- NUMERIC COLUMNS: shape statistics ---
numeric_cols = ['order_value', 'delivery_mins', 'rating'] # columns to analyse individually
for col in numeric_cols: # loop — same analysis repeated per column
print(f"--- {col} ---")
print(f" Mean: {df[col].mean():.2f}") # central tendency — mean
print(f" Median: {df[col].median():.2f}") # central tendency — median (robust to outliers)
print(f" Std dev: {df[col].std():.2f}") # spread — standard deviation
print(f" Skewness: {df[col].skew():.3f}") # shape — from Lesson 12
print(f" Min: {df[col].min():.1f} Max: {df[col].max():.1f}") # range
print()
# --- CATEGORICAL COLUMN: frequency counts ---
print("--- restaurant_name (frequency) ---")
print(df['restaurant_name'].value_counts()) # .value_counts() counts unique entries, sorted by frequency
--- order_value --- Mean: 25.73 Median: 26.65 Std dev: 7.27 Skewness: 0.192 Min: 15.8 Max: 41.0 --- delivery_mins --- Mean: 33.91 Median: 34.50 Std dev: 9.82 Skewness: 0.213 Min: 19.0 Max: 55.0 --- rating --- Mean: 4.09 Median: 4.00 Std dev: 0.70 Skewness: -0.187 Min: 3.0 Max: 5.0 --- restaurant_name (frequency) --- restaurant_name Burger Barn 4 Pizza Palace 3 Curry Corner 3 Sushi Spot 2 dtype: int64
What just happened?
pandas is providing every method here: .mean(), .median(), .std(), .skew(), .min(), .max() — all called directly on a Series (a single column). The loop pattern keeps the code clean: same analysis runs on every numeric column without repeating code.
.value_counts() is pandas' go-to method for categorical columns — it returns a Series sorted by frequency, showing you at a glance which categories dominate. Burger Barn leads with 4 orders out of 12.
All three numeric columns have skewness close to zero — no transformation needed. Mean ≈ median for order_value and delivery_mins, confirming near-symmetric distributions. Rating has a mild negative skew (-0.187): more high ratings than low, which is typical for consumer apps.
Univariate Snapshot — Visual Summary
Here's what the restaurant order distribution and the order value spread look like side by side. Quick visual confirms the numbers.
Orders by Restaurant
Order Value Distribution
Mean £25.73 · Skew 0.19
Phase 4 — Bivariate Analysis
The scenario: Your manager comes back with a follow-up question: "Does the restaurant affect how long delivery takes, and do higher-value orders get better ratings?" These are relationship questions — two variables at once. Phase 4 starts here.
import pandas as pd # pandas: data library — .groupby(), .corr(), and aggregation methods all live here
import numpy as np # numpy: numerical library — present as standard; underpins pandas operations
# --- GROUPED ANALYSIS: average delivery time per restaurant ---
# .groupby() splits the DataFrame by restaurant, then .agg() computes stats per group
delivery_by_restaurant = (
df.groupby('restaurant_name')['delivery_mins']
.agg(mean_mins='mean', count='count') # named aggregations: mean and count in one step
.round(1)
.sort_values('mean_mins', ascending=False) # sort so slowest is at the top
)
print("=== DELIVERY TIME BY RESTAURANT ===")
print(delivery_by_restaurant)
print()
# --- CORRELATION: order value vs rating ---
# .corr() computes Pearson correlation between two numeric Series
# Returns a value from -1 (perfect negative) to +1 (perfect positive); 0 = no linear relationship
corr_value = df['order_value'].corr(df['rating'])
print(f"Correlation — order value vs rating: {corr_value:.3f}")
print()
# --- GROUPED ANALYSIS: average order value per restaurant ---
value_by_restaurant = (
df.groupby('restaurant_name')['order_value']
.mean()
.round(2)
.sort_values(ascending=False)
)
print("=== AVG ORDER VALUE BY RESTAURANT ===")
print(value_by_restaurant)
=== DELIVERY TIME BY RESTAURANT ===
mean_mins count
restaurant_name
Curry Corner 40.7 3
Sushi Spot 40.0 2
Pizza Palace 30.7 3
Burger Barn 27.5 4
Correlation — order value vs rating: 0.187
=== AVG ORDER VALUE BY RESTAURANT ===
restaurant_name
Sushi Spot 37.60
Curry Corner 29.70
Pizza Palace 20.27
Burger Barn 19.50
Name: order_value, dtype: float64
What just happened?
pandas is the engine throughout Phase 4. .groupby() is one of pandas' most powerful methods — it splits the DataFrame into groups by a categorical column, applies a function to each group, and combines the results. The named aggregation syntax (mean_mins='mean') gives you clean column names in one step instead of renaming afterwards.
.corr() computes the Pearson correlation coefficient between two Series. It automatically ignores rows where either value is NaN — no extra handling needed.
The findings are genuinely interesting: Curry Corner and Sushi Spot are 10+ minutes slower than Burger Barn. But Sushi Spot charges the most (avg £37.60). The correlation between order value and rating is only 0.187 — essentially no relationship. Customers don't rate more expensive orders higher. These are the insights your manager actually wants.
Phases 5 & 6 — Clean, Transform, Document
Phase 5 applies everything from Lessons 6–13: handle the missing values you found in Phase 2, transform any skewed columns, encode categoricals. Phase 6 writes it all down. Here's a compact version of both applied to the delivery dataset.
import pandas as pd # pandas: core data library — fillna(), get_dummies(), and all column ops
import numpy as np # numpy: numerical library — np.nan for dataset construction
df_clean = df.copy() # always work on a copy — never mutate the original DataFrame in-place
# --- PHASE 5: CLEAN & TRANSFORM ---
# Missing values: fill numeric columns with median (robust to outliers, safe for small samples)
# We use median rather than mean because with only 12 rows, one extreme value could skew the fill
for col in ['order_value', 'delivery_mins', 'rating']:
median_val = df_clean[col].median() # compute median ignoring NaN
df_clean[col] = df_clean[col].fillna(median_val) # .fillna() replaces NaN with the given value
print(f"Filled {col} NaN with median: {median_val:.2f}")
print()
# Encode restaurant_name with one-hot encoding (nominal — no inherent order between restaurants)
df_clean = pd.get_dummies(df_clean, columns=['restaurant_name'], prefix='rest', drop_first=True) # drop_first avoids multicollinearity
# --- PHASE 6: DOCUMENT ---
print("=== EDA SUMMARY REPORT ===")
print(f"Rows: {len(df_clean)}")
print(f"Columns (final): {df_clean.shape[1]}")
print(f"Missing values: 0 (filled with column medians)")
print(f"Duplicates: 0")
print(f"Skewed columns: None (all |skew| < 0.5)")
print(f"Encoding: restaurant_name one-hot encoded (3 new columns, original dropped)")
print(f"Key finding: Curry Corner slowest delivery (40.7 min avg); Sushi Spot highest value (£37.60 avg)")
print()
print("Final column list:")
print(list(df_clean.columns))
Filled order_value NaN with median: 27.90 Filled delivery_mins NaN with median: 34.50 Filled rating NaN with median: 4.00 === EDA SUMMARY REPORT === Rows: 12 Columns (final): 10 Missing values: 0 (filled with column medians) Duplicates: 0 Skewed columns: None (all |skew| < 0.5) Encoding: restaurant_name one-hot encoded (3 new columns, original dropped) Key finding: Curry Corner slowest delivery (40.7 min avg); Sushi Spot highest value (£37.60 avg) Final column list: ['order_id', 'customer_id', 'order_value', 'delivery_mins', 'rating', 'order_date', 'rest_Curry Corner', 'rest_Pizza Palace', 'rest_Sushi Spot']
What just happened?
pandas provides both cleaning tools used here. .fillna() replaces NaN values with a scalar — we pass in the column's median, computed with .median(). pd.get_dummies() with columns= specified encodes the named column in-place and drops the original — cleaner than the separate pd.concat() approach when you want to transform and replace simultaneously.
The summary report at the end is Phase 6 in miniature. In a real project this would be a Markdown cell or a written section of your Jupyter notebook. The key discipline: every decision gets one sentence of justification. "Filled with median because..." not just "filled."
Teacher's Note
The six-phase workflow isn't gospel — experienced analysts adapt it based on dataset size, business context, and time pressure. But when you're new to a dataset or new to the job, having a default sequence to fall back on is genuinely valuable. It stops you spinning your wheels and gives your manager something predictable to expect from you.
One thing that never changes across projects: document as you go, not at the end. Every decision you delay writing down is a decision you'll misremember or forget to include. The five-minute discipline of writing one line of justification per choice saves hours of confusion in code review.
Practice Questions
1. In the 6-phase EDA workflow, which phase number involves finding and quantifying missing values and duplicates — without fixing them yet?
2. Which pandas method returns the frequency of each unique value in a categorical column, sorted by count?
3. Which pandas method splits a DataFrame into groups by a categorical column so you can compute statistics per group?
Quiz
1. You fit a scaler on your entire dataset before splitting into train and test sets. What is the problem with this approach?
2. What is the correct order of analysis in phases 3 and 4 of the EDA workflow?
3. Which result indicates effectively no linear relationship between two variables?
Up Next · Lesson 15
Data Quality Issues
A deep dive into the full spectrum of real-world data problems — inconsistent formats, wrong types, impossible values, and how to systematically hunt them down.