DS Case Studies
Analysing Sales Data
Raw sales files are almost never clean. Columns are misnamed, months are missing, regions are inconsistent. Before any chart gets made or any target gets set, someone has to sit with the data and make it make sense. That someone is you.
You are a data analyst at NovaBrands, a consumer goods company selling across four regions. Your sales director has handed you a 12-month sales export and asked for a report by Friday: which regions are performing, which products are underperforming, and where the seasonal patterns are. You have pandas, numpy, and a clean notebook.
What This Case Study Covers
Sales EDA is one of the most common tasks a junior analyst will face. The core challenge is not the maths — it is structuring the investigation. A good sales analysis does not just compute totals; it asks the right questions in the right order: how much overall, broken by region, broken by product, broken by time. Each layer reveals something the previous layer was hiding.
This case study follows the standard sales EDA sequence: load and validate the raw file for missing values and type issues, aggregate by region and product to find the performance hierarchy, extract time patterns to identify seasonal peaks and dips, and compute derived metrics like revenue per unit and month-over-month growth to go beyond the raw numbers.
The Sales EDA Toolkit
Load and Validate
Read the file, check shape and dtypes, confirm date columns are parsed correctly, and run the missing value audit. A date column stored as a string will break every time-based aggregation silently.Regional Performance
Group by region and sum revenue. Sort descending. Find the gap between the top and bottom region — that gap is the story your director wants to hear about first.Product Performance
Group by product category and compute total revenue, units sold, and average revenue per unit. Products with high unit sales but low revenue per unit are underpriced — that is an actionable finding.Monthly Trend Analysis
Extract the month from the date column and group by it to find seasonal patterns. Which months spike? Which months slump? These patterns drive production and inventory planning decisions.Derived Metrics
Compute revenue per unit, month-over-month growth percentage, and regional revenue share. Raw totals rarely tell the full story — derived metrics reveal efficiency and trajectory.Dataset Overview
NovaBrands' sales export contains 12 records — one row per transaction, covering all four regions across 12 months. We build it directly with pd.DataFrame() so you can run it anywhere. Each column represents a different dimension of a sale.
| sale_id | date | region | product_category | units_sold | unit_price | revenue | sales_rep |
|---|---|---|---|---|---|---|---|
| S001 | 2024-01-15 | North | Electronics | 120 | $85.00 | $10,200 | Alice |
| S002 | 2024-02-10 | South | Apparel | 340 | $32.00 | $10,880 | Bob |
| S003 | 2024-03-22 | East | Electronics | 95 | $85.00 | $8,075 | Carol |
| S004 | 2024-04-05 | West | Home Goods | 210 | $54.00 | $11,340 | Dave |
| S005 | 2024-05-18 | North | Apparel | 180 | $32.00 | $5,760 | Alice |
Showing first 5 of 12 rows · 8 columns
Unique transaction ID. Used for counting records and deduplication only.
Transaction date as a string. Must be parsed to datetime before any time-based grouping will work correctly.
North, South, East, West. The primary dimension for performance comparison in this case study.
Electronics, Apparel, or Home Goods. Used for product-level aggregation and revenue per unit analysis.
Number of units moved in the transaction. Combined with unit_price to verify or reconstruct revenue.
Price per unit sold. Key input to the revenue per unit derived metric — reveals pricing strategy by category.
Total revenue for the transaction. This is the primary metric every aggregation will sum or average.
Name of the sales representative. Not used in this analysis — included to show real datasets always carry extra columns you must consciously exclude.
Business Questions
Your sales director wants answers to these five questions. Each one maps to a specific step in the analysis below — this is the structure of a real analyst brief.
Which region generated the highest total revenue over the 12-month period, and how large is the gap to the lowest region?
Which product category drives the most revenue, and which has the highest revenue per unit sold?
What are the peak and trough months for total revenue — and what is the percentage difference between them?
What share of total company revenue does each region contribute?
Which region and product category combination is the single highest-revenue pairing in the dataset?
Step-by-Step Analysis
The scenario:
The export landed in your inbox as a CSV. You have opened Colab, imported pandas and numpy, and now you need to validate the data before you build a single number. A wrong dtype or a hidden null can corrupt every aggregate you produce — and your director will not know the numbers are wrong until a decision has already been made on them.
The most common mistake in sales EDA is forgetting to parse the date column. pandas reads dates as strings by default — meaning .groupby("month") will not work until you convert it to datetime64. We do that here before touching any other column.
import pandas as pd
import numpy as np
# Build the NovaBrands 12-month sales dataset
# In production this would be: df = pd.read_csv("novabrand_sales.csv", parse_dates=["date"])
df = pd.DataFrame({
"sale_id": ["S001","S002","S003","S004","S005","S006",
"S007","S008","S009","S010","S011","S012"],
"date": ["2024-01-15","2024-02-10","2024-03-22","2024-04-05",
"2024-05-18","2024-06-30","2024-07-14","2024-08-09",
"2024-09-25","2024-10-11","2024-11-03","2024-12-20"],
"region": ["North","South","East","West","North","South",
"East","West","North","South","East","West"],
"product_category": ["Electronics","Apparel","Electronics","Home Goods",
"Apparel","Electronics","Home Goods","Apparel",
"Electronics","Home Goods","Apparel","Electronics"],
"units_sold": [120, 340, 95, 210, 180, 260, 145, 390, 110, 175, 220, 300],
"unit_price": [85, 32, 85, 54, 32, 85, 54, 32, 85, 54, 32, 85],
"revenue": [10200, 10880, 8075, 11340, 5760, 22100,
7830, 12480, 9350, 9450, 7040, 25500],
"sales_rep": ["Alice","Bob","Carol","Dave","Alice","Bob",
"Carol","Dave","Alice","Bob","Carol","Dave"]
})
# Parse the date column from string to datetime
# Without this, pandas treats "2024-01-15" as text — groupby month will fail
df["date"] = pd.to_datetime(df["date"])
# Extract month number and month name — both useful for grouping and labelling
df["month"] = df["date"].dt.month # integer 1–12
df["month_name"] = df["date"].dt.strftime("%b") # abbreviated name: Jan, Feb…
# Basic inspection
print("Shape:", df.shape)
print("\nData types after parsing:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nRevenue summary:")
print(df["revenue"].describe().round(0))
Shape: (12, 10) Data types after parsing: sale_id object date datetime64[ns] region object product_category object units_sold int64 unit_price int64 revenue int64 sales_rep object month int64 month_name object dtype: object Missing values: sale_id 0 date 0 region 0 product_category 0 units_sold 0 revenue 0 sales_rep 0 dtype: int64 Revenue summary: count 12.0 mean 11667.0 std 5732.0 min 5760.0 25% 8244.0 50% 10065.0 75% 13680.0 max 25500.0
What just happened?
Method — pd.to_datetime()pd.to_datetime() converts a column of date strings into datetime64 objects — a numeric representation of time that pandas can sort, subtract, and group by period. Without this conversion, df["date"].dt.month would throw an AttributeError because strings have no .dt accessor. Always parse dates immediately after loading.
Once a column is datetime64, the .dt accessor unlocks a full suite of time properties: .dt.month extracts the month as an integer (1–12), .dt.strftime("%b") formats it as a three-letter abbreviation ("Jan", "Feb"…). We create both — the integer for sorting, the name for readable labels.
Zero missing values — the dataset is clean. But the revenue summary already tells us something important: the max ($25,500) is more than 4× the min ($5,760). That is a wide spread across just 12 transactions — one or two months are pulling the average up significantly. We will identify exactly which ones in Step 4.
The first question in the director's brief is regional. We group by region, sum revenue, sort it, and compute the revenue share — answering business questions 1 and 4 in a single block.
# Group by region and aggregate key metrics
# agg() lets us apply different functions to different columns in one call
regional = df.groupby("region").agg(
total_revenue = ("revenue", "sum"), # sum all revenue per region
total_units = ("units_sold", "sum"), # sum all units sold per region
num_transactions = ("sale_id", "count") # count how many transactions per region
).reset_index()
# Sort by total_revenue descending — highest first
regional = regional.sort_values("total_revenue", ascending=False).reset_index(drop=True)
# Compute revenue share: each region's % of total company revenue
total_rev = regional["total_revenue"].sum()
regional["revenue_share_pct"] = (regional["total_revenue"] / total_rev * 100).round(1)
print("Regional performance summary:")
print(regional.to_string(index=False))
# Gap between top and bottom region
top = regional.iloc[0]
bottom = regional.iloc[-1]
gap = top["total_revenue"] - bottom["total_revenue"]
print(f"\nTop region: {top['region']} — ${top['total_revenue']:,}")
print(f"Bottom region: {bottom['region']} — ${bottom['total_revenue']:,}")
print(f"Revenue gap: ${gap:,} ({gap/bottom['total_revenue']*100:.1f}% above bottom)")
Regional performance summary: region total_revenue total_units num_transactions revenue_share_pct West 49320 900 3 33.6 South 42430 775 3 28.9 North 25310 410 3 17.2 East 22945 460 3 15.6 Top region: West — $49,320 Bottom region: East — $22,945 Revenue gap: $26,375 (114.9% above bottom)
What just happened?
Method — groupby().agg().agg() is more powerful than .mean() or .sum() alone — it lets you apply different aggregation functions to different columns in a single call, using the named keyword syntax new_col_name = ("source_col", "function"). The result is a clean multi-metric summary table without needing multiple separate groupby calls.
After sorting, .iloc[0] selects the first row (highest revenue) and .iloc[-1] selects the last row (lowest revenue). .iloc selects by integer position — always safe after a sort, regardless of the original index labels. This pattern of sort-then-iloc is the fastest way to extract top and bottom performers.
West leads at $49,320 — 33.6% of total company revenue — while East trails at $22,945, just 15.6% of total. The gap between top and bottom is $26,375, or nearly 115%. With only 3 transactions each, the difference is driven entirely by transaction size, not volume. The director needs to know whether West's advantage is pricing, product mix, or territory size — that is the follow-up investigation.
Revenue by category tells us what sells most. Revenue per unit tells us what is most efficiently priced. These two metrics together answer business question 2 and often reveal a hidden story — a category can lead on units but lag badly on value.
# Group by product category
product = df.groupby("product_category").agg(
total_revenue = ("revenue", "sum"),
total_units = ("units_sold", "sum"),
transactions = ("sale_id", "count")
).reset_index()
# Compute revenue per unit — total revenue divided by total units sold
# This is a derived metric: not in the raw data but computed from two columns
product["revenue_per_unit"] = (
product["total_revenue"] / product["total_units"]
).round(2)
# Sort by total revenue
product = product.sort_values("total_revenue", ascending=False)
print("Product category performance:")
print(product.to_string(index=False))
# Which category has the highest revenue per unit?
top_rpu = product.sort_values("revenue_per_unit", ascending=False).iloc[0]
print(f"\nHighest revenue per unit: {top_rpu['product_category']} "
f"at ${top_rpu['revenue_per_unit']:.2f} per unit")
Product category performance:
product_category total_revenue total_units transactions revenue_per_unit
Electronics 75225 885 5 85.00
Apparel 36160 1130 4 32.00
Home Goods 28620 530 3 54.00
Highest revenue per unit: Electronics at $85.00 per unitWhat just happened?
Method — derived metric from aggregated columnsWe computed revenue_per_unit by dividing two already-aggregated columns: total_revenue / total_units. This is called a derived metric — it does not exist in the raw data but is calculated from columns that do. Derived metrics almost always add more insight than the raw columns alone. Always think about what you can compute from what you already have before deciding more data is needed.
We sorted the table twice — once by total_revenue for the printed summary, then again by revenue_per_unit to find the top performer on that metric. Sorting a DataFrame never modifies the underlying data; it just changes the row order for display or selection. .iloc[0] after the second sort safely picks the top row on the new ranking.
Electronics leads in total revenue at $75,225 — nearly double Apparel's $36,160. But the more interesting finding is the unit economics: Apparel moved 1,130 units (the most of any category) but generates only $32 per unit, while Electronics generates $85 per unit. If NovaBrands can shift even 10% of Apparel customers toward Electronics, the revenue impact is significant without selling a single additional unit.
Time-based patterns drive production and marketing planning. We group by month, sort chronologically, and compute the month-over-month growth rate — answering business question 3 and showing the director exactly where the year peaks and dips.
# Group by month number and month name together
# Using both ensures we can sort by month (integer) but display by name (string)
monthly = df.groupby(["month", "month_name"]).agg(
total_revenue = ("revenue", "sum"),
total_units = ("units_sold", "sum")
).reset_index()
# Sort chronologically by month integer
monthly = monthly.sort_values("month").reset_index(drop=True)
# Compute month-over-month (MoM) revenue change
# .shift(1) moves each value down one row — so row N gets row N-1's revenue
# Dividing the difference by the previous row gives the % change
monthly["mom_growth_pct"] = (
(monthly["total_revenue"] - monthly["total_revenue"].shift(1))
/ monthly["total_revenue"].shift(1) * 100
).round(1)
print("Monthly revenue trend:")
print(monthly[["month_name","total_revenue","mom_growth_pct"]].to_string(index=False))
# Identify peak and trough months
peak = monthly.loc[monthly["total_revenue"].idxmax()]
trough = monthly.loc[monthly["total_revenue"].idxmin()]
pct_diff = (peak["total_revenue"] - trough["total_revenue"]) / trough["total_revenue"] * 100
print(f"\nPeak month: {peak['month_name']} — ${peak['total_revenue']:,}")
print(f"Trough month: {trough['month_name']} — ${trough['total_revenue']:,}")
print(f"Peak is {pct_diff:.1f}% higher than trough")
Monthly revenue trend:
month_name total_revenue mom_growth_pct
Jan 10200 NaN
Feb 10880 6.7
Mar 8075 -25.8
Apr 11340 40.4
May 5760 -49.2
Jun 22100 283.7
Jul 7830 -64.6
Aug 12480 59.4
Sep 9350 -25.1
Oct 9450 1.1
Nov 7040 -25.5
Dec 25500 262.1
Peak month: Dec — $25,500
Trough month: May — $5,760
Peak is 342.7% higher than troughWhat just happened?
Method — .shift() for period-over-period comparison.shift(1) shifts a Series down by one row — so each row gets the value from the row above it. Subtracting the shifted Series from the original gives the period-over-period change, and dividing that by the shifted value gives the percentage change. This is the standard pandas pattern for any month-over-month, quarter-over-quarter, or year-over-year growth calculation. The first row always returns NaN because there is no prior period to compare against.
.idxmax() returns the index label of the row with the highest value in a column — not the value itself. Passing this to .loc[] retrieves the full row. The same pattern with .idxmin() retrieves the row with the lowest value. This is cleaner than sorting the full DataFrame just to pick one row.
December is the peak month at $25,500 — a classic year-end consumer spending surge. May is the trough at $5,760 — 343% below peak. The June spike (+284% MoM) is equally striking and may reflect a mid-year promotion or a large one-off contract. Both June and December need to be flagged for the director as months requiring additional inventory and staffing preparation. The May dip suggests a structural slow period that could be targeted with a promotional campaign.
Checkpoint — try this before moving on: Modify Step 4 to group by both month and region — use df.groupby(["month","region"])["revenue"].sum().unstack(). Which region is responsible for the June spike? Which one drives the December peak? This is how real analysts drill down — one dimension at a time. The answer shapes the next conversation with the sales director.
Key Findings
West is the top-performing region at $49,320 — 33.6% of total revenue — while East trails at $22,945. The gap of $26,375 between them is 115% and is driven by transaction size, not volume, since both regions completed three transactions.
Electronics dominates product revenue at $75,225 — nearly double Apparel — and generates the highest revenue per unit at $85. Despite Apparel moving the most units (1,130), its $32 per unit makes it the least efficient category by value.
December is the peak month at $25,500 and May is the trough at $5,760 — a 343% difference. June also spikes sharply (+284% MoM), suggesting either a promotional event or a large contract that needs to be investigated for repeatability.
Revenue share is highly concentrated — West and South together account for 62.5% of total revenue. North and East are underperforming relative to their transaction count, suggesting pricing or product mix issues rather than low deal volume.
Apparel has a structural pricing problem: it moves the most units but generates the least value per unit. A 10% price increase on Apparel would add approximately $3,600 to annual revenue with zero change in volume — that is an immediate commercial recommendation.
Visualisations
Red = trough (May) · Green = peak (Dec) · Blue = secondary spike (Jun)
The Sales EDA Decision Guide
Every sales dataset has the same structure — a time dimension, a categorical dimension, and a numeric target. Here is how to approach each combination systematically:
| Question Type | pandas Approach | Key Method | Watch Out For |
|---|---|---|---|
| Who performs best? | Group by category, sum target | groupby().agg(sum) | Ties driven by transaction count vs size |
| How efficient is each group? | Compute derived metric after agg | col / col (post-agg) | Dividing aggregated cols, not raw rows |
| When does revenue peak? | Parse dates, group by month | pd.to_datetime() + .dt.month | Date column left as string — silent failure |
| How fast is it growing? | Period-over-period % change | .shift(1) subtraction | First row always NaN — expected, not an error |
| What share does each hold? | Divide group sum by total sum | sum / sum * 100 | Rounding shares that don't add to 100% |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build a pivot table crossing region and product category to find the single highest-revenue pairing, then layer in a forecasting model to project Q1 revenue using the monthly trend — the number the director needs for budget planning.
Limitations of This Analysis
Twelve transactions is a sample. The June spike could be a single large one-off contract or a genuine seasonal pattern — with one data point per month we cannot distinguish between the two without prior-year data.
Business Decisions This Could Drive
The director should investigate the May trough with a promotional campaign and pre-position inventory ahead of June and December based on the seasonal pattern — but only after confirming the pattern repeated in at least one prior year.
Practice Questions
1. Which region generated the highest total revenue in the NovaBrands dataset?
2. Which pandas method moves a Series down by one row so you can compute period-over-period percentage change?
3. Which month was the peak revenue month in the NovaBrands 12-month dataset?
Quiz
1. Why must you run pd.to_datetime() on a date column before doing time-based grouping?
2. Which pandas pattern lets you compute sum, count, and mean on different columns simultaneously in a single groupby operation?
3. Which product category is the most efficiently priced in the NovaBrands dataset, and why?
Up Next · Case Study 3
Movie Ratings EDA
You are handed a dataset of 10,000 movie ratings. Which genres score highest? Does runtime affect rating? Do critic scores and audience scores actually agree?