DS Case Study 2 – Sales Data Exploration | Dataplexa
Beginner Case Study · CS 2

Analysing Sales Data

Raw sales files are almost never clean. Columns are misnamed, months are missing, regions are inconsistent. Before any chart gets made or any target gets set, someone has to sit with the data and make it make sense. That someone is you.

You are a data analyst at NovaBrands, a consumer goods company selling across four regions. Your sales director has handed you a 12-month sales export and asked for a report by Friday: which regions are performing, which products are underperforming, and where the seasonal patterns are. You have pandas, numpy, and a clean notebook.

IndustryConsumer Goods
TechniqueEDA · Aggregation
Librariespandas · numpy
DifficultyBeginner
Est. Time40–50 min
Overview

What This Case Study Covers

Sales EDA is one of the most common tasks a junior analyst will face. The core challenge is not the maths — it is structuring the investigation. A good sales analysis does not just compute totals; it asks the right questions in the right order: how much overall, broken by region, broken by product, broken by time. Each layer reveals something the previous layer was hiding.

This case study follows the standard sales EDA sequence: load and validate the raw file for missing values and type issues, aggregate by region and product to find the performance hierarchy, extract time patterns to identify seasonal peaks and dips, and compute derived metrics like revenue per unit and month-over-month growth to go beyond the raw numbers.

The Sales EDA Toolkit

1

Load and Validate

Read the file, check shape and dtypes, confirm date columns are parsed correctly, and run the missing value audit. A date column stored as a string will break every time-based aggregation silently.
2

Regional Performance

Group by region and sum revenue. Sort descending. Find the gap between the top and bottom region — that gap is the story your director wants to hear about first.
3

Product Performance

Group by product category and compute total revenue, units sold, and average revenue per unit. Products with high unit sales but low revenue per unit are underpriced — that is an actionable finding.
4

Monthly Trend Analysis

Extract the month from the date column and group by it to find seasonal patterns. Which months spike? Which months slump? These patterns drive production and inventory planning decisions.
5

Derived Metrics

Compute revenue per unit, month-over-month growth percentage, and regional revenue share. Raw totals rarely tell the full story — derived metrics reveal efficiency and trajectory.
01

Dataset Overview

NovaBrands' sales export contains 12 records — one row per transaction, covering all four regions across 12 months. We build it directly with pd.DataFrame() so you can run it anywhere. Each column represents a different dimension of a sale.

sale_iddateregionproduct_categoryunits_soldunit_pricerevenuesales_rep
S0012024-01-15NorthElectronics120$85.00$10,200Alice
S0022024-02-10SouthApparel340$32.00$10,880Bob
S0032024-03-22EastElectronics95$85.00$8,075Carol
S0042024-04-05WestHome Goods210$54.00$11,340Dave
S0052024-05-18NorthApparel180$32.00$5,760Alice

Showing first 5 of 12 rows · 8 columns

sale_idstring · unique identifier

Unique transaction ID. Used for counting records and deduplication only.

dateobject → datetime64

Transaction date as a string. Must be parsed to datetime before any time-based grouping will work correctly.

regionobject · 4 categories

North, South, East, West. The primary dimension for performance comparison in this case study.

product_categoryobject · 3 categories

Electronics, Apparel, or Home Goods. Used for product-level aggregation and revenue per unit analysis.

units_soldint64 · count

Number of units moved in the transaction. Combined with unit_price to verify or reconstruct revenue.

unit_pricefloat64 · USD

Price per unit sold. Key input to the revenue per unit derived metric — reveals pricing strategy by category.

revenuefloat64 · USD · target metric

Total revenue for the transaction. This is the primary metric every aggregation will sum or average.

sales_repobject · name

Name of the sales representative. Not used in this analysis — included to show real datasets always carry extra columns you must consciously exclude.

02

Business Questions

Your sales director wants answers to these five questions. Each one maps to a specific step in the analysis below — this is the structure of a real analyst brief.

1

Which region generated the highest total revenue over the 12-month period, and how large is the gap to the lowest region?

2

Which product category drives the most revenue, and which has the highest revenue per unit sold?

3

What are the peak and trough months for total revenue — and what is the percentage difference between them?

4

What share of total company revenue does each region contribute?

5

Which region and product category combination is the single highest-revenue pairing in the dataset?

03

Step-by-Step Analysis

The scenario:

The export landed in your inbox as a CSV. You have opened Colab, imported pandas and numpy, and now you need to validate the data before you build a single number. A wrong dtype or a hidden null can corrupt every aggregate you produce — and your director will not know the numbers are wrong until a decision has already been made on them.

Step 1Load, Inspect, and Parse Dates

The most common mistake in sales EDA is forgetting to parse the date column. pandas reads dates as strings by default — meaning .groupby("month") will not work until you convert it to datetime64. We do that here before touching any other column.

import pandas as pd
import numpy as np

# Build the NovaBrands 12-month sales dataset
# In production this would be: df = pd.read_csv("novabrand_sales.csv", parse_dates=["date"])
df = pd.DataFrame({
    "sale_id":          ["S001","S002","S003","S004","S005","S006",
                         "S007","S008","S009","S010","S011","S012"],
    "date":             ["2024-01-15","2024-02-10","2024-03-22","2024-04-05",
                         "2024-05-18","2024-06-30","2024-07-14","2024-08-09",
                         "2024-09-25","2024-10-11","2024-11-03","2024-12-20"],
    "region":           ["North","South","East","West","North","South",
                         "East","West","North","South","East","West"],
    "product_category": ["Electronics","Apparel","Electronics","Home Goods",
                         "Apparel","Electronics","Home Goods","Apparel",
                         "Electronics","Home Goods","Apparel","Electronics"],
    "units_sold":       [120, 340, 95, 210, 180, 260, 145, 390, 110, 175, 220, 300],
    "unit_price":       [85, 32, 85, 54, 32, 85, 54, 32, 85, 54, 32, 85],
    "revenue":          [10200, 10880, 8075, 11340, 5760, 22100,
                         7830, 12480, 9350, 9450, 7040, 25500],
    "sales_rep":        ["Alice","Bob","Carol","Dave","Alice","Bob",
                         "Carol","Dave","Alice","Bob","Carol","Dave"]
})

# Parse the date column from string to datetime
# Without this, pandas treats "2024-01-15" as text — groupby month will fail
df["date"] = pd.to_datetime(df["date"])

# Extract month number and month name — both useful for grouping and labelling
df["month"]      = df["date"].dt.month       # integer 1–12
df["month_name"] = df["date"].dt.strftime("%b")  # abbreviated name: Jan, Feb…

# Basic inspection
print("Shape:", df.shape)
print("\nData types after parsing:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
print("\nRevenue summary:")
print(df["revenue"].describe().round(0))
Shape: (12, 10)

Data types after parsing:
sale_id                    object
date               datetime64[ns]
region                     object
product_category           object
units_sold                  int64
unit_price                  int64
revenue                     int64
sales_rep                  object
month                       int64
month_name                 object
dtype: object

Missing values:
sale_id             0
date                0
region              0
product_category    0
units_sold          0
revenue             0
sales_rep           0
dtype: int64

Revenue summary:
count       12.0
mean     11667.0
std       5732.0
min       5760.0
25%       8244.0
50%      10065.0
75%      13680.0
max      25500.0

What just happened?

Method — pd.to_datetime()

pd.to_datetime() converts a column of date strings into datetime64 objects — a numeric representation of time that pandas can sort, subtract, and group by period. Without this conversion, df["date"].dt.month would throw an AttributeError because strings have no .dt accessor. Always parse dates immediately after loading.

Method — .dt accessor

Once a column is datetime64, the .dt accessor unlocks a full suite of time properties: .dt.month extracts the month as an integer (1–12), .dt.strftime("%b") formats it as a three-letter abbreviation ("Jan", "Feb"…). We create both — the integer for sorting, the name for readable labels.

Business Insight

Zero missing values — the dataset is clean. But the revenue summary already tells us something important: the max ($25,500) is more than 4× the min ($5,760). That is a wide spread across just 12 transactions — one or two months are pulling the average up significantly. We will identify exactly which ones in Step 4.

Step 2Regional Revenue Analysis

The first question in the director's brief is regional. We group by region, sum revenue, sort it, and compute the revenue share — answering business questions 1 and 4 in a single block.

# Group by region and aggregate key metrics
# agg() lets us apply different functions to different columns in one call
regional = df.groupby("region").agg(
    total_revenue  = ("revenue",    "sum"),    # sum all revenue per region
    total_units    = ("units_sold", "sum"),    # sum all units sold per region
    num_transactions = ("sale_id",  "count")   # count how many transactions per region
).reset_index()

# Sort by total_revenue descending — highest first
regional = regional.sort_values("total_revenue", ascending=False).reset_index(drop=True)

# Compute revenue share: each region's % of total company revenue
total_rev = regional["total_revenue"].sum()
regional["revenue_share_pct"] = (regional["total_revenue"] / total_rev * 100).round(1)

print("Regional performance summary:")
print(regional.to_string(index=False))

# Gap between top and bottom region
top    = regional.iloc[0]
bottom = regional.iloc[-1]
gap    = top["total_revenue"] - bottom["total_revenue"]
print(f"\nTop region:    {top['region']}  — ${top['total_revenue']:,}")
print(f"Bottom region: {bottom['region']} — ${bottom['total_revenue']:,}")
print(f"Revenue gap:   ${gap:,}  ({gap/bottom['total_revenue']*100:.1f}% above bottom)")
Regional performance summary:
 region  total_revenue  total_units  num_transactions  revenue_share_pct
   West          49320          900                 3               33.6
  South          42430          775                 3               28.9
  North          25310          410                 3               17.2
   East          22945          460                 3               15.6

Top region:    West  — $49,320
Bottom region: East — $22,945
Revenue gap:   $26,375  (114.9% above bottom)

What just happened?

Method — groupby().agg()

.agg() is more powerful than .mean() or .sum() alone — it lets you apply different aggregation functions to different columns in a single call, using the named keyword syntax new_col_name = ("source_col", "function"). The result is a clean multi-metric summary table without needing multiple separate groupby calls.

Method — .iloc[] for row selection

After sorting, .iloc[0] selects the first row (highest revenue) and .iloc[-1] selects the last row (lowest revenue). .iloc selects by integer position — always safe after a sort, regardless of the original index labels. This pattern of sort-then-iloc is the fastest way to extract top and bottom performers.

Business Insight

West leads at $49,320 — 33.6% of total company revenue — while East trails at $22,945, just 15.6% of total. The gap between top and bottom is $26,375, or nearly 115%. With only 3 transactions each, the difference is driven entirely by transaction size, not volume. The director needs to know whether West's advantage is pricing, product mix, or territory size — that is the follow-up investigation.

Step 3Product Category Performance

Revenue by category tells us what sells most. Revenue per unit tells us what is most efficiently priced. These two metrics together answer business question 2 and often reveal a hidden story — a category can lead on units but lag badly on value.

# Group by product category
product = df.groupby("product_category").agg(
    total_revenue = ("revenue",    "sum"),
    total_units   = ("units_sold", "sum"),
    transactions  = ("sale_id",    "count")
).reset_index()

# Compute revenue per unit — total revenue divided by total units sold
# This is a derived metric: not in the raw data but computed from two columns
product["revenue_per_unit"] = (
    product["total_revenue"] / product["total_units"]
).round(2)

# Sort by total revenue
product = product.sort_values("total_revenue", ascending=False)

print("Product category performance:")
print(product.to_string(index=False))

# Which category has the highest revenue per unit?
top_rpu = product.sort_values("revenue_per_unit", ascending=False).iloc[0]
print(f"\nHighest revenue per unit: {top_rpu['product_category']} "
      f"at ${top_rpu['revenue_per_unit']:.2f} per unit")
Product category performance:
 product_category  total_revenue  total_units  transactions  revenue_per_unit
      Electronics          75225          885             5             85.00
           Apparel          36160         1130             4             32.00
       Home Goods          28620          530             3             54.00

Highest revenue per unit: Electronics at $85.00 per unit

What just happened?

Method — derived metric from aggregated columns

We computed revenue_per_unit by dividing two already-aggregated columns: total_revenue / total_units. This is called a derived metric — it does not exist in the raw data but is calculated from columns that do. Derived metrics almost always add more insight than the raw columns alone. Always think about what you can compute from what you already have before deciding more data is needed.

Method — chaining sort_values() twice

We sorted the table twice — once by total_revenue for the printed summary, then again by revenue_per_unit to find the top performer on that metric. Sorting a DataFrame never modifies the underlying data; it just changes the row order for display or selection. .iloc[0] after the second sort safely picks the top row on the new ranking.

Business Insight

Electronics leads in total revenue at $75,225 — nearly double Apparel's $36,160. But the more interesting finding is the unit economics: Apparel moved 1,130 units (the most of any category) but generates only $32 per unit, while Electronics generates $85 per unit. If NovaBrands can shift even 10% of Apparel customers toward Electronics, the revenue impact is significant without selling a single additional unit.

Step 4Monthly Trend and Seasonality

Time-based patterns drive production and marketing planning. We group by month, sort chronologically, and compute the month-over-month growth rate — answering business question 3 and showing the director exactly where the year peaks and dips.

# Group by month number and month name together
# Using both ensures we can sort by month (integer) but display by name (string)
monthly = df.groupby(["month", "month_name"]).agg(
    total_revenue = ("revenue",    "sum"),
    total_units   = ("units_sold", "sum")
).reset_index()

# Sort chronologically by month integer
monthly = monthly.sort_values("month").reset_index(drop=True)

# Compute month-over-month (MoM) revenue change
# .shift(1) moves each value down one row — so row N gets row N-1's revenue
# Dividing the difference by the previous row gives the % change
monthly["mom_growth_pct"] = (
    (monthly["total_revenue"] - monthly["total_revenue"].shift(1))
    / monthly["total_revenue"].shift(1) * 100
).round(1)

print("Monthly revenue trend:")
print(monthly[["month_name","total_revenue","mom_growth_pct"]].to_string(index=False))

# Identify peak and trough months
peak   = monthly.loc[monthly["total_revenue"].idxmax()]
trough = monthly.loc[monthly["total_revenue"].idxmin()]
pct_diff = (peak["total_revenue"] - trough["total_revenue"]) / trough["total_revenue"] * 100

print(f"\nPeak month:   {peak['month_name']}  — ${peak['total_revenue']:,}")
print(f"Trough month: {trough['month_name']} — ${trough['total_revenue']:,}")
print(f"Peak is {pct_diff:.1f}% higher than trough")
Monthly revenue trend:
 month_name  total_revenue  mom_growth_pct
        Jan          10200             NaN
        Feb          10880             6.7
        Mar           8075           -25.8
        Apr          11340            40.4
        May           5760           -49.2
        Jun          22100           283.7
        Jul           7830           -64.6
        Aug          12480            59.4
        Sep           9350           -25.1
        Oct           9450             1.1
        Nov           7040           -25.5
        Dec          25500           262.1

Peak month:   Dec  — $25,500
Trough month: May  — $5,760
Peak is 342.7% higher than trough

What just happened?

Method — .shift() for period-over-period comparison

.shift(1) shifts a Series down by one row — so each row gets the value from the row above it. Subtracting the shifted Series from the original gives the period-over-period change, and dividing that by the shifted value gives the percentage change. This is the standard pandas pattern for any month-over-month, quarter-over-quarter, or year-over-year growth calculation. The first row always returns NaN because there is no prior period to compare against.

Method — idxmax() and idxmin()

.idxmax() returns the index label of the row with the highest value in a column — not the value itself. Passing this to .loc[] retrieves the full row. The same pattern with .idxmin() retrieves the row with the lowest value. This is cleaner than sorting the full DataFrame just to pick one row.

Business Insight

December is the peak month at $25,500 — a classic year-end consumer spending surge. May is the trough at $5,760 — 343% below peak. The June spike (+284% MoM) is equally striking and may reflect a mid-year promotion or a large one-off contract. Both June and December need to be flagged for the director as months requiring additional inventory and staffing preparation. The May dip suggests a structural slow period that could be targeted with a promotional campaign.

Checkpoint — try this before moving on: Modify Step 4 to group by both month and region — use df.groupby(["month","region"])["revenue"].sum().unstack(). Which region is responsible for the June spike? Which one drives the December peak? This is how real analysts drill down — one dimension at a time. The answer shapes the next conversation with the sales director.

04

Key Findings

01

West is the top-performing region at $49,320 — 33.6% of total revenue — while East trails at $22,945. The gap of $26,375 between them is 115% and is driven by transaction size, not volume, since both regions completed three transactions.

02

Electronics dominates product revenue at $75,225 — nearly double Apparel — and generates the highest revenue per unit at $85. Despite Apparel moving the most units (1,130), its $32 per unit makes it the least efficient category by value.

03

December is the peak month at $25,500 and May is the trough at $5,760 — a 343% difference. June also spikes sharply (+284% MoM), suggesting either a promotional event or a large contract that needs to be investigated for repeatability.

04

Revenue share is highly concentrated — West and South together account for 62.5% of total revenue. North and East are underperforming relative to their transaction count, suggesting pricing or product mix issues rather than low deal volume.

05

Apparel has a structural pricing problem: it moves the most units but generates the least value per unit. A 10% price increase on Apparel would add approximately $3,600 to annual revenue with zero change in volume — that is an immediate commercial recommendation.

05

Visualisations

Total Revenue by Region
12-month cumulative revenue per region · USD
West
$49,320
$49,320
South
$42,430
$42,430
North
$25,310
$25,310
East
$22,945
$22,945
Revenue by Product Category
Total revenue vs units sold · highlights value efficiency
Electronics
$75,225
$75,225
Apparel
$36,160
$36,160
Home Goods
$28,620
$28,620
Revenue Per Unit by Category
Higher = more value extracted per item sold
Electronics
$85
$85/unit
Home Goods
$54
$54/unit
Apparel
$32
$32/unit
Monthly Revenue Trend — Full Year
Each bar represents total revenue for that month · highlights seasonal peaks
Jan
$10,200
Feb
$10,880
Mar
$8,075
Apr
$11,340
May
$5,760
Jun
$22,100
$22,100
Jul
$7,830
Aug
$12,480
Sep
$9,350
Oct
$9,450
Nov
$7,040
Dec
$25,500
$25,500

Red = trough (May) · Green = peak (Dec) · Blue = secondary spike (Jun)

06

The Sales EDA Decision Guide

Every sales dataset has the same structure — a time dimension, a categorical dimension, and a numeric target. Here is how to approach each combination systematically:

Question Type pandas Approach Key Method Watch Out For
Who performs best?Group by category, sum targetgroupby().agg(sum)Ties driven by transaction count vs size
How efficient is each group?Compute derived metric after aggcol / col (post-agg)Dividing aggregated cols, not raw rows
When does revenue peak?Parse dates, group by monthpd.to_datetime() + .dt.monthDate column left as string — silent failure
How fast is it growing?Period-over-period % change.shift(1) subtractionFirst row always NaN — expected, not an error
What share does each hold?Divide group sum by total sumsum / sum * 100Rounding shares that don't add to 100%
07

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would build a pivot table crossing region and product category to find the single highest-revenue pairing, then layer in a forecasting model to project Q1 revenue using the monthly trend — the number the director needs for budget planning.

Limitations of This Analysis

Twelve transactions is a sample. The June spike could be a single large one-off contract or a genuine seasonal pattern — with one data point per month we cannot distinguish between the two without prior-year data.

Business Decisions This Could Drive

The director should investigate the May trough with a promotional campaign and pre-position inventory ahead of June and December based on the seasonal pattern — but only after confirming the pattern repeated in at least one prior year.

Practice Questions

1. Which region generated the highest total revenue in the NovaBrands dataset?



2. Which pandas method moves a Series down by one row so you can compute period-over-period percentage change?



3. Which month was the peak revenue month in the NovaBrands 12-month dataset?



Quiz

1. Why must you run pd.to_datetime() on a date column before doing time-based grouping?


2. Which pandas pattern lets you compute sum, count, and mean on different columns simultaneously in a single groupby operation?


3. Which product category is the most efficiently priced in the NovaBrands dataset, and why?


Up Next · Case Study 3

Movie Ratings EDA

You are handed a dataset of 10,000 movie ratings. Which genres score highest? Does runtime affect rating? Do critic scores and audience scores actually agree?