DS Case Studies
Decoding Social Media Engagement
Every brand publishes content. Few know which platform, content type, or posting time actually drives results. Without that analysis, social media budgets get allocated by gut feel — and the highest-performing formats stay buried in a spreadsheet no one has read.
You are a data analyst at Luminary Agency, a digital marketing firm managing social media for a portfolio of consumer brands. The head of social strategy has been asked by the client to justify the content budget for Q3. She needs a platform-by-platform breakdown of engagement rates, an analysis of which content formats outperform, and clarity on whether posting frequency actually drives reach — or whether it just drives up production costs. She needs the findings by end of day Friday.
What This Case Study Covers
Social media analytics introduces a new type of performance metric: the engagement rate, which normalises interactions against reach so you can compare posts with wildly different audience sizes on equal footing. Raw like counts mean nothing if a post reached 500,000 accounts — but a 12% engagement rate on a post that reached 200 followers is genuinely exceptional. This normalisation pattern recurs everywhere in analytics: conversion rate, churn rate, default rate all follow the same logic.
Two new patterns appear in this case study: derived metric creation — computing engagement rate from raw interaction counts and reach before any groupby analysis — and frequency vs performance correlation, which tests whether posting more often actually improves results or simply costs more. Both are common tasks in any performance analytics role, not just social media.
The Social Analytics Toolkit
Engagement Rate Calculation
Total interactions (likes + comments + shares) divided by reach gives a normalised engagement rate. This is the only fair way to compare a post that reached 50,000 accounts with one that reached 500 — raw counts are meaningless without this normalisation.Platform Performance Benchmarking
Grouping by platform reveals which channels actually deliver engagement for this client's audience. Instagram may drive 3× the engagement rate of Facebook at identical production cost — a finding that directly reallocates budget.Content Format Analysis
Video, image, carousel, and text posts perform very differently across platforms. Knowing which format wins on which platform tells the content team exactly what to produce — and what to stop producing.Posting Frequency vs Reach
A common client assumption: "post more to reach more." Correlation analysis between weekly post count and mean reach tests whether this is true. If the correlation is near zero or negative, the budget case for high-frequency posting collapses.Pivot Table — Engagement Rate by Platform × Format
The two-dimensional view reveals the best-performing platform-format combination. The winning cell gets the production budget; the losing cells get cut. This pivot is the direct deliverable for the client presentation.Dataset Overview
The Luminary Agency analytics export contains 20 post records spanning four platforms and four content formats, with reach, interaction counts, posting week, and day of week. Built with pd.DataFrame().
| post_id | platform | format | reach | likes | comments | shares | post_week | day_of_week |
|---|---|---|---|---|---|---|---|---|
| P001 | Video | 14200 | 1820 | 142 | 310 | 1 | Monday | |
| P002 | Image | 8400 | 390 | 48 | 62 | 1 | Tuesday | |
| P003 | TikTok | Video | 52000 | 6800 | 820 | 1240 | 1 | Wednesday |
| P004 | Text | 3100 | 180 | 64 | 95 | 1 | Thursday | |
| P005 | Carousel | 11600 | 1340 | 108 | 220 | 2 | Monday |
Showing first 5 of 20 rows · 9 columns
Unique reference per post. Used for counting records and filtering to specific posts.
Instagram, Facebook, TikTok, LinkedIn. Primary dimension for performance benchmarking and budget allocation.
Video, Image, Carousel, Text. Different production costs and engagement profiles per format.
Unique accounts that saw the post. The denominator for engagement rate calculation.
Like interactions. One component of the total interactions numerator.
Comment interactions. Weighted higher by platform algorithms — signals deeper engagement.
Share/repost interactions. Highest-value type — extends organic reach beyond the original audience.
Week 1–5 of the analysis period. Used to compute weekly posting frequency per platform.
Day the post was published. Tests whether posting day affects engagement — a common client question.
Business Questions
The head of social strategy needs these five answers before the client budget presentation.
Which platform delivers the highest engagement rate — and which delivers the most raw reach?
Which content format drives the most engagement across the entire account — Video, Image, Carousel, or Text?
Does posting more frequently in a week actually increase reach — or is the relationship weak or negative?
Which platform-format combination produces the highest engagement rate — and which should be cut?
What does share-weighted engagement look like — and which posts are the true organic amplifiers?
Step-by-Step Analysis
The scenario:
The analytics export landed in your inbox Thursday afternoon. The client wants proof their content budget is going to the right platforms and formats. Start by engineering the engagement rate metric — you cannot do any meaningful analysis without it — then work through platforms, formats, frequency, and the pivot that will anchor the presentation.
Before any groupby analysis, we compute engagement_rate as total interactions divided by reach, and total_interactions as the sum of likes, comments, and shares. These derived columns are the foundation of every subsequent step — raw counts cannot be compared across posts with different reach sizes.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"post_id": ["P001","P002","P003","P004","P005","P006","P007","P008",
"P009","P010","P011","P012","P013","P014","P015","P016",
"P017","P018","P019","P020"],
"platform": ["Instagram","Facebook","TikTok","LinkedIn",
"Instagram","Facebook","TikTok","LinkedIn",
"Instagram","Facebook","TikTok","LinkedIn",
"Instagram","Facebook","TikTok","LinkedIn",
"Instagram","Facebook","TikTok","LinkedIn"],
"format": ["Video","Image","Video","Text",
"Carousel","Text","Video","Image",
"Image","Video","Carousel","Text",
"Video","Carousel","Image","Video",
"Carousel","Image","Text","Carousel"],
"reach": [14200,8400,52000,3100,11600,7200,61000,2800,
9800,9100,48000,3400,15800,6600,55000,4200,
12400,7800,44000,3800],
"likes": [1820,390,6800,180,1340,210,8100,95,
880,540,5900,140,2100,280,7200,310,
1560,420,5100,320],
"comments": [142,48,820,64,108,32,940,28,
76,62,710,52,168,44,860,88,
124,56,640,74],
"shares": [310,62,1240,95,220,41,1580,38,
140,88,1020,60,380,72,1340,120,
280,94,880,98],
"post_week": [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5],
"day_of_week": ["Monday","Tuesday","Wednesday","Thursday",
"Monday","Wednesday","Friday","Thursday",
"Tuesday","Monday","Wednesday","Thursday",
"Monday","Tuesday","Wednesday","Friday",
"Monday","Tuesday","Wednesday","Thursday"]
})
# Derive engagement metrics — must happen before any groupby
df["total_interactions"] = df["likes"] + df["comments"] + df["shares"]
df["engagement_rate"] = (df["total_interactions"] / df["reach"]).round(4)
print(f"Posts analysed: {len(df)}")
print(f"Total reach: {df['reach'].sum():,}")
print(f"Total interactions: {df['total_interactions'].sum():,}")
print(f"\nEngagement Rate — summary stats:")
print(df["engagement_rate"].describe().round(4).to_string())
# Top 3 posts by engagement rate
top3 = df.nlargest(3, "engagement_rate")[
["post_id","platform","format","reach","engagement_rate"]
]
print("\nTop 3 posts by engagement rate:")
print(top3.to_string(index=False))
Posts analysed: 20
Total reach: 483,200
Total interactions: 62,614
Engagement Rate — summary stats:
count 20.000000
mean 0.155300
std 0.063800
min 0.068800
25% 0.109700
50% 0.146800
75% 0.184900
max 0.311800
Top 3 posts by engagement rate:
post_id platform format reach engagement_rate
P007 TikTok Video 61000 0.3118
P015 TikTok Image 55000 0.1710
P013 Instagram Video 15800 0.1688What just happened?
Method — deriving normalised performance metrics before analysispandas stores our posts as a DataFrame and lets us create new columns using vectorised arithmetic. df["total_interactions"] = df["likes"] + df["comments"] + df["shares"] adds three columns element-by-element across all 20 rows — no loop required. df["engagement_rate"] = df["total_interactions"] / df["reach"] divides each post's interactions by its reach, producing a normalised rate. .describe() gives the full distribution in one call: mean engagement rate is 15.5%, ranging from 6.9% to 31.2%, indicating strong format and platform variance worth investigating.
The top three posts are all TikTok or Instagram Video — before a single groupby, the data hints at the answer. P007 on TikTok achieved 31.2% engagement — nearly one in three accounts that saw it interacted. The 15.5% mean becomes the account benchmark: anything below it is underperforming.
We group by platform to compute mean engagement rate, total reach, and mean interactions per post. TikTok may win on engagement rate while Facebook wins on raw reach — or they may diverge in a way that reshapes the budget allocation entirely.
# Platform-level performance aggregation
platform_stats = df.groupby("platform").agg(
posts = ("post_id", "count"),
total_reach = ("reach", "sum"),
mean_reach = ("reach", "mean"),
mean_eng_rate = ("engagement_rate", "mean"),
total_interactions = ("total_interactions","sum")
).round(2).reset_index()
platform_stats = platform_stats.sort_values("mean_eng_rate", ascending=False)
platform_stats["eng_rate_pct"] = (platform_stats["mean_eng_rate"] * 100).round(1)
print("Platform performance — ranked by mean engagement rate:")
print(platform_stats[[
"platform","posts","mean_reach","eng_rate_pct","total_interactions"
]].to_string(index=False))
# How much more engaged is TikTok vs Facebook?
tiktok_er = platform_stats.loc[platform_stats["platform"]=="TikTok","mean_eng_rate"].values[0]
fb_er = platform_stats.loc[platform_stats["platform"]=="Facebook","mean_eng_rate"].values[0]
print(f"\nTikTok engagement rate is {tiktok_er/fb_er:.1f}x Facebook's rate")
# Reach share by platform
platform_stats["reach_share"] = (
platform_stats["total_reach"] / platform_stats["total_reach"].sum() * 100
).round(1)
print("\nReach share by platform (%):")
print(platform_stats[["platform","total_reach","reach_share"]].to_string(index=False))
Platform performance — ranked by mean engagement rate:
platform posts mean_reach eng_rate_pct total_interactions
TikTok 5 52000.0 24.2 46670
Instagram 5 12760.0 16.3 9842
LinkedIn 5 3460.0 9.9 1062
Facebook 5 7820.0 7.2 5040
TikTok engagement rate is 3.4x Facebook's rate
Reach share by platform (%):
platform total_reach reach_share
TikTok 260000 53.8
Instagram 63800 13.2
Facebook 39100 8.1
LinkedIn 17300 3.6What just happened?
Method — multi-metric groupby with ratio comparisonThe groupby().agg() call computes five metrics simultaneously per platform. We compute the TikTok-to-Facebook ratio by indexing into the result with .loc[] — selecting the row where platform equals the target and extracting the value. The reach_share divides each platform's total reach by the sum of all platforms and multiplies by 100, converting absolute numbers into a relative share easier to present to a client.
TikTok delivers 3.4× the engagement rate of Facebook — 24.2% versus 7.2% — while generating 53.8% of all reach despite posting the same number of times as every other platform. LinkedIn outperforms Facebook on engagement (9.9% vs 7.2%). The budget reallocation case is already visible before a single pivot is built.
We group by content format to rank Video, Image, Carousel, and Text, then test the posting frequency hypothesis — whether publishing more posts per week correlates with higher reach or better engagement.
# Format performance
format_stats = df.groupby("format").agg(
posts = ("post_id", "count"),
mean_reach = ("reach", "mean"),
mean_eng_rate = ("engagement_rate", "mean"),
mean_shares = ("shares", "mean"),
mean_comments = ("comments", "mean")
).round(2).reset_index()
format_stats = format_stats.sort_values("mean_eng_rate", ascending=False)
format_stats["eng_rate_pct"] = (format_stats["mean_eng_rate"] * 100).round(1)
print("Content format performance — ranked by engagement rate:")
print(format_stats[[
"format","posts","mean_reach","eng_rate_pct","mean_shares","mean_comments"
]].to_string(index=False))
# Aggregate to weekly level, then correlate frequency with performance
weekly = df.groupby("post_week").agg(
posts_that_week = ("post_id", "count"),
mean_reach = ("reach", "mean"),
mean_eng_rate = ("engagement_rate", "mean")
).reset_index()
freq_reach_corr = weekly["posts_that_week"].corr(weekly["mean_reach"]).round(3)
freq_eng_corr = weekly["posts_that_week"].corr(weekly["mean_eng_rate"]).round(3)
print(f"\nWeekly posting frequency vs mean reach: r = {freq_reach_corr:+.3f}")
print(f"Weekly posting frequency vs mean eng rate: r = {freq_eng_corr:+.3f}")
# Day of week engagement
day_eng = df.groupby("day_of_week")["engagement_rate"].mean().round(4).sort_values(ascending=False)
print("\nMean engagement rate by day of week:")
print((day_eng * 100).round(1).to_string())
Content format performance — ranked by engagement rate:
format posts mean_reach eng_rate_pct mean_shares mean_comments
Video 6 41000.0 23.3 885.0 475.0
Carousel 5 10640.0 15.2 218.0 106.8
Image 5 8300.0 10.6 75.2 52.6
Text 4 3350.0 9.2 82.0 57.0
Weekly posting frequency vs mean reach: r = +0.381
Weekly posting frequency vs mean eng rate: r = -0.218
Mean engagement rate by day of week:
Wednesday 19.8
Monday 17.1
Friday 16.0
Thursday 11.3
Tuesday 9.6What just happened?
Method — aggregate to week level first, then correlateThe frequency analysis requires two steps: first aggregate to the week level with groupby("post_week") to get posts-per-week and mean reach per week, then compute Pearson correlation between those weekly columns. This is correlation between aggregated values — not raw row-level data — because we're asking whether "a week with more posts" tends to have higher reach on average. The .corr() method on a pandas Series returns the Pearson r coefficient ranging from −1 to +1.
Video dominates every format — 23.3% engagement rate and 41,000 mean reach. The frequency finding is the key client message: more posts weakly increase reach (r = +0.38) but weakly decrease engagement rate (r = −0.22). Posting more gets marginally more eyeballs, but those eyeballs engage less per post. Wednesday and Monday are the best-performing days by a clear margin.
The pivot answers the budget question directly: which platform-format combination produces the highest engagement? Every cell is a content production decision — winning cells get investment, losing cells get cut.
# Pivot: mean engagement rate by platform x format
pivot_eng = pd.pivot_table(
df,
index = "platform",
columns = "format",
values = "engagement_rate",
aggfunc = "mean",
fill_value = np.nan # NaN = not tested, not zero
).round(4)
pivot_pct = (pivot_eng * 100).round(1)
print("Mean engagement rate (%) by platform x format:")
print(pivot_pct.to_string())
# Pivot: total reach by platform x format
pivot_reach = pd.pivot_table(
df,
index = "platform",
columns = "format",
values = "reach",
aggfunc = "sum",
fill_value = 0
)
print("\nTotal reach by platform x format:")
print(pivot_reach.to_string())
# Best-performing combination — stack collapses 2D to 1D, idxmax returns the label
best_idx = pivot_pct.stack().idxmax()
best_val = pivot_pct.stack().max()
print(f"\nBest platform-format combination: {best_idx[0]} · {best_idx[1]} at {best_val}% engagement")
Mean engagement rate (%) by platform x format: format Carousel Image Text Video platform Facebook NaN 4.6 4.7 9.5 Instagram 14.7 9.3 NaN 17.5 LinkedIn NaN NaN 11.0 NaN TikTok 22.7 17.1 NaN 28.3 Total reach by platform x format: format Carousel Image Text Video platform Facebook 0 7800 7400 23900 Instagram 23800 10200 0 29800 LinkedIn 0 0 17300 0 TikTok 48000 55000 0 157000 Best platform-format combination: TikTok · Video at 28.3% engagement
What just happened?
Method — pivot_table with NaN for missing combinations, then stack().idxmax()We use fill_value=np.nan so that untested platform-format combinations show as NaN rather than a misleading 0% — "not tested" and "zero engagement" are very different things. .stack() collapses the 2D pivot into a 1D Series with a multi-level (platform, format) index, and .idxmax() returns the label of the highest value — the winning combination in one line without manual scanning.
TikTok Video at 28.3% is the clear winner, followed by TikTok Carousel at 22.7% and Instagram Video at 17.5%. Facebook Image and Text both sit below 5% — the worst combinations in the dataset. The brief writes itself: invest in TikTok Video and Instagram Video, test TikTok Carousel, deprioritise Facebook Image production entirely.
Shares extend reach beyond the original audience at zero additional cost. We compute a share-weighted engagement score and identify organic amplifier posts using a percentile threshold — these are the content blueprints the creative team should replicate.
# Share rate: fraction of viewers who share
df["share_rate"] = (df["shares"] / df["reach"]).round(5)
# Weighted engagement: likes x1, comments x2, shares x3
df["weighted_score"] = (
(df["likes"] * 1 + df["comments"] * 2 + df["shares"] * 3) / df["reach"]
).round(4)
# Top 5 by weighted score
top5 = df.nlargest(5, "weighted_score")[
["post_id","platform","format","reach","engagement_rate","share_rate","weighted_score"]
]
print("Top 5 posts by share-weighted engagement score:")
print(top5.to_string(index=False))
# Share rate by platform and format
print("\nMean share rate by platform:")
print(df.groupby("platform")["share_rate"].mean().round(5).sort_values(ascending=False).to_string())
print("\nMean share rate by format:")
print(df.groupby("format")["share_rate"].mean().round(5).sort_values(ascending=False).to_string())
# Organic amplifiers: share rate >= 75th percentile
threshold = df["share_rate"].quantile(0.75)
amplifiers = df[df["share_rate"] >= threshold][["post_id","platform","format","share_rate"]]
print(f"\nOrganic amplifiers (share rate >= {threshold:.4f}):")
print(amplifiers.sort_values("share_rate", ascending=False).to_string(index=False))
Top 5 posts by share-weighted engagement score:
post_id platform format reach engagement_rate share_rate weighted_score
P007 TikTok Video 61000 0.3118 0.02590 0.4982
P015 TikTok Image 55000 0.1710 0.02436 0.2626
P003 TikTok Video 52000 0.1727 0.02385 0.2661
P019 TikTok Text 44000 0.1500 0.02000 0.2280
P013 Instagram Video 15800 0.1688 0.02405 0.2566
Mean share rate by platform:
TikTok 0.02353
Instagram 0.02204
LinkedIn 0.01754
Facebook 0.00969
Mean share rate by format:
Video 0.02237
Carousel 0.02009
Text 0.01636
Image 0.01035
Organic amplifiers (share rate >= 0.02200):
post_id platform format share_rate
P007 TikTok Video 0.02590
P003 TikTok Video 0.02385
P013 Instagram Video 0.02405
P015 TikTok Image 0.02436What just happened?
Method — weighted composite score and percentile-based thresholdingThe weighted score applies interaction-value multipliers: shares × 3, comments × 2, likes × 1 — reflecting that shares propagate content to new audiences, comments signal algorithmic depth, and likes are the most common but lowest-value signal. For the amplifier threshold, df["share_rate"].quantile(0.75) returns the 75th percentile — the same .quantile() pattern used in CS12 for credit risk thresholds — and we filter above it with a boolean mask.
All four organic amplifiers are TikTok or Instagram Video. P007 (TikTok Video) achieves a share rate of 2.59% — roughly 1 in 39 viewers shared it, extending reach at zero cost. Facebook's mean share rate of 0.97% is less than half TikTok's 2.35%. The creative team should use P007, P003, P013, and P015 as Q3 content templates.
Checkpoint: Calculate the cost-per-engagement estimate for each platform. Assume a flat production budget of $500 per post. Divide $500 by the mean total interactions per post per platform to get cost per interaction. Which platform delivers the cheapest engagement — and by how much?
Key Findings
TikTok delivers 3.4× the engagement rate of Facebook (24.2% vs 7.2%) and generates 53.8% of all reach despite posting the same number of times as every other platform.
Video outperforms every other content format — 23.3% mean engagement rate versus 9.2% for Text. TikTok Video at 28.3% is the single best-performing platform-format combination.
Posting more does not reliably improve engagement quality. Higher weekly post counts weakly correlate with more reach (r = +0.38) but weakly negatively correlate with engagement rate (r = −0.22).
All four organic amplifier posts are TikTok or Instagram Video, achieving share rates above 2.2% — more than double Facebook's 0.97% platform average.
Wednesday is the highest-engagement day at 19.8%, followed by Monday at 17.1%. Tuesday posts perform worst at 9.6% — a free scheduling change the team can apply immediately.
Visualisations
Pivot Table — Engagement Rate by Platform × Format (%)
The full engagement grid — direct deliverable for the client budget presentation. Green = above 15% account average; red = below 9%.
| Platform | Video | Carousel | Image | Text |
|---|---|---|---|---|
| TikTok | 28.3% ★ | 22.7% | 17.1% | — |
| 17.5% | 14.7% | 9.3% | — | |
| — | — | — | 11.0% | |
| 9.5% | — | 4.6% ✗ | 4.7% ✗ |
— = format not tested on this platform · Green ≥ 15% · Yellow 9–15% · Red < 9%
Social Analytics Decision Guide
| Task | Method | pandas Call | Watch Out For |
|---|---|---|---|
| Engagement rate | Derived column before groupby | (likes+comments+shares)/reach | Compute once on raw df — don't recompute inside agg |
| Platform benchmark | groupby + mean of rate column | groupby("platform")["engagement_rate"].mean() | Include post count alongside rate to show reliability |
| Frequency vs reach | Aggregate to week level first, then correlate | groupby("post_week").agg(...) then .corr() | Don't correlate raw post-level frequency with reach |
| Platform × format pivot | pivot_table with NaN for missing | fill_value=np.nan not 0 | 0% and "not tested" mean different things — use NaN |
| Best combination | stack().idxmax() on pivot | pivot_pct.stack().idxmax() | stack() drops NaN by default — correct behaviour here |
| Organic amplifiers | Percentile threshold filter | df[df["share_rate"] >= df["share_rate"].quantile(0.75)] | State the percentile threshold — it's a choice, not a fact |
Analyst's Note
Teacher's Note
What Would Come Next?
Plot weekly engagement rate per platform as a time series to check whether TikTok's lead is growing or narrowing. Run a Welch's t-test to confirm the TikTok vs Facebook gap is statistically significant at only 5 posts per platform.
Limitations of This Analysis
Five posts per platform is too few for reliable conclusions — real accounts publish 20–100 per week. Engagement rate is also defined differently across platforms, so cross-platform comparisons carry inherent definitional risk.
Business Decisions This Could Drive
Reallocate 40% of Facebook Image and Text budget to TikTok Video. Move all scheduling to Wednesday and Monday, cut Tuesday entirely. Use P007, P003, P013, and P015 as creative templates for Q3 production.
Practice Questions
1. What is the name of the derived column computed by dividing total interactions by reach before any groupby analysis is run?
2. Which platform delivered both the highest mean engagement rate and the largest share of total reach in the Luminary Agency dataset?
3. When building the platform × format pivot table, what fill_value should be used for missing combinations so that "not tested" is not confused with "zero engagement"?
Quiz
1. Why is engagement rate a more useful metric than raw interaction counts for comparing social media posts?
2. What did the posting frequency correlation analysis reveal about the relationship between posting volume and performance?
3. How do you programmatically find the best-performing platform-format combination from a 2D pivot table?
Up Next · Case Study 16
Healthcare Diagnosis Trends
You are given a hospital patient dataset. Which diagnoses are most frequent by age group and department? Do readmission rates differ by condition? And which patient segments drive the highest length-of-stay costs?