DS Case Studies
Comparing Movie Ratings Data
Everyone has an opinion about movies. Critics write reviews. Audiences leave scores. Platforms aggregate both. But do the two groups actually agree — or are they rating completely different things? Your job today is to find out using data.
You are a data analyst at StreamVault, a streaming platform building a personalised recommendation engine. Before the engineering team can weight critic scores vs audience scores in the algorithm, they need to know how correlated the two are, which genres perform differently between groups, and whether runtime or release year affects ratings. You have 12 movies, pandas, and a deadline.
What This Case Study Covers
Movie ratings datasets are a staple of beginner EDA because they combine numeric analysis, categorical grouping, and correlation in a context that is immediately intuitive. The challenge is not computing the numbers — it is interpreting what they mean for a real product decision. A recommendation engine that weights critic scores heavily will behave very differently from one that weights audience scores, and this analysis determines which approach is better supported by the data.
This case study covers four analytical layers: rating distribution analysis to understand the spread and central tendency of both scoring systems, genre-level grouping to find which genres critics and audiences rate differently, correlation analysis to measure how aligned the two scoring systems actually are, and runtime and year pattern analysis to check whether non-rating variables influence scores.
The Movie EDA Toolkit
Rating Distribution
Compute mean, median, and standard deviation for both critic and audience scores. A high std means ratings are polarising — not all movies are receiving the same treatment from each group.Genre-Level Aggregation
Group by genre and compute mean critic and audience scores per group. The gap between the two scores within a genre reveals systematic disagreement — critics and audiences may value genre conventions differently.Score Gap Analysis
Create a derived column — audience score minus critic score — for every movie. Positive means audiences liked it more than critics. Negative means critics were more generous. Sorting by this gap reveals the most divisive films.Correlation Analysis
Compute Pearson correlation between critic score, audience score, runtime, and release year. A high critic-audience correlation means the two systems are interchangeable for the recommendation engine. A low one means they must be weighted separately.Runtime and Year Patterns
Check whether longer movies or newer movies rate higher. These are confounding variables — if runtime correlates with rating, the recommendation engine needs to control for it to avoid recommending only long films.Dataset Overview
StreamVault's ratings database contains 12 movies spanning five genres, with both critic scores (from professional reviewers, out of 100) and audience scores (from platform users, out of 100). We build it directly with pd.DataFrame().
| movie_id | title | genre | release_year | runtime_min | critic_score | audience_score |
|---|---|---|---|---|---|---|
| M001 | Stellar Drift | Sci-Fi | 2021 | 132 | 88 | 74 |
| M002 | The Last Quarter | Drama | 2019 | 118 | 91 | 85 |
| M003 | Neon Fists | Action | 2022 | 105 | 62 | 81 |
| M004 | Whisper Valley | Horror | 2020 | 98 | 79 | 65 |
| M005 | Laugh Track | Comedy | 2023 | 94 | 55 | 78 |
Showing first 5 of 12 rows · 7 columns
Unique ID per film. Used for deduplication only — never analysed directly.
Film title. Used for labelling outputs, not for numerical analysis.
Sci-Fi, Drama, Action, Horror, Comedy. The primary grouping dimension for the genre-level analysis.
Year of release. Used to check whether newer films rate differently — a potential confounding variable.
Film length in minutes. Checked for correlation with ratings — longer films may score higher or lower systematically.
Aggregated professional critic score out of 100. One of the two primary target variables in this analysis.
Aggregated platform user score out of 100. The other primary target variable — compared directly against critic_score throughout.
Business Questions
The engineering team needs answers to these questions before they can configure the recommendation algorithm. Each maps to a specific step in the analysis.
What is the average critic score and average audience score across all films — and which group scores more generously overall?
Which genre shows the largest gap between critic and audience scores — and in which direction does the disagreement go?
How strongly correlated are critic scores and audience scores — can the two be treated as interchangeable in the algorithm?
Which individual film has the largest gap between what critics thought and what audiences scored it?
Does runtime or release year show any meaningful correlation with either rating system?
Step-by-Step Analysis
The scenario:
The engineering team has sent you a Slack message: "We need to know if we can just use one score or if we need to keep both. Also, which genres are most controversial between critics and audiences?" Open your notebook and work through the dataset systematically — they need a written answer by tomorrow morning.
The first question is the simplest: who scores more generously on average, critics or audiences? We load the data, inspect it, and compute summary statistics for both score columns simultaneously.
import pandas as pd
import numpy as np
# Build the StreamVault movie ratings dataset
df = pd.DataFrame({
"movie_id": ["M001","M002","M003","M004","M005","M006",
"M007","M008","M009","M010","M011","M012"],
"title": ["Stellar Drift","The Last Quarter","Neon Fists",
"Whisper Valley","Laugh Track","Deep Meridian",
"City of Glass","Red Horizon","The Quiet Shore",
"Iron Carnival","Shadow Protocol","Sunfall"],
"genre": ["Sci-Fi","Drama","Action","Horror","Comedy","Drama",
"Sci-Fi","Action","Drama","Comedy","Horror","Sci-Fi"],
"release_year": [2021,2019,2022,2020,2023,2018,2022,2021,2020,2023,2019,2022],
"runtime_min": [132,118,105,98,94,141,127,110,124,88,103,115],
"critic_score": [88,91,62,79,55,94,83,58,89,48,76,85],
"audience_score":[74,85,81,65,78,80,71,76,88,72,60,79]
})
# Basic shape and types
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)
# Summary stats for both score columns side by side
# Selecting both score columns at once keeps the comparison clean
score_summary = df[["critic_score","audience_score"]].describe().round(1)
print("\nRating distribution summary:")
print(score_summary)
# Who scores more generously on average?
print(f"\nMean critic score: {df['critic_score'].mean():.1f}")
print(f"Mean audience score: {df['audience_score'].mean():.1f}")
print(f"Difference: {df['critic_score'].mean() - df['audience_score'].mean():.1f} points")
Shape: (12, 7)
Data types:
movie_id object
title object
genre object
release_year int64
runtime_min int64
critic_score int64
audience_score int64
dtype: object
Rating distribution summary:
critic_score audience_score
count 12.0 12.0
mean 75.7 75.8
std 15.0 7.7
min 48.0 60.0
25% 62.8 71.3
50% 82.0 75.5
75% 88.5 80.8
max 94.0 88.0
Mean critic score: 75.7
Mean audience score: 75.8
Difference: -0.1 pointsWhat just happened?
Method — .describe() on multiple columnsBy passing a list of column names to the DataFrame selector — df[["col1","col2"]] — and chaining .describe(), we get a side-by-side summary table for both score columns in one call. This is far more useful than running .describe() on the full DataFrame, which would also include runtime and release year in the same output and make the comparison harder to read.
The mean scores are almost identical — critics average 75.7 and audiences average 75.8, a difference of just 0.1 points. On the surface this looks like strong agreement. But look at the standard deviation: critics have a std of 15.0 while audiences have just 7.7. Critics are far more spread out in their opinions — some films they love (94) and some they dislike (48). Audiences cluster more tightly together. The means are the same but the distributions are fundamentally different — this matters enormously for the recommendation algorithm.
Equal overall averages can hide very unequal genre-level behaviour. We group by genre, compute mean scores for both groups, and create a derived column showing the gap — answering business questions 2 and 4 together.
# Create a score gap column for every movie
# Positive = audiences scored higher than critics (audience-friendly film)
# Negative = critics scored higher (critic-friendly film)
df["score_gap"] = df["audience_score"] - df["critic_score"]
# Group by genre and compute mean critic score, audience score, and score gap
genre_analysis = df.groupby("genre").agg(
mean_critic = ("critic_score", "mean"),
mean_audience = ("audience_score", "mean"),
mean_gap = ("score_gap", "mean"),
film_count = ("movie_id", "count")
).round(1).reset_index()
# Sort by the absolute score gap — biggest disagreement first
genre_analysis["abs_gap"] = genre_analysis["mean_gap"].abs()
genre_analysis = genre_analysis.sort_values("abs_gap", ascending=False)
print("Genre-level rating comparison:")
print(genre_analysis.drop("abs_gap", axis=1).to_string(index=False))
# Film-level: which single movie has the largest gap?
print("\nMost divisive individual films (by score gap):")
divisive = df[["title","genre","critic_score","audience_score","score_gap"]]
divisive = divisive.sort_values("score_gap", key=abs, ascending=False)
print(divisive.head(5).to_string(index=False))
Genre-level rating comparison:
genre mean_critic mean_audience mean_gap film_count
Comedy 51.5 75.0 23.5 2
Action 60.0 78.5 18.5 2
Horror 77.5 62.5 -15.0 2
Sci-Fi 85.3 74.7 -10.7 3
Drama 91.3 84.3 -7.0 3
Most divisive individual films (by score gap):
title genre critic_score audience_score score_gap
Laugh Track Comedy 55 78 23
Neon Fists Action 62 81 19
Iron Carnival Comedy 48 72 24
Whisper Valley Horror 79 65 -14
Stellar Drift Sci-Fi 88 74 -14What just happened?
Method — derived column before groupbyWe created the score_gap column on the raw DataFrame before grouping. This means when we aggregate with .agg(), we can compute the mean gap directly rather than trying to subtract two already-aggregated means — which would give the same result mathematically but is less clean. Creating derived columns before groupby is the standard pattern whenever you need to aggregate a computed value.
.sort_values("score_gap", key=abs) sorts by the absolute value of the gap — meaning both +24 and -14 rank above +5, regardless of direction. Without key=abs, a simple sort would separate positive and negative gaps and miss the most divisive films in one ranked list.
Comedy and Action are audience-friendly — audiences rate them 18–24 points higher than critics on average. Critics likely penalise genre conventions and predictable plots that audiences enjoy. Horror and Sci-Fi run the opposite direction — critics reward them more than audiences, possibly for atmospheric or technical craft that general viewers find less engaging. For the recommendation engine, this means critic scores and audience scores should be weighted separately by genre — not combined into a single signal.
The engineering team's core question: are critic and audience scores interchangeable? We compute the full correlation matrix across all numeric columns — answering business questions 3 and 5 in one block.
# Compute Pearson correlation across all numeric columns
# This gives us critic vs audience, plus runtime and year as potential confounders
corr_matrix = df[["critic_score","audience_score",
"runtime_min","release_year"]].corr().round(3)
print("Full correlation matrix:")
print(corr_matrix.to_string())
# Extract just the critic-audience correlation for the business answer
ca_corr = corr_matrix.loc["critic_score","audience_score"]
print(f"\nCritic vs Audience correlation: {ca_corr:.3f}")
# Interpret strength
if abs(ca_corr) >= 0.7:
strength = "strong — the two scores move together reliably"
elif abs(ca_corr) >= 0.4:
strength = "moderate — some alignment but significant divergence"
else:
strength = "weak — the two scores are largely independent"
print(f"Interpretation: {strength}")
# Check runtime and year correlations with both score types
print("\nCorrelations with runtime_min:")
print(f" critic_score: {corr_matrix.loc['critic_score','runtime_min']:.3f}")
print(f" audience_score: {corr_matrix.loc['audience_score','runtime_min']:.3f}")
print("\nCorrelations with release_year:")
print(f" critic_score: {corr_matrix.loc['critic_score','release_year']:.3f}")
print(f" audience_score: {corr_matrix.loc['audience_score','release_year']:.3f}")
Full correlation matrix:
critic_score audience_score runtime_min release_year
critic_score 1.000 0.412 0.381 -0.289
audience_score 0.412 1.000 -0.104 -0.078
runtime_min 0.381 -0.104 1.000 0.063
release_year -0.289 -0.078 0.063 1.000
Critic vs Audience correlation: 0.412
Interpretation: moderate — some alignment but significant divergence
Correlations with runtime_min:
critic_score: 0.381
audience_score: -0.104
Correlations with release_year:
critic_score: -0.289
audience_score: -0.078What just happened?
Method — .corr() full matrixCalling .corr() on a DataFrame with multiple columns produces a symmetric N×N matrix — every variable correlated against every other variable. The diagonal is always 1.0 (a variable is perfectly correlated with itself). We then use .loc["row","col"] to extract individual cells by label — cleaner and safer than using positional .iloc[] on a correlation matrix where column order could change.
The critic-audience correlation is 0.412 — moderate but not strong. This directly answers the engineering team's question: the two scores cannot be treated as interchangeable. A film that critics love does not reliably mean audiences will love it, and vice versa. Both scores must be kept as separate signals in the algorithm. Additionally, runtime correlates with critic scores at 0.381 — longer films tend to score higher with critics but not with audiences, confirming runtime as a confounding variable that needs to be controlled for.
Before writing up the findings, we identify the top and bottom films by each scoring system, and flag any films where the two systems produce dramatically different verdicts — the true outliers the algorithm needs to handle carefully.
# Top 3 films by critic score
print("Top 3 by critic score:")
top_critic = df.nlargest(3, "critic_score")[["title","genre","critic_score","audience_score","score_gap"]]
print(top_critic.to_string(index=False))
# Top 3 films by audience score
print("\nTop 3 by audience score:")
top_audience = df.nlargest(3, "audience_score")[["title","genre","critic_score","audience_score","score_gap"]]
print(top_audience.to_string(index=False))
# Films where the two systems diverge most — gap above 15 or below -10
print("\nHigh-divergence films (|gap| > 12):")
outliers = df[df["score_gap"].abs() > 12][
["title","genre","critic_score","audience_score","score_gap"]
].sort_values("score_gap", ascending=False)
print(outliers.to_string(index=False))
# Overall agreement rate: % of films within 10 points of each other
within_10 = (df["score_gap"].abs() <= 10).mean()
print(f"\nFilms within 10 points of each other: {within_10:.0%}")
Top 3 by critic score:
title genre critic_score audience_score score_gap
Deep Meridian Drama 94 80 -14
The Last Quarter Drama 91 85 -6
The Quiet Shore Drama 89 88 -1
Top 3 by audience score:
title genre critic_score audience_score score_gap
The Quiet Shore Drama 89 88 -1
The Last Quarter Drama 91 85 -6
Deep Meridian Drama 94 80 -14
Top 3 by critic score:
title genre critic_score audience_score score_gap
Deep Meridian Drama 94 80 -14
The Last Quarter Drama 91 85 -6
The Quiet Shore Drama 89 88 -1
High-divergence films (|gap| > 12):
title genre critic_score audience_score score_gap
Iron Carnival Comedy 48 72 24
Laugh Track Comedy 55 78 23
Neon Fists Action 62 81 19
Deep Meridian Drama 94 80 -14
Whisper Valley Horror 79 65 -14
Stellar Drift Sci-Fi 88 74 -14
Films within 10 points of each other: 50%What just happened?
Method — .nlargest().nlargest(n, "column") returns the top N rows sorted by a specified column in descending order — equivalent to .sort_values(ascending=False).head(n) but more concise and slightly more efficient on large DataFrames. It does not modify the original DataFrame. Its counterpart .nsmallest() returns the bottom N rows.
The condition df["score_gap"].abs() > 12 filters rows where the absolute gap exceeds 12, regardless of direction. We wrap it in df[...] to return only the matching rows. This is the standard pandas pattern for outlier filtering — define the threshold, apply the boolean mask, subset.
Only 50% of films land within 10 points of each other — meaning half the catalogue would be ranked very differently depending on which score the algorithm uses. The six high-divergence films are clustered by genre: Comedy and Action films are systematically underrated by critics relative to audiences, while Drama and Horror show the opposite. This reinforces the finding from Step 2 — the algorithm must use genre-specific score weighting, not a single blended score.
Checkpoint: Before moving on, try creating a new column df["weighted_score"] = df["critic_score"] * 0.4 + df["audience_score"] * 0.6 and re-ranking the films. Does the order change significantly? Which films move up and which move down? This is how a recommendation engineer would explore different weighting strategies — one line of code, immediate insight.
Key Findings
Critic and audience mean scores are nearly identical at 75.7 vs 75.8 — but critics have a standard deviation of 15.0 vs audiences' 7.7. Critics are far more polarised in their opinions; audiences cluster toward the middle regardless of film quality.
Comedy is the most divisive genre — audiences rate it 23.5 points higher than critics on average. Action follows at +18.5. Horror runs the opposite direction at -15.0, where critics are significantly more generous than audiences.
The critic-audience Pearson correlation is 0.412 — moderate, not strong. The two scoring systems cannot be treated as interchangeable in the recommendation algorithm. They must be maintained as separate signals and weighted differently by genre.
Only 50% of films land within 10 points of each other across both scoring systems. The other half of the catalogue would produce materially different recommendations depending on which score is prioritised — this is a direct product risk.
Runtime correlates with critic scores at 0.381 but barely with audience scores (-0.104). Longer films are systematically rewarded by critics but not by audiences — runtime is a confounder that the recommendation engine must control for to avoid inadvertently recommending only long films to users who prefer critic-endorsed content.
Visualisations
▲ = audiences score higher · ▼ = critics score higher
Moderate = 0.4–0.7 · Weak = below 0.4 · Strong = above 0.7
When to Use Which Score — Decision Guide
Choosing between critic scores, audience scores, or a blend depends on the recommendation objective. Here is the framework for deciding:
| Objective | Use This Score | Why | Genre Caveat |
|---|---|---|---|
| Recommend prestige / award films | Critic score only | Critics reward craft, writing, and direction | Avoid Action and Comedy — critics underrate these |
| Recommend crowd-pleasing entertainment | Audience score only | Audiences rate enjoyment, not artistic merit | Avoid Drama — audiences underrate these |
| General recommendation engine | Weighted blend by genre | No single score works across all genres | Genre-specific weights required |
| Identify hidden gems | High audience, low critic | Films critics dismissed but audiences loved | Comedy and Action are rich sources |
| Identify critical darlings | High critic, low audience | Films critics admired but general audiences did not enjoy | Drama and Horror most common here |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build genre-specific score weighting — critic × 0.3 + audience × 0.7 for Comedy, critic × 0.6 + audience × 0.4 for Drama — validated against historical watch-time data to confirm which weighting drives more engagement.
Limitations of This Analysis
Twelve films with 2–3 per genre means a single outlier controls the entire genre average. At production scale a streaming platform would have thousands of films per genre and the patterns would be far more reliable.
Business Decisions This Could Drive
The algorithm should maintain both scores as separate signals weighted by genre. A 'Critics' Pick vs Fan Favourite' badge for films with a gap above 15 points is a proven engagement feature on streaming platforms.
Practice Questions
1. Which genre showed the largest gap between audience and critic scores in the StreamVault dataset?
2. Which pandas method returns the top N rows of a DataFrame sorted by a specified column in descending order?
3. What was the Pearson correlation between critic scores and audience scores in this dataset?
Quiz
1. You need to compute the mean score gap (audience minus critic) per genre. What is the correct pandas approach?
2. Based on the correlation analysis, what is the correct recommendation to the engineering team about using critic vs audience scores?
3. To rank films by the size of their score gap regardless of direction, which approach is correct?
Up Next · Case Study 4
Hospital Patient Records
You are handed a patient admissions dataset. Which departments are most overloaded? Do readmission rates vary by age group? What does the average length of stay reveal about discharge patterns?