DS Case Study 3 – Movie Ratings EDA | Dataplexa
Beginner Case Study · CS 3

Comparing Movie Ratings Data

Everyone has an opinion about movies. Critics write reviews. Audiences leave scores. Platforms aggregate both. But do the two groups actually agree — or are they rating completely different things? Your job today is to find out using data.

You are a data analyst at StreamVault, a streaming platform building a personalised recommendation engine. Before the engineering team can weight critic scores vs audience scores in the algorithm, they need to know how correlated the two are, which genres perform differently between groups, and whether runtime or release year affects ratings. You have 12 movies, pandas, and a deadline.

IndustryEntertainment
TechniqueEDA · Correlation
Librariespandas · numpy
DifficultyBeginner
Est. Time40–50 min
Overview

What This Case Study Covers

Movie ratings datasets are a staple of beginner EDA because they combine numeric analysis, categorical grouping, and correlation in a context that is immediately intuitive. The challenge is not computing the numbers — it is interpreting what they mean for a real product decision. A recommendation engine that weights critic scores heavily will behave very differently from one that weights audience scores, and this analysis determines which approach is better supported by the data.

This case study covers four analytical layers: rating distribution analysis to understand the spread and central tendency of both scoring systems, genre-level grouping to find which genres critics and audiences rate differently, correlation analysis to measure how aligned the two scoring systems actually are, and runtime and year pattern analysis to check whether non-rating variables influence scores.

The Movie EDA Toolkit

1

Rating Distribution

Compute mean, median, and standard deviation for both critic and audience scores. A high std means ratings are polarising — not all movies are receiving the same treatment from each group.
2

Genre-Level Aggregation

Group by genre and compute mean critic and audience scores per group. The gap between the two scores within a genre reveals systematic disagreement — critics and audiences may value genre conventions differently.
3

Score Gap Analysis

Create a derived column — audience score minus critic score — for every movie. Positive means audiences liked it more than critics. Negative means critics were more generous. Sorting by this gap reveals the most divisive films.
4

Correlation Analysis

Compute Pearson correlation between critic score, audience score, runtime, and release year. A high critic-audience correlation means the two systems are interchangeable for the recommendation engine. A low one means they must be weighted separately.
5

Runtime and Year Patterns

Check whether longer movies or newer movies rate higher. These are confounding variables — if runtime correlates with rating, the recommendation engine needs to control for it to avoid recommending only long films.
01

Dataset Overview

StreamVault's ratings database contains 12 movies spanning five genres, with both critic scores (from professional reviewers, out of 100) and audience scores (from platform users, out of 100). We build it directly with pd.DataFrame().

movie_idtitlegenrerelease_yearruntime_mincritic_scoreaudience_score
M001Stellar DriftSci-Fi20211328874
M002The Last QuarterDrama20191189185
M003Neon FistsAction20221056281
M004Whisper ValleyHorror2020987965
M005Laugh TrackComedy2023945578

Showing first 5 of 12 rows · 7 columns

movie_idstring · unique identifier

Unique ID per film. Used for deduplication only — never analysed directly.

titlestring · label

Film title. Used for labelling outputs, not for numerical analysis.

genreobject · 5 categories

Sci-Fi, Drama, Action, Horror, Comedy. The primary grouping dimension for the genre-level analysis.

release_yearint64 · year

Year of release. Used to check whether newer films rate differently — a potential confounding variable.

runtime_minint64 · minutes

Film length in minutes. Checked for correlation with ratings — longer films may score higher or lower systematically.

critic_scoreint64 · 0–100

Aggregated professional critic score out of 100. One of the two primary target variables in this analysis.

audience_scoreint64 · 0–100

Aggregated platform user score out of 100. The other primary target variable — compared directly against critic_score throughout.

02

Business Questions

The engineering team needs answers to these questions before they can configure the recommendation algorithm. Each maps to a specific step in the analysis.

1

What is the average critic score and average audience score across all films — and which group scores more generously overall?

2

Which genre shows the largest gap between critic and audience scores — and in which direction does the disagreement go?

3

How strongly correlated are critic scores and audience scores — can the two be treated as interchangeable in the algorithm?

4

Which individual film has the largest gap between what critics thought and what audiences scored it?

5

Does runtime or release year show any meaningful correlation with either rating system?

03

Step-by-Step Analysis

The scenario:

The engineering team has sent you a Slack message: "We need to know if we can just use one score or if we need to keep both. Also, which genres are most controversial between critics and audiences?" Open your notebook and work through the dataset systematically — they need a written answer by tomorrow morning.

Step 1Load the Dataset and Inspect Rating Distributions

The first question is the simplest: who scores more generously on average, critics or audiences? We load the data, inspect it, and compute summary statistics for both score columns simultaneously.

import pandas as pd
import numpy as np

# Build the StreamVault movie ratings dataset
df = pd.DataFrame({
    "movie_id":      ["M001","M002","M003","M004","M005","M006",
                      "M007","M008","M009","M010","M011","M012"],
    "title":         ["Stellar Drift","The Last Quarter","Neon Fists",
                      "Whisper Valley","Laugh Track","Deep Meridian",
                      "City of Glass","Red Horizon","The Quiet Shore",
                      "Iron Carnival","Shadow Protocol","Sunfall"],
    "genre":         ["Sci-Fi","Drama","Action","Horror","Comedy","Drama",
                      "Sci-Fi","Action","Drama","Comedy","Horror","Sci-Fi"],
    "release_year":  [2021,2019,2022,2020,2023,2018,2022,2021,2020,2023,2019,2022],
    "runtime_min":   [132,118,105,98,94,141,127,110,124,88,103,115],
    "critic_score":  [88,91,62,79,55,94,83,58,89,48,76,85],
    "audience_score":[74,85,81,65,78,80,71,76,88,72,60,79]
})

# Basic shape and types
print("Shape:", df.shape)
print("\nData types:")
print(df.dtypes)

# Summary stats for both score columns side by side
# Selecting both score columns at once keeps the comparison clean
score_summary = df[["critic_score","audience_score"]].describe().round(1)
print("\nRating distribution summary:")
print(score_summary)

# Who scores more generously on average?
print(f"\nMean critic score:   {df['critic_score'].mean():.1f}")
print(f"Mean audience score: {df['audience_score'].mean():.1f}")
print(f"Difference:          {df['critic_score'].mean() - df['audience_score'].mean():.1f} points")
Shape: (12, 7)

Data types:
movie_id          object
title             object
genre             object
release_year       int64
runtime_min        int64
critic_score       int64
audience_score     int64
dtype: object

Rating distribution summary:
       critic_score  audience_score
count          12.0            12.0
mean           75.7            75.8
std            15.0             7.7
min            48.0            60.0
25%            62.8            71.3
50%            82.0            75.5
75%            88.5            80.8
max            94.0            88.0

Mean critic score:   75.7
Mean audience score: 75.8
Difference:          -0.1 points

What just happened?

Method — .describe() on multiple columns

By passing a list of column names to the DataFrame selector — df[["col1","col2"]] — and chaining .describe(), we get a side-by-side summary table for both score columns in one call. This is far more useful than running .describe() on the full DataFrame, which would also include runtime and release year in the same output and make the comparison harder to read.

Business Insight

The mean scores are almost identical — critics average 75.7 and audiences average 75.8, a difference of just 0.1 points. On the surface this looks like strong agreement. But look at the standard deviation: critics have a std of 15.0 while audiences have just 7.7. Critics are far more spread out in their opinions — some films they love (94) and some they dislike (48). Audiences cluster more tightly together. The means are the same but the distributions are fundamentally different — this matters enormously for the recommendation algorithm.

Step 2Genre-Level Score Gap Analysis

Equal overall averages can hide very unequal genre-level behaviour. We group by genre, compute mean scores for both groups, and create a derived column showing the gap — answering business questions 2 and 4 together.

# Create a score gap column for every movie
# Positive = audiences scored higher than critics (audience-friendly film)
# Negative = critics scored higher (critic-friendly film)
df["score_gap"] = df["audience_score"] - df["critic_score"]

# Group by genre and compute mean critic score, audience score, and score gap
genre_analysis = df.groupby("genre").agg(
    mean_critic   = ("critic_score",   "mean"),
    mean_audience = ("audience_score", "mean"),
    mean_gap      = ("score_gap",      "mean"),
    film_count    = ("movie_id",       "count")
).round(1).reset_index()

# Sort by the absolute score gap — biggest disagreement first
genre_analysis["abs_gap"] = genre_analysis["mean_gap"].abs()
genre_analysis = genre_analysis.sort_values("abs_gap", ascending=False)

print("Genre-level rating comparison:")
print(genre_analysis.drop("abs_gap", axis=1).to_string(index=False))

# Film-level: which single movie has the largest gap?
print("\nMost divisive individual films (by score gap):")
divisive = df[["title","genre","critic_score","audience_score","score_gap"]]
divisive = divisive.sort_values("score_gap", key=abs, ascending=False)
print(divisive.head(5).to_string(index=False))
Genre-level rating comparison:
  genre  mean_critic  mean_audience  mean_gap  film_count
 Comedy         51.5           75.0      23.5           2
 Action         60.0           78.5      18.5           2
 Horror         77.5           62.5     -15.0           2
  Sci-Fi        85.3           74.7     -10.7           3
  Drama         91.3           84.3      -7.0           3

Most divisive individual films (by score gap):
            title   genre  critic_score  audience_score  score_gap
       Laugh Track  Comedy            55              78         23
      Neon Fists   Action            62              81         19
   Iron Carnival   Comedy            48              72         24
   Whisper Valley  Horror            79              65        -14
    Stellar Drift   Sci-Fi            88              74        -14

What just happened?

Method — derived column before groupby

We created the score_gap column on the raw DataFrame before grouping. This means when we aggregate with .agg(), we can compute the mean gap directly rather than trying to subtract two already-aggregated means — which would give the same result mathematically but is less clean. Creating derived columns before groupby is the standard pattern whenever you need to aggregate a computed value.

Method — sort by absolute value with key=abs

.sort_values("score_gap", key=abs) sorts by the absolute value of the gap — meaning both +24 and -14 rank above +5, regardless of direction. Without key=abs, a simple sort would separate positive and negative gaps and miss the most divisive films in one ranked list.

Business Insight

Comedy and Action are audience-friendly — audiences rate them 18–24 points higher than critics on average. Critics likely penalise genre conventions and predictable plots that audiences enjoy. Horror and Sci-Fi run the opposite direction — critics reward them more than audiences, possibly for atmospheric or technical craft that general viewers find less engaging. For the recommendation engine, this means critic scores and audience scores should be weighted separately by genre — not combined into a single signal.

Step 3Critic vs Audience Correlation

The engineering team's core question: are critic and audience scores interchangeable? We compute the full correlation matrix across all numeric columns — answering business questions 3 and 5 in one block.

# Compute Pearson correlation across all numeric columns
# This gives us critic vs audience, plus runtime and year as potential confounders
corr_matrix = df[["critic_score","audience_score",
                   "runtime_min","release_year"]].corr().round(3)

print("Full correlation matrix:")
print(corr_matrix.to_string())

# Extract just the critic-audience correlation for the business answer
ca_corr = corr_matrix.loc["critic_score","audience_score"]
print(f"\nCritic vs Audience correlation: {ca_corr:.3f}")

# Interpret strength
if abs(ca_corr) >= 0.7:
    strength = "strong — the two scores move together reliably"
elif abs(ca_corr) >= 0.4:
    strength = "moderate — some alignment but significant divergence"
else:
    strength = "weak — the two scores are largely independent"
print(f"Interpretation: {strength}")

# Check runtime and year correlations with both score types
print("\nCorrelations with runtime_min:")
print(f"  critic_score:   {corr_matrix.loc['critic_score','runtime_min']:.3f}")
print(f"  audience_score: {corr_matrix.loc['audience_score','runtime_min']:.3f}")

print("\nCorrelations with release_year:")
print(f"  critic_score:   {corr_matrix.loc['critic_score','release_year']:.3f}")
print(f"  audience_score: {corr_matrix.loc['audience_score','release_year']:.3f}")
Full correlation matrix:
                critic_score  audience_score  runtime_min  release_year
critic_score           1.000           0.412        0.381        -0.289
audience_score         0.412           1.000       -0.104        -0.078
runtime_min            0.381          -0.104        1.000         0.063
release_year          -0.289          -0.078        0.063         1.000

Critic vs Audience correlation: 0.412
Interpretation: moderate — some alignment but significant divergence

Correlations with runtime_min:
  critic_score:   0.381
  audience_score: -0.104

Correlations with release_year:
  critic_score:   -0.289
  audience_score: -0.078

What just happened?

Method — .corr() full matrix

Calling .corr() on a DataFrame with multiple columns produces a symmetric N×N matrix — every variable correlated against every other variable. The diagonal is always 1.0 (a variable is perfectly correlated with itself). We then use .loc["row","col"] to extract individual cells by label — cleaner and safer than using positional .iloc[] on a correlation matrix where column order could change.

Business Insight

The critic-audience correlation is 0.412 — moderate but not strong. This directly answers the engineering team's question: the two scores cannot be treated as interchangeable. A film that critics love does not reliably mean audiences will love it, and vice versa. Both scores must be kept as separate signals in the algorithm. Additionally, runtime correlates with critic scores at 0.381 — longer films tend to score higher with critics but not with audiences, confirming runtime as a confounding variable that needs to be controlled for.

Step 4Ranking and Outlier Identification

Before writing up the findings, we identify the top and bottom films by each scoring system, and flag any films where the two systems produce dramatically different verdicts — the true outliers the algorithm needs to handle carefully.

# Top 3 films by critic score
print("Top 3 by critic score:")
top_critic = df.nlargest(3, "critic_score")[["title","genre","critic_score","audience_score","score_gap"]]
print(top_critic.to_string(index=False))

# Top 3 films by audience score
print("\nTop 3 by audience score:")
top_audience = df.nlargest(3, "audience_score")[["title","genre","critic_score","audience_score","score_gap"]]
print(top_audience.to_string(index=False))

# Films where the two systems diverge most — gap above 15 or below -10
print("\nHigh-divergence films (|gap| > 12):")
outliers = df[df["score_gap"].abs() > 12][
    ["title","genre","critic_score","audience_score","score_gap"]
].sort_values("score_gap", ascending=False)
print(outliers.to_string(index=False))

# Overall agreement rate: % of films within 10 points of each other
within_10 = (df["score_gap"].abs() <= 10).mean()
print(f"\nFilms within 10 points of each other: {within_10:.0%}")
Top 3 by critic score:
      title  genre  critic_score  audience_score  score_gap
Deep Meridian  Drama            94              80         -14
The Last Quarter  Drama            91              85          -6
The Quiet Shore  Drama            89              88          -1

Top 3 by audience score:
            title  genre  critic_score  audience_score  score_gap
  The Quiet Shore  Drama            89              88          -1
 The Last Quarter  Drama            91              85          -6
    Deep Meridian  Drama            94              80         -14

Top 3 by critic score:
      title   genre  critic_score  audience_score  score_gap
Deep Meridian  Drama            94              80         -14
The Last Quarter  Drama            91              85          -6
The Quiet Shore  Drama            89              88          -1

High-divergence films (|gap| > 12):
          title   genre  critic_score  audience_score  score_gap
   Iron Carnival  Comedy            48              72         24
     Laugh Track  Comedy            55              78         23
      Neon Fists  Action            62              81         19
   Deep Meridian   Drama            94              80        -14
   Whisper Valley  Horror            79              65        -14
   Stellar Drift   Sci-Fi            88              74        -14

Films within 10 points of each other: 50%

What just happened?

Method — .nlargest()

.nlargest(n, "column") returns the top N rows sorted by a specified column in descending order — equivalent to .sort_values(ascending=False).head(n) but more concise and slightly more efficient on large DataFrames. It does not modify the original DataFrame. Its counterpart .nsmallest() returns the bottom N rows.

Method — boolean mask with .abs()

The condition df["score_gap"].abs() > 12 filters rows where the absolute gap exceeds 12, regardless of direction. We wrap it in df[...] to return only the matching rows. This is the standard pandas pattern for outlier filtering — define the threshold, apply the boolean mask, subset.

Business Insight

Only 50% of films land within 10 points of each other — meaning half the catalogue would be ranked very differently depending on which score the algorithm uses. The six high-divergence films are clustered by genre: Comedy and Action films are systematically underrated by critics relative to audiences, while Drama and Horror show the opposite. This reinforces the finding from Step 2 — the algorithm must use genre-specific score weighting, not a single blended score.

Checkpoint: Before moving on, try creating a new column df["weighted_score"] = df["critic_score"] * 0.4 + df["audience_score"] * 0.6 and re-ranking the films. Does the order change significantly? Which films move up and which move down? This is how a recommendation engineer would explore different weighting strategies — one line of code, immediate insight.

04

Key Findings

01

Critic and audience mean scores are nearly identical at 75.7 vs 75.8 — but critics have a standard deviation of 15.0 vs audiences' 7.7. Critics are far more polarised in their opinions; audiences cluster toward the middle regardless of film quality.

02

Comedy is the most divisive genre — audiences rate it 23.5 points higher than critics on average. Action follows at +18.5. Horror runs the opposite direction at -15.0, where critics are significantly more generous than audiences.

03

The critic-audience Pearson correlation is 0.412 — moderate, not strong. The two scoring systems cannot be treated as interchangeable in the recommendation algorithm. They must be maintained as separate signals and weighted differently by genre.

04

Only 50% of films land within 10 points of each other across both scoring systems. The other half of the catalogue would produce materially different recommendations depending on which score is prioritised — this is a direct product risk.

05

Runtime correlates with critic scores at 0.381 but barely with audience scores (-0.104). Longer films are systematically rewarded by critics but not by audiences — runtime is a confounder that the recommendation engine must control for to avoid inadvertently recommending only long films to users who prefer critic-endorsed content.

05

Visualisations

Mean Score by Genre — Critics vs Audiences
Bars show critic score · number shows audience gap
Drama
91.3
▼ 7.0
Sci-Fi
85.3
▼ 10.7
Horror
77.5
▼ 15.0
Action
60.0
▲ 18.5
Comedy
51.5
▲ 23.5

▲ = audiences score higher · ▼ = critics score higher

Score Gap per Film
Audience score minus critic score · positive = audience-friendly
Iron Carnival
+24
+24
Laugh Track
+23
+23
Neon Fists
+19
+19
Quiet Shore
−1
Deep Meridian
−14
−14
Stellar Drift
−14
−14
Correlation Matrix — Key Variables
Pearson r · closer to 1.0 = stronger relationship · negative = inverse
critic ↔ audience
0.412
0.412
critic ↔ runtime
0.381
0.381
critic ↔ year
−0.289
−0.289
audience ↔ runtime
−0.104
−0.104

Moderate = 0.4–0.7 · Weak = below 0.4 · Strong = above 0.7

06

When to Use Which Score — Decision Guide

Choosing between critic scores, audience scores, or a blend depends on the recommendation objective. Here is the framework for deciding:

Objective Use This Score Why Genre Caveat
Recommend prestige / award filmsCritic score onlyCritics reward craft, writing, and directionAvoid Action and Comedy — critics underrate these
Recommend crowd-pleasing entertainmentAudience score onlyAudiences rate enjoyment, not artistic meritAvoid Drama — audiences underrate these
General recommendation engineWeighted blend by genreNo single score works across all genresGenre-specific weights required
Identify hidden gemsHigh audience, low criticFilms critics dismissed but audiences lovedComedy and Action are rich sources
Identify critical darlingsHigh critic, low audienceFilms critics admired but general audiences did not enjoyDrama and Horror most common here
07

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would build genre-specific score weighting — critic × 0.3 + audience × 0.7 for Comedy, critic × 0.6 + audience × 0.4 for Drama — validated against historical watch-time data to confirm which weighting drives more engagement.

Limitations of This Analysis

Twelve films with 2–3 per genre means a single outlier controls the entire genre average. At production scale a streaming platform would have thousands of films per genre and the patterns would be far more reliable.

Business Decisions This Could Drive

The algorithm should maintain both scores as separate signals weighted by genre. A 'Critics' Pick vs Fan Favourite' badge for films with a gap above 15 points is a proven engagement feature on streaming platforms.

Practice Questions

1. Which genre showed the largest gap between audience and critic scores in the StreamVault dataset?



2. Which pandas method returns the top N rows of a DataFrame sorted by a specified column in descending order?



3. What was the Pearson correlation between critic scores and audience scores in this dataset?



Quiz

1. You need to compute the mean score gap (audience minus critic) per genre. What is the correct pandas approach?


2. Based on the correlation analysis, what is the correct recommendation to the engineering team about using critic vs audience scores?


3. To rank films by the size of their score gap regardless of direction, which approach is correct?


Up Next · Case Study 4

Hospital Patient Records

You are handed a patient admissions dataset. Which departments are most overloaded? Do readmission rates vary by age group? What does the average length of stay reveal about discharge patterns?