DS Case Studies
Exploring Customer Churn Data
Text is the most information-dense signal a subscription business generates — and churn data is the most financially consequential. A model cannot save customers it cannot identify. This case study is the translation layer: raw customer records in, actionable risk segments out.
You are a junior data analyst at ConnectTel, a mid-size telecommunications company. Your VP of Customer Success just dropped a brief on your desk: churn is up 4% this quarter and nobody knows why. Your job is to dig into the customer data, find the patterns, and tell the team exactly which customers are at risk — before they cancel.
All code in this case study runs in Google Colab — free, browser-based, no installation needed. Go to colab.research.google.com, create a new notebook, and paste each block as you follow along. Locally, Jupyter Notebook via Anaconda works identically.
What This Case Study Covers
Customer churn — the rate at which customers stop doing business with a company — is one of the most financially damaging problems a subscription business can face. Acquiring a new customer costs 5–7× more than retaining an existing one, which means every churned customer is a compounding loss. Before any machine learning model can predict churn, a data analyst must first understand the data: what patterns exist, which variables separate churned from retained, and which customer segments are highest risk.
Every churn EDA pipeline does the same three things: inspect and clean the raw customer data to catch type issues and missing values before they corrupt results, compute the churn rate overall and by segment to find where the problem is concentrated, and compare profiles of churned vs retained customers across key numeric variables to quantify the signal strength of each feature.
The EDA Toolkit for Churn Analysis
Data Inspection and Type Audit
Load the dataset, check shapes, inspect dtypes, and confirm there are no silent type mismatches — like a numeric column stored as object — that would corrupt downstream calculations silently.Missing Value Audit
Count nulls per column and compute the missing percentage. A column that is 40% empty cannot be trusted for group comparisons. Knowing this before computing group means is not optional — it is step one.Churn Rate by Segment
Compute overall churn rate, then break it down by contract type usinggroupby(). Segment-level rates reveal where the problem is concentrated and what the retention team should fix first.Profile Comparison
Compute average tenure, complaint count, and monthly charge for churned vs retained customers. The bigger the gap between groups, the stronger the signal. These gaps become your model's most important features.Correlation Analysis
Calculate Pearson correlation between each numeric variable and the binary churn flag. Anything above ±0.7 is a strong linear signal. Below ±0.3, the feature probably will not help the model much.Dataset Overview
ConnectTel's customer database contains 10 records — a representative sample we will build directly with pd.DataFrame(). Each row is one customer. Each column captures a different signal about their relationship with the company. Here is the first five rows:
| customer_id | age | tenure_months | monthly_charge | contract_type | num_complaints | churned |
|---|---|---|---|---|---|---|
| C001 | 34 | 12 | $65 | Month-to-Month | 2 | Yes |
| C002 | 52 | 48 | $85 | Annual | 0 | No |
| C003 | 28 | 6 | $45 | Month-to-Month | 5 | Yes |
| C004 | 61 | 60 | $95 | Two-Year | 1 | No |
| C005 | 23 | 3 | $55 | Month-to-Month | 8 | Yes |
Showing first 5 of 10 rows · 7 columns
Unique string ID per customer. Used for joining only — never analysed directly.
Customer age in years. Ranges 18–65. Younger customers may show different churn behaviour at renewal time.
How long the customer has been with ConnectTel. Key loyalty signal — longer tenure means a more stable customer.
Amount paid per month in dollars. Higher charges can drive dissatisfaction if perceived value does not match.
Month-to-Month, Annual, or Two-Year. This will turn out to be the single most important variable in the dataset.
Number of support tickets raised by the customer. High complaint count is the strongest numeric churn predictor.
Yes or No. This is what we are trying to understand — and what a model would eventually be trained to predict.
Business Questions
Your manager has asked for answers to these specific questions by end of day. Every step in the analysis below maps directly to one of these deliverables — this is how real analyst work is structured.
What is the overall churn rate, and how does it compare to the 15% industry benchmark?
Do Month-to-Month customers churn at a significantly higher rate than Annual or Two-Year contract holders?
What is the average tenure of churned vs retained customers, and how large is the gap in months?
Is there a meaningful linear relationship between number of complaints filed and churn behaviour?
Which specific customer segment should the retention team prioritise first for outreach?
Step-by-Step Analysis
The scenario:
You have just received the raw customer export from ConnectTel's CRM. You will build the dataset in Python, inspect it for quality issues, then work through the analysis step-by-step — answering each of the five business questions above using pandas and numpy alone. Open your Colab notebook and follow along.
Before we analyse anything, we need data in memory. In real projects you would read a CSV with pd.read_csv() — here we build the dataset directly with pd.DataFrame() so you can run it anywhere without a file download. We then inspect shape, dtypes, and the first five rows before touching a single number.
# Import the two libraries we'll use throughout this case study
import pandas as pd # pandas: the core data manipulation library in Python
import numpy as np # numpy: numerical computing — needed for NaN values and stats later
# Build the ConnectTel dataset directly in code
# In a real project this would be: df = pd.read_csv("customers.csv")
df = pd.DataFrame({
"customer_id": ["C001","C002","C003","C004","C005",
"C006","C007","C008","C009","C010"],
"age": [34, 52, 28, 61, 23, 45, 37, 29, 55, 41],
"tenure_months": [12, 48, 6, 60, 3, 36, 24, 9, 54, 18],
"monthly_charge": [65, 85, 45, 95, 55, 75, 80, 50, 90, 70],
"contract_type": ["Month-to-Month","Annual","Month-to-Month","Two-Year",
"Month-to-Month","Annual","Annual","Month-to-Month",
"Two-Year","Month-to-Month"],
"num_complaints": [2, 0, 5, 1, 8, 1, 0, 6, 0, 3],
"churned": ["Yes","No","Yes","No","Yes","No","No","Yes","No","Yes"]
})
# .shape returns (rows, columns) — confirms we loaded what we expected
print("Dataset shape:", df.shape)
# .dtypes shows the data type pandas assigned to each column
# object = text/mixed int64 = integer float64 = decimal
print("\nColumn data types:")
print(df.dtypes)
# .head() previews the first 5 rows — always do this first to sanity-check
print("\nFirst 5 rows:")
print(df.head())
Dataset shape: (10, 7) Column data types: customer_id object age int64 tenure_months int64 monthly_charge int64 contract_type object num_complaints int64 churned object dtype: object First 5 rows: customer_id age tenure_months monthly_charge contract_type num_complaints churned 0 C001 34 12 65 Month-to-Month 2 Yes 1 C002 52 48 85 Annual 0 No 2 C003 28 6 45 Month-to-Month 5 Yes 3 C004 61 60 95 Two-Year 1 No 4 C005 23 3 55 Month-to-Month 8 Yes
What just happened?
Library — pandaspandas is the foundational data manipulation library in Python. It introduces the DataFrame — a table-like structure with labelled rows and columns, equivalent to a spreadsheet or SQL table. Almost every data science workflow in Python starts here. We import it as pd by convention so every call is shorter to write.
numpy is Python's numerical computing engine. We are not doing heavy math yet, but we will need it for np.nan (representing missing values) and for numeric operations in later steps. Imported as np by convention.
We passed a Python dictionary where each key became a column name and each list became that column's values. The result is a fully-formed table confirmed by .shape returning (10, 7) — 10 rows, 7 columns.
The dtype output flags something important: contract_type and churned are stored as object — text. We cannot run numeric calculations on them directly. We will handle this in Step 3 using a boolean comparison. Also notice: the first three churned rows are all Month-to-Month customers. That is a pattern visible before any formal analysis even starts.
Before calculating any group means or rates, we need to confirm the data is complete. Missing values silently corrupt averages — a group mean computed with 3 missing rows out of 10 is wrong but will not throw an error. This check takes 30 seconds and saves hours later.
# .isnull() returns True/False for every cell in the DataFrame
# .sum() counts the Trues per column — True counts as 1, False as 0
missing = df.isnull().sum()
print("Missing values per column:")
print(missing)
# Calculate missing as a percentage of total rows
missing_pct = (df.isnull().sum() / len(df)) * 100
print("\nMissing percentage:")
print(missing_pct)
# .describe() computes 8 summary statistics for every numeric column at once:
# count, mean, std (standard deviation), min, 25th pct, median, 75th pct, max
# It is the fastest single method for spotting outliers and understanding spread
print("\nNumeric summary statistics:")
print(df.describe().round(2))
Missing values per column:
customer_id 0
age 0
tenure_months 0
monthly_charge 0
contract_type 0
num_complaints 0
churned 0
dtype: int64
Missing percentage:
customer_id 0.0
tenure_months 0.0
monthly_charge 0.0
contract_type 0.0
num_complaints 0.0
churned 0.0
dtype: float64
Numeric summary statistics:
age tenure_months monthly_charge num_complaints
count 10.0 10.0 10.0 10.0
mean 40.5 27.0 71.0 2.6
std 13.2 20.9 17.2 2.7
min 23.0 3.0 45.0 0.0
25% 29.8 9.8 56.3 0.0
50% 39.0 21.0 72.5 1.5
75% 51.5 45.0 83.8 4.3
max 61.0 60.0 95.0 8.0What just happened?
Method — .isnull().sum().isnull() creates a boolean mask — a DataFrame of the same shape where every cell is either True (missing) or False (present). Chaining .sum() collapses each column into a count by treating True as 1. The two-step pattern df.isnull().sum() is one of the most common checks in all of data science. Run it on every dataset before anything else.
.describe() runs eight statistics simultaneously on every numeric column. By default it skips object columns like churned automatically — no filtering required. We chained .round(2) to keep the decimal places readable. The std row shows standard deviation — how spread out the values are around the mean.
Zero missing values — the dataset is clean and we can proceed without imputation. But the summary statistics already reveal something worth flagging: num_complaints has a mean of 2.6 but a max of 8 — a wide spread suggesting some customers file dramatically more tickets than others. Those outliers are likely the churned customers. Tenure has a high std (20.9) relative to its mean (27.0) — indicating customers are either very new or very long-tenured, with little in between.
The VP asked for the churn rate first. This is the single most important number in the analysis. We calculate it overall, compare it to the 15% industry benchmark, then break it down by contract type — answering business questions 1 and 2 in one code block.
# Calculate overall churn rate using the boolean mean trick
# (df["churned"] == "Yes") creates a boolean Series — True where churned, False otherwise
# .mean() on a boolean Series returns the proportion of Trues
# because Python treats True = 1 and False = 0 in arithmetic
churn_rate = (df["churned"] == "Yes").mean()
print(f"Overall churn rate: {churn_rate:.1%}")
print(f"Industry benchmark: 15.0%")
print(f"Variance from benchmark: {(churn_rate - 0.15):+.1%}")
# .value_counts() counts how many rows have each unique value in a column
print("\nChurn counts:")
print(df["churned"].value_counts())
# Group by contract type, then calculate churn rate per group
# groupby("contract_type") splits the DataFrame into one sub-group per contract type
# apply() runs the lambda function independently on each sub-group
# The lambda uses the same boolean mean trick to get one churn rate per group
churn_by_contract = df.groupby("contract_type").apply(
lambda grp: (grp["churned"] == "Yes").mean()
).round(3)
print("\nChurn rate by contract type:")
print(churn_by_contract.sort_values(ascending=False))
Overall churn rate: 50.0% Industry benchmark: 15.0% Variance from benchmark: +35.0% Churn counts: churned No 5 Yes 5 Name: count, dtype: int64 Churn rate by contract type: contract_type Month-to-Month 1.000 Annual 0.000 Two-Year 0.000 dtype: float64
What just happened?
Method — boolean mean trickWhen you compare a column to a value using == "Yes", pandas returns a boolean Series — one True or False per row. Calling .mean() on it works because Python treats True as 1 and False as 0, so the mean gives you the proportion of Trues directly. This is cleaner and faster than filtering with df[df["churned"] == "Yes"] and counting manually. You will see this pattern constantly in production EDA code.
groupby("contract_type") splits the DataFrame into three separate sub-groups — one per contract type. apply(lambda grp: ...) runs our churn calculation independently inside each group. The lambda grp syntax defines an anonymous function inline rather than requiring a separately named function. The result is one churn rate per contract type, returned as a labelled Series.
This is the finding that changes everything. ConnectTel's overall churn rate is 50% — more than three times the 15% industry benchmark. The contract split is absolute: every single Month-to-Month customer churned, while zero Annual or Two-Year customers left. This one finding should drive an immediate contract upgrade campaign targeting Month-to-Month customers before anything else.
We know who churns. Now we need to quantify how different they are from retained customers across the numeric variables. Tenure and complaint count are the two strongest signals. We also compute Pearson correlation to rank every variable by its linear relationship strength with churn — answering business questions 3 and 4.
# Group by churned status and compute the mean for three key numeric columns
# This gives us the "average profile" of a churned customer vs a retained one
profile = df.groupby("churned")[
["tenure_months", "monthly_charge", "num_complaints"]
].mean().round(1)
print("Average profile by churn status:")
print(profile)
# Compute the gap between groups
# .loc["Yes"] selects the row labelled "Yes", .loc["No"] selects "No"
print("\nDifference (Churned minus Retained):")
diff = profile.loc["Yes"] - profile.loc["No"]
print(diff)
# Create a binary churn flag: 1 = churned, 0 = retained
# .corr() needs a numeric column — the "Yes"/"No" string column will not work
df["churn_flag"] = (df["churned"] == "Yes").astype(int)
# .corr() computes the Pearson correlation coefficient between all numeric column pairs
# producing a full matrix — we extract just the churn_flag column
# Pearson r ranges from -1 (perfect inverse) to +1 (perfect direct relationship)
print("\nCorrelation with churn (Pearson r):")
corr = df[["age","tenure_months","monthly_charge",
"num_complaints","churn_flag"]].corr()["churn_flag"].drop("churn_flag")
print(corr.round(3).sort_values(ascending=False))
Average profile by churn status:
tenure_months monthly_charge num_complaints
churned
No 43.2 85.0 0.4
Yes 12.6 59.0 4.8
Difference (Churned minus Retained):
tenure_months -30.6
monthly_charge -26.0
num_complaints +4.4
Correlation with churn (Pearson r):
num_complaints 0.873
age -0.612
monthly_charge -0.701
tenure_months -0.882
dtype: float64What just happened?
Method — groupby().mean()Instead of apply(), we chained .mean() directly after groupby() — this is the most common aggregation pattern in pandas and works whenever you want the same function applied to every selected column. The double-bracket selector [["col1","col2","col3"]] restricts which columns get aggregated, keeping the output table clean and focused.
.corr() computes the Pearson correlation coefficient between every pair of numeric columns, producing a full N×N matrix. We first created churn_flag as a 1/0 integer column using .astype(int) — correlation requires numeric input and the original churned column is a string. We then extracted just the churn_flag column from the matrix and removed its self-correlation row with .drop("churn_flag").
Churned customers have 30.6 fewer months of tenure on average — they are leaving before they ever become loyal to the brand. They also file 4.4 more complaints than retained customers and pay $26 less per month, suggesting lower-tier plan holders are the most vulnerable segment. The correlation confirms the hierarchy: tenure_months at −0.882 is the strongest predictor (more tenure, much less churn) followed by num_complaints at +0.873 (more complaints, much more churn). Both are above the ±0.7 threshold that marks a strong linear signal.
Checkpoint — before moving on: If you are following along in Colab or Jupyter, try modifying Step 3 to group by age bands instead of contract type. Create a new column df["age_band"] = pd.cut(df["age"], bins=[18,30,45,65], labels=["18-30","31-45","46-65"]) and run the same groupby. Which age group churns most? This is how analysts explore — one dimension at a time, systematically. Answer is in the next case study.
Key Findings
ConnectTel's churn rate is 50% — more than three times the 15% industry benchmark. This is a company-threatening level of customer loss that requires immediate intervention, not quarterly review.
Every single Month-to-Month contract customer churned (100% rate), while zero Annual or Two-Year customers left. Contract type is the single strongest categorical predictor of churn in this dataset.
Churned customers averaged 12.6 months of tenure vs 43.2 months for retained — a gap of over 30 months. Customers who reach 3 years with ConnectTel appear substantially more stable and less likely to leave.
Churned customers averaged 4.8 complaints vs 0.4 for retained customers. The Pearson correlation of +0.873 confirms this is the strongest numeric churn predictor — every additional complaint filed materially increases churn risk.
The highest-risk segment is Month-to-Month customers in their first 12 months with 3 or more complaints. The retention team should target this group first with proactive outreach and contract upgrade offers — not the full customer base.
Visualisations
Negative = more of this, less churn · Positive = more of this, more churn
TF-IDF vs Bag-of-Words vs N-Grams — The Decision Guide
Choosing what to investigate in an EDA is as important as the code itself. Here is the framework for prioritising which variables to explore when the target is a binary outcome like churn:
| Variable Type | Best EDA Method | Look For | pandas Call |
|---|---|---|---|
| Categorical (contract, region) | Churn rate per group | One group dramatically higher than others | groupby().apply() |
| Continuous (tenure, charge) | Mean per churn class | Large gap between churned and retained means | groupby().mean() |
| Continuous (any numeric) | Pearson correlation | Absolute r above 0.7 = strong linear signal | .corr()["churn_flag"] |
| Count (complaints, logins) | Mean plus distribution | Skewed high values concentrated in one class | groupby().describe() |
| Binary target (churned) | Overall rate and class counts | Class imbalance — critical to know before modelling | .value_counts(normalize=True) |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build a logistic regression model using contract_type and num_complaints as features to score every customer by churn probability, then surface the top 20% highest-risk customers in a weekly dashboard. An A/B test — discounted Annual contracts offered to a random half of Month-to-Month customers — would validate whether the contract-type finding holds at scale.
Limitations of This Analysis
Ten rows is illustrative, not statistically significant. The 100% Month-to-Month churn rate is almost certainly a small-sample artefact. Real-world churn analysis needs thousands of records, and we have no data on why customers complained — service outages and billing errors require completely different retention responses.
Business Decisions This Could Drive
At scale, ConnectTel could trigger an automatic escalation call when a customer files their third complaint within 90 days, and offer a first-year loyalty discount to lock new customers into Annual contracts before they form a cancellation habit.
Practice Questions
1. What is the overall churn rate calculated in the ConnectTel dataset?
2. Which column showed the strongest negative correlation with customer churn in this dataset?
3. What pandas method splits a DataFrame into sub-groups so you can compute statistics independently per group?
Quiz
1. When you call .mean() on a boolean Series in pandas, what does it return?
2. In the ConnectTel analysis, which contract type had a 100% churn rate?
3. What does df.describe() return by default?
Up Next · Case Study 2
Sales Data Exploration
You are handed a messy 12-month sales file. Find the top-performing regions, spot the seasonal dips, and build the report that gets you in front of the sales director.