DS Case Study 1 – Customer Churn Exploration | Dataplexa

Beginner Case Study · CS 1

Exploring Customer Churn Data

Text is the most information-dense signal a subscription business generates — and churn data is the most financially consequential. A model cannot save customers it cannot identify. This case study is the translation layer: raw customer records in, actionable risk segments out.

You are a junior data analyst at ConnectTel, a mid-size telecommunications company. Your VP of Customer Success just dropped a brief on your desk: churn is up 4% this quarter and nobody knows why. Your job is to dig into the customer data, find the patterns, and tell the team exactly which customers are at risk — before they cancel.

IndustryTelecommunications

TechniqueExploratory Data Analysis

Librariespandas · numpy

DifficultyBeginner

Est. Time35–45 min

WHERE TO PRACTICE

All code in this case study runs in Google Colab — free, browser-based, no installation needed. Go to colab.research.google.com, create a new notebook, and paste each block as you follow along. Locally, Jupyter Notebook via Anaconda works identically.

Overview

What This Case Study Covers

Customer churn — the rate at which customers stop doing business with a company — is one of the most financially damaging problems a subscription business can face. Acquiring a new customer costs 5–7× more than retaining an existing one, which means every churned customer is a compounding loss. Before any machine learning model can predict churn, a data analyst must first understand the data: what patterns exist, which variables separate churned from retained, and which customer segments are highest risk.

Every churn EDA pipeline does the same three things: inspect and clean the raw customer data to catch type issues and missing values before they corrupt results, compute the churn rate overall and by segment to find where the problem is concentrated, and compare profiles of churned vs retained customers across key numeric variables to quantify the signal strength of each feature.

The EDA Toolkit for Churn Analysis

Data Inspection and Type Audit

Load the dataset, check shapes, inspect dtypes, and confirm there are no silent type mismatches — like a numeric column stored as object — that would corrupt downstream calculations silently.

Missing Value Audit

Count nulls per column and compute the missing percentage. A column that is 40% empty cannot be trusted for group comparisons. Knowing this before computing group means is not optional — it is step one.

Churn Rate by Segment

Compute overall churn rate, then break it down by contract type using groupby(). Segment-level rates reveal where the problem is concentrated and what the retention team should fix first.

Profile Comparison

Compute average tenure, complaint count, and monthly charge for churned vs retained customers. The bigger the gap between groups, the stronger the signal. These gaps become your model's most important features.

Correlation Analysis

Calculate Pearson correlation between each numeric variable and the binary churn flag. Anything above ±0.7 is a strong linear signal. Below ±0.3, the feature probably will not help the model much.

Dataset Overview

ConnectTel's customer database contains 10 records — a representative sample we will build directly with pd.DataFrame(). Each row is one customer. Each column captures a different signal about their relationship with the company. Here is the first five rows:

customer_id	age	tenure_months	monthly_charge	contract_type	num_complaints	churned
C001	34	12	$65	Month-to-Month	2	Yes
C002	52	48	$85	Annual	0	No
C003	28	6	$45	Month-to-Month	5	Yes
C004	61	60	$95	Two-Year	1	No
C005	23	3	$55	Month-to-Month	8	Yes

Showing first 5 of 10 rows · 7 columns

customer_idstring · unique identifier

Unique string ID per customer. Used for joining only — never analysed directly.

ageint64 · years

Customer age in years. Ranges 18–65. Younger customers may show different churn behaviour at renewal time.

tenure_monthsint64 · months

How long the customer has been with ConnectTel. Key loyalty signal — longer tenure means a more stable customer.

monthly_chargefloat64 · USD

Amount paid per month in dollars. Higher charges can drive dissatisfaction if perceived value does not match.

contract_typeobject · categorical (3 values)

Month-to-Month, Annual, or Two-Year. This will turn out to be the single most important variable in the dataset.

num_complaintsint64 · count

Number of support tickets raised by the customer. High complaint count is the strongest numeric churn predictor.

churnedobject · target variable

Yes or No. This is what we are trying to understand — and what a model would eventually be trained to predict.

Business Questions

Your manager has asked for answers to these specific questions by end of day. Every step in the analysis below maps directly to one of these deliverables — this is how real analyst work is structured.

What is the overall churn rate, and how does it compare to the 15% industry benchmark?

Do Month-to-Month customers churn at a significantly higher rate than Annual or Two-Year contract holders?

What is the average tenure of churned vs retained customers, and how large is the gap in months?

Is there a meaningful linear relationship between number of complaints filed and churn behaviour?

Which specific customer segment should the retention team prioritise first for outreach?

Step-by-Step Analysis

The scenario:

You have just received the raw customer export from ConnectTel's CRM. You will build the dataset in Python, inspect it for quality issues, then work through the analysis step-by-step — answering each of the five business questions above using pandas and numpy alone. Open your Colab notebook and follow along.

Step 1Load and Inspect the Dataset

Coding — run this in Google Colab or Jupyter Notebook

Before we analyse anything, we need data in memory. In real projects you would read a CSV with pd.read_csv() — here we build the dataset directly with pd.DataFrame() so you can run it anywhere without a file download. We then inspect shape, dtypes, and the first five rows before touching a single number.

# Import the two libraries we'll use throughout this case study
import pandas as pd     # pandas: the core data manipulation library in Python
import numpy as np      # numpy: numerical computing — needed for NaN values and stats later

# Build the ConnectTel dataset directly in code
# In a real project this would be: df = pd.read_csv("customers.csv")
df = pd.DataFrame({
    "customer_id":    ["C001","C002","C003","C004","C005",
                       "C006","C007","C008","C009","C010"],
    "age":            [34, 52, 28, 61, 23, 45, 37, 29, 55, 41],
    "tenure_months":  [12, 48,  6, 60,  3, 36, 24,  9, 54, 18],
    "monthly_charge": [65, 85, 45, 95, 55, 75, 80, 50, 90, 70],
    "contract_type":  ["Month-to-Month","Annual","Month-to-Month","Two-Year",
                       "Month-to-Month","Annual","Annual","Month-to-Month",
                       "Two-Year","Month-to-Month"],
    "num_complaints": [2, 0, 5, 1, 8, 1, 0, 6, 0, 3],
    "churned":        ["Yes","No","Yes","No","Yes","No","No","Yes","No","Yes"]
})

# .shape returns (rows, columns) — confirms we loaded what we expected
print("Dataset shape:", df.shape)

# .dtypes shows the data type pandas assigned to each column
# object = text/mixed   int64 = integer   float64 = decimal
print("\nColumn data types:")
print(df.dtypes)

# .head() previews the first 5 rows — always do this first to sanity-check
print("\nFirst 5 rows:")
print(df.head())

Dataset shape: (10, 7)

Column data types:
customer_id      object
age               int64
tenure_months     int64
monthly_charge    int64
contract_type    object
num_complaints    int64
churned          object
dtype: object

First 5 rows:
  customer_id  age  tenure_months  monthly_charge   contract_type  num_complaints churned
0        C001   34             12              65  Month-to-Month               2     Yes
1        C002   52             48              85          Annual               0      No
2        C003   28              6              45  Month-to-Month               5     Yes
3        C004   61             60              95        Two-Year               1      No
4        C005   23              3              55  Month-to-Month               8     Yes

What just happened?

Library — pandas

pandas is the foundational data manipulation library in Python. It introduces the DataFrame — a table-like structure with labelled rows and columns, equivalent to a spreadsheet or SQL table. Almost every data science workflow in Python starts here. We import it as pd by convention so every call is shorter to write.

Library — numpy

numpy is Python's numerical computing engine. We are not doing heavy math yet, but we will need it for np.nan (representing missing values) and for numeric operations in later steps. Imported as np by convention.

Method — pd.DataFrame()

We passed a Python dictionary where each key became a column name and each list became that column's values. The result is a fully-formed table confirmed by .shape returning (10, 7) — 10 rows, 7 columns.

Business Insight

The dtype output flags something important: contract_type and churned are stored as object — text. We cannot run numeric calculations on them directly. We will handle this in Step 3 using a boolean comparison. Also notice: the first three churned rows are all Month-to-Month customers. That is a pattern visible before any formal analysis even starts.

Step 2Missing Value Audit and Summary Statistics

Coding — run this in Google Colab or Jupyter Notebook

Before calculating any group means or rates, we need to confirm the data is complete. Missing values silently corrupt averages — a group mean computed with 3 missing rows out of 10 is wrong but will not throw an error. This check takes 30 seconds and saves hours later.

# .isnull() returns True/False for every cell in the DataFrame
# .sum() counts the Trues per column — True counts as 1, False as 0
missing = df.isnull().sum()
print("Missing values per column:")
print(missing)

# Calculate missing as a percentage of total rows
missing_pct = (df.isnull().sum() / len(df)) * 100
print("\nMissing percentage:")
print(missing_pct)

# .describe() computes 8 summary statistics for every numeric column at once:
# count, mean, std (standard deviation), min, 25th pct, median, 75th pct, max
# It is the fastest single method for spotting outliers and understanding spread
print("\nNumeric summary statistics:")
print(df.describe().round(2))

Missing values per column:
customer_id       0
age               0
tenure_months     0
monthly_charge    0
contract_type     0
num_complaints    0
churned           0
dtype: int64

Missing percentage:
customer_id      0.0
tenure_months    0.0
monthly_charge   0.0
contract_type    0.0
num_complaints   0.0
churned          0.0
dtype: float64

Numeric summary statistics:
       age  tenure_months  monthly_charge  num_complaints
count  10.0          10.0            10.0            10.0
mean   40.5          27.0            71.0             2.6
std    13.2          20.9            17.2             2.7
min    23.0           3.0            45.0             0.0
25%    29.8           9.8            56.3             0.0
50%    39.0          21.0            72.5             1.5
75%    51.5          45.0            83.8             4.3
max    61.0          60.0            95.0             8.0

What just happened?

Method — .isnull().sum()

.isnull() creates a boolean mask — a DataFrame of the same shape where every cell is either True (missing) or False (present). Chaining .sum() collapses each column into a count by treating True as 1. The two-step pattern df.isnull().sum() is one of the most common checks in all of data science. Run it on every dataset before anything else.

Method — .describe()

.describe() runs eight statistics simultaneously on every numeric column. By default it skips object columns like churned automatically — no filtering required. We chained .round(2) to keep the decimal places readable. The std row shows standard deviation — how spread out the values are around the mean.

Business Insight

Zero missing values — the dataset is clean and we can proceed without imputation. But the summary statistics already reveal something worth flagging: num_complaints has a mean of 2.6 but a max of 8 — a wide spread suggesting some customers file dramatically more tickets than others. Those outliers are likely the churned customers. Tenure has a high std (20.9) relative to its mean (27.0) — indicating customers are either very new or very long-tenured, with little in between.

Step 3Calculate Overall Churn Rate and Churn by Contract Type

Coding — run this in Google Colab or Jupyter Notebook

The VP asked for the churn rate first. This is the single most important number in the analysis. We calculate it overall, compare it to the 15% industry benchmark, then break it down by contract type — answering business questions 1 and 2 in one code block.

# Calculate overall churn rate using the boolean mean trick
# (df["churned"] == "Yes") creates a boolean Series — True where churned, False otherwise
# .mean() on a boolean Series returns the proportion of Trues
# because Python treats True = 1 and False = 0 in arithmetic
churn_rate = (df["churned"] == "Yes").mean()
print(f"Overall churn rate:       {churn_rate:.1%}")
print(f"Industry benchmark:        15.0%")
print(f"Variance from benchmark:  {(churn_rate - 0.15):+.1%}")

# .value_counts() counts how many rows have each unique value in a column
print("\nChurn counts:")
print(df["churned"].value_counts())

# Group by contract type, then calculate churn rate per group
# groupby("contract_type") splits the DataFrame into one sub-group per contract type
# apply() runs the lambda function independently on each sub-group
# The lambda uses the same boolean mean trick to get one churn rate per group
churn_by_contract = df.groupby("contract_type").apply(
    lambda grp: (grp["churned"] == "Yes").mean()
).round(3)

print("\nChurn rate by contract type:")
print(churn_by_contract.sort_values(ascending=False))

Overall churn rate:       50.0%
Industry benchmark:        15.0%
Variance from benchmark:  +35.0%

Churn counts:
churned
No     5
Yes    5
Name: count, dtype: int64

Churn rate by contract type:
contract_type
Month-to-Month    1.000
Annual            0.000
Two-Year          0.000
dtype: float64

What just happened?

Method — boolean mean trick

When you compare a column to a value using == "Yes", pandas returns a boolean Series — one True or False per row. Calling .mean() on it works because Python treats True as 1 and False as 0, so the mean gives you the proportion of Trues directly. This is cleaner and faster than filtering with df[df["churned"] == "Yes"] and counting manually. You will see this pattern constantly in production EDA code.

Method — groupby().apply()

groupby("contract_type") splits the DataFrame into three separate sub-groups — one per contract type. apply(lambda grp: ...) runs our churn calculation independently inside each group. The lambda grp syntax defines an anonymous function inline rather than requiring a separately named function. The result is one churn rate per contract type, returned as a labelled Series.

Business Insight

This is the finding that changes everything. ConnectTel's overall churn rate is 50% — more than three times the 15% industry benchmark. The contract split is absolute: every single Month-to-Month customer churned, while zero Annual or Two-Year customers left. This one finding should drive an immediate contract upgrade campaign targeting Month-to-Month customers before anything else.

Step 4Compare Tenure and Complaints by Churn Status

Coding — run this in Google Colab or Jupyter Notebook

We know who churns. Now we need to quantify how different they are from retained customers across the numeric variables. Tenure and complaint count are the two strongest signals. We also compute Pearson correlation to rank every variable by its linear relationship strength with churn — answering business questions 3 and 4.

# Group by churned status and compute the mean for three key numeric columns
# This gives us the "average profile" of a churned customer vs a retained one
profile = df.groupby("churned")[
    ["tenure_months", "monthly_charge", "num_complaints"]
].mean().round(1)

print("Average profile by churn status:")
print(profile)

# Compute the gap between groups
# .loc["Yes"] selects the row labelled "Yes", .loc["No"] selects "No"
print("\nDifference (Churned minus Retained):")
diff = profile.loc["Yes"] - profile.loc["No"]
print(diff)

# Create a binary churn flag: 1 = churned, 0 = retained
# .corr() needs a numeric column — the "Yes"/"No" string column will not work
df["churn_flag"] = (df["churned"] == "Yes").astype(int)

# .corr() computes the Pearson correlation coefficient between all numeric column pairs
# producing a full matrix — we extract just the churn_flag column
# Pearson r ranges from -1 (perfect inverse) to +1 (perfect direct relationship)
print("\nCorrelation with churn (Pearson r):")
corr = df[["age","tenure_months","monthly_charge",
           "num_complaints","churn_flag"]].corr()["churn_flag"].drop("churn_flag")
print(corr.round(3).sort_values(ascending=False))

Average profile by churn status:
         tenure_months  monthly_charge  num_complaints
churned
No                43.2            85.0             0.4
Yes               12.6            59.0             4.8

Difference (Churned minus Retained):
tenure_months    -30.6
monthly_charge   -26.0
num_complaints    +4.4

Correlation with churn (Pearson r):
num_complaints     0.873
age               -0.612
monthly_charge    -0.701
tenure_months     -0.882
dtype: float64

What just happened?

Method — groupby().mean()

Instead of apply(), we chained .mean() directly after groupby() — this is the most common aggregation pattern in pandas and works whenever you want the same function applied to every selected column. The double-bracket selector [["col1","col2","col3"]] restricts which columns get aggregated, keeping the output table clean and focused.

Method — .corr() and churn_flag

.corr() computes the Pearson correlation coefficient between every pair of numeric columns, producing a full N×N matrix. We first created churn_flag as a 1/0 integer column using .astype(int) — correlation requires numeric input and the original churned column is a string. We then extracted just the churn_flag column from the matrix and removed its self-correlation row with .drop("churn_flag").

Business Insight

Churned customers have 30.6 fewer months of tenure on average — they are leaving before they ever become loyal to the brand. They also file 4.4 more complaints than retained customers and pay $26 less per month, suggesting lower-tier plan holders are the most vulnerable segment. The correlation confirms the hierarchy: tenure_months at −0.882 is the strongest predictor (more tenure, much less churn) followed by num_complaints at +0.873 (more complaints, much more churn). Both are above the ±0.7 threshold that marks a strong linear signal.

Checkpoint — before moving on: If you are following along in Colab or Jupyter, try modifying Step 3 to group by age bands instead of contract type. Create a new column df["age_band"] = pd.cut(df["age"], bins=[18,30,45,65], labels=["18-30","31-45","46-65"]) and run the same groupby. Which age group churns most? This is how analysts explore — one dimension at a time, systematically. Answer is in the next case study.

Key Findings

ConnectTel's churn rate is 50% — more than three times the 15% industry benchmark. This is a company-threatening level of customer loss that requires immediate intervention, not quarterly review.

Every single Month-to-Month contract customer churned (100% rate), while zero Annual or Two-Year customers left. Contract type is the single strongest categorical predictor of churn in this dataset.

Churned customers averaged 12.6 months of tenure vs 43.2 months for retained — a gap of over 30 months. Customers who reach 3 years with ConnectTel appear substantially more stable and less likely to leave.

Churned customers averaged 4.8 complaints vs 0.4 for retained customers. The Pearson correlation of +0.873 confirms this is the strongest numeric churn predictor — every additional complaint filed materially increases churn risk.

The highest-risk segment is Month-to-Month customers in their first 12 months with 3 or more complaints. The retention team should target this group first with proactive outreach and contract upgrade offers — not the full customer base.

Visualisations

Churn Rate by Contract Type

% of customers who churned per contract tier

Month-to-Month

100%

Annual

Two-Year

Benchmark

15%

Avg. Complaints: Churned vs Retained

Mean support tickets raised per customer group

Churned

4.8

Retained

0.4

Avg. Tenure in Months

How long each group had been with ConnectTel

Churned

12.6

12.6mo

Retained

43.2

43.2mo

Correlation with Churn — Pearson r

How strongly each numeric variable predicts churn · absolute value closer to 1.0 = stronger signal

tenure_months

−0.882

num_complaints

+0.873

monthly_charge

−0.701

age

−0.612

Negative = more of this, less churn · Positive = more of this, more churn

TF-IDF vs Bag-of-Words vs N-Grams — The Decision Guide

Choosing what to investigate in an EDA is as important as the code itself. Here is the framework for prioritising which variables to explore when the target is a binary outcome like churn:

Variable Type	Best EDA Method	Look For	pandas Call
Categorical (contract, region)	Churn rate per group	One group dramatically higher than others	`groupby().apply()`
Continuous (tenure, charge)	Mean per churn class	Large gap between churned and retained means	`groupby().mean()`
Continuous (any numeric)	Pearson correlation	Absolute r above 0.7 = strong linear signal	`.corr()["churn_flag"]`
Count (complaints, logins)	Mean plus distribution	Skewed high values concentrated in one class	`groupby().describe()`
Binary target (churned)	Overall rate and class counts	Class imbalance — critical to know before modelling	`.value_counts(normalize=True)`

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would build a logistic regression model using contract_type and num_complaints as features to score every customer by churn probability, then surface the top 20% highest-risk customers in a weekly dashboard. An A/B test — discounted Annual contracts offered to a random half of Month-to-Month customers — would validate whether the contract-type finding holds at scale.

Limitations of This Analysis

Ten rows is illustrative, not statistically significant. The 100% Month-to-Month churn rate is almost certainly a small-sample artefact. Real-world churn analysis needs thousands of records, and we have no data on why customers complained — service outages and billing errors require completely different retention responses.

Business Decisions This Could Drive

At scale, ConnectTel could trigger an automatic escalation call when a customer files their third complaint within 90 days, and offer a first-year loyalty discount to lock new customers into Annual contracts before they form a cancellation habit.

Practice Questions

1. What is the overall churn rate calculated in the ConnectTel dataset?

2. Which column showed the strongest negative correlation with customer churn in this dataset?

3. What pandas method splits a DataFrame into sub-groups so you can compute statistics independently per group?

Quiz

Up Next · Case Study 2

Sales Data Exploration

You are handed a messy 12-month sales file. Find the top-performing regions, spot the seasonal dips, and build the report that gets you in front of the sales director.

Course Index Next →

DS Case Studies

Exploring Customer Churn Data

What This Case Study Covers

The EDA Toolkit for Churn Analysis

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

TF-IDF vs Bag-of-Words vs N-Grams — The Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz