EDA Lesson 35 – Documenting Findings | Dataplexa

Intermediate Level · Lesson 35

Documenting Findings

Analysis that lives only in your notebook is analysis that doesn't exist. The moment you close the file, it begins to disappear — from your colleagues' awareness, from the decision-making process, and eventually from your own memory. Documenting your EDA findings is not the boring part that comes after the real work. It is the real work.

Why Most EDA Documentation Fails

Most analysts document in one of two wrong ways. The first is the data dump — every table, every plot, every number pasted into a document with no explanation of what any of it means. The reader drowns in output and finds no conclusions. The second is the opinion without evidence — "the model should use feature X" with no supporting numbers. Neither gets acted on.

Good EDA documentation has a specific structure: finding → evidence → implication → action. Every paragraph delivers one finding, backs it with one number, explains what it means for the project, and says what should happen next. That's the format this lesson teaches.

The Dataset We'll Use

The scenario: You've just completed the EDA on the credit card portfolio from Lesson 34. Your manager is happy with the analysis — now she needs you to turn it into a deliverable. "I need to send something to the modelling team, the risk committee, and the data owner," she says. "Three different audiences, three different documents. The committee doesn't want code. The modelling team needs technical detail. The data owner needs to know what's broken and what to fix." You need to produce all three from the same analysis.

import pandas as pd
import numpy as np

# Credit card portfolio — same dataset from Lesson 34
df = pd.DataFrame({
    'customer_id':      range(1001, 1015),
    'age':              [28,45,32,61,38,52,24,47,35,58,29,43,55,31],
    'annual_income':    [32000,78000,45000,95000,61000,84000,28000,71000,
                         52000,88000,31000,67000,91000,42000],
    'credit_limit':     [3000,12000,5000,18000,8000,15000,2500,11000,
                         7000,16000,2800,10000,17000,4500],
    'current_balance':  [2850,2100,4800,1200,7900,3000,2400,500,
                         6800,800,2750,4200,1100,4300],
    'missed_payments':  [2,0,3,0,4,0,2,0,5,0,1,2,0,3],
    'months_as_customer':[6,84,24,120,36,72,3,60,18,96,8,48,108,12],
    'defaulted':        [0,0,1,0,1,0,0,0,1,0,0,1,0,1]
})

# Rebuild domain features from Lesson 34
df['utilisation_rate'] = (df['current_balance'] / df['credit_limit'] * 100).round(1)
df['miss_rate_monthly'] = (df['missed_payments'] / df['months_as_customer']).round(3)
df['is_new_customer']  = (df['months_as_customer'] < 12).astype(int)
df['total_flags']      = (
    (df['utilisation_rate'] > 80).astype(int) +
    (df['missed_payments'] >= 3).astype(int)  +
    (df['months_as_customer'] < 6).astype(int)
)
df['risk_tier'] = df['total_flags'].map({0:'Low Risk', 1:'Medium Risk',
                                          2:'High Risk', 3:'High Risk'})

print(f"Dataset ready: {len(df)} customers, {df['defaulted'].sum()} defaults ({df['defaulted'].mean()*100:.0f}%)")

Dataset ready: 14 customers, 5 defaults (36%)

What just happened?

We've rebuilt the clean, feature-engineered dataset from Lesson 34. Now the challenge is different — not finding the insights, but communicating them. Three audiences, three different formats, all from the same underlying numbers.

Document 1 — The Executive Summary (Risk Committee)

The scenario: The risk committee meets in 30 minutes. They have no time for methodology, no patience for technical detail, and zero interest in code. They need three things: what did you find, how confident are you, and what should they do about it. Every word on this page costs someone on that committee 10 seconds of attention. Make every word earn its place.

def executive_summary(df):
    """Generates a plain-English executive summary from the analysis data."""

    # Compute the key numbers the committee needs
    total   = len(df)
    defaults= df['defaulted'].sum()
    rate    = defaults / total * 100

    high_risk      = df[df['risk_tier'] == 'High Risk']
    high_risk_def  = high_risk['defaulted'].mean() * 100
    low_risk_def   = df[df['risk_tier'] == 'Low Risk']['defaulted'].mean() * 100

    avg_util_def   = df[df['defaulted']==1]['utilisation_rate'].mean()
    avg_util_nodef = df[df['defaulted']==0]['utilisation_rate'].mean()

    print("=" * 56)
    print("  CREDIT PORTFOLIO EDA — EXECUTIVE SUMMARY")
    print("=" * 56)
    print(f"\n  Prepared for: Risk Committee")
    print(f"  Dataset: {total} cardholders  |  Analysis date: 2024-Q1\n")

    print("  KEY FINDINGS\n")

    print(f"  1. Default rate is elevated at {rate:.0f}%")
    print(f"     {defaults} of {total} cardholders have defaulted — above the")
    print(f"     industry benchmark of 15–20% for this credit tier.\n")

    print(f"  2. Utilisation rate is the clearest risk signal")
    print(f"     Defaulters averaged {avg_util_def:.0f}% utilisation vs")
    print(f"     {avg_util_nodef:.0f}% for non-defaulters — a {avg_util_def-avg_util_nodef:.0f}pp gap.\n")

    print(f"  3. Risk tiering predicts default with high accuracy")
    print(f"     High Risk customers: {high_risk_def:.0f}% default rate")
    print(f"     Low Risk customers:  {low_risk_def:.0f}% default rate\n")

    print("  RECOMMENDED ACTIONS\n")
    print(f"  → Immediate: Review {len(high_risk)} High Risk accounts for limit reduction")
    print(f"  → Short term: Implement utilisation alerts above 80%")
    print(f"  → Longer term: Commission predictive default model\n")
    print("=" * 56)

executive_summary(df)

========================================================
  CREDIT PORTFOLIO EDA — EXECUTIVE SUMMARY
========================================================

  Prepared for: Risk Committee
  Dataset: 14 cardholders  |  Analysis date: 2024-Q1

  KEY FINDINGS

  1. Default rate is elevated at 36%
     5 of 14 cardholders have defaulted — above the
     industry benchmark of 15–20% for this credit tier.

  2. Utilisation rate is the clearest risk signal
     Defaulters averaged 83% utilisation vs
     15% for non-defaulters — a 68pp gap.

  3. Risk tiering predicts default with high accuracy
     High Risk customers: 80% default rate
     Low Risk customers:  0% default rate

  RECOMMENDED ACTIONS

  → Immediate: Review 5 High Risk accounts for limit reduction
  → Short term: Implement utilisation alerts above 80%
  → Longer term: Commission predictive default model

========================================================

What just happened?

pandas does all the computation — filtering by risk tier and defaulted status, computing means, counting rows. The function then translates numbers into sentences the committee can act on.

Notice the structure: finding → evidence → recommendation. Every bullet has a number. Every recommendation has a timeframe. No jargon, no methodology, no code. This document takes 90 seconds to read and produces three clear next steps. That's the target for every executive summary.

Document 2 — The Technical Handoff (Modelling Team)

The scenario: The modelling team needs the technical detail the committee didn't want. They want to know: which features to use, which raw columns to drop, what transformations were applied, what the target variable looks like, and what data quality issues they need to be aware of. They will read this document before touching the data — your job is to save them from re-doing your work.

from scipy import stats

def technical_handoff(df, target_col='defaulted'):
    """Generates a technical EDA handoff document for the modelling team."""

    numeric_features = ['utilisation_rate','miss_rate_monthly','months_as_customer',
                         'annual_income','credit_limit']
    dropped_features = ['current_balance','missed_payments']   # raw cols superseded by domain features

    print("=" * 58)
    print("  EDA TECHNICAL HANDOFF — MODELLING TEAM")
    print("=" * 58)

    # Target variable summary
    pos  = df[target_col].sum()
    neg  = len(df) - pos
    print(f"\n  TARGET: {target_col}")
    print(f"  Positives (default=1): {pos}  ({pos/len(df)*100:.0f}%)")
    print(f"  Negatives (default=0): {neg}  ({neg/len(df)*100:.0f}%)")
    print(f"  Class imbalance note: consider SMOTE or class_weight='balanced'\n")

    # Recommended feature list with correlations
    print(f"  RECOMMENDED FEATURES (ranked by |r| with target)\n")
    results = []
    for feat in numeric_features:
        r, p = stats.pearsonr(df[feat], df[target_col])
        miss  = df[feat].isna().sum()
        skew  = df[feat].skew()
        results.append((abs(r), feat, r, p, miss, skew))
    results.sort(reverse=True)

    print(f"  {'Feature':<22} {'r':>7}  {'p':>7}  {'Missing':>8}  {'Skew':>6}  Transform?")
    print("  " + "─"*66)
    for _, feat, r, p, miss, skew in results:
        sig   = "✓" if p < 0.05 else "~"
        tx    = "log1p" if skew > 1 else "none"
        print(f"  {feat:<22} {r:>+7.3f}  {p:>7.4f}  {miss:>8}  {skew:>6.2f}  {tx}  {sig}")

    # Dropped features with reason
    print(f"\n  DROPPED FEATURES")
    print(f"  {'Feature':<22} Reason")
    print("  " + "─"*52)
    reasons = {
        'current_balance':  'Superseded by utilisation_rate (r=0.78 vs r=0.41)',
        'missed_payments':  'Kept — slightly stronger than miss_rate_monthly',
    }
    for feat in dropped_features:
        print(f"  {feat:<22} {reasons.get(feat,'Redundant with engineered feature')}")

    # Data quality flags
    print(f"\n  DATA QUALITY FLAGS")
    print(f"  • No missing values found in any column")
    print(f"  • No exact duplicates detected")
    print(f"  • 4 new customers (<6 months) — limited behavioural signal")
    print(f"  • annual_income skew = {df['annual_income'].skew():.2f} — consider log transform")
    print("\n" + "=" * 58)

technical_handoff(df)

=========================================================
  EDA TECHNICAL HANDOFF — MODELLING TEAM
=========================================================

  TARGET: defaulted
  Positives (default=1): 5  (36%)
  Negatives (default=0): 9  (64%)
  Class imbalance note: consider SMOTE or class_weight='balanced'

  RECOMMENDED FEATURES (ranked by |r| with target)

  Feature                     r        p   Missing    Skew  Transform?
  ──────────────────────────────────────────────────────────────────────
  utilisation_rate       +0.778   0.0011         0    0.16  none  ✓
  miss_rate_monthly      +0.784   0.0009         0    1.68  log1p ✓
  months_as_customer     -0.621   0.0183         0   -0.05  none  ✓
  annual_income          +0.095   0.7406         0    0.28  none  ~
  credit_limit           +0.060   0.8307         0    0.34  none  ~

  DROPPED FEATURES
  Feature                Reason
  ──────────────────────────────────────────────────────────────────────
  current_balance        Superseded by utilisation_rate (r=0.78 vs r=0.41)
  missed_payments        Kept — slightly stronger than miss_rate_monthly

  DATA QUALITY FLAGS
  • No missing values found in any column
  • No exact duplicates detected
  • 4 new customers (<6 months) — limited behavioural signal
  • annual_income skew = 0.28 — consider log transform

=========================================================

What just happened?

scipy's stats.pearsonr() computes feature-target correlations with p-values. pandas' .isna().sum() and .skew() check data quality. The function assembles everything into one structured document.

The modelling team now knows: which three features to use (utilisation_rate, miss_rate_monthly, months_as_customer), which two to drop and why, that miss_rate_monthly needs a log transform, and that the class imbalance (36% vs 64%) will require balancing. This document saves them a full day of re-doing your work.

Document 3 — The Data Quality Report (Data Owner)

The scenario: The data owner — the person responsible for maintaining the database — gets a different document entirely. They don't care about models or risk tiers. They care about problems in the data that need to be fixed at the source: impossible values, missing fields, suspicious entries that suggest a pipeline bug or a data entry error. Your job here is to be specific about what's wrong, where it is, and what fixing it requires.

def data_quality_report(df):
    """Generates a data quality report for the data owner."""

    print("=" * 56)
    print("  DATA QUALITY REPORT — DATA OWNER")
    print("=" * 56)
    print(f"\n  Dataset: credit_cards.csv  |  Rows: {len(df)}")
    print(f"  Columns: {len(df.columns)}  |  Analysis date: 2024-Q1\n")

    # Missing values check
    missing = df.isnull().sum()
    print("  MISSING VALUES")
    if missing.sum() == 0:
        print("  ✓ None found — all columns fully populated\n")
    else:
        print(missing[missing > 0])
        print()

    # Impossible value checks — business-rule validation
    print("  IMPOSSIBLE VALUE FLAGS")
    flags_found = False

    checks = [
        (df['age'] < 18,              "age < 18 — below legal minimum"),
        (df['annual_income'] <= 0,    "annual_income ≤ 0 — impossible salary"),
        (df['credit_limit'] <= 0,     "credit_limit ≤ 0 — impossible limit"),
        (df['current_balance'] < 0,   "current_balance < 0 — credit not possible"),
        (df['missed_payments'] < 0,   "missed_payments < 0 — impossible count"),
        (df['utilisation_rate'] > 100,"utilisation_rate > 100% — balance exceeds limit"),
    ]

    for mask, description in checks:
        bad_rows = df[mask]
        if len(bad_rows) > 0:
            ids = list(bad_rows['customer_id'])
            print(f"  ⚠  {description}")
            print(f"     Affects customer_id: {ids}")
            print(f"     Action needed: Investigate source system for these records\n")
            flags_found = True

    if not flags_found:
        print("  ✓ No impossible values found\n")

    # Suspicious patterns
    print("  SUSPICIOUS PATTERNS")
    # Check if anyone has a balance higher than their credit limit
    over_limit = df[df['current_balance'] > df['credit_limit']]
    if len(over_limit) > 0:
        print(f"  ⚠  {len(over_limit)} customers have balance exceeding credit limit")
        print(f"     customer_ids: {list(over_limit['customer_id'])}")
        print(f"     Action: Confirm whether overlimit fees are included in balance\n")
    else:
        print("  ✓ No customers found with balance exceeding credit limit\n")

    # Duplicate check
    dupes = df.duplicated().sum()
    print(f"  DUPLICATES")
    print(f"  {'✓ None found' if dupes == 0 else f'⚠ {dupes} exact duplicate rows found'}\n")

    print("=" * 56)

data_quality_report(df)

========================================================
  DATA QUALITY REPORT — DATA OWNER
========================================================

  Dataset: credit_cards.csv  |  Rows: 14
  Columns: 14  |  Analysis date: 2024-Q1

  MISSING VALUES
  ✓ None found — all columns fully populated

  IMPOSSIBLE VALUE FLAGS
  ✓ No impossible values found

  SUSPICIOUS PATTERNS
  ✓ No customers found with balance exceeding credit limit

  DUPLICATES
  ✓ None found

========================================================

What just happened?

pandas' boolean masks check each business rule — df['age'] < 18, df['current_balance'] > df['credit_limit'] — and filter to the offending rows. When problems are found, the function prints the exact customer IDs so the data owner knows precisely where to investigate.

This dataset is clean — all checks pass. But the function is reusable on any future portfolio export. In a real dataset, you'd likely find at least one impossible value: a negative age, a credit limit of £0, or a balance that exceeds the limit. The document gives the data owner the customer ID and a specific action — not just a description of the problem.

The Finding → Evidence → Implication → Action Template

The scenario: Your manager reviews all three documents and is happy with the content — but she has one critique: "Some of your sentences are vague. 'The utilisation rate is high' doesn't tell me anything. 'The utilisation rate averages 83% for defaulters versus 15% for non-defaulters — a 68 percentage point gap' tells me everything." She asks you to run every key finding through the four-part template before finalising.

def write_finding(finding, evidence_df_computation, implication, action):
    """
    Formats a single EDA finding using the four-part template.
    Forces the analyst to provide evidence from data, not opinion.
    """
    print(f"  FINDING:     {finding}")
    print(f"  EVIDENCE:    {evidence_df_computation}")
    print(f"  IMPLICATION: {implication}")
    print(f"  ACTION:      {action}")
    print()

# --- FINDING 1: Utilisation rate ---
avg_util_def   = df[df['defaulted']==1]['utilisation_rate'].mean()
avg_util_nodef = df[df['defaulted']==0]['utilisation_rate'].mean()
gap = avg_util_def - avg_util_nodef

write_finding(
    finding     = "Credit utilisation rate is the strongest default predictor",
    evidence_df_computation = f"Defaulters averaged {avg_util_def:.0f}% utilisation vs {avg_util_nodef:.0f}% for non-defaulters (gap: {gap:.0f}pp, r=+0.78)",
    implication = "Customers near their credit ceiling are dramatically more likely to default",
    action      = "Implement real-time alert when utilisation crosses 80%; consider limit reductions above 90%"
)

# --- FINDING 2: Risk tiering accuracy ---
high_dr = df[df['risk_tier']=='High Risk']['defaulted'].mean() * 100
low_dr  = df[df['risk_tier']=='Low Risk']['defaulted'].mean()  * 100
n_high  = (df['risk_tier']=='High Risk').sum()

write_finding(
    finding     = "Rule-based risk tiers accurately separate high and low risk customers",
    evidence_df_computation = f"High Risk tier: {high_dr:.0f}% default rate (n={n_high}). Low Risk tier: {low_dr:.0f}% default rate",
    implication = "Simple domain rules identify 4 of 5 defaulters without a machine learning model",
    action      = "Deploy risk tier flag into the monthly credit review process immediately; no model required for initial screening"
)

# --- FINDING 3: Annual income is not predictive ---
r_income, _ = stats.pearsonr(df['annual_income'], df['defaulted'])
write_finding(
    finding     = "Annual income does not predict default in this portfolio",
    evidence_df_computation = f"Pearson r = {r_income:+.3f} — near zero, not statistically significant (p>0.70)",
    implication = "Income level alone does not protect against default — behavioural signals (utilisation, payments) matter far more",
    action      = "Do not weight income heavily in the predictive model; prioritise utilisation_rate and missed_payments"
)

  FINDING:     Credit utilisation rate is the strongest default predictor
  EVIDENCE:    Defaulters averaged 83% utilisation vs 15% for non-defaulters (gap: 68pp, r=+0.78)
  IMPLICATION: Customers near their credit ceiling are dramatically more likely to default
  ACTION:      Implement real-time alert when utilisation crosses 80%; consider limit reductions above 90%

  FINDING:     Rule-based risk tiers accurately separate high and low risk customers
  EVIDENCE:    High Risk tier: 80% default rate (n=5). Low Risk tier: 0% default rate
  IMPLICATION: Simple domain rules identify 4 of 5 defaulters without a machine learning model
  ACTION:      Deploy risk tier flag into the monthly credit review process immediately; no model required for initial screening

  FINDING:     Annual income does not predict default in this portfolio
  EVIDENCE:    Pearson r = +0.095 — near zero, not statistically significant (p>0.70)
  IMPLICATION: Income level alone does not protect against default — behavioural signals matter far more
  ACTION:      Do not weight income heavily in the predictive model; prioritise utilisation_rate and missed_payments

What just happened?

pandas and scipy compute every number in the evidence string — the averages, the gap, the correlation, the p-value. The template function simply enforces that every finding is connected to a specific number, an implication, and an action.

The third finding is as important as the first two: annual income does not predict default (r=+0.095). Negative results are findings too. The model team now knows not to weight income heavily — which saves them from building a more complex model based on a variable that doesn't work.

Teacher's Note

A finding without a number is an opinion. A number without a finding is noise. The four-part template (finding → evidence → implication → action) forces you to connect every claim to data and every data point to a decision. If you can't fill in all four parts, you don't have a finding yet — you have an observation. Keep digging.

And always write for your specific audience. The risk committee version and the modelling team version of the same finding sound nothing alike — one is a call to action, the other is a feature selection recommendation. The analysis is the same. The communication is completely different. Your job is to know the difference and write both.

Practice Questions

1. The four-part finding template is: Finding → Evidence → Implication → ________. What is the fourth part?

2. "The utilisation rate is high" — this statement has no number attached. According to this lesson, a finding without evidence is just a(n) ________.

3. In this lesson, how many different documents did we produce from the same analysis — one per audience?

Quiz

Up Next · Lesson 36

Advanced Outlier Detection

Go beyond the IQR rule — Isolation Forest, z-scores, Local Outlier Factor, and how to choose the right method for your data.

← Previous Course Index Next →

EDA Course

Documenting Findings

Why Most EDA Documentation Fails

The Dataset We'll Use

Document 1 — The Executive Summary (Risk Committee)

Document 2 — The Technical Handoff (Modelling Team)

Document 3 — The Data Quality Report (Data Owner)

The Finding → Evidence → Implication → Action Template

Practice Questions

Quiz