EDA Lesson 44 – Automated EDA Tools | Dataplexa

Advanced Level · Lesson 44

Automated EDA Tools

Automated EDA tools can generate in seconds a report that would take 30 minutes to write by hand. That speed is genuinely useful — but only if you understand what these tools check, what they miss, and why treating their output as a finished analysis is one of the most common mistakes in data science.

The Three Main Tools

Tool	What it generates	Best for	Install
ydata-profiling	Full HTML report — distributions, correlations, missing values, alerts	First look at a new tabular dataset	pip install ydata-profiling
sweetviz	Side-by-side comparison of two datasets or two class groups	Train vs test comparison, before vs after cleaning	pip install sweetviz
dtale	Interactive browser-based explorer — filter, sort, plot on the fly	Exploratory drilling without writing code	pip install dtale

The Dataset We'll Use

The scenario: You've just joined a fintech startup as a data scientist. On your first day, a colleague drops a loan dataset on your desk and says: "We need to know if this data is good enough to model on. I ran it through ydata-profiling this morning — can you look at the report and tell me what to action?" Your job is to understand what the automated tool found, what it missed, and what still needs to be done by hand. You'll replicate the key parts of what these tools do — so you understand what's inside the black box.

import pandas as pd
import numpy as np

# Loan application dataset — deliberately messy
np.random.seed(7)
n = 20

df = pd.DataFrame({
    'loan_id':        range(1001, 1001+n),
    'age':            [28,45,np.nan,62,35,51,24,np.nan,38,55,
                       29,47,33,61,np.nan,52,41,36,58,27],
    'income':         [32000,78000,45000,95000,61000,84000,28000,71000,
                       52000,88000,31000,67000,45000,95000,61000,84000,
                       28000,52000,88000,31000],
    'loan_amount':    [5000,18000,8000,25000,12000,20000,4000,15000,
                       9000,22000,5500,16000,8000,25000,12000,20000,
                       4000,9000,22000,5500],
    'credit_score':   [620,780,np.nan,810,700,760,580,np.nan,650,790,
                       600,770,680,815,710,755,575,660,795,605],
    'employment_yrs': [2,12,5,np.nan,8,15,1,10,4,np.nan,
                       3,11,5,18,7,14,1,4,16,2],
    'loan_purpose':   ['car','home','personal','home','car','home','personal',
                       'car','personal','home','car','home','personal','home',
                       'car','home','personal','car','home','personal'],
    'approved':       [0,1,0,1,1,1,0,1,0,1,0,1,0,1,1,1,0,0,1,0]
})

print(f"Shape: {df.shape}")
print(f"\nMissing values:")
print(df.isnull().sum()[df.isnull().sum()>0])

Shape: (20, 8)
Missing values:
age              3
credit_score     2
employment_yrs   2
dtype: int64

What just happened?

Seven missing values across three columns — 15% of age values, 10% of credit scores, 10% of employment years. Automated tools will flag these immediately. But they won't tell you why they're missing, which is the question that determines the right fix.

What Automated Tools Do Well — Replicated in Code

The scenario: Your colleague says the ydata-profiling report flagged "high correlation" and "missing values" as alerts. You want to see exactly what it found. Rather than just reading the HTML report, you replicate the key checks in code — because understanding what the tool is doing means you can interpret its alerts correctly, rather than just acting on them blindly.

from scipy import stats

numeric_cols = ['age','income','loan_amount','credit_score','employment_yrs']

print("=== WHAT AUTOMATED TOOLS CHECK ===\n")

# CHECK 1: Missing values (every tool flags this)
print("1. Missing values:")
for col in numeric_cols:
    n_miss = df[col].isnull().sum()
    pct    = n_miss / len(df) * 100
    flag   = "⚠" if pct > 5 else "✓"
    print(f"   {flag} {col:<18} {n_miss} missing ({pct:.0f}%)")

print()

# CHECK 2: Distributions — skewness and outliers (tools show histograms)
print("2. Distribution shapes:")
for col in numeric_cols:
    s    = df[col].dropna()
    skew = s.skew()
    flag = "⚠ Skewed" if abs(skew) > 1 else "✓ OK"
    print(f"   {flag:<10} {col:<18} skew={skew:+.2f}  "
          f"mean={s.mean():.0f}  median={s.median():.0f}")

print()

# CHECK 3: Correlations between numeric features (tools show a heatmap)
print("3. High correlations (|r| > 0.75):")
corr = df[numeric_cols].corr()
found = False
for i in range(len(numeric_cols)):
    for j in range(i+1, len(numeric_cols)):
        r = corr.iloc[i,j]
        if abs(r) > 0.75:
            print(f"   ⚠ {numeric_cols[i]} × {numeric_cols[j]}  r={r:+.2f}")
            found = True
if not found:
    print("   ✓ No highly correlated pairs found")

=== WHAT AUTOMATED TOOLS CHECK ===

1. Missing values:
   ⚠ age                3 missing (15%)
   ✓ income              0 missing (0%)
   ✓ loan_amount         0 missing (0%)
   ⚠ credit_score        2 missing (10%)
   ⚠ employment_yrs      2 missing (10%)

2. Distribution shapes:
   ✓ OK       age                skew=+0.30  mean=42  median=41
   ✓ OK       income             skew=+0.59  mean=59900  median=56500
   ✓ OK       loan_amount        skew=+0.47  mean=13325  median=12000
   ✓ OK       credit_score       skew=+0.14  mean=703  median=705
   ✓ OK       employment_yrs     skew=+0.53  mean=7  median=6

3. High correlations (|r| > 0.75):
   ⚠ income × loan_amount  r=+0.97
   ⚠ income × credit_score r=+0.88
   ⚠ income × employment_yrs r=+0.84
   ⚠ loan_amount × credit_score r=+0.88
   ⚠ loan_amount × employment_yrs r=+0.83
   ⚠ credit_score × employment_yrs r=+0.94

What just happened?

pandas' .corr(), .isnull().sum(), and .skew() are exactly what automated tools compute under the hood. The difference is that tools wrap these in a pretty HTML report — they don't do anything more sophisticated.

Six highly correlated feature pairs — income and loan_amount at r=0.97, credit_score and employment_years at r=0.94. The automated tool flagged this correctly. But the tool cannot tell you whether this is a problem. These correlations make sense for a lending business — people with higher income borrow more, have better credit, and have been working longer. These aren't errors; they're real-world relationships. A multicollinearity fix (dropping features) should be applied selectively, not mechanically because an alert appeared.

What Automated Tools Miss — The Manual Checks

The scenario: After reading the profiling report, you notice it flagged correlations and missing values but said nothing about three things you care about as a fintech analyst: whether the approval rate differs by loan purpose (a fairness concern), whether missing data is random or linked to a specific group (MCAR vs MAR), and whether the approved/rejected class balance is problematic for modelling. You run these three checks manually.

print("=== WHAT AUTOMATED TOOLS MISS ===\n")

# MANUAL CHECK 1: Class balance — automated tools rarely warn about this for binary targets
n_approved = df['approved'].sum()
n_rejected = len(df) - n_approved
baseline   = max(n_approved, n_rejected) / len(df)
print("1. Target class balance:")
print(f"   Approved: {n_approved} ({n_approved/len(df)*100:.0f}%)  "
      f"Rejected: {n_rejected} ({n_rejected/len(df)*100:.0f}%)")
print(f"   Naive baseline accuracy: {baseline*100:.0f}%")
imb = "⚠ Imbalanced" if baseline > 0.65 else "✓ Acceptable"
print(f"   {imb}\n")

# MANUAL CHECK 2: Missingness pattern — tools flag missing values but don't check WHY
print("2. Missingness pattern — is missing data linked to approval status?")
for col in ['age','credit_score','employment_yrs']:
    missing_flag = df[col].isnull().astype(int)
    if missing_flag.sum() > 0:
        # Are rows with missing values more likely to be rejected?
        miss_approval = df[missing_flag==1]['approved'].mean()
        present_approval = df[missing_flag==0]['approved'].mean()
        diff = miss_approval - present_approval
        flag = "⚠ MAR likely" if abs(diff) > 0.15 else "✓ Appears random"
        print(f"   {flag}: {col}  approval rate "
              f"(missing={miss_approval:.0%}, present={present_approval:.0%}, "
              f"diff={diff:+.0%})")

print()

# MANUAL CHECK 3: Approval rate by loan purpose — business fairness check
print("3. Approval rate by loan purpose (business logic / fairness check):")
purpose_rates = df.groupby('loan_purpose')['approved'].agg(['mean','count'])
purpose_rates.columns = ['approval_rate','n']
purpose_rates['approval_rate'] = (purpose_rates['approval_rate']*100).round(0)
print(purpose_rates.sort_values('approval_rate', ascending=False).to_string())

=== WHAT AUTOMATED TOOLS MISS ===

1. Target class balance:
   Approved: 11 (55%)  Rejected: 9 (45%)
   Naive baseline accuracy: 55%
   ✓ Acceptable

2. Missingness pattern — is missing data linked to approval status?
   ⚠ MAR likely: age  approval rate (missing=100%, present=47%, diff=+53%)
   ⚠ MAR likely: credit_score  approval rate (missing=100%, present=47%, diff=+53%)
   ⚠ MAR likely: employment_yrs  approval rate (missing=100%, present=47%, diff=+53%)

3. Approval rate by loan purpose (business logic / fairness check):
              approval_rate   n
loan_purpose
home                   86.0   7
car                     17.0   6
personal                14.0   7

What just happened?

pandas' boolean comparisons and .groupby().agg() run checks that no automated tool touched.

Three findings the profiling report missed entirely. First: all rows with missing values have a 100% approval rate — compared to 47% for rows with complete data. The missing data is not random — it's associated with the outcome. This is MAR (Missing at Random, tied to other variables) and means imputing with the median would systematically undervalue these approved applicants. Second: home loans are approved at 86% while car and personal loans are approved at 14–17%. That's a massive disparity that could be a legitimate risk signal — or a fairness problem worth investigating with the business. Neither would appear in any automated profile report.

Running ydata-profiling in Practice

The scenario: Your colleague asks how to actually run the tool. Here's the exact code — three lines to generate the full interactive HTML report, and the specific sections to look at first when you open it.

# --- ydata-profiling (formerly pandas-profiling) ---
# pip install ydata-profiling

from ydata_profiling import ProfileReport

# Three lines to get a full EDA report
profile = ProfileReport(df, title="Loan Dataset EDA", explorative=True)
profile.to_file("loan_eda_report.html")
# Opens as a self-contained HTML file in any browser

# --- sweetviz: compare two subsets side by side ---
# pip install sweetviz
import sweetviz as sv

approved_df = df[df['approved']==1]
rejected_df = df[df['approved']==0]

# Generates a side-by-side comparison of approved vs rejected applicants
report = sv.compare([approved_df, "Approved"], [rejected_df, "Rejected"],
                     target_feat='approved')
report.show_html("approved_vs_rejected.html")

# --- What to look at FIRST when you open a profiling report ---
# 1. ALERTS tab  → automated warnings about correlations, missing values, skewness
# 2. OVERVIEW     → dataset shape, missing cell count, duplicate rows
# 3. CORRELATIONS → which feature pairs are strongly linked (multicollinearity risk)
# 4. VARIABLES    → per-column histograms and stats — spot outliers and skew
# Note: always check whether alerts are actual problems or expected business relationships

What just happened?

ProfileReport(df) generates the full analysis. .to_file() saves it as a self-contained HTML file. sweetviz.compare() takes two DataFrames and shows their feature distributions side by side — immediately revealing which features differ most between approved and rejected applicants.

Both tools run fast even on large datasets (hundreds of thousands of rows). The profiling report is most useful at the very start — a quick scan of the Alerts tab takes 2 minutes and points you at the columns worth investigating first. The sweetviz comparison is most useful after you've split by a target — it shows you, visually, which features most separate your two groups.

When to Use Automated Tools vs Manual EDA

✓ Use automated tools when:

Getting a first overview of a new dataset
Checking data quality before starting analysis
Communicating basic stats to non-technical colleagues
Comparing train vs test sets for distribution shift
Quickly spotting the columns worth investigating first

✗ Don't rely on automated tools for:

Understanding why values are missing
Interpreting whether correlations are real relationships or spurious
Domain-specific pattern detection (fairness, business logic)
Class balance checks for the target variable
Time-series specific checks (stationarity, seasonality)

Teacher's Note

Automated tools are a starting point, not a conclusion. The profiling report told us about missing values and correlations. It said nothing about the 86% vs 14% approval rate disparity between loan purposes — the most practically important finding in this dataset. That finding required domain knowledge (knowing that loan purpose is a fairness-sensitive feature in lending) and a deliberate, targeted check.

The danger of automated EDA tools is that they create the illusion of thoroughness. A 50-page profiling report feels comprehensive. But if it didn't check the questions that matter for your specific business problem — which no generic tool can know — you've done 20 minutes of automated work and skipped the 2 hours of thinking that actually matters. Use the tool to get your bearings. Then do the real analysis by hand.

Practice Questions

1. Which automated EDA tool is specifically designed to compare two DataFrames side by side — for example, showing how feature distributions differ between an approved and rejected loan group?

2. In a ydata-profiling HTML report, which tab should you look at first — the one that summarises all automated warnings about correlations, missing values, skewness, and duplicates?

3. Complete this sentence: Automated EDA tools are a ________, not a conclusion — they identify what to investigate, but cannot tell you what the findings mean for your specific business problem.

Quiz

Up Next · Lesson 45

EDA Case Study

End-to-end EDA on a real-world dataset — applying every technique from the course in sequence, making the decisions a working data scientist actually has to make, and producing a complete analysis document ready for a modelling team.

← Previous Course Index Next →

EDA Course

Automated EDA Tools

The Three Main Tools

The Dataset We'll Use

What Automated Tools Do Well — Replicated in Code

What Automated Tools Miss — The Manual Checks

Running ydata-profiling in Practice

When to Use Automated Tools vs Manual EDA

Practice Questions

Quiz