EDA Lesson 1 – Introduction to EDA | Dataplexa
Beginner Level · Lesson 1

Introduction to Exploratory Data Analysis

Before you build any model or write a single line of analysis code, you need to actually look at your data — that's exactly what Exploratory Data Analysis is, and this lesson walks you through what it means, why it matters, and how to do it in Python.

What EDA Actually Is

Exploratory Data Analysis — EDA for short — is the process of getting familiar with a dataset before doing anything formal with it. Think of it like reading a restaurant menu before ordering. You're not committing to anything yet. You're scanning, asking questions, and getting a feel for what's there.

The term was coined by statistician John Tukey in 1977. His core idea was simple: don't jump straight to hypothesis testing or modeling. First, explore. Look at the shapes, the patterns, the weird values, the things that don't make sense. The data will tell you things you didn't expect.

In practice, EDA means:

  • Looking at the first and last few rows of data
  • Checking column names, data types, and dimensions
  • Finding missing values and duplicates
  • Computing basic summary statistics (mean, median, min, max)
  • Visualising distributions and relationships between variables

None of this is fancy. But skipping it is one of the most common mistakes beginners make — and even experienced analysts get burned by it.

A Real-World Analogy

Imagine you are a new employee and someone hands you a spreadsheet of customer orders. Before you can answer any business question — "Who are our top customers?" or "Which products are underperforming?" — you need to understand the spreadsheet itself. How many rows are there? What does each column mean? Are there blanks? Are the dates formatted correctly? Is the revenue column in dollars or thousands of dollars? These are EDA questions.

EDA is that orientation phase. It saves you from building a beautiful chart on dirty data, or drawing conclusions from a column that turned out to be 40% empty.

The EDA Process at a Glance

Here is a simple visual of the typical EDA workflow — the steps flow from raw data all the way to insights:

Raw Data
CSV / DB / API
Load & Inspect
shape, dtypes, head
Clean & Check
nulls, duplicates
Summarise
stats, distributions
Visualise
charts, patterns
Insights
decisions & next steps

This course follows exactly this flow — we start with structure and loading, move through cleaning, statistics, and visualisation, and end with real-world case studies.

Your First EDA in Python

The scenario: You have just joined a small electronics retailer as a data analyst. Your manager hands you a sales dataset and says "tell me what's in here." There is no dashboard yet, no report template — just raw data. This is exactly where EDA begins.

We will build the dataset inline using pandas — no file needed — and run our first three inspection commands.

import pandas as pd    # pandas is the core data analysis library in Python
import numpy as np     # numpy handles numerical operations; pandas uses it internally

# Build a small electronics sales dataset directly in Python
# Each key is a column name, each list is that column's values
sales_df = pd.DataFrame({
    'order_id':    [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],   # unique order ID
    'customer':    ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank'],
    'product':     ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Keyboard', 'Laptop'],
    'quantity':    [1, 3, 2, 1, 1, 5, 2, 1],                           # units sold per order
    'unit_price':  [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
    'revenue':     [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99],
    'region':      ['North', 'South', 'East', 'West', 'North', 'South', 'East', None]  # None = missing
})

# Step 1 – How big is this dataset? Returns a tuple (rows, columns)
print("Shape:", sales_df.shape)

# Step 2 – What does the data look like? Shows first 5 rows by default
print(sales_df.head())

# Step 3 – What data type is stored in each column?
print(sales_df.dtypes)
Shape: (8, 7)

   order_id customer   product  quantity  unit_price   revenue region
0      1001    Alice    Laptop         1      999.99    999.99  North
1      1002      Bob     Mouse         3       29.99     89.97  South
2      1003    Carol  Keyboard         2       79.99    159.98   East
3      1004     Dave    Laptop         1      999.99    999.99   West
4      1005      Eve   Monitor         1      349.99    349.99  North

order_id        int64
customer       object
product        object
quantity        int64
unit_price    float64
revenue       float64
region         object
dtype: object

What just happened?

pandas is the go-to Python library for working with tabular data — think of it as Excel inside Python. We used pd.DataFrame() to create a table from a Python dictionary. numpy was imported because pandas uses it under the hood for number crunching.

  • .shape told us we have 8 rows and 7 columns — the dataset's size at a glance
  • .head() showed the first 5 rows — a quick sanity check that data loaded correctly
  • .dtypes revealed column types: int64 for whole numbers, float64 for decimals, object for text
  • We already spotted something: the last row has None in region — a missing value we will need to handle

The describe() Function — Your Instant Summary

The scenario: Your manager asks — "Give me a quick statistical snapshot of the sales numbers." Instead of calculating mean, min, and max one by one for each column, pandas gives you a single command that handles everything at once. This is the fastest way to understand the numerical range and spread of your data.

import pandas as pd

# Rebuild the sales dataset (numeric columns only for this example)
sales_df = pd.DataFrame({
    'order_id':    [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'quantity':    [1, 3, 2, 1, 1, 5, 2, 1],
    'unit_price':  [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
    'revenue':     [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99]
})

# describe() computes 8 statistics for every numeric column automatically
# .round(2) is chained on top to keep decimal places clean and readable
print(sales_df.describe().round(2))
       order_id  quantity  unit_price    revenue
count      8.00      8.00        8.00       8.00
mean    1004.50      2.00      321.24     488.74
std        2.45      1.41      413.58     402.43
min     1001.00      1.00       29.99      89.97
25%     1002.75      1.00       79.99     159.98
50%     1004.50      1.50      214.99     254.99
75%     1006.25      2.25      999.99     999.99
max     1008.00      5.00      999.99     999.99

What just happened?

describe() is a pandas method that automatically computes 8 statistics per numeric column: count, mean, std (standard deviation), min, 25th percentile, 50th percentile (median), 75th percentile, and max. We chained .round(2) to tidy the output.

Look at the revenue column: the mean is $488.74 but the median (50%) is only $254.99. That gap tells you the distribution is right-skewed — a few large Laptop orders are pulling the average up. If someone reports "average revenue per order is $488" that is technically true but misleading. EDA caught this in minute one.

Spotting Missing Values Immediately

The scenario: Before any analysis, you need to know if your data has gaps. A column that looks complete might be 30% empty. If you skip this check and later group or average that column, your results will be silently wrong — no error message, just bad numbers. Here we find missing values and report them as a percentage, which is far more actionable than a raw count.

import pandas as pd

# Rebuild the full sales dataset — note the None in the last row's region column
sales_df = pd.DataFrame({
    'order_id':   [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'customer':   ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank'],
    'product':    ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Keyboard', 'Laptop'],
    'quantity':   [1, 3, 2, 1, 1, 5, 2, 1],
    'unit_price': [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
    'revenue':    [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99],
    'region':     ['North', 'South', 'East', 'West', 'North', 'South', 'East', None]
})

# isnull() scans every cell — returns True if value is missing, False if not
# .sum() then counts the True values per column = total missing per column
missing_count = sales_df.isnull().sum()

# Divide by total rows (len) and multiply by 100 to get a percentage per column
missing_pct = (sales_df.isnull().sum() / len(sales_df) * 100).round(1)

# Combine count and percentage into one clean summary DataFrame
missing_summary = pd.DataFrame({
    'missing_count': missing_count,
    'missing_pct':   missing_pct
})

# Filter to show only columns that actually have missing data
print(missing_summary[missing_summary['missing_count'] > 0])
        missing_count  missing_pct
region              1         12.5

What just happened?

Three pandas tools worked together: isnull() scanned every cell for missing values, sum() totalled them per column, and pd.DataFrame() packaged the count and percentage into a readable summary.

The output tells us only the region column has missing data — 1 row out of 8 (12.5%). That is not catastrophic, but it is there. Depending on the analysis you might fill it in, drop that row, or flag it. The key is — you know about it now, not three hours later when your regional breakdown chart is silently wrong.

A Quick Visual Check — Revenue by Product

Numbers tell part of the story. A chart can reveal the same insight in seconds. Here is a mockup of what a revenue-by-product bar chart looks like from our dataset — we will write the actual matplotlib and seaborn code to produce this in later lessons.

Total Revenue by Product — sales_df

Laptop
$2,999.97
Monitor
$349.99
Keyboard
$319.96
Mouse
$239.92

Laptops drive 75% of total revenue despite being just one of four products — a pattern invisible in the raw numbers until you chart it.

This is the kind of insight EDA surfaces in the first ten minutes. Laptops dominate revenue, but mice are ordered more frequently. A business strategy built on order volume looks completely different from one built on revenue — and EDA is what makes that distinction visible before you build anything on top of it.

Teacher's Note

EDA has no rigid end point — it is a mindset, not a checklist. You keep exploring until you feel confident you understand your data well enough to trust the analysis that follows. Beginners rush past EDA and spend hours debugging broken models. Experienced analysts treat it as the most important phase. Three commands — shape, describe(), and isnull() — already tell you more about a dataset than most people find out in an hour. That is the power of systematic exploration.

Practice Questions

1. Which pandas function gives you count, mean, std, min, and max for all numeric columns in one call?



2. To count missing values per column in a DataFrame called df, you write df.______().sum(). Fill in the blank.



3. To find out how many rows and columns a DataFrame has, you use the df.______ attribute.



Quiz

1. Who is credited with coining the term Exploratory Data Analysis?


2. In our sales_df, the revenue mean ($488.74) is much higher than the median ($254.99). What does this tell you?


3. What is the primary goal of EDA?


Up Next · Lesson 2

Dataset Structure

Learn how to read shape, dtypes, index, and column info like a pro using pandas.