EDA Course
Introduction to Exploratory Data Analysis
Before you build any model or write a single line of analysis code, you need to actually look at your data — that's exactly what Exploratory Data Analysis is, and this lesson walks you through what it means, why it matters, and how to do it in Python.
What EDA Actually Is
Exploratory Data Analysis — EDA for short — is the process of getting familiar with a dataset before doing anything formal with it. Think of it like reading a restaurant menu before ordering. You're not committing to anything yet. You're scanning, asking questions, and getting a feel for what's there.
The term was coined by statistician John Tukey in 1977. His core idea was simple: don't jump straight to hypothesis testing or modeling. First, explore. Look at the shapes, the patterns, the weird values, the things that don't make sense. The data will tell you things you didn't expect.
In practice, EDA means:
- Looking at the first and last few rows of data
- Checking column names, data types, and dimensions
- Finding missing values and duplicates
- Computing basic summary statistics (mean, median, min, max)
- Visualising distributions and relationships between variables
None of this is fancy. But skipping it is one of the most common mistakes beginners make — and even experienced analysts get burned by it.
A Real-World Analogy
Imagine you are a new employee and someone hands you a spreadsheet of customer orders. Before you can answer any business question — "Who are our top customers?" or "Which products are underperforming?" — you need to understand the spreadsheet itself. How many rows are there? What does each column mean? Are there blanks? Are the dates formatted correctly? Is the revenue column in dollars or thousands of dollars? These are EDA questions.
EDA is that orientation phase. It saves you from building a beautiful chart on dirty data, or drawing conclusions from a column that turned out to be 40% empty.
The EDA Process at a Glance
Here is a simple visual of the typical EDA workflow — the steps flow from raw data all the way to insights:
This course follows exactly this flow — we start with structure and loading, move through cleaning, statistics, and visualisation, and end with real-world case studies.
Your First EDA in Python
The scenario: You have just joined a small electronics retailer as a data analyst. Your manager hands you a sales dataset and says "tell me what's in here." There is no dashboard yet, no report template — just raw data. This is exactly where EDA begins.
We will build the dataset inline using pandas — no file needed — and run our first three inspection commands.
import pandas as pd # pandas is the core data analysis library in Python
import numpy as np # numpy handles numerical operations; pandas uses it internally
# Build a small electronics sales dataset directly in Python
# Each key is a column name, each list is that column's values
sales_df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008], # unique order ID
'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Keyboard', 'Laptop'],
'quantity': [1, 3, 2, 1, 1, 5, 2, 1], # units sold per order
'unit_price': [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
'revenue': [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99],
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', None] # None = missing
})
# Step 1 – How big is this dataset? Returns a tuple (rows, columns)
print("Shape:", sales_df.shape)
# Step 2 – What does the data look like? Shows first 5 rows by default
print(sales_df.head())
# Step 3 – What data type is stored in each column?
print(sales_df.dtypes)
Shape: (8, 7) order_id customer product quantity unit_price revenue region 0 1001 Alice Laptop 1 999.99 999.99 North 1 1002 Bob Mouse 3 29.99 89.97 South 2 1003 Carol Keyboard 2 79.99 159.98 East 3 1004 Dave Laptop 1 999.99 999.99 West 4 1005 Eve Monitor 1 349.99 349.99 North order_id int64 customer object product object quantity int64 unit_price float64 revenue float64 region object dtype: object
What just happened?
pandas is the go-to Python library for working with tabular data — think of it as Excel inside Python. We used pd.DataFrame() to create a table from a Python dictionary. numpy was imported because pandas uses it under the hood for number crunching.
.shapetold us we have 8 rows and 7 columns — the dataset's size at a glance.head()showed the first 5 rows — a quick sanity check that data loaded correctly.dtypesrevealed column types:int64for whole numbers,float64for decimals,objectfor text- We already spotted something: the last row has
Nonein region — a missing value we will need to handle
The describe() Function — Your Instant Summary
The scenario: Your manager asks — "Give me a quick statistical snapshot of the sales numbers." Instead of calculating mean, min, and max one by one for each column, pandas gives you a single command that handles everything at once. This is the fastest way to understand the numerical range and spread of your data.
import pandas as pd
# Rebuild the sales dataset (numeric columns only for this example)
sales_df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
'quantity': [1, 3, 2, 1, 1, 5, 2, 1],
'unit_price': [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
'revenue': [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99]
})
# describe() computes 8 statistics for every numeric column automatically
# .round(2) is chained on top to keep decimal places clean and readable
print(sales_df.describe().round(2))
order_id quantity unit_price revenue count 8.00 8.00 8.00 8.00 mean 1004.50 2.00 321.24 488.74 std 2.45 1.41 413.58 402.43 min 1001.00 1.00 29.99 89.97 25% 1002.75 1.00 79.99 159.98 50% 1004.50 1.50 214.99 254.99 75% 1006.25 2.25 999.99 999.99 max 1008.00 5.00 999.99 999.99
What just happened?
describe() is a pandas method that automatically computes 8 statistics per numeric column: count, mean, std (standard deviation), min, 25th percentile, 50th percentile (median), 75th percentile, and max. We chained .round(2) to tidy the output.
Look at the revenue column: the mean is $488.74 but the median (50%) is only $254.99. That gap tells you the distribution is right-skewed — a few large Laptop orders are pulling the average up. If someone reports "average revenue per order is $488" that is technically true but misleading. EDA caught this in minute one.
Spotting Missing Values Immediately
The scenario: Before any analysis, you need to know if your data has gaps. A column that looks complete might be 30% empty. If you skip this check and later group or average that column, your results will be silently wrong — no error message, just bad numbers. Here we find missing values and report them as a percentage, which is far more actionable than a raw count.
import pandas as pd
# Rebuild the full sales dataset — note the None in the last row's region column
sales_df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
'customer': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve', 'Frank', 'Grace', 'Hank'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Monitor', 'Mouse', 'Keyboard', 'Laptop'],
'quantity': [1, 3, 2, 1, 1, 5, 2, 1],
'unit_price': [999.99, 29.99, 79.99, 999.99, 349.99, 29.99, 79.99, 999.99],
'revenue': [999.99, 89.97, 159.98, 999.99, 349.99, 149.95, 159.98, 999.99],
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', None]
})
# isnull() scans every cell — returns True if value is missing, False if not
# .sum() then counts the True values per column = total missing per column
missing_count = sales_df.isnull().sum()
# Divide by total rows (len) and multiply by 100 to get a percentage per column
missing_pct = (sales_df.isnull().sum() / len(sales_df) * 100).round(1)
# Combine count and percentage into one clean summary DataFrame
missing_summary = pd.DataFrame({
'missing_count': missing_count,
'missing_pct': missing_pct
})
# Filter to show only columns that actually have missing data
print(missing_summary[missing_summary['missing_count'] > 0])
missing_count missing_pct region 1 12.5
What just happened?
Three pandas tools worked together: isnull() scanned every cell for missing values, sum() totalled them per column, and pd.DataFrame() packaged the count and percentage into a readable summary.
The output tells us only the region column has missing data — 1 row out of 8 (12.5%). That is not catastrophic, but it is there. Depending on the analysis you might fill it in, drop that row, or flag it. The key is — you know about it now, not three hours later when your regional breakdown chart is silently wrong.
A Quick Visual Check — Revenue by Product
Numbers tell part of the story. A chart can reveal the same insight in seconds. Here is a mockup of what a revenue-by-product bar chart looks like from our dataset — we will write the actual matplotlib and seaborn code to produce this in later lessons.
Total Revenue by Product — sales_df
Laptops drive 75% of total revenue despite being just one of four products — a pattern invisible in the raw numbers until you chart it.
This is the kind of insight EDA surfaces in the first ten minutes. Laptops dominate revenue, but mice are ordered more frequently. A business strategy built on order volume looks completely different from one built on revenue — and EDA is what makes that distinction visible before you build anything on top of it.
Teacher's Note
EDA has no rigid end point — it is a mindset, not a checklist. You keep exploring until you feel confident you understand your data well enough to trust the analysis that follows. Beginners rush past EDA and spend hours debugging broken models. Experienced analysts treat it as the most important phase. Three commands — shape, describe(), and isnull() — already tell you more about a dataset than most people find out in an hour. That is the power of systematic exploration.
Practice Questions
1. Which pandas function gives you count, mean, std, min, and max for all numeric columns in one call?
2. To count missing values per column in a DataFrame called df, you write df.______().sum(). Fill in the blank.
3. To find out how many rows and columns a DataFrame has, you use the df.______ attribute.
Quiz
1. Who is credited with coining the term Exploratory Data Analysis?
2. In our sales_df, the revenue mean ($488.74) is much higher than the median ($254.99). What does this tell you?
3. What is the primary goal of EDA?
Up Next · Lesson 2
Dataset Structure
Learn how to read shape, dtypes, index, and column info like a pro using pandas.