EDA Course
Understanding Dataset Structure
Every time you open a new dataset, the very first thing you do is understand its structure — how many rows, how many columns, what each column holds, and how it's all organised. This lesson gives you the exact commands to do that in under two minutes.
What "Dataset Structure" Means
When analysts talk about dataset structure, they mean the skeleton of the data — before you look at any values, you want to know the shape of the container that holds them. This includes:
- Dimensions — how many rows (observations) and columns (features)
- Column names — what each variable is called
- Data types — whether each column stores numbers, text, dates, or booleans
- Index — how the rows are labelled (usually 0, 1, 2… by default)
- Memory usage — how much RAM the dataset is consuming
Understanding structure first means you never get surprised halfway through an analysis. You won't try to calculate the mean of a column that's actually stored as text, or wonder why your merge produced 10x more rows than expected.
Think of a dataset like a building blueprint. Before you move furniture in, you check how many rooms there are, what each room is for, and what the walls are made of. Dataset structure is exactly that — the blueprint before the work begins.
The Dataset We Are Using
Throughout this lesson we will work with a customer orders dataset from an online retail store. It has a mix of numeric, text, and date-like columns — exactly the kind of messy, real-world structure you will encounter on the job.
| order_id | customer_name | country | category | quantity | unit_price | total | is_member |
|---|---|---|---|---|---|---|---|
| 2001 | Alice Brown | USA | Electronics | 2 | 499.99 | 999.98 | True |
| 2002 | Bob Smith | UK | Clothing | 5 | 39.99 | 199.95 | False |
| 2003 | Carol White | Canada | Electronics | 1 | 899.00 | 899.00 | True |
| 2004 | Dave Lee | USA | Books | 3 | 14.99 | 44.97 | False |
| 2005 | Eve Patel | India | Clothing | 4 | 24.99 | 99.96 | True |
shape — Rows and Columns in One Shot
The scenario: Your team lead sends you this dataset and asks — "How big is it? How many records do we have?" The very first command every analyst runs is .shape. It tells you the dimensions of the entire DataFrame in a single tuple: (rows, columns).
import pandas as pd
# Build the customer orders dataset inline
customer_df = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
'customer_name': ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
'country': ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'quantity': [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
'unit_price': [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
'total': [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
'is_member': [True, False, True, False, True, True, False, True, False, True]
})
# .shape returns a tuple: (number of rows, number of columns)
print("Shape:", customer_df.shape)
# You can also access rows and columns individually
print("Rows :", customer_df.shape[0]) # index 0 = row count
print("Columns:", customer_df.shape[1]) # index 1 = column count
Shape: (10, 8) Rows : 10 Columns: 8
What just happened?
.shape is a pandas DataFrame attribute — not a function, so there are no parentheses. It returns a Python tuple with two values: row count first, column count second.
- We have 10 rows — 10 individual customer orders
- We have 8 columns — 8 variables describing each order
.shape[0]and.shape[1]let you extract either number individually — useful when you want to print them in a report or use them in conditions
columns and index — Names and Labels
The scenario: You've confirmed the size. Now your manager asks — "What columns do we actually have, and how are the rows labelled?" These two attributes give you the full list of column names and the row index in one glance.
import pandas as pd
# Rebuild the dataset
customer_df = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
'customer_name': ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
'country': ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'quantity': [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
'unit_price': [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
'total': [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
'is_member': [True, False, True, False, True, True, False, True, False, True]
})
# .columns returns all column names as a pandas Index object
print("Column names:")
print(customer_df.columns.tolist()) # .tolist() converts it to a plain Python list
# .index shows how the rows are labelled
print("\nRow index:")
print(customer_df.index)
Column names: ['order_id', 'customer_name', 'country', 'category', 'quantity', 'unit_price', 'total', 'is_member'] Row index: RangeIndex(start=0, stop=10, step=1)
What just happened?
.columns is a pandas attribute that holds all column names. We chained .tolist() — a standard Python method — to convert it from a pandas Index object into a plain readable list.
.index shows the row labels. Here it says RangeIndex(start=0, stop=10, step=1) — meaning rows are labelled 0 through 9, which is the default. In real-world datasets, the index is sometimes set to a customer ID or date, which changes how you select and merge data. Knowing the index type upfront prevents silent errors later.
dtypes — What Type of Data is in Each Column
The scenario: You want to calculate total average spend. But before you do, you need to confirm that total and unit_price are actually stored as numbers — not as text strings that look like numbers. This is a surprisingly common issue in real datasets exported from Excel or databases.
import pandas as pd
# Rebuild the dataset
customer_df = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
'customer_name': ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
'country': ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'quantity': [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
'unit_price': [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
'total': [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
'is_member': [True, False, True, False, True, True, False, True, False, True]
})
# .dtypes shows the data type stored in each column
print(customer_df.dtypes)
order_id int64 customer_name object country object category object quantity int64 unit_price float64 total float64 is_member bool dtype: object
What just happened?
.dtypes is a pandas attribute that returns the data type of every column. Here is what each type means:
- int64 — whole numbers (order_id, quantity)
- float64 — decimal numbers (unit_price, total)
- object — text / mixed (customer_name, country, category)
- bool — True/False values (is_member)
If total had come back as object instead of float64, that would be a red flag — it would mean the column contains text like "$999.98" and you cannot do arithmetic on it without cleaning first.
info() — The Complete Structural Report
The scenario: You want a single command that gives you everything at once — column names, data types, non-null counts, and memory usage — in one clean printout. That command is .info() and it is the single most useful structure command in pandas.
import pandas as pd
# Rebuild the dataset
customer_df = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
'customer_name': ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
'country': ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'quantity': [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
'unit_price': [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
'total': [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
'is_member': [True, False, True, False, True, True, False, True, False, True]
})
# .info() prints a full structural summary of the DataFrame
# It shows: index type, column names, non-null count, data types, and memory usage
customer_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 10 non-null int64 1 customer_name 10 non-null object 2 country 10 non-null object 3 category 10 non-null object 4 quantity 10 non-null int64 5 unit_price 10 non-null float64 6 total 10 non-null float64 7 is_member 10 non-null bool dtypes: bool(1), float64(2), int64(2), object(3) memory usage: 728.0 bytes
What just happened?
.info() is a pandas method that prints a concise structural summary. Unlike .dtypes, it also shows the Non-Null Count per column — this is how you spot missing values at a glance. If a column shows 8 non-null in a 10-row dataset, that means 2 values are missing.
- All 8 columns show 10 non-null — no missing values in this dataset
- The dtype summary at the bottom —
bool(1), float64(2), int64(2), object(3)— tells you the mix of column types at a glance - Memory usage shows 728 bytes — tiny here, but critical to check on million-row datasets before loading them into memory
Structure Commands Side by Side
Here is a quick visual reference showing what each structure command tells you and when to use it:
| Command | What It Returns | Use It When |
|---|---|---|
df.shape |
Tuple (rows, columns) | First thing — check dataset size |
df.columns |
All column names | Verify column names before selecting |
df.index |
Row label range / type | Before merging or slicing rows |
df.dtypes |
Data type per column | Before arithmetic or aggregation |
df.info() |
Full summary: types + nulls + memory | Complete first-look at any new dataset |
Putting It All Together
The scenario: A colleague sends you a new dataset and says — "Can you do a quick structural check before we start the analysis?" Here is the exact five-line inspection routine every analyst should run on any new dataset.
import pandas as pd
# Rebuild the full customer dataset
customer_df = pd.DataFrame({
'order_id': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
'customer_name': ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
'country': ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
'quantity': [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
'unit_price': [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
'total': [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
'is_member': [True, False, True, False, True, True, False, True, False, True]
})
# --- THE STANDARD STRUCTURAL INSPECTION ROUTINE ---
# 1. Size: how many rows and columns?
print("1. Shape:", customer_df.shape)
# 2. Column names: what variables exist?
print("2. Columns:", customer_df.columns.tolist())
# 3. Data types: are numbers stored as numbers?
print("3. Data types:")
print(customer_df.dtypes)
# 4. First few rows: does the data look right?
print("4. First 3 rows:")
print(customer_df.head(3))
# 5. Full structural summary: types + nulls + memory
print("5. Full info:")
customer_df.info()
1. Shape: (10, 8) 2. Columns: ['order_id', 'customer_name', 'country', 'category', 'quantity', 'unit_price', 'total', 'is_member'] 3. Data types: order_id int64 customer_name object country object category object quantity int64 unit_price float64 total float64 is_member bool dtype: object 4. First 3 rows: order_id customer_name country category quantity unit_price total is_member 0 2001 Alice Brown USA Electronics 2 499.99 999.98 True 1 2002 Bob Smith UK Clothing 5 39.99 199.95 False 2 2003 Carol White Canada Electronics 1 899.00 899.00 True 5. Full info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 10 non-null int64 1 customer_name 10 non-null object 2 country 10 non-null object 3 category 10 non-null object 4 quantity 10 non-null int64 5 unit_price 10 non-null float64 6 total 10 non-null float64 7 is_member 10 non-null bool dtypes: bool(1), float64(2), int64(2), object(3) memory usage: 728.0 bytes
What just happened?
This is your standard structural inspection routine — five commands that together give you the complete picture of any dataset in under 30 seconds. Run these five lines on every dataset you ever receive before doing anything else. You will immediately know the size, the columns, the types, what the data looks like, and whether anything is missing. This is professional EDA practice from day one.
Teacher's Note
The most common beginner mistake is running .info() and seeing all columns show 10 non-null and assuming the data is clean. That just means there are no Python None or NaN values — but the data can still have placeholder garbage like "N/A", "unknown", "-", or 0 used to represent missing information. Structure checks are your first filter. Deeper quality checks come in Lessons 6 and 15.
Practice Questions
1. Which pandas method gives you column names, data types, non-null counts, and memory usage all in one output?
2. When a text column like customer_name is inspected with .dtypes, what data type label does pandas show for it?
3. To get both the row count and column count of a DataFrame in a single attribute (no parentheses), you use df.______.
Quiz
1. You run df.info() on a 100-row dataset and see one column shows 87 non-null. What does this mean?
2. A column storing decimal prices like 499.99 will have which pandas dtype?
3. For a 10-row DataFrame with no custom index set, what does df.index return?
Up Next · Lesson 3
Data Types in Detail
Go deeper into numeric, categorical, boolean, and datetime types — and learn how to convert between them when your data comes in the wrong format.