EDA Lesson 2 – Datase structure | Dataplexa
Beginner Level · Lesson 2

Understanding Dataset Structure

Every time you open a new dataset, the very first thing you do is understand its structure — how many rows, how many columns, what each column holds, and how it's all organised. This lesson gives you the exact commands to do that in under two minutes.

What "Dataset Structure" Means

When analysts talk about dataset structure, they mean the skeleton of the data — before you look at any values, you want to know the shape of the container that holds them. This includes:

  • Dimensions — how many rows (observations) and columns (features)
  • Column names — what each variable is called
  • Data types — whether each column stores numbers, text, dates, or booleans
  • Index — how the rows are labelled (usually 0, 1, 2… by default)
  • Memory usage — how much RAM the dataset is consuming

Understanding structure first means you never get surprised halfway through an analysis. You won't try to calculate the mean of a column that's actually stored as text, or wonder why your merge produced 10x more rows than expected.

Think of a dataset like a building blueprint. Before you move furniture in, you check how many rooms there are, what each room is for, and what the walls are made of. Dataset structure is exactly that — the blueprint before the work begins.

The Dataset We Are Using

Throughout this lesson we will work with a customer orders dataset from an online retail store. It has a mix of numeric, text, and date-like columns — exactly the kind of messy, real-world structure you will encounter on the job.

customer_df — Preview (first 5 rows)
order_id customer_name country category quantity unit_price total is_member
2001 Alice Brown USA Electronics 2 499.99 999.98 True
2002 Bob Smith UK Clothing 5 39.99 199.95 False
2003 Carol White Canada Electronics 1 899.00 899.00 True
2004 Dave Lee USA Books 3 14.99 44.97 False
2005 Eve Patel India Clothing 4 24.99 99.96 True

shape — Rows and Columns in One Shot

The scenario: Your team lead sends you this dataset and asks — "How big is it? How many records do we have?" The very first command every analyst runs is .shape. It tells you the dimensions of the entire DataFrame in a single tuple: (rows, columns).

import pandas as pd

# Build the customer orders dataset inline
customer_df = pd.DataFrame({
    'order_id':       [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
    'customer_name':  ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
                       'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
    'country':        ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
    'category':       ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
                       'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'quantity':       [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
    'unit_price':     [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
    'total':          [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
    'is_member':      [True, False, True, False, True, True, False, True, False, True]
})

# .shape returns a tuple: (number of rows, number of columns)
print("Shape:", customer_df.shape)

# You can also access rows and columns individually
print("Rows   :", customer_df.shape[0])   # index 0 = row count
print("Columns:", customer_df.shape[1])   # index 1 = column count
Shape: (10, 8)
Rows   : 10
Columns: 8

What just happened?

.shape is a pandas DataFrame attribute — not a function, so there are no parentheses. It returns a Python tuple with two values: row count first, column count second.

  • We have 10 rows — 10 individual customer orders
  • We have 8 columns — 8 variables describing each order
  • .shape[0] and .shape[1] let you extract either number individually — useful when you want to print them in a report or use them in conditions

columns and index — Names and Labels

The scenario: You've confirmed the size. Now your manager asks — "What columns do we actually have, and how are the rows labelled?" These two attributes give you the full list of column names and the row index in one glance.

import pandas as pd

# Rebuild the dataset
customer_df = pd.DataFrame({
    'order_id':       [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
    'customer_name':  ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
                       'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
    'country':        ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
    'category':       ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
                       'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'quantity':       [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
    'unit_price':     [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
    'total':          [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
    'is_member':      [True, False, True, False, True, True, False, True, False, True]
})

# .columns returns all column names as a pandas Index object
print("Column names:")
print(customer_df.columns.tolist())   # .tolist() converts it to a plain Python list

# .index shows how the rows are labelled
print("\nRow index:")
print(customer_df.index)
Column names:
['order_id', 'customer_name', 'country', 'category', 'quantity', 'unit_price', 'total', 'is_member']

Row index:
RangeIndex(start=0, stop=10, step=1)

What just happened?

.columns is a pandas attribute that holds all column names. We chained .tolist() — a standard Python method — to convert it from a pandas Index object into a plain readable list.

.index shows the row labels. Here it says RangeIndex(start=0, stop=10, step=1) — meaning rows are labelled 0 through 9, which is the default. In real-world datasets, the index is sometimes set to a customer ID or date, which changes how you select and merge data. Knowing the index type upfront prevents silent errors later.

dtypes — What Type of Data is in Each Column

The scenario: You want to calculate total average spend. But before you do, you need to confirm that total and unit_price are actually stored as numbers — not as text strings that look like numbers. This is a surprisingly common issue in real datasets exported from Excel or databases.

import pandas as pd

# Rebuild the dataset
customer_df = pd.DataFrame({
    'order_id':       [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
    'customer_name':  ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
                       'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
    'country':        ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
    'category':       ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
                       'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'quantity':       [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
    'unit_price':     [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
    'total':          [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
    'is_member':      [True, False, True, False, True, True, False, True, False, True]
})

# .dtypes shows the data type stored in each column
print(customer_df.dtypes)
order_id          int64
customer_name    object
country          object
category         object
quantity          int64
unit_price      float64
total           float64
is_member          bool
dtype: object

What just happened?

.dtypes is a pandas attribute that returns the data type of every column. Here is what each type means:

  • int64 — whole numbers (order_id, quantity)
  • float64 — decimal numbers (unit_price, total)
  • object — text / mixed (customer_name, country, category)
  • bool — True/False values (is_member)

If total had come back as object instead of float64, that would be a red flag — it would mean the column contains text like "$999.98" and you cannot do arithmetic on it without cleaning first.

info() — The Complete Structural Report

The scenario: You want a single command that gives you everything at once — column names, data types, non-null counts, and memory usage — in one clean printout. That command is .info() and it is the single most useful structure command in pandas.

import pandas as pd

# Rebuild the dataset
customer_df = pd.DataFrame({
    'order_id':       [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
    'customer_name':  ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
                       'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
    'country':        ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
    'category':       ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
                       'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'quantity':       [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
    'unit_price':     [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
    'total':          [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
    'is_member':      [True, False, True, False, True, True, False, True, False, True]
})

# .info() prints a full structural summary of the DataFrame
# It shows: index type, column names, non-null count, data types, and memory usage
customer_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   order_id       10 non-null     int64
 1   customer_name  10 non-null     object
 2   country        10 non-null     object
 3   category       10 non-null     object
 4   quantity       10 non-null     int64
 5   unit_price     10 non-null     float64
 6   total          10 non-null     float64
 7   is_member      10 non-null     bool
dtypes: bool(1), float64(2), int64(2), object(3)
memory usage: 728.0 bytes

What just happened?

.info() is a pandas method that prints a concise structural summary. Unlike .dtypes, it also shows the Non-Null Count per column — this is how you spot missing values at a glance. If a column shows 8 non-null in a 10-row dataset, that means 2 values are missing.

  • All 8 columns show 10 non-null — no missing values in this dataset
  • The dtype summary at the bottom — bool(1), float64(2), int64(2), object(3) — tells you the mix of column types at a glance
  • Memory usage shows 728 bytes — tiny here, but critical to check on million-row datasets before loading them into memory

Structure Commands Side by Side

Here is a quick visual reference showing what each structure command tells you and when to use it:

Structure Command Reference
Command What It Returns Use It When
df.shape Tuple (rows, columns) First thing — check dataset size
df.columns All column names Verify column names before selecting
df.index Row label range / type Before merging or slicing rows
df.dtypes Data type per column Before arithmetic or aggregation
df.info() Full summary: types + nulls + memory Complete first-look at any new dataset

Putting It All Together

The scenario: A colleague sends you a new dataset and says — "Can you do a quick structural check before we start the analysis?" Here is the exact five-line inspection routine every analyst should run on any new dataset.

import pandas as pd

# Rebuild the full customer dataset
customer_df = pd.DataFrame({
    'order_id':       [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010],
    'customer_name':  ['Alice Brown', 'Bob Smith', 'Carol White', 'Dave Lee', 'Eve Patel',
                       'Frank Kim', 'Grace Luo', 'Hank Diaz', 'Ivy Chen', 'Jack Roy'],
    'country':        ['USA', 'UK', 'Canada', 'USA', 'India', 'UK', 'Canada', 'USA', 'India', 'UK'],
    'category':       ['Electronics', 'Clothing', 'Electronics', 'Books', 'Clothing',
                       'Electronics', 'Books', 'Clothing', 'Electronics', 'Books'],
    'quantity':       [2, 5, 1, 3, 4, 1, 6, 2, 1, 3],
    'unit_price':     [499.99, 39.99, 899.00, 14.99, 24.99, 649.00, 9.99, 59.99, 399.00, 19.99],
    'total':          [999.98, 199.95, 899.00, 44.97, 99.96, 649.00, 59.94, 119.98, 399.00, 59.97],
    'is_member':      [True, False, True, False, True, True, False, True, False, True]
})

# --- THE STANDARD STRUCTURAL INSPECTION ROUTINE ---

# 1. Size: how many rows and columns?
print("1. Shape:", customer_df.shape)

# 2. Column names: what variables exist?
print("2. Columns:", customer_df.columns.tolist())

# 3. Data types: are numbers stored as numbers?
print("3. Data types:")
print(customer_df.dtypes)

# 4. First few rows: does the data look right?
print("4. First 3 rows:")
print(customer_df.head(3))

# 5. Full structural summary: types + nulls + memory
print("5. Full info:")
customer_df.info()
1. Shape: (10, 8)
2. Columns: ['order_id', 'customer_name', 'country', 'category', 'quantity', 'unit_price', 'total', 'is_member']
3. Data types:
order_id          int64
customer_name    object
country          object
category         object
quantity          int64
unit_price      float64
total           float64
is_member          bool
dtype: object
4. First 3 rows:
   order_id customer_name country     category  quantity  unit_price    total  is_member
0      2001   Alice Brown     USA  Electronics         2      499.99   999.98       True
1      2002     Bob Smith      UK     Clothing         5       39.99   199.95      False
2      2003   Carol White  Canada  Electronics         1      899.00   899.00       True
5. Full info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   order_id       10 non-null     int64
 1   customer_name  10 non-null     object
 2   country        10 non-null     object
 3   category       10 non-null     object
 4   quantity       10 non-null     int64
 5   unit_price     10 non-null     float64
 6   total          10 non-null     float64
 7   is_member      10 non-null     bool
dtypes: bool(1), float64(2), int64(2), object(3)
memory usage: 728.0 bytes

What just happened?

This is your standard structural inspection routine — five commands that together give you the complete picture of any dataset in under 30 seconds. Run these five lines on every dataset you ever receive before doing anything else. You will immediately know the size, the columns, the types, what the data looks like, and whether anything is missing. This is professional EDA practice from day one.

Teacher's Note

The most common beginner mistake is running .info() and seeing all columns show 10 non-null and assuming the data is clean. That just means there are no Python None or NaN values — but the data can still have placeholder garbage like "N/A", "unknown", "-", or 0 used to represent missing information. Structure checks are your first filter. Deeper quality checks come in Lessons 6 and 15.

Practice Questions

1. Which pandas method gives you column names, data types, non-null counts, and memory usage all in one output?



2. When a text column like customer_name is inspected with .dtypes, what data type label does pandas show for it?



3. To get both the row count and column count of a DataFrame in a single attribute (no parentheses), you use df.______.



Quiz

1. You run df.info() on a 100-row dataset and see one column shows 87 non-null. What does this mean?


2. A column storing decimal prices like 499.99 will have which pandas dtype?


3. For a 10-row DataFrame with no custom index set, what does df.index return?


Up Next · Lesson 3

Data Types in Detail

Go deeper into numeric, categorical, boolean, and datetime types — and learn how to convert between them when your data comes in the wrong format.