Working with Large Datasets in Pandas
As datasets grow larger, performance and memory usage become critical. In real-world projects, you may work with millions of rows that cannot be handled efficiently using default settings.
In this lesson, you will learn practical techniques to work with large datasets efficiently using Pandas.
What is a Large Dataset?
A dataset is considered large when:
- It consumes significant system memory
- Operations become slow or unresponsive
- Loading the full dataset at once causes errors
Even moderately sized CSV files can cause issues if not handled correctly.
Checking Dataset Size
Before optimizing, always inspect your dataset.
import pandas as pd
sales = pd.read_csv("dataplexa_pandas_sales.csv")
sales.shape
This shows the number of rows and columns.
Inspecting Memory Usage
Use info() to understand memory consumption.
sales.info()
This output helps identify columns that consume more memory.
Reading Only Required Columns
Instead of loading all columns, select only what you need.
sales_small = pd.read_csv(
"dataplexa_pandas_sales.csv",
usecols=["order_id", "product", "quantity", "revenue"]
)
This significantly reduces memory usage.
Loading Data in Chunks
For very large files, load data in smaller chunks.
chunks = pd.read_csv(
"dataplexa_pandas_sales.csv",
chunksize=3
)
for chunk in chunks:
print(chunk)
This approach processes data step-by-step instead of loading everything at once.
Optimizing Data Types
Using correct data types saves memory.
sales["quantity"] = sales["quantity"].astype("int16")
sales["revenue"] = sales["revenue"].astype("float32")
Smaller data types reduce memory consumption.
Using Categorical Data
For repeated text values, use categorical data types.
sales["region"] = sales["region"].astype("category")
This is especially useful for columns like region, category, or status.
Filtering Early
Filter data as early as possible to avoid unnecessary processing.
high_value_sales = sales[sales["revenue"] > 1000]
Smaller DataFrames are faster to work with.
Dropping Unused Columns
Remove columns that are not required for analysis.
sales = sales.drop(columns=["notes", "comments"])
Best Practices for Large Data
- Load only required columns
- Use chunks for very large files
- Optimize data types
- Filter early in the workflow
- Monitor memory using
info()
Practice Exercise
Try the following:
- Load the dataset using
usecols - Convert one text column to
category - Filter rows with high revenue
What’s Next?
In the next lesson, you will learn how to work with categorical data efficiently in Pandas.