Pandas Lesson 21 – Large Data | Dataplexa

Working with Large Datasets in Pandas

As datasets grow larger, performance and memory usage become critical. In real-world projects, you may work with millions of rows that cannot be handled efficiently using default settings.

In this lesson, you will learn practical techniques to work with large datasets efficiently using Pandas.

What is a Large Dataset?

A dataset is considered large when:

It consumes significant system memory
Operations become slow or unresponsive
Loading the full dataset at once causes errors

Even moderately sized CSV files can cause issues if not handled correctly.

Checking Dataset Size

Before optimizing, always inspect your dataset.

import pandas as pd

sales = pd.read_csv("dataplexa_pandas_sales.csv")

sales.shape

This shows the number of rows and columns.

Inspecting Memory Usage

Use info() to understand memory consumption.

sales.info()

This output helps identify columns that consume more memory.

Reading Only Required Columns

Instead of loading all columns, select only what you need.

sales_small = pd.read_csv(
    "dataplexa_pandas_sales.csv",
    usecols=["order_id", "product", "quantity", "revenue"]
)

This significantly reduces memory usage.

Loading Data in Chunks

For very large files, load data in smaller chunks.

chunks = pd.read_csv(
    "dataplexa_pandas_sales.csv",
    chunksize=3
)

for chunk in chunks:
    print(chunk)

This approach processes data step-by-step instead of loading everything at once.

Optimizing Data Types

Using correct data types saves memory.

sales["quantity"] = sales["quantity"].astype("int16")
sales["revenue"] = sales["revenue"].astype("float32")

Smaller data types reduce memory consumption.

Using Categorical Data

For repeated text values, use categorical data types.

sales["region"] = sales["region"].astype("category")

This is especially useful for columns like region, category, or status.

Filtering Early

Filter data as early as possible to avoid unnecessary processing.

high_value_sales = sales[sales["revenue"] > 1000]

Smaller DataFrames are faster to work with.

Dropping Unused Columns

Remove columns that are not required for analysis.

sales = sales.drop(columns=["notes", "comments"])

Best Practices for Large Data

Load only required columns
Use chunks for very large files
Optimize data types
Filter early in the workflow
Monitor memory using info()

Practice Exercise

Try the following:

Load the dataset using usecols
Convert one text column to category
Filter rows with high revenue

What’s Next?

In the next lesson, you will learn how to work with categorical data efficiently in Pandas.

← Previous Lesson Pandas Index Next ➜