Pandas Lesson 29 – Performance | Dataplexa

Performance Optimization in Pandas

When working with large datasets, performance becomes critical. Slow operations can waste time, memory, and computing resources.

In this lesson, you will learn how to write faster and more efficient Pandas code without changing the results.


Why Performance Matters in Pandas

Pandas is powerful, but inefficient usage can slow it down, especially when working with:

  • Large CSV files
  • Millions of rows
  • Complex transformations

Optimizing your code improves:

  • Execution speed
  • Memory usage
  • Scalability

Understand Data Size First

Before optimizing, always inspect the size of your data.

sales.shape

This tells you how many rows and columns you are processing.


Avoid Loops — Use Vectorized Operations

Loops in Python are slow. Pandas is designed to work with vectorized operations.

❌ Slow approach using a loop:

total = 0
for value in sales["sales_amount"]:
    total += value

✅ Fast vectorized approach:

total = sales["sales_amount"].sum()

Use Built-in Pandas Functions

Built-in Pandas functions are optimized in C and run much faster.

Examples:

  • .sum()
  • .mean()
  • .groupby()
  • .value_counts()
sales.groupby("region")["sales_amount"].mean()

Reduce Memory Usage with Correct Data Types

Wrong data types waste memory and slow operations.

Check column data types:

sales.dtypes

Convert columns when possible:

sales["region"] = sales["region"].astype("category")

Categorical data uses much less memory for repeated values.


Read Only Required Columns

When loading large CSV files, avoid reading unnecessary columns.

sales = pd.read_csv(
    "dataplexa_pandas_sales.csv",
    usecols=["order_id", "region", "sales_amount"]
)

This reduces memory usage and speeds up loading.


Use chunking for Very Large Files

For extremely large files, load data in chunks instead of all at once.

chunks = pd.read_csv(
    "dataplexa_pandas_sales.csv",
    chunksize=10000
)

for chunk in chunks:
    print(chunk["sales_amount"].mean())

Chunking prevents memory overflow.


Avoid Unnecessary Copies

Some operations create full copies of DataFrames. Avoid chaining where possible.

❌ Inefficient:

sales = sales[sales["sales_amount"] > 100]
sales = sales.reset_index(drop=True)

✅ Better:

sales = sales.loc[sales["sales_amount"] > 100].reset_index(drop=True)

Use .loc Instead of Chained Indexing

Chained indexing can be slower and unsafe.

sales.loc[sales["region"] == "East", "sales_amount"]

Measure Performance

Always measure before and after optimization.

%timeit sales["sales_amount"].mean()

This helps confirm improvements.


Practice Exercise

Using the dataset:

  • Convert categorical columns to category type
  • Load only required columns
  • Compare performance before and after optimization

What’s Next?

In the final lesson, you will apply everything you learned to a complete Pandas Project using real-world data workflows.