Pandas Lesson 22 – Categories | Dataplexa

Working with Categorical Data in Pandas

In many real-world datasets, certain columns contain repeated values such as regions, categories, departments, or status labels. Treating these columns correctly improves performance and memory usage.

In this lesson, you will learn how to work with categorical data efficiently using Pandas.


What is Categorical Data?

Categorical data represents values that belong to a limited set of categories. Examples include:

  • Region (North, South, East, West)
  • Product Category (Electronics, Clothing, Grocery)
  • Status (Completed, Pending, Cancelled)

These values repeat frequently and do not need full string storage every time.


Why Use Categorical Data?

Using categorical data provides several advantages:

  • Lower memory usage
  • Faster operations on large datasets
  • Clear definition of valid values

Identifying Categorical Columns

Start by checking column data types.

import pandas as pd

sales = pd.read_csv("dataplexa_pandas_sales.csv")
sales.dtypes

Columns with object type are candidates for categorical conversion.


Converting a Column to Category

Convert frequently repeated text columns into categorical type.

sales["region"] = sales["region"].astype("category")

This reduces memory usage while keeping values readable.


Checking Categories

You can inspect available categories.

sales["region"].cat.categories

Renaming Categories

Categories can be renamed for clarity.

sales["region"] = sales["region"].cat.rename_categories({
    "N": "North",
    "S": "South"
})

This updates labels without changing the underlying data structure.


Adding New Categories

Sometimes new categories appear in incoming data.

sales["region"] = sales["region"].cat.add_categories(["Central"])

Removing Unused Categories

After filtering data, unused categories may remain.

sales["region"] = sales["region"].cat.remove_unused_categories()

This keeps category lists clean and accurate.


Sorting by Categorical Order

You can define a custom category order.

sales["region"] = sales["region"].cat.set_categories(
    ["North", "South", "East", "West"],
    ordered=True
)

sales.sort_values("region")

Memory Comparison

Compare memory usage before and after conversion.

sales.info()

You will notice reduced memory usage for categorical columns.


When to Use Categorical Data

  • Columns with repeated text values
  • Fixed number of known categories
  • Large datasets where memory matters

Practice Exercise

Try the following:

  • Convert a text column to category
  • Rename one category value
  • Remove unused categories

What’s Next?

In the next lesson, you will learn how to use window functions for advanced data analysis.