Working with Categorical Data in Pandas
In many real-world datasets, certain columns contain repeated values such as regions, categories, departments, or status labels. Treating these columns correctly improves performance and memory usage.
In this lesson, you will learn how to work with categorical data efficiently using Pandas.
What is Categorical Data?
Categorical data represents values that belong to a limited set of categories. Examples include:
- Region (North, South, East, West)
- Product Category (Electronics, Clothing, Grocery)
- Status (Completed, Pending, Cancelled)
These values repeat frequently and do not need full string storage every time.
Why Use Categorical Data?
Using categorical data provides several advantages:
- Lower memory usage
- Faster operations on large datasets
- Clear definition of valid values
Identifying Categorical Columns
Start by checking column data types.
import pandas as pd
sales = pd.read_csv("dataplexa_pandas_sales.csv")
sales.dtypes
Columns with object type are candidates for categorical conversion.
Converting a Column to Category
Convert frequently repeated text columns into categorical type.
sales["region"] = sales["region"].astype("category")
This reduces memory usage while keeping values readable.
Checking Categories
You can inspect available categories.
sales["region"].cat.categories
Renaming Categories
Categories can be renamed for clarity.
sales["region"] = sales["region"].cat.rename_categories({
"N": "North",
"S": "South"
})
This updates labels without changing the underlying data structure.
Adding New Categories
Sometimes new categories appear in incoming data.
sales["region"] = sales["region"].cat.add_categories(["Central"])
Removing Unused Categories
After filtering data, unused categories may remain.
sales["region"] = sales["region"].cat.remove_unused_categories()
This keeps category lists clean and accurate.
Sorting by Categorical Order
You can define a custom category order.
sales["region"] = sales["region"].cat.set_categories(
["North", "South", "East", "West"],
ordered=True
)
sales.sort_values("region")
Memory Comparison
Compare memory usage before and after conversion.
sales.info()
You will notice reduced memory usage for categorical columns.
When to Use Categorical Data
- Columns with repeated text values
- Fixed number of known categories
- Large datasets where memory matters
Practice Exercise
Try the following:
- Convert a text column to category
- Rename one category value
- Remove unused categories
What’s Next?
In the next lesson, you will learn how to use window functions for advanced data analysis.