GroupBy & Aggregations in Pandas
One of the most powerful features of Pandas is the ability to group data and perform calculations on those groups.
GroupBy allows you to summarize large datasets and answer real business questions such as:
- Total sales per product
- Average revenue per region
- Number of orders per customer
Loading the Dataset
We continue using the same dataset used throughout this Pandas course.
import pandas as pd
df = pd.read_csv("dataplexa_pandas_sales.csv")
df["order_date"] = pd.to_datetime(df["order_date"])
What Is GroupBy?
groupby() splits the data into groups based on one or more columns,
applies a function to each group, and then combines the results.
Think of it as:
- Split
- Apply
- Combine
Grouping by a Single Column
Let’s calculate the total sales amount for each product.
df.groupby("product")["sales_amount"].sum()
This groups all rows by product name and sums the sales.
Resetting the Index
GroupBy results return the grouping column as an index.
To convert it back into a regular column, use reset_index().
df.groupby("product")["sales_amount"].sum().reset_index()
Calculating Average Values
You can calculate averages using mean().
Example: average sales amount per region.
df.groupby("region")["sales_amount"].mean()
Counting Records in Each Group
To count how many records exist in each group, use count().
Example: number of orders per customer.
df.groupby("customer_id")["order_id"].count()
Grouping by Multiple Columns
You can group data using more than one column.
Example: total sales by region and product.
df.groupby(["region", "product"])["sales_amount"].sum()
Multiple Aggregations at Once
You can apply multiple aggregation functions using agg().
Example: total sales, average sales, and order count per product.
df.groupby("product").agg(
total_sales=("sales_amount", "sum"),
avg_sales=("sales_amount", "mean"),
order_count=("order_id", "count")
)
Grouping by Date
GroupBy is very powerful when combined with date columns.
Example: total sales per month.
df.groupby(df["order_date"].dt.month)["sales_amount"].sum()
Sorting Grouped Results
You can sort aggregated results to find top or lowest values.
df.groupby("product")["sales_amount"]
.sum()
.sort_values(ascending=False)
Why GroupBy Is Important
GroupBy is used in almost every real-world data analysis task.
It allows you to:
- Summarize large datasets
- Identify trends
- Prepare data for visualization
- Generate business reports
Practice Exercise
Using the dataset:
- Calculate total sales per region
- Find average sales per product
- Count orders per customer
- Group sales by month
What’s Next?
In the next lesson, you will learn how to create Pivot Tables in Pandas, which provide an Excel-style way to summarize grouped data.