Feature Engineering Course
Group-Based Features
Some of the most powerful features you'll ever create don't come from a single row — they come from comparing a row to its group. A customer's spend only becomes meaningful when you know how it compares to similar customers.
Group-based features — also called aggregation features — capture statistical context from a group and attach it back to each individual row. Instead of asking "how much did this customer spend?", you ask "how much did this customer spend compared to others in their region?" That relative signal is what models actually need.
Rows Without Context Are Almost Useless
Imagine you're handed a single row from a sales dataset. A customer spent $340. Is that high or low? You have no idea — until you know the average customer in that region spends $90. Now $340 is a massive outlier, and that fact is a valuable feature.
This is the core idea behind group-based features. You group rows by a categorical variable (city, department, product category, customer segment), compute a statistic (mean, median, std, count, min, max) for that group, and merge it back into the original DataFrame as a new column. The model can then see not just the raw value, but how that value relates to its peers.
Without Group Features
Each row contains only its own raw values. A model sees that a customer spent $340 with no way to know whether that's typical or extreme for their segment. Signal is buried in noise.
With Group Features
Each row also carries the group mean, group std, and a deviation score. The model instantly knows this customer spends 3.8× their segment average. That ratio is a real signal.
Four Group Aggregations That Actually Matter
Not all aggregations are equally useful. These four are the ones that show up in winning Kaggle notebooks and real production pipelines:
Group Mean
The average value for all rows in the group. Answers: what is typical for this category? Useful as a standalone feature and as the denominator for deviation scores.
Group Standard Deviation
How spread out is this group? A high std means the group is volatile; a low std means behavior is predictable. Useful on its own and as the denominator in z-score features.
Deviation Score
The difference between a row's value and its group mean. Negative means below average for the group; positive means above. This is one of the most model-friendly features you can create.
Z-Score Within Group
The deviation score divided by the group std. This standardises the deviation so groups with different scales become comparable. A z-score of 2.5 means the same thing whether you're in the "North" or "South" region.
Building Group Features with groupby and transform
The scenario:
You're a data scientist at a retail chain. The marketing team wants a churn model — but the dataset only has raw transaction amounts and store region. Your job is to add group context: for each transaction, compute how the amount compares to the regional average. A tree model will be able to split on deviation far more meaningfully than on raw spend alone.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a realistic retail transactions DataFrame
churn_df = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], # unique customer IDs
'region': ['North','North','South','South','East','East','North','South','East','North'], # store region
'spend': [340, 90, 210, 80, 500, 120, 75, 310, 490, 95] # transaction amount in dollars
})
# Compute group mean spend per region using groupby + transform
# transform returns a Series aligned with the original index — no merge needed
churn_df['region_mean_spend'] = churn_df.groupby('region')['spend'].transform('mean')
# Compute group standard deviation per region
churn_df['region_std_spend'] = churn_df.groupby('region')['spend'].transform('std')
# Compute deviation: how far is this customer from their regional average?
churn_df['spend_deviation'] = churn_df['spend'] - churn_df['region_mean_spend']
# Compute z-score within group: deviation divided by group std
# Add a small epsilon (1e-9) to avoid division by zero if a group has only one row
churn_df['spend_zscore'] = churn_df['spend_deviation'] / (churn_df['region_std_spend'] + 1e-9)
# Round for readability
churn_df = churn_df.round(2)
# Display results
print(churn_df.to_string(index=False))
customer_id region spend region_mean_spend region_std_spend spend_deviation spend_zscore
101 North 340 150.00 126.79 190.00 1.50
102 North 90 150.00 126.79 -60.00 -0.47
103 South 210 200.00 121.24 10.00 0.08
104 South 80 200.00 121.24 -120.00 -0.99
105 East 500 370.00 212.13 130.00 0.61
106 East 120 370.00 212.13 -250.00 -1.18
107 North 75 150.00 126.79 -75.00 -0.59
108 South 310 200.00 121.24 110.00 0.91
109 East 490 370.00 212.13 120.00 0.57
110 North 95 150.00 126.79 -55.00 -0.43What just happened?
groupby().transform('mean') computed each region's average spend and returned a value for every row — no index reshuffling needed. Customer 101 spent $340 but the North region mean is only $150, giving a deviation of +$190 and a z-score of 1.50. Customer 106 spent $120 but their East group mean is $370, making them a strong negative outlier at −1.18. The raw spend alone would never reveal that story.
Multiple Aggregations in One Shot
You don't have to call transform four separate times. groupby().agg() lets you compute many aggregations at once, and then you merge them back in a single step. This is cleaner for production code and much faster on large DataFrames.
The scenario:
You're building a loan default model. Each row is one loan application. You need to enrich every application with group statistics per employment type — what's the average loan amount for self-employed applicants? How variable is it? How does this applicant compare to their peers? Gathering all of this in one clean pipeline is the professional approach.
# Import libraries
import pandas as pd
import numpy as np
# Create a loan applications DataFrame — 10 rows, realistic values
loan_df = pd.DataFrame({
'loan_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], # unique loan IDs
'emp_type': ['Salaried','Self-Emp','Salaried','Self-Emp','Salaried',
'Salaried','Self-Emp','Salaried','Self-Emp','Salaried'], # employment type
'loan_amount': [25000, 80000, 30000, 55000, 28000, 22000, 95000, 35000, 70000, 40000] # loan in dollars
})
# Step 1: compute multiple group stats at once using agg()
group_stats = loan_df.groupby('emp_type')['loan_amount'].agg(
grp_mean='mean', # average loan amount for this employment type
grp_median='median', # median — more robust to outliers than mean
grp_std='std', # standard deviation — how spread is this group?
grp_count='count' # how many applicants in this group? (group size is itself a feature)
).reset_index() # bring emp_type back as a column for merging
# Step 2: merge group stats back into the original DataFrame on emp_type
loan_df = loan_df.merge(group_stats, on='emp_type', how='left')
# Step 3: compute deviation and z-score now that group stats are in the DataFrame
loan_df['loan_deviation'] = loan_df['loan_amount'] - loan_df['grp_mean'] # raw distance from group mean
loan_df['loan_zscore'] = loan_df['loan_deviation'] / (loan_df['grp_std'] + 1e-9) # standardised deviation
# Round everything to 1 decimal place for clean display
loan_df = loan_df.round(1)
# Print results
print(loan_df[['loan_id','emp_type','loan_amount','grp_mean','grp_std','loan_deviation','loan_zscore']].to_string(index=False))
loan_id emp_type loan_amount grp_mean grp_std loan_deviation loan_zscore
1 Salaried 25000 30000.0 6442.0 -5000.0 -0.8
2 Self-Emp 80000 75000.0 17078.2 5000.0 0.3
3 Salaried 30000 30000.0 6442.0 0.0 0.0
4 Self-Emp 55000 75000.0 17078.2 -20000.0 -1.2
5 Salaried 28000 30000.0 6442.0 -2000.0 -0.3
6 Salaried 22000 30000.0 6442.0 -8000.0 -1.2
7 Self-Emp 95000 75000.0 17078.2 20000.0 1.2
8 Salaried 35000 30000.0 6442.0 5000.0 0.8
9 Self-Emp 70000 75000.0 17078.2 -5000.0 -0.3
10 Salaried 40000 30000.0 6442.0 10000.0 1.6What just happened?
The agg() call computed four statistics for each employment type in one pass, and the merge() joined them back on emp_type. Now every row carries its group context. Loan 10 looks large at $40,000 until you see it's only 1.6 standard deviations above the Salaried average — not an extreme outlier. But Loan 6 at $22,000 is 1.2 std below the Salaried mean, which could be a meaningful signal for default risk.
Group Count and Frequency Encoding
Not all group features need to be about a numeric column. Sometimes the most useful thing you can add is simply: how often does this category appear? This is called frequency encoding and it converts a categorical column into a numeric one without introducing arbitrary orderings.
The scenario:
You're a machine learning engineer at a property platform. The dataset has a neighbourhood column with 200 unique values. Label encoding would impose a fake ordering; one-hot encoding would add 200 columns and destroy your memory budget. Frequency encoding replaces each neighbourhood with the proportion of listings that belong to it — a meaningful, compact, model-ready number.
# Import pandas
import pandas as pd
# Create a housing listings DataFrame — 10 rows
housing_df = pd.DataFrame({
'listing_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010], # unique listing IDs
'neighbourhood': ['Midtown','Midtown','Uptown','Midtown','Riverside',
'Uptown','Midtown','Riverside','Midtown','Uptown'], # categorical: 3 unique values here
'price': [450000, 460000, 520000, 440000, 380000, 530000, 470000, 390000, 455000, 510000] # listing price
})
# Step 1: compute the total number of rows — needed for frequency calculation
total_rows = len(housing_df) # 10
# Step 2: compute raw count per neighbourhood using groupby + transform
housing_df['neighbourhood_count'] = housing_df.groupby('neighbourhood')['listing_id'].transform('count')
# Step 3: compute frequency (proportion) — count divided by total
housing_df['neighbourhood_freq'] = housing_df['neighbourhood_count'] / total_rows
# Step 4: compute the group mean price per neighbourhood — a bonus group feature
housing_df['neighbourhood_mean_price'] = housing_df.groupby('neighbourhood')['price'].transform('mean')
# Round for clarity
housing_df = housing_df.round(3)
# Print the result
print(housing_df.to_string(index=False))
listing_id neighbourhood price neighbourhood_count neighbourhood_freq neighbourhood_mean_price
1001 Midtown 450000 5 0.5 455000
1002 Midtown 460000 5 0.5 455000
1003 Uptown 520000 3 0.3 520000
1004 Midtown 440000 5 0.5 455000
1005 Riverside 380000 2 0.2 385000
1006 Uptown 530000 3 0.3 520000
1007 Midtown 470000 5 0.5 455000
1008 Riverside 390000 2 0.2 385000
1009 Midtown 455000 5 0.5 455000
1010 Uptown 510000 3 0.3 520000What just happened?
Midtown is the dominant neighbourhood — it appears in 5 of 10 listings (frequency 0.5). Riverside is rare at 0.2. The model can now use neighbourhood_freq as a numeric feature without any ordinal assumptions. The neighbourhood_mean_price column also gives every listing access to the group average price, so the model can learn that Uptown listings tend to be more expensive independently of the individual listing's price.
The Target Leakage Trap — Group Features on Train vs Test
Group-based features have one nasty gotcha that kills models in production: if you compute group statistics on the full dataset before splitting into train and test, information from the test set leaks into your training features. Your model looks great in evaluation and collapses in deployment.
The Wrong Way
Compute group means on the full DataFrame, then split. Every row in your test set now carries statistics that were influenced by its own value. You've trained the model on information it would never have at prediction time.
The Right Way
Split first, then compute group stats on the training set only. Merge those train-derived stats into the test set. New categories that appear only in test get a fallback value (global mean is a safe choice).
Inside a Pipeline
When using sklearn pipelines or cross-validation, compute group stats inside a custom transformer that fits on training folds and transforms all folds. This is the production-safe pattern and it prevents leakage automatically.
A Visual Look at What transform Returns
The magic of transform is that it broadcasts group-level statistics back to every individual row in the group, preserving the original index perfectly. Here's a visual of what happens under the hood:
| customer_id | region | spend | region_mean ← transform | deviation |
|---|---|---|---|---|
| 101 | North | 340 | 150 ← North avg | +190 |
| 102 | North | 90 | 150 ← North avg | −60 |
| 103 | South | 210 | 200 ← South avg | +10 |
| 105 | East | 500 | 370 ← East avg | +130 |
| 106 | East | 120 | 370 ← East avg | −250 |
Every row in the same group gets the same group mean, but the deviation is unique to each row. That uniqueness is the signal.
Teacher's Note
The difference between transform and agg trips up almost every beginner. Use transform when you want the result aligned to the original DataFrame — no merge needed. Use agg when you want a compact summary table, then merge it back yourself. Both produce the same final features; the choice is about readability and performance. For large DataFrames with many groups, agg + merge is often faster because pandas can optimise the aggregation pass independently.
Practice Questions
1. Which pandas method broadcasts a group statistic back to every row in that group without requiring a merge?
2. A feature computed as the difference between a row's value and its group mean is called a ________ score.
3. Replacing a categorical column with the proportion of rows that belong to each category is called __________.
Quiz
1. You are adding group-based features to a dataset. What is the correct approach to avoid target leakage?
2. A within-group z-score is more useful than a raw deviation score because:
3. What is the key practical difference between groupby().agg() and groupby().transform()?
Up Next · Lesson 33
Rolling Window Features
Capture trends, momentum, and volatility over sliding time windows — a must-have technique for any time-aware dataset.