Feature Engineering Lesson 32 – Group-Based Features | Dataplexa

Advanced Level · Lesson 32

Group-Based Features

Some of the most powerful features you'll ever create don't come from a single row — they come from comparing a row to its group. A customer's spend only becomes meaningful when you know how it compares to similar customers.

Group-based features — also called aggregation features — capture statistical context from a group and attach it back to each individual row. Instead of asking "how much did this customer spend?", you ask "how much did this customer spend compared to others in their region?" That relative signal is what models actually need.

Rows Without Context Are Almost Useless

Imagine you're handed a single row from a sales dataset. A customer spent $340. Is that high or low? You have no idea — until you know the average customer in that region spends $90. Now $340 is a massive outlier, and that fact is a valuable feature.

This is the core idea behind group-based features. You group rows by a categorical variable (city, department, product category, customer segment), compute a statistic (mean, median, std, count, min, max) for that group, and merge it back into the original DataFrame as a new column. The model can then see not just the raw value, but how that value relates to its peers.

Without Group Features

Each row contains only its own raw values. A model sees that a customer spent $340 with no way to know whether that's typical or extreme for their segment. Signal is buried in noise.

With Group Features

Each row also carries the group mean, group std, and a deviation score. The model instantly knows this customer spends 3.8× their segment average. That ratio is a real signal.

Four Group Aggregations That Actually Matter

Not all aggregations are equally useful. These four are the ones that show up in winning Kaggle notebooks and real production pipelines:

Group Mean

The average value for all rows in the group. Answers: what is typical for this category? Useful as a standalone feature and as the denominator for deviation scores.

Group Standard Deviation

How spread out is this group? A high std means the group is volatile; a low std means behavior is predictable. Useful on its own and as the denominator in z-score features.

Deviation Score

The difference between a row's value and its group mean. Negative means below average for the group; positive means above. This is one of the most model-friendly features you can create.

Z-Score Within Group

The deviation score divided by the group std. This standardises the deviation so groups with different scales become comparable. A z-score of 2.5 means the same thing whether you're in the "North" or "South" region.

Building Group Features with groupby and transform

The scenario:

You're a data scientist at a retail chain. The marketing team wants a churn model — but the dataset only has raw transaction amounts and store region. Your job is to add group context: for each transaction, compute how the amount compares to the regional average. A tree model will be able to split on deviation far more meaningfully than on raw spend alone.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a realistic retail transactions DataFrame
churn_df = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],  # unique customer IDs
    'region':      ['North','North','South','South','East','East','North','South','East','North'],  # store region
    'spend':       [340, 90, 210, 80, 500, 120, 75, 310, 490, 95]  # transaction amount in dollars
})

# Compute group mean spend per region using groupby + transform
# transform returns a Series aligned with the original index — no merge needed
churn_df['region_mean_spend'] = churn_df.groupby('region')['spend'].transform('mean')

# Compute group standard deviation per region
churn_df['region_std_spend'] = churn_df.groupby('region')['spend'].transform('std')

# Compute deviation: how far is this customer from their regional average?
churn_df['spend_deviation'] = churn_df['spend'] - churn_df['region_mean_spend']

# Compute z-score within group: deviation divided by group std
# Add a small epsilon (1e-9) to avoid division by zero if a group has only one row
churn_df['spend_zscore'] = churn_df['spend_deviation'] / (churn_df['region_std_spend'] + 1e-9)

# Round for readability
churn_df = churn_df.round(2)

# Display results
print(churn_df.to_string(index=False))

 customer_id  region  spend  region_mean_spend  region_std_spend  spend_deviation  spend_zscore
         101   North    340             150.00            126.79           190.00          1.50
         102   North     90             150.00            126.79           -60.00         -0.47
         103   South    210             200.00            121.24            10.00          0.08
         104   South     80             200.00            121.24          -120.00         -0.99
         105    East    500             370.00            212.13           130.00          0.61
         106    East    120             370.00            212.13          -250.00         -1.18
         107   North     75             150.00            126.79           -75.00         -0.59
         108   South    310             200.00            121.24           110.00          0.91
         109    East    490             370.00            212.13           120.00          0.57
         110   North     95             150.00            126.79           -55.00         -0.43

What just happened?

groupby().transform('mean') computed each region's average spend and returned a value for every row — no index reshuffling needed. Customer 101 spent $340 but the North region mean is only $150, giving a deviation of +$190 and a z-score of 1.50. Customer 106 spent $120 but their East group mean is $370, making them a strong negative outlier at −1.18. The raw spend alone would never reveal that story.

Multiple Aggregations in One Shot

You don't have to call transform four separate times. groupby().agg() lets you compute many aggregations at once, and then you merge them back in a single step. This is cleaner for production code and much faster on large DataFrames.

The scenario:

You're building a loan default model. Each row is one loan application. You need to enrich every application with group statistics per employment type — what's the average loan amount for self-employed applicants? How variable is it? How does this applicant compare to their peers? Gathering all of this in one clean pipeline is the professional approach.

# Import libraries
import pandas as pd
import numpy as np

# Create a loan applications DataFrame — 10 rows, realistic values
loan_df = pd.DataFrame({
    'loan_id':       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],  # unique loan IDs
    'emp_type':      ['Salaried','Self-Emp','Salaried','Self-Emp','Salaried',
                      'Salaried','Self-Emp','Salaried','Self-Emp','Salaried'],  # employment type
    'loan_amount':   [25000, 80000, 30000, 55000, 28000, 22000, 95000, 35000, 70000, 40000]  # loan in dollars
})

# Step 1: compute multiple group stats at once using agg()
group_stats = loan_df.groupby('emp_type')['loan_amount'].agg(
    grp_mean='mean',    # average loan amount for this employment type
    grp_median='median',  # median — more robust to outliers than mean
    grp_std='std',      # standard deviation — how spread is this group?
    grp_count='count'   # how many applicants in this group? (group size is itself a feature)
).reset_index()  # bring emp_type back as a column for merging

# Step 2: merge group stats back into the original DataFrame on emp_type
loan_df = loan_df.merge(group_stats, on='emp_type', how='left')

# Step 3: compute deviation and z-score now that group stats are in the DataFrame
loan_df['loan_deviation'] = loan_df['loan_amount'] - loan_df['grp_mean']  # raw distance from group mean
loan_df['loan_zscore']    = loan_df['loan_deviation'] / (loan_df['grp_std'] + 1e-9)  # standardised deviation

# Round everything to 1 decimal place for clean display
loan_df = loan_df.round(1)

# Print results
print(loan_df[['loan_id','emp_type','loan_amount','grp_mean','grp_std','loan_deviation','loan_zscore']].to_string(index=False))

 loan_id   emp_type  loan_amount  grp_mean  grp_std  loan_deviation  loan_zscore
       1   Salaried        25000   30000.0   6442.0         -5000.0         -0.8
       2   Self-Emp        80000   75000.0  17078.2          5000.0          0.3
       3   Salaried        30000   30000.0   6442.0             0.0          0.0
       4   Self-Emp        55000   75000.0  17078.2        -20000.0         -1.2
       5   Salaried        28000   30000.0   6442.0         -2000.0         -0.3
       6   Salaried        22000   30000.0   6442.0         -8000.0         -1.2
       7   Self-Emp        95000   75000.0  17078.2         20000.0          1.2
       8   Salaried        35000   30000.0   6442.0          5000.0          0.8
       9   Self-Emp        70000   75000.0  17078.2         -5000.0         -0.3
      10   Salaried        40000   30000.0   6442.0         10000.0          1.6

What just happened?

The agg() call computed four statistics for each employment type in one pass, and the merge() joined them back on emp_type. Now every row carries its group context. Loan 10 looks large at $40,000 until you see it's only 1.6 standard deviations above the Salaried average — not an extreme outlier. But Loan 6 at $22,000 is 1.2 std below the Salaried mean, which could be a meaningful signal for default risk.

Group Count and Frequency Encoding

Not all group features need to be about a numeric column. Sometimes the most useful thing you can add is simply: how often does this category appear? This is called frequency encoding and it converts a categorical column into a numeric one without introducing arbitrary orderings.

The scenario:

You're a machine learning engineer at a property platform. The dataset has a neighbourhood column with 200 unique values. Label encoding would impose a fake ordering; one-hot encoding would add 200 columns and destroy your memory budget. Frequency encoding replaces each neighbourhood with the proportion of listings that belong to it — a meaningful, compact, model-ready number.

# Import pandas
import pandas as pd

# Create a housing listings DataFrame — 10 rows
housing_df = pd.DataFrame({
    'listing_id':    [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],  # unique listing IDs
    'neighbourhood': ['Midtown','Midtown','Uptown','Midtown','Riverside',
                      'Uptown','Midtown','Riverside','Midtown','Uptown'],  # categorical: 3 unique values here
    'price':         [450000, 460000, 520000, 440000, 380000, 530000, 470000, 390000, 455000, 510000]  # listing price
})

# Step 1: compute the total number of rows — needed for frequency calculation
total_rows = len(housing_df)  # 10

# Step 2: compute raw count per neighbourhood using groupby + transform
housing_df['neighbourhood_count'] = housing_df.groupby('neighbourhood')['listing_id'].transform('count')

# Step 3: compute frequency (proportion) — count divided by total
housing_df['neighbourhood_freq'] = housing_df['neighbourhood_count'] / total_rows

# Step 4: compute the group mean price per neighbourhood — a bonus group feature
housing_df['neighbourhood_mean_price'] = housing_df.groupby('neighbourhood')['price'].transform('mean')

# Round for clarity
housing_df = housing_df.round(3)

# Print the result
print(housing_df.to_string(index=False))

 listing_id neighbourhood    price  neighbourhood_count  neighbourhood_freq  neighbourhood_mean_price
       1001       Midtown   450000                    5                 0.5                    455000
       1002       Midtown   460000                    5                 0.5                    455000
       1003        Uptown   520000                    3                 0.3                    520000
       1004       Midtown   440000                    5                 0.5                    455000
       1005     Riverside   380000                    2                 0.2                    385000
       1006        Uptown   530000                    3                 0.3                    520000
       1007       Midtown   470000                    5                 0.5                    455000
       1008     Riverside   390000                    2                 0.2                    385000
       1009       Midtown   455000                    5                 0.5                    455000
       1010        Uptown   510000                    3                 0.3                    520000

What just happened?

Midtown is the dominant neighbourhood — it appears in 5 of 10 listings (frequency 0.5). Riverside is rare at 0.2. The model can now use neighbourhood_freq as a numeric feature without any ordinal assumptions. The neighbourhood_mean_price column also gives every listing access to the group average price, so the model can learn that Uptown listings tend to be more expensive independently of the individual listing's price.

The Target Leakage Trap — Group Features on Train vs Test

Group-based features have one nasty gotcha that kills models in production: if you compute group statistics on the full dataset before splitting into train and test, information from the test set leaks into your training features. Your model looks great in evaluation and collapses in deployment.

The Wrong Way

Compute group means on the full DataFrame, then split. Every row in your test set now carries statistics that were influenced by its own value. You've trained the model on information it would never have at prediction time.

The Right Way

Split first, then compute group stats on the training set only. Merge those train-derived stats into the test set. New categories that appear only in test get a fallback value (global mean is a safe choice).

Inside a Pipeline

When using sklearn pipelines or cross-validation, compute group stats inside a custom transformer that fits on training folds and transforms all folds. This is the production-safe pattern and it prevents leakage automatically.

A Visual Look at What transform Returns

The magic of transform is that it broadcasts group-level statistics back to every individual row in the group, preserving the original index perfectly. Here's a visual of what happens under the hood:

customer_id	region	spend	region_mean ← transform	deviation
101	North	340	150 ← North avg	+190
102	North	90	150 ← North avg	−60
103	South	210	200 ← South avg	+10
105	East	500	370 ← East avg	+130
106	East	120	370 ← East avg	−250

Every row in the same group gets the same group mean, but the deviation is unique to each row. That uniqueness is the signal.

Teacher's Note

The difference between transform and agg trips up almost every beginner. Use transform when you want the result aligned to the original DataFrame — no merge needed. Use agg when you want a compact summary table, then merge it back yourself. Both produce the same final features; the choice is about readability and performance. For large DataFrames with many groups, agg + merge is often faster because pandas can optimise the aggregation pass independently.

Practice Questions

1. Which pandas method broadcasts a group statistic back to every row in that group without requiring a merge?

2. A feature computed as the difference between a row's value and its group mean is called a ________ score.

3. Replacing a categorical column with the proportion of rows that belong to each category is called __________.

Quiz

Up Next · Lesson 33

Rolling Window Features

Capture trends, momentum, and volatility over sliding time windows — a must-have technique for any time-aware dataset.

← Previous Course Index Next →

Feature Engineering Course

Group-Based Features

Rows Without Context Are Almost Useless

Four Group Aggregations That Actually Matter

Building Group Features with groupby and transform

Multiple Aggregations in One Shot

Group Count and Frequency Encoding

The Target Leakage Trap — Group Features on Train vs Test

A Visual Look at What transform Returns

Practice Questions

Quiz