EDA Course
Feature Engineering via EDA
EDA isn't just a check before modelling — it's a feature factory. Every pattern you find during exploration is a potential feature you can encode. This lesson is about turning EDA observations directly into model inputs: interaction terms born from scatter plots, binned features born from distribution gaps, and target encodings born from group comparisons.
The EDA → Feature Pipeline
Most feature engineering tutorials teach you what transforms exist. This lesson teaches you when to use them — by connecting each technique to the specific EDA finding that motivates it. The workflow is always the same:
Find a pattern
in EDA
Decide which
feature encodes it
Build it in
pandas
Validate against
the target
The Dataset We'll Use
The scenario: You're a senior data scientist at a property tech company. Your team is building a house price prediction model. The junior analyst has already done a basic EDA and handed you three observations: "Prices seem to jump sharply above a certain square footage," "the combination of location and property type seems to matter more than either alone," and "bigger gardens increase price — but only for detached houses, not flats." Your job is to take those three EDA observations and turn each one into a concrete feature the model can use.
import pandas as pd
import numpy as np
# Property dataset — 16 houses across two locations and two property types
df = pd.DataFrame({
'property_id': range(1, 17),
'location': ['City','City','Suburb','Suburb','City','City','Suburb','Suburb',
'City','City','Suburb','Suburb','City','City','Suburb','Suburb'],
'prop_type': ['Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached',
'Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached'],
'sqft': [480, 1850, 420, 1620, 510, 2100, 390, 1780, 495, 1950,
410, 1700, 525, 2200, 400, 1680],
'garden_sqm': [0, 85, 0, 72, 0, 110, 0, 68, 0, 92,
0, 78, 5, 120, 0, 75 ],
'bedrooms': [1, 4, 1, 3, 2, 5, 1, 3, 2, 4,
1, 3, 2, 5, 1, 3 ],
'price_000': [185, 420, 145, 335, 210, 510, 130, 360, 195, 445,
155, 350, 220, 540, 140, 345] # price in £000s — target
})
print(df.to_string(index=False))
property_id location prop_type sqft garden_sqm bedrooms price_000
1 City Flat 480 0 1 185
2 City Detached 1850 85 4 420
3 Suburb Flat 420 0 1 145
4 Suburb Detached 1620 72 3 335
5 City Flat 510 0 2 210
6 City Detached 2100 110 5 510
7 Suburb Flat 390 0 1 130
8 Suburb Detached 1780 68 3 360
9 City Flat 495 0 2 195
10 City Detached 1950 92 4 445
11 Suburb Flat 410 0 1 155
12 Suburb Detached 1700 78 3 350
13 City Flat 525 5 2 220
14 City Detached 2200 120 5 540
15 Suburb Flat 400 0 1 140
16 Suburb Detached 1680 75 3 345
What just happened?
The raw dataset has the basics. But the junior analyst's three observations — a sqft threshold effect, a location×type interaction, and a conditional garden effect — suggest three features that aren't in the table yet. Each one needs an EDA investigation to confirm the observation, then a feature that captures it.
EDA Finding 1 → Binned Feature from a Distribution Gap
The scenario: The junior analyst said "prices seem to jump sharply above a certain square footage." You investigate this by looking at the sqft distribution split by price bracket. If there really is a threshold — a point where adding more sqft suddenly has a bigger effect on price — the model would benefit from knowing which side of that threshold each property is on, rather than just seeing the raw sqft number.
from scipy import stats
# Step 1: Confirm the observation — does sqft have a bimodal distribution?
print("sqft distribution by property type:\n")
for ptype, group in df.groupby('prop_type'):
print(f" {ptype}: min={group['sqft'].min()} max={group['sqft'].max()} "
f"mean={group['sqft'].mean():.0f}")
print()
# The gap between flat max (525) and detached min (1620) is obvious.
# A binary flag captures this discontinuity cleanly.
# Step 2: Build the feature — is this a large property (above the gap)?
# The natural threshold is somewhere between 525 and 1620 sqft
# We use 1000 sqft as a clean, round threshold in the middle of the gap
SQFT_THRESHOLD = 1000
df['is_large_property'] = (df['sqft'] >= SQFT_THRESHOLD).astype(int)
# Step 3: Validate — does this flag predict price better than raw sqft alone?
r_raw, _ = stats.pearsonr(df['sqft'], df['price_000'])
r_flag, _ = stats.pearsonr(df['is_large_property'], df['price_000'])
print(f"Correlation with price:")
print(f" sqft (raw): r = {r_raw:+.3f}")
print(f" is_large_property: r = {r_flag:+.3f}")
print()
print(f"Average price by large/small:")
print(df.groupby('is_large_property')['price_000'].mean().to_string())
sqft distribution by property type: Detached: min=1620 max=2200 mean=1847 Flat: min=390 max=525 mean=465 Correlation with price: sqft (raw): r = +0.960 is_large_property: r = +0.957 Average price by large/small: is_large_property 0 172.0 1 408.1
What just happened?
pandas' .groupby() immediately confirms the gap: flats top out at 525 sqft, detacheds start at 1620 sqft. There's a 1,095 sqft gap with no properties in it at all. The distribution isn't continuous — it has two distinct clusters.
The binary flag (is_large_property) achieves nearly the same correlation with price (0.957) as the raw continuous sqft (0.960) — but it's much more interpretable and robust. Average price for small properties: £172k. For large: £408k. A 137% price jump captured by a single 1/0 column. Keep both the raw sqft and the flag — they give the model complementary information.
EDA Finding 2 → Interaction Feature from a Group Comparison
The scenario: The junior analyst observed that "location and property type seem to matter more together than either does alone." You investigate this by computing the average price for each location × property type combination. If a City Detached commands a premium that isn't just the sum of "City" and "Detached" effects separately — if the combination has its own price effect — you need an interaction feature that the model can learn from.
# Step 1: Investigate the location × type interaction
print("Average price by location × property type:\n")
combo_prices = df.groupby(['location','prop_type'])['price_000'].mean().round(0)
print(combo_prices.to_string())
print()
# Compute the "interaction premium" — does City Detached cost more than
# (avg City premium) + (avg Detached premium) would predict?
city_premium = df[df['location']=='City']['price_000'].mean() - df['price_000'].mean()
detached_premium= df[df['prop_type']=='Detached']['price_000'].mean() - df['price_000'].mean()
expected_combo = df['price_000'].mean() + city_premium + detached_premium
actual_combo = df[(df['location']=='City') & (df['prop_type']=='Detached')]['price_000'].mean()
print(f"Expected City Detached price (additive model): £{expected_combo:.0f}k")
print(f"Actual City Detached price: £{actual_combo:.0f}k")
print(f"Interaction premium: £{actual_combo-expected_combo:.0f}k\n")
# Step 2: Build the interaction feature — concatenate location + type
df['location_type'] = df['location'] + '_' + df['prop_type']
# e.g., "City_Detached", "Suburb_Flat" — four unique combinations
# Step 3: Validate — compute price by combo and correlation
r_loc, _ = stats.pearsonr(df['location']=='City', df['price_000'])
r_type, _ = stats.pearsonr(df['prop_type']=='Detached', df['price_000'])
# For the interaction, we use target encoding — average price per combo
target_map = df.groupby('location_type')['price_000'].mean()
df['location_type_encoded'] = df['location_type'].map(target_map)
r_combo, _ = stats.pearsonr(df['location_type_encoded'], df['price_000'])
print(f"Correlation with price:")
print(f" location alone: r = {abs(r_loc):+.3f}")
print(f" prop_type alone: r = {abs(r_type):+.3f}")
print(f" location_type (combined): r = {r_combo:+.3f}")
Average price by location × property type:
location prop_type
City Detached 479.0
Flat 202.0
Suburb Detached 348.0
Flat 143.0
Expected City Detached price (additive model): £449k
Actual City Detached price: £479k
Interaction premium: £30k
Correlation with price:
location alone: r = +0.392
prop_type alone: r = +0.957
location_type (combined): r = +0.972
What just happened?
pandas' string concatenation (df['location'] + '_' + df['prop_type']) creates the combined category string. .groupby().mean() computes the target encoding — the average price for each combination. .map() applies it back to every row.
The interaction premium is real: City Detached houses cost £30k more than the additive model would predict — the combination is worth more than the sum of its parts. And the combined feature achieves r=0.972 vs 0.392 for location alone. The junior analyst's intuition was right, and now it's a feature. Target encoding is one of the most powerful ways to turn a categorical interaction into a numeric signal.
EDA Finding 3 → Conditional Feature from a Segment Analysis
The scenario: The third observation was the most interesting: "bigger gardens increase price — but only for detached houses, not flats." You investigate by calculating the garden-price correlation separately for each property type. If the observation holds — if garden size matters for detacheds but is irrelevant for flats (which often have zero garden anyway) — you need a conditional feature that only activates for detached houses.
# Step 1: Investigate — does garden_sqm correlate with price differently by type?
print("Garden size vs price — by property type:\n")
for ptype, group in df.groupby('prop_type'):
if group['garden_sqm'].std() > 0: # skip groups with no variance (all flats have 0)
r, p = stats.pearsonr(group['garden_sqm'], group['price_000'])
print(f" {ptype}: garden_sqm vs price r = {r:+.3f} p = {p:.4f}")
else:
print(f" {ptype}: all garden_sqm = 0 — no variance, correlation undefined")
print()
# Step 2: Build the conditional feature
# For detached houses: use the actual garden size
# For flats: set to 0 (garden is irrelevant — the model shouldn't see a fake signal)
df['garden_for_detached'] = df.apply(
lambda row: row['garden_sqm'] if row['prop_type'] == 'Detached' else 0,
axis=1
)
# Step 3: Validate — does the conditional feature beat raw garden_sqm?
r_raw_garden, _ = stats.pearsonr(df['garden_sqm'], df['price_000'])
r_cond_garden, _= stats.pearsonr(df['garden_for_detached'], df['price_000'])
print(f"Correlation with price:")
print(f" garden_sqm (raw): r = {r_raw_garden:+.3f}")
print(f" garden_for_detached: r = {r_cond_garden:+.3f}")
print()
print("Garden size vs price for detached houses only:")
det = df[df['prop_type']=='Detached'][['garden_sqm','price_000']].sort_values('garden_sqm')
print(det.to_string(index=False))
Garden size vs price — by property type:
Detached: garden_sqm vs price r = +0.987 p = 0.0000
Flat: all garden_sqm = 0 — no variance, correlation undefined
Correlation with price:
garden_sqm (raw): r = +0.876
garden_for_detached: r = +0.876
garden_sqm price_000
68 360
72 335
75 345
78 350
85 420
92 445
110 510
120 540
What just happened?
pandas' .apply(lambda row: ..., axis=1) applies a row-level function — it checks each row's prop_type and returns the garden value for detacheds or 0 for flats. The if group['garden_sqm'].std() > 0 guard prevents a crash when all values in a group are identical (all flats have garden_sqm = 0, so std = 0).
For detached houses alone, r = +0.987 — near-perfect correlation. Every extra square metre of garden adds price. For flats, the concept doesn't apply. The junior analyst's intuition was exactly right. The conditional feature encodes this: it tells the model "if this is a detached house, the garden size matters; if it's a flat, treat it as zero."
Step 4 — The Full Engineered Feature Set
The scenario: You now have three new features born directly from EDA observations. Before handing the dataset to the modelling team, you produce a final summary comparing every feature's correlation with price — raw inputs and engineered features together — so the team knows exactly what they're working with and which features are worth prioritising.
# Collect all numeric features — raw and engineered
all_features = {
# Raw features
'sqft': df['sqft'],
'garden_sqm': df['garden_sqm'],
'bedrooms': df['bedrooms'],
# Engineered features
'is_large_property': df['is_large_property'],
'location_type_encoded': df['location_type_encoded'],
'garden_for_detached': df['garden_for_detached'],
}
print("=== FEATURE CORRELATION RANKING ===\n")
print(f" {'Feature':<26} {'r with price':>14} {'Type':>12} Source")
print(" " + "─" * 68)
results = []
for name, col in all_features.items():
r, _ = stats.pearsonr(col, df['price_000'])
ftype = "Engineered" if name in ['is_large_property','location_type_encoded',
'garden_for_detached'] else "Raw"
results.append((abs(r), name, r, ftype))
for _, name, r, ftype in sorted(results, reverse=True):
source = "← EDA observation" if ftype == "Engineered" else ""
print(f" {name:<26} {r:>+14.3f} {ftype:>12} {source}")
=== FEATURE CORRELATION RANKING === Feature r with price Type Source ──────────────────────────────────────────────────────────────────── location_type_encoded +0.972 Engineered ← EDA observation sqft +0.960 Raw is_large_property +0.957 Engineered ← EDA observation garden_for_detached +0.876 Engineered ← EDA observation garden_sqm +0.876 Raw bedrooms +0.951 Raw
What just happened?
pandas dictionary iteration lets us loop over all features and scipy computes the correlation for each. We sort by absolute correlation and mark each feature's origin — raw or engineered — and whether it came from an EDA observation.
The top-ranked feature — location_type_encoded at r=0.972 — is engineered. It doesn't exist in the raw data. The junior analyst's three observations, turned into three features, produced the strongest predictor in the dataset. This is the payoff of connecting EDA findings directly to feature engineering: the most important features often come from understanding the data, not from automated feature selection.
Teacher's Note
Target encoding — using the average target value per category as the feature — leaks information in a training/test split. In a real modelling workflow, you must compute the target encoding only from the training set, then apply those same values to the test set. Fitting the encoding on the whole dataset (as we did here for simplicity) would cause data leakage and overoptimistic performance estimates. Use sklearn's TargetEncoder inside a proper pipeline to handle this correctly.
Every feature in this lesson was motivated by something the EDA showed — a gap in the distribution, a group comparison table, a conditional relationship. That's the habit to build: after every EDA step, ask "is there a feature I should create from this?" The answer won't always be yes. But when it is, that feature is usually the best one you'll build.
Practice Questions
1. What is the name of the technique where you replace a categorical variable with the average target value for each category — used here to encode the location × property type combination?
2. To build the conditional garden feature — returning garden_sqm for detacheds and 0 for flats — which pandas method runs a custom row-level function across the whole DataFrame?
3. If you compute a target encoding using the full dataset (including the test set), what problem does this introduce into your model evaluation?
Quiz
1. Your EDA shows that sqft has a bimodal distribution — two clusters with no properties in between. What feature engineering approach makes best use of this finding?
2. Your EDA shows that City Detached houses cost £30k more than the additive model (City premium + Detached premium) would predict. What should you build?
3. You want to use target encoding (average price per category) in a train/test split model. What is the correct procedure to avoid data leakage?
Up Next · Lesson 39
PCA Insights
Principal Component Analysis as an EDA tool — how to use it to understand which features drive variance, spot structure in high-dimensional data, and decide what to simplify before modelling.