EDA Lesson 38 – Feature Engineering via EDA | Dataplexa
Advanced Level · Lesson 38

Feature Engineering via EDA

EDA isn't just a check before modelling — it's a feature factory. Every pattern you find during exploration is a potential feature you can encode. This lesson is about turning EDA observations directly into model inputs: interaction terms born from scatter plots, binned features born from distribution gaps, and target encodings born from group comparisons.

The EDA → Feature Pipeline

Most feature engineering tutorials teach you what transforms exist. This lesson teaches you when to use them — by connecting each technique to the specific EDA finding that motivates it. The workflow is always the same:

Find a pattern
in EDA

Decide which
feature encodes it

Build it in
pandas

Validate against
the target

The Dataset We'll Use

The scenario: You're a senior data scientist at a property tech company. Your team is building a house price prediction model. The junior analyst has already done a basic EDA and handed you three observations: "Prices seem to jump sharply above a certain square footage," "the combination of location and property type seems to matter more than either alone," and "bigger gardens increase price — but only for detached houses, not flats." Your job is to take those three EDA observations and turn each one into a concrete feature the model can use.

import pandas as pd
import numpy as np

# Property dataset — 16 houses across two locations and two property types
df = pd.DataFrame({
    'property_id':  range(1, 17),
    'location':     ['City','City','Suburb','Suburb','City','City','Suburb','Suburb',
                     'City','City','Suburb','Suburb','City','City','Suburb','Suburb'],
    'prop_type':    ['Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached',
                     'Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached'],
    'sqft':         [480, 1850, 420, 1620, 510, 2100, 390, 1780, 495, 1950,
                     410, 1700, 525, 2200, 400, 1680],
    'garden_sqm':   [0,   85,   0,   72,   0,   110,  0,   68,   0,   92,
                     0,   78,   5,   120,  0,   75  ],
    'bedrooms':     [1,   4,    1,   3,    2,   5,    1,   3,    2,   4,
                     1,   3,    2,   5,    1,   3   ],
    'price_000':    [185, 420, 145, 335, 210, 510, 130, 360, 195, 445,
                     155, 350, 220, 540, 140, 345]   # price in £000s — target
})

print(df.to_string(index=False))

What just happened?

The raw dataset has the basics. But the junior analyst's three observations — a sqft threshold effect, a location×type interaction, and a conditional garden effect — suggest three features that aren't in the table yet. Each one needs an EDA investigation to confirm the observation, then a feature that captures it.

EDA Finding 1 → Binned Feature from a Distribution Gap

The scenario: The junior analyst said "prices seem to jump sharply above a certain square footage." You investigate this by looking at the sqft distribution split by price bracket. If there really is a threshold — a point where adding more sqft suddenly has a bigger effect on price — the model would benefit from knowing which side of that threshold each property is on, rather than just seeing the raw sqft number.

from scipy import stats

# Step 1: Confirm the observation — does sqft have a bimodal distribution?
print("sqft distribution by property type:\n")
for ptype, group in df.groupby('prop_type'):
    print(f"  {ptype}: min={group['sqft'].min()}  max={group['sqft'].max()}  "
          f"mean={group['sqft'].mean():.0f}")

print()
# The gap between flat max (525) and detached min (1620) is obvious.
# A binary flag captures this discontinuity cleanly.

# Step 2: Build the feature — is this a large property (above the gap)?
# The natural threshold is somewhere between 525 and 1620 sqft
# We use 1000 sqft as a clean, round threshold in the middle of the gap
SQFT_THRESHOLD = 1000
df['is_large_property'] = (df['sqft'] >= SQFT_THRESHOLD).astype(int)

# Step 3: Validate — does this flag predict price better than raw sqft alone?
r_raw,  _ = stats.pearsonr(df['sqft'],             df['price_000'])
r_flag, _ = stats.pearsonr(df['is_large_property'], df['price_000'])

print(f"Correlation with price:")
print(f"  sqft (raw):           r = {r_raw:+.3f}")
print(f"  is_large_property:    r = {r_flag:+.3f}")
print()
print(f"Average price by large/small:")
print(df.groupby('is_large_property')['price_000'].mean().to_string())

What just happened?

pandas' .groupby() immediately confirms the gap: flats top out at 525 sqft, detacheds start at 1620 sqft. There's a 1,095 sqft gap with no properties in it at all. The distribution isn't continuous — it has two distinct clusters.

The binary flag (is_large_property) achieves nearly the same correlation with price (0.957) as the raw continuous sqft (0.960) — but it's much more interpretable and robust. Average price for small properties: £172k. For large: £408k. A 137% price jump captured by a single 1/0 column. Keep both the raw sqft and the flag — they give the model complementary information.

EDA Finding 2 → Interaction Feature from a Group Comparison

The scenario: The junior analyst observed that "location and property type seem to matter more together than either does alone." You investigate this by computing the average price for each location × property type combination. If a City Detached commands a premium that isn't just the sum of "City" and "Detached" effects separately — if the combination has its own price effect — you need an interaction feature that the model can learn from.

# Step 1: Investigate the location × type interaction
print("Average price by location × property type:\n")
combo_prices = df.groupby(['location','prop_type'])['price_000'].mean().round(0)
print(combo_prices.to_string())
print()

# Compute the "interaction premium" — does City Detached cost more than
# (avg City premium) + (avg Detached premium) would predict?
city_premium    = df[df['location']=='City']['price_000'].mean() - df['price_000'].mean()
detached_premium= df[df['prop_type']=='Detached']['price_000'].mean() - df['price_000'].mean()
expected_combo  = df['price_000'].mean() + city_premium + detached_premium
actual_combo    = df[(df['location']=='City') & (df['prop_type']=='Detached')]['price_000'].mean()

print(f"Expected City Detached price (additive model): £{expected_combo:.0f}k")
print(f"Actual City Detached price:                    £{actual_combo:.0f}k")
print(f"Interaction premium:                           £{actual_combo-expected_combo:.0f}k\n")

# Step 2: Build the interaction feature — concatenate location + type
df['location_type'] = df['location'] + '_' + df['prop_type']
# e.g., "City_Detached", "Suburb_Flat" — four unique combinations

# Step 3: Validate — compute price by combo and correlation
r_loc,  _ = stats.pearsonr(df['location']=='City',     df['price_000'])
r_type, _ = stats.pearsonr(df['prop_type']=='Detached', df['price_000'])

# For the interaction, we use target encoding — average price per combo
target_map = df.groupby('location_type')['price_000'].mean()
df['location_type_encoded'] = df['location_type'].map(target_map)
r_combo, _ = stats.pearsonr(df['location_type_encoded'], df['price_000'])

print(f"Correlation with price:")
print(f"  location alone:            r = {abs(r_loc):+.3f}")
print(f"  prop_type alone:           r = {abs(r_type):+.3f}")
print(f"  location_type (combined):  r = {r_combo:+.3f}")

What just happened?

pandas' string concatenation (df['location'] + '_' + df['prop_type']) creates the combined category string. .groupby().mean() computes the target encoding — the average price for each combination. .map() applies it back to every row.

The interaction premium is real: City Detached houses cost £30k more than the additive model would predict — the combination is worth more than the sum of its parts. And the combined feature achieves r=0.972 vs 0.392 for location alone. The junior analyst's intuition was right, and now it's a feature. Target encoding is one of the most powerful ways to turn a categorical interaction into a numeric signal.

EDA Finding 3 → Conditional Feature from a Segment Analysis

The scenario: The third observation was the most interesting: "bigger gardens increase price — but only for detached houses, not flats." You investigate by calculating the garden-price correlation separately for each property type. If the observation holds — if garden size matters for detacheds but is irrelevant for flats (which often have zero garden anyway) — you need a conditional feature that only activates for detached houses.

# Step 1: Investigate — does garden_sqm correlate with price differently by type?
print("Garden size vs price — by property type:\n")

for ptype, group in df.groupby('prop_type'):
    if group['garden_sqm'].std() > 0:   # skip groups with no variance (all flats have 0)
        r, p = stats.pearsonr(group['garden_sqm'], group['price_000'])
        print(f"  {ptype}: garden_sqm vs price  r = {r:+.3f}  p = {p:.4f}")
    else:
        print(f"  {ptype}: all garden_sqm = 0 — no variance, correlation undefined")

print()

# Step 2: Build the conditional feature
# For detached houses: use the actual garden size
# For flats: set to 0 (garden is irrelevant — the model shouldn't see a fake signal)
df['garden_for_detached'] = df.apply(
    lambda row: row['garden_sqm'] if row['prop_type'] == 'Detached' else 0,
    axis=1
)

# Step 3: Validate — does the conditional feature beat raw garden_sqm?
r_raw_garden, _ = stats.pearsonr(df['garden_sqm'],          df['price_000'])
r_cond_garden, _= stats.pearsonr(df['garden_for_detached'],  df['price_000'])

print(f"Correlation with price:")
print(f"  garden_sqm (raw):           r = {r_raw_garden:+.3f}")
print(f"  garden_for_detached:        r = {r_cond_garden:+.3f}")
print()
print("Garden size vs price for detached houses only:")
det = df[df['prop_type']=='Detached'][['garden_sqm','price_000']].sort_values('garden_sqm')
print(det.to_string(index=False))

What just happened?

pandas' .apply(lambda row: ..., axis=1) applies a row-level function — it checks each row's prop_type and returns the garden value for detacheds or 0 for flats. The if group['garden_sqm'].std() > 0 guard prevents a crash when all values in a group are identical (all flats have garden_sqm = 0, so std = 0).

For detached houses alone, r = +0.987 — near-perfect correlation. Every extra square metre of garden adds price. For flats, the concept doesn't apply. The junior analyst's intuition was exactly right. The conditional feature encodes this: it tells the model "if this is a detached house, the garden size matters; if it's a flat, treat it as zero."

Step 4 — The Full Engineered Feature Set

The scenario: You now have three new features born directly from EDA observations. Before handing the dataset to the modelling team, you produce a final summary comparing every feature's correlation with price — raw inputs and engineered features together — so the team knows exactly what they're working with and which features are worth prioritising.

# Collect all numeric features — raw and engineered
all_features = {
    # Raw features
    'sqft':                    df['sqft'],
    'garden_sqm':              df['garden_sqm'],
    'bedrooms':                df['bedrooms'],
    # Engineered features
    'is_large_property':       df['is_large_property'],
    'location_type_encoded':   df['location_type_encoded'],
    'garden_for_detached':     df['garden_for_detached'],
}

print("=== FEATURE CORRELATION RANKING ===\n")
print(f"  {'Feature':<26} {'r with price':>14}  {'Type':>12}  Source")
print("  " + "─" * 68)

results = []
for name, col in all_features.items():
    r, _ = stats.pearsonr(col, df['price_000'])
    ftype = "Engineered" if name in ['is_large_property','location_type_encoded',
                                     'garden_for_detached'] else "Raw"
    results.append((abs(r), name, r, ftype))

for _, name, r, ftype in sorted(results, reverse=True):
    source = "← EDA observation" if ftype == "Engineered" else ""
    print(f"  {name:<26} {r:>+14.3f}  {ftype:>12}  {source}")

What just happened?

pandas dictionary iteration lets us loop over all features and scipy computes the correlation for each. We sort by absolute correlation and mark each feature's origin — raw or engineered — and whether it came from an EDA observation.

The top-ranked feature — location_type_encoded at r=0.972 — is engineered. It doesn't exist in the raw data. The junior analyst's three observations, turned into three features, produced the strongest predictor in the dataset. This is the payoff of connecting EDA findings directly to feature engineering: the most important features often come from understanding the data, not from automated feature selection.

Teacher's Note

Target encoding — using the average target value per category as the feature — leaks information in a training/test split. In a real modelling workflow, you must compute the target encoding only from the training set, then apply those same values to the test set. Fitting the encoding on the whole dataset (as we did here for simplicity) would cause data leakage and overoptimistic performance estimates. Use sklearn's TargetEncoder inside a proper pipeline to handle this correctly.

Every feature in this lesson was motivated by something the EDA showed — a gap in the distribution, a group comparison table, a conditional relationship. That's the habit to build: after every EDA step, ask "is there a feature I should create from this?" The answer won't always be yes. But when it is, that feature is usually the best one you'll build.

Practice Questions

1. What is the name of the technique where you replace a categorical variable with the average target value for each category — used here to encode the location × property type combination?



2. To build the conditional garden feature — returning garden_sqm for detacheds and 0 for flats — which pandas method runs a custom row-level function across the whole DataFrame?



3. If you compute a target encoding using the full dataset (including the test set), what problem does this introduce into your model evaluation?



Quiz

1. Your EDA shows that sqft has a bimodal distribution — two clusters with no properties in between. What feature engineering approach makes best use of this finding?


2. Your EDA shows that City Detached houses cost £30k more than the additive model (City premium + Detached premium) would predict. What should you build?


3. You want to use target encoding (average price per category) in a train/test split model. What is the correct procedure to avoid data leakage?


Up Next · Lesson 39

PCA Insights

Principal Component Analysis as an EDA tool — how to use it to understand which features drive variance, spot structure in high-dimensional data, and decide what to simplify before modelling.