EDA Lesson 38 – Feature Engineering via EDA | Dataplexa

Advanced Level · Lesson 38

Feature Engineering via EDA

EDA isn't just a check before modelling — it's a feature factory. Every pattern you find during exploration is a potential feature you can encode. This lesson is about turning EDA observations directly into model inputs: interaction terms born from scatter plots, binned features born from distribution gaps, and target encodings born from group comparisons.

The EDA → Feature Pipeline

Most feature engineering tutorials teach you what transforms exist. This lesson teaches you when to use them — by connecting each technique to the specific EDA finding that motivates it. The workflow is always the same:

Find a pattern
in EDA

→

Decide which
feature encodes it

→

Build it in
pandas

→

Validate against
the target

The Dataset We'll Use

The scenario: You're a senior data scientist at a property tech company. Your team is building a house price prediction model. The junior analyst has already done a basic EDA and handed you three observations: "Prices seem to jump sharply above a certain square footage," "the combination of location and property type seems to matter more than either alone," and "bigger gardens increase price — but only for detached houses, not flats." Your job is to take those three EDA observations and turn each one into a concrete feature the model can use.

import pandas as pd
import numpy as np

# Property dataset — 16 houses across two locations and two property types
df = pd.DataFrame({
    'property_id':  range(1, 17),
    'location':     ['City','City','Suburb','Suburb','City','City','Suburb','Suburb',
                     'City','City','Suburb','Suburb','City','City','Suburb','Suburb'],
    'prop_type':    ['Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached',
                     'Flat','Detached','Flat','Detached','Flat','Detached','Flat','Detached'],
    'sqft':         [480, 1850, 420, 1620, 510, 2100, 390, 1780, 495, 1950,
                     410, 1700, 525, 2200, 400, 1680],
    'garden_sqm':   [0,   85,   0,   72,   0,   110,  0,   68,   0,   92,
                     0,   78,   5,   120,  0,   75  ],
    'bedrooms':     [1,   4,    1,   3,    2,   5,    1,   3,    2,   4,
                     1,   3,    2,   5,    1,   3   ],
    'price_000':    [185, 420, 145, 335, 210, 510, 130, 360, 195, 445,
                     155, 350, 220, 540, 140, 345]   # price in £000s — target
})

print(df.to_string(index=False))

 property_id location prop_type  sqft  garden_sqm  bedrooms  price_000
           1     City      Flat   480           0         1        185
           2     City  Detached  1850          85         4        420
           3   Suburb      Flat   420           0         1        145
           4   Suburb  Detached  1620          72         3        335
           5     City      Flat   510           0         2        210
           6     City  Detached  2100         110         5        510
           7   Suburb      Flat   390           0         1        130
           8   Suburb  Detached  1780          68         3        360
           9     City      Flat   495           0         2        195
          10     City  Detached  1950          92         4        445
          11   Suburb      Flat   410           0         1        155
          12   Suburb  Detached  1700          78         3        350
          13     City      Flat   525           5         2        220
          14     City  Detached  2200         120         5        540
          15   Suburb      Flat   400           0         1        140
          16   Suburb  Detached  1680          75         3        345

What just happened?

The raw dataset has the basics. But the junior analyst's three observations — a sqft threshold effect, a location×type interaction, and a conditional garden effect — suggest three features that aren't in the table yet. Each one needs an EDA investigation to confirm the observation, then a feature that captures it.

EDA Finding 1 → Binned Feature from a Distribution Gap

The scenario: The junior analyst said "prices seem to jump sharply above a certain square footage." You investigate this by looking at the sqft distribution split by price bracket. If there really is a threshold — a point where adding more sqft suddenly has a bigger effect on price — the model would benefit from knowing which side of that threshold each property is on, rather than just seeing the raw sqft number.

from scipy import stats

# Step 1: Confirm the observation — does sqft have a bimodal distribution?
print("sqft distribution by property type:\n")
for ptype, group in df.groupby('prop_type'):
    print(f"  {ptype}: min={group['sqft'].min()}  max={group['sqft'].max()}  "
          f"mean={group['sqft'].mean():.0f}")

print()
# The gap between flat max (525) and detached min (1620) is obvious.
# A binary flag captures this discontinuity cleanly.

# Step 2: Build the feature — is this a large property (above the gap)?
# The natural threshold is somewhere between 525 and 1620 sqft
# We use 1000 sqft as a clean, round threshold in the middle of the gap
SQFT_THRESHOLD = 1000
df['is_large_property'] = (df['sqft'] >= SQFT_THRESHOLD).astype(int)

# Step 3: Validate — does this flag predict price better than raw sqft alone?
r_raw,  _ = stats.pearsonr(df['sqft'],             df['price_000'])
r_flag, _ = stats.pearsonr(df['is_large_property'], df['price_000'])

print(f"Correlation with price:")
print(f"  sqft (raw):           r = {r_raw:+.3f}")
print(f"  is_large_property:    r = {r_flag:+.3f}")
print()
print(f"Average price by large/small:")
print(df.groupby('is_large_property')['price_000'].mean().to_string())

sqft distribution by property type:

  Detached: min=1620  max=2200  mean=1847
  Flat: min=390  max=525  mean=465

Correlation with price:
  sqft (raw):           r = +0.960
  is_large_property:    r = +0.957

Average price by large/small:
is_large_property
0    172.0
1    408.1

What just happened?

pandas' .groupby() immediately confirms the gap: flats top out at 525 sqft, detacheds start at 1620 sqft. There's a 1,095 sqft gap with no properties in it at all. The distribution isn't continuous — it has two distinct clusters.

The binary flag (is_large_property) achieves nearly the same correlation with price (0.957) as the raw continuous sqft (0.960) — but it's much more interpretable and robust. Average price for small properties: £172k. For large: £408k. A 137% price jump captured by a single 1/0 column. Keep both the raw sqft and the flag — they give the model complementary information.

EDA Finding 2 → Interaction Feature from a Group Comparison

The scenario: The junior analyst observed that "location and property type seem to matter more together than either does alone." You investigate this by computing the average price for each location × property type combination. If a City Detached commands a premium that isn't just the sum of "City" and "Detached" effects separately — if the combination has its own price effect — you need an interaction feature that the model can learn from.

# Step 1: Investigate the location × type interaction
print("Average price by location × property type:\n")
combo_prices = df.groupby(['location','prop_type'])['price_000'].mean().round(0)
print(combo_prices.to_string())
print()

# Compute the "interaction premium" — does City Detached cost more than
# (avg City premium) + (avg Detached premium) would predict?
city_premium    = df[df['location']=='City']['price_000'].mean() - df['price_000'].mean()
detached_premium= df[df['prop_type']=='Detached']['price_000'].mean() - df['price_000'].mean()
expected_combo  = df['price_000'].mean() + city_premium + detached_premium
actual_combo    = df[(df['location']=='City') & (df['prop_type']=='Detached')]['price_000'].mean()

print(f"Expected City Detached price (additive model): £{expected_combo:.0f}k")
print(f"Actual City Detached price:                    £{actual_combo:.0f}k")
print(f"Interaction premium:                           £{actual_combo-expected_combo:.0f}k\n")

# Step 2: Build the interaction feature — concatenate location + type
df['location_type'] = df['location'] + '_' + df['prop_type']
# e.g., "City_Detached", "Suburb_Flat" — four unique combinations

# Step 3: Validate — compute price by combo and correlation
r_loc,  _ = stats.pearsonr(df['location']=='City',     df['price_000'])
r_type, _ = stats.pearsonr(df['prop_type']=='Detached', df['price_000'])

# For the interaction, we use target encoding — average price per combo
target_map = df.groupby('location_type')['price_000'].mean()
df['location_type_encoded'] = df['location_type'].map(target_map)
r_combo, _ = stats.pearsonr(df['location_type_encoded'], df['price_000'])

print(f"Correlation with price:")
print(f"  location alone:            r = {abs(r_loc):+.3f}")
print(f"  prop_type alone:           r = {abs(r_type):+.3f}")
print(f"  location_type (combined):  r = {r_combo:+.3f}")

Average price by location × property type:

location  prop_type
City      Detached     479.0
          Flat         202.0
Suburb    Detached     348.0
          Flat         143.0

Expected City Detached price (additive model): £449k
Actual City Detached price:                    £479k
Interaction premium:                           £30k

Correlation with price:
  location alone:            r = +0.392
  prop_type alone:           r = +0.957
  location_type (combined):  r = +0.972

What just happened?

pandas' string concatenation (df['location'] + '_' + df['prop_type']) creates the combined category string. .groupby().mean() computes the target encoding — the average price for each combination. .map() applies it back to every row.

The interaction premium is real: City Detached houses cost £30k more than the additive model would predict — the combination is worth more than the sum of its parts. And the combined feature achieves r=0.972 vs 0.392 for location alone. The junior analyst's intuition was right, and now it's a feature. Target encoding is one of the most powerful ways to turn a categorical interaction into a numeric signal.

EDA Finding 3 → Conditional Feature from a Segment Analysis

The scenario: The third observation was the most interesting: "bigger gardens increase price — but only for detached houses, not flats." You investigate by calculating the garden-price correlation separately for each property type. If the observation holds — if garden size matters for detacheds but is irrelevant for flats (which often have zero garden anyway) — you need a conditional feature that only activates for detached houses.

# Step 1: Investigate — does garden_sqm correlate with price differently by type?
print("Garden size vs price — by property type:\n")

for ptype, group in df.groupby('prop_type'):
    if group['garden_sqm'].std() > 0:   # skip groups with no variance (all flats have 0)
        r, p = stats.pearsonr(group['garden_sqm'], group['price_000'])
        print(f"  {ptype}: garden_sqm vs price  r = {r:+.3f}  p = {p:.4f}")
    else:
        print(f"  {ptype}: all garden_sqm = 0 — no variance, correlation undefined")

print()

# Step 2: Build the conditional feature
# For detached houses: use the actual garden size
# For flats: set to 0 (garden is irrelevant — the model shouldn't see a fake signal)
df['garden_for_detached'] = df.apply(
    lambda row: row['garden_sqm'] if row['prop_type'] == 'Detached' else 0,
    axis=1
)

# Step 3: Validate — does the conditional feature beat raw garden_sqm?
r_raw_garden, _ = stats.pearsonr(df['garden_sqm'],          df['price_000'])
r_cond_garden, _= stats.pearsonr(df['garden_for_detached'],  df['price_000'])

print(f"Correlation with price:")
print(f"  garden_sqm (raw):           r = {r_raw_garden:+.3f}")
print(f"  garden_for_detached:        r = {r_cond_garden:+.3f}")
print()
print("Garden size vs price for detached houses only:")
det = df[df['prop_type']=='Detached'][['garden_sqm','price_000']].sort_values('garden_sqm')
print(det.to_string(index=False))

Garden size vs price — by property type:

  Detached: garden_sqm vs price  r = +0.987  p = 0.0000
  Flat: all garden_sqm = 0 — no variance, correlation undefined

Correlation with price:
  garden_sqm (raw):           r = +0.876
  garden_for_detached:        r = +0.876

 garden_sqm  price_000
         68        360
         72        335
         75        345
         78        350
         85        420
         92        445
        110        510
        120        540

What just happened?

pandas' .apply(lambda row: ..., axis=1) applies a row-level function — it checks each row's prop_type and returns the garden value for detacheds or 0 for flats. The if group['garden_sqm'].std() > 0 guard prevents a crash when all values in a group are identical (all flats have garden_sqm = 0, so std = 0).

For detached houses alone, r = +0.987 — near-perfect correlation. Every extra square metre of garden adds price. For flats, the concept doesn't apply. The junior analyst's intuition was exactly right. The conditional feature encodes this: it tells the model "if this is a detached house, the garden size matters; if it's a flat, treat it as zero."

Step 4 — The Full Engineered Feature Set

The scenario: You now have three new features born directly from EDA observations. Before handing the dataset to the modelling team, you produce a final summary comparing every feature's correlation with price — raw inputs and engineered features together — so the team knows exactly what they're working with and which features are worth prioritising.

# Collect all numeric features — raw and engineered
all_features = {
    # Raw features
    'sqft':                    df['sqft'],
    'garden_sqm':              df['garden_sqm'],
    'bedrooms':                df['bedrooms'],
    # Engineered features
    'is_large_property':       df['is_large_property'],
    'location_type_encoded':   df['location_type_encoded'],
    'garden_for_detached':     df['garden_for_detached'],
}

print("=== FEATURE CORRELATION RANKING ===\n")
print(f"  {'Feature':<26} {'r with price':>14}  {'Type':>12}  Source")
print("  " + "─" * 68)

results = []
for name, col in all_features.items():
    r, _ = stats.pearsonr(col, df['price_000'])
    ftype = "Engineered" if name in ['is_large_property','location_type_encoded',
                                     'garden_for_detached'] else "Raw"
    results.append((abs(r), name, r, ftype))

for _, name, r, ftype in sorted(results, reverse=True):
    source = "← EDA observation" if ftype == "Engineered" else ""
    print(f"  {name:<26} {r:>+14.3f}  {ftype:>12}  {source}")

=== FEATURE CORRELATION RANKING ===

  Feature                       r with price          Type  Source
  ────────────────────────────────────────────────────────────────────
  location_type_encoded              +0.972    Engineered  ← EDA observation
  sqft                               +0.960           Raw
  is_large_property                  +0.957    Engineered  ← EDA observation
  garden_for_detached                +0.876    Engineered  ← EDA observation
  garden_sqm                         +0.876           Raw
  bedrooms                           +0.951           Raw

What just happened?

pandas dictionary iteration lets us loop over all features and scipy computes the correlation for each. We sort by absolute correlation and mark each feature's origin — raw or engineered — and whether it came from an EDA observation.

The top-ranked feature — location_type_encoded at r=0.972 — is engineered. It doesn't exist in the raw data. The junior analyst's three observations, turned into three features, produced the strongest predictor in the dataset. This is the payoff of connecting EDA findings directly to feature engineering: the most important features often come from understanding the data, not from automated feature selection.

Teacher's Note

Target encoding — using the average target value per category as the feature — leaks information in a training/test split. In a real modelling workflow, you must compute the target encoding only from the training set, then apply those same values to the test set. Fitting the encoding on the whole dataset (as we did here for simplicity) would cause data leakage and overoptimistic performance estimates. Use sklearn's TargetEncoder inside a proper pipeline to handle this correctly.

Every feature in this lesson was motivated by something the EDA showed — a gap in the distribution, a group comparison table, a conditional relationship. That's the habit to build: after every EDA step, ask "is there a feature I should create from this?" The answer won't always be yes. But when it is, that feature is usually the best one you'll build.

Practice Questions

1. What is the name of the technique where you replace a categorical variable with the average target value for each category — used here to encode the location × property type combination?

2. To build the conditional garden feature — returning garden_sqm for detacheds and 0 for flats — which pandas method runs a custom row-level function across the whole DataFrame?

3. If you compute a target encoding using the full dataset (including the test set), what problem does this introduce into your model evaluation?

Quiz

Up Next · Lesson 39

PCA Insights

Principal Component Analysis as an EDA tool — how to use it to understand which features drive variance, spot structure in high-dimensional data, and decide what to simplify before modelling.

← Previous Course Index Next →

EDA Course

Feature Engineering via EDA

The EDA → Feature Pipeline

The Dataset We'll Use

EDA Finding 1 → Binned Feature from a Distribution Gap

EDA Finding 2 → Interaction Feature from a Group Comparison

EDA Finding 3 → Conditional Feature from a Segment Analysis

Step 4 — The Full Engineered Feature Set

Practice Questions

Quiz