EDA Lesson 21 – Feature Relationships | Dataplexa
Intermediate Level · Lesson 21

Feature Relationships

A machine learning model learns from features — the columns in your dataset. But features don't exist in isolation. Some explain the same thing twice over. Some matter a lot for predictions. Some matter only in certain situations. This lesson is about mapping that web of relationships before your model sees a single row.

What Is a Feature — And Why Do Relationships Matter?

In data science, a feature is just a column you use as an input to make a prediction. If you're predicting house prices, your features might be: number of bedrooms, square footage, distance from the city centre, age of the property.

Now imagine two features: "square footage" and "number of rooms." Both tell you roughly the same thing — bigger house. A model that sees both might get confused, weigh them against each other, and produce worse predictions than if you just used one. That's a redundant feature relationship.

Feature relationship analysis asks three key questions before modelling:

1

Which features are related to the target?

If a feature has no relationship with what you're predicting, it adds noise, not signal. Find and remove it early.

2

Which features are redundant with each other?

Highly correlated features carry duplicate information. Keeping both wastes resources and can hurt linear models.

3

Do any features interact with each other?

Sometimes two weak features become powerful when combined. "Number of rooms" and "house age" might each weakly predict price — but old houses with many rooms might be a very specific, valuable signal.

Step 1 — Which Features Actually Matter?

The scenario: You're a data analyst at a property tech company. Your team is building a model to predict house sale prices in a UK city. You have eight potential features. Before building anything, your lead wants to know which of these features are actually worth including — and which ones are wasting everyone's time. You start by measuring how strongly each feature correlates with the sale price.

import pandas as pd      # pandas: our main data table tool — like a spreadsheet in Python
import numpy as np       # numpy: fast maths library — used for absolute value sorting
from scipy import stats  # scipy: statistics library — gives us correlation with p-values

# House dataset — 12 properties with 8 potential features and the target (sale_price)
df = pd.DataFrame({
    'sale_price':       [285,320,195,410,265,375,220,340,290,430,255,310],  # £000s — this is what we predict
    'sq_footage':       [850,980,620,1200,780,1100,660,1020,870,1250,720,940],
    'num_bedrooms':     [3,  4,  2,  5,   3,  4,   2,  4,   3,  5,   2,  3 ],
    'num_bathrooms':    [1,  2,  1,  3,   1,  2,   1,  2,   2,  3,   1,  2 ],
    'property_age_yrs': [25, 8,  42, 3,   31, 12,  38, 6,   18, 2,   45, 15],
    'dist_city_km':     [4.2,2.1,8.5,1.2, 5.8,1.8, 7.3,2.5, 3.9,0.9, 9.1,3.2],
    'garden_size_sqm':  [45, 80, 20, 120, 35, 95,  25, 75,  50, 130, 15, 60 ],
    'crime_rate_score': [42, 28, 65, 18,  51, 22,  71, 31,  38, 15,  78, 35 ],  # higher = more crime
    'nearest_school_km':[0.8,1.2,2.1,0.5,1.5,0.9, 2.8,1.1, 0.7,0.4, 3.2,1.3],
})

# For each feature, compute its correlation with sale_price
# We use Spearman (rank-based) because some features may be skewed
print("=== FEATURE IMPORTANCE: Correlation with Sale Price ===")
print(f"{'Feature':<22} {'Spearman r':>11}  {'p-value':>9}  {'Significant?':>13}  Signal strength")
print("-" * 82)

results = []
for col in df.columns:
    if col == 'sale_price':
        continue   # skip the target itself
    r, p = stats.spearmanr(df[col], df['sale_price'])
    sig  = "Yes ✓" if p < 0.05 else "No  ✗"
    # Strength label — plain English, no jargon
    if   abs(r) >= 0.7: strength = "★★★ Strong"
    elif abs(r) >= 0.4: strength = "★★  Moderate"
    else:               strength = "★   Weak"
    results.append((col, r, p, sig, strength))

# Sort by absolute correlation — strongest relationships at the top
results.sort(key=lambda x: abs(x[1]), reverse=True)
for col, r, p, sig, strength in results:
    print(f"  {col:<20} {r:>+11.3f}  {p:>9.4f}  {sig:>13}  {strength}")

What just happened?

pandas is our data table library — it holds all the property data in rows and columns. We loop over each column with a simple for col in df.columns and skip the target column.

scipy's stats.spearmanr() gives us both the correlation strength (r) and the p-value. We sort by absolute correlation so the most useful features rise to the top automatically.

Every single feature is significantly correlated with sale price — this is a well-chosen dataset. Notice the negative correlations: distance from city (−0.916) means farther away = lower price. Crime rate (−0.902) means more crime = lower price. Nearest school (−0.846) means farther from school = lower price. The sign of the correlation tells you the direction of the relationship, which is just as important as its strength.

Step 2 — Finding Redundant Features

The scenario: All eight features correlate well with sale price. But that doesn't mean you should use all eight. Some features might be strongly correlated with each other — essentially telling the model the same thing twice. Square footage and number of bedrooms, for example — bigger houses tend to have more bedrooms. Using both might not add any new information. You need to find these redundant pairs.

import pandas as pd      # pandas: data library — .corr() for the feature-to-feature correlation matrix
import numpy as np       # numpy: maths library — helps us work with the matrix values

features = ['sq_footage','num_bedrooms','num_bathrooms','property_age_yrs',
            'dist_city_km','garden_size_sqm','crime_rate_score','nearest_school_km']

# Build a correlation matrix — every feature vs every other feature
# We want to find pairs of FEATURES that are highly correlated with EACH OTHER
# (not with the target — that's step 1)
feat_corr = df[features].corr(method='spearman')   # Spearman again — robust to skew
print("=== FEATURE-TO-FEATURE CORRELATION MATRIX ===")
print(feat_corr.round(2).to_string())
print()

# Now find every pair where |r| > 0.85 — those are potential redundancies
REDUNDANCY_THRESHOLD = 0.85
print(f"=== REDUNDANT FEATURE PAIRS (|r| > {REDUNDANCY_THRESHOLD}) ===")
print("These pairs carry very similar information — consider dropping one from each pair.\n")

found_any = False
for i in range(len(features)):
    for j in range(i + 1, len(features)):       # upper triangle only — avoids duplicates
        r = feat_corr.iloc[i, j]
        if abs(r) > REDUNDANCY_THRESHOLD:
            fa, fb = features[i], features[j]
            # Which one to keep? Usually keep the one more correlated with the target
            r_a = abs(stats.spearmanr(df[fa], df['sale_price'])[0])
            r_b = abs(stats.spearmanr(df[fb], df['sale_price'])[0])
            keep = fa if r_a >= r_b else fb
            drop = fb if keep == fa else fa
            print(f"  {fa}  ×  {fb}")
            print(f"    r = {r:.3f}  → these features overlap significantly")
            print(f"    Recommendation: keep '{keep}' (r={max(r_a,r_b):.3f} with target), "
                  f"consider dropping '{drop}' (r={min(r_a,r_b):.3f} with target)\n")
            found_any = True

if not found_any:
    print("  No highly redundant pairs found above the threshold.")

What just happened?

pandas' .corr() builds a feature-vs-feature correlation matrix. We then loop through the upper triangle (skipping the diagonal and lower half, which are mirror images) looking for pairs above our redundancy threshold of 0.85.

The recommendation logic is practical: when two features overlap, keep the one that's more correlated with the target. sq_footage (r=0.994 with price) wins over num_bedrooms (r=0.971) in every pairing it appears in.

This is a surprisingly common situation in property data. Distance from city centre and crime rate (r=0.970) are nearly redundant — expensive central areas tend to have lower crime. They're measuring different things, but in this dataset they carry almost identical information. A linear model using both would be confused; a decision tree would waste splits. This check saves you from that problem before modelling even starts.

Step 3 — Detecting Feature Interactions

The scenario: Your lead raises an interesting question: "Does property age matter the same way for all property sizes? Or does an old small flat behave very differently from an old large house?" This is an interaction — where the effect of one feature depends on the value of another. You split the data by property size groups and check whether age's relationship with price changes across them.

import pandas as pd      # pandas: data table library — pd.cut() for grouping, .groupby() for splitting
import numpy as np       # numpy: maths library — standard import
from scipy import stats  # scipy: statistics library — spearmanr for within-group correlations

# Create a size category — split properties into Small, Medium, Large based on sq_footage
# pd.cut() divides a continuous column into labelled buckets
df['size_group'] = pd.cut(
    df['sq_footage'],
    bins=[0, 800, 1000, 2000],          # 0–800 = Small, 800–1000 = Medium, 1000–2000 = Large
    labels=['Small', 'Medium', 'Large']
)

print("=== PROPERTIES BY SIZE GROUP ===")
print(df.groupby('size_group', observed=True)['sale_price'].agg(
    count='count',
    mean_price='mean',
    mean_age='mean'          # does property age differ by size group?
).round(1))
print()

# Check whether 'property_age_yrs' relates to price differently in each size group
# This reveals an INTERACTION: does age matter more for large houses than small ones?
print("=== INTERACTION: property_age_yrs × sale_price, by size group ===")
print("(Does property age affect price differently depending on house size?)\n")

for grp_label, grp_df in df.groupby('size_group', observed=True):
    if len(grp_df) < 3:
        print(f"  [{grp_label}]  n={len(grp_df)} — too few rows for reliable correlation")
        continue
    r, p = stats.spearmanr(grp_df['property_age_yrs'], grp_df['sale_price'])
    direction = "older → lower price" if r < 0 else "older → higher price"
    print(f"  [{grp_label}]  n={len(grp_df)}  r={r:.3f}  p={p:.4f}  → {direction}")

print()
print("Overall correlation (no split):  ", round(stats.spearmanr(df['property_age_yrs'], df['sale_price'])[0], 3))

What just happened?

pandas' pd.cut() divides a continuous column into labelled buckets — here splitting square footage into Small, Medium, and Large groups. We then use .groupby() as an iterator, running a fresh Spearman correlation inside each group.

scipy's stats.spearmanr() runs once per size group, giving a separate r value per slice. This is how you test for interactions without a machine learning model.

Interesting finding: the pivot table shows large properties average just 5.5 years old, while small ones average 39 years old. Age and size are already intertwined in this dataset — which is exactly the kind of structural relationship that affects modelling decisions. The small group sizes (n=4 each) mean we can't draw firm statistical conclusions here, but the pattern is clear enough to flag for the modelling team.

Step 4 — Building a Feature Relationship Summary

The scenario: End of the analysis. You need a single, clean deliverable — a feature relationship report your team can use to make informed decisions about which features to keep, which to drop, and which interactions to investigate further. This is the kind of output that goes straight into the project documentation.

import pandas as pd      # pandas: data library — corr(), groupby, and DataFrame construction
import numpy as np       # numpy: maths library — abs() for sorting
from scipy import stats  # scipy: statistics library — spearmanr for robust correlation

def feature_relationship_report(dataframe, features, target, redundancy_threshold=0.85):
    """
    Produces a plain-English feature relationship report covering:
    1. How strongly each feature correlates with the target
    2. Which feature pairs are redundant with each other
    3. A recommended feature shortlist
    """
    print("=" * 60)
    print("  FEATURE RELATIONSHIP REPORT")
    print("=" * 60)

    # --- SECTION 1: Feature–target correlations ---
    print("\n[1] RELATIONSHIP WITH TARGET: '{}'\n".format(target))
    target_corrs = []
    for feat in features:
        r, p = stats.spearmanr(dataframe[feat], dataframe[target])
        target_corrs.append((feat, r, p))
    # Sort by absolute correlation strength
    target_corrs.sort(key=lambda x: abs(x[1]), reverse=True)

    keep_set = set()   # features worth keeping — built as we go
    for feat, r, p in target_corrs:
        sig   = "✓ significant" if p < 0.05 else "✗ not significant"
        stars = "★★★" if abs(r)>=0.7 else "★★" if abs(r)>=0.4 else "★"
        action = "KEEP" if p < 0.05 and abs(r) >= 0.4 else "REVIEW"
        if action == "KEEP":
            keep_set.add(feat)
        print(f"  {feat:<22}  r={r:+.3f}  {sig:<18}  {stars}  → {action}")

    # --- SECTION 2: Redundancy check ---
    print("\n[2] REDUNDANT FEATURE PAIRS (|r| > {})\n".format(redundancy_threshold))
    feat_corr = dataframe[features].corr(method='spearman')
    drop_suggestions = set()   # features flagged for potential removal

    found = False
    for i in range(len(features)):
        for j in range(i + 1, len(features)):
            r_pair = feat_corr.iloc[i, j]
            if abs(r_pair) > redundancy_threshold:
                fa, fb = features[i], features[j]
                # Keep the one with stronger target correlation
                r_a = abs(stats.spearmanr(dataframe[fa], dataframe[target])[0])
                r_b = abs(stats.spearmanr(dataframe[fb], dataframe[target])[0])
                drop = fb if r_a >= r_b else fa
                drop_suggestions.add(drop)
                print(f"  {fa}  ×  {fb}  →  r={r_pair:.3f}  (suggest dropping '{drop}')")
                found = True
    if not found:
        print("  No redundant pairs found above threshold.")

    # --- SECTION 3: Recommended shortlist ---
    print("\n[3] RECOMMENDED FEATURE SHORTLIST\n")
    final_features = [f for f in features if f in keep_set and f not in drop_suggestions]
    review_features = [f for f in features if f in drop_suggestions]
    print("  Keep these features:")
    for f in final_features:
        print(f"    ✓ {f}")
    print("\n  Consider dropping (redundant with a stronger feature):")
    for f in review_features:
        print(f"    ✗ {f}")
    print("\n" + "=" * 60)

feature_relationship_report(
    df,
    features=['sq_footage','num_bedrooms','num_bathrooms','property_age_yrs',
              'dist_city_km','garden_size_sqm','crime_rate_score','nearest_school_km'],
    target='sale_price'
)

What just happened?

pandas and scipy work together throughout this function. dataframe[features].corr(method='spearman') builds the feature-vs-feature matrix. stats.spearmanr() gives us feature-vs-target scores with p-values. Python's built-in set() tracks which features to keep and which to flag for dropping — an efficient way to avoid duplicating decisions across the loop.

Starting from 8 features, the report recommends keeping just 3: sq_footage, num_bedrooms, and property_age_yrs. This isn't because the dropped features are bad predictors — they're all highly significant. It's because they're largely telling the model the same thing as the features we kept. A leaner feature set trains faster, generalises better, and is easier to explain to stakeholders.

The Feature Relationship Checklist

Use this checklist every time you get a new dataset before building any model:

Step Question Method Action if yes
1 Is this feature correlated with the target? spearmanr(feature, target) Keep it. If not → consider dropping.
2 Is this feature highly correlated with another feature? df[features].corr() Keep the one more correlated with target. Drop the other.
3 Does this feature's effect on the target change across groups? groupby() + correlation per group Flag the interaction. Consider creating a combined feature.
4 Are there features with near-zero correlation AND not significant? p-value check Drop them — they add noise, not signal.

Teacher's Note

Don't fall into the trap of thinking "more features = better model." It's almost always the opposite. A model trained on 3 clean, independent, relevant features will nearly always beat one trained on 10 features where half are redundant and two are noise.

The exception: tree-based models (like Random Forest and XGBoost) handle redundant features better than linear models because they can simply ignore unhelpful splits. But even then, fewer features means faster training, easier debugging, and a model you can actually explain to your stakeholders. Less is almost always more.

Practice Questions

1. Two features in your dataset correlate with each other at r = 0.97. What word describes this kind of feature pair?



2. When two features are redundant with each other, which one should you keep — the one more correlated with the other feature, or the one more correlated with the target?



3. The effect of property age on price is much stronger for large houses than for small ones. What is this called — where the effect of one feature depends on the value of another?



Quiz

1. Two features — sq_footage (r=0.99 with price) and num_bedrooms (r=0.97 with price) — correlate with each other at r=0.97. What should you do?


2. Which of these features is most likely safe to drop before modelling?


3. Does removing redundant features matter equally for all model types?


Up Next · Lesson 22

Categorical Exploration

Dig deep into categorical columns — frequency, dominance, rare categories, ordinal vs nominal, and how to spot the encoding traps before they break your model.