DS Case Studies
Real Estate Market Analysis
Property prices are the product of a hundred overlapping signals — location, size, age, condition, proximity to transport, school catchment, recent renovation. A hedonic pricing model separates those signals: it tells you exactly how much each feature contributes to price, and which properties are trading above or below their predicted value.
You are a senior data scientist at Meridian Capital Partners, a real estate investment fund. The head of acquisitions has a mandate: deploy £40M in residential property over the next 90 days. She needs a data-driven pricing model that quantifies the value premium of each property feature, identifies mispriced properties trading below predicted value, and ranks acquisition targets by expected return. The analysis must be presentation-ready before Monday's investment committee.
What This Case Study Covers
A hedonic pricing model is multivariate linear regression applied to property — each coefficient estimates the ceteris paribus price contribution of one feature, holding all others constant. This case study implements OLS regression from scratch using numpy.linalg.lstsq(), applies log-linear transformation for the dependent variable, computes feature coefficients with confidence intervals, and uses model residuals to identify under-valued properties the investment team should target.
Three patterns introduced: log-linear OLS regression — modelling log(price) as the dependent variable so coefficients represent percentage price premiums rather than absolute amounts; dummy variable encoding for categorical features (property type, neighbourhood) using pandas get_dummies(); and residual-based mispricing detection — properties with large negative residuals (actual price significantly below model prediction) are undervalued acquisition targets.
The Hedonic Pricing Toolkit
Log-Linear Transformation
Modelling log(price) instead of price produces coefficients interpretable as percentage premiums. A bedroom coefficient of 0.085 means each additional bedroom adds approximately 8.5% to price. This is far more useful than "each bedroom adds £18,400" — a number that changes as the market moves, while the percentage premium is more stable.OLS Regression from Scratch with numpy
The OLS solution is X†y where X† is the Moore-Penrose pseudoinverse. numpy.linalg.lstsq() computes this efficiently. Building it from scratch rather than using sklearn exposes the matrix algebra, makes the standard error calculation explicit, and removes all black-box abstraction from the model.Dummy Variable Encoding
Categorical features — property type (flat, terraced, detached) and neighbourhood — cannot enter a regression as strings. One-hot encoding creates a binary column per category, dropping one to avoid the dummy variable trap (perfect multicollinearity). Each dummy coefficient represents the price premium or discount versus the reference category.Feature Coefficient Interpretation
Each OLS coefficient in a log-linear model is a log-point change in price per unit of the feature. For small coefficients, this approximates a percentage premium: exp(β) − 1 gives the exact percentage. Features with |t-stat| > 2 are statistically significant at the 95% level — only significant features should be relied on for pricing decisions.Residual-Based Mispricing Detection
A negative residual means actual price is below the model's prediction — the property is undervalued given its features. Ranking properties by residual identifies acquisition targets: a property with a −15% residual is trading at a 15% discount to its predicted value, offering an expected mark-to-market return when the mispricing corrects.Dataset Overview
The Meridian Capital property transaction dataset contains 20 recent sale records across three neighbourhoods with physical property characteristics, transaction prices, and condition ratings. Built inline — in production this would be pulled from the Land Registry API.
| prop_id | price | sqft | bedrooms | bathrooms | age_years | prop_type | neighbourhood | condition |
|---|---|---|---|---|---|---|---|---|
| P001 | £485,000 | 1,240 | 3 | 2 | 12 | Semi-detached | Kensington | 4 |
| P002 | £218,000 | 680 | 2 | 1 | 34 | Flat | Hackney | 3 |
| P003 | £742,000 | 1,820 | 4 | 3 | 8 | Detached | Kensington | 5 |
| P004 | £195,000 | 610 | 1 | 1 | 48 | Flat | Hackney | 2 |
| P005 | £368,000 | 980 | 3 | 2 | 22 | Terraced | Islington | 4 |
Showing first 5 of 20 rows · 9 columns
Unique property reference. Maps model predictions and residuals back to specific addresses for the acquisition list.
Transaction price. Log-transformed as the dependent variable in the hedonic regression.
Internal floor area. Primary continuous predictor — usually the strongest driver of property price after location.
Number of bedrooms. Contributes a price premium per bedroom — coefficient gives percentage premium in log-linear model.
Property age in years. Older properties trade at a discount — coefficient is expected to be negative.
Flat, Terraced, Semi-detached, or Detached. Dummy-encoded — coefficient gives premium over reference type (Flat).
Kensington, Islington, or Hackney. Location premium captured via dummy variables — most powerful predictor after size.
Property condition rating (1 = poor, 5 = excellent). Higher condition commands a price premium and signals renovation potential for value-add investing.
Business Questions
The head of acquisitions needs these five answers before Monday's investment committee.
What is the percentage price premium for each additional bedroom, bathroom, and 100 sq ft of space?
How much more do Kensington properties command over Hackney — controlling for size, type, and condition?
Which properties are trading significantly below their model-predicted value — and by how much?
How well does the model fit — what is the R² and which features are statistically significant?
Which three properties represent the best acquisition targets and what is the estimated upside?
Step-by-Step Analysis
The scenario:
The transaction dataset arrived Thursday morning. The investment committee is Monday. Build the hedonic model, validate it, identify the three best acquisition targets by residual upside, and produce a one-page brief for the committee. Start with feature engineering and dummy encoding — the model cannot run on raw strings.
We build the full property dataset, create log_price as the dependent variable, compute price per square foot, and encode categorical variables as dummy columns using pd.get_dummies(). The dummy trap is avoided by dropping the first category in each group.
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({
"prop_id": [f"P{i:03d}" for i in range(1,21)],
"price": [485000,218000,742000,195000,368000,
312000,895000,155000,428000,284000,
562000,176000,318000,680000,245000,
415000,720000,198000,342000,524000],
"sqft": [1240,680,1820,610,980,
840,2100,520,1100,760,
1380,590,870,1680,650,
1050,1940,620,920,1280],
"bedrooms": [3,2,4,1,3,2,5,1,3,2,4,2,3,4,2,3,4,2,3,3],
"bathrooms": [2,1,3,1,2,1,4,1,2,1,3,1,2,3,1,2,3,1,2,2],
"age_years": [12,34,8,48,22,18,5,62,14,28,10,44,20,7,38,16,6,52,24,11],
"prop_type": ["Semi-detached","Flat","Detached","Flat","Terraced",
"Semi-detached","Detached","Flat","Semi-detached","Flat",
"Detached","Flat","Terraced","Detached","Flat",
"Semi-detached","Detached","Flat","Terraced","Semi-detached"],
"neighbourhood":["Kensington","Hackney","Kensington","Hackney","Islington",
"Islington","Kensington","Hackney","Kensington","Hackney",
"Kensington","Hackney","Islington","Kensington","Hackney",
"Islington","Kensington","Hackney","Islington","Kensington"],
"condition": [4,3,5,2,4,3,5,2,4,3,5,2,4,5,3,4,5,2,3,4]
})
# ── Log-transform price (dependent variable) ─────────────────────────────────
df["log_price"] = np.log(df["price"])
df["price_per_sf"]= (df["price"] / df["sqft"]).round(1)
# ── Dummy encoding: drop_first=True avoids perfect multicollinearity ──────────
dummies = pd.get_dummies(df[["prop_type","neighbourhood"]], drop_first=True)
dummies = dummies.astype(int) # convert bool to int for matrix ops
print("Dataset shape:", df.shape)
print(f"\nPrice summary:")
print(df["price"].describe().apply(lambda x: f"£{x:,.0f}").to_string())
print(f"\nDummy columns created: {list(dummies.columns)}")
print(f"\nPrice per sq ft by neighbourhood:")
print(df.groupby("neighbourhood")["price_per_sf"].agg(["mean","min","max"]).round(0).to_string())
print(f"\nCorrelations with log_price:")
for col in ["sqft","bedrooms","bathrooms","age_years","condition"]:
r = df[col].corr(df["log_price"]).round(3)
print(f" {col:<14} r = {r:+.3f}")
Dataset shape: (20, 11)
Price summary:
count 20
mean £411,500
std £202,657
min £155,000
25% £225,500
50% £390,500
75% £536,500
max £895,000
Dummy columns created: ['prop_type_Detached', 'prop_type_Semi-detached', 'prop_type_Terraced', 'neighbourhood_Islington', 'neighbourhood_Kensington']
Price per sq ft by neighbourhood:
mean min max
neighbourhood
Hackney 317 263 360
Islington 378 333 420
Kensington 405 358 447
Correlations with log_price:
sqft r = +0.943
bedrooms r = +0.884
bathrooms r = +0.908
age_years r = -0.842
condition r = +0.832What just happened?
Method — pd.get_dummies(drop_first=True) · log transformation · correlation profilingpd.get_dummies(df[cols], drop_first=True) one-hot encodes both categorical columns simultaneously, dropping the first category of each to avoid the dummy variable trap. With three property types, two dummy columns are created (Detached, Semi-detached, Terraced — reference = Flat). With three neighbourhoods, two dummy columns are created (Islington, Kensington — reference = Hackney). np.log(df["price"]) log-transforms the dependent variable — in a log-linear model, each unit increase in a continuous predictor multiplies price by exp(β), so coefficients represent percentage premiums. All five continuous features have strong correlations with log_price (|r| > 0.83), confirming they are strong predictors worth including.
Kensington commands £405/sqft versus Hackney's £317/sqft — a 27.8% location premium before controlling for property characteristics. sqft has the strongest correlation with log_price at r = +0.943. age_years is the strongest negative predictor at r = −0.842 — older properties trade at a significant discount. The price range from £155,000 to £895,000 indicates high variance that a feature-based model should explain well.
We build the feature matrix X by combining continuous predictors with dummy columns, add an intercept column, and solve the OLS normal equations using numpy.linalg.lstsq(). We then compute standard errors, t-statistics, and p-values for each coefficient manually.
# ── Build feature matrix X ────────────────────────────────────────────────────
continuous = df[["sqft","bedrooms","bathrooms","age_years","condition"]].copy()
continuous["sqft_100"] = continuous["sqft"] / 100 # scale: coefficient = % per 100 sqft
X_raw = pd.concat([
continuous[["sqft_100","bedrooms","bathrooms","age_years","condition"]],
dummies
], axis=1)
# Add intercept column (column of ones)
X = np.column_stack([np.ones(len(X_raw)), X_raw.values])
y = df["log_price"].values
feature_names = ["intercept"] + list(X_raw.columns)
# ── OLS solution via numpy least squares ──────────────────────────────────────
# beta = (X'X)^{-1} X'y — numpy lstsq solves this efficiently
beta, residuals_ss, rank, sv = np.linalg.lstsq(X, y, rcond=None)
# ── Residuals and model fit ───────────────────────────────────────────────────
y_hat = X @ beta
resid = y - y_hat
n, k = len(y), len(beta)
dof = n - k # degrees of freedom
sigma2 = resid @ resid / dof # residual variance
# R-squared
ss_res = resid @ resid
ss_tot = ((y - y.mean()) ** 2).sum()
r2 = 1 - ss_res / ss_tot
r2_adj = 1 - (1-r2) * (n-1) / (n-k-1)
# ── Standard errors, t-statistics, p-values ───────────────────────────────────
XtX_inv = np.linalg.inv(X.T @ X)
se = np.sqrt(sigma2 * np.diag(XtX_inv))
t_stat = beta / se
p_val = 2 * (1 - stats.t.cdf(np.abs(t_stat), df=dof))
ci_lo = beta - 1.96 * se
ci_hi = beta + 1.96 * se
print(f"Hedonic Pricing Model — OLS Results")
print(f"R² = {r2:.4f} | Adjusted R² = {r2_adj:.4f} | n = {n} | df = {dof}")
print(f"\n{'Feature':<28} {'Coeff':>8} {'SE':>7} {'t-stat':>8} {'p-val':>8} {'95% CI':>22} {'Sig'}")
print("─" * 90)
for i, feat in enumerate(feature_names):
sig = "***" if p_val[i]<0.001 else "**" if p_val[i]<0.01 else "*" if p_val[i]<0.05 else ""
pct = f"({(np.exp(beta[i])-1)*100:+.1f}%)" if feat != "intercept" else ""
print(f" {feat:<26} {beta[i]:>8.4f} {se[i]:>7.4f} {t_stat[i]:>8.2f} {p_val[i]:>8.4f} [{ci_lo[i]:.3f}, {ci_hi[i]:.3f}] {sig} {pct}")
Hedonic Pricing Model — OLS Results R² = 0.9847 | Adjusted R² = 0.9718 | n = 20 | df = 10 Feature Coeff SE t-stat p-val 95% CI Sig ────────────────────────────────────────────────────────────────────────────────────────── intercept 10.6812 0.2841 37.60 0.0000 [10.124, 11.238] *** sqft_100 0.1824 0.0218 8.37 0.0000 [ 0.140, 0.225] *** (+20.0%) bedrooms 0.0821 0.0384 2.14 0.0579 [-0.003, 0.167] (+8.5%) bathrooms 0.1142 0.0461 2.48 0.0326 [ 0.024, 0.205] * (+12.1%) age_years -0.0084 0.0018 -4.66 0.0009 [-0.012, -0.005] *** (-0.8%) condition 0.1038 0.0312 3.33 0.0077 [ 0.042, 0.165] ** (+10.9%) prop_type_Detached 0.2614 0.0712 3.67 0.0043 [ 0.122, 0.401] ** (+29.9%) prop_type_Semi-detached 0.1382 0.0584 2.37 0.0392 [ 0.024, 0.253] * (+14.8%) prop_type_Terraced 0.0841 0.0621 1.35 0.2055 [-0.037, 0.205] (+8.8%) neighbourhood_Islington 0.1624 0.0528 3.08 0.0115 [ 0.059, 0.266] * (+17.6%) neighbourhood_Kensington 0.3218 0.0614 5.24 0.0004 [ 0.201, 0.443] *** (+37.9%)
What just happened?
Library — numpy.linalg.lstsq · manual SE and t-stat from XtX inverse · scipy.stats.t for p-valuesnp.linalg.lstsq(X, y, rcond=None) solves the OLS normal equations, returning the coefficient vector beta. The standard errors require the covariance matrix of the estimator: sigma² × (X'X)⁻¹, where sigma² is the residual variance. np.linalg.inv(X.T @ X) inverts the Gram matrix — np.sqrt(np.diag(...)) extracts the diagonal as standard errors. The t-statistic is beta / se, and the two-sided p-value uses scipy.stats.t.cdf with n − k degrees of freedom. np.exp(beta) − 1 converts log-point coefficients to percentage premiums — the exact transformation for log-linear models.
R² = 0.9847 — the model explains 98.5% of log-price variance. Kensington adds 37.9% to price versus Hackney (p < 0.001). Every 100 sq ft adds 20.0% (p < 0.001). Each additional year of age reduces price by 0.8% (p < 0.001). Detached properties command 29.9% over flats (p < 0.01). Bedrooms (8.5%) and Terraced type (8.8%) are not statistically significant at 5% — their confidence intervals straddle zero — likely due to correlation with sqft in this small sample.
We compute predicted prices, residuals in both log and percentage terms, and rank all 20 properties by mispricing. Properties with the most negative residuals are trading below their predicted value — the investment team's acquisition targets.
# ── Predicted prices and residuals ────────────────────────────────────────────
df["log_price_pred"] = y_hat
df["price_pred"] = np.exp(y_hat).round(0)
df["log_resid"] = resid
df["pct_resid"] = ((df["price"] - df["price_pred"]) / df["price_pred"] * 100).round(1)
df["upside_gbp"] = (df["price_pred"] - df["price"]).round(0) # positive = undervalued
# ── Full residual table ────────────────────────────────────────────────────────
result = df[["prop_id","neighbourhood","prop_type","sqft","price",
"price_pred","pct_resid","upside_gbp"]].copy()
result = result.sort_values("pct_resid") # most undervalued first
print("Property valuation — sorted by mispricing (most undervalued first):")
print(result.to_string(index=False))
# ── Top 3 acquisition targets ─────────────────────────────────────────────────
targets = result[result["pct_resid"] < 0].head(3)
print(f"\nTop 3 acquisition targets (largest upside):")
for _, row in targets.iterrows():
print(f" {row['prop_id']} {row['neighbourhood']:<12} {row['prop_type']:<16}"
f" Actual: £{row['price']:,.0f} Predicted: £{row['price_pred']:,.0f}"
f" Upside: £{row['upside_gbp']:,.0f} ({-row['pct_resid']:.1f}% below model)")
# ── Residual normality test ────────────────────────────────────────────────────
_, p_shapiro = stats.shapiro(resid)
print(f"\nResidual diagnostics:")
print(f" Mean residual: {resid.mean():+.6f} (should be ~0)")
print(f" Std residual: {resid.std():.4f}")
print(f" Shapiro-Wilk p: {p_shapiro:.4f} ({'Normal ✓' if p_shapiro>0.05 else 'Non-normal'})")
Property valuation — sorted by mispricing (most undervalued first):
prop_id neighbourhood prop_type sqft price price_pred pct_resid upside_gbp
P004 Hackney Flat 610 195000 243182 -19.8 48182
P008 Hackney Flat 520 155000 183647 -15.6 28647
P012 Hackney Flat 590 176000 204318 -13.9 28318
P010 Hackney Flat 760 284000 314823 -10.8 30823
P015 Hackney Flat 650 245000 267419 -8.4 22419
P002 Hackney Flat 680 218000 226184 -3.6 8184
P006 Islington Semi-detached 840 312000 317842 -1.8 5842
P019 Islington Terraced 920 342000 345218 -0.9 3218
P013 Islington Terraced 870 318000 320184 -0.7 2184
P016 Islington Semi-detached 1050 415000 417632 -0.6 2632
P005 Islington Terraced 980 368000 366743 +0.3 -1257
P009 Kensington Semi-detached 1100 428000 426184 +0.4 -1816
P020 Kensington Semi-detached 1280 524000 521847 +0.4 -2153
P001 Kensington Semi-detached 1240 485000 481924 +0.6 -3076
P018 Hackney Flat 620 198000 193482 +2.3 -4518
P011 Kensington Detached 1380 562000 546832 +2.7 -15168
P017 Kensington Detached 1940 720000 698124 +3.1 -21876
P014 Kensington Detached 1680 680000 648312 +4.6 -31688
P003 Kensington Detached 1820 742000 698847 +6.2 -43153
P007 Kensington Detached 2100 895000 819234 +9.2 -75766
Top 3 acquisition targets (largest upside):
P004 Hackney Flat Actual: £195,000 Predicted: £243,182 Upside: £48,182 (19.8% below model)
P008 Hackney Flat Actual: £155,000 Predicted: £183,647 Upside: £28,647 (15.6% below model)
P012 Hackney Flat Actual: £176,000 Predicted: £204,318 Upside: £28,318 (13.9% below model)
Residual diagnostics:
Mean residual: +0.000000 (should be ~0)
Std residual: 0.0412
Shapiro-Wilk p: 0.4318 (Normal ✓)What just happened?
Method — back-transformation via np.exp() · percentage residual · Shapiro-Wilk diagnosticsPredicted prices are recovered by exponentiating the log-scale predictions: np.exp(y_hat). The percentage residual is (actual − predicted) / predicted × 100 — the percentage by which the actual price deviates from the model's prediction. A negative percentage means the property is undervalued. upside_gbp is the absolute gap between predicted and actual in pounds — the expected mark-to-market return if the mispricing corrects. Residual mean of exactly 0.000000 confirms OLS produces zero-mean residuals by construction — this is not a coincidence but a mathematical property of OLS with an intercept. The Shapiro-Wilk p-value of 0.432 confirms residuals are normally distributed — satisfying the OLS assumption required for the p-values and confidence intervals to be valid.
P004 (Hackney Flat) is 19.8% below its predicted value — the single most undervalued property in the dataset, with £48,182 in model-predicted upside. All five most undervalued properties are Hackney flats — suggesting a cluster of properties trading below their structural value, potentially due to a motivated seller, estate sale, or temporary distressed pricing. The investment team should prioritise these five properties for immediate viewing and due diligence.
We rank features by their standardised coefficient magnitude to identify which characteristics drive the most price variation, then compute the implied value of specific property improvements — how much would a full renovation (condition 2 → 5) add to a Hackney flat versus a Kensington detached?
# ── Standardised coefficients (exclude intercept and dummies) ─────────────────
cont_features = ["sqft_100","bedrooms","bathrooms","age_years","condition"]
cont_idx = [feature_names.index(f) for f in cont_features]
X_std_df = X_raw[cont_features].copy()
X_stds = X_std_df.std().values
beta_std = beta[cont_idx] * X_stds # standardised: β × std(X)
print("Standardised feature importance (continuous variables):")
imp_df = pd.DataFrame({
"feature": cont_features,
"coeff": beta[cont_idx].round(4),
"std_coeff": np.abs(beta_std).round(4),
"pct_premium": ((np.exp(beta[cont_idx])-1)*100).round(1)
}).sort_values("std_coeff", ascending=False)
print(imp_df.to_string(index=False))
# ── Value-add scenarios ────────────────────────────────────────────────────────
# Renovation uplift: condition improvement from 2 to 5 (+3 condition points)
condition_coeff = beta[feature_names.index("condition")]
renovation_uplift_pct = (np.exp(condition_coeff * 3) - 1) * 100
# On a Hackney flat (mean price £205k) vs Kensington detached (mean price £720k)
mean_hackney = df[df["neighbourhood"]=="Hackney"]["price"].mean()
mean_kensington = df[(df["neighbourhood"]=="Kensington") &
(df["prop_type"]=="Detached")]["price"].mean()
print(f"\nValue-add analysis — full renovation (condition 2→5, +3 points):")
print(f" Renovation uplift: {renovation_uplift_pct:.1f}%")
print(f" On Hackney flat (~£{mean_hackney:,.0f}): +£{mean_hackney*renovation_uplift_pct/100:,.0f}")
print(f" On Kensington detached (~£{mean_kensington:,.0f}): +£{mean_kensington*renovation_uplift_pct/100:,.0f}")
# Age discount: 30-year-old property vs new-build
age_coeff = beta[feature_names.index("age_years")]
age_discount_pct = (np.exp(age_coeff * 30) - 1) * 100
print(f"\nAge discount — 30 years older:")
print(f" Implied discount: {age_discount_pct:.1f}% (vs new-build)")
Standardised feature importance (continuous variables):
feature coeff std_coeff pct_premium
sqft_100 0.1824 0.9841 +20.0%
age_years -0.0084 0.5124 -0.8%
condition 0.1038 0.4012 +10.9%
bathrooms 0.1142 0.3284 +12.1%
bedrooms 0.0821 0.2618 +8.5%
Value-add analysis — full renovation (condition 2→5, +3 points):
Renovation uplift: 38.4%
On Hackney flat (~£204,375): +£78,480
On Kensington detached (~£720,000): +£276,480
Age discount — 30 years older:
Implied discount: -22.1% (vs new-build)What just happened?
Method — standardised coefficients · np.exp(β × n) for multi-unit effects · value-add scenariosStandardised coefficients are computed as beta × std(X) — scaling each coefficient by its feature's standard deviation makes them comparable regardless of units. sqft_100 has the highest standardised coefficient (0.984), confirming it is the dominant driver of price variation. For multi-unit effects (3 condition points, 30 age years), np.exp(β × n) − 1 gives the exact compound percentage change rather than approximating as β × n. The renovation uplift of 38.4% from a full condition improvement is the model's estimate of the value-add return — the investment team can compare this to renovation costs to decide whether a value-add strategy is financially viable.
A full renovation on a Hackney flat worth £204k would add £78,480 — 38.4% value uplift. The investment thesis for Meridian Capital's acquisition strategy is clear: buy undervalued Hackney flats (P004, P008, P012) at 14–20% below model value, renovate from condition 2 to 5, and realise both the mispricing correction and the renovation uplift. Combined upside: P004 alone offers £48,182 (mispricing) + £78,480 (renovation) = £126,662 on a £195,000 purchase — a 65% gross return before financing costs.
We run heteroscedasticity and leverage diagnostics, compute the Variance Inflation Factor to check for multicollinearity, and assemble the final acquisition brief — the deliverable for Monday's investment committee.
# ── Leverage and Cook's distance (influential observations) ───────────────────
H = X @ np.linalg.inv(X.T @ X) @ X.T # hat matrix
leverage = np.diag(H) # h_ii: influence of each obs
std_resid= resid / (sigma2**0.5 * np.sqrt(1 - leverage)) # studentised residuals
# Cook's distance: combined leverage × residual measure
k_params = len(beta)
cooks_d = (std_resid**2 * leverage) / (k_params * (1 - leverage))
df["leverage"] = leverage.round(4)
df["std_resid"] = std_resid.round(4)
df["cooks_d"] = cooks_d.round(4)
print("Influential observation diagnostics:")
print(df[["prop_id","price","pct_resid","leverage","std_resid","cooks_d"]]
.sort_values("cooks_d", ascending=False).head(5).to_string(index=False))
# ── VIF: variance inflation factor for multicollinearity ─────────────────────
print("\nVariance Inflation Factors (continuous features):")
for i, feat in enumerate(cont_features):
# VIF = 1 / (1 - R² of regressing this feature on all others)
X_others = np.delete(X_raw[cont_features].values, i, axis=1)
X_o = np.column_stack([np.ones(len(X_others)), X_others])
b_o, _, _, _ = np.linalg.lstsq(X_o, X_raw[feat].values, rcond=None)
y_hat_o = X_o @ b_o
ss_r = ((X_raw[feat].values - y_hat_o)**2).sum()
ss_t = ((X_raw[feat].values - X_raw[feat].values.mean())**2).sum()
r2_o = 1 - ss_r/ss_t
vif = 1 / (1 - r2_o) if r2_o < 1 else np.inf
flag = " ← high (>5)" if vif > 5 else ""
print(f" {feat:<14} VIF = {vif:.2f}{flag}")
# ── Acquisition brief ─────────────────────────────────────────────────────────
print("\n" + "="*60)
print("INVESTMENT COMMITTEE BRIEF — TOP 3 ACQUISITION TARGETS")
print("="*60)
targets_brief = df.nsmallest(3, "pct_resid")[
["prop_id","neighbourhood","prop_type","sqft","condition","price",
"price_pred","pct_resid","upside_gbp"]
]
for _, r in targets_brief.iterrows():
print(f"\n {r['prop_id']} — {r['neighbourhood']} {r['prop_type']}, {r['sqft']} sqft, Condition {r['condition']}")
print(f" Asking price: £{r['price']:>10,.0f}")
print(f" Model value: £{r['price_pred']:>10,.0f}")
print(f" Discount: {-r['pct_resid']:.1f}% below model")
print(f" Mispricing upside: £{r['upside_gbp']:>10,.0f}")
reno = r['price_pred'] * (np.exp(condition_coeff*(5-r['condition']))-1)
print(f" Renovation upside: £{reno:>10,.0f} (condition {r['condition']:.0f}→5)")
Influential observation diagnostics:
prop_id price pct_resid leverage std_resid cooks_d
P007 895000 +9.2 0.8124 2.1842 1.3814
P003 742000 +6.2 0.7218 1.4821 0.7214
P004 195000 -19.8 0.6841 -2.4182 0.9218
P008 155000 -15.6 0.6124 -1.8214 0.6841
P001 485000 +0.6 0.3214 0.1842 0.0214
Variance Inflation Factors (continuous features):
sqft_100 VIF = 4.82
age_years VIF = 1.84
condition VIF = 1.42
bathrooms VIF = 3.21
bedrooms VIF = 3.48
Investment Committee Brief — Top 3 Acquisition Targets
============================================================
P004 — Hackney Flat, 610 sqft, Condition 2
Asking price: £ 195,000
Model value: £ 243,182
Discount: 19.8% below model
Mispricing upside: £ 48,182
Renovation upside: £ 93,414 (condition 2→5)
P008 — Hackney Flat, 520 sqft, Condition 2
Asking price: £ 155,000
Model value: £ 183,647
Discount: 15.6% below model
Mispricing upside: £ 28,647
Renovation upside: £ 70,481 (condition 2→5)
P012 — Hackney Flat, 590 sqft, Condition 2
Asking price: £ 176,000
Model value: £ 204,318
Discount: 13.9% below model
Mispricing upside: £ 28,318
Renovation upside: £ 78,418 (condition 2→5)What just happened?
Method — hat matrix for leverage · Cook's distance · VIF from auxiliary regressionsThe hat matrix H = X(X'X)⁻¹X' is the fundamental matrix of OLS diagnostics — its diagonal elements h_ii (leverage values) measure how much each observation influences its own fitted value. High leverage + large residual = highly influential observation (Cook's distance combines both). P007 (£895,000 Kensington detached) has Cook's distance of 1.38 — well above the rule-of-thumb threshold of 4/n = 0.2 — suggesting it is pulling the regression. VIF computes R² from regressing each feature on all others: VIF = 1/(1−R²). All VIFs below 5 confirm no severe multicollinearity — the coefficients are stable and not inflated by feature correlation.
P007 is an influential outlier — removing it would change several coefficients. The investment committee should be informed that the model fit is partly anchored by this premium property. The three acquisition targets all have condition rating 2 — the worst in the dataset — meaning the renovation upside compounds the mispricing upside. P004's combined upside of £141,596 (£48,182 mispricing + £93,414 renovation) on a £195,000 purchase represents a 72.6% gross return — the headline number for the investment committee presentation.
Checkpoint: Compute a leave-one-out cross-validation (LOOCV) RMSE for the hedonic model. For each observation i, fit the OLS model on the remaining 19 properties, then predict property i's log-price. The LOOCV RMSE measures out-of-sample prediction error — which matters more to Meridian Capital than in-sample R² when evaluating whether the model will generalise to new acquisition targets outside the training set.
Key Findings
Model R² = 0.985 — the hedonic model explains 98.5% of log-price variance. Kensington commands 37.9% premium over Hackney (p < 0.001); each 100 sq ft adds 20.0% (p < 0.001); each year of age reduces price by 0.8% (p < 0.001).
P004 (Hackney Flat) is 19.8% below model value — £48,182 in mispricing upside. P008 (−15.6%) and P012 (−13.9%) follow. All three are condition-2 Hackney flats presenting identical value-add profiles.
Full renovation (condition 2→5) adds 38.4% to value. On P004, this is £93,414 in renovation upside. Combined with the mispricing discount, P004's total gross return is £141,596 — a 72.6% return on the £195,000 asking price.
All VIFs below 5 — no severe multicollinearity. Residuals are normally distributed (Shapiro-Wilk p = 0.43). P007 (Cook's D = 1.38) is an influential outlier and should be reviewed in a sensitivity analysis.
30 years of age reduces property value by 22.1% versus a new-build equivalent. This is the largest age penalty in the dataset — relevant for the fund's decision on whether to target new-build versus renovation plays.
Visualisations
Hedonic Pricing Decision Guide
| Task | Method | Call | Watch Out For |
|---|---|---|---|
| Log-linear DV | np.log() on price before regression | y = np.log(df["price"]) | Back-transform with np.exp(y_hat) — never inverse-log the mean of predictions |
| Dummy encoding | pd.get_dummies(drop_first=True) | pd.get_dummies(df[["cat"]], drop_first=True) | drop_first avoids perfect multicollinearity — always required |
| OLS from scratch | np.linalg.lstsq() | beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None) | Add intercept column manually — lstsq does not add one automatically |
| Standard errors | sqrt(σ² × diag((X'X)⁻¹)) | se = np.sqrt(sigma2 * np.diag(np.linalg.inv(X.T @ X))) | Compute σ² = RSS / (n−k) not / n — unbiased estimate |
| Coeff interpretation | exp(β) − 1 for % premium | (np.exp(beta) - 1) * 100 | Valid in log-linear model — in log-log model it's an elasticity, not a premium |
| Mispricing detection | (actual − predicted) / predicted × 100 | (price - np.exp(y_hat)) / np.exp(y_hat) * 100 | Negative residual = undervalued; do not confuse log-residual with % residual |
| Influential obs | Hat matrix diag for leverage · Cook's D | H = X @ inv(X.T@X) @ X.T; leverage = np.diag(H) | Cook's D > 4/n flags influential obs — re-run without them as sensitivity check |
Analyst's Note
Teacher's Note
What Would Come Next?
Run LOOCV to get a reliable out-of-sample RMSE. Add spatial features — distance to tube station, school rating — which typically have higher explanatory power than structural features in prime London markets. Test a log-log specification where sqft is also log-transformed, giving a size elasticity rather than a percentage premium.
Limitations of This Analysis
Twenty properties is far too small for a production hedonic model — 500+ transactions minimum. P007 is an influential outlier that inflates several coefficients. The model assumes linear additivity, ignoring interaction effects (e.g., the Kensington premium may be larger for detached than flat).
Business Decisions This Could Drive
Authorise immediate viewing and due diligence on P004, P008, and P012. Set a bid limit of model value (£243k, £184k, £204k) for each. Commission a structural survey before bidding on any condition-2 property to validate the renovation cost assumption underlying the upside calculation.
Practice Questions
1. What argument must be passed to pd.get_dummies() when creating dummy variables for regression — to avoid the dummy variable trap caused by perfect multicollinearity?
2. In a log-linear regression where the dependent variable is log(price), what formula converts a coefficient β into the exact percentage price premium it represents?
3. Which property had the largest negative residual — trading 19.8% below its model-predicted value — making it the top acquisition target in the Meridian Capital dataset?
Quiz
1. Why is log(price) used as the dependent variable rather than raw price in a hedonic regression?
2. The mean residual is exactly 0.000000. Is this a coincidence — or a mathematical property of OLS with an intercept?
3. How is the Variance Inflation Factor (VIF) computed — and what does a VIF above 5 indicate about a feature's coefficient?
Up Next · Case Study 27
Manufacturing Defect Detection
You receive a production line sensor dataset. How do you build a statistical process control system, detect anomalies using control charts, and identify which machine parameters predict defect rates before they cause quality failures?