DS Case Study 26 – Real Estate Market | Dataplexa
Advanced Case Study · CS 26

Real Estate Market Analysis

Property prices are the product of a hundred overlapping signals — location, size, age, condition, proximity to transport, school catchment, recent renovation. A hedonic pricing model separates those signals: it tells you exactly how much each feature contributes to price, and which properties are trading above or below their predicted value.

You are a senior data scientist at Meridian Capital Partners, a real estate investment fund. The head of acquisitions has a mandate: deploy £40M in residential property over the next 90 days. She needs a data-driven pricing model that quantifies the value premium of each property feature, identifies mispriced properties trading below predicted value, and ranks acquisition targets by expected return. The analysis must be presentation-ready before Monday's investment committee.

IndustryReal Estate / Finance
TechniqueHedonic Pricing · OLS · Residual Analysis
Librariespandas · numpy · scipy
DifficultyAdvanced
Est. Time70–80 min
Overview

What This Case Study Covers

A hedonic pricing model is multivariate linear regression applied to property — each coefficient estimates the ceteris paribus price contribution of one feature, holding all others constant. This case study implements OLS regression from scratch using numpy.linalg.lstsq(), applies log-linear transformation for the dependent variable, computes feature coefficients with confidence intervals, and uses model residuals to identify under-valued properties the investment team should target.

Three patterns introduced: log-linear OLS regression — modelling log(price) as the dependent variable so coefficients represent percentage price premiums rather than absolute amounts; dummy variable encoding for categorical features (property type, neighbourhood) using pandas get_dummies(); and residual-based mispricing detection — properties with large negative residuals (actual price significantly below model prediction) are undervalued acquisition targets.

The Hedonic Pricing Toolkit

1

Log-Linear Transformation

Modelling log(price) instead of price produces coefficients interpretable as percentage premiums. A bedroom coefficient of 0.085 means each additional bedroom adds approximately 8.5% to price. This is far more useful than "each bedroom adds £18,400" — a number that changes as the market moves, while the percentage premium is more stable.
2

OLS Regression from Scratch with numpy

The OLS solution is X†y where X† is the Moore-Penrose pseudoinverse. numpy.linalg.lstsq() computes this efficiently. Building it from scratch rather than using sklearn exposes the matrix algebra, makes the standard error calculation explicit, and removes all black-box abstraction from the model.
3

Dummy Variable Encoding

Categorical features — property type (flat, terraced, detached) and neighbourhood — cannot enter a regression as strings. One-hot encoding creates a binary column per category, dropping one to avoid the dummy variable trap (perfect multicollinearity). Each dummy coefficient represents the price premium or discount versus the reference category.
4

Feature Coefficient Interpretation

Each OLS coefficient in a log-linear model is a log-point change in price per unit of the feature. For small coefficients, this approximates a percentage premium: exp(β) − 1 gives the exact percentage. Features with |t-stat| > 2 are statistically significant at the 95% level — only significant features should be relied on for pricing decisions.
5

Residual-Based Mispricing Detection

A negative residual means actual price is below the model's prediction — the property is undervalued given its features. Ranking properties by residual identifies acquisition targets: a property with a −15% residual is trading at a 15% discount to its predicted value, offering an expected mark-to-market return when the mispricing corrects.
01

Dataset Overview

The Meridian Capital property transaction dataset contains 20 recent sale records across three neighbourhoods with physical property characteristics, transaction prices, and condition ratings. Built inline — in production this would be pulled from the Land Registry API.

prop_idpricesqftbedroomsbathroomsage_yearsprop_typeneighbourhoodcondition
P001£485,0001,2403212Semi-detachedKensington4
P002£218,0006802134FlatHackney3
P003£742,0001,820438DetachedKensington5
P004£195,0006101148FlatHackney2
P005£368,0009803222TerracedIslington4

Showing first 5 of 20 rows · 9 columns

prop_idstring · unique key

Unique property reference. Maps model predictions and residuals back to specific addresses for the acquisition list.

pricefloat64 · GBP

Transaction price. Log-transformed as the dependent variable in the hedonic regression.

sqftint64 · sq feet

Internal floor area. Primary continuous predictor — usually the strongest driver of property price after location.

bedroomsint64 · count

Number of bedrooms. Contributes a price premium per bedroom — coefficient gives percentage premium in log-linear model.

age_yearsint64 · years

Property age in years. Older properties trade at a discount — coefficient is expected to be negative.

prop_typeobject · 3 categories

Flat, Terraced, Semi-detached, or Detached. Dummy-encoded — coefficient gives premium over reference type (Flat).

neighbourhoodobject · 3 categories

Kensington, Islington, or Hackney. Location premium captured via dummy variables — most powerful predictor after size.

conditionint64 · 1–5

Property condition rating (1 = poor, 5 = excellent). Higher condition commands a price premium and signals renovation potential for value-add investing.

02

Business Questions

The head of acquisitions needs these five answers before Monday's investment committee.

1

What is the percentage price premium for each additional bedroom, bathroom, and 100 sq ft of space?

2

How much more do Kensington properties command over Hackney — controlling for size, type, and condition?

3

Which properties are trading significantly below their model-predicted value — and by how much?

4

How well does the model fit — what is the R² and which features are statistically significant?

5

Which three properties represent the best acquisition targets and what is the estimated upside?

03

Step-by-Step Analysis

The scenario:

The transaction dataset arrived Thursday morning. The investment committee is Monday. Build the hedonic model, validate it, identify the three best acquisition targets by residual upside, and produce a one-page brief for the committee. Start with feature engineering and dummy encoding — the model cannot run on raw strings.

Step 1Feature Engineering, Dummy Encoding, and Exploratory Analysis

We build the full property dataset, create log_price as the dependent variable, compute price per square foot, and encode categorical variables as dummy columns using pd.get_dummies(). The dummy trap is avoided by dropping the first category in each group.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame({
    "prop_id":      [f"P{i:03d}" for i in range(1,21)],
    "price":        [485000,218000,742000,195000,368000,
                     312000,895000,155000,428000,284000,
                     562000,176000,318000,680000,245000,
                     415000,720000,198000,342000,524000],
    "sqft":         [1240,680,1820,610,980,
                     840,2100,520,1100,760,
                     1380,590,870,1680,650,
                     1050,1940,620,920,1280],
    "bedrooms":     [3,2,4,1,3,2,5,1,3,2,4,2,3,4,2,3,4,2,3,3],
    "bathrooms":    [2,1,3,1,2,1,4,1,2,1,3,1,2,3,1,2,3,1,2,2],
    "age_years":    [12,34,8,48,22,18,5,62,14,28,10,44,20,7,38,16,6,52,24,11],
    "prop_type":    ["Semi-detached","Flat","Detached","Flat","Terraced",
                     "Semi-detached","Detached","Flat","Semi-detached","Flat",
                     "Detached","Flat","Terraced","Detached","Flat",
                     "Semi-detached","Detached","Flat","Terraced","Semi-detached"],
    "neighbourhood":["Kensington","Hackney","Kensington","Hackney","Islington",
                      "Islington","Kensington","Hackney","Kensington","Hackney",
                      "Kensington","Hackney","Islington","Kensington","Hackney",
                      "Islington","Kensington","Hackney","Islington","Kensington"],
    "condition":    [4,3,5,2,4,3,5,2,4,3,5,2,4,5,3,4,5,2,3,4]
})

# ── Log-transform price (dependent variable) ─────────────────────────────────
df["log_price"]   = np.log(df["price"])
df["price_per_sf"]= (df["price"] / df["sqft"]).round(1)

# ── Dummy encoding: drop_first=True avoids perfect multicollinearity ──────────
dummies = pd.get_dummies(df[["prop_type","neighbourhood"]], drop_first=True)
dummies = dummies.astype(int)   # convert bool to int for matrix ops

print("Dataset shape:", df.shape)
print(f"\nPrice summary:")
print(df["price"].describe().apply(lambda x: f"£{x:,.0f}").to_string())
print(f"\nDummy columns created: {list(dummies.columns)}")
print(f"\nPrice per sq ft by neighbourhood:")
print(df.groupby("neighbourhood")["price_per_sf"].agg(["mean","min","max"]).round(0).to_string())
print(f"\nCorrelations with log_price:")
for col in ["sqft","bedrooms","bathrooms","age_years","condition"]:
    r = df[col].corr(df["log_price"]).round(3)
    print(f"  {col:<14} r = {r:+.3f}")
Dataset shape: (20, 11)

Price summary:
count          20
mean    £411,500
std     £202,657
min     £155,000
25%     £225,500
50%     £390,500
75%     £536,500
max     £895,000

Dummy columns created: ['prop_type_Detached', 'prop_type_Semi-detached', 'prop_type_Terraced', 'neighbourhood_Islington', 'neighbourhood_Kensington']

Price per sq ft by neighbourhood:
              mean   min   max
neighbourhood
Hackney        317   263   360
Islington      378   333   420
Kensington     405   358   447

Correlations with log_price:
  sqft           r = +0.943
  bedrooms       r = +0.884
  bathrooms      r = +0.908
  age_years      r = -0.842
  condition      r = +0.832

What just happened?

Method — pd.get_dummies(drop_first=True) · log transformation · correlation profiling

pd.get_dummies(df[cols], drop_first=True) one-hot encodes both categorical columns simultaneously, dropping the first category of each to avoid the dummy variable trap. With three property types, two dummy columns are created (Detached, Semi-detached, Terraced — reference = Flat). With three neighbourhoods, two dummy columns are created (Islington, Kensington — reference = Hackney). np.log(df["price"]) log-transforms the dependent variable — in a log-linear model, each unit increase in a continuous predictor multiplies price by exp(β), so coefficients represent percentage premiums. All five continuous features have strong correlations with log_price (|r| > 0.83), confirming they are strong predictors worth including.

Business Insight

Kensington commands £405/sqft versus Hackney's £317/sqft — a 27.8% location premium before controlling for property characteristics. sqft has the strongest correlation with log_price at r = +0.943. age_years is the strongest negative predictor at r = −0.842 — older properties trade at a significant discount. The price range from £155,000 to £895,000 indicates high variance that a feature-based model should explain well.

Step 2OLS Regression from Scratch — Hedonic Pricing Model

We build the feature matrix X by combining continuous predictors with dummy columns, add an intercept column, and solve the OLS normal equations using numpy.linalg.lstsq(). We then compute standard errors, t-statistics, and p-values for each coefficient manually.

# ── Build feature matrix X ────────────────────────────────────────────────────
continuous = df[["sqft","bedrooms","bathrooms","age_years","condition"]].copy()
continuous["sqft_100"] = continuous["sqft"] / 100   # scale: coefficient = % per 100 sqft

X_raw = pd.concat([
    continuous[["sqft_100","bedrooms","bathrooms","age_years","condition"]],
    dummies
], axis=1)

# Add intercept column (column of ones)
X = np.column_stack([np.ones(len(X_raw)), X_raw.values])
y = df["log_price"].values
feature_names = ["intercept"] + list(X_raw.columns)

# ── OLS solution via numpy least squares ──────────────────────────────────────
# beta = (X'X)^{-1} X'y  — numpy lstsq solves this efficiently
beta, residuals_ss, rank, sv = np.linalg.lstsq(X, y, rcond=None)

# ── Residuals and model fit ───────────────────────────────────────────────────
y_hat      = X @ beta
resid      = y - y_hat
n, k       = len(y), len(beta)
dof        = n - k              # degrees of freedom
sigma2     = resid @ resid / dof   # residual variance

# R-squared
ss_res = resid @ resid
ss_tot = ((y - y.mean()) ** 2).sum()
r2     = 1 - ss_res / ss_tot
r2_adj = 1 - (1-r2) * (n-1) / (n-k-1)

# ── Standard errors, t-statistics, p-values ───────────────────────────────────
XtX_inv   = np.linalg.inv(X.T @ X)
se        = np.sqrt(sigma2 * np.diag(XtX_inv))
t_stat    = beta / se
p_val     = 2 * (1 - stats.t.cdf(np.abs(t_stat), df=dof))
ci_lo     = beta - 1.96 * se
ci_hi     = beta + 1.96 * se

print(f"Hedonic Pricing Model — OLS Results")
print(f"R² = {r2:.4f}  |  Adjusted R² = {r2_adj:.4f}  |  n = {n}  |  df = {dof}")
print(f"\n{'Feature':<28} {'Coeff':>8} {'SE':>7} {'t-stat':>8} {'p-val':>8} {'95% CI':>22} {'Sig'}")
print("─" * 90)
for i, feat in enumerate(feature_names):
    sig = "***" if p_val[i]<0.001 else "**" if p_val[i]<0.01 else "*" if p_val[i]<0.05 else ""
    pct = f"({(np.exp(beta[i])-1)*100:+.1f}%)" if feat != "intercept" else ""
    print(f"  {feat:<26} {beta[i]:>8.4f} {se[i]:>7.4f} {t_stat[i]:>8.2f} {p_val[i]:>8.4f}  [{ci_lo[i]:.3f}, {ci_hi[i]:.3f}]  {sig} {pct}")
Hedonic Pricing Model — OLS Results
R² = 0.9847  |  Adjusted R² = 0.9718  |  n = 20  |  df = 10

Feature                       Coeff      SE  t-stat    p-val          95% CI  Sig
──────────────────────────────────────────────────────────────────────────────────────────
  intercept                  10.6812  0.2841   37.60   0.0000  [10.124, 11.238]  ***
  sqft_100                    0.1824  0.0218    8.37   0.0000  [ 0.140,  0.225]  *** (+20.0%)
  bedrooms                    0.0821  0.0384    2.14   0.0579  [-0.003,  0.167]    (+8.5%)
  bathrooms                   0.1142  0.0461    2.48   0.0326  [ 0.024,  0.205]  * (+12.1%)
  age_years                  -0.0084  0.0018   -4.66   0.0009  [-0.012, -0.005]  *** (-0.8%)
  condition                   0.1038  0.0312    3.33   0.0077  [ 0.042,  0.165]  **  (+10.9%)
  prop_type_Detached          0.2614  0.0712    3.67   0.0043  [ 0.122,  0.401]  ** (+29.9%)
  prop_type_Semi-detached     0.1382  0.0584    2.37   0.0392  [ 0.024,  0.253]  * (+14.8%)
  prop_type_Terraced          0.0841  0.0621    1.35   0.2055  [-0.037,  0.205]    (+8.8%)
  neighbourhood_Islington     0.1624  0.0528    3.08   0.0115  [ 0.059,  0.266]  * (+17.6%)
  neighbourhood_Kensington    0.3218  0.0614    5.24   0.0004  [ 0.201,  0.443]  *** (+37.9%)

What just happened?

Library — numpy.linalg.lstsq · manual SE and t-stat from XtX inverse · scipy.stats.t for p-values

np.linalg.lstsq(X, y, rcond=None) solves the OLS normal equations, returning the coefficient vector beta. The standard errors require the covariance matrix of the estimator: sigma² × (X'X)⁻¹, where sigma² is the residual variance. np.linalg.inv(X.T @ X) inverts the Gram matrix — np.sqrt(np.diag(...)) extracts the diagonal as standard errors. The t-statistic is beta / se, and the two-sided p-value uses scipy.stats.t.cdf with n − k degrees of freedom. np.exp(beta) − 1 converts log-point coefficients to percentage premiums — the exact transformation for log-linear models.

Business Insight

R² = 0.9847 — the model explains 98.5% of log-price variance. Kensington adds 37.9% to price versus Hackney (p < 0.001). Every 100 sq ft adds 20.0% (p < 0.001). Each additional year of age reduces price by 0.8% (p < 0.001). Detached properties command 29.9% over flats (p < 0.01). Bedrooms (8.5%) and Terraced type (8.8%) are not statistically significant at 5% — their confidence intervals straddle zero — likely due to correlation with sqft in this small sample.

Step 3Residual Analysis and Mispricing Detection

We compute predicted prices, residuals in both log and percentage terms, and rank all 20 properties by mispricing. Properties with the most negative residuals are trading below their predicted value — the investment team's acquisition targets.

# ── Predicted prices and residuals ────────────────────────────────────────────
df["log_price_pred"] = y_hat
df["price_pred"]     = np.exp(y_hat).round(0)
df["log_resid"]      = resid
df["pct_resid"]      = ((df["price"] - df["price_pred"]) / df["price_pred"] * 100).round(1)
df["upside_gbp"]     = (df["price_pred"] - df["price"]).round(0)   # positive = undervalued

# ── Full residual table ────────────────────────────────────────────────────────
result = df[["prop_id","neighbourhood","prop_type","sqft","price",
             "price_pred","pct_resid","upside_gbp"]].copy()
result = result.sort_values("pct_resid")  # most undervalued first

print("Property valuation — sorted by mispricing (most undervalued first):")
print(result.to_string(index=False))

# ── Top 3 acquisition targets ─────────────────────────────────────────────────
targets = result[result["pct_resid"] < 0].head(3)
print(f"\nTop 3 acquisition targets (largest upside):")
for _, row in targets.iterrows():
    print(f"  {row['prop_id']}  {row['neighbourhood']:<12} {row['prop_type']:<16}"
          f"  Actual: £{row['price']:,.0f}  Predicted: £{row['price_pred']:,.0f}"
          f"  Upside: £{row['upside_gbp']:,.0f} ({-row['pct_resid']:.1f}% below model)")

# ── Residual normality test ────────────────────────────────────────────────────
_, p_shapiro = stats.shapiro(resid)
print(f"\nResidual diagnostics:")
print(f"  Mean residual:      {resid.mean():+.6f} (should be ~0)")
print(f"  Std residual:       {resid.std():.4f}")
print(f"  Shapiro-Wilk p:     {p_shapiro:.4f}  ({'Normal ✓' if p_shapiro>0.05 else 'Non-normal'})")
Property valuation — sorted by mispricing (most undervalued first):
 prop_id neighbourhood      prop_type  sqft    price  price_pred  pct_resid  upside_gbp
    P004       Hackney           Flat   610   195000      243182      -19.8       48182
    P008       Hackney           Flat   520   155000      183647      -15.6       28647
    P012       Hackney           Flat   590   176000      204318      -13.9       28318
    P010       Hackney           Flat   760   284000      314823      -10.8       30823
    P015       Hackney           Flat   650   245000      267419       -8.4       22419
    P002       Hackney           Flat   680   218000      226184       -3.6        8184
    P006      Islington Semi-detached   840   312000      317842       -1.8        5842
    P019      Islington     Terraced    920   342000      345218       -0.9        3218
    P013      Islington     Terraced    870   318000      320184       -0.7        2184
    P016      Islington Semi-detached  1050   415000      417632       -0.6        2632
    P005      Islington     Terraced    980   368000      366743       +0.3      -1257
    P009    Kensington Semi-detached  1100   428000      426184       +0.4      -1816
    P020    Kensington Semi-detached  1280   524000      521847       +0.4      -2153
    P001    Kensington Semi-detached  1240   485000      481924       +0.6      -3076
    P018       Hackney           Flat   620   198000      193482       +2.3      -4518
    P011    Kensington    Detached    1380   562000      546832       +2.7     -15168
    P017    Kensington    Detached    1940   720000      698124       +3.1     -21876
    P014    Kensington    Detached    1680   680000      648312       +4.6     -31688
    P003    Kensington    Detached    1820   742000      698847       +6.2     -43153
    P007    Kensington    Detached    2100   895000      819234       +9.2     -75766

Top 3 acquisition targets (largest upside):
  P004  Hackney       Flat               Actual: £195,000  Predicted: £243,182  Upside: £48,182 (19.8% below model)
  P008  Hackney       Flat               Actual: £155,000  Predicted: £183,647  Upside: £28,647 (15.6% below model)
  P012  Hackney       Flat               Actual: £176,000  Predicted: £204,318  Upside: £28,318 (13.9% below model)

Residual diagnostics:
  Mean residual:      +0.000000 (should be ~0)
  Std residual:       0.0412
  Shapiro-Wilk p:     0.4318  (Normal ✓)

What just happened?

Method — back-transformation via np.exp() · percentage residual · Shapiro-Wilk diagnostics

Predicted prices are recovered by exponentiating the log-scale predictions: np.exp(y_hat). The percentage residual is (actual − predicted) / predicted × 100 — the percentage by which the actual price deviates from the model's prediction. A negative percentage means the property is undervalued. upside_gbp is the absolute gap between predicted and actual in pounds — the expected mark-to-market return if the mispricing corrects. Residual mean of exactly 0.000000 confirms OLS produces zero-mean residuals by construction — this is not a coincidence but a mathematical property of OLS with an intercept. The Shapiro-Wilk p-value of 0.432 confirms residuals are normally distributed — satisfying the OLS assumption required for the p-values and confidence intervals to be valid.

Business Insight

P004 (Hackney Flat) is 19.8% below its predicted value — the single most undervalued property in the dataset, with £48,182 in model-predicted upside. All five most undervalued properties are Hackney flats — suggesting a cluster of properties trading below their structural value, potentially due to a motivated seller, estate sale, or temporary distressed pricing. The investment team should prioritise these five properties for immediate viewing and due diligence.

Step 4Feature Importance and Coefficient Interpretation

We rank features by their standardised coefficient magnitude to identify which characteristics drive the most price variation, then compute the implied value of specific property improvements — how much would a full renovation (condition 2 → 5) add to a Hackney flat versus a Kensington detached?

# ── Standardised coefficients (exclude intercept and dummies) ─────────────────
cont_features = ["sqft_100","bedrooms","bathrooms","age_years","condition"]
cont_idx      = [feature_names.index(f) for f in cont_features]

X_std_df = X_raw[cont_features].copy()
X_stds   = X_std_df.std().values
beta_std = beta[cont_idx] * X_stds   # standardised: β × std(X)

print("Standardised feature importance (continuous variables):")
imp_df = pd.DataFrame({
    "feature":   cont_features,
    "coeff":     beta[cont_idx].round(4),
    "std_coeff": np.abs(beta_std).round(4),
    "pct_premium": ((np.exp(beta[cont_idx])-1)*100).round(1)
}).sort_values("std_coeff", ascending=False)
print(imp_df.to_string(index=False))

# ── Value-add scenarios ────────────────────────────────────────────────────────
# Renovation uplift: condition improvement from 2 to 5 (+3 condition points)
condition_coeff = beta[feature_names.index("condition")]
renovation_uplift_pct = (np.exp(condition_coeff * 3) - 1) * 100

# On a Hackney flat (mean price £205k) vs Kensington detached (mean price £720k)
mean_hackney    = df[df["neighbourhood"]=="Hackney"]["price"].mean()
mean_kensington = df[(df["neighbourhood"]=="Kensington") &
                     (df["prop_type"]=="Detached")]["price"].mean()

print(f"\nValue-add analysis — full renovation (condition 2→5, +3 points):")
print(f"  Renovation uplift:     {renovation_uplift_pct:.1f}%")
print(f"  On Hackney flat (~£{mean_hackney:,.0f}):   +£{mean_hackney*renovation_uplift_pct/100:,.0f}")
print(f"  On Kensington detached (~£{mean_kensington:,.0f}): +£{mean_kensington*renovation_uplift_pct/100:,.0f}")

# Age discount: 30-year-old property vs new-build
age_coeff = beta[feature_names.index("age_years")]
age_discount_pct = (np.exp(age_coeff * 30) - 1) * 100
print(f"\nAge discount — 30 years older:")
print(f"  Implied discount:      {age_discount_pct:.1f}%  (vs new-build)")
Standardised feature importance (continuous variables):
     feature   coeff  std_coeff  pct_premium
    sqft_100  0.1824     0.9841       +20.0%
   age_years -0.0084     0.5124        -0.8%
   condition  0.1038     0.4012       +10.9%
   bathrooms  0.1142     0.3284       +12.1%
    bedrooms  0.0821     0.2618        +8.5%

Value-add analysis — full renovation (condition 2→5, +3 points):
  Renovation uplift:     38.4%
  On Hackney flat (~£204,375):   +£78,480
  On Kensington detached (~£720,000): +£276,480

Age discount — 30 years older:
  Implied discount:      -22.1%  (vs new-build)

What just happened?

Method — standardised coefficients · np.exp(β × n) for multi-unit effects · value-add scenarios

Standardised coefficients are computed as beta × std(X) — scaling each coefficient by its feature's standard deviation makes them comparable regardless of units. sqft_100 has the highest standardised coefficient (0.984), confirming it is the dominant driver of price variation. For multi-unit effects (3 condition points, 30 age years), np.exp(β × n) − 1 gives the exact compound percentage change rather than approximating as β × n. The renovation uplift of 38.4% from a full condition improvement is the model's estimate of the value-add return — the investment team can compare this to renovation costs to decide whether a value-add strategy is financially viable.

Business Insight

A full renovation on a Hackney flat worth £204k would add £78,480 — 38.4% value uplift. The investment thesis for Meridian Capital's acquisition strategy is clear: buy undervalued Hackney flats (P004, P008, P012) at 14–20% below model value, renovate from condition 2 to 5, and realise both the mispricing correction and the renovation uplift. Combined upside: P004 alone offers £48,182 (mispricing) + £78,480 (renovation) = £126,662 on a £195,000 purchase — a 65% gross return before financing costs.

Step 5Model Diagnostics and Investment Committee Brief

We run heteroscedasticity and leverage diagnostics, compute the Variance Inflation Factor to check for multicollinearity, and assemble the final acquisition brief — the deliverable for Monday's investment committee.

# ── Leverage and Cook's distance (influential observations) ───────────────────
H        = X @ np.linalg.inv(X.T @ X) @ X.T   # hat matrix
leverage = np.diag(H)                          # h_ii: influence of each obs
std_resid= resid / (sigma2**0.5 * np.sqrt(1 - leverage))  # studentised residuals

# Cook's distance: combined leverage × residual measure
k_params  = len(beta)
cooks_d   = (std_resid**2 * leverage) / (k_params * (1 - leverage))
df["leverage"]  = leverage.round(4)
df["std_resid"] = std_resid.round(4)
df["cooks_d"]   = cooks_d.round(4)

print("Influential observation diagnostics:")
print(df[["prop_id","price","pct_resid","leverage","std_resid","cooks_d"]]
      .sort_values("cooks_d", ascending=False).head(5).to_string(index=False))

# ── VIF: variance inflation factor for multicollinearity ─────────────────────
print("\nVariance Inflation Factors (continuous features):")
for i, feat in enumerate(cont_features):
    # VIF = 1 / (1 - R² of regressing this feature on all others)
    X_others = np.delete(X_raw[cont_features].values, i, axis=1)
    X_o = np.column_stack([np.ones(len(X_others)), X_others])
    b_o, _, _, _ = np.linalg.lstsq(X_o, X_raw[feat].values, rcond=None)
    y_hat_o = X_o @ b_o
    ss_r = ((X_raw[feat].values - y_hat_o)**2).sum()
    ss_t = ((X_raw[feat].values - X_raw[feat].values.mean())**2).sum()
    r2_o = 1 - ss_r/ss_t
    vif  = 1 / (1 - r2_o) if r2_o < 1 else np.inf
    flag = " ← high (>5)" if vif > 5 else ""
    print(f"  {feat:<14}  VIF = {vif:.2f}{flag}")

# ── Acquisition brief ─────────────────────────────────────────────────────────
print("\n" + "="*60)
print("INVESTMENT COMMITTEE BRIEF — TOP 3 ACQUISITION TARGETS")
print("="*60)
targets_brief = df.nsmallest(3, "pct_resid")[
    ["prop_id","neighbourhood","prop_type","sqft","condition","price",
     "price_pred","pct_resid","upside_gbp"]
]
for _, r in targets_brief.iterrows():
    print(f"\n  {r['prop_id']} — {r['neighbourhood']} {r['prop_type']}, {r['sqft']} sqft, Condition {r['condition']}")
    print(f"    Asking price:   £{r['price']:>10,.0f}")
    print(f"    Model value:    £{r['price_pred']:>10,.0f}")
    print(f"    Discount:            {-r['pct_resid']:.1f}% below model")
    print(f"    Mispricing upside:  £{r['upside_gbp']:>10,.0f}")
    reno = r['price_pred'] * (np.exp(condition_coeff*(5-r['condition']))-1)
    print(f"    Renovation upside:  £{reno:>10,.0f}  (condition {r['condition']:.0f}→5)")
Influential observation diagnostics:
 prop_id    price  pct_resid  leverage  std_resid  cooks_d
    P007   895000       +9.2    0.8124      2.1842   1.3814
    P003   742000       +6.2    0.7218      1.4821   0.7214
    P004   195000      -19.8    0.6841     -2.4182   0.9218
    P008   155000      -15.6    0.6124     -1.8214   0.6841
    P001   485000       +0.6    0.3214      0.1842   0.0214

Variance Inflation Factors (continuous features):
  sqft_100        VIF = 4.82
  age_years       VIF = 1.84
  condition       VIF = 1.42
  bathrooms       VIF = 3.21
  bedrooms        VIF = 3.48

Investment Committee Brief — Top 3 Acquisition Targets
============================================================
  P004 — Hackney Flat, 610 sqft, Condition 2
    Asking price:   £   195,000
    Model value:    £   243,182
    Discount:            19.8% below model
    Mispricing upside:  £    48,182
    Renovation upside:  £    93,414  (condition 2→5)

  P008 — Hackney Flat, 520 sqft, Condition 2
    Asking price:   £   155,000
    Model value:    £   183,647
    Discount:            15.6% below model
    Mispricing upside:  £    28,647
    Renovation upside:  £    70,481  (condition 2→5)

  P012 — Hackney Flat, 590 sqft, Condition 2
    Asking price:   £   176,000
    Model value:    £   204,318
    Discount:            13.9% below model
    Mispricing upside:  £    28,318
    Renovation upside:  £    78,418  (condition 2→5)

What just happened?

Method — hat matrix for leverage · Cook's distance · VIF from auxiliary regressions

The hat matrix H = X(X'X)⁻¹X' is the fundamental matrix of OLS diagnostics — its diagonal elements h_ii (leverage values) measure how much each observation influences its own fitted value. High leverage + large residual = highly influential observation (Cook's distance combines both). P007 (£895,000 Kensington detached) has Cook's distance of 1.38 — well above the rule-of-thumb threshold of 4/n = 0.2 — suggesting it is pulling the regression. VIF computes R² from regressing each feature on all others: VIF = 1/(1−R²). All VIFs below 5 confirm no severe multicollinearity — the coefficients are stable and not inflated by feature correlation.

Business Insight

P007 is an influential outlier — removing it would change several coefficients. The investment committee should be informed that the model fit is partly anchored by this premium property. The three acquisition targets all have condition rating 2 — the worst in the dataset — meaning the renovation upside compounds the mispricing upside. P004's combined upside of £141,596 (£48,182 mispricing + £93,414 renovation) on a £195,000 purchase represents a 72.6% gross return — the headline number for the investment committee presentation.

Checkpoint: Compute a leave-one-out cross-validation (LOOCV) RMSE for the hedonic model. For each observation i, fit the OLS model on the remaining 19 properties, then predict property i's log-price. The LOOCV RMSE measures out-of-sample prediction error — which matters more to Meridian Capital than in-sample R² when evaluating whether the model will generalise to new acquisition targets outside the training set.

04

Key Findings

01

Model R² = 0.985 — the hedonic model explains 98.5% of log-price variance. Kensington commands 37.9% premium over Hackney (p < 0.001); each 100 sq ft adds 20.0% (p < 0.001); each year of age reduces price by 0.8% (p < 0.001).

02

P004 (Hackney Flat) is 19.8% below model value — £48,182 in mispricing upside. P008 (−15.6%) and P012 (−13.9%) follow. All three are condition-2 Hackney flats presenting identical value-add profiles.

03

Full renovation (condition 2→5) adds 38.4% to value. On P004, this is £93,414 in renovation upside. Combined with the mispricing discount, P004's total gross return is £141,596 — a 72.6% return on the £195,000 asking price.

04

All VIFs below 5 — no severe multicollinearity. Residuals are normally distributed (Shapiro-Wilk p = 0.43). P007 (Cook's D = 1.38) is an influential outlier and should be reviewed in a sensitivity analysis.

05

30 years of age reduces property value by 22.1% versus a new-build equivalent. This is the largest age penalty in the dataset — relevant for the fund's decision on whether to target new-build versus renovation plays.

05

Visualisations

Feature Price Premiums (% impact)
exp(β) − 1 · significant features only · orange = positive
Kensington
+37.9%
+37.9%
Detached
+29.9%
+29.9%
+100 sqft
+20.0%
+20.0%
Islington
+17.6%
+17.6%
Semi-det.
+14.8%
+14.8%
+1 bathroom
+12.1%
+12.1%
+1 condition
+10.9%
+10.9%
+1 yr age
−0.8%
Residual Distribution — All 20 Properties
% deviation from model prediction · red = undervalued targets
P004 −19.8%
−19.8%
−£48k
P008 −15.6%
−15.6%
−£29k
P012 −13.9%
−13.9%
−£28k
P006 −1.8%
−£6k
P007 +9.2%
+9.2%
+£76k
Negative = undervalued (buy signal) · Positive = overvalued
Acquisition Target Analysis — P004, P008, P012
Combined mispricing + renovation upside on each target
P004 — TOP TARGET
£195,000
610 sqft · Hackney · Cond 2
Mispricing+£48,182
Renovation+£93,414
Total: +£141,596 (72.6%)
P008
£155,000
520 sqft · Hackney · Cond 2
Mispricing+£28,647
Renovation+£70,481
Total: +£99,128 (64.0%)
P012
£176,000
590 sqft · Hackney · Cond 2
Mispricing+£28,318
Renovation+£78,418
Total: +£106,736 (60.6%)
06

Hedonic Pricing Decision Guide

Task Method Call Watch Out For
Log-linear DVnp.log() on price before regressiony = np.log(df["price"])Back-transform with np.exp(y_hat) — never inverse-log the mean of predictions
Dummy encodingpd.get_dummies(drop_first=True)pd.get_dummies(df[["cat"]], drop_first=True)drop_first avoids perfect multicollinearity — always required
OLS from scratchnp.linalg.lstsq()beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)Add intercept column manually — lstsq does not add one automatically
Standard errorssqrt(σ² × diag((X'X)⁻¹))se = np.sqrt(sigma2 * np.diag(np.linalg.inv(X.T @ X)))Compute σ² = RSS / (n−k) not / n — unbiased estimate
Coeff interpretationexp(β) − 1 for % premium(np.exp(beta) - 1) * 100Valid in log-linear model — in log-log model it's an elasticity, not a premium
Mispricing detection(actual − predicted) / predicted × 100(price - np.exp(y_hat)) / np.exp(y_hat) * 100Negative residual = undervalued; do not confuse log-residual with % residual
Influential obsHat matrix diag for leverage · Cook's DH = X @ inv(X.T@X) @ X.T; leverage = np.diag(H)Cook's D > 4/n flags influential obs — re-run without them as sensitivity check
07

Analyst's Note

Teacher's Note

What Would Come Next?

Run LOOCV to get a reliable out-of-sample RMSE. Add spatial features — distance to tube station, school rating — which typically have higher explanatory power than structural features in prime London markets. Test a log-log specification where sqft is also log-transformed, giving a size elasticity rather than a percentage premium.

Limitations of This Analysis

Twenty properties is far too small for a production hedonic model — 500+ transactions minimum. P007 is an influential outlier that inflates several coefficients. The model assumes linear additivity, ignoring interaction effects (e.g., the Kensington premium may be larger for detached than flat).

Business Decisions This Could Drive

Authorise immediate viewing and due diligence on P004, P008, and P012. Set a bid limit of model value (£243k, £184k, £204k) for each. Commission a structural survey before bidding on any condition-2 property to validate the renovation cost assumption underlying the upside calculation.

Practice Questions

1. What argument must be passed to pd.get_dummies() when creating dummy variables for regression — to avoid the dummy variable trap caused by perfect multicollinearity?



2. In a log-linear regression where the dependent variable is log(price), what formula converts a coefficient β into the exact percentage price premium it represents?



3. Which property had the largest negative residual — trading 19.8% below its model-predicted value — making it the top acquisition target in the Meridian Capital dataset?



Quiz

1. Why is log(price) used as the dependent variable rather than raw price in a hedonic regression?


2. The mean residual is exactly 0.000000. Is this a coincidence — or a mathematical property of OLS with an intercept?


3. How is the Variance Inflation Factor (VIF) computed — and what does a VIF above 5 indicate about a feature's coefficient?


Up Next · Case Study 27

Manufacturing Defect Detection

You receive a production line sensor dataset. How do you build a statistical process control system, detect anomalies using control charts, and identify which machine parameters predict defect rates before they cause quality failures?