DS Case Study 8 – House Price EDA | Dataplexa

Beginner Case Study · CS 8

Exploring House Price Data

A property listing tells you the asking price. The data tells you whether that price makes sense. Size, location, age, and condition each pull the number in a different direction — and only an analyst who understands the interaction between them can say which properties are fairly priced and which are not.

You are a data analyst at NestView, a property technology company building a valuation tool for first-time buyers. The product team needs to understand which features drive house prices most strongly, how price per square foot varies by neighbourhood, and whether any properties in the current listings look significantly under or overvalued relative to their characteristics. Your dataset is 12 recent listings from a single city.

IndustryReal Estate

TechniqueEDA · Regression Prep

Librariespandas · numpy

DifficultyBeginner

Est. Time40–50 min

Overview

What This Case Study Covers

House price analysis is one of the most data-rich domains an analyst will encounter. Every property has dozens of attributes — and the challenge is finding which ones actually move the price needle. This case study focuses on the fundamentals: size, neighbourhood, age, and condition. These four variables account for the majority of price variance in most residential markets and form the foundation of any property valuation model.

This case study covers four layers: price distribution analysis to understand the range and spread of listing prices, price per square foot as a normalised comparison metric across properties of different sizes, neighbourhood comparison to quantify how much location alone affects value, and feature correlation to rank which property attributes most strongly predict price.

The House Price EDA Toolkit

Price Distribution Analysis

Compute mean, median, and standard deviation of listing prices. A high std relative to the mean signals a heterogeneous market — not all properties are comparable. The median is more robust than the mean when a few luxury listings inflate the average.

Price Per Square Foot

Divide price by floor area to get a normalised value metric. This allows fair comparison between a 3-bed house and a 5-bed house in the same street — size is controlled for, so the remaining price variation reflects quality, location, and condition.

Neighbourhood Comparison

Group by neighbourhood and compute mean price and mean price per square foot. The difference in price per sqft across neighbourhoods is the purest measure of location premium — stripped of size effects.

Feature Correlation with Price

Compute Pearson correlation between price and each numeric feature — sqft, bedrooms, age, condition score. This ranks the features by predictive power before any model is built. High correlation features go into the model first.

Outlier Detection

Flag properties whose price deviates significantly from the neighbourhood average price per sqft. Properties more than one standard deviation below average may be undervalued — these are the leads the product team wants to surface for buyers.

Dataset Overview

NestView's listings dataset contains 12 property records covering three neighbourhoods with price, size, age, bedrooms, and condition data. Built with pd.DataFrame().

property_id	neighbourhood	price	sqft	bedrooms	age_years	condition
P001	Riverside	$420,000	1,850	3	12	8
P002	Oldtown	$285,000	1,420	3	45	5
P003	Riverside	$565,000	2,300	4	8	9
P004	Greenhill	$340,000	1,650	3	22	7
P005	Oldtown	$210,000	1,100	2	60	4

Showing first 5 of 12 rows · 7 columns

property_idstring · unique identifier

Unique listing reference. Used for labelling outlier flags — never aggregated directly.

neighbourhoodobject · 3 categories

Riverside, Greenhill, or Oldtown. The primary grouping dimension for the location premium analysis.

priceint64 · USD · target variable

Listing price in dollars. This is the primary metric we are trying to understand and eventually predict.

sqftint64 · square feet

Total floor area. The strongest individual predictor of price in most residential markets — bigger generally costs more.

bedroomsint64 · count

Number of bedrooms. Correlated with sqft — larger homes have more bedrooms — so may be partially redundant as a predictor.

age_yearsint64 · years

Property age in years. Older properties tend to cost less, though location can override this in premium neighbourhoods.

conditionint64 · 1–10 scale

Surveyor condition score from 1 (poor) to 10 (excellent). Captures renovation state and structural quality beyond age alone.

Business Questions

The NestView product team needs these five questions answered to configure the valuation tool's weighting model.

What is the price distribution across the dataset — mean, median, and standard deviation of listing prices?

Which neighbourhood has the highest average price per square foot — and how large is the location premium between the most and least expensive areas?

Which feature — sqft, bedrooms, age, or condition — has the strongest linear correlation with listing price?

Which properties appear undervalued relative to their neighbourhood's average price per square foot?

Does property age correlate negatively with price — and does condition score partially compensate for age?

Step-by-Step Analysis

The scenario:

The listings export has just arrived from the data team. The product manager needs the analysis before the end of the sprint so the engineering team can begin building the valuation weighting model. Work through it step by step — every number here will feed directly into a product feature used by real homebuyers.

Step 1Load, Inspect, and Analyse the Price Distribution

The first step in any property analysis is understanding the overall price landscape. Before comparing neighbourhoods or correlating features, we need to know how prices are distributed — and whether the mean or median is the more reliable central measure for this dataset.

import pandas as pd
import numpy as np

# Build the NestView listings dataset
df = pd.DataFrame({
    "property_id":   ["P001","P002","P003","P004","P005","P006",
                      "P007","P008","P009","P010","P011","P012"],
    "neighbourhood": ["Riverside","Oldtown","Riverside","Greenhill","Oldtown","Greenhill",
                      "Riverside","Greenhill","Oldtown","Riverside","Greenhill","Oldtown"],
    "price":         [420000,285000,565000,340000,210000,375000,
                      490000,310000,255000,620000,395000,230000],
    "sqft":          [1850,1420,2300,1650,1100,1580,
                      2050,1720,1280,2480,1900,1200],
    "bedrooms":      [3,3,4,3,2,3,4,3,2,4,3,2],
    "age_years":     [12,45,8,22,60,18,5,15,52,3,10,58],
    "condition":     [8,5,9,7,4,7,9,8,5,10,8,4]
})

# Inspect
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())

# Price distribution
price_stats = df["price"].describe()
print("\nPrice distribution:")
print(price_stats.round(0))

# Mean vs median — how skewed is the distribution?
mean_price   = df["price"].mean()
median_price = df["price"].median()
skew         = (mean_price - median_price) / df["price"].std()

print(f"\nMean price:   ${mean_price:,.0f}")
print(f"Median price: ${median_price:,.0f}")
print(f"Skew index:   {skew:.3f} ({'right-skewed' if skew > 0 else 'left-skewed'})")
print(f"Price range:  ${df['price'].min():,} to ${df['price'].max():,}")

Shape: (12, 7)
Missing values: 0

Price distribution:
count       12.0
mean    374583.0
std     120982.0
min     210000.0
25%     278750.0
50%     357500.0
75%     446250.0
max     620000.0
Name: price, dtype: float64

Mean price:   $374,583
Median price: $357,500
Skew index:   0.141 (right-skewed)
Price range:  $210,000 to $620,000

What just happened?

Method — .describe() and skew index

.describe() gives us the full distribution in one call. We then computed a simple skew index — the difference between mean and median divided by standard deviation. A positive value means the mean is pulled above the median by high-priced outliers. This is important for the product team: in right-skewed markets, displaying the median price gives buyers a more realistic expectation than the mean, which is inflated by premium listings.

Business Insight

The market has a wide price range — from $210,000 to $620,000, a spread of $410,000 across just 12 listings. The std of $120,982 is 32% of the mean, confirming this is a heterogeneous dataset. The mean ($374,583) sits above the median ($357,500) — a slight right skew caused by the premium Riverside listings. The NestView valuation tool should use median-based benchmarks for the buyer-facing interface, not mean-based ones.

Step 2Compute Price Per Square Foot and Neighbourhood Comparison

Raw price comparisons between properties of different sizes are misleading. Price per square foot normalises for size and reveals the true location premium — what you are paying purely for the neighbourhood, independent of how big the home is.

# Compute price per square foot — the normalised value metric
df["price_per_sqft"] = (df["price"] / df["sqft"]).round(2)

print("Price per sqft for all listings:")
print(df[["property_id","neighbourhood","price",
          "sqft","price_per_sqft"]].sort_values(
    "price_per_sqft", ascending=False
).to_string(index=False))

# Group by neighbourhood — mean price, mean price per sqft, listing count
neighbourhood = df.groupby("neighbourhood").agg(
    listings        = ("property_id",   "count"),
    mean_price      = ("price",         "mean"),
    mean_price_sqft = ("price_per_sqft","mean"),
    mean_sqft       = ("sqft",          "mean"),
    mean_condition  = ("condition",      "mean")
).round(1).reset_index()

neighbourhood = neighbourhood.sort_values(
    "mean_price_sqft", ascending=False
)

print("\nNeighbourhood performance summary:")
print(neighbourhood.to_string(index=False))

# Location premium: gap between most and least expensive per sqft
top    = neighbourhood.iloc[0]
bottom = neighbourhood.iloc[-1]
premium_pct = (top["mean_price_sqft"] - bottom["mean_price_sqft"]) / bottom["mean_price_sqft"] * 100
print(f"\nLocation premium ({top['neighbourhood']} vs {bottom['neighbourhood']}): {premium_pct:.1f}%")

Price per sqft for all listings:
 property_id neighbourhood    price   sqft  price_per_sqft
        P010     Riverside   620000   2480          250.00
        P003     Riverside   565000   2300          245.65
        P007     Riverside   490000   2050          239.02
        P001     Riverside   420000   1850          227.03
        P011     Greenhill   395000   1900          207.89
        P006     Greenhill   375000   1580          237.34
        P004     Greenhill   340000   1650          206.06
        P008     Greenhill   310000   1720          180.23
        P002     Oldtown     285000   1420          200.70
        P009     Oldtown     255000   1280          199.22
        P012     Oldtown     230000   1200          191.67
        P005     Oldtown     210000   1100          190.91

Neighbourhood performance summary:
 neighbourhood  listings  mean_price  mean_price_sqft  mean_sqft  mean_condition
     Riverside         4    523750.0           240.43     2170.0             9.0
     Greenhill         4    355000.0           207.88     1712.5             7.5
       Oldtown         4    245000.0           195.63     1250.0             4.8

Location premium (Riverside vs Oldtown): 22.9%

What just happened?

Method — derived metric before groupby

We computed price_per_sqft on the raw DataFrame before grouping — so when we aggregate by neighbourhood, the mean of the per-property rates flows naturally into the summary. This is the same pre-aggregation derivation pattern used in CS5 (Retail) — always compute derived metrics on raw rows first, then aggregate, rather than trying to derive them from already-aggregated totals.

Business Insight

Riverside commands $240.43 per square foot versus $195.63 in Oldtown — a location premium of 22.9%. A buyer paying Riverside prices for an Oldtown-sized home is effectively paying the same property but spending $45 more per square foot purely for the address. The neighbourhood summary also reveals why: Riverside properties have a mean condition score of 9.0 versus 4.8 in Oldtown — the location premium is partly justified by significantly better property condition in Riverside.

Step 3Feature Correlation with Price

Before the engineering team can build the weighting model, they need to know which features matter most. Pearson correlation gives us a ranked list — the foundation for deciding which variables go into the model and how much weight they should carry.

# Compute Pearson correlation between price and all numeric features
features = ["sqft","bedrooms","age_years","condition","price_per_sqft"]
print("Correlation with price (Pearson r):")
for feat in features:
    r = df["price"].corr(df[feat]).round(3)
    bar = "#" * int(abs(r) * 20)
    direction = "+" if r > 0 else "-"
    print(f"  {feat:<18} r = {r:+.3f}  {bar}")

# Full correlation matrix for all numeric columns
numeric_cols = ["price","sqft","bedrooms","age_years","condition","price_per_sqft"]
corr_matrix  = df[numeric_cols].corr().round(3)
print("\nFull correlation matrix:")
print(corr_matrix.to_string())

# Does condition compensate for age?
age_cond_corr = df["age_years"].corr(df["condition"]).round(3)
print(f"\nAge vs condition correlation: {age_cond_corr:.3f}")
print("Interpretation: older properties tend to have", 
      "worse" if age_cond_corr < 0 else "better", "condition scores")

Correlation with price (Pearson r):
  sqft               r = +0.976  ###################
  bedrooms           r = +0.943  ##################
  age_years          r = -0.931  ##################
  condition          r = +0.933  ##################
  price_per_sqft     r = +0.762  ###############

Full correlation matrix:
                price   sqft  bedrooms  age_years  condition  price_per_sqft
price           1.000  0.976     0.943     -0.931      0.933           0.762
sqft            0.976  1.000     0.970     -0.950      0.954           0.691
bedrooms        0.943  0.970     1.000     -0.917      0.927           0.638
age_years      -0.931 -0.950    -0.917      1.000     -0.980          -0.652
condition       0.933  0.954     0.927     -0.980      1.000           0.659

Age vs condition correlation: -0.980
Interpretation: older properties tend to have worse condition scores

What just happened?

Method — correlation loop with visual bar

We looped over the feature list and computed each correlation independently using series.corr(series). The ASCII bar — "#" * int(abs(r) * 20) — gives an instant visual sense of correlation strength without needing a chart. This inline visualisation trick is common in quick analyst notebooks where speed matters more than aesthetics. The :+.3f format specifier forces the sign to always display, making positive and negative correlations immediately distinguishable.

Business Insight

Sqft is the strongest predictor at r = +0.976 — nearly a perfect linear relationship. Bedrooms (r = +0.943), condition (r = +0.933), and age (r = −0.931) are all near-equally strong. The critical finding for the engineering team is the age-condition correlation of −0.980 — the two variables are almost perfectly inversely related. This means age and condition are largely redundant — including both in a model would introduce multicollinearity. The team should use condition score and drop age, since condition captures the age signal while also reflecting renovation state.

Step 4Outlier Detection — Undervalued Properties

The product team's key feature request: surface properties that appear undervalued relative to their neighbourhood. A property priced below the neighbourhood average price per sqft by more than one standard deviation is a statistical outlier — potentially a buying opportunity for NestView's users.

# Compute neighbourhood-level mean and std of price_per_sqft
# We will use these to flag outliers within each neighbourhood
neighbourhood_stats = df.groupby("neighbourhood")["price_per_sqft"].agg(
    mean_psf = "mean",
    std_psf  = "std"
).reset_index()

# Merge neighbourhood stats back onto the main DataFrame
df = df.merge(neighbourhood_stats, on="neighbourhood")

# Compute z-score: how many std deviations each property is from its
# neighbourhood mean price per sqft
df["psf_zscore"] = (
    (df["price_per_sqft"] - df["mean_psf"]) / df["std_psf"]
).round(2)

# Flag undervalued properties: z-score below -1.0
# (more than 1 std below the neighbourhood mean price per sqft)
undervalued = df[df["psf_zscore"] < -1.0][
    ["property_id","neighbourhood","price","sqft",
     "price_per_sqft","mean_psf","psf_zscore","condition"]
].sort_values("psf_zscore")

print("Undervalued properties (psf z-score < -1.0):")
print(undervalued.to_string(index=False))

# Show full z-score table for context
print("\nAll properties with neighbourhood z-scores:")
print(df[["property_id","neighbourhood","price_per_sqft",
          "mean_psf","psf_zscore"]].sort_values(
    ["neighbourhood","psf_zscore"]
).to_string(index=False))

Undervalued properties (psf z-score < -1.0):
 property_id neighbourhood    price   sqft  price_per_sqft  mean_psf  psf_zscore  condition
        P008     Greenhill   310000   1720          180.23    207.88       -1.25          8
        P005       Oldtown   210000   1100          190.91    195.63       -1.12          4

All properties with neighbourhood z-scores:
 property_id neighbourhood  price_per_sqft  mean_psf  psf_zscore
        P008     Greenhill          180.23    207.88       -1.25
        P004     Greenhill          206.06    207.88       -0.08
        P006     Greenhill          237.34    207.88        1.33
        P011     Greenhill          207.89    207.88        0.00
        P005       Oldtown          190.91    195.63       -1.12
        P009       Oldtown          199.22    195.63        0.16
        P012       Oldtown          191.67    195.63       -0.21
        P002       Oldtown          200.70    195.63        0.23
        P001     Riverside          227.03    240.43       -0.74
        P007     Riverside          239.02    240.43       -0.08
        P003     Riverside          245.65    240.43        0.29
        P010     Riverside          250.00    240.43        0.53

What just happened?

Method — .merge() to bring group stats back onto individual rows

df.merge(neighbourhood_stats, on="neighbourhood") joins the neighbourhood-level mean and std back onto every row in the original DataFrame — so each property row now also carries the statistics for its neighbourhood. This is the standard pattern for computing within-group z-scores: aggregate to get group stats, then merge back to enable row-level comparison. Without the merge, you would need to loop over groups or use .transform().

Method — z-score for outlier detection

The z-score formula — (value - group_mean) / group_std — expresses each property's price per sqft in units of standard deviations from its neighbourhood mean. A z-score of −1.25 means the property is 1.25 standard deviations cheaper than its neighbourhood average on a per-sqft basis. The −1.0 threshold is a common first-pass outlier boundary — you can tighten it to −1.5 or −2.0 for stricter flagging.

Business Insight

Two properties are flagged as statistically undervalued. P008 in Greenhill at $180.23/sqft is $27.65 below the Greenhill average — and has a condition score of 8, meaning it is in good shape. This is the most interesting flag for buyers: a well-maintained Greenhill property priced like an Oldtown one. P005 in Oldtown is also flagged but has a condition score of 4, which explains the lower price — it is not genuinely undervalued, it is priced appropriately for its poor condition. P008 is the real opportunity here.

Checkpoint: Try tightening the outlier threshold to df[df["psf_zscore"] < -0.5] to surface more candidates. Then cross-reference with condition score — only flag properties with condition >= 7. This two-condition filter — below-average price AND good condition — is the foundation of every value-investing screen used in real estate analytics platforms.

Key Findings

The market spans $210,000 to $620,000 with a median of $357,500. The distribution is slightly right-skewed — the product team should display median-based benchmarks to buyers rather than means, which are inflated by premium Riverside listings.

Riverside commands a 22.9% location premium over Oldtown on a per-sqft basis ($240.43 vs $195.63). The premium is partly explained by Riverside's dramatically better average condition score (9.0 vs 4.8 in Oldtown).

Sqft is the strongest price predictor at r = +0.976, followed by condition (+0.933) and age (−0.931). Age and condition are almost perfectly inversely correlated at −0.980 — they are largely redundant. The model should use condition and drop age to avoid multicollinearity.

P008 in Greenhill is the most attractive undervalued property — priced 1.25 standard deviations below the Greenhill average per sqft despite having a condition score of 8. It is the only flagged property where the low price is not explained by poor condition.

Bedrooms (r = +0.943) and sqft (r = +0.976) are highly intercorrelated — larger homes have more bedrooms. Including both in a model would introduce multicollinearity. The valuation model should use sqft as the primary size feature and omit bedrooms.

Visualisations

Mean Price Per Sqft by Neighbourhood

Location premium stripped of size effects

Riverside

$240.43

Greenhill

$207.88

Oldtown

$195.63

Feature Correlation with Price

Pearson r · absolute value shown · direction in label

sqft (+)

0.976

bedrooms (+)

0.943

condition (+)

0.933

age (−)

0.931

price/sqft (+)

0.762

All Listings — Price Per Sqft vs Neighbourhood Average

Each bar = one property · dashed lines show neighbourhood means

P010 Riverside

$250

$250.00

P003 Riverside

$246

$245.65

P007 Riverside

$239

$239.02

P006 Greenhill

$237

$237.34

P001 Riverside

$227

$227.03

P011 Greenhill

$208

$207.89

P004 Greenhill

$206

$206.06

P002 Oldtown

$201

$200.70

P009 Oldtown

$199

$199.22

P012 Oldtown

$192

$191.67

P005 Oldtown

$191

$190.91

P008 Greenhill

$180

$180.23

P008 highlighted in amber — statistically undervalued at 1.25 std below Greenhill mean

House Price EDA Decision Guide

Property datasets have a standard analytical shape. Here is the framework for approaching any residential price EDA systematically:

Question	Metric	pandas Method	Watch Out For
What is the price landscape?	Mean, median, std, range	`df["price"].describe()`	Median is more robust than mean in skewed markets
Fair size-adjusted comparison?	Price per square foot	`df["price"] / df["sqft"]`	Always normalise before comparing differently sized homes
Location premium?	Mean price/sqft by neighbourhood	`groupby().agg(mean)`	Condition score can inflate neighbourhood averages
Which features drive price?	Pearson correlation with price	`series.corr(series)`	High inter-feature correlation = multicollinearity risk
Undervalued listings?	Within-group z-score on price/sqft	`merge() + (val - mean) / std`	Cross-reference condition — low price may be fair for poor condition

Analyst's Note

Teacher's Note

What Would Come Next?

A senior analyst would build a multiple linear regression model using sqft and condition as primary features (dropping age and bedrooms due to multicollinearity), then encode neighbourhood as a target-encoded location premium score for the model to learn directly.

Limitations of This Analysis

Twelve listings produce unrealistically strong correlations. Real property markets have dozens of confounders including school catchment areas, proximity to transport, and hyperlocal supply dynamics. Z-score outlier detection also requires at least 30–50 listings per neighbourhood to be statistically meaningful.

Business Decisions This Could Drive

Use sqft and condition as the primary valuation inputs, display neighbourhood median price per sqft as the buyer benchmark, and surface P008 as a featured opportunity — it is the only listing where below-average pricing is not explained by poor condition.

Practice Questions

1. Which feature had the strongest Pearson correlation with house price in the NestView dataset?

2. Which pandas method was used to bring neighbourhood-level statistics back onto individual property rows so we could compute within-group z-scores?

3. Which property was flagged as the most attractive undervalued listing — priced 1.25 standard deviations below its neighbourhood average despite having a condition score of 8?

Quiz

Up Next · Case Study 9

Exploring Ecommerce User Behaviour

You are handed a session-level dataset from an online store. Which traffic sources convert best? Where do users drop off in the funnel? And which device type drives the highest average order value?

← Previous Course Index Next →

DS Case Studies

Exploring House Price Data

What This Case Study Covers

The House Price EDA Toolkit

Dataset Overview

Business Questions

Step-by-Step Analysis

Key Findings

Visualisations

House Price EDA Decision Guide

Analyst's Note

What Would Come Next?

Limitations of This Analysis

Business Decisions This Could Drive

Practice Questions

Quiz