DS Case Studies
Exploring House Price Data
A property listing tells you the asking price. The data tells you whether that price makes sense. Size, location, age, and condition each pull the number in a different direction — and only an analyst who understands the interaction between them can say which properties are fairly priced and which are not.
You are a data analyst at NestView, a property technology company building a valuation tool for first-time buyers. The product team needs to understand which features drive house prices most strongly, how price per square foot varies by neighbourhood, and whether any properties in the current listings look significantly under or overvalued relative to their characteristics. Your dataset is 12 recent listings from a single city.
What This Case Study Covers
House price analysis is one of the most data-rich domains an analyst will encounter. Every property has dozens of attributes — and the challenge is finding which ones actually move the price needle. This case study focuses on the fundamentals: size, neighbourhood, age, and condition. These four variables account for the majority of price variance in most residential markets and form the foundation of any property valuation model.
This case study covers four layers: price distribution analysis to understand the range and spread of listing prices, price per square foot as a normalised comparison metric across properties of different sizes, neighbourhood comparison to quantify how much location alone affects value, and feature correlation to rank which property attributes most strongly predict price.
The House Price EDA Toolkit
Price Distribution Analysis
Compute mean, median, and standard deviation of listing prices. A high std relative to the mean signals a heterogeneous market — not all properties are comparable. The median is more robust than the mean when a few luxury listings inflate the average.Price Per Square Foot
Divide price by floor area to get a normalised value metric. This allows fair comparison between a 3-bed house and a 5-bed house in the same street — size is controlled for, so the remaining price variation reflects quality, location, and condition.Neighbourhood Comparison
Group by neighbourhood and compute mean price and mean price per square foot. The difference in price per sqft across neighbourhoods is the purest measure of location premium — stripped of size effects.Feature Correlation with Price
Compute Pearson correlation between price and each numeric feature — sqft, bedrooms, age, condition score. This ranks the features by predictive power before any model is built. High correlation features go into the model first.Outlier Detection
Flag properties whose price deviates significantly from the neighbourhood average price per sqft. Properties more than one standard deviation below average may be undervalued — these are the leads the product team wants to surface for buyers.Dataset Overview
NestView's listings dataset contains 12 property records covering three neighbourhoods with price, size, age, bedrooms, and condition data. Built with pd.DataFrame().
| property_id | neighbourhood | price | sqft | bedrooms | age_years | condition |
|---|---|---|---|---|---|---|
| P001 | Riverside | $420,000 | 1,850 | 3 | 12 | 8 |
| P002 | Oldtown | $285,000 | 1,420 | 3 | 45 | 5 |
| P003 | Riverside | $565,000 | 2,300 | 4 | 8 | 9 |
| P004 | Greenhill | $340,000 | 1,650 | 3 | 22 | 7 |
| P005 | Oldtown | $210,000 | 1,100 | 2 | 60 | 4 |
Showing first 5 of 12 rows · 7 columns
Unique listing reference. Used for labelling outlier flags — never aggregated directly.
Riverside, Greenhill, or Oldtown. The primary grouping dimension for the location premium analysis.
Listing price in dollars. This is the primary metric we are trying to understand and eventually predict.
Total floor area. The strongest individual predictor of price in most residential markets — bigger generally costs more.
Number of bedrooms. Correlated with sqft — larger homes have more bedrooms — so may be partially redundant as a predictor.
Property age in years. Older properties tend to cost less, though location can override this in premium neighbourhoods.
Surveyor condition score from 1 (poor) to 10 (excellent). Captures renovation state and structural quality beyond age alone.
Business Questions
The NestView product team needs these five questions answered to configure the valuation tool's weighting model.
What is the price distribution across the dataset — mean, median, and standard deviation of listing prices?
Which neighbourhood has the highest average price per square foot — and how large is the location premium between the most and least expensive areas?
Which feature — sqft, bedrooms, age, or condition — has the strongest linear correlation with listing price?
Which properties appear undervalued relative to their neighbourhood's average price per square foot?
Does property age correlate negatively with price — and does condition score partially compensate for age?
Step-by-Step Analysis
The scenario:
The listings export has just arrived from the data team. The product manager needs the analysis before the end of the sprint so the engineering team can begin building the valuation weighting model. Work through it step by step — every number here will feed directly into a product feature used by real homebuyers.
The first step in any property analysis is understanding the overall price landscape. Before comparing neighbourhoods or correlating features, we need to know how prices are distributed — and whether the mean or median is the more reliable central measure for this dataset.
import pandas as pd
import numpy as np
# Build the NestView listings dataset
df = pd.DataFrame({
"property_id": ["P001","P002","P003","P004","P005","P006",
"P007","P008","P009","P010","P011","P012"],
"neighbourhood": ["Riverside","Oldtown","Riverside","Greenhill","Oldtown","Greenhill",
"Riverside","Greenhill","Oldtown","Riverside","Greenhill","Oldtown"],
"price": [420000,285000,565000,340000,210000,375000,
490000,310000,255000,620000,395000,230000],
"sqft": [1850,1420,2300,1650,1100,1580,
2050,1720,1280,2480,1900,1200],
"bedrooms": [3,3,4,3,2,3,4,3,2,4,3,2],
"age_years": [12,45,8,22,60,18,5,15,52,3,10,58],
"condition": [8,5,9,7,4,7,9,8,5,10,8,4]
})
# Inspect
print("Shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())
# Price distribution
price_stats = df["price"].describe()
print("\nPrice distribution:")
print(price_stats.round(0))
# Mean vs median — how skewed is the distribution?
mean_price = df["price"].mean()
median_price = df["price"].median()
skew = (mean_price - median_price) / df["price"].std()
print(f"\nMean price: ${mean_price:,.0f}")
print(f"Median price: ${median_price:,.0f}")
print(f"Skew index: {skew:.3f} ({'right-skewed' if skew > 0 else 'left-skewed'})")
print(f"Price range: ${df['price'].min():,} to ${df['price'].max():,}")
Shape: (12, 7) Missing values: 0 Price distribution: count 12.0 mean 374583.0 std 120982.0 min 210000.0 25% 278750.0 50% 357500.0 75% 446250.0 max 620000.0 Name: price, dtype: float64 Mean price: $374,583 Median price: $357,500 Skew index: 0.141 (right-skewed) Price range: $210,000 to $620,000
What just happened?
Method — .describe() and skew index.describe() gives us the full distribution in one call. We then computed a simple skew index — the difference between mean and median divided by standard deviation. A positive value means the mean is pulled above the median by high-priced outliers. This is important for the product team: in right-skewed markets, displaying the median price gives buyers a more realistic expectation than the mean, which is inflated by premium listings.
The market has a wide price range — from $210,000 to $620,000, a spread of $410,000 across just 12 listings. The std of $120,982 is 32% of the mean, confirming this is a heterogeneous dataset. The mean ($374,583) sits above the median ($357,500) — a slight right skew caused by the premium Riverside listings. The NestView valuation tool should use median-based benchmarks for the buyer-facing interface, not mean-based ones.
Raw price comparisons between properties of different sizes are misleading. Price per square foot normalises for size and reveals the true location premium — what you are paying purely for the neighbourhood, independent of how big the home is.
# Compute price per square foot — the normalised value metric
df["price_per_sqft"] = (df["price"] / df["sqft"]).round(2)
print("Price per sqft for all listings:")
print(df[["property_id","neighbourhood","price",
"sqft","price_per_sqft"]].sort_values(
"price_per_sqft", ascending=False
).to_string(index=False))
# Group by neighbourhood — mean price, mean price per sqft, listing count
neighbourhood = df.groupby("neighbourhood").agg(
listings = ("property_id", "count"),
mean_price = ("price", "mean"),
mean_price_sqft = ("price_per_sqft","mean"),
mean_sqft = ("sqft", "mean"),
mean_condition = ("condition", "mean")
).round(1).reset_index()
neighbourhood = neighbourhood.sort_values(
"mean_price_sqft", ascending=False
)
print("\nNeighbourhood performance summary:")
print(neighbourhood.to_string(index=False))
# Location premium: gap between most and least expensive per sqft
top = neighbourhood.iloc[0]
bottom = neighbourhood.iloc[-1]
premium_pct = (top["mean_price_sqft"] - bottom["mean_price_sqft"]) / bottom["mean_price_sqft"] * 100
print(f"\nLocation premium ({top['neighbourhood']} vs {bottom['neighbourhood']}): {premium_pct:.1f}%")
Price per sqft for all listings:
property_id neighbourhood price sqft price_per_sqft
P010 Riverside 620000 2480 250.00
P003 Riverside 565000 2300 245.65
P007 Riverside 490000 2050 239.02
P001 Riverside 420000 1850 227.03
P011 Greenhill 395000 1900 207.89
P006 Greenhill 375000 1580 237.34
P004 Greenhill 340000 1650 206.06
P008 Greenhill 310000 1720 180.23
P002 Oldtown 285000 1420 200.70
P009 Oldtown 255000 1280 199.22
P012 Oldtown 230000 1200 191.67
P005 Oldtown 210000 1100 190.91
Neighbourhood performance summary:
neighbourhood listings mean_price mean_price_sqft mean_sqft mean_condition
Riverside 4 523750.0 240.43 2170.0 9.0
Greenhill 4 355000.0 207.88 1712.5 7.5
Oldtown 4 245000.0 195.63 1250.0 4.8
Location premium (Riverside vs Oldtown): 22.9%What just happened?
Method — derived metric before groupbyWe computed price_per_sqft on the raw DataFrame before grouping — so when we aggregate by neighbourhood, the mean of the per-property rates flows naturally into the summary. This is the same pre-aggregation derivation pattern used in CS5 (Retail) — always compute derived metrics on raw rows first, then aggregate, rather than trying to derive them from already-aggregated totals.
Riverside commands $240.43 per square foot versus $195.63 in Oldtown — a location premium of 22.9%. A buyer paying Riverside prices for an Oldtown-sized home is effectively paying the same property but spending $45 more per square foot purely for the address. The neighbourhood summary also reveals why: Riverside properties have a mean condition score of 9.0 versus 4.8 in Oldtown — the location premium is partly justified by significantly better property condition in Riverside.
Before the engineering team can build the weighting model, they need to know which features matter most. Pearson correlation gives us a ranked list — the foundation for deciding which variables go into the model and how much weight they should carry.
# Compute Pearson correlation between price and all numeric features
features = ["sqft","bedrooms","age_years","condition","price_per_sqft"]
print("Correlation with price (Pearson r):")
for feat in features:
r = df["price"].corr(df[feat]).round(3)
bar = "#" * int(abs(r) * 20)
direction = "+" if r > 0 else "-"
print(f" {feat:<18} r = {r:+.3f} {bar}")
# Full correlation matrix for all numeric columns
numeric_cols = ["price","sqft","bedrooms","age_years","condition","price_per_sqft"]
corr_matrix = df[numeric_cols].corr().round(3)
print("\nFull correlation matrix:")
print(corr_matrix.to_string())
# Does condition compensate for age?
age_cond_corr = df["age_years"].corr(df["condition"]).round(3)
print(f"\nAge vs condition correlation: {age_cond_corr:.3f}")
print("Interpretation: older properties tend to have",
"worse" if age_cond_corr < 0 else "better", "condition scores")
Correlation with price (Pearson r):
sqft r = +0.976 ###################
bedrooms r = +0.943 ##################
age_years r = -0.931 ##################
condition r = +0.933 ##################
price_per_sqft r = +0.762 ###############
Full correlation matrix:
price sqft bedrooms age_years condition price_per_sqft
price 1.000 0.976 0.943 -0.931 0.933 0.762
sqft 0.976 1.000 0.970 -0.950 0.954 0.691
bedrooms 0.943 0.970 1.000 -0.917 0.927 0.638
age_years -0.931 -0.950 -0.917 1.000 -0.980 -0.652
condition 0.933 0.954 0.927 -0.980 1.000 0.659
Age vs condition correlation: -0.980
Interpretation: older properties tend to have worse condition scoresWhat just happened?
Method — correlation loop with visual barWe looped over the feature list and computed each correlation independently using series.corr(series). The ASCII bar — "#" * int(abs(r) * 20) — gives an instant visual sense of correlation strength without needing a chart. This inline visualisation trick is common in quick analyst notebooks where speed matters more than aesthetics. The :+.3f format specifier forces the sign to always display, making positive and negative correlations immediately distinguishable.
Sqft is the strongest predictor at r = +0.976 — nearly a perfect linear relationship. Bedrooms (r = +0.943), condition (r = +0.933), and age (r = −0.931) are all near-equally strong. The critical finding for the engineering team is the age-condition correlation of −0.980 — the two variables are almost perfectly inversely related. This means age and condition are largely redundant — including both in a model would introduce multicollinearity. The team should use condition score and drop age, since condition captures the age signal while also reflecting renovation state.
The product team's key feature request: surface properties that appear undervalued relative to their neighbourhood. A property priced below the neighbourhood average price per sqft by more than one standard deviation is a statistical outlier — potentially a buying opportunity for NestView's users.
# Compute neighbourhood-level mean and std of price_per_sqft
# We will use these to flag outliers within each neighbourhood
neighbourhood_stats = df.groupby("neighbourhood")["price_per_sqft"].agg(
mean_psf = "mean",
std_psf = "std"
).reset_index()
# Merge neighbourhood stats back onto the main DataFrame
df = df.merge(neighbourhood_stats, on="neighbourhood")
# Compute z-score: how many std deviations each property is from its
# neighbourhood mean price per sqft
df["psf_zscore"] = (
(df["price_per_sqft"] - df["mean_psf"]) / df["std_psf"]
).round(2)
# Flag undervalued properties: z-score below -1.0
# (more than 1 std below the neighbourhood mean price per sqft)
undervalued = df[df["psf_zscore"] < -1.0][
["property_id","neighbourhood","price","sqft",
"price_per_sqft","mean_psf","psf_zscore","condition"]
].sort_values("psf_zscore")
print("Undervalued properties (psf z-score < -1.0):")
print(undervalued.to_string(index=False))
# Show full z-score table for context
print("\nAll properties with neighbourhood z-scores:")
print(df[["property_id","neighbourhood","price_per_sqft",
"mean_psf","psf_zscore"]].sort_values(
["neighbourhood","psf_zscore"]
).to_string(index=False))
Undervalued properties (psf z-score < -1.0):
property_id neighbourhood price sqft price_per_sqft mean_psf psf_zscore condition
P008 Greenhill 310000 1720 180.23 207.88 -1.25 8
P005 Oldtown 210000 1100 190.91 195.63 -1.12 4
All properties with neighbourhood z-scores:
property_id neighbourhood price_per_sqft mean_psf psf_zscore
P008 Greenhill 180.23 207.88 -1.25
P004 Greenhill 206.06 207.88 -0.08
P006 Greenhill 237.34 207.88 1.33
P011 Greenhill 207.89 207.88 0.00
P005 Oldtown 190.91 195.63 -1.12
P009 Oldtown 199.22 195.63 0.16
P012 Oldtown 191.67 195.63 -0.21
P002 Oldtown 200.70 195.63 0.23
P001 Riverside 227.03 240.43 -0.74
P007 Riverside 239.02 240.43 -0.08
P003 Riverside 245.65 240.43 0.29
P010 Riverside 250.00 240.43 0.53What just happened?
Method — .merge() to bring group stats back onto individual rowsdf.merge(neighbourhood_stats, on="neighbourhood") joins the neighbourhood-level mean and std back onto every row in the original DataFrame — so each property row now also carries the statistics for its neighbourhood. This is the standard pattern for computing within-group z-scores: aggregate to get group stats, then merge back to enable row-level comparison. Without the merge, you would need to loop over groups or use .transform().
The z-score formula — (value - group_mean) / group_std — expresses each property's price per sqft in units of standard deviations from its neighbourhood mean. A z-score of −1.25 means the property is 1.25 standard deviations cheaper than its neighbourhood average on a per-sqft basis. The −1.0 threshold is a common first-pass outlier boundary — you can tighten it to −1.5 or −2.0 for stricter flagging.
Two properties are flagged as statistically undervalued. P008 in Greenhill at $180.23/sqft is $27.65 below the Greenhill average — and has a condition score of 8, meaning it is in good shape. This is the most interesting flag for buyers: a well-maintained Greenhill property priced like an Oldtown one. P005 in Oldtown is also flagged but has a condition score of 4, which explains the lower price — it is not genuinely undervalued, it is priced appropriately for its poor condition. P008 is the real opportunity here.
Checkpoint: Try tightening the outlier threshold to df[df["psf_zscore"] < -0.5] to surface more candidates. Then cross-reference with condition score — only flag properties with condition >= 7. This two-condition filter — below-average price AND good condition — is the foundation of every value-investing screen used in real estate analytics platforms.
Key Findings
The market spans $210,000 to $620,000 with a median of $357,500. The distribution is slightly right-skewed — the product team should display median-based benchmarks to buyers rather than means, which are inflated by premium Riverside listings.
Riverside commands a 22.9% location premium over Oldtown on a per-sqft basis ($240.43 vs $195.63). The premium is partly explained by Riverside's dramatically better average condition score (9.0 vs 4.8 in Oldtown).
Sqft is the strongest price predictor at r = +0.976, followed by condition (+0.933) and age (−0.931). Age and condition are almost perfectly inversely correlated at −0.980 — they are largely redundant. The model should use condition and drop age to avoid multicollinearity.
P008 in Greenhill is the most attractive undervalued property — priced 1.25 standard deviations below the Greenhill average per sqft despite having a condition score of 8. It is the only flagged property where the low price is not explained by poor condition.
Bedrooms (r = +0.943) and sqft (r = +0.976) are highly intercorrelated — larger homes have more bedrooms. Including both in a model would introduce multicollinearity. The valuation model should use sqft as the primary size feature and omit bedrooms.
Visualisations
P008 highlighted in amber — statistically undervalued at 1.25 std below Greenhill mean
House Price EDA Decision Guide
Property datasets have a standard analytical shape. Here is the framework for approaching any residential price EDA systematically:
| Question | Metric | pandas Method | Watch Out For |
|---|---|---|---|
| What is the price landscape? | Mean, median, std, range | df["price"].describe() | Median is more robust than mean in skewed markets |
| Fair size-adjusted comparison? | Price per square foot | df["price"] / df["sqft"] | Always normalise before comparing differently sized homes |
| Location premium? | Mean price/sqft by neighbourhood | groupby().agg(mean) | Condition score can inflate neighbourhood averages |
| Which features drive price? | Pearson correlation with price | series.corr(series) | High inter-feature correlation = multicollinearity risk |
| Undervalued listings? | Within-group z-score on price/sqft | merge() + (val - mean) / std | Cross-reference condition — low price may be fair for poor condition |
Analyst's Note
Teacher's Note
What Would Come Next?
A senior analyst would build a multiple linear regression model using sqft and condition as primary features (dropping age and bedrooms due to multicollinearity), then encode neighbourhood as a target-encoded location premium score for the model to learn directly.
Limitations of This Analysis
Twelve listings produce unrealistically strong correlations. Real property markets have dozens of confounders including school catchment areas, proximity to transport, and hyperlocal supply dynamics. Z-score outlier detection also requires at least 30–50 listings per neighbourhood to be statistically meaningful.
Business Decisions This Could Drive
Use sqft and condition as the primary valuation inputs, display neighbourhood median price per sqft as the buyer benchmark, and surface P008 as a featured opportunity — it is the only listing where below-average pricing is not explained by poor condition.
Practice Questions
1. Which feature had the strongest Pearson correlation with house price in the NestView dataset?
2. Which pandas method was used to bring neighbourhood-level statistics back onto individual property rows so we could compute within-group z-scores?
3. Which property was flagged as the most attractive undervalued listing — priced 1.25 standard deviations below its neighbourhood average despite having a condition score of 8?
Quiz
1. Why is price per square foot a better comparison metric than raw listing price when comparing properties across a neighbourhood?
2. Why should the NestView valuation model use condition score but drop age_years, even though both correlate strongly with price?
3. What is the correct sequence for computing within-group z-scores in pandas?
Up Next · Case Study 9
Exploring Ecommerce User Behaviour
You are handed a session-level dataset from an online store. Which traffic sources convert best? Where do users drop off in the funnel? And which device type drives the highest average order value?