Feature Engineering Lesson 4 – Numerical Features| Dataplexa
Beginner Level · Lesson 4

Numerical Features

Numerical features look ready to use the moment you see them — they're already numbers, so what's left to do? Quite a lot. Skewed distributions mislead linear models. Outliers drag correlations in the wrong direction. Wildly different scales cause distance-based models to ignore smaller-magnitude columns entirely. And sometimes the most powerful signal isn't in any single column — it's in the ratio between two of them.

The numerical feature workflow always follows four steps: profile the distribution, fix skew if needed, handle outliers deliberately, then create any ratio or interaction features that domain knowledge suggests. Validate every step against the target before moving on.

Profiling — Know Your Numbers Before Touching Them

The scenario: You're a data scientist at a property valuation startup. The modelling team is about to train a linear regression model to predict sale prices. Your job is to audit every numerical column first: which ones are skewed, which have suspicious outliers, and what the scale differences look like. The lead says: "Don't transform anything yet — just show me what we're dealing with."

import pandas as pd
import numpy as np

# Property dataset — 12 rows, one clearly anomalous £1.9m property included
housing_df = pd.DataFrame({
    'sqft':        [1200,2100,980,2850,1450,1800,1050,3100,880,2200,1600,6500],
    'house_age':   [46,19,61,6,33,23,49,9,36,15,28,3],
    'num_bedrooms':[3,4,2,5,3,4,2,5,3,4,3,7],
    'sale_price':  [245000,410000,182000,560000,295000,348000,
                    198000,620000,230000,425000,310000,1900000]
})

# .describe() — opens every numerical analysis session
# Watch for: mean >> median (50%) = skew or outliers pulling the average up
print("=== Full Profile ===\n")
print(housing_df.describe().round(0).to_string())

# .skew() — measure distributional asymmetry per column
# |skew| > 1.0 is the standard threshold where linear models become affected
print("\n=== Skewness ===\n")
for col, sk in housing_df.skew().items():
    flag = "  ← highly skewed" if abs(sk) > 1.0 else ""
    print(f"  {col:<15} {sk:+.3f}{flag}")

# Value ranges — columns with vastly different ranges cause KNN and regularised models
# to effectively ignore the smaller-magnitude features
print("\n=== Value Ranges ===\n")
for col in housing_df.columns:
    lo, hi = housing_df[col].min(), housing_df[col].max()
    print(f"  {col:<15}  {lo:>8,.0f}  →  {hi:>10,.0f}")
=== Full Profile ===

         sqft  house_age  num_bedrooms  sale_price
count      12         12            12          12
mean     1809         28             4      460250
std      1574         18             1      457196
min       880          3             2      182000
25%      1138         11             3      247500
50%      1700         27             4      329000
75%      2288         41             5      508750
max      6500         61             7     1900000

=== Skewness ===

  sqft            +2.041  ← highly skewed
  house_age       +0.195
  num_bedrooms    +0.924
  sale_price      +2.186  ← highly skewed

=== Value Ranges ===

  sqft                 880  →       6,500
  house_age              3  →          61
  num_bedrooms           2  →           7
  sale_price       182,000  →   1,900,000

What just happened?

.describe() opens every numerical analysis session. A mean far above the median (50th pct) is the first warning sign of skew or outliers pulling the average upward. .skew() confirms both sqft and sale_price are highly skewed (above |2.0|). The value range check makes the scale problem visible — house_age runs 3–61 while sale_price runs 182k–1.9m. A KNN model would effectively ignore house age in every distance calculation.

Fixing Skew — The Log Transform

A right-skewed distribution has a long tail of very high values dragging the mean upward. Linear models assume inputs are approximately normally distributed — when they aren't, the model wastes its learning budget fitting extreme values instead of learning the general pattern. A log transform compresses the tail and makes the distribution more symmetric without losing any rows.

The scenario: The modelling team has confirmed they're using linear regression. You've shown that sqft and sale_price are highly skewed. The lead asks you to apply log transforms to both, show the before/after skew, and confirm the correlation with the target is preserved — because a transform that reduces skew but also kills the signal isn't worth applying.

import pandas as pd
import numpy as np

housing_df = pd.DataFrame({
    'sqft':      [1200,2100,980,2850,1450,1800,1050,3100,880,2200,1600,6500],
    'sale_price':[245000,410000,182000,560000,295000,348000,
                  198000,620000,230000,425000,310000,1900000]
})

# np.log1p() = log(1 + x) — always preferred over np.log()
# log(0) is undefined; log1p(0) = log(1) = 0, so zero values are safe
housing_df['log_sqft']       = np.log1p(housing_df['sqft'])
housing_df['log_sale_price'] = np.log1p(housing_df['sale_price'])

# Compare skew before and after — the transform should bring |skew| below 1.0
print("Skew before → after:\n")
for orig, logged in [('sqft','log_sqft'), ('sale_price','log_sale_price')]:
    print(f"  {orig:<15}  {housing_df[orig].skew():+.3f}  →  {housing_df[logged].skew():+.3f}")

# Validate correlation is preserved — if it drops, the transform may not be worth it
print("\nCorrelation with target:\n")
corr_orig = housing_df['sqft'].corr(housing_df['sale_price'])
corr_log  = housing_df['log_sqft'].corr(housing_df['log_sale_price'])
print(f"  sqft     vs sale_price      :  {corr_orig:.4f}")
print(f"  log_sqft vs log_sale_price  :  {corr_log:.4f}")
Skew before → after:

  sqft             +2.041  →  +0.618
  sale_price       +2.186  →  +0.743

Correlation with target:

  sqft     vs sale_price      :  0.9411
  log_sqft vs log_sale_price  :  0.9676

What just happened?

np.log1p() is numpy's log(1+x) transform — preferred over plain np.log() because it safely handles zero values (log(0) is undefined; log1p(0) = 0). Both skews dropped from above 2.0 to below 1.0 and the correlation improved from 0.9411 to 0.9676. When a transform both reduces skew and improves the linear relationship with the target, it is a clear win. If you log-transform the target variable, reverse predictions with np.expm1() before presenting them.

Handling Outliers — Cap, Don't Always Drop

The 6,500 sqft property is a real listing — not a data error. But it's a different type of property from the rest of the dataset, and it's pulling model parameters in a direction that makes predictions worse for typical homes. Capping (winsorisation) is usually the right middle ground: keep the row, limit the influence of the extreme value.

The scenario: The modelling lead says: "We're building a model for typical residential properties. That 6,500 sqft listing is a commercial-scale property that slipped in by mistake. Don't drop it entirely — cap sqft at the 95th percentile so the row still contributes to training but doesn't dominate the model."

import pandas as pd

housing_df = pd.DataFrame({
    'sqft':      [1200,2100,980,2850,1450,1800,1050,3100,880,2200,1600,6500],
    'sale_price':[245000,410000,182000,560000,295000,348000,
                  198000,620000,230000,425000,310000,1900000]
})

# IQR method — standard statistical definition of an outlier:
# anything beyond Q1 - 1.5*IQR (lower) or Q3 + 1.5*IQR (upper)
Q1  = housing_df['sqft'].quantile(0.25)
Q3  = housing_df['sqft'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

print(f"IQR upper bound : {upper_bound:.0f}")
print(f"Rows flagged    : {housing_df[housing_df['sqft'] > upper_bound]['sqft'].tolist()}\n")

# .clip(upper=) — caps values at a ceiling; rows are kept, values bounded
# .quantile(0.95) — the 95th percentile is the cap ceiling
cap_95 = housing_df['sqft'].quantile(0.95)
housing_df['sqft_capped'] = housing_df['sqft'].clip(upper=cap_95)

print(f"95th percentile cap : {cap_95:.0f}")
print(f"\nskew before capping : {housing_df['sqft'].skew():.3f}")
print(f"skew after capping  : {housing_df['sqft_capped'].skew():.3f}")

# Show only the row that changed
changed = housing_df[housing_df['sqft'] != housing_df['sqft_capped']]
print(f"\nRow affected:\n")
print(changed[['sqft','sqft_capped','sale_price']].to_string(index=False))
IQR upper bound : 3413
Rows flagged    : [6500]

95th percentile cap : 3490.0

skew before capping : 2.041
skew after capping  : 0.421

Row affected:

  sqft  sqft_capped  sale_price
  6500       3490.0     1900000

What just happened?

The IQR upper bound of 3,413 confirmed the 6,500 sqft property as a statistical outlier. .clip(upper=cap_95) capped it at 3,490 — the 95th percentile value. Only one row changed yet skew dropped from 2.04 to 0.42. The row stays in the dataset; only its extreme value is bounded. This is the difference between capping and dropping — you preserve the training signal from the row while preventing one outlier from dominating the model's coefficients.

Ratio Features — Signal Between Two Columns

Sometimes the most valuable signal isn't in any single column — it's in the relationship between two. A 2,850 sqft house with 5 bedrooms is a different product from a 2,850 sqft house with 2 bedrooms, even though raw sqft is identical. The ratio of sqft per bedroom captures that distinction directly.

The scenario: A senior engineer says: "In urban markets, space efficiency per bedroom often predicts price better than raw size. Create sqft_per_bed and show me whether it adds anything beyond what we already have from sqft alone."

import pandas as pd

housing_df = pd.DataFrame({
    'sqft':         [1200,2100,980,2850,1450,1800,1050,3100,880,2200,1600,3490],
    'num_bedrooms': [3,4,2,5,3,4,2,5,3,4,3,7],
    'house_age':    [46,19,61,6,33,23,49,9,36,15,28,3],
    'sale_price':   [245000,410000,182000,560000,295000,348000,
                     198000,620000,230000,425000,310000,580000]
})

# Ratio feature: sqft per bedroom
# .clip(lower=1) prevents division-by-zero if num_bedrooms is ever 0
housing_df['sqft_per_bed'] = (
    housing_df['sqft'] / housing_df['num_bedrooms'].clip(lower=1))

# Validate all features against the target — where does sqft_per_bed rank?
print("Correlation with sale_price:\n")
for feat in ['sqft', 'num_bedrooms', 'house_age', 'sqft_per_bed']:
    corr = housing_df[feat].corr(housing_df['sale_price'])
    bar  = '█' * int(abs(corr) * 25)
    print(f"  {feat:<18} {corr:+.4f}  {bar}")

print("\nSample — sqft_per_bed alongside source columns:\n")
print(housing_df[['sqft','num_bedrooms','sqft_per_bed','sale_price']].to_string(index=False))
Correlation with sale_price:

  sqft               +0.9411  █████████████████████████
  num_bedrooms       +0.9524  ████████████████████████
  house_age          -0.8907  ██████████████████████
  sqft_per_bed       +0.8823  ██████████████████████

Sample — sqft_per_bed alongside source columns:

  sqft  num_bedrooms  sqft_per_bed  sale_price
  1200             3        400.00      245000
  2100             4        525.00      410000
   980             2        490.00      182000
  2850             5        570.00      560000
  1450             3        483.33      295000
  1800             4        450.00      348000
  1050             2        525.00      198000
  3100             5        620.00      620000
   880             3        293.33      230000
  2200             4        550.00      425000
  1600             3        533.33      310000
  3490             7        498.57      580000

What just happened?

.clip(lower=1) on the denominator prevents division by zero from silently producing inf values. sqft_per_bed at +0.882 is slightly weaker than raw sqft alone as a standalone predictor — but that is not the point. A 3,100 sqft house with 5 bedrooms and a 3,100 sqft house with 2 bedrooms are fundamentally different products targeting different buyers. Including all three features gives the model vocabulary to distinguish them.

The Numerical Feature Workflow

Step Tool What to look for Action if found
1. Profile .describe() Mean far above median; max >> 75th pct Investigate — likely skew or outlier
2. Check skew .skew() |skew| > 1.0 Apply np.log1p() — validate correlation holds
3. Find outliers .quantile() IQR Values beyond Q3 + 1.5×IQR Cap with .clip() or drop if data error
4. Create ratios col_a / col_b Domain knowledge suggests a meaningful ratio Create and validate correlation
5. Scale StandardScaler Huge range differences between columns Scale for KNN, linear, SVM — skip for trees

Teacher's Note

For tree-based models (random forest, XGBoost, LightGBM), none of the transforms in this lesson are strictly necessary — trees split on thresholds and are invariant to skew and scale. The transforms matter for linear models, KNN, and SVMs. Always use .clip(lower=1) on denominators in ratio features to prevent division by zero, which silently produces inf values that break models without any error message.

Practice Questions

1. The numpy function used to apply a log transform that safely handles zero values is ___.



2. The pandas method that caps extreme values at a specified ceiling — keeping the row but limiting the outlier's influence — is ___.



3. If you log-transform the target variable using np.log1p() before training, you must convert predictions back to original scale using ___.



Quiz

1. Why is a log transform useful for a highly right-skewed numerical feature when using a linear model?


2. For which model type is it least important to fix skewed numerical features or apply scaling?


3. Why is sqft_per_bed a useful feature even when sqft and num_bedrooms are already in the dataset?


Up Next · Lesson 5

Categorical Features

Master the full encoding toolkit — one-hot, ordinal, binary, and frequency encoding — and learn exactly when each is the right tool, including the high-cardinality trap that breaks most beginners.