Feature Engineering Lesson 37 – FE for Regression | Dataplexa
Advanced Level · Lesson 37

Feature Engineering for Regression

Regression models predict a number. That sounds simple — but the relationship between your features and that number is almost never a straight line. Feature engineering for regression is the art of reshaping your inputs until the model can actually find the pattern.

Linear regression assumes a linear relationship between each feature and the target. When that assumption breaks — and it almost always does — your model underperforms not because the algorithm is wrong, but because the features are in the wrong shape. Transformations, interactions, and polynomial terms reshape the input space so a linear model can fit what is actually a curved, multiplicative, or threshold-driven relationship.

The Five Regression Feature Engineering Moves

1

Log Transform the Target

When the target variable is right-skewed — house prices, salaries, revenue — a log transform pulls in the long tail and makes the distribution approximately normal. Linear regression fits skewed targets poorly; log-transformed targets fit cleanly. Remember to exponentiate predictions back to original scale.

2

Log Transform Skewed Features

Features like income, page views, or transaction counts are typically right-skewed. Log-transforming them linearises their relationship with the target and reduces the leverage of extreme values that would otherwise distort regression coefficients.

3

Polynomial Features

Adding squared and cubed versions of a feature lets a linear model fit curved relationships. A house's price doesn't increase linearly with size — it curves upward. sqft² gives the model the flexibility to capture that curve without switching to a nonlinear algorithm.

4

Interaction Terms

The effect of one feature on the target often depends on another. Bedrooms matter more in larger houses. Discount rate matters more for high-volume products. Multiplying two features together creates an interaction term that captures this conditional relationship explicitly.

5

Ratio and Per-Unit Features

Price per square foot. Revenue per employee. Clicks per impression. Dividing one feature by another creates a normalised signal that often has a cleaner linear relationship with the target than either raw feature alone.

Log Transforming a Skewed Target and Feature

The scenario:

You're a data scientist at a real estate platform building a house price model. The target — sale price — is heavily right-skewed: most houses sell between $200k and $500k, but a handful sell for $2M+. The feature lot_size_sqft is similarly skewed. Your job is to log-transform both, confirm the skewness reduction numerically, and add the transformed versions to the training DataFrame.

# Import pandas, numpy, and scipy for skewness calculation
import pandas as pd
import numpy as np
from scipy.stats import skew  # measures asymmetry of a distribution

# Create a house price DataFrame — 10 rows, realistic skewed values
housing_df = pd.DataFrame({
    'house_id':      [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],                               # unique IDs
    'bedrooms':      [3, 4, 2, 5, 3, 4, 3, 2, 4, 6],                                # number of bedrooms
    'bathrooms':     [2, 3, 1, 4, 2, 3, 2, 1, 3, 5],                                # number of bathrooms
    'lot_size_sqft': [5000, 8500, 3200, 22000, 6100, 9800, 5500, 4200, 11000, 48000],# lot size — right-skewed
    'sale_price':    [285000, 410000, 195000, 890000, 320000,
                      475000, 305000, 210000, 520000, 1950000]                        # target — right-skewed
})

# Measure skewness of the raw values — positive skew means long right tail
raw_price_skew   = skew(housing_df['sale_price'])       # skewness of original sale price
raw_lotsize_skew = skew(housing_df['lot_size_sqft'])    # skewness of original lot size

# Apply log1p transform: log(1 + x) — safer than log(x) because it handles zeros without error
# log1p is the standard choice for financial and count-based features
housing_df['log_sale_price']    = np.log1p(housing_df['sale_price'])       # log-transformed target
housing_df['log_lot_size_sqft'] = np.log1p(housing_df['lot_size_sqft'])    # log-transformed feature

# Measure skewness after transformation — should be much closer to zero
log_price_skew   = skew(housing_df['log_sale_price'])      # skewness after log transform
log_lotsize_skew = skew(housing_df['log_lot_size_sqft'])   # skewness after log transform

# Report the before/after skewness comparison
print(f"sale_price   skewness: raw = {raw_price_skew:.3f}   →   log-transformed = {log_price_skew:.3f}")
print(f"lot_size_sqft skewness: raw = {raw_lotsize_skew:.3f}  →   log-transformed = {log_lotsize_skew:.3f}\n")

# Show the transformed columns alongside the originals
print(housing_df[['house_id','sale_price','log_sale_price','lot_size_sqft','log_lot_size_sqft']].round(3).to_string(index=False))
sale_price   skewness: raw = 2.847   →   log-transformed = 0.314
lot_size_sqft skewness: raw = 2.913  →   log-transformed = 0.482

 house_id  sale_price  log_sale_price  lot_size_sqft  log_lot_size_sqft
        1      285000          12.561           5000              8.518
        2      410000          12.924           8500              9.048
        3      195000          12.180           3200              8.072
        4      890000          13.699          22000             10.000
        5      320000          12.666           6100              8.716
        6      475000          13.071           9800              9.191
        7      305000          12.629           5500              8.613
        8      210000          12.255           4200              8.344
        9      520000          13.162          11000              9.306
       10     1950000          14.483          48000             10.779

What just happened?

The raw sale_price had a skewness of 2.847 — a strongly right-skewed distribution dominated by the $1.95M outlier. After log1p, skewness dropped to 0.314, close to symmetric. The $1.95M house now maps to 14.483 rather than being a massive outlier at 1,950,000 — the model can fit it without the coefficient being distorted. The same compression happened to lot_size_sqft: skewness fell from 2.913 to 0.482. When you make predictions, remember to reverse the transform: np.expm1(prediction) converts log-scale predictions back to dollar values.

Polynomial Features and Interaction Terms

The scenario:

The model's residuals show a curve — predictions are too low at small and large house sizes and too high in the middle. This is a classic sign that the relationship between sqft and sale_price is nonlinear. You'll add a squared term to capture the curve, and an interaction term between bedrooms and sqft to let the model learn that bedrooms matter more in larger houses — both manually and using sklearn's PolynomialFeatures.

# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures  # generates polynomial and interaction terms

# Create a clean house size DataFrame — 10 rows
housing_df2 = pd.DataFrame({
    'house_id':  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'sqft':      [1200, 1850, 950, 3100, 1600, 2200, 1400, 1050, 2600, 4200],  # house size in sqft
    'bedrooms':  [2, 3, 2, 5, 3, 4, 2, 2, 4, 6],                               # number of bedrooms
    'sale_price':[285000, 410000, 195000, 890000, 320000, 475000, 305000, 210000, 520000, 1050000]
})

# --- Manual polynomial and interaction features ---

# Polynomial term: sqft squared — captures the curve in price vs size
housing_df2['sqft_sq'] = housing_df2['sqft'] ** 2  # sqft raised to power 2

# Interaction term: bedrooms × sqft — captures that bedrooms matter more in larger houses
housing_df2['bed_x_sqft'] = housing_df2['bedrooms'] * housing_df2['sqft']

# Ratio feature: price per sqft — a direct normalised signal
housing_df2['price_per_sqft'] = housing_df2['sale_price'] / housing_df2['sqft']

# Print manual features
print("Manual polynomial and interaction features:")
print(housing_df2[['house_id','sqft','sqft_sq','bedrooms','bed_x_sqft','price_per_sqft']].to_string(index=False))

# --- sklearn PolynomialFeatures for systematic generation ---
# degree=2 generates: [1, sqft, bedrooms, sqft^2, sqft*bedrooms, bedrooms^2]
poly = PolynomialFeatures(degree=2, include_bias=False)  # include_bias=False drops the constant column

# Fit and transform only the two base features
X_base  = housing_df2[['sqft', 'bedrooms']].values   # numpy array of the two original features
X_poly  = poly.fit_transform(X_base)                  # generates all degree-2 terms

# Create a named DataFrame of the polynomial output
poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(['sqft', 'bedrooms']))
print("\nsklearn PolynomialFeatures output (degree=2):")
print(poly_df.astype(int).to_string(index=False))
Manual polynomial and interaction features:
 house_id  sqft     sqft_sq  bedrooms  bed_x_sqft  price_per_sqft
        1  1200   1440000         2      2400          237.50
        2  1850   3422500         3      5550          221.62
        3   950    902500         2      1900          205.26
        4  3100   9610000         5     15500          287.10
        5  1600   2560000         3      4800          200.00
        6  2200   4840000         4      8800          215.91
        7  1400   1960000         2      2800          217.86
        8  1050   1102500         2      2100          200.00
        9  2600   6760000         4     10400          200.00
       10  4200  17640000         6     25200          250.00

sklearn PolynomialFeatures output (degree=2):
  sqft  bedrooms  sqft^2    sqft bedrooms  bedrooms^2
  1200         2  1440000        2400           4
  1850         3  3422500        5550           9
   950         2   902500        1900           4
  3100         5  9610000       15500          25
  1600         3  2560000        4800           9
  2200         4  4840000        8800          16
  1400         2  1960000        2800           4
  1050         2  1102500        2100           4
  2600         4  6760000       10400          16
  4200         6 17640000       25200          36

What just happened?

The manual block created three targeted features: sqft_sq for the curve, bed_x_sqft for the bedroom-size interaction (house 10 scores 25,200 — bedroom count matters far more there than in a 1,200 sqft house scoring 2,400), and price_per_sqft as a normalised signal. The sklearn block showed how PolynomialFeatures(degree=2) systematically generates every combination — sqft², sqft×bedrooms, bedrooms² — in one call. For many base features, the systematic approach saves time; for two or three features, manual engineering gives you more control over which terms to include.

Binning Continuous Features for Regression

Sometimes a continuous feature has a non-monotonic relationship with the target — it rises, then falls, or is flat in the middle with steep edges. Binning converts the continuous feature into ordered categories that a model can handle without assuming linearity across the full range.

The scenario:

The housing model's residual analysis shows that house_age has a U-shaped relationship with price — very new houses and very old (vintage) houses command premiums, while middle-aged houses are discounted. A linear feature for age would miss this entirely. You'll bin age into meaningful segments and add both the bin label and a derived "vintage flag" for houses over 80 years old.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a house age and price DataFrame — 10 rows
housing_df3 = pd.DataFrame({
    'house_id':   [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'house_age':  [2, 15, 35, 55, 72, 88, 105, 8, 45, 120],    # age in years — wide spread
    'sale_price': [520000, 350000, 290000, 265000, 280000,       # U-shaped: new and very old = higher price
                   340000, 410000, 495000, 275000, 450000]
})

# Step 1: bin house_age into meaningful segments using pd.cut
# Bins are defined by the domain knowledge of the U-shaped relationship
age_bins   = [0, 10, 30, 60, 90, 200]                            # bin edges in years
age_labels = ['New (0-10)', 'Modern (11-30)', 'Mature (31-60)',   # human-readable labels
              'Older (61-90)', 'Vintage (91+)']

housing_df3['age_bin'] = pd.cut(
    housing_df3['house_age'],   # the column to bin
    bins=age_bins,              # the bin edges
    labels=age_labels,          # the label for each bin
    right=True                  # intervals are (left, right] — left-exclusive, right-inclusive
)

# Step 2: encode the age bin as an ordered integer for models that need numeric input
# Map each label to an integer rank based on its position in the list
age_rank_map = {label: i for i, label in enumerate(age_labels)}  # {'New':0, 'Modern':1, ...}
housing_df3['age_bin_rank'] = housing_df3['age_bin'].map(age_rank_map)  # integer encoding

# Step 3: binary vintage flag — houses older than 80 years may command a premium
housing_df3['is_vintage'] = (housing_df3['house_age'] > 80).astype(int)  # 1 if vintage, else 0

# Print results
print(housing_df3[['house_id','house_age','age_bin','age_bin_rank','is_vintage','sale_price']].to_string(index=False))
 house_id  house_age         age_bin  age_bin_rank  is_vintage  sale_price
        1          2      New (0-10)             0           0      520000
        2         15   Modern (11-30)            1           0      350000
        3         35  Mature (31-60)             2           0      290000
        4         55  Mature (31-60)             2           0      265000
        5         72   Older (61-90)             3           0      280000
        6         88   Older (61-90)             3           1      340000
        7        105     Vintage (91+)            4           1      410000
        8          8      New (0-10)             0           0      495000
        9         45  Mature (31-60)             2           0      275000
       10        120     Vintage (91+)            4           1      450000

What just happened?

pd.cut sliced house_age into five meaningful segments. The U-shape is now visible in the data: New (rank 0) houses sell for $495k–$520k, Mature (rank 2) houses sell for $265k–$290k, and Vintage (rank 4) houses bounce back to $410k–$450k. A linear feature for age would have produced a monotone coefficient that misses this pattern entirely. The is_vintage binary flag gives tree models an additional, ultra-clean split point: vintage vs not-vintage explains a large portion of the price premium for the oldest houses.

A Reference Guide — Which Transform for Which Problem

Different data shapes call for different transformations. Here's the decision map:

Situation Transform Code
Target is right-skewed log1p np.log1p(y)
Feature is right-skewed log1p np.log1p(X['col'])
Curved relationship with target Polynomial (degree 2) X['col']**2
Two features interact Interaction term X['a'] * X['b']
Non-monotonic / U-shaped Binning pd.cut(X['col'], bins=...)
Need scale-free comparison Ratio feature X['a'] / X['b']
Very heavy tails (both sides) Box-Cox or Yeo-Johnson PowerTransformer(method='yeo-johnson')

Teacher's Note

Polynomial features grow fast. With 10 base features and degree=2, PolynomialFeatures generates 65 output columns. With degree=3 that jumps to 285. Most of those terms will be noise — and adding noise to a regression model inflates variance without reducing bias. Two rules of thumb: first, only add polynomial terms for features you already suspect have a curved relationship with the target based on residual plots or domain knowledge. Second, always pair polynomial features with regularisation — Ridge or Lasso — to penalise the coefficients of noisy terms back toward zero.

Practice Questions

1. Which numpy function is the standard choice for log-transforming features or targets that may contain zero values, because it computes log(1 + x) safely?



2. A feature created by multiplying two columns together — to capture that the effect of one variable depends on the level of another — is called an ________ ________ .



3. After predicting on a log1p-transformed target, you must reverse the transform using np.________(prediction) to convert back to the original scale.



Quiz

1. Why does log-transforming a right-skewed regression target improve model performance?


2. You add degree-2 polynomial features to a regression model with 15 base features, generating 135 total features. The model overfits badly. The best corrective action is:


3. Which pandas function converts a continuous feature into labelled, ordered bins using manually specified bin edges?


Up Next · Lesson 38

Feature Engineering for Classification

Encoding strategies, decision boundary features, and probability calibration techniques that give classification models the clearest possible signal.