Feature Engineering Course
Feature Engineering for Regression
Regression models predict a number. That sounds simple — but the relationship between your features and that number is almost never a straight line. Feature engineering for regression is the art of reshaping your inputs until the model can actually find the pattern.
Linear regression assumes a linear relationship between each feature and the target. When that assumption breaks — and it almost always does — your model underperforms not because the algorithm is wrong, but because the features are in the wrong shape. Transformations, interactions, and polynomial terms reshape the input space so a linear model can fit what is actually a curved, multiplicative, or threshold-driven relationship.
The Five Regression Feature Engineering Moves
Log Transform the Target
When the target variable is right-skewed — house prices, salaries, revenue — a log transform pulls in the long tail and makes the distribution approximately normal. Linear regression fits skewed targets poorly; log-transformed targets fit cleanly. Remember to exponentiate predictions back to original scale.
Log Transform Skewed Features
Features like income, page views, or transaction counts are typically right-skewed. Log-transforming them linearises their relationship with the target and reduces the leverage of extreme values that would otherwise distort regression coefficients.
Polynomial Features
Adding squared and cubed versions of a feature lets a linear model fit curved relationships. A house's price doesn't increase linearly with size — it curves upward. sqft² gives the model the flexibility to capture that curve without switching to a nonlinear algorithm.
Interaction Terms
The effect of one feature on the target often depends on another. Bedrooms matter more in larger houses. Discount rate matters more for high-volume products. Multiplying two features together creates an interaction term that captures this conditional relationship explicitly.
Ratio and Per-Unit Features
Price per square foot. Revenue per employee. Clicks per impression. Dividing one feature by another creates a normalised signal that often has a cleaner linear relationship with the target than either raw feature alone.
Log Transforming a Skewed Target and Feature
The scenario:
You're a data scientist at a real estate platform building a house price model. The target — sale price — is heavily right-skewed: most houses sell between $200k and $500k, but a handful sell for $2M+. The feature lot_size_sqft is similarly skewed. Your job is to log-transform both, confirm the skewness reduction numerically, and add the transformed versions to the training DataFrame.
# Import pandas, numpy, and scipy for skewness calculation
import pandas as pd
import numpy as np
from scipy.stats import skew # measures asymmetry of a distribution
# Create a house price DataFrame — 10 rows, realistic skewed values
housing_df = pd.DataFrame({
'house_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], # unique IDs
'bedrooms': [3, 4, 2, 5, 3, 4, 3, 2, 4, 6], # number of bedrooms
'bathrooms': [2, 3, 1, 4, 2, 3, 2, 1, 3, 5], # number of bathrooms
'lot_size_sqft': [5000, 8500, 3200, 22000, 6100, 9800, 5500, 4200, 11000, 48000],# lot size — right-skewed
'sale_price': [285000, 410000, 195000, 890000, 320000,
475000, 305000, 210000, 520000, 1950000] # target — right-skewed
})
# Measure skewness of the raw values — positive skew means long right tail
raw_price_skew = skew(housing_df['sale_price']) # skewness of original sale price
raw_lotsize_skew = skew(housing_df['lot_size_sqft']) # skewness of original lot size
# Apply log1p transform: log(1 + x) — safer than log(x) because it handles zeros without error
# log1p is the standard choice for financial and count-based features
housing_df['log_sale_price'] = np.log1p(housing_df['sale_price']) # log-transformed target
housing_df['log_lot_size_sqft'] = np.log1p(housing_df['lot_size_sqft']) # log-transformed feature
# Measure skewness after transformation — should be much closer to zero
log_price_skew = skew(housing_df['log_sale_price']) # skewness after log transform
log_lotsize_skew = skew(housing_df['log_lot_size_sqft']) # skewness after log transform
# Report the before/after skewness comparison
print(f"sale_price skewness: raw = {raw_price_skew:.3f} → log-transformed = {log_price_skew:.3f}")
print(f"lot_size_sqft skewness: raw = {raw_lotsize_skew:.3f} → log-transformed = {log_lotsize_skew:.3f}\n")
# Show the transformed columns alongside the originals
print(housing_df[['house_id','sale_price','log_sale_price','lot_size_sqft','log_lot_size_sqft']].round(3).to_string(index=False))
sale_price skewness: raw = 2.847 → log-transformed = 0.314
lot_size_sqft skewness: raw = 2.913 → log-transformed = 0.482
house_id sale_price log_sale_price lot_size_sqft log_lot_size_sqft
1 285000 12.561 5000 8.518
2 410000 12.924 8500 9.048
3 195000 12.180 3200 8.072
4 890000 13.699 22000 10.000
5 320000 12.666 6100 8.716
6 475000 13.071 9800 9.191
7 305000 12.629 5500 8.613
8 210000 12.255 4200 8.344
9 520000 13.162 11000 9.306
10 1950000 14.483 48000 10.779What just happened?
The raw sale_price had a skewness of 2.847 — a strongly right-skewed distribution dominated by the $1.95M outlier. After log1p, skewness dropped to 0.314, close to symmetric. The $1.95M house now maps to 14.483 rather than being a massive outlier at 1,950,000 — the model can fit it without the coefficient being distorted. The same compression happened to lot_size_sqft: skewness fell from 2.913 to 0.482. When you make predictions, remember to reverse the transform: np.expm1(prediction) converts log-scale predictions back to dollar values.
Polynomial Features and Interaction Terms
The scenario:
The model's residuals show a curve — predictions are too low at small and large house sizes and too high in the middle. This is a classic sign that the relationship between sqft and sale_price is nonlinear. You'll add a squared term to capture the curve, and an interaction term between bedrooms and sqft to let the model learn that bedrooms matter more in larger houses — both manually and using sklearn's PolynomialFeatures.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures # generates polynomial and interaction terms
# Create a clean house size DataFrame — 10 rows
housing_df2 = pd.DataFrame({
'house_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'sqft': [1200, 1850, 950, 3100, 1600, 2200, 1400, 1050, 2600, 4200], # house size in sqft
'bedrooms': [2, 3, 2, 5, 3, 4, 2, 2, 4, 6], # number of bedrooms
'sale_price':[285000, 410000, 195000, 890000, 320000, 475000, 305000, 210000, 520000, 1050000]
})
# --- Manual polynomial and interaction features ---
# Polynomial term: sqft squared — captures the curve in price vs size
housing_df2['sqft_sq'] = housing_df2['sqft'] ** 2 # sqft raised to power 2
# Interaction term: bedrooms × sqft — captures that bedrooms matter more in larger houses
housing_df2['bed_x_sqft'] = housing_df2['bedrooms'] * housing_df2['sqft']
# Ratio feature: price per sqft — a direct normalised signal
housing_df2['price_per_sqft'] = housing_df2['sale_price'] / housing_df2['sqft']
# Print manual features
print("Manual polynomial and interaction features:")
print(housing_df2[['house_id','sqft','sqft_sq','bedrooms','bed_x_sqft','price_per_sqft']].to_string(index=False))
# --- sklearn PolynomialFeatures for systematic generation ---
# degree=2 generates: [1, sqft, bedrooms, sqft^2, sqft*bedrooms, bedrooms^2]
poly = PolynomialFeatures(degree=2, include_bias=False) # include_bias=False drops the constant column
# Fit and transform only the two base features
X_base = housing_df2[['sqft', 'bedrooms']].values # numpy array of the two original features
X_poly = poly.fit_transform(X_base) # generates all degree-2 terms
# Create a named DataFrame of the polynomial output
poly_df = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(['sqft', 'bedrooms']))
print("\nsklearn PolynomialFeatures output (degree=2):")
print(poly_df.astype(int).to_string(index=False))
Manual polynomial and interaction features:
house_id sqft sqft_sq bedrooms bed_x_sqft price_per_sqft
1 1200 1440000 2 2400 237.50
2 1850 3422500 3 5550 221.62
3 950 902500 2 1900 205.26
4 3100 9610000 5 15500 287.10
5 1600 2560000 3 4800 200.00
6 2200 4840000 4 8800 215.91
7 1400 1960000 2 2800 217.86
8 1050 1102500 2 2100 200.00
9 2600 6760000 4 10400 200.00
10 4200 17640000 6 25200 250.00
sklearn PolynomialFeatures output (degree=2):
sqft bedrooms sqft^2 sqft bedrooms bedrooms^2
1200 2 1440000 2400 4
1850 3 3422500 5550 9
950 2 902500 1900 4
3100 5 9610000 15500 25
1600 3 2560000 4800 9
2200 4 4840000 8800 16
1400 2 1960000 2800 4
1050 2 1102500 2100 4
2600 4 6760000 10400 16
4200 6 17640000 25200 36What just happened?
The manual block created three targeted features: sqft_sq for the curve, bed_x_sqft for the bedroom-size interaction (house 10 scores 25,200 — bedroom count matters far more there than in a 1,200 sqft house scoring 2,400), and price_per_sqft as a normalised signal. The sklearn block showed how PolynomialFeatures(degree=2) systematically generates every combination — sqft², sqft×bedrooms, bedrooms² — in one call. For many base features, the systematic approach saves time; for two or three features, manual engineering gives you more control over which terms to include.
Binning Continuous Features for Regression
Sometimes a continuous feature has a non-monotonic relationship with the target — it rises, then falls, or is flat in the middle with steep edges. Binning converts the continuous feature into ordered categories that a model can handle without assuming linearity across the full range.
The scenario:
The housing model's residual analysis shows that house_age has a U-shaped relationship with price — very new houses and very old (vintage) houses command premiums, while middle-aged houses are discounted. A linear feature for age would miss this entirely. You'll bin age into meaningful segments and add both the bin label and a derived "vintage flag" for houses over 80 years old.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a house age and price DataFrame — 10 rows
housing_df3 = pd.DataFrame({
'house_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'house_age': [2, 15, 35, 55, 72, 88, 105, 8, 45, 120], # age in years — wide spread
'sale_price': [520000, 350000, 290000, 265000, 280000, # U-shaped: new and very old = higher price
340000, 410000, 495000, 275000, 450000]
})
# Step 1: bin house_age into meaningful segments using pd.cut
# Bins are defined by the domain knowledge of the U-shaped relationship
age_bins = [0, 10, 30, 60, 90, 200] # bin edges in years
age_labels = ['New (0-10)', 'Modern (11-30)', 'Mature (31-60)', # human-readable labels
'Older (61-90)', 'Vintage (91+)']
housing_df3['age_bin'] = pd.cut(
housing_df3['house_age'], # the column to bin
bins=age_bins, # the bin edges
labels=age_labels, # the label for each bin
right=True # intervals are (left, right] — left-exclusive, right-inclusive
)
# Step 2: encode the age bin as an ordered integer for models that need numeric input
# Map each label to an integer rank based on its position in the list
age_rank_map = {label: i for i, label in enumerate(age_labels)} # {'New':0, 'Modern':1, ...}
housing_df3['age_bin_rank'] = housing_df3['age_bin'].map(age_rank_map) # integer encoding
# Step 3: binary vintage flag — houses older than 80 years may command a premium
housing_df3['is_vintage'] = (housing_df3['house_age'] > 80).astype(int) # 1 if vintage, else 0
# Print results
print(housing_df3[['house_id','house_age','age_bin','age_bin_rank','is_vintage','sale_price']].to_string(index=False))
house_id house_age age_bin age_bin_rank is_vintage sale_price
1 2 New (0-10) 0 0 520000
2 15 Modern (11-30) 1 0 350000
3 35 Mature (31-60) 2 0 290000
4 55 Mature (31-60) 2 0 265000
5 72 Older (61-90) 3 0 280000
6 88 Older (61-90) 3 1 340000
7 105 Vintage (91+) 4 1 410000
8 8 New (0-10) 0 0 495000
9 45 Mature (31-60) 2 0 275000
10 120 Vintage (91+) 4 1 450000What just happened?
pd.cut sliced house_age into five meaningful segments. The U-shape is now visible in the data: New (rank 0) houses sell for $495k–$520k, Mature (rank 2) houses sell for $265k–$290k, and Vintage (rank 4) houses bounce back to $410k–$450k. A linear feature for age would have produced a monotone coefficient that misses this pattern entirely. The is_vintage binary flag gives tree models an additional, ultra-clean split point: vintage vs not-vintage explains a large portion of the price premium for the oldest houses.
A Reference Guide — Which Transform for Which Problem
Different data shapes call for different transformations. Here's the decision map:
| Situation | Transform | Code |
|---|---|---|
| Target is right-skewed | log1p | np.log1p(y) |
| Feature is right-skewed | log1p | np.log1p(X['col']) |
| Curved relationship with target | Polynomial (degree 2) | X['col']**2 |
| Two features interact | Interaction term | X['a'] * X['b'] |
| Non-monotonic / U-shaped | Binning | pd.cut(X['col'], bins=...) |
| Need scale-free comparison | Ratio feature | X['a'] / X['b'] |
| Very heavy tails (both sides) | Box-Cox or Yeo-Johnson | PowerTransformer(method='yeo-johnson') |
Teacher's Note
Polynomial features grow fast. With 10 base features and degree=2, PolynomialFeatures generates 65 output columns. With degree=3 that jumps to 285. Most of those terms will be noise — and adding noise to a regression model inflates variance without reducing bias. Two rules of thumb: first, only add polynomial terms for features you already suspect have a curved relationship with the target based on residual plots or domain knowledge. Second, always pair polynomial features with regularisation — Ridge or Lasso — to penalise the coefficients of noisy terms back toward zero.
Practice Questions
1. Which numpy function is the standard choice for log-transforming features or targets that may contain zero values, because it computes log(1 + x) safely?
2. A feature created by multiplying two columns together — to capture that the effect of one variable depends on the level of another — is called an ________ ________ .
3. After predicting on a log1p-transformed target, you must reverse the transform using np.________(prediction) to convert back to the original scale.
Quiz
1. Why does log-transforming a right-skewed regression target improve model performance?
2. You add degree-2 polynomial features to a regression model with 15 base features, generating 135 total features. The model overfits badly. The best corrective action is:
3. Which pandas function converts a continuous feature into labelled, ordered bins using manually specified bin edges?
Up Next · Lesson 38
Feature Engineering for Classification
Encoding strategies, decision boundary features, and probability calibration techniques that give classification models the clearest possible signal.