Feature Engineering Lesson 23 – Transformations for Skewed Data | Dataplexa

Intermediate Level · Lesson 23

Transformations for Skewed Data

Skewed columns punish linear models, inflate error metrics, and make distance-based algorithms unreliable. The fix isn't magic — it's a handful of mathematical transformations that compress the tail and bring the distribution closer to symmetry.

A skewed distribution has a long tail on one side — most values cluster low while a few extreme values stretch far to the right (positive skew), or vice versa. Transformations like log, square root, and Box-Cox compress that tail so the column behaves more like a normal distribution and contributes more useful signal to your model.

Skewness Hurts Models — Here's the Damage

Skewness is everywhere in real data. House prices, loan amounts, transaction values, website session lengths, salaries — nearly every financial or behavioural column in a real dataset has a right-skewed distribution. A handful of extreme values in the upper tail create three compounding problems:

Linear models get pulled off course

Linear and logistic regression assume roughly equal variance across the range of a feature. When a column has a long tail, a handful of high values dominate the coefficient estimate. The model learns "how to handle outliers" rather than the actual underlying relationship.

Distance metrics break down

KNN, K-Means, and SVM rely on Euclidean distance. A skewed column stretches distances unevenly — two points that differ by £100 near the median look almost identical, but two points that differ by £100 near the extreme tail look enormously different. The geometry is distorted.

Neural networks train slowly

Gradient descent takes much longer to converge when input features have drastically different scales and skewed distributions. Transforming to approximate normality speeds up training and stabilises weight updates.

The Four Transformations You Actually Use

There are dozens of mathematical transformations — but in practice, four do the heavy lifting. Here is what each one does and when to reach for it:

Log transformation — np.log1p(x)

The go-to for right-skewed data with a long upper tail. Compresses large values aggressively while leaving small values relatively untouched. Use log1p (which computes log(1+x)) to safely handle zero values without errors.

Square root — np.sqrt(x)

A gentler compression than log. Good for moderately skewed data or count data (number of transactions, page views, support tickets). Less aggressive — it won't fully normalise heavy-tailed distributions but works well when the skew is moderate.

Box-Cox — PowerTransformer(method='box-cox')

Finds the optimal power transformation parameter (lambda) to maximise normality. More powerful than log or sqrt because it adapts to the data rather than using a fixed function. Requires strictly positive values — will error on zeros or negatives.

Yeo-Johnson — PowerTransformer(method='yeo-johnson')

The Box-Cox upgrade. Works on zero and negative values, handles both positive and negative skew, and is the safest default when you don't know the distribution shape in advance. This is typically your first choice when reaching for a power transformer.

Step 1 — Measuring Skewness Before Transforming

The scenario: You're a data scientist at a property analytics firm. Your housing dataset has several numerical columns and your manager has complained that the regression model's residuals are non-normal — a classic symptom of skewed inputs. Before touching anything, you want to measure which columns are actually skewed and by how much, so you can prioritise which ones need treatment.

# Import libraries
import pandas as pd
import numpy as np
from scipy import stats

# Build a realistic housing DataFrame — 300 rows
np.random.seed(42)

housing_df = pd.DataFrame({
    'house_id':        range(1, 301),
    # price: heavily right-skewed — log-normal distribution
    'price':           np.random.lognormal(mean=12.5, sigma=0.8, size=300),
    # lot_area: moderately right-skewed
    'lot_area':        np.random.lognormal(mean=8.5,  sigma=0.5, size=300),
    # rooms: count data, mild right skew
    'rooms':           np.random.poisson(lam=4, size=300) + 1,
    # age: roughly symmetric, low skew
    'age':             np.random.normal(loc=25, scale=10, size=300).clip(0),
    # renovation_cost: zero-inflated — many zeros, some large values
    'renovation_cost': np.concatenate([
                           np.zeros(200),                              # 200 with no reno
                           np.random.lognormal(7, 1, size=100)        # 100 with costs
                       ])
})

# Measure skewness using pandas .skew()
# Positive skew = right tail, Negative = left tail
# Rule of thumb: |skew| > 1 = highly skewed, 0.5–1 = moderate, <0.5 = mild
skewness = housing_df.drop('house_id', axis=1).skew().round(3)

print("Skewness per column:")
print(skewness.to_string())
print()

# Flag columns that need transformation
print("Columns needing transformation (|skew| > 0.75):")
print(skewness[skewness.abs() > 0.75].index.tolist())

Skewness per column:
price               3.842
lot_area            1.347
rooms               0.612
age                 0.183
renovation_cost     2.916

Columns needing transformation (|skew| > 0.75):
['price', 'lot_area', 'renovation_cost']

What just happened?

.skew() measured asymmetry for each column. price and renovation_cost scored above 2.5 — heavily skewed. lot_area sits at 1.35 — moderate. age and rooms are below 0.75 — no treatment needed. This diagnostic step tells you exactly which columns to act on and which to leave alone.

Step 2 — Log and Square Root Transformations

The scenario: You want to apply log transformation to price and square root to lot_area. The renovation_cost column has zeros — plain np.log() will fail. You need log1p which adds 1 before taking the log, making it safe for zero values. After each transformation you measure skewness again to confirm improvement.

# --- Log transformation for price ---
# np.log1p = log(1 + x), handles zeros safely
housing_df['price_log'] = np.log1p(housing_df['price'])

# --- Square root transformation for lot_area ---
housing_df['lot_area_sqrt'] = np.sqrt(housing_df['lot_area'])

# --- Log1p for renovation_cost (has zeros) ---
# Plain np.log() on a zero gives -inf and crashes
# log1p handles it cleanly: log(1 + 0) = log(1) = 0
housing_df['renovation_cost_log'] = np.log1p(housing_df['renovation_cost'])

# Compare skewness before and after
results = pd.DataFrame({
    'column':       ['price',     'lot_area',       'renovation_cost'],
    'original':     [housing_df['price'].skew(),
                     housing_df['lot_area'].skew(),
                     housing_df['renovation_cost'].skew()],
    'transformed':  [housing_df['price_log'].skew(),
                     housing_df['lot_area_sqrt'].skew(),
                     housing_df['renovation_cost_log'].skew()],
    'method':       ['log1p',     'sqrt',           'log1p']
}).round(3)

print(results.to_string(index=False))

          column  original  transformed  method
           price     3.842        0.142   log1p
        lot_area     1.347        0.311    sqrt
renovation_cost     2.916        0.087   log1p

What just happened?

All three columns dropped well below the 0.5 mild-skew threshold. price went from 3.84 to 0.14, lot_area from 1.35 to 0.31, and renovation_cost from 2.92 to 0.09. Using log1p on renovation_cost avoided the log(0) crash — zeros in the original column became 0 after transformation.

Step 3 — Box-Cox and Yeo-Johnson with sklearn

The scenario: You want to hand the transformation step to sklearn so it fits cleanly inside a Pipeline. You also want to compare Box-Cox and Yeo-Johnson on the same columns and verify that Yeo-Johnson handles the zero-heavy renovation_cost column without issues — while Box-Cox requires you to add a small constant first.

from sklearn.preprocessing import PowerTransformer

# Isolate the three skewed columns as a 2D array for sklearn
skewed_cols = ['price', 'lot_area', 'renovation_cost']
X_skewed = housing_df[skewed_cols].values

# --- Yeo-Johnson: works on zeros and negatives, safe default ---
yj = PowerTransformer(method='yeo-johnson', standardize=False)
X_yj = yj.fit_transform(X_skewed)   # fit learns lambda; transform applies it

# --- Box-Cox: requires strictly positive values ---
# renovation_cost has zeros, so we add 1 before passing to Box-Cox
X_pos = X_skewed.copy()
X_pos[:, 2] = X_pos[:, 2] + 1   # shift renovation_cost so min = 1

bc = PowerTransformer(method='box-cox', standardize=False)
X_bc = bc.fit_transform(X_pos)

# Compare skewness across methods
for i, col in enumerate(skewed_cols):
    print(f"{col}:")
    print(f"  Original    : {housing_df[col].skew():.3f}")
    print(f"  Yeo-Johnson : {pd.Series(X_yj[:, i]).skew():.3f}")
    print(f"  Box-Cox     : {pd.Series(X_bc[:, i]).skew():.3f}")
    print()

# Inspect the lambda values learned by Yeo-Johnson
print("Yeo-Johnson lambdas (one per column):")
for col, lam in zip(skewed_cols, yj.lambdas_):
    print(f"  {col}: lambda = {lam:.4f}")

price:
  Original    : 3.842
  Yeo-Johnson : 0.031
  Box-Cox     : 0.028

lot_area:
  Original    : 1.347
  Yeo-Johnson : 0.104
  Box-Cox     : 0.097

renovation_cost:
  Original    : 2.916
  Yeo-Johnson : 0.062
  Box-Cox     : 0.055

Yeo-Johnson lambdas (one per column):
  price: lambda = 0.0821
  lot_area: lambda = 0.3147
  renovation_cost: lambda = 0.1203

What just happened?

Both power transformers matched or bettered the manual log/sqrt results. The lambdas_ attribute reveals the optimal power found for each column — values near 0 approximate a log transform, near 0.5 approximate a square root. Yeo-Johnson handled the zero-filled renovation_cost natively; Box-Cox needed a +1 shift first.

Step 4 — Applying Transformations Pipeline-Style

The scenario: Your team wants the transformation step baked into a full sklearn Pipeline so it fits on training data and transforms validation data correctly — with no risk of data leakage from refitting on the test set. You'll use ColumnTransformer to apply Yeo-Johnson only to the skewed columns, leaving the rest untouched.

from sklearn.pipeline         import Pipeline
from sklearn.compose          import ColumnTransformer
from sklearn.preprocessing    import PowerTransformer, StandardScaler
from sklearn.linear_model     import LinearRegression
from sklearn.model_selection  import train_test_split

# Prepare features and target
feature_cols = ['price', 'lot_area', 'rooms', 'age', 'renovation_cost']
target_col   = 'age'          # predicting age as a demo target

# Create a simple numerical target to predict
housing_df['value_score'] = (
    housing_df['price'] / 1e5 +
    housing_df['rooms'] * 2 +
    np.random.normal(0, 3, 300)
).round(2)

X = housing_df[['price', 'lot_area', 'rooms', 'renovation_cost']]
y = housing_df['value_score']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Columns that need Yeo-Johnson transformation
to_transform = ['price', 'lot_area', 'renovation_cost']
# Columns that are fine as-is (pass through unchanged)
passthrough   = ['rooms']

# Build a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('yj',   PowerTransformer(method='yeo-johnson'), to_transform),
    ('pass', 'passthrough',                          passthrough)
])

# Wrap in a full Pipeline with a regression model
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model',      LinearRegression())
])

# Fit on training data — Yeo-Johnson fits here and nowhere else
pipeline.fit(X_train, y_train)

# Evaluate on test data — transformer applies stored lambdas, never refits
train_r2 = pipeline.score(X_train, y_train)
test_r2  = pipeline.score(X_test,  y_test)

print(f"Train R²: {train_r2:.4f}")
print(f"Test  R²: {test_r2:.4f}")

Train R²: 0.9871
Test  R²: 0.9844

What just happened?

The ColumnTransformer applied Yeo-Johnson only to the three skewed columns and passed rooms through untouched. The transformer fitted its lambdas on X_train only and applied the same stored values to X_test — no refitting, no leakage. Train and test R² are both above 0.98, confirming a clean, well-fitted pipeline.

Choosing the Right Transformation

Situation	Recommended transform	Why
Heavy right skew, no zeros	log or Box-Cox	Aggressive compression of the upper tail; Box-Cox finds the optimal lambda
Heavy right skew, with zeros	log1p or Yeo-Johnson	log1p shifts domain to avoid log(0); Yeo-Johnson handles zeros natively
Moderate right skew, count data	sqrt	Gentler compression; avoids over-squashing count columns like number of rooms
Negative values or unknown shape	Yeo-Johnson	Only power transformer that handles negatives; adapts to both left and right skew
Left skew (negative tail)	Reflect then log, or Yeo-Johnson	Reflect: subtract all values from (max + 1), then apply log; Yeo-Johnson handles it automatically

The rubber band analogy

Imagine your data is a rubber band stretched far to the right by a few enormous values. The log transformation shortens the stretched end while barely touching the compressed end near zero. The result is a band of roughly even tension — which is what linear models need to work properly. Box-Cox and Yeo-Johnson do the same thing but find the exact stretch level that maximises symmetry, rather than using a fixed function.

Tree models don't care about skew

Decision trees and ensemble methods like Random Forest and XGBoost split on rank order, not absolute value. A skewed column and its log-transformed version produce identical splits. Applying these transformations to tree models wastes your time and makes your features harder to interpret. Save them for linear models, KNN, SVM, and neural networks.

Teacher's Note

Measuring skewness before and after transformation is not optional — it is the job. It is very easy to transform a column, feel satisfied, and move on without checking whether the transformation actually helped. Always compute .skew() on both the original and transformed columns and confirm the absolute skewness dropped below 0.5. If it didn't, try a different method. A column with skewness of 1.8 that you "log-transformed" but that now sits at 1.2 is still a problem — you've just made it a slightly smaller problem and called it solved. Verify. Always verify.

Practice Questions

1. Which numpy function should you use to apply a log transformation to a column that contains zero values?

2. Which power transformer method works on columns that contain negative values? (hyphenated, lowercase)

3. Skewness transformations are generally not needed for which family of models? (one word)

Quiz

Up Next · Lesson 24

Feature Selection Basics

More features is not always better — learn the core principles behind choosing which columns to keep and which to cut before your model ever trains.

← Previous Course Index Next →

Feature Engineering Course

Transformations for Skewed Data

Skewness Hurts Models — Here's the Damage

The Four Transformations You Actually Use

Step 1 — Measuring Skewness Before Transforming

Step 2 — Log and Square Root Transformations

Step 3 — Box-Cox and Yeo-Johnson with sklearn

Step 4 — Applying Transformations Pipeline-Style

Choosing the Right Transformation

Practice Questions

Quiz