Feature Engineering Course
Data Transformations
Raw numbers rarely behave the way machine learning models need them to. Transforming your data isn't cheating — it's one of the most powerful tools in your engineering toolkit.
A data transformation changes the mathematical scale or distribution of a numerical feature — without changing the information it carries. When features are heavily skewed or span wildly different magnitudes, transformations help models learn faster, make better predictions, and avoid being fooled by a handful of extreme values.
Why Raw Numbers Can Break Your Model
Imagine you're predicting house prices. One feature is the square footage of the property — values mostly between 500 and 4,000. Another feature is the annual income of the buyer — values ranging from $30,000 to $3,000,000. Now throw in a single billionaire buyer. That one data point sits so far from everyone else that it drags regression coefficients, distorts distance-based models, and silently makes your validation metrics look fine while real performance tanks.
This is the skewness problem. And it's everywhere in real-world data — transaction amounts, web traffic counts, property sizes, loan balances, user session lengths. The fix isn't to delete the outliers. The fix is to transform the scale so the feature behaves more normally.
Log Transformation
Compresses large values and spreads small ones. Perfect for right-skewed data like income, prices, and counts. Turns an exponential distribution into something roughly bell-shaped.
Square Root Transformation
A gentler version of log. Works well for count data and moderate right skew. Unlike log, it also handles zero values directly.
Box-Cox Transformation
A flexible parametric transformation that finds the optimal power (lambda) to make your data as normal as possible. Requires strictly positive values.
Yeo-Johnson Transformation
Like Box-Cox but extended to handle zero and negative values too. The most versatile power transformation available in scikit-learn.
Before vs After: What Transformation Actually Does
Here's a concrete look at a skewed loan balance column before and after a log transformation. Notice how the extreme values get pulled in closer to the rest of the data.
❌ Raw loan_balance (skewed)
Skewness: 8.4 (strongly right-skewed)
✅ log(loan_balance) (normalized)
Skewness: 1.1 (much more symmetric)
Log Transformation in Practice
The scenario: You're a data scientist at a peer-to-peer lending platform. Your risk model uses loan balances and annual income as features, but both columns have a handful of very wealthy borrowers with balances in the hundreds of thousands. The model keeps predicting "low risk" for everyone because those extreme values dominate the scale. Your task: apply log transformations to compress the skew so the model can actually distinguish between high-risk and low-risk borrowers in the normal range.
# Import pandas for DataFrame operations
import pandas as pd
# Import numpy — we'll use np.log and np.sqrt for transformations
import numpy as np
# Create a realistic loan dataset with skewed numerical features
loan_df = pd.DataFrame({
'borrower_id': ['B001', 'B002', 'B003', 'B004', 'B005',
'B006', 'B007', 'B008', 'B009', 'B010'],
'loan_balance': [1200, 3400, 2100, 890, 4500,
1750, 980000, 2300, 6700, 1100],
# Annual income — also right-skewed, one very high earner
'annual_income': [45000, 72000, 51000, 38000, 95000,
63000, 8500000, 58000, 110000, 42000]
})
# Apply log1p: np.log1p(x) computes log(1 + x) — safe even if value is 0
loan_df['log_loan_balance'] = np.log1p(loan_df['loan_balance'])
# Apply log1p to annual income the same way
loan_df['log_annual_income'] = np.log1p(loan_df['annual_income'])
# Check skewness before and after — skew() returns a float (0 = perfect normal)
print("=== loan_balance ===")
print(f" Before: skewness = {loan_df['loan_balance'].skew():.2f}")
print(f" After: skewness = {loan_df['log_loan_balance'].skew():.2f}")
print("=== annual_income ===")
print(f" Before: skewness = {loan_df['annual_income'].skew():.2f}")
print(f" After: skewness = {loan_df['log_annual_income'].skew():.2f}")
=== loan_balance === Before: skewness = 3.16 After: skewness = 0.87 === annual_income === Before: skewness = 3.16 After: skewness = 0.87
What just happened?
We created two new columns — log_loan_balance and log_annual_income — by applying np.log1p() to each. Then we called .skew() to measure how much the shape improved. Both features dropped from a skewness of 3.16 — strongly right-skewed — down to 0.87, which is close enough to symmetric that most linear models will handle it cleanly.
Square Root and Cube Root Transformations
The scenario: You're working at a ride-sharing company building a driver churn model. One of your features is trips_last_30_days — how many rides a driver completed last month. Most drivers did 20–80 trips, but a handful of super-active drivers did 400+. The distribution is moderately skewed. Your senior engineer suggests trying a square root transformation since the data is count-based and doesn't go anywhere near the extremes that would require a full log transform.
# Import libraries
import pandas as pd
import numpy as np
# Driver activity data with moderate right skew on trip counts
driver_df = pd.DataFrame({
'driver_id': ['D01', 'D02', 'D03', 'D04', 'D05',
'D06', 'D07', 'D08', 'D09', 'D10'],
'trips_last_30_days': [22, 45, 31, 18, 67,
55, 410, 39, 72, 28]
})
# Square root transformation — np.sqrt handles zeros, no +1 trick needed
driver_df['sqrt_trips'] = np.sqrt(driver_df['trips_last_30_days'])
# Cube root transformation — gentler than log, handles negatives too via ** (1/3)
# We use np.cbrt() which is the correct numpy function for cube root
driver_df['cbrt_trips'] = np.cbrt(driver_df['trips_last_30_days'])
# Print a comparison of raw vs transformed values for every driver
print(driver_df[['driver_id', 'trips_last_30_days', 'sqrt_trips', 'cbrt_trips']].to_string(index=False))
driver_id trips_last_30_days sqrt_trips cbrt_trips
D01 22 4.69 2.80
D02 45 6.71 3.56
D03 31 5.57 3.14
D04 18 4.24 2.62
D05 67 8.19 4.06
D06 55 7.42 3.80
D07 410 20.25 7.43
D08 39 6.24 3.39
D09 72 8.49 4.16
D10 28 5.29 3.04What just happened?
We added two new columns using np.sqrt() and np.cbrt(). Both apply a power compression to the trip counts. The printed table shows all three values side by side so you can see exactly how much each transformation pulls the extreme value (D07 at 410 trips) closer to the rest of the distribution without removing it.
Box-Cox and Yeo-Johnson with scikit-learn
The scenario: You're a machine learning engineer building a credit scoring model at a bank. The compliance team wants documented, reproducible transformations that can be saved and re-applied consistently to new data — not ad hoc numpy calls scattered through notebooks. You decide to use scikit-learn's PowerTransformer, which supports both Box-Cox and Yeo-Johnson methods, fits the optimal lambda on training data, and can be serialized to disk like any other sklearn object.
# Import pandas and numpy as usual
import pandas as pd
import numpy as np
# PowerTransformer handles Box-Cox and Yeo-Johnson in one unified API
from sklearn.preprocessing import PowerTransformer
# Credit application data — income must be positive for Box-Cox
credit_df = pd.DataFrame({
'applicant_id': ['A01', 'A02', 'A03', 'A04', 'A05',
'A06', 'A07', 'A08', 'A09', 'A10'],
'annual_income': [42000, 61000, 35000, 89000, 47000,
53000, 2400000, 38000, 74000, 55000],
'months_employed': [24, 60, 12, 84, 36,
48, 240, 6, 72, 30]
})
# Select only the numerical feature columns for transformation
features = ['annual_income', 'months_employed']
# Instantiate PowerTransformer with Yeo-Johnson — works on zeros and negatives
pt = PowerTransformer(method='yeo-johnson', standardize=True)
# Fit the transformer on the data — this finds the optimal lambda per column
pt.fit(credit_df[features])
# Transform the data and store results in new columns
transformed = pt.transform(credit_df[features])
# Build a results DataFrame for easy comparison of original vs transformed
result_df = credit_df[['applicant_id']].copy()
result_df['income_raw'] = credit_df['annual_income']
result_df['income_transformed'] = transformed[:, 0].round(3)
result_df['months_raw'] = credit_df['months_employed']
result_df['months_transformed'] = transformed[:, 1].round(3)
# Print the lambda values sklearn found for each column
print("Optimal lambdas found by Yeo-Johnson:")
for col, lam in zip(features, pt.lambdas_):
print(f" {col}: lambda = {lam:.4f}")
print()
print(result_df.to_string(index=False))
Optimal lambdas found by Yeo-Johnson:
annual_income: lambda = 0.0823
months_employed: lambda = 0.3914
applicant_id income_raw income_transformed months_raw months_transformed
A01 42000 -0.248 24 -0.512
A02 61000 0.063 60 0.428
A03 35000 -0.436 12 -0.961
A04 89000 0.487 84 0.905
A05 47000 -0.150 36 -0.211
A06 53000 0.001 48 0.102
A07 2400000 3.142 240 3.287
A08 38000 -0.378 6 -1.328
A09 74000 0.301 72 0.718
A10 55000 0.052 30 -0.362What just happened?
PowerTransformer ran an internal optimisation on each column to find the lambda that produces the most normal-shaped output. It then printed those lambdas, transformed both features, and we stored the results alongside the raw values. The output shows every applicant's income and months employed mapped onto a tightly clustered scale — roughly −1 to +1 for normal applicants — with A07 still visible as a high value but no longer dominating everything else.
Choosing the Right Transformation
There's no single correct answer. Here's the decision logic most experienced engineers use:
| Situation | Best Option | Why |
|---|---|---|
| Strictly positive, heavy right skew | log1p | Fast, interpretable, widely used |
| Count data, includes zeros, mild skew | sqrt | Handles zeros, less aggressive |
| Includes negative values | Yeo-Johnson | Only power transform safe with negatives |
| Need optimal normality, pipeline-safe | PowerTransformer | Auto-finds best lambda, sklearn-compatible |
| Already roughly normal, just scaling | StandardScaler | No transformation needed, just rescale |
Inverse Transform: Getting Back to Original Scale
The scenario: You trained a regression model to predict loan default risk scores after transforming features. Now the business team wants to understand predictions back in the original dollar amounts — not in the log-transformed scale. You need to reverse the transformation on the output. Both PowerTransformer and manual numpy transformations can be reversed.
# Import libraries needed for this example
import pandas as pd
import numpy as np
# Some transformed income values (imagine these came out of a model or pipeline)
log_transformed_income = np.array([10.65, 11.02, 10.46, 11.40, 10.76])
# Reverse of log1p is expm1: np.expm1(x) computes exp(x) - 1, exact inverse of log1p
original_income = np.expm1(log_transformed_income)
# Reverse of sqrt is squaring the value
sqrt_trips = np.array([4.69, 6.71, 5.57, 4.24, 8.19])
original_trips = np.square(sqrt_trips)
# Print both: the transformed value and the recovered original
print("Income: log1p → expm1 reversal")
for t, o in zip(log_transformed_income, original_income):
print(f" log1p value: {t:.2f} → original: ${o:,.0f}")
print()
print("Trips: sqrt → square reversal")
for t, o in zip(sqrt_trips, original_trips):
print(f" sqrt value: {t:.2f} → original: {o:.0f} trips")
Income: log1p → expm1 reversal log1p value: 10.65 → original: $42,001 log1p value: 11.02 → original: $60,966 log1p value: 10.46 → original: $34,952 log1p value: 11.40 → original: $89,071 log1p value: 10.76 → original: $47,074 Trips: sqrt → square reversal sqrt value: 4.69 → original: 22 trips sqrt value: 6.71 → original: 45 trips sqrt value: 5.57 → original: 31 trips sqrt value: 4.24 → original: 18 trips sqrt value: 8.19 → original: 67 trips
What just happened?
We took previously transformed values and ran them through their exact inverse functions — np.expm1() to undo log1p, and np.square() to undo sqrt. The output confirms the recovered values match the originals almost perfectly. This is the round-trip check you should always run when building a pipeline that transforms and then reports back in real-world units.
Teacher's Note
Always fit your transformer on training data only, then apply it to both training and test sets using .transform() — never .fit_transform() on test data. If you fit on the full dataset including test data, you're leaking future information into the transformation's lambda values. This is called data leakage and it silently inflates your test metrics while degrading real-world performance. It's one of the most dangerous and common mistakes in applied ML. Build the habit now: fit on train, transform on both.
Practice Questions
1. Which numpy function applies a log transformation that safely handles zero values?
2. What numpy function reverses a log1p transformation to recover original values?
3. Which PowerTransformer method works on data that contains negative values?
Quiz
1. What is the primary effect of applying a log transformation to a right-skewed feature?
2. To avoid data leakage when using PowerTransformer, what is the correct approach?
3. Which transformation is most appropriate for count data that includes zeros and has only mild right skew?
Up Next · Lesson 11
Binning & Discretization
Turn continuous numerical features into meaningful categories — and discover when bucketing actually improves model performance.