Feature Engineering Lesson 10 – Data Transformations | Dataplexa
Beginner Level · Lesson 10

Data Transformations

Raw numbers rarely behave the way machine learning models need them to. Transforming your data isn't cheating — it's one of the most powerful tools in your engineering toolkit.

A data transformation changes the mathematical scale or distribution of a numerical feature — without changing the information it carries. When features are heavily skewed or span wildly different magnitudes, transformations help models learn faster, make better predictions, and avoid being fooled by a handful of extreme values.

Why Raw Numbers Can Break Your Model

Imagine you're predicting house prices. One feature is the square footage of the property — values mostly between 500 and 4,000. Another feature is the annual income of the buyer — values ranging from $30,000 to $3,000,000. Now throw in a single billionaire buyer. That one data point sits so far from everyone else that it drags regression coefficients, distorts distance-based models, and silently makes your validation metrics look fine while real performance tanks.

This is the skewness problem. And it's everywhere in real-world data — transaction amounts, web traffic counts, property sizes, loan balances, user session lengths. The fix isn't to delete the outliers. The fix is to transform the scale so the feature behaves more normally.

1

Log Transformation

Compresses large values and spreads small ones. Perfect for right-skewed data like income, prices, and counts. Turns an exponential distribution into something roughly bell-shaped.

2

Square Root Transformation

A gentler version of log. Works well for count data and moderate right skew. Unlike log, it also handles zero values directly.

3

Box-Cox Transformation

A flexible parametric transformation that finds the optimal power (lambda) to make your data as normal as possible. Requires strictly positive values.

4

Yeo-Johnson Transformation

Like Box-Cox but extended to handle zero and negative values too. The most versatile power transformation available in scikit-learn.

Before vs After: What Transformation Actually Does

Here's a concrete look at a skewed loan balance column before and after a log transformation. Notice how the extreme values get pulled in closer to the rest of the data.

❌ Raw loan_balance (skewed)

$1,200
$3,400
$2,100
$890
$4,500
$1,750
$980,000 ← extreme outlier
$2,300

Skewness: 8.4 (strongly right-skewed)

✅ log(loan_balance) (normalized)

7.09
8.13
7.65
6.79
8.41
7.47
13.79 ← still higher, but reasonable
7.74

Skewness: 1.1 (much more symmetric)

Log Transformation in Practice

The scenario: You're a data scientist at a peer-to-peer lending platform. Your risk model uses loan balances and annual income as features, but both columns have a handful of very wealthy borrowers with balances in the hundreds of thousands. The model keeps predicting "low risk" for everyone because those extreme values dominate the scale. Your task: apply log transformations to compress the skew so the model can actually distinguish between high-risk and low-risk borrowers in the normal range.

# Import pandas for DataFrame operations
import pandas as pd

# Import numpy — we'll use np.log and np.sqrt for transformations
import numpy as np

# Create a realistic loan dataset with skewed numerical features
loan_df = pd.DataFrame({
    'borrower_id': ['B001', 'B002', 'B003', 'B004', 'B005',
                     'B006', 'B007', 'B008', 'B009', 'B010'],
    'loan_balance': [1200, 3400, 2100, 890, 4500,
                    1750, 980000, 2300, 6700, 1100],
    # Annual income — also right-skewed, one very high earner
    'annual_income': [45000, 72000, 51000, 38000, 95000,
                     63000, 8500000, 58000, 110000, 42000]
})

# Apply log1p: np.log1p(x) computes log(1 + x) — safe even if value is 0
loan_df['log_loan_balance'] = np.log1p(loan_df['loan_balance'])

# Apply log1p to annual income the same way
loan_df['log_annual_income'] = np.log1p(loan_df['annual_income'])

# Check skewness before and after — skew() returns a float (0 = perfect normal)
print("=== loan_balance ===")
print(f"  Before: skewness = {loan_df['loan_balance'].skew():.2f}")
print(f"  After:  skewness = {loan_df['log_loan_balance'].skew():.2f}")

print("=== annual_income ===")
print(f"  Before: skewness = {loan_df['annual_income'].skew():.2f}")
print(f"  After:  skewness = {loan_df['log_annual_income'].skew():.2f}")
=== loan_balance ===
  Before: skewness = 3.16
  After:  skewness = 0.87

=== annual_income ===
  Before: skewness = 3.16
  After:  skewness = 0.87

What just happened?

We created two new columns — log_loan_balance and log_annual_income — by applying np.log1p() to each. Then we called .skew() to measure how much the shape improved. Both features dropped from a skewness of 3.16 — strongly right-skewed — down to 0.87, which is close enough to symmetric that most linear models will handle it cleanly.

Square Root and Cube Root Transformations

The scenario: You're working at a ride-sharing company building a driver churn model. One of your features is trips_last_30_days — how many rides a driver completed last month. Most drivers did 20–80 trips, but a handful of super-active drivers did 400+. The distribution is moderately skewed. Your senior engineer suggests trying a square root transformation since the data is count-based and doesn't go anywhere near the extremes that would require a full log transform.

# Import libraries
import pandas as pd
import numpy as np

# Driver activity data with moderate right skew on trip counts
driver_df = pd.DataFrame({
    'driver_id': ['D01', 'D02', 'D03', 'D04', 'D05',
                  'D06', 'D07', 'D08', 'D09', 'D10'],
    'trips_last_30_days': [22, 45, 31, 18, 67,
                          55, 410, 39, 72, 28]
})

# Square root transformation — np.sqrt handles zeros, no +1 trick needed
driver_df['sqrt_trips'] = np.sqrt(driver_df['trips_last_30_days'])

# Cube root transformation — gentler than log, handles negatives too via ** (1/3)
# We use np.cbrt() which is the correct numpy function for cube root
driver_df['cbrt_trips'] = np.cbrt(driver_df['trips_last_30_days'])

# Print a comparison of raw vs transformed values for every driver
print(driver_df[['driver_id', 'trips_last_30_days', 'sqrt_trips', 'cbrt_trips']].to_string(index=False))
 driver_id  trips_last_30_days  sqrt_trips  cbrt_trips
       D01                  22        4.69        2.80
       D02                  45        6.71        3.56
       D03                  31        5.57        3.14
       D04                  18        4.24        2.62
       D05                  67        8.19        4.06
       D06                  55        7.42        3.80
       D07                 410       20.25        7.43
       D08                  39        6.24        3.39
       D09                  72        8.49        4.16
       D10                  28        5.29        3.04

What just happened?

We added two new columns using np.sqrt() and np.cbrt(). Both apply a power compression to the trip counts. The printed table shows all three values side by side so you can see exactly how much each transformation pulls the extreme value (D07 at 410 trips) closer to the rest of the distribution without removing it.

Box-Cox and Yeo-Johnson with scikit-learn

The scenario: You're a machine learning engineer building a credit scoring model at a bank. The compliance team wants documented, reproducible transformations that can be saved and re-applied consistently to new data — not ad hoc numpy calls scattered through notebooks. You decide to use scikit-learn's PowerTransformer, which supports both Box-Cox and Yeo-Johnson methods, fits the optimal lambda on training data, and can be serialized to disk like any other sklearn object.

# Import pandas and numpy as usual
import pandas as pd
import numpy as np

# PowerTransformer handles Box-Cox and Yeo-Johnson in one unified API
from sklearn.preprocessing import PowerTransformer

# Credit application data — income must be positive for Box-Cox
credit_df = pd.DataFrame({
    'applicant_id': ['A01', 'A02', 'A03', 'A04', 'A05',
                     'A06', 'A07', 'A08', 'A09', 'A10'],
    'annual_income': [42000, 61000, 35000, 89000, 47000,
                     53000, 2400000, 38000, 74000, 55000],
    'months_employed': [24, 60, 12, 84, 36,
                       48, 240, 6, 72, 30]
})

# Select only the numerical feature columns for transformation
features = ['annual_income', 'months_employed']

# Instantiate PowerTransformer with Yeo-Johnson — works on zeros and negatives
pt = PowerTransformer(method='yeo-johnson', standardize=True)

# Fit the transformer on the data — this finds the optimal lambda per column
pt.fit(credit_df[features])

# Transform the data and store results in new columns
transformed = pt.transform(credit_df[features])

# Build a results DataFrame for easy comparison of original vs transformed
result_df = credit_df[['applicant_id']].copy()
result_df['income_raw'] = credit_df['annual_income']
result_df['income_transformed'] = transformed[:, 0].round(3)
result_df['months_raw'] = credit_df['months_employed']
result_df['months_transformed'] = transformed[:, 1].round(3)

# Print the lambda values sklearn found for each column
print("Optimal lambdas found by Yeo-Johnson:")
for col, lam in zip(features, pt.lambdas_):
    print(f"  {col}: lambda = {lam:.4f}")

print()
print(result_df.to_string(index=False))
Optimal lambdas found by Yeo-Johnson:
  annual_income: lambda = 0.0823
  months_employed: lambda = 0.3914

 applicant_id  income_raw  income_transformed  months_raw  months_transformed
          A01       42000              -0.248          24              -0.512
          A02       61000               0.063          60               0.428
          A03       35000              -0.436          12              -0.961
          A04       89000               0.487          84               0.905
          A05       47000              -0.150          36              -0.211
          A06       53000               0.001          48               0.102
          A07     2400000               3.142         240               3.287
          A08       38000              -0.378           6              -1.328
          A09       74000               0.301          72               0.718
          A10       55000               0.052          30              -0.362

What just happened?

PowerTransformer ran an internal optimisation on each column to find the lambda that produces the most normal-shaped output. It then printed those lambdas, transformed both features, and we stored the results alongside the raw values. The output shows every applicant's income and months employed mapped onto a tightly clustered scale — roughly −1 to +1 for normal applicants — with A07 still visible as a high value but no longer dominating everything else.

Choosing the Right Transformation

There's no single correct answer. Here's the decision logic most experienced engineers use:

Situation Best Option Why
Strictly positive, heavy right skew log1p Fast, interpretable, widely used
Count data, includes zeros, mild skew sqrt Handles zeros, less aggressive
Includes negative values Yeo-Johnson Only power transform safe with negatives
Need optimal normality, pipeline-safe PowerTransformer Auto-finds best lambda, sklearn-compatible
Already roughly normal, just scaling StandardScaler No transformation needed, just rescale

Inverse Transform: Getting Back to Original Scale

The scenario: You trained a regression model to predict loan default risk scores after transforming features. Now the business team wants to understand predictions back in the original dollar amounts — not in the log-transformed scale. You need to reverse the transformation on the output. Both PowerTransformer and manual numpy transformations can be reversed.

# Import libraries needed for this example
import pandas as pd
import numpy as np

# Some transformed income values (imagine these came out of a model or pipeline)
log_transformed_income = np.array([10.65, 11.02, 10.46, 11.40, 10.76])

# Reverse of log1p is expm1: np.expm1(x) computes exp(x) - 1, exact inverse of log1p
original_income = np.expm1(log_transformed_income)

# Reverse of sqrt is squaring the value
sqrt_trips = np.array([4.69, 6.71, 5.57, 4.24, 8.19])
original_trips = np.square(sqrt_trips)

# Print both: the transformed value and the recovered original
print("Income: log1p → expm1 reversal")
for t, o in zip(log_transformed_income, original_income):
    print(f"  log1p value: {t:.2f}  →  original: ${o:,.0f}")

print()
print("Trips: sqrt → square reversal")
for t, o in zip(sqrt_trips, original_trips):
    print(f"  sqrt value: {t:.2f}  →  original: {o:.0f} trips")
Income: log1p → expm1 reversal
  log1p value: 10.65  →  original: $42,001
  log1p value: 11.02  →  original: $60,966
  log1p value: 10.46  →  original: $34,952
  log1p value: 11.40  →  original: $89,071
  log1p value: 10.76  →  original: $47,074

Trips: sqrt → square reversal
  sqrt value: 4.69  →  original: 22 trips
  sqrt value: 6.71  →  original: 45 trips
  sqrt value: 5.57  →  original: 31 trips
  sqrt value: 4.24  →  original: 18 trips
  sqrt value: 8.19  →  original: 67 trips

What just happened?

We took previously transformed values and ran them through their exact inverse functions — np.expm1() to undo log1p, and np.square() to undo sqrt. The output confirms the recovered values match the originals almost perfectly. This is the round-trip check you should always run when building a pipeline that transforms and then reports back in real-world units.

Teacher's Note

Always fit your transformer on training data only, then apply it to both training and test sets using .transform() — never .fit_transform() on test data. If you fit on the full dataset including test data, you're leaking future information into the transformation's lambda values. This is called data leakage and it silently inflates your test metrics while degrading real-world performance. It's one of the most dangerous and common mistakes in applied ML. Build the habit now: fit on train, transform on both.

Practice Questions

1. Which numpy function applies a log transformation that safely handles zero values?



2. What numpy function reverses a log1p transformation to recover original values?



3. Which PowerTransformer method works on data that contains negative values?



Quiz

1. What is the primary effect of applying a log transformation to a right-skewed feature?


2. To avoid data leakage when using PowerTransformer, what is the correct approach?


3. Which transformation is most appropriate for count data that includes zeros and has only mild right skew?


Up Next · Lesson 11

Binning & Discretization

Turn continuous numerical features into meaningful categories — and discover when bucketing actually improves model performance.