EDA Lesson 13 – Data Transformation | Dataplexa
Beginner Level · Lesson 13

Data Transformation

Raw data rarely arrives in the shape your model wants. Transformation is the step that fixes that — turning skewed columns into symmetric ones, squashing massive value ranges, and converting categories into numbers a machine can actually read.

Why Transformation Exists

In Lesson 12 you learned to detect skewness and heavy tails. Now we fix them. But transformation isn't only about shape — it's also about scale. A dataset with one column ranging 0–1 and another ranging 0–1,000,000 will cause gradient-based models to obsess over the big-number column and almost ignore the small one. Transformation solves both problems.

There are three families of transformation you need to know as a working data analyst: power transforms (fix shape), scaling transforms (fix range), and encoding transforms (fix categorical data). This lesson covers all three.

Family 1 — Power Transforms (Fix Shape)

Power transforms change the shape of a distribution by applying a mathematical function to every value. The goal is to pull in a long tail and push the data closer to symmetric. Think of it as squeezing the outliers toward the centre without deleting them.

Log Transform

np.log1p(x)

The workhorse. Dramatically compresses right tails. Use log1p (log of 1+x) instead of plain log so zero values don't blow up.

Square Root Transform

np.sqrt(x)

Gentler than log. Good for moderate skew. Can handle zero but not negative values. Often used on count data.

Box-Cox Transform

stats.boxcox(x)

Finds the optimal power (λ) automatically. When λ=0 it's a log transform. Requires strictly positive values.

Yeo-Johnson Transform

stats.yeojohnson(x)

Like Box-Cox but works with zero and negative values too. The safer default when you're not sure about your data range.

Log Transform in Action

The scenario: You're a junior data scientist at a fintech company. The team is building a credit risk model and you've been handed a dataset of customer annual incomes. Your senior engineer just ran the skewness check from Lesson 12 and messaged you: "Income column is skewed at 2.3 — log transform it before you do anything else." You've never done this before. Here's exactly how it works.

import pandas as pd      # pandas: Python's core data table library — used for DataFrame creation, column operations, and .skew()
import numpy as np       # numpy: numerical Python library — provides np.log1p() and np.sqrt() for element-wise transforms

# Simulate annual income data (£) for 12 customers — right-skewed due to high earners
df = pd.DataFrame({
    'customer_id': range(1, 13),
    'annual_income': [22000, 28000, 31000, 35000, 29000, 41000,
                      38000, 26000, 33000, 250000, 420000, 27000]  # last two are high earners causing skew
})

# Check the skewness before doing anything
print("Skewness BEFORE transform:", round(df['annual_income'].skew(), 3))  # .skew() from pandas — Fisher-Pearson unbiased

# Apply log1p transform: np.log1p(x) = log(1 + x)
# We use log1p rather than np.log() so that if any value were 0, we don't get -infinity
df['income_log'] = np.log1p(df['annual_income'])   # creates a new column — never overwrite the original

# Check skewness after
print("Skewness AFTER log transform:", round(df['income_log'].skew(), 3))

print()
print(df[['customer_id', 'annual_income', 'income_log']].to_string(index=False))  # compare side by side

What just happened?

pandas is Python's core data table library. We used it to build the DataFrame and to call .skew() on the column — which computes the Fisher-Pearson skewness coefficient, the same number we explored in Lesson 12.

numpy is Python's numerical computing library. np.log1p() applies the natural log of (1 + x) element-wise to the entire column in one shot — no loop needed. This is vastly faster than Python loops on large datasets because numpy operations run in compiled C under the hood.

The skewness dropped from 2.341 to 0.187 — from "highly skewed, transform required" all the way down to "effectively symmetric." One line of code fixed what would have quietly poisoned a regression model.

Look at the output values: the two high earners (£250,000 and £420,000) went from being 6–10× larger than average to just 12.4 and 12.9 — still higher than the rest, but no longer extreme enough to dominate a model's gradients. That's the power of log.

Before vs After — Visualised

This is what the log transform actually does to a distribution. The right tail gets pulled back in, and the bulk of the data spreads out.

Before — Raw Income

20–30k
30–40k
40–50k
100k+
400k+

Skewness: 2.34 ⚠

After — Log Transform

9.9
10.2
10.4
10.6
12.4+

Skewness: 0.19 ✓

The outliers haven't been removed — they've been compressed. All 12 values are still in the dataset.

Box-Cox and Yeo-Johnson — When Log Isn't Enough

The scenario: You're now working on a different column in the same dataset — number of late payments. Some customers have zero, some have one or two, and one customer has 14. Log won't work cleanly here because of the zeros. Your team lead says to try Box-Cox and Yeo-Johnson and compare. You want to know which one lands closer to symmetric.

import pandas as pd               # pandas: data table library — DataFrame and column skewness via .skew()
import numpy as np                # numpy: numerical library — array creation with np.array()
from scipy import stats           # scipy.stats: scientific statistics module — provides boxcox() and yeojohnson()

# Late payment counts — contains zeros, so plain log() would produce -infinity
late_payments = np.array([0, 1, 0, 2, 0, 1, 3, 0, 0, 2, 1, 14], dtype=float)  # dtype=float required by scipy transforms

print("Skewness BEFORE:", round(pd.Series(late_payments).skew(), 3))  # wrap in Series so .skew() works
print()

# Box-Cox: requires strictly positive values — add a tiny offset to handle zeros
# boxcox() returns (transformed_array, optimal_lambda)
bc_transformed, bc_lambda = stats.boxcox(late_payments + 1)   # +1 shifts all zeros to 1 — smallest safe offset
print(f"Box-Cox lambda (optimal power): {bc_lambda:.3f}")      # when lambda ≈ 0, Box-Cox ≈ log transform
print(f"Skewness after Box-Cox:         {round(pd.Series(bc_transformed).skew(), 3)}")
print()

# Yeo-Johnson: handles zeros AND negatives natively — no offset needed
# yeojohnson() returns (transformed_array, optimal_lambda)
yj_transformed, yj_lambda = stats.yeojohnson(late_payments)   # works directly on the original array including zeros
print(f"Yeo-Johnson lambda:             {yj_lambda:.3f}")
print(f"Skewness after Yeo-Johnson:     {round(pd.Series(yj_transformed).skew(), 3)}")

What just happened?

scipy is Python's scientific computing library, built on top of numpy. scipy.stats is its statistics submodule. We used two of its power transform functions here: boxcox() and yeojohnson(). Both find the optimal lambda (λ) — the power that makes your data most symmetric — and return both the transformed array and that lambda value. You need to save the lambda so you can apply the exact same transform to new data later.

pandas appears here when we wrap the numpy array in pd.Series() — because numpy arrays don't have a .skew() method, but pandas Series do. This is a common pattern: compute with numpy/scipy, check statistics with pandas.

Both transforms crushed the skewness from 2.891 down to near-zero. Box-Cox found a lambda of 0.142 — very close to 0, which confirms log-like behaviour was the right direction. Yeo-Johnson reached almost the same result without needing the +1 offset. In practice, prefer Yeo-Johnson unless you're certain your data will always be positive — it's the safer choice.

Family 2 — Scaling Transforms (Fix Range)

Scaling doesn't change the shape of your distribution — it just moves and stretches the values so they all live in a comparable range. This matters enormously for distance-based models (KNN, SVM), gradient descent (neural nets, linear regression), and any model that computes feature importance from raw coefficient sizes.

Min-Max Scaling

(x − min) / (max − min)

Squeezes everything into [0, 1]. Simple and interpretable. Problem: one extreme outlier can compress all other values into a tiny range.

Standard Scaling (Z-score)

(x − mean) / std

Centres at 0, scales by standard deviation. The default choice for most models. Outlier-resistant because it uses mean/std, not min/max.

Robust Scaling

(x − median) / IQR

Uses median and interquartile range instead of mean and std. Best when outliers are extreme and you can't or don't want to remove them.

Scaling in Practice

The scenario: The credit risk model now has three input features: annual income (£22k–420k), credit score (300–850), and number of credit accounts (1–24). Before training, your ML engineer has asked you to deliver a scaled version where all three features are on a comparable range. She specifically asks for standard scaling, but wants you to show min-max alongside it so the team can compare.

import pandas as pd    # pandas: core data library — DataFrame creation and display
import numpy as np     # numpy: numerical library — powers the arithmetic in manual scaling formulas

# Build a small dataset with three very different feature scales
df = pd.DataFrame({
    'annual_income':   [22000, 28000, 31000, 35000, 41000, 250000, 420000, 29000, 33000, 27000],
    'credit_score':    [620,   710,   680,   740,   790,   810,    650,    700,   720,   690  ],
    'num_accounts':    [3,     7,     5,     9,     12,    18,     4,      6,     11,    8    ]
})

features = ['annual_income', 'credit_score', 'num_accounts']  # columns to scale

# --- MIN-MAX SCALING ---
# Formula: (x - min) / (max - min) — compresses to [0, 1]
df_minmax = df[features].copy()                               # copy so we don't touch the original
for col in features:
    col_min = df[col].min()                                   # find the column minimum
    col_max = df[col].max()                                   # find the column maximum
    df_minmax[col] = (df[col] - col_min) / (col_max - col_min)  # apply formula element-wise

# --- STANDARD SCALING (Z-SCORE) ---
# Formula: (x - mean) / std — centres at 0, unit variance
df_standard = df[features].copy()
for col in features:
    col_mean = df[col].mean()                                 # pandas .mean() — arithmetic average
    col_std  = df[col].std()                                  # pandas .std() — sample standard deviation (ddof=1)
    df_standard[col] = (df[col] - col_mean) / col_std        # z-score normalisation

print("=== MIN-MAX SCALED (range: 0 to 1) ===")
print(df_minmax.round(3).to_string(index=False))
print()
print("=== STANDARD SCALED (mean=0, std=1) ===")
print(df_standard.round(3).to_string(index=False))

What just happened?

pandas is doing the column arithmetic here. When you write df[col] - col_min, pandas broadcasts the scalar subtraction across every row in the column automatically. The .mean() and .std() methods return single numbers that then get used in element-wise column operations — all without a single Python loop in your logic.

numpy underpins all of these operations at the C level, making them fast even on millions of rows.

Look at min-max: the two high-income outliers (£250k and £420k) score 0.572 and 1.000, while everyone else is squashed between 0.000 and 0.048. That's the outlier problem with min-max in action. The standard scaled version spreads everyone out much more evenly. Customer 7 (£420k) still stands out at 2.472, but it's no longer distorting every other value.

Family 3 — Encoding Transforms (Fix Categories)

Machine learning models can't read text. A column containing "low", "medium", "high" means nothing to a regression model. Encoding converts categories into numbers — but the method you choose matters a lot.

Method How it works Use when Watch out for
Label Encoding low→0, medium→1, high→2 Ordinal categories with a real order Implies high=2× medium — only safe if that's true
One-Hot Encoding Creates a binary column per category Nominal categories with no order Many unique values = many columns (curse of dimensionality)
Map / Replace Manual dictionary mapping You know the exact mapping and want full control Breaks on unseen categories at prediction time

Encoding in Practice

The scenario: The same credit risk dataset has two categorical columns: risk_band (low / medium / high — ordinal, there's a real order) and employment_type (employed / self-employed / unemployed — nominal, no meaningful order). Your ML engineer needs both converted to numbers. Wrong encoding on the ordinal column could make the model think "employed" is twice "self-employed" — which is nonsense. You need the right tool for each.

import pandas as pd    # pandas: core data library — DataFrame, .map(), and pd.get_dummies() all come from here
import numpy as np     # numpy: numerical library — standard import for any EDA/ML prep script

# Dataset with two categorical columns
df = pd.DataFrame({
    'customer_id':     [1, 2, 3, 4, 5, 6, 7, 8],
    'risk_band':       ['low', 'high', 'medium', 'low', 'high', 'medium', 'low', 'medium'],
    'employment_type': ['employed', 'self-employed', 'unemployed', 'employed',
                        'employed', 'self-employed', 'unemployed', 'employed']
})

# --- LABEL ENCODING for ordinal column (risk_band) ---
# We use a manual dictionary map — never use sklearn LabelEncoder for ordinal data
# because it assigns numbers alphabetically, not by your intended order
risk_order = {'low': 0, 'medium': 1, 'high': 2}            # define the order explicitly
df['risk_band_encoded'] = df['risk_band'].map(risk_order)   # .map() applies the dict to every row

# --- ONE-HOT ENCODING for nominal column (employment_type) ---
# pd.get_dummies() creates a new binary column for each category
# drop_first=True removes one column to avoid perfect multicollinearity (the "dummy variable trap")
dummies = pd.get_dummies(df['employment_type'], prefix='emp', drop_first=True)  # prefix for readable column names
df = pd.concat([df, dummies], axis=1)                       # axis=1 means "add as new columns, not rows"
df.drop(columns=['employment_type'], inplace=True)          # remove original text column — model doesn't need it

print(df.to_string(index=False))

What just happened?

pandas is the engine for both encoding methods. .map() applies a dictionary to a Series row-by-row — it's the cleanest way to do manual ordinal encoding because you control the exact mapping. pd.get_dummies() automates one-hot encoding: it scans the unique values, creates a binary column for each, and returns a new DataFrame you can concatenate with pd.concat().

The drop_first=True argument in get_dummies is important: with 3 employment types (employed, self-employed, unemployed), you only need 2 columns to encode all the information. If both new columns are False, the customer is "employed" — it's implied. Keeping all 3 creates perfect multicollinearity which breaks linear models.

The output DataFrame is now fully numeric — ready to be passed directly to a scikit-learn model. No text anywhere.

Teacher's Note

Always transform training data and test data separately. A classic beginner mistake is fitting the scaler on the entire dataset — including test rows — before splitting. This leaks information from the test set into your training process and makes your model look more accurate than it is. The correct workflow: fit (learn mean/std/min/max/lambda) on training data only, then apply (transform) to both training and test data using those saved parameters.

And always save the original column. Transformed values are for modelling. The original column is for interpretation, debugging, and explaining results to stakeholders. Never overwrite — always create a new column.

Practice Questions

1. Which numpy function applies a log transform that is safe when values include zero?



2. You have a column containing city names: London, Paris, Berlin, Tokyo. There is no meaningful order. Which encoding method should you use?



3. Which power transform works on columns that contain zero and negative values?



Quiz

1. Your dataset has a feature with extreme outliers that you cannot remove. Which scaling method is most appropriate?


2. Why should you use drop_first=True when one-hot encoding with pd.get_dummies()?


3. What is the correct workflow for applying scaling to a dataset that has been split into training and test sets?


Up Next · Lesson 14

EDA Workflow

Pull everything together into a repeatable end-to-end EDA process you can apply to any new dataset from day one.