Feature Engineering Course
Lag Features
Rolling windows summarise the recent past. Lag features do something more direct — they hand the model the exact value from N steps ago and say: here, learn from this specific moment in history.
A lag feature is a copy of a column shifted forward in time by N steps. lag_1 is yesterday's value sitting in today's row. lag_7 is last week's value. The model can now directly compare today against specific past moments — not just a smoothed average, but the actual number from that day.
The Core Intuition — Giving the Model a Rearview Mirror
Think about how you forecast tomorrow's sales. You don't just look at today's number in isolation — you compare it to yesterday, to the same day last week, to the same day last month. You have a mental model built on specific historical reference points. Lag features replicate that instinct in machine-learnable form.
Without lag features, a time-series model has amnesia. Each row is a stranger to the model — it has no idea that yesterday's value was 20% higher, or that this same weekday last week showed a dip. Lag features cure that amnesia by making the past an explicit, numerical part of the current row.
Model Without Lag Features
Sees today's value. Has no idea whether it's higher or lower than yesterday. Cannot detect acceleration, reversal, or weekly seasonality. Every row looks like an isolated snapshot.
Model With Lag Features
Sees today, yesterday, last week, and last month simultaneously. Can learn that a value 20% above lag_7 predicts high demand. Can detect reversals, cycles, and momentum directly from structured features.
Choosing the Right Lag Depths
The lag depths you choose should match the natural cycles in your data. There's no single right answer — it depends entirely on the domain:
lag_1 — Immediate Previous Step
Yesterday's value, last hour's reading, the prior transaction. Captures short-term momentum and is almost always worth including. If today's value is substantially higher than lag_1, that delta is a powerful signal.
lag_7 — Same Day Last Week
Captures weekly seasonality. Monday tends to look like last Monday. This lag is essential for any retail, traffic, or consumer-behaviour dataset that has a 7-day cycle.
lag_30 — Same Period Last Month
Monthly seasonality, billing cycles, payroll effects. Useful in finance, subscription businesses, and utility demand forecasting where month-over-month patterns are strong.
lag_365 — Same Day Last Year
Annual seasonality — holidays, weather cycles, budget cycles. The gold standard for retail demand and energy forecasting. Requires at least one full year of data before it becomes usable.
Creating Lag Features with shift()
The scenario:
You're a data scientist at a grocery chain. The demand forecasting team needs a model to predict tomorrow's sales. The raw dataset has one row per day with total units sold. Your job is to add lag_1 (yesterday), lag_3 (three days ago), and lag_7 (same day last week) as features — the exact reference points a human forecaster would instinctively reach for.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a daily sales DataFrame — 10 rows
sales_df = pd.DataFrame({
'date': pd.date_range(start='2024-06-01', periods=10, freq='D'), # 10 consecutive days
'units_sold': [200, 210, 195, 220, 230, 250, 240, 260, 270, 255] # daily units sold
})
# Sort by date — always required before shift operations
sales_df = sales_df.sort_values('date').reset_index(drop=True)
# lag_1: yesterday's units sold — shift the column DOWN by 1 row
# The first row gets NaN because there is no "yesterday" for it
sales_df['lag_1'] = sales_df['units_sold'].shift(1)
# lag_3: units sold 3 days ago — shift down by 3 rows
# First 3 rows get NaN
sales_df['lag_3'] = sales_df['units_sold'].shift(3)
# lag_7: units sold same day last week — shift down by 7 rows
# First 7 rows get NaN — this is expected and normal
sales_df['lag_7'] = sales_df['units_sold'].shift(7)
# Derived feature: delta vs yesterday — how much did sales change since lag_1?
# This is often more predictive than the raw lag value itself
sales_df['delta_lag_1'] = sales_df['units_sold'] - sales_df['lag_1']
# Round for display
sales_df = sales_df.round(1)
# Print results
print(sales_df.to_string(index=False))
date units_sold lag_1 lag_3 lag_7 delta_lag_1 2024-06-01 200 NaN NaN NaN NaN 2024-06-02 210 200.0 NaN NaN 10.0 2024-06-03 195 210.0 NaN NaN -15.0 2024-06-04 220 195.0 200.0 NaN 25.0 2024-06-05 230 220.0 210.0 NaN 10.0 2024-06-06 250 230.0 195.0 NaN 20.0 2024-06-07 240 250.0 220.0 NaN -10.0 2024-06-08 260 240.0 230.0 200.0 20.0 2024-06-09 270 260.0 250.0 210.0 10.0 2024-06-10 255 270.0 240.0 195.0 -15.0
What just happened?
shift(1) moves the entire column down by one row, so each row now contains yesterday's value in lag_1. The NaNs at the top are expected — there is simply no "yesterday" for June 1. By June 8, all three lags are populated and the model can compare today's 260 units against lag_1 (260), lag_3 (230), and lag_7 (200), seeing a clear upward trend across all three reference points. The delta_lag_1 column makes the day-over-day change explicit — June 10 shows −15, alerting the model to a potential reversal.
Lag Features Per Group — The Multi-Entity Pattern
Just like rolling features, lags must be computed within each entity separately when your DataFrame contains multiple time series stacked together. A naive shift(1) on the full DataFrame will make the last row of Customer A bleed into the first row of Customer B — a silent, catastrophic error.
The scenario:
You're building a churn model at a SaaS company. The dataset has weekly login counts per user, all stacked in one DataFrame. You need lag_1 (last week's logins) and lag_2 (two weeks ago) per user. The product team suspects that a sharp drop in weekly logins is the strongest early churn signal — you need the lag features to compute that drop explicitly.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a weekly login DataFrame — 2 users, 5 weeks each, all in one flat table
churn_df = pd.DataFrame({
'user_id': ['U1','U1','U1','U1','U1', 'U2','U2','U2','U2','U2'], # two users
'week': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5], # week number
'logins': [12, 14, 13, 3, 2, 8, 9, 10, 11, 10] # U1 drops off; U2 is stable
})
# Sort by user then week — required so shift operates in correct time order
churn_df = churn_df.sort_values(['user_id', 'week']).reset_index(drop=True)
# Compute lag_1 PER USER: last week's logins for each user
# groupby + shift keeps the shift operation isolated within each user's rows
churn_df['lag_1'] = churn_df.groupby('user_id')['logins'].shift(1)
# Compute lag_2 PER USER: logins from two weeks ago
churn_df['lag_2'] = churn_df.groupby('user_id')['logins'].shift(2)
# Compute week-over-week change: current logins minus last week's logins
churn_df['wow_change'] = churn_df['logins'] - churn_df['lag_1']
# Compute percentage change vs lag_1 — normalises the delta by baseline activity
# Multiply by 100 to express as a percentage
churn_df['wow_pct_change'] = (churn_df['wow_change'] / (churn_df['lag_1'] + 1e-9)) * 100
# Round for clean display
churn_df = churn_df.round(1)
# Print results
print(churn_df.to_string(index=False))
user_id week logins lag_1 lag_2 wow_change wow_pct_change
U1 1 12 NaN NaN NaN NaN
U1 2 14 12.0 NaN 2.0 16.7
U1 3 13 14.0 12.0 -1.0 -7.1
U1 4 3 13.0 14.0 -10.0 -76.9
U1 5 2 3.0 13.0 -1.0 -33.3
U2 1 8 NaN NaN NaN NaN
U2 2 9 8.0 NaN 1.0 12.5
U2 3 10 9.0 8.0 1.0 11.1
U2 4 11 10.0 9.0 1.0 10.0
U2 5 10 11.0 10.0 -1.0 -9.1What just happened?
U1's week 4 shows a wow_pct_change of −76.9% — logins collapsed from 13 to 3. That single number is an extremely loud churn signal. U2's worst week is only −9.1%, which is normal variation. The groupby before shift ensured U2 week 1's NaN did not pick up U1 week 5's value of 2 — which is exactly the silent bug that kills models in production when you forget the groupby.
Combining Lags with Rolling Features
In practice, you use lags and rolling features together. Lags give the model specific historical snapshots; rolling features give it smoothed summaries. A model that has both sees the same thing a seasoned analyst sees: the exact value at key moments in the past, plus a sense of the overall recent trend.
The scenario:
You're on a data science team at an energy company. Daily power consumption data feeds into a demand forecasting model. The operations team wants both specific lag reference points and a smoothed recent average — together they make the model robust to single-day noise while still being sensitive to sharp changes.
# Import pandas and numpy
import pandas as pd
import numpy as np
# Create a daily power consumption DataFrame — 10 rows (MWh)
energy_df = pd.DataFrame({
'date': pd.date_range(start='2024-09-01', periods=10, freq='D'), # 10 days
'consumption': [310, 315, 308, 320, 490, 485, 480, 312, 318, 310] # MWh; spike on days 5-7
})
# Sort by date — always required
energy_df = energy_df.sort_values('date').reset_index(drop=True)
# Lag features: specific historical reference points
energy_df['lag_1'] = energy_df['consumption'].shift(1) # yesterday's consumption
energy_df['lag_7'] = energy_df['consumption'].shift(7) # same day last week
# Rolling feature: 3-day smoothed average — reduces day-to-day noise
energy_df['roll_mean_3d'] = energy_df['consumption'].rolling(window=3, min_periods=1).mean()
# Derived feature: today vs rolling mean — how far is today from the recent smooth baseline?
energy_df['deviation_from_trend'] = energy_df['consumption'] - energy_df['roll_mean_3d']
# Derived feature: today vs lag_1 — raw day-over-day delta
energy_df['delta_vs_yesterday'] = energy_df['consumption'] - energy_df['lag_1']
# Round everything to 1 decimal place
energy_df = energy_df.round(1)
# Print selected columns for clarity
print(energy_df[['date','consumption','lag_1','roll_mean_3d','deviation_from_trend','delta_vs_yesterday']].to_string(index=False))
date consumption lag_1 roll_mean_3d deviation_from_trend delta_vs_yesterday 2024-09-01 310 NaN 310.0 0.0 NaN 2024-09-02 315 310.0 312.5 2.5 5.0 2024-09-03 308 315.0 311.0 -3.0 -7.0 2024-09-04 320 308.0 314.3 5.7 12.0 2024-09-05 490 320.0 372.7 117.3 170.0 2024-09-06 485 490.0 431.7 53.3 -5.0 2024-09-07 480 485.0 485.0 -5.0 -5.0 2024-09-08 312 480.0 425.7 -113.7 -168.0 2024-09-09 318 312.0 370.0 -52.0 6.0 2024-09-10 310 318.0 313.3 -3.3 -8.0
What just happened?
Sep 5 is a major anomaly — consumption spikes to 490 MWh, 170 units above lag_1 and 117.3 above the rolling mean. Both features independently flag this as extreme. Sep 8 shows the mirror image: consumption drops back to 312 while the rolling mean is still elevated at 425 (dragged up by the spike days), producing a deviation_from_trend of −113.7. The combination of lag and rolling features gives the model a much richer description of this event than either feature type alone.
The Leakage Rule for Lag Features
Lag features have a leakage rule that is slightly different from group-based features. The rule is simple: the lag depth must be at least as large as the forecast horizon.
Predicting tomorrow (horizon = 1 day)
lag_1 is safe — at prediction time, yesterday's value is already known. lag_0 would be leakage — that's today's value, which is the target itself.
Predicting 7 days ahead (horizon = 7 days)
lag_1 through lag_6 are leakage — those days haven't happened yet at prediction time. lag_7 is the minimum safe lag. lag_14 and lag_21 are also safe.
The general rule
Minimum safe lag = forecast horizon. Any lag shallower than the horizon uses data that would not yet exist when the model runs in production. Violating this rule produces models that look great offline and fail completely in deployment.
A Visual — What shift() Does to a Column
This table makes the mechanics of shift() completely transparent. Each lag column is the original column slid down by N rows:
| Date | Original | lag_1 shift(1) | lag_3 shift(3) | delta (orig − lag_1) |
|---|---|---|---|---|
| Jun 1 | 200 | NaN | NaN | NaN |
| Jun 2 | 210 | 200 | NaN | +10 |
| Jun 3 | 195 | 210 | NaN | −15 |
| Jun 4 | 220 | 195 | 200 | +25 |
| Jun 5 | 230 | 220 | 210 | +10 |
NaN values appear wherever there is no historical data to look back to. For lag_3, the first three rows will always be NaN because there are no rows 3 steps in the past.
Teacher's Note
The NaN rows created by lag features are a decision point, not just a nuisance. You have three options: drop them (simplest, but you lose early rows), fill them with a global or group mean (keeps the rows but introduces a small bias), or use them as a boolean mask feature — a column called lag_1_available that is 1 when lag_1 is not NaN and 0 otherwise. The third option is surprisingly useful in production where partial history is common for new customers or new products and you want the model to explicitly know whether the lag was real or imputed.
Practice Questions
1. Which pandas method is used to create lag features by moving column values down by N rows?
2. To avoid leakage, the minimum safe lag depth must be at least as large as the ________ ________.
3. When multiple entities share a single DataFrame, you must use ________ before shift() to prevent one entity's last row from leaking into the next entity's first row.
Quiz
1. Your model predicts demand 7 days in advance. Which of these lag features are safe to use without causing leakage?
2. What is the consequence of calling shift(1) on a multi-entity DataFrame without using groupby first?
3. What is the main practical difference between lag features and rolling window features?
Up Next · Lesson 35
Feature Engineering for Imbalanced Data
When 99% of your rows are class 0, raw features lie. Learn how to engineer features that actually separate the rare minority class from the noise.