Feature Engineering Lesson 34 – Lag Features | Dataplexa

Advanced Level · Lesson 34

Lag Features

Rolling windows summarise the recent past. Lag features do something more direct — they hand the model the exact value from N steps ago and say: here, learn from this specific moment in history.

A lag feature is a copy of a column shifted forward in time by N steps. lag_1 is yesterday's value sitting in today's row. lag_7 is last week's value. The model can now directly compare today against specific past moments — not just a smoothed average, but the actual number from that day.

The Core Intuition — Giving the Model a Rearview Mirror

Think about how you forecast tomorrow's sales. You don't just look at today's number in isolation — you compare it to yesterday, to the same day last week, to the same day last month. You have a mental model built on specific historical reference points. Lag features replicate that instinct in machine-learnable form.

Without lag features, a time-series model has amnesia. Each row is a stranger to the model — it has no idea that yesterday's value was 20% higher, or that this same weekday last week showed a dip. Lag features cure that amnesia by making the past an explicit, numerical part of the current row.

Model Without Lag Features

Sees today's value. Has no idea whether it's higher or lower than yesterday. Cannot detect acceleration, reversal, or weekly seasonality. Every row looks like an isolated snapshot.

Model With Lag Features

Sees today, yesterday, last week, and last month simultaneously. Can learn that a value 20% above lag_7 predicts high demand. Can detect reversals, cycles, and momentum directly from structured features.

Choosing the Right Lag Depths

The lag depths you choose should match the natural cycles in your data. There's no single right answer — it depends entirely on the domain:

lag_1 — Immediate Previous Step

Yesterday's value, last hour's reading, the prior transaction. Captures short-term momentum and is almost always worth including. If today's value is substantially higher than lag_1, that delta is a powerful signal.

lag_7 — Same Day Last Week

Captures weekly seasonality. Monday tends to look like last Monday. This lag is essential for any retail, traffic, or consumer-behaviour dataset that has a 7-day cycle.

lag_30 — Same Period Last Month

Monthly seasonality, billing cycles, payroll effects. Useful in finance, subscription businesses, and utility demand forecasting where month-over-month patterns are strong.

lag_365 — Same Day Last Year

Annual seasonality — holidays, weather cycles, budget cycles. The gold standard for retail demand and energy forecasting. Requires at least one full year of data before it becomes usable.

Creating Lag Features with shift()

The scenario:

You're a data scientist at a grocery chain. The demand forecasting team needs a model to predict tomorrow's sales. The raw dataset has one row per day with total units sold. Your job is to add lag_1 (yesterday), lag_3 (three days ago), and lag_7 (same day last week) as features — the exact reference points a human forecaster would instinctively reach for.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a daily sales DataFrame — 10 rows
sales_df = pd.DataFrame({
    'date':        pd.date_range(start='2024-06-01', periods=10, freq='D'),  # 10 consecutive days
    'units_sold':  [200, 210, 195, 220, 230, 250, 240, 260, 270, 255]        # daily units sold
})

# Sort by date — always required before shift operations
sales_df = sales_df.sort_values('date').reset_index(drop=True)

# lag_1: yesterday's units sold — shift the column DOWN by 1 row
# The first row gets NaN because there is no "yesterday" for it
sales_df['lag_1'] = sales_df['units_sold'].shift(1)

# lag_3: units sold 3 days ago — shift down by 3 rows
# First 3 rows get NaN
sales_df['lag_3'] = sales_df['units_sold'].shift(3)

# lag_7: units sold same day last week — shift down by 7 rows
# First 7 rows get NaN — this is expected and normal
sales_df['lag_7'] = sales_df['units_sold'].shift(7)

# Derived feature: delta vs yesterday — how much did sales change since lag_1?
# This is often more predictive than the raw lag value itself
sales_df['delta_lag_1'] = sales_df['units_sold'] - sales_df['lag_1']

# Round for display
sales_df = sales_df.round(1)

# Print results
print(sales_df.to_string(index=False))

       date  units_sold  lag_1  lag_3  lag_7  delta_lag_1
 2024-06-01         200    NaN    NaN    NaN          NaN
 2024-06-02         210  200.0    NaN    NaN         10.0
 2024-06-03         195  210.0    NaN    NaN        -15.0
 2024-06-04         220  195.0  200.0    NaN         25.0
 2024-06-05         230  220.0  210.0    NaN         10.0
 2024-06-06         250  230.0  195.0    NaN         20.0
 2024-06-07         240  250.0  220.0    NaN        -10.0
 2024-06-08         260  240.0  230.0  200.0         20.0
 2024-06-09         270  260.0  250.0  210.0         10.0
 2024-06-10         255  270.0  240.0  195.0        -15.0

What just happened?

shift(1) moves the entire column down by one row, so each row now contains yesterday's value in lag_1. The NaNs at the top are expected — there is simply no "yesterday" for June 1. By June 8, all three lags are populated and the model can compare today's 260 units against lag_1 (260), lag_3 (230), and lag_7 (200), seeing a clear upward trend across all three reference points. The delta_lag_1 column makes the day-over-day change explicit — June 10 shows −15, alerting the model to a potential reversal.

Lag Features Per Group — The Multi-Entity Pattern

Just like rolling features, lags must be computed within each entity separately when your DataFrame contains multiple time series stacked together. A naive shift(1) on the full DataFrame will make the last row of Customer A bleed into the first row of Customer B — a silent, catastrophic error.

The scenario:

You're building a churn model at a SaaS company. The dataset has weekly login counts per user, all stacked in one DataFrame. You need lag_1 (last week's logins) and lag_2 (two weeks ago) per user. The product team suspects that a sharp drop in weekly logins is the strongest early churn signal — you need the lag features to compute that drop explicitly.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a weekly login DataFrame — 2 users, 5 weeks each, all in one flat table
churn_df = pd.DataFrame({
    'user_id': ['U1','U1','U1','U1','U1', 'U2','U2','U2','U2','U2'],   # two users
    'week':    [1, 2, 3, 4, 5,  1, 2, 3, 4, 5],                       # week number
    'logins':  [12, 14, 13, 3, 2,   8, 9, 10, 11, 10]                  # U1 drops off; U2 is stable
})

# Sort by user then week — required so shift operates in correct time order
churn_df = churn_df.sort_values(['user_id', 'week']).reset_index(drop=True)

# Compute lag_1 PER USER: last week's logins for each user
# groupby + shift keeps the shift operation isolated within each user's rows
churn_df['lag_1'] = churn_df.groupby('user_id')['logins'].shift(1)

# Compute lag_2 PER USER: logins from two weeks ago
churn_df['lag_2'] = churn_df.groupby('user_id')['logins'].shift(2)

# Compute week-over-week change: current logins minus last week's logins
churn_df['wow_change'] = churn_df['logins'] - churn_df['lag_1']

# Compute percentage change vs lag_1 — normalises the delta by baseline activity
# Multiply by 100 to express as a percentage
churn_df['wow_pct_change'] = (churn_df['wow_change'] / (churn_df['lag_1'] + 1e-9)) * 100

# Round for clean display
churn_df = churn_df.round(1)

# Print results
print(churn_df.to_string(index=False))

 user_id  week  logins  lag_1  lag_2  wow_change  wow_pct_change
      U1     1      12    NaN    NaN         NaN             NaN
      U1     2      14   12.0    NaN         2.0            16.7
      U1     3      13   14.0   12.0        -1.0            -7.1
      U1     4       3   13.0   14.0       -10.0           -76.9
      U1     5       2    3.0   13.0        -1.0           -33.3
      U2     1       8    NaN    NaN         NaN             NaN
      U2     2       9    8.0    NaN         1.0            12.5
      U2     3      10    9.0    8.0         1.0            11.1
      U2     4      11   10.0    9.0         1.0            10.0
      U2     5      10   11.0   10.0        -1.0            -9.1

What just happened?

U1's week 4 shows a wow_pct_change of −76.9% — logins collapsed from 13 to 3. That single number is an extremely loud churn signal. U2's worst week is only −9.1%, which is normal variation. The groupby before shift ensured U2 week 1's NaN did not pick up U1 week 5's value of 2 — which is exactly the silent bug that kills models in production when you forget the groupby.

Combining Lags with Rolling Features

In practice, you use lags and rolling features together. Lags give the model specific historical snapshots; rolling features give it smoothed summaries. A model that has both sees the same thing a seasoned analyst sees: the exact value at key moments in the past, plus a sense of the overall recent trend.

The scenario:

You're on a data science team at an energy company. Daily power consumption data feeds into a demand forecasting model. The operations team wants both specific lag reference points and a smoothed recent average — together they make the model robust to single-day noise while still being sensitive to sharp changes.

# Import pandas and numpy
import pandas as pd
import numpy as np

# Create a daily power consumption DataFrame — 10 rows (MWh)
energy_df = pd.DataFrame({
    'date':        pd.date_range(start='2024-09-01', periods=10, freq='D'),   # 10 days
    'consumption': [310, 315, 308, 320, 490, 485, 480, 312, 318, 310]         # MWh; spike on days 5-7
})

# Sort by date — always required
energy_df = energy_df.sort_values('date').reset_index(drop=True)

# Lag features: specific historical reference points
energy_df['lag_1'] = energy_df['consumption'].shift(1)   # yesterday's consumption
energy_df['lag_7'] = energy_df['consumption'].shift(7)   # same day last week

# Rolling feature: 3-day smoothed average — reduces day-to-day noise
energy_df['roll_mean_3d'] = energy_df['consumption'].rolling(window=3, min_periods=1).mean()

# Derived feature: today vs rolling mean — how far is today from the recent smooth baseline?
energy_df['deviation_from_trend'] = energy_df['consumption'] - energy_df['roll_mean_3d']

# Derived feature: today vs lag_1 — raw day-over-day delta
energy_df['delta_vs_yesterday'] = energy_df['consumption'] - energy_df['lag_1']

# Round everything to 1 decimal place
energy_df = energy_df.round(1)

# Print selected columns for clarity
print(energy_df[['date','consumption','lag_1','roll_mean_3d','deviation_from_trend','delta_vs_yesterday']].to_string(index=False))

       date  consumption  lag_1  roll_mean_3d  deviation_from_trend  delta_vs_yesterday
 2024-09-01          310    NaN        310.0                   0.0                 NaN
 2024-09-02          315  310.0        312.5                   2.5                 5.0
 2024-09-03          308  315.0        311.0                  -3.0                -7.0
 2024-09-04          320  308.0        314.3                   5.7                12.0
 2024-09-05          490  320.0        372.7                 117.3               170.0
 2024-09-06          485  490.0        431.7                  53.3                -5.0
 2024-09-07          480  485.0        485.0                  -5.0                -5.0
 2024-09-08          312  480.0        425.7                -113.7              -168.0
 2024-09-09          318  312.0        370.0                 -52.0                 6.0
 2024-09-10          310  318.0        313.3                  -3.3                -8.0

What just happened?

Sep 5 is a major anomaly — consumption spikes to 490 MWh, 170 units above lag_1 and 117.3 above the rolling mean. Both features independently flag this as extreme. Sep 8 shows the mirror image: consumption drops back to 312 while the rolling mean is still elevated at 425 (dragged up by the spike days), producing a deviation_from_trend of −113.7. The combination of lag and rolling features gives the model a much richer description of this event than either feature type alone.

The Leakage Rule for Lag Features

Lag features have a leakage rule that is slightly different from group-based features. The rule is simple: the lag depth must be at least as large as the forecast horizon.

Predicting tomorrow (horizon = 1 day)

lag_1 is safe — at prediction time, yesterday's value is already known. lag_0 would be leakage — that's today's value, which is the target itself.

Predicting 7 days ahead (horizon = 7 days)

lag_1 through lag_6 are leakage — those days haven't happened yet at prediction time. lag_7 is the minimum safe lag. lag_14 and lag_21 are also safe.

The general rule

Minimum safe lag = forecast horizon. Any lag shallower than the horizon uses data that would not yet exist when the model runs in production. Violating this rule produces models that look great offline and fail completely in deployment.

A Visual — What shift() Does to a Column

This table makes the mechanics of shift() completely transparent. Each lag column is the original column slid down by N rows:

Date	Original	lag_1 shift(1)	lag_3 shift(3)	delta (orig − lag_1)
Jun 1	200	NaN	NaN	NaN
Jun 2	210	200	NaN	+10
Jun 3	195	210	NaN	−15
Jun 4	220	195	200	+25
Jun 5	230	220	210	+10

NaN values appear wherever there is no historical data to look back to. For lag_3, the first three rows will always be NaN because there are no rows 3 steps in the past.

Teacher's Note

The NaN rows created by lag features are a decision point, not just a nuisance. You have three options: drop them (simplest, but you lose early rows), fill them with a global or group mean (keeps the rows but introduces a small bias), or use them as a boolean mask feature — a column called lag_1_available that is 1 when lag_1 is not NaN and 0 otherwise. The third option is surprisingly useful in production where partial history is common for new customers or new products and you want the model to explicitly know whether the lag was real or imputed.

Practice Questions

1. Which pandas method is used to create lag features by moving column values down by N rows?

2. To avoid leakage, the minimum safe lag depth must be at least as large as the ________ ________.

3. When multiple entities share a single DataFrame, you must use ________ before shift() to prevent one entity's last row from leaking into the next entity's first row.

Quiz

Up Next · Lesson 35

Feature Engineering for Imbalanced Data

When 99% of your rows are class 0, raw features lie. Learn how to engineer features that actually separate the rare minority class from the noise.

← Previous Course Index Next →

Feature Engineering Course

Lag Features

The Core Intuition — Giving the Model a Rearview Mirror

Choosing the Right Lag Depths

Creating Lag Features with shift()

Lag Features Per Group — The Multi-Entity Pattern

Combining Lags with Rolling Features

The Leakage Rule for Lag Features

A Visual — What shift() Does to a Column

Practice Questions

Quiz