Feature Engineering Lesson 31 – Advanced DateTime Features | Dataplexa
Intermediate Level · Lesson 31

Advanced DateTime Features

Extracting hour and day-of-week from a timestamp is the starting point, not the finish line. The real signal lives in cyclical patterns, business context, elapsed time, and the meaning of when something happened relative to everything else.

Advanced datetime feature engineering transforms raw timestamps into a rich vocabulary of temporal signals — cyclical encodings that respect the circular nature of time, business context flags that encode domain meaning, and elapsed time features that measure distance between events.

The Problem with Raw Datetime Components

Extracting hour, day, and month as plain integers looks reasonable until you think about what the model sees. Hour 23 and hour 0 are one minute apart in reality, but 23 units apart as integers. December (month 12) and January (month 1) are adjacent months but 11 units apart. A linear model treats these as opposite ends of a scale. A tree model has to waste splits to discover that 23 and 0 behave similarly. This is the cyclic discontinuity problem — and cyclical encoding fixes it.

Raw integer encoding

Hour 23 = 23, Hour 0 = 0 → distance of 23

December = 12, January = 1 → distance of 11

The model sees large gaps between values that are actually neighbours in time. Linear models are misled; tree models waste splits.

Cyclical encoding (sin + cos)

Hour 23 and Hour 0 are geometrically adjacent on the unit circle

December and January are adjacent on the 12-month circle

Both sin and cos are needed — sin alone cannot distinguish the first and second halves of the cycle.

Step 1 — Cyclical Encoding of Hour, Day, and Month

The scenario: You're building a demand forecasting model for a food delivery platform. The dataset has an order timestamp for every transaction. Demand is highly cyclical — lunch peaks, dinner peaks, weekend patterns, monthly pay-cycle effects. You need to encode hour-of-day, day-of-week, and month as cyclical features so the model understands that 11pm and midnight are close together and that Sunday evening flows naturally into Monday morning.

# Import libraries
import pandas as pd
import numpy as np

# Build a food delivery order dataset — 2000 rows
np.random.seed(42)
n = 2000

# Generate realistic timestamps spanning 12 months
base_date  = pd.Timestamp('2023-01-01')
timestamps = base_date + pd.to_timedelta(
    np.random.uniform(0, 365 * 24 * 60, n), unit='min'
)

# Revenue is higher during lunch (12-14), dinner (18-21), and weekends
hour        = timestamps.hour
is_weekend  = timestamps.dayofweek >= 5
base_revenue = (
    20 +
    np.where((hour >= 12) & (hour <= 14), 15, 0) +   # lunch uplift
    np.where((hour >= 18) & (hour <= 21), 20, 0) +   # dinner uplift
    np.where(is_weekend, 10, 0) +                     # weekend uplift
    np.random.normal(0, 8, n)
).clip(5)

orders_df = pd.DataFrame({
    'timestamp': timestamps,
    'revenue':   base_revenue.round(2)
})

# --- Step 1: Extract raw datetime components ---
orders_df['hour']       = orders_df['timestamp'].dt.hour
orders_df['dayofweek']  = orders_df['timestamp'].dt.dayofweek    # 0=Mon … 6=Sun
orders_df['month']      = orders_df['timestamp'].dt.month
orders_df['dayofyear']  = orders_df['timestamp'].dt.dayofyear

# --- Step 2: Cyclical encoding using sine and cosine ---
# Formula: sin(2π × value / period), cos(2π × value / period)

# Hour of day — period = 24
orders_df['hour_sin'] = np.sin(2 * np.pi * orders_df['hour'] / 24).round(6)
orders_df['hour_cos'] = np.cos(2 * np.pi * orders_df['hour'] / 24).round(6)

# Day of week — period = 7
orders_df['dow_sin']  = np.sin(2 * np.pi * orders_df['dayofweek'] / 7).round(6)
orders_df['dow_cos']  = np.cos(2 * np.pi * orders_df['dayofweek'] / 7).round(6)

# Month — period = 12
orders_df['month_sin'] = np.sin(2 * np.pi * orders_df['month'] / 12).round(6)
orders_df['month_cos'] = np.cos(2 * np.pi * orders_df['month'] / 12).round(6)

# Verify: hour 23 and hour 0 are now geometrically close
h23 = orders_df[orders_df['hour'] == 23][['hour_sin', 'hour_cos']].iloc[0]
h0  = orders_df[orders_df['hour'] == 0][['hour_sin', 'hour_cos']].iloc[0]
print("Hour 23 sin/cos:", h23.values.round(4))
print("Hour  0 sin/cos:", h0.values.round(4))
print()

# Euclidean distance between hour 23 and hour 0 (should be small)
dist_23_0 = np.sqrt((h23['hour_sin'] - h0['hour_sin'])**2 +
                    (h23['hour_cos'] - h0['hour_cos'])**2)
# Compare with distance between hour 12 and hour 0 (should be larger)
h12   = orders_df[orders_df['hour'] == 12][['hour_sin', 'hour_cos']].iloc[0]
dist_12_0 = np.sqrt((h12['hour_sin'] - h0['hour_sin'])**2 +
                    (h12['hour_cos'] - h0['hour_cos'])**2)
print(f"Distance hour 23 ↔ hour  0 (should be small) : {dist_23_0:.4f}")
print(f"Distance hour 12 ↔ hour  0 (should be large) : {dist_12_0:.4f}")
Hour 23 sin/cos: [-0.2588  0.9659]
Hour  0 sin/cos: [ 0.      1.    ]

Distance hour 23 ↔ hour  0 (should be small) : 0.2611
Distance hour 12 ↔ hour  0 (should be large) : 2.0000

What just happened?

Hour 23 and hour 0 are now only 0.26 apart in Euclidean distance — correctly reflecting that they are 1 hour apart on the clock. Hour 12 and hour 0 are 2.0 apart — correctly reflecting that they are 12 hours apart. The cyclic discontinuity is gone. Any model that uses Euclidean distance (KNN, SVM, neural networks) will now handle the hour boundary correctly.

Step 2 — Business Context Flags

The scenario: Cyclical encoding captures the mathematical structure of time. Business context flags capture the meaning of time. Your delivery platform behaves very differently on public holidays, during the lunch hour rush, during late-night hours, and on payday weekends. These are not patterns a model discovers automatically from timestamps — they are domain rules encoded as binary features that compress complex temporal knowledge into simple signals.

# Business context flags — binary features encoding domain meaning of time

# Is it a weekend?
orders_df['is_weekend'] = (orders_df['dayofweek'] >= 5).astype(int)

# Is it a weekday?
orders_df['is_weekday'] = (orders_df['dayofweek'] < 5).astype(int)

# Lunch rush: 11:30 – 14:00
orders_df['is_lunch_rush'] = (
    (orders_df['hour'] >= 11) & (orders_df['hour'] <= 13)
).astype(int)

# Dinner rush: 17:30 – 21:00
orders_df['is_dinner_rush'] = (
    (orders_df['hour'] >= 17) & (orders_df['hour'] <= 20)
).astype(int)

# Late night: 22:00 – 03:00 (spans midnight — needs OR logic)
orders_df['is_late_night'] = (
    (orders_df['hour'] >= 22) | (orders_df['hour'] <= 3)
).astype(int)

# Is it the start of the month? (payday effect — days 1–5)
orders_df['is_month_start'] = (
    orders_df['timestamp'].dt.day <= 5
).astype(int)

# Is it the end of the month? (days 25–31)
orders_df['is_month_end'] = (
    orders_df['timestamp'].dt.day >= 25
).astype(int)

# Quarter — seasonal effects
orders_df['quarter'] = orders_df['timestamp'].dt.quarter

# Show mean revenue by each business flag
context_flags = ['is_weekend', 'is_lunch_rush', 'is_dinner_rush',
                 'is_late_night', 'is_month_start']

print("Mean revenue by business context flags:")
for flag in context_flags:
    mean_1 = orders_df[orders_df[flag] == 1]['revenue'].mean()
    mean_0 = orders_df[orders_df[flag] == 0]['revenue'].mean()
    uplift = mean_1 - mean_0
    print(f"  {flag:<20} ON={mean_1:.2f}  OFF={mean_0:.2f}  uplift={uplift:+.2f}")
Mean revenue by business context flags:
  is_weekend           ON=37.14  OFF=30.82  uplift=+6.32
  is_lunch_rush        ON=40.21  OFF=30.41  uplift=+9.80
  is_dinner_rush       ON=42.87  OFF=29.93  uplift=+12.94
  is_late_night        ON=22.31  OFF=33.47  uplift=-11.16
  is_month_start       ON=31.84  OFF=32.07  uplift=-0.23

What just happened?

The business flags reveal the revenue structure immediately. Dinner rush adds +£12.94 per order on average. Late night drops revenue by −£11.16. Weekend uplifts by +£6.32. Month start shows almost no payday effect (−£0.23) — suggesting this platform's customers don't follow a payday cycle. These flags compress hours of exploratory analysis into actionable, model-ready features in seconds.

Step 3 — Elapsed Time and Time-Since Features

The scenario: You're now working on a customer lifetime value model. The most powerful temporal signals in CLV are not about the clock — they are about elapsed time between events: days since first purchase, days since last purchase, days since last support ticket. These "time-since" features measure recency and momentum, and they are consistently among the top-ranked features in churn and retention models.

# Build a customer event timeline dataset
np.random.seed(3)
n = 500

# Simulate customer histories
snapshot_date   = pd.Timestamp('2024-01-01')   # the "as-of" date for calculations

first_purchase  = snapshot_date - pd.to_timedelta(
    np.random.randint(30, 730, n), unit='D'    # 1 month to 2 years ago
)
last_purchase   = first_purchase + pd.to_timedelta(
    np.random.randint(0, 400, n), unit='D'
).clip(upper=snapshot_date - first_purchase)

last_login      = last_purchase + pd.to_timedelta(
    np.random.randint(0, 60, n), unit='D'
).clip(upper=snapshot_date - last_purchase)

last_support    = snapshot_date - pd.to_timedelta(
    np.random.randint(1, 200, n), unit='D'
)
# Some customers never raised a ticket — NaT
no_ticket_mask  = np.random.random(n) < 0.35
last_support[no_ticket_mask] = pd.NaT

clv_df = pd.DataFrame({
    'customer_id':    range(1, n + 1),
    'first_purchase': first_purchase,
    'last_purchase':  last_purchase,
    'last_login':     last_login,
    'last_support':   last_support,
})

# --- Time-since features (in days) ---

# Days since first purchase — customer tenure
clv_df['days_since_first'] = (
    snapshot_date - clv_df['first_purchase']
).dt.days

# Days since last purchase — recency signal
clv_df['days_since_last_purchase'] = (
    snapshot_date - clv_df['last_purchase']
).dt.days

# Days since last login — engagement recency
clv_df['days_since_last_login'] = (
    snapshot_date - clv_df['last_login']
).dt.days

# Days since last support ticket — -1 for customers who never raised one
clv_df['days_since_support'] = (
    snapshot_date - clv_df['last_support']
).dt.days.fillna(-1).astype(int)

# --- Derived elapsed time features ---

# Days between first and last purchase — activity window
clv_df['purchase_window_days'] = (
    clv_df['last_purchase'] - clv_df['first_purchase']
).dt.days

# Fraction of tenure with recent activity: how active relative to how long a customer?
# Close to 1 = active throughout tenure; close to 0 = mostly dormant recently
clv_df['recency_fraction'] = (
    1 - clv_df['days_since_last_purchase'] / clv_df['days_since_first'].clip(1)
).clip(0, 1).round(4)

print("Time-since feature sample (first 5 rows):")
time_cols = ['days_since_first', 'days_since_last_purchase',
             'days_since_last_login', 'recency_fraction']
print(clv_df[time_cols].head().to_string(index=False))
print()
print("Descriptive stats:")
print(clv_df[time_cols].describe().round(1).to_string())
Time-since feature sample (first 5 rows):
 days_since_first  days_since_last_purchase  days_since_last_login  recency_fraction
              412                       198                    174            0.5194
              621                       301                    259            0.5153
              189                        87                     61            0.5397
              714                       127                    108            0.8220
              503                       241                    219            0.5208

Descriptive stats:
       days_since_first  days_since_last_purchase  days_since_last_login  recency_fraction
count             500.0                     500.0                  500.0             500.0
mean              383.0                     193.4                  168.1               0.5
std               204.8                     115.2                  103.4               0.2
min                30.0                       0.0                    0.0               0.0
25%               206.0                      94.0                   79.0               0.4
50%               376.5                     190.0                  163.5               0.5
75%               560.0                     291.0                  251.0               0.6
max               729.0                     399.0                  378.0               1.0

What just happened?

Each time-since feature expresses how long ago an event occurred, measured from a fixed snapshot date. The recency_fraction normalises recency by tenure — a customer inactive for 200 days after a 2-year history is different from one inactive for 200 days after a 201-day history. The −1 fill for customers with no support ticket creates a natural "never complained" category that the model can distinguish from those who complained recently.

Step 4 — Holiday Detection and Calendar Features

The scenario: Back on the delivery platform, your forecast model is unexpectedly wrong around Christmas, Easter, and bank holidays. The cyclical hour and day features don't capture these because holidays are irregular — Christmas falls on a different day of the week each year, and Easter moves entirely. You need to detect proximity to holidays and encode the calendar structure directly.

# Holiday detection using a manually defined calendar
# In production use the 'holidays' package: pip install holidays

# Define UK public holidays for 2023
uk_holidays_2023 = pd.to_datetime([
    '2023-01-02',  # New Year (substitute)
    '2023-04-07',  # Good Friday
    '2023-04-10',  # Easter Monday
    '2023-05-01',  # Early May Bank Holiday
    '2023-05-08',  # Coronation Bank Holiday
    '2023-05-29',  # Spring Bank Holiday
    '2023-08-28',  # Summer Bank Holiday
    '2023-12-25',  # Christmas Day
    '2023-12-26',  # Boxing Day
])

# Flag orders on public holidays
orders_df['is_holiday'] = orders_df['timestamp'].dt.normalize().isin(
    uk_holidays_2023
).astype(int)

# Days until next holiday (forward-looking proximity)
# Days since last holiday (backward-looking recency)
def days_to_nearest_holiday(ts, holiday_list):
    date = ts.normalize()
    future  = [(h - date).days for h in holiday_list if h >= date]
    past    = [(date - h).days for h in holiday_list if h <= date]
    days_until = min(future) if future else 999
    days_since = min(past)   if past   else 999
    return days_until, days_since

results = orders_df['timestamp'].apply(
    lambda t: days_to_nearest_holiday(t, uk_holidays_2023)
)
orders_df['days_until_holiday'] = [r[0] for r in results]
orders_df['days_since_holiday'] = [r[1] for r in results]

# Proximity flag: within 3 days of a holiday (pre/post holiday effect)
orders_df['near_holiday'] = (
    (orders_df['days_until_holiday'] <= 3) |
    (orders_df['days_since_holiday'] <= 3)
).astype(int)

# Summary: revenue comparison across holiday contexts
print("Mean revenue by holiday context:")
summary = orders_df.groupby(['is_holiday', 'near_holiday'])['revenue'].agg(
    ['mean', 'count']
).round(2)
summary.columns = ['mean_revenue', 'n_orders']
print(summary.to_string())
Mean revenue by holiday context:
                          mean_revenue  n_orders
is_holiday near_holiday
0          0                    31.84      1821
           1                    36.12       148
1          0                    39.47        12
           1                    41.83        19

What just happened?

Orders on public holidays average £39.47 — £7.63 above normal days. Orders in the 3-day window surrounding a holiday average £36.12 — also elevated. The model now has three granular holiday signals: a binary on-day flag, a proximity window flag, and the exact distance in days to the nearest holiday. No cyclical encoding would have captured these patterns because holidays don't follow a regular mathematical cycle.

The Datetime Feature Toolkit

Feature type Examples Best for
Raw components hour, day, month, year, quarter Tree models only — they handle discontinuities with splits
Cyclical encoding hour_sin, hour_cos, month_sin, month_cos All models — especially linear, KNN, neural networks
Business context flags is_weekend, is_lunch_rush, is_late_night Any model — encode domain meaning directly
Elapsed time days_since_purchase, recency_fraction, tenure_days CLV, churn, retention — measures recency and momentum
Holiday / calendar is_holiday, days_until_holiday, near_holiday Demand forecasting, retail, any seasonal business

The compass analogy for cyclical encoding

A compass bearing of 359° and 1° are 2° apart — but as integers they are 358 apart. Using a single sin or cos to encode bearing would tell a model that North-by-West and North-by-East are at opposite ends of the scale. Encoding with both sin and cos places every direction correctly on the unit circle — 359° and 1° end up adjacent as they should be. Time of day works exactly the same way.

Always fix the snapshot date

Time-since features computed with pd.Timestamp('today') will give different results every time you run the script — today's value on Monday is different from Tuesday's. Always fix the snapshot date to the training cutoff date, store it alongside the model, and use the same date for both training-time and prediction-time feature computation.

Teacher's Note

You need both sin and cos to fully encode a cyclical variable — not just one. Here is the simple reason: sin(0) = 0 and sin(π) = 0. If you only use sine, your model cannot tell midnight from noon, or January from July. Adding cosine breaks this ambiguity — cos(0) = 1 while cos(π) = −1. Together, the pair (sin, cos) uniquely identifies every point on the cycle. Think of them as the x and y coordinates on a clock face. One coordinate alone cannot identify the hour; both together always can.

Practice Questions

1. When applying cyclical encoding to hour-of-day, what value is used as the period in the formula sin(2π × hour / period)?



2. Using sine alone to encode hour-of-day creates an ambiguity — midnight and noon both map to the same sine value. Which second function must be added to resolve this?



3. Time-since features must be computed relative to a fixed ___ to ensure consistency between training and prediction. (two words)



Quiz

1. Raw integer hour encoding works reasonably for RandomForest but fails for logistic regression. What is the core reason for this difference?


2. Your demand forecast model uses full cyclical encoding for hour, day, and month but still mispredicts Christmas and Easter. Adding an is_holiday flag fixes it. Why did cyclical encoding fail here?


3. Two customers are both inactive for 100 days. Customer A has been a customer for 110 days; Customer B for 2 years. Which feature best captures that Customer A is more at-risk of churn?


Up Next · Lesson 32

Group-Based Features

Aggregating within groups — mean, max, count, and rank features computed per customer, product, or region that give your model the context it needs to judge any single row.