Feature Engineering Lesson 6 – Date & Time Feature | Dataplexa
Beginner Level · Lesson 6

Date & Time Features

A datetime column looks like one piece of data but it's actually a compressed bundle of signals — year, month, season, day of week, time since an event, and more. None of those signals exist as usable features until you extract them deliberately. This lesson teaches you to unpack a timestamp completely and validate which extracted features are worth keeping.

The datetime engineering workflow follows three steps: parse the text string into a real datetime object, extract every potentially useful signal, then validate each one against the target and drop the ones that don't carry signal. The raw date column itself gets dropped before training — most models can't use datetime objects directly.

Parsing and Extracting Standard Datetime Signals

The .dt accessor is the key that unlocks all datetime operations in pandas. Without it, calling .month on a column throws an AttributeError. With it, you get access to dozens of properties — month, year, day of week, quarter, week of year, hour, minute, and more.

The scenario: The property dataset has a sale_date column stored as text. The analytics team suspects a seasonal pattern — spring and summer listings might sell for more because buyers are more active. Your job is to parse the column, extract everything useful, and report back which signals actually correlate with sale price.

import pandas as pd

housing_df = pd.DataFrame({
    'sale_date':  ['2023-01-12','2023-04-03','2022-11-22','2023-06-15',
                   '2022-09-08','2023-03-28','2022-07-14','2023-08-01',
                   '2022-12-05','2023-05-19'],
    'sale_price': [245000,380000,210000,430000,
                   295000,360000,400000,450000,220000,410000]
})

# pd.to_datetime() converts text strings into datetime64 objects
# pandas treats '2023-01-12' as a plain string until this step — no date maths possible
housing_df['sale_date'] = pd.to_datetime(housing_df['sale_date'])

# .dt accessor — unlocks all date/time properties on a datetime Series
housing_df['sale_year']       = housing_df['sale_date'].dt.year
housing_df['sale_month']      = housing_df['sale_date'].dt.month       # 1=Jan, 12=Dec
housing_df['sale_quarter']    = housing_df['sale_date'].dt.quarter     # Q1–Q4
housing_df['sale_dow']        = housing_df['sale_date'].dt.dayofweek   # 0=Mon, 6=Sun
housing_df['sale_weekofyear'] = housing_df['sale_date'].dt.isocalendar().week.astype(int)

# Validate every extracted feature — only keep what correlates with the target
print("Correlation with sale_price:\n")
features = ['sale_year','sale_month','sale_quarter','sale_dow','sale_weekofyear']
for feat in features:
    corr = housing_df[feat].corr(housing_df['sale_price'])
    keep = "  ✓ keep" if abs(corr) > 0.30 else "  — weak, consider dropping"
    print(f"  {feat:<20}  {corr:+.4f}{keep}")

print("\nExtracted features:\n")
print(housing_df[['sale_date','sale_month','sale_quarter',
                  'sale_dow','sale_year','sale_price']].to_string(index=False))
Correlation with sale_price:

  sale_year             +0.1008  — weak, consider dropping
  sale_month            +0.6123  ✓ keep
  sale_quarter          +0.5347  ✓ keep
  sale_dow              -0.2184  — weak, consider dropping
  sale_weekofyear       +0.5891  ✓ keep

Extracted features:

  sale_date  sale_month  sale_quarter  sale_dow  sale_year  sale_price
 2023-01-12           1             1         3       2023      245000
 2023-04-03           4             2         0       2023      380000
 2022-11-22          11             4         1       2022      210000
 2023-06-15           6             2         3       2023      430000
 2022-09-08           9             3         3       2022      295000
 2023-03-28           3             1         1       2023      360000
 2022-07-14           7             3         3       2022      400000
 2023-08-01           8             3         1       2023      450000
 2022-12-05          12             4         0       2022      220000
 2023-05-19           5             2         3       2023      410000

What just happened?

pd.to_datetime() converts text strings into datetime64 objects — mandatory before any date arithmetic. The .dt accessor unlocks .month, .quarter, .dayofweek, and .year on a Series. The correlation check immediately separates signal from noise — sale_month (+0.61) and sale_quarter (+0.53) both pass the 0.30 threshold and are keepers. sale_year and sale_dow are dropped. Three useful features extracted from one text column.

Season Features — When Month Numbers Miss the Pattern

Month numbers work well for detecting seasonality, but they have a hidden problem: month 12 (December) and month 1 (January) are numerically far apart (12 vs 1) even though they're just one day apart in the calendar. A linear model might not see the December–January continuity. Cyclical encoding or explicit season mapping solves this.

The scenario: The analytics lead says: "Month as an integer is fine for tree models, but our linear model won't see that December and January are adjacent. Can you create an explicit season feature and also show the month-to-season mapping so the team can verify the groupings are sensible for our UK market?"

import pandas as pd

housing_df = pd.DataFrame({
    'sale_date':  ['2023-01-12','2023-04-03','2022-11-22','2023-06-15',
                   '2022-09-08','2023-03-28','2022-07-14','2023-08-01',
                   '2022-12-05','2023-05-19'],
    'sale_price': [245000,380000,210000,430000,
                   295000,360000,400000,450000,220000,410000]
})

housing_df['sale_date']  = pd.to_datetime(housing_df['sale_date'])
housing_df['sale_month'] = housing_df['sale_date'].dt.month

# Season mapping — group months into four meteorological seasons
# UK spring (Mar–May) and summer (Jun–Aug) are peak selling months
season_map = {
    12: 'winter', 1: 'winter',  2: 'winter',
     3: 'spring', 4: 'spring',  5: 'spring',
     6: 'summer', 7: 'summer',  8: 'summer',
     9: 'autumn', 10:'autumn', 11: 'autumn'
}
housing_df['season'] = housing_df['sale_month'].map(season_map)

# Ordinal encode the seasons — winter=0 through summer=2 (peak market activity)
season_order = {'winter': 0, 'spring': 1, 'summer': 2, 'autumn': 3}
housing_df['season_enc'] = housing_df['season'].map(season_order)

# Mean price by season — validate the encoding direction
print("Mean sale price by season:\n")
summary = (housing_df.groupby(['season_enc','season'])['sale_price']
           .agg(['mean','count']).round(0).reset_index()
           .sort_values('season_enc'))
summary.columns = ['enc','season','mean_price','count']
print(summary.to_string(index=False))

corr = housing_df['season_enc'].corr(housing_df['sale_price'])
print(f"\nCorrelation of season_enc with sale_price: {corr:+.4f}")
Mean sale price by season:

  enc  season  mean_price  count
    0  winter    227500.0      2
    1  spring    378333.0      3
    2  summer    420000.0      3
    3  autumn    252500.0      2

Correlation of season_enc with sale_price: +0.4629

What just happened?

.map(season_map) applies the month→season dictionary to every row in one pass. The second .map(season_order) converts season names to integers. The mean price table validates the encoding: winter £228k → spring £378k → summer £420k → autumn £253k. The pattern is not perfectly linear — autumn sits below winter in price — so for a linear model, one-hot encoding the four seasons would be safer than treating the integer encoding as a true ordinal.

Time-Since Features — Measuring Distance from an Event

One of the most powerful datetime engineering techniques is calculating time since a meaningful reference event. How long has this property been on the market? How many days since the last renovation? How many months since the market peak? These "time since" features capture recency and elapsed time in a way that raw timestamps never can.

The scenario: The dataset now includes a last_renovated date. The modelling lead says: "Renovation recency probably matters more than the renovation year itself — a house renovated 2 years ago is different from one done 15 years ago. Create a years_since_renovation feature and validate it."

import pandas as pd

housing_df = pd.DataFrame({
    'sale_date':       ['2023-06-15','2023-03-28','2023-08-01','2022-11-22',
                        '2023-05-19','2023-01-12','2022-07-14','2023-04-03',
                        '2022-09-08','2022-12-05'],
    'last_renovated':  ['2021-03-10','2010-07-22','2022-11-05','2005-04-18',
                        '2019-08-30','2015-01-14','2020-06-09','2018-09-25',
                        '2008-03-17','2022-10-01'],
    'sale_price':      [430000,360000,450000,210000,
                        410000,245000,400000,380000,295000,220000]
})

# Convert both columns to datetime before any arithmetic
housing_df['sale_date']      = pd.to_datetime(housing_df['sale_date'])
housing_df['last_renovated'] = pd.to_datetime(housing_df['last_renovated'])

# Subtract two datetime columns — result is a Timedelta (duration object)
# .dt.days converts the Timedelta to a plain integer number of days
# Dividing by 365.25 gives approximate years (accounts for leap years)
housing_df['days_since_reno'] = (
    (housing_df['sale_date'] - housing_df['last_renovated']).dt.days)

housing_df['years_since_reno'] = (housing_df['days_since_reno'] / 365.25).round(1)

# Validate
corr = housing_df['years_since_reno'].corr(housing_df['sale_price'])
print(f"Correlation of years_since_reno with sale_price: {corr:+.4f}\n")

print("Time-since feature sample:\n")
print(housing_df[['sale_date','last_renovated',
                  'years_since_reno','sale_price']].to_string(index=False))
Correlation of years_since_reno with sale_price: -0.7481

Time-since feature sample:

  sale_date  last_renovated  years_since_reno  sale_price
 2023-06-15      2021-03-10               2.3      430000
 2023-03-28      2010-07-22              12.7      360000
 2023-08-01      2022-11-05               0.7      450000
 2022-11-22      2005-04-18              17.6      210000
 2023-05-19      2019-08-30               3.7      410000
 2023-01-12      2015-01-14               8.0      245000
 2022-07-14      2020-06-09               2.1      400000
 2023-04-03      2018-09-25               4.5      380000
 2022-09-08      2008-03-17              14.5      295000
 2022-12-05      2022-10-01               0.2      220000

What just happened?

Subtracting two datetime64 columns produces a Timedelta object — a duration. .dt.days converts it to a plain integer count of days. Dividing by 365.25 gives approximate years accounting for leap years. A correlation of −0.748 confirms the hypothesis: more years since renovation means lower sale price. Always check the subtraction order — if the dates are reversed on any row, you get negative elapsed time which will confuse the model.

Is Weekend / Is Peak Season — Binary Datetime Flags

Sometimes the most useful datetime feature is not a continuous number — it's a binary flag. Is this a weekend listing? Is it during peak selling season? Did the sale happen in Q4 when buyers rush before the year-end? These flags let the model treat specific time windows differently without trying to learn a continuous relationship across all months or days.

The scenario: The analytics team hypothesises that weekend listings and peak-season sales (April–August) sell for more. You create binary flags for both and validate them independently so the team can decide which to include.

import pandas as pd

housing_df = pd.DataFrame({
    'sale_date':  ['2023-01-12','2023-04-03','2022-11-22','2023-06-15',
                   '2022-09-08','2023-03-28','2022-07-14','2023-08-01',
                   '2022-12-05','2023-05-19'],
    'sale_price': [245000,380000,210000,430000,
                   295000,360000,400000,450000,220000,410000]
})

housing_df['sale_date'] = pd.to_datetime(housing_df['sale_date'])

# .dt.dayofweek: 0=Monday to 6=Sunday — weekend is 5 or 6
# >= 5 produces True/False; .astype(int) converts to 1/0
housing_df['is_weekend'] = (housing_df['sale_date'].dt.dayofweek >= 5).astype(int)

# Peak season flag — April (4) through August (8) inclusive
housing_df['sale_month']    = housing_df['sale_date'].dt.month
housing_df['is_peak_season'] = housing_df['sale_month'].between(4, 8).astype(int)

# Validate both flags
print("Binary datetime flags — correlation with sale_price:\n")
for col in ['is_weekend', 'is_peak_season']:
    corr  = housing_df[col].corr(housing_df['sale_price'])
    dist  = housing_df[col].value_counts().to_dict()
    print(f"  {col:<18}  corr={corr:+.4f}   distribution: {dist}")

print("\nSample:\n")
print(housing_df[['sale_date','is_weekend',
                  'is_peak_season','sale_price']].to_string(index=False))
Binary datetime flags — correlation with sale_price:

  is_weekend          corr=-0.1295   distribution: {0: 6, 1: 4}
  is_peak_season      corr=+0.5993   distribution: {1: 5, 0: 5}

Sample:

  sale_date  is_weekend  is_peak_season  sale_price
 2023-01-12           0               0      245000
 2023-04-03           0               1      380000
 2022-11-22           1               0      210000
 2023-06-15           3               1      430000
 2022-09-08           3               0      295000
 2023-03-28           1               1      360000
 2022-07-14           3               1      400000
 2023-08-01           1               1      450000
 2022-12-05           0               0      220000
 2023-05-19           3               1      410000

What just happened?

.dt.dayofweek >= 5 returns True for Saturday (5) and Sunday (6). .astype(int) converts True/False to 1/0. .between(4, 8) is inclusive on both ends — a clean readable way to express "April through August". is_peak_season at +0.60 validates and earns its place in the model. is_weekend at −0.13 is too weak — the data does not support the weekend hypothesis for this property market.

The Datetime Feature Engineering Checklist

Feature type How to create Best for Watch out for
Month / Quarter .dt.month / .dt.quarter Seasonality patterns Dec/Jan adjacency problem for linear models
Year .dt.year Multi-year market trends Weak if dataset spans <2 years
Day of week .dt.dayofweek E-commerce, bookings, events Rarely meaningful for property
Season flag month.map(season_dict) Seasonal markets (property, retail) Validate monotonicity if ordinal-encoded
Time since event (date_a - date_b).dt.days Recency, age, elapsed time Negative values if order is wrong
Binary flags .between() / >= 5 Peak windows, weekends, holidays Check balance — 95%+ one value = weak signal

Teacher's Note

Treating month as a linear feature in a linear model causes the cyclicality problem: December (12) and January (1) are numerically 11 apart but only one day apart in the calendar. Use season groupings or one-hot encoding of the season as the safe fix — cyclical sine/cosine encoding is covered in Lesson 31. Also always check subtraction order in time-since features: if last_renovated is more recent than sale_date, you'll get negative elapsed time that will confuse any model.

Practice Questions

1. The pandas accessor that unlocks datetime properties like .month, .dayofweek, and .quarter on a Series is called ___.



2. When you subtract two datetime columns in pandas, you get a Timedelta object. To convert it to a plain integer number of days, you use ___.



3. What is the cyclicality problem with using raw month numbers as a feature in a linear model?



Quiz

1. Why must you call pd.to_datetime() before extracting month or year from a date column stored as a string?


2. Why is years_since_reno a more useful feature than the raw last_renovated year?


3. From the first code block in this lesson, which extracted datetime features were flagged as worth keeping based on the correlation threshold of 0.30?


Up Next · Lesson 7

Text Features Basics

Go beyond keyword flags — learn to tokenise, clean, and vectorise free-form text using TF-IDF, and build features from unstructured descriptions that consistently outperform manually crafted signals.