Time Series Lesson 32 – XGBoost | Dataplexa

XGBoost for Time Series Forecasting

XGBoost is one of the most powerful machine-learning algorithms used in real-world forecasting systems. It is widely used in finance, supply chains, energy forecasting, and demand prediction.

Unlike classical time-series models, XGBoost does not assume trends or seasonality. Instead, it learns patterns directly from engineered features.


Real-World Scenario

Imagine you work for an e-commerce company. Your job is to forecast daily order volume so the warehouse can plan staff and inventory.

Order volume depends on:

  • Recent past demand
  • Weekly patterns
  • Promotions and campaigns
  • Non-linear relationships

This is exactly where XGBoost shines.


Step 1: Create a Sample Time Series

We’ll simulate daily order volume with trend, seasonality, and noise.

Python: Generate Time Series Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

days = 300
time = np.arange(days)

trend = time * 0.3
seasonality = 15 * np.sin(2 * np.pi * time / 7)
noise = np.random.normal(0, 5, days)

orders = 50 + trend + seasonality + noise

df = pd.DataFrame({
    "orders": orders
})

df.head()

What you should observe:

  • Overall upward trend (business growth)
  • Weekly demand cycle
  • Random fluctuations from day-to-day behavior

Step 2: Feature Engineering for XGBoost

XGBoost cannot understand time directly. We must convert the time series into a supervised learning problem.

We’ll create:

  • Lag features (previous days)
  • Rolling averages
Python: Feature Engineering
df["lag_1"] = df["orders"].shift(1)
df["lag_2"] = df["orders"].shift(2)
df["lag_7"] = df["orders"].shift(7)

df["rolling_mean_7"] = df["orders"].rolling(7).mean()

df = df.dropna()
df.head()

Why these features matter:

  • Lag features capture recent demand memory
  • Weekly lag captures repeating patterns
  • Rolling mean smooths short-term noise

Step 3: Train-Test Split (Time-Aware)

We split the data while preserving time order.

Python: Train-Test Split
X = df.drop("orders", axis=1)
y = df["orders"]

split = int(len(df) * 0.8)

X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

Step 4: Train XGBoost Model

Now we train XGBoost to learn non-linear relationships.

Python: Train XGBoost
from xgboost import XGBRegressor

model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)

predictions = model.predict(X_test)

Step 5: Visualize Predictions vs Actual

This is the most important part. A good forecasting model must be visually evaluated.

What to observe carefully:

  • Predictions follow the overall trend
  • Weekly seasonality is learned well
  • Minor deviations are expected due to noise

This confirms XGBoost successfully captured complex patterns.


Error Analysis

We now look at the prediction error to understand model stability.

Key insights:

  • Errors are centered around zero
  • No long-term bias
  • Occasional spikes during high volatility days

Why XGBoost Works So Well for Time Series

  • Handles non-linear relationships
  • Robust to noise
  • Works well with engineered features
  • Scales to large datasets

This is why many production forecasting systems use XGBoost.


Practice Questions

Q1. Why do we create lag features for XGBoost?

Because XGBoost cannot understand time order directly. Lag features provide historical context.

Q2. Can XGBoost replace ARIMA completely?

Not always. XGBoost is data-driven, while ARIMA provides interpretability and statistical guarantees.

Next lesson: We will explore LightGBM forecasting and compare its behavior with XGBoost.