Time Series Lesson 32 – XGBoost | Dataplexa

XGBoost for Time Series Forecasting

XGBoost is one of the most powerful machine-learning algorithms used in real-world forecasting systems. It is widely used in finance, supply chains, energy forecasting, and demand prediction.

Unlike classical time-series models, XGBoost does not assume trends or seasonality. Instead, it learns patterns directly from engineered features.

Real-World Scenario

Imagine you work for an e-commerce company. Your job is to forecast daily order volume so the warehouse can plan staff and inventory.

Order volume depends on:

Recent past demand
Weekly patterns
Promotions and campaigns
Non-linear relationships

This is exactly where XGBoost shines.

Step 1: Create a Sample Time Series

We’ll simulate daily order volume with trend, seasonality, and noise.

Python: Generate Time Series Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

days = 300
time = np.arange(days)

trend = time * 0.3
seasonality = 15 * np.sin(2 * np.pi * time / 7)
noise = np.random.normal(0, 5, days)

orders = 50 + trend + seasonality + noise

df = pd.DataFrame({
    "orders": orders
})

df.head()

What you should observe:

Overall upward trend (business growth)
Weekly demand cycle
Random fluctuations from day-to-day behavior

Step 2: Feature Engineering for XGBoost

XGBoost cannot understand time directly. We must convert the time series into a supervised learning problem.

We’ll create:

Lag features (previous days)
Rolling averages

Python: Feature Engineering

df["lag_1"] = df["orders"].shift(1)
df["lag_2"] = df["orders"].shift(2)
df["lag_7"] = df["orders"].shift(7)

df["rolling_mean_7"] = df["orders"].rolling(7).mean()

df = df.dropna()
df.head()

Why these features matter:

Lag features capture recent demand memory
Weekly lag captures repeating patterns
Rolling mean smooths short-term noise

Step 3: Train-Test Split (Time-Aware)

We split the data while preserving time order.

Python: Train-Test Split

X = df.drop("orders", axis=1)
y = df["orders"]

split = int(len(df) * 0.8)

X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

Step 4: Train XGBoost Model

Now we train XGBoost to learn non-linear relationships.

Python: Train XGBoost

from xgboost import XGBRegressor

model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)

predictions = model.predict(X_test)

Step 5: Visualize Predictions vs Actual

This is the most important part. A good forecasting model must be visually evaluated.

What to observe carefully:

Predictions follow the overall trend
Weekly seasonality is learned well
Minor deviations are expected due to noise

This confirms XGBoost successfully captured complex patterns.

Error Analysis

We now look at the prediction error to understand model stability.

Key insights:

Errors are centered around zero
No long-term bias
Occasional spikes during high volatility days

Why XGBoost Works So Well for Time Series

Handles non-linear relationships
Robust to noise
Works well with engineered features
Scales to large datasets

This is why many production forecasting systems use XGBoost.

Practice Questions

Q1. Why do we create lag features for XGBoost?

Because XGBoost cannot understand time order directly. Lag features provide historical context.

Q2. Can XGBoost replace ARIMA completely?

Not always. XGBoost is data-driven, while ARIMA provides interpretability and statistical guarantees.

Next lesson: We will explore LightGBM forecasting and compare its behavior with XGBoost.

← Previous Course Index Next →