ML Lesson 21 – XGBoost | Dataplexa

XGBoost

In the previous lesson, we learned Gradient Boosting and understood how models can improve by learning from their past mistakes. That idea was powerful, but it also came with limitations such as slow training and sensitivity to noise.

In this lesson, we move one step further and study XGBoost, which stands for Extreme Gradient Boosting.

XGBoost is not just an algorithm. It is an engineering-optimized version of Gradient Boosting that is faster, more accurate, and more reliable for real-world machine learning systems.


Why XGBoost Exists

When Gradient Boosting started being used on large datasets, data scientists faced practical problems. Training was slow, memory usage was high, and models were difficult to tune properly.

XGBoost was created to solve these real engineering issues. It keeps the core idea of boosting, but improves how trees are built, optimized, and evaluated.

This is why XGBoost dominates machine learning competitions and is widely used in industry.


How XGBoost Learns

XGBoost still builds trees sequentially, just like Gradient Boosting. However, it is smarter about how errors are corrected.

Instead of only looking at prediction mistakes, XGBoost also considers how confident the model is about those mistakes.

It applies mathematical optimization techniques to decide how much each tree should contribute to the final model.

This makes learning more controlled and more stable.


Using Our Dataset

We continue using the same dataset that we introduced earlier and have been using consistently.

Dataplexa ML Housing & Customer Dataset

Our goal remains the same: predict whether a loan is approved or not.


Preparing the Data

The data preparation steps remain unchanged, which shows how reusable a good ML pipeline can be.

import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Training an XGBoost Model

Now we train an XGBoost classifier. Notice how little code is needed despite the complexity behind the scenes.

model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    random_state=42
)

model.fit(X_train, y_train)

Behind this simple interface, XGBoost applies regularization, efficient tree pruning, and optimized computation.


Making Predictions

Predictions are generated by combining all trees into one final decision.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating Performance

Now we evaluate how well the model performs on unseen data.

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In most real datasets, XGBoost performs better than Decision Trees, Random Forest, and even basic Gradient Boosting.


Real-World Use Case

Banks use XGBoost for credit scoring systems. Each tree captures a different risk pattern, and the final model balances all of them.

This helps banks approve good customers while minimizing financial risk.


Mini Practice

Imagine you are building a loan approval system. Some customers are clearly safe, some are clearly risky, and some fall in between.

XGBoost focuses strongly on those borderline cases, which makes it extremely effective in finance.


Exercises

Exercise 1:
Why is XGBoost faster than Gradient Boosting?

Because it uses optimized computation, tree pruning, and parallel processing.

Exercise 2:
Does XGBoost reduce overfitting?

Yes. It uses regularization to control model complexity.

Quick Quiz

Q1. Is XGBoost suitable for tabular data?

Yes. XGBoost is especially strong for structured tabular datasets.

In the next lesson, we will explore Support Vector Machines and understand how they draw optimal decision boundaries.