XGBoost
In the previous lesson, we learned Gradient Boosting and understood how models can improve by learning from their past mistakes. That idea was powerful, but it also came with limitations such as slow training and sensitivity to noise.
In this lesson, we move one step further and study XGBoost, which stands for Extreme Gradient Boosting.
XGBoost is not just an algorithm. It is an engineering-optimized version of Gradient Boosting that is faster, more accurate, and more reliable for real-world machine learning systems.
Why XGBoost Exists
When Gradient Boosting started being used on large datasets, data scientists faced practical problems. Training was slow, memory usage was high, and models were difficult to tune properly.
XGBoost was created to solve these real engineering issues. It keeps the core idea of boosting, but improves how trees are built, optimized, and evaluated.
This is why XGBoost dominates machine learning competitions and is widely used in industry.
How XGBoost Learns
XGBoost still builds trees sequentially, just like Gradient Boosting. However, it is smarter about how errors are corrected.
Instead of only looking at prediction mistakes, XGBoost also considers how confident the model is about those mistakes.
It applies mathematical optimization techniques to decide how much each tree should contribute to the final model.
This makes learning more controlled and more stable.
Using Our Dataset
We continue using the same dataset that we introduced earlier and have been using consistently.
Dataplexa ML Housing & Customer Dataset
Our goal remains the same: predict whether a loan is approved or not.
Preparing the Data
The data preparation steps remain unchanged, which shows how reusable a good ML pipeline can be.
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Training an XGBoost Model
Now we train an XGBoost classifier. Notice how little code is needed despite the complexity behind the scenes.
model = XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=4,
random_state=42
)
model.fit(X_train, y_train)
Behind this simple interface, XGBoost applies regularization, efficient tree pruning, and optimized computation.
Making Predictions
Predictions are generated by combining all trees into one final decision.
y_pred = model.predict(X_test)
y_pred[:10]
Evaluating Performance
Now we evaluate how well the model performs on unseen data.
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
In most real datasets, XGBoost performs better than Decision Trees, Random Forest, and even basic Gradient Boosting.
Real-World Use Case
Banks use XGBoost for credit scoring systems. Each tree captures a different risk pattern, and the final model balances all of them.
This helps banks approve good customers while minimizing financial risk.
Mini Practice
Imagine you are building a loan approval system. Some customers are clearly safe, some are clearly risky, and some fall in between.
XGBoost focuses strongly on those borderline cases, which makes it extremely effective in finance.
Exercises
Exercise 1:
Why is XGBoost faster than Gradient Boosting?
Exercise 2:
Does XGBoost reduce overfitting?
Quick Quiz
Q1. Is XGBoost suitable for tabular data?
In the next lesson, we will explore Support Vector Machines and understand how they draw optimal decision boundaries.