ML Lesson 20 – Gradient Boosting | Dataplexa

Gradient Boosting

In the previous lesson, we learned Random Forest, where many trees work independently and vote together.

In this lesson, we learn a smarter and more focused idea called Gradient Boosting.

Instead of building trees independently, Gradient Boosting builds trees one after another, and each new tree learns from the mistakes of the previous ones.


What Is Gradient Boosting?

Gradient Boosting is an ensemble learning technique that combines many weak models to create a strong model.

The key idea is simple:

Each new model focuses on correcting the errors made by the previous model.

This gradual improvement is why it is called boosting.


Real-World Intuition

Think of a student preparing for an exam.

After the first test, the teacher checks the mistakes and focuses more on weak topics.

After the second test, the student improves again, but still has a few weak areas.

Each round focuses only on mistakes from the previous round.

That is exactly how Gradient Boosting works.


How Gradient Boosting Learns

The learning process happens step by step:

1. Build a simple model and make predictions 2. Calculate the errors (residuals) 3. Train a new model to predict those errors 4. Add the new model to the existing model 5. Repeat this process many times

Each new tree is small and simple, but together they form a very powerful model.


Using Our Dataset

We continue using the same dataset introduced in Lesson 4 and used throughout the module.

Dataplexa ML Housing & Customer Dataset

Target variable:

loan_approved


Preparing the Data

The preprocessing steps remain unchanged.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Training a Gradient Boosting Model

Now we train a Gradient Boosting classifier.

model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)

model.fit(X_train, y_train)

Here:

n_estimators controls how many trees are built learning_rate controls how much each tree contributes


Making Predictions

Predictions are made using the combined output of all trees.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating the Model

Let us check how well Gradient Boosting performs.

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(classification_report(y_test, y_pred))

In many real-world datasets, Gradient Boosting performs better than Random Forest.


Why Gradient Boosting Is Powerful

It focuses on hard-to-predict samples It reduces bias and variance It learns complex patterns It works well with structured/tabular data


Limitations of Gradient Boosting

Training can be slow Sensitive to noisy data Requires careful tuning of parameters


Mini Practice

Imagine a loan approval system.

Early models may misclassify borderline customers.

Gradient Boosting focuses more on those borderline cases in later stages.

This makes it extremely useful in finance and risk modeling.


Exercises

Exercise 1:
How is Gradient Boosting different from Random Forest?

Random Forest builds trees independently, while Gradient Boosting builds trees sequentially to correct mistakes.

Exercise 2:
What does the learning rate control?

It controls how much each tree contributes to the final model.

Exercise 3:
Why can Gradient Boosting overfit?

Because it keeps focusing on errors, including noise, if not tuned properly.

Quick Quiz

Q1. Is Gradient Boosting sensitive to noisy data?

Yes. Noisy data can mislead the boosting process.

Q2. What happens if learning rate is too high?

The model may overfit and fail to generalize well.

In the next lesson, we move to XGBoost, a highly optimized and industry-grade boosting algorithm.