Gradient Boosting
In the previous lesson, we learned Random Forest, where many trees work independently and vote together.
In this lesson, we learn a smarter and more focused idea called Gradient Boosting.
Instead of building trees independently, Gradient Boosting builds trees one after another, and each new tree learns from the mistakes of the previous ones.
What Is Gradient Boosting?
Gradient Boosting is an ensemble learning technique that combines many weak models to create a strong model.
The key idea is simple:
Each new model focuses on correcting the errors made by the previous model.
This gradual improvement is why it is called boosting.
Real-World Intuition
Think of a student preparing for an exam.
After the first test, the teacher checks the mistakes and focuses more on weak topics.
After the second test, the student improves again, but still has a few weak areas.
Each round focuses only on mistakes from the previous round.
That is exactly how Gradient Boosting works.
How Gradient Boosting Learns
The learning process happens step by step:
1. Build a simple model and make predictions 2. Calculate the errors (residuals) 3. Train a new model to predict those errors 4. Add the new model to the existing model 5. Repeat this process many times
Each new tree is small and simple, but together they form a very powerful model.
Using Our Dataset
We continue using the same dataset introduced in Lesson 4 and used throughout the module.
Dataplexa ML Housing & Customer Dataset
Target variable:
loan_approved
Preparing the Data
The preprocessing steps remain unchanged.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Training a Gradient Boosting Model
Now we train a Gradient Boosting classifier.
model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
random_state=42
)
model.fit(X_train, y_train)
Here:
n_estimators controls how many trees are built learning_rate controls how much each tree contributes
Making Predictions
Predictions are made using the combined output of all trees.
y_pred = model.predict(X_test)
y_pred[:10]
Evaluating the Model
Let us check how well Gradient Boosting performs.
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))
In many real-world datasets, Gradient Boosting performs better than Random Forest.
Why Gradient Boosting Is Powerful
It focuses on hard-to-predict samples It reduces bias and variance It learns complex patterns It works well with structured/tabular data
Limitations of Gradient Boosting
Training can be slow Sensitive to noisy data Requires careful tuning of parameters
Mini Practice
Imagine a loan approval system.
Early models may misclassify borderline customers.
Gradient Boosting focuses more on those borderline cases in later stages.
This makes it extremely useful in finance and risk modeling.
Exercises
Exercise 1:
How is Gradient Boosting different from Random Forest?
Exercise 2:
What does the learning rate control?
Exercise 3:
Why can Gradient Boosting overfit?
Quick Quiz
Q1. Is Gradient Boosting sensitive to noisy data?
Q2. What happens if learning rate is too high?
In the next lesson, we move to XGBoost, a highly optimized and industry-grade boosting algorithm.