ML Lesson 19 – Random Forest | Dataplexa

Random Forest

In the previous lesson, we learned how a single Decision Tree works. You saw that it is easy to understand but can easily overfit the data.

In this lesson, we fix that weakness using one of the most powerful and widely used algorithms in Machine Learning — Random Forest.


What Is Random Forest?

Random Forest is an ensemble learning algorithm.

Instead of building one decision tree, it builds many decision trees and combines their results.

Each tree gives its own prediction, and the forest chooses the final answer by:

Majority voting (for classification) Average prediction (for regression)

This simple idea makes Random Forest very accurate and stable.


Real-World Analogy

Imagine you want to buy a house.

You ask one person for advice — that person may be biased or wrong.

Now imagine asking 100 different experts and choosing the most common opinion.

That is exactly how Random Forest works.


Why Random Forest Is Better Than a Single Tree

Each tree is trained on a slightly different version of the data.

Each tree looks at a random subset of features.

Because of this randomness:

Errors made by one tree are corrected by others.

Overfitting is reduced significantly.


Using Our Dataset

We continue using the same dataset used since Lesson 4:

Dataplexa ML Housing & Customer Dataset

Target variable:

loan_approved


Preparing the Data

We follow the same familiar preprocessing steps.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Training a Random Forest Model

Now we train a Random Forest classifier.

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

model.fit(X_train, y_train)

Here:

n_estimators = number of trees in the forest


Making Predictions

Prediction works exactly like other ML models.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating the Model

Let us check how well the forest performs.

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(classification_report(y_test, y_pred))

You will usually notice that Random Forest performs better than a single Decision Tree.


Important Parameters in Random Forest

Random Forest has several important settings:

n_estimators – number of trees max_depth – maximum depth of each tree min_samples_split – minimum samples to split a node max_features – number of features considered per split

We will tune these parameters later in hyperparameter tuning lessons.


Advantages of Random Forest

High accuracy Handles non-linear data well Works with minimal preprocessing Less prone to overfitting Handles missing values better than trees


Limitations of Random Forest

More computation compared to a single tree Less interpretable Large models can be slow to deploy


Mini Practice

Think about fraud detection in banking.

Each tree checks different patterns:

Transaction amount Transaction time Location Customer behavior

The final decision comes from combining all trees.

This is why Random Forest is widely used in fraud detection.


Exercises

Exercise 1:
Why is Random Forest more stable than Decision Trees?

Because it combines predictions from many trees, reducing overfitting.

Exercise 2:
What does n_estimators represent?

It represents the number of decision trees in the forest.

Exercise 3:
Can Random Forest handle non-linear relationships?

Yes. Decision tree-based models handle non-linear patterns naturally.

Quick Quiz

Q1. Does Random Forest require feature scaling?

No. Random Forest is not affected by feature scaling.

Q2. What happens if we increase the number of trees?

Accuracy may improve, but training time also increases.

In the next lesson, we move to Gradient Boosting, which improves models by learning from previous mistakes.