Random Forest
In the previous lesson, we learned how a single Decision Tree works. You saw that it is easy to understand but can easily overfit the data.
In this lesson, we fix that weakness using one of the most powerful and widely used algorithms in Machine Learning — Random Forest.
What Is Random Forest?
Random Forest is an ensemble learning algorithm.
Instead of building one decision tree, it builds many decision trees and combines their results.
Each tree gives its own prediction, and the forest chooses the final answer by:
Majority voting (for classification) Average prediction (for regression)
This simple idea makes Random Forest very accurate and stable.
Real-World Analogy
Imagine you want to buy a house.
You ask one person for advice — that person may be biased or wrong.
Now imagine asking 100 different experts and choosing the most common opinion.
That is exactly how Random Forest works.
Why Random Forest Is Better Than a Single Tree
Each tree is trained on a slightly different version of the data.
Each tree looks at a random subset of features.
Because of this randomness:
Errors made by one tree are corrected by others.
Overfitting is reduced significantly.
Using Our Dataset
We continue using the same dataset used since Lesson 4:
Dataplexa ML Housing & Customer Dataset
Target variable:
loan_approved
Preparing the Data
We follow the same familiar preprocessing steps.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Training a Random Forest Model
Now we train a Random Forest classifier.
model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
model.fit(X_train, y_train)
Here:
n_estimators = number of trees in the forest
Making Predictions
Prediction works exactly like other ML models.
y_pred = model.predict(X_test)
y_pred[:10]
Evaluating the Model
Let us check how well the forest performs.
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))
You will usually notice that Random Forest performs better than a single Decision Tree.
Important Parameters in Random Forest
Random Forest has several important settings:
n_estimators – number of trees max_depth – maximum depth of each tree min_samples_split – minimum samples to split a node max_features – number of features considered per split
We will tune these parameters later in hyperparameter tuning lessons.
Advantages of Random Forest
High accuracy Handles non-linear data well Works with minimal preprocessing Less prone to overfitting Handles missing values better than trees
Limitations of Random Forest
More computation compared to a single tree Less interpretable Large models can be slow to deploy
Mini Practice
Think about fraud detection in banking.
Each tree checks different patterns:
Transaction amount Transaction time Location Customer behavior
The final decision comes from combining all trees.
This is why Random Forest is widely used in fraud detection.
Exercises
Exercise 1:
Why is Random Forest more stable than Decision Trees?
Exercise 2:
What does n_estimators represent?
Exercise 3:
Can Random Forest handle non-linear relationships?
Quick Quiz
Q1. Does Random Forest require feature scaling?
Q2. What happens if we increase the number of trees?
In the next lesson, we move to Gradient Boosting, which improves models by learning from previous mistakes.