ML Lesson 24 – Naive Bayes | Dataplexa

Naive Bayes

In the previous lesson, we studied K-Nearest Neighbors and learned how predictions can be made purely based on similarity between data points. KNN worked by comparing new customers with existing ones.

In this lesson, we move to a very different philosophy of learning. Instead of distance or decision boundaries, we now use probability to make predictions.

This approach is called Naive Bayes. Despite the word “naive” in its name, this algorithm is surprisingly powerful and widely used.


The Core Idea Behind Naive Bayes

Naive Bayes is based on Bayes’ Theorem, which helps us calculate the probability of an event given that some other event has already occurred.

In simple terms, Naive Bayes answers this question:

“What is the probability that a customer’s loan is approved, given their income, credit score, age, and employment status?”

The algorithm assumes that each feature contributes independently to the final outcome. This assumption is what makes the algorithm “naive”.

Even though this assumption is not always true in real life, the model still performs very well in practice.


Why Naive Bayes Works So Well

Naive Bayes does not try to learn complex patterns. Instead, it focuses on probability distributions.

This makes it extremely fast, memory-efficient, and reliable even when the dataset is not very large.

It is commonly used in spam detection, document classification, sentiment analysis, and medical diagnosis systems.


Using Our Dataset

We continue using the same dataset throughout the ML module to ensure smooth learning progression.

Dataplexa ML Housing & Customer Dataset

Our task remains unchanged: predict whether a loan is approved.


Preparing the Data

Naive Bayes works best when features are numerical and follow clear distributions.

For this lesson, we will use Gaussian Naive Bayes, which assumes that features follow a normal distribution.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Training the Naive Bayes Model

Training a Naive Bayes model is extremely fast. There are no complex parameters to tune.

model = GaussianNB()
model.fit(X_train, y_train)

The model learns the probability distributions of each feature for approved and rejected loans.


Making Predictions

Predictions are made by combining probabilities from all features.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating the Model

Let us evaluate how well Naive Bayes performs on unseen customer data.

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Naive Bayes often performs slightly worse than advanced models, but its speed and simplicity make it very valuable.


Real-World Intuition

Think of Naive Bayes like a quick background check. A bank officer looks at income, credit score, and employment history individually and combines their likelihoods.

The officer does not try to model complex interactions. They simply estimate overall risk.


Mini Practice

Imagine a customer with high income but low credit score. Naive Bayes balances the probabilities from each feature before making a decision.

This explains why it can still perform well even with simple assumptions.


Exercises

Exercise 1:
Why is Naive Bayes called “naive”?

Because it assumes that features are independent of each other.

Exercise 2:
Why is Gaussian Naive Bayes suitable here?

Because the features are numerical and can be modeled using normal distributions.

Quick Quiz

Q1. Is Naive Bayes computationally expensive?

No. It is one of the fastest machine learning algorithms.

In the next lesson, we will move into Clustering and study the K-Means algorithm, where no labels are provided to the model.