Naive Bayes
In the previous lesson, we studied K-Nearest Neighbors and learned how predictions can be made purely based on similarity between data points. KNN worked by comparing new customers with existing ones.
In this lesson, we move to a very different philosophy of learning. Instead of distance or decision boundaries, we now use probability to make predictions.
This approach is called Naive Bayes. Despite the word “naive” in its name, this algorithm is surprisingly powerful and widely used.
The Core Idea Behind Naive Bayes
Naive Bayes is based on Bayes’ Theorem, which helps us calculate the probability of an event given that some other event has already occurred.
In simple terms, Naive Bayes answers this question:
“What is the probability that a customer’s loan is approved, given their income, credit score, age, and employment status?”
The algorithm assumes that each feature contributes independently to the final outcome. This assumption is what makes the algorithm “naive”.
Even though this assumption is not always true in real life, the model still performs very well in practice.
Why Naive Bayes Works So Well
Naive Bayes does not try to learn complex patterns. Instead, it focuses on probability distributions.
This makes it extremely fast, memory-efficient, and reliable even when the dataset is not very large.
It is commonly used in spam detection, document classification, sentiment analysis, and medical diagnosis systems.
Using Our Dataset
We continue using the same dataset throughout the ML module to ensure smooth learning progression.
Dataplexa ML Housing & Customer Dataset
Our task remains unchanged: predict whether a loan is approved.
Preparing the Data
Naive Bayes works best when features are numerical and follow clear distributions.
For this lesson, we will use Gaussian Naive Bayes, which assumes that features follow a normal distribution.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Training the Naive Bayes Model
Training a Naive Bayes model is extremely fast. There are no complex parameters to tune.
model = GaussianNB()
model.fit(X_train, y_train)
The model learns the probability distributions of each feature for approved and rejected loans.
Making Predictions
Predictions are made by combining probabilities from all features.
y_pred = model.predict(X_test)
y_pred[:10]
Evaluating the Model
Let us evaluate how well Naive Bayes performs on unseen customer data.
from sklearn.metrics import accuracy_score, classification_report
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Naive Bayes often performs slightly worse than advanced models, but its speed and simplicity make it very valuable.
Real-World Intuition
Think of Naive Bayes like a quick background check. A bank officer looks at income, credit score, and employment history individually and combines their likelihoods.
The officer does not try to model complex interactions. They simply estimate overall risk.
Mini Practice
Imagine a customer with high income but low credit score. Naive Bayes balances the probabilities from each feature before making a decision.
This explains why it can still perform well even with simple assumptions.
Exercises
Exercise 1:
Why is Naive Bayes called “naive”?
Exercise 2:
Why is Gaussian Naive Bayes suitable here?
Quick Quiz
Q1. Is Naive Bayes computationally expensive?
In the next lesson, we will move into Clustering and study the K-Means algorithm, where no labels are provided to the model.