ML Lesson 22 – Support Vector Machines | Dataplexa

Support Vector Machines (SVM)

In the previous lesson, we studied XGBoost and saw how powerful ensemble models can become by combining many decision trees. While XGBoost works extremely well for many business problems, there is another class of algorithms that approaches classification from a completely different angle.

In this lesson, we learn about Support Vector Machines, commonly called SVM.

SVM is a mathematically elegant algorithm that focuses on drawing the best possible boundary between different classes.

The Core Idea Behind SVM

Imagine you are separating approved and rejected loan applications. Some customers are clearly safe, some are clearly risky, and some fall very close to the decision boundary.

SVM tries to find a line (or plane) that separates these two groups in such a way that the distance between the boundary and the nearest data points on both sides is as large as possible.

This distance is called the margin. The larger the margin, the more confident the model is.

The data points that lie closest to this boundary are known as support vectors. These points completely define the model.

Why SVM Is Powerful

SVM does not try to classify every point perfectly. Instead, it focuses on the most difficult cases near the boundary.

This makes SVM very effective when the dataset is clean, well-structured, and not extremely large.

It also explains why SVM is popular in fields like text classification, bioinformatics, and image recognition.

Using Our Dataset

We continue using the same dataset throughout this module to maintain a smooth learning flow.

Dataplexa ML Housing & Customer Dataset

Our goal remains unchanged: predict whether a loan will be approved.

Preparing the Data

SVM is sensitive to feature scales. This means we must normalize the data before training the model.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Standardization ensures that all features contribute fairly to the model.

Training an SVM Model

We now train a Support Vector Classifier. The kernel parameter controls how the decision boundary is shaped.

model = SVC(kernel="rbf", C=1.0, gamma="scale")

model.fit(X_train, y_train)

The RBF kernel allows SVM to create curved decision boundaries, which is useful when the data is not linearly separable.

Making Predictions

Once trained, the model can predict loan approval decisions for unseen customers.

y_pred = model.predict(X_test)
y_pred[:10]

Evaluating the Model

Let us measure how well the model performs.

from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

SVM often performs well when the classes are clearly separable and the dataset size is moderate.

Real-World Intuition

Think of SVM like a strict bank officer. Instead of considering every customer equally, the officer focuses on borderline applicants and sets rules that maximize safety.

This mindset helps SVM generalize well to new cases.

Mini Practice

Suppose two customers have very similar income and credit scores, but one gets approved and the other does not.

SVM concentrates heavily on such edge cases when forming its decision boundary.

Exercises

Exercise 1:
Why is feature scaling important for SVM?

Because SVM uses distance-based calculations, and unscaled features can dominate the margin.

Exercise 2:
What role do support vectors play?

They define the decision boundary and control the margin.

Quick Quiz

Q1. Is SVM suitable for extremely large datasets?

Not always. Training can become slow for very large datasets.

In the next lesson, we will study the K-Nearest Neighbors (KNN) algorithm and understand how similarity-based learning works.

← Previous Lesson ML Index Next ➜