ML Lesson 30 – Feature Engineering | Dataplexa

Feature Engineering

In the previous lesson, we carefully selected the most important features from our dataset. We did not change the data itself. We only decided what to keep and what to remove.

Now we move to a more powerful idea. Instead of only choosing features, we will learn how to create better features. This process is called Feature Engineering.


What Feature Engineering Really Is

Feature engineering is the process of transforming raw data into features that help machine learning models learn patterns more effectively.

Raw data is rarely perfect. It may be messy, incomplete, or poorly expressed. A good feature engineer reshapes the data so that important signals become clear to the model.

In real-world machine learning, feature engineering often has more impact on performance than choosing a complex algorithm.


Why Feature Engineering Matters

Machine learning models do not understand meaning. They only understand numbers and patterns.

If we present data in a confusing way, the model struggles. If we present data in a meaningful way, even a simple model can perform very well.

This is why feature engineering is considered one of the most valuable skills for ML engineers.


Our Dataset and Objective

We continue using the same dataset introduced earlier:

Dataplexa ML Housing & Customer Dataset

Our goal remains the same: predict whether a loan is approved.

Now, instead of removing features, we will improve how features represent real-world behavior.


Creating Meaningful Features

Let us look at income and loan amount. Individually, these values are useful. But their relationship is often more important.

A person earning 100,000 requesting a 10,000 loan is very different from a person earning 20,000 requesting the same loan.

So we create a new feature that captures this relationship.

import pandas as pd

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")

df["income_to_loan_ratio"] = df["annual_income"] / df["loan_amount"]

df.head()

This new feature helps the model understand affordability, not just raw numbers.


Handling Time-Based Features

Time-related features often hide important insights.

For example, employment length tells us more when categorized into experience levels rather than treated as a raw number.

df["experience_level"] = pd.cut(
    df["employment_years"],
    bins=[0, 2, 5, 10, 40],
    labels=["Junior", "Mid", "Senior", "Expert"]
)

df[["employment_years", "experience_level"]].head()

This transformation reflects how humans reason, making patterns easier for models to detect.


Encoding Categorical Meaning

Some features are categorical but still carry order.

For example, education levels represent progression, not just labels.

Encoding them carefully preserves this meaning.

education_map = {
    "High School": 1,
    "Bachelor": 2,
    "Master": 3,
    "PhD": 4
}

df["education_level_encoded"] = df["education_level"].map(education_map)

df[["education_level", "education_level_encoded"]].head()

Real-World Example

Banks rarely feed raw data directly into models.

They create features such as risk scores, income ratios, and stability indicators.

Feature engineering bridges the gap between raw data and business reasoning.


Mini Practice

Create a feature that represents loan amount as a percentage of annual income.

Think about why this might help a loan approval model.


Exercises

Exercise 1:
Why does feature engineering often improve model performance?

Because it presents data in a way that highlights meaningful patterns.

Exercise 2:
Is feature engineering model-specific?

No. Good features usually improve many models.

Quick Quiz

Q1. Can feature engineering reduce overfitting?

Yes. Well-designed features can reduce noise and improve generalization.

In the next lesson, we enter the advanced phase and begin learning about Hyperparameter Tuning.