AI Course
Feature Engineering
Feature Engineering is the process of transforming raw data into meaningful inputs that help machine learning models perform better. A model is only as good as the features it learns from.
Even the most advanced algorithm can fail if the features are weak, noisy, or irrelevant. Strong feature engineering often improves performance more than changing the model itself.
Why Feature Engineering Matters
Raw data rarely comes in a form that machine learning algorithms can understand directly. Feature engineering bridges the gap between raw data and intelligent predictions.
- Improves model accuracy
- Reduces noise and redundancy
- Helps models learn patterns faster
- Increases interpretability
Real-World Connection
Consider predicting house prices. Using only raw text like address strings is not helpful. Converting that information into numerical features such as distance to city center, number of rooms, or area size makes prediction possible.
Common Feature Engineering Techniques
- Handling missing values
- Encoding categorical variables
- Feature scaling
- Creating new features
Handling Missing Values
Missing data can confuse models. One common approach is replacing missing values with the mean or median.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'age': [25, 30, np.nan, 40],
'salary': [50000, 60000, 55000, np.nan]
})
data_filled = data.fillna(data.mean())
print(data_filled)
Here, missing values are replaced with column averages, making the dataset usable for training.
Encoding Categorical Data
Machine learning models work with numbers, not text. Categorical features must be encoded.
from sklearn.preprocessing import LabelEncoder
cities = ['New York', 'London', 'Paris', 'London']
encoder = LabelEncoder()
encoded = encoder.fit_transform(cities)
print(encoded)
Each city is converted into a numeric label that models can process.
Feature Scaling
Features with large values can dominate others. Scaling brings all features to a similar range.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform([[1], [10], [100]])
print(scaled_data)
After scaling, features contribute more equally to model training.
Creating New Features
Sometimes combining existing features creates more useful information.
data = pd.DataFrame({
'length': [10, 20, 30],
'width': [5, 10, 15]
})
data['area'] = data['length'] * data['width']
print(data)
The new feature “area” captures more meaningful information than length or width alone.
When Feature Engineering Is Critical
- Small datasets
- Business-driven predictions
- Interpretable models
- Competitive machine learning tasks
Practice Questions
Practice 1: What process converts raw data into useful inputs?
Practice 2: What technique brings features to similar ranges?
Practice 3: Converting text categories into numbers is called?
Quick Quiz
Quiz 1: Machine learning models primarily work with?
Quiz 2: StandardScaler performs which operation?
Quiz 3: Combining existing columns to create better inputs is called?
Coming up next: Feature Selection — choosing the most important features.