Machine Learning Pipelines
So far, we have been performing machine learning steps one by one. We cleaned data, scaled features, trained models, and tuned hyperparameters manually.
While this approach is useful for learning, real-world machine learning systems need something more organized, repeatable, and less error-prone.
This lesson introduces ML Pipelines, a powerful concept that connects all steps into a single automated workflow.
What Is an ML Pipeline?
An ML Pipeline is a structured sequence of steps that transforms raw data into predictions.
Each step performs a specific task, such as scaling, feature selection, or model training.
Once defined, the pipeline behaves like a single model.
This makes machine learning workflows cleaner, safer, and easier to maintain.
Why Pipelines Are Important
In earlier lessons, we manually applied preprocessing before training the model.
This creates a serious risk: training and testing data may not be processed in exactly the same way.
Pipelines solve this problem by enforcing consistent transformations during both training and prediction.
Dataset Continuity
We continue using the same dataset introduced earlier:
Dataplexa ML Housing & Customer Dataset
Using a single dataset throughout the course helps you understand how real ML systems evolve step by step.
Preparing the Dataset
We start by loading the dataset and separating features from the target variable.
import pandas as pd
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
X = df.drop("loan_approved", axis=1)
y = df["loan_approved"]
Creating a Pipeline
In this example, we create a pipeline that performs feature scaling followed by logistic regression.
The order matters. Each step feeds its output to the next step.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000))
])
Training the Pipeline
The pipeline is trained just like a normal model.
Internally, it applies scaling first, then fits the classifier.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
pipeline.fit(X_train, y_train)
Evaluating the Pipeline
Evaluation is simple and safe.
The same preprocessing is automatically applied to the test data.
pipeline.score(X_test, y_test)
Combining Pipelines with Hyperparameter Tuning
One of the biggest advantages of pipelines is their compatibility with Grid Search and Random Search.
We can tune model parameters without breaking the workflow.
from sklearn.model_selection import GridSearchCV
param_grid = {
"model__C": [0.1, 1, 10]
}
grid = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring="accuracy"
)
grid.fit(X_train, y_train)
Real-World Example
In production systems, data arrives continuously from different sources.
Pipelines ensure that every incoming record is processed exactly like training data.
This prevents silent bugs that can destroy model performance.
Mini Practice
Add another preprocessing step such as feature selection and observe how pipeline structure changes.
Exercises
Exercise 1:
Why are pipelines safer than manual preprocessing?
Exercise 2:
Can pipelines be used with Random Search?
Quick Quiz
Q1. Does a pipeline replace the need for feature engineering?
In the next lesson, we move from model building to real-world usage by learning Model Deployment.