Python Lesson 44 – ML with Python | Dataplexa

Machine Learning with Python

Machine learning is the practice of building systems that learn patterns from data rather than following explicitly programmed rules. Python dominates this space, and scikit-learn is the essential toolkit — a consistent, well-documented library covering the full supervised and unsupervised learning workflow: preprocessing, model selection, training, evaluation, and deployment.

This lesson walks through the entire ML workflow from raw data to a production-ready model, covering classification, regression, clustering, and the evaluation techniques that separate good models from overfit ones.

The Machine Learning Workflow

Every ML project follows the same core sequence regardless of the algorithm used.

Collect and understand data — know your features, target, and data quality
Preprocess — handle missing values, encode categories, scale features
Split — separate training data from test data so evaluation is honest
Train — fit a model on training data
Evaluate — measure performance on held-out test data
Tune — improve the model through hyperparameter search
Deploy — save the model for production use

1. Loading Data and Splitting

scikit-learn ships several built-in datasets for learning. train_test_split is the first essential step — you must never evaluate a model on the data it trained on.

# Loading data and train/test split

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the classic Iris dataset — 150 flowers, 4 features, 3 species
iris = load_iris()
X = iris.data     # features: sepal length/width, petal length/width
y = iris.target   # labels: 0=setosa, 1=versicolor, 2=virginica

print("Dataset shape:", X.shape)         # (150, 4)
print("Feature names:", iris.feature_names)
print("Classes:      ", iris.target_names)
print("Label counts: ", pd.Series(y).value_counts().to_dict())

# Split — 80% train, 20% test, stratified so class balance is preserved
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y        # preserves class proportions in both splits
)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")

Dataset shape: (150, 4)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']
Label counts: {0: 50, 1: 50, 2: 50}

Training set: 120 samples
Test set: 30 samples

X is the feature matrix — shape (n_samples, n_features); y is the target vector — shape (n_samples,)
random_state=42 makes the split reproducible — same split every run
stratify=y ensures each class appears proportionally in train and test sets — essential for imbalanced datasets
Never evaluate on training data — it gives an optimistic, misleading picture of real-world performance

2. Preprocessing — Scaling Features

Many algorithms are sensitive to feature scale — a feature ranging from 0 to 100,000 dominates one ranging from 0 to 1. StandardScaler transforms features to zero mean and unit variance. The scaler is fit only on training data and then applied to both train and test sets.

# Feature scaling — StandardScaler

from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()

# Fit ONLY on training data — test data is unseen
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)    # use the same scaler, do NOT refit

print("Before scaling — train feature means:", X_train.mean(axis=0).round(2))
print("After scaling  — train feature means:", X_train_scaled.mean(axis=0).round(2))
print("After scaling  — train feature stds: ", X_train_scaled.std(axis=0).round(2))

Before scaling — train feature means: [5.84 3.06 3.77 1.2 ]
After scaling — train feature means: [ 0. 0. -0. 0.]
After scaling — train feature stds: [1. 1. 1. 1.]

fit_transform(X_train) — learns the mean and std from training data, then scales it
transform(X_test) — applies the training statistics to test data — never refit on test data
Fitting on test data is called data leakage — it gives the model information it would not have in production
Other scalers: MinMaxScaler (scales to 0–1), RobustScaler (uses median — better for outliers)

3. Training a Classifier

scikit-learn's consistent API means every model has the same three methods: fit(), predict(), and score(). Learning one model means you know the interface for all of them.

Real-world use: email spam detection, customer churn prediction, medical diagnosis, image classification, fraud detection — all classification problems.

# Training a classifier — Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Predict on unseen test data
y_pred = model.predict(X_test_scaled)

# Accuracy — fraction of correct predictions
accuracy = model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.2%}\n")

# Classification report — precision, recall, F1 per class
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix — rows=actual, columns=predicted
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(cm)

Accuracy: 100.00%

precision recall f1-score support

setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion matrix:
[[10 0 0]
[ 0 10 0]
[ 0 0 10]]

Precision — of all predicted positives, how many were actually positive
Recall — of all actual positives, how many were correctly predicted
F1-score — harmonic mean of precision and recall — the go-to metric for imbalanced classes
The confusion matrix diagonal shows correct predictions; off-diagonal shows which classes are confused with which

4. Cross-Validation — Honest Evaluation

A single train/test split can be lucky or unlucky depending on which samples end up in each set. Cross-validation evaluates the model on multiple different splits and averages the scores — a much more reliable picture of real-world performance.

# K-Fold Cross-Validation — more reliable than a single split

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

# Pipeline — chains preprocessing and model into one object
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  RandomForestClassifier(n_estimators=100, random_state=42))
])

# 5-fold cross-validation — splits data 5 ways, evaluates on each
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")

print("CV scores:  ", scores.round(4))
print(f"Mean:       {scores.mean():.4f}")
print(f"Std:        {scores.std():.4f}")
print(f"95% CI:     {scores.mean():.4f} ± {scores.std() * 2:.4f}")

CV scores: [0.9667 1. 0.9333 0.9667 1. ]
Mean: 0.9733
Std: 0.0249
95% CI: 0.9733 ± 0.0499

A Pipeline chains transformers and a model — the scaler is refit separately for each fold, preventing data leakage
5-fold CV trains five models — each time using 80% of data for training and 20% for evaluation, rotating the held-out fold
Report mean ± std — a large std means the model is sensitive to which data it sees
Use cross_val_score early to sanity-check before committing to expensive hyperparameter tuning

5. Regression — Predicting Continuous Values

Regression predicts a number rather than a category. The workflow is identical — only the model and evaluation metrics change.

Real-world use: predicting house prices, forecasting sales, estimating delivery time, scoring credit risk.

# Regression — predicting house prices

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target   # y = median house value in $100k

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)

mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: ${mae * 100_000:,.0f}")
print(f"R² Score:            {r2:.4f}")   # 1.0 = perfect, 0 = predicts mean

# Feature importance — which features matter most
importances = model.feature_importances_
for name, imp in sorted(zip(housing.feature_names, importances), key=lambda x: -x[1]):
    print(f"  {name:20} {imp:.4f}")

Mean Absolute Error: $37,824
R² Score: 0.8162

MedInc 0.5271
Latitude 0.1204
Longitude 0.0994
AveOccup 0.0874
HouseAge 0.0612
AveRooms 0.0521
Population 0.0295
AveBedrms 0.0229

MAE (Mean Absolute Error) — average absolute difference between predictions and actual values — in the same units as the target
R² — proportion of variance explained; 1.0 is perfect, 0.0 means the model is no better than always predicting the mean
feature_importances_ shows which features the model relied on most — useful for understanding and simplifying models
Other regression metrics: MSE (mean squared error), RMSE (root MSE), MAPE (mean absolute percentage error)

6. Unsupervised Learning — KMeans Clustering

Unsupervised learning finds structure in data without labels. KMeans groups samples into k clusters based on similarity.

Real-world use: customer segmentation, anomaly detection, document grouping, image compression.

# KMeans clustering — customer segmentation example

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Synthetic customer data: [annual_spend, visit_frequency]
rng = np.random.default_rng(42)
customers = np.vstack([
    rng.multivariate_normal([1000, 2],  [[50000, 0], [0, 1]], 50),   # low spend
    rng.multivariate_normal([5000, 12], [[50000, 0], [0, 4]], 50),   # mid spend
    rng.multivariate_normal([12000, 25],[[50000, 0], [0, 9]], 50),   # high spend
])

scaler    = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(customers_scaled)
labels = kmeans.labels_

# Summarise each cluster
for cluster_id in range(3):
    mask    = labels == cluster_id
    segment = customers[mask]
    print(f"Cluster {cluster_id}: {mask.sum()} customers | "
          f"avg spend ${segment[:,0].mean():,.0f} | "
          f"avg visits {segment[:,1].mean():.1f}/month")

Scale features before clustering — KMeans uses Euclidean distance, so unscaled features with larger ranges dominate
Choose k using the elbow method: plot inertia (kmeans.inertia_) for k=1 to 10 and find where it stops falling sharply
kmeans.labels_ gives the cluster assignment for each sample
kmeans.cluster_centers_ gives the centroid of each cluster in the scaled feature space

7. Saving and Loading Models

After training, save the model to disk so it can be loaded and used in production without retraining.

# Saving and loading models with joblib

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe.fit(X_train, y_train)

# Save the entire pipeline (scaler + model) to disk
joblib.dump(pipe, "iris_pipeline.joblib")
print("Model saved.")

# Load in production — no need to retrain or refit scaler
loaded_pipe = joblib.load("iris_pipeline.joblib")
predictions = loaded_pipe.predict(X_test)
print("Loaded model accuracy:", loaded_pipe.score(X_test, y_test))

Model saved.
Loaded model accuracy: 1.0

Save the entire Pipeline — not just the model — so the scaler is saved alongside it
joblib is faster than Python's pickle for large NumPy arrays and is the scikit-learn recommendation
In production, load the pipeline once at startup and call predict() on incoming data

Summary Table

Step	Tool	Key Call
Split data	`train_test_split`	`train_test_split(X, y, test_size=0.2, stratify=y)`
Scale features	`StandardScaler`	`scaler.fit_transform(X_train)`
Train model	Any estimator	`model.fit(X_train, y_train)`
Evaluate	`classification_report, r2_score`	`model.score(X_test, y_test)`
Cross-validate	`cross_val_score`	`cross_val_score(pipe, X, y, cv=5)`
Pipeline	`Pipeline`	`Pipeline([("scaler", ...), ("model", ...)])`
Save / load	`joblib`	`joblib.dump(pipe, "model.joblib")`

Practice Questions

Practice 1. Why must the scaler be fit only on training data and not on the full dataset?

Practice 2. What is the F1-score and when is it preferred over accuracy?

Practice 3. What does stratify=y do in train_test_split?

Practice 4. Why is saving a Pipeline preferable to saving just the model?

Practice 5. What does an R² score of 0.0 mean for a regression model?

Quiz

Quiz 1. What does model.fit(X_train, y_train) do?

Trains the model by learning patterns from the training features and labels
Evaluates the model on training data and returns a score
Splits the data into train and test sets
Scales X_train to zero mean and unit variance

Quiz 2. What advantage does 5-fold cross-validation have over a single train/test split?

It uses all data for both training and evaluation across five folds giving a more reliable performance estimate
It trains five separate models and picks the best one
It eliminates the need for preprocessing
It always achieves higher accuracy

Quiz 3. In the confusion matrix, what do the off-diagonal values represent?

Misclassifications — samples the model predicted as the wrong class
Correct predictions for each class
The number of samples per class in the test set
The model's confidence scores

Quiz 4. Why must features be scaled before running KMeans clustering?

KMeans uses Euclidean distance — features with larger ranges dominate the distance calculation
KMeans cannot handle negative values
Scaling speeds up the training loop
scikit-learn requires scaled input for all algorithms

Quiz 5. Which library is recommended over Python's pickle for saving scikit-learn models?

joblib — faster for large NumPy arrays and the scikit-learn recommendation
json — works with any Python object
shelve — built-in database-style persistence
marshal — the fastest serialisation module

Next up — Mini Project: putting everything together in a complete end-to-end Python project.

← Previous Course Index Next →

Python Course