Python Lesson 44 – ML with Python | Dataplexa

Machine Learning with Python

Machine learning is the practice of building systems that learn patterns from data rather than following explicitly programmed rules. Python dominates this space, and scikit-learn is the essential toolkit — a consistent, well-documented library covering the full supervised and unsupervised learning workflow: preprocessing, model selection, training, evaluation, and deployment.

This lesson walks through the entire ML workflow from raw data to a production-ready model, covering classification, regression, clustering, and the evaluation techniques that separate good models from overfit ones.

The Machine Learning Workflow

Every ML project follows the same core sequence regardless of the algorithm used.

  • Collect and understand data — know your features, target, and data quality
  • Preprocess — handle missing values, encode categories, scale features
  • Split — separate training data from test data so evaluation is honest
  • Train — fit a model on training data
  • Evaluate — measure performance on held-out test data
  • Tune — improve the model through hyperparameter search
  • Deploy — save the model for production use

1. Loading Data and Splitting

scikit-learn ships several built-in datasets for learning. train_test_split is the first essential step — you must never evaluate a model on the data it trained on.

# Loading data and train/test split

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the classic Iris dataset — 150 flowers, 4 features, 3 species
iris = load_iris()
X = iris.data     # features: sepal length/width, petal length/width
y = iris.target   # labels: 0=setosa, 1=versicolor, 2=virginica

print("Dataset shape:", X.shape)         # (150, 4)
print("Feature names:", iris.feature_names)
print("Classes:      ", iris.target_names)
print("Label counts: ", pd.Series(y).value_counts().to_dict())

# Split — 80% train, 20% test, stratified so class balance is preserved
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y        # preserves class proportions in both splits
)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")
Dataset shape: (150, 4)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']
Label counts: {0: 50, 1: 50, 2: 50}

Training set: 120 samples
Test set: 30 samples
  • X is the feature matrix — shape (n_samples, n_features); y is the target vector — shape (n_samples,)
  • random_state=42 makes the split reproducible — same split every run
  • stratify=y ensures each class appears proportionally in train and test sets — essential for imbalanced datasets
  • Never evaluate on training data — it gives an optimistic, misleading picture of real-world performance

2. Preprocessing — Scaling Features

Many algorithms are sensitive to feature scale — a feature ranging from 0 to 100,000 dominates one ranging from 0 to 1. StandardScaler transforms features to zero mean and unit variance. The scaler is fit only on training data and then applied to both train and test sets.

# Feature scaling — StandardScaler

from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()

# Fit ONLY on training data — test data is unseen
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)    # use the same scaler, do NOT refit

print("Before scaling — train feature means:", X_train.mean(axis=0).round(2))
print("After scaling  — train feature means:", X_train_scaled.mean(axis=0).round(2))
print("After scaling  — train feature stds: ", X_train_scaled.std(axis=0).round(2))
Before scaling — train feature means: [5.84 3.06 3.77 1.2 ]
After scaling — train feature means: [ 0. 0. -0. 0.]
After scaling — train feature stds: [1. 1. 1. 1.]
  • fit_transform(X_train) — learns the mean and std from training data, then scales it
  • transform(X_test) — applies the training statistics to test data — never refit on test data
  • Fitting on test data is called data leakage — it gives the model information it would not have in production
  • Other scalers: MinMaxScaler (scales to 0–1), RobustScaler (uses median — better for outliers)

3. Training a Classifier

scikit-learn's consistent API means every model has the same three methods: fit(), predict(), and score(). Learning one model means you know the interface for all of them.

Real-world use: email spam detection, customer churn prediction, medical diagnosis, image classification, fraud detection — all classification problems.

# Training a classifier — Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Predict on unseen test data
y_pred = model.predict(X_test_scaled)

# Accuracy — fraction of correct predictions
accuracy = model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.2%}\n")

# Classification report — precision, recall, F1 per class
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix — rows=actual, columns=predicted
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(cm)
Accuracy: 100.00%

precision recall f1-score support

setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Confusion matrix:
[[10 0 0]
[ 0 10 0]
[ 0 0 10]]
  • Precision — of all predicted positives, how many were actually positive
  • Recall — of all actual positives, how many were correctly predicted
  • F1-score — harmonic mean of precision and recall — the go-to metric for imbalanced classes
  • The confusion matrix diagonal shows correct predictions; off-diagonal shows which classes are confused with which

4. Cross-Validation — Honest Evaluation

A single train/test split can be lucky or unlucky depending on which samples end up in each set. Cross-validation evaluates the model on multiple different splits and averages the scores — a much more reliable picture of real-world performance.

# K-Fold Cross-Validation — more reliable than a single split

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

# Pipeline — chains preprocessing and model into one object
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  RandomForestClassifier(n_estimators=100, random_state=42))
])

# 5-fold cross-validation — splits data 5 ways, evaluates on each
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")

print("CV scores:  ", scores.round(4))
print(f"Mean:       {scores.mean():.4f}")
print(f"Std:        {scores.std():.4f}")
print(f"95% CI:     {scores.mean():.4f} ± {scores.std() * 2:.4f}")
CV scores: [0.9667 1. 0.9333 0.9667 1. ]
Mean: 0.9733
Std: 0.0249
95% CI: 0.9733 ± 0.0499
  • A Pipeline chains transformers and a model — the scaler is refit separately for each fold, preventing data leakage
  • 5-fold CV trains five models — each time using 80% of data for training and 20% for evaluation, rotating the held-out fold
  • Report mean ± std — a large std means the model is sensitive to which data it sees
  • Use cross_val_score early to sanity-check before committing to expensive hyperparameter tuning

5. Regression — Predicting Continuous Values

Regression predicts a number rather than a category. The workflow is identical — only the model and evaluation metrics change.

Real-world use: predicting house prices, forecasting sales, estimating delivery time, scoring credit risk.

# Regression — predicting house prices

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target   # y = median house value in $100k

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)

mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: ${mae * 100_000:,.0f}")
print(f"R² Score:            {r2:.4f}")   # 1.0 = perfect, 0 = predicts mean

# Feature importance — which features matter most
importances = model.feature_importances_
for name, imp in sorted(zip(housing.feature_names, importances), key=lambda x: -x[1]):
    print(f"  {name:20} {imp:.4f}")
Mean Absolute Error: $37,824
R² Score: 0.8162

MedInc 0.5271
Latitude 0.1204
Longitude 0.0994
AveOccup 0.0874
HouseAge 0.0612
AveRooms 0.0521
Population 0.0295
AveBedrms 0.0229
  • MAE (Mean Absolute Error) — average absolute difference between predictions and actual values — in the same units as the target
  • — proportion of variance explained; 1.0 is perfect, 0.0 means the model is no better than always predicting the mean
  • feature_importances_ shows which features the model relied on most — useful for understanding and simplifying models
  • Other regression metrics: MSE (mean squared error), RMSE (root MSE), MAPE (mean absolute percentage error)

6. Unsupervised Learning — KMeans Clustering

Unsupervised learning finds structure in data without labels. KMeans groups samples into k clusters based on similarity.

Real-world use: customer segmentation, anomaly detection, document grouping, image compression.

# KMeans clustering — customer segmentation example

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

# Synthetic customer data: [annual_spend, visit_frequency]
rng = np.random.default_rng(42)
customers = np.vstack([
    rng.multivariate_normal([1000, 2],  [[50000, 0], [0, 1]], 50),   # low spend
    rng.multivariate_normal([5000, 12], [[50000, 0], [0, 4]], 50),   # mid spend
    rng.multivariate_normal([12000, 25],[[50000, 0], [0, 9]], 50),   # high spend
])

scaler    = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(customers_scaled)
labels = kmeans.labels_

# Summarise each cluster
for cluster_id in range(3):
    mask    = labels == cluster_id
    segment = customers[mask]
    print(f"Cluster {cluster_id}: {mask.sum()} customers | "
          f"avg spend ${segment[:,0].mean():,.0f} | "
          f"avg visits {segment[:,1].mean():.1f}/month")
Cluster 0: 50 customers | avg spend $1,024 | avg visits 2.0/month
Cluster 1: 50 customers | avg spend $4,987 | avg visits 12.1/month
Cluster 2: 50 customers | avg spend $12,043 | avg visits 25.1/month
  • Scale features before clustering — KMeans uses Euclidean distance, so unscaled features with larger ranges dominate
  • Choose k using the elbow method: plot inertia (kmeans.inertia_) for k=1 to 10 and find where it stops falling sharply
  • kmeans.labels_ gives the cluster assignment for each sample
  • kmeans.cluster_centers_ gives the centroid of each cluster in the scaled feature space

7. Saving and Loading Models

After training, save the model to disk so it can be loaded and used in production without retraining.

# Saving and loading models with joblib

import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe.fit(X_train, y_train)

# Save the entire pipeline (scaler + model) to disk
joblib.dump(pipe, "iris_pipeline.joblib")
print("Model saved.")

# Load in production — no need to retrain or refit scaler
loaded_pipe = joblib.load("iris_pipeline.joblib")
predictions = loaded_pipe.predict(X_test)
print("Loaded model accuracy:", loaded_pipe.score(X_test, y_test))
Model saved.
Loaded model accuracy: 1.0
  • Save the entire Pipeline — not just the model — so the scaler is saved alongside it
  • joblib is faster than Python's pickle for large NumPy arrays and is the scikit-learn recommendation
  • In production, load the pipeline once at startup and call predict() on incoming data

Summary Table

Step Tool Key Call
Split data train_test_split train_test_split(X, y, test_size=0.2, stratify=y)
Scale features StandardScaler scaler.fit_transform(X_train)
Train model Any estimator model.fit(X_train, y_train)
Evaluate classification_report, r2_score model.score(X_test, y_test)
Cross-validate cross_val_score cross_val_score(pipe, X, y, cv=5)
Pipeline Pipeline Pipeline([("scaler", ...), ("model", ...)])
Save / load joblib joblib.dump(pipe, "model.joblib")

Practice Questions

Practice 1. Why must the scaler be fit only on training data and not on the full dataset?



Practice 2. What is the F1-score and when is it preferred over accuracy?



Practice 3. What does stratify=y do in train_test_split?



Practice 4. Why is saving a Pipeline preferable to saving just the model?



Practice 5. What does an R² score of 0.0 mean for a regression model?



Quiz

Quiz 1. What does model.fit(X_train, y_train) do?






Quiz 2. What advantage does 5-fold cross-validation have over a single train/test split?






Quiz 3. In the confusion matrix, what do the off-diagonal values represent?






Quiz 4. Why must features be scaled before running KMeans clustering?






Quiz 5. Which library is recommended over Python's pickle for saving scikit-learn models?






Next up — Mini Project: putting everything together in a complete end-to-end Python project.