Python Course
Machine Learning with Python
Machine learning is the practice of building systems that learn patterns from data rather than following explicitly programmed rules. Python dominates this space, and scikit-learn is the essential toolkit — a consistent, well-documented library covering the full supervised and unsupervised learning workflow: preprocessing, model selection, training, evaluation, and deployment.
This lesson walks through the entire ML workflow from raw data to a production-ready model, covering classification, regression, clustering, and the evaluation techniques that separate good models from overfit ones.
The Machine Learning Workflow
Every ML project follows the same core sequence regardless of the algorithm used.
- Collect and understand data — know your features, target, and data quality
- Preprocess — handle missing values, encode categories, scale features
- Split — separate training data from test data so evaluation is honest
- Train — fit a model on training data
- Evaluate — measure performance on held-out test data
- Tune — improve the model through hyperparameter search
- Deploy — save the model for production use
1. Loading Data and Splitting
scikit-learn ships several built-in datasets for learning. train_test_split is the first essential step — you must never evaluate a model on the data it trained on.
# Loading data and train/test split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
# Load the classic Iris dataset — 150 flowers, 4 features, 3 species
iris = load_iris()
X = iris.data # features: sepal length/width, petal length/width
y = iris.target # labels: 0=setosa, 1=versicolor, 2=virginica
print("Dataset shape:", X.shape) # (150, 4)
print("Feature names:", iris.feature_names)
print("Classes: ", iris.target_names)
print("Label counts: ", pd.Series(y).value_counts().to_dict())
# Split — 80% train, 20% test, stratified so class balance is preserved
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # preserves class proportions in both splits
)
print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Classes: ['setosa' 'versicolor' 'virginica']
Label counts: {0: 50, 1: 50, 2: 50}
Training set: 120 samples
Test set: 30 samples
Xis the feature matrix — shape(n_samples, n_features);yis the target vector — shape(n_samples,)random_state=42makes the split reproducible — same split every runstratify=yensures each class appears proportionally in train and test sets — essential for imbalanced datasets- Never evaluate on training data — it gives an optimistic, misleading picture of real-world performance
2. Preprocessing — Scaling Features
Many algorithms are sensitive to feature scale — a feature ranging from 0 to 100,000 dominates one ranging from 0 to 1. StandardScaler transforms features to zero mean and unit variance. The scaler is fit only on training data and then applied to both train and test sets.
# Feature scaling — StandardScaler
from sklearn.preprocessing import StandardScaler
import numpy as np
scaler = StandardScaler()
# Fit ONLY on training data — test data is unseen
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use the same scaler, do NOT refit
print("Before scaling — train feature means:", X_train.mean(axis=0).round(2))
print("After scaling — train feature means:", X_train_scaled.mean(axis=0).round(2))
print("After scaling — train feature stds: ", X_train_scaled.std(axis=0).round(2))After scaling — train feature means: [ 0. 0. -0. 0.]
After scaling — train feature stds: [1. 1. 1. 1.]
fit_transform(X_train)— learns the mean and std from training data, then scales ittransform(X_test)— applies the training statistics to test data — never refit on test data- Fitting on test data is called data leakage — it gives the model information it would not have in production
- Other scalers:
MinMaxScaler(scales to 0–1),RobustScaler(uses median — better for outliers)
3. Training a Classifier
scikit-learn's consistent API means every model has the same three methods: fit(), predict(), and score(). Learning one model means you know the interface for all of them.
Real-world use: email spam detection, customer churn prediction, medical diagnosis, image classification, fraud detection — all classification problems.
# Training a classifier — Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Predict on unseen test data
y_pred = model.predict(X_test_scaled)
# Accuracy — fraction of correct predictions
accuracy = model.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.2%}\n")
# Classification report — precision, recall, F1 per class
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion matrix — rows=actual, columns=predicted
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(cm)precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion matrix:
[[10 0 0]
[ 0 10 0]
[ 0 0 10]]
- Precision — of all predicted positives, how many were actually positive
- Recall — of all actual positives, how many were correctly predicted
- F1-score — harmonic mean of precision and recall — the go-to metric for imbalanced classes
- The confusion matrix diagonal shows correct predictions; off-diagonal shows which classes are confused with which
4. Cross-Validation — Honest Evaluation
A single train/test split can be lucky or unlucky depending on which samples end up in each set. Cross-validation evaluates the model on multiple different splits and averages the scores — a much more reliable picture of real-world performance.
# K-Fold Cross-Validation — more reliable than a single split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
# Pipeline — chains preprocessing and model into one object
pipe = Pipeline([
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100, random_state=42))
])
# 5-fold cross-validation — splits data 5 ways, evaluates on each
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print("CV scores: ", scores.round(4))
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")
print(f"95% CI: {scores.mean():.4f} ± {scores.std() * 2:.4f}")Mean: 0.9733
Std: 0.0249
95% CI: 0.9733 ± 0.0499
- A
Pipelinechains transformers and a model — the scaler is refit separately for each fold, preventing data leakage - 5-fold CV trains five models — each time using 80% of data for training and 20% for evaluation, rotating the held-out fold
- Report mean ± std — a large std means the model is sensitive to which data it sees
- Use
cross_val_scoreearly to sanity-check before committing to expensive hyperparameter tuning
5. Regression — Predicting Continuous Values
Regression predicts a number rather than a category. The workflow is identical — only the model and evaluation metrics change.
Real-world use: predicting house prices, forecasting sales, estimating delivery time, scoring credit risk.
# Regression — predicting house prices
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
housing = fetch_california_housing()
X, y = housing.data, housing.target # y = median house value in $100k
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: ${mae * 100_000:,.0f}")
print(f"R² Score: {r2:.4f}") # 1.0 = perfect, 0 = predicts mean
# Feature importance — which features matter most
importances = model.feature_importances_
for name, imp in sorted(zip(housing.feature_names, importances), key=lambda x: -x[1]):
print(f" {name:20} {imp:.4f}")R² Score: 0.8162
MedInc 0.5271
Latitude 0.1204
Longitude 0.0994
AveOccup 0.0874
HouseAge 0.0612
AveRooms 0.0521
Population 0.0295
AveBedrms 0.0229
- MAE (Mean Absolute Error) — average absolute difference between predictions and actual values — in the same units as the target
- R² — proportion of variance explained; 1.0 is perfect, 0.0 means the model is no better than always predicting the mean
feature_importances_shows which features the model relied on most — useful for understanding and simplifying models- Other regression metrics: MSE (mean squared error), RMSE (root MSE), MAPE (mean absolute percentage error)
6. Unsupervised Learning — KMeans Clustering
Unsupervised learning finds structure in data without labels. KMeans groups samples into k clusters based on similarity.
Real-world use: customer segmentation, anomaly detection, document grouping, image compression.
# KMeans clustering — customer segmentation example
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
# Synthetic customer data: [annual_spend, visit_frequency]
rng = np.random.default_rng(42)
customers = np.vstack([
rng.multivariate_normal([1000, 2], [[50000, 0], [0, 1]], 50), # low spend
rng.multivariate_normal([5000, 12], [[50000, 0], [0, 4]], 50), # mid spend
rng.multivariate_normal([12000, 25],[[50000, 0], [0, 9]], 50), # high spend
])
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(customers_scaled)
labels = kmeans.labels_
# Summarise each cluster
for cluster_id in range(3):
mask = labels == cluster_id
segment = customers[mask]
print(f"Cluster {cluster_id}: {mask.sum()} customers | "
f"avg spend ${segment[:,0].mean():,.0f} | "
f"avg visits {segment[:,1].mean():.1f}/month")Cluster 1: 50 customers | avg spend $4,987 | avg visits 12.1/month
Cluster 2: 50 customers | avg spend $12,043 | avg visits 25.1/month
- Scale features before clustering — KMeans uses Euclidean distance, so unscaled features with larger ranges dominate
- Choose
kusing the elbow method: plot inertia (kmeans.inertia_) for k=1 to 10 and find where it stops falling sharply kmeans.labels_gives the cluster assignment for each samplekmeans.cluster_centers_gives the centroid of each cluster in the scaled feature space
7. Saving and Loading Models
After training, save the model to disk so it can be loaded and used in production without retraining.
# Saving and loading models with joblib
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
pipe = Pipeline([
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100, random_state=42))
])
pipe.fit(X_train, y_train)
# Save the entire pipeline (scaler + model) to disk
joblib.dump(pipe, "iris_pipeline.joblib")
print("Model saved.")
# Load in production — no need to retrain or refit scaler
loaded_pipe = joblib.load("iris_pipeline.joblib")
predictions = loaded_pipe.predict(X_test)
print("Loaded model accuracy:", loaded_pipe.score(X_test, y_test))Loaded model accuracy: 1.0
- Save the entire
Pipeline— not just the model — so the scaler is saved alongside it joblibis faster than Python'spicklefor large NumPy arrays and is the scikit-learn recommendation- In production, load the pipeline once at startup and call
predict()on incoming data
Summary Table
| Step | Tool | Key Call |
|---|---|---|
| Split data | train_test_split |
train_test_split(X, y, test_size=0.2, stratify=y) |
| Scale features | StandardScaler |
scaler.fit_transform(X_train) |
| Train model | Any estimator | model.fit(X_train, y_train) |
| Evaluate | classification_report, r2_score |
model.score(X_test, y_test) |
| Cross-validate | cross_val_score |
cross_val_score(pipe, X, y, cv=5) |
| Pipeline | Pipeline |
Pipeline([("scaler", ...), ("model", ...)]) |
| Save / load | joblib |
joblib.dump(pipe, "model.joblib") |
Practice Questions
Practice 1. Why must the scaler be fit only on training data and not on the full dataset?
Practice 2. What is the F1-score and when is it preferred over accuracy?
Practice 3. What does stratify=y do in train_test_split?
Practice 4. Why is saving a Pipeline preferable to saving just the model?
Practice 5. What does an R² score of 0.0 mean for a regression model?
Quiz
Quiz 1. What does model.fit(X_train, y_train) do?
Quiz 2. What advantage does 5-fold cross-validation have over a single train/test split?
Quiz 3. In the confusion matrix, what do the off-diagonal values represent?
Quiz 4. Why must features be scaled before running KMeans clustering?
Quiz 5. Which library is recommended over Python's pickle for saving scikit-learn models?
Next up — Mini Project: putting everything together in a complete end-to-end Python project.