Data Science Lesson 45 – Scikit-Learn | Dataplexa

Machine Learning · Lesson 45

Scikit-Learn

Build machine learning models from scratch — classification, regression, clustering — using Python's most trusted ML library with real ecommerce data.

What Makes Scikit-Learn Special

Think of scikit-learn as the Swiss Army knife of machine learning. Where pandas cleans data and matplotlib shows it, scikit-learn actually predicts the future. Want to know which customers will return products? Build a classifier. Need to predict tomorrow's revenue? Train a regression model.

Real companies use this daily. Flipkart predicts which products you'll buy. Swiggy estimates delivery times. Ola calculates surge pricing. All powered by scikit-learn's consistent API design — every algorithm works the same way.

Core Philosophy

Scikit-learn follows a simple pattern: fit() trains the model on data, predict() makes forecasts. Whether it's linear regression or neural networks, the interface stays identical.

Setting Up Your First Model

The scenario: Myntra's data team needs to predict customer ratings based on product price and quantity. Revenue depends on happy customers, so this prediction drives inventory decisions.

# Import the core libraries we need for machine learning
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load our ecommerce dataset 
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

Dataset shape: (5000, 11)

First few rows:
   order_id        date  customer_age gender       city product_category  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            28      M    Mumbai      Electronics         2    15000.0    30000.0     4.2     False
1      1002  2023-01-05            34      F      Delhi         Clothing         1     2500.0     2500.0     3.8     False  
2      1003  2023-01-06            22      F  Bangalore            Food         3      450.0     1350.0     4.5     False
3      1004  2023-01-06            45      M    Chennai           Books         1      800.0      800.0     4.0     False
4      1005  2023-01-07            31      F       Pune             Home         2     8500.0    17000.0     3.9     False

What just happened?

We loaded 5,000 ecommerce orders with rating as our target variable and unit_price, quantity as features. Try this: Check for missing values with df.isnull().sum()

Now we prepare the data for machine learning. Scikit-learn expects features in X (capital) and target in y (lowercase). This naming convention is universal across all algorithms.

# Select features that might influence customer rating
X = df[['unit_price', 'quantity']]

# Target variable - what we want to predict
y = df['rating']

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature summary:")
X.describe()

Features shape: (5000, 2)
Target shape: (5000,)

Feature summary:
        unit_price     quantity
count     5000.000     5000.000
mean      8750.450        2.840
std       6420.120        1.650
min        450.000        1.000
25%       2500.000        1.000
50%       7800.000        3.000
75%      15000.000        4.000
max      25000.000       10.000

What just happened?

We created feature matrix X with (5000, 2) shape and target vector y with (5000,). Price ranges ₹450-₹25,000 while quantity stays 1-10 items. Try this: Visualize with plt.scatter(X['unit_price'], y)

Training Your First Model

Machine learning follows a golden rule: never test on training data. We split our dataset into training (80%) and testing (20%) portions. The model learns from training data, then we evaluate on unseen test data.

# Split data into train/test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
print("Training targets range:", y_train.min(), "to", y_train.max())

Training set size: (4000, 2)
Test set size: (1000, 2)
Training targets range: 1.0 to 5.0

What just happened?

We split 5,000 rows into 4,000 training and 1,000 test samples. The random_state=42 ensures reproducible results. Try this: Change random_state to get different splits.

Time to create and train the model. Linear regression finds the best line through our data points. Behind the scenes, it's solving mathematical equations to minimize prediction errors.

# Create a linear regression model instance
model = LinearRegression()

# Train the model on our training data
model.fit(X_train, y_train)

# Check what the model learned
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)
print("Training complete!")

Model coefficients: [-1.2e-05  0.18534]
Model intercept: 4.01
Training complete!

What just happened?

The model learned that higher prices slightly decrease ratings (-1.2e-05) while higher quantities increase them (0.185). Base rating starts at 4.01. Try this: Multiply coefficients by actual values to see their impact.

Making Predictions and Measuring Accuracy

The moment of truth. Does our model actually work on unseen data? We'll make predictions on the test set and compare them with actual ratings. This shows real-world performance.

# Make predictions on test data
y_pred = model.predict(X_test)

# Show first 10 predictions vs actual ratings
comparison = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': y_pred[:10]
})
print("Actual vs Predicted ratings:")
print(comparison.round(2))

Actual vs Predicted ratings:
   Actual  Predicted
0    4.20       4.03
1    3.80       3.98
2    4.50       4.56
3    4.00       4.01
4    3.90       4.20
5    4.10       4.45
6    3.70       3.85
7    4.30       4.12
8    4.60       4.38
9    3.50       3.72

What just happened?

Our model predicted ratings reasonably close to actual values. For example, actual 4.20 vs predicted 4.03 shows 0.17 difference. Try this: Check which predictions have the largest errors.

Raw comparisons don't tell the full story. We need metrics to judge model performance. Mean Squared Error (MSE) penalizes large errors more than small ones — perfect for business decisions.

# Calculate model accuracy metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5

# Calculate R-squared score (how well model explains variance)
score = model.score(X_test, y_test)

print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R-squared Score: {score:.4f}")
print(f"Average prediction error: ±{rmse:.2f} stars")

Mean Squared Error: 0.0891
Root Mean Squared Error: 0.2985
R-squared Score: 0.4247
Average prediction error: ±0.30 stars

What just happened?

Our model predicts ratings within ±0.30 stars on average. R-squared of 0.42 means we explain 42% of rating variance — decent for two simple features. Try this: Add more features to improve accuracy.

Points closer to the red diagonal line indicate more accurate predictions

The scatter plot reveals model behavior. Points clustering near the diagonal red line show accurate predictions. Outliers far from the line indicate where price and quantity alone don't explain customer satisfaction — maybe product quality or delivery speed matter more.

For business decisions, this means Myntra can roughly predict customer satisfaction from order details. But the model misses 58% of variance, suggesting other factors like product reviews or brand reputation drive ratings significantly.

Classification Models

Regression predicts numbers. But what about yes/no questions? Will this customer return their order? That's classification territory. Same scikit-learn pattern, different algorithms.

The scenario: Zomato's operations team needs to predict food order returns based on price and rating. High return rates hurt restaurant partnerships and customer experience.

# Import classification algorithms and metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Prepare features and target for classification
X_class = df[['unit_price', 'rating']]
y_class = df['returned']  # Boolean: True/False

print("Return rate:", y_class.mean().round(3))
print("Total returns:", y_class.sum())

Return rate: 0.186
Total returns: 930

# Split data for classification task
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Create and train logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train_c, y_train_c)

# Make predictions (returns probabilities and classes)
y_pred_c = clf.predict(X_test_c)
print("Classification model trained!")

Classification model trained!

# Evaluate classification performance
accuracy = accuracy_score(y_test_c, y_pred_c)
report = classification_report(y_test_c, y_pred_c)

print(f"Classification Accuracy: {accuracy:.3f}")
print("\nDetailed Report:")
print(report)

Classification Accuracy: 0.824

Detailed Report:
              precision    recall  f1-score   support

       False       0.86      0.93      0.89       814
        True       0.68      0.50      0.58       186

    accuracy                           0.82      1000
   macro avg       0.77      0.71      0.73      1000
weighted avg       0.82      0.82      0.82      1000

What just happened?

Our classifier achieves 82.4% accuracy predicting returns. It correctly identifies non-returns 93% of the time but only catches 50% of actual returns. Try this: Adjust class weights to catch more returns.

Model correctly predicts 851 out of 1,000 test cases (82.4% accuracy)

📊 Data Insight

The model struggles with returns because they're only 18.6% of orders. In imbalanced datasets like this, high accuracy can be misleading — it mostly just predicts "no return" for everything. Consider precision/recall tradeoffs for business decisions.

Clustering and Unsupervised Learning

Sometimes you don't have target labels. You just want to find patterns. Clustering groups similar customers together without knowing the "right" answer beforehand.

The scenario: BigBasket wants to segment customers by purchasing behavior. No predefined categories — just find natural groups to tailor marketing campaigns.

# Import clustering algorithm
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Prepare features for clustering (price and quantity behavior)
X_cluster = df[['unit_price', 'quantity', 'rating']]

# Scale features so all have equal weight
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
print("Features scaled for clustering")
print("Original shape:", X_cluster.shape)

Features scaled for clustering
Original shape: (5000, 3)

# Create K-means clustering with 4 customer segments
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels back to original data
df['customer_segment'] = clusters

# Analyze each cluster's characteristics
segment_summary = df.groupby('customer_segment').agg({
    'unit_price': 'mean',
    'quantity': 'mean', 
    'rating': 'mean'
}).round(2)
print("Customer Segments:")
print(segment_summary)

Customer Segments:
                  unit_price  quantity  rating
customer_segment                             
0                    8921.45      2.84    4.14
1                    8456.78      2.91    4.12  
2                    8743.21      2.78    4.17
3                    8890.34      2.82    4.09

# Check cluster sizes and distribution
cluster_counts = df['customer_segment'].value_counts().sort_index()
print("Customers per segment:")
print(cluster_counts)

# Calculate percentage distribution
percentages = (cluster_counts / len(df) * 100).round(1)
print("\nPercentage distribution:")
for i in range(4):
    print(f"Segment {i}: {percentages[i]}%")

Customers per segment:
customer_segment
0    1247
1    1269  
2    1234
3    1250
Name: count, dtype: int64

Percentage distribution:
Segment 0: 24.9%
Segment 1: 25.4%
Segment 2: 24.7% 
Segment 3: 25.0%

What just happened?

K-means found 4 customer segments with roughly equal sizes (~25% each). However, the segments show very similar behavior patterns — this might indicate our features don't capture enough customer diversity. Try this: Add customer_age or product_category to improve segmentation.

Customer segments show minimal differences — additional features needed for meaningful segmentation

The clustering reveals an important lesson: not all data naturally clusters. These segments are too similar to drive different marketing strategies. In real projects, you'd add demographic data, purchase history, or seasonal patterns to find meaningful customer groups.

Common Clustering Mistake

Blindly choosing the number of clusters without domain knowledge. Use the "elbow method" to find optimal cluster counts, but always validate if segments make business sense. Four mathematical clusters mean nothing if customers behave identically.

Model Selection and Cross-Validation

Which algorithm works best for your data? Random forests? Support Vector Machines? Cross-validation tests multiple models systematically without peeking at test data.

# Import different algorithms to compare
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np

# Test multiple models on the same dataset
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Support Vector Regression': SVR()
}

Models loaded for comparison

# Use 5-fold cross-validation to test each model
results = {}
for name, model in models.items():
    # Split data 5 ways, train on 4, test on 1, repeat
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results[name] = scores
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")

print("\nBest model:", max(results, key=lambda x: results[x].mean()))

Linear Regression: 0.419 ± 0.028
Random Forest: 0.456 ± 0.031  
Support Vector Regression: 0.387 ± 0.025

Best model: Random Forest

What just happened?

Cross-validation tested each model 5 times with different data splits. Random Forest achieved the highest R² score (0.456 ± 0.031), meaning it explains 45.6% of rating variance on average. The ± shows consistency across different data samples. Try this: Test with 10-fold CV for more robust estimates.

Random Forest won because it handles non-linear relationships better than linear regression. It creates multiple decision trees and averages their predictions, capturing complex patterns between price, quantity, and ratings that a straight line misses.

Cross-validation prevents a critical mistake: choosing models based on lucky train/test splits. By testing on multiple data combinations, we get realistic performance estimates. This matters when you're betting business decisions on model accuracy.

Pro Tip: Always cross-validate hyperparameters too. Grid search + cross-validation finds optimal settings like tree depth or regularization strength. Scikit-learn's GridSearchCV does this automatically.

Quiz

Up Next

Joblib & Pickle

Save and load your trained models for production deployment — no more retraining every time you restart your application.

← Previous Course Index Next →