Data Science Lesson 45 – Scikit-Learn | Dataplexa
Machine Learning · Lesson 45

Scikit-Learn

Build machine learning models from scratch — classification, regression, clustering — using Python's most trusted ML library with real ecommerce data.

What Makes Scikit-Learn Special

Think of scikit-learn as the Swiss Army knife of machine learning. Where pandas cleans data and matplotlib shows it, scikit-learn actually predicts the future. Want to know which customers will return products? Build a classifier. Need to predict tomorrow's revenue? Train a regression model.

Real companies use this daily. Flipkart predicts which products you'll buy. Swiggy estimates delivery times. Ola calculates surge pricing. All powered by scikit-learn's consistent API design — every algorithm works the same way.

Core Philosophy

Scikit-learn follows a simple pattern: fit() trains the model on data, predict() makes forecasts. Whether it's linear regression or neural networks, the interface stays identical.

Setting Up Your First Model

The scenario: Myntra's data team needs to predict customer ratings based on product price and quantity. Revenue depends on happy customers, so this prediction drives inventory decisions.

# Import the core libraries we need for machine learning
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load our ecommerce dataset 
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

What just happened?

We loaded 5,000 ecommerce orders with rating as our target variable and unit_price, quantity as features. Try this: Check for missing values with df.isnull().sum()

Now we prepare the data for machine learning. Scikit-learn expects features in X (capital) and target in y (lowercase). This naming convention is universal across all algorithms.

# Select features that might influence customer rating
X = df[['unit_price', 'quantity']]

# Target variable - what we want to predict
y = df['rating']

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature summary:")
X.describe()

What just happened?

We created feature matrix X with (5000, 2) shape and target vector y with (5000,). Price ranges ₹450-₹25,000 while quantity stays 1-10 items. Try this: Visualize with plt.scatter(X['unit_price'], y)

Training Your First Model

Machine learning follows a golden rule: never test on training data. We split our dataset into training (80%) and testing (20%) portions. The model learns from training data, then we evaluate on unseen test data.

# Split data into train/test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
print("Training targets range:", y_train.min(), "to", y_train.max())

What just happened?

We split 5,000 rows into 4,000 training and 1,000 test samples. The random_state=42 ensures reproducible results. Try this: Change random_state to get different splits.

Time to create and train the model. Linear regression finds the best line through our data points. Behind the scenes, it's solving mathematical equations to minimize prediction errors.

# Create a linear regression model instance
model = LinearRegression()

# Train the model on our training data
model.fit(X_train, y_train)

# Check what the model learned
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)
print("Training complete!")

What just happened?

The model learned that higher prices slightly decrease ratings (-1.2e-05) while higher quantities increase them (0.185). Base rating starts at 4.01. Try this: Multiply coefficients by actual values to see their impact.

Making Predictions and Measuring Accuracy

The moment of truth. Does our model actually work on unseen data? We'll make predictions on the test set and compare them with actual ratings. This shows real-world performance.

# Make predictions on test data
y_pred = model.predict(X_test)

# Show first 10 predictions vs actual ratings
comparison = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': y_pred[:10]
})
print("Actual vs Predicted ratings:")
print(comparison.round(2))

What just happened?

Our model predicted ratings reasonably close to actual values. For example, actual 4.20 vs predicted 4.03 shows 0.17 difference. Try this: Check which predictions have the largest errors.

Raw comparisons don't tell the full story. We need metrics to judge model performance. Mean Squared Error (MSE) penalizes large errors more than small ones — perfect for business decisions.

# Calculate model accuracy metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5

# Calculate R-squared score (how well model explains variance)
score = model.score(X_test, y_test)

print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R-squared Score: {score:.4f}")
print(f"Average prediction error: ±{rmse:.2f} stars")

What just happened?

Our model predicts ratings within ±0.30 stars on average. R-squared of 0.42 means we explain 42% of rating variance — decent for two simple features. Try this: Add more features to improve accuracy.

Points closer to the red diagonal line indicate more accurate predictions

The scatter plot reveals model behavior. Points clustering near the diagonal red line show accurate predictions. Outliers far from the line indicate where price and quantity alone don't explain customer satisfaction — maybe product quality or delivery speed matter more.

For business decisions, this means Myntra can roughly predict customer satisfaction from order details. But the model misses 58% of variance, suggesting other factors like product reviews or brand reputation drive ratings significantly.

Classification Models

Regression predicts numbers. But what about yes/no questions? Will this customer return their order? That's classification territory. Same scikit-learn pattern, different algorithms.

The scenario: Zomato's operations team needs to predict food order returns based on price and rating. High return rates hurt restaurant partnerships and customer experience.

# Import classification algorithms and metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Prepare features and target for classification
X_class = df[['unit_price', 'rating']]
y_class = df['returned']  # Boolean: True/False

print("Return rate:", y_class.mean().round(3))
print("Total returns:", y_class.sum())
# Split data for classification task
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.2, random_state=42
)

# Create and train logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train_c, y_train_c)

# Make predictions (returns probabilities and classes)
y_pred_c = clf.predict(X_test_c)
print("Classification model trained!")
# Evaluate classification performance
accuracy = accuracy_score(y_test_c, y_pred_c)
report = classification_report(y_test_c, y_pred_c)

print(f"Classification Accuracy: {accuracy:.3f}")
print("\nDetailed Report:")
print(report)

What just happened?

Our classifier achieves 82.4% accuracy predicting returns. It correctly identifies non-returns 93% of the time but only catches 50% of actual returns. Try this: Adjust class weights to catch more returns.

Model correctly predicts 851 out of 1,000 test cases (82.4% accuracy)

📊 Data Insight

The model struggles with returns because they're only 18.6% of orders. In imbalanced datasets like this, high accuracy can be misleading — it mostly just predicts "no return" for everything. Consider precision/recall tradeoffs for business decisions.

Clustering and Unsupervised Learning

Sometimes you don't have target labels. You just want to find patterns. Clustering groups similar customers together without knowing the "right" answer beforehand.

The scenario: BigBasket wants to segment customers by purchasing behavior. No predefined categories — just find natural groups to tailor marketing campaigns.

# Import clustering algorithm
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Prepare features for clustering (price and quantity behavior)
X_cluster = df[['unit_price', 'quantity', 'rating']]

# Scale features so all have equal weight
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
print("Features scaled for clustering")
print("Original shape:", X_cluster.shape)
# Create K-means clustering with 4 customer segments
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels back to original data
df['customer_segment'] = clusters

# Analyze each cluster's characteristics
segment_summary = df.groupby('customer_segment').agg({
    'unit_price': 'mean',
    'quantity': 'mean', 
    'rating': 'mean'
}).round(2)
print("Customer Segments:")
print(segment_summary)
# Check cluster sizes and distribution
cluster_counts = df['customer_segment'].value_counts().sort_index()
print("Customers per segment:")
print(cluster_counts)

# Calculate percentage distribution
percentages = (cluster_counts / len(df) * 100).round(1)
print("\nPercentage distribution:")
for i in range(4):
    print(f"Segment {i}: {percentages[i]}%")

What just happened?

K-means found 4 customer segments with roughly equal sizes (~25% each). However, the segments show very similar behavior patterns — this might indicate our features don't capture enough customer diversity. Try this: Add customer_age or product_category to improve segmentation.

Customer segments show minimal differences — additional features needed for meaningful segmentation

The clustering reveals an important lesson: not all data naturally clusters. These segments are too similar to drive different marketing strategies. In real projects, you'd add demographic data, purchase history, or seasonal patterns to find meaningful customer groups.

Common Clustering Mistake

Blindly choosing the number of clusters without domain knowledge. Use the "elbow method" to find optimal cluster counts, but always validate if segments make business sense. Four mathematical clusters mean nothing if customers behave identically.

Model Selection and Cross-Validation

Which algorithm works best for your data? Random forests? Support Vector Machines? Cross-validation tests multiple models systematically without peeking at test data.

# Import different algorithms to compare
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np

# Test multiple models on the same dataset
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Support Vector Regression': SVR()
}
# Use 5-fold cross-validation to test each model
results = {}
for name, model in models.items():
    # Split data 5 ways, train on 4, test on 1, repeat
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results[name] = scores
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")

print("\nBest model:", max(results, key=lambda x: results[x].mean()))

What just happened?

Cross-validation tested each model 5 times with different data splits. Random Forest achieved the highest R² score (0.456 ± 0.031), meaning it explains 45.6% of rating variance on average. The ± shows consistency across different data samples. Try this: Test with 10-fold CV for more robust estimates.

Random Forest won because it handles non-linear relationships better than linear regression. It creates multiple decision trees and averages their predictions, capturing complex patterns between price, quantity, and ratings that a straight line misses.

Cross-validation prevents a critical mistake: choosing models based on lucky train/test splits. By testing on multiple data combinations, we get realistic performance estimates. This matters when you're betting business decisions on model accuracy.

Pro Tip: Always cross-validate hyperparameters too. Grid search + cross-validation finds optimal settings like tree depth or regularization strength. Scikit-learn's GridSearchCV does this automatically.

Quiz

1. You're building a model to predict Ola ride prices. What's the correct sequence using scikit-learn?


2. Paytm wants to predict both transaction amounts (₹50 to ₹50,000) and fraud risk (Yes/No). Which algorithms should you choose?


3. Your Swiggy order cancellation model shows 95% accuracy, but only catches 10% of actual cancellations. What's the problem?


Up Next

Joblib & Pickle

Save and load your trained models for production deployment — no more retraining every time you restart your application.