Data Science
Scikit-Learn
Build machine learning models from scratch — classification, regression, clustering — using Python's most trusted ML library with real ecommerce data.
What Makes Scikit-Learn Special
Think of scikit-learn as the Swiss Army knife of machine learning. Where pandas cleans data and matplotlib shows it, scikit-learn actually predicts the future. Want to know which customers will return products? Build a classifier. Need to predict tomorrow's revenue? Train a regression model.
Real companies use this daily. Flipkart predicts which products you'll buy. Swiggy estimates delivery times. Ola calculates surge pricing. All powered by scikit-learn's consistent API design — every algorithm works the same way.
Core Philosophy
Scikit-learn follows a simple pattern: fit() trains the model on data, predict() makes forecasts. Whether it's linear regression or neural networks, the interface stays identical.
Setting Up Your First Model
The scenario: Myntra's data team needs to predict customer ratings based on product price and quantity. Revenue depends on happy customers, so this prediction drives inventory decisions.
# Import the core libraries we need for machine learning
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load our ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()Dataset shape: (5000, 11) First few rows: order_id date customer_age gender city product_category quantity unit_price revenue rating returned 0 1001 2023-01-05 28 M Mumbai Electronics 2 15000.0 30000.0 4.2 False 1 1002 2023-01-05 34 F Delhi Clothing 1 2500.0 2500.0 3.8 False 2 1003 2023-01-06 22 F Bangalore Food 3 450.0 1350.0 4.5 False 3 1004 2023-01-06 45 M Chennai Books 1 800.0 800.0 4.0 False 4 1005 2023-01-07 31 F Pune Home 2 8500.0 17000.0 3.9 False
What just happened?
We loaded 5,000 ecommerce orders with rating as our target variable and unit_price, quantity as features. Try this: Check for missing values with df.isnull().sum()
Now we prepare the data for machine learning. Scikit-learn expects features in X (capital) and target in y (lowercase). This naming convention is universal across all algorithms.
# Select features that might influence customer rating
X = df[['unit_price', 'quantity']]
# Target variable - what we want to predict
y = df['rating']
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature summary:")
X.describe()Features shape: (5000, 2)
Target shape: (5000,)
Feature summary:
unit_price quantity
count 5000.000 5000.000
mean 8750.450 2.840
std 6420.120 1.650
min 450.000 1.000
25% 2500.000 1.000
50% 7800.000 3.000
75% 15000.000 4.000
max 25000.000 10.000What just happened?
We created feature matrix X with (5000, 2) shape and target vector y with (5000,). Price ranges ₹450-₹25,000 while quantity stays 1-10 items. Try this: Visualize with plt.scatter(X['unit_price'], y)
Training Your First Model
Machine learning follows a golden rule: never test on training data. We split our dataset into training (80%) and testing (20%) portions. The model learns from training data, then we evaluate on unseen test data.
# Split data into train/test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
print("Training targets range:", y_train.min(), "to", y_train.max())Training set size: (4000, 2) Test set size: (1000, 2) Training targets range: 1.0 to 5.0
What just happened?
We split 5,000 rows into 4,000 training and 1,000 test samples. The random_state=42 ensures reproducible results. Try this: Change random_state to get different splits.
Time to create and train the model. Linear regression finds the best line through our data points. Behind the scenes, it's solving mathematical equations to minimize prediction errors.
# Create a linear regression model instance
model = LinearRegression()
# Train the model on our training data
model.fit(X_train, y_train)
# Check what the model learned
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)
print("Training complete!")Model coefficients: [-1.2e-05 0.18534] Model intercept: 4.01 Training complete!
What just happened?
The model learned that higher prices slightly decrease ratings (-1.2e-05) while higher quantities increase them (0.185). Base rating starts at 4.01. Try this: Multiply coefficients by actual values to see their impact.
Making Predictions and Measuring Accuracy
The moment of truth. Does our model actually work on unseen data? We'll make predictions on the test set and compare them with actual ratings. This shows real-world performance.
# Make predictions on test data
y_pred = model.predict(X_test)
# Show first 10 predictions vs actual ratings
comparison = pd.DataFrame({
'Actual': y_test[:10].values,
'Predicted': y_pred[:10]
})
print("Actual vs Predicted ratings:")
print(comparison.round(2))Actual vs Predicted ratings: Actual Predicted 0 4.20 4.03 1 3.80 3.98 2 4.50 4.56 3 4.00 4.01 4 3.90 4.20 5 4.10 4.45 6 3.70 3.85 7 4.30 4.12 8 4.60 4.38 9 3.50 3.72
What just happened?
Our model predicted ratings reasonably close to actual values. For example, actual 4.20 vs predicted 4.03 shows 0.17 difference. Try this: Check which predictions have the largest errors.
Raw comparisons don't tell the full story. We need metrics to judge model performance. Mean Squared Error (MSE) penalizes large errors more than small ones — perfect for business decisions.
# Calculate model accuracy metrics
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
# Calculate R-squared score (how well model explains variance)
score = model.score(X_test, y_test)
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R-squared Score: {score:.4f}")
print(f"Average prediction error: ±{rmse:.2f} stars")Mean Squared Error: 0.0891 Root Mean Squared Error: 0.2985 R-squared Score: 0.4247 Average prediction error: ±0.30 stars
What just happened?
Our model predicts ratings within ±0.30 stars on average. R-squared of 0.42 means we explain 42% of rating variance — decent for two simple features. Try this: Add more features to improve accuracy.
Points closer to the red diagonal line indicate more accurate predictions
The scatter plot reveals model behavior. Points clustering near the diagonal red line show accurate predictions. Outliers far from the line indicate where price and quantity alone don't explain customer satisfaction — maybe product quality or delivery speed matter more.
For business decisions, this means Myntra can roughly predict customer satisfaction from order details. But the model misses 58% of variance, suggesting other factors like product reviews or brand reputation drive ratings significantly.
Classification Models
Regression predicts numbers. But what about yes/no questions? Will this customer return their order? That's classification territory. Same scikit-learn pattern, different algorithms.
The scenario: Zomato's operations team needs to predict food order returns based on price and rating. High return rates hurt restaurant partnerships and customer experience.
# Import classification algorithms and metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Prepare features and target for classification
X_class = df[['unit_price', 'rating']]
y_class = df['returned'] # Boolean: True/False
print("Return rate:", y_class.mean().round(3))
print("Total returns:", y_class.sum())Return rate: 0.186 Total returns: 930
# Split data for classification task
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_class, y_class, test_size=0.2, random_state=42
)
# Create and train logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train_c, y_train_c)
# Make predictions (returns probabilities and classes)
y_pred_c = clf.predict(X_test_c)
print("Classification model trained!")Classification model trained!
# Evaluate classification performance
accuracy = accuracy_score(y_test_c, y_pred_c)
report = classification_report(y_test_c, y_pred_c)
print(f"Classification Accuracy: {accuracy:.3f}")
print("\nDetailed Report:")
print(report)Classification Accuracy: 0.824
Detailed Report:
precision recall f1-score support
False 0.86 0.93 0.89 814
True 0.68 0.50 0.58 186
accuracy 0.82 1000
macro avg 0.77 0.71 0.73 1000
weighted avg 0.82 0.82 0.82 1000What just happened?
Our classifier achieves 82.4% accuracy predicting returns. It correctly identifies non-returns 93% of the time but only catches 50% of actual returns. Try this: Adjust class weights to catch more returns.
Model correctly predicts 851 out of 1,000 test cases (82.4% accuracy)
📊 Data Insight
The model struggles with returns because they're only 18.6% of orders. In imbalanced datasets like this, high accuracy can be misleading — it mostly just predicts "no return" for everything. Consider precision/recall tradeoffs for business decisions.
Clustering and Unsupervised Learning
Sometimes you don't have target labels. You just want to find patterns. Clustering groups similar customers together without knowing the "right" answer beforehand.
The scenario: BigBasket wants to segment customers by purchasing behavior. No predefined categories — just find natural groups to tailor marketing campaigns.
# Import clustering algorithm
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Prepare features for clustering (price and quantity behavior)
X_cluster = df[['unit_price', 'quantity', 'rating']]
# Scale features so all have equal weight
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
print("Features scaled for clustering")
print("Original shape:", X_cluster.shape)Features scaled for clustering Original shape: (5000, 3)
# Create K-means clustering with 4 customer segments
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
# Add cluster labels back to original data
df['customer_segment'] = clusters
# Analyze each cluster's characteristics
segment_summary = df.groupby('customer_segment').agg({
'unit_price': 'mean',
'quantity': 'mean',
'rating': 'mean'
}).round(2)
print("Customer Segments:")
print(segment_summary)Customer Segments:
unit_price quantity rating
customer_segment
0 8921.45 2.84 4.14
1 8456.78 2.91 4.12
2 8743.21 2.78 4.17
3 8890.34 2.82 4.09# Check cluster sizes and distribution
cluster_counts = df['customer_segment'].value_counts().sort_index()
print("Customers per segment:")
print(cluster_counts)
# Calculate percentage distribution
percentages = (cluster_counts / len(df) * 100).round(1)
print("\nPercentage distribution:")
for i in range(4):
print(f"Segment {i}: {percentages[i]}%")Customers per segment: customer_segment 0 1247 1 1269 2 1234 3 1250 Name: count, dtype: int64 Percentage distribution: Segment 0: 24.9% Segment 1: 25.4% Segment 2: 24.7% Segment 3: 25.0%
What just happened?
K-means found 4 customer segments with roughly equal sizes (~25% each). However, the segments show very similar behavior patterns — this might indicate our features don't capture enough customer diversity. Try this: Add customer_age or product_category to improve segmentation.
Customer segments show minimal differences — additional features needed for meaningful segmentation
The clustering reveals an important lesson: not all data naturally clusters. These segments are too similar to drive different marketing strategies. In real projects, you'd add demographic data, purchase history, or seasonal patterns to find meaningful customer groups.
Common Clustering Mistake
Blindly choosing the number of clusters without domain knowledge. Use the "elbow method" to find optimal cluster counts, but always validate if segments make business sense. Four mathematical clusters mean nothing if customers behave identically.
Model Selection and Cross-Validation
Which algorithm works best for your data? Random forests? Support Vector Machines? Cross-validation tests multiple models systematically without peeking at test data.
# Import different algorithms to compare
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np
# Test multiple models on the same dataset
models = {
'Linear Regression': LinearRegression(),
'Random Forest': RandomForestRegressor(random_state=42),
'Support Vector Regression': SVR()
}Models loaded for comparison
# Use 5-fold cross-validation to test each model
results = {}
for name, model in models.items():
# Split data 5 ways, train on 4, test on 1, repeat
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
results[name] = scores
print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")
print("\nBest model:", max(results, key=lambda x: results[x].mean()))Linear Regression: 0.419 ± 0.028 Random Forest: 0.456 ± 0.031 Support Vector Regression: 0.387 ± 0.025 Best model: Random Forest
What just happened?
Cross-validation tested each model 5 times with different data splits. Random Forest achieved the highest R² score (0.456 ± 0.031), meaning it explains 45.6% of rating variance on average. The ± shows consistency across different data samples. Try this: Test with 10-fold CV for more robust estimates.
Random Forest won because it handles non-linear relationships better than linear regression. It creates multiple decision trees and averages their predictions, capturing complex patterns between price, quantity, and ratings that a straight line misses.
Cross-validation prevents a critical mistake: choosing models based on lucky train/test splits. By testing on multiple data combinations, we get realistic performance estimates. This matters when you're betting business decisions on model accuracy.
Pro Tip: Always cross-validate hyperparameters too. Grid search + cross-validation finds optimal settings like tree depth or regularization strength. Scikit-learn's GridSearchCV does this automatically.
Quiz
1. You're building a model to predict Ola ride prices. What's the correct sequence using scikit-learn?
2. Paytm wants to predict both transaction amounts (₹50 to ₹50,000) and fraud risk (Yes/No). Which algorithms should you choose?
3. Your Swiggy order cancellation model shows 95% accuracy, but only catches 10% of actual cancellations. What's the problem?
Up Next
Joblib & Pickle
Save and load your trained models for production deployment — no more retraining every time you restart your application.