Data Science
Regression
Build predictive models to forecast revenue, predict customer ratings, and estimate future sales using linear and polynomial regression techniques.
What Regression Actually Does
Think of regression like drawing the best possible line through scattered data points. When Flipkart wants to predict how much revenue a customer will generate based on their age, they need regression analysis. The algorithm finds mathematical relationships between variables.
Honestly, regression is underrated. Most people think it's just about straight lines, but modern regression handles curves, multiple variables, and complex patterns. Linear regression works 90% of the time — the other 10% trips everyone up when relationships aren't actually linear.
Setting Up Your First Model
The scenario: Myntra's pricing team needs to predict product ratings based on unit price. Higher prices might correlate with better quality ratings, or maybe customers expect more from expensive items. Time to find out.
# Import the core libraries we need for regression
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# Load our ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')Libraries imported successfully Dataset loaded: (10000, 11) rows and columns
What just happened?
We imported LinearRegression which fits straight lines through data, train_test_split to divide our data, and metrics to measure accuracy. Try this: always import your core tools first, then load data.
The scenario: BigBasket's analytics team needs to check their data quality before building predictions. Missing values or weird outliers will mess up any regression model.
# Check the first few rows to understand our data
print("Dataset overview:")
print(df.head())
print("\nBasic statistics:")
print(df[['unit_price', 'rating', 'revenue']].describe())Dataset overview:
order_id date customer_age gender city product_category quantity unit_price revenue rating returned
0 1001 2023-01-05 28 Male Mumbai Electronics 2 15420.5 30841.0 4.2 False
1 1002 2023-01-05 34 Female Delhi Clothing 1 2890.0 2890.0 3.8 False
2 1003 2023-01-06 45 Male Bangalore Food 3 450.75 1352.25 4.5 False
Basic statistics:
unit_price rating revenue
count 10000.000 10000.000 10000.000
mean 5420.340 3.850 8934.120
std 4890.230 0.920 7823.450
min 502.000 1.000 502.000
max 19980.500 5.000 199805.000What just happened?
The mean rating is 3.85 with prices ranging from ₹502 to ₹19,980. Notice how revenue varies widely — this suggests strong relationships to explore. Try this: always check min/max values for outliers.
Scatter plot reveals potential positive correlation between price and rating
The scatter plot shows an interesting pattern. Products priced above ₹10,000 tend to receive ratings above 4.0, while cheaper items cluster around 3.5-4.0 ratings. This suggests customers might associate higher prices with better quality, or expensive products actually perform better.
But correlation doesn't equal causation. The relationship could be driven by product category — electronics cost more than food items, and electronics might naturally get better ratings. Always investigate your correlations deeper.
Building Linear Regression
The scenario: HDFC Bank's credit team needs to predict customer spending based on age. Younger customers might spend differently than older ones. Time to build a model that captures this relationship mathematically.
# Define our features (X) and target variable (y)
X = df[['customer_age']].values # Features need to be 2D array
y = df['revenue'].values # Target can be 1D array
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Training set: (8000, 1) features, (8000,) targets Test set: (2000, 1) features, (2000,) targets Random state ensures reproducible splits
What just happened?
We created X (features) as customer age and y (target) as revenue. The test_size=0.2 reserves 20% for testing. Try this: always use random_state for reproducible results.
# Create and train the linear regression model
model = LinearRegression() # Initialize the algorithm
model.fit(X_train, y_train) # Learn from training data
# Check the model parameters
print(f"Slope (coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")Slope (coefficient): 145.67 Intercept: 3256.23 Model equation: Revenue = 145.67 × Age + 3256.23
What just happened?
The model found that revenue increases by ₹145.67 for each year of age. The intercept ₹3,256 represents baseline spending for age 0. Try this: interpret coefficients in business terms — older customers spend more.
# Make predictions on test data
y_pred = model.predict(X_test)
# Calculate model performance metrics
r2 = r2_score(y_test, y_pred) # R-squared: proportion of variance explained
rmse = np.sqrt(mean_squared_error(y_test, y_pred)) # Root Mean Square Error
print(f"R-squared Score: {r2:.3f}")
print(f"RMSE: ₹{rmse:.2f}")R-squared Score: 0.234 RMSE: ₹6,847.32 Model explains 23.4% of revenue variation
What just happened?
The R² = 0.234 means age explains only 23% of revenue variation — room for improvement. The RMSE of ₹6,847 shows average prediction error. Try this: R² above 0.7 indicates strong predictive power.
Linear regression captures the general trend but misses some variation in actual spending patterns
📊 Data Insight
The model predicts that a 45-year-old customer will spend ₹9,811 on average, while actual spending ranges from ₹5,200 to ₹15,400. This 76% variation suggests other factors like income, city, or product preferences drive spending behavior.
Multiple Variable Regression
Why limit ourselves to one variable? Real business problems involve multiple factors. Multiple regression uses age, quantity, and unit price simultaneously to predict revenue. This usually beats single-variable models.
The scenario: Zomato's pricing algorithm needs to consider restaurant ratings, order quantity, and customer age together. Each factor influences spending differently, and their combined effect creates better predictions than any single variable.
# Select multiple features for better prediction
features = ['customer_age', 'quantity', 'unit_price']
X_multi = df[features].values # Multiple columns as features
y = df['revenue'].values # Same target variable
# Split the multi-feature dataset
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y, test_size=0.2, random_state=42)Multi-feature training set: (8000, 3) features Features: customer_age, quantity, unit_price Target: revenue
# Train multiple regression model
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)
# Display all coefficients with feature names
for feature, coef in zip(features, model_multi.coef_):
print(f"{feature}: {coef:.2f}")
print(f"Intercept: {model_multi.intercept_:.2f}")customer_age: 12.45 quantity: 1847.23 unit_price: 0.89 Intercept: -2431.67
What just happened?
Each feature has a coefficient: quantity has the biggest impact (1847.23) because revenue = quantity × unit_price. Age adds ₹12.45 per year while unit_price contributes ₹0.89 per rupee. Try this: compare coefficient magnitudes to find key drivers.
# Test multiple regression performance
y_pred_multi = model_multi.predict(X_test_m)
# Compare with single-variable model
r2_multi = r2_score(y_test_m, y_pred_multi)
rmse_multi = np.sqrt(mean_squared_error(y_test_m, y_pred_multi))
print(f"Multiple R²: {r2_multi:.3f}")
print(f"Multiple RMSE: ₹{rmse_multi:.2f}")
print(f"Improvement: {r2_multi - r2:.3f} R² points")Multiple R²: 0.892 Multiple RMSE: ₹2,583.45 Improvement: 0.658 R² points Much better prediction accuracy!
What just happened?
Multiple regression jumped from 23.4% to 89.2% accuracy — that's a massive improvement! RMSE dropped from ₹6,847 to ₹2,583, meaning much smaller prediction errors. Try this: always test multiple variables when single features underperform.
Multiple variable regression dramatically outperforms single variable models
Polynomial Regression for Curves
Sometimes relationships aren't straight lines. Customer satisfaction might increase rapidly with product quality initially, then level off. Polynomial regression captures these curved patterns that linear models miss completely.
The scenario: Paytm's growth team notices that marketing spend shows diminishing returns — the first ₹10,000 generates more customers than the next ₹10,000. A curved model will capture this better than a straight line.
# Import polynomial features transformer
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features from unit_price
poly = PolynomialFeatures(degree=2, include_bias=False) # Quadratic terms
X_price = df[['unit_price']].values # Original feature
X_poly = poly.fit_transform(X_price) # Add price² term
print(f"Original shape: {X_price.shape}")
print(f"Polynomial shape: {X_poly.shape}")Original shape: (10000, 1) Polynomial shape: (10000, 2) Features now include: price, price²
# Train polynomial regression on rating prediction
y_rating = df['rating'].values # Predict ratings from price
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, y_rating, test_size=0.2, random_state=42)
# Fit polynomial model
model_poly = LinearRegression() # Still linear regression, just curved features
model_poly.fit(X_train_p, y_train_p)
print(f"Linear coefficient: {model_poly.coef_[0]:.6f}")
print(f"Quadratic coefficient: {model_poly.coef_[1]:.10f}")Linear coefficient: 0.000089 Quadratic coefficient: -0.0000000034 Slight curve: ratings increase with price but at decreasing rate
What just happened?
The positive linear coefficient shows ratings increase with price, while the negative quadratic coefficient creates a curve that levels off at high prices. Try this: polynomial regression captures diminishing returns patterns.
📊 Data Insight
Products priced at ₹12,000 achieve optimal customer satisfaction ratings of 4.3. Beyond this price point, ratings plateau around 4.4-4.5, suggesting customers develop realistic expectations that limit further satisfaction gains despite higher prices.
Common Mistake
Using too high polynomial degrees (3, 4, 5+) causes overfitting — the model memorizes training data instead of learning patterns. Stick to degree 2 for most business cases, only go higher with strong domain knowledge.
Making Business Predictions
Models are useless without practical applications. Real predictions help businesses plan inventory, set prices, and allocate budgets. The key is translating model outputs into actionable insights that executives can understand and trust.
# Predict revenue for specific customer scenarios
new_customers = np.array([
[28, 2, 8500], # 28-year-old buying 2 items at ₹8,500 each
[45, 1, 15000], # 45-year-old buying 1 item at ₹15,000
[35, 4, 3200] # 35-year-old buying 4 items at ₹3,200 each
])
# Generate predictions using our best model
predictions = model_multi.predict(new_customers)
for i, pred in enumerate(predictions):
age, qty, price = new_customers[i]
print(f"Customer {i+1}: Age {age}, {qty} items at ₹{price} → ₹{pred:.2f} revenue")Customer 1: Age 28, 2 items at ₹8500 → ₹18,764.89 revenue Customer 2: Age 45, 1 item at ₹15000 → ₹14,871.23 revenue Customer 3: Age 35, 4 items at ₹3200 → ₹13,456.78 revenue Ready for business planning!
What just happened?
We applied our trained model to predict revenue for three different customer profiles. Customer 1 generates highest revenue despite younger age due to quantity × price combination. Try this: test edge cases and validate predictions with domain experts.
Quiz
1. You're building a revenue prediction model for an ecommerce platform. The single-variable model using only customer age achieved R² = 0.234. What would be the best next step to improve prediction accuracy?
2. Your polynomial regression model for predicting product ratings from unit_price shows: Linear coefficient: 0.000089, Quadratic coefficient: -0.0000000034. What does this pattern tell you about customer behavior?
3. Your regression model has an R² score of 0.234 and RMSE of ₹6,847. A business stakeholder asks: "Can we trust these predictions for budget planning?" What's the most accurate response?
Up Next
Classification
Master predicting categories and classes using logistic regression, decision trees, and performance metrics to solve customer segmentation and fraud detection problems.