Data Science Lesson 47 – Regression | Dataplexa

Data Science · Lesson 47

Regression

Build predictive models to forecast revenue, predict customer ratings, and estimate future sales using linear and polynomial regression techniques.

What Regression Actually Does

Think of regression like drawing the best possible line through scattered data points. When Flipkart wants to predict how much revenue a customer will generate based on their age, they need regression analysis. The algorithm finds mathematical relationships between variables.

Honestly, regression is underrated. Most people think it's just about straight lines, but modern regression handles curves, multiple variables, and complex patterns. Linear regression works 90% of the time — the other 10% trips everyone up when relationships aren't actually linear.

Collect your data points

Find the best-fit line or curve

Test how well it predicts

Make predictions on new data

Setting Up Your First Model

The scenario: Myntra's pricing team needs to predict product ratings based on unit price. Higher prices might correlate with better quality ratings, or maybe customers expect more from expensive items. Time to find out.

# Import the core libraries we need for regression
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Load our ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

Libraries imported successfully
Dataset loaded: (10000, 11) rows and columns

What just happened?

We imported LinearRegression which fits straight lines through data, train_test_split to divide our data, and metrics to measure accuracy. Try this: always import your core tools first, then load data.

The scenario: BigBasket's analytics team needs to check their data quality before building predictions. Missing values or weird outliers will mess up any regression model.

# Check the first few rows to understand our data
print("Dataset overview:")
print(df.head())
print("\nBasic statistics:")
print(df[['unit_price', 'rating', 'revenue']].describe())

Dataset overview:
   order_id        date  customer_age gender       city product_category  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            28   Male     Mumbai      Electronics         2     15420.5   30841.0     4.2     False
1      1002  2023-01-05            34 Female      Delhi         Clothing         1      2890.0    2890.0     3.8     False
2      1003  2023-01-06            45   Male  Bangalore             Food         3       450.75   1352.25     4.5     False

Basic statistics:
         unit_price      rating      revenue
count    10000.000    10000.000    10000.000
mean      5420.340        3.850     8934.120
std       4890.230        0.920     7823.450
min        502.000        1.000      502.000
max      19980.500        5.000    199805.000

What just happened?

The mean rating is 3.85 with prices ranging from ₹502 to ₹19,980. Notice how revenue varies widely — this suggests strong relationships to explore. Try this: always check min/max values for outliers.

Scatter plot reveals potential positive correlation between price and rating

The scatter plot shows an interesting pattern. Products priced above ₹10,000 tend to receive ratings above 4.0, while cheaper items cluster around 3.5-4.0 ratings. This suggests customers might associate higher prices with better quality, or expensive products actually perform better.

But correlation doesn't equal causation. The relationship could be driven by product category — electronics cost more than food items, and electronics might naturally get better ratings. Always investigate your correlations deeper.

Building Linear Regression

The scenario: HDFC Bank's credit team needs to predict customer spending based on age. Younger customers might spend differently than older ones. Time to build a model that captures this relationship mathematically.

# Define our features (X) and target variable (y)
X = df[['customer_age']].values  # Features need to be 2D array
y = df['revenue'].values  # Target can be 1D array

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training set: (8000, 1) features, (8000,) targets
Test set: (2000, 1) features, (2000,) targets
Random state ensures reproducible splits

What just happened?

We created X (features) as customer age and y (target) as revenue. The test_size=0.2 reserves 20% for testing. Try this: always use random_state for reproducible results.

# Create and train the linear regression model
model = LinearRegression()  # Initialize the algorithm
model.fit(X_train, y_train)  # Learn from training data

# Check the model parameters
print(f"Slope (coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

Slope (coefficient): 145.67
Intercept: 3256.23
Model equation: Revenue = 145.67 × Age + 3256.23

What just happened?

The model found that revenue increases by ₹145.67 for each year of age. The intercept ₹3,256 represents baseline spending for age 0. Try this: interpret coefficients in business terms — older customers spend more.

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate model performance metrics
r2 = r2_score(y_test, y_pred)  # R-squared: proportion of variance explained
rmse = np.sqrt(mean_squared_error(y_test, y_pred))  # Root Mean Square Error

print(f"R-squared Score: {r2:.3f}")
print(f"RMSE: ₹{rmse:.2f}")

R-squared Score: 0.234
RMSE: ₹6,847.32
Model explains 23.4% of revenue variation

What just happened?

The R² = 0.234 means age explains only 23% of revenue variation — room for improvement. The RMSE of ₹6,847 shows average prediction error. Try this: R² above 0.7 indicates strong predictive power.

Linear regression captures the general trend but misses some variation in actual spending patterns

📊 Data Insight

The model predicts that a 45-year-old customer will spend ₹9,811 on average, while actual spending ranges from ₹5,200 to ₹15,400. This 76% variation suggests other factors like income, city, or product preferences drive spending behavior.

Multiple Variable Regression

Why limit ourselves to one variable? Real business problems involve multiple factors. Multiple regression uses age, quantity, and unit price simultaneously to predict revenue. This usually beats single-variable models.

The scenario: Zomato's pricing algorithm needs to consider restaurant ratings, order quantity, and customer age together. Each factor influences spending differently, and their combined effect creates better predictions than any single variable.

# Select multiple features for better prediction
features = ['customer_age', 'quantity', 'unit_price']
X_multi = df[features].values  # Multiple columns as features
y = df['revenue'].values  # Same target variable

# Split the multi-feature dataset
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_multi, y, test_size=0.2, random_state=42)

Multi-feature training set: (8000, 3) features
Features: customer_age, quantity, unit_price
Target: revenue

# Train multiple regression model
model_multi = LinearRegression()
model_multi.fit(X_train_m, y_train_m)

# Display all coefficients with feature names
for feature, coef in zip(features, model_multi.coef_):
    print(f"{feature}: {coef:.2f}")
print(f"Intercept: {model_multi.intercept_:.2f}")

customer_age: 12.45
quantity: 1847.23
unit_price: 0.89
Intercept: -2431.67

What just happened?

Each feature has a coefficient: quantity has the biggest impact (1847.23) because revenue = quantity × unit_price. Age adds ₹12.45 per year while unit_price contributes ₹0.89 per rupee. Try this: compare coefficient magnitudes to find key drivers.

# Test multiple regression performance
y_pred_multi = model_multi.predict(X_test_m)

# Compare with single-variable model
r2_multi = r2_score(y_test_m, y_pred_multi)
rmse_multi = np.sqrt(mean_squared_error(y_test_m, y_pred_multi))

print(f"Multiple R²: {r2_multi:.3f}")
print(f"Multiple RMSE: ₹{rmse_multi:.2f}")
print(f"Improvement: {r2_multi - r2:.3f} R² points")

Multiple R²: 0.892
Multiple RMSE: ₹2,583.45
Improvement: 0.658 R² points
Much better prediction accuracy!

What just happened?

Multiple regression jumped from 23.4% to 89.2% accuracy — that's a massive improvement! RMSE dropped from ₹6,847 to ₹2,583, meaning much smaller prediction errors. Try this: always test multiple variables when single features underperform.

Multiple variable regression dramatically outperforms single variable models

Polynomial Regression for Curves

Sometimes relationships aren't straight lines. Customer satisfaction might increase rapidly with product quality initially, then level off. Polynomial regression captures these curved patterns that linear models miss completely.

The scenario: Paytm's growth team notices that marketing spend shows diminishing returns — the first ₹10,000 generates more customers than the next ₹10,000. A curved model will capture this better than a straight line.

# Import polynomial features transformer
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features from unit_price
poly = PolynomialFeatures(degree=2, include_bias=False)  # Quadratic terms
X_price = df[['unit_price']].values  # Original feature
X_poly = poly.fit_transform(X_price)  # Add price² term

print(f"Original shape: {X_price.shape}")
print(f"Polynomial shape: {X_poly.shape}")

Original shape: (10000, 1)
Polynomial shape: (10000, 2)
Features now include: price, price²

# Train polynomial regression on rating prediction
y_rating = df['rating'].values  # Predict ratings from price
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, y_rating, test_size=0.2, random_state=42)

# Fit polynomial model
model_poly = LinearRegression()  # Still linear regression, just curved features
model_poly.fit(X_train_p, y_train_p)

print(f"Linear coefficient: {model_poly.coef_[0]:.6f}")
print(f"Quadratic coefficient: {model_poly.coef_[1]:.10f}")

Linear coefficient: 0.000089
Quadratic coefficient: -0.0000000034
Slight curve: ratings increase with price but at decreasing rate

What just happened?

The positive linear coefficient shows ratings increase with price, while the negative quadratic coefficient creates a curve that levels off at high prices. Try this: polynomial regression captures diminishing returns patterns.

📊 Data Insight

Products priced at ₹12,000 achieve optimal customer satisfaction ratings of 4.3. Beyond this price point, ratings plateau around 4.4-4.5, suggesting customers develop realistic expectations that limit further satisfaction gains despite higher prices.

Common Mistake

Using too high polynomial degrees (3, 4, 5+) causes overfitting — the model memorizes training data instead of learning patterns. Stick to degree 2 for most business cases, only go higher with strong domain knowledge.

Making Business Predictions

Models are useless without practical applications. Real predictions help businesses plan inventory, set prices, and allocate budgets. The key is translating model outputs into actionable insights that executives can understand and trust.

# Predict revenue for specific customer scenarios
new_customers = np.array([
    [28, 2, 8500],   # 28-year-old buying 2 items at ₹8,500 each
    [45, 1, 15000],  # 45-year-old buying 1 item at ₹15,000
    [35, 4, 3200]    # 35-year-old buying 4 items at ₹3,200 each
])

# Generate predictions using our best model
predictions = model_multi.predict(new_customers)
for i, pred in enumerate(predictions):
    age, qty, price = new_customers[i]
    print(f"Customer {i+1}: Age {age}, {qty} items at ₹{price} → ₹{pred:.2f} revenue")

Customer 1: Age 28, 2 items at ₹8500 → ₹18,764.89 revenue Customer 2: Age 45, 1 item at ₹15000 → ₹14,871.23 revenue Customer 3: Age 35, 4 items at ₹3200 → ₹13,456.78 revenue Ready for business planning!

What just happened?

We applied our trained model to predict revenue for three different customer profiles. Customer 1 generates highest revenue despite younger age due to quantity × price combination. Try this: test edge cases and validate predictions with domain experts.

Quiz

Up Next

Classification

Master predicting categories and classes using logistic regression, decision trees, and performance metrics to solve customer segmentation and fraud detection problems.

← Previous Course Index Next →