Data Science Lesson 22 – Probability Basics | Dataplexa

Statistics · Lesson 22

Probability Basics

Master the fundamentals of probability theory and apply them to real business data using Python and practical examples from e-commerce analytics.

Basic Concepts
Events & Sample Space

Calculate Probabilities
Rules & Formulas

Business Applications
Real Data Analysis

Understanding Probability Fundamentals

Probability measures how likely something is to happen. Think of it as a number between 0 and 1, where 0 means impossible and 1 means certain. When Flipkart analyzes customer behavior, they use probability to predict purchase patterns.

The sample space contains all possible outcomes. If you're analyzing customer ratings, your sample space might be {1, 2, 3, 4, 5}. An event is a subset of outcomes you care about — like "ratings above 4".

Classical Probability

All outcomes equally likely. Rolling dice, coin flips.

Empirical Probability

Based on observed data. Customer purchase rates.

Key Definition

Probability = Number of favorable outcomes ÷ Total number of possible outcomes

Calculating Basic Probabilities

Time to get hands-on. A Myntra data scientist needs to analyze return probabilities to optimize inventory. They have order data with return information and need quick probability calculations.

# Import libraries for probability calculations
import pandas as pd
import numpy as np

# Load the e-commerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# Check the shape and basic info
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

Dataset shape: (10000, 11)
Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating', 'returned']

What just happened?

We loaded 10,000 e-commerce records with returned column showing True/False for returns. This gives us the foundation for calculating return probabilities. Try this: Look at the column names — which ones might help predict returns?

# Calculate basic probability of returns
total_orders = len(df)
returned_orders = df['returned'].sum()

# Basic probability calculation
return_probability = returned_orders / total_orders
print(f"Total orders: {total_orders}")
print(f"Returned orders: {returned_orders}")
print(f"Return probability: {return_probability:.4f}")

Total orders: 10000
Returned orders: 1247
Return probability: 0.1247

What just happened?

We calculated that 12.47% of orders get returned. This uses the classical probability formula: 1,247 favorable outcomes (returns) divided by 10,000 total outcomes. Try this: This probability helps Myntra set realistic expectations for return processing capacity.

Conditional Probability

Conditional probability answers: "What's the probability of A happening, given that B already happened?" This is massive for business decisions. Swiggy uses this to predict order completion rates based on weather conditions.

# Calculate conditional probability: P(Return | Low Rating)
# What's probability of return given rating <= 2?
low_rated = df[df['rating'] <= 2.0]
low_rated_returns = low_rated['returned'].sum()
total_low_rated = len(low_rated)

# Conditional probability calculation
prob_return_given_low_rating = low_rated_returns / total_low_rated
print(f"Orders with rating <= 2: {total_low_rated}")
print(f"Returns from low-rated orders: {low_rated_returns}")
print(f"P(Return | Low Rating): {prob_return_given_low_rating:.4f}")

Orders with rating <= 2: 1456
Returns from low-rated orders: 789
P(Return | Low Rating): 0.5420

What just happened?

Huge insight! When customers rate items 2 or below, return probability jumps to 54.2% versus the overall 12.47%. This conditional probability helps identify high-risk orders. Try this: Calculate the same for high ratings (>4) to see the contrast.

📊 Data Insight

Low-rated products are 4.3x more likely to be returned (54.2% vs 12.47%). This suggests customer satisfaction directly correlates with return behavior — a key metric for inventory planning.

Probability Rules and Operations

Three fundamental rules drive all probability calculations. The Addition Rule handles "or" scenarios, the Multiplication Rule handles "and" scenarios, and Complement Rule finds the opposite probability.

Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)
Electronics OR Clothing returns

Multiplication Rule

P(A and B) = P(A) × P(B|A)
Return AND low rating together

# Apply Addition Rule: P(Electronics OR Clothing returns)
electronics_returns = df[
    (df['product_category'] == 'Electronics') & 
    (df['returned'] == True)
]
clothing_returns = df[
    (df['product_category'] == 'Clothing') & 
    (df['returned'] == True)
]

print(f"Electronics returns: {len(electronics_returns)}")
print(f"Clothing returns: {len(clothing_returns)}")
print(f"Total orders: {len(df)}")

Electronics returns: 378
Clothing returns: 297
Total orders: 10000

What just happened?

We counted returns by category: 378 Electronics and 297 Clothing returns. Since these categories are mutually exclusive (can't be both), P(Electronics OR Clothing return) = 3.78% + 2.97% = 6.75%. Try this: Calculate this for overlapping events like "High price OR Electronics".

# Apply Complement Rule: P(NOT returned) = 1 - P(returned)
prob_return = df['returned'].mean()
prob_not_return = 1 - prob_return

print(f"P(Return): {prob_return:.4f}")
print(f"P(NOT Return): {prob_not_return:.4f}")
print(f"Sum check: {prob_return + prob_not_return:.4f}")

# Verify with direct calculation
actual_not_return = (df['returned'] == False).mean()
print(f"Direct calculation: {actual_not_return:.4f}")

P(Return): 0.1247
P(NOT Return): 0.8753
Sum check: 1.0000
Direct calculation: 0.8753

What just happened?

The complement rule works perfectly: 87.53% of orders aren't returned, which equals 1 - 12.47%. Both methods give identical results, and probabilities sum to 1.0000. Try this: Use complements when it's easier to calculate the opposite event first.

Business Applications with Probability

Real businesses use probability for inventory planning, risk assessment, and customer segmentation. HDFC Bank uses probability models to assess loan default risk. Ola calculates surge pricing based on demand probability distributions.

Clothing shows highest return probability at 18.7%, while Books have lowest at 6.4% — critical for inventory management

The chart reveals Clothing category risk — nearly 1 in 5 clothing items get returned. This drives business decisions: higher inventory buffers for clothing, stricter quality controls, and targeted customer education for fit guidelines.

Books show remarkably low return rates. Why? Digital previews, established author reputations, and clear product descriptions reduce uncertainty. Food items also stay low — freshness concerns make returns less common.

# Create probability distribution for rating-based segmentation
rating_return_prob = df.groupby('rating')['returned'].agg(['count', 'sum', 'mean'])
rating_return_prob.columns = ['total_orders', 'returns', 'return_probability']

# Display the probability distribution
print("Rating-based Return Probability Distribution:")
print(rating_return_prob.round(4))

Rating-based Return Probability Distribution:
       total_orders  returns  return_probability
rating                                          
1.0            467      289               0.6187
1.5            234      143               0.6111
2.0            755      357               0.4728
2.5            623      189               0.3034
3.0           1245      167               0.1341
3.5           1456       87               0.0598
4.0           2134       14               0.0066
4.5           1687        1               0.0006
5.0           1399        0               0.0000

What just happened?

Perfect inverse relationship! Rating 1.0 has 61.87% return probability while rating 5.0 has 0%. This probability distribution helps create customer satisfaction scoring models. Try this: Use this to flag orders with ratings below 3.0 for proactive customer service.

Common Mistake: Probability vs Percentage

Beginners mix up 0.1247 and 12.47%. Remember: probability ranges 0-1, percentage ranges 0-100%. Always multiply by 100 when presenting to business stakeholders, but keep calculations in 0-1 format for accuracy.

Advanced Probability Concepts

Independent events don't influence each other — coin flips or random customer arrivals. Dependent events do influence each other — customer age affecting purchase category preferences. Recognizing this distinction prevents major analytical errors.

# Test independence: Age vs Product Category
# If independent, P(Electronics|Young) should equal P(Electronics)
young_customers = df[df['customer_age'] < 30]
electronics_overall = (df['product_category'] == 'Electronics').mean()
electronics_young = (young_customers['product_category'] == 'Electronics').mean()

print(f"P(Electronics) overall: {electronics_overall:.4f}")
print(f"P(Electronics | Age < 30): {electronics_young:.4f}")
print(f"Difference: {abs(electronics_overall - electronics_young):.4f}")

P(Electronics) overall: 0.2847
P(Electronics | Age < 30): 0.4127
Difference: 0.1280

What just happened?

Not independent! Young customers buy electronics 41.27% of the time versus 28.47% overall. The 12.8 percentage point difference proves age influences product category choice — valuable for targeted marketing. Try this: Test independence between gender and product categories.

Pro Tip: Use probability calculations for A/B test planning. If current conversion rate is 3.2%, you need specific sample sizes to detect a 0.5% improvement with 95% confidence. Probability theory drives experiment design.

Sharp exponential decay from 61.87% return rate at 1-star to 0% at 5-star ratings — perfect for risk modeling

This exponential decay pattern is gold for business applications. You can model return risk as a function of expected rating, optimize quality control thresholds, and predict customer lifetime value based on satisfaction patterns.

The curve suggests a critical threshold around rating 3.0 where return probability drops significantly. Focus quality improvements on pushing products from 2-3 rating range into 3-4 range for maximum impact.

Quiz

Up Next

Random Variables

Transform probability concepts into mathematical models using discrete and continuous random variables for advanced business analytics.

← Previous Course Index Next →