Data Science
SciPy
Master statistical computations, optimization, and scientific computing with Python's most powerful mathematical library for data analysis.
What Makes SciPy Essential
SciPy builds on NumPy to deliver advanced scientific computing capabilities. While NumPy handles arrays, SciPy tackles complex mathematical problems. Statistical hypothesis testing. Optimization algorithms. Signal processing. Advanced clustering. The heavy mathematical lifting that separates basic analysis from serious data science.
Think of it this way — NumPy is your calculator, SciPy is your engineering toolkit. Every Fortune 500 data team uses SciPy for statistical inference. Amazon for A/B testing. Netflix for recommendation optimization. Google for clustering algorithms. The mathematical foundation behind machine learning frameworks.
| SciPy Module | Purpose | Key Functions | Business Use |
|---|---|---|---|
stats |
Statistical tests & distributions | ttest_ind, chi2, pearsonr | A/B testing, hypothesis validation |
optimize |
Function optimization | minimize, curve_fit | Price optimization, cost minimization |
cluster |
Clustering algorithms | hierarchy, kmeans | Customer segmentation, grouping |
spatial |
Distance & spatial analysis | distance, KDTree | Location analysis, similarity |
Statistical Testing with SciPy
The scenario: Flipkart's pricing team needs to test if Electronics category customers spend significantly more than Clothing customers. Statistical significance determines pricing strategy.
# Import SciPy's statistics module for hypothesis testing
import scipy.stats as stats
import pandas as pd
import numpy as np
# Load our ecommerce dataset for analysis
df = pd.read_csv('dataplexa_ecommerce.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
Dataset shape: (10000, 11) Columns: ['order_id', 'date', 'customer_age', 'gender', 'city', 'product_category', 'product_name', 'quantity', 'unit_price', 'revenue', 'rating']
What just happened?
SciPy's stats module loaded for statistical functions. Our dataset has 10,000 orders with 11 features including product_category and revenue. Try this: Check unique categories with df['product_category'].unique()
Now separate revenue data by category. Independent samples t-test compares means between two groups. Perfect for category comparison analysis.
# Extract revenue data for Electronics category customers
electronics_revenue = df[df['product_category'] == 'Electronics']['revenue']
# Extract revenue data for Clothing category customers
clothing_revenue = df[df['product_category'] == 'Clothing']['revenue']
# Display basic statistics for comparison
print(f"Electronics: Mean ₹{electronics_revenue.mean():.0f}, Count {len(electronics_revenue)}")
print(f"Clothing: Mean ₹{clothing_revenue.mean():.0f}, Count {len(clothing_revenue)}")
Electronics: Mean ₹45,234, Count 2,847 Clothing: Mean ₹28,156, Count 2,129
What just happened?
Electronics shows ₹17,078 higher average revenue than Clothing. But 2,847 vs 2,129 samples means different group sizes. Statistical testing determines if this difference is significant or random variation. Try this: Compare standard deviations with .std()
Time for the independent samples t-test. This determines statistical significance between group means.
# Perform independent samples t-test between the two groups
t_statistic, p_value = stats.ttest_ind(electronics_revenue, clothing_revenue)
# Display test results with interpretation
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
T-statistic: 23.847 P-value: 0.000000 Significant at α=0.05: Yes
What just happened?
T-statistic of 23.847 shows large difference relative to variation. P-value 0.000000 (essentially zero) means this difference is statistically significant. Electronics customers definitively spend more. Try this: Test other category pairs.
Electronics dominates with ₹45K average revenue, 60% higher than lowest category
Electronics clearly leads revenue generation across all product categories. The ₹17K difference between Electronics and Clothing represents substantial business impact. Statistical significance confirms this isn't random variation — it's a real pattern worth strategic attention.
This analysis supports focused marketing spend on Electronics customers. Higher lifetime value justifies premium acquisition costs. Clothing customers might need different engagement strategies or upselling to premium products.
Optimization and Curve Fitting
The scenario: Zomato's data team wants to model the relationship between customer age and average order value. Finding the optimal curve helps predict spending patterns for different demographics.
# Import optimization module for curve fitting
from scipy.optimize import curve_fit
# Calculate average revenue by customer age
age_revenue = df.groupby('customer_age')['revenue'].mean().reset_index()
print(f"Age range: {age_revenue['customer_age'].min()} to {age_revenue['customer_age'].max()}")
print(f"Revenue range: ₹{age_revenue['revenue'].min():.0f} to ₹{age_revenue['revenue'].max():.0f}")
Age range: 18 to 65 Revenue range: ₹28,945 to ₹52,180
What just happened?
Grouped data by customer_age and calculated mean revenue. Age spans 47 years (18-65) with ₹23K revenue variation. This relationship needs mathematical modeling to predict spending patterns. Try this: Plot scatter to visualize the relationship first.
# Define quadratic function to model age-revenue relationship
def quadratic_model(x, a, b, c):
return a * x**2 + b * x + c
# Extract data arrays for curve fitting
x_data = age_revenue['customer_age'].values
y_data = age_revenue['revenue'].values
print(f"Fitting model to {len(x_data)} age groups")
Fitting model to 48 age groups
What just happened?
Defined a quadratic_model function with parameters a, b, c. Extracted x_data (ages) and y_data (revenue) for 48 distinct age groups. Ready for optimization to find best-fit parameters. Try this: Test different model functions like exponential or logarithmic.
# Fit quadratic curve to age-revenue data using optimization
optimal_params, covariance = curve_fit(quadratic_model, x_data, y_data)
# Extract optimized parameters
a, b, c = optimal_params
print(f"Optimal parameters: a={a:.3f}, b={b:.3f}, c={c:.0f}")
print(f"R² coefficient: {np.corrcoef(y_data, quadratic_model(x_data, a, b, c))[0,1]**2:.3f}")
Optimal parameters: a=-2.847, b=298.456, c=25789 R² coefficient: 0.847
What just happened?
SciPy's curve_fit found optimal parameters through least-squares optimization. Negative a=-2.847 creates inverted parabola. R² of 0.847 means 84.7% of revenue variation explained by age. Try this: Predict revenue for specific ages using the model.
Quadratic model captures peak spending around age 52, then declining revenue in older customers
The fitted curve reveals critical business insights. Revenue peaks around age 52, then declines — likely reflecting peak earning years followed by retirement-driven spending reduction. This inverted parabola pattern suggests targeting mid-career professionals for premium offerings.
With R² = 0.847, this model predicts 84.7% of revenue variation from age alone. Marketing teams can now forecast demographic spending, optimize age-targeted campaigns, and identify high-value customer segments with mathematical precision.
Clustering and Spatial Analysis
The scenario: Swiggy wants to group cities by customer behavior patterns. Hierarchical clustering reveals natural market segments for targeted expansion strategies.
# Import clustering modules from SciPy
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
# Calculate city-level metrics for clustering analysis
city_metrics = df.groupby('city').agg({
'revenue': 'mean',
'customer_age': 'mean',
'rating': 'mean',
'quantity': 'mean'
}).round(2)
print("City metrics for clustering:")
print(city_metrics)
City metrics for clustering:
revenue customer_age rating quantity
city
Bangalore 38420 41.2 4.12 3.8
Chennai 35680 39.8 4.08 3.6
Delhi 42150 42.1 4.15 4.1
Mumbai 44890 43.2 4.18 4.3
Pune 36240 40.5 4.09 3.7
What just happened?
Aggregated 4 key metrics per city: revenue, customer_age, rating, quantity. Mumbai leads in revenue (₹44,890) and customer age. Delhi close second. This creates feature matrix for clustering. Try this: Add more metrics like return rate or product mix.
# Calculate pairwise distances between cities using Euclidean metric
distances = pdist(city_metrics.values, metric='euclidean')
# Perform hierarchical clustering using Ward linkage method
linkage_matrix = hierarchy.linkage(distances, method='ward')
# Display clustering results
print("Hierarchical clustering complete")
print(f"Distance matrix shape: {distances.shape}")
print(f"Linkage matrix shape: {linkage_matrix.shape}")
Hierarchical clustering complete Distance matrix shape: (10,) Linkage matrix shape: (4, 4)
What just happened?
Calculated euclidean distances between all city pairs (10 combinations from 5 cities). Ward linkage builds clusters by minimizing within-cluster variance. linkage_matrix contains the clustering hierarchy. Try this: Use different metrics like 'manhattan' or 'cosine' distance.
# Extract cluster assignments at 2-cluster level
cluster_labels = hierarchy.fcluster(linkage_matrix, t=2, criterion='maxclust')
# Display clustering results with city names
cluster_results = pd.DataFrame({
'City': city_metrics.index,
'Cluster': cluster_labels,
'Revenue': city_metrics['revenue']
})
print("City clustering results:")
print(cluster_results.sort_values('Cluster'))
City clustering results:
City Cluster Revenue
2 Delhi 1 42150
3 Mumbai 1 44890
0 Bangalore 2 38420
1 Chennai 2 35680
4 Pune 2 36240
What just happened?
Used fcluster to extract 2 clusters from hierarchy. Cluster 1: Delhi & Mumbai (high revenue metros). Cluster 2: Bangalore, Chennai, Pune (mid-tier cities). Clear market segmentation based on customer behavior patterns. Try this: Test different cluster numbers with t=3.
📊 Data Insight
Hierarchical clustering reveals two distinct market segments: High-value metros (Mumbai ₹44,890, Delhi ₹42,150) vs. Growth markets (Pune ₹36,240, Bangalore ₹38,420, Chennai ₹35,680). This 18% revenue gap suggests different pricing and product strategies.
Metro cities (blue) outperform growth cities (red) across all customer behavior metrics
The radar chart clearly shows metro cities dominating across all dimensions. Mumbai and Delhi form a premium segment with older, higher-spending customers who order larger quantities. This clustering analysis drives market-specific strategies — premium products for metros, value offerings for growth cities.
SciPy's clustering capabilities extend beyond simple grouping. The mathematical rigor ensures statistically valid segments. Marketing teams can now allocate budgets based on cluster characteristics, customize product catalogs, and predict expansion success in similar cities.
Common SciPy Mistake
Using scipy.cluster.kmeans instead of scipy.cluster.hierarchy. The kmeans module is deprecated. Always use sklearn.cluster.KMeans for k-means clustering or SciPy's hierarchy for hierarchical methods.
Quiz
1. You need to test if Mumbai customers spend significantly more than Chennai customers. Which SciPy function and what does it test?
2. Your curve_fit() optimization returns R² = 0.847 for age-revenue relationship. What does this indicate?
3. In hierarchical clustering, what does scipy.cluster.hierarchy.fcluster() accomplish that linkage() doesn't?
Up Next
Scikit-Learn
Build machine learning models with the industry-standard framework that turns SciPy's mathematical foundation into predictive algorithms.