Statistics in Python
So far, we have learned statistical concepts and applied them using Excel.
When datasets grow larger or analysis becomes repetitive, manual tools become limiting.
Python allows us to perform statistical analysis in a scalable, reproducible, and automated way.
Why Use Python for Statistics?
- Handles large datasets efficiently
- Automates repetitive analysis
- Produces reproducible results
- Integrates analysis and visualization
Python is widely used in data science, analytics, and machine learning.
Core Python Libraries for Statistics
| Library | Purpose |
|---|---|
| NumPy | Numerical operations and arrays |
| Pandas | Data manipulation and summaries |
| SciPy | Statistical tests |
| Statsmodels | Regression and statistical modeling |
| Matplotlib / Seaborn | Visualization |
Descriptive Statistics Using Pandas
Pandas makes it easy to compute summary statistics.
import pandas as pd
data = pd.Series([10, 12, 15, 18, 20])
data.mean()
data.median()
data.std()
These functions directly correspond to mean, median, and standard deviation.
Exploring Data Quickly
The describe() function provides a complete statistical summary in one step.
data.describe()
This output includes count, mean, standard deviation, and percentiles.
Correlation in Python
Correlation can be calculated using Pandas or NumPy.
df.corr()
This produces a correlation matrix similar to what you saw in Excel.
Hypothesis Testing with SciPy
Python allows direct implementation of statistical tests.
from scipy import stats
stats.ttest_ind(group1, group2)
The output includes a test statistic and a p-value, which you already know how to interpret.
Regression in Python
Statsmodels provides detailed regression output.
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()
This summary includes coefficients, p-values, and R-squared.
Visualization for Statistical Insight
Visualization helps detect patterns and violations of assumptions.
- Histograms for distributions
- Box plots for outliers
- Scatter plots for relationships
Python integrates statistics and visualization seamlessly.
Real-World Example
A data analyst uses Python to:
- Clean raw customer data
- Compute descriptive statistics
- Run hypothesis tests
- Build regression models
This entire workflow can be repeated with new data automatically.
Python vs Excel for Statistics
| Aspect | Excel | Python |
|---|---|---|
| Ease of use | High | Moderate |
| Automation | Limited | Excellent |
| Large datasets | Limited | Strong |
| Reproducibility | Low | High |
Common Mistakes to Avoid
- Running tests without understanding assumptions
- Blindly trusting library output
- Ignoring data cleaning
- Confusing syntax with statistics
Quick Check
Which Python library is mainly used for hypothesis testing?
SciPy.
Practice Quiz
Question 1:
Which function provides a statistical summary in Pandas?
describe()
Question 2:
Which library provides detailed regression summaries?
Statsmodels.
Question 3:
Does Python replace statistical thinking?
No. It applies statistical thinking programmatically.
Mini Practice
You receive a CSV file containing product sales data.
- Which library would you use to load it?
- Which function would give a quick summary?
Pandas for loading. describe() for summary statistics.
What’s Next
In the next lesson, we will apply statistics using R, a language built specifically for statistical analysis.