Statistics for Machine Learning
In the previous lesson, we used visualizations to see patterns in data. In this lesson, we learn how to measure those patterns using statistics.
Machine Learning is not magic. It is applied statistics combined with computation. If you understand statistics, models stop feeling mysterious.
Why Statistics Is the Backbone of ML
Every Machine Learning model makes decisions based on numerical summaries: averages, spreads, probabilities, and deviations.
Statistics answers questions like:
What is a normal value? How much variation is acceptable? Is a change meaningful or just noise?
Without statistics, a model may appear accurate but fail badly in real-world situations.
We Continue With the Same Dataset
From Lesson 4 onwards, we consistently use:
Dataplexa ML Housing & Customer Dataset
This allows us to apply statistical thinking to the same data that will later be used for regression and classification.
import pandas as pd
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.describe()
Measures of Central Tendency
Central tendency tells us where data is centered. In real life, we often ask: “What is the typical value?”
In our dataset, this might mean: What is the typical house price? What is the typical customer income?
The most common measures are mean, median, and mode.
df["house_price"].mean(),
df["house_price"].median()
The mean is sensitive to extreme values, while the median is more robust.
If mean and median differ significantly, it signals skewness in the data.
Real-World Meaning
If a few luxury houses exist, the mean price increases, but the median still represents what most people can afford.
Models trained on skewed data without understanding this often make unrealistic predictions.
Measures of Spread (Variance & Standard Deviation)
Two datasets can have the same average but behave very differently.
Spread tells us how much values vary from the center.
df["house_price"].var(),
df["house_price"].std()
Standard deviation answers an important question:
“How far are most values from the average?”
Models assume certain distributions. High variance can reduce model stability.
Understanding Distribution Shape
Statistics also helps us identify:
Skewness – whether data leans left or right Kurtosis – whether data has heavy tails
df["house_price"].skew(),
df["house_price"].kurt()
These values guide later decisions like: log transformations or normalization.
Correlation and Statistical Relationships
Correlation measures how strongly two variables move together.
For example: Does income rise with house price? Does age influence purchase decisions?
df.corr()["house_price"]
Correlation does not imply causation, but it helps prioritize features.
Statistics vs Visualization
Visualization shows patterns visually. Statistics confirms them numerically.
A good ML engineer always uses both.
Mini Practice
Look at the mean and median of customer_income.
Ask yourself:
Is income evenly distributed? Would a median-based approach be safer?
Exercises
Exercise 1:
Why is standard deviation important in ML?
Exercise 2:
What does skewness indicate?
Exercise 3:
Why is correlation useful before modeling?
Quick Quiz
Q1. Can two datasets have the same mean but different variance?
Q2. Should statistics be skipped if visualization looks good?
In the next lesson, we move from statistics into Linear Algebra fundamentals, which explains how models learn using vectors and matrices.