ML Lesson 8 – Statistics for ML | Dataplexa

Statistics for Machine Learning

In the previous lesson, we used visualizations to see patterns in data. In this lesson, we learn how to measure those patterns using statistics.

Machine Learning is not magic. It is applied statistics combined with computation. If you understand statistics, models stop feeling mysterious.


Why Statistics Is the Backbone of ML

Every Machine Learning model makes decisions based on numerical summaries: averages, spreads, probabilities, and deviations.

Statistics answers questions like:

What is a normal value? How much variation is acceptable? Is a change meaningful or just noise?

Without statistics, a model may appear accurate but fail badly in real-world situations.


We Continue With the Same Dataset

From Lesson 4 onwards, we consistently use:

Dataplexa ML Housing & Customer Dataset

This allows us to apply statistical thinking to the same data that will later be used for regression and classification.

import pandas as pd

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.describe()

Measures of Central Tendency

Central tendency tells us where data is centered. In real life, we often ask: “What is the typical value?”

In our dataset, this might mean: What is the typical house price? What is the typical customer income?

The most common measures are mean, median, and mode.

df["house_price"].mean(),
df["house_price"].median()

The mean is sensitive to extreme values, while the median is more robust.

If mean and median differ significantly, it signals skewness in the data.


Real-World Meaning

If a few luxury houses exist, the mean price increases, but the median still represents what most people can afford.

Models trained on skewed data without understanding this often make unrealistic predictions.


Measures of Spread (Variance & Standard Deviation)

Two datasets can have the same average but behave very differently.

Spread tells us how much values vary from the center.

df["house_price"].var(),
df["house_price"].std()

Standard deviation answers an important question:

“How far are most values from the average?”

Models assume certain distributions. High variance can reduce model stability.


Understanding Distribution Shape

Statistics also helps us identify:

Skewness – whether data leans left or right Kurtosis – whether data has heavy tails

df["house_price"].skew(),
df["house_price"].kurt()

These values guide later decisions like: log transformations or normalization.


Correlation and Statistical Relationships

Correlation measures how strongly two variables move together.

For example: Does income rise with house price? Does age influence purchase decisions?

df.corr()["house_price"]

Correlation does not imply causation, but it helps prioritize features.


Statistics vs Visualization

Visualization shows patterns visually. Statistics confirms them numerically.

A good ML engineer always uses both.


Mini Practice

Look at the mean and median of customer_income.

Ask yourself:

Is income evenly distributed? Would a median-based approach be safer?


Exercises

Exercise 1:
Why is standard deviation important in ML?

It measures variability and helps understand model stability and data spread.

Exercise 2:
What does skewness indicate?

It indicates whether data is asymmetric around the mean.

Exercise 3:
Why is correlation useful before modeling?

It helps identify relationships and prioritize features.

Quick Quiz

Q1. Can two datasets have the same mean but different variance?

Yes. Variance measures spread, not central value.

Q2. Should statistics be skipped if visualization looks good?

No. Statistics confirms what visuals suggest.

In the next lesson, we move from statistics into Linear Algebra fundamentals, which explains how models learn using vectors and matrices.