ML Lesson 7 – Data Visualization for ML | Dataplexa

Data Visualization for Machine Learning

Till now, we have done something very important without training any model. We prepared the data step by step — cleaning it, scaling it, and fixing logical issues.

Before we allow a machine learning algorithm to learn from this data, we must first understand the data ourselves.

This is where data visualization plays a critical role. Good visualizations reveal patterns, relationships, and problems that numbers alone cannot show.


Why Visualization Matters Before Modeling

Machine Learning algorithms do not understand business logic or real life. They only understand mathematical patterns.

Visualization allows humans to see:

Which features influence the target, Which variables move together, Where data is skewed, And where hidden problems still exist.

Skipping visualization often leads to poor model performance, even when advanced algorithms are used.


We Continue With the Same Dataset

From Lesson 4 onwards, we are using one single dataset for the entire module:

Dataplexa ML Housing & Customer Dataset

This continuity is intentional. Real-world ML projects do not change datasets every day — they improve understanding of the same data.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.head()

Understanding Feature Distribution

Before comparing features with targets, we must understand how individual features are distributed.

For example, house prices are rarely evenly distributed. Most houses fall into a mid-range, while a few are very expensive.

This skewness directly affects model behavior.

plt.figure(figsize=(7,4))
sns.histplot(df["house_price"], kde=True)
plt.title("Distribution of House Prices")
plt.show()

From this plot, we can visually confirm:

Whether prices are normally distributed, Whether extreme values still exist, And whether transformations may be required later.


Real-World Interpretation

If most houses cluster around a certain price range, a regression model will learn that range very well.

If a few extreme values dominate, the model may overestimate prices for normal houses.

This is why we visualize before training.


Visualizing Relationships Between Features

Machine Learning is not about individual columns. It is about relationships between variables.

For example, larger houses usually cost more, but not always.

plt.figure(figsize=(7,4))
sns.scatterplot(x="house_size", y="house_price", data=df)
plt.title("House Size vs House Price")
plt.show()

This scatter plot helps us see:

Whether the relationship is linear, Whether noise exists, And whether regression is suitable.


Visualizing Categorical Impact

Our dataset includes a classification target: purchase_decision.

Visualization helps us see how numerical features influence categorical outcomes.

plt.figure(figsize=(7,4))
sns.boxplot(x="purchase_decision", y="customer_income", data=df)
plt.title("Income vs Purchase Decision")
plt.show()

From this plot, we can reason:

Do higher-income customers purchase more often? Is income a strong predictor? Is there overlap between classes?

These insights guide algorithm selection later.


Correlation Visualization

Correlation heatmaps help us understand how strongly features are related to each other.

Highly correlated features may be redundant and can confuse some models.

plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()

This visualization prepares us for:

Feature selection, Dimensionality reduction, And regularization techniques.


Why Visualization Comes Before Algorithms

A model trained without visualization is like driving blindfolded.

Visualization gives context, confidence, and clarity. It tells us what kind of model we should even try.


Mini Practice

Look at a scatter plot between house_size and house_price.

Ask yourself:

Is the relationship strong enough for linear regression? Would a tree-based model perform better?


Exercises

Exercise 1:
Why is visualization done before model training?

Visualization helps humans understand patterns, errors, and relationships before machines learn them.

Exercise 2:
What does a skewed distribution indicate?

It indicates that data is not evenly spread and may require transformation.

Exercise 3:
Why are correlation heatmaps important?

They reveal relationships between features and help detect redundancy.

Quick Quiz

Q1. Should visualization replace modeling?

No. Visualization supports modeling but does not replace it.

Q2. Why do we keep using the same dataset?

Real-world ML projects improve understanding of the same dataset over time.

In the next lesson, we will connect visualization with Statistics for Machine Learning and see how numerical measures explain what visuals show.