Data Visualization for Machine Learning
Till now, we have done something very important without training any model. We prepared the data step by step — cleaning it, scaling it, and fixing logical issues.
Before we allow a machine learning algorithm to learn from this data, we must first understand the data ourselves.
This is where data visualization plays a critical role. Good visualizations reveal patterns, relationships, and problems that numbers alone cannot show.
Why Visualization Matters Before Modeling
Machine Learning algorithms do not understand business logic or real life. They only understand mathematical patterns.
Visualization allows humans to see:
Which features influence the target, Which variables move together, Where data is skewed, And where hidden problems still exist.
Skipping visualization often leads to poor model performance, even when advanced algorithms are used.
We Continue With the Same Dataset
From Lesson 4 onwards, we are using one single dataset for the entire module:
Dataplexa ML Housing & Customer Dataset
This continuity is intentional. Real-world ML projects do not change datasets every day — they improve understanding of the same data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("dataplexa_ml_housing_customer_dataset.csv")
df.head()
Understanding Feature Distribution
Before comparing features with targets, we must understand how individual features are distributed.
For example, house prices are rarely evenly distributed. Most houses fall into a mid-range, while a few are very expensive.
This skewness directly affects model behavior.
plt.figure(figsize=(7,4))
sns.histplot(df["house_price"], kde=True)
plt.title("Distribution of House Prices")
plt.show()
From this plot, we can visually confirm:
Whether prices are normally distributed, Whether extreme values still exist, And whether transformations may be required later.
Real-World Interpretation
If most houses cluster around a certain price range, a regression model will learn that range very well.
If a few extreme values dominate, the model may overestimate prices for normal houses.
This is why we visualize before training.
Visualizing Relationships Between Features
Machine Learning is not about individual columns. It is about relationships between variables.
For example, larger houses usually cost more, but not always.
plt.figure(figsize=(7,4))
sns.scatterplot(x="house_size", y="house_price", data=df)
plt.title("House Size vs House Price")
plt.show()
This scatter plot helps us see:
Whether the relationship is linear, Whether noise exists, And whether regression is suitable.
Visualizing Categorical Impact
Our dataset includes a classification target: purchase_decision.
Visualization helps us see how numerical features influence categorical outcomes.
plt.figure(figsize=(7,4))
sns.boxplot(x="purchase_decision", y="customer_income", data=df)
plt.title("Income vs Purchase Decision")
plt.show()
From this plot, we can reason:
Do higher-income customers purchase more often? Is income a strong predictor? Is there overlap between classes?
These insights guide algorithm selection later.
Correlation Visualization
Correlation heatmaps help us understand how strongly features are related to each other.
Highly correlated features may be redundant and can confuse some models.
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()
This visualization prepares us for:
Feature selection, Dimensionality reduction, And regularization techniques.
Why Visualization Comes Before Algorithms
A model trained without visualization is like driving blindfolded.
Visualization gives context, confidence, and clarity. It tells us what kind of model we should even try.
Mini Practice
Look at a scatter plot between house_size and house_price.
Ask yourself:
Is the relationship strong enough for linear regression? Would a tree-based model perform better?
Exercises
Exercise 1:
Why is visualization done before model training?
Exercise 2:
What does a skewed distribution indicate?
Exercise 3:
Why are correlation heatmaps important?
Quick Quiz
Q1. Should visualization replace modeling?
Q2. Why do we keep using the same dataset?
In the next lesson, we will connect visualization with Statistics for Machine Learning and see how numerical measures explain what visuals show.