EDA Lesson 3 – Datatypes | Dataplexa

Beginner Level · Lesson 3

Data Types — Know What You're Actually Working With

Here's a mistake almost every beginner makes: they assume a column full of numbers is a number column. It isn't always. And that one wrong assumption can silently break your entire analysis. This lesson fixes that.

Why Data Types Are a Big Deal

Picture this. You get a dataset. There's a column called age. You try to calculate the average age. Python throws an error. You stare at the screen. Everything looks fine. The column clearly has numbers in it — 24, 31, 45, 28.

What went wrong? The column is stored as text. Someone exported it from Excel and the ages came through as strings. Python sees "24", not 24. You can't average text.

This happens constantly in real data work. The fix is simple once you know about it — but first you need to understand what data types actually are and how pandas handles them.

Real talk: Data type errors don't always throw exceptions. Sometimes they just give you wrong answers silently. A column of revenue values stored as text will sort like 1, 10, 100, 2, 20 instead of 1, 2, 10, 20, 100. No error. Just quietly wrong. That's scarier than an exception.

The Four Types You Will Actually Use

Pandas has a lot of possible data types, but honestly? In 90% of real-world EDA work, you're dealing with just four. Learn these four and you're set.

int64 Integers

Whole numbers. No decimals. Think: age, quantity, number of purchases, year.

Examples: 24, 100, 2019, 0, -5

float64 Decimals

Numbers with a decimal point. Think: price, weight, temperature, percentage.

Examples: 19.99, 3.14, 98.6, 0.75

object Text / Mixed

Strings and anything pandas couldn't classify. Think: names, categories, addresses, IDs.

Examples: "Alice", "Electronics", "USA"

bool True / False

Binary flags. Think: is_member, is_active, has_discount, verified.

Examples: True, False

There's also datetime64 for dates and times — we cover that in the Time-based EDA lesson. But for now, these four are your world.

Checking Types — Three Ways to Do It

You've already seen .dtypes in Lesson 2. But there are actually three ways to check data types in pandas, and each one is useful in different situations. Let's build a dataset and try all three.

The scenario: You're a junior analyst at a HR consultancy. A colleague hands you an employee dataset exported from their HR system. Before you do any analysis — salary averages, headcount by department, tenure calculations — you need to confirm every column is storing the right type of data.

import pandas as pd

# Employee dataset from an HR system — mix of data types on purpose
employee_df = pd.DataFrame({
    'emp_id':       [101, 102, 103, 104, 105, 106, 107, 108],
    'name':         ['Sara', 'James', 'Priya', 'Tom', 'Luna', 'Kai', 'Omar', 'Zoe'],
    'department':   ['Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering'],
    'salary':       [52000.00, 87000.00, 61000.00, 55000.00, 92000.00, 63000.00, 58000.00, 95000.00],
    'years_exp':    [3, 7, 5, 2, 9, 4, 3, 11],
    'is_manager':   [False, True, False, False, True, False, False, True],
    'performance':  ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Excellent']
})

# Method 1: .dtypes — shows type for EVERY column at once
# Best used at the start to get a full picture
print("--- Method 1: .dtypes ---")
print(employee_df.dtypes)

# Method 2: .dtype on a single column — when you just need to check one
# Use this when you already know which column is suspicious
print("\n--- Method 2: Single column dtype ---")
print("salary column type :", employee_df['salary'].dtype)
print("is_manager type    :", employee_df['is_manager'].dtype)

# Method 3: .info() — shows dtype AND how many non-null values exist
# Best used when you want type + missing value check in one go
print("\n--- Method 3: .info() ---")
employee_df.info()

--- Method 1: .dtypes ---
emp_id          int64
name           object
department     object
salary        float64
years_exp       int64
is_manager       bool
performance    object
dtype: object

--- Method 2: Single column dtype ---
salary column type : float64
is_manager type    : bool

--- Method 3: .info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   emp_id       8 non-null      int64
 1   name         8 non-null      object
 2   department   8 non-null      object
 3   salary       8 non-null      float64
 4   years_exp    8 non-null      int64
 5   is_manager   8 non-null      bool
 6   performance  8 non-null      object
dtypes: bool(1), float64(1), int64(2), object(3)
memory usage: 576.0 bytes

What just happened?

We used three different pandas tools. .dtypes is an attribute that returns the type of every column in one shot. .dtype (no 's') is called on a single column to check just that one. .info() is a method that combines type information with null counts — two checks for the price of one command.

Notice that performance (values like "Good", "Excellent") came back as object. That's correct — it's text. But in a later lesson we'll convert it to a category type, which is more memory-efficient for columns with a small set of repeating values. For now, just recognise it as text.

The Sneaky Problem — Numbers Stored as Text

This is the one that trips people up the most. Let's recreate that scenario from the beginning of this lesson — where ages or salaries come in as strings, not numbers.

The scenario: Your colleague exports the salary data from an old HR system. The file comes through and everything looks fine until you try to calculate anything. Here's how to spot it and what happens when you try to do math on text columns.

import pandas as pd

# Simulate a badly exported dataset — salaries came through as strings
# This happens with old Excel exports, CSVs with currency symbols, or copy-paste errors
bad_df = pd.DataFrame({
    'name':       ['Sara', 'James', 'Priya', 'Tom', 'Luna'],
    'salary_str': ['52000', '87000', '61000', '55000', '92000']  # strings, not numbers!
})

# Check what type salary_str is
print("Type of salary_str:", bad_df['salary_str'].dtype)

# Try to calculate average salary — this will fail because it's text
try:
    avg = bad_df['salary_str'].mean()   # .mean() only works on numeric types
    print("Average salary:", avg)
except TypeError as e:
    print("ERROR:", e)                  # TypeError tells us exactly what went wrong

# What sorting looks like with text vs numbers (alphabetical vs numerical)
print("\nText sort (wrong):", sorted(['52000', '87000', '61000', '55000', '92000']))
print("Number sort (right):", sorted([52000, 87000, 61000, 55000, 92000]))

Type of salary_str: object

ERROR: Could not convert string to float: '52000'

Text sort (wrong):  ['52000', '55000', '61000', '87000', '92000']
Number sort (right): [52000, 55000, 61000, 87000, 92000]

What just happened?

The dtype came back as object — that's the red flag. When you see a column that should be numeric showing as object, something went wrong during data import. We used a Python try/except block to catch the TypeError gracefully instead of crashing the whole script.

The sorting example is the quietly-dangerous version — no error, just wrong results. '52000' sorts before '87000' alphabetically, which happens to be correct here. But '9' would sort after '87000' alphabetically even though 9 < 87000 numerically. That kind of bug can survive in production for months before anyone notices.

Fixing It — Type Conversion with astype()

Once you spot a wrong type, fixing it is usually one line. The astype() method converts a column from one type to another. It's one of the most-used tools in any analyst's workflow.

The scenario: You've confirmed the salary column is text. Now you need to convert it to a proper numeric type so you can calculate averages, find the highest paid employee, and group by salary ranges. Here's how to do it cleanly.

import pandas as pd

# Dataset with the broken string salary column
bad_df = pd.DataFrame({
    'name':       ['Sara', 'James', 'Priya', 'Tom', 'Luna'],
    'salary_str': ['52000', '87000', '61000', '55000', '92000']
})

# Fix 1: Convert string column to integer using .astype()
bad_df['salary_int'] = bad_df['salary_str'].astype(int)

# Fix 2: Or convert to float if you expect decimal values later
bad_df['salary_float'] = bad_df['salary_str'].astype(float)

# Verify the types changed
print("Original (string):", bad_df['salary_str'].dtype)
print("Converted (int)  :", bad_df['salary_int'].dtype)
print("Converted (float):", bad_df['salary_float'].dtype)

# Now we can do math on the fixed column
print("\nAverage salary (int)  :", bad_df['salary_int'].mean())
print("Highest salary (float):", bad_df['salary_float'].max())

Original (string): object
Converted (int)  : int64
Converted (float): float64

Average salary (int)  : 69400.0
Highest salary (float): 92000.0

What just happened?

.astype() is a pandas Series method that returns a new column with the type converted. We stored the result in a new column rather than overwriting the original — good practice when you're still validating the conversion.

One thing to watch: .astype(int) will crash if the column has any non-numeric values, empty strings, or NaNs hiding inside. In those cases use pd.to_numeric(df['col'], errors='coerce') instead — it converts what it can and turns bad values into NaN rather than throwing an error. We'll cover that properly in the Missing Values lessons.

The category Type — a Hidden Gem

Most people never use this one and they're leaving performance on the table. When you have a column with a small set of repeating text values — like department (Sales, Engineering, HR) or grade (A, B, C, D) — storing it as object wastes memory. The category type is built exactly for this.

The scenario: Your dataset has 50,000 employees. The department column has 50,000 text entries — but only 6 unique values. Storing 50,000 full strings is wasteful. Converting to category stores the 6 unique values once and uses tiny integer codes for each row instead. Let's see how much it matters.

import pandas as pd
import numpy as np

# Simulate a larger employee dataset with repeating categories
np.random.seed(42)   # seed makes random choices reproducible

departments = ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance', 'Operations']

# Create 1,000 rows to make the memory difference visible
big_df = pd.DataFrame({
    'emp_id':     range(1001, 2001),                                      # 1000 employees
    'department': np.random.choice(departments, size=1000),               # random dept per row
    'salary':     np.random.randint(45000, 120000, size=1000).astype(float)
})

# Check memory usage BEFORE conversion
print("=== BEFORE conversion ===")
print("department dtype   :", big_df['department'].dtype)
print("Memory (object)    :", big_df['department'].memory_usage(deep=True), "bytes")

# Convert department to category type
big_df['department'] = big_df['department'].astype('category')

# Check memory usage AFTER conversion
print("\n=== AFTER conversion ===")
print("department dtype   :", big_df['department'].dtype)
print("Memory (category)  :", big_df['department'].memory_usage(deep=True), "bytes")

# Bonus: category type makes unique values easy to access
print("\nUnique departments :", big_df['department'].cat.categories.tolist())

=== BEFORE conversion ===
department dtype   : object
Memory (object)    : 63800 bytes

=== AFTER conversion ===
department dtype   : category
Memory (category)  : 1524 bytes

Unique departments : ['Engineering', 'Finance', 'HR', 'Marketing', 'Operations', 'Sales']

What just happened?

We used numpy here for the first time in this course — np.random.choice() picks random values from a list, and np.random.randint() generates random integers. We set np.random.seed(42) so the results are the same every time you run it — important when teaching.

The memory result is the jaw-dropper: 63,800 bytes down to 1,524 bytes — roughly 42× smaller. On a million-row production dataset that's the difference between a query that runs in 2 seconds and one that runs in 0.05 seconds. The .cat.categories accessor is a bonus — it gives you the unique values in sorted order, which is handy for validation.

A Visual Look at the Type System

Here's a quick reference of how real-world data columns map to pandas types — the kind of cheat sheet you'll actually reach for on the job.

Real-world column → pandas dtype mapping

Column example	Sample values	Expected dtype	Watch out for
age	24, 31, 45	int64	Coming in as `"24"` (string)
price / salary	19.99, 87000.00	float64	`"$19.99"` with currency symbol
name / address	"Alice", "123 Main St"	object	Numbers hiding as strings
department / grade	"Sales", "A", "High"	object → category	Left as object wastes memory
is_active / verified	True, False	bool	`1/0` stored as int instead
order_date	"2024-01-15"	datetime64	Often imported as `object`

Putting It Together — Full Type Audit Routine

The scenario: You've just received a brand new dataset — the kind where you don't know the source, don't know if it was cleaned, and definitely don't trust it yet. Here's the complete type-auditing routine you run before touching anything else.

import pandas as pd

# Full employee dataset for the audit
employee_df = pd.DataFrame({
    'emp_id':       [101, 102, 103, 104, 105, 106, 107, 108],
    'name':         ['Sara', 'James', 'Priya', 'Tom', 'Luna', 'Kai', 'Omar', 'Zoe'],
    'department':   ['Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering'],
    'salary':       [52000.00, 87000.00, 61000.00, 55000.00, 92000.00, 63000.00, 58000.00, 95000.00],
    'years_exp':    [3, 7, 5, 2, 9, 4, 3, 11],
    'is_manager':   [False, True, False, False, True, False, False, True],
    'performance':  ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Excellent']
})

# Step 1: Full dtype overview
print("STEP 1 — All column types:")
print(employee_df.dtypes)

# Step 2: Separate numeric vs non-numeric columns automatically
numeric_cols     = employee_df.select_dtypes(include=['number']).columns.tolist()
non_numeric_cols = employee_df.select_dtypes(exclude=['number']).columns.tolist()

print("\nSTEP 2 — Numeric columns   :", numeric_cols)
print("STEP 2 — Non-numeric columns:", non_numeric_cols)

# Step 3: Optimise — convert low-cardinality text columns to category
# "Low cardinality" means few unique values relative to total rows
for col in non_numeric_cols:
    n_unique = employee_df[col].nunique()       # count distinct values
    n_rows   = len(employee_df)
    ratio    = n_unique / n_rows                # uniqueness ratio

    # If less than 50% of values are unique, it's a good category candidate
    if ratio < 0.5:
        employee_df[col] = employee_df[col].astype('category')
        print(f"STEP 3 — Converted '{col}' to category ({n_unique} unique values)")

# Step 4: Confirm final types
print("\nSTEP 4 — Final dtypes after optimisation:")
print(employee_df.dtypes)

STEP 1 — All column types:
emp_id          int64
name           object
department     object
salary        float64
years_exp       int64
is_manager       bool
performance    object
dtype: object

STEP 2 — Numeric columns   : ['emp_id', 'salary', 'years_exp']
STEP 2 — Non-numeric columns: ['name', 'department', 'is_manager', 'performance']

STEP 3 — Converted 'department' to category (3 unique values)
STEP 3 — Converted 'is_manager' to category (2 unique values)
STEP 3 — Converted 'performance' to category (3 unique values)

STEP 4 — Final dtypes after optimisation:
emp_id           int64
name            object
department    category
salary         float64
years_exp        int64
is_manager    category
performance   category
dtype: object

What just happened?

Two new pandas methods appeared here. select_dtypes(include=['number']) automatically filters columns by their type — much faster than checking each one manually. nunique() counts distinct values in a column, which we used to calculate a uniqueness ratio — a simple heuristic for deciding whether category conversion makes sense.

The name column was correctly left as object — it has 8 unique values out of 8 rows (ratio = 1.0), so every value is different. No point making that a category. department has only 3 unique values across 8 rows — perfect category candidate. This is the kind of thinking that separates a careful analyst from someone who just runs code and hopes for the best.

Teacher's Note

There's a trap that gets intermediate analysts too. You check .dtypes, everything looks correct — salary is float64, age is int64. You feel good. But then your groupby produces wrong numbers.

What happened? The column was correct, but there were NaN values hiding in it. Pandas silently converts an int64 column to float64 the moment you introduce a NaN — because integers can't represent missing values natively, but floats can. So if you see a column that should be integer but shows as float64, check for missing values. That's almost always the reason.

Practice Questions

1. You have a column called age stored as object. To convert it to integers, you write df['age'].______('int'). What goes in the blank?

2. A column called grade has only 4 unique values — A, B, C, D — repeated across 50,000 rows. What dtype should you convert it to for maximum memory efficiency?

3. To automatically get only the numeric columns from a DataFrame, you use df.________(include=['number']). Fill in the blank.

Quiz

Up Next · Lesson 4

Central Tendency — Mean, Median and Mode

Which average should you actually report? The answer changes depending on your data — and getting it wrong is more common than you'd think.

← Previous Course Index Next →

EDA Course

Data Types — Know What You're Actually Working With

Why Data Types Are a Big Deal

The Four Types You Will Actually Use

Checking Types — Three Ways to Do It

The Sneaky Problem — Numbers Stored as Text

Fixing It — Type Conversion with astype()

The category Type — a Hidden Gem

A Visual Look at the Type System

Putting It Together — Full Type Audit Routine

Practice Questions

Quiz