EDA Course
Data Types — Know What You're Actually Working With
Here's a mistake almost every beginner makes: they assume a column full of numbers is a number column. It isn't always. And that one wrong assumption can silently break your entire analysis. This lesson fixes that.
Why Data Types Are a Big Deal
Picture this. You get a dataset. There's a column called age. You try to calculate the average age. Python throws an error. You stare at the screen. Everything looks fine. The column clearly has numbers in it — 24, 31, 45, 28.
What went wrong? The column is stored as text. Someone exported it from Excel and the ages came through as strings. Python sees "24", not 24. You can't average text.
This happens constantly in real data work. The fix is simple once you know about it — but first you need to understand what data types actually are and how pandas handles them.
Real talk: Data type errors don't always throw exceptions. Sometimes they just give you wrong answers silently. A column of revenue values stored as text will sort like 1, 10, 100, 2, 20 instead of 1, 2, 10, 20, 100. No error. Just quietly wrong. That's scarier than an exception.
The Four Types You Will Actually Use
Pandas has a lot of possible data types, but honestly? In 90% of real-world EDA work, you're dealing with just four. Learn these four and you're set.
Whole numbers. No decimals. Think: age, quantity, number of purchases, year.
Examples: 24, 100, 2019, 0, -5
Numbers with a decimal point. Think: price, weight, temperature, percentage.
Examples: 19.99, 3.14, 98.6, 0.75
Strings and anything pandas couldn't classify. Think: names, categories, addresses, IDs.
Examples: "Alice", "Electronics", "USA"
Binary flags. Think: is_member, is_active, has_discount, verified.
Examples: True, False
There's also datetime64 for dates and times — we cover that in the Time-based EDA lesson. But for now, these four are your world.
Checking Types — Three Ways to Do It
You've already seen .dtypes in Lesson 2. But there are actually three ways to check data types in pandas, and each one is useful in different situations. Let's build a dataset and try all three.
The scenario: You're a junior analyst at a HR consultancy. A colleague hands you an employee dataset exported from their HR system. Before you do any analysis — salary averages, headcount by department, tenure calculations — you need to confirm every column is storing the right type of data.
import pandas as pd
# Employee dataset from an HR system — mix of data types on purpose
employee_df = pd.DataFrame({
'emp_id': [101, 102, 103, 104, 105, 106, 107, 108],
'name': ['Sara', 'James', 'Priya', 'Tom', 'Luna', 'Kai', 'Omar', 'Zoe'],
'department': ['Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering'],
'salary': [52000.00, 87000.00, 61000.00, 55000.00, 92000.00, 63000.00, 58000.00, 95000.00],
'years_exp': [3, 7, 5, 2, 9, 4, 3, 11],
'is_manager': [False, True, False, False, True, False, False, True],
'performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Excellent']
})
# Method 1: .dtypes — shows type for EVERY column at once
# Best used at the start to get a full picture
print("--- Method 1: .dtypes ---")
print(employee_df.dtypes)
# Method 2: .dtype on a single column — when you just need to check one
# Use this when you already know which column is suspicious
print("\n--- Method 2: Single column dtype ---")
print("salary column type :", employee_df['salary'].dtype)
print("is_manager type :", employee_df['is_manager'].dtype)
# Method 3: .info() — shows dtype AND how many non-null values exist
# Best used when you want type + missing value check in one go
print("\n--- Method 3: .info() ---")
employee_df.info()
--- Method 1: .dtypes --- emp_id int64 name object department object salary float64 years_exp int64 is_manager bool performance object dtype: object --- Method 2: Single column dtype --- salary column type : float64 is_manager type : bool --- Method 3: .info() --- <class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 emp_id 8 non-null int64 1 name 8 non-null object 2 department 8 non-null object 3 salary 8 non-null float64 4 years_exp 8 non-null int64 5 is_manager 8 non-null bool 6 performance 8 non-null object dtypes: bool(1), float64(1), int64(2), object(3) memory usage: 576.0 bytes
What just happened?
We used three different pandas tools. .dtypes is an attribute that returns the type of every column in one shot. .dtype (no 's') is called on a single column to check just that one. .info() is a method that combines type information with null counts — two checks for the price of one command.
Notice that performance (values like "Good", "Excellent") came back as object. That's correct — it's text. But in a later lesson we'll convert it to a category type, which is more memory-efficient for columns with a small set of repeating values. For now, just recognise it as text.
The Sneaky Problem — Numbers Stored as Text
This is the one that trips people up the most. Let's recreate that scenario from the beginning of this lesson — where ages or salaries come in as strings, not numbers.
The scenario: Your colleague exports the salary data from an old HR system. The file comes through and everything looks fine until you try to calculate anything. Here's how to spot it and what happens when you try to do math on text columns.
import pandas as pd
# Simulate a badly exported dataset — salaries came through as strings
# This happens with old Excel exports, CSVs with currency symbols, or copy-paste errors
bad_df = pd.DataFrame({
'name': ['Sara', 'James', 'Priya', 'Tom', 'Luna'],
'salary_str': ['52000', '87000', '61000', '55000', '92000'] # strings, not numbers!
})
# Check what type salary_str is
print("Type of salary_str:", bad_df['salary_str'].dtype)
# Try to calculate average salary — this will fail because it's text
try:
avg = bad_df['salary_str'].mean() # .mean() only works on numeric types
print("Average salary:", avg)
except TypeError as e:
print("ERROR:", e) # TypeError tells us exactly what went wrong
# What sorting looks like with text vs numbers (alphabetical vs numerical)
print("\nText sort (wrong):", sorted(['52000', '87000', '61000', '55000', '92000']))
print("Number sort (right):", sorted([52000, 87000, 61000, 55000, 92000]))
Type of salary_str: object ERROR: Could not convert string to float: '52000' Text sort (wrong): ['52000', '55000', '61000', '87000', '92000'] Number sort (right): [52000, 55000, 61000, 87000, 92000]
What just happened?
The dtype came back as object — that's the red flag. When you see a column that should be numeric showing as object, something went wrong during data import. We used a Python try/except block to catch the TypeError gracefully instead of crashing the whole script.
The sorting example is the quietly-dangerous version — no error, just wrong results. '52000' sorts before '87000' alphabetically, which happens to be correct here. But '9' would sort after '87000' alphabetically even though 9 < 87000 numerically. That kind of bug can survive in production for months before anyone notices.
Fixing It — Type Conversion with astype()
Once you spot a wrong type, fixing it is usually one line. The astype() method converts a column from one type to another. It's one of the most-used tools in any analyst's workflow.
The scenario: You've confirmed the salary column is text. Now you need to convert it to a proper numeric type so you can calculate averages, find the highest paid employee, and group by salary ranges. Here's how to do it cleanly.
import pandas as pd
# Dataset with the broken string salary column
bad_df = pd.DataFrame({
'name': ['Sara', 'James', 'Priya', 'Tom', 'Luna'],
'salary_str': ['52000', '87000', '61000', '55000', '92000']
})
# Fix 1: Convert string column to integer using .astype()
bad_df['salary_int'] = bad_df['salary_str'].astype(int)
# Fix 2: Or convert to float if you expect decimal values later
bad_df['salary_float'] = bad_df['salary_str'].astype(float)
# Verify the types changed
print("Original (string):", bad_df['salary_str'].dtype)
print("Converted (int) :", bad_df['salary_int'].dtype)
print("Converted (float):", bad_df['salary_float'].dtype)
# Now we can do math on the fixed column
print("\nAverage salary (int) :", bad_df['salary_int'].mean())
print("Highest salary (float):", bad_df['salary_float'].max())
Original (string): object Converted (int) : int64 Converted (float): float64 Average salary (int) : 69400.0 Highest salary (float): 92000.0
What just happened?
.astype() is a pandas Series method that returns a new column with the type converted. We stored the result in a new column rather than overwriting the original — good practice when you're still validating the conversion.
One thing to watch: .astype(int) will crash if the column has any non-numeric values, empty strings, or NaNs hiding inside. In those cases use pd.to_numeric(df['col'], errors='coerce') instead — it converts what it can and turns bad values into NaN rather than throwing an error. We'll cover that properly in the Missing Values lessons.
The category Type — a Hidden Gem
Most people never use this one and they're leaving performance on the table. When you have a column with a small set of repeating text values — like department (Sales, Engineering, HR) or grade (A, B, C, D) — storing it as object wastes memory. The category type is built exactly for this.
The scenario: Your dataset has 50,000 employees. The department column has 50,000 text entries — but only 6 unique values. Storing 50,000 full strings is wasteful. Converting to category stores the 6 unique values once and uses tiny integer codes for each row instead. Let's see how much it matters.
import pandas as pd
import numpy as np
# Simulate a larger employee dataset with repeating categories
np.random.seed(42) # seed makes random choices reproducible
departments = ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance', 'Operations']
# Create 1,000 rows to make the memory difference visible
big_df = pd.DataFrame({
'emp_id': range(1001, 2001), # 1000 employees
'department': np.random.choice(departments, size=1000), # random dept per row
'salary': np.random.randint(45000, 120000, size=1000).astype(float)
})
# Check memory usage BEFORE conversion
print("=== BEFORE conversion ===")
print("department dtype :", big_df['department'].dtype)
print("Memory (object) :", big_df['department'].memory_usage(deep=True), "bytes")
# Convert department to category type
big_df['department'] = big_df['department'].astype('category')
# Check memory usage AFTER conversion
print("\n=== AFTER conversion ===")
print("department dtype :", big_df['department'].dtype)
print("Memory (category) :", big_df['department'].memory_usage(deep=True), "bytes")
# Bonus: category type makes unique values easy to access
print("\nUnique departments :", big_df['department'].cat.categories.tolist())
=== BEFORE conversion === department dtype : object Memory (object) : 63800 bytes === AFTER conversion === department dtype : category Memory (category) : 1524 bytes Unique departments : ['Engineering', 'Finance', 'HR', 'Marketing', 'Operations', 'Sales']
What just happened?
We used numpy here for the first time in this course — np.random.choice() picks random values from a list, and np.random.randint() generates random integers. We set np.random.seed(42) so the results are the same every time you run it — important when teaching.
The memory result is the jaw-dropper: 63,800 bytes down to 1,524 bytes — roughly 42× smaller. On a million-row production dataset that's the difference between a query that runs in 2 seconds and one that runs in 0.05 seconds. The .cat.categories accessor is a bonus — it gives you the unique values in sorted order, which is handy for validation.
A Visual Look at the Type System
Here's a quick reference of how real-world data columns map to pandas types — the kind of cheat sheet you'll actually reach for on the job.
| Column example | Sample values | Expected dtype | Watch out for |
|---|---|---|---|
| age | 24, 31, 45 | int64 | Coming in as "24" (string) |
| price / salary | 19.99, 87000.00 | float64 | "$19.99" with currency symbol |
| name / address | "Alice", "123 Main St" | object | Numbers hiding as strings |
| department / grade | "Sales", "A", "High" | object → category | Left as object wastes memory |
| is_active / verified | True, False | bool | 1/0 stored as int instead |
| order_date | "2024-01-15" | datetime64 | Often imported as object |
Putting It Together — Full Type Audit Routine
The scenario: You've just received a brand new dataset — the kind where you don't know the source, don't know if it was cleaned, and definitely don't trust it yet. Here's the complete type-auditing routine you run before touching anything else.
import pandas as pd
# Full employee dataset for the audit
employee_df = pd.DataFrame({
'emp_id': [101, 102, 103, 104, 105, 106, 107, 108],
'name': ['Sara', 'James', 'Priya', 'Tom', 'Luna', 'Kai', 'Omar', 'Zoe'],
'department': ['Sales', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'Engineering'],
'salary': [52000.00, 87000.00, 61000.00, 55000.00, 92000.00, 63000.00, 58000.00, 95000.00],
'years_exp': [3, 7, 5, 2, 9, 4, 3, 11],
'is_manager': [False, True, False, False, True, False, False, True],
'performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent', 'Good', 'Average', 'Excellent']
})
# Step 1: Full dtype overview
print("STEP 1 — All column types:")
print(employee_df.dtypes)
# Step 2: Separate numeric vs non-numeric columns automatically
numeric_cols = employee_df.select_dtypes(include=['number']).columns.tolist()
non_numeric_cols = employee_df.select_dtypes(exclude=['number']).columns.tolist()
print("\nSTEP 2 — Numeric columns :", numeric_cols)
print("STEP 2 — Non-numeric columns:", non_numeric_cols)
# Step 3: Optimise — convert low-cardinality text columns to category
# "Low cardinality" means few unique values relative to total rows
for col in non_numeric_cols:
n_unique = employee_df[col].nunique() # count distinct values
n_rows = len(employee_df)
ratio = n_unique / n_rows # uniqueness ratio
# If less than 50% of values are unique, it's a good category candidate
if ratio < 0.5:
employee_df[col] = employee_df[col].astype('category')
print(f"STEP 3 — Converted '{col}' to category ({n_unique} unique values)")
# Step 4: Confirm final types
print("\nSTEP 4 — Final dtypes after optimisation:")
print(employee_df.dtypes)
STEP 1 — All column types: emp_id int64 name object department object salary float64 years_exp int64 is_manager bool performance object dtype: object STEP 2 — Numeric columns : ['emp_id', 'salary', 'years_exp'] STEP 2 — Non-numeric columns: ['name', 'department', 'is_manager', 'performance'] STEP 3 — Converted 'department' to category (3 unique values) STEP 3 — Converted 'is_manager' to category (2 unique values) STEP 3 — Converted 'performance' to category (3 unique values) STEP 4 — Final dtypes after optimisation: emp_id int64 name object department category salary float64 years_exp int64 is_manager category performance category dtype: object
What just happened?
Two new pandas methods appeared here. select_dtypes(include=['number']) automatically filters columns by their type — much faster than checking each one manually. nunique() counts distinct values in a column, which we used to calculate a uniqueness ratio — a simple heuristic for deciding whether category conversion makes sense.
The name column was correctly left as object — it has 8 unique values out of 8 rows (ratio = 1.0), so every value is different. No point making that a category. department has only 3 unique values across 8 rows — perfect category candidate. This is the kind of thinking that separates a careful analyst from someone who just runs code and hopes for the best.
Teacher's Note
There's a trap that gets intermediate analysts too. You check .dtypes, everything looks correct — salary is float64, age is int64. You feel good. But then your groupby produces wrong numbers.
What happened? The column was correct, but there were NaN values hiding in it. Pandas silently converts an int64 column to float64 the moment you introduce a NaN — because integers can't represent missing values natively, but floats can. So if you see a column that should be integer but shows as float64, check for missing values. That's almost always the reason.
Practice Questions
1. You have a column called age stored as object. To convert it to integers, you write df['age'].______('int'). What goes in the blank?
2. A column called grade has only 4 unique values — A, B, C, D — repeated across 50,000 rows. What dtype should you convert it to for maximum memory efficiency?
3. To automatically get only the numeric columns from a DataFrame, you use df.________(include=['number']). Fill in the blank.
Quiz
1. A column contains values like "52000", "87000", "61000" — numbers wrapped in quotes. What dtype will pandas assign to this column?
2. You expected a column to be int64 but it's showing as float64. No conversion was done. What is the most likely explanation?
3. What is the danger of leaving a numeric column as object dtype without converting it?
Up Next · Lesson 4
Central Tendency — Mean, Median and Mode
Which average should you actually report? The answer changes depending on your data — and getting it wrong is more common than you'd think.