EDA Course
Data Quality Issues
Missing values and outliers get all the attention. But the data problems that silently wreck real projects are subtler — dates stored as text, negative ages, mismatched category spellings, impossible values that pass every null check. This lesson is a systematic tour of what actually goes wrong in production data.
The Full Taxonomy of Data Quality Problems
Data quality issues fall into six categories. Missing values are just one of them — and often not the most damaging one. Here's the complete map:
1 · Missing Values
NaN, None, empty strings, placeholder text like "N/A" or "unknown" that wasn't caught at ingestion.
2 · Wrong Data Types
Dates as strings, numbers as objects, booleans as integers. Can't compute on them until you fix the type.
3 · Inconsistent Formatting
"London", "london", "LONDON" treated as three different cities. Extra spaces, mixed case, inconsistent punctuation.
4 · Impossible Values
Age of -5, percentage of 150%, a transaction dated year 1800. Technically not null — but factually wrong.
5 · Duplicates
Exact row duplicates, near-duplicates with minor differences, duplicate IDs pointing to different records.
6 · Referential Integrity
Customer IDs in an orders table that don't exist in the customers table. Broken foreign key relationships.
Hunting Hidden Missing Values
The scenario: You're a data analyst at an HR software company. A client has exported their employee database and sent it to you for cleaning before it's migrated into your platform. The client insists "the data is clean — we filled in every field." But you've seen this before. Strings like "N/A", "none", "-", and a single space are not the same as NaN — pandas won't detect them automatically. You need to find all of them.
import pandas as pd # pandas: Python's core data table library — isnull(), replace(), and column ops
import numpy as np # numpy: numerical library — np.nan used to standardise missing value representation
# Employee dataset — notice the disguised missing values hiding in plain sight
df = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
'name': ['Alice Brown', 'Bob Smith', 'N/A', 'Diana Prince',
'Eve Carter', '-', 'Grace Lee', 'Henry Ford'],
'department': ['Engineering', 'Marketing', 'Sales', 'Engineering',
'none', 'Marketing', ' ', 'Sales'], # 'none' and ' ' (space) are hidden nulls
'salary': [72000, 58000, 45000, 81000, 63000, 55000, 0, 77000], # 0 might mean missing
'start_date': ['2021-03-15', '2020-07-01', '2022-01-10', 'unknown',
'2019-11-25', '2023-06-14', '2021-08-30', '2018-04-05'] # 'unknown' is hidden null
})
print("=== RAW DATA ===")
print(df.to_string(index=False))
print()
# Standard .isnull() will miss all of these — they look like strings, not NaN
print(f"Standard null count: {df.isnull().sum().sum()}") # total NaN across entire DataFrame — will show 0
print()
# Define all the strings your organisation uses as placeholders for missing data
MISSING_PLACEHOLDERS = ['N/A', 'n/a', 'NA', 'none', 'None', 'NONE', '-', '--', 'unknown',
'Unknown', 'UNKNOWN', ' ', '', 'null', 'NULL']
# Replace all placeholder strings with np.nan so pandas can handle them uniformly
df_clean = df.replace(MISSING_PLACEHOLDERS, np.nan) # .replace() with a list swaps any matching value to NaN
# Now also flag salary = 0 as suspicious — a real salary of £0 is almost certainly a data entry error
df_clean.loc[df_clean['salary'] == 0, 'salary'] = np.nan # .loc[] for conditional assignment
print("=== MISSING VALUE REPORT (after standardisation) ===")
missing = df_clean.isnull().sum()
missing = missing[missing > 0] # filter to only affected columns
print(missing)
=== RAW DATA ===
employee_id name department salary start_date
101 Alice Brown Engineering 72000 2021-03-15
102 Bob Smith Marketing 58000 2020-07-01
103 N/A Sales 45000 2022-01-10
104 Diana Prince Engineering 81000 unknown
105 Eve Carter none 63000 2019-11-25
106 - Marketing 55000 2023-06-14
107 Grace Lee · 0 2021-08-30
108 Henry Ford Sales 77000 2018-04-05
Standard null count: 0
=== MISSING VALUE REPORT (after standardisation) ===
name 2
department 2
salary 1
start_date 1
dtype: int64
What just happened?
pandas is Python's core data table library. The key method here is .replace() — when you pass it a list as the first argument, it replaces every occurrence of any value in that list with the second argument. We pass np.nan as the replacement so all disguised nulls become real NaN values that pandas can detect and handle uniformly.
numpy provides np.nan — the standard floating-point representation of a missing value. It's the format pandas uses internally for all missing numeric data.
Before standardisation: zero nulls detected. After: six missing values across four columns. The client was technically right that "every field was filled" — they just filled some fields with strings meaning "we don't know." This is one of the most common data quality traps in real projects. Always run a placeholder scan before trusting a null count.
Wrong Data Types — Finding and Fixing Them
The scenario: The same HR dataset has a start_date column. It came in as plain strings. Your analysis needs to calculate how long each employee has been at the company — but you can't subtract strings from dates. You also need to validate the salary column is truly numeric, not a mix of numbers and text that pandas silently stored as object type.
import pandas as pd # pandas: data library — pd.to_datetime(), pd.to_numeric(), and dtype checking
import numpy as np # numpy: numerical library — np.nan for missing value replacement
# Rebuild a clean version of the dataset after placeholder standardisation from the previous block
df = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
'name': ['Alice Brown', 'Bob Smith', np.nan, 'Diana Prince',
'Eve Carter', np.nan, 'Grace Lee', 'Henry Ford'],
'department': ['Engineering', 'Marketing', 'Sales', 'Engineering',
np.nan, 'Marketing', np.nan, 'Sales'],
'salary': ['72000', '58000', '45000', '81000', # intentionally stored as strings
'63000', '55000', np.nan, '77000'], # to simulate a mixed-type column
'start_date': ['2021-03-15', '2020-07-01', '2022-01-10', np.nan,
'2019-11-25', '2023-06-14', '2021-08-30', '2018-04-05']
})
print("Dtypes BEFORE fixing:")
print(df.dtypes)
print()
# Fix 1: Convert salary from object (string) to numeric
# errors='coerce' turns anything that can't be parsed into NaN instead of raising an error
df['salary'] = pd.to_numeric(df['salary'], errors='coerce') # pd.to_numeric() handles mixed types safely
# Fix 2: Convert start_date from object (string) to datetime
# errors='coerce' turns unparseable dates (like 'unknown') into NaT — the datetime equivalent of NaN
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce') # pd.to_datetime() for date columns
print("Dtypes AFTER fixing:")
print(df.dtypes)
print()
# Now we can do date arithmetic — calculate tenure in days from start_date to today
reference_date = pd.Timestamp('2024-06-01') # fixed reference date for reproducibility
df['tenure_days'] = (reference_date - df['start_date']).dt.days # .dt.days extracts integer days from a Timedelta
print("Tenure calculation (days employed as of 2024-06-01):")
print(df[['employee_id', 'name', 'start_date', 'tenure_days']].to_string(index=False))
Dtypes BEFORE fixing:
employee_id int64
name object
department object
salary object
start_date object
dtype: object
Dtypes AFTER fixing:
employee_id int64
name object
department object
salary float64
start_date datetime64[ns]
tenure_days float64
dtype: object
Tenure calculation (days employed as of 2024-06-01):
employee_id name start_date tenure_days
101 Alice Brown 2021-03-15 1174.0
102 Bob Smith 2020-07-01 1431.0
103 NaN 2022-01-10 872.0
104 Diana Prince NaT NaN
105 Eve Carter 2019-11-25 1649.0
106 NaN 2023-06-14 352.0
107 Grace Lee 2021-08-30 1005.0
108 Henry Ford 2018-04-05 2248.0
What just happened?
pandas provides two essential type-conversion functions: pd.to_numeric() converts a column to a numeric type, and pd.to_datetime() converts a column to datetime. The errors='coerce' argument is critical — without it, a single unparseable value raises an exception and crashes your entire script. With it, bad values silently become NaN or NaT (Not a Time — datetime's version of NaN), and you can deal with them separately.
The .dt accessor unlocks datetime-specific methods on a Series — .dt.days extracts the integer day count from a Timedelta object, giving us clean tenure numbers.
Notice employee 104 (Diana Prince): her start_date was 'unknown', so it became NaT, and tenure_days became NaN. The fix didn't invent data — it preserved the uncertainty correctly.
Inconsistent Formatting — The Case Chaos Problem
The scenario: You're running a .groupby('department') to count headcount per team — and the results look completely wrong. Engineering shows up three times: "Engineering", "engineering", and "ENGINEERING". Marketing appears as "Marketing " (with a trailing space). These aren't different departments — they're the same department entered inconsistently by different people. A groupby treats them as separate groups, silently producing a broken report.
import pandas as pd # pandas: data library — .str accessor, .groupby(), and .value_counts()
import numpy as np # numpy: standard EDA import
# Dataset with inconsistent department formatting — a real-world nightmare
df = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'name': ['Alice', 'Bob', 'Carol', 'Diana', 'Eve',
'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
'department': ['Engineering', 'marketing ', 'Sales', 'ENGINEERING', 'Sales',
'Marketing', 'engineering', 'Sales ', 'Engineering', 'MARKETING']
# ^trailing ^all caps ^lowercase ^trailing ^all caps
})
# Broken groupby BEFORE cleaning — shows 6 "departments" when there are really 3
print("=== HEADCOUNT BEFORE CLEANING (broken) ===")
print(df.groupby('department')['employee_id'].count())
print()
# Fix all three problems in one chained .str operation:
# 1. .str.strip() — removes leading and trailing whitespace
# 2. .str.lower() — converts to all lowercase
# 3. .str.title() — capitalises first letter of each word (Title Case)
df['department'] = df['department'].str.strip().str.lower().str.title()
# ^whitespace ^normalise ^pretty format
# Correct groupby AFTER cleaning
print("=== HEADCOUNT AFTER CLEANING (correct) ===")
print(df.groupby('department')['employee_id'].count())
=== HEADCOUNT BEFORE CLEANING (broken) === department ENGINEERING 1 MARKETING 1 Engineering 3 Marketing 1 Sales 2 Sales 1 engineering 1 marketing 1 dtype: int64 === HEADCOUNT AFTER CLEANING (correct) === department Engineering 4 Marketing 3 Sales 3 dtype: int64
What just happened?
pandas provides the .str accessor — a special interface that unlocks string methods on an entire Series column at once, without needing a Python loop. .str.strip() removes whitespace from both ends of every string, .str.lower() converts everything to lowercase, and .str.title() then capitalises the first letter of each word. Chaining them in sequence is clean and readable — each method returns a new Series, so the next method in the chain acts on the already-cleaned version.
Before: 8 apparent departments — some with only 1 employee. After: 3 real departments, correctly counted. Engineering had 4 employees the whole time. A headcount report generated from the raw data would have been completely wrong — and without this check, nobody would have noticed until a manager questioned the numbers in a meeting.
Impossible Values — Business Rule Validation
The scenario: Quality audit on the HR dataset — this time focused on impossible values. A salary of £0 is suspicious. An age under 16 violates employment law. A start date in the future hasn't happened yet. A performance score above 100% is nonsensical. These values aren't null — they passed every standard null check. But they're wrong, and they'll corrupt any model or calculation that uses them.
import pandas as pd # pandas: data library — boolean indexing, .loc[], and .any() for validation
import numpy as np # numpy: numerical library — np.nan for flagging invalid values
# Dataset with impossible values — all fields are technically non-null but several are wrong
df = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
'age': [34, 28, -5, 52, 17, 41, 200, 29], # -5 and 200 are impossible; 17 is under legal min
'salary': [72000, 58000, 45000, 81000, 0, 55000, 63000, 77000], # 0 is almost certainly wrong
'performance_score':[88, 94, 72, 110, 85, 91, 67, 78], # 110 exceeds the 100% maximum
'start_date': pd.to_datetime(['2021-03-15', '2020-07-01', '2022-01-10', '2023-08-14',
'2019-11-25', '2026-03-01', '2021-08-30', '2018-04-05'])
# 2026-03-01 is in the future relative to our reference date of 2024-06-01
})
reference_date = pd.Timestamp('2024-06-01') # fixed reference — "today" for this analysis
# Define business rules as boolean masks — True means the row VIOLATES the rule
# Each rule is one line: what's the condition that should never be true?
invalid_age_low = df['age'] < 16 # under minimum working age
invalid_age_high = df['age'] > 120 # biologically impossible
invalid_salary = df['salary'] <= 0 # salary must be positive
invalid_score = df['performance_score'] > 100 # percentage can't exceed 100
invalid_date = df['start_date'] > reference_date # future start date — hasn't happened
# Combine all flags into a summary DataFrame
validation = pd.DataFrame({
'age_too_low': invalid_age_low,
'age_too_high': invalid_age_high,
'salary_invalid': invalid_salary,
'score_over_100': invalid_score,
'future_start_date':invalid_date
})
# .any(axis=1) returns True for each row where at least one validation check failed
validation['any_issue'] = validation.any(axis=1) # axis=1 = check across columns (per row)
print("=== VALIDATION REPORT ===")
print(validation.to_string())
print()
print(f"Total rows with at least one issue: {validation['any_issue'].sum()} / {len(df)}")
=== VALIDATION REPORT === age_too_low age_too_high salary_invalid score_over_100 future_start_date any_issue 0 False False False False False False 1 False False False False False False 2 True False False False False True 3 False False False True False True 4 True False True False False True 5 False False False False True True 6 False True False False False True 7 False False False False False False Total rows with at least one issue: 5 / 8
What just happened?
pandas is handling the boolean logic throughout. When you write df['age'] < 16, pandas compares every value in the column against 16 and returns a boolean Series — one True/False per row. These boolean Series can be stored as columns, combined with & (and) or | (or), and summarised with .sum() (which counts Trues) or .any().
.any(axis=1) is the key aggregation: it checks each row across all columns and returns True if any single column is True for that row. axis=1 means "collapse across columns" — the opposite of the default axis=0 which collapses across rows.
5 out of 8 rows have at least one problem — and none of them would have been caught by a simple .isnull() check. This is why business rule validation is a separate, explicit step in a real data quality audit.
The Data Quality Issue Map
Here's a quick-reference visual of every issue type, the pandas method that catches it, and the standard fix.
| Issue Type | How to Detect | Standard Fix |
|---|---|---|
| True NaN / None | df.isnull().sum() | fillna(), dropna(), imputation |
| Hidden nulls ("N/A", "-") | df.replace(placeholders, np.nan) | replace() then handle as NaN |
| Wrong data type | df.dtypes | pd.to_numeric(), pd.to_datetime() |
| Case / whitespace chaos | value_counts() on text cols | .str.strip().str.lower().str.title() |
| Impossible values | Boolean masks + business rules | Replace with NaN, flag for review |
| Exact duplicates | df.duplicated().sum() | df.drop_duplicates() |
| Broken foreign keys | ~df['id'].isin(other_df['id']) | Drop orphaned rows or investigate source |
Referential Integrity — The Silent Join Killer
The scenario: The HR platform stores employees in one table and their department assignments in another. Before building the headcount dashboard, you need to check that every employee ID in the assignments table actually exists in the employees table — and vice versa. Broken links mean rows will silently disappear when you join, and your dashboard will be wrong with no error message telling you why.
import pandas as pd # pandas: data library — .isin(), merge(), and boolean indexing
import numpy as np # numpy: standard numerical import
# Two related tables — employees master list and department assignments
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106],
'name': ['Alice', 'Bob', 'Carol', 'Diana', 'Eve', 'Frank']
})
assignments = pd.DataFrame({
'employee_id': [101, 102, 103, 107, 108], # 107 and 108 don't exist in employees table
'department': ['Engineering', 'Marketing', 'Sales', 'Engineering', 'Sales']
})
# Check 1: assignment rows with no matching employee (orphaned assignments)
# ~df.isin() = NOT in — finds values that are absent from the reference set
orphaned = assignments[~assignments['employee_id'].isin(employees['employee_id'])]
print("=== ORPHANED ASSIGNMENTS (IDs with no employee record) ===")
print(orphaned)
print()
# Check 2: employees with no department assignment (unassigned employees)
unassigned = employees[~employees['employee_id'].isin(assignments['employee_id'])]
print("=== UNASSIGNED EMPLOYEES (no department assignment) ===")
print(unassigned)
print()
# Check 3: demonstrate what an inner join loses vs a left join
inner_joined = pd.merge(employees, assignments, on='employee_id', how='inner') # only matched rows
left_joined = pd.merge(employees, assignments, on='employee_id', how='left') # all employees, NaN for unmatched
print(f"Inner join rows: {len(inner_joined)} (lost {len(employees) - len(inner_joined)} employees)")
print(f"Left join rows: {len(left_joined)} (all employees preserved, NaN for unassigned)")
=== ORPHANED ASSIGNMENTS (IDs with no employee record) === employee_id department 3 107 Engineering 4 108 Sales === UNASSIGNED EMPLOYEES (no department assignment) === employee_id name 3 104 Diana 4 105 Eve 5 106 Frank Inner join rows: 3 (lost 3 employees) Left join rows: 6 (all employees preserved, NaN for unassigned)
What just happened?
pandas provides .isin() — it returns a boolean Series that is True wherever a value appears in the provided collection. The ~ operator inverts it (NOT isin), giving us the rows that are absent. This is the standard pattern for referential integrity checks: "which IDs in table A don't exist in table B?"
pd.merge() with how='inner' silently drops all rows that don't match — no warning, no error. Three employees (Diana, Eve, Frank) would vanish from the headcount dashboard without a trace. Using how='left' keeps all employees and puts NaN in the department column for unassigned ones — making the problem visible instead of hiding it.
The referential integrity check should always run before any join in a production pipeline. Silent data loss is far more dangerous than a loud error.
Teacher's Note
Data quality issues compound. A salary column stored as strings (wrong type) will appear to have no missing values — because "N/A" is a perfectly valid string. Fix the type first, and suddenly you discover hidden nulls. This is why the order in Phase 2 of the EDA workflow matters: type checking before null counting, placeholder standardisation before summary statistics.
Build a personal checklist for every new dataset: types → hidden nulls → impossible values → formatting → duplicates → referential integrity. Run it in that order every time. The fifteen minutes it takes will save you hours of debugging downstream.
Practice Questions
1. Which pandas method should you use to convert placeholder strings like "N/A", "-", and "none" into real NaN values?
2. When using pd.to_datetime() on a column that contains some unparseable strings, which argument prevents the function from crashing and instead converts bad values to NaT?
3. Which pandas method is used to check whether values in one column exist within another collection — useful for referential integrity checks?
Quiz
1. A dataset has a column where missing values were entered as the string "N/A". What does df.isnull().sum() return for this column?
2. A groupby('city') produces separate groups for "London", "london", and "London ". What is the correct fix?
3. You perform an inner join between two tables that have broken referential integrity. What happens?
Up Next · Lesson 16
Univariate Analysis
A deep dive into single-variable analysis — frequency tables, distribution shapes, and the full statistical portrait of one column at a time.