EDA Lesson 15 – Data Quality Issues | Dataplexa

Beginner Level · Lesson 15

Data Quality Issues

Missing values and outliers get all the attention. But the data problems that silently wreck real projects are subtler — dates stored as text, negative ages, mismatched category spellings, impossible values that pass every null check. This lesson is a systematic tour of what actually goes wrong in production data.

The Full Taxonomy of Data Quality Problems

Data quality issues fall into six categories. Missing values are just one of them — and often not the most damaging one. Here's the complete map:

1 · Missing Values

NaN, None, empty strings, placeholder text like "N/A" or "unknown" that wasn't caught at ingestion.

2 · Wrong Data Types

Dates as strings, numbers as objects, booleans as integers. Can't compute on them until you fix the type.

3 · Inconsistent Formatting

"London", "london", "LONDON" treated as three different cities. Extra spaces, mixed case, inconsistent punctuation.

4 · Impossible Values

Age of -5, percentage of 150%, a transaction dated year 1800. Technically not null — but factually wrong.

5 · Duplicates

Exact row duplicates, near-duplicates with minor differences, duplicate IDs pointing to different records.

6 · Referential Integrity

Customer IDs in an orders table that don't exist in the customers table. Broken foreign key relationships.

Hunting Hidden Missing Values

The scenario: You're a data analyst at an HR software company. A client has exported their employee database and sent it to you for cleaning before it's migrated into your platform. The client insists "the data is clean — we filled in every field." But you've seen this before. Strings like "N/A", "none", "-", and a single space are not the same as NaN — pandas won't detect them automatically. You need to find all of them.

import pandas as pd      # pandas: Python's core data table library — isnull(), replace(), and column ops
import numpy as np       # numpy: numerical library — np.nan used to standardise missing value representation

# Employee dataset — notice the disguised missing values hiding in plain sight
df = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'name':        ['Alice Brown', 'Bob Smith', 'N/A', 'Diana Prince',
                    'Eve Carter', '-', 'Grace Lee', 'Henry Ford'],
    'department':  ['Engineering', 'Marketing', 'Sales', 'Engineering',
                    'none', 'Marketing', ' ', 'Sales'],        # 'none' and ' ' (space) are hidden nulls
    'salary':      [72000, 58000, 45000, 81000, 63000, 55000, 0, 77000],   # 0 might mean missing
    'start_date':  ['2021-03-15', '2020-07-01', '2022-01-10', 'unknown',
                    '2019-11-25', '2023-06-14', '2021-08-30', '2018-04-05']  # 'unknown' is hidden null
})

print("=== RAW DATA ===")
print(df.to_string(index=False))
print()

# Standard .isnull() will miss all of these — they look like strings, not NaN
print(f"Standard null count: {df.isnull().sum().sum()}")   # total NaN across entire DataFrame — will show 0
print()

# Define all the strings your organisation uses as placeholders for missing data
MISSING_PLACEHOLDERS = ['N/A', 'n/a', 'NA', 'none', 'None', 'NONE', '-', '--', 'unknown',
                         'Unknown', 'UNKNOWN', ' ', '', 'null', 'NULL']

# Replace all placeholder strings with np.nan so pandas can handle them uniformly
df_clean = df.replace(MISSING_PLACEHOLDERS, np.nan)   # .replace() with a list swaps any matching value to NaN

# Now also flag salary = 0 as suspicious — a real salary of £0 is almost certainly a data entry error
df_clean.loc[df_clean['salary'] == 0, 'salary'] = np.nan   # .loc[] for conditional assignment

print("=== MISSING VALUE REPORT (after standardisation) ===")
missing = df_clean.isnull().sum()
missing = missing[missing > 0]                               # filter to only affected columns
print(missing)

=== RAW DATA ===
 employee_id          name   department  salary  start_date
         101   Alice Brown  Engineering   72000  2021-03-15
         102     Bob Smith    Marketing   58000  2020-07-01
         103           N/A        Sales   45000  2022-01-10
         104  Diana Prince  Engineering   81000     unknown
         105    Eve Carter         none   63000  2019-11-25
         106             -    Marketing   55000  2023-06-14
         107     Grace Lee            ·   0      2021-08-30
         108    Henry Ford        Sales   77000  2018-04-05

Standard null count: 0

=== MISSING VALUE REPORT (after standardisation) ===
name          2
department    2
salary        1
start_date    1
dtype: int64

What just happened?

pandas is Python's core data table library. The key method here is .replace() — when you pass it a list as the first argument, it replaces every occurrence of any value in that list with the second argument. We pass np.nan as the replacement so all disguised nulls become real NaN values that pandas can detect and handle uniformly.

numpy provides np.nan — the standard floating-point representation of a missing value. It's the format pandas uses internally for all missing numeric data.

Before standardisation: zero nulls detected. After: six missing values across four columns. The client was technically right that "every field was filled" — they just filled some fields with strings meaning "we don't know." This is one of the most common data quality traps in real projects. Always run a placeholder scan before trusting a null count.

Wrong Data Types — Finding and Fixing Them

The scenario: The same HR dataset has a start_date column. It came in as plain strings. Your analysis needs to calculate how long each employee has been at the company — but you can't subtract strings from dates. You also need to validate the salary column is truly numeric, not a mix of numbers and text that pandas silently stored as object type.

import pandas as pd      # pandas: data library — pd.to_datetime(), pd.to_numeric(), and dtype checking
import numpy as np       # numpy: numerical library — np.nan for missing value replacement

# Rebuild a clean version of the dataset after placeholder standardisation from the previous block
df = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'name':        ['Alice Brown', 'Bob Smith', np.nan, 'Diana Prince',
                    'Eve Carter', np.nan, 'Grace Lee', 'Henry Ford'],
    'department':  ['Engineering', 'Marketing', 'Sales', 'Engineering',
                    np.nan, 'Marketing', np.nan, 'Sales'],
    'salary':      ['72000', '58000', '45000', '81000',        # intentionally stored as strings
                    '63000', '55000', np.nan, '77000'],         # to simulate a mixed-type column
    'start_date':  ['2021-03-15', '2020-07-01', '2022-01-10', np.nan,
                    '2019-11-25', '2023-06-14', '2021-08-30', '2018-04-05']
})

print("Dtypes BEFORE fixing:")
print(df.dtypes)
print()

# Fix 1: Convert salary from object (string) to numeric
# errors='coerce' turns anything that can't be parsed into NaN instead of raising an error
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')   # pd.to_numeric() handles mixed types safely

# Fix 2: Convert start_date from object (string) to datetime
# errors='coerce' turns unparseable dates (like 'unknown') into NaT — the datetime equivalent of NaN
df['start_date'] = pd.to_datetime(df['start_date'], errors='coerce')  # pd.to_datetime() for date columns

print("Dtypes AFTER fixing:")
print(df.dtypes)
print()

# Now we can do date arithmetic — calculate tenure in days from start_date to today
reference_date = pd.Timestamp('2024-06-01')                   # fixed reference date for reproducibility
df['tenure_days'] = (reference_date - df['start_date']).dt.days   # .dt.days extracts integer days from a Timedelta

print("Tenure calculation (days employed as of 2024-06-01):")
print(df[['employee_id', 'name', 'start_date', 'tenure_days']].to_string(index=False))

Dtypes BEFORE fixing:
employee_id      int64
name            object
department      object
salary          object
start_date      object
dtype: object

Dtypes AFTER fixing:
employee_id             int64
name                   object
department             object
salary                float64
start_date     datetime64[ns]
tenure_days           float64
dtype: object

Tenure calculation (days employed as of 2024-06-01):
 employee_id          name start_date  tenure_days
         101   Alice Brown 2021-03-15       1174.0
         102     Bob Smith 2020-07-01       1431.0
         103           NaN 2022-01-10        872.0
         104  Diana Prince        NaT          NaN
         105    Eve Carter 2019-11-25       1649.0
         106           NaN 2023-06-14        352.0
         107     Grace Lee 2021-08-30       1005.0
         108    Henry Ford 2018-04-05       2248.0

What just happened?

pandas provides two essential type-conversion functions: pd.to_numeric() converts a column to a numeric type, and pd.to_datetime() converts a column to datetime. The errors='coerce' argument is critical — without it, a single unparseable value raises an exception and crashes your entire script. With it, bad values silently become NaN or NaT (Not a Time — datetime's version of NaN), and you can deal with them separately.

The .dt accessor unlocks datetime-specific methods on a Series — .dt.days extracts the integer day count from a Timedelta object, giving us clean tenure numbers.

Notice employee 104 (Diana Prince): her start_date was 'unknown', so it became NaT, and tenure_days became NaN. The fix didn't invent data — it preserved the uncertainty correctly.

Inconsistent Formatting — The Case Chaos Problem

The scenario: You're running a .groupby('department') to count headcount per team — and the results look completely wrong. Engineering shows up three times: "Engineering", "engineering", and "ENGINEERING". Marketing appears as "Marketing " (with a trailing space). These aren't different departments — they're the same department entered inconsistently by different people. A groupby treats them as separate groups, silently producing a broken report.

import pandas as pd      # pandas: data library — .str accessor, .groupby(), and .value_counts()
import numpy as np       # numpy: standard EDA import

# Dataset with inconsistent department formatting — a real-world nightmare
df = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'name':        ['Alice', 'Bob', 'Carol', 'Diana', 'Eve',
                    'Frank', 'Grace', 'Henry', 'Iris', 'Jack'],
    'department':  ['Engineering', 'marketing ', 'Sales', 'ENGINEERING', 'Sales',
                    'Marketing', 'engineering', 'Sales ', 'Engineering', 'MARKETING']
    #                              ^trailing       ^all caps  ^lowercase                ^trailing   ^all caps
})

# Broken groupby BEFORE cleaning — shows 6 "departments" when there are really 3
print("=== HEADCOUNT BEFORE CLEANING (broken) ===")
print(df.groupby('department')['employee_id'].count())
print()

# Fix all three problems in one chained .str operation:
# 1. .str.strip()  — removes leading and trailing whitespace
# 2. .str.lower()  — converts to all lowercase
# 3. .str.title()  — capitalises first letter of each word (Title Case)
df['department'] = df['department'].str.strip().str.lower().str.title()
#                                   ^whitespace  ^normalise  ^pretty format

# Correct groupby AFTER cleaning
print("=== HEADCOUNT AFTER CLEANING (correct) ===")
print(df.groupby('department')['employee_id'].count())

=== HEADCOUNT BEFORE CLEANING (broken) ===
department
ENGINEERING      1
MARKETING        1
Engineering      3
Marketing        1
Sales            2
Sales            1
engineering      1
marketing        1
dtype: int64

=== HEADCOUNT AFTER CLEANING (correct) ===
department
Engineering    4
Marketing      3
Sales          3
dtype: int64

What just happened?

pandas provides the .str accessor — a special interface that unlocks string methods on an entire Series column at once, without needing a Python loop. .str.strip() removes whitespace from both ends of every string, .str.lower() converts everything to lowercase, and .str.title() then capitalises the first letter of each word. Chaining them in sequence is clean and readable — each method returns a new Series, so the next method in the chain acts on the already-cleaned version.

Before: 8 apparent departments — some with only 1 employee. After: 3 real departments, correctly counted. Engineering had 4 employees the whole time. A headcount report generated from the raw data would have been completely wrong — and without this check, nobody would have noticed until a manager questioned the numbers in a meeting.

Impossible Values — Business Rule Validation

The scenario: Quality audit on the HR dataset — this time focused on impossible values. A salary of £0 is suspicious. An age under 16 violates employment law. A start date in the future hasn't happened yet. A performance score above 100% is nonsensical. These values aren't null — they passed every standard null check. But they're wrong, and they'll corrupt any model or calculation that uses them.

import pandas as pd      # pandas: data library — boolean indexing, .loc[], and .any() for validation
import numpy as np       # numpy: numerical library — np.nan for flagging invalid values

# Dataset with impossible values — all fields are technically non-null but several are wrong
df = pd.DataFrame({
    'employee_id':      [101, 102, 103, 104, 105, 106, 107, 108],
    'age':              [34, 28, -5, 52, 17, 41, 200, 29],       # -5 and 200 are impossible; 17 is under legal min
    'salary':           [72000, 58000, 45000, 81000, 0, 55000, 63000, 77000],  # 0 is almost certainly wrong
    'performance_score':[88, 94, 72, 110, 85, 91, 67, 78],       # 110 exceeds the 100% maximum
    'start_date':       pd.to_datetime(['2021-03-15', '2020-07-01', '2022-01-10', '2023-08-14',
                                        '2019-11-25', '2026-03-01', '2021-08-30', '2018-04-05'])
    # 2026-03-01 is in the future relative to our reference date of 2024-06-01
})

reference_date = pd.Timestamp('2024-06-01')   # fixed reference — "today" for this analysis

# Define business rules as boolean masks — True means the row VIOLATES the rule
# Each rule is one line: what's the condition that should never be true?
invalid_age_low    = df['age'] < 16                              # under minimum working age
invalid_age_high   = df['age'] > 120                             # biologically impossible
invalid_salary     = df['salary'] <= 0                           # salary must be positive
invalid_score      = df['performance_score'] > 100               # percentage can't exceed 100
invalid_date       = df['start_date'] > reference_date           # future start date — hasn't happened

# Combine all flags into a summary DataFrame
validation = pd.DataFrame({
    'age_too_low':      invalid_age_low,
    'age_too_high':     invalid_age_high,
    'salary_invalid':   invalid_salary,
    'score_over_100':   invalid_score,
    'future_start_date':invalid_date
})

# .any(axis=1) returns True for each row where at least one validation check failed
validation['any_issue'] = validation.any(axis=1)   # axis=1 = check across columns (per row)

print("=== VALIDATION REPORT ===")
print(validation.to_string())
print()
print(f"Total rows with at least one issue: {validation['any_issue'].sum()} / {len(df)}")

=== VALIDATION REPORT ===
   age_too_low  age_too_high  salary_invalid  score_over_100  future_start_date  any_issue
0        False         False           False           False              False      False
1        False         False           False           False              False      False
2         True         False           False           False              False       True
3        False         False           False            True              False       True
4         True         False            True           False              False       True
5        False         False           False           False               True       True
6        False          True           False           False              False       True
7        False         False           False           False              False      False

Total rows with at least one issue: 5 / 8

What just happened?

pandas is handling the boolean logic throughout. When you write df['age'] < 16, pandas compares every value in the column against 16 and returns a boolean Series — one True/False per row. These boolean Series can be stored as columns, combined with & (and) or | (or), and summarised with .sum() (which counts Trues) or .any().

.any(axis=1) is the key aggregation: it checks each row across all columns and returns True if any single column is True for that row. axis=1 means "collapse across columns" — the opposite of the default axis=0 which collapses across rows.

5 out of 8 rows have at least one problem — and none of them would have been caught by a simple .isnull() check. This is why business rule validation is a separate, explicit step in a real data quality audit.

The Data Quality Issue Map

Here's a quick-reference visual of every issue type, the pandas method that catches it, and the standard fix.

Issue Type	How to Detect	Standard Fix
True NaN / None	df.isnull().sum()	fillna(), dropna(), imputation
Hidden nulls ("N/A", "-")	df.replace(placeholders, np.nan)	replace() then handle as NaN
Wrong data type	df.dtypes	pd.to_numeric(), pd.to_datetime()
Case / whitespace chaos	value_counts() on text cols	.str.strip().str.lower().str.title()
Impossible values	Boolean masks + business rules	Replace with NaN, flag for review
Exact duplicates	df.duplicated().sum()	df.drop_duplicates()
Broken foreign keys	~df['id'].isin(other_df['id'])	Drop orphaned rows or investigate source

Referential Integrity — The Silent Join Killer

The scenario: The HR platform stores employees in one table and their department assignments in another. Before building the headcount dashboard, you need to check that every employee ID in the assignments table actually exists in the employees table — and vice versa. Broken links mean rows will silently disappear when you join, and your dashboard will be wrong with no error message telling you why.

import pandas as pd    # pandas: data library — .isin(), merge(), and boolean indexing
import numpy as np     # numpy: standard numerical import

# Two related tables — employees master list and department assignments
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106],
    'name':        ['Alice', 'Bob', 'Carol', 'Diana', 'Eve', 'Frank']
})

assignments = pd.DataFrame({
    'employee_id': [101, 102, 103, 107, 108],   # 107 and 108 don't exist in employees table
    'department':  ['Engineering', 'Marketing', 'Sales', 'Engineering', 'Sales']
})

# Check 1: assignment rows with no matching employee (orphaned assignments)
# ~df.isin() = NOT in — finds values that are absent from the reference set
orphaned = assignments[~assignments['employee_id'].isin(employees['employee_id'])]
print("=== ORPHANED ASSIGNMENTS (IDs with no employee record) ===")
print(orphaned)
print()

# Check 2: employees with no department assignment (unassigned employees)
unassigned = employees[~employees['employee_id'].isin(assignments['employee_id'])]
print("=== UNASSIGNED EMPLOYEES (no department assignment) ===")
print(unassigned)
print()

# Check 3: demonstrate what an inner join loses vs a left join
inner_joined  = pd.merge(employees, assignments, on='employee_id', how='inner')  # only matched rows
left_joined   = pd.merge(employees, assignments, on='employee_id', how='left')   # all employees, NaN for unmatched

print(f"Inner join rows: {len(inner_joined)}  (lost {len(employees) - len(inner_joined)} employees)")
print(f"Left join rows:  {len(left_joined)}   (all employees preserved, NaN for unassigned)")

=== ORPHANED ASSIGNMENTS (IDs with no employee record) ===
   employee_id   department
3          107  Engineering
4          108        Sales

=== UNASSIGNED EMPLOYEES (no department assignment) ===
   employee_id   name
3          104  Diana
4          105    Eve
5          106  Frank

Inner join rows: 3  (lost 3 employees)
Left join rows:  6  (all employees preserved, NaN for unassigned)

What just happened?

pandas provides .isin() — it returns a boolean Series that is True wherever a value appears in the provided collection. The ~ operator inverts it (NOT isin), giving us the rows that are absent. This is the standard pattern for referential integrity checks: "which IDs in table A don't exist in table B?"

pd.merge() with how='inner' silently drops all rows that don't match — no warning, no error. Three employees (Diana, Eve, Frank) would vanish from the headcount dashboard without a trace. Using how='left' keeps all employees and puts NaN in the department column for unassigned ones — making the problem visible instead of hiding it.

The referential integrity check should always run before any join in a production pipeline. Silent data loss is far more dangerous than a loud error.

Teacher's Note

Data quality issues compound. A salary column stored as strings (wrong type) will appear to have no missing values — because "N/A" is a perfectly valid string. Fix the type first, and suddenly you discover hidden nulls. This is why the order in Phase 2 of the EDA workflow matters: type checking before null counting, placeholder standardisation before summary statistics.

Build a personal checklist for every new dataset: types → hidden nulls → impossible values → formatting → duplicates → referential integrity. Run it in that order every time. The fifteen minutes it takes will save you hours of debugging downstream.

Practice Questions

1. Which pandas method should you use to convert placeholder strings like "N/A", "-", and "none" into real NaN values?

2. When using pd.to_datetime() on a column that contains some unparseable strings, which argument prevents the function from crashing and instead converts bad values to NaT?

3. Which pandas method is used to check whether values in one column exist within another collection — useful for referential integrity checks?

Quiz

Up Next · Lesson 16

Univariate Analysis

A deep dive into single-variable analysis — frequency tables, distribution shapes, and the full statistical portrait of one column at a time.

← Previous Course Index Next →

EDA Course

Data Quality Issues

The Full Taxonomy of Data Quality Problems

Hunting Hidden Missing Values

Wrong Data Types — Finding and Fixing Them

Inconsistent Formatting — The Case Chaos Problem

Impossible Values — Business Rule Validation

The Data Quality Issue Map

Referential Integrity — The Silent Join Killer

Practice Questions

Quiz