Excel Course
Data Cleaning Project
Every real-world Excel project begins with the same unglamorous first step: cleaning the data. Raw data from CRMs, exports, legacy systems, and manual entry is almost never clean. It arrives with inconsistent casing, merged fields, blank rows, duplicate records, mistyped values, and date formats that Excel refuses to recognise. This project takes a deliberately messy HR dataset through a full end-to-end cleaning workflow — using Power Query for structural problems and Excel formulas for field-level fixes — and produces a clean, validated, analysis-ready table. Every technique used in this project came from Sections I through VII of this course.
Project Overview
The Raw Data — What We Start With
The raw dataset is a CSV export of employee records from a legacy HR system. It has 12 columns and approximately 200 rows. The problems are typical of real-world HR exports: inconsistent casing, merged name fields, date format issues, blank rows at the bottom, a few duplicated employee IDs, and salary values entered as text strings with currency symbols.
| EmpID | Full Name | department | Start Date | Salary | Status | Issue |
|---|---|---|---|---|---|---|
| E1042 | smith, john | sales | 01/03/2019 | £42,000 | Active | Name: wrong case, Last,First order · Dept: lowercase · Salary: text with £ |
| E1043 | TAYLOR SARAH | Marketing | 15-Jan-20 | 38500 | Active | Name: ALL CAPS, no comma · Date: different format (15-Jan-20) |
| E1044 | Brown, Mike | IT | 2021-06-01 | 51000 | active | Name: leading/trailing spaces · Status: lowercase "active" |
| E1042 | Smith, John | Sales | 01/03/2019 | 42000 | Active | DUPLICATE — EmpID E1042 appears twice |
| (blank) | (blank) | (blank) | (blank) | (blank) | (blank) | BLANK ROW — must be removed |
Phase 1 — Inspect and Audit
Before changing anything, document every problem. A cleaning audit prevents you from missing issues and gives you a checklist to verify against when the work is done. Run these checks on the raw data before opening Power Query.
Audit checklist — run these checks on the raw data:
1. Row count
=COUNTA(A:A)-1 → how many non-blank EmpID rows exist
Compare to expected headcount from HR system
2. Duplicate EmpIDs
=COUNTIF(A:A,A2) → copy down — any value >1 is a duplicate
Or: Data → Remove Duplicates → preview count (but don't apply yet)
3. Blank rows
=COUNTBLANK(A2:A201) → count blank cells in EmpID column
Any blank = a blank row to remove
4. Salary column type
=ISNUMBER(E2) → FALSE means it's stored as text
=ISTEXT(E2) → TRUE means it has a £ symbol or comma
5. Date column consistency
=ISNUMBER(D2) → TRUE means Excel has recognised it as a date
FALSE means it's a text string in an unusual format
6. Status unique values
=UNIQUE(F2:F201) → spill list of all distinct values
Should only contain "Active" and "Leaver" — any others are errors
7. Department unique values
=UNIQUE(C2:C201) → should show "Sales" "Marketing" "IT" "Finance" "HR"
Watch for: "sales" "SALES" "Slaes" (typo) as separate entries
Write these findings down before you touch anything. Your audit log is the record of what the data looked like and what you changed — it protects you if questions arise later about why certain rows were removed or modified. A one-page table with "Issue / Count / Fix Applied" is sufficient and takes five minutes to produce.
Phase 2 — Power Query: Structural Fixes
Load the CSV into Power Query:
Data → Get Data → From File → From Text/CSV
Preview → Transform Data (opens Power Query Editor)
Step 1: Remove blank rows
Home → Remove Rows → Remove Blank Rows
This removes any row where all columns are null/empty
Step 2: Remove duplicates on EmpID
Right-click the EmpID column header → Remove Duplicates
Power Query keeps the first occurrence and removes subsequent ones
Always check: is the first occurrence the correct/most recent one?
If not, sort by a date column first before removing duplicates
Step 3: Split the Full Name column
The raw data has "Smith, John" or "TAYLOR SARAH" — inconsistent
First: standardise to "Last, First" format using a formula helper (Phase 3)
Then in Power Query: right-click Full Name → Split Column → By Delimiter
Delimiter: comma → At each occurrence
Rename resulting columns: Last Name, First Name
Trim both columns: Transform → Format → Trim
Step 4: Fix the Salary column
The column contains values like "£42,000" and "38500" mixed
Select the Salary column → Transform → Replace Values
Replace "£" with "" (empty string)
Replace "," with "" (empty string)
Change column type: right-click header → Change Type → Whole Number
Step 5: Fix date columns
Select the Start Date column
Right-click → Change Type → Using Locale
Data type: Date · Locale: English (United Kingdom) for DD/MM/YYYY
Power Query parses multiple date formats automatically when locale is set correctly
Step 6: Close and Load
Home → Close & Load To... → Table → New worksheet
Name the sheet "Clean_Stage1"
| # | Applied Step | What It Does |
|---|---|---|
| 1 | Source | Connects to the CSV file |
| 2 | Promoted Headers | First row becomes column headers |
| 3 | Removed Blank Rows | 3 empty rows removed |
| 4 | Removed Duplicates | 3 duplicate EmpID rows removed |
| 5 | Split Column by Delimiter | Full Name split into Last Name + First Name |
| 6 | Replaced Values (x2) | £ and , removed from Salary |
| 7 | Changed Type | Salary to Whole Number · Start Date to Date |
| 8 | Trimmed Text | Leading/trailing spaces removed from all text columns |
Phase 3 — Standardise With Excel Formulas
On the Clean_Stage1 sheet, add helper columns to standardise text fields.
When satisfied, paste-as-values over the original columns and delete helpers.
Standardise First Name and Last Name to Proper Case:
=PROPER(B2) → "SARAH" becomes "Sarah" · "smith" becomes "Smith"
Standardise Department to Proper Case + fix typos:
=SWITCH(TRIM(LOWER(C2)),
"sales", "Sales",
"marketing", "Marketing",
"it", "IT",
"finance", "Finance",
"hr", "HR",
"slaes", "Sales", ← catch the typo
"Unknown" ← default for anything else
)
Standardise Status:
=PROPER(TRIM(F2))
"active" → "Active" · "ACTIVE" → "Active" · "leaver" → "Leaver"
Verify no unknowns remain:
=COUNTIF(G:G,"Unknown") → should be 0 after corrections
=UNIQUE(G2:G201) → should only show "Sales" "Marketing" "IT" "Finance" "HR"
The SWITCH with LOWER approach for departments is the most robust standardisation pattern. LOWER converts everything to lowercase first, so you only need to handle one version of each variant — no matter whether the original was "SALES", "Sales", or "sales", LOWER turns it into "sales" before SWITCH evaluates it. The typo catch ("slaes" → "Sales") comes from the audit you ran in Phase 1.
Phase 4 — Validate With Formulas
Add a "Validation" column that checks multiple rules and returns a status:
=IF(
OR(
B2="", ← blank First Name
C2="", ← blank Last Name
NOT(ISNUMBER(E2)), ← Salary not a number
E2 < 15000, ← Salary below minimum threshold
E2 > 250000, ← Salary above maximum threshold
NOT(ISNUMBER(D2)), ← Start Date not recognised
D2 > TODAY(), ← Start Date in the future
AND(F2<>"Active",F2<>"Leaver") ← Status not a valid value
),
"REVIEW",
"OK"
)
Count how many rows need review:
=COUNTIF(H:H,"REVIEW") → number of rows flagged
Filter to only the flagged rows for manual inspection:
=FILTER(A2:H201, H2:H201="REVIEW")
After reviewing each flagged row manually:
Either correct the value and change Validation to "OK"
Or document it as a known data issue and keep the REVIEW flag
Phase 5 — Enrich With Calculated Columns
Add these calculated columns to make the clean table analysis-ready:
Tenure (years in company):
=DATEDIF(D2, TODAY(), "Y") & " yrs"
Or as a number for analysis: =DATEDIF(D2, TODAY(), "Y")
Tenure band (for grouping):
=IFS(
DATEDIF(D2,TODAY(),"Y") < 1, "0-1 years",
DATEDIF(D2,TODAY(),"Y") < 3, "1-3 years",
DATEDIF(D2,TODAY(),"Y") < 5, "3-5 years",
DATEDIF(D2,TODAY(),"Y") < 10, "5-10 years",
TRUE, "10+ years"
)
Salary band:
=IFS(
E2 < 25000, "Band A: Under 25K",
E2 < 40000, "Band B: 25K-40K",
E2 < 60000, "Band C: 40K-60K",
E2 < 90000, "Band D: 60K-90K",
TRUE, "Band E: 90K+"
)
Start Year (for hire trend analysis):
=YEAR(D2)
Department Code (short code for dashboard use):
=SWITCH(C2,
"Sales", "SLS",
"Marketing", "MKT",
"IT", "IT",
"Finance", "FIN",
"HR", "HR",
"UNK"
)
Phase 6 — Final Output
Steps to produce the final clean output:
1. On the working sheet, select all helper formula columns
Copy → Paste Special → Values only
This converts all formulas to static values so the table is portable
2. Delete the original messy columns (Full Name, raw Salary, raw Status)
The clean versions in the helper columns are now the permanent values
3. Reorder columns to the agreed final structure:
EmpID | First Name | Last Name | Department | Dept Code |
Start Date | Start Year | Salary | Salary Band | Tenure | Tenure Band | Status | Validation
4. Convert to an Excel Table:
Select all data → Ctrl+T → tick "My table has headers" → OK
Name the table: Employees_Clean
5. Add a documentation sheet:
New sheet named "Cleaning Log"
Record: source file name, date cleaned, row count before/after,
issues found (from audit), issues fixed, issues remaining (REVIEW rows)
6. Final row count check:
Raw file rows: 206 (including 3 blanks, 3 duplicate extra rows, header)
After cleaning: 199 (200 employees minus 1 true duplicate confirmed with HR)
Validation OK: 188
Validation REVIEW: 11 (documented in Cleaning Log)
6 duplicate EmpIDs
Mixed case names
3 date formats
Salary as text (£42,000)
Status: "active" / "ACTIVE" / "Active"
No derived columns
No validation
Duplicates resolved
Proper Case throughout
Single consistent date format
Salary as integer (42000)
Status: "Active" / "Leaver" only
5 derived columns added
Validation flag on each row
🠐 Practice
Q1. Your audit finds that the Department column contains "Sales", "sales", "SALES", and "Slaes". Write the SWITCH formula pattern that standardises all four to "Sales".
Q2. In Power Query, you need to remove duplicates based on the EmpID column but want to keep the most recently updated record, not the first occurrence. What step do you do before removing duplicates?
Q3. Why should you paste-as-values over your formula helper columns before delivering the final clean table?
🟣 Quiz
Q1. What is the correct order of operations when cleaning a dataset with both structural issues (blank rows, duplicates) and field-level issues (wrong casing, text-as-numbers)?
Q2. You find 6 rows with EmpID E1042 in the raw data. After investigating, you confirm only one is correct. What is the risk of using Power Query's "Remove Duplicates" without checking first?
Q3. The Validation column flags 11 rows as "REVIEW". What should you do with these rows — delete them or keep them?
Next up — Sales Dashboard Project, where we take a clean multi-table sales dataset through Power Query, Power Pivot, DAX measures, and a full dashboard build to produce a polished interactive Sales Performance Dashboard.