Excel Lesson 46 – Data Cleaning Project | Dataplexa
Lesson 46 · Final Projects Project

Data Cleaning Project

Every real-world Excel project begins with the same unglamorous first step: cleaning the data. Raw data from CRMs, exports, legacy systems, and manual entry is almost never clean. It arrives with inconsistent casing, merged fields, blank rows, duplicate records, mistyped values, and date formats that Excel refuses to recognise. This project takes a deliberately messy HR dataset through a full end-to-end cleaning workflow — using Power Query for structural problems and Excel formulas for field-level fixes — and produces a clean, validated, analysis-ready table. Every technique used in this project came from Sections I through VII of this course.

Project Overview

Cleaning Workflow — Six Phases
Phase 1
Inspect & audit — identify all issues before touching anything
Phase 2
Power Query — structural fixes: remove blanks, duplicates, split columns
Phase 3
Standardise — text casing, trim whitespace, consistent date format
Phase 4
Validate — flag rows with missing or out-of-range values using formulas
Phase 5
Enrich — add calculated columns: tenure, age band, department code
Phase 6
Final output — load clean table to worksheet, document what changed

The Raw Data — What We Start With

The raw dataset is a CSV export of employee records from a legacy HR system. It has 12 columns and approximately 200 rows. The problems are typical of real-world HR exports: inconsistent casing, merged name fields, date format issues, blank rows at the bottom, a few duplicated employee IDs, and salary values entered as text strings with currency symbols.

Raw Data — Problems Annotated
EmpID Full Name department Start Date Salary Status Issue
E1042 smith, john sales 01/03/2019 £42,000 Active Name: wrong case, Last,First order · Dept: lowercase · Salary: text with £
E1043 TAYLOR SARAH Marketing 15-Jan-20 38500 Active Name: ALL CAPS, no comma · Date: different format (15-Jan-20)
E1044   Brown, Mike   IT 2021-06-01 51000 active Name: leading/trailing spaces · Status: lowercase "active"
E1042 Smith, John Sales 01/03/2019 42000 Active DUPLICATE — EmpID E1042 appears twice
(blank) (blank) (blank) (blank) (blank) (blank) BLANK ROW — must be removed
Yellow = data quality issue · Red = structural issue (duplicate or blank row) · Full dataset has ~200 rows with these problems distributed throughout

Phase 1 — Inspect and Audit

Before changing anything, document every problem. A cleaning audit prevents you from missing issues and gives you a checklist to verify against when the work is done. Run these checks on the raw data before opening Power Query.

Audit checklist — run these checks on the raw data:

1. Row count
   =COUNTA(A:A)-1   → how many non-blank EmpID rows exist
   Compare to expected headcount from HR system

2. Duplicate EmpIDs
   =COUNTIF(A:A,A2)  → copy down — any value >1 is a duplicate
   Or: Data → Remove Duplicates → preview count (but don't apply yet)

3. Blank rows
   =COUNTBLANK(A2:A201)  → count blank cells in EmpID column
   Any blank = a blank row to remove

4. Salary column type
   =ISNUMBER(E2)   → FALSE means it's stored as text
   =ISTEXT(E2)     → TRUE means it has a £ symbol or comma

5. Date column consistency
   =ISNUMBER(D2)   → TRUE means Excel has recognised it as a date
                       FALSE means it's a text string in an unusual format

6. Status unique values
   =UNIQUE(F2:F201)  → spill list of all distinct values
   Should only contain "Active" and "Leaver" — any others are errors

7. Department unique values
   =UNIQUE(C2:C201)  → should show "Sales" "Marketing" "IT" "Finance" "HR"
   Watch for: "sales" "SALES" "Slaes" (typo) as separate entries
COUNTA: 203 rows (should be 200) · Duplicates: 3 EmpIDs appear twice · Blank rows: 3 · Salary: 18 rows stored as text · Dates: 24 rows not recognised · Status: "active", "Active", "ACTIVE" all present · Departments: "sales", "Sales", "Slaes" all present

Write these findings down before you touch anything. Your audit log is the record of what the data looked like and what you changed — it protects you if questions arise later about why certain rows were removed or modified. A one-page table with "Issue / Count / Fix Applied" is sufficient and takes five minutes to produce.

Phase 2 — Power Query: Structural Fixes

Load the CSV into Power Query:
  Data → Get Data → From File → From Text/CSV
  Preview → Transform Data (opens Power Query Editor)

Step 1: Remove blank rows
  Home → Remove Rows → Remove Blank Rows
  This removes any row where all columns are null/empty

Step 2: Remove duplicates on EmpID
  Right-click the EmpID column header → Remove Duplicates
  Power Query keeps the first occurrence and removes subsequent ones
  Always check: is the first occurrence the correct/most recent one?
  If not, sort by a date column first before removing duplicates

Step 3: Split the Full Name column
  The raw data has "Smith, John" or "TAYLOR SARAH" — inconsistent
  First: standardise to "Last, First" format using a formula helper (Phase 3)
  Then in Power Query: right-click Full Name → Split Column → By Delimiter
  Delimiter: comma → At each occurrence
  Rename resulting columns: Last Name, First Name
  Trim both columns: Transform → Format → Trim

Step 4: Fix the Salary column
  The column contains values like "£42,000" and "38500" mixed
  Select the Salary column → Transform → Replace Values
  Replace "£" with "" (empty string)
  Replace "," with "" (empty string)
  Change column type: right-click header → Change Type → Whole Number

Step 5: Fix date columns
  Select the Start Date column
  Right-click → Change Type → Using Locale
  Data type: Date · Locale: English (United Kingdom) for DD/MM/YYYY
  Power Query parses multiple date formats automatically when locale is set correctly

Step 6: Close and Load
  Home → Close & Load To... → Table → New worksheet
  Name the sheet "Clean_Stage1"
Power Query Applied Steps — What the M Code Records
# Applied Step What It Does
1SourceConnects to the CSV file
2Promoted HeadersFirst row becomes column headers
3Removed Blank Rows3 empty rows removed
4Removed Duplicates3 duplicate EmpID rows removed
5Split Column by DelimiterFull Name split into Last Name + First Name
6Replaced Values (x2)£ and , removed from Salary
7Changed TypeSalary to Whole Number · Start Date to Date
8Trimmed TextLeading/trailing spaces removed from all text columns
Each step is recorded in the Applied Steps panel · clicking any step previews the data at that point · steps can be deleted or reordered

Phase 3 — Standardise With Excel Formulas

On the Clean_Stage1 sheet, add helper columns to standardise text fields.
When satisfied, paste-as-values over the original columns and delete helpers.

Standardise First Name and Last Name to Proper Case:
  =PROPER(B2)   → "SARAH" becomes "Sarah" · "smith" becomes "Smith"

Standardise Department to Proper Case + fix typos:
  =SWITCH(TRIM(LOWER(C2)),
    "sales",     "Sales",
    "marketing", "Marketing",
    "it",        "IT",
    "finance",   "Finance",
    "hr",        "HR",
    "slaes",     "Sales",     ← catch the typo
    "Unknown"                 ← default for anything else
  )

Standardise Status:
  =PROPER(TRIM(F2))
  "active" → "Active" · "ACTIVE" → "Active" · "leaver" → "Leaver"

Verify no unknowns remain:
  =COUNTIF(G:G,"Unknown")  → should be 0 after corrections
  =UNIQUE(G2:G201)         → should only show "Sales" "Marketing" "IT" "Finance" "HR"

The SWITCH with LOWER approach for departments is the most robust standardisation pattern. LOWER converts everything to lowercase first, so you only need to handle one version of each variant — no matter whether the original was "SALES", "Sales", or "sales", LOWER turns it into "sales" before SWITCH evaluates it. The typo catch ("slaes" → "Sales") comes from the audit you ran in Phase 1.

Phase 4 — Validate With Formulas

Add a "Validation" column that checks multiple rules and returns a status:

=IF(
  OR(
    B2="",                         ← blank First Name
    C2="",                         ← blank Last Name
    NOT(ISNUMBER(E2)),             ← Salary not a number
    E2 < 15000,                    ← Salary below minimum threshold
    E2 > 250000,                   ← Salary above maximum threshold
    NOT(ISNUMBER(D2)),             ← Start Date not recognised
    D2 > TODAY(),                  ← Start Date in the future
    AND(F2<>"Active",F2<>"Leaver") ← Status not a valid value
  ),
  "REVIEW",
  "OK"
)

Count how many rows need review:
  =COUNTIF(H:H,"REVIEW")  → number of rows flagged

Filter to only the flagged rows for manual inspection:
  =FILTER(A2:H201, H2:H201="REVIEW")

After reviewing each flagged row manually:
  Either correct the value and change Validation to "OK"
  Or document it as a known data issue and keep the REVIEW flag
REVIEW count: 11 rows · Issues: 4 future start dates, 3 salaries below £15K (likely part-time — noted), 2 blank First Names, 2 unrecognised date values still remaining

Phase 5 — Enrich With Calculated Columns

Add these calculated columns to make the clean table analysis-ready:

Tenure (years in company):
  =DATEDIF(D2, TODAY(), "Y") & " yrs"
  Or as a number for analysis: =DATEDIF(D2, TODAY(), "Y")

Tenure band (for grouping):
  =IFS(
    DATEDIF(D2,TODAY(),"Y") < 1,  "0-1 years",
    DATEDIF(D2,TODAY(),"Y") < 3,  "1-3 years",
    DATEDIF(D2,TODAY(),"Y") < 5,  "3-5 years",
    DATEDIF(D2,TODAY(),"Y") < 10, "5-10 years",
    TRUE,                          "10+ years"
  )

Salary band:
  =IFS(
    E2 < 25000,  "Band A: Under 25K",
    E2 < 40000,  "Band B: 25K-40K",
    E2 < 60000,  "Band C: 40K-60K",
    E2 < 90000,  "Band D: 60K-90K",
    TRUE,        "Band E: 90K+"
  )

Start Year (for hire trend analysis):
  =YEAR(D2)

Department Code (short code for dashboard use):
  =SWITCH(C2,
    "Sales",     "SLS",
    "Marketing", "MKT",
    "IT",        "IT",
    "Finance",   "FIN",
    "HR",        "HR",
    "UNK"
  )

Phase 6 — Final Output

Steps to produce the final clean output:

1. On the working sheet, select all helper formula columns
   Copy → Paste Special → Values only
   This converts all formulas to static values so the table is portable

2. Delete the original messy columns (Full Name, raw Salary, raw Status)
   The clean versions in the helper columns are now the permanent values

3. Reorder columns to the agreed final structure:
   EmpID | First Name | Last Name | Department | Dept Code |
   Start Date | Start Year | Salary | Salary Band | Tenure | Tenure Band | Status | Validation

4. Convert to an Excel Table:
   Select all data → Ctrl+T → tick "My table has headers" → OK
   Name the table: Employees_Clean

5. Add a documentation sheet:
   New sheet named "Cleaning Log"
   Record: source file name, date cleaned, row count before/after,
   issues found (from audit), issues fixed, issues remaining (REVIEW rows)

6. Final row count check:
   Raw file rows:   206  (including 3 blanks, 3 duplicate extra rows, header)
   After cleaning:  199  (200 employees minus 1 true duplicate confirmed with HR)
   Validation OK:   188
   Validation REVIEW: 11  (documented in Cleaning Log)
Before vs After — The Transformation
✗ Raw Data
206 rows (including blanks)
6 duplicate EmpIDs
Mixed case names
3 date formats
Salary as text (£42,000)
Status: "active" / "ACTIVE" / "Active"
No derived columns
No validation
✓ Clean Output
199 unique employee rows
Duplicates resolved
Proper Case throughout
Single consistent date format
Salary as integer (42000)
Status: "Active" / "Leaver" only
5 derived columns added
Validation flag on each row
💡 Teacher's Note
The single most important habit in data cleaning is never modifying the source file. Always work on a copy, always keep Power Query pointed at the original CSV, and always maintain the Applied Steps trail so every transformation is documented and reversible. If a stakeholder questions why a row was removed or a value was changed, you need to be able to show the step that did it and explain why. The Cleaning Log sheet completes that picture on the Excel side. Clean data is not just accurate data — it is documented data. Undocumented cleaning is just hidden guessing.

🠐 Practice

Q1. Your audit finds that the Department column contains "Sales", "sales", "SALES", and "Slaes". Write the SWITCH formula pattern that standardises all four to "Sales".




Q2. In Power Query, you need to remove duplicates based on the EmpID column but want to keep the most recently updated record, not the first occurrence. What step do you do before removing duplicates?




Q3. Why should you paste-as-values over your formula helper columns before delivering the final clean table?



🟣 Quiz

Q1. What is the correct order of operations when cleaning a dataset with both structural issues (blank rows, duplicates) and field-level issues (wrong casing, text-as-numbers)?







Q2. You find 6 rows with EmpID E1042 in the raw data. After investigating, you confirm only one is correct. What is the risk of using Power Query's "Remove Duplicates" without checking first?







Q3. The Validation column flags 11 rows as "REVIEW". What should you do with these rows — delete them or keep them?






Next up — Sales Dashboard Project, where we take a clean multi-table sales dataset through Power Query, Power Pivot, DAX measures, and a full dashboard build to produce a polished interactive Sales Performance Dashboard.