SPSS Lesson 6 – Data Cleaning Basics| Dataplexa

Data Cleaning Basics

Before performing any statistical analysis, data must be clean, consistent, and reliable. Data cleaning is the process of identifying and correcting problems that can distort results and lead to incorrect conclusions.

In real-world datasets, raw data is rarely perfect. Missing values, incorrect entries, and inconsistent formats are extremely common. SPSS provides powerful tools to identify and fix these issues.


Why Data Cleaning Is Critical

Statistical results are only as good as the data used. Even a small number of errors can significantly affect averages, relationships, and hypothesis tests.

Data cleaning helps you:

  • Improve accuracy of analysis
  • Reduce bias in results
  • Avoid misleading conclusions

Professionals often spend more time cleaning data than performing the actual analysis.


Common Data Quality Problems

Most data issues fall into a few well-known categories. Recognizing them early saves time and effort.

Typical problems include:

  • Missing values
  • Out-of-range values
  • Duplicate records
  • Inconsistent data formats

SPSS allows you to detect these problems using both visual inspection and statistical summaries.


Example Dataset with Errors

Consider the following dataset representing employee information:

Employee_ID Age Monthly_Salary
201 25 35000
202 42000
203 150 30000
204 28

This dataset contains missing values and an unrealistic age value. If left uncorrected, these issues will distort analysis results.


Identifying Missing Values in SPSS

SPSS helps detect missing data using descriptive statistics and frequency tables.

For numeric variables, unusually low counts or blank cells indicate missing values.

Missing values can be:

  • System-missing (blank cells)
  • User-defined (special codes like -99 or 999)

Defining missing values correctly ensures they are excluded from analysis.


Handling Out-of-Range Values

Out-of-range values occur when data falls outside realistic or acceptable limits.

For example, an age value of 150 is not realistic. Such values may result from data entry errors.

SPSS allows you to:

  • Detect extreme values using Descriptives
  • Filter cases to inspect suspicious records
  • Correct or remove invalid entries

Using SPSS Syntax for Cleaning Data

SPSS syntax provides precise control over data cleaning steps. This is especially useful for reproducible analysis.


RECODE Age (LO THRU 0=SYSMIS) (120 THRU HI=SYSMIS).
EXECUTE.

This syntax replaces unrealistic age values with system-missing values.

By doing this, SPSS excludes these values from statistical calculations.


Checking Data After Cleaning

After cleaning, it is important to verify that issues have been resolved.

You should:

  • Re-run descriptive statistics
  • Check minimum and maximum values
  • Confirm missing values are handled correctly

This step ensures your dataset is ready for analysis.


Quiz 1

Why is data cleaning important?

Because data errors can distort statistical results.


Quiz 2

What is a system-missing value in SPSS?

A blank or undefined value recognized by SPSS as missing.


Quiz 3

What problem does an age value of 150 represent?

An out-of-range value likely caused by a data entry error.


Quiz 4

Why is SPSS syntax useful for data cleaning?

It allows repeatable and consistent cleaning operations.


Quiz 5

What should you do after cleaning data?

Verify the data using descriptive statistics.


Mini Practice

Create a dataset with the following variables:

  • Employee_ID
  • Age
  • Monthly_Salary

Intentionally introduce:

  • One missing value
  • One unrealistic age value

Use SPSS tools or syntax to identify and correct these issues.

Use Descriptives to identify errors, then recode unrealistic values as missing.


What’s Next

In the next lesson, you will learn how to sort and filter data, which helps isolate specific cases and explore datasets more effectively.