SPSS Lesson 6 – Data Cleaning Basics| Dataplexa

Data Cleaning Basics

Before performing any statistical analysis, data must be clean, consistent, and reliable. Data cleaning is the process of identifying and correcting problems that can distort results and lead to incorrect conclusions.

In real-world datasets, raw data is rarely perfect. Missing values, incorrect entries, and inconsistent formats are extremely common. SPSS provides powerful tools to identify and fix these issues.

Why Data Cleaning Is Critical

Statistical results are only as good as the data used. Even a small number of errors can significantly affect averages, relationships, and hypothesis tests.

Data cleaning helps you:

Improve accuracy of analysis
Reduce bias in results
Avoid misleading conclusions

Professionals often spend more time cleaning data than performing the actual analysis.

Common Data Quality Problems

Most data issues fall into a few well-known categories. Recognizing them early saves time and effort.

Typical problems include:

Missing values
Out-of-range values
Duplicate records
Inconsistent data formats

SPSS allows you to detect these problems using both visual inspection and statistical summaries.

Example Dataset with Errors

Consider the following dataset representing employee information:

Employee_ID	Age	Monthly_Salary
201	25	35000
202		42000
203	150	30000
204	28

This dataset contains missing values and an unrealistic age value. If left uncorrected, these issues will distort analysis results.

Identifying Missing Values in SPSS

SPSS helps detect missing data using descriptive statistics and frequency tables.

For numeric variables, unusually low counts or blank cells indicate missing values.

Missing values can be:

System-missing (blank cells)
User-defined (special codes like -99 or 999)

Defining missing values correctly ensures they are excluded from analysis.

Handling Out-of-Range Values

Out-of-range values occur when data falls outside realistic or acceptable limits.

For example, an age value of 150 is not realistic. Such values may result from data entry errors.

SPSS allows you to:

Detect extreme values using Descriptives
Filter cases to inspect suspicious records
Correct or remove invalid entries

Using SPSS Syntax for Cleaning Data

SPSS syntax provides precise control over data cleaning steps. This is especially useful for reproducible analysis.


RECODE Age (LO THRU 0=SYSMIS) (120 THRU HI=SYSMIS).
EXECUTE.

This syntax replaces unrealistic age values with system-missing values.

By doing this, SPSS excludes these values from statistical calculations.

Checking Data After Cleaning

After cleaning, it is important to verify that issues have been resolved.

You should:

Re-run descriptive statistics
Check minimum and maximum values
Confirm missing values are handled correctly

This step ensures your dataset is ready for analysis.

Quiz 1

Why is data cleaning important?

Because data errors can distort statistical results.

Quiz 2

What is a system-missing value in SPSS?

A blank or undefined value recognized by SPSS as missing.

Quiz 3

What problem does an age value of 150 represent?

An out-of-range value likely caused by a data entry error.

Quiz 4

Why is SPSS syntax useful for data cleaning?

It allows repeatable and consistent cleaning operations.

Quiz 5

What should you do after cleaning data?

Verify the data using descriptive statistics.

Mini Practice

Create a dataset with the following variables:

Employee_ID
Age
Monthly_Salary

Intentionally introduce:

One missing value
One unrealistic age value

Use SPSS tools or syntax to identify and correct these issues.

Use Descriptives to identify errors, then recode unrealistic values as missing.

What’s Next

In the next lesson, you will learn how to sort and filter data, which helps isolate specific cases and explore datasets more effectively.

← Previous Lesson SPSS Index Next ➜