Data cleaning is the most time-consuming part of any data analysis project. I’d estimate that 70-80% of my time as a data analyst goes to cleaning data, not analyzing it. Messy data = wrong conclusions. It’s that simple.
In this comprehensive guide, I’ll show you:
- How to identify and handle missing values (5 different methods)
- How to remove duplicate rows
- How to detect and treat outliers
- How to deal with inconsistent formats
- Common cleaning workflows
- Best practices I use in production code
By the end, you’ll have a complete toolkit for tackling messy real-world data.
Prerequisites
- R 3.6 or higher
- Familiarity with data frames and basic R syntax
- tidyverse packages (dplyr, tidyr) - optional but recommended
- Base R functions work too (we’ll cover both)
The Data Cleaning Pipeline
Here’s what professional data cleaning looks like:
Raw Data → Check Structure → Handle Missing Values → Remove Duplicates →
Fix Outliers → Standardize Formats → Validate → Clean Data Ready for Analysis
Let’s work through each step with real examples.
Understanding Your Data First
Always start by understanding what you’re working with:
# Create sample data with common problems
df <- data.frame(
customer_id = c(101, 102, 102, 103, 104, NA, 105, 105),
name = c("Alice", "Bob", "Bob", "Charlie", "David", "Eve", "Frank", "Frank"),
purchase_amount = c(150, NA, 200, 75, 999, 85, 120, 120),
purchase_date = c("2024-01-10", "2024-01-15", "2024-01-15", "2024-02-05",
"2024-02-12", "2024-03-01", "2024-03-05", "2024-03-05"),
region = c("North", "south", "South", "NORTH", "East", "west", "North", "North")
)
# First, see what we're dealing with
str(df)
head(df)
# Check for missing values
sum(is.na(df)) # [1] 2
Notice the problems:
- Duplicate rows (102, 105)
- Missing value (NA in customer_id)
- Extreme value (999 seems high for purchase_amount)
- Inconsistent format (“south” vs “South” vs “NORTH”)
This is typical real-world data. Let’s fix it.
Method 1: Remove Rows With Any Missing Values - na.omit()
The simplest approach - delete entire rows that have ANY missing value:
# Using base R
df_clean <- na.omit(df)
# Or using dplyr
library(dplyr)
df_clean <- df %>% drop_na()
# Check what was removed
print(nrow(df) - nrow(df_clean)) # [1] 1 row removed
When to Use na.omit()
Pros:
- Simple and fast
- Predictable - removes complete rows only
Cons:
- Can lose a lot of data if many NAs
- Not selective (removes entire row for just 1 missing value)
Use when: You have few missing values and your data is large enough that losing rows is acceptable.
Method 2: Remove Rows With Missing Values in Specific Columns
More selective - only remove rows where certain columns are missing:
# Remove rows where customer_id is missing
df_clean <- df %>% drop_na(customer_id)
# Remove rows where customer_id OR purchase_amount are missing
df_clean <- df %>% drop_na(customer_id, purchase_amount)
# Base R approach
df_clean <- df[!is.na(df$customer_id), ]
Checking Which Rows Have Missing Values
# Find rows with any NA
df[!complete.cases(df), ]
# Find rows with NA in specific column
df[is.na(df$customer_id), ]
# Count missing values by column
sapply(df, function(x) sum(is.na(x)))
# customer_id name purchase_amount purchase_date region
# 1 0 1 0 0
Method 3: Replace Missing Values - Mean Imputation
Instead of deleting, replace missing values with the mean (or median, mode, etc.):
# Replace numeric NAs with mean
library(tidyr)
df_clean <- df %>%
mutate(
purchase_amount = replace_na(purchase_amount, mean(purchase_amount, na.rm = TRUE))
)
# Multiple numeric columns at once
df_clean <- df %>%
mutate(
across(where(is.numeric), ~replace_na(., mean(., na.rm = TRUE)))
)
# Using base R
df$purchase_amount[is.na(df$purchase_amount)] <- mean(df$purchase_amount, na.rm = TRUE)
When to Use Mean Imputation
Pros:
- Keeps all rows (no data loss)
- Simple and interpretable
Cons:
- Biases data toward the mean
- Reduces variability
- Not ideal for categorical data
Use when: You have missing numeric data and small percentage of NAs.
Method 4: Replace With Group-Specific Value
More sophisticated - replace missing value with mean of its group:
# Replace purchase_amount NA with region's mean
df_clean <- df %>%
group_by(region) %>%
mutate(
purchase_amount = replace_na(purchase_amount, mean(purchase_amount, na.rm = TRUE))
) %>%
ungroup()
# Example: Region "North" mean is used for Eve's missing value
This is better because it uses context-specific averages.
Method 5: Replace Character NAs With a Default
For character columns, use a placeholder:
# Replace NA in name with "Unknown"
df_clean <- df %>%
mutate(
name = replace_na(name, "Unknown")
)
# Or multiple columns
df_clean <- df %>%
mutate(
across(where(is.character), ~replace_na(., "Unknown"))
)
Method 6: Delete Entire Columns With Too Many NAs
Sometimes a whole column is too incomplete to use:
# Remove columns with >50% missing values
df_clean <- df %>%
select(where(~sum(is.na(.)) / length(.) < 0.5))
# Or specific columns
df_clean <- df %>% select(-problematic_column)
Removing Duplicate Rows
Duplicates are common in real data. Here’s how to handle them:
Method 1: Remove Exact Duplicates
# Remove rows completely identical to previous rows
df_clean <- df %>% distinct()
# Using base R
df_clean <- df[!duplicated(df), ]
# Example - removes the second Bob row
Method 2: Remove Duplicates Based on Specific Columns
# Keep first occurrence of each customer_id
df_clean <- df %>%
distinct(customer_id, .keep_all = TRUE)
# Keep first occurrence of each (customer_id, region) combination
df_clean <- df %>%
distinct(customer_id, region, .keep_all = TRUE)
Identifying Duplicates Before Removing
# Find all duplicate rows
duplicates <- df[duplicated(df), ]
print(duplicates)
# Find rows with duplicate customer_id
duplicates_by_id <- df[duplicated(df$customer_id), ]
# Count duplicates per customer
df %>%
group_by(customer_id) %>%
filter(n() > 1) %>%
arrange(customer_id)
Standardizing Formats
Real data often has inconsistent formats. Fix them before analysis:
Standardize Text (Case)
# Convert all regions to lowercase
df_clean <- df %>%
mutate(region = tolower(region))
# Convert to title case
df_clean <- df %>%
mutate(region = tools::toTitleCase(region))
# Or use stringr for more control
library(stringr)
df_clean <- df %>%
mutate(region = str_to_title(region))
Standardize Whitespace
# Remove leading/trailing whitespace
df_clean <- df %>%
mutate(name = trimws(name))
# Works on all columns
df_clean <- df %>%
mutate(across(where(is.character), trimws))
Standardize Dates
# Convert to standard Date format
df_clean <- df %>%
mutate(purchase_date = as.Date(purchase_date, format = "%Y-%m-%d"))
# Check if any dates are still character/invalid
class(df_clean$purchase_date)
Detecting and Handling Outliers
Outliers can distort analysis. Here’s how to find them:
Method 1: Statistical Detection - IQR Method
# Calculate IQR (Interquartile Range)
Q1 <- quantile(df$purchase_amount, 0.25, na.rm = TRUE)
Q3 <- quantile(df$purchase_amount, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Outliers are > 1.5 * IQR beyond quartiles
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Find outliers
outliers <- df %>%
filter(purchase_amount < lower_bound | purchase_amount > upper_bound)
print(outliers) # Shows the $999 purchase as outlier
Method 2: Z-Score Detection
# Calculate Z-scores (standard deviations from mean)
df_clean <- df %>%
mutate(
purchase_amount_z = abs(scale(purchase_amount)),
is_outlier = purchase_amount_z > 3 # More than 3 std dev
)
# Remove outliers
df_clean <- df_clean %>%
filter(!is_outlier) %>%
select(-purchase_amount_z, -is_outlier)
Handling Outliers (Choose One)
# Option 1: Remove outliers
df_clean <- df %>%
filter(purchase_amount >= lower_bound & purchase_amount <= upper_bound)
# Option 2: Cap outliers (replace with boundary value)
df_clean <- df %>%
mutate(
purchase_amount = pmin(purchase_amount, upper_bound),
purchase_amount = pmax(purchase_amount, lower_bound)
)
# Option 3: Replace with median
df_clean <- df %>%
mutate(
purchase_amount = ifelse(is_outlier, median(purchase_amount, na.rm = TRUE), purchase_amount)
)
Complete Data Cleaning Workflow
Here’s a realistic end-to-end workflow:
library(dplyr)
library(tidyr)
# 1. Load and inspect
df <- read.csv("raw_data.csv")
str(df)
# 2. Handle missing values
df_clean <- df %>%
# Remove rows with missing customer_id (critical)
drop_na(customer_id) %>%
# Fill purchase_amount with group mean
group_by(region) %>%
mutate(purchase_amount = replace_na(purchase_amount, mean(purchase_amount, na.rm = TRUE))) %>%
ungroup()
# 3. Remove duplicates
df_clean <- df_clean %>%
distinct(customer_id, region, purchase_date, .keep_all = TRUE)
# 4. Standardize formats
df_clean <- df_clean %>%
mutate(
name = trimws(tolower(name)),
region = str_to_title(region)
) %>%
mutate(purchase_date = as.Date(purchase_date))
# 5. Remove outliers
Q1 <- quantile(df_clean$purchase_amount, 0.25)
Q3 <- quantile(df_clean$purchase_amount, 0.75)
IQR <- Q3 - Q1
df_clean <- df_clean %>%
filter(purchase_amount >= Q1 - 1.5*IQR & purchase_amount <= Q3 + 1.5*IQR)
# 6. Final validation
summary(df_clean)
sum(is.na(df_clean)) # Should be 0
print(df_clean)
Common Data Quality Issues & Solutions
Problem 1: Numeric Column Stored as Character
# Symptoms: "150.50" instead of 150.50
df$purchase_amount <- as.numeric(df$purchase_amount)
# Check for NAs introduced by conversion
sum(is.na(df$purchase_amount))
Problem 2: Inconsistent Delimiters in Text
# Example: "North |South" vs "North | South"
df$regions <- gsub("\\s*\\|\\s*", "|", df$regions) # Standardize spaces
Problem 3: Empty Strings vs NA
# Empty strings don't show as NA
df[df$name == "", ]
# Replace empty strings with NA
df <- df %>%
mutate(across(where(is.character), ~na_if(., "")))
FAQ
Q: Should I always remove rows with missing values? A: No. Removing rows loses information. Only remove if missing data is rare, or use imputation for important columns.
Q: What’s better - mean or median imputation? A: Median is more robust to outliers. Use median for skewed data, mean for normal data.
Q: How do I know if data is clean enough? A: Check: No NAs (or documented reason), no duplicates, consistent formats, reasonable values (pass logic checks).
Q: Should I remove all outliers? A: Not automatically. Outliers can be legitimate. Only remove if they’re data entry errors. Document your decision.
Q: Can I automate cleaning for future data? A: Yes! Write a cleaning function and source it in every analysis script. This ensures consistency.
Q: What’s the best order for cleaning steps? A: Check structure → Handle NAs → Remove duplicates → Standardize formats → Fix outliers → Validate.
Best Practices
- Never modify raw data - always work on a copy
- Document every cleaning decision - why did you remove rows?
- Keep a cleaning log - track what was changed and why
- Validate after cleaning - run sanity checks
- Automate repeated cleanings - write reusable functions
- Check with domain experts - they know if values are reasonable
Related Topics
- R Data Frames - Master Guide - Working with data frame structures
- R Data Import/Export - Master Guide - Loading data correctly
- R String Operations - Complete Guide - Text cleaning
- R Data Transformation - Complete Guide - Transform cleaned data
Download R Script
Get all code examples from this tutorial: data-cleaning-examples.R