How to Perform Data Cleaning in R

If your dataset has missing values, you can handle them by either removing or replacing them using different methods in R.

The following methods show how you can do it with syntax.

Method 1: Use na.omit() Function From dplyr Package

library(dplyr)

new_df <- df %>% na.omit()

Method 2: Replace Missing Values With Other Values

library(dplyr)
library(tidyr)

df %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))

Method 3: Remove Duplicate Rows

library(dplyr)

df %>% distinct(.keep_all=TRUE)

The following examples show how you can perform data cleaning in R.

Let’s first create data frame with null values:

# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
                 Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
                 Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
                 Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))

# Show data frame
print(df)

Output:

  Machine_name Pressure Temperature Status
1            A    78.20          31     NA
2            B       NA          35   TRUE
3            C    71.70          36   TRUE
4            D    80.21          36  FALSE
5            E    83.12          38  FALSE
6            F    82.56          32  FALSE
7            G       NA          33   TRUE
8            H    79.50          NA   TRUE

The output shows data frame with null values.

The following output shows how to perform data cleaning in R.

Remove Rows With Null Values Using na.omit()

Let’s take a look at an example:

# Import library
library(dplyr)

# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
                 Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
                 Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
                 Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))

# Remove NA values
new_df <- df %>% na.omit()

# Print new data frame
print(new_df)

Output:

  Machine_name Pressure Temperature Status
3            C    71.70          36   TRUE
4            D    80.21          36  FALSE
5            E    83.12          38  FALSE
6            F    82.56          32  FALSE

In the output we see rows with missing values are removed.

Replace Null Values With Another Value

The below example shows how to remove missing values with the mean value of each column:

# Import library
library(dplyr)
library(tidyr)

# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
                 Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
                 Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
                 Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))

# Replace missing values in each numeric column with mean value of column
new_df <- df %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))           

# Print new data frame
print(new_df)

Output:

  Machine_name Pressure Temperature Status
1            A   78.200    31.00000     NA
2            B   79.215    35.00000   TRUE
3            C   71.700    36.00000   TRUE
4            D   80.210    36.00000  FALSE
5            E   83.120    38.00000  FALSE
6            F   82.560    32.00000  FALSE
7            G   79.215    33.00000   TRUE
8            H   79.500    34.42857   TRUE

The output shows missing values are replaced with mean values of each column.

Remove Duplicate Rows Using distinct()

The below example demonstrate how remove duplicate rows using distinct() function:

# Import library
library(dplyr)

# Create data frame
df <- data.frame(Machine_name=c("A","A","B","C","C","D","E","E","F","G","H","G"),
                 Pressure=c(78.2, 78.2, 71.7, 80.21, 80.21, 82.56, NA, NA, 72.12, NA, 73.85, NA),
                 Temperature=c(NA, NA, 35, 36, 36, 38, 32, 32, 31, NA, 34, NA))

# Remove duplicate rows
new_df <- df %>% distinct(.keep_all=TRUE)

# Print new data frame
print(new_df)

Output:

  Machine_name Pressure Temperature
1            A    78.20          NA
2            B    71.70          35
3            C    80.21          36
4            D    82.56          38
5            E       NA          32
6            F    72.12          31
7            G       NA          NA
8            H    73.85          34

As the output shows duplicate rows are removed.

Using all these method you can handle dataset with missing values or duplicate rows.