If your dataset has missing values, you can handle them by either removing or replacing them using different methods in R.
The following methods show how you can do it with syntax.
Method 1: Use na.omit() Function From dplyr Package
library(dplyr)
new_df <- df %>% na.omit()
Method 2: Replace Missing Values With Other Values
library(dplyr)
library(tidyr)
df %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))
Method 3: Remove Duplicate Rows
library(dplyr)
df %>% distinct(.keep_all=TRUE)
The following examples show how you can perform data cleaning in R.
Let’s first create data frame with null values:
# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))
# Show data frame
print(df)
Output:
Machine_name Pressure Temperature Status
1 A 78.20 31 NA
2 B NA 35 TRUE
3 C 71.70 36 TRUE
4 D 80.21 36 FALSE
5 E 83.12 38 FALSE
6 F 82.56 32 FALSE
7 G NA 33 TRUE
8 H 79.50 NA TRUE
The output shows data frame with null values.
The following output shows how to perform data cleaning in R.
Remove Rows With Null Values Using na.omit()
Let’s take a look at an example:
# Import library
library(dplyr)
# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))
# Remove NA values
new_df <- df %>% na.omit()
# Print new data frame
print(new_df)
Output:
Machine_name Pressure Temperature Status
3 C 71.70 36 TRUE
4 D 80.21 36 FALSE
5 E 83.12 38 FALSE
6 F 82.56 32 FALSE
In the output we see rows with missing values are removed.
Replace Null Values With Another Value
The below example shows how to remove missing values with the mean value of each column:
# Import library
library(dplyr)
library(tidyr)
# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","E","F","G","H"),
Pressure=c(78.2, NA, 71.7, 80.21, 83.12, 82.56,NA, 79.50),
Temperature=c(31, 35, 36, 36, 38, 32, 33, NA),
Status=c(NA,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE))
# Replace missing values in each numeric column with mean value of column
new_df <- df %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))
# Print new data frame
print(new_df)
Output:
Machine_name Pressure Temperature Status
1 A 78.200 31.00000 NA
2 B 79.215 35.00000 TRUE
3 C 71.700 36.00000 TRUE
4 D 80.210 36.00000 FALSE
5 E 83.120 38.00000 FALSE
6 F 82.560 32.00000 FALSE
7 G 79.215 33.00000 TRUE
8 H 79.500 34.42857 TRUE
The output shows missing values are replaced with mean values of each column.
Remove Duplicate Rows Using distinct()
The below example demonstrate how remove duplicate rows using distinct() function:
# Import library
library(dplyr)
# Create data frame
df <- data.frame(Machine_name=c("A","A","B","C","C","D","E","E","F","G","H","G"),
Pressure=c(78.2, 78.2, 71.7, 80.21, 80.21, 82.56, NA, NA, 72.12, NA, 73.85, NA),
Temperature=c(NA, NA, 35, 36, 36, 38, 32, 32, 31, NA, 34, NA))
# Remove duplicate rows
new_df <- df %>% distinct(.keep_all=TRUE)
# Print new data frame
print(new_df)
Output:
Machine_name Pressure Temperature
1 A 78.20 NA
2 B 71.70 35
3 C 80.21 36
4 D 82.56 38
5 E NA 32
6 F 72.12 31
7 G NA NA
8 H 73.85 34
As the output shows duplicate rows are removed.
Using all these method you can handle dataset with missing values or duplicate rows.