To calculate summary statistics by group in R, you can use tapply() function or create function manually using group_by() summarise() function from dplyr package.
The following methods show how you can do it with syntax.
Method 1: Use tapply() Function
tapply(data, summary)
Method 2: Create Function Manually
library(dplyr)
d <- df %>%
group_by(column1) %>%
summarize(min = min(column2),
q1 = quantile(column2, 0.25),
median = median(column2),
mean = mean(column2),
q3 = quantile(column2, 0.75),
max = max(column2))
The following examples show how to calculate summary statistics by group in R.
Use tapply() to Calculate Summary Statistics
Let’s see how we can calculate summary statistics using tapply() function:
# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","A","B","C","D"),
Pressure=c(78.2, 78.2, 71.7, 80.21, 80.21, 82.56, 72.12, 73.85),
Temperature=c(35, 36, 36, 38, 32, 32, 31, 34))
# Calculate summary statistics of 'Pressure' grouped by 'Machine_name'
s <- tapply(df$Pressure, df$Machine_name, summary)
# Print summary statistics
print(s)
Output:
$A
Min. 1st Qu. Median Mean 3rd Qu. Max.
78.20 78.70 79.20 79.20 79.71 80.21
$B
Min. 1st Qu. Median Mean 3rd Qu. Max.
78.20 79.29 80.38 80.38 81.47 82.56
$C
Min. 1st Qu. Median Mean 3rd Qu. Max.
71.70 71.81 71.91 71.91 72.02 72.12
$D
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.85 75.44 77.03 77.03 78.62 80.21
The output shows summary statistics values of Pressure column which group by Machine_name column of dataframe.
Create Function to Calculate Summary Statistics by Group
Let’s see how we can use group_by() and summarize() function from dplyr package to create function to calculate summary statistics by group:
# Import library
library(dplyr)
# Create data frame
df <- data.frame(Machine_name=c("A","B","C","D","A","B","C","D"),
Pressure=c(78.2, 78.2, 71.7, 80.21, 80.21, 82.56, 72.12, 73.85),
Temperature=c(35, 36, 36, 38, 32, 32, 31, 34))
# Calculate summary statistics of 'Temperature' grouped by 'Machine_name'
d <- df %>%
group_by(Machine_name) %>%
summarize(min = min(Temperature),
q1 = quantile(Temperature, 0.25),
median = median(Temperature),
mean = mean(Temperature),
q3 = quantile(Temperature, 0.75),
max = max(Temperature))
# Print summary statistics
print(s)
Output:
$A
Min. 1st Qu. Median Mean 3rd Qu. Max.
78.20 78.70 79.20 79.20 79.71 80.21
$B
Min. 1st Qu. Median Mean 3rd Qu. Max.
78.20 79.29 80.38 80.38 81.47 82.56
$C
Min. 1st Qu. Median Mean 3rd Qu. Max.
71.70 71.81 71.91 71.91 72.02 72.12
$D
Min. 1st Qu. Median Mean 3rd Qu. Max.
73.85 75.44 77.03 77.03 78.62 80.21
The output shows summary statistics of Temperature column which group by Machine_name column of dataframe.