Descriptive statistics form the foundation of data analysis, providing summary measures that describe the main features of a dataset. Whether you’re exploring a dataset for the first time or preparing data for advanced analysis, understanding how to calculate and interpret descriptive statistics in R is essential.

In this comprehensive guide, we’ll explore all key descriptive statistics concepts and demonstrate how to compute them efficiently using R’s built-in functions and popular packages.

What Are Descriptive Statistics?

Descriptive statistics are numerical and graphical methods used to summarize and describe the characteristics of a dataset without making inferences about a larger population. They answer fundamental questions about your data:

  • What is the typical/average value? (Central Tendency)
  • How spread out are the values? (Dispersion)
  • What is the distribution shape? (Skewness)
  • Are there any extreme values? (Outliers)

Why Descriptive Statistics Matter

Before applying any statistical models or machine learning algorithms, you should first understand your data through descriptive statistics. This helps you:

  1. Identify data quality issues (missing values, outliers)
  2. Understand the distribution and pattern of variables
  3. Detect anomalies or unusual patterns
  4. Guide selection of appropriate statistical methods
  5. Communicate findings to non-technical stakeholders

Measures of Central Tendency

Central tendency measures describe the “center” or “typical” value in your dataset.

Mean (Average)

The mean is the sum of all values divided by the number of observations. It’s the most commonly used measure of central tendency.

# Calculate mean of a vector
x <- c(2.39, 11.25, 12.15, 13.48, 13.78, 12.89, 12.21, 12.58)
mean_x <- mean(x)
print(mean_x)
# [1] 11.34125

# Calculate mean with NA values
y <- c(5, 10, NA, 15, 20)
mean_y <- mean(y, na.rm = TRUE)  # Remove NA values
print(mean_y)
# [1] 12.5

# Calculate mean of data frame columns
df <- data.frame(
  Sales = c(100, 150, 200, 175, 225),
  Profit = c(20, 35, 45, 40, 55)
)
mean(df$Sales)
# [1] 170

# Calculate means of multiple columns
colMeans(df)
#  Sales  Profit
#    170      39

When to use: The mean works well for normally distributed data without extreme outliers. It’s sensitive to extreme values, so consider alternatives if you have outliers.

Median

The median is the middle value when data is sorted. It’s robust to outliers and useful for skewed distributions.

# Calculate median
x <- c(2, 5, 8, 12, 15, 18, 22)
median_x <- median(x)
print(median_x)
# [1] 12

# Median vs mean with outliers
data_with_outlier <- c(1, 2, 3, 4, 5, 100)
mean(data_with_outlier)     # [1] 19.17 (affected by outlier)
median(data_with_outlier)   # [1] 3.5 (not affected)

# Median of data frame column
median(df$Sales)
# [1] 175

When to use: Prefer the median when your data contains outliers or is skewed, such as income distributions or housing prices.

Mode

The mode is the value that appears most frequently. Base R doesn’t have a built-in mode function, but you can easily create one:

# Custom mode function
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Find mode
x <- c(1, 2, 2, 3, 3, 3, 4, 5, 5)
mode_x <- get_mode(x)
print(mode_x)
# [1] 3

# Using table to find frequency
table(x)
# x
# 1 2 3 4 5
# 1 2 3 1 2

When to use: The mode is useful for categorical data or identifying the most common category in your dataset.

Measures of Dispersion

Dispersion measures describe how spread out the values are in your dataset.

Variance and Standard Deviation

Variance measures the average squared deviation from the mean. Standard deviation is the square root of variance and has the same units as the original data.

# Calculate variance and standard deviation
x <- c(10, 20, 30, 40, 50)
var_x <- var(x)      # Sample variance
sd_x <- sd(x)        # Sample standard deviation
print(var_x)
# [1] 250

print(sd_x)
# [1] 15.81139

# Verify relationship: sd = sqrt(var)
sqrt(var_x)
# [1] 15.81139

# Population variance (multiply by (n-1)/n)
n <- length(x)
pop_var <- var_x * (n - 1) / n
print(pop_var)
# [1] 200

# Standard deviation of data frame columns
sapply(df, sd)
#     Sales    Profit
# 49.49747 14.11269

Interpretation: A smaller standard deviation indicates values are clustered closely around the mean, while larger values indicate greater spread.

Range and Interquartile Range (IQR)

The range is the difference between maximum and minimum values. The IQR (Interquartile Range) represents the spread of the middle 50% of data, calculated as Q3 - Q1. It’s resistant to outliers and useful for identifying unusual values.

# Calculate range
x <- c(5, 12, 15, 20, 25, 35, 40)
range_x <- max(x) - min(x)
print(range_x)
# [1] 35

# Using range() function
range(x)
# [1]  5 40

# Calculate IQR using IQR() function
iqr_x <- IQR(x)
print(iqr_x)
# [1] 15

# Manual IQR calculation using quantiles
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
manual_iqr <- q3 - q1
print(manual_iqr)
# [1] 15

# IQR for data frame columns
df <- data.frame(
  Score = c(78, 85, 84, 81, 79, 85, 85, 81, 78, 89, 84, 84, 80),
  Age = c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85)
)

# IQR for each column
IQR(df$Score)  # [1] 5
IQR(df$Age)    # [1] 22.5

# Using IQR to detect outliers
outlier_threshold <- 1.5 * IQR(df$Score)
lower_bound <- q1 - outlier_threshold
upper_bound <- q3 + outlier_threshold
outliers <- df$Score[df$Score < lower_bound | df$Score > upper_bound]

Interpretation: IQR divides data into quartiles. Values beyond Q1 - 1.5IQR or Q3 + 1.5IQR are typically considered outliers in boxplot visualization.

Quantiles and Percentiles

Quantiles divide sorted data into equal groups. Percentiles express quantiles as percentages.

# Calculate quantiles
x <- c(1, 5, 8, 12, 15, 18, 22, 25, 28, 30)
quantile(x)
#   0%  25%  50%  75% 100%
#    1  8.5   17  23.5   30

# Calculate specific quantiles
quantile(x, c(0.1, 0.5, 0.9))
#  10%  50%  90%
# 3.1   17 27.1

# Percentile interpretation: 90th percentile
q90 <- quantile(x, 0.90)
print(q90)
# 90%
#27.1
# This means 90% of values are below 27.1

# Deciles (divide into 10 equal parts)
deciles <- quantile(x, seq(0, 1, 0.1))
print(deciles)

# Quartiles (divide into 4 equal parts)
quartiles <- quantile(x, c(0, 0.25, 0.5, 0.75, 1))
print(quartiles)

Five-Number Summary

The five-number summary provides a quick overview: minimum, Q1, median, Q3, maximum.

# Five-number summary
x <- c(10, 15, 20, 22, 25, 28, 30, 32, 35, 40)
summary(x)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#   10.00   19.75   25.50   26.70   31.75   40.00

# Apply to data frame
df <- data.frame(
  Age = c(25, 30, 35, 40, 45, 50, 55),
  Salary = c(50000, 55000, 60000, 65000, 70000, 75000, 80000)
)
summary(df)
#      Age          Salary
# Min.   :25.0   Min.   :50000
# 1st Qu.:32.5   1st Qu.:57500
# Median :40.0   Median :65000
# Mean   :40.0   Mean   :65000
# 3rd Qu.:47.5   3rd Qu.:72500
# Max.   :55.0   Max.   :80000

Trimmed Mean

The trimmed mean calculates the mean after removing a specified percentage of extreme values from both tails.

# Calculate 10% trimmed mean
x <- c(1, 2, 3, 4, 5, 100)  # Contains outlier
mean(x)              # [1] 19.17 (affected by outlier)
mean(x, trim = 0.1)  # [1] 3.5 (outlier removed)

# Different trim percentages
data <- c(5, 10, 15, 20, 25, 30, 1000)
mean(data)                # [1] 157.86 (includes outlier)
mean(data, trim = 0.1)    # [1] 17.5 (removes 1 value from each end)
mean(data, trim = 0.2)    # [1] 17.5 (removes 2 values from each end)
mean(data, trim = 0.25)   # [1] 17.5 (removes more extreme values)

When to use: Trimmed means are robust to outliers and useful for heavily skewed distributions or datasets with measurement errors.

Standardization and Z-Scores

Z-scores standardize values to have mean 0 and standard deviation 1. This allows comparison across variables with different scales.

# Calculate z-scores
x <- c(100, 120, 140, 160, 180)
z_scores <- (x - mean(x)) / sd(x)
print(z_scores)
# [1] -1.4142136 -0.7071068  0.0000000  0.7071068  1.4142136

# Verify: standardized values should have mean≈0 and sd≈1
mean(z_scores)     # [1] 0 (or very close)
sd(z_scores)       # [1] 1

# Using scale() function
z_scores_scale <- scale(x)
print(z_scores_scale)
#            [,1]
# [1,] -1.4142136
# [2,] -0.7071068
# [3,]  0.0000000
# [4,]  0.7071068
# [5,]  1.4142136

# Identify outliers: |z| > 3 indicates potential outliers
y <- c(5, 10, 15, 20, 25, 100)
z_y <- scale(y)
outliers <- which(abs(z_y) > 3)
print(outliers)  # Position of extreme values

Summary Statistics by Groups

Often you need to calculate descriptive statistics separately for different groups in your data.

# Sample data with groups
df <- data.frame(
  Department = c("Sales", "Sales", "Sales", "IT", "IT", "IT", "HR", "HR"),
  Salary = c(50000, 55000, 60000, 70000, 75000, 80000, 45000, 48000),
  Experience = c(2, 3, 5, 3, 4, 6, 1, 2)
)

# Base R approach: aggregate()
aggregate(Salary ~ Department, data = df, FUN = mean)
#   Department  Salary
# 1         HR 46500.0
# 2         IT 75000.0
# 3      Sales 55000.0

# Multiple statistics with aggregate()
aggregate(df[c("Salary", "Experience")],
          by = list(Department = df$Department),
          FUN = function(x) c(mean = mean(x), sd = sd(x)))

# dplyr approach (modern and flexible)
library(dplyr)

df %>%
  group_by(Department) %>%
  summarize(
    mean_salary = mean(Salary),
    median_salary = median(Salary),
    sd_salary = sd(Salary),
    min_salary = min(Salary),
    max_salary = max(Salary),
    n = n()
  )

# Using psych::describeBy() for comprehensive statistics
library(psych)
describeBy(df[c("Salary", "Experience")], group = df$Department)

Covariance and Correlation in Descriptive Context

Covariance measures how two variables change together. Correlation standardizes covariance to a -1 to 1 scale.

# Create sample data
df <- data.frame(
  Age = c(25, 30, 35, 40, 45),
  Salary = c(50, 55, 65, 75, 85)
)

# Calculate covariance
cov_matrix <- cov(df)
print(cov_matrix)
#     Age  Salary
# Age   62.5  850.0
# Salary 850.0 212.5

# Extract specific covariance
cov(df$Age, df$Salary)
# [1] 212.5

# Calculate correlation (covariance standardized)
cor_matrix <- cor(df)
print(cor_matrix)
#      Age  Salary
# Age  1.000  0.999
# Salary 0.999  1.000

# Correlation is covariance / (sd1 * sd2)
cov(df$Age, df$Salary) / (sd(df$Age) * sd(df$Salary))
# [1] 0.999

Outlier Detection

Identifying unusual values is crucial for data quality assessment.

# Detection using z-scores
data <- c(10, 12, 15, 14, 13, 100)
z <- scale(data)
outliers_z <- which(abs(z) > 2.5)
print(outliers_z)

# Detection using IQR method (most common)
iqr_outliers <- function(x) {
  q1 <- quantile(x, 0.25)
  q3 <- quantile(x, 0.75)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr

  outliers <- which(x < lower_bound | x > upper_bound)
  return(outliers)
}

outliers <- iqr_outliers(data)
print(data[outliers])

# Mahalanobis distance for multivariate outliers
library(stats)
df <- data.frame(
  x = c(1, 2, 3, 4, 5, 100),
  y = c(2, 4, 6, 8, 10, 200)
)
distances <- mahalanobis(df, colMeans(df), cov(df))
outliers_mahal <- which(distances > qchisq(0.95, df = 2))
print(df[outliers_mahal, ])

Summary Statistics with dplyr

Modern R data analysis uses dplyr for efficient, readable code:

library(dplyr)

df <- data.frame(
  Product = rep(c("A", "B"), 5),
  Sales = c(100, 150, 120, 160, 110, 170, 130, 180, 140, 190),
  Quarter = rep(1:5, 2)
)

# Comprehensive summary by group
df %>%
  group_by(Product) %>%
  summarize(
    n = n(),
    mean = mean(Sales),
    median = median(Sales),
    sd = sd(Sales),
    min = min(Sales),
    q1 = quantile(Sales, 0.25),
    q3 = quantile(Sales, 0.75),
    max = max(Sales),
    iqr = IQR(Sales),
    cv = (sd(Sales) / mean(Sales)) * 100  # Coefficient of variation
  )

# Using across() for multiple columns
df %>%
  group_by(Product) %>%
  summarize(across(
    Sales,
    list(
      mean = mean,
      median = median,
      sd = sd,
      min = min,
      max = max
    ),
    .names = "{.fn}_{.col}"
  ))

Complete Descriptive Statistics Example

Here’s a realistic example combining multiple techniques:

# Load built-in mtcars dataset
data(mtcars)

# Comprehensive analysis
library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  summarize(
    n_cars = n(),
    mean_mpg = mean(mpg),
    median_mpg = median(mpg),
    sd_mpg = sd(mpg),
    cv_mpg = (sd(mpg) / mean(mpg)) * 100,
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    iqr_mpg = IQR(mpg),
    skewness = mean((mpg - mean(mpg))^3) / (sd(mpg)^3),
    .groups = 'drop'
  )

# Identify unusual cars (by z-score)
mtcars$mpg_zscore <- scale(mtcars$mpg)
unusual_cars <- mtcars %>%
  filter(abs(mpg_zscore) > 2) %>%
  select(mpg, cyl, hp, wt)
print(unusual_cars)

# Visualize distribution
hist(mtcars$mpg, main = "MPG Distribution", breaks = 8)
boxplot(mpg ~ cyl, data = mtcars, main = "MPG by Cylinder")

Best Practices for Descriptive Statistics

  1. Always check for missing values before calculating statistics

    sum(is.na(df$column))
    
  2. Use appropriate measures for your data type:

    • Continuous: mean, median, variance, sd
    • Categorical: mode, frequency tables
    • Ordinal: median, quantiles
  3. Consider your data distribution when choosing statistics:

    • Normal distribution: use mean and sd
    • Skewed distribution: prefer median and IQR
    • Presence of outliers: use trimmed mean or median
  4. Report multiple statistics for a complete picture:

    • Always include measure of central tendency AND dispersion
    • Report sample size (n)
    • Note any outliers or unusual patterns
  5. Visualize alongside statistics:

    • Histograms show distribution shape
    • Boxplots show median, quartiles, and outliers
    • Q-Q plots reveal normality

Common Questions

Q: Should I use mean or median? A: Use mean for normally distributed data without outliers. Use median for skewed data or when outliers are present. Often report both.

Q: What’s the difference between sample and population statistics? A: R’s var() and sd() calculate sample statistics (divide by n-1). For population statistics, multiply by (n-1)/n.

Q: How do I detect outliers? A: Common methods include z-scores (|z| > 3), IQR method (1.5 × IQR), and Mahalanobis distance for multivariate data.

Q: What’s the coefficient of variation (CV)? A: CV = (standard deviation / mean) × 100. It’s useful for comparing spread across variables with different scales.

Q: Why remove outliers? A: Not always! Outliers may be legitimate data points. Investigate them first to understand whether they’re errors or genuine extreme values.

Q: How do I handle missing values in descriptive statistics? A: Use na.rm = TRUE in functions: mean(x, na.rm = TRUE). First explore the pattern of missingness using summary() or is.na().

Q: What’s the difference between variance and standard deviation? A: Variance is the average squared deviation, standard deviation is its square root. SD is preferred for interpretation as it’s in original data units.

Q: How should I report descriptive statistics? A: Standard format: Mean ± SD (or Median with IQR). Always include sample size (n). Example: “Mean age = 35.4 ± 8.2 years (n=45)”.

Descriptive statistics form the foundation for more advanced analyses. Explore these related tutorials:

Download R Script

Get all the code examples from this tutorial in one convenient R script: descriptive-statistics-examples.R