Hypothesis testing is the foundation of statistical inference. It allows you to make decisions about populations based on sample data, answer research questions, and determine whether observed differences are statistically significant or due to chance. Mastering hypothesis testing in R is essential for data analysis, research, and evidence-based decision-making.

This comprehensive guide covers all major hypothesis tests with practical R implementations and interpretations.

Foundations of Hypothesis Testing

Hypothesis testing follows a structured framework:

  1. Null Hypothesis (H₀): No effect or difference exists
  2. Alternative Hypothesis (H₁): An effect or difference exists
  3. Test Statistic: Calculated from sample data
  4. P-value: Probability of observing results if H₀ is true
  5. Significance Level (α): Threshold for decision (typically 0.05)
  6. Decision: Reject or fail to reject H₀

Key Concepts

# Understanding p-values and significance
alpha <- 0.05  # Significance level

# If p-value < alpha: Reject H₀ (statistically significant)
# If p-value >= alpha: Fail to reject H₀ (not significant)

# Type I Error (False Positive): Reject H₀ when it's true
# Type II Error (False Negative): Fail to reject H₀ when it's false

# Power: 1 - Type II Error rate (ability to detect true effects)

One-Sample Tests

One-Sample T-Test

Tests if sample mean differs from a hypothesized population mean.

# One-sample t-test
data <- c(23, 25, 24, 26, 25, 23, 24, 25, 26, 24)

# Test if mean differs from 25
result <- t.test(data, mu = 25)
print(result)
# t = -0.66667, df = 9, p-value = 0.5189
# 95 percent confidence interval: [23.71 25.89]
# sample estimates: mean of x = 24.8

# Interpretation
if (result$p.value < 0.05) {
  print("Statistically significant difference from 25")
} else {
  print("No statistically significant difference from 25")
}

# One-tailed test (mean > 25)
result_greater <- t.test(data, mu = 25, alternative = "greater")

# One-tailed test (mean < 25)
result_less <- t.test(data, mu = 25, alternative = "less")

One-Sample Wilcoxon Test

Non-parametric alternative to t-test (doesn’t assume normality).

# Wilcoxon signed-rank test
result <- wilcox.test(data, mu = 25)
print(result)
# W = 24, p-value = 0.5469

# Extract components
p_value <- result$p.value
statistic <- result$statistic

Two-Sample Tests

Two-Sample T-Test

Compares means between two independent groups.

# Create two independent samples
group1 <- c(23, 25, 24, 26, 25)
group2 <- c(20, 22, 21, 19, 23)

# Two-sample t-test (assume equal variances)
result <- t.test(group1, group2)
print(result)

# Welch's t-test (doesn't assume equal variances)
result_welch <- t.test(group1, group2, var.equal = FALSE)

# One-tailed test
result_one_tail <- t.test(group1, group2, alternative = "greater")

Paired T-Test

Compares means between paired (dependent) observations.

# Paired measurements (before/after)
before <- c(120, 125, 130, 118, 122)
after <- c(115, 120, 128, 116, 119)

# Paired t-test
result <- t.test(before, after, paired = TRUE)
print(result)

# Calculate mean difference
mean_diff <- mean(before - after)
print(paste("Mean difference:", mean_diff))

Two-Sample Wilcoxon Test

Non-parametric alternative for comparing two groups.

# Mann-Whitney U test (Wilcoxon rank-sum test)
result <- wilcox.test(group1, group2)
print(result)
# W = 23, p-value = 0.0635

ANOVA (Analysis of Variance)

One-Way ANOVA

Compares means across three or more groups.

# Create data with multiple groups
control <- c(20, 22, 21, 23, 19)
treatment_a <- c(25, 27, 26, 28, 24)
treatment_b <- c(30, 32, 31, 33, 29)

# Combine into data frame
data <- data.frame(
  value = c(control, treatment_a, treatment_b),
  group = rep(c("Control", "Treatment_A", "Treatment_B"), each = 5)
)

# One-way ANOVA
result <- aov(value ~ group, data = data)
print(summary(result))

# Extract F-statistic and p-value
f_stat <- summary(result)[[1]]$`F value`[1]
p_value <- summary(result)[[1]]$`Pr(>F)`[1]
print(paste("F-statistic:", f_stat, "P-value:", p_value))

Post-Hoc Tests (Multiple Comparisons)

When ANOVA is significant, determine which groups differ.

# Tukey HSD (Honestly Significant Difference)
tukey_result <- TukeyHSD(result)
print(tukey_result)

# Plot Tukey results
plot(tukey_result)

# Pairwise t-tests with Bonferroni correction
pairwise.t.test(data$value, data$group, p.adjust.method = "bonferroni")

Kruskal-Wallis Test

Non-parametric alternative to ANOVA.

# Kruskal-Wallis test (non-parametric ANOVA)
result_kw <- kruskal.test(value ~ group, data = data)
print(result_kw)
# Kruskal-Wallis chi-squared = 9.6, df = 2, p-value = 0.00821

Chi-Square Test

Tests association between categorical variables.

# Create contingency table
contingency <- matrix(c(10, 15, 20, 25), nrow = 2, ncol = 2,
                      dimnames = list(c("Yes", "No"), c("Success", "Failure")))
print(contingency)

# Chi-square test
result <- chisq.test(contingency)
print(result)

# Extract components
chi_stat <- result$statistic
p_value <- result$p.value
print(paste("Chi-square statistic:", chi_stat, "P-value:", p_value))

# Expected frequencies
print(result$expected)

Correlation Tests

Pearson Correlation Test

Tests linear relationship between two continuous variables.

# Two variables
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 5, 4, 6)

# Pearson correlation test
result <- cor.test(x, y, method = "pearson")
print(result)

# Extract correlation coefficient and p-value
correlation <- result$estimate
p_value <- result$p.value
print(paste("Correlation:", round(correlation, 3), "P-value:", p_value))

Spearman Correlation Test

Non-parametric correlation test.

# Spearman correlation test
result_spearman <- cor.test(x, y, method = "spearman")
print(result_spearman)

Other Important Tests

Fisher’s Exact Test

Tests association for 2×2 contingency tables (small samples).

# Fisher's exact test
contingency <- matrix(c(8, 2, 1, 9), nrow = 2, ncol = 2)
result <- fisher.test(contingency)
print(result)
# Odds ratio and p-value

Effect Sizes

Quantify practical significance beyond p-values.

# Cohen's d (effect size for t-tests)
cohens_d <- function(x1, x2) {
  n1 <- length(x1)
  n2 <- length(x2)
  var1 <- var(x1)
  var2 <- var(x2)
  pooled_sd <- sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2))
  (mean(x1) - mean(x2)) / pooled_sd
}

d <- cohens_d(group1, group2)
print(paste("Cohen's d:", round(d, 3)))

# Interpretation: |d| < 0.2 (small), 0.2-0.5 (small-medium),
#                 0.5-0.8 (medium-large), > 0.8 (large)

Assumptions Testing

Normality Tests

# Shapiro-Wilk test for normality
result <- shapiro.test(data)
print(result)

# Visual inspection
qqnorm(data)
qqline(data)

# Histogram
hist(data, main = "Histogram with Normal Curve")

Homogeneity of Variance

# Levene's test (assumes normality)
library(car)
result <- leveneTest(value ~ group, data = data)
print(result)

# Bartlett's test
result_bartlett <- bartlett.test(value ~ group, data = data)
print(result_bartlett)

Complete Workflow Example

# Complete hypothesis testing workflow
library(dplyr)

# 1. Load and explore data
data(mtcars)
head(mtcars)

# 2. Define hypothesis
# H₀: Mean MPG for automatic and manual transmissions are equal
# H₁: Means differ

# 3. Check assumptions
# Normality
with(mtcars, {
  print(shapiro.test(mpg[am == 0]))  # Automatic
  print(shapiro.test(mpg[am == 1]))  # Manual
})

# Homogeneity of variance
with(mtcars, bartlett.test(mpg ~ am))

# 4. Perform test
auto <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
result <- t.test(auto, manual)

# 5. Interpret results
print(result)
print(paste("Mean automatic:", round(mean(auto), 2)))
print(paste("Mean manual:", round(mean(manual), 2)))
print(paste("P-value:", round(result$p.value, 4)))
print(paste("Conclusion:", ifelse(result$p.value < 0.05,
                                   "Significant difference",
                                   "No significant difference")))

# 6. Calculate effect size
d <- cohens_d(auto, manual)
print(paste("Cohen's d:", round(d, 3)))

Best Practices

  1. State hypotheses first - Before seeing data
  2. Check assumptions - Verify test requirements
  3. Choose appropriate test - Match to data type and design
  4. Report effect sizes - Not just p-values
  5. Use significance level α = 0.05 - Standard convention
  6. Adjust for multiple comparisons - When doing many tests
  7. Interpret in context - Statistical vs practical significance
  8. Visualize results - Plots aid interpretation

Common Questions

Q: What if my data isn’t normal? A: Use non-parametric tests (Wilcoxon, Kruskal-Wallis) or transform data

Q: What’s a p-value? A: Probability of observing data if null hypothesis is true. Lower = stronger evidence against H₀

Q: Should I report p-values or effect sizes? A: Report both. P-values show significance, effect sizes show practical importance

Q: How many comparisons can I do? A: Use corrections (Bonferroni) for multiple tests to control error rate

Q: What’s the difference between statistical and practical significance? A: Statistical significance (p < 0.05) vs practical significance (meaningful effect size)

Build on hypothesis testing:

Download R Script

Get all code examples from this tutorial: hypothesis-testing-examples.R