Probability and combinatorics are foundational to statistics, data science, and machine learning. Whether you’re calculating the likelihood of events, generating random samples, simulating experiments, or understanding statistical distributions, R provides comprehensive tools for probability analysis. Mastering these concepts enables you to build robust statistical models and make data-driven decisions.

This comprehensive guide covers all essential probability and combinatorics concepts with practical R implementations.

Combinations and Permutations

Combinations

Combinations count the number of ways to choose items from a set where order doesn’t matter.

# Calculate combinations C(n, k) = n! / (k!(n-k)!)
choose(8, 5)           # [1] 56
# Ways to choose 5 items from 8

# More combinations
choose(10, 3)          # [1] 120
choose(5, 2)           # [1] 10
choose(52, 5)          # [1] 2598960 (5-card poker hand)

# Verify combination property
choose(5, 2)           # [1] 10
choose(5, 3)           # [1] 10 (choosing 3 is same as leaving 2)

# Generate all combinations
library(combinat)
combn(c("A", "B", "C"), 2)
#      [,1] [,2] [,3]
# [1,] "A"  "A"  "B"
# [2,] "B"  "C"  "C"

Permutations

Permutations count arrangements where order matters.

# Calculate permutations P(n, k) = n! / (n-k)!
factorial(5)           # [1] 120 (arrangements of 5 items)

# Permutations formula: P(n,k) = n! / (n-k)!
n <- 8
k <- 5
permutations <- factorial(n) / factorial(n - k)
print(permutations)    # [1] 6720

# Using choose and factorial relationship
# P(n,k) = C(n,k) × k!
choose(8, 5) * factorial(5)  # [1] 6720

# Practical example: Password arrangements
# 4-digit PIN from 10 digits (0-9)
factorial(10) / factorial(10 - 4)  # [1] 5040

Practical Applications

# Lottery probability
# Winning 6/49 lottery
ways_to_win <- choose(49, 6)
probability_jackpot <- 1 / ways_to_win
print(probability_jackpot)  # Very small!

# Committee selection
# Choose 5 people from 12 for committee
committee_ways <- choose(12, 5)
print(committee_ways)  # [1] 792

# Seating arrangements
# Arrange 5 people in 5 seats
arrangements <- factorial(5)
print(arrangements)  # [1] 120

Probability Distributions

Binomial Distribution

Models number of successes in fixed number of trials.

# Binomial: n trials, p probability of success

# Probability mass function (PMF)
# P(X = k) for k successes in n trials
dbinom(x = 3, size = 10, prob = 0.5)  # P(X = 3)

# Cumulative distribution function (CDF)
# P(X ≤ k)
pbinom(q = 3, size = 10, prob = 0.5)  # P(X ≤ 3)

# Quantile function
# What value k has P(X ≤ k) = 0.5?
qbinom(p = 0.5, size = 10, prob = 0.5)

# Generate random samples
rbinom(n = 100, size = 10, prob = 0.5)  # 100 samples

# Example: Coin flips
# 8 heads in 10 fair coin flips?
dbinom(8, size = 10, prob = 0.5)  # [1] 0.04394531

Normal Distribution

Models continuous data following bell curve.

# Normal distribution: mean μ, standard deviation σ

# Probability density function (PDF)
dnorm(x = 0, mean = 0, sd = 1)    # Standard normal at x=0

# Cumulative distribution function (CDF)
pnorm(q = 1.96, mean = 0, sd = 1)  # P(X ≤ 1.96) ≈ 0.975

# Quantile function
qnorm(p = 0.975, mean = 0, sd = 1)  # [1] 1.96

# Generate random samples
rnorm(n = 100, mean = 100, sd = 15)

# 95% confidence interval for standard normal
qnorm(0.025, mean = 0, sd = 1)    # Lower bound ≈ -1.96
qnorm(0.975, mean = 0, sd = 1)    # Upper bound ≈ 1.96

# Example: IQ scores (mean=100, sd=15)
# What proportion have IQ > 130?
1 - pnorm(130, mean = 100, sd = 15)  # [1] 0.02275013

Poisson Distribution

Models count of events in fixed time/space.

# Poisson: λ (lambda) = expected count

# PMF: P(X = k)
dpois(x = 3, lambda = 2)   # P(X = 3) with λ=2

# CDF: P(X ≤ k)
ppois(q = 3, lambda = 2)   # P(X ≤ 3)

# Quantile function
qpois(p = 0.5, lambda = 2)

# Random samples
rpois(n = 100, lambda = 5)

# Example: Defects per batch (average 3 per 100 units)
dpois(5, lambda = 3)  # P(exactly 5 defects)
ppois(5, lambda = 3)  # P(5 or fewer defects)

Other Important Distributions

# Exponential (continuous waiting times)
rexp(n = 100, rate = 0.5)

# Uniform (equal probability over range)
runif(n = 100, min = 0, max = 1)

# Chi-square (test statistic distribution)
rchisq(n = 100, df = 5)

# t-distribution (small sample inference)
rt(n = 100, df = 10)

# F-distribution (comparing variances)
rf(n = 100, df1 = 5, df2 = 10)

Random Sampling

Random Number Generation

# Set seed for reproducibility
set.seed(123)

# Uniform random numbers [0, 1)
runif(5)  # [1] 0.2876823 0.7883051 0.4089769 0.8830174 0.9404673

# Reproducible results
set.seed(123)
runif(5)  # Same values

# Different range
runif(5, min = 0, max = 100)  # [0, 100]

# Normal random numbers
rnorm(5, mean = 0, sd = 1)

# Integer sampling
sample(1:10, size = 5)  # Random sample of 5 from 1-10
sample(1:10, size = 5, replace = FALSE)  # Without replacement
sample(1:10, size = 15, replace = TRUE)  # With replacement (allows repeats)

Sampling from Data

# Sample from vector
x <- c(10, 20, 30, 40, 50)
sample(x, size = 3)        # Random 3 elements

# Weighted sampling
prob_weights <- c(0.1, 0.2, 0.3, 0.2, 0.2)
sample(x, size = 100, replace = TRUE, prob = prob_weights)

# Bootstrap sampling
data <- rnorm(50)
bootstrap_samples <- replicate(1000, mean(sample(data, replace = TRUE)))
hist(bootstrap_samples)  # Distribution of bootstrap means

Probability Calculations

Basic Probability

# Rolling dice
# Probability of rolling a 6
prob_six <- 1/6
print(prob_six)  # [1] 0.1666667

# Multiple events
# P(rolling 6 three times in a row)
prob_three_sixes <- (1/6)^3
print(prob_three_sixes)

# Probability at least one 6 in 3 rolls
prob_at_least_one <- 1 - (5/6)^3
print(prob_at_least_one)

Conditional Probability

# P(A and B) / P(B) = P(A|B)
# Example: Disease testing
# P(Disease) = 0.01
# P(Positive | Disease) = 0.99
# P(Positive | No Disease) = 0.05

p_disease <- 0.01
p_pos_given_disease <- 0.99
p_pos_given_no_disease <- 0.05

# P(Positive)
p_positive <- (p_disease * p_pos_given_disease) +
              ((1 - p_disease) * p_pos_given_no_disease)

# P(Disease | Positive) - using Bayes' theorem
p_disease_given_positive <- (p_pos_given_disease * p_disease) / p_positive
print(p_disease_given_positive)  # Much lower than 0.99!

Set Operations

Basic Set Operations

# Create sets
A <- c(1, 2, 3, 4, 5)
B <- c(4, 5, 6, 7, 8)

# Union (A or B)
union(A, B)  # [1] 1 2 3 4 5 6 7 8

# Intersection (A and B)
intersect(A, B)  # [1] 4 5

# Difference (in A but not B)
setdiff(A, B)  # [1] 1 2 3

# Is subset?
is.element(4, A)  # [1] TRUE
all(c(1, 2) %in% A)  # [1] TRUE

Practical Set Applications

# Students in classes
Math <- c("Alice", "Bob", "Charlie", "Diana")
Science <- c("Bob", "Diana", "Eve", "Frank")

# Students in both classes
intersect(Math, Science)  # [1] "Bob" "Diana"

# Students in either class
union(Math, Science)  # All unique students

# In Math but not Science
setdiff(Math, Science)  # [1] "Alice" "Charlie"

Simulations

Simple Simulation

# Simulate coin flips
set.seed(42)
flips <- rbinom(n = 10000, size = 1, prob = 0.5)
mean(flips)  # ≈ 0.5 (proportion of heads)

# Simulate rolling dice 1000 times
rolls <- sample(1:6, size = 1000, replace = TRUE)
table(rolls)  # Distribution of outcomes

Monte Carlo Simulation

# Estimate π using random sampling
set.seed(123)
n_points <- 100000

# Generate random points in unit square
x <- runif(n_points)
y <- runif(n_points)

# Distance from origin
distance <- sqrt(x^2 + y^2)

# Points inside unit circle
inside_circle <- sum(distance <= 1)

# π ≈ 4 × (points inside circle) / (total points)
pi_estimate <- 4 * inside_circle / n_points
print(pi_estimate)  # Close to 3.14159

Bootstrap Confidence Interval

# Data with unknown distribution
data <- c(23, 45, 12, 56, 34, 67, 28, 45, 56, 34)

# Bootstrap to estimate CI for mean
set.seed(42)
bootstrap_means <- replicate(10000, mean(sample(data, replace = TRUE)))

# 95% confidence interval
ci_lower <- quantile(bootstrap_means, 0.025)
ci_upper <- quantile(bootstrap_means, 0.975)
print(c(ci_lower, ci_upper))

Probability Distributions Functions Reference

# For each distribution type:
# d* = density/mass function (PDF/PMF)
# p* = cumulative distribution function (CDF)
# q* = quantile function (inverse CDF)
# r* = random number generator

# Binomial
dbinom(k, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)

# Normal
dnorm(x, mean, sd)
pnorm(q, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)

# Poisson
dpois(x, lambda)
ppois(q, lambda)
qpois(p, lambda)
rpois(n, lambda)

# Exponential
dexp(x, rate)
pexp(q, rate)
qexp(p, rate)
rexp(n, rate)

# Uniform
dunif(x, min, max)
punif(q, min, max)
qunif(p, min, max)
runif(n, min, max)

Best Practices

  1. Set seed for reproducibility - Use set.seed() for consistent random results
  2. Understand your distribution - Choose distribution matching your data
  3. Verify assumptions - Check normality before using normal distribution
  4. Use appropriate functions - d* for density, p* for probabilities, r* for sampling
  5. Large sample simulations - More iterations increase accuracy
  6. Document randomness - Note seed used for reproducibility

Common Questions

Q: What’s the difference between probability and likelihood? A: Probability is fixed parameter, likelihood varies with observed data

Q: When should I use combinations vs permutations? A: Combinations when order doesn’t matter (choosing), permutations when order matters (arranging)

Q: How do I know which distribution to use? A: Binomial for fixed trials, Poisson for rare events, Normal for continuous bell-shaped data

Q: Why set a seed? A: Ensures reproducible results when using random numbers in analysis

Q: What’s the difference between sampling with and without replacement? A: Without replacement: items get removed after selection; With replacement: items can be selected again

Download R Script

Get all code examples from this tutorial: probability-combinatorics-examples.R