If you’ve ever calculated correlation in R using cor(), you got just the correlation coefficients. But in real data analysis, you need more: you need p-values to tell you if those correlations are statistically significant.

That’s where rcorr() comes in. It’s part of the Hmisc package and gives you both the correlation matrix AND the p-values in one shot.

In this guide, I’ll show you:

  • How rcorr() works and when to use it
  • Calculating Pearson and Spearman correlations with p-values
  • Understanding the output format
  • Comparing rcorr() vs cor()
  • Visualizing correlation matrices
  • Common mistakes and troubleshooting

By the end, you’ll have a powerful tool for correlation analysis that goes beyond what cor() offers.

Prerequisites

  • R 3.0 or higher
  • Hmisc package (install with install.packages("Hmisc"))
  • Basic understanding of correlation and p-values
  • Familiarity with data frames

Why rcorr() Over cor()?

Let’s compare the two approaches:

Using cor() - Just Correlations

data(mtcars)

# cor() gives you just correlation coefficients
cor(mtcars[1:3])

# Output:
#           mpg       cyl      disp
# mpg    1.00000 -0.85216 -0.84755
# cyl   -0.85216  1.00000  0.90203
# disp  -0.84755  0.90203  1.00000

Notice: You get the numbers, but no p-values. You don’t know if -0.852 is statistically significant or just noise.

Using rcorr() - Correlations + P-Values

library(Hmisc)

# rcorr() gives correlation AND p-values
rcorr(as.matrix(mtcars[1:3]))

# Output:
# Correlation Matrix:
#         mpg    cyl   disp
# mpg   1.00 -0.852 -0.848
# cyl  -0.852  1.00  0.902
# disp -0.848  0.902  1.00
#
# P-values:
#         mpg    cyl   disp
# mpg          0.0000 0.0000
# cyl  0.0000        0.0000
# disp 0.0000 0.0000
#
# n=32

Much better! Now you see:

  • Correlation values
  • P-values (all <0.0001, so very significant)
  • Sample size (n=32)

Installation

First, install the Hmisc package if you don’t have it:

# One-time installation
install.packages("Hmisc")

# Load the package
library(Hmisc)

Basic Usage: Pearson Correlation

The simplest approach:

library(Hmisc)

# Create sample data
data <- data.frame(
  height = c(170, 172, 168, 175, 180, 165, 178, 182),
  weight = c(70, 75, 65, 80, 90, 62, 88, 95),
  age = c(25, 28, 22, 30, 35, 20, 32, 38)
)

# Calculate correlation matrix with p-values
rcorr(as.matrix(data))

Important: rcorr() requires a matrix, not a data frame. Use as.matrix() to convert.

Understanding the Output

Let’s break down what rcorr() returns:

result <- rcorr(as.matrix(data))

# View correlation matrix
result$r

# Output:
#         height weight        age
# height  1.0000  0.9689  0.99235
# weight  0.9689  1.0000  0.98361
# age     0.99235 0.98361 1.00000

# View p-value matrix
result$P

# Output:
#          height    weight        age
# height          0.00075  0.000042
# weight  0.00075          0.000088
# age     0.000042 0.000088

What this means:

  • result$r: Correlation coefficients (ranges -1 to +1)
  • result$P: P-values for each correlation
  • Low p-values (<0.05) = statistically significant correlation
  • High p-values (>0.05) = correlation could be due to chance

In this example, all correlations are highly significant (p < 0.001).

Method 1: Pearson Correlation (Default)

Pearson measures linear relationships between continuous variables:

library(Hmisc)

# Sample data
sales_data <- data.frame(
  advertising_spend = c(1000, 2000, 1500, 3000, 2500, 1200, 2800, 3200),
  sales_revenue = c(5000, 12000, 8000, 18000, 15000, 6000, 16000, 19000),
  store_size = c(100, 200, 150, 300, 250, 120, 280, 320)
)

# Calculate Pearson correlations with p-values
result <- rcorr(as.matrix(sales_data), type = "pearson")

print(result)

# Output shows correlation matrix and p-values

Interpreting Results

result$r
#                advertising_spend sales_revenue store_size
# advertising_spend       1.0000       0.9845    0.9982
# sales_revenue            0.9845       1.0000    0.9651
# store_size               0.9982       0.9651    1.0000

result$P
#                advertising_spend sales_revenue store_size
# advertising_spend                  0.00008      0.00000001
# sales_revenue         0.00008                   0.0021
# store_size            0.00000001    0.0021

Interpretation:

  • Advertising spend & sales revenue: r = 0.98, p < 0.0001 (Very strong positive correlation)
  • Advertising spend & store size: r = 0.998, p < 0.0001 (Nearly perfect correlation)
  • Store size & sales revenue: r = 0.97, p = 0.0021 (Strong positive correlation)

All relationships are statistically significant.

Method 2: Spearman Correlation

Use Spearman when data is non-normal or you have ordinal variables (ranks, ratings):

# Data with ordinal variables (ratings)
customer_data <- data.frame(
  satisfaction_rating = c(1, 2, 4, 5, 3, 5, 4, 2),  # 1-5 scale
  customer_service_rating = c(2, 3, 4, 5, 3, 5, 4, 2),
  likelihood_to_recommend = c(1, 2, 4, 5, 3, 5, 4, 1)
)

# Spearman correlation (for ordinal/non-normal data)
rcorr(as.matrix(customer_data), type = "spearman")

# Output includes both correlation and p-values

When to Use Spearman vs Pearson

Situation Use
Continuous, normal data Pearson
Ordinal data (ranks, ratings) Spearman
Non-normal continuous data Spearman
Curved relationships Spearman
Outliers present Spearman
# Example: Non-normal data with outlier
income_data <- data.frame(
  income = c(30000, 35000, 40000, 45000, 50000, 55000, 60000, 500000),  # Last one is outlier
  education_level = c(1, 2, 2, 3, 3, 4, 4, 4)  # Years of education (ordinal)
)

# Pearson might be affected by outlier
pearson_result <- rcorr(as.matrix(income_data), type = "pearson")

# Spearman is more robust
spearman_result <- rcorr(as.matrix(income_data), type = "spearman")

# Spearman p-value is typically more reliable here

Method 3: Kendall Correlation

Less common, but available for very robust correlation estimation:

# Kendall correlation (most conservative, best for very small samples)
rcorr(as.matrix(sales_data), type = "kendall")

Kendall is more conservative than Spearman and works best with very small samples.

Advanced: Extracting and Formatting Results

Extract Specific Correlations

result <- rcorr(as.matrix(sales_data))

# Get correlation between two variables
result$r["advertising_spend", "sales_revenue"]  # [1] 0.9845

# Get p-value between two variables
result$P["advertising_spend", "sales_revenue"]  # [1] 0.00008

Create a Formatted Table

library(Hmisc)

result <- rcorr(as.matrix(sales_data))

# Combine correlation and p-values
correlation_table <- data.frame(
  Variable1 = rep(rownames(result$r), ncol(result$r)),
  Variable2 = rep(colnames(result$r), each = nrow(result$r)),
  Correlation = as.vector(result$r),
  PValue = as.vector(result$P),
  Significant = ifelse(as.vector(result$P) < 0.05, "Yes", "No")
)

# Remove diagonal (correlation of variable with itself)
correlation_table <- correlation_table[correlation_table$Variable1 != correlation_table$Variable2, ]

print(correlation_table)

Round for Readability

# Round correlation and p-values
correlation_clean <- data.frame(
  Variable1 = row.names(result$r),
  Variable2 = colnames(result$r),
  Correlation = round(result$r, 3),
  PValue = round(result$P, 4)
)

print(correlation_clean)

Handling Special Cases

Missing Values

# Data with missing values (NA)
data_with_na <- data.frame(
  x = c(1, 2, NA, 4, 5, 6, 7, 8),
  y = c(2, 4, 5, NA, 6, 8, 9, 10),
  z = c(1, 1, 2, 3, 3, 4, 5, 5)
)

# rcorr() handles NAs, but uses pairwise deletion by default
result <- rcorr(as.matrix(data_with_na))

# Check sample size for each correlation
# (it's shown as 'n' at the bottom of output)

Important: When you have missing data, rcorr() uses pairwise deletion - it removes only the missing pairs for each correlation, not entire rows. This can lead to different sample sizes for different correlations.

Very Large Data Frames

For very large datasets, consider subsetting first:

# Select only numeric columns
numeric_cols <- sapply(large_df, is.numeric)
numeric_data <- large_df[, numeric_cols]

# Calculate correlation for subset
result <- rcorr(as.matrix(numeric_data))

Visualizing Correlations

Correlation Heatmap

library(Hmisc)
library(corrplot)  # Install if needed

result <- rcorr(as.matrix(sales_data))

# Create heatmap
corrplot(result$r, p.mat = result$P,
         sig.level = 0.05,  # Mark significant correlations
         method = "color",
         addCoef.col = "black",  # Show values
         diag = FALSE)  # Don't show diagonal

Simple Visualization

# Plot correlations in simple format
plot_correlations <- function(result) {
  # Extract lower triangle
  upper_triangle <- upper.tri(result$r)

  correlation_data <- data.frame(
    var1 = row.names(result$r)[which(upper_triangle, arr.ind=TRUE)[,1]],
    var2 = colnames(result$r)[which(upper_triangle, arr.ind=TRUE)[,2]],
    correlation = result$r[upper_triangle],
    p_value = result$P[upper_triangle]
  )

  print(correlation_data)
}

plot_correlations(result)

Comparing rcorr() vs cor()

Feature rcorr() cor()
P-values ✅ Yes ❌ No
Pearson ✅ Yes ✅ Yes
Spearman ✅ Yes ✅ Yes
Input type Matrix Matrix or DF
Handles NAs ✅ Yes ✅ Yes
Performance Good Faster
Easy interpretation ✅ Yes More setup

Use rcorr() when: You need p-values or working with Hmisc functions Use cor() when: You only need correlations or want speed

Common Mistakes & Troubleshooting

Mistake 1: Forgetting as.matrix()

# WRONG - will cause error
rcorr(sales_data)  # Error!

# RIGHT
rcorr(as.matrix(sales_data))

rcorr() specifically requires a matrix, not a data frame.

Mistake 2: Correlating Non-Numeric Columns

# If data frame includes character columns
mixed_data <- data.frame(
  store_name = c("A", "B", "C"),
  revenue = c(1000, 2000, 1500),
  customers = c(50, 100, 75)
)

# WRONG
rcorr(as.matrix(mixed_data))  # Error - can't correlate text!

# RIGHT - select only numeric columns
numeric_cols <- sapply(mixed_data, is.numeric)
rcorr(as.matrix(mixed_data[, numeric_cols]))

Mistake 3: Misinterpreting P-Values

# If p-value is missing (NA), don't panic
# This happens when correlation is undefined (e.g., perfect correlation)

result <- rcorr(as.matrix(data))
# Some p-values might show as NA - this is normal

# Only interpret non-NA p-values

Mistake 4: Forgetting to Install Hmisc

# This will fail
library(Hmisc)  # Error: package not found

# Solution
install.packages("Hmisc")
library(Hmisc)

FAQ

Q: What’s the difference between Pearson and Spearman? A: Pearson measures linear relationships for continuous data. Spearman works with ordinal data or non-normal data by ranking values first.

Q: How do I know if a correlation is significant? A: Look at the p-value. If p < 0.05, it’s typically considered statistically significant. But also look at the correlation strength (0.9 is strong, 0.3 is weak).

Q: Can I use rcorr() with factors? A: No, factors need to be converted to numeric first: as.numeric(as.character(factor_column))

Q: What does “n=” mean at the bottom of rcorr output? A: It’s the sample size used for calculations. With missing data, this might vary by correlation (pairwise deletion).

Q: Why are some p-values NA? A: This happens when correlation cannot be calculated (e.g., if a column has no variance, or only one unique value).

Q: Can I correlate more than 3 variables at once? A: Yes! rcorr() works with any number of columns. You get a full correlation matrix.

Q: Is rcorr() more accurate than cor()? A: They calculate the same correlations. rcorr() just also gives you p-values, making it more informative.

Best Practices

  1. Always check for missing data - use sum(is.na(df)) first
  2. Choose the right correlation type - check if data is normal (use Pearson) or ordinal/non-normal (use Spearman)
  3. Consider correlation strength, not just p-values - a weak correlation can be significant with large samples
  4. Document your choice - which correlation type did you use and why?
  5. Watch out for outliers - they can distort Pearson; Spearman is more robust
  6. Report both correlation and p-value - “r = 0.85, p = 0.002” tells the full story

Download R Script

Get all code examples from this tutorial: rcorr-examples.R