If you’ve ever calculated correlation in R using cor(), you got just the correlation coefficients. But in real data analysis, you need more: you need p-values to tell you if those correlations are statistically significant.
That’s where rcorr() comes in. It’s part of the Hmisc package and gives you both the correlation matrix AND the p-values in one shot.
In this guide, I’ll show you:
- How
rcorr()works and when to use it - Calculating Pearson and Spearman correlations with p-values
- Understanding the output format
- Comparing
rcorr()vscor() - Visualizing correlation matrices
- Common mistakes and troubleshooting
By the end, you’ll have a powerful tool for correlation analysis that goes beyond what cor() offers.
Prerequisites
- R 3.0 or higher
- Hmisc package (install with
install.packages("Hmisc")) - Basic understanding of correlation and p-values
- Familiarity with data frames
Why rcorr() Over cor()?
Let’s compare the two approaches:
Using cor() - Just Correlations
data(mtcars)
# cor() gives you just correlation coefficients
cor(mtcars[1:3])
# Output:
# mpg cyl disp
# mpg 1.00000 -0.85216 -0.84755
# cyl -0.85216 1.00000 0.90203
# disp -0.84755 0.90203 1.00000
Notice: You get the numbers, but no p-values. You don’t know if -0.852 is statistically significant or just noise.
Using rcorr() - Correlations + P-Values
library(Hmisc)
# rcorr() gives correlation AND p-values
rcorr(as.matrix(mtcars[1:3]))
# Output:
# Correlation Matrix:
# mpg cyl disp
# mpg 1.00 -0.852 -0.848
# cyl -0.852 1.00 0.902
# disp -0.848 0.902 1.00
#
# P-values:
# mpg cyl disp
# mpg 0.0000 0.0000
# cyl 0.0000 0.0000
# disp 0.0000 0.0000
#
# n=32
Much better! Now you see:
- Correlation values
- P-values (all <0.0001, so very significant)
- Sample size (n=32)
Installation
First, install the Hmisc package if you don’t have it:
# One-time installation
install.packages("Hmisc")
# Load the package
library(Hmisc)
Basic Usage: Pearson Correlation
The simplest approach:
library(Hmisc)
# Create sample data
data <- data.frame(
height = c(170, 172, 168, 175, 180, 165, 178, 182),
weight = c(70, 75, 65, 80, 90, 62, 88, 95),
age = c(25, 28, 22, 30, 35, 20, 32, 38)
)
# Calculate correlation matrix with p-values
rcorr(as.matrix(data))
Important: rcorr() requires a matrix, not a data frame. Use as.matrix() to convert.
Understanding the Output
Let’s break down what rcorr() returns:
result <- rcorr(as.matrix(data))
# View correlation matrix
result$r
# Output:
# height weight age
# height 1.0000 0.9689 0.99235
# weight 0.9689 1.0000 0.98361
# age 0.99235 0.98361 1.00000
# View p-value matrix
result$P
# Output:
# height weight age
# height 0.00075 0.000042
# weight 0.00075 0.000088
# age 0.000042 0.000088
What this means:
result$r: Correlation coefficients (ranges -1 to +1)result$P: P-values for each correlation- Low p-values (<0.05) = statistically significant correlation
- High p-values (>0.05) = correlation could be due to chance
In this example, all correlations are highly significant (p < 0.001).
Method 1: Pearson Correlation (Default)
Pearson measures linear relationships between continuous variables:
library(Hmisc)
# Sample data
sales_data <- data.frame(
advertising_spend = c(1000, 2000, 1500, 3000, 2500, 1200, 2800, 3200),
sales_revenue = c(5000, 12000, 8000, 18000, 15000, 6000, 16000, 19000),
store_size = c(100, 200, 150, 300, 250, 120, 280, 320)
)
# Calculate Pearson correlations with p-values
result <- rcorr(as.matrix(sales_data), type = "pearson")
print(result)
# Output shows correlation matrix and p-values
Interpreting Results
result$r
# advertising_spend sales_revenue store_size
# advertising_spend 1.0000 0.9845 0.9982
# sales_revenue 0.9845 1.0000 0.9651
# store_size 0.9982 0.9651 1.0000
result$P
# advertising_spend sales_revenue store_size
# advertising_spend 0.00008 0.00000001
# sales_revenue 0.00008 0.0021
# store_size 0.00000001 0.0021
Interpretation:
- Advertising spend & sales revenue: r = 0.98, p < 0.0001 (Very strong positive correlation)
- Advertising spend & store size: r = 0.998, p < 0.0001 (Nearly perfect correlation)
- Store size & sales revenue: r = 0.97, p = 0.0021 (Strong positive correlation)
All relationships are statistically significant.
Method 2: Spearman Correlation
Use Spearman when data is non-normal or you have ordinal variables (ranks, ratings):
# Data with ordinal variables (ratings)
customer_data <- data.frame(
satisfaction_rating = c(1, 2, 4, 5, 3, 5, 4, 2), # 1-5 scale
customer_service_rating = c(2, 3, 4, 5, 3, 5, 4, 2),
likelihood_to_recommend = c(1, 2, 4, 5, 3, 5, 4, 1)
)
# Spearman correlation (for ordinal/non-normal data)
rcorr(as.matrix(customer_data), type = "spearman")
# Output includes both correlation and p-values
When to Use Spearman vs Pearson
| Situation | Use |
|---|---|
| Continuous, normal data | Pearson |
| Ordinal data (ranks, ratings) | Spearman |
| Non-normal continuous data | Spearman |
| Curved relationships | Spearman |
| Outliers present | Spearman |
# Example: Non-normal data with outlier
income_data <- data.frame(
income = c(30000, 35000, 40000, 45000, 50000, 55000, 60000, 500000), # Last one is outlier
education_level = c(1, 2, 2, 3, 3, 4, 4, 4) # Years of education (ordinal)
)
# Pearson might be affected by outlier
pearson_result <- rcorr(as.matrix(income_data), type = "pearson")
# Spearman is more robust
spearman_result <- rcorr(as.matrix(income_data), type = "spearman")
# Spearman p-value is typically more reliable here
Method 3: Kendall Correlation
Less common, but available for very robust correlation estimation:
# Kendall correlation (most conservative, best for very small samples)
rcorr(as.matrix(sales_data), type = "kendall")
Kendall is more conservative than Spearman and works best with very small samples.
Advanced: Extracting and Formatting Results
Extract Specific Correlations
result <- rcorr(as.matrix(sales_data))
# Get correlation between two variables
result$r["advertising_spend", "sales_revenue"] # [1] 0.9845
# Get p-value between two variables
result$P["advertising_spend", "sales_revenue"] # [1] 0.00008
Create a Formatted Table
library(Hmisc)
result <- rcorr(as.matrix(sales_data))
# Combine correlation and p-values
correlation_table <- data.frame(
Variable1 = rep(rownames(result$r), ncol(result$r)),
Variable2 = rep(colnames(result$r), each = nrow(result$r)),
Correlation = as.vector(result$r),
PValue = as.vector(result$P),
Significant = ifelse(as.vector(result$P) < 0.05, "Yes", "No")
)
# Remove diagonal (correlation of variable with itself)
correlation_table <- correlation_table[correlation_table$Variable1 != correlation_table$Variable2, ]
print(correlation_table)
Round for Readability
# Round correlation and p-values
correlation_clean <- data.frame(
Variable1 = row.names(result$r),
Variable2 = colnames(result$r),
Correlation = round(result$r, 3),
PValue = round(result$P, 4)
)
print(correlation_clean)
Handling Special Cases
Missing Values
# Data with missing values (NA)
data_with_na <- data.frame(
x = c(1, 2, NA, 4, 5, 6, 7, 8),
y = c(2, 4, 5, NA, 6, 8, 9, 10),
z = c(1, 1, 2, 3, 3, 4, 5, 5)
)
# rcorr() handles NAs, but uses pairwise deletion by default
result <- rcorr(as.matrix(data_with_na))
# Check sample size for each correlation
# (it's shown as 'n' at the bottom of output)
Important: When you have missing data, rcorr() uses pairwise deletion - it removes only the missing pairs for each correlation, not entire rows. This can lead to different sample sizes for different correlations.
Very Large Data Frames
For very large datasets, consider subsetting first:
# Select only numeric columns
numeric_cols <- sapply(large_df, is.numeric)
numeric_data <- large_df[, numeric_cols]
# Calculate correlation for subset
result <- rcorr(as.matrix(numeric_data))
Visualizing Correlations
Correlation Heatmap
library(Hmisc)
library(corrplot) # Install if needed
result <- rcorr(as.matrix(sales_data))
# Create heatmap
corrplot(result$r, p.mat = result$P,
sig.level = 0.05, # Mark significant correlations
method = "color",
addCoef.col = "black", # Show values
diag = FALSE) # Don't show diagonal
Simple Visualization
# Plot correlations in simple format
plot_correlations <- function(result) {
# Extract lower triangle
upper_triangle <- upper.tri(result$r)
correlation_data <- data.frame(
var1 = row.names(result$r)[which(upper_triangle, arr.ind=TRUE)[,1]],
var2 = colnames(result$r)[which(upper_triangle, arr.ind=TRUE)[,2]],
correlation = result$r[upper_triangle],
p_value = result$P[upper_triangle]
)
print(correlation_data)
}
plot_correlations(result)
Comparing rcorr() vs cor()
| Feature | rcorr() | cor() |
|---|---|---|
| P-values | ✅ Yes | ❌ No |
| Pearson | ✅ Yes | ✅ Yes |
| Spearman | ✅ Yes | ✅ Yes |
| Input type | Matrix | Matrix or DF |
| Handles NAs | ✅ Yes | ✅ Yes |
| Performance | Good | Faster |
| Easy interpretation | ✅ Yes | More setup |
Use rcorr() when: You need p-values or working with Hmisc functions Use cor() when: You only need correlations or want speed
Common Mistakes & Troubleshooting
Mistake 1: Forgetting as.matrix()
# WRONG - will cause error
rcorr(sales_data) # Error!
# RIGHT
rcorr(as.matrix(sales_data))
rcorr() specifically requires a matrix, not a data frame.
Mistake 2: Correlating Non-Numeric Columns
# If data frame includes character columns
mixed_data <- data.frame(
store_name = c("A", "B", "C"),
revenue = c(1000, 2000, 1500),
customers = c(50, 100, 75)
)
# WRONG
rcorr(as.matrix(mixed_data)) # Error - can't correlate text!
# RIGHT - select only numeric columns
numeric_cols <- sapply(mixed_data, is.numeric)
rcorr(as.matrix(mixed_data[, numeric_cols]))
Mistake 3: Misinterpreting P-Values
# If p-value is missing (NA), don't panic
# This happens when correlation is undefined (e.g., perfect correlation)
result <- rcorr(as.matrix(data))
# Some p-values might show as NA - this is normal
# Only interpret non-NA p-values
Mistake 4: Forgetting to Install Hmisc
# This will fail
library(Hmisc) # Error: package not found
# Solution
install.packages("Hmisc")
library(Hmisc)
FAQ
Q: What’s the difference between Pearson and Spearman? A: Pearson measures linear relationships for continuous data. Spearman works with ordinal data or non-normal data by ranking values first.
Q: How do I know if a correlation is significant? A: Look at the p-value. If p < 0.05, it’s typically considered statistically significant. But also look at the correlation strength (0.9 is strong, 0.3 is weak).
Q: Can I use rcorr() with factors?
A: No, factors need to be converted to numeric first: as.numeric(as.character(factor_column))
Q: What does “n=” mean at the bottom of rcorr output? A: It’s the sample size used for calculations. With missing data, this might vary by correlation (pairwise deletion).
Q: Why are some p-values NA? A: This happens when correlation cannot be calculated (e.g., if a column has no variance, or only one unique value).
Q: Can I correlate more than 3 variables at once? A: Yes! rcorr() works with any number of columns. You get a full correlation matrix.
Q: Is rcorr() more accurate than cor()? A: They calculate the same correlations. rcorr() just also gives you p-values, making it more informative.
Best Practices
- Always check for missing data - use
sum(is.na(df))first - Choose the right correlation type - check if data is normal (use Pearson) or ordinal/non-normal (use Spearman)
- Consider correlation strength, not just p-values - a weak correlation can be significant with large samples
- Document your choice - which correlation type did you use and why?
- Watch out for outliers - they can distort Pearson; Spearman is more robust
- Report both correlation and p-value - “r = 0.85, p = 0.002” tells the full story
Related Topics
- R Correlation Analysis - Complete Guide - Comprehensive correlation guide
- R Data Quality - Complete Guide - Data preparation
- R Regression Analysis - Complete Guide - Build on correlations
- Statistics: Hypothesis Testing - Understanding p-values
Download R Script
Get all code examples from this tutorial: rcorr-examples.R