Correlation measures how two variables move together; regression predicts one variable from another. These tools help answer questions like: Do variables move in tandem? Can we predict sales from advertising? What’s the relationship between age and income? Understanding correlation and regression is essential for analyzing relationships and making predictions.
This comprehensive guide covers correlation and regression analysis with interactive calculators and practical interpretations.
Understanding Relationships Between Variables
When analyzing data, we often need to understand how two (or more) variables relate.
Correlation: Do variables move together? (Association)
- Example: Do taller people weigh more?
- No causation implied
Regression: Can we predict one variable from another? (Prediction)
- Example: Predict weight from height
- Implies some dependence (but not necessarily causation)
Section 1: Correlation Analysis
Correlation measures the strength and direction of linear relationship between two variables.
Pearson Correlation Coefficient
The Pearson correlation coefficient (r) measures linear relationship between two continuous variables.
Formula:
r = Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)² × Σ(y - ȳ)²)
Range: -1 to +1
Interpretation:
- r = +1: Perfect positive correlation (scatter points on upward line)
- r = +0.7 to +0.99: Strong positive correlation
- r = +0.3 to +0.7: Moderate positive correlation
- r = 0 to +0.3: Weak positive correlation
- r = 0: No linear correlation
- r = -0.3 to 0: Weak negative correlation
- r = -0.7 to -0.3: Moderate negative correlation
- r = -0.99 to -0.7: Strong negative correlation
- r = -1: Perfect negative correlation
Coefficient of Determination (r²): Percentage of variance in one variable explained by the other.
Example: r = 0.8 → r² = 0.64 → 64% of variance explained
When to use Pearson:
- Both variables continuous
- Relationship appears linear
- No severe outliers
- Variables approximately normally distributed
Assumptions:
- Random sample
- Independence of observations
- Linearity
- No outliers
Spearman Rank Correlation
Spearman correlation (ρ, “rho”) measures monotonic relationship using ranks (order).
When to use:
- Non-linear but monotonic relationship
- Ordinal data
- Non-normal distributions
- Outliers present
Advantage: Robust to outliers and non-linear monotonic relationships
Formula: Same as Pearson but applied to ranks
Example: Rank scores: Person A (rank 1), B (rank 5), C (rank 3) Calculate Spearman on ranks, not actual values
Kendall’s Tau Correlation
Another rank-based correlation, more robust than Spearman.
When to use:
- Extremely non-normal data
- Strong outliers
- Monotonic but not linear
Testing Correlation Significance
Null hypothesis: H₀: ρ = 0 (no correlation)
Test statistic:
t = r × √(n - 2) / √(1 - r²)
df = n - 2
Interpretation:
- Small p-value: Correlation is statistically significant
- Large p-value: No evidence of correlation
Note: Statistical significance depends on sample size
- Large sample: Even weak correlation can be significant
- Small sample: Strong correlation might not be significant
- Always check effect size (r-value), not just p-value
Interactive Calculators: Correlation
[Interactive Calculator Placeholders]
- Pearson Correlation Calculator
- Spearman Correlation Calculator
- Testing Correlation Coefficient
- Testing Homogeneity of Two Correlations
Section 2: Correlation vs Causation
Critical Principle: Correlation does NOT imply causation
Why Correlation ≠ Causation
Possible Explanations for Correlation:
-
Direct Causation: X causes Y
- Example: Studying harder → Better test scores
-
Reverse Causation: Y causes X
- Example: Better health → More exercise
-
Common Cause: Both caused by Z
- Example: Ice cream sales and drowning deaths both caused by summer heat
-
Confounding: Third variable affects both
- Example: Shoe size correlates with reading ability (both caused by age)
-
Coincidence: No real relationship
- Example: Nicolas Cage films released per year vs. swimming pool drownings
Identifying Causal Relationships
Requirements for causation:
- Correlation: Variables must be associated
- Temporal precedence: Cause must precede effect
- No alternative explanations: Other variables ruled out
Methods to establish causation:
- Randomized controlled experiments: Assign treatment randomly
- Natural experiments: Exploit naturally occurring variation
- Longitudinal studies: Follow subjects over time
- Causal diagrams: Map possible causal pathways
Section 3: Simple Linear Regression
Simple linear regression predicts one variable (Y) from another (X) using a straight line.
The Regression Line
Equation:
ŷ = a + b×x
where:
ŷ = predicted value of Y
a = y-intercept (value of Y when X=0)
b = slope (change in Y per unit change in X)
Calculating Regression Coefficients
Slope (b):
b = r × (s_y / s_x)
or equivalently:
b = Σ((x - x̄)(y - ȳ)) / Σ(x - x̄)²
Intercept (a):
a = ȳ - b × x̄
Example
Data: Advertising spending (X) vs Sales (Y)
Advertising ($1000s): 1 2 3 4 5
Sales ($1000s): 10 15 20 25 30
- x̄ = 3, ȳ = 20
- Slope: b = 5 (each $1000 ad spending → $5000 sales increase)
- Intercept: a = 20 - 5×3 = 5
Regression equation: ŷ = 5 + 5x
Prediction: Advertising = $3.5k → Predicted sales = 5 + 5(3.5) = $22.5k
Interpreting the Slope
Slope = 5 means:
- For each 1-unit increase in X, Y increases by 5 units
- If X increases by 10 units, Y increases by 50 units
Caution: Only valid for X range in data (extrapolation unreliable)
R² (Coefficient of Determination)
R² measures how well regression line fits data.
Formula:
R² = r² (for simple regression)
R² = Explained Variation / Total Variation
Range: 0 to 1 (often expressed as %)
Interpretation:
- R² = 0.81: 81% of variation in Y explained by X
- R² = 0.20: 20% of variation explained (weak model)
- R² = 1: Perfect fit (rarely achieved)
Residuals
Residual = Observed value - Predicted value
e = y - ŷ
Residual Analysis:
- Should be randomly scattered around zero
- No pattern indicates good fit
- Patterns suggest model problems
Regression Assumptions
- Linearity: Relationship is linear
- Independence: Observations independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals approximately normal
- No outliers: Extreme values investigated
Check with:
- Scatter plot with regression line
- Residual plot
- Q-Q plot
- Histogram of residuals
Interactive Calculators: Simple Regression
[Interactive Calculator Placeholders]
Section 4: Multiple Regression
Multiple regression predicts Y from multiple X variables.
Equation:
ŷ = a + b₁x₁ + b₂x₂ + ... + bₖxₖ
Example: Predict house price from:
- Square footage (x₁)
- Number of bedrooms (x₂)
- Age (x₃)
Multiple R²
Multiple R² measures overall model fit with all predictors.
Adjusted R²: Penalizes for adding too many variables.
Formula:
Adjusted R² = 1 - [(1 - R²) × (n - 1) / (n - p - 1)]
where p = number of predictors
Use Adjusted R² to compare models with different numbers of predictors
Coefficient Interpretation
In multiple regression:
- b₁ = change in Y per unit change in X₁, holding other X’s constant
- Assumes linear relationship
- Called “partial” effect
Example: ŷ = 50 + 2x₁ + 3x₂
- b₁ = 2: Increasing X₁ by 1 unit → Y increases by 2 (if X₂ held constant)
Multicollinearity
Problem: Predictor variables correlated with each other
Consequences:
- Unreliable coefficient estimates
- Inflated standard errors
- Wide confidence intervals
Detection:
- Correlation matrix of predictors
- VIF (Variance Inflation Factor) > 10
- High R² but non-significant predictors
Solutions:
- Remove correlated predictors
- Combine related predictors
- Use regularization (Ridge, Lasso)
Section 5: Making Predictions
Point Predictions
Single value: ŷ = a + bx
Example: Predict sales with $3k advertising = 5 + 5(3) = $20k
Confidence Interval for Mean Response
Range of likely values for average Y at given X level.
Narrower than prediction interval
- Predicting average of many cases
- Less uncertainty
Prediction Interval for Individual Response
Range of likely values for single individual’s Y.
Wider than confidence interval
- Predicting individual case
- More uncertainty
Comparison:
- 95% CI for mean: Maybe [19.5, 20.5]
- 95% PI for individual: Maybe [15, 25]
Extrapolation Risk
⚠️ Don’t predict outside X data range
Example:
- Data: Advertising $1k to $5k
- Safe: Predict for $3k
- Risky: Predict for $20k (far outside range)
- Relationship may not hold at extremes
Section 6: Partial Correlation
Partial correlation measures relationship between two variables while controlling for other variables.
Example: Correlate age and health, controlling for exercise
Formula involves residuals:
- Regress age on exercise, save residuals
- Regress health on exercise, save residuals
- Correlate the two residuals
Shows: Relationship independent of control variable
Section 7: Practical Examples
Example 1: House Price Prediction
Scenario: Real estate company predicts house prices
Data: 100 houses
- X = Square footage
- Y = Price
Analysis:
- Correlation: r = 0.92 (very strong positive)
- Regression: Price = 50,000 + 150 × Sqft
- R² = 0.85 (85% of price variation explained by size)
Interpretation:
- Each additional square foot → $150 price increase
- 85% of house price variation explained by size
- Other factors (location, condition, etc.) explain remaining 15%
Prediction: 2,000 sq ft house = $50,000 + 150(2,000) = $350,000
Example 2: Test Score Prediction
Scenario: Predict college GPA from high school SAT scores
Data: 500 students
- X = SAT score
- Y = College GPA
Analysis:
- Correlation: r = 0.45 (moderate positive)
- Regression: GPA = 0.50 + 0.002 × SAT
- R² = 0.20 (only 20% explained)
Interpretation:
- Weak to moderate relationship
- 100-point SAT increase → 0.2 GPA increase
- Other factors (motivation, major, etc.) explain 80%
- SAT alone insufficient for prediction
Example 3: Multiple Regression Example
Scenario: Predict employee salary from multiple factors
Model: Salary = 30,000 + 2,000×(Yrs Experience) + 5,000×(Degree Level) - 1,000×(Age)
Interpretation:
- Each year experience → $2,000 salary increase
- Each degree level → $5,000 increase
- Each year of age → $1,000 decrease (unexpected!)
- Suggests age is confounded with experience
- Need to reconsider model
Section 8: Regression Diagnostics
Checking Linearity
Plot: Scatter plot of Y vs X with regression line
Look for:
- ✅ Random scatter around line
- ❌ Curved pattern (use transformation)
- ❌ Non-linear relationship
Checking Homoscedasticity
Plot: Residuals vs Predicted Values
Look for:
- ✅ Constant variance (cone doesn’t widen)
- ❌ Funnel shape (increasing variance)
- ❌ Patterns
Fix: Transform Y variable (log, sqrt)
Checking Normality of Residuals
Plot: Q-Q plot of residuals
Look for:
- ✅ Points follow diagonal line
- ❌ Deviations at tails (heavy or light)
Test: Shapiro-Wilk test
Checking Independence
Context: How data was collected
- ✅ Random sample from population
- ❌ Time series data (autocorrelated)
- ❌ Hierarchical data (students in schools)
Plot: Residuals vs observation order
Checking for Outliers and Influential Points
Outlier: Point far from regression line
Influential point: Outlier that pulls line (high leverage)
Diagnostics:
- Residuals > ±3 SD are outliers
- Cook’s distance > 1 indicates influential points
- Leverage > 2p/n indicates high leverage
Action:
- Investigate cause
- Remove if error
- Use robust regression if legitimate
Best Practices
Before Analysis
- ✅ Visualize: Create scatter plot
- ✅ Check linearity: Is relationship linear?
- ✅ Identify outliers: Any extreme points?
- ✅ Consider confounders: What other variables matter?
During Analysis
- ✅ Test significance: p-value < 0.05?
- ✅ Check R²: How much variance explained?
- ✅ Check assumptions: Use diagnostic plots
- ✅ Report uncertainty: Include CIs or SEs
When Interpreting
- ✅ Correlation ≠ Causation: Don’t claim cause
- ✅ Report effect size: Not just p-values
- ✅ Consider context: Does result make sense?
- ✅ Acknowledge limitations: What factors not measured?
Common Mistakes
- ❌ Claiming causation from correlation
- ❌ Ignoring outliers
- ❌ Extrapolating beyond data range
- ❌ Confusing correlation with regression
- ❌ Not checking assumptions
- ❌ Overfitting with too many predictors
- ❌ Ignoring multicollinearity
When to Use What
| Situation | Use |
|---|---|
| Measure relationship strength | Correlation |
| Predict one from another | Regression |
| Relationship non-linear | Transformation or polynomial regression |
| Data contain ranks/ordinal | Spearman correlation |
| Check causation | Experiments or causal inference methods |
| Multiple predictors | Multiple regression |
| Compare groups | ANOVA, not regression |
Related Topics
- Previous: Hypothesis Testing - Test correlation significance
- Confidence Intervals - CI for regression coefficients
- Descriptive Statistics - Foundation
- Advanced: Causal Inference - Go beyond correlation
Summary
Correlation and regression enable you to:
- Measure relationships between variables
- Make data-driven predictions
- Understand how variables move together
- Identify important factors
- Communicate quantitative insights
Remember: correlation suggests patterns; experiments establish causation.