Correlation measures how two variables move together; regression predicts one variable from another. These tools help answer questions like: Do variables move in tandem? Can we predict sales from advertising? What’s the relationship between age and income? Understanding correlation and regression is essential for analyzing relationships and making predictions.

This comprehensive guide covers correlation and regression analysis with interactive calculators and practical interpretations.

Understanding Relationships Between Variables

When analyzing data, we often need to understand how two (or more) variables relate.

Correlation: Do variables move together? (Association)

  • Example: Do taller people weigh more?
  • No causation implied

Regression: Can we predict one variable from another? (Prediction)

  • Example: Predict weight from height
  • Implies some dependence (but not necessarily causation)

Section 1: Correlation Analysis

Correlation measures the strength and direction of linear relationship between two variables.

Pearson Correlation Coefficient

The Pearson correlation coefficient (r) measures linear relationship between two continuous variables.

Formula:

r = Σ((x - x̄)(y - ȳ)) / √(Σ(x - x̄)² × Σ(y - ȳ)²)

Range: -1 to +1

Interpretation:

  • r = +1: Perfect positive correlation (scatter points on upward line)
  • r = +0.7 to +0.99: Strong positive correlation
  • r = +0.3 to +0.7: Moderate positive correlation
  • r = 0 to +0.3: Weak positive correlation
  • r = 0: No linear correlation
  • r = -0.3 to 0: Weak negative correlation
  • r = -0.7 to -0.3: Moderate negative correlation
  • r = -0.99 to -0.7: Strong negative correlation
  • r = -1: Perfect negative correlation

Coefficient of Determination (r²): Percentage of variance in one variable explained by the other.

Example: r = 0.8 → r² = 0.64 → 64% of variance explained

When to use Pearson:

  • Both variables continuous
  • Relationship appears linear
  • No severe outliers
  • Variables approximately normally distributed

Assumptions:

  • Random sample
  • Independence of observations
  • Linearity
  • No outliers

Spearman Rank Correlation

Spearman correlation (ρ, “rho”) measures monotonic relationship using ranks (order).

When to use:

  • Non-linear but monotonic relationship
  • Ordinal data
  • Non-normal distributions
  • Outliers present

Advantage: Robust to outliers and non-linear monotonic relationships

Formula: Same as Pearson but applied to ranks

Example: Rank scores: Person A (rank 1), B (rank 5), C (rank 3) Calculate Spearman on ranks, not actual values

Kendall’s Tau Correlation

Another rank-based correlation, more robust than Spearman.

When to use:

  • Extremely non-normal data
  • Strong outliers
  • Monotonic but not linear

Testing Correlation Significance

Null hypothesis: H₀: ρ = 0 (no correlation)

Test statistic:

t = r × √(n - 2) / √(1 - r²)

df = n - 2

Interpretation:

  • Small p-value: Correlation is statistically significant
  • Large p-value: No evidence of correlation

Note: Statistical significance depends on sample size

  • Large sample: Even weak correlation can be significant
  • Small sample: Strong correlation might not be significant
  • Always check effect size (r-value), not just p-value

Interactive Calculators: Correlation

[Interactive Calculator Placeholders]


Section 2: Correlation vs Causation

Critical Principle: Correlation does NOT imply causation

Why Correlation ≠ Causation

Possible Explanations for Correlation:

  1. Direct Causation: X causes Y

    • Example: Studying harder → Better test scores
  2. Reverse Causation: Y causes X

    • Example: Better health → More exercise
  3. Common Cause: Both caused by Z

    • Example: Ice cream sales and drowning deaths both caused by summer heat
  4. Confounding: Third variable affects both

    • Example: Shoe size correlates with reading ability (both caused by age)
  5. Coincidence: No real relationship

    • Example: Nicolas Cage films released per year vs. swimming pool drownings

Identifying Causal Relationships

Requirements for causation:

  1. Correlation: Variables must be associated
  2. Temporal precedence: Cause must precede effect
  3. No alternative explanations: Other variables ruled out

Methods to establish causation:

  • Randomized controlled experiments: Assign treatment randomly
  • Natural experiments: Exploit naturally occurring variation
  • Longitudinal studies: Follow subjects over time
  • Causal diagrams: Map possible causal pathways

Section 3: Simple Linear Regression

Simple linear regression predicts one variable (Y) from another (X) using a straight line.

The Regression Line

Equation:

ŷ = a + b×x

where:
ŷ = predicted value of Y
a = y-intercept (value of Y when X=0)
b = slope (change in Y per unit change in X)

Calculating Regression Coefficients

Slope (b):

b = r × (s_y / s_x)

or equivalently:

b = Σ((x - x̄)(y - ȳ)) / Σ(x - x̄)²

Intercept (a):

a = ȳ - b × x̄

Example

Data: Advertising spending (X) vs Sales (Y)

Advertising ($1000s):  1    2    3    4    5
Sales ($1000s):       10   15   20   25   30
  • x̄ = 3, ȳ = 20
  • Slope: b = 5 (each $1000 ad spending → $5000 sales increase)
  • Intercept: a = 20 - 5×3 = 5

Regression equation: ŷ = 5 + 5x

Prediction: Advertising = $3.5k → Predicted sales = 5 + 5(3.5) = $22.5k

Interpreting the Slope

Slope = 5 means:

  • For each 1-unit increase in X, Y increases by 5 units
  • If X increases by 10 units, Y increases by 50 units

Caution: Only valid for X range in data (extrapolation unreliable)

R² (Coefficient of Determination)

measures how well regression line fits data.

Formula:

R² = r² (for simple regression)
R² = Explained Variation / Total Variation

Range: 0 to 1 (often expressed as %)

Interpretation:

  • R² = 0.81: 81% of variation in Y explained by X
  • R² = 0.20: 20% of variation explained (weak model)
  • R² = 1: Perfect fit (rarely achieved)

Residuals

Residual = Observed value - Predicted value

e = y - ŷ

Residual Analysis:

  • Should be randomly scattered around zero
  • No pattern indicates good fit
  • Patterns suggest model problems

Regression Assumptions

  1. Linearity: Relationship is linear
  2. Independence: Observations independent
  3. Homoscedasticity: Constant variance of residuals
  4. Normality: Residuals approximately normal
  5. No outliers: Extreme values investigated

Check with:

  • Scatter plot with regression line
  • Residual plot
  • Q-Q plot
  • Histogram of residuals

Interactive Calculators: Simple Regression

[Interactive Calculator Placeholders]


Section 4: Multiple Regression

Multiple regression predicts Y from multiple X variables.

Equation:

ŷ = a + b₁x₁ + b₂x₂ + ... + bₖxₖ

Example: Predict house price from:

  • Square footage (x₁)
  • Number of bedrooms (x₂)
  • Age (x₃)

Multiple R²

Multiple R² measures overall model fit with all predictors.

Adjusted R²: Penalizes for adding too many variables.

Formula:

Adjusted R² = 1 - [(1 - R²) × (n - 1) / (n - p - 1)]

where p = number of predictors

Use Adjusted R² to compare models with different numbers of predictors

Coefficient Interpretation

In multiple regression:

  • b₁ = change in Y per unit change in X₁, holding other X’s constant
  • Assumes linear relationship
  • Called “partial” effect

Example: ŷ = 50 + 2x₁ + 3x₂

  • b₁ = 2: Increasing X₁ by 1 unit → Y increases by 2 (if X₂ held constant)

Multicollinearity

Problem: Predictor variables correlated with each other

Consequences:

  • Unreliable coefficient estimates
  • Inflated standard errors
  • Wide confidence intervals

Detection:

  • Correlation matrix of predictors
  • VIF (Variance Inflation Factor) > 10
  • High R² but non-significant predictors

Solutions:

  • Remove correlated predictors
  • Combine related predictors
  • Use regularization (Ridge, Lasso)

Section 5: Making Predictions

Point Predictions

Single value: ŷ = a + bx

Example: Predict sales with $3k advertising = 5 + 5(3) = $20k

Confidence Interval for Mean Response

Range of likely values for average Y at given X level.

Narrower than prediction interval

  • Predicting average of many cases
  • Less uncertainty

Prediction Interval for Individual Response

Range of likely values for single individual’s Y.

Wider than confidence interval

  • Predicting individual case
  • More uncertainty

Comparison:

  • 95% CI for mean: Maybe [19.5, 20.5]
  • 95% PI for individual: Maybe [15, 25]

Extrapolation Risk

⚠️ Don’t predict outside X data range

Example:

  • Data: Advertising $1k to $5k
  • Safe: Predict for $3k
  • Risky: Predict for $20k (far outside range)
  • Relationship may not hold at extremes

Section 6: Partial Correlation

Partial correlation measures relationship between two variables while controlling for other variables.

Example: Correlate age and health, controlling for exercise

Formula involves residuals:

  1. Regress age on exercise, save residuals
  2. Regress health on exercise, save residuals
  3. Correlate the two residuals

Shows: Relationship independent of control variable


Section 7: Practical Examples

Example 1: House Price Prediction

Scenario: Real estate company predicts house prices

Data: 100 houses

  • X = Square footage
  • Y = Price

Analysis:

  • Correlation: r = 0.92 (very strong positive)
  • Regression: Price = 50,000 + 150 × Sqft
  • R² = 0.85 (85% of price variation explained by size)

Interpretation:

  • Each additional square foot → $150 price increase
  • 85% of house price variation explained by size
  • Other factors (location, condition, etc.) explain remaining 15%

Prediction: 2,000 sq ft house = $50,000 + 150(2,000) = $350,000

Example 2: Test Score Prediction

Scenario: Predict college GPA from high school SAT scores

Data: 500 students

  • X = SAT score
  • Y = College GPA

Analysis:

  • Correlation: r = 0.45 (moderate positive)
  • Regression: GPA = 0.50 + 0.002 × SAT
  • R² = 0.20 (only 20% explained)

Interpretation:

  • Weak to moderate relationship
  • 100-point SAT increase → 0.2 GPA increase
  • Other factors (motivation, major, etc.) explain 80%
  • SAT alone insufficient for prediction

Example 3: Multiple Regression Example

Scenario: Predict employee salary from multiple factors

Model: Salary = 30,000 + 2,000×(Yrs Experience) + 5,000×(Degree Level) - 1,000×(Age)

Interpretation:

  • Each year experience → $2,000 salary increase
  • Each degree level → $5,000 increase
  • Each year of age → $1,000 decrease (unexpected!)
  • Suggests age is confounded with experience
  • Need to reconsider model

Section 8: Regression Diagnostics

Checking Linearity

Plot: Scatter plot of Y vs X with regression line

Look for:

  • ✅ Random scatter around line
  • ❌ Curved pattern (use transformation)
  • ❌ Non-linear relationship

Checking Homoscedasticity

Plot: Residuals vs Predicted Values

Look for:

  • ✅ Constant variance (cone doesn’t widen)
  • ❌ Funnel shape (increasing variance)
  • ❌ Patterns

Fix: Transform Y variable (log, sqrt)

Checking Normality of Residuals

Plot: Q-Q plot of residuals

Look for:

  • ✅ Points follow diagonal line
  • ❌ Deviations at tails (heavy or light)

Test: Shapiro-Wilk test

Checking Independence

Context: How data was collected

  • ✅ Random sample from population
  • ❌ Time series data (autocorrelated)
  • ❌ Hierarchical data (students in schools)

Plot: Residuals vs observation order

Checking for Outliers and Influential Points

Outlier: Point far from regression line

Influential point: Outlier that pulls line (high leverage)

Diagnostics:

  • Residuals > ±3 SD are outliers
  • Cook’s distance > 1 indicates influential points
  • Leverage > 2p/n indicates high leverage

Action:

  • Investigate cause
  • Remove if error
  • Use robust regression if legitimate

Best Practices

Before Analysis

  1. Visualize: Create scatter plot
  2. Check linearity: Is relationship linear?
  3. Identify outliers: Any extreme points?
  4. Consider confounders: What other variables matter?

During Analysis

  1. Test significance: p-value < 0.05?
  2. Check R²: How much variance explained?
  3. Check assumptions: Use diagnostic plots
  4. Report uncertainty: Include CIs or SEs

When Interpreting

  1. Correlation ≠ Causation: Don’t claim cause
  2. Report effect size: Not just p-values
  3. Consider context: Does result make sense?
  4. Acknowledge limitations: What factors not measured?

Common Mistakes

  1. ❌ Claiming causation from correlation
  2. ❌ Ignoring outliers
  3. ❌ Extrapolating beyond data range
  4. ❌ Confusing correlation with regression
  5. ❌ Not checking assumptions
  6. ❌ Overfitting with too many predictors
  7. ❌ Ignoring multicollinearity

When to Use What

Situation Use
Measure relationship strength Correlation
Predict one from another Regression
Relationship non-linear Transformation or polynomial regression
Data contain ranks/ordinal Spearman correlation
Check causation Experiments or causal inference methods
Multiple predictors Multiple regression
Compare groups ANOVA, not regression


Summary

Correlation and regression enable you to:

  • Measure relationships between variables
  • Make data-driven predictions
  • Understand how variables move together
  • Identify important factors
  • Communicate quantitative insights

Remember: correlation suggests patterns; experiments establish causation.