Effect Sizes & Statistical Power: Cohen's d, Power Analysis, Sample Size - Complete Guide

Statistical significance (p-value < 0.05) tells you an effect exists; effect size tells you how big it is. Large samples can find tiny, meaningless effects as “significant.” This guide covers effect sizes and statistical power;two critical concepts often ignored in statistics education but essential for good research.

This comprehensive guide covers effect sizes, power analysis, and practical study planning.

Understanding Effect Size

Why Effect Size Matters

Problem with p-values alone:

Large sample → Small p-value even for tiny effect
Small sample → Large p-value even for big effect
P-value depends on both effect size AND sample size

Example:

Study 1: 50 subjects, r = 0.30, p = 0.03 (significant)
Study 2: 1000 subjects, r = 0.05, p = 0.02 (significant)
Study 2 has much stronger evidence for practical significance? NO!
Study 1 has much larger effect (r = 0.30 vs 0.05)

Solution: Report effect size (not just p-value)

What is Effect Size?

Effect size = Magnitude of difference or relationship, independent of sample size

Characteristics:

Standardized (unitless, comparable across studies)
Independent of sample size
Focuses on practical significance
Enables meta-analysis and comparison

Section 1: Effect Sizes for Comparing Means

Cohen’s d

Most common effect size for comparing two group means.

Formula:

d = (m₁ - m₂) / s_pooled

where:
m₁, m₂ = group means
s_pooled = pooled standard deviation

Pooled SD:

s_pooled = √[((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2)]

Cohen’s d Interpretation

Standard benchmarks:

d = 0.2: Small effect
d = 0.5: Medium effect
d = 0.8: Large effect

Context matters:

What’s “large” varies by field
Medical: Small effect might be clinically important
Psychology: Large effects often hard to achieve

Examples:

IQ increase of 3 points: d ≈ 0.2 (small)
IQ increase of 7.5 points: d ≈ 0.5 (medium)
IQ increase of 12 points: d ≈ 0.8 (large)

Hedges’ g

Adjusted Cohen’s d for small samples.

Same interpretation as Cohen’s d but corrects slight bias.

Effect Size for t-Tests

From sample means:

d = (x̄₁ - x̄₂) / √(((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2))

From t-statistic:

d = t × √(1/n₁ + 1/n₂)

From p-value:

Approximate using online calculators or conversion tables

Section 2: Effect Sizes for Categorical Data

Phi Coefficient (φ)

Effect size for 2×2 contingency table (binary × binary)

Formula:

φ = √(χ² / N)

where:
χ² = chi-square statistic
N = total sample size

Range: 0 to 1

Interpretation:

φ = 0.1: Small association
φ = 0.3: Medium association
φ = 0.5: Large association

Cramer’s V

Effect size for larger contingency tables

Formula:

V = √(χ² / (N × (min(r,c) - 1)))

where:
r = number of rows
c = number of columns

Range: 0 to 1

Interpretation: Same as Phi (0.1, 0.3, 0.5)

Eta (η) for ANOVA

Effect size for one-way ANOVA

Formula:

η = √(SS_between / SS_total)

Represents proportion of variance explained by groups

Interpretation:

η = 0.1: Small effect
η = 0.3: Medium effect
η = 0.5: Large effect

Interactive Calculators: Effect Sizes

[Interactive Calculator Placeholders]

Section 3: Statistical Power

What is Power?

Power (1 - β) = Probability of rejecting null hypothesis when it’s false (detecting real effect)

Alternative: β = Type II error rate = Probability of missing real effect

Relationship: Power = 1 - β

Standard goal: 80% power (20% Type II error)

Power Depends On

1. Effect Size

Larger effect → Easier to detect → More power
Smaller effect → Harder to detect → Less power

2. Sample Size

Larger sample → More power
Smaller sample → Less power

3. Significance Level (α)

Stricter α (0.01 vs 0.05) → Less power
More lenient α → More power

4. Type of Test

One-tailed vs two-tailed
Parametric vs non-parametric
Specific statistical test used

Power Analysis Decision Tree

Before Study (A priori):

Want to know: Required sample size
Give: Effect size, power, α
Find: n

After Study (Post hoc):

Want to know: Achieved power
Give: Effect size, sample size, α
Find: power (rarely useful)

Section 4: Sample Size Planning

One-Sample t-Test

To detect difference from known value:

n = (z_α + z_β)² × σ² / δ²

where:
z_α = critical value for significance level
z_β = critical value for power
σ = population standard deviation
δ = effect size (difference to detect)

Simpler using Cohen’s d:

n = ((z_α + z_β) / d)²

where d = δ / σ = Cohen's d

Example:

Want 95% significance (z_α = 1.96)
Want 80% power (z_β = 0.84)
Expect medium effect (d = 0.5)
n = ((1.96 + 0.84) / 0.5)² = 31 subjects

Two-Sample t-Test (Independent)

To detect difference between two groups:

n per group = 2 × ((z_α + z_β) / d)²

Same formula as one-sample, but doubled.

Example (from above):

n = 2 × 31 = 62 subjects per group (124 total)

Paired t-Test

Same formula as one-sample (comparing differences, not separate groups)

Proportion Test

To detect difference in proportions:

n per group = 2 × p(1-p) × ((z_α + z_β) / (p₁ - p₂))²

where:
p = average proportion = (p₁ + p₂) / 2
p₁, p₂ = group proportions

Example:

Detect difference between 40% and 50%
p = 0.45
95% sig, 80% power
n ≈ 386 per group

ANOVA (Multiple Groups)

Depends on:

Number of groups (k)
Effect size (f, related to η)
Power and α

General rule: Need larger n with more groups

Section 5: Type I and Type II Error Tradeoff

The 2×2 Error Table

True State	Reject H₀	Fail to Reject H₀
H₀ True	Type I error (α)	Correct
H₁ True (H₀ False)	Correct (Power = 1-β)	Type II error (β)

Managing Errors

Type I Error (α):

False positive
Concluding effect exists when it doesn’t
Controlled by choosing significance level
Default: α = 0.05

Type II Error (β):

False negative
Missing real effect
Reduced by increasing power
Default aim: β = 0.20 (80% power)

Error Tradeoff

For fixed sample size:

Decrease α → Increase β (narrower rejection region)
Increase α → Decrease β (wider rejection region)

Solution: Increase sample size to reduce both simultaneously

Choosing Error Rates

When α more critical (Type I error costly):

Medical: False diagnosis serious
Legal: False conviction serious
Use α = 0.01

When β more critical (Type II error costly):

Medical: Missing disease serious
Job training: Missing effective program serious
Use lower β (higher power, e.g., 90%)

Default: α = 0.05, β = 0.20

Section 6: Practical Power Analysis

Prospective (Pre-Study) Power Analysis

When to do: Before starting study

Question: How many subjects needed?

Steps:

Define effect size (small, medium, large OR estimate from literature)
Choose power (usually 80%)
Choose α (usually 0.05)
Calculate required n

Outcome: Tells you if study is feasible with available resources

Post-Hoc Power Analysis

When to do: After completing study

Question: What power did we actually have?

Use sparingly:

If p > 0.05, post-hoc power typically low (expected)
If p < 0.05, post-hoc power typically high (expected)
Usually not informative

Better: Report effect size and confidence interval

Section 7: Achieving Adequate Power

Option 1: Increase Sample Size

Most direct method

Costs money and time
Usually feasible
Straightforward calculation

Rule of thumb:

Quadrupling sample size doubles power
Doubling sample size increases power moderately

Option 2: Accept Smaller Effect Size

If power inadequate with current n:

Maybe can detect only larger effects
Still valuable if large effect is clinically meaningful

Example:

Planned to detect d = 0.5 with n = 100
Only achieves 80% power to detect d = 0.6
May be acceptable depending on context

Option 3: Relax Significance Level

Use α = 0.10 instead of 0.05

Increases Type I error risk
Only justified if Type II error more critical

Option 4: Use More Efficient Test

Paired t-test more powerful than independent t-test
Parametric tests more powerful than non-parametric
Stratified sampling more efficient than simple random

Option 5: Improve Measurement

Reduce error in measurements
Use more reliable instruments
Decrease within-group variance

Section 8: Underpowered Studies

Problem: Many published studies are underpowered (power < 80%)

Consequences:

True effects might be missed (Type II error)
Published effects tend to be inflated (publication bias)
Results don’t replicate in larger studies

Median power in published studies: ~50% (Cohen, 1962)

Why Studies Underpowered?

Budget constraints - Cost limits sample size
Time constraints - Limited recruitment period
Availability - Limited access to subjects
Ignorance - Researchers don’t calculate power
Multiple comparisons - Increase α required

Improving Situation

✅ Calculate power prospectively - Plan adequate n before starting
✅ Report effect sizes - Not just p-values
✅ Combine studies - Meta-analysis with multiple small studies
✅ Use sequential testing - Stop early if effect clear
✅ Register studies - Pre-registration reduces bias

Section 9: Practical Examples

Example 1: Medical Intervention Study

Scenario: Test new treatment vs placebo for blood pressure

Planning:

Current treatment: Average BP 150 mmHg
Goal: Detect 10 mmHg reduction (practical significance)
Estimate SD: 20 mmHg
Effect size: d = 10/20 = 0.5 (medium)

Power calculation:

Want 95% significance (α = 0.05)
Want 80% power
n = 2 × ((1.96 + 0.84) / 0.5)² = 64 per group

Conclusion: Need 128 subjects total (64 per group)

Example 2: Educational Intervention

Scenario: Compare two teaching methods

Planning:

Current method: Average test score 75, SD = 15
Goal: Detect 7-point improvement
Effect size: d = 7/15 = 0.47 ≈ 0.5 (medium)

Power calculation:

α = 0.05, power = 0.80
Paired sample (same students)
n = ((1.96 + 0.84) / 0.5)² = 31 students

Conclusion: Can test with 31 students if measuring same students twice

Example 3: Survey of Proportions

Scenario: Compare voting preference between demographics

Planning:

Current: 50% support
Goal: Detect 10-point difference (45% vs 55%)
Effect: Small-to-medium

Power calculation:

n ≈ 386 per group
Total: 772 respondents

Section 10: Best Practices

Before Study

✅ Calculate required sample size - Based on effect size
✅ Justify effect size - From literature or pilot data
✅ Document power analysis - For grant proposal
✅ Plan for attrition - Oversample if subjects may drop out
✅ Use power analysis software - G*Power, R, or online tools

During Study

✅ Track progress - Interim analyses okay if pre-planned
✅ Monitor quality - Good data collection reduces error
✅ Document issues - Deviations from plan

Reporting Results

✅ Report effect sizes - With confidence intervals
✅ Report achieved power - If significant
✅ Discuss practical significance - Beyond statistical significance
✅ Acknowledge limitations - If underpowered

Interpretation

✅ Significant + large effect: Strong evidence
✅ Significant + small effect: Real but small effect
✅ Not significant + adequate power: Likely no effect
❌ Not significant + low power: Inconclusive
✅ Non-significant + large CI: Uncertainty remains

Common Mistakes

❌ Confusing statistical and practical significance
❌ Not calculating power prospectively
❌ Assuming p < 0.05 means large effect
❌ Using post-hoc power to justify non-significant results
❌ Ignoring effect size in favor of p-values
❌ Not accounting for multiple comparisons
❌ Overstating small effects as meaningful

Power Analysis Software

Online Tools:

G*Power (free desktop software)
ClinCalc (online calculators)
StatCom

R Packages:

pwr
WebPower

Other:

Stata power command
SAS proc power

Sample Size Planning - Practical implementation
Hypothesis Testing - Understand Type I errors
Confidence Intervals - Effect size in interval form
Meta-Analysis - Combine effect sizes across studies

Summary

Key Takeaways:

Report effect sizes - Not just p-values
Plan adequate power - Prospectively calculate sample size
Practical significance - Consider real-world importance
Avoid underpowered studies - More likely to miss real effects
Document assumptions - Be transparent about effect size estimates

Statistical significance tells you an effect exists; effect size tells you if you should care.

Effect Sizes and Statistical Power - Complete Guide

Understanding Effect Size

Why Effect Size Matters

What is Effect Size?

Section 1: Effect Sizes for Comparing Means

Cohen’s d

Cohen’s d Interpretation

Hedges’ g

Effect Size for t-Tests

Section 2: Effect Sizes for Categorical Data

Phi Coefficient (φ)

Cramer’s V

Eta (η) for ANOVA

Interactive Calculators: Effect Sizes

Section 3: Statistical Power

What is Power?

Power Depends On

Power Analysis Decision Tree

Section 4: Sample Size Planning

One-Sample t-Test

Two-Sample t-Test (Independent)

Paired t-Test

Proportion Test

ANOVA (Multiple Groups)

Section 5: Type I and Type II Error Tradeoff

The 2×2 Error Table

Managing Errors

Error Tradeoff

Choosing Error Rates

Section 6: Practical Power Analysis

Prospective (Pre-Study) Power Analysis

Post-Hoc Power Analysis

Section 7: Achieving Adequate Power

Option 1: Increase Sample Size

Option 2: Accept Smaller Effect Size

Option 3: Relax Significance Level

Option 4: Use More Efficient Test

Option 5: Improve Measurement

Section 8: Underpowered Studies

Why Studies Underpowered?

Improving Situation

Section 9: Practical Examples

Example 1: Medical Intervention Study

Example 2: Educational Intervention

Example 3: Survey of Proportions

Section 10: Best Practices

Before Study

During Study

Reporting Results

Interpretation

Common Mistakes

Power Analysis Software

Related Topics

Summary