Statistical significance (p-value < 0.05) tells you an effect exists; effect size tells you how big it is. Large samples can find tiny, meaningless effects as “significant.” This guide covers effect sizes and statistical power;two critical concepts often ignored in statistics education but essential for good research.

This comprehensive guide covers effect sizes, power analysis, and practical study planning.

Understanding Effect Size

Why Effect Size Matters

Problem with p-values alone:

  • Large sample → Small p-value even for tiny effect
  • Small sample → Large p-value even for big effect
  • P-value depends on both effect size AND sample size

Example:

  • Study 1: 50 subjects, r = 0.30, p = 0.03 (significant)
  • Study 2: 1000 subjects, r = 0.05, p = 0.02 (significant)
  • Study 2 has much stronger evidence for practical significance? NO!
  • Study 1 has much larger effect (r = 0.30 vs 0.05)

Solution: Report effect size (not just p-value)

What is Effect Size?

Effect size = Magnitude of difference or relationship, independent of sample size

Characteristics:

  • Standardized (unitless, comparable across studies)
  • Independent of sample size
  • Focuses on practical significance
  • Enables meta-analysis and comparison

Section 1: Effect Sizes for Comparing Means

Cohen’s d

Most common effect size for comparing two group means.

Formula:

d = (m₁ - m₂) / s_pooled

where:
m₁, m₂ = group means
s_pooled = pooled standard deviation

Pooled SD:

s_pooled = √[((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2)]

Cohen’s d Interpretation

Standard benchmarks:

  • d = 0.2: Small effect
  • d = 0.5: Medium effect
  • d = 0.8: Large effect

Context matters:

  • What’s “large” varies by field
  • Medical: Small effect might be clinically important
  • Psychology: Large effects often hard to achieve

Examples:

  • IQ increase of 3 points: d ≈ 0.2 (small)
  • IQ increase of 7.5 points: d ≈ 0.5 (medium)
  • IQ increase of 12 points: d ≈ 0.8 (large)

Hedges’ g

Adjusted Cohen’s d for small samples.

Same interpretation as Cohen’s d but corrects slight bias.

Effect Size for t-Tests

From sample means:

d = (x̄₁ - x̄₂) / √(((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2))

From t-statistic:

d = t × √(1/n₁ + 1/n₂)

From p-value:

  • Approximate using online calculators or conversion tables

Section 2: Effect Sizes for Categorical Data

Phi Coefficient (φ)

Effect size for 2×2 contingency table (binary × binary)

Formula:

φ = √(χ² / N)

where:
χ² = chi-square statistic
N = total sample size

Range: 0 to 1

Interpretation:

  • φ = 0.1: Small association
  • φ = 0.3: Medium association
  • φ = 0.5: Large association

Cramer’s V

Effect size for larger contingency tables

Formula:

V = √(χ² / (N × (min(r,c) - 1)))

where:
r = number of rows
c = number of columns

Range: 0 to 1

Interpretation: Same as Phi (0.1, 0.3, 0.5)

Eta (η) for ANOVA

Effect size for one-way ANOVA

Formula:

η = √(SS_between / SS_total)

Represents proportion of variance explained by groups

Interpretation:

  • η = 0.1: Small effect
  • η = 0.3: Medium effect
  • η = 0.5: Large effect

Interactive Calculators: Effect Sizes

[Interactive Calculator Placeholders]


Section 3: Statistical Power

What is Power?

Power (1 - β) = Probability of rejecting null hypothesis when it’s false (detecting real effect)

Alternative: β = Type II error rate = Probability of missing real effect

Relationship: Power = 1 - β

Standard goal: 80% power (20% Type II error)

Power Depends On

1. Effect Size

  • Larger effect → Easier to detect → More power
  • Smaller effect → Harder to detect → Less power

2. Sample Size

  • Larger sample → More power
  • Smaller sample → Less power

3. Significance Level (α)

  • Stricter α (0.01 vs 0.05) → Less power
  • More lenient α → More power

4. Type of Test

  • One-tailed vs two-tailed
  • Parametric vs non-parametric
  • Specific statistical test used

Power Analysis Decision Tree

Before Study (A priori):

  • Want to know: Required sample size
  • Give: Effect size, power, α
  • Find: n

After Study (Post hoc):

  • Want to know: Achieved power
  • Give: Effect size, sample size, α
  • Find: power (rarely useful)

Section 4: Sample Size Planning

One-Sample t-Test

To detect difference from known value:

n = (z_α + z_β)² × σ² / δ²

where:
z_α = critical value for significance level
z_β = critical value for power
σ = population standard deviation
δ = effect size (difference to detect)

Simpler using Cohen’s d:

n = ((z_α + z_β) / d)²

where d = δ / σ = Cohen's d

Example:

  • Want 95% significance (z_α = 1.96)
  • Want 80% power (z_β = 0.84)
  • Expect medium effect (d = 0.5)
  • n = ((1.96 + 0.84) / 0.5)² = 31 subjects

Two-Sample t-Test (Independent)

To detect difference between two groups:

n per group = 2 × ((z_α + z_β) / d)²

Same formula as one-sample, but doubled.

Example (from above):

  • n = 2 × 31 = 62 subjects per group (124 total)

Paired t-Test

Same formula as one-sample (comparing differences, not separate groups)

Proportion Test

To detect difference in proportions:

n per group = 2 × p(1-p) × ((z_α + z_β) / (p₁ - p₂))²

where:
p = average proportion = (p₁ + p₂) / 2
p₁, p₂ = group proportions

Example:

  • Detect difference between 40% and 50%
  • p = 0.45
  • 95% sig, 80% power
  • n ≈ 386 per group

ANOVA (Multiple Groups)

Depends on:

  • Number of groups (k)
  • Effect size (f, related to η)
  • Power and α

General rule: Need larger n with more groups


Section 5: Type I and Type II Error Tradeoff

The 2×2 Error Table

True State Reject H₀ Fail to Reject H₀
H₀ True Type I error (α) Correct
H₁ True (H₀ False) Correct (Power = 1-β) Type II error (β)

Managing Errors

Type I Error (α):

  • False positive
  • Concluding effect exists when it doesn’t
  • Controlled by choosing significance level
  • Default: α = 0.05

Type II Error (β):

  • False negative
  • Missing real effect
  • Reduced by increasing power
  • Default aim: β = 0.20 (80% power)

Error Tradeoff

For fixed sample size:

  • Decrease αIncrease β (narrower rejection region)
  • Increase αDecrease β (wider rejection region)

Solution: Increase sample size to reduce both simultaneously

Choosing Error Rates

When α more critical (Type I error costly):

  • Medical: False diagnosis serious
  • Legal: False conviction serious
  • Use α = 0.01

When β more critical (Type II error costly):

  • Medical: Missing disease serious
  • Job training: Missing effective program serious
  • Use lower β (higher power, e.g., 90%)

Default: α = 0.05, β = 0.20


Section 6: Practical Power Analysis

Prospective (Pre-Study) Power Analysis

When to do: Before starting study

Question: How many subjects needed?

Steps:

  1. Define effect size (small, medium, large OR estimate from literature)
  2. Choose power (usually 80%)
  3. Choose α (usually 0.05)
  4. Calculate required n

Outcome: Tells you if study is feasible with available resources

Post-Hoc Power Analysis

When to do: After completing study

Question: What power did we actually have?

Use sparingly:

  • If p > 0.05, post-hoc power typically low (expected)
  • If p < 0.05, post-hoc power typically high (expected)
  • Usually not informative

Better: Report effect size and confidence interval


Section 7: Achieving Adequate Power

Option 1: Increase Sample Size

Most direct method

  • Costs money and time
  • Usually feasible
  • Straightforward calculation

Rule of thumb:

  • Quadrupling sample size doubles power
  • Doubling sample size increases power moderately

Option 2: Accept Smaller Effect Size

If power inadequate with current n:

  • Maybe can detect only larger effects
  • Still valuable if large effect is clinically meaningful

Example:

  • Planned to detect d = 0.5 with n = 100
  • Only achieves 80% power to detect d = 0.6
  • May be acceptable depending on context

Option 3: Relax Significance Level

Use α = 0.10 instead of 0.05

  • Increases Type I error risk
  • Only justified if Type II error more critical

Option 4: Use More Efficient Test

  • Paired t-test more powerful than independent t-test
  • Parametric tests more powerful than non-parametric
  • Stratified sampling more efficient than simple random

Option 5: Improve Measurement

  • Reduce error in measurements
  • Use more reliable instruments
  • Decrease within-group variance

Section 8: Underpowered Studies

Problem: Many published studies are underpowered (power < 80%)

Consequences:

  • True effects might be missed (Type II error)
  • Published effects tend to be inflated (publication bias)
  • Results don’t replicate in larger studies

Median power in published studies: ~50% (Cohen, 1962)

Why Studies Underpowered?

  1. Budget constraints - Cost limits sample size
  2. Time constraints - Limited recruitment period
  3. Availability - Limited access to subjects
  4. Ignorance - Researchers don’t calculate power
  5. Multiple comparisons - Increase α required

Improving Situation

  1. Calculate power prospectively - Plan adequate n before starting
  2. Report effect sizes - Not just p-values
  3. Combine studies - Meta-analysis with multiple small studies
  4. Use sequential testing - Stop early if effect clear
  5. Register studies - Pre-registration reduces bias

Section 9: Practical Examples

Example 1: Medical Intervention Study

Scenario: Test new treatment vs placebo for blood pressure

Planning:

  • Current treatment: Average BP 150 mmHg
  • Goal: Detect 10 mmHg reduction (practical significance)
  • Estimate SD: 20 mmHg
  • Effect size: d = 10/20 = 0.5 (medium)

Power calculation:

  • Want 95% significance (α = 0.05)
  • Want 80% power
  • n = 2 × ((1.96 + 0.84) / 0.5)² = 64 per group

Conclusion: Need 128 subjects total (64 per group)

Example 2: Educational Intervention

Scenario: Compare two teaching methods

Planning:

  • Current method: Average test score 75, SD = 15
  • Goal: Detect 7-point improvement
  • Effect size: d = 7/15 = 0.47 ≈ 0.5 (medium)

Power calculation:

  • α = 0.05, power = 0.80
  • Paired sample (same students)
  • n = ((1.96 + 0.84) / 0.5)² = 31 students

Conclusion: Can test with 31 students if measuring same students twice

Example 3: Survey of Proportions

Scenario: Compare voting preference between demographics

Planning:

  • Current: 50% support
  • Goal: Detect 10-point difference (45% vs 55%)
  • Effect: Small-to-medium

Power calculation:

  • n ≈ 386 per group
  • Total: 772 respondents

Section 10: Best Practices

Before Study

  1. Calculate required sample size - Based on effect size
  2. Justify effect size - From literature or pilot data
  3. Document power analysis - For grant proposal
  4. Plan for attrition - Oversample if subjects may drop out
  5. Use power analysis software - G*Power, R, or online tools

During Study

  1. Track progress - Interim analyses okay if pre-planned
  2. Monitor quality - Good data collection reduces error
  3. Document issues - Deviations from plan

Reporting Results

  1. Report effect sizes - With confidence intervals
  2. Report achieved power - If significant
  3. Discuss practical significance - Beyond statistical significance
  4. Acknowledge limitations - If underpowered

Interpretation

  1. Significant + large effect: Strong evidence
  2. Significant + small effect: Real but small effect
  3. Not significant + adequate power: Likely no effect
  4. Not significant + low power: Inconclusive
  5. Non-significant + large CI: Uncertainty remains

Common Mistakes

  1. ❌ Confusing statistical and practical significance
  2. ❌ Not calculating power prospectively
  3. ❌ Assuming p < 0.05 means large effect
  4. ❌ Using post-hoc power to justify non-significant results
  5. ❌ Ignoring effect size in favor of p-values
  6. ❌ Not accounting for multiple comparisons
  7. ❌ Overstating small effects as meaningful

Power Analysis Software

Online Tools:

  • G*Power (free desktop software)
  • ClinCalc (online calculators)
  • StatCom

R Packages:

  • pwr
  • WebPower

Other:

  • Stata power command
  • SAS proc power


Summary

Key Takeaways:

  1. Report effect sizes - Not just p-values
  2. Plan adequate power - Prospectively calculate sample size
  3. Practical significance - Consider real-world importance
  4. Avoid underpowered studies - More likely to miss real effects
  5. Document assumptions - Be transparent about effect size estimates

Statistical significance tells you an effect exists; effect size tells you if you should care.