Statistical significance (p-value < 0.05) tells you an effect exists; effect size tells you how big it is. Large samples can find tiny, meaningless effects as “significant.” This guide covers effect sizes and statistical power;two critical concepts often ignored in statistics education but essential for good research.
This comprehensive guide covers effect sizes, power analysis, and practical study planning.
Understanding Effect Size
Why Effect Size Matters
Problem with p-values alone:
- Large sample → Small p-value even for tiny effect
- Small sample → Large p-value even for big effect
- P-value depends on both effect size AND sample size
Example:
- Study 1: 50 subjects, r = 0.30, p = 0.03 (significant)
- Study 2: 1000 subjects, r = 0.05, p = 0.02 (significant)
- Study 2 has much stronger evidence for practical significance? NO!
- Study 1 has much larger effect (r = 0.30 vs 0.05)
Solution: Report effect size (not just p-value)
What is Effect Size?
Effect size = Magnitude of difference or relationship, independent of sample size
Characteristics:
- Standardized (unitless, comparable across studies)
- Independent of sample size
- Focuses on practical significance
- Enables meta-analysis and comparison
Section 1: Effect Sizes for Comparing Means
Cohen’s d
Most common effect size for comparing two group means.
Formula:
d = (m₁ - m₂) / s_pooled
where:
m₁, m₂ = group means
s_pooled = pooled standard deviation
Pooled SD:
s_pooled = √[((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2)]
Cohen’s d Interpretation
Standard benchmarks:
- d = 0.2: Small effect
- d = 0.5: Medium effect
- d = 0.8: Large effect
Context matters:
- What’s “large” varies by field
- Medical: Small effect might be clinically important
- Psychology: Large effects often hard to achieve
Examples:
- IQ increase of 3 points: d ≈ 0.2 (small)
- IQ increase of 7.5 points: d ≈ 0.5 (medium)
- IQ increase of 12 points: d ≈ 0.8 (large)
Hedges’ g
Adjusted Cohen’s d for small samples.
Same interpretation as Cohen’s d but corrects slight bias.
Effect Size for t-Tests
From sample means:
d = (x̄₁ - x̄₂) / √(((n₁-1)s₁² + (n₂-1)s₂²) / (n₁ + n₂ - 2))
From t-statistic:
d = t × √(1/n₁ + 1/n₂)
From p-value:
- Approximate using online calculators or conversion tables
Section 2: Effect Sizes for Categorical Data
Phi Coefficient (φ)
Effect size for 2×2 contingency table (binary × binary)
Formula:
φ = √(χ² / N)
where:
χ² = chi-square statistic
N = total sample size
Range: 0 to 1
Interpretation:
- φ = 0.1: Small association
- φ = 0.3: Medium association
- φ = 0.5: Large association
Cramer’s V
Effect size for larger contingency tables
Formula:
V = √(χ² / (N × (min(r,c) - 1)))
where:
r = number of rows
c = number of columns
Range: 0 to 1
Interpretation: Same as Phi (0.1, 0.3, 0.5)
Eta (η) for ANOVA
Effect size for one-way ANOVA
Formula:
η = √(SS_between / SS_total)
Represents proportion of variance explained by groups
Interpretation:
- η = 0.1: Small effect
- η = 0.3: Medium effect
- η = 0.5: Large effect
Interactive Calculators: Effect Sizes
[Interactive Calculator Placeholders]
Section 3: Statistical Power
What is Power?
Power (1 - β) = Probability of rejecting null hypothesis when it’s false (detecting real effect)
Alternative: β = Type II error rate = Probability of missing real effect
Relationship: Power = 1 - β
Standard goal: 80% power (20% Type II error)
Power Depends On
1. Effect Size
- Larger effect → Easier to detect → More power
- Smaller effect → Harder to detect → Less power
2. Sample Size
- Larger sample → More power
- Smaller sample → Less power
3. Significance Level (α)
- Stricter α (0.01 vs 0.05) → Less power
- More lenient α → More power
4. Type of Test
- One-tailed vs two-tailed
- Parametric vs non-parametric
- Specific statistical test used
Power Analysis Decision Tree
Before Study (A priori):
- Want to know: Required sample size
- Give: Effect size, power, α
- Find: n
After Study (Post hoc):
- Want to know: Achieved power
- Give: Effect size, sample size, α
- Find: power (rarely useful)
Section 4: Sample Size Planning
One-Sample t-Test
To detect difference from known value:
n = (z_α + z_β)² × σ² / δ²
where:
z_α = critical value for significance level
z_β = critical value for power
σ = population standard deviation
δ = effect size (difference to detect)
Simpler using Cohen’s d:
n = ((z_α + z_β) / d)²
where d = δ / σ = Cohen's d
Example:
- Want 95% significance (z_α = 1.96)
- Want 80% power (z_β = 0.84)
- Expect medium effect (d = 0.5)
- n = ((1.96 + 0.84) / 0.5)² = 31 subjects
Two-Sample t-Test (Independent)
To detect difference between two groups:
n per group = 2 × ((z_α + z_β) / d)²
Same formula as one-sample, but doubled.
Example (from above):
- n = 2 × 31 = 62 subjects per group (124 total)
Paired t-Test
Same formula as one-sample (comparing differences, not separate groups)
Proportion Test
To detect difference in proportions:
n per group = 2 × p(1-p) × ((z_α + z_β) / (p₁ - p₂))²
where:
p = average proportion = (p₁ + p₂) / 2
p₁, p₂ = group proportions
Example:
- Detect difference between 40% and 50%
- p = 0.45
- 95% sig, 80% power
- n ≈ 386 per group
ANOVA (Multiple Groups)
Depends on:
- Number of groups (k)
- Effect size (f, related to η)
- Power and α
General rule: Need larger n with more groups
Section 5: Type I and Type II Error Tradeoff
The 2×2 Error Table
| True State | Reject H₀ | Fail to Reject H₀ |
|---|---|---|
| H₀ True | Type I error (α) | Correct |
| H₁ True (H₀ False) | Correct (Power = 1-β) | Type II error (β) |
Managing Errors
Type I Error (α):
- False positive
- Concluding effect exists when it doesn’t
- Controlled by choosing significance level
- Default: α = 0.05
Type II Error (β):
- False negative
- Missing real effect
- Reduced by increasing power
- Default aim: β = 0.20 (80% power)
Error Tradeoff
For fixed sample size:
- Decrease α → Increase β (narrower rejection region)
- Increase α → Decrease β (wider rejection region)
Solution: Increase sample size to reduce both simultaneously
Choosing Error Rates
When α more critical (Type I error costly):
- Medical: False diagnosis serious
- Legal: False conviction serious
- Use α = 0.01
When β more critical (Type II error costly):
- Medical: Missing disease serious
- Job training: Missing effective program serious
- Use lower β (higher power, e.g., 90%)
Default: α = 0.05, β = 0.20
Section 6: Practical Power Analysis
Prospective (Pre-Study) Power Analysis
When to do: Before starting study
Question: How many subjects needed?
Steps:
- Define effect size (small, medium, large OR estimate from literature)
- Choose power (usually 80%)
- Choose α (usually 0.05)
- Calculate required n
Outcome: Tells you if study is feasible with available resources
Post-Hoc Power Analysis
When to do: After completing study
Question: What power did we actually have?
Use sparingly:
- If p > 0.05, post-hoc power typically low (expected)
- If p < 0.05, post-hoc power typically high (expected)
- Usually not informative
Better: Report effect size and confidence interval
Section 7: Achieving Adequate Power
Option 1: Increase Sample Size
Most direct method
- Costs money and time
- Usually feasible
- Straightforward calculation
Rule of thumb:
- Quadrupling sample size doubles power
- Doubling sample size increases power moderately
Option 2: Accept Smaller Effect Size
If power inadequate with current n:
- Maybe can detect only larger effects
- Still valuable if large effect is clinically meaningful
Example:
- Planned to detect d = 0.5 with n = 100
- Only achieves 80% power to detect d = 0.6
- May be acceptable depending on context
Option 3: Relax Significance Level
Use α = 0.10 instead of 0.05
- Increases Type I error risk
- Only justified if Type II error more critical
Option 4: Use More Efficient Test
- Paired t-test more powerful than independent t-test
- Parametric tests more powerful than non-parametric
- Stratified sampling more efficient than simple random
Option 5: Improve Measurement
- Reduce error in measurements
- Use more reliable instruments
- Decrease within-group variance
Section 8: Underpowered Studies
Problem: Many published studies are underpowered (power < 80%)
Consequences:
- True effects might be missed (Type II error)
- Published effects tend to be inflated (publication bias)
- Results don’t replicate in larger studies
Median power in published studies: ~50% (Cohen, 1962)
Why Studies Underpowered?
- Budget constraints - Cost limits sample size
- Time constraints - Limited recruitment period
- Availability - Limited access to subjects
- Ignorance - Researchers don’t calculate power
- Multiple comparisons - Increase α required
Improving Situation
- ✅ Calculate power prospectively - Plan adequate n before starting
- ✅ Report effect sizes - Not just p-values
- ✅ Combine studies - Meta-analysis with multiple small studies
- ✅ Use sequential testing - Stop early if effect clear
- ✅ Register studies - Pre-registration reduces bias
Section 9: Practical Examples
Example 1: Medical Intervention Study
Scenario: Test new treatment vs placebo for blood pressure
Planning:
- Current treatment: Average BP 150 mmHg
- Goal: Detect 10 mmHg reduction (practical significance)
- Estimate SD: 20 mmHg
- Effect size: d = 10/20 = 0.5 (medium)
Power calculation:
- Want 95% significance (α = 0.05)
- Want 80% power
- n = 2 × ((1.96 + 0.84) / 0.5)² = 64 per group
Conclusion: Need 128 subjects total (64 per group)
Example 2: Educational Intervention
Scenario: Compare two teaching methods
Planning:
- Current method: Average test score 75, SD = 15
- Goal: Detect 7-point improvement
- Effect size: d = 7/15 = 0.47 ≈ 0.5 (medium)
Power calculation:
- α = 0.05, power = 0.80
- Paired sample (same students)
- n = ((1.96 + 0.84) / 0.5)² = 31 students
Conclusion: Can test with 31 students if measuring same students twice
Example 3: Survey of Proportions
Scenario: Compare voting preference between demographics
Planning:
- Current: 50% support
- Goal: Detect 10-point difference (45% vs 55%)
- Effect: Small-to-medium
Power calculation:
- n ≈ 386 per group
- Total: 772 respondents
Section 10: Best Practices
Before Study
- ✅ Calculate required sample size - Based on effect size
- ✅ Justify effect size - From literature or pilot data
- ✅ Document power analysis - For grant proposal
- ✅ Plan for attrition - Oversample if subjects may drop out
- ✅ Use power analysis software - G*Power, R, or online tools
During Study
- ✅ Track progress - Interim analyses okay if pre-planned
- ✅ Monitor quality - Good data collection reduces error
- ✅ Document issues - Deviations from plan
Reporting Results
- ✅ Report effect sizes - With confidence intervals
- ✅ Report achieved power - If significant
- ✅ Discuss practical significance - Beyond statistical significance
- ✅ Acknowledge limitations - If underpowered
Interpretation
- ✅ Significant + large effect: Strong evidence
- ✅ Significant + small effect: Real but small effect
- ✅ Not significant + adequate power: Likely no effect
- ❌ Not significant + low power: Inconclusive
- ✅ Non-significant + large CI: Uncertainty remains
Common Mistakes
- ❌ Confusing statistical and practical significance
- ❌ Not calculating power prospectively
- ❌ Assuming p < 0.05 means large effect
- ❌ Using post-hoc power to justify non-significant results
- ❌ Ignoring effect size in favor of p-values
- ❌ Not accounting for multiple comparisons
- ❌ Overstating small effects as meaningful
Power Analysis Software
Online Tools:
- G*Power (free desktop software)
- ClinCalc (online calculators)
- StatCom
R Packages:
- pwr
- WebPower
Other:
- Stata power command
- SAS proc power
Related Topics
- Sample Size Planning - Practical implementation
- Hypothesis Testing - Understand Type I errors
- Confidence Intervals - Effect size in interval form
- Meta-Analysis - Combine effect sizes across studies
Summary
Key Takeaways:
- Report effect sizes - Not just p-values
- Plan adequate power - Prospectively calculate sample size
- Practical significance - Consider real-world importance
- Avoid underpowered studies - More likely to miss real effects
- Document assumptions - Be transparent about effect size estimates
Statistical significance tells you an effect exists; effect size tells you if you should care.