Introduction to Hypothesis Testing
Hypothesis testing is a statistical method for making decisions about populations based on sample data. It’s one of the most important tools in statistics for:
- Testing scientific claims and theories
- Evaluating business decisions
- Determining if treatments are effective
- Making evidence-based conclusions
Why Hypothesis Testing Matters
In reality, we rarely have access to entire populations. Instead, we collect sample data and use hypothesis testing to make inferences about the population. This systematic approach allows us to:
- Quantify uncertainty in our conclusions
- Control the probability of making errors
- Make objective, data-driven decisions
- Report results with statistical confidence
Section 1: Hypothesis Testing Fundamentals
Core Concepts
Null and Alternative Hypotheses
The null hypothesis (H₀) represents the status quo or “no effect” assumption:
- “There is no difference between groups”
- “Treatment has no effect”
- “Mean equals the claimed value”
The alternative hypothesis (H₁) represents what we’re trying to prove:
- “There is a difference between groups”
- “Treatment has an effect”
- “Mean differs from the claimed value”
Types of Alternative Hypotheses
Two-Tailed Test: H₁: μ ≠ μ₀
- Testing if something is “different” (either direction)
- More conservative; splits significance level across two tails
- Most common in practice
Right-Tailed Test: H₁: μ > μ₀
- Testing if something is “greater than”
- Use when directional prediction exists
- p-value = P(T > t_obs)
Left-Tailed Test: H₁: μ < μ₀
- Testing if something is “less than”
- p-value = P(T < t_obs)
Type I and Type II Errors
| Scenario | H₀ True | H₀ False |
|---|---|---|
| Reject H₀ | Type I Error (α) | ✓ Correct |
| Fail to Reject H₀ | ✓ Correct | Type II Error (β) |
Type I Error (False Positive):
- Rejecting H₀ when it’s actually true
- Probability = α (significance level, typically 0.05)
- “Crying wolf” - claiming an effect doesn’t exist
Type II Error (False Negative):
- Failing to reject H₀ when it’s actually false
- Probability = β (related to statistical power)
- Missing a real effect
Significance Level (α)
The significance level is the maximum probability we accept for making a Type I error:
- α = 0.05: Most common; 5% chance of false positive
- α = 0.01: More conservative; 1% chance of false positive (used for critical applications)
- α = 0.10: More liberal; 10% chance (sometimes used in exploratory research)
Section 2: Statistical Significance & P-Values
What is a P-Value?
The p-value is the probability of observing test results as extreme or more extreme than what was actually observed, assuming the null hypothesis is true.
Mathematical Definition: $$\text{p-value} = P(\text{data} | H_0 \text{ is true})$$
Interpreting P-Values
| P-Value | Interpretation | Decision |
|---|---|---|
| p < 0.001 | Extremely strong evidence against H₀ | Reject H₀ |
| p < 0.01 | Very strong evidence against H₀ | Reject H₀ |
| p < 0.05 | Strong evidence against H₀ | Reject H₀ (α = 0.05) |
| p = 0.05 | Borderline evidence | Decision depends on context |
| p > 0.05 | Weak evidence against H₀ | Fail to reject H₀ (α = 0.05) |
| p > 0.10 | Little to no evidence against H₀ | Fail to reject H₀ |
Common Misconceptions About P-Values
❌ Incorrect: “p = 0.03 means there’s a 3% chance H₀ is true” ✓ Correct: “p = 0.03 means if H₀ is true, there’s a 3% chance of observing data this extreme”
❌ Incorrect: “p-value measures the size of the effect” ✓ Correct: “Effect size measures the magnitude; p-value measures strength of evidence”
❌ Incorrect: “p ≤ 0.05 means ‘proven’” ✓ Correct: “p ≤ 0.05 means ‘sufficient evidence at this significance level’”
Decision Rule
Decision: Compare p-value to significance level (α)
- If p-value ≤ α: Reject H₀ (Statistically significant)
- If p-value > α: Fail to reject H₀ (Not statistically significant)
Section 3: Statistical Power & Effect Size
Statistical Power
Power = Probability of rejecting H₀ when it’s actually false = 1 - β
- Power = 0.80 (most common) means 80% chance of detecting a real effect
- Power = 0.90 means 90% chance
- Power < 0.50 means test may not detect real effects
Factors Affecting Power
- Sample Size: Larger samples = higher power
- Effect Size: Larger effects = easier to detect
- Significance Level (α): Higher α = higher power (but more Type I errors)
- Test Type: One-tailed vs two-tailed affects power
Effect Size Measures
Effect size measures the practical magnitude of a difference, separate from statistical significance.
Cohen’s d (for means)
$$d = \frac{\text{mean difference}}{\text{standard deviation}}$$
Interpretation:
- |d| ≈ 0.2: Small effect (detectable but small)
- |d| ≈ 0.5: Medium effect (moderate)
- |d| ≈ 0.8: Large effect (substantial)
Cohen’s d for Proportions
$$d = 2 \arcsin(\sqrt{p_1}) - 2 \arcsin(\sqrt{p_2})$$
Correlation (r) as Effect Size
- r ≈ 0.1: Small effect
- r ≈ 0.3: Medium effect
- r ≈ 0.5: Large effect
R² (Coefficient of Determination)
- R² ≈ 0.01: Small effect (1% variance explained)
- R² ≈ 0.06: Medium effect (6% variance explained)
- R² ≈ 0.14: Large effect (14% variance explained)
Why Report Effect Size?
Statistical significance ≠ Practical significance
Example:
- Result 1: “p < 0.001, d = 0.15” → Statistically significant but practically small
- Result 2: “p = 0.08, d = 0.85” → Not quite significant but large practical effect
Always report both p-value (statistical significance) and effect size (practical significance).
Section 4: Parametric Tests
Z-Tests for Means
When to use: Population SD known, large sample (n ≥ 30), or normal population
Types:
- One-sample z-test: Compare sample mean to population mean
- Two-sample z-test: Compare means of two independent groups
Test Statistic: $$z = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}}$$
Distribution: Standard normal (z-distribution)
Use Cases:
- Quality control when population parameters are known
- Large sample testing
- Proportion testing
→ Detailed Guide: Z-Tests Comprehensive Guide
T-Tests for Means
When to use: Population SD unknown, small to moderate samples (n < 30), population approximately normal
Types:
- One-sample t-test: Compare sample mean to hypothesized population mean
- Paired t-test: Compare means from related/paired samples
- Two-sample t-test: Compare means from two independent groups
- Standard (equal variances assumed)
- Welch’s (unequal variances allowed)
Test Statistic: $$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}, \quad df = n-1$$
Distribution: Student’s t-distribution (df = n - 1 or more complex for two-sample)
Use Cases:
- Most common hypothesis tests in practice
- Small sample studies
- Unknown population variance (typical situation)
- Medical/psychological research
→ Detailed Guide: T-Tests Comprehensive Guide
F-Test for Variances
When to use: Testing equality of variances between two or more groups
Types:
- F-test: Compare two variances
- Levene’s test: More robust; compare 2+ variances
- Bartlett’s test: Sensitive; compare 2+ variances (requires normality)
Test Statistic: $$F = \frac{s_1^2}{s_2^2}$$
Distribution: F-distribution (df₁ = n₁ - 1, df₂ = n₂ - 1)
Use Cases:
- Checking assumption for t-tests and ANOVA
- Comparing process consistency
- Quality control applications
→ Detailed Guide: Variance Tests Comprehensive Guide
ANOVA (Analysis of Variance)
When to use: Comparing means of 3 or more independent groups
Types:
- One-way ANOVA: Single factor with multiple levels
- Two-way ANOVA: Two factors
- Repeated measures ANOVA: Repeated observations
- Welch’s ANOVA: When variances unequal
Test Statistic: $$F = \frac{\text{Variance between groups}}{\text{Variance within groups}}$$
Distribution: F-distribution
Use Cases:
- Comparing 3+ treatment groups
- Experimental design analysis
- Quality control with multiple factors
Correlation Testing
When to use: Testing if two continuous variables are associated
Types:
- Pearson correlation: For linear relationships, continuous data
- Spearman correlation: Rank-based; for monotonic relationships, ordinal data
Test Statistic: $$t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$
Distribution: t-distribution (df = n - 2)
Use Cases:
- Examining relationship strength
- Checking variable independence
- Preliminary data exploration
Section 5: Tests for Proportions
One-Sample Proportion Test
When to use: Testing if sample proportion differs from hypothesized population proportion
Conditions:
- np₀ ≥ 5 and n(1-p₀) ≥ 5 (for normal approximation)
Test Statistic: $$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$$
Distribution: Approximately normal for large samples
Example: Testing if 60% of population supports a policy (based on sample survey)
Two-Sample Proportion Test
When to use: Comparing proportions between two groups
Test Statistic: $$z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}$$
Where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ (pooled proportion)
Example: Comparing success rates between treatment and control groups
Section 6: Chi-Square Tests for Categorical Data
Chi-Square Goodness of Fit Test
When to use: Testing if categorical data follows an expected distribution
Hypotheses:
- H₀: Data follow the hypothesized distribution
- H₁: Data don’t follow the hypothesized distribution
Test Statistic: $$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$
Where O_i = observed, E_i = expected
Assumptions:
- Expected frequency ≥ 5 for each category
- Random sample
- Independent observations
Example: Testing if die rolls show equal probability for each face
Chi-Square Test of Independence
When to use: Testing if two categorical variables are independent
Hypotheses:
- H₀: Variables are independent
- H₁: Variables are associated
Test Statistic: $$\chi^2 = \sum \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
Example: Testing if smoking status and lung cancer diagnosis are related
Cramér’s V (Effect Size for Chi-Square)
$$V = \sqrt{\frac{\chi^2}{n(k-1)}}$$
Where k = min(rows, columns)
Section 7: Non-Parametric Tests
When to Use Non-Parametric Tests
Use when:
- Assumptions violated (normality, equal variances)
- Ordinal or ranked data
- Small samples
- Outliers or skewed distributions
Mann-Whitney U Test (Wilcoxon Rank-Sum)
Purpose: Compare two independent groups (non-parametric alternative to independent t-test)
Hypotheses:
- H₀: Distributions are equal
- H₁: Distributions differ
Procedure:
- Rank all data points
- Calculate sum of ranks for each group
- Compute U statistic
- Compare to critical value
Use: When normality assumption violated or ordinal data
Kruskal-Wallis Test
Purpose: Compare 3+ groups (non-parametric alternative to ANOVA)
Test Statistic: $$H = \frac{12}{n(n+1)}\sum_i \frac{R_i^2}{n_i} - 3(n+1)$$
Where R_i = sum of ranks for group i
Use: When ANOVA assumptions violated with 3+ groups
Paired Wilcoxon Signed-Rank Test
Purpose: Compare paired observations (non-parametric alternative to paired t-test)
Use: When normality assumption violated for paired data
McNemar Test
Purpose: Test change in binary outcome (paired categorical data)
Test Statistic: $$\chi^2 = \frac{(b-c)^2}{b+c}$$
Where b, c = count of observations with discordant outcomes
Use: Paired binary outcomes (e.g., before-after with yes/no response)
Section 8: Assumptions Checking
Normality Assessment
Visual Methods:
- Q-Q Plot: Points close to diagonal = normal
- Histogram: Bell shape = approximately normal
- Box Plot: Check for skewness and outliers
Formal Tests:
- Shapiro-Wilk Test: p > 0.05 suggests normality
- Kolmogorov-Smirnov Test: p > 0.05 suggests normality
- Anderson-Darling Test: More sensitive to tails
Equal Variance Testing
Levene’s Test: p > 0.05 suggests equal variances Bartlett’s Test: p > 0.05 suggests equal variances (but sensitive to non-normality) F-Test: Ratio of variances; F close to 1 suggests equality
Independence Check
- Random sampling/assignment used?
- No repeated measures on same subject?
- Observations not influenced by each other?
- No temporal or spatial correlation?
Remedies for Assumption Violations
| Violation | Remedy |
|---|---|
| Non-normality | Transform data, use non-parametric test, increase sample size |
| Unequal variances | Use Welch’s test, transform data, non-parametric alternative |
| Outliers | Remove if data entry error; use robust methods |
| Dependence | Use appropriate test (paired, repeated measures, mixed) |
Section 9: Decision-Making Framework
Step-by-Step Process
1. Define Research Question
- What are you testing?
- Who/what is the population?
2. State Hypotheses
- H₀: null hypothesis
- H₁: alternative hypothesis
- One-tailed or two-tailed?
3. Plan the Study
- Determine sample size (power analysis)
- Set significance level (α = 0.05)
- Choose appropriate test
4. Collect Data
- Random sampling
- Ensure independence
- Record accurately
5. Check Assumptions
- Normality?
- Equal variances?
- Independence?
6. Calculate Test Statistic
- Use appropriate formula
- Verify calculations
7. Find P-Value
- Compare to critical value or use software
- Determine probability
8. Make Decision
- p ≤ α: Reject H₀
- p > α: Fail to reject H₀
9. Report Results
- Test name, test statistic, df, p-value, effect size
- Practical interpretation
- Limitations and context
10. Conclude & Discuss
- What does the result mean?
- Implications for research question?
- Limitations?
Section 10: Practical Applications & Examples
Business: A/B Testing
Question: Does website redesign increase conversion rate?
Data: 500 visitors; 30 converted on old design, 45 converted on new design
Test: Two-sample proportion test
Result: p = 0.031 < 0.05; New design significantly better (effect size: d = 0.30)
Action: Implement new design
Medicine: Drug Efficacy
Question: Does new medication reduce symptoms more than placebo?
Data: 50 patients per group; measured symptom severity (continuous)
Test: Two-sample t-test (independent samples)
Assumption Check: Levene’s test p = 0.18; Equal variances assumed
Result: t(98) = 3.45, p < 0.001, d = 0.69 (medium to large effect)
Action: Drug is significantly better; clinically meaningful improvement
Education: Teaching Method Comparison
Question: Do three teaching methods produce different test scores?
Data: 30 students per method; measured final exam scores
Test: One-way ANOVA
Assumption Checks:
- Normality (Shapiro-Wilk p > 0.05 for each group) ✓
- Equal variances (Levene’s p = 0.32) ✓
Result: F(2, 87) = 5.23, p = 0.007
Follow-up: Post-hoc tests to compare which methods differ
Action: Methods 1 and 3 significantly better than Method 2
Quality Control: Consistency Check
Question: Are two manufacturing processes equally consistent?
Data: Process A (12 samples, SD = 2.1), Process B (12 samples, SD = 1.8)
Test: F-test for variance equality
Result: F = 1.36, p = 0.62; No significant difference in consistency
Action: Both processes meet consistency standards
Section 11: Common Mistakes & How to Avoid Them
Mistake 1: P-Hacking (Data Dredging)
Problem: Running many tests and reporting only significant results
Example: Testing 20 hypotheses; even with no real effects, ~1 will be “significant” by chance (5% = 1/20)
Solution:
- Pre-register hypotheses before data collection
- Correct for multiple comparisons (Bonferroni: divide α by number of tests)
- Use exploratory testing only; confirm with new data
Mistake 2: Ignoring Effect Size
Problem: Over-emphasizing p-value; ignoring practical significance
Example: “p < 0.001” with d = 0.10 (tiny effect); not practically meaningful
Solution: Always report and interpret effect size alongside p-value
Mistake 3: Wrong Test Selection
Problem: Using z-test with unknown variance; using parametric test on non-normal data
Solution: Check assumptions; refer to decision trees
Mistake 4: Misinterpreting Non-Significance
Problem: Concluding “no difference exists” from p > 0.05
Example: Study has low power; real effect might exist
Solution: State “insufficient evidence” and consider statistical power
Mistake 5: Violating Independence Assumption
Problem: Using independent samples test on paired data
Solution: Identify structure of data; use appropriate test
Mistake 6: Assuming Causation from Correlation
Problem: Finding p < 0.05 for correlation implies causation
Solution: Correlation ≠ causation; need experimental design for causation
Interactive Hypothesis Testing Calculators
[Multiple calculators would be embedded here providing:]
Parametric Tests:
- One-sample t-test calculator
- Two-sample t-test calculator (equal/unequal variances)
- Paired t-test calculator
- One-way ANOVA calculator
- F-test for variance equality
Proportion Tests:
- One-sample proportion test calculator
- Two-sample proportion test calculator
- Chi-square goodness of fit calculator
Non-Parametric Tests:
- Mann-Whitney U test calculator
- Kruskal-Wallis test calculator
Support Tools:
- Critical value finder (t, F, χ²)
- P-value calculator
- Effect size calculator (Cohen’s d)
- Statistical power calculator
- Sample size calculator
Test Selection Quick Reference
For Continuous Data
| # Groups | Independent | Paired | Parametric | Non-Param |
|---|---|---|---|---|
| 1 | - | - | One-sample t | - |
| 2 | Yes | - | Two-sample t | Mann-Whitney U |
| 2 | No | Yes | Paired t | Wilcoxon |
| 3+ | - | - | ANOVA | Kruskal-Wallis |
For Categorical Data
| # Groups | Test | Purpose |
|---|---|---|
| 1 | Chi-square GoF | Goodness of fit |
| 2 | Chi-square/Prop | Proportion equality |
| 2x2 table | McNemar | Paired binary |
| 2+ × 2+ | Chi-square | Independence |
Key Formulas Reference
One-Sample Tests
$$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}, \quad z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$$
Two-Sample Tests
$$t = \frac{\overline{x}_1 - \overline{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}, \quad F = \frac{s_1^2}{s_2^2}$$
Chi-Square
$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
Effect Sizes
$$d = \frac{\text{mean diff}}{SD}, \quad r = \frac{\text{cov}}{s_x s_y}, \quad \text{Cramér’s V} = \sqrt{\frac{\chi^2}{n(k-1)}}$$
Summary: When to Use Each Test
| Test | Purpose | Data Type | # Groups | Sample Size | Assumptions |
|---|---|---|---|---|---|
| Z-test | Mean test | Continuous | 1-2 | Large | Normal, σ known |
| T-test | Mean test | Continuous | 1-2 | Any | Normal (robust for large n) |
| ANOVA | Mean comparison | Continuous | 3+ | Any | Normal, equal σ |
| Chi-square | Categorical | Categorical | Any | Any | n ≥ 5 per cell |
| Mann-Whitney | Median test | Ordinal | 2 | Any | None |
| Kruskal-Wallis | Median test | Ordinal | 3+ | Any | None |
| Correlation | Association | Continuous | 2 vars | Any | Bivariate normal |
Related Resources
Foundational Topics:
- Z-Score Utilities: Calculations & Tables
- Z-Tests: Complete Guide
- T-Tests: Complete Guide
- Variance Tests: F-Test, Levene’s, Bartlett’s
Conceptual Topics:
Advanced Topics:
Learning Pathways
Path 1: Beginner to Hypothesis Testing
- Start with Hypothesis Testing Fundamentals
- Learn about P-Values & Significance
- Master One-Sample Tests (t-test)
- Progress to Two-Sample Tests (t-test comparisons)
- Explore Effect Sizes & Power
Path 2: Statistical Inference Specialist
- All Parametric Tests: z, t, F, ANOVA
- Proportion Tests
- Chi-Square Tests
- Non-Parametric Alternatives
- Assumptions & Diagnostics
Path 3: Research & Experimental Design
Next Steps in Your Statistics Journey
Immediate Next Steps:
- Choose your first hypothesis test based on your data type
- Check assumptions using diagnostic tools
- Calculate and interpret results with effect size
- Report findings systematically
Build Deeper Understanding:
- Master multiple test types
- Learn advanced topics: ANOVA, regression, multivariate methods
- Develop intuition through practice problems
- Study research literature using these tests
Apply to Real Projects:
- Design and conduct your own study
- Analyze real datasets
- Publish findings with proper statistical inference
- Teach others these concepts