Introduction to Hypothesis Testing

Hypothesis testing is a statistical method for making decisions about populations based on sample data. It’s one of the most important tools in statistics for:

  • Testing scientific claims and theories
  • Evaluating business decisions
  • Determining if treatments are effective
  • Making evidence-based conclusions

Why Hypothesis Testing Matters

In reality, we rarely have access to entire populations. Instead, we collect sample data and use hypothesis testing to make inferences about the population. This systematic approach allows us to:

  1. Quantify uncertainty in our conclusions
  2. Control the probability of making errors
  3. Make objective, data-driven decisions
  4. Report results with statistical confidence

Section 1: Hypothesis Testing Fundamentals

Core Concepts

Null and Alternative Hypotheses

The null hypothesis (H₀) represents the status quo or “no effect” assumption:

  • “There is no difference between groups”
  • “Treatment has no effect”
  • “Mean equals the claimed value”

The alternative hypothesis (H₁) represents what we’re trying to prove:

  • “There is a difference between groups”
  • “Treatment has an effect”
  • “Mean differs from the claimed value”

Types of Alternative Hypotheses

Two-Tailed Test: H₁: μ ≠ μ₀

  • Testing if something is “different” (either direction)
  • More conservative; splits significance level across two tails
  • Most common in practice

Right-Tailed Test: H₁: μ > μ₀

  • Testing if something is “greater than”
  • Use when directional prediction exists
  • p-value = P(T > t_obs)

Left-Tailed Test: H₁: μ < μ₀

  • Testing if something is “less than”
  • p-value = P(T < t_obs)

Type I and Type II Errors

Scenario H₀ True H₀ False
Reject H₀ Type I Error (α) ✓ Correct
Fail to Reject H₀ ✓ Correct Type II Error (β)

Type I Error (False Positive):

  • Rejecting H₀ when it’s actually true
  • Probability = α (significance level, typically 0.05)
  • “Crying wolf” - claiming an effect doesn’t exist

Type II Error (False Negative):

  • Failing to reject H₀ when it’s actually false
  • Probability = β (related to statistical power)
  • Missing a real effect

Significance Level (α)

The significance level is the maximum probability we accept for making a Type I error:

  • α = 0.05: Most common; 5% chance of false positive
  • α = 0.01: More conservative; 1% chance of false positive (used for critical applications)
  • α = 0.10: More liberal; 10% chance (sometimes used in exploratory research)

Section 2: Statistical Significance & P-Values

What is a P-Value?

The p-value is the probability of observing test results as extreme or more extreme than what was actually observed, assuming the null hypothesis is true.

Mathematical Definition: $$\text{p-value} = P(\text{data} | H_0 \text{ is true})$$

Interpreting P-Values

P-Value Interpretation Decision
p < 0.001 Extremely strong evidence against H₀ Reject H₀
p < 0.01 Very strong evidence against H₀ Reject H₀
p < 0.05 Strong evidence against H₀ Reject H₀ (α = 0.05)
p = 0.05 Borderline evidence Decision depends on context
p > 0.05 Weak evidence against H₀ Fail to reject H₀ (α = 0.05)
p > 0.10 Little to no evidence against H₀ Fail to reject H₀

Common Misconceptions About P-Values

Incorrect: “p = 0.03 means there’s a 3% chance H₀ is true” ✓ Correct: “p = 0.03 means if H₀ is true, there’s a 3% chance of observing data this extreme”

Incorrect: “p-value measures the size of the effect” ✓ Correct: “Effect size measures the magnitude; p-value measures strength of evidence”

Incorrect: “p ≤ 0.05 means ‘proven’” ✓ Correct: “p ≤ 0.05 means ‘sufficient evidence at this significance level’”

Decision Rule

Decision: Compare p-value to significance level (α)

  • If p-value ≤ α: Reject H₀ (Statistically significant)
  • If p-value > α: Fail to reject H₀ (Not statistically significant)

Section 3: Statistical Power & Effect Size

Statistical Power

Power = Probability of rejecting H₀ when it’s actually false = 1 - β

  • Power = 0.80 (most common) means 80% chance of detecting a real effect
  • Power = 0.90 means 90% chance
  • Power < 0.50 means test may not detect real effects

Factors Affecting Power

  1. Sample Size: Larger samples = higher power
  2. Effect Size: Larger effects = easier to detect
  3. Significance Level (α): Higher α = higher power (but more Type I errors)
  4. Test Type: One-tailed vs two-tailed affects power

Effect Size Measures

Effect size measures the practical magnitude of a difference, separate from statistical significance.

Cohen’s d (for means)

$$d = \frac{\text{mean difference}}{\text{standard deviation}}$$

Interpretation:

  • |d| ≈ 0.2: Small effect (detectable but small)
  • |d| ≈ 0.5: Medium effect (moderate)
  • |d| ≈ 0.8: Large effect (substantial)

Cohen’s d for Proportions

$$d = 2 \arcsin(\sqrt{p_1}) - 2 \arcsin(\sqrt{p_2})$$

Correlation (r) as Effect Size

  • r ≈ 0.1: Small effect
  • r ≈ 0.3: Medium effect
  • r ≈ 0.5: Large effect

R² (Coefficient of Determination)

  • R² ≈ 0.01: Small effect (1% variance explained)
  • R² ≈ 0.06: Medium effect (6% variance explained)
  • R² ≈ 0.14: Large effect (14% variance explained)

Why Report Effect Size?

Statistical significance ≠ Practical significance

Example:

  • Result 1: “p < 0.001, d = 0.15” → Statistically significant but practically small
  • Result 2: “p = 0.08, d = 0.85” → Not quite significant but large practical effect

Always report both p-value (statistical significance) and effect size (practical significance).


Section 4: Parametric Tests

Z-Tests for Means

When to use: Population SD known, large sample (n ≥ 30), or normal population

Types:

  • One-sample z-test: Compare sample mean to population mean
  • Two-sample z-test: Compare means of two independent groups

Test Statistic: $$z = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}}$$

Distribution: Standard normal (z-distribution)

Use Cases:

  • Quality control when population parameters are known
  • Large sample testing
  • Proportion testing

→ Detailed Guide: Z-Tests Comprehensive Guide

T-Tests for Means

When to use: Population SD unknown, small to moderate samples (n < 30), population approximately normal

Types:

  • One-sample t-test: Compare sample mean to hypothesized population mean
  • Paired t-test: Compare means from related/paired samples
  • Two-sample t-test: Compare means from two independent groups
    • Standard (equal variances assumed)
    • Welch’s (unequal variances allowed)

Test Statistic: $$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}, \quad df = n-1$$

Distribution: Student’s t-distribution (df = n - 1 or more complex for two-sample)

Use Cases:

  • Most common hypothesis tests in practice
  • Small sample studies
  • Unknown population variance (typical situation)
  • Medical/psychological research

→ Detailed Guide: T-Tests Comprehensive Guide

F-Test for Variances

When to use: Testing equality of variances between two or more groups

Types:

  • F-test: Compare two variances
  • Levene’s test: More robust; compare 2+ variances
  • Bartlett’s test: Sensitive; compare 2+ variances (requires normality)

Test Statistic: $$F = \frac{s_1^2}{s_2^2}$$

Distribution: F-distribution (df₁ = n₁ - 1, df₂ = n₂ - 1)

Use Cases:

  • Checking assumption for t-tests and ANOVA
  • Comparing process consistency
  • Quality control applications

→ Detailed Guide: Variance Tests Comprehensive Guide

ANOVA (Analysis of Variance)

When to use: Comparing means of 3 or more independent groups

Types:

  • One-way ANOVA: Single factor with multiple levels
  • Two-way ANOVA: Two factors
  • Repeated measures ANOVA: Repeated observations
  • Welch’s ANOVA: When variances unequal

Test Statistic: $$F = \frac{\text{Variance between groups}}{\text{Variance within groups}}$$

Distribution: F-distribution

Use Cases:

  • Comparing 3+ treatment groups
  • Experimental design analysis
  • Quality control with multiple factors

Correlation Testing

When to use: Testing if two continuous variables are associated

Types:

  • Pearson correlation: For linear relationships, continuous data
  • Spearman correlation: Rank-based; for monotonic relationships, ordinal data

Test Statistic: $$t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$

Distribution: t-distribution (df = n - 2)

Use Cases:

  • Examining relationship strength
  • Checking variable independence
  • Preliminary data exploration

Section 5: Tests for Proportions

One-Sample Proportion Test

When to use: Testing if sample proportion differs from hypothesized population proportion

Conditions:

  • np₀ ≥ 5 and n(1-p₀) ≥ 5 (for normal approximation)

Test Statistic: $$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$$

Distribution: Approximately normal for large samples

Example: Testing if 60% of population supports a policy (based on sample survey)

Two-Sample Proportion Test

When to use: Comparing proportions between two groups

Test Statistic: $$z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}$$

Where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ (pooled proportion)

Example: Comparing success rates between treatment and control groups


Section 6: Chi-Square Tests for Categorical Data

Chi-Square Goodness of Fit Test

When to use: Testing if categorical data follows an expected distribution

Hypotheses:

  • H₀: Data follow the hypothesized distribution
  • H₁: Data don’t follow the hypothesized distribution

Test Statistic: $$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

Where O_i = observed, E_i = expected

Assumptions:

  • Expected frequency ≥ 5 for each category
  • Random sample
  • Independent observations

Example: Testing if die rolls show equal probability for each face

Chi-Square Test of Independence

When to use: Testing if two categorical variables are independent

Hypotheses:

  • H₀: Variables are independent
  • H₁: Variables are associated

Test Statistic: $$\chi^2 = \sum \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Example: Testing if smoking status and lung cancer diagnosis are related

Cramér’s V (Effect Size for Chi-Square)

$$V = \sqrt{\frac{\chi^2}{n(k-1)}}$$

Where k = min(rows, columns)


Section 7: Non-Parametric Tests

When to Use Non-Parametric Tests

Use when:

  • Assumptions violated (normality, equal variances)
  • Ordinal or ranked data
  • Small samples
  • Outliers or skewed distributions

Mann-Whitney U Test (Wilcoxon Rank-Sum)

Purpose: Compare two independent groups (non-parametric alternative to independent t-test)

Hypotheses:

  • H₀: Distributions are equal
  • H₁: Distributions differ

Procedure:

  1. Rank all data points
  2. Calculate sum of ranks for each group
  3. Compute U statistic
  4. Compare to critical value

Use: When normality assumption violated or ordinal data

Kruskal-Wallis Test

Purpose: Compare 3+ groups (non-parametric alternative to ANOVA)

Test Statistic: $$H = \frac{12}{n(n+1)}\sum_i \frac{R_i^2}{n_i} - 3(n+1)$$

Where R_i = sum of ranks for group i

Use: When ANOVA assumptions violated with 3+ groups

Paired Wilcoxon Signed-Rank Test

Purpose: Compare paired observations (non-parametric alternative to paired t-test)

Use: When normality assumption violated for paired data

McNemar Test

Purpose: Test change in binary outcome (paired categorical data)

Test Statistic: $$\chi^2 = \frac{(b-c)^2}{b+c}$$

Where b, c = count of observations with discordant outcomes

Use: Paired binary outcomes (e.g., before-after with yes/no response)


Section 8: Assumptions Checking

Normality Assessment

Visual Methods:

  • Q-Q Plot: Points close to diagonal = normal
  • Histogram: Bell shape = approximately normal
  • Box Plot: Check for skewness and outliers

Formal Tests:

  • Shapiro-Wilk Test: p > 0.05 suggests normality
  • Kolmogorov-Smirnov Test: p > 0.05 suggests normality
  • Anderson-Darling Test: More sensitive to tails

Equal Variance Testing

Levene’s Test: p > 0.05 suggests equal variances Bartlett’s Test: p > 0.05 suggests equal variances (but sensitive to non-normality) F-Test: Ratio of variances; F close to 1 suggests equality

Independence Check

  • Random sampling/assignment used?
  • No repeated measures on same subject?
  • Observations not influenced by each other?
  • No temporal or spatial correlation?

Remedies for Assumption Violations

Violation Remedy
Non-normality Transform data, use non-parametric test, increase sample size
Unequal variances Use Welch’s test, transform data, non-parametric alternative
Outliers Remove if data entry error; use robust methods
Dependence Use appropriate test (paired, repeated measures, mixed)

Section 9: Decision-Making Framework

Step-by-Step Process

1. Define Research Question

  • What are you testing?
  • Who/what is the population?

2. State Hypotheses

  • H₀: null hypothesis
  • H₁: alternative hypothesis
  • One-tailed or two-tailed?

3. Plan the Study

  • Determine sample size (power analysis)
  • Set significance level (α = 0.05)
  • Choose appropriate test

4. Collect Data

  • Random sampling
  • Ensure independence
  • Record accurately

5. Check Assumptions

  • Normality?
  • Equal variances?
  • Independence?

6. Calculate Test Statistic

  • Use appropriate formula
  • Verify calculations

7. Find P-Value

  • Compare to critical value or use software
  • Determine probability

8. Make Decision

  • p ≤ α: Reject H₀
  • p > α: Fail to reject H₀

9. Report Results

  • Test name, test statistic, df, p-value, effect size
  • Practical interpretation
  • Limitations and context

10. Conclude & Discuss

  • What does the result mean?
  • Implications for research question?
  • Limitations?

Section 10: Practical Applications & Examples

Business: A/B Testing

Question: Does website redesign increase conversion rate?

Data: 500 visitors; 30 converted on old design, 45 converted on new design

Test: Two-sample proportion test

Result: p = 0.031 < 0.05; New design significantly better (effect size: d = 0.30)

Action: Implement new design

Medicine: Drug Efficacy

Question: Does new medication reduce symptoms more than placebo?

Data: 50 patients per group; measured symptom severity (continuous)

Test: Two-sample t-test (independent samples)

Assumption Check: Levene’s test p = 0.18; Equal variances assumed

Result: t(98) = 3.45, p < 0.001, d = 0.69 (medium to large effect)

Action: Drug is significantly better; clinically meaningful improvement

Education: Teaching Method Comparison

Question: Do three teaching methods produce different test scores?

Data: 30 students per method; measured final exam scores

Test: One-way ANOVA

Assumption Checks:

  • Normality (Shapiro-Wilk p > 0.05 for each group) ✓
  • Equal variances (Levene’s p = 0.32) ✓

Result: F(2, 87) = 5.23, p = 0.007

Follow-up: Post-hoc tests to compare which methods differ

Action: Methods 1 and 3 significantly better than Method 2

Quality Control: Consistency Check

Question: Are two manufacturing processes equally consistent?

Data: Process A (12 samples, SD = 2.1), Process B (12 samples, SD = 1.8)

Test: F-test for variance equality

Result: F = 1.36, p = 0.62; No significant difference in consistency

Action: Both processes meet consistency standards


Section 11: Common Mistakes & How to Avoid Them

Mistake 1: P-Hacking (Data Dredging)

Problem: Running many tests and reporting only significant results

Example: Testing 20 hypotheses; even with no real effects, ~1 will be “significant” by chance (5% = 1/20)

Solution:

  • Pre-register hypotheses before data collection
  • Correct for multiple comparisons (Bonferroni: divide α by number of tests)
  • Use exploratory testing only; confirm with new data

Mistake 2: Ignoring Effect Size

Problem: Over-emphasizing p-value; ignoring practical significance

Example: “p < 0.001” with d = 0.10 (tiny effect); not practically meaningful

Solution: Always report and interpret effect size alongside p-value

Mistake 3: Wrong Test Selection

Problem: Using z-test with unknown variance; using parametric test on non-normal data

Solution: Check assumptions; refer to decision trees

Mistake 4: Misinterpreting Non-Significance

Problem: Concluding “no difference exists” from p > 0.05

Example: Study has low power; real effect might exist

Solution: State “insufficient evidence” and consider statistical power

Mistake 5: Violating Independence Assumption

Problem: Using independent samples test on paired data

Solution: Identify structure of data; use appropriate test

Mistake 6: Assuming Causation from Correlation

Problem: Finding p < 0.05 for correlation implies causation

Solution: Correlation ≠ causation; need experimental design for causation


Interactive Hypothesis Testing Calculators

[Multiple calculators would be embedded here providing:]

Parametric Tests:

  • One-sample t-test calculator
  • Two-sample t-test calculator (equal/unequal variances)
  • Paired t-test calculator
  • One-way ANOVA calculator
  • F-test for variance equality

Proportion Tests:

  • One-sample proportion test calculator
  • Two-sample proportion test calculator
  • Chi-square goodness of fit calculator

Non-Parametric Tests:

  • Mann-Whitney U test calculator
  • Kruskal-Wallis test calculator

Support Tools:

  • Critical value finder (t, F, χ²)
  • P-value calculator
  • Effect size calculator (Cohen’s d)
  • Statistical power calculator
  • Sample size calculator

Test Selection Quick Reference

For Continuous Data

# Groups Independent Paired Parametric Non-Param
1 - - One-sample t -
2 Yes - Two-sample t Mann-Whitney U
2 No Yes Paired t Wilcoxon
3+ - - ANOVA Kruskal-Wallis

For Categorical Data

# Groups Test Purpose
1 Chi-square GoF Goodness of fit
2 Chi-square/Prop Proportion equality
2x2 table McNemar Paired binary
2+ × 2+ Chi-square Independence

Key Formulas Reference

One-Sample Tests

$$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}, \quad z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$$

Two-Sample Tests

$$t = \frac{\overline{x}_1 - \overline{x}_2}{s_p\sqrt{1/n_1 + 1/n_2}}, \quad F = \frac{s_1^2}{s_2^2}$$

Chi-Square

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$

Effect Sizes

$$d = \frac{\text{mean diff}}{SD}, \quad r = \frac{\text{cov}}{s_x s_y}, \quad \text{Cramér’s V} = \sqrt{\frac{\chi^2}{n(k-1)}}$$


Summary: When to Use Each Test

Test Purpose Data Type # Groups Sample Size Assumptions
Z-test Mean test Continuous 1-2 Large Normal, σ known
T-test Mean test Continuous 1-2 Any Normal (robust for large n)
ANOVA Mean comparison Continuous 3+ Any Normal, equal σ
Chi-square Categorical Categorical Any Any n ≥ 5 per cell
Mann-Whitney Median test Ordinal 2 Any None
Kruskal-Wallis Median test Ordinal 3+ Any None
Correlation Association Continuous 2 vars Any Bivariate normal

Foundational Topics:

Conceptual Topics:

Advanced Topics:


Learning Pathways

Path 1: Beginner to Hypothesis Testing

  1. Start with Hypothesis Testing Fundamentals
  2. Learn about P-Values & Significance
  3. Master One-Sample Tests (t-test)
  4. Progress to Two-Sample Tests (t-test comparisons)
  5. Explore Effect Sizes & Power

Path 2: Statistical Inference Specialist

  1. All Parametric Tests: z, t, F, ANOVA
  2. Proportion Tests
  3. Chi-Square Tests
  4. Non-Parametric Alternatives
  5. Assumptions & Diagnostics

Path 3: Research & Experimental Design

  1. Hypothesis Formation
  2. Type I & II Errors
  3. Power Analysis
  4. Test Selection
  5. Reporting Results

Next Steps in Your Statistics Journey

Immediate Next Steps:

  1. Choose your first hypothesis test based on your data type
  2. Check assumptions using diagnostic tools
  3. Calculate and interpret results with effect size
  4. Report findings systematically

Build Deeper Understanding:

  • Master multiple test types
  • Learn advanced topics: ANOVA, regression, multivariate methods
  • Develop intuition through practice problems
  • Study research literature using these tests

Apply to Real Projects:

  • Design and conduct your own study
  • Analyze real datasets
  • Publish findings with proper statistical inference
  • Teach others these concepts