Every statistical study begins with a fundamental question: How do we choose who to study? Sampling is the process of selecting a subset from a population to represent the whole. Good sampling design ensures that your results are reliable and generalizable. Poor sampling leads to biased estimates and false conclusions.

This comprehensive guide covers sampling theory, methods, and practical survey design.

Understanding Sampling

Population vs Sample

Population:

  • Complete set of individuals or items of interest
  • Example: All voters in a country
  • Usually too large to measure entirely

Sample:

  • Subset of population
  • Smaller, more manageable
  • Representative if selected properly

Sample Size (n) vs Population Size (N):

  • n = number of items in sample
  • N = number of items in population
  • Typically n « N

Why Sampling?

Advantages of sampling:

  • ✅ Lower cost
  • ✅ Faster data collection
  • ✅ Less resource-intensive
  • ✅ Can conduct destructive tests (battery life, crash tests)

When to sample the entire population:

  • Very small population
  • Every item critical (aerospace)
  • Administrative data already available

Sampling Error vs Non-Sampling Error

Sampling Error:

  • Variation due to chance (random)
  • Inherent in sampling process
  • Reduced by larger samples
  • Measured by standard error

Non-Sampling Error:

  • Systematic bias (not random)
  • From poor design or execution
  • Not reduced by larger samples
  • Examples: Wrong questions, misclassification, refusals

Total Error = Sampling Error + Non-Sampling Error


Section 1: Probability Sampling Methods

Probability sampling gives each population member known, non-zero chance of selection.

Simple Random Sampling

Method: Randomly select n items from population of N items.

Process:

  1. Assign each item a unique number (1 to N)
  2. Use random number generator
  3. Select n numbers
  4. Include corresponding items

Advantages:

  • ✅ Unbiased
  • ✅ Simple to understand
  • ✅ Representative
  • ✅ Good for inference

Disadvantages:

  • ❌ Impractical for large populations
  • ❌ Doesn’t account for subgroups
  • ❌ May miss rare characteristics

When to use:

  • Homogeneous population
  • Complete population list available
  • Adequate resources

Example: Randomly select 100 students from university roster of 10,000

Stratified Sampling

Method: Divide population into subgroups (strata), then randomly sample from each stratum.

Process:

  1. Divide population into strata (often by natural groupings)
  2. Determine allocation (proportional or equal)
  3. Randomly sample from each stratum

Stratification Variables:

  • Geographic region
  • Gender
  • Age group
  • Income level
  • Department

Proportional Allocation: Sample each stratum in proportion to population.

Example:

  • Stratum A: 40% of population → 40 of sample
  • Stratum B: 30% of population → 30 of sample
  • Stratum C: 30% of population → 30 of sample

Equal Allocation: Same sample size per stratum regardless of population size.

Example:

  • Each stratum: 33-34 items (whether 10 or 1000 in population)

Advantages:

  • ✅ Ensures representation of subgroups
  • ✅ More precise estimates for subgroups
  • ✅ Identifies differences between strata
  • ✅ Reduces sampling error

Disadvantages:

  • ❌ More complex than simple random
  • ❌ Need population information
  • ❌ More expensive

When to use:

  • Heterogeneous population
  • Subgroups naturally exist
  • Interested in subgroup differences
  • Want reduced variance

Example: Survey student satisfaction by school (Engineering, Business, Arts) with stratified sample

Cluster Sampling

Method: Divide population into clusters, randomly select clusters, then study all or sample within clusters.

Process:

  1. Divide population into clusters
  2. Randomly select clusters
  3. Include all items in selected clusters (or subsample)

Cluster Definition: Natural groupings: Geographic regions, schools, companies, neighborhoods

One-Stage Clustering: Select clusters, include all items in selected clusters.

Two-Stage Clustering: Select clusters, then randomly sample within clusters.

Advantages:

  • ✅ Cost-effective for geographically dispersed populations
  • ✅ No complete population list needed
  • ✅ Practical for field surveys
  • ✅ Reduces travel/administration costs

Disadvantages:

  • ❌ Less efficient (higher sampling error)
  • ❌ Cluster members may be similar (homogeneous within clusters)
  • ❌ Requires more items to achieve same precision as simple random

When to use:

  • Geographically dispersed population
  • Complete list unavailable
  • Cost is primary concern
  • Natural clusters exist

Example: Survey voter preferences by randomly selecting 50 zip codes, then interviewing voters in those areas

Systematic Sampling

Method: Select every kth item from population after random start.

Process:

  1. Calculate k = N/n (population size / desired sample size)
  2. Randomly select starting point (1 to k)
  3. Select every kth item thereafter

Example: N = 1000, n = 100 → k = 10 Random start = 7 Select items: 7, 17, 27, 37, …

Advantages:

  • ✅ Simple to execute
  • ✅ Less training needed
  • ✅ Spread sample across population

Disadvantages:

  • ❌ Can introduce bias if pattern in population
  • ❌ Not truly random

When to use:

  • Sequential or ordered population
  • No patterns in population
  • Easy implementation needed

Example: Quality control: test every 10th item off production line


Section 2: Non-Probability Sampling Methods

Non-probability sampling doesn’t guarantee equal selection chances. Use only with caution (introduces bias).

Convenience Sampling

Method: Select readily available subjects.

Disadvantages:

  • ❌ Often biased (sample differs from population)
  • ❌ Not representative
  • ❌ Results not generalizable

When used:

  • Early exploratory research
  • Pilot studies
  • When randomization impossible
  • Budget extremely limited

Example: Survey mall shoppers on weekday afternoon

Purposive (Judgmental) Sampling

Method: Deliberately select subjects based on researcher’s judgment.

Types:

  • Typical case sampling
  • Extreme case sampling
  • Maximum variation sampling
  • Snowball sampling (referrals)

Advantages:

  • ✅ Targets specific types of subjects
  • ✅ Efficient for qualitative research

Disadvantages:

  • ❌ Introduces researcher bias
  • ❌ Results not generalizable to population

When used:

  • Qualitative research
  • Need specific expertise
  • Focused case studies

Section 3: The Sampling Distribution

Sampling Distribution is the probability distribution of a sample statistic (like sample mean).

Key Properties

Central Limit Theorem:

  • Sample means approximately follow normal distribution
  • True regardless of population distribution
  • As n increases, distribution becomes more normal

Standard Error (SE): Standard deviation of sampling distribution:

SE = σ / √n   (or s / √n for sample)

Interpretation:

  • Smaller SE = more precise estimate
  • Larger sample → Smaller SE (more precision)
  • More variable population → Larger SE (less precision)

Example:

  • Population: σ = 10
  • Sample n = 100: SE = 10 / √100 = 1.0
  • Sample n = 400: SE = 10 / √400 = 0.5

Quadrupling sample size halves standard error.

Confidence Interval from Sampling Distribution

Sample means are approximately normally distributed, so:

CI = x̄ ± z* × SE

Where z* depends on confidence level (1.96 for 95%)


Section 4: Sample Size Determination

How large should sample be?

Depends on:

  • Desired precision (margin of error)
  • Confidence level (usually 95%)
  • Population variability
  • Population size (rarely critical)

Formula for Estimating Mean

n = (z* × σ / ME)²

where:
z* = critical value (1.96 for 95%)
σ = population standard deviation (estimate)
ME = desired margin of error

Example:

  • Want 95% CI with ME = 2
  • Estimate σ = 10
  • n = (1.96 × 10 / 2)² = 96 subjects needed

Formula for Estimating Proportion

n = (z* / ME)² × p(1-p)

where:
p = estimated proportion
If p unknown, use p = 0.5 (most conservative)

Example:

  • Want 95% CI with ME = 0.05 (5%)
  • Use p = 0.5 (most conservative)
  • n = (1.96 / 0.05)² × 0.5 × 0.5 = 385 subjects needed

Accounting for Population Size

When population is small relative to sample:

n_adjusted = n / (1 + n/N)

where N = population size

Example:

  • Calculated n = 400
  • Population N = 1000
  • Adjusted n = 400 / (1 + 400/1000) = 286

Population size matters only when N small.


Section 5: Statistical Power and Effect Size

Power (1 - β) = Probability of detecting real effect if it exists (typically aim for 80%)

Power depends on:

  1. Sample size (larger = more power)
  2. Effect size (larger effect = more power)
  3. Significance level α (5% standard)
  4. Type of test

Effect Size

Standardized effect size independent of sample size.

Common measures:

  • Cohen’s d (for means): d = (μ₁ - μ₂) / σ
    • d = 0.2: Small effect
    • d = 0.5: Medium effect
    • d = 0.8: Large effect

Sample Size for Hypothesis Test

n = 2 × [(z_α + z_β) / d]²

where:
z_α = critical value for significance level
z_β = critical value for power (1 - β)
d = standardized effect size

Example:

  • Want 95% significance (α = 0.05, z = 1.96)
  • Want 80% power (β = 0.20, z = 0.84)
  • Expect medium effect (d = 0.5)
  • n = 2 × [(1.96 + 0.84) / 0.5]² = 64 per group

Section 6: Survey Design Principles

Questionnaire Design

Golden Rules:

  1. Keep it short: Minimize non-response
  2. Use clear language: Avoid jargon
  3. Avoid leading questions: “Do you agree that X?” biases responses
  4. Avoid double-barreled questions: “Is X good and helpful?” asks two things
  5. Provide escape options: “Don’t know”, “No opinion”
  6. Use consistent response scales: Don’t change mid-survey

Question Types:

  • Open-ended: Respondent answers freely (hard to analyze)
  • Closed-ended: Choose from options (easier to analyze)
  • Likert scale: Agree-disagree on spectrum

Question Order:

  • Start with easy demographic questions
  • Build to sensitive questions
  • Separate related topics
  • End with open feedback

Sources of Survey Error

Coverage Error:

  • Population not fully represented
  • Example: Phone survey missing cell-only households

Sampling Error:

  • Random variation from sampling
  • Reduced by larger sample size

Non-Response Error:

  • Respondents who don’t participate differ from those who do
  • Track response rate
  • Follow up on non-respondents

Measurement Error:

  • Misunderstanding questions
  • Social desirability bias (answering “correctly”)
  • Forgetting (recall error)
  • Interview effects (interviewer influence)

Reducing Survey Error

Coverage:

  • Use multiple modes (phone, online, mail)
  • Include all population segments
  • Random digit dialing for phone

Sampling:

  • Random sampling method
  • Adequate sample size
  • Stratification for subgroups

Non-Response:

  • Incentives for participation
  • Multiple follow-ups
  • Compare respondents to non-respondents

Measurement:

  • Pre-test questionnaire
  • Train interviewers
  • Keep questions clear and brief
  • Minimize socially desirable responses

Section 7: Practical Examples

Example 1: Customer Satisfaction Survey

Population: All customers in past year (N = 50,000)

Objectives:

  • Estimate satisfaction percentage within ±3%
  • Identify differences by region
  • 95% confidence

Design:

  1. Sampling method: Stratified random by region

  2. Sample size:

    • Using p = 0.5 (unknown): n = (1.96 / 0.03)² × 0.5 × 0.5 = 1,068
    • Allocate proportionally to regions
  3. Survey design:

    • 5-question satisfaction scale
    • Pre-test with 50 customers
    • Phone or online response
    • Incentive: Entry in prize drawing
  4. Results analysis:

    • Compare satisfaction by region
    • Identify improvement areas

Example 2: Quality Control Sampling

Population: Continuous production line (N unknown)

Objectives:

  • Maintain defect rate below 2%
  • Detect shifts in process

Design:

  1. Sampling method: Systematic sampling (every 50th item)
  2. Sample size: n = 200 items per day
  3. Acceptance rule:
    • 0-2 defects: Accept batch
    • 3-4 defects: Re-sample
    • 5+ defects: Reject batch and investigate

Example 3: Election Poll

Population: Registered voters (N = millions)

Objectives:

  • Estimate candidate support to within ±2%
  • Track changes over time
  • 95% confidence

Design:

  1. Sampling method: Stratified random by state

  2. Sample size:

    • Using p = 0.5: n = (1.96 / 0.02)² × 0.5 × 0.5 = 2,401
    • Report margin of error explicitly
  3. Survey design:

    • Single preference question
    • Multiple response modes
    • Weighting to match voting patterns
    • Disclose methodology

Section 8: Best Practices

Planning a Study

  1. Define population clearly - Who are we studying?
  2. Choose appropriate method - Probability or non-probability?
  3. Calculate sample size - Use formulas or software
  4. Consider margin of error - What precision needed?
  5. Anticipate non-response - Oversample to account for it
  6. Budget time and resources - What’s feasible?

Executing Survey

  1. Pre-test questionnaire - With 20-50 people
  2. Train data collectors - Consistent administration
  3. Track non-response - Document reasons
  4. Monitor data quality - Check for patterns/errors
  5. Secure data - Protect respondent privacy

Reporting Results

  1. Disclose methodology - Sample method, size, response rate
  2. Report margin of error - “X% ± Y%”
  3. Identify confidence level - Usually 95%
  4. Note limitations - What could be biased?
  5. Describe sampling design - Simple, stratified, or cluster?

Common Mistakes

  1. ❌ Using convenience sample then claiming results generalizable
  2. ❌ Ignoring non-response bias
  3. ❌ Assuming larger sample always better
  4. ❌ Confusing sampling error with non-sampling error
  5. ❌ Not pre-testing survey questions
  6. ❌ Asking leading or biased questions
  7. ❌ Not reporting response rate

Sampling Method Comparison

Method Bias Cost Efficiency Use Case
Simple Random Low Medium Good Homogeneous population
Stratified Low Medium Excellent Heterogeneous with subgroups
Cluster Medium Low Fair Geographic dispersal
Systematic Low Low Good Ordered list
Convenience High Very Low Poor Exploratory only
Purposive High Medium Fair Qualitative research


Summary

Proper sampling design ensures:

  • Representative samples
  • Reliable generalizations
  • Valid statistical inferences
  • Efficient use of resources

Invest in good sampling design before collecting data;it’s harder to fix after the fact.