Every statistical study begins with a fundamental question: How do we choose who to study? Sampling is the process of selecting a subset from a population to represent the whole. Good sampling design ensures that your results are reliable and generalizable. Poor sampling leads to biased estimates and false conclusions.
This comprehensive guide covers sampling theory, methods, and practical survey design.
Understanding Sampling
Population vs Sample
Population:
- Complete set of individuals or items of interest
- Example: All voters in a country
- Usually too large to measure entirely
Sample:
- Subset of population
- Smaller, more manageable
- Representative if selected properly
Sample Size (n) vs Population Size (N):
- n = number of items in sample
- N = number of items in population
- Typically n « N
Why Sampling?
Advantages of sampling:
- ✅ Lower cost
- ✅ Faster data collection
- ✅ Less resource-intensive
- ✅ Can conduct destructive tests (battery life, crash tests)
When to sample the entire population:
- Very small population
- Every item critical (aerospace)
- Administrative data already available
Sampling Error vs Non-Sampling Error
Sampling Error:
- Variation due to chance (random)
- Inherent in sampling process
- Reduced by larger samples
- Measured by standard error
Non-Sampling Error:
- Systematic bias (not random)
- From poor design or execution
- Not reduced by larger samples
- Examples: Wrong questions, misclassification, refusals
Total Error = Sampling Error + Non-Sampling Error
Section 1: Probability Sampling Methods
Probability sampling gives each population member known, non-zero chance of selection.
Simple Random Sampling
Method: Randomly select n items from population of N items.
Process:
- Assign each item a unique number (1 to N)
- Use random number generator
- Select n numbers
- Include corresponding items
Advantages:
- ✅ Unbiased
- ✅ Simple to understand
- ✅ Representative
- ✅ Good for inference
Disadvantages:
- ❌ Impractical for large populations
- ❌ Doesn’t account for subgroups
- ❌ May miss rare characteristics
When to use:
- Homogeneous population
- Complete population list available
- Adequate resources
Example: Randomly select 100 students from university roster of 10,000
Stratified Sampling
Method: Divide population into subgroups (strata), then randomly sample from each stratum.
Process:
- Divide population into strata (often by natural groupings)
- Determine allocation (proportional or equal)
- Randomly sample from each stratum
Stratification Variables:
- Geographic region
- Gender
- Age group
- Income level
- Department
Proportional Allocation: Sample each stratum in proportion to population.
Example:
- Stratum A: 40% of population → 40 of sample
- Stratum B: 30% of population → 30 of sample
- Stratum C: 30% of population → 30 of sample
Equal Allocation: Same sample size per stratum regardless of population size.
Example:
- Each stratum: 33-34 items (whether 10 or 1000 in population)
Advantages:
- ✅ Ensures representation of subgroups
- ✅ More precise estimates for subgroups
- ✅ Identifies differences between strata
- ✅ Reduces sampling error
Disadvantages:
- ❌ More complex than simple random
- ❌ Need population information
- ❌ More expensive
When to use:
- Heterogeneous population
- Subgroups naturally exist
- Interested in subgroup differences
- Want reduced variance
Example: Survey student satisfaction by school (Engineering, Business, Arts) with stratified sample
Cluster Sampling
Method: Divide population into clusters, randomly select clusters, then study all or sample within clusters.
Process:
- Divide population into clusters
- Randomly select clusters
- Include all items in selected clusters (or subsample)
Cluster Definition: Natural groupings: Geographic regions, schools, companies, neighborhoods
One-Stage Clustering: Select clusters, include all items in selected clusters.
Two-Stage Clustering: Select clusters, then randomly sample within clusters.
Advantages:
- ✅ Cost-effective for geographically dispersed populations
- ✅ No complete population list needed
- ✅ Practical for field surveys
- ✅ Reduces travel/administration costs
Disadvantages:
- ❌ Less efficient (higher sampling error)
- ❌ Cluster members may be similar (homogeneous within clusters)
- ❌ Requires more items to achieve same precision as simple random
When to use:
- Geographically dispersed population
- Complete list unavailable
- Cost is primary concern
- Natural clusters exist
Example: Survey voter preferences by randomly selecting 50 zip codes, then interviewing voters in those areas
Systematic Sampling
Method: Select every kth item from population after random start.
Process:
- Calculate k = N/n (population size / desired sample size)
- Randomly select starting point (1 to k)
- Select every kth item thereafter
Example: N = 1000, n = 100 → k = 10 Random start = 7 Select items: 7, 17, 27, 37, …
Advantages:
- ✅ Simple to execute
- ✅ Less training needed
- ✅ Spread sample across population
Disadvantages:
- ❌ Can introduce bias if pattern in population
- ❌ Not truly random
When to use:
- Sequential or ordered population
- No patterns in population
- Easy implementation needed
Example: Quality control: test every 10th item off production line
Section 2: Non-Probability Sampling Methods
Non-probability sampling doesn’t guarantee equal selection chances. Use only with caution (introduces bias).
Convenience Sampling
Method: Select readily available subjects.
Disadvantages:
- ❌ Often biased (sample differs from population)
- ❌ Not representative
- ❌ Results not generalizable
When used:
- Early exploratory research
- Pilot studies
- When randomization impossible
- Budget extremely limited
Example: Survey mall shoppers on weekday afternoon
Purposive (Judgmental) Sampling
Method: Deliberately select subjects based on researcher’s judgment.
Types:
- Typical case sampling
- Extreme case sampling
- Maximum variation sampling
- Snowball sampling (referrals)
Advantages:
- ✅ Targets specific types of subjects
- ✅ Efficient for qualitative research
Disadvantages:
- ❌ Introduces researcher bias
- ❌ Results not generalizable to population
When used:
- Qualitative research
- Need specific expertise
- Focused case studies
Section 3: The Sampling Distribution
Sampling Distribution is the probability distribution of a sample statistic (like sample mean).
Key Properties
Central Limit Theorem:
- Sample means approximately follow normal distribution
- True regardless of population distribution
- As n increases, distribution becomes more normal
Standard Error (SE): Standard deviation of sampling distribution:
SE = σ / √n (or s / √n for sample)
Interpretation:
- Smaller SE = more precise estimate
- Larger sample → Smaller SE (more precision)
- More variable population → Larger SE (less precision)
Example:
- Population: σ = 10
- Sample n = 100: SE = 10 / √100 = 1.0
- Sample n = 400: SE = 10 / √400 = 0.5
Quadrupling sample size halves standard error.
Confidence Interval from Sampling Distribution
Sample means are approximately normally distributed, so:
CI = x̄ ± z* × SE
Where z* depends on confidence level (1.96 for 95%)
Section 4: Sample Size Determination
How large should sample be?
Depends on:
- Desired precision (margin of error)
- Confidence level (usually 95%)
- Population variability
- Population size (rarely critical)
Formula for Estimating Mean
n = (z* × σ / ME)²
where:
z* = critical value (1.96 for 95%)
σ = population standard deviation (estimate)
ME = desired margin of error
Example:
- Want 95% CI with ME = 2
- Estimate σ = 10
- n = (1.96 × 10 / 2)² = 96 subjects needed
Formula for Estimating Proportion
n = (z* / ME)² × p(1-p)
where:
p = estimated proportion
If p unknown, use p = 0.5 (most conservative)
Example:
- Want 95% CI with ME = 0.05 (5%)
- Use p = 0.5 (most conservative)
- n = (1.96 / 0.05)² × 0.5 × 0.5 = 385 subjects needed
Accounting for Population Size
When population is small relative to sample:
n_adjusted = n / (1 + n/N)
where N = population size
Example:
- Calculated n = 400
- Population N = 1000
- Adjusted n = 400 / (1 + 400/1000) = 286
Population size matters only when N small.
Section 5: Statistical Power and Effect Size
Power (1 - β) = Probability of detecting real effect if it exists (typically aim for 80%)
Power depends on:
- Sample size (larger = more power)
- Effect size (larger effect = more power)
- Significance level α (5% standard)
- Type of test
Effect Size
Standardized effect size independent of sample size.
Common measures:
- Cohen’s d (for means): d = (μ₁ - μ₂) / σ
- d = 0.2: Small effect
- d = 0.5: Medium effect
- d = 0.8: Large effect
Sample Size for Hypothesis Test
n = 2 × [(z_α + z_β) / d]²
where:
z_α = critical value for significance level
z_β = critical value for power (1 - β)
d = standardized effect size
Example:
- Want 95% significance (α = 0.05, z = 1.96)
- Want 80% power (β = 0.20, z = 0.84)
- Expect medium effect (d = 0.5)
- n = 2 × [(1.96 + 0.84) / 0.5]² = 64 per group
Section 6: Survey Design Principles
Questionnaire Design
Golden Rules:
- Keep it short: Minimize non-response
- Use clear language: Avoid jargon
- Avoid leading questions: “Do you agree that X?” biases responses
- Avoid double-barreled questions: “Is X good and helpful?” asks two things
- Provide escape options: “Don’t know”, “No opinion”
- Use consistent response scales: Don’t change mid-survey
Question Types:
- Open-ended: Respondent answers freely (hard to analyze)
- Closed-ended: Choose from options (easier to analyze)
- Likert scale: Agree-disagree on spectrum
Question Order:
- Start with easy demographic questions
- Build to sensitive questions
- Separate related topics
- End with open feedback
Sources of Survey Error
Coverage Error:
- Population not fully represented
- Example: Phone survey missing cell-only households
Sampling Error:
- Random variation from sampling
- Reduced by larger sample size
Non-Response Error:
- Respondents who don’t participate differ from those who do
- Track response rate
- Follow up on non-respondents
Measurement Error:
- Misunderstanding questions
- Social desirability bias (answering “correctly”)
- Forgetting (recall error)
- Interview effects (interviewer influence)
Reducing Survey Error
Coverage:
- Use multiple modes (phone, online, mail)
- Include all population segments
- Random digit dialing for phone
Sampling:
- Random sampling method
- Adequate sample size
- Stratification for subgroups
Non-Response:
- Incentives for participation
- Multiple follow-ups
- Compare respondents to non-respondents
Measurement:
- Pre-test questionnaire
- Train interviewers
- Keep questions clear and brief
- Minimize socially desirable responses
Section 7: Practical Examples
Example 1: Customer Satisfaction Survey
Population: All customers in past year (N = 50,000)
Objectives:
- Estimate satisfaction percentage within ±3%
- Identify differences by region
- 95% confidence
Design:
-
Sampling method: Stratified random by region
-
Sample size:
- Using p = 0.5 (unknown): n = (1.96 / 0.03)² × 0.5 × 0.5 = 1,068
- Allocate proportionally to regions
-
Survey design:
- 5-question satisfaction scale
- Pre-test with 50 customers
- Phone or online response
- Incentive: Entry in prize drawing
-
Results analysis:
- Compare satisfaction by region
- Identify improvement areas
Example 2: Quality Control Sampling
Population: Continuous production line (N unknown)
Objectives:
- Maintain defect rate below 2%
- Detect shifts in process
Design:
- Sampling method: Systematic sampling (every 50th item)
- Sample size: n = 200 items per day
- Acceptance rule:
- 0-2 defects: Accept batch
- 3-4 defects: Re-sample
- 5+ defects: Reject batch and investigate
Example 3: Election Poll
Population: Registered voters (N = millions)
Objectives:
- Estimate candidate support to within ±2%
- Track changes over time
- 95% confidence
Design:
-
Sampling method: Stratified random by state
-
Sample size:
- Using p = 0.5: n = (1.96 / 0.02)² × 0.5 × 0.5 = 2,401
- Report margin of error explicitly
-
Survey design:
- Single preference question
- Multiple response modes
- Weighting to match voting patterns
- Disclose methodology
Section 8: Best Practices
Planning a Study
- ✅ Define population clearly - Who are we studying?
- ✅ Choose appropriate method - Probability or non-probability?
- ✅ Calculate sample size - Use formulas or software
- ✅ Consider margin of error - What precision needed?
- ✅ Anticipate non-response - Oversample to account for it
- ✅ Budget time and resources - What’s feasible?
Executing Survey
- ✅ Pre-test questionnaire - With 20-50 people
- ✅ Train data collectors - Consistent administration
- ✅ Track non-response - Document reasons
- ✅ Monitor data quality - Check for patterns/errors
- ✅ Secure data - Protect respondent privacy
Reporting Results
- ✅ Disclose methodology - Sample method, size, response rate
- ✅ Report margin of error - “X% ± Y%”
- ✅ Identify confidence level - Usually 95%
- ✅ Note limitations - What could be biased?
- ✅ Describe sampling design - Simple, stratified, or cluster?
Common Mistakes
- ❌ Using convenience sample then claiming results generalizable
- ❌ Ignoring non-response bias
- ❌ Assuming larger sample always better
- ❌ Confusing sampling error with non-sampling error
- ❌ Not pre-testing survey questions
- ❌ Asking leading or biased questions
- ❌ Not reporting response rate
Sampling Method Comparison
| Method | Bias | Cost | Efficiency | Use Case |
|---|---|---|---|---|
| Simple Random | Low | Medium | Good | Homogeneous population |
| Stratified | Low | Medium | Excellent | Heterogeneous with subgroups |
| Cluster | Medium | Low | Fair | Geographic dispersal |
| Systematic | Low | Low | Good | Ordered list |
| Convenience | High | Very Low | Poor | Exploratory only |
| Purposive | High | Medium | Fair | Qualitative research |
Related Topics
- Hypothesis Testing - Use samples to test hypotheses
- Confidence Intervals - Estimate parameters with sample data
- Central Limit Theorem - Foundation of sampling
- Effect Sizes & Power - Planning adequate samples
Summary
Proper sampling design ensures:
- Representative samples
- Reliable generalizations
- Valid statistical inferences
- Efficient use of resources
Invest in good sampling design before collecting data;it’s harder to fix after the fact.