Outliers Overview

Outliers are data points that deviate significantly from the typical pattern of a dataset. They can result from:

  • Measurement errors: Faulty equipment, recording mistakes
  • Data entry errors: Typing mistakes, transcription errors
  • Legitimate extreme values: Genuine rare events
  • Process changes: System shifts or special causes

Why Detect Outliers?

  1. Data Quality: Identify errors that need correction
  2. Statistical Validity: Remove errors before analysis
  3. Model Performance: Prevent models from being skewed
  4. Pattern Recognition: Discover unusual but genuine events
  5. Risk Assessment: Identify extreme scenarios

Method 1: Interquartile Range (IQR) Method

The IQR method is the most commonly used outlier detection technique.

Formula

Outlier Boundaries:

$$\text{Lower Bound} = Q_1 - 1.5 \times IQR$$ $$\text{Upper Bound} = Q_3 + 1.5 \times IQR$$

where $IQR = Q_3 - Q_1$

Any value outside these bounds is considered an outlier.

Interpretation

  • Mild Outliers: Between 1.5 and 3 times IQR
  • Extreme Outliers: Beyond 3 times IQR

Example: IQR Method

Data: 15, 22, 28, 35, 42, 48, 55, 62, 78, 95, 125

Identify outliers using IQR method.

Solution:

Step 1: Calculate quartiles

Arranged: 15, 22, 28, 35, 42, 48, 55, 62, 78, 95, 125 (n=11)

  • Q₁ (25th percentile) = 28
  • Q₃ (75th percentile) = 78

Step 2: Calculate IQR

$$IQR = 78 - 28 = 50$$

Step 3: Calculate outlier boundaries

$$\text{Lower Bound} = 28 - 1.5 \times 50 = 28 - 75 = -47$$

$$\text{Upper Bound} = 78 + 1.5 \times 50 = 78 + 75 = 153$$

Step 4: Identify outliers

Values outside [-47, 153]: None (all values within bounds)

All values are normal; no outliers detected.

Example 2: With Obvious Outliers

Data: 10, 12, 14, 15, 16, 17, 18, 19, 20, 100

Identify outliers.

Solution:

  • Q₁ = 14.5
  • Q₃ = 19
  • IQR = 4.5
  • Lower Bound = 14.5 - 1.5(4.5) = 7.75
  • Upper Bound = 19 + 1.5(4.5) = 25.75

Outlier: 100 (exceeds upper bound of 25.75)

Method 2: Z-Score Method

The z-score method identifies outliers based on standard deviations from the mean.

Formula

$$z = \frac{x - \bar{x}}{s}$$

Outlier Criteria:

  • Moderate outliers: |z| > 2 (2 standard deviations)
  • Extreme outliers: |z| > 3 (3 standard deviations)

Example: Z-Score Method

Data: 50, 55, 60, 65, 70, 75, 80, 85, 90, 150

Identify outliers using z-score method.

Solution:

$$\bar{x} = 78, \quad s = 31.6$$

Calculate z-scores:

Value Z-score
50 (50-78)/31.6 = -0.89
55 (55-78)/31.6 = -0.73
60 (60-78)/31.6 = -0.57
65 (65-78)/31.6 = -0.41
70 (70-78)/31.6 = -0.25
75 (75-78)/31.6 = -0.09
80 (80-78)/31.6 = 0.06
85 (85-78)/31.6 = 0.22
90 (90-78)/31.6 = 0.38
150 (150-78)/31.6 = 2.27

Outliers (|z| > 2): 150 (z = 2.27)

Method 3: Modified Z-Score Method

Uses median absolute deviation (MAD) instead of standard deviation, making it more robust to outliers.

Formula

$$\text{Modified Z-score} = \frac{0.6745(x - \text{Median})}{MAD}$$

where MAD is the median of absolute deviations from the median.

Outlier Criterion: |Modified z-score| > 3.5

Advantages

  • Less sensitive to extreme outliers
  • Better for skewed distributions
  • More robust for small datasets

Method 4: Box Plot Method

Visually identifies outliers using the five-number summary.

Box Plot Structure

Whisker ← Q₁ - 1.5×IQR
|
|----Q₁----Median----Q₃----|
|             *
Whisker → Q₃ + 1.5×IQR
O = Outlier points beyond whiskers

Visual Identification

  • Points beyond whiskers are outliers
  • Easy to compare multiple datasets
  • Useful for presentations

Method 5: Isolation Forest Method

A machine learning approach that identifies outliers by isolation.

Concept: Outliers are isolated by random partitioning.

Advantages:

  • No assumptions about distribution
  • Works well with multivariate data
  • Handles mixed data types

Handling Outliers

Options for Dealing with Outliers

Method Use Case
Delete Clear data entry errors; not legitimate
Transform Use log/square root transformation
Cap/Floor Replace with max/min acceptable values
Separate Analysis Analyze outliers independently
Robust Methods Use methods less sensitive to outliers
Investigate Determine cause (process change, error)

Example: Decision Framework

  1. Identify outliers using IQR or Z-score
  2. Verify data entry: check if it’s a real value
  3. Investigate cause: error, process change, or legitimate extreme?
  4. Decide action:
    • If error: Delete or correct
    • If process change: Analyze separately
    • If legitimate: Keep or use robust methods

Practical Considerations

When Using IQR Method

  • Best for normally distributed or symmetric data
  • Conservative (fewer false outliers)
  • Good for practical applications
  • Industry standard

When Using Z-Score Method

  • Works best with normally distributed data
  • Assumes mean-centered data
  • Sensitive to extreme values
  • Good for statistical analysis

When Using Modified Z-Score Method

  • Better for skewed data
  • More robust to extreme values
  • Preferred for non-normal distributions
  • Good for data with known outliers

Real-World Examples

Example 1: Sales Data

Daily sales: 1200, 1150, 1300, 1250, 1400, 5000 (unusual sale)

Q₁ = 1200, Q₃ = 1350, IQR = 150

Bounds: [1200 - 225, 1350 + 225] = [975, 1575]

5000 is an outlier (likely exceptional order or data error)

Example 2: Temperature Data

Daily temperatures: 68, 70, 72, 71, 69, -50

IQR Method: -50 is clearly an outlier (sensor malfunction)

Action: Check sensor, delete or investigate cause

Summary Statistics with/without Outliers

Statistic With Outlier Without Outlier Difference
Mean 78.6 71.2 7.4
Median 75 74 1
Std Dev 31.6 8.2 23.4

Note: Outliers greatly affect mean and standard deviation

Best Practices

  1. Always visualize: Use box plots, scatter plots
  2. Use multiple methods: Compare IQR, Z-score, modified Z-score
  3. Document decisions: Note why outliers were kept/removed
  4. Investigate causes: Understand outlier origin
  5. Report findings: Include outlier information in analysis
  6. Consider context: Domain knowledge is crucial
  7. Be conservative: When uncertain, keep outliers

Caution

  • Never automatically delete outliers - they may be important
  • Don’t use just one method - triangulate with multiple approaches
  • Consider subject matter - domain expertise matters
  • Document your process - for reproducibility
  • Check sensitivity - how much do results change with/without outliers?

Tools and Software

  • Python: scikit-learn (Isolation Forest), pandas
  • R: boxplot.stats(), is.outlier packages
  • Excel: Conditional formatting with IQR formula
  • Statistics: SPSS, SAS statistical outlier detection

References

  1. Anderson, D.R., Sweeney, D.J., & Williams, T.A. (2018). Statistics for Business and Economics (14th ed.). Cengage Learning. - Coverage of outlier detection methods including IQR and Z-score approaches.

  2. Walpole, R.E., Myers, S.L., Myers, S.L., & Ye, K. (2012). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson. - Theoretical basis for outlier detection and robust statistics.