Outliers Overview
Outliers are data points that deviate significantly from the typical pattern of a dataset. They can result from:
- Measurement errors: Faulty equipment, recording mistakes
- Data entry errors: Typing mistakes, transcription errors
- Legitimate extreme values: Genuine rare events
- Process changes: System shifts or special causes
Why Detect Outliers?
- Data Quality: Identify errors that need correction
- Statistical Validity: Remove errors before analysis
- Model Performance: Prevent models from being skewed
- Pattern Recognition: Discover unusual but genuine events
- Risk Assessment: Identify extreme scenarios
Method 1: Interquartile Range (IQR) Method
The IQR method is the most commonly used outlier detection technique.
Formula
Outlier Boundaries:
$$\text{Lower Bound} = Q_1 - 1.5 \times IQR$$ $$\text{Upper Bound} = Q_3 + 1.5 \times IQR$$
where $IQR = Q_3 - Q_1$
Any value outside these bounds is considered an outlier.
Interpretation
- Mild Outliers: Between 1.5 and 3 times IQR
- Extreme Outliers: Beyond 3 times IQR
Example: IQR Method
Data: 15, 22, 28, 35, 42, 48, 55, 62, 78, 95, 125
Identify outliers using IQR method.
Solution:
Step 1: Calculate quartiles
Arranged: 15, 22, 28, 35, 42, 48, 55, 62, 78, 95, 125 (n=11)
- Q₁ (25th percentile) = 28
- Q₃ (75th percentile) = 78
Step 2: Calculate IQR
$$IQR = 78 - 28 = 50$$
Step 3: Calculate outlier boundaries
$$\text{Lower Bound} = 28 - 1.5 \times 50 = 28 - 75 = -47$$
$$\text{Upper Bound} = 78 + 1.5 \times 50 = 78 + 75 = 153$$
Step 4: Identify outliers
Values outside [-47, 153]: None (all values within bounds)
All values are normal; no outliers detected.
Example 2: With Obvious Outliers
Data: 10, 12, 14, 15, 16, 17, 18, 19, 20, 100
Identify outliers.
Solution:
- Q₁ = 14.5
- Q₃ = 19
- IQR = 4.5
- Lower Bound = 14.5 - 1.5(4.5) = 7.75
- Upper Bound = 19 + 1.5(4.5) = 25.75
Outlier: 100 (exceeds upper bound of 25.75)
Method 2: Z-Score Method
The z-score method identifies outliers based on standard deviations from the mean.
Formula
$$z = \frac{x - \bar{x}}{s}$$
Outlier Criteria:
- Moderate outliers: |z| > 2 (2 standard deviations)
- Extreme outliers: |z| > 3 (3 standard deviations)
Example: Z-Score Method
Data: 50, 55, 60, 65, 70, 75, 80, 85, 90, 150
Identify outliers using z-score method.
Solution:
$$\bar{x} = 78, \quad s = 31.6$$
Calculate z-scores:
| Value | Z-score |
|---|---|
| 50 | (50-78)/31.6 = -0.89 |
| 55 | (55-78)/31.6 = -0.73 |
| 60 | (60-78)/31.6 = -0.57 |
| 65 | (65-78)/31.6 = -0.41 |
| 70 | (70-78)/31.6 = -0.25 |
| 75 | (75-78)/31.6 = -0.09 |
| 80 | (80-78)/31.6 = 0.06 |
| 85 | (85-78)/31.6 = 0.22 |
| 90 | (90-78)/31.6 = 0.38 |
| 150 | (150-78)/31.6 = 2.27 |
Outliers (|z| > 2): 150 (z = 2.27)
Method 3: Modified Z-Score Method
Uses median absolute deviation (MAD) instead of standard deviation, making it more robust to outliers.
Formula
$$\text{Modified Z-score} = \frac{0.6745(x - \text{Median})}{MAD}$$
where MAD is the median of absolute deviations from the median.
Outlier Criterion: |Modified z-score| > 3.5
Advantages
- Less sensitive to extreme outliers
- Better for skewed distributions
- More robust for small datasets
Method 4: Box Plot Method
Visually identifies outliers using the five-number summary.
Box Plot Structure
Whisker ← Q₁ - 1.5×IQR
|
|----Q₁----Median----Q₃----|
| *
Whisker → Q₃ + 1.5×IQR
O = Outlier points beyond whiskers
Visual Identification
- Points beyond whiskers are outliers
- Easy to compare multiple datasets
- Useful for presentations
Method 5: Isolation Forest Method
A machine learning approach that identifies outliers by isolation.
Concept: Outliers are isolated by random partitioning.
Advantages:
- No assumptions about distribution
- Works well with multivariate data
- Handles mixed data types
Handling Outliers
Options for Dealing with Outliers
| Method | Use Case |
|---|---|
| Delete | Clear data entry errors; not legitimate |
| Transform | Use log/square root transformation |
| Cap/Floor | Replace with max/min acceptable values |
| Separate Analysis | Analyze outliers independently |
| Robust Methods | Use methods less sensitive to outliers |
| Investigate | Determine cause (process change, error) |
Example: Decision Framework
- Identify outliers using IQR or Z-score
- Verify data entry: check if it’s a real value
- Investigate cause: error, process change, or legitimate extreme?
- Decide action:
- If error: Delete or correct
- If process change: Analyze separately
- If legitimate: Keep or use robust methods
Practical Considerations
When Using IQR Method
- Best for normally distributed or symmetric data
- Conservative (fewer false outliers)
- Good for practical applications
- Industry standard
When Using Z-Score Method
- Works best with normally distributed data
- Assumes mean-centered data
- Sensitive to extreme values
- Good for statistical analysis
When Using Modified Z-Score Method
- Better for skewed data
- More robust to extreme values
- Preferred for non-normal distributions
- Good for data with known outliers
Real-World Examples
Example 1: Sales Data
Daily sales: 1200, 1150, 1300, 1250, 1400, 5000 (unusual sale)
Q₁ = 1200, Q₃ = 1350, IQR = 150
Bounds: [1200 - 225, 1350 + 225] = [975, 1575]
5000 is an outlier (likely exceptional order or data error)
Example 2: Temperature Data
Daily temperatures: 68, 70, 72, 71, 69, -50
IQR Method: -50 is clearly an outlier (sensor malfunction)
Action: Check sensor, delete or investigate cause
Summary Statistics with/without Outliers
| Statistic | With Outlier | Without Outlier | Difference |
|---|---|---|---|
| Mean | 78.6 | 71.2 | 7.4 |
| Median | 75 | 74 | 1 |
| Std Dev | 31.6 | 8.2 | 23.4 |
Note: Outliers greatly affect mean and standard deviation
Best Practices
- Always visualize: Use box plots, scatter plots
- Use multiple methods: Compare IQR, Z-score, modified Z-score
- Document decisions: Note why outliers were kept/removed
- Investigate causes: Understand outlier origin
- Report findings: Include outlier information in analysis
- Consider context: Domain knowledge is crucial
- Be conservative: When uncertain, keep outliers
Caution
- Never automatically delete outliers - they may be important
- Don’t use just one method - triangulate with multiple approaches
- Consider subject matter - domain expertise matters
- Document your process - for reproducibility
- Check sensitivity - how much do results change with/without outliers?
Tools and Software
- Python: scikit-learn (Isolation Forest), pandas
- R: boxplot.stats(), is.outlier packages
- Excel: Conditional formatting with IQR formula
- Statistics: SPSS, SAS statistical outlier detection
References
-
Anderson, D.R., Sweeney, D.J., & Williams, T.A. (2018). Statistics for Business and Economics (14th ed.). Cengage Learning. - Coverage of outlier detection methods including IQR and Z-score approaches.
-
Walpole, R.E., Myers, S.L., Myers, S.L., & Ye, K. (2012). Probability & Statistics for Engineers & Scientists (9th ed.). Pearson. - Theoretical basis for outlier detection and robust statistics.