The dist() function calculates distances between observations (rows) in a dataset. It’s fundamental for clustering, multivariate analysis, and exploratory data analysis. I use dist() constantly when preparing data for kmeans clustering or hierarchical clustering.

When to Use dist()

You’ll use dist() when:

  • Preparing data for clustering algorithms
  • Measuring similarity between observations
  • Hierarchical clustering (hclust)
  • Dimensionality reduction
  • Distance-based analysis

Distance Metrics

dist(x, method = "euclidean")

Available methods:

  • euclidean: Standard Euclidean distance (most common)
  • manhattan: Sum of absolute differences
  • maximum: Maximum coordinate difference
  • binary: Jaccard distance
  • minkowski: Generalized distance

Basic Syntax

Using dist() Function

Let’s see how we can use dist() function to calculate distance matrix in R:

# Define vectors
a <-c(78,85,89,96,74,75)
b <-c(65,66,64,69,70,61)
c <-c(23,25,26,19,21,18)
d <-c(45,41,35,36,39,48)

# Define matrix
matrix1 <- rbind(a,b,c,d)

# Show matrix
print(matrix1)

# Calculate distance matrix 
dist(matrix1)
  [,1] [,2] [,3] [,4] [,5] [,6]
a   78   85   89   96   74   75
b   65   66   64   69   70   61
c   23   25   26   19   21   18
d   45   41   35   36   39   48


          a         b         c
b  45.78209                    
c 150.26976 107.88420          
d 107.21474  63.91400  48.31149

Here the output shows the distance matrix of the matrix we declared above. The lower triangular matrix shows distances between each pair of rows.

Common Mistakes to Avoid

Mistake 1: Calculating distance on wrong scale

# ❌ PROBLEM - Variables on very different scales dominate distance
data <- data.frame(Age = c(25, 30, 35), Income = c(50000, 60000, 70000))
dist(data)  # Income differences dominate!

# ✅ SOLUTION - Standardize first
data_scaled <- scale(data)
dist(data_scaled)  # Now equal weight to both variables

Mistake 2: Using wrong distance method for clustering

# ❌ WRONG - Using binary distance for continuous data
dist(data, method="binary")  # Wrong method!

# ✅ CORRECT - Use euclidean for continuous data
dist(data, method="euclidean")  # Standard for numeric

Mistake 3: Including non-numeric columns

# ❌ ERROR - Can't calculate distance with character data
df <- data.frame(name=c("A","B"), value=c(1,2))
dist(df)  # Error!

# ✅ CORRECT - Extract only numeric columns
dist(df[, 2, drop=FALSE])  # Works!

Pro Tips

  1. Visualize with heatmap: heatmap(as.matrix(dist(df)))
  2. Compare methods: Different methods give different clustering results
  3. Always scale: When variables have different units
  4. Check for NA values: dist() can’t handle missing values

See Also