Distance and similarity metrics are fundamental tools in machine learning, clustering, and data analysis. They measure how different or similar pairs of observations are, enabling algorithms to group similar data points and identify patterns. Whether you’re performing clustering, classification, or anomaly detection, understanding different distance metrics is essential.

This comprehensive guide covers all major distance and similarity metrics with practical R implementations for every scenario.

Understanding Distance Metrics

Distance metrics quantify the dissimilarity between two points or observations. A good distance metric should satisfy:

  • Non-negativity: Distance ≥ 0
  • Identity of indiscernibles: Distance = 0 if and only if objects are identical
  • Symmetry: Distance(A,B) = Distance(B,A)
  • Triangle inequality: Distance(A,C) ≤ Distance(A,B) + Distance(B,C)

Vector Distance Metrics

Vector distances measure dissimilarity between numerical vectors.

Euclidean Distance

Euclidean distance is the straight-line distance between two points in multidimensional space.

Formula: √(Σ(x_i - y_i)²)

# Calculate Euclidean distance manually
point1 <- c(1, 2, 3)
point2 <- c(4, 5, 6)

euclidean <- sqrt(sum((point1 - point2)^2))
print(euclidean)
# [1] 5.196152

# Using dist() function for vectors
data <- rbind(point1, point2)
distances <- dist(data, method = "euclidean")
print(distances)

# Euclidean distance matrix for multiple points
points <- data.frame(
  x = c(0, 1, 4),
  y = c(0, 1, 5)
)
dist_matrix <- dist(points, method = "euclidean")
print(dist_matrix)

# Using proxy package for more options
library(proxy)
d <- dist(data, method = "euclidean")
print(d)

When to use: Continuous numerical data, especially in Euclidean space (most common spatial analysis)

Manhattan Distance

Manhattan distance (taxicab distance) measures distance along grid lines.

Formula: Σ|x_i - y_i|

# Calculate Manhattan distance manually
manhattan <- sum(abs(point1 - point2))
print(manhattan)
# [1] 9

# Using dist() function
distances <- dist(data, method = "manhattan")
print(distances)

# Manhattan distance matrix
dist_matrix <- dist(points, method = "manhattan")
print(dist_matrix)

# Comparison with Euclidean
euclidean_dist <- dist(data, method = "euclidean")[1]
manhattan_dist <- dist(data, method = "manhattan")[1]
print(c(Euclidean = euclidean_dist, Manhattan = manhattan_dist))

When to use: Grid-based data, urban/navigation distances, or when movement is restricted to axes

Minkowski Distance

Minkowski distance is a generalization of Euclidean and Manhattan distances.

Formula: (Σ|x_i - y_i|^p)^(1/p)

# Minkowski distance with p=3
minkowski_p3 <- sum(abs(point1 - point2)^3)^(1/3)
print(minkowski_p3)

# Using dist() function with p parameter
distances <- dist(data, method = "minkowski", p = 2)  # p=2 gives Euclidean
print(distances)

distances <- dist(data, method = "minkowski", p = 1)  # p=1 gives Manhattan
print(distances)

distances <- dist(data, method = "minkowski", p = 3)  # p=3
print(distances)

# Visualize effect of different p values
p_values <- c(1, 2, 3, 5, 10)
distances <- sapply(p_values, function(p) {
  sum(abs(point1 - point2)^p)^(1/p)
})
print(data.frame(p = p_values, distance = distances))

When to use: Generalized distance metric, adjustable sensitivity to outliers via p parameter

String Distance Metrics

String distances measure dissimilarity between text or categorical data.

Hamming Distance

Hamming distance counts the number of positions where characters differ (for equal-length strings).

Formula: Number of positions where x_i ≠ y_i

# Calculate Hamming distance manually
string1 <- "karolin"
string2 <- "kathrin"

hamming <- sum(strsplit(string1, "")[[1]] != strsplit(string2, "")[[1]])
print(hamming)
# [1] 3

# Using stringdist package
library(stringdist)
hamming_dist <- stringdist(string1, string2, method = "hamming")
print(hamming_dist)

# DNA sequence example
dna1 <- "AGGTAB"
dna2 <- "GXTXAYB"
hamming_dna <- stringdist(dna1, dna2, method = "hamming")
print(hamming_dna)

# Pairwise Hamming distances
strings <- c("cat", "car", "dog", "cot")
hamming_matrix <- stringdistmatrix(strings, method = "hamming")
print(hamming_matrix)

When to use: Fixed-length strings, DNA/protein sequences, error detection in transmission

Levenshtein Distance

Levenshtein distance (edit distance) counts minimum edits (insertions, deletions, substitutions) needed to transform one string to another.

# Calculate Levenshtein distance using stringdist
library(stringdist)

string1 <- "kitten"
string2 <- "sitting"

levenshtein_dist <- stringdist(string1, string2, method = "lv")
print(levenshtein_dist)
# [1] 3  (substitute k→s, substitute e→i, insert g)

# Pairwise distances for spell-checking
words <- c("algorithm", "algoritm", "algorithmm", "algorith")
target <- "algorithm"
distances <- stringdist(target, words, method = "lv")
print(data.frame(word = words, distance = distances))

# Normalized Levenshtein (0-1 scale)
normalized <- distances / max(nchar(c(target, words)))
print(normalized)

# Example: Finding closest match (autocorrect)
closest_idx <- which.min(distances)
print(paste("Closest match:", words[closest_idx]))

When to use: Spell checking, fuzzy matching, text similarity, typo tolerance

Similarity Measures

Similarity measures quantify how alike two objects are (higher = more similar).

Cosine Similarity

Cosine similarity measures the angle between vectors, ranging from -1 to 1.

Formula: (x · y) / (||x|| × ||y||)

# Calculate cosine similarity manually
vec1 <- c(1, 2, 3)
vec2 <- c(2, 4, 6)

dot_product <- sum(vec1 * vec2)
magnitude1 <- sqrt(sum(vec1^2))
magnitude2 <- sqrt(sum(vec2^2))
cosine_sim <- dot_product / (magnitude1 * magnitude2)
print(cosine_sim)
# [1] 1  (perfect alignment)

# Using lsa package
library(lsa)
cosine_similarity <- cosine(vec1, vec2)
print(cosine_similarity)

# Text similarity example using document vectors
doc1 <- c(1, 0, 1, 1, 0)  # TF-IDF vector
doc2 <- c(1, 1, 1, 0, 0)  # TF-IDF vector
sim <- cosine(doc1, doc2)
print(sim)

# Cosine distance = 1 - cosine similarity
cosine_dist <- 1 - cosine_sim
print(cosine_dist)

# Pairwise cosine similarity
library(stringdist)
vectors <- matrix(c(1, 2, 3, 2, 4, 6, 0, 0, 1), ncol = 3, byrow = TRUE)
sim_matrix <- apply(vectors, 1, function(v) {
  apply(vectors, 1, function(u) cosine(u, v))
})
print(sim_matrix)

When to use: Text similarity, recommendation systems, document clustering, high-dimensional data

Jaccard Similarity

Jaccard similarity measures overlap between sets.

Formula: |A ∩ B| / |A ∪ B|

# Calculate Jaccard similarity manually
set1 <- c("apple", "banana", "cherry")
set2 <- c("apple", "cherry", "date")

intersection <- length(intersect(set1, set2))
union <- length(union(set1, set2))
jaccard_sim <- intersection / union
print(jaccard_sim)
# [1] 0.5

# Using stringdist package
library(stringdist)
jaccard_dist <- stringdist("apple,banana,cherry", "apple,cherry,date", method = "jaccard")
print(jaccard_dist)

# Binary vector example (common in recommendation systems)
user1_prefs <- c(1, 0, 1, 1, 0, 1)  # Binary preferences
user2_prefs <- c(1, 1, 1, 0, 0, 1)

both_1 <- sum(user1_prefs & user2_prefs)
either_1 <- sum(user1_prefs | user2_prefs)
jaccard_sim_binary <- both_1 / either_1
print(jaccard_sim_binary)

# Jaccard distance = 1 - Jaccard similarity
jaccard_dist <- 1 - jaccard_sim
print(jaccard_dist)

When to use: Set comparison, binary attributes, recommendation systems, document similarity

Statistical Distance Metrics

Mahalanobis Distance

Mahalanobis distance accounts for correlation between variables.

Formula: √((x - y)^T × Σ^(-1) × (x - y))

# Calculate Mahalanobis distance
library(stats)

data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 4, 6)
)

# Calculate Mahalanobis distance from centroid
center <- colMeans(data)
cov_matrix <- cov(data)
mahal_dist <- mahalanobis(data, center, cov_matrix)
print(mahal_dist)

# Identify outliers using Mahalanobis distance
threshold <- qchisq(0.95, df = ncol(data))
outliers <- which(mahal_dist > threshold)
print(paste("Outliers at indices:", paste(outliers, collapse = ", ")))

# Pairwise Mahalanobis distances
point1 <- c(1, 2)
point2 <- c(3, 5)
points_df <- data.frame(x = c(point1[1], point2[1]), y = c(point1[2], point2[2]))
mahal_dist_pairwise <- mahalanobis(points_df, center, cov_matrix)
print(mahal_dist_pairwise)

When to use: Correlated variables, outlier detection, multivariate analysis where Euclidean distance is inappropriate

Practical Applications

Clustering Using Distance Metrics

# K-means clustering using Euclidean distance
data <- rbind(
  data.frame(x = rnorm(10, mean = 0, sd = 1), y = rnorm(10, mean = 0, sd = 1)),
  data.frame(x = rnorm(10, mean = 5, sd = 1), y = rnorm(10, mean = 5, sd = 1))
)

kmeans_result <- kmeans(data, centers = 2)
print(kmeans_result$cluster)

# Hierarchical clustering using different distance metrics
distances <- dist(data, method = "euclidean")
hc <- hclust(distances, method = "complete")
plot(hc)

Anomaly Detection

# Detect outliers using Mahalanobis distance
data <- data.frame(x = c(rnorm(50, 0, 1), 10), y = c(rnorm(50, 0, 1), 10))
center <- colMeans(data)
cov_mat <- cov(data)
distances <- mahalanobis(data, center, cov_mat)

threshold <- qchisq(0.95, df = 2)
outliers <- which(distances > threshold)
print(paste("Anomalies detected at indices:", paste(outliers, collapse = ", ")))

Recommendation Systems

# Movie recommendation using Cosine similarity
user_vectors <- data.frame(
  Action = c(1, 0, 1),
  Comedy = c(0, 1, 1),
  Drama = c(1, 1, 0)
)
rownames(user_vectors) <- c("User1", "User2", "User3")

# Calculate pairwise cosine similarities
library(lsa)
similarity_matrix <- apply(user_vectors, 1, function(u) {
  apply(user_vectors, 1, function(v) cosine(as.numeric(u), as.numeric(v)))
})
print(similarity_matrix)

Comparison and Selection Guide

# Summary of when to use each metric

comparison <- data.frame(
  Metric = c("Euclidean", "Manhattan", "Minkowski", "Hamming", "Levenshtein",
             "Cosine", "Jaccard", "Mahalanobis"),
  "Data Type" = c("Continuous", "Continuous", "Continuous", "Categorical", "Text",
                  "Vectors", "Sets", "Continuous"),
  "Best For" = c("Euclidean space", "Grid-based", "Flexible", "Equal-length strings",
                 "Variable-length text", "Angle-based similarity", "Set overlap",
                 "Correlated variables"),
  "Range" = c("[0,∞)", "[0,∞)", "[0,∞)", "[0,n]", "[0,n]", "[-1,1]", "[0,1]", "[0,∞)")
)

print(comparison)

Best Practices

  1. Standardize data - Scale variables to have similar ranges (especially for Euclidean distance)
  2. Choose appropriate metric - Match metric to data type and problem domain
  3. Consider computational cost - Some metrics are more expensive for large datasets
  4. Handle missing values - Decide how to treat NAs before calculating distances
  5. Document metric choice - Explain why a specific distance metric was selected
  6. Validate with domain knowledge - Test if distance metric captures meaningful similarity
  7. Use normalized metrics - When comparing across different datasets

Common Questions

Q: When should I use Euclidean vs Manhattan distance? A: Use Euclidean for continuous data in unrestricted space. Use Manhattan for grid-based movement or when features can’t vary together.

Q: How do I handle text data with numerical algorithms? A: Convert text to numerical vectors using methods like TF-IDF, word embeddings, or character encodings, then apply distance metrics.

Q: Which metric is best for clustering? A: Euclidean is most common for K-means. For hierarchical clustering, try several metrics and compare results. For text, use Cosine or Jaccard.

Q: How do I detect outliers using distance metrics? A: Calculate distances to center/centroid, then flag points exceeding a threshold (e.g., Mahalanobis > chi-square critical value).

Q: What’s the computational complexity of different metrics? A: Most vector distances are O(n). String distances vary: Hamming is O(n), Levenshtein is O(n×m) where n,m are string lengths.

Q: Can I combine multiple distance metrics? A: Yes, you can weight metrics differently. For example: combined_distance = 0.6×Euclidean + 0.4×Manhattan

Build on distance metrics for advanced analysis:

Download R Script

Get all code examples from this tutorial: distance-metrics-examples.R