Distance and similarity metrics are fundamental tools in machine learning, clustering, and data analysis. They measure how different or similar pairs of observations are, enabling algorithms to group similar data points and identify patterns. Whether you’re performing clustering, classification, or anomaly detection, understanding different distance metrics is essential.
This comprehensive guide covers all major distance and similarity metrics with practical R implementations for every scenario.
Understanding Distance Metrics
Distance metrics quantify the dissimilarity between two points or observations. A good distance metric should satisfy:
- Non-negativity: Distance ≥ 0
- Identity of indiscernibles: Distance = 0 if and only if objects are identical
- Symmetry: Distance(A,B) = Distance(B,A)
- Triangle inequality: Distance(A,C) ≤ Distance(A,B) + Distance(B,C)
Vector Distance Metrics
Vector distances measure dissimilarity between numerical vectors.
Euclidean Distance
Euclidean distance is the straight-line distance between two points in multidimensional space.
Formula: √(Σ(x_i - y_i)²)
# Calculate Euclidean distance manually
point1 <- c(1, 2, 3)
point2 <- c(4, 5, 6)
euclidean <- sqrt(sum((point1 - point2)^2))
print(euclidean)
# [1] 5.196152
# Using dist() function for vectors
data <- rbind(point1, point2)
distances <- dist(data, method = "euclidean")
print(distances)
# Euclidean distance matrix for multiple points
points <- data.frame(
x = c(0, 1, 4),
y = c(0, 1, 5)
)
dist_matrix <- dist(points, method = "euclidean")
print(dist_matrix)
# Using proxy package for more options
library(proxy)
d <- dist(data, method = "euclidean")
print(d)
When to use: Continuous numerical data, especially in Euclidean space (most common spatial analysis)
Manhattan Distance
Manhattan distance (taxicab distance) measures distance along grid lines.
Formula: Σ|x_i - y_i|
# Calculate Manhattan distance manually
manhattan <- sum(abs(point1 - point2))
print(manhattan)
# [1] 9
# Using dist() function
distances <- dist(data, method = "manhattan")
print(distances)
# Manhattan distance matrix
dist_matrix <- dist(points, method = "manhattan")
print(dist_matrix)
# Comparison with Euclidean
euclidean_dist <- dist(data, method = "euclidean")[1]
manhattan_dist <- dist(data, method = "manhattan")[1]
print(c(Euclidean = euclidean_dist, Manhattan = manhattan_dist))
When to use: Grid-based data, urban/navigation distances, or when movement is restricted to axes
Minkowski Distance
Minkowski distance is a generalization of Euclidean and Manhattan distances.
Formula: (Σ|x_i - y_i|^p)^(1/p)
# Minkowski distance with p=3
minkowski_p3 <- sum(abs(point1 - point2)^3)^(1/3)
print(minkowski_p3)
# Using dist() function with p parameter
distances <- dist(data, method = "minkowski", p = 2) # p=2 gives Euclidean
print(distances)
distances <- dist(data, method = "minkowski", p = 1) # p=1 gives Manhattan
print(distances)
distances <- dist(data, method = "minkowski", p = 3) # p=3
print(distances)
# Visualize effect of different p values
p_values <- c(1, 2, 3, 5, 10)
distances <- sapply(p_values, function(p) {
sum(abs(point1 - point2)^p)^(1/p)
})
print(data.frame(p = p_values, distance = distances))
When to use: Generalized distance metric, adjustable sensitivity to outliers via p parameter
String Distance Metrics
String distances measure dissimilarity between text or categorical data.
Hamming Distance
Hamming distance counts the number of positions where characters differ (for equal-length strings).
Formula: Number of positions where x_i ≠ y_i
# Calculate Hamming distance manually
string1 <- "karolin"
string2 <- "kathrin"
hamming <- sum(strsplit(string1, "")[[1]] != strsplit(string2, "")[[1]])
print(hamming)
# [1] 3
# Using stringdist package
library(stringdist)
hamming_dist <- stringdist(string1, string2, method = "hamming")
print(hamming_dist)
# DNA sequence example
dna1 <- "AGGTAB"
dna2 <- "GXTXAYB"
hamming_dna <- stringdist(dna1, dna2, method = "hamming")
print(hamming_dna)
# Pairwise Hamming distances
strings <- c("cat", "car", "dog", "cot")
hamming_matrix <- stringdistmatrix(strings, method = "hamming")
print(hamming_matrix)
When to use: Fixed-length strings, DNA/protein sequences, error detection in transmission
Levenshtein Distance
Levenshtein distance (edit distance) counts minimum edits (insertions, deletions, substitutions) needed to transform one string to another.
# Calculate Levenshtein distance using stringdist
library(stringdist)
string1 <- "kitten"
string2 <- "sitting"
levenshtein_dist <- stringdist(string1, string2, method = "lv")
print(levenshtein_dist)
# [1] 3 (substitute k→s, substitute e→i, insert g)
# Pairwise distances for spell-checking
words <- c("algorithm", "algoritm", "algorithmm", "algorith")
target <- "algorithm"
distances <- stringdist(target, words, method = "lv")
print(data.frame(word = words, distance = distances))
# Normalized Levenshtein (0-1 scale)
normalized <- distances / max(nchar(c(target, words)))
print(normalized)
# Example: Finding closest match (autocorrect)
closest_idx <- which.min(distances)
print(paste("Closest match:", words[closest_idx]))
When to use: Spell checking, fuzzy matching, text similarity, typo tolerance
Similarity Measures
Similarity measures quantify how alike two objects are (higher = more similar).
Cosine Similarity
Cosine similarity measures the angle between vectors, ranging from -1 to 1.
Formula: (x · y) / (||x|| × ||y||)
# Calculate cosine similarity manually
vec1 <- c(1, 2, 3)
vec2 <- c(2, 4, 6)
dot_product <- sum(vec1 * vec2)
magnitude1 <- sqrt(sum(vec1^2))
magnitude2 <- sqrt(sum(vec2^2))
cosine_sim <- dot_product / (magnitude1 * magnitude2)
print(cosine_sim)
# [1] 1 (perfect alignment)
# Using lsa package
library(lsa)
cosine_similarity <- cosine(vec1, vec2)
print(cosine_similarity)
# Text similarity example using document vectors
doc1 <- c(1, 0, 1, 1, 0) # TF-IDF vector
doc2 <- c(1, 1, 1, 0, 0) # TF-IDF vector
sim <- cosine(doc1, doc2)
print(sim)
# Cosine distance = 1 - cosine similarity
cosine_dist <- 1 - cosine_sim
print(cosine_dist)
# Pairwise cosine similarity
library(stringdist)
vectors <- matrix(c(1, 2, 3, 2, 4, 6, 0, 0, 1), ncol = 3, byrow = TRUE)
sim_matrix <- apply(vectors, 1, function(v) {
apply(vectors, 1, function(u) cosine(u, v))
})
print(sim_matrix)
When to use: Text similarity, recommendation systems, document clustering, high-dimensional data
Jaccard Similarity
Jaccard similarity measures overlap between sets.
Formula: |A ∩ B| / |A ∪ B|
# Calculate Jaccard similarity manually
set1 <- c("apple", "banana", "cherry")
set2 <- c("apple", "cherry", "date")
intersection <- length(intersect(set1, set2))
union <- length(union(set1, set2))
jaccard_sim <- intersection / union
print(jaccard_sim)
# [1] 0.5
# Using stringdist package
library(stringdist)
jaccard_dist <- stringdist("apple,banana,cherry", "apple,cherry,date", method = "jaccard")
print(jaccard_dist)
# Binary vector example (common in recommendation systems)
user1_prefs <- c(1, 0, 1, 1, 0, 1) # Binary preferences
user2_prefs <- c(1, 1, 1, 0, 0, 1)
both_1 <- sum(user1_prefs & user2_prefs)
either_1 <- sum(user1_prefs | user2_prefs)
jaccard_sim_binary <- both_1 / either_1
print(jaccard_sim_binary)
# Jaccard distance = 1 - Jaccard similarity
jaccard_dist <- 1 - jaccard_sim
print(jaccard_dist)
When to use: Set comparison, binary attributes, recommendation systems, document similarity
Statistical Distance Metrics
Mahalanobis Distance
Mahalanobis distance accounts for correlation between variables.
Formula: √((x - y)^T × Σ^(-1) × (x - y))
# Calculate Mahalanobis distance
library(stats)
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(2, 4, 5, 4, 6)
)
# Calculate Mahalanobis distance from centroid
center <- colMeans(data)
cov_matrix <- cov(data)
mahal_dist <- mahalanobis(data, center, cov_matrix)
print(mahal_dist)
# Identify outliers using Mahalanobis distance
threshold <- qchisq(0.95, df = ncol(data))
outliers <- which(mahal_dist > threshold)
print(paste("Outliers at indices:", paste(outliers, collapse = ", ")))
# Pairwise Mahalanobis distances
point1 <- c(1, 2)
point2 <- c(3, 5)
points_df <- data.frame(x = c(point1[1], point2[1]), y = c(point1[2], point2[2]))
mahal_dist_pairwise <- mahalanobis(points_df, center, cov_matrix)
print(mahal_dist_pairwise)
When to use: Correlated variables, outlier detection, multivariate analysis where Euclidean distance is inappropriate
Practical Applications
Clustering Using Distance Metrics
# K-means clustering using Euclidean distance
data <- rbind(
data.frame(x = rnorm(10, mean = 0, sd = 1), y = rnorm(10, mean = 0, sd = 1)),
data.frame(x = rnorm(10, mean = 5, sd = 1), y = rnorm(10, mean = 5, sd = 1))
)
kmeans_result <- kmeans(data, centers = 2)
print(kmeans_result$cluster)
# Hierarchical clustering using different distance metrics
distances <- dist(data, method = "euclidean")
hc <- hclust(distances, method = "complete")
plot(hc)
Anomaly Detection
# Detect outliers using Mahalanobis distance
data <- data.frame(x = c(rnorm(50, 0, 1), 10), y = c(rnorm(50, 0, 1), 10))
center <- colMeans(data)
cov_mat <- cov(data)
distances <- mahalanobis(data, center, cov_mat)
threshold <- qchisq(0.95, df = 2)
outliers <- which(distances > threshold)
print(paste("Anomalies detected at indices:", paste(outliers, collapse = ", ")))
Recommendation Systems
# Movie recommendation using Cosine similarity
user_vectors <- data.frame(
Action = c(1, 0, 1),
Comedy = c(0, 1, 1),
Drama = c(1, 1, 0)
)
rownames(user_vectors) <- c("User1", "User2", "User3")
# Calculate pairwise cosine similarities
library(lsa)
similarity_matrix <- apply(user_vectors, 1, function(u) {
apply(user_vectors, 1, function(v) cosine(as.numeric(u), as.numeric(v)))
})
print(similarity_matrix)
Comparison and Selection Guide
# Summary of when to use each metric
comparison <- data.frame(
Metric = c("Euclidean", "Manhattan", "Minkowski", "Hamming", "Levenshtein",
"Cosine", "Jaccard", "Mahalanobis"),
"Data Type" = c("Continuous", "Continuous", "Continuous", "Categorical", "Text",
"Vectors", "Sets", "Continuous"),
"Best For" = c("Euclidean space", "Grid-based", "Flexible", "Equal-length strings",
"Variable-length text", "Angle-based similarity", "Set overlap",
"Correlated variables"),
"Range" = c("[0,∞)", "[0,∞)", "[0,∞)", "[0,n]", "[0,n]", "[-1,1]", "[0,1]", "[0,∞)")
)
print(comparison)
Best Practices
- Standardize data - Scale variables to have similar ranges (especially for Euclidean distance)
- Choose appropriate metric - Match metric to data type and problem domain
- Consider computational cost - Some metrics are more expensive for large datasets
- Handle missing values - Decide how to treat NAs before calculating distances
- Document metric choice - Explain why a specific distance metric was selected
- Validate with domain knowledge - Test if distance metric captures meaningful similarity
- Use normalized metrics - When comparing across different datasets
Common Questions
Q: When should I use Euclidean vs Manhattan distance? A: Use Euclidean for continuous data in unrestricted space. Use Manhattan for grid-based movement or when features can’t vary together.
Q: How do I handle text data with numerical algorithms? A: Convert text to numerical vectors using methods like TF-IDF, word embeddings, or character encodings, then apply distance metrics.
Q: Which metric is best for clustering? A: Euclidean is most common for K-means. For hierarchical clustering, try several metrics and compare results. For text, use Cosine or Jaccard.
Q: How do I detect outliers using distance metrics? A: Calculate distances to center/centroid, then flag points exceeding a threshold (e.g., Mahalanobis > chi-square critical value).
Q: What’s the computational complexity of different metrics? A: Most vector distances are O(n). String distances vary: Hamming is O(n), Levenshtein is O(n×m) where n,m are string lengths.
Q: Can I combine multiple distance metrics? A: Yes, you can weight metrics differently. For example: combined_distance = 0.6×Euclidean + 0.4×Manhattan
Related Topics
Build on distance metrics for advanced analysis:
- R Data Transformation - Complete Guide - Prepare data for distance calculations
- R Data Visualization - Complete Guide - Visualize distance relationships
Download R Script
Get all code examples from this tutorial: distance-metrics-examples.R