Regular expressions (regex) are one of those skills that seem intimidating at first, but once they click, they’ll save you hours of tedious string manipulation. I can’t tell you how many times regex has pulled me out of a tight spot when cleaning messy data.
Whether you’re validating email addresses, extracting patterns from text, parsing log files, or transforming data formats, regular expressions are essential. In my experience, most R programmers underutilize regex and end up writing complicated loops when a single regex pattern would do it in one line.
This comprehensive guide covers everything from regex basics to advanced patterns, with practical examples tested in R 4.0+. I’ll show you not just how regex works, but when and why to use each technique. You’ll also see common mistakes I’ve made and learned from.
Regular Expression Basics
A regular expression (regex) is a pattern that describes text. It consists of literal characters and metacharacters that have special meanings.
Common Metacharacters
# Basic metacharacters
# . - Any character except newline
# * - Zero or more of preceding element
# + - One or more of preceding element
# ? - Zero or one of preceding element
# ^ - Start of string
# $ - End of string
# | - OR operator
# [] - Character class
# () - Grouping
# \ - Escape character
# Examples
text <- "Hello123World456"
# . matches any character
grepl("H.llo", text) # [1] TRUE
# * matches zero or more
grepl("Hel*o", text) # [1] TRUE
# + matches one or more
grepl("Hello[0-9]+", text) # [1] TRUE
# ? matches zero or one
grepl("Hel?lo", text) # [1] TRUE
Pattern Matching Functions
grep() - Find Matching Strings
# Find indices of matching strings
fruits <- c("apple", "apricot", "banana", "cherry", "avocado")
# Find elements matching pattern
indices <- grep("^a", fruits) # Elements starting with 'a'
print(indices)
# [1] 1 2 5
# Get matching values
matches <- fruits[grep("^a", fruits)]
print(matches)
# [1] "apple" "apricot" "avocado"
# Case insensitive matching
grep("APPLE", fruits, ignore.case = TRUE)
# Pattern with alternation
grep("ap|ch", fruits) # Contains 'ap' or 'ch'
# Invert match (find non-matching)
grep("^a", fruits, invert = TRUE)
# [1] 3 4
grepl() - Logical Pattern Test
# Test if each element matches pattern (returns logical vector)
emails <- c("[email protected]", "invalid-email", "[email protected]")
# Check if valid email format (simplified)
valid_emails <- grepl("^[a-zA-Z0-9]+@[a-zA-Z0-9]+\\.[a-zA-Z]+$", emails)
print(valid_emails)
# [1] TRUE FALSE TRUE
# Filter valid emails
emails[valid_emails]
# [1] "[email protected]" "[email protected]"
# Test multiple patterns
has_vowel <- grepl("[aeiou]", emails, ignore.case = TRUE)
print(has_vowel)
Text Extraction
sub() and gsub() - Pattern Replacement
# sub() - Replace first occurrence
text <- "apple apple apple"
result_sub <- sub("apple", "orange", text)
print(result_sub)
# [1] "orange apple apple"
# gsub() - Replace all occurrences
result_gsub <- gsub("apple", "orange", text)
print(result_gsub)
# [1] "orange orange orange"
# Remove pattern (replace with empty string)
phone <- "123-456-7890"
clean_phone <- gsub("-", "", phone)
print(clean_phone)
# [1] "1234567890"
# Use capture groups for rearrangement
text <- "2026-02-09"
reformatted <- gsub("(\\d{4})-(\\d{2})-(\\d{2})", "\\3/\\2/\\1", text)
print(reformatted)
# [1] "09/02/2024"
strsplit() - Split by Pattern
# Split string by pattern
text <- "apple,banana,cherry"
parts <- strsplit(text, ",")
print(parts)
# [[1]]
# [1] "apple" "banana" "cherry"
# Extract vector instead of list
parts_vec <- unlist(strsplit(text, ","))
print(parts_vec)
# Split by whitespace
sentence <- "The quick brown fox jumps"
words <- strsplit(sentence, "\\s+")[[1]]
print(words)
# Split with limit (not all splits)
text_long <- "one,two,three,four,five"
parts_limited <- strsplit(text_long, ",", fixed = TRUE)[[1]]
regmatches() and gregexpr()
# Extract all matches
text <- "ABC 123 DEF 456 GHI"
# gregexpr finds positions of all matches
positions <- gregexpr("[0-9]+", text)
print(positions)
# regmatches extracts the matched substrings
matches <- regmatches(text, gregexpr("[0-9]+", text))
print(matches)
# [[1]]
# [1] "123" "456"
# Simplify to vector
numbers <- unlist(regmatches(text, gregexpr("[0-9]+", text)))
print(numbers)
# [1] "123" "456"
stringr Package (tidyverse)
The stringr package provides cleaner, more consistent functions for regex operations.
Basic stringr Functions
library(stringr)
text <- "The year is 2024"
# str_detect - Test for pattern presence
str_detect(text, "[0-9]+") # [1] TRUE
# str_extract - Extract first match
str_extract(text, "[0-9]+") # [1] "2024"
# str_extract_all - Extract all matches
str_extract_all(text, "[0-9]")
# [[1]]
# [1] "2" "0" "2" "4"
# str_replace - Replace first occurrence
str_replace(text, "[0-9]+", "XXXX")
# [1] "The year is XXXX"
# str_replace_all - Replace all occurrences
text_repeated <- "2 and 2 and 2"
str_replace_all(text_repeated, "2", "X")
# [1] "X and X and X"
# str_match - Extract with groups
pattern <- "(\\d{4})-(\\d{2})-(\\d{2})"
date_text <- "2026-02-09"
str_match(date_text, pattern)
# [,1] [,2] [,3] [,4]
# [1,] "2026-02-09" "2024" "02" "09"
Advanced stringr Operations
# str_split - Split strings
str_split("apple,banana,cherry", ",")
# str_trim - Remove whitespace
str_trim(" hello world ") # [1] "hello world"
# str_sub - Extract substring
str_sub("hello", 1, 3) # [1] "hel"
# str_to_upper/lower/title
str_to_upper("hello") # [1] "HELLO"
str_to_title("hello world") # [1] "Hello World"
Character Classes and Quantifiers
Character Classes
# Common character classes
# [a-z] - Lowercase letters
# [A-Z] - Uppercase letters
# [0-9] - Digits
# [a-zA-Z0-9] - Alphanumeric
# \d - Digit (equivalent to [0-9])
# \w - Word character (alphanumeric + underscore)
# \s - Whitespace
# [^abc] - NOT a, b, or c
# Examples
grepl("[0-9]", "abc123") # [1] TRUE (contains digit)
grepl("[A-Z]", "hello") # [1] FALSE (no uppercase)
grepl("\\d+", "abc123def") # [1] TRUE (contains digits)
Quantifiers
# Quantifiers control repetition
# n - Exactly n times
# n, - At least n times
# n,m - Between n and m times
# * - 0 or more (greedy)
# + - 1 or more (greedy)
# ? - 0 or 1 (greedy)
# *? - 0 or more (non-greedy)
# Examples
grepl("a{3}", "aaa") # [1] TRUE
grepl("a{2,4}", "aaa") # [1] TRUE
grepl("a+", "aaa") # [1] TRUE
grepl("a?b", "b") # [1] TRUE
Practical Text Processing Examples
Email Validation
# Validate email format
validate_email <- function(email) {
pattern <- "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
grepl(pattern, email)
}
test_emails <- c(
"[email protected]",
"invalid.email@",
"[email protected]",
"no-at-sign.com"
)
sapply(test_emails, validate_email)
# [1] TRUE FALSE TRUE FALSE
Extract Data
# Extract phone numbers from text
text <- "Call me at 555-123-4567 or 555.987.6543"
# Using gregexpr and regmatches
numbers <- unlist(regmatches(text, gregexpr("[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}", text)))
print(numbers)
# [1] "555-123-4567" "555.987.6543"
# Clean format
clean_numbers <- gsub("[^0-9]", "", numbers)
print(clean_numbers)
Data Cleaning
# Clean and standardize data
data <- c("$1,234.56", "$2,500.00", "$99.99")
# Remove currency and commas, convert to numeric
cleaned <- as.numeric(gsub("[$,]", "", data))
print(cleaned)
# [1] 1234.56 2500.00 99.99
# Standardize phone format
phones <- c("5551234567", "(555) 123-4567", "555.123.4567")
standardize_phone <- function(phone) {
digits <- gsub("[^0-9]", "", phone)
paste0("(", substr(digits, 1, 3), ") ",
substr(digits, 4, 6), "-",
substr(digits, 7, 10))
}
sapply(phones, standardize_phone)
Text Extraction from Structured Data
# Extract age and name from text
entries <- c(
"John (age 25)",
"Mary (age 30)",
"Robert (age 45)"
)
# Extract names
names <- str_trim(gsub("\\(age.*", "", entries))
print(names)
# Extract ages
library(stringr)
ages <- as.numeric(str_extract(entries, "(?<=age )\\d+"))
print(ages)
Best Practices
- Use raw strings - Use
r"{...}"for complex patterns to avoid escaping issues - Test patterns incrementally - Start simple and build complexity
- Use anchors -
^and$to match start/end of strings - Escape special characters - Use
\\for literal special characters - Use non-capturing groups -
(?:...)when you don’t need to extract - Be specific - More specific patterns are usually more efficient
- Use stringr - Cleaner API than base R functions
- Document patterns - Complex patterns benefit from comments explaining intent
Troubleshooting Regex Issues
Issue: Pattern not matching when it should
Problem: Your regex pattern looks correct but grep() or grepl() returns no matches.
Causes & Solutions:
- Backslash escaping - In R strings,
\dneeds to be\\d. Use raw strings to avoid confusion:r"{\d}". - Special characters - Remember that
.matches ANY character. Use\\.to match a literal dot. - Anchors - Use
^for start of string and$for end. Without them, patterns match anywhere.
Example:
# ❌ WRONG - Dot matches any character
grepl("1.0", "120") # [1] TRUE (matches because . matches anything)
# ✅ RIGHT - Escape the dot
grepl("1\\.0", "1.0") # [1] TRUE
Issue: Performance slowdown with complex patterns
Problem: grepl() or gsub() running very slowly on large text.
Solutions:
- Use
fixed = TRUEif you’re matching literal strings (no patterns) - Simplify patterns - Greedy quantifiers (
.*) are slower than specific matches - Use stringr package - Often more optimized than base R
Issue: Understanding replacement references
Problem: Getting the wrong output when using capture groups in gsub().
Troubleshooting:
# Capture groups use \\1, \\2, etc. (not \\0)
text <- "Smith, John"
# ❌ WRONG - \\0 references whole match
gsub("(\\w+), (\\w+)", "\\0 \\2", text)
# ✅ RIGHT - Swap using \\1 and \\2
gsub("(\\w+), (\\w+)", "\\2 \\1", text)
# [1] "John Smith"
Issue: Case sensitivity issues
Problem: Pattern isn’t matching text that visually looks correct.
Solution: Use ignore.case = TRUE parameter:
# Default: case sensitive
grepl("HELLO", "hello") # [1] FALSE
# Case insensitive
grepl("HELLO", "hello", ignore.case = TRUE) # [1] TRUE
Common Patterns
# Email
"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
# URL
"https?://[a-zA-Z0-9./?=_%:-]*"
# Phone (US format)
"\\(?\\d{3}\\)?[-.]?\\d{3}[-.]?\\d{4}"
# Date (YYYY-MM-DD)
"\\d{4}-\\d{2}-\\d{2}"
# Integer
"^-?\\d+$"
# Floating point
"^-?\\d*\\.?\\d+$"
# Alphanumeric only
"^[a-zA-Z0-9]+$"
# No special characters
"^[^!@#$%^&*(),.?\":{}|<>]+$"
Common Questions
Q: What’s the difference between grep and grepl?
A: grep() returns indices/values of matching elements. grepl() returns TRUE/FALSE for each element.
Q: How do I escape a backslash in a regex?
A: Use \\\\ (four backslashes) or use raw strings r"{\\}".
Q: What’s non-greedy matching?
A: *? and +? match as few characters as possible (vs greedy * and + which match as many as possible).
Q: How do I use capture groups?
A: Use () to capture, then reference with \\1, \\2, etc. in replacement.
Q: Is stringr better than base R regex? A: stringr is more consistent and readable, but base R functions are slightly faster.
Related Topics
Build on regular expressions for advanced text processing:
- R Data Transformation - Complete Guide - Use regex for transformation
- R Functions & Control Flow - Complete Guide - Use regex in functions
- R Data Cleaning - Apply regex to data quality
Download R Script
Get all code examples from this tutorial: regular-expressions-examples.R