R Regular Expressions - Comprehensive Tutorial: Pattern Matching & Text Processing

Regular expressions (regex) are one of those skills that seem intimidating at first, but once they click, they’ll save you hours of tedious string manipulation. I can’t tell you how many times regex has pulled me out of a tight spot when cleaning messy data.

Whether you’re validating email addresses, extracting patterns from text, parsing log files, or transforming data formats, regular expressions are essential. In my experience, most R programmers underutilize regex and end up writing complicated loops when a single regex pattern would do it in one line.

This comprehensive guide covers everything from regex basics to advanced patterns, with practical examples tested in R 4.0+. I’ll show you not just how regex works, but when and why to use each technique. You’ll also see common mistakes I’ve made and learned from.

Regular Expression Basics

A regular expression (regex) is a pattern that describes text. It consists of literal characters and metacharacters that have special meanings.

Common Metacharacters

# Basic metacharacters
# .       - Any character except newline
# *       - Zero or more of preceding element
# +       - One or more of preceding element
# ?       - Zero or one of preceding element
# ^       - Start of string
# $       - End of string
# |       - OR operator
# []      - Character class
# ()      - Grouping
# \       - Escape character

# Examples
text <- "Hello123World456"

# . matches any character
grepl("H.llo", text)                    # [1] TRUE

# * matches zero or more
grepl("Hel*o", text)                    # [1] TRUE

# + matches one or more
grepl("Hello[0-9]+", text)              # [1] TRUE

# ? matches zero or one
grepl("Hel?lo", text)                   # [1] TRUE

Pattern Matching Functions

grep() - Find Matching Strings

# Find indices of matching strings
fruits <- c("apple", "apricot", "banana", "cherry", "avocado")

# Find elements matching pattern
indices <- grep("^a", fruits)           # Elements starting with 'a'
print(indices)
# [1] 1 2 5

# Get matching values
matches <- fruits[grep("^a", fruits)]
print(matches)
# [1] "apple"   "apricot" "avocado"

# Case insensitive matching
grep("APPLE", fruits, ignore.case = TRUE)

# Pattern with alternation
grep("ap|ch", fruits)                   # Contains 'ap' or 'ch'

# Invert match (find non-matching)
grep("^a", fruits, invert = TRUE)
# [1] 3 4

grepl() - Logical Pattern Test

# Test if each element matches pattern (returns logical vector)
emails <- c("[email protected]", "invalid-email", "[email protected]")

# Check if valid email format (simplified)
valid_emails <- grepl("^[a-zA-Z0-9]+@[a-zA-Z0-9]+\\.[a-zA-Z]+$", emails)
print(valid_emails)
# [1]  TRUE FALSE  TRUE

# Filter valid emails
emails[valid_emails]
# [1] "[email protected]" "[email protected]"

# Test multiple patterns
has_vowel <- grepl("[aeiou]", emails, ignore.case = TRUE)
print(has_vowel)

Text Extraction

sub() and gsub() - Pattern Replacement

# sub() - Replace first occurrence
text <- "apple apple apple"
result_sub <- sub("apple", "orange", text)
print(result_sub)
# [1] "orange apple apple"

# gsub() - Replace all occurrences
result_gsub <- gsub("apple", "orange", text)
print(result_gsub)
# [1] "orange orange orange"

# Remove pattern (replace with empty string)
phone <- "123-456-7890"
clean_phone <- gsub("-", "", phone)
print(clean_phone)
# [1] "1234567890"

# Use capture groups for rearrangement
text <- "2026-02-09"
reformatted <- gsub("(\\d{4})-(\\d{2})-(\\d{2})", "\\3/\\2/\\1", text)
print(reformatted)
# [1] "09/02/2024"

strsplit() - Split by Pattern

# Split string by pattern
text <- "apple,banana,cherry"
parts <- strsplit(text, ",")
print(parts)
# [[1]]
# [1] "apple"  "banana" "cherry"

# Extract vector instead of list
parts_vec <- unlist(strsplit(text, ","))
print(parts_vec)

# Split by whitespace
sentence <- "The quick brown fox jumps"
words <- strsplit(sentence, "\\s+")[[1]]
print(words)

# Split with limit (not all splits)
text_long <- "one,two,three,four,five"
parts_limited <- strsplit(text_long, ",", fixed = TRUE)[[1]]

regmatches() and gregexpr()

# Extract all matches
text <- "ABC 123 DEF 456 GHI"

# gregexpr finds positions of all matches
positions <- gregexpr("[0-9]+", text)
print(positions)

# regmatches extracts the matched substrings
matches <- regmatches(text, gregexpr("[0-9]+", text))
print(matches)
# [[1]]
# [1] "123" "456"

# Simplify to vector
numbers <- unlist(regmatches(text, gregexpr("[0-9]+", text)))
print(numbers)
# [1] "123" "456"

stringr Package (tidyverse)

The stringr package provides cleaner, more consistent functions for regex operations.

Basic stringr Functions

library(stringr)

text <- "The year is 2024"

# str_detect - Test for pattern presence
str_detect(text, "[0-9]+")              # [1] TRUE

# str_extract - Extract first match
str_extract(text, "[0-9]+")             # [1] "2024"

# str_extract_all - Extract all matches
str_extract_all(text, "[0-9]")
# [[1]]
# [1] "2" "0" "2" "4"

# str_replace - Replace first occurrence
str_replace(text, "[0-9]+", "XXXX")
# [1] "The year is XXXX"

# str_replace_all - Replace all occurrences
text_repeated <- "2 and 2 and 2"
str_replace_all(text_repeated, "2", "X")
# [1] "X and X and X"

# str_match - Extract with groups
pattern <- "(\\d{4})-(\\d{2})-(\\d{2})"
date_text <- "2026-02-09"
str_match(date_text, pattern)
#      [,1]         [,2]   [,3] [,4]
# [1,] "2026-02-09" "2024" "02" "09"

Advanced stringr Operations

# str_split - Split strings
str_split("apple,banana,cherry", ",")

# str_trim - Remove whitespace
str_trim("  hello world  ")            # [1] "hello world"

# str_sub - Extract substring
str_sub("hello", 1, 3)                 # [1] "hel"

# str_to_upper/lower/title
str_to_upper("hello")                  # [1] "HELLO"
str_to_title("hello world")            # [1] "Hello World"

Character Classes and Quantifiers

Character Classes

# Common character classes
# [a-z]     - Lowercase letters
# [A-Z]     - Uppercase letters
# [0-9]     - Digits
# [a-zA-Z0-9] - Alphanumeric
# \d        - Digit (equivalent to [0-9])
# \w        - Word character (alphanumeric + underscore)
# \s        - Whitespace
# [^abc]    - NOT a, b, or c

# Examples
grepl("[0-9]", "abc123")               # [1] TRUE (contains digit)
grepl("[A-Z]", "hello")                # [1] FALSE (no uppercase)
grepl("\\d+", "abc123def")             # [1] TRUE (contains digits)

Quantifiers

# Quantifiers control repetition
# n         - Exactly n times
# n,        - At least n times
# n,m       - Between n and m times
# *         - 0 or more (greedy)
# +         - 1 or more (greedy)
# ?         - 0 or 1 (greedy)
# *?        - 0 or more (non-greedy)

# Examples
grepl("a{3}", "aaa")                   # [1] TRUE
grepl("a{2,4}", "aaa")                 # [1] TRUE
grepl("a+", "aaa")                     # [1] TRUE
grepl("a?b", "b")                      # [1] TRUE

Practical Text Processing Examples

Email Validation

# Validate email format
validate_email <- function(email) {
  pattern <- "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
  grepl(pattern, email)
}

test_emails <- c(
  "[email protected]",
  "invalid.email@",
  "[email protected]",
  "no-at-sign.com"
)

sapply(test_emails, validate_email)
# [1]  TRUE FALSE  TRUE FALSE

Extract Data

# Extract phone numbers from text
text <- "Call me at 555-123-4567 or 555.987.6543"

# Using gregexpr and regmatches
numbers <- unlist(regmatches(text, gregexpr("[0-9]{3}[-.]?[0-9]{3}[-.]?[0-9]{4}", text)))
print(numbers)
# [1] "555-123-4567" "555.987.6543"

# Clean format
clean_numbers <- gsub("[^0-9]", "", numbers)
print(clean_numbers)

Data Cleaning

# Clean and standardize data
data <- c("$1,234.56", "$2,500.00", "$99.99")

# Remove currency and commas, convert to numeric
cleaned <- as.numeric(gsub("[$,]", "", data))
print(cleaned)
# [1] 1234.56 2500.00   99.99

# Standardize phone format
phones <- c("5551234567", "(555) 123-4567", "555.123.4567")

standardize_phone <- function(phone) {
  digits <- gsub("[^0-9]", "", phone)
  paste0("(", substr(digits, 1, 3), ") ",
         substr(digits, 4, 6), "-",
         substr(digits, 7, 10))
}

sapply(phones, standardize_phone)

Text Extraction from Structured Data

# Extract age and name from text
entries <- c(
  "John (age 25)",
  "Mary (age 30)",
  "Robert (age 45)"
)

# Extract names
names <- str_trim(gsub("\\(age.*", "", entries))
print(names)

# Extract ages
library(stringr)
ages <- as.numeric(str_extract(entries, "(?<=age )\\d+"))
print(ages)

Best Practices

Use raw strings - Use r"{...}" for complex patterns to avoid escaping issues
Test patterns incrementally - Start simple and build complexity
Use anchors - ^ and $ to match start/end of strings
Escape special characters - Use \\ for literal special characters
Use non-capturing groups - (?:...) when you don’t need to extract
Be specific - More specific patterns are usually more efficient
Use stringr - Cleaner API than base R functions
Document patterns - Complex patterns benefit from comments explaining intent

Troubleshooting Regex Issues

Issue: Pattern not matching when it should

Problem: Your regex pattern looks correct but grep() or grepl() returns no matches.

Causes & Solutions:

Backslash escaping - In R strings, \d needs to be \\d. Use raw strings to avoid confusion: r"{\d}".
Special characters - Remember that . matches ANY character. Use \\. to match a literal dot.
Anchors - Use ^ for start of string and $ for end. Without them, patterns match anywhere.

Example:

# ❌ WRONG - Dot matches any character
grepl("1.0", "120")  # [1] TRUE (matches because . matches anything)

# ✅ RIGHT - Escape the dot
grepl("1\\.0", "1.0")  # [1] TRUE

Issue: Performance slowdown with complex patterns

Problem: grepl() or gsub() running very slowly on large text.

Solutions:

Use fixed = TRUE if you’re matching literal strings (no patterns)
Simplify patterns - Greedy quantifiers (.*) are slower than specific matches
Use stringr package - Often more optimized than base R

Issue: Understanding replacement references

Problem: Getting the wrong output when using capture groups in gsub().

Troubleshooting:

# Capture groups use \\1, \\2, etc. (not \\0)
text <- "Smith, John"

# ❌ WRONG - \\0 references whole match
gsub("(\\w+), (\\w+)", "\\0 \\2", text)

# ✅ RIGHT - Swap using \\1 and \\2
gsub("(\\w+), (\\w+)", "\\2 \\1", text)
# [1] "John Smith"

Issue: Case sensitivity issues

Problem: Pattern isn’t matching text that visually looks correct.

Solution: Use ignore.case = TRUE parameter:

# Default: case sensitive
grepl("HELLO", "hello")  # [1] FALSE

# Case insensitive
grepl("HELLO", "hello", ignore.case = TRUE)  # [1] TRUE

Common Patterns

# Email
"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

# URL
"https?://[a-zA-Z0-9./?=_%:-]*"

# Phone (US format)
"\\(?\\d{3}\\)?[-.]?\\d{3}[-.]?\\d{4}"

# Date (YYYY-MM-DD)
"\\d{4}-\\d{2}-\\d{2}"

# Integer
"^-?\\d+$"

# Floating point
"^-?\\d*\\.?\\d+$"

# Alphanumeric only
"^[a-zA-Z0-9]+$"

# No special characters
"^[^!@#$%^&*(),.?\":{}|<>]+$"

Common Questions

Q: What’s the difference between grep and grepl? A: grep() returns indices/values of matching elements. grepl() returns TRUE/FALSE for each element.

Q: How do I escape a backslash in a regex? A: Use \\\\ (four backslashes) or use raw strings r"{\\}".

Q: What’s non-greedy matching? A: *? and +? match as few characters as possible (vs greedy * and + which match as many as possible).

Q: How do I use capture groups? A: Use () to capture, then reference with \\1, \\2, etc. in replacement.

Q: Is stringr better than base R regex? A: stringr is more consistent and readable, but base R functions are slightly faster.

Build on regular expressions for advanced text processing:

R Data Transformation - Complete Guide - Use regex for transformation
R Functions & Control Flow - Complete Guide - Use regex in functions
R Data Cleaning - Apply regex to data quality

Download R Script

Get all code examples from this tutorial: regular-expressions-examples.R

R Regular Expressions - Complete Guide