Data frames are absolutely central to R programming. In my work analyzing real datasets, I’d say I spend 80% of my time working with data frames in some form. They’re the bridge between raw data and analysis.
If you’re new to R, here’s what you need to understand: data frames are how R stores tabular data;the kind you’re used to seeing in Excel spreadsheets or SQL databases. Rows are your observations, columns are your variables. Simple concept, but mastering data frame operations will make you significantly more efficient.
This complete guide covers everything you’ll encounter;from creating data frames to advanced operations like reshaping and merging. I’ve tested all examples in R 4.0+ and included real-world scenarios you’ll actually face in your work.
What Are Data Frames?
A data frame is a 2-dimensional table-like structure where:
- Rows represent observations or records
- Columns represent variables or features
- Each column can contain different data types (numeric, character, logical, etc.)
Data frames are R’s primary data structure for tabular data and the foundation for most statistical analysis, data manipulation, and visualization tasks.
Why Use Data Frames?
- Mixed data types - Unlike matrices, columns can have different types
- Named columns - Access columns by name, not just index
- Familiar structure - Similar to databases or Excel spreadsheets
- Rich ecosystem - Hundreds of functions work with data frames
- Flexible operations - Powerful subsetting, merging, and reshaping capabilities
Creating Data Frames
Basic Creation with data.frame()
The simplest way to create a data frame is using the data.frame() function:
# Create a data frame from vectors
employees <- data.frame(
EmployeeID = 1:5,
Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
Department = c("Sales", "IT", "HR", "Sales", "IT"),
Salary = c(50000, 65000, 55000, 52000, 70000),
StartDate = as.Date(c("2020-01-15", "2019-06-01", "2021-03-10", "2020-11-22", "2018-09-05"))
)
print(employees)
# EmployeeID Name Department Salary StartDate
# 1 1 Alice Sales 50000 2020-01-15
# 2 2 Bob IT 65000 2019-06-01
# 3 3 Charlie HR 55000 2021-03-10
# 4 4 Diana Sales 52000 2020-11-22
# 5 5 Eve IT 70000 2018-09-05
Creating from Vectors
Each argument to data.frame() becomes a column:
# Creating from existing vectors
names <- c("Product_A", "Product_B", "Product_C")
sales <- c(15000, 22000, 18500)
quarters <- c("Q1", "Q2", "Q1")
sales_df <- data.frame(
Product = names,
Revenue = sales,
Quarter = quarters
)
print(sales_df)
# Product Revenue Quarter
# 1 Product_A 15000 Q1
# 2 Product_B 22000 Q2
# 3 Product_C 18500 Q1
Creating from Lists
Convert a list to a data frame (useful after importing or processing data):
# Create from list
data_list <- list(
ID = 1:3,
Value = c(10, 20, 30),
Category = c("A", "B", "A")
)
df_from_list <- as.data.frame(data_list)
print(df_from_list)
# ID Value Category
# 1 1 10 A
# 2 2 20 B
# 3 3 30 A
Manual Data Entry
For small datasets, enter data directly:
# Manual entry for small datasets
survey_results <- data.frame(
Response = c("Yes", "No", "Yes", "Maybe", "Yes"),
Score = c(5, 2, 4, 3, 5)
)
# Or use data entry GUI
# df <- edit(data.frame()) # Opens spreadsheet interface
Understanding Data Frame Structure
Check if Object is a Data Frame
# Check data type
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
is.data.frame(df) # [1] TRUE
class(df) # [1] "data.frame"
typeof(df) # [1] "list"
# Data frames are lists with equal-length vectors
Examine Data Frame Properties
df <- data.frame(
ID = 1:100,
Value = rnorm(100),
Category = sample(c("A", "B", "C"), 100, replace = TRUE)
)
# Dimensions and Structure
nrow(df) # [1] 100 (number of rows)
ncol(df) # [1] 3 (number of columns)
dim(df) # [1] 100 3 (returns vector with rows and columns)
# Using length() function
length(df) # [1] 3 (length returns number of columns for data frames)
names(df) # [1] "ID" "Value" "Category" (column names)
# For vectors, length() returns number of elements
x <- c(10, 20, 30, 40, 50)
length(x) # [1] 5
# Length of specific column
length(df$ID) # [1] 100 (same as nrow for columns)
length(df$Category) # [1] 100
# Find number of unique values
length(unique(df$Category)) # [1] 3 (unique categories)
# Structure overview
str(df) # Shows structure of each column
head(df, 10) # First 10 rows
tail(df, 5) # Last 5 rows
summary(df) # Statistical summary
Get Column Names and Row Names
df <- data.frame(
Name = c("Alice", "Bob"),
Age = c(25, 30),
City = c("NY", "LA")
)
# Column names
colnames(df) # [1] "Name" "Age" "City"
names(df) # [1] "Name" "Age" "City"
# Row names (default: 1, 2, 3, ...)
rownames(df) # [1] "1" "2"
# Set custom row names
rownames(df) <- c("Person1", "Person2")
print(df)
# Name Age City
# Person1 Alice 25 NY
# Person2 Bob 30 LA
Working with Columns
Accessing Columns
df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Salary = c(50000, 60000, 70000)
)
# By name using $
df$Name # [1] "Alice" "Bob" "Charlie"
df$Age # [1] 25 30 35
# By index using brackets
df[[1]] # Same as df$Name
df[,1] # Same as df$Name
df[["Name"]] # Same as df$Name
# By position with numeric index
df[,2] # Age column
df[2] # Data frame with only 2nd column
Adding Columns
df <- data.frame(
ID = 1:3,
Name = c("Alice", "Bob", "Charlie")
)
# Method 1: Using $
df$Age <- c(25, 30, 35)
# Method 2: Using [[ ]]
df[["Department"]] <- c("Sales", "IT", "HR")
# Method 3: Using cbind()
df <- cbind(df, Status = c("Active", "Active", "Inactive"))
# Method 4: Create new column from existing ones
df$Email <- paste0(tolower(df$Name), "@company.com")
print(df)
# ID Name Age Department Status Email
# 1 1 Alice 25 Sales Active [email protected]
# 2 2 Bob 30 IT Active [email protected]
# 3 3 Charlie 35 HR Inactive [email protected]
Adding Multiple Columns
df <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
# Using cbind
new_cols <- data.frame(
Age = c(25, 30, 35),
City = c("NY", "LA", "Chicago")
)
df <- cbind(df, new_cols)
print(df)
# Using data.table merge
library(dplyr)
df %>%
mutate(
Salary = c(50000, 60000, 70000),
Department = c("Sales", "IT", "HR")
)
Removing Columns
df <- data.frame(
ID = 1:3,
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Salary = c(50000, 60000, 70000)
)
# Method 1: Set to NULL
df$Age <- NULL
# Method 2: Use negative indexing
df <- df[, -3] # Remove 3rd column
# Method 3: Select columns to keep
df <- df[, c("ID", "Name")]
# Method 4: Using dplyr
library(dplyr)
df %>% select(-Age, -Salary)
Reordering Columns
df <- data.frame(
ID = 1:3,
Salary = c(50000, 60000, 70000),
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35)
)
# Method 1: Select columns in desired order
df <- df[, c("ID", "Name", "Age", "Salary")]
# Method 2: Using indices
df <- df[, c(1, 3, 4, 2)]
# Method 3: Using dplyr
library(dplyr)
df %>% select(ID, Name, Age, Salary)
Column Statistics
df <- data.frame(
Product = c("A", "B", "C"),
Sales = c(100, 150, 120),
Profit = c(20, 35, 25)
)
# Mean of column
mean(df$Sales) # [1] 123.33
# Multiple statistics
colMeans(df[, c("Sales", "Profit")])
# Sales Profit
# 123.33 26.67
# Count occurrences
table(df$Product) # Frequency table
Unique and Duplicated Values
df <- data.frame(
ID = c(1, 2, 2, 3, 3, 3),
Value = c("A", "B", "B", "C", "C", "C")
)
# Count unique values in column
length(unique(df$Value)) # [1] 3
# Get unique rows
unique_df <- unique(df)
# Find duplicates
duplicated(df) # Shows which rows are duplicates
df[!duplicated(df),] # Remove duplicates
# Count values with condition
sum(df$ID == 2) # [1] 2
nrow(df[df$Value == "C",]) # [1] 3
Working with Rows
Filtering Rows
df <- data.frame(
ID = 1:5,
Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
Salary = c(50000, 65000, 55000, 52000, 70000),
Department = c("Sales", "IT", "HR", "Sales", "IT")
)
# Filter by condition
high_earners <- df[df$Salary > 55000, ]
# Multiple conditions
it_staff <- df[df$Department == "IT" & df$Salary > 60000, ]
# Using %in% for multiple values
sales_or_it <- df[df$Department %in% c("Sales", "IT"), ]
# Using negative indexing to exclude
non_sales <- df[df$Department != "Sales", ]
# Using which()
df[which(df$Salary > 60000), ]
Removing Rows
df <- data.frame(
ID = 1:5,
Value = c(10, 20, NA, 40, 50)
)
# Remove rows with NA
df_clean <- df[!is.na(df$Value), ]
# Remove specific rows by index
df <- df[-c(1, 3), ] # Remove rows 1 and 3
# Remove duplicate rows
df_unique <- df[!duplicated(df), ]
# Remove first/last row
df <- df[-1, ] # Remove first row
df <- df[-nrow(df), ] # Remove last row
# Using dplyr
library(dplyr)
df %>% filter(!is.na(Value))
Selecting/Filtering Rows by Condition
df <- data.frame(
Employee = c("Alice", "Bob", "Charlie"),
Salary = c(50000, 75000, 60000),
Years = c(2, 5, 3)
)
# Single condition
senior_employees <- df[df$Years >= 3, ]
# Multiple AND conditions
criteria <- df[df$Salary > 55000 & df$Years >= 2, ]
# Multiple OR conditions
special_group <- df[df$Salary > 70000 | df$Years >= 5, ]
# Complex conditions
df[df$Salary > 50000 & (df$Years < 4 | df$Employee == "Bob"), ]
# Conditional row count
nrow(df[df$Salary > 60000, ]) # Number of rows matching condition
Row Names and Dimensions
df <- data.frame(
ID = 1:5,
Value = c(10, 20, 30, 40, 50)
)
# Get number of rows
nrow(df) # [1] 5
# Get row names (default: 1, 2, 3, ...)
rownames(df)
# Set custom row names
rownames(df) <- c("Row_A", "Row_B", "Row_C", "Row_D", "Row_E")
# Access by row name
df["Row_A", ]
Indexing Data Frames
df <- data.frame(
ID = 1:3,
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35)
)
# Single element
df[1, 2] # First row, second column
df[1, "Name"] # First row, "Name" column
# Entire row
df[1, ] # First row, all columns
# Entire column
df[, 2] # All rows, second column
df$Name # All rows, Name column
# Multiple rows/columns
df[1:2, c(1, 3)] # Rows 1-2, columns 1 and 3
df[c(1, 3), c("ID", "Name")]
Merging and Combining Data Frames
Merging by Keys
# Create two data frames to merge
customers <- data.frame(
CustomerID = 1:3,
Name = c("Alice", "Bob", "Charlie")
)
orders <- data.frame(
OrderID = 101:103,
CustomerID = c(1, 2, 1),
Amount = c(500, 750, 600)
)
# Inner join (only matching rows)
merged_inner <- merge(customers, orders, by = "CustomerID")
# Left join (all from customers)
merged_left <- merge(customers, orders, by = "CustomerID", all.x = TRUE)
# Right join (all from orders)
merged_right <- merge(customers, orders, by = "CustomerID", all.y = TRUE)
# Full outer join (all rows from both)
merged_full <- merge(customers, orders, by = "CustomerID", all = TRUE)
print(merged_inner)
# CustomerID Name OrderID Amount
# 1 1 Alice 101 500
# 2 1 Alice 103 600
# 3 2 Bob 102 750
Combining Rows (rbind)
df1 <- data.frame(
ID = 1:2,
Name = c("Alice", "Bob"),
Score = c(85, 90)
)
df2 <- data.frame(
ID = 3:4,
Name = c("Charlie", "Diana"),
Score = c(88, 92)
)
# Combine rows
combined <- rbind(df1, df2)
print(combined)
# ID Name Score
# 1 1 Alice 85
# 2 2 Bob 90
# 3 3 Charlie 88
# 4 4 Diana 92
Combining Columns (cbind)
df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
df2 <- data.frame(Score = c(85, 90, 88))
# Combine columns
combined <- cbind(df1, df2)
print(combined)
# ID Name Score
# 1 1 A 85
# 2 2 B 90
# 3 3 C 88
Reshaping Data
Long to Wide Format (pivot_wider)
# Sample long format data
long_df <- data.frame(
Country = c("USA", "USA", "USA", "UK", "UK", "UK"),
Year = c(2020, 2021, 2022, 2020, 2021, 2022),
Revenue = c(100, 120, 150, 80, 95, 110)
)
# Convert to wide format
library(tidyr)
wide_df <- pivot_wider(
long_df,
names_from = Year,
values_from = Revenue
)
print(wide_df)
# Country `2020` `2021` `2022`
# USA 100 120 150
# UK 80 95 110
Wide to Long Format (pivot_longer)
# Sample wide format data
wide_df <- data.frame(
Country = c("USA", "UK"),
`2020` = c(100, 80),
`2021` = c(120, 95),
`2022` = c(150, 110)
)
# Convert to long format
library(tidyr)
long_df <- pivot_longer(
wide_df,
cols = -Country,
names_to = "Year",
values_to = "Revenue"
)
print(long_df)
# Country Year Revenue
# USA 2020 100
# USA 2021 120
# ...
Melt Data (wide to long)
# Using melt for reshaping
library(reshape2)
wide_data <- data.frame(
ID = 1:2,
Q1 = c(100, 150),
Q2 = c(120, 160),
Q3 = c(150, 180)
)
melted <- melt(
wide_data,
id.vars = "ID",
variable.name = "Quarter",
value.name = "Revenue"
)
print(melted)
# ID Quarter Revenue
# 1 1 Q1 100
# 2 2 Q1 150
# 3 1 Q2 120
# ...
Data Type Conversion
Converting Columns
df <- data.frame(
ID = c("1", "2", "3"), # Character
Score = c("85.5", "90.2", "88"), # Character
Status = c("1", "0", "1") # Character
)
# Convert character to numeric
df$ID <- as.numeric(df$ID)
df$Score <- as.numeric(df$Score)
# Convert character to logical/factor
df$Status <- as.logical(as.numeric(df$Status))
# Convert to data.table
library(data.table)
dt <- as.data.table(df)
# Convert to matrix
matrix_form <- as.matrix(df)
print(str(df))
Advanced Operations
Aggregation and Grouping
df <- data.frame(
Department = c("Sales", "Sales", "IT", "IT", "HR"),
Employee = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
Salary = c(50000, 55000, 70000, 75000, 48000)
)
# Using aggregate()
dept_summary <- aggregate(
Salary ~ Department,
data = df,
FUN = function(x) c(Mean = mean(x), Count = length(x))
)
# Using dplyr
library(dplyr)
dept_stats <- df %>%
group_by(Department) %>%
summarize(
AvgSalary = mean(Salary),
Count = n(),
MinSalary = min(Salary),
MaxSalary = max(Salary)
)
Sorting/Ordering
df <- data.frame(
Name = c("Charlie", "Alice", "Bob"),
Score = c(88, 95, 90)
)
# Sort by Name
sorted_name <- df[order(df$Name), ]
# Sort by Score (descending)
sorted_score <- df[order(df$Score, decreasing = TRUE), ]
# Multiple column sort
sorted_multi <- df[order(df$Score, -as.numeric(df$Name)), ]
# Using dplyr
library(dplyr)
df %>% arrange(Score, desc(Name))
Transposing
df <- data.frame(
A = c(1, 2, 3),
B = c(4, 5, 6),
C = c(7, 8, 9)
)
# Transpose
transposed <- t(df)
print(transposed)
Complete Practical Example
# Load sample data
data(mtcars)
# Overview
head(mtcars)
str(mtcars)
# Create analysis data frame
analysis <- mtcars %>%
mutate(
efficiency = mpg / wt,
power_class = cut(hp, breaks = c(0, 100, 150, Inf),
labels = c("Low", "Medium", "High"))
) %>%
group_by(cyl, power_class) %>%
summarize(
count = n(),
avg_mpg = mean(mpg),
avg_hp = mean(hp),
avg_wt = mean(wt),
.groups = 'drop'
) %>%
arrange(desc(avg_mpg))
print(analysis)
Best Practices
- Use descriptive column names - Makes code readable
- Keep consistent data types per column - Avoid mixed types
- Handle NAs explicitly - Check and handle missing values
- Use factor for categorical data - More efficient than character
- Avoid side effects - Don’t modify original data frames
- Use dplyr for complex operations - More readable than base R
- Index carefully - Remember row/column order in subsetting
Common Questions
Q: What’s the difference between df$col and df[[“col”]]?
A: Both access the column, but $ returns a vector while [[ ]] is more explicit. [[ also works with variables: df[[col_name]].
Q: How do I add a column based on another column?
A: Use $ or mutate(): df$new_col <- df$existing_col * 2
Q: How do I remove rows with NA values?
A: Use df[!is.na(df$column), ] or na.omit(df) to remove any rows with NA.
Q: What’s the fastest way to merge large data frames?
A: Use data.table::merge() or dplyr::left_join() for better performance than base merge().
Q: How do I rename columns?
A: Use names(df)[1] <- "new_name" or dplyr::rename(df, new_name = old_name)
Q: Can I modify multiple columns at once?
A: Yes, use across() in dplyr: df %>% mutate(across(col1:col3, as.numeric))
Q: What’s the difference between data.frame and data.table?
A: data.table is faster for large datasets, but data.frame is more widely compatible. Use data.table for big data operations.
Related Topics
Master data frames as the foundation for advanced R data analysis:
- R Descriptive Statistics - Complete Guide - Analyze data frame columns
- R Data Transformation - Complete Guide - Advanced data manipulation
- R Data Visualization - Complete Guide - Visualize data frame data
Download R Script
Get all code examples from this tutorial: data-frames-examples.R