Data frames are absolutely central to R programming. In my work analyzing real datasets, I’d say I spend 80% of my time working with data frames in some form. They’re the bridge between raw data and analysis.

If you’re new to R, here’s what you need to understand: data frames are how R stores tabular data;the kind you’re used to seeing in Excel spreadsheets or SQL databases. Rows are your observations, columns are your variables. Simple concept, but mastering data frame operations will make you significantly more efficient.

This complete guide covers everything you’ll encounter;from creating data frames to advanced operations like reshaping and merging. I’ve tested all examples in R 4.0+ and included real-world scenarios you’ll actually face in your work.

What Are Data Frames?

A data frame is a 2-dimensional table-like structure where:

  • Rows represent observations or records
  • Columns represent variables or features
  • Each column can contain different data types (numeric, character, logical, etc.)

Data frames are R’s primary data structure for tabular data and the foundation for most statistical analysis, data manipulation, and visualization tasks.

Why Use Data Frames?

  1. Mixed data types - Unlike matrices, columns can have different types
  2. Named columns - Access columns by name, not just index
  3. Familiar structure - Similar to databases or Excel spreadsheets
  4. Rich ecosystem - Hundreds of functions work with data frames
  5. Flexible operations - Powerful subsetting, merging, and reshaping capabilities

Creating Data Frames

Basic Creation with data.frame()

The simplest way to create a data frame is using the data.frame() function:

# Create a data frame from vectors
employees <- data.frame(
  EmployeeID = 1:5,
  Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  Department = c("Sales", "IT", "HR", "Sales", "IT"),
  Salary = c(50000, 65000, 55000, 52000, 70000),
  StartDate = as.Date(c("2020-01-15", "2019-06-01", "2021-03-10", "2020-11-22", "2018-09-05"))
)

print(employees)
#   EmployeeID   Name Department Salary  StartDate
# 1          1  Alice      Sales  50000 2020-01-15
# 2          2    Bob         IT  65000 2019-06-01
# 3          3 Charlie         HR  55000 2021-03-10
# 4          4  Diana      Sales  52000 2020-11-22
# 5          5    Eve         IT  70000 2018-09-05

Creating from Vectors

Each argument to data.frame() becomes a column:

# Creating from existing vectors
names <- c("Product_A", "Product_B", "Product_C")
sales <- c(15000, 22000, 18500)
quarters <- c("Q1", "Q2", "Q1")

sales_df <- data.frame(
  Product = names,
  Revenue = sales,
  Quarter = quarters
)

print(sales_df)
#    Product Revenue Quarter
# 1 Product_A   15000      Q1
# 2 Product_B   22000      Q2
# 3 Product_C   18500      Q1

Creating from Lists

Convert a list to a data frame (useful after importing or processing data):

# Create from list
data_list <- list(
  ID = 1:3,
  Value = c(10, 20, 30),
  Category = c("A", "B", "A")
)

df_from_list <- as.data.frame(data_list)
print(df_from_list)
#   ID Value Category
# 1  1    10        A
# 2  2    20        B
# 3  3    30        A

Manual Data Entry

For small datasets, enter data directly:

# Manual entry for small datasets
survey_results <- data.frame(
  Response = c("Yes", "No", "Yes", "Maybe", "Yes"),
  Score = c(5, 2, 4, 3, 5)
)

# Or use data entry GUI
# df <- edit(data.frame())  # Opens spreadsheet interface

Understanding Data Frame Structure

Check if Object is a Data Frame

# Check data type
df <- data.frame(x = 1:3, y = c("a", "b", "c"))

is.data.frame(df)          # [1] TRUE
class(df)                  # [1] "data.frame"
typeof(df)                 # [1] "list"

# Data frames are lists with equal-length vectors

Examine Data Frame Properties

df <- data.frame(
  ID = 1:100,
  Value = rnorm(100),
  Category = sample(c("A", "B", "C"), 100, replace = TRUE)
)

# Dimensions and Structure
nrow(df)         # [1] 100 (number of rows)
ncol(df)         # [1] 3 (number of columns)
dim(df)          # [1] 100 3 (returns vector with rows and columns)

# Using length() function
length(df)       # [1] 3 (length returns number of columns for data frames)
names(df)        # [1] "ID" "Value" "Category" (column names)

# For vectors, length() returns number of elements
x <- c(10, 20, 30, 40, 50)
length(x)        # [1] 5

# Length of specific column
length(df$ID)    # [1] 100 (same as nrow for columns)
length(df$Category)  # [1] 100

# Find number of unique values
length(unique(df$Category))  # [1] 3 (unique categories)

# Structure overview
str(df)          # Shows structure of each column
head(df, 10)     # First 10 rows
tail(df, 5)      # Last 5 rows
summary(df)      # Statistical summary

Get Column Names and Row Names

df <- data.frame(
  Name = c("Alice", "Bob"),
  Age = c(25, 30),
  City = c("NY", "LA")
)

# Column names
colnames(df)     # [1] "Name" "Age" "City"
names(df)        # [1] "Name" "Age" "City"

# Row names (default: 1, 2, 3, ...)
rownames(df)     # [1] "1" "2"

# Set custom row names
rownames(df) <- c("Person1", "Person2")
print(df)
#         Name Age City
# Person1 Alice  25   NY
# Person2   Bob  30   LA

Working with Columns

Accessing Columns

df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Salary = c(50000, 60000, 70000)
)

# By name using $
df$Name           # [1] "Alice" "Bob" "Charlie"
df$Age            # [1] 25 30 35

# By index using brackets
df[[1]]           # Same as df$Name
df[,1]            # Same as df$Name
df[["Name"]]      # Same as df$Name

# By position with numeric index
df[,2]            # Age column
df[2]             # Data frame with only 2nd column

Adding Columns

df <- data.frame(
  ID = 1:3,
  Name = c("Alice", "Bob", "Charlie")
)

# Method 1: Using $
df$Age <- c(25, 30, 35)

# Method 2: Using [[ ]]
df[["Department"]] <- c("Sales", "IT", "HR")

# Method 3: Using cbind()
df <- cbind(df, Status = c("Active", "Active", "Inactive"))

# Method 4: Create new column from existing ones
df$Email <- paste0(tolower(df$Name), "@company.com")

print(df)
#   ID    Name Age Department   Status          Email
# 1  1   Alice  25      Sales   Active [email protected]
# 2  2     Bob  30         IT   Active   [email protected]
# 3  3 Charlie  35         HR Inactive [email protected]

Adding Multiple Columns

df <- data.frame(ID = 1:3, Name = c("A", "B", "C"))

# Using cbind
new_cols <- data.frame(
  Age = c(25, 30, 35),
  City = c("NY", "LA", "Chicago")
)
df <- cbind(df, new_cols)
print(df)

# Using data.table merge
library(dplyr)
df %>%
  mutate(
    Salary = c(50000, 60000, 70000),
    Department = c("Sales", "IT", "HR")
  )

Removing Columns

df <- data.frame(
  ID = 1:3,
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Salary = c(50000, 60000, 70000)
)

# Method 1: Set to NULL
df$Age <- NULL

# Method 2: Use negative indexing
df <- df[, -3]  # Remove 3rd column

# Method 3: Select columns to keep
df <- df[, c("ID", "Name")]

# Method 4: Using dplyr
library(dplyr)
df %>% select(-Age, -Salary)

Reordering Columns

df <- data.frame(
  ID = 1:3,
  Salary = c(50000, 60000, 70000),
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35)
)

# Method 1: Select columns in desired order
df <- df[, c("ID", "Name", "Age", "Salary")]

# Method 2: Using indices
df <- df[, c(1, 3, 4, 2)]

# Method 3: Using dplyr
library(dplyr)
df %>% select(ID, Name, Age, Salary)

Column Statistics

df <- data.frame(
  Product = c("A", "B", "C"),
  Sales = c(100, 150, 120),
  Profit = c(20, 35, 25)
)

# Mean of column
mean(df$Sales)                    # [1] 123.33

# Multiple statistics
colMeans(df[, c("Sales", "Profit")])
#  Sales  Profit
# 123.33  26.67

# Count occurrences
table(df$Product)                 # Frequency table

Unique and Duplicated Values

df <- data.frame(
  ID = c(1, 2, 2, 3, 3, 3),
  Value = c("A", "B", "B", "C", "C", "C")
)

# Count unique values in column
length(unique(df$Value))          # [1] 3

# Get unique rows
unique_df <- unique(df)

# Find duplicates
duplicated(df)                    # Shows which rows are duplicates
df[!duplicated(df),]              # Remove duplicates

# Count values with condition
sum(df$ID == 2)                   # [1] 2
nrow(df[df$Value == "C",])        # [1] 3

Working with Rows

Filtering Rows

df <- data.frame(
  ID = 1:5,
  Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  Salary = c(50000, 65000, 55000, 52000, 70000),
  Department = c("Sales", "IT", "HR", "Sales", "IT")
)

# Filter by condition
high_earners <- df[df$Salary > 55000, ]

# Multiple conditions
it_staff <- df[df$Department == "IT" & df$Salary > 60000, ]

# Using %in% for multiple values
sales_or_it <- df[df$Department %in% c("Sales", "IT"), ]

# Using negative indexing to exclude
non_sales <- df[df$Department != "Sales", ]

# Using which()
df[which(df$Salary > 60000), ]

Removing Rows

df <- data.frame(
  ID = 1:5,
  Value = c(10, 20, NA, 40, 50)
)

# Remove rows with NA
df_clean <- df[!is.na(df$Value), ]

# Remove specific rows by index
df <- df[-c(1, 3), ]  # Remove rows 1 and 3

# Remove duplicate rows
df_unique <- df[!duplicated(df), ]

# Remove first/last row
df <- df[-1, ]        # Remove first row
df <- df[-nrow(df), ] # Remove last row

# Using dplyr
library(dplyr)
df %>% filter(!is.na(Value))

Selecting/Filtering Rows by Condition

df <- data.frame(
  Employee = c("Alice", "Bob", "Charlie"),
  Salary = c(50000, 75000, 60000),
  Years = c(2, 5, 3)
)

# Single condition
senior_employees <- df[df$Years >= 3, ]

# Multiple AND conditions
criteria <- df[df$Salary > 55000 & df$Years >= 2, ]

# Multiple OR conditions
special_group <- df[df$Salary > 70000 | df$Years >= 5, ]

# Complex conditions
df[df$Salary > 50000 & (df$Years < 4 | df$Employee == "Bob"), ]

# Conditional row count
nrow(df[df$Salary > 60000, ])     # Number of rows matching condition

Row Names and Dimensions

df <- data.frame(
  ID = 1:5,
  Value = c(10, 20, 30, 40, 50)
)

# Get number of rows
nrow(df)              # [1] 5

# Get row names (default: 1, 2, 3, ...)
rownames(df)

# Set custom row names
rownames(df) <- c("Row_A", "Row_B", "Row_C", "Row_D", "Row_E")

# Access by row name
df["Row_A", ]

Indexing Data Frames

df <- data.frame(
  ID = 1:3,
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35)
)

# Single element
df[1, 2]              # First row, second column
df[1, "Name"]         # First row, "Name" column

# Entire row
df[1, ]               # First row, all columns

# Entire column
df[, 2]               # All rows, second column
df$Name               # All rows, Name column

# Multiple rows/columns
df[1:2, c(1, 3)]      # Rows 1-2, columns 1 and 3
df[c(1, 3), c("ID", "Name")]

Merging and Combining Data Frames

Merging by Keys

# Create two data frames to merge
customers <- data.frame(
  CustomerID = 1:3,
  Name = c("Alice", "Bob", "Charlie")
)

orders <- data.frame(
  OrderID = 101:103,
  CustomerID = c(1, 2, 1),
  Amount = c(500, 750, 600)
)

# Inner join (only matching rows)
merged_inner <- merge(customers, orders, by = "CustomerID")

# Left join (all from customers)
merged_left <- merge(customers, orders, by = "CustomerID", all.x = TRUE)

# Right join (all from orders)
merged_right <- merge(customers, orders, by = "CustomerID", all.y = TRUE)

# Full outer join (all rows from both)
merged_full <- merge(customers, orders, by = "CustomerID", all = TRUE)

print(merged_inner)
#   CustomerID   Name OrderID Amount
# 1          1  Alice     101    500
# 2          1  Alice     103    600
# 3          2    Bob     102    750

Combining Rows (rbind)

df1 <- data.frame(
  ID = 1:2,
  Name = c("Alice", "Bob"),
  Score = c(85, 90)
)

df2 <- data.frame(
  ID = 3:4,
  Name = c("Charlie", "Diana"),
  Score = c(88, 92)
)

# Combine rows
combined <- rbind(df1, df2)
print(combined)
#   ID    Name Score
# 1  1   Alice    85
# 2  2     Bob    90
# 3  3 Charlie    88
# 4  4   Diana    92

Combining Columns (cbind)

df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
df2 <- data.frame(Score = c(85, 90, 88))

# Combine columns
combined <- cbind(df1, df2)
print(combined)
#   ID Name Score
# 1  1    A    85
# 2  2    B    90
# 3  3    C    88

Reshaping Data

Long to Wide Format (pivot_wider)

# Sample long format data
long_df <- data.frame(
  Country = c("USA", "USA", "USA", "UK", "UK", "UK"),
  Year = c(2020, 2021, 2022, 2020, 2021, 2022),
  Revenue = c(100, 120, 150, 80, 95, 110)
)

# Convert to wide format
library(tidyr)
wide_df <- pivot_wider(
  long_df,
  names_from = Year,
  values_from = Revenue
)

print(wide_df)
# Country `2020` `2021` `2022`
# USA       100   120   150
# UK         80    95   110

Wide to Long Format (pivot_longer)

# Sample wide format data
wide_df <- data.frame(
  Country = c("USA", "UK"),
  `2020` = c(100, 80),
  `2021` = c(120, 95),
  `2022` = c(150, 110)
)

# Convert to long format
library(tidyr)
long_df <- pivot_longer(
  wide_df,
  cols = -Country,
  names_to = "Year",
  values_to = "Revenue"
)

print(long_df)
# Country Year Revenue
# USA     2020   100
# USA     2021   120
# ...

Melt Data (wide to long)

# Using melt for reshaping
library(reshape2)

wide_data <- data.frame(
  ID = 1:2,
  Q1 = c(100, 150),
  Q2 = c(120, 160),
  Q3 = c(150, 180)
)

melted <- melt(
  wide_data,
  id.vars = "ID",
  variable.name = "Quarter",
  value.name = "Revenue"
)

print(melted)
#   ID Quarter Revenue
# 1  1      Q1   100
# 2  2      Q1   150
# 3  1      Q2   120
# ...

Data Type Conversion

Converting Columns

df <- data.frame(
  ID = c("1", "2", "3"),           # Character
  Score = c("85.5", "90.2", "88"),  # Character
  Status = c("1", "0", "1")         # Character
)

# Convert character to numeric
df$ID <- as.numeric(df$ID)
df$Score <- as.numeric(df$Score)

# Convert character to logical/factor
df$Status <- as.logical(as.numeric(df$Status))

# Convert to data.table
library(data.table)
dt <- as.data.table(df)

# Convert to matrix
matrix_form <- as.matrix(df)

print(str(df))

Advanced Operations

Aggregation and Grouping

df <- data.frame(
  Department = c("Sales", "Sales", "IT", "IT", "HR"),
  Employee = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  Salary = c(50000, 55000, 70000, 75000, 48000)
)

# Using aggregate()
dept_summary <- aggregate(
  Salary ~ Department,
  data = df,
  FUN = function(x) c(Mean = mean(x), Count = length(x))
)

# Using dplyr
library(dplyr)
dept_stats <- df %>%
  group_by(Department) %>%
  summarize(
    AvgSalary = mean(Salary),
    Count = n(),
    MinSalary = min(Salary),
    MaxSalary = max(Salary)
  )

Sorting/Ordering

df <- data.frame(
  Name = c("Charlie", "Alice", "Bob"),
  Score = c(88, 95, 90)
)

# Sort by Name
sorted_name <- df[order(df$Name), ]

# Sort by Score (descending)
sorted_score <- df[order(df$Score, decreasing = TRUE), ]

# Multiple column sort
sorted_multi <- df[order(df$Score, -as.numeric(df$Name)), ]

# Using dplyr
library(dplyr)
df %>% arrange(Score, desc(Name))

Transposing

df <- data.frame(
  A = c(1, 2, 3),
  B = c(4, 5, 6),
  C = c(7, 8, 9)
)

# Transpose
transposed <- t(df)
print(transposed)

Complete Practical Example

# Load sample data
data(mtcars)

# Overview
head(mtcars)
str(mtcars)

# Create analysis data frame
analysis <- mtcars %>%
  mutate(
    efficiency = mpg / wt,
    power_class = cut(hp, breaks = c(0, 100, 150, Inf),
                     labels = c("Low", "Medium", "High"))
  ) %>%
  group_by(cyl, power_class) %>%
  summarize(
    count = n(),
    avg_mpg = mean(mpg),
    avg_hp = mean(hp),
    avg_wt = mean(wt),
    .groups = 'drop'
  ) %>%
  arrange(desc(avg_mpg))

print(analysis)

Best Practices

  1. Use descriptive column names - Makes code readable
  2. Keep consistent data types per column - Avoid mixed types
  3. Handle NAs explicitly - Check and handle missing values
  4. Use factor for categorical data - More efficient than character
  5. Avoid side effects - Don’t modify original data frames
  6. Use dplyr for complex operations - More readable than base R
  7. Index carefully - Remember row/column order in subsetting

Common Questions

Q: What’s the difference between df$col and df[[“col”]]? A: Both access the column, but $ returns a vector while [[ ]] is more explicit. [[ also works with variables: df[[col_name]].

Q: How do I add a column based on another column? A: Use $ or mutate(): df$new_col <- df$existing_col * 2

Q: How do I remove rows with NA values? A: Use df[!is.na(df$column), ] or na.omit(df) to remove any rows with NA.

Q: What’s the fastest way to merge large data frames? A: Use data.table::merge() or dplyr::left_join() for better performance than base merge().

Q: How do I rename columns? A: Use names(df)[1] <- "new_name" or dplyr::rename(df, new_name = old_name)

Q: Can I modify multiple columns at once? A: Yes, use across() in dplyr: df %>% mutate(across(col1:col3, as.numeric))

Q: What’s the difference between data.frame and data.table? A: data.table is faster for large datasets, but data.frame is more widely compatible. Use data.table for big data operations.

Master data frames as the foundation for advanced R data analysis:

Download R Script

Get all code examples from this tutorial: data-frames-examples.R