R comes with built-in datasets that are invaluable for learning, teaching, and prototyping data analysis techniques. These datasets cover diverse domains including biology, automobiles, gemology, and real estate. Understanding these standard datasets allows you to quickly learn new techniques without spending time on data collection and cleaning.

This comprehensive guide covers the most important built-in datasets with structure, examples, and practical analyses.

The Iris Dataset

The iris dataset is one of the most famous datasets in statistics and machine learning, introduced by Ronald Fisher in 1936.

Dataset Structure

# Load iris dataset
data(iris)

# View first few rows
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa

# Dataset dimensions
dim(iris)           # [1] 150   5
nrow(iris)          # [1] 150
ncol(iris)          # [1] 5

# Column names and types
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

str(iris)
# 'data.frame': 150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 ...
# $ Species     : Factor w/ 3 levels "setosa", "versicolor", "virginica"

# Species distribution
table(iris$Species)
# setosa versicolor virginica
#     50          50         50

Basic Analyses

# Summary statistics
summary(iris)
#   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
#  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
#  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
#  Median :5.800   Median :3.000   Median :4.350   Median :1.300
#  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
#  3rd Qu.:6.500   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
#  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
#        Species
#  setosa    :50
#  versicolor:50
#  virginica :50

# Mean measurements by species
by(iris[, 1:4], iris$Species, colMeans)

# Correlation of measurements
cor(iris[, 1:4])

Visualization

# Scatter plot of sepal dimensions
plot(iris$Sepal.Length, iris$Sepal.Width,
     col = as.numeric(iris$Species),
     main = "Iris Sepal Dimensions",
     xlab = "Sepal Length", ylab = "Sepal Width")
legend("topright", levels(iris$Species), col = 1:3, pch = 1)

# Box plot by species
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Sepal Length by Species")

# Pairplot using ggplot2
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species) +
  theme_minimal()

The mtcars Dataset

The mtcars dataset contains fuel consumption and performance characteristics for 32 automobile models from Motor Trend magazine (1974).

Dataset Structure

# Load mtcars dataset
data(mtcars)

# View first few rows
head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wagon   21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 310E       22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

# Column meanings
# mpg: Miles per gallon
# cyl: Number of cylinders
# disp: Displacement (cubic inches)
# hp: Gross horsepower
# drat: Rear axle ratio
# wt: Weight (1000 lbs)
# qsec: 1/4 mile time (seconds)
# vs: V/S (0=V-shaped, 1=Straight)
# am: Transmission (0=Automatic, 1=Manual)
# gear: Number of forward gears
# carb: Number of carburetors

dim(mtcars)        # [1] 32 11
str(mtcars)

Basic Analyses

# Summary statistics
summary(mtcars)

# Fuel efficiency by transmission
tapply(mtcars$mpg, mtcars$am, mean)
# 0       1
# 17.147 24.392

# Correlation with MPG
cor(mtcars)["mpg", ]

# Relationship between weight and MPG
plot(mtcars$wt, mtcars$mpg,
     main = "Weight vs Fuel Efficiency",
     xlab = "Weight (1000 lbs)",
     ylab = "Miles Per Gallon")
abline(lm(mpg ~ wt, data = mtcars), col = "red")

# Linear regression
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)

The diamonds Dataset

The diamonds dataset contains prices and characteristics of nearly 54,000 diamonds from ggplot2 package.

Dataset Structure

# Load diamonds dataset (from ggplot2)
library(ggplot2)
data(diamonds)

# View first few rows
head(diamonds)
# # A tibble: 6 × 10
#   carat color clarity depth table price     x     y     z
#   <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1  0.23 E     SI2      61.5    55   326  3.95  3.98  2.43
# 2  0.21 E     SI1      59.8    61   326  3.89  3.84  2.31
# 3  0.23 E     VS1      56.9    65   327  4.05  4.10  2.31

# Column meanings
# carat: Weight of diamond (0.2 - 5.01 carats)
# cut: Quality of cut (Fair, Good, Very Good, Premium, Ideal)
# color: Diamond color (D to Z, where D is best)
# clarity: Clarity of diamond
# depth: Total depth percentage (z / mean(x, y) * 100)
# table: Width of table of diamond
# price: Price in US dollars
# x, y, z: Dimensions in millimeters

dim(diamonds)       # [1] 53940    10
str(diamonds)

Basic Analyses

# Summary statistics
summary(diamonds)

# Price distribution
hist(diamonds$price, breaks = 50,
     main = "Diamond Price Distribution",
     xlab = "Price ($)")

# Average price by cut quality
aggregate(price ~ cut, data = diamonds, FUN = mean)
#         cut  price
# 1      Fair 4358.76
# 2      Good 3928.86
# 3 Very Good 3981.76
# 4   Premium 4584.26
# 5     Ideal 3457.54

# Relationship between carat weight and price
plot(diamonds$carat, diamonds$price,
     main = "Diamond Carat vs Price",
     xlab = "Carat Weight",
     ylab = "Price ($)")

# Linear regression
model <- lm(log(price) ~ carat + cut + color, data = diamonds)
summary(model)

# Visualization with ggplot2
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~clarity) +
  scale_y_log10() +
  theme_minimal()

The Boston Housing Dataset

The Boston housing dataset contains information about Boston suburbs and their median home prices (historically available in MASS package).

Dataset Structure

# Note: This dataset is no longer recommended due to ethical concerns
# but is still included for historical/educational purposes
# Install MASS package if needed
# install.packages("MASS")
library(MASS)
data(Boston)

# View first few rows
head(Boston)
#      CRIM   ZN INDUS CHAS   NOX    RM   AGE    DIS RAD TAX PTRATIO
# 1 0.00632 18.0  2.31    0 0.538 6.575  65.2 4.0900   1 296    15.3
# 2 0.02731  0.0  7.07    0 0.469 6.421  78.9 4.9671   2 242    17.8

# Column meanings
# CRIM: Per capita crime rate by town
# ZN: Proportion of residential land zoned for lots over 25,000 sq.ft
# INDUS: Proportion of non-retail business acres per town
# CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
# NOX: Nitric oxides concentration
# RM: Average number of rooms per dwelling
# AGE: Proportion of owner-occupied units built prior to 1940
# DIS: Weighted distances to five Boston employment centers
# RAD: Index of accessibility to radial highways
# TAX: Full-value property-tax rate per $10,000
# PTRATIO: Pupil-teacher ratio by town
# MEDV: Median value of owner-occupied homes in $1000s

dim(Boston)         # [1] 506  14
str(Boston)

Basic Analyses

# Summary statistics
summary(Boston)

# Median home price distribution
hist(Boston$MEDV, breaks = 30,
     main = "Boston Housing Prices",
     xlab = "Median Home Value ($1000s)")

# Correlation with median price
cor(Boston)["MEDV", ] |> sort(decreasing = TRUE)

# Multiple regression
model <- lm(MEDV ~ RM + LSTAT + CRIM + AGE, data = Boston)
summary(model)

# Key findings
# - More rooms (RM) increases price
# - Higher percentage of lower status population (LSTAT) decreases price
# - Higher crime (CRIM) decreases price

Other Important Built-in Datasets

ChickWeight Dataset

data(ChickWeight)
# Chicken growth data
# Columns: weight, Time, Chick, Diet
# 578 observations of weight measurements for 50 chickens

head(ChickWeight)
plot(ChickWeight$Time, ChickWeight$weight,
     col = ChickWeight$Chick, type = "n",
     main = "Chicken Growth by Diet",
     xlab = "Time (days)",
     ylab = "Weight (grams)")
for (i in unique(ChickWeight$Chick)) {
  lines(subset(ChickWeight, Chick == i)[, c("Time", "weight")])
}

PlantGrowth Dataset

data(PlantGrowth)
# Plant growth under different conditions
# 30 observations, weight by group
head(PlantGrowth)
boxplot(weight ~ group, data = PlantGrowth,
        main = "Plant Growth by Treatment Group")

# ANOVA test
summary(aov(weight ~ group, data = PlantGrowth))

USArrests Dataset

data(USArrests)
# 1973 arrest statistics by US state
# 50 observations, arrest rates and urbanization
head(USArrests)

# Correlation analysis
cor(USArrests)

# Standardize and perform PCA
pca <- prcomp(USArrests, scale = TRUE)
summary(pca)
plot(pca$x[, 1:2], type = "n")
text(pca$x[, 1], pca$x[, 2], rownames(USArrests))

Accessing Dataset Information

# List all available datasets
data()

# Load and view documentation
?iris
?mtcars
?diamonds  # (requires ggplot2)

# Get dataset info
help(datasets)

# Search for datasets by topic
RSiteSearch("datasets")  # Online search

Best Practices for Using Built-in Datasets

  1. Know your data - Always examine structure with str() and summary()
  2. Check for missing values - Use sum(is.na(data))
  3. Understand variables - Read documentation with ?dataset_name
  4. Document sources - Cite the original source in your work
  5. Be aware of context - Some datasets have limitations or biases
  6. Use for learning - Ideal for prototyping before using production data
  7. Explore fully - Use visualization and summary statistics

Common Questions

Q: Where can I find more information about a dataset? A: Use ?dataset_name or help(dataset_name) to view full documentation

Q: Are there datasets beyond these main ones? A: Yes, many packages include datasets. Use data() to list all available, or search packages

Q: Can I use these datasets in my own projects? A: Yes, they’re included with R and are public domain for educational and research use

Q: How do I load a dataset from a specific package? A: Use library(package) then data(dataset) or directly reference with data(dataset, package = "packagename")

Q: What if I need a larger dataset? A: Consider packages like nycflights13, babynames, fivethirtyeight, or online sources like Kaggle

Build on these datasets for analysis practice:

Download R Script

Get all code examples from this tutorial: builtin-datasets-examples.R