R comes with built-in datasets that are invaluable for learning, teaching, and prototyping data analysis techniques. These datasets cover diverse domains including biology, automobiles, gemology, and real estate. Understanding these standard datasets allows you to quickly learn new techniques without spending time on data collection and cleaning.
This comprehensive guide covers the most important built-in datasets with structure, examples, and practical analyses.
The Iris Dataset
The iris dataset is one of the most famous datasets in statistics and machine learning, introduced by Ronald Fisher in 1936.
Dataset Structure
# Load iris dataset
data(iris)
# View first few rows
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# Dataset dimensions
dim(iris) # [1] 150 5
nrow(iris) # [1] 150
ncol(iris) # [1] 5
# Column names and types
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 ...
# $ Species : Factor w/ 3 levels "setosa", "versicolor", "virginica"
# Species distribution
table(iris$Species)
# setosa versicolor virginica
# 50 50 50
Basic Analyses
# Summary statistics
summary(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
# 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
# Median :5.800 Median :3.000 Median :4.350 Median :1.300
# Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
# 3rd Qu.:6.500 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
# Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# Species
# setosa :50
# versicolor:50
# virginica :50
# Mean measurements by species
by(iris[, 1:4], iris$Species, colMeans)
# Correlation of measurements
cor(iris[, 1:4])
Visualization
# Scatter plot of sepal dimensions
plot(iris$Sepal.Length, iris$Sepal.Width,
col = as.numeric(iris$Species),
main = "Iris Sepal Dimensions",
xlab = "Sepal Length", ylab = "Sepal Width")
legend("topright", levels(iris$Species), col = 1:3, pch = 1)
# Box plot by species
boxplot(Sepal.Length ~ Species, data = iris,
main = "Sepal Length by Species")
# Pairplot using ggplot2
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
facet_wrap(~Species) +
theme_minimal()
The mtcars Dataset
The mtcars dataset contains fuel consumption and performance characteristics for 32 automobile models from Motor Trend magazine (1974).
Dataset Structure
# Load mtcars dataset
data(mtcars)
# View first few rows
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wagon 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 310E 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Column meanings
# mpg: Miles per gallon
# cyl: Number of cylinders
# disp: Displacement (cubic inches)
# hp: Gross horsepower
# drat: Rear axle ratio
# wt: Weight (1000 lbs)
# qsec: 1/4 mile time (seconds)
# vs: V/S (0=V-shaped, 1=Straight)
# am: Transmission (0=Automatic, 1=Manual)
# gear: Number of forward gears
# carb: Number of carburetors
dim(mtcars) # [1] 32 11
str(mtcars)
Basic Analyses
# Summary statistics
summary(mtcars)
# Fuel efficiency by transmission
tapply(mtcars$mpg, mtcars$am, mean)
# 0 1
# 17.147 24.392
# Correlation with MPG
cor(mtcars)["mpg", ]
# Relationship between weight and MPG
plot(mtcars$wt, mtcars$mpg,
main = "Weight vs Fuel Efficiency",
xlab = "Weight (1000 lbs)",
ylab = "Miles Per Gallon")
abline(lm(mpg ~ wt, data = mtcars), col = "red")
# Linear regression
model <- lm(mpg ~ wt + hp, data = mtcars)
summary(model)
The diamonds Dataset
The diamonds dataset contains prices and characteristics of nearly 54,000 diamonds from ggplot2 package.
Dataset Structure
# Load diamonds dataset (from ggplot2)
library(ggplot2)
data(diamonds)
# View first few rows
head(diamonds)
# # A tibble: 6 × 10
# carat color clarity depth table price x y z
# <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 E VS1 56.9 65 327 4.05 4.10 2.31
# Column meanings
# carat: Weight of diamond (0.2 - 5.01 carats)
# cut: Quality of cut (Fair, Good, Very Good, Premium, Ideal)
# color: Diamond color (D to Z, where D is best)
# clarity: Clarity of diamond
# depth: Total depth percentage (z / mean(x, y) * 100)
# table: Width of table of diamond
# price: Price in US dollars
# x, y, z: Dimensions in millimeters
dim(diamonds) # [1] 53940 10
str(diamonds)
Basic Analyses
# Summary statistics
summary(diamonds)
# Price distribution
hist(diamonds$price, breaks = 50,
main = "Diamond Price Distribution",
xlab = "Price ($)")
# Average price by cut quality
aggregate(price ~ cut, data = diamonds, FUN = mean)
# cut price
# 1 Fair 4358.76
# 2 Good 3928.86
# 3 Very Good 3981.76
# 4 Premium 4584.26
# 5 Ideal 3457.54
# Relationship between carat weight and price
plot(diamonds$carat, diamonds$price,
main = "Diamond Carat vs Price",
xlab = "Carat Weight",
ylab = "Price ($)")
# Linear regression
model <- lm(log(price) ~ carat + cut + color, data = diamonds)
summary(model)
# Visualization with ggplot2
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.5) +
facet_wrap(~clarity) +
scale_y_log10() +
theme_minimal()
The Boston Housing Dataset
The Boston housing dataset contains information about Boston suburbs and their median home prices (historically available in MASS package).
Dataset Structure
# Note: This dataset is no longer recommended due to ethical concerns
# but is still included for historical/educational purposes
# Install MASS package if needed
# install.packages("MASS")
library(MASS)
data(Boston)
# View first few rows
head(Boston)
# CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
# 1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3
# 2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8
# Column meanings
# CRIM: Per capita crime rate by town
# ZN: Proportion of residential land zoned for lots over 25,000 sq.ft
# INDUS: Proportion of non-retail business acres per town
# CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
# NOX: Nitric oxides concentration
# RM: Average number of rooms per dwelling
# AGE: Proportion of owner-occupied units built prior to 1940
# DIS: Weighted distances to five Boston employment centers
# RAD: Index of accessibility to radial highways
# TAX: Full-value property-tax rate per $10,000
# PTRATIO: Pupil-teacher ratio by town
# MEDV: Median value of owner-occupied homes in $1000s
dim(Boston) # [1] 506 14
str(Boston)
Basic Analyses
# Summary statistics
summary(Boston)
# Median home price distribution
hist(Boston$MEDV, breaks = 30,
main = "Boston Housing Prices",
xlab = "Median Home Value ($1000s)")
# Correlation with median price
cor(Boston)["MEDV", ] |> sort(decreasing = TRUE)
# Multiple regression
model <- lm(MEDV ~ RM + LSTAT + CRIM + AGE, data = Boston)
summary(model)
# Key findings
# - More rooms (RM) increases price
# - Higher percentage of lower status population (LSTAT) decreases price
# - Higher crime (CRIM) decreases price
Other Important Built-in Datasets
ChickWeight Dataset
data(ChickWeight)
# Chicken growth data
# Columns: weight, Time, Chick, Diet
# 578 observations of weight measurements for 50 chickens
head(ChickWeight)
plot(ChickWeight$Time, ChickWeight$weight,
col = ChickWeight$Chick, type = "n",
main = "Chicken Growth by Diet",
xlab = "Time (days)",
ylab = "Weight (grams)")
for (i in unique(ChickWeight$Chick)) {
lines(subset(ChickWeight, Chick == i)[, c("Time", "weight")])
}
PlantGrowth Dataset
data(PlantGrowth)
# Plant growth under different conditions
# 30 observations, weight by group
head(PlantGrowth)
boxplot(weight ~ group, data = PlantGrowth,
main = "Plant Growth by Treatment Group")
# ANOVA test
summary(aov(weight ~ group, data = PlantGrowth))
USArrests Dataset
data(USArrests)
# 1973 arrest statistics by US state
# 50 observations, arrest rates and urbanization
head(USArrests)
# Correlation analysis
cor(USArrests)
# Standardize and perform PCA
pca <- prcomp(USArrests, scale = TRUE)
summary(pca)
plot(pca$x[, 1:2], type = "n")
text(pca$x[, 1], pca$x[, 2], rownames(USArrests))
Accessing Dataset Information
# List all available datasets
data()
# Load and view documentation
?iris
?mtcars
?diamonds # (requires ggplot2)
# Get dataset info
help(datasets)
# Search for datasets by topic
RSiteSearch("datasets") # Online search
Best Practices for Using Built-in Datasets
- Know your data - Always examine structure with
str()andsummary() - Check for missing values - Use
sum(is.na(data)) - Understand variables - Read documentation with
?dataset_name - Document sources - Cite the original source in your work
- Be aware of context - Some datasets have limitations or biases
- Use for learning - Ideal for prototyping before using production data
- Explore fully - Use visualization and summary statistics
Common Questions
Q: Where can I find more information about a dataset?
A: Use ?dataset_name or help(dataset_name) to view full documentation
Q: Are there datasets beyond these main ones?
A: Yes, many packages include datasets. Use data() to list all available, or search packages
Q: Can I use these datasets in my own projects? A: Yes, they’re included with R and are public domain for educational and research use
Q: How do I load a dataset from a specific package?
A: Use library(package) then data(dataset) or directly reference with data(dataset, package = "packagename")
Q: What if I need a larger dataset?
A: Consider packages like nycflights13, babynames, fivethirtyeight, or online sources like Kaggle
Related Topics
Build on these datasets for analysis practice:
- R Descriptive Statistics - Complete Guide - Analyze dataset summaries
- R Data Visualization - Complete Guide - Visualize dataset patterns
- R Hypothesis Testing - Complete Guide - Test hypotheses with built-in data
Download R Script
Get all code examples from this tutorial: builtin-datasets-examples.R