Exploratory Data Analysis is an essential step in understanding your dataset’s structure, relationships, and potential issues.To perform Exploratory Data Analysis you can use dplyr, ggplot2 and summarytools package. Follow below steps to perform exploratory data analysis in R.

First we need to import necessary packages and load dataset:

library(tidyverse)

data(diamonds)


Here we load diamond dataset from tidyverse package.

## Step 2: Summarize the data

Use summary() function to summarize the data:

# Get statistical values
summary(diamonds)


Output:

   carat               cut        color        clarity          depth           table
Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00
1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00
Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00
Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46
3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00
Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00
J: 2808   (Other): 2531
price             x                y                z
Min.   :  326   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000
1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910
Median : 2401   Median : 5.700   Median : 5.710   Median : 3.530
Mean   : 3933   Mean   : 5.731   Mean   : 5.735   Mean   : 3.539
3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040
Max.   :18823   Max.   :10.740   Max.   :58.900   Max.   :31.800


Here the output shows statistical values of each column of dataset.

## Step 4: Visualize Dataset

You can use ggplot2 package to visualize dataset:

# Create histogram of values for price
ggplot(data=diamonds, aes(x=price)) +
geom_histogram(fill="steelblue", color="black") +
ggtitle("Histogram of Price Values")


Output:

Here the above snippte shows histogram which created using geom_histogram() function on price column of dataset.

# Create scatterplot of carat vs. price, using cut as color variable
ggplot(data=diamonds, aes(x=carat, y=price, color=cut)) +
geom_point()


Output:

The above snippet shows scatter plot created using geom_point() function.This scatterplot shows the correlation between carat and price column of dataset.

# Create boxplot of price, grouped by cut
ggplot(data=diamonds, aes(x=cut, y=price)) +
geom_boxplot(fill="steelblue")


Output:

The snippet shows box plot created using geom_boxplot() function which shows boxplot for price column which is group by cut column of dataset.