Objectives

Upon completion of this lesson, you will be able to:

  • build visualizations using ggplot
  • explain the grammar of graphics used by ggplot

Overview

ggplot2 is a popular data visualization package for R. It is built upon the principles of the Grammar of Graphics (Wickham, 2010) which a theoretical framework for describing the structure and components of computational visualization using a “sentence structure”.

Excel vs ggplot2

The need for ggplot2 in R and its common use arise from several key advantages it offers over basic plotting tools in R and Excel:

  • Consistency and flexibility: ggplot2 allows users to create a wide range of visually appealing and informative graphics using a consistent and flexible syntax. This makes it easier to learn and apply the package across different data visualization tasks.

  • Layered approach: ggplot2 utilizes a layered approach to create complex plots by combining simple components. This enables users to build customized visualizations by adding, modifying, or removing specific layers as needed.

  • Aesthetic mapping: ggplot2 allows for easy mapping of data variables to visual aesthetics (e.g., color, size, shape) of graphical elements, enabling the creation of more informative visualizations that reveal patterns, trends, and relationships in the data.

  • Theme customization: ggplot2 provides a variety of built-in themes and the ability to create custom themes, giving users control over the appearance of their plots, including font, background, gridlines, and more.

  • Scalability: ggplot2 is designed to handle large datasets efficiently, making it suitable for working with complex data and producing publication-quality graphics.

  • Granularity of control: ggplot2 provides finer control over various aspects of a plot, including customization of scales, axes, legends, and individual graphical elements, which is difficult or impossible to achieve in Excel.

  • Reproducibility: ggplot2 code is easily reproducible, allowing users to recreate and modify visualizations quickly and consistently. This is particularly important in research and data analysis, where reproducibility is crucial for verifying results and sharing work with others.

  • Integration with R: ggplot2 is natively integrated with R, allowing seamless compatibility with other R packages for data manipulation, analysis, and modeling. This enables users to streamline their entire data analysis workflow within R, rather than switching between R and Excel.

  • Extensibility: ggplot2 is part of the wider “tidyverse” ecosystem, which includes other packages designed to work seamlessly together, allowing users to create more advanced visualizations and analyses with relative ease.

In short, ggplot2 offers a more flexible and consistent framework for data visualization that is well-suited for complex, large-scale data analysis tasks. Its advantages over Excel include greater control, reproducibility, integration with R, and extensibility, making it a popular choice for data visualization among researchers, information scientists and data analysts.

Essential ggplot2

To use, ggplot2, we need to include (after installing) the ggplot2 package:

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

The following examples use the mpg data frame (actually, a tibble) found in ggplot2 (aka ggplot2::mpg) – it is not the same as the mpg data frame found in base R. Recall that a data frame is a tabular collection of variables (in the columns) and observations (in the rows).

The dataframe mpg contains observations collected by the US EPA on 38 models of car.

head(ggplot2::mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class  
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compact
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compact
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compact
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compact
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compact
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compact

Initial Canvas

With ggplot2, you begin a visualization (or plot) with the function ggplot(). ggplot() creates a blank canvas onto which you add layers.

The first argument of ggplot() is the dataset to use in the graph. The code below creates an empty visualization; correct, but not very interesting.

ggplot(data = mpg)

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates_i.e., a scatterplot. ggplot2 has numerous “geom_” functions that each add a different type of layer to a plot, such as geom_col() and geom_line().

Each “geom_” function in ggplot2 takes a mapping argument, which determines how variables in the data are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes.

ggplot2 looks for the mapped variables in the data argument, in this case, mpg.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

A graphing template

Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetics

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty, color = year))

# Note: Using "size" parameters with a discrete variable is not advised

Example I: Column Chart

fn = "AssetQualitybySector-2.csv"
df = read.csv(fn, header = T,
              colClasses = c("factor","factor","numeric"))

df$quality = df$quality / 100
ggplot(data = df) + 
  geom_col(mapping = aes (x = sector, y = quality, fill = year), 
           position = position_dodge(0.7), width = 0.6) +
  scale_fill_manual(values = c("#029673","#43b6e6","#02289f","#78be1f","#77787c")) +
  scale_x_discrete(limits = c("Personal","Construction","Wholesale","Manufacturing","Overall")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme (
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.background = element_blank(),
    panel.grid.major.y = element_line(color = "light gray")
  ) + 
  theme (
    legend.position = "top",
    legend.justification = "left",
    legend.title = element_blank()
  ) +
  xlab(NULL) + 
  ylab(NULL) +
  ggtitle("Asset quality will continue to deteriorate across sectors") +
  theme (
    plot.title = element_text(color = "#7ebee1")
  ) +
  theme (
    plot.caption = element_text(face = "bold", color = "#7ebee1", hjust = 0,
                                margin = margin(t = 10)),
  ) +
  labs(caption = "Source: Central Bank via Moody's", label.padding = 30, colour = "blue")

Exercise

Use ggplot2 to build this visualization below as closely as you can. Read the data from the chart and manually add it to a CSV or dataframe.

Summary

ggplot2 is an essential data visualization package for R that offers consistency, flexibility, and customization for creating complex and informative graphics. Its advantages over Excel include greater control over plot aesthetics, reproducibility, and seamless integration with R. By utilizing a layered approach and providing support for large datasets, ggplot2 has become a preferred choice for data visualization among researchers and data analysts.


Files & Resources

All Files for Lesson 54.202

Acknowledgements

Portions of this tutorial were initially generated by ChatGPT-4 and ChatGPT-3.5.

Errata

Let us know.