Customizing data visualizations using colorspace, ggplot2, and patchwork

Learn how to change ggplot’s default choices for color and style and take control of final figure presentation.

Color Scales

Make slides full screen

Coding exercise 7.1

In this worksheet, we will discuss how to change and customize color scales.

We will be using the R package tidyverse, which includes ggplot() and related functions. We will also be using the R package colorspace for the scale functions it provides.

# load required library
library(tidyverse)
library(colorspace)

temperatures <- read_csv("https://wilkelab.org/SDS375/datasets/tempnormals.csv") %>%
  mutate(
    location = factor(
      location, levels = c("Death Valley", "Houston", "San Diego", "Chicago")
    )
  ) %>%
  select(location, day_of_year, month, temperature)

temps_months <- read_csv("https://wilkelab.org/SDS375/datasets/tempnormals.csv") %>%
  group_by(location, month_name) %>%
  summarize(mean = mean(temperature)) %>%
  mutate(
    month = factor(
      month_name,
      levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
    ),
    location = factor(
      location, levels = c("Death Valley", "Houston", "San Diego", "Chicago")
    )
  ) %>%
  select(-month_name)

We will be working with the dataset temperatures that we have used in previous worksheets. This dataset contains the average temperature for each day of the year for four different locations.

temperatures
# A tibble: 1,464 × 4
   location     day_of_year month temperature
   <fct>              <dbl> <chr>       <dbl>
 1 Death Valley           1 01           51  
 2 Death Valley           2 01           51.2
 3 Death Valley           3 01           51.3
 4 Death Valley           4 01           51.4
 5 Death Valley           5 01           51.6
 6 Death Valley           6 01           51.7
 7 Death Valley           7 01           51.9
 8 Death Valley           8 01           52  
 9 Death Valley           9 01           52.2
10 Death Valley          10 01           52.3
# ℹ 1,454 more rows

We will also be working with an aggregated version of this dataset called temps_months, which contains the mean temperature for each month for the same locations.

temps_months
# A tibble: 48 × 3
# Groups:   location [4]
   location  mean month
   <fct>    <dbl> <fct>
 1 Chicago   50.4 Apr  
 2 Chicago   74.1 Aug  
 3 Chicago   29   Dec  
 4 Chicago   28.9 Feb  
 5 Chicago   24.8 Jan  
 6 Chicago   75.8 Jul  
 7 Chicago   71.0 Jun  
 8 Chicago   38.8 Mar  
 9 Chicago   60.9 May  
10 Chicago   41.6 Nov  
# ℹ 38 more rows

As a challenge, try to create this above table yourself using group_by() and summarize() like we learned about Wednesday., and then make a month column which is a factor with levels froing from “Jan” to “Dec”, and make the location column a factor with levels “Death Valley”, “Houston”, “San Diego”, “Chicago”. If you are having trouble, the solution is at the end of this page, make sure you copy it into your code so the rest of the exercise works.

# check solution at the end before moving on!
temps_months <- read_csv("https://wilkelab.org/SDS375/datasets/tempnormals.csv") %>%
  group_by(___) %>%
  summarize(___) %>%
  mutate(
    month = factor(
      month_name,
      ___
    ),
    location = factor(
      location, ___
    )
  ) %>%
  select(-month_name)

Built in ggplot2 color scales

We will start with built-in ggplot2 color scales, which require no additional packages. The scale functions are always named scale_color_*() or scale_fill_*(), depending on whether they apply to the color or fill aesthetic. The * indicates some other words specifying the type of the scale, for example scale_color_brewer() or scale_color_distiller() for discrete or continuous scales from the ColorBrewer project, respectively. You can find all available built-in scales here: https://ggplot2.tidyverse.org/reference/index.html#section-scales

Now consider the following plot.

ggplot(temps_months, aes(x = month, y = location, fill = mean)) + 
  geom_tile() + 
  coord_fixed(expand = FALSE)

If you wanted to change the color scale to one from the ColorBrewer project, which scale function would you have to add? scale_color_brewer(), scale_color_distiller(), scale_fill_brewer(), scale_fill_distiller()?

 # answer the question above to yourself

Now try this out.

ggplot(temps_months, aes(x = month, y = location, fill = mean)) + 
  geom_tile() + 
  coord_fixed(expand = FALSE) +
  ___

Most color scale functions have additional customizations. How to use them depends on the specific scale function. For the ColorBrewer scales you can set direction = 1 or direction = -1 to set the direction of the scale (light to dark or dark to light). You can also set the palette via a numeric argument, e.g. palette = 1, palette = 2, palette = 3 etc.

Try this out by setting the direction of the scale from light to dark and using palette #4.

 # build all the code for this exercise

A popular set of scales are the viridis scales, which are provided by scale_*_viridis_c() for continuous data and scale_*_viridis_d() for discrete data. Change the above plot to use a viridis scale.

 # build all the code for this exercise

The viridis scales can be customized with direction (as before), option (which can be "A", "B", "C", "D", or "E"), and begin and end which are numerical values between 0 and 1 indicating where in the color scale the data should begin or end. For example, begin = 0.2 means that the lowest data value is mapped to the 20th percentile in the scale.

Try different choices for option, begin, and end to see how they change the plot.

 # build all the code for this exercise

Customizing scale title and labels

In a previous worksheet, we used arguments such as name, breaks, labels, and limits to customize the axis. For color scales, instead of an axis we have a legend, and we can use the same arguments inside the scale function to customize how the legend looks.

Try this out. Set the scale limits from 10 to 110 and set the name of the scale and the breaks as you wish.

ggplot(temps_months, aes(x = month, y = location, fill = mean)) + 
  geom_tile() + 
  coord_fixed(expand = FALSE) +
  scale_fill_viridis_c(
    name = ___,
    breaks = ___,
    limits = ___
  )

Note: Color scales ignore the expand argument, so you cannot use it to expand the scale beyond the data values as you can for position scales.

Binned scales

Research into human perception has shown that continuous coloring can be difficult to interpret. Therefore, it is often preferable to use a small number of discrete colors to indicate ranges of data values. You can do this in ggplot with binned scales. For example, scale_fill_viridis_b() provides a binned version of the viridis scale. Try this out.

ggplot(temps_months, aes(x = month, y = location, fill = mean)) + 
  geom_tile() + 
  coord_fixed(expand = FALSE) +
  ___

You can change the number of bins by providing the n.breaks argument or alternatively by setting breaks explicitly. Try this out.

 # build all the code for this exercise

Scales from the colorspace package

The color scales provided by the colorspace package follow a simple naming scheme of the form scale_<aesthetic>_<datatype>_<colorscale>(), where <aesthetic> is the name of the aesthetic (fill, color, colour), <datatype> indicates the type of variable plotted (discrete, continuous, binned), and colorscale stands for the type of the color scale (qualitative, sequential, diverging, divergingx).

For the mean temperature plot we have been using throughout this worksheet, which 2 color scales from the colorspace package is/are appropriate?

scale_fill_binned_sequential(), scale_fill_discrete_qualitative(), scale_fill_continuous_sequential(), scale_color_continuous_sequential(), scale_color_continuous_diverging()>

ggplot(temps_months, aes(x = month, y = location, fill = mean)) + 
  geom_tile() + 
  coord_fixed(expand = FALSE) +
  ___

You can customize the colorspace scales with the palette argument, which takes the name of a palette (e.g., "Inferno", "BluYl", "Lajolla"). Try this out. Also try reversing the scale direction with rev = TRUE or rev = FALSE. (The colorspace scales use rev instead of direction.) You can find the names of all supported scales here (consider specifically single-hue and multi-hue sequential palettes).

 # build all the code for this exercise

You can also use begin and end just like in the viridis scales.

Manual scales

For discrete data with a small number of categories, it’s usually best to set colors manually. This can be done with the scale functions scale_*_manual(). These functions take an argument values that specifies the color values to use.

To see how this works, let’s go back to this plot of temperatures over time for four locations.

ggplot(temperatures, aes(day_of_year, temperature, color = location)) +
  geom_line(size = 1.5)

Let’s use the following four colors: "gold2", "firebrick", "blue3", "springgreen4". We can visualize this palette using the function swatchplot() from the colorspace package.

colorspace::swatchplot(c("gold2", "firebrick", "blue3", "springgreen4"))

Now apply this color palette to the temperatures plot, by using the manual color scale. Hint: use the values argument to provide the colors to the manual scale.

ggplot(temperatures, aes(day_of_year, temperature, color = location)) +
  geom_line(size = 1.5) +
  ___

One problem with this approach is that we can’t easily control which data value gets assigned to which color. What if we wanted San Diego to be shown in green and Chicago in blue? The simplest way to resolve this issue is to use a named vector. A named vector in R is a vector where each value has a name. Named vectors are created by writing c(name1 = value1, name2 = value2, ...). See the following example.

# regular vector
c("cat", "mouse", "house")
[1] "cat"   "mouse" "house"
# named vector
c(A = "cat", B = "mouse", C = "house")
      A       B       C 
  "cat" "mouse" "house" 

The names in the second example are A, B, and C. Notice that the names are not in quotes. However, if you need a name containing a space (such as Death Valley), you need to enclose the name in backticks. Thus, our named vector of colors could be written like so:

c(`Death Valley` = "gold2", Houston = "firebrick", Chicago = "blue3", `San Diego` = "springgreen4")
  Death Valley        Houston        Chicago      San Diego 
       "gold2"    "firebrick"        "blue3" "springgreen4" 

Now try to use this color vector in the figure showing temperatures over time.

 # build all the code for this exercise

Try some other colors also. For example, you could use the Okabe-Ito colors:

# Okabe-Ito colors
colorspace::swatchplot(c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#000000"))

Alternatively, you can find a list of all named colors here. You can also run the command colors() in your R console to get a list of all available color names.

Hint: It’s a good idea to never use the colors "red", "green", "blue", "cyan", "magenta", "yellow". They are extreme points in the RGB color space and tend to look unnatural and too crazy. Try this by making a swatch plot of these colors, and compare for example to the color scale containing the colors "firebrick", "springgreen4", "blue3", "turquoise3", "darkorchid2", "gold2". Do you see the difference?

 # build all the code for this exercise

Solution to the challenge to make the summary table of mean temperature by month:

# paste this below the "temperatures" code-chunk
temps_months <- read_csv("https://wilkelab.org/SDS375/datasets/tempnormals.csv") %>%
  group_by(location, month_name) %>%
  summarize(mean = mean(temperature)) %>%
  mutate(
    month = factor(
      month_name,
      levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
    ),
    location = factor(
      location, levels = c("Death Valley", "Houston", "San Diego", "Chicago")
    )
  ) %>%
  select(-month_name)

Color Scales Exercise Solutions

Figure Design

Make slides full screen

Coding exercise 7.2

In this worksheet, we will discuss how to change and customize themes.

We will be using the R package tidyverse, which includes ggplot() and related functions. We will also be using the packages cowplot for themes and the package palmerpenguins for the penguins dataset.

# load required library
library(tidyverse)
library(cowplot)
library(palmerpenguins)

We will be working with the dataset penguins containing data on individual penguins on Antarctica.

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Ready-made themes

Let’s start with this simple plot with no specific styling.

ggplot(penguins, aes(flipper_length_mm, body_mass_g, color = species)) +
  geom_point(na.rm = TRUE)  # na.rm = TRUE prevents warning about missing values

The default ggplot theme is theme_gray(). Verify that adding this theme to the plot makes no difference in the output. Then change the overall font size by providing the theme function with a numeric font size argument, e.g. theme_gray(16).

ggplot(penguins, aes(flipper_length_mm, body_mass_g, color = species)) +
  geom_point(na.rm = TRUE) +
  ___

The ggplot2 package has many built-in themes, including theme_minimal(), theme_bw(), theme_void(), theme_dark(). Try these different themes on the above plot. Also try again changing the font size. You can see all themes provided by ggplot2 here: https://ggplot2.tidyverse.org/reference/ggtheme.html

 # build all the code for this exercise

Many other packages also provide themes. For example, the cowplot package provides themes theme_half_open(), theme_minimal_grid(), theme_minimal_hgrid(), and theme_minimal_vgrid(). You can see all cowplot themes here: https://wilkelab.org/cowplot/articles/themes.html

 # build all the code for this exercise

Compare the visual appearance of theme_minimal() from ggplot2 to theme_minimal_grid() from cowplot. What similarities and differences to you notice? Which do you prefer? (There is no correct answer here, just be aware of the differences and of your preferences.)

 # build all the code for this exercise

Modifying theme elements

You can modify theme elements by adding a theme() call to the plot. Inside the theme() call you specify which theme element you want to modify (e.g., axis.title, axis.text.x, panel.background, etc) and what changes you want to make. For example, to make axis titles blue, you would write:

theme(
  axis.title = element_text(color = "blue")
)

There are many theme settings, and for each one you need to know what type of an element it is (element_text(), element_line(), element_rect() for text, lines, or rectangles, respectively). A complete description of the available options is available at the ggplot2 website: https://ggplot2.tidyverse.org/reference/theme.html

Here, we will only try a few simple things. For example, see if you can make the legend title blue and the legend text red.

# make the legend title blue and the legend text red
ggplot(penguins, aes(flipper_length_mm, body_mass_g, color = species)) +
  geom_point(na.rm = TRUE)

Now color the area behind the legend in "aliceblue". Hint: The theme element you need to change is called legend.background. There is also an element legend.box.background but it is only visible if legend.background is not shown, and in the default ggplot2 themes that is not the case.

 # build all the code for this exercise

Another commonly used feature in themes are margins. Many parts of the plot theme can understand customized margins, which control how much spacing there is between different parts of a plot. Margins are typically specified with the function margin(), which takes four numbers specifying the margins in points, in the order top, right, bottom, left. So, margin(10, 5, 5, 10) would specify a top margin of 10pt, a right margin of 5pt, a bottom margin of 5pt, and a left margin of 10pt.

Try this out by setting the legend margin (element legend.margin) such that there is no top and no bottom margin but 10pt left and right margin.

 # build all the code for this exercise

There are many other things you can do. Try at least some of the following:

  • Change the horizontal or vertical justification of text with hjust and vjust.
  • Change the font family with family.1
  • Change the panel grid. For example, create only horizontal lines, or only vertical lines.
  • Change the overall margin of the plot with plot.margin.
  • Move the position of the legend with legend.position and legend.justification.
  • Turn off some elements by setting them to element_blank().

1 Getting fonts to work well can be tricky in R. Which specific fonts work depends on the graphics device and the operating system. You can try some of the following fonts and see if they work on app.terra.bio: "Palatino", "Times", "Helvetica", "Courier", "ITC Bookman", "ITC Avant Garde Gothic", "ITC Zapf Chancery".

Writing your own theme

You can write a theme by

theme_colorful <-
  theme_bw() +
  theme(
    text = element_text(color = "mediumblue"),
    axis.text = element_text(color = "springgreen4"),
    legend.text = element_text(color = "firebrick4")
  )

Hint: Do you have to add theme_colorful or theme_colorful()? Do you understand which option is correct and why? If you are unsure, try both and see what happens.

 # build all the code for this exercise

Now write your own theme and then add it to the penguing plot.

 # build all the code for this exercise

Figure Design Exercise Solutions

Compound Figures

Make slides full screen

Coding exercise 7.3

In this worksheet, we will discuss how to combine several ggplot2 plots into one compound figure.

We will be using the R package tidyverse, which includes ggplot() and related functions. We will also be using the package patchwork for plot composition.

# load required library
library(tidyverse)
library(patchwork)

We will be working with the dataset mtcars, which contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Combining plots

First we set up four different plots that we will subsequently combine. The plots are stored in variables p1, p2, p3, p4.

p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp))
p1  

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear))
p2

p3 <- ggplot(mtcars) + 
  geom_smooth(aes(disp, qsec))
p3

p4 <- ggplot(mtcars) + 
  geom_bar(aes(carb))
p4

To show plots side-by-side, we use the operator |, as in p1 | p2. Try this by making a compound plot of plots p1, p2, p3 side-by-side.

 # build all the code for this exercise

To show plots on top of one-another, we use the operator /, as in p1 / p2. Try this by making a compound plot of plots p1, p2, p3 on top of each other.

 # build all the code for this exercise

We can also use parentheses to group plots with respect to the operators | and /. For example, we can place several plots side-by-side and then place this entire row of plots on top of another plot. Try putting p1, p2, p3, on the top row, and p4 on the bottom row.

(___) / ___

Plot annotations

The patchwork package provides a powerful annotation system via the plot_annotation() function that can be added to a plot assembly. For example, we can add plot tags (the labels in the upper left corner identifying the plots) via the plot annotation tag_levels. You can set tag_levels = "A" to generate tags A, B, C, etc. Try this out.

(p1 | p2 | p3 ) / p4

Try also tag levels such as "a", "i", or "1".

You can also add elements such as titles, subtitles, and captions, by setting the title, subtitle, or caption argument in plot_annotation(). Try this out by adding an overall title to the figure from the previous exercise.

 # build all the code for this exercise

Also set a subtitle and a caption.

Finally, you can change the theme of all plots in the plot assembly via the & operator, as in (p1 | p2) & theme_bw(). Try this out.

 # build all the code for this exercise

What happens if you write this expression without parentheses? Do you understand why?

(Big) Challenge Problem:

If you have time this morning, or if you want to work on it this afternoon, try analyzing a new dataset to test your R skills. First, learn what the columns mean, what missing values you see, and then start to ask questions about patterns in the data by making plots.
You can browse the datasets at https://github.com/rfordatascience/tidytuesday/tree/master/data (click on a year folder and go to the README to read about the datasets). Or pick from one of the ones below:

# fish consumption in different countries 
consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-and-seafood-consumption-per-capita.csv')

# world cup Cricket matches from 1996 to 2005
matches <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-11-30/matches.csv')

# malaria deaths by age across the world and time. 
 malaria_deaths<- readr::read_csv('https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-11-13/malaria_deaths_age.csv')

#meteorites and/or volcanos:
# note to plot a map, try the following:
countries_map <- ggplot2::map_data("world")
world_map<-ggplot() + 
  geom_map(data = countries_map, 
           map = countries_map,aes(x = long, y = lat, map_id = region, group = group),
           fill = "white", color = "black", size = 0.1) # then add geom_point() to this map to add lat/long points

meteorites <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-11/meteorites.csv")
volcano <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv")

# or pick any dataset from https://github.com/rfordatascience/tidytuesday/tree/master/data , 
# just click on a year folder and go to the README to read about the datasets
# if you have trouble loading a dataset there, ask for help!

Compound Figures Exercise Solutions