NYGC/AMNH Workshop on Microbial Ecology - Introduction to R

Claus Wilke remixed by Joseph Elsherbini
2022/08/24

Visualizing Amounts Exercise

Introduction

Whenever you see a code chunk, copy it into your document, and if there are blanks try filling them in to complete the prompt. You can also copy the prose between code chunks, or add your own notes.

In this worksheet, we will discuss how to visualize amounts using bars.

We will be using the R package tidyverse, which includes ggplot() and related functions.

Make a new Rmarkdown document called visualizing_amounts.Rmd, and copy the following code into a new code chunk:


```{r library-calls}
# load required libraries and datasets
library(tidyverse)
library(palmerpenguins)

boxoffice <- tibble(
  rank = 1:5,
  title = c("Star Wars", "Jumanji", "Pitch Perfect 3", "Greatest Showman", "Ferdinand"),
  amount = c(71.57, 36.17, 19.93, 8.81, 7.32) # million USD
)
penguins_nomissing <- na.omit(penguins) # remove all rows with any missing values
```

We will be working with two datasets. First, box-office gross results for Dec. 22-24, 2017:


```{r boxoffice}
boxoffice
```
# A tibble: 5 × 3
   rank title            amount
  <int> <chr>             <dbl>
1     1 Star Wars         71.6 
2     2 Jumanji           36.2 
3     3 Pitch Perfect 3   19.9 
4     4 Greatest Showman   8.81
5     5 Ferdinand          7.32

Second, data on individual penguins on Antarctica. Note that missing values have been removed:


```{r penguins}
penguins_nomissing
```
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
 9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Drawing numerical values as bars

For the boxoffice dataset, we want to draw the amount (Weekend gross, in million USD) for each movie as a bar.


```{r boxoffice2}
boxoffice
```
# A tibble: 5 × 3
   rank title            amount
  <int> <chr>             <dbl>
1     1 Star Wars         71.6 
2     2 Jumanji           36.2 
3     3 Pitch Perfect 3   19.9 
4     4 Greatest Showman   8.81
5     5 Ferdinand          7.32

Somewhat confusingly, the ggplot geom that does this is called geom_col(). (There is also a geom_bar(), but it works differently. We’ll get to that later in this tutorial.) Make a bar plot of amount versus title. This means amount goes on the y axis and title on the x axis.


```{r geom-col}
ggplot(boxoffice, aes(x = ___, y = ___)) +
  ___()
```

Now flip which column you map onto x and which onto y.


```{r geom-col2}
ggplot(boxoffice, aes(x = ___, y = ___)) +
  ___()
```

The x-axis label should specify that the amount is in million USD, and the y axis doesn’t need the word “title”. Use xlab() and ylab() to make these changes to the plot.


```{r geom-col3}
ggplot(boxoffice, aes(x = amount, y = title)) +
  geom_col() +
  ___() +
  ___()
```

Getting bars into the right order

Whenever we are making bar plots, we need to think about the correct order of the bars. By default, ggplot uses alphabetic ordering, but that is rarely appropriate. If there is no inherent ordering (such as, for example, a temporal progression), then it is usually best to order by the magnitude of the values, i.e., sort the bars by length.

We can do this with the fct_reorder() function, which takes two arguments: The categorical variable we want to re-order, and the values by which we want to order. Here, the categorical variable is the column title and the values are in the column amount. We can apply the fct_reorder() function right inside the aes() statement.


```{r geom-col-sorted}
ggplot(boxoffice, aes(x = amount, y = ___)) +
  geom_col() +
  xlab("weekend gross (million USD)") +
  ylab(NULL)
```

Try the following additional experiments in the above code:

  • What happens when you run the above code without the ylab(NULL) statement?
  • Can you make the bars blue?
  • Can you color the bars by amount or by title?

Drawing bars based on a count

The boxoffice dataset contains individual values, the dollar amounts, that we wanted to visualize with bars. Often, however, we encounter a slightly different scenario: A dataset doesn’t contain the numeric amounts directly, but instead contains observations we want to count. For example, consider the penguins_nomissing dataset:


```{r penguins2}
penguins_nomissing
```
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
 9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

It contains one row per penguin. If we want to make a bar plot of the number of penguins of each species (Adelie, Chinstrap, Gentoo), we cannot use geom_col() as before, because the dataset doesn’t have a column that contains these counts.

The solution here is to use geom_bar(), which performs a count and then displays the result of that count. Because geom_bar() counts automatically, you only have to provide it with a single aesthetic, which specifies the data column in which you are counting.

Try this out. Make a bar plot of the number of penguins per species. Map the penguin species onto the x axis.


```{r geom-bar}
ggplot(penguins_nomissing, aes(___)) +
  geom_bar()
```

Try the following additional modifications in the above code:

  • Map penguin species onto the y axis.
  • Remove the axis label that says “species”.
  • Change the order of the bars manually, using fct_relevel() (see slides).

Counting subgroups

geom_bar() automatically counts how many cases there are in each unique combination of different categorical aesthetics. In the previous example, we had only one categorical aesthetic, species. But we can add a second one, for example sex. Then geom_bar() counts the number of cases in each unique combination of species and sex and draws separate bars for each. Try this out by mapping the sex column onto the fill aesthetic.


```{r geom-bar2}
ggplot(penguins_nomissing, aes(x = species, fill = ___)) +
  geom_bar()
```

By default, the bars for different fill values but identical x values will be drawn on top of one-another. But there are other possibilities, which are controled by the position argument to geom_bar(). For example, try to set the position to "dodge".


```{r geom-bar-position}
ggplot(penguins_nomissing, aes(x = species, fill = ___)) +
  geom_bar(___)
```

Challenge Problems If you still have time, try the following problems. These problems use the dataset txhouse that has been derived from the txhousing dataset provided by ggplot2. See here for details of the original dataset: https://ggplot2.tidyverse.org/reference/txhousing.html. txhouse contains three columns: city (containing four Texas cities), year (containing four years between 2000 and 2015) and total_sales indicating the total number of sales for the specified year and city.

```{r txhousing}
txhouse <- txhousing %>%
  filter(city %in% c('Austin', 'Houston', 'San Antonio', 'Dallas')) %>%
  filter(year %in% c('2000', '2005', '2010', '2015')) %>%
  group_by(city, year) %>%
  summarize(total_sales = sum(sales))

  txhouse
```
# A tibble: 16 × 3
# Groups:   city [4]
   city         year total_sales
   <chr>       <int>       <dbl>
 1 Austin       2000       18621
 2 Austin       2005       26905
 3 Austin       2010       19872
 4 Austin       2015       18878
 5 Dallas       2000       45446
 6 Dallas       2005       59980
 7 Dallas       2010       42383
 8 Dallas       2015       36735
 9 Houston      2000       52459
10 Houston      2005       72800
11 Houston      2010       56807
12 Houston      2015       48109
13 San Antonio  2000       15590
14 San Antonio  2005       24034
15 San Antonio  2010       18449
16 San Antonio  2015       16455

Problem 1: Make a new code chunk and use ggplot to make a bar plot of the total housing sales (column total_sales) for each city and show one panel per year. Hint: Use facet_wrap().

Problem 2: Use ggplot to make a bar plot of the total housing sales (column total_sales) for each year, color the bar borders “gray34”, and fill the bars by city.

Problem 3 Modify the plot from Problem 2 by placing city bars side-by-side, rather than stacked. Next, reorder the bars for each year by total_sales in descending order.