Introduction
Whenever you see a code chunk, copy it into your document, and if there are blanks try filling them in to complete the prompt. You can also copy the prose between code chunks, or add your own notes.
In this worksheet, we will discuss how to visualize amounts using bars.
We will be using the R package tidyverse, which
includes ggplot()
and related functions.
Make a new Rmarkdown document called visualizing_amounts.Rmd, and copy the following code into a new code chunk:
```{r library-calls}
# load required libraries and datasets
library(tidyverse)
library(palmerpenguins)
boxoffice <- tibble(
rank = 1:5,
title = c("Star Wars", "Jumanji", "Pitch Perfect 3", "Greatest Showman", "Ferdinand"),
amount = c(71.57, 36.17, 19.93, 8.81, 7.32) # million USD
)
penguins_nomissing <- na.omit(penguins) # remove all rows with any missing values
```
We will be working with two datasets. First, box-office gross results for Dec. 22-24, 2017:
```{r boxoffice}
boxoffice
```
# A tibble: 5 × 3
rank title amount
<int> <chr> <dbl>
1 1 Star Wars 71.6
2 2 Jumanji 36.2
3 3 Pitch Perfect 3 19.9
4 4 Greatest Showman 8.81
5 5 Ferdinand 7.32
Second, data on individual penguins on Antarctica. Note that missing values have been removed:
```{r penguins}
penguins_nomissing
```
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007
9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
# … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Drawing numerical values as bars
For the boxoffice
dataset, we want to draw the amount
(Weekend gross, in million USD) for each movie as a bar.
```{r boxoffice2}
boxoffice
```
# A tibble: 5 × 3
rank title amount
<int> <chr> <dbl>
1 1 Star Wars 71.6
2 2 Jumanji 36.2
3 3 Pitch Perfect 3 19.9
4 4 Greatest Showman 8.81
5 5 Ferdinand 7.32
Somewhat confusingly, the ggplot geom that does this is called
geom_col()
. (There is also a geom_bar()
, but
it works differently. We’ll get to that later in this tutorial.) Make a
bar plot of amount
versus title
. This means
amount
goes on the y axis and title
on the x
axis.
```{r geom-col}
ggplot(boxoffice, aes(x = ___, y = ___)) +
___()
```
Now flip which column you map onto x and which onto y.
```{r geom-col2}
ggplot(boxoffice, aes(x = ___, y = ___)) +
___()
```
The x-axis label should specify that the amount is in million USD,
and the y axis doesn’t need the word “title”. Use xlab()
and ylab()
to make these changes to the plot.
```{r geom-col3}
ggplot(boxoffice, aes(x = amount, y = title)) +
geom_col() +
___() +
___()
```
Getting bars into the right order
Whenever we are making bar plots, we need to think about the correct order of the bars. By default, ggplot uses alphabetic ordering, but that is rarely appropriate. If there is no inherent ordering (such as, for example, a temporal progression), then it is usually best to order by the magnitude of the values, i.e., sort the bars by length.
We can do this with the fct_reorder()
function, which
takes two arguments: The categorical variable we want to re-order, and
the values by which we want to order. Here, the categorical variable is
the column title
and the values are in the column
amount
. We can apply the fct_reorder()
function right inside the aes()
statement.
```{r geom-col-sorted}
ggplot(boxoffice, aes(x = amount, y = ___)) +
geom_col() +
xlab("weekend gross (million USD)") +
ylab(NULL)
```
Try the following additional experiments in the above code:
- What happens when you run the above code without the
ylab(NULL)
statement? - Can you make the bars blue?
- Can you color the bars by
amount
or bytitle
?
Drawing bars based on a count
The boxoffice
dataset contains individual values, the
dollar amounts, that we wanted to visualize with bars. Often, however,
we encounter a slightly different scenario: A dataset doesn’t contain
the numeric amounts directly, but instead contains observations we want
to count. For example, consider the penguins_nomissing
dataset:
```{r penguins2}
penguins_nomissing
```
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007
9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
# … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
It contains one row per penguin. If we want to make a bar plot of the
number of penguins of each species (Adelie, Chinstrap, Gentoo), we
cannot use geom_col()
as before, because the dataset
doesn’t have a column that contains these counts.
The solution here is to use geom_bar()
, which performs a
count and then displays the result of that count. Because
geom_bar()
counts automatically, you only have to provide
it with a single aesthetic, which specifies the data column in which you
are counting.
Try this out. Make a bar plot of the number of penguins per species. Map the penguin species onto the x axis.
```{r geom-bar}
ggplot(penguins_nomissing, aes(___)) +
geom_bar()
```
Try the following additional modifications in the above code:
- Map penguin species onto the y axis.
- Remove the axis label that says “species”.
- Change the order of the bars manually, using
fct_relevel()
(see slides).
Counting subgroups
geom_bar()
automatically counts how many cases there are
in each unique combination of different categorical aesthetics. In the
previous example, we had only one categorical aesthetic,
species
. But we can add a second one, for example
sex
. Then geom_bar()
counts the number of
cases in each unique combination of species and sex and draws separate
bars for each. Try this out by mapping the sex
column onto
the fill
aesthetic.
```{r geom-bar2}
ggplot(penguins_nomissing, aes(x = species, fill = ___)) +
geom_bar()
```
By default, the bars for different fill
values but
identical x
values will be drawn on top of one-another. But
there are other possibilities, which are controled by the
position
argument to geom_bar()
. For example,
try to set the position to "dodge"
.
```{r geom-bar-position}
ggplot(penguins_nomissing, aes(x = species, fill = ___)) +
geom_bar(___)
```
Challenge Problems If you still have time, try the following
problems. These problems use the dataset txhouse
that has
been derived from the txhousing
dataset provided by
ggplot2. See here for details of the original dataset:
https://ggplot2.tidyverse.org/reference/txhousing.html.
txhouse
contains three columns: city
(containing four Texas cities), year
(containing four years
between 2000 and 2015) and total_sales
indicating the total
number of sales for the specified year and city.
```{r txhousing}
txhouse <- txhousing %>%
filter(city %in% c('Austin', 'Houston', 'San Antonio', 'Dallas')) %>%
filter(year %in% c('2000', '2005', '2010', '2015')) %>%
group_by(city, year) %>%
summarize(total_sales = sum(sales))
txhouse
```
# A tibble: 16 × 3
# Groups: city [4]
city year total_sales
<chr> <int> <dbl>
1 Austin 2000 18621
2 Austin 2005 26905
3 Austin 2010 19872
4 Austin 2015 18878
5 Dallas 2000 45446
6 Dallas 2005 59980
7 Dallas 2010 42383
8 Dallas 2015 36735
9 Houston 2000 52459
10 Houston 2005 72800
11 Houston 2010 56807
12 Houston 2015 48109
13 San Antonio 2000 15590
14 San Antonio 2005 24034
15 San Antonio 2010 18449
16 San Antonio 2015 16455
Problem 1: Make a new code chunk and use ggplot to
make a bar plot of the total housing sales (column
total_sales
) for each city
and show one panel
per year
. Hint: Use facet_wrap()
.
Problem 2: Use ggplot to make a bar plot of the
total housing sales (column total_sales
) for each
year
, color the bar borders “gray34”, and fill the bars by
city
.
Problem 3 Modify the plot from Problem 2 by placing
city
bars side-by-side, rather than stacked. Next, reorder
the bars for each year
by total_sales
in
descending order.