Load both datasets and the libraries we’re using in this exercise
```{r library-calls}
# load required libraries for this worksheet
library(tidyverse)
library(ggforce)
library(ggridges)
```
```{r}
# data prep
titanic <- read_csv("https://wilkelab.org/SDS375/datasets/titanic.csv") %>%
select(age, sex, class, survived) %>%
arrange(age, sex, class)
lincoln_temps <- readRDS(url("https://wilkelab.org/SDS375/datasets/lincoln_temps.rds"))
```
Introduction
In this worksheet, we will discuss how to display distributions of data values using histograms and density plots.
The first dataset we will be working with contains information about passengers on the Titanic, including their age, sex, the class in which they traveled on the ship, and whether they survived or not:
```{r titanic}
titanic
```
# A tibble: 756 × 4
age sex class survived
<dbl> <chr> <chr> <chr>
1 0.17 female 3rd survived
2 0.33 male 3rd died
3 0.8 male 2nd survived
4 0.83 male 2nd survived
5 0.83 male 3rd survived
6 0.92 male 1st survived
7 1 female 2nd survived
8 1 female 3rd survived
9 1 male 2nd survived
10 1 male 2nd survived
# … with 746 more rows
Histograms
We start by drawing a histogram of the passenger ages (column
age
in the dataset titanic
). We can do this in
ggplot with the geom geom_histogram()
. Try this for
yourself.
```{r titanic-hist}
ggplot(titanic, aes(___)) +
___
```
If you don’t specify how many bins you want or how wide you want them
to be, geom_histogram()
will make an automatic choice, but
it will also give you a warning that the automatic choice is probably
not good. Make a better choice by setting the binwidth
and
center
parameters. Try the values 5 and 2.5,
respectively.
```{r titanic-hist2}
ggplot(titanic, aes(age)) +
geom_histogram(___)
```
Try a few more different binwidths, e.g. 1 or 10. What are good
values for center
that go with these choices?
Density plots
Density plots are a good alternative to histograms. We can create
them with geom_density()
. Try this out by drawing a density
plot of the passenger ages (column age
in the dataset
titanic
). Also, by default geom_density()
does
not draw a filled area under the density line. We can change this by
setting an explicit fill color, e.g. “cornsilk”.
```{r titanic-dens}
ggplot(titanic, aes(___)) +
___
```
Just like for histograms, there are options to modify how much detail
a density plot shows. A small binwidth in a histogram corresponds to a
low bandwidth (bw
) in a density plot and similarly a large
binwidth corresponds to a high bandwidth. In addition, you can change
the kernel, e.g. kernel = "rectangular"
or
kernel = "triangular"
. Try this out by using a bandwidth of
1 and a triangular kernel.
```{r titanic-dens2}
ggplot(titanic, aes(age)) +
geom_density(fill = "cornsilk", ___)
```
Try a few more different bandwidth and kernel choices, e.g. 0.1 or 10, or rectangular or gaussian kernels. How does the density plot depend on these choices?
Small multiples (facets)
We can also draw separate histograms for passengers meeting different
criteria, for example for passengers traveling in the different classes.
Whenever we draw multiple plot panels containing the same type of plot
but for different subsets of the data, we speak of “small multiples”. In
ggplot, we generate small multiples with the function
facet_wrap()
. The function facet_wrap()
takes
as its argument a list of data columns to subdivide the data by. This
list is provided via the vars()
function. For example,
vars(class)
means draw a separate panel for each class,
vars(survived)
means draw a separate panel for each
survival status, and vars(class, survived)
means draw a
separate panel for each combination of class and survival status.
As an example, the following code generates small multiple histograms by class:
```{r titanic-hist-class}
ggplot(titanic, aes(age)) +
geom_histogram(binwidth = 5, center = 2.5) +
facet_wrap(vars(class))
```
Now use the same principle to draw facets of histograms by survival status.
```{r titanic-hist-surv}
ggplot(titanic, aes(age)) +
geom_histogram(binwidth = 5, center = 2.5) +
___
```
Now make a plot that breaks down the data by both survival status and class.
```{r titanic-hist-survclass}
ggplot(titanic, aes(age)) +
geom_histogram(binwidth = 5, center = 2.5) +
facet_wrap(vars(survived, ___))
```
Now, do the same but drawing density plots rather than histograms.
```{r titanic-dens-survclass}
ggplot(titanic, aes(age)) +
___ +
facet_wrap(vars(survived, ___))
```
Multiple distributions at once
Now, let’s display many distributions at once, using boxplots, violin plots, strip charts, sina plots, and ridgeline plots.
The next dataset we will be working with contains information about the mean temperature for every day of the year 2016 in Lincoln, NE:
```{r lincoln_temps}
lincoln_temps
```
# A tibble: 366 × 4
date month month_long mean_temp
<date> <fct> <fct> <int>
1 2016-01-01 Jan January 24
2 2016-01-02 Jan January 23
3 2016-01-03 Jan January 23
4 2016-01-04 Jan January 17
5 2016-01-05 Jan January 29
6 2016-01-06 Jan January 33
7 2016-01-07 Jan January 30
8 2016-01-08 Jan January 25
9 2016-01-09 Jan January 9
10 2016-01-10 Jan January 11
# … with 356 more rows
Boxplots and violins
We start by drawing the distributions of mean temperatures for each
month of the year (columns month
and mean_temp
in the dataset lincoln_temps
), using boxplots. We can do
this in ggplot with the geom geom_boxplot()
. Try this for
yourself.
```{r lincoln-box}
ggplot(lincoln_temps, aes(___)) +
___
```
Next, do the same but now using violins (geom_violin()
)
instead of boxplots.
```{r lincoln-violin}
# build all the code for this exercise
ggplot(___) +
```
Customize the violins by trying some of the following:
- Change the fill or outline color.
- Swap the x and y mappings.
- Change the bandwidth (parameter
bw
) or kernel (parameterkernel
). These parameters work just like ingeom_density()
as discussed in the single density section above. - Set
trim = FALSE
. What does this do?
Strip charts and jittering
Both boxplots and violin plots have the disadvantage that they don’t
show the individual data points. We can show individual data points by
using geom_point()
. Such a plot is called a strip
chart.
Make a strip chart for the Lincoln temperature data set. Hint: Use
size = 0.75
to reduce the size of the individual
points.
```{r lincoln-strip}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
___
```
Frequently when we make strip charts we want to apply some jitter to
separate points away from each other. We can do so by setting the
argument position = position_jitter()
in
geom_point()
.
When using position_jitter()
we will normally have to
specify how much jittering we want in the horizontal and vertical
direction, by setting the width
and height
arguments: position_jitter(width = 0.15, height = 0)
. Both
width
and height
are specified in units
representing the resolution of the data points, and indicate jittering
in either direction. So, if data points are 1 unit apart, then
width = 0.15
means the jittering covers 0.3 units or 30% of
the spacing of the data points.
Try this for yourself, by making a strip chart with jittering.
```{r lincoln-strip-jitter}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
geom_point(
size = 0.75,
___
)
```
The function position_jitter()
applies random jittering
to the data points, which means the plot looks different each time you
make it. (Verify this.) We can force a specific, fixed arrangement of
jittering by setting the seed
parameter. This parameter
takes an arbitrary integer value, e.g. seed = 1234
. Try
this out.
```{r lincoln-strip-jitter2}
# build all the code for this exercise
ggplot(___) +
```
Finally, try to figure out what the parameter height
does, by setting it to a value other than 0, or by removing it
entirely.
Sina plots
We can create a combination of strip charts and violin plots by
making sina plots, which jitter points into the shape of a violin. We
can do this with geom_sina()
from the
ggforce package. Try this out.
```{r lincoln-sina}
# build all the code for this exercise
ggplot(___) +
```
It often makes sense to draw a sina plot on top of a violin plot. Try this out.
```{r lincoln-sina2}
# build all the code for this exercise
ggplot(___) +
```
Finally, customize the violins by removing the outline and changing the fill color.
```{r lincoln-sina3}
# build all the code for this exercise
ggplot(___) +
```
Ridgeline plots
As the last alternative for visualizing multiple distributions at once, we will make ridgeline plots. These are multiple density plots staggered vertically. In ridgeline plots, we normally map the grouping variable (e.g. here, the month) to the y axis and the dependent variable (e.g. here, the mean temperature) to the x axis.
We can create ridgeline plots using
geom_density_ridges()
from the ggridges
package. Try this out. Use the column month_long
instead of
month
for the name of the month to get a slightly nicer
plot. Hint: If you get an error about a missing y aesthetic you need to
swap your x and y mappings.
```{r lincoln-ridges}
# build all the code for this exercise
ggplot(___) +
```
What happens when you use month
instead of
month_long
? Can you explain why?
It is often a good idea to prune the ridgelines once they are close
to zero. You can do this with the parameter rel_min_height
,
which takes a numeric value relative to the maximum height of any
ridgeline anywhere in the plot. So, rel_min_height = 0.01
would prune all lines that are less than 1% of the maximum height in the
plot.
```{r lincoln-ridges2}
ggplot(lincoln_temps, aes(x = mean_temp, y = month_long)) +
geom_density_ridges(___)
```