NYGC/AMNH Workshop on Microbial Ecology - Introduction to R

Claus Wilke remixed by Joseph Elsherbini
2022/08/24

Visualizing Distributions Exercise

Load both datasets and the libraries we’re using in this exercise


```{r library-calls}
# load required libraries for this worksheet
library(tidyverse)
library(ggforce)
library(ggridges)
```


```{r}
# data prep
titanic <- read_csv("https://wilkelab.org/SDS375/datasets/titanic.csv") %>%
  select(age, sex, class, survived) %>%
  arrange(age, sex, class)
  
lincoln_temps <- readRDS(url("https://wilkelab.org/SDS375/datasets/lincoln_temps.rds"))
```

Introduction

In this worksheet, we will discuss how to display distributions of data values using histograms and density plots.

The first dataset we will be working with contains information about passengers on the Titanic, including their age, sex, the class in which they traveled on the ship, and whether they survived or not:


```{r titanic}
titanic
```

# A tibble: 756 × 4
     age sex    class survived
   <dbl> <chr>  <chr> <chr>   
 1  0.17 female 3rd   survived
 2  0.33 male   3rd   died    
 3  0.8  male   2nd   survived
 4  0.83 male   2nd   survived
 5  0.83 male   3rd   survived
 6  0.92 male   1st   survived
 7  1    female 2nd   survived
 8  1    female 3rd   survived
 9  1    male   2nd   survived
10  1    male   2nd   survived
# … with 746 more rows

Histograms

We start by drawing a histogram of the passenger ages (column age in the dataset titanic). We can do this in ggplot with the geom geom_histogram(). Try this for yourself.


```{r titanic-hist}
ggplot(titanic, aes(___)) +
  ___
```

If you don’t specify how many bins you want or how wide you want them to be, geom_histogram() will make an automatic choice, but it will also give you a warning that the automatic choice is probably not good. Make a better choice by setting the binwidth and center parameters. Try the values 5 and 2.5, respectively.


```{r titanic-hist2}
ggplot(titanic, aes(age)) +
  geom_histogram(___)
```

Try a few more different binwidths, e.g. 1 or 10. What are good values for center that go with these choices?

Density plots

Density plots are a good alternative to histograms. We can create them with geom_density(). Try this out by drawing a density plot of the passenger ages (column age in the dataset titanic). Also, by default geom_density() does not draw a filled area under the density line. We can change this by setting an explicit fill color, e.g. “cornsilk”.


```{r titanic-dens}
ggplot(titanic, aes(___)) +
  ___
```

Just like for histograms, there are options to modify how much detail a density plot shows. A small binwidth in a histogram corresponds to a low bandwidth (bw) in a density plot and similarly a large binwidth corresponds to a high bandwidth. In addition, you can change the kernel, e.g. kernel = "rectangular" or kernel = "triangular". Try this out by using a bandwidth of 1 and a triangular kernel.


```{r titanic-dens2}
ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", ___)
```

Try a few more different bandwidth and kernel choices, e.g. 0.1 or 10, or rectangular or gaussian kernels. How does the density plot depend on these choices?

Small multiples (facets)

We can also draw separate histograms for passengers meeting different criteria, for example for passengers traveling in the different classes. Whenever we draw multiple plot panels containing the same type of plot but for different subsets of the data, we speak of “small multiples”. In ggplot, we generate small multiples with the function facet_wrap(). The function facet_wrap() takes as its argument a list of data columns to subdivide the data by. This list is provided via the vars() function. For example, vars(class) means draw a separate panel for each class, vars(survived) means draw a separate panel for each survival status, and vars(class, survived) means draw a separate panel for each combination of class and survival status.

As an example, the following code generates small multiple histograms by class:


```{r titanic-hist-class}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(class))
```

Now use the same principle to draw facets of histograms by survival status.


```{r titanic-hist-surv}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  ___
```

Now make a plot that breaks down the data by both survival status and class.


```{r titanic-hist-survclass}
ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, ___))
```

Now, do the same but drawing density plots rather than histograms.


```{r titanic-dens-survclass}
ggplot(titanic, aes(age)) +
  ___ +
  facet_wrap(vars(survived, ___))
```

Multiple distributions at once

Now, let’s display many distributions at once, using boxplots, violin plots, strip charts, sina plots, and ridgeline plots.

The next dataset we will be working with contains information about the mean temperature for every day of the year 2016 in Lincoln, NE:


```{r lincoln_temps}
lincoln_temps
```

# A tibble: 366 × 4
   date       month month_long mean_temp
   <date>     <fct> <fct>          <int>
 1 2016-01-01 Jan   January           24
 2 2016-01-02 Jan   January           23
 3 2016-01-03 Jan   January           23
 4 2016-01-04 Jan   January           17
 5 2016-01-05 Jan   January           29
 6 2016-01-06 Jan   January           33
 7 2016-01-07 Jan   January           30
 8 2016-01-08 Jan   January           25
 9 2016-01-09 Jan   January            9
10 2016-01-10 Jan   January           11
# … with 356 more rows

Boxplots and violins

We start by drawing the distributions of mean temperatures for each month of the year (columns month and mean_temp in the dataset lincoln_temps), using boxplots. We can do this in ggplot with the geom geom_boxplot(). Try this for yourself.


```{r lincoln-box}
ggplot(lincoln_temps, aes(___)) +
  ___
```

Next, do the same but now using violins (geom_violin()) instead of boxplots.


```{r lincoln-violin}
 # build all the code for this exercise
ggplot(___) +

```

Customize the violins by trying some of the following:

  • Change the fill or outline color.
  • Swap the x and y mappings.
  • Change the bandwidth (parameter bw) or kernel (parameter kernel). These parameters work just like in geom_density() as discussed in the single density section above.
  • Set trim = FALSE. What does this do?

Strip charts and jittering

Both boxplots and violin plots have the disadvantage that they don’t show the individual data points. We can show individual data points by using geom_point(). Such a plot is called a strip chart.

Make a strip chart for the Lincoln temperature data set. Hint: Use size = 0.75 to reduce the size of the individual points.


```{r lincoln-strip}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  ___
```

Frequently when we make strip charts we want to apply some jitter to separate points away from each other. We can do so by setting the argument position = position_jitter() in geom_point().

When using position_jitter() we will normally have to specify how much jittering we want in the horizontal and vertical direction, by setting the width and height arguments: position_jitter(width = 0.15, height = 0). Both width and height are specified in units representing the resolution of the data points, and indicate jittering in either direction. So, if data points are 1 unit apart, then width = 0.15 means the jittering covers 0.3 units or 30% of the spacing of the data points.

Try this for yourself, by making a strip chart with jittering.


```{r lincoln-strip-jitter}
ggplot(lincoln_temps, aes(x = month, y = mean_temp)) +
  geom_point(
    size = 0.75,
    ___
  )
```

The function position_jitter() applies random jittering to the data points, which means the plot looks different each time you make it. (Verify this.) We can force a specific, fixed arrangement of jittering by setting the seed parameter. This parameter takes an arbitrary integer value, e.g. seed = 1234. Try this out.


```{r lincoln-strip-jitter2}
 # build all the code for this exercise
ggplot(___) +

```

Finally, try to figure out what the parameter height does, by setting it to a value other than 0, or by removing it entirely.

Sina plots

We can create a combination of strip charts and violin plots by making sina plots, which jitter points into the shape of a violin. We can do this with geom_sina() from the ggforce package. Try this out.


```{r lincoln-sina}
 # build all the code for this exercise
ggplot(___) +

```

It often makes sense to draw a sina plot on top of a violin plot. Try this out.


```{r lincoln-sina2}
 # build all the code for this exercise
ggplot(___) +

```

Finally, customize the violins by removing the outline and changing the fill color.


```{r lincoln-sina3}
 # build all the code for this exercise
ggplot(___) +

```

Ridgeline plots

As the last alternative for visualizing multiple distributions at once, we will make ridgeline plots. These are multiple density plots staggered vertically. In ridgeline plots, we normally map the grouping variable (e.g. here, the month) to the y axis and the dependent variable (e.g. here, the mean temperature) to the x axis.

We can create ridgeline plots using geom_density_ridges() from the ggridges package. Try this out. Use the column month_long instead of month for the name of the month to get a slightly nicer plot. Hint: If you get an error about a missing y aesthetic you need to swap your x and y mappings.


```{r lincoln-ridges}
 # build all the code for this exercise
ggplot(___) +

```

What happens when you use month instead of month_long? Can you explain why?

It is often a good idea to prune the ridgelines once they are close to zero. You can do this with the parameter rel_min_height, which takes a numeric value relative to the maximum height of any ridgeline anywhere in the plot. So, rel_min_height = 0.01 would prune all lines that are less than 1% of the maximum height in the plot.


```{r lincoln-ridges2}
ggplot(lincoln_temps, aes(x = mean_temp, y = month_long)) +
  geom_density_ridges(___)
```