Intro to data visualization and data wrangling with the tidyverse

remixed from Claus O. Wilke’s SDS375 course






Goals for this session

  1. Get the big picture of data visualization

  2. Learn how to wrangle data and make plots with the tidyverse

data wrangling (n.) - the art of taking data in one format and filtering, reshaping, and deriving values to make the data format you need.

Discussions: discord

Ask questions at #workshop-questions on https://discord.gg/UDAsYTzZE.

Screenshot of the discord server app that serves as the forum for the workshop.

Stickies

Picture of a laptop with a red sticky note stuck to the top.

During an activity, place a yellow sticky on your laptop if you’re good to go and a pink sticky if you want help.

Practicalities

WiFi:

Network: KTB Free Wifi (no password needed)

Network AHRI Password: @hR1W1F1!17

Network CAPRISA-Corp Password: corp@caprisa17

Bathrooms are out the lobby to your left

Group Pen and Paper exercise

10:00
30:00

Get with your group. Go to the activity

  1. For the first 10 minutes think on your own
  2. For 30 minutes discuss with your group and produce at least one plot
  3. Someone post a picture on the #pen-and-paper-activity channel.
  4. Decide on one member of your group to present your plot (3 minute limit per group)

Presentation

Have one member from your group present the plot to everyone! 3 minute limit!

03:00

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

Table 1: 02_visit_clinical_measurements_UKZN_workshop_2023.csv
pid time_point arm nugent_score crp_blood ph
pid_01 baseline placebo 8 0.44 5.7
pid_01 week_1 placebo 7 1.66 5.2
pid_01 week_7 placebo 7 1.44 5.4
pid_02 baseline placebo 7 1.55 5.2
pid_02 week_1 placebo 7 0.75 4.8
pid_02 week_7 placebo 4 1.17 4.2

pH mapped to y position

pH mapped to color

Commonly used aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

The same data values can be mapped to different aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

We can use many different aesthetics at once

Creating aesthetic mappings in ggplot

We define the mapping with aes()

```{r}
table_02 %>%
  ggplot(mapping = aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

We frequently omit argument names

Long form, all arguments are named:

```{r}
#| eval: false

ggplot(
  data= table_02,
  mapping = aes(x = time_point, y = ph, color = ph)
) +
  geom_jitter()
```

We frequently omit argument names

Abbreviated form, common arguments remain unnamed:

```{r}
#| eval: false

ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_point()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_boxplot()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

Different geoms have parameters for control

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter(size=3)
```

Different geoms have parameters for control

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter(size=3, width = 0.2)
```

Important: color and fill apply to different elements

color
Applies color to points, lines, text, borders

fill
Applies color to any filled areas

Many geoms have both color and fill aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    color = time_point
  )
) + geom_boxplot()
```

Many geoms have both color and fill aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    fill = time_point
  )
) + geom_boxplot()
```

Many geoms have both color and fill aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    fill = time_point,
    color = time_point
  )
) + geom_boxplot()
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph
  )
) + geom_boxplot()
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph
  )
) + geom_boxplot(fill="orange")
```

Exercise

30:00

Time to try it yourself. Go to the first coding exercise.

Picture of a laptop with a red sticky note stuck to the top.

During an activity, place a blue sticky on your laptop if you’re good to go and a pink sticky if you want help.

Visualizing amounts

We often encounter datasets containing simple amounts

Example: Highest grossing movies 2023 to date

rank title amount
1 Barbie 1437.8
2 The Super Mario Bros Movie 1361.9
3 Oppenheimer 939.3
4 Guardians of the Galaxy 3 845.5
5 The Little Mermaid 569.6

Millions USD. Data source: Box Office Mojo

We can visualize amounts with bar plots

Bars can also run horizontally

Avoid rotated axis labels

Avoid rotated axis labels - flip the axes!

Pay attention to the order of the bars

Pay attention to the order of the bars

We can use dots instead of bars

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

bar lengths do not accurately represent the data values

Dots are preferable if we want to truncate the axes

key features of the data are obscured

Dots are preferable if we want to truncate the axes

Let’s take a poll

Go to the event on wooclap

M3. Do you think it makes sense to truncate the axes for the life expectancy data?

Grouped bars

We use grouped bars for higher-dimensional datasets

We are free to choose by which variable to group

We can also use multiple plot panels (facets)

Making bar plots using ggplot2

The simple dataset

# Data from Box Office Mojo for 2023. 
boxoffice <- tibble(
  rank = 1:5,
  title = c("Barbie", "The Super Mario Bros Movie", "Oppenheimer", "Guardians of the Galaxy 3", "The Little Mermaid"),
  amount = c(1437.8, 1361.9, 939.3, 845.5, 569.6) # million USD
)
rank title amount
1 Barbie 1437.8
2 The Super Mario Bros Movie 1361.9
3 Oppenheimer 939.3
4 Guardians of the Galaxy 3 845.5
5 The Little Mermaid 569.6

Visualize as a bar plot

ggplot(boxoffice, aes(title, amount)) +
  geom_col()  # "col" stands for column

Order by data value

ggplot(boxoffice, aes(fct_reorder(title, amount), amount)) +
  geom_col()

Order by data value, descending

ggplot(boxoffice, aes(fct_reorder(title, -amount), amount)) +
  geom_col() + 
  xlab(NULL) # remove x axis label

Flip x and y, set custom x axis label

ggplot(boxoffice, aes(amount, fct_reorder(title, amount))) +
  geom_col() +
  xlab("amount (in million USD)") +
  ylab(NULL)

Sometimes we need to count before visualization

library(here)
library(tidyverse)

table_02 <- read_csv(here("datasets/instructional_dataset/02_visit_clinical_measurements_UKZN_workshop_2023.csv")) %>%
  mutate(nugent_score = as_factor(nugent_score))
pid time_point arm nugent_score crp_blood ph
pid_01 baseline placebo 8 0.44 5.7
pid_01 week_1 placebo 7 1.66 5.2
pid_01 week_7 placebo 7 1.44 5.4
pid_02 baseline placebo 7 1.55 5.2
pid_02 week_1 placebo 7 0.75 4.8
pid_02 week_7 placebo 4 1.17 4.2
pid_03 baseline placebo 6 1.78 4.8
pid_03 week_1 placebo 10 0.57 5.3
pid_03 week_7 placebo 7 1.79 5.2
pid_04 baseline placebo 5 1.76 4.8
pid_04 week_1 placebo 9 2.58 5.1
pid_04 week_7 placebo 7 5.68 5.4
pid_05 baseline treatment 8 0.95 4.9
pid_05 week_1 treatment 3 0.19 3.2
pid_05 week_7 treatment 2 0.45 3.5
pid_06 baseline placebo 10 4.03 5.3
pid_06 week_1 placebo 8 1.72 5.6
pid_06 week_7 placebo 8 3.19 5.0
pid_07 baseline placebo 7 0.10 5.2
pid_07 week_1 placebo 7 1.36 4.9
pid_07 week_7 placebo 5 0.38 5.1
pid_08 baseline placebo 9 3.18 5.4
pid_08 week_1 placebo 5 1.55 4.8
pid_08 week_7 placebo 7 1.77 5.0
pid_09 baseline treatment 5 2.13 4.9
pid_09 week_1 treatment 3 0.27 3.6
pid_09 week_7 treatment 4 1.04 4.2
pid_10 baseline treatment 8 0.98 4.9
pid_10 week_1 treatment 0 0.01 3.5
pid_10 week_7 treatment 1 2.87 2.9
pid_11 baseline treatment 7 0.31 5.0
pid_11 week_1 treatment 1 0.10 3.3
pid_11 week_7 treatment 4 1.15 5.1
pid_12 baseline placebo 8 2.42 5.0
pid_12 week_1 placebo 6 0.64 4.5
pid_12 week_7 placebo 9 4.36 5.2
pid_13 baseline placebo 8 2.69 5.1
pid_13 week_1 placebo 7 2.57 5.5
pid_13 week_7 placebo 8 1.98 4.8
pid_14 baseline placebo 7 0.34 5.3
pid_14 week_1 placebo 5 2.07 4.2
pid_14 week_7 placebo 7 5.06 5.1
pid_15 baseline treatment 7 0.29 4.8
pid_15 week_1 treatment 3 0.84 3.4
pid_15 week_7 treatment 3 0.68 3.5
pid_16 baseline treatment 6 1.91 5.7
pid_16 week_1 treatment 0 0.03 3.7
pid_16 week_7 treatment 2 0.50 3.2
pid_17 baseline treatment 5 1.39 4.8
pid_17 week_1 treatment 2 0.00 3.3
pid_17 week_7 treatment 3 0.90 3.7
pid_18 baseline treatment 6 0.45 4.3
pid_18 week_1 treatment 1 1.81 3.6
pid_18 week_7 treatment 6 0.41 3.9
pid_19 baseline placebo 7 1.34 5.3
pid_19 week_1 placebo 5 2.91 4.3
pid_19 week_7 placebo 5 1.27 4.5
pid_20 baseline placebo 4 0.86 4.3
pid_20 week_1 placebo 8 1.45 5.2
pid_20 week_7 placebo 5 3.95 4.9
pid_21 baseline treatment 5 0.50 4.6
pid_21 week_1 treatment 1 1.60 3.4
pid_21 week_7 treatment 4 1.23 4.8
pid_22 baseline treatment 6 1.10 4.0
pid_22 week_1 treatment 3 0.58 4.2
pid_22 week_7 treatment 6 1.67 5.1
pid_23 baseline placebo 8 0.99 5.4
pid_23 week_1 placebo 8 0.80 5.5
pid_23 week_7 placebo 3 3.67 3.1
pid_24 baseline placebo 5 4.91 3.8
pid_24 week_1 placebo 7 0.94 5.1
pid_24 week_7 placebo 4 1.03 4.5
pid_25 baseline treatment 3 2.84 3.9
pid_25 week_1 treatment 4 3.52 4.7
pid_25 week_7 treatment 2 0.49 3.7
pid_26 baseline treatment 7 0.94 5.6
pid_26 week_1 treatment 0 0.11 3.0
pid_26 week_7 treatment 4 0.29 4.8
pid_27 baseline placebo 7 1.17 5.5
pid_27 week_1 placebo 5 1.62 4.7
pid_27 week_7 placebo 8 0.76 4.7
pid_28 baseline treatment 3 0.67 2.9
pid_28 week_1 treatment 1 0.05 3.3
pid_28 week_7 treatment 1 0.22 3.5
pid_29 baseline placebo 7 2.39 5.8
pid_29 week_1 placebo 4 4.09 4.5
pid_29 week_7 placebo 3 3.13 3.5
pid_30 baseline placebo 7 0.85 4.8
pid_30 week_1 placebo 8 2.56 5.1
pid_30 week_7 placebo 7 1.62 5.2
pid_31 baseline treatment 6 1.78 4.4
pid_31 week_1 treatment 2 0.41 3.5
pid_31 week_7 treatment 2 1.36 2.8
pid_32 baseline treatment 5 4.83 4.9
pid_32 week_1 treatment 1 0.03 3.3
pid_32 week_7 treatment 3 0.21 3.8
pid_33 baseline treatment 6 5.26 4.6
pid_33 week_1 treatment 1 0.07 3.6
pid_33 week_7 treatment 2 1.92 3.3
pid_34 baseline placebo 8 3.16 5.4
pid_34 week_1 placebo 4 1.12 4.7
pid_34 week_7 placebo 7 2.34 5.3
pid_35 baseline placebo 8 0.74 5.3
pid_35 week_1 placebo 5 0.16 4.4
pid_35 week_7 placebo 3 1.97 3.9
pid_36 baseline placebo 8 1.21 5.1
pid_36 week_1 placebo 5 2.28 4.3
pid_36 week_7 placebo 8 1.10 4.8
pid_37 baseline treatment 5 1.16 4.8
pid_37 week_1 treatment 1 0.07 3.6
pid_37 week_7 treatment 2 0.70 3.2
pid_38 baseline placebo 8 0.41 5.1
pid_38 week_1 placebo 5 1.55 4.8
pid_38 week_7 placebo 4 3.22 4.5
pid_39 baseline treatment 6 1.61 4.6
pid_39 week_1 treatment 2 0.09 3.6
pid_39 week_7 treatment 5 0.77 4.7
pid_40 baseline treatment 3 1.48 3.1
pid_40 week_1 treatment 2 0.17 3.1
pid_40 week_7 treatment 6 0.21 4.5
pid_41 baseline treatment 4 1.51 4.3
pid_41 week_1 treatment 2 0.64 3.4
pid_41 week_7 treatment 4 0.78 4.4
pid_42 baseline placebo 6 0.91 4.7
pid_42 week_1 placebo 5 0.88 4.3
pid_42 week_7 placebo 7 3.06 5.3
pid_43 baseline placebo 6 1.08 4.7
pid_43 week_1 placebo 6 0.94 4.1
pid_43 week_7 placebo 6 1.79 4.1
pid_44 baseline treatment 6 0.48 4.4
pid_44 week_1 treatment 1 1.67 3.5
pid_44 week_7 treatment 3 0.60 3.4

Goal: Visualize number of people with different nugent scores

Use geom_bar() to count before plotting

table_02 %>%
  ggplot(aes(y=nugent_score))+
  geom_bar()

Getting the bars into the right order

table_01 %>%
  ggplot(aes(y=education))+
  geom_bar()

Getting the bars into the right order

education_order <- c("less than grade 9","grade 10-12, not matriculated","grade 10-12, matriculated","post-secondary")
table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education))+
  geom_bar()

Display counts by smoking and education

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar()

Positions define how subgroups are shown

position = "dodge": Place bars for subgroups side-by-side

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "dodge")

Positions define how subgroups are shown

position = "stack": Place bars for subgroups on top of each other

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "stack")

Positions define how subgroups are shown

position = "fill": Like "stack", but scale to 100%

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "fill")

Let’s take a poll

Go to the event on wooclap

2 questions: M3. What’s the difference between geom_col and geom_bar? and M3. What patterns did you see in the smoker CRP data (slide 49)?

Exercise

30:00

Time to try it yourself. Go to back to the module.

Picture of a laptop with a red sticky note stuck to the top.

During an activity, place a yellow sticky on your laptop if you’re good to go and a pink sticky if you want help.