Intro to data visualization and data wrangling with the `tidyverse`

remixed from Claus O. Wilke’s SDS375 course

Workshop materials are at:

https://elsherbini.github.io/durban-data-science-for-biology/

Goals for this session

Get the big picture of data visualization
Learn how to wrangle data and make plots with the tidyverse

data wrangling (n.) - the art of taking data in one format and filtering, reshaping, and deriving values to make the data format you need.

Discussions: discord

Ask questions at #workshop-questions on https://discord.gg/UDAsYTzZE.

Screenshot of the discord server app that serves as the forum for the workshop.

Stickies

Picture of a laptop with a red sticky note stuck to the top.

During an activity, place a yellow sticky on your laptop if you’re good to go and a pink sticky if you want help.

Practicalities

WiFi:

Network: KTB Free Wifi (no password needed)

Network AHRI Password: @hR1W1F1!17

Network CAPRISA-Corp Password: corp@caprisa17

Bathrooms are out the lobby to your left

Group Pen and Paper exercise

10:00

30:00

Get with your group. Go to the activity

For the first 10 minutes think on your own
For 30 minutes discuss with your group and produce at least one plot
Someone post a picture on the #pen-and-paper-activity channel.
Decide on one member of your group to present your plot (3 minute limit per group)

Presentation

Have one member from your group present the plot to everyone! 3 minute limit!

03:00

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

Table 1: 02_visit_clinical_measurements_UKZN_workshop_2023.csv

pid	time_point	arm	nugent_score	crp_blood	ph
pid_01	baseline	placebo	8	0.44	5.7
pid_01	week_1	placebo	7	1.66	5.2
pid_01	week_7	placebo	7	1.44	5.4
pid_02	baseline	placebo	7	1.55	5.2
pid_02	week_1	placebo	7	0.75	4.8
pid_02	week_7	placebo	4	1.17	4.2

pH mapped to y position

pH mapped to color

Commonly used aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

The same data values can be mapped to different aesthetics

Figure from Claus O. Wilke. Fundamentals of Data Visualization. O’Reilly, 2019

We can use many different aesthetics at once

Creating aesthetic mappings in `ggplot`

We define the mapping with `aes()`

```{r}
table_02 %>%
  ggplot(mapping = aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

We frequently omit argument names

Long form, all arguments are named:

```{r}
#| eval: false

ggplot(
  data= table_02,
  mapping = aes(x = time_point, y = ph, color = ph)
) +
  geom_jitter()
```

We frequently omit argument names

Abbreviated form, common arguments remain unnamed:

```{r}
#| eval: false

ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_point()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_boxplot()
```

The geom determines how the data is shown

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter()
```

Different geoms have parameters for control

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter(size=3)
```

Different geoms have parameters for control

```{r}
ggplot(table_02, aes(x = time_point, y = ph, color = ph)) +
  geom_jitter(size=3, width = 0.2)
```

Important: `color` and `fill` apply to different elements

color
Applies color to points, lines, text, borders

fill
Applies color to any filled areas

Many geoms have both `color` and `fill` aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    color = time_point
  )
) + geom_boxplot()
```

Many geoms have both `color` and `fill` aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    fill = time_point
  )
) + geom_boxplot()
```

Many geoms have both `color` and `fill` aesthetics

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph,
    fill = time_point,
    color = time_point
  )
) + geom_boxplot()
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph
  )
) + geom_boxplot()
```

Aesthetics can also be used as parameters in geoms

```{r}
#| output-location: column
ggplot(
  data = table_02,
  mapping = aes(
    x = time_point,
    y = ph
  )
) + geom_boxplot(fill="orange")
```

Exercise

30:00

Time to try it yourself. Go to the first coding exercise.

Picture of a laptop with a red sticky note stuck to the top.

During an activity, place a blue sticky on your laptop if you’re good to go and a pink sticky if you want help.

Visualizing amounts

We often encounter datasets containing simple amounts

Example: Highest grossing movies 2023 to date

rank	title	amount
1	Barbie	1437.8
2	The Super Mario Bros Movie	1361.9
3	Oppenheimer	939.3
4	Guardians of the Galaxy 3	845.5
5	The Little Mermaid	569.6

Millions USD. Data source: Box Office Mojo

We can visualize amounts with bar plots

Bars can also run horizontally

Avoid rotated axis labels

Avoid rotated axis labels - flip the axes!

Pay attention to the order of the bars

We can use dots instead of bars

Dots are preferable if we want to truncate the axes

bar lengths do not accurately represent the data values

Dots are preferable if we want to truncate the axes

key features of the data are obscured

Dots are preferable if we want to truncate the axes

Let’s take a poll

Go to the event on wooclap

M3. Do you think it makes sense to truncate the axes for the life expectancy data?

Grouped bars

We use grouped bars for higher-dimensional datasets

We are free to choose by which variable to group

Making bar plots using `ggplot2`

The simple dataset

# Data from Box Office Mojo for 2023. 
boxoffice <- tibble(
  rank = 1:5,
  title = c("Barbie", "The Super Mario Bros Movie", "Oppenheimer", "Guardians of the Galaxy 3", "The Little Mermaid"),
  amount = c(1437.8, 1361.9, 939.3, 845.5, 569.6) # million USD
)

rank	title	amount
1	Barbie	1437.8
2	The Super Mario Bros Movie	1361.9
3	Oppenheimer	939.3
4	Guardians of the Galaxy 3	845.5
5	The Little Mermaid	569.6

Visualize as a bar plot

ggplot(boxoffice, aes(title, amount)) +
  geom_col()  # "col" stands for column

Order by data value

ggplot(boxoffice, aes(fct_reorder(title, amount), amount)) +
  geom_col()

Order by data value, descending

ggplot(boxoffice, aes(fct_reorder(title, -amount), amount)) +
  geom_col() + 
  xlab(NULL) # remove x axis label

Flip x and y, set custom x axis label

ggplot(boxoffice, aes(amount, fct_reorder(title, amount))) +
  geom_col() +
  xlab("amount (in million USD)") +
  ylab(NULL)

Sometimes we need to count before visualization

library(here)
library(tidyverse)

table_02 <- read_csv(here("datasets/instructional_dataset/02_visit_clinical_measurements_UKZN_workshop_2023.csv")) %>%
  mutate(nugent_score = as_factor(nugent_score))

pid	time_point	arm	nugent_score	crp_blood	ph
pid_01	baseline	placebo	8	0.44	5.7
pid_01	week_1	placebo	7	1.66	5.2
pid_01	week_7	placebo	7	1.44	5.4
pid_02	baseline	placebo	7	1.55	5.2
pid_02	week_1	placebo	7	0.75	4.8
pid_02	week_7	placebo	4	1.17	4.2
pid_03	baseline	placebo	6	1.78	4.8
pid_03	week_1	placebo	10	0.57	5.3
pid_03	week_7	placebo	7	1.79	5.2
pid_04	baseline	placebo	5	1.76	4.8
pid_04	week_1	placebo	9	2.58	5.1
pid_04	week_7	placebo	7	5.68	5.4
pid_05	baseline	treatment	8	0.95	4.9
pid_05	week_1	treatment	3	0.19	3.2
pid_05	week_7	treatment	2	0.45	3.5
pid_06	baseline	placebo	10	4.03	5.3
pid_06	week_1	placebo	8	1.72	5.6
pid_06	week_7	placebo	8	3.19	5.0
pid_07	baseline	placebo	7	0.10	5.2
pid_07	week_1	placebo	7	1.36	4.9
pid_07	week_7	placebo	5	0.38	5.1
pid_08	baseline	placebo	9	3.18	5.4
pid_08	week_1	placebo	5	1.55	4.8
pid_08	week_7	placebo	7	1.77	5.0
pid_09	baseline	treatment	5	2.13	4.9
pid_09	week_1	treatment	3	0.27	3.6
pid_09	week_7	treatment	4	1.04	4.2
pid_10	baseline	treatment	8	0.98	4.9
pid_10	week_1	treatment	0	0.01	3.5
pid_10	week_7	treatment	1	2.87	2.9
pid_11	baseline	treatment	7	0.31	5.0
pid_11	week_1	treatment	1	0.10	3.3
pid_11	week_7	treatment	4	1.15	5.1
pid_12	baseline	placebo	8	2.42	5.0
pid_12	week_1	placebo	6	0.64	4.5
pid_12	week_7	placebo	9	4.36	5.2
pid_13	baseline	placebo	8	2.69	5.1
pid_13	week_1	placebo	7	2.57	5.5
pid_13	week_7	placebo	8	1.98	4.8
pid_14	baseline	placebo	7	0.34	5.3
pid_14	week_1	placebo	5	2.07	4.2
pid_14	week_7	placebo	7	5.06	5.1
pid_15	baseline	treatment	7	0.29	4.8
pid_15	week_1	treatment	3	0.84	3.4
pid_15	week_7	treatment	3	0.68	3.5
pid_16	baseline	treatment	6	1.91	5.7
pid_16	week_1	treatment	0	0.03	3.7
pid_16	week_7	treatment	2	0.50	3.2
pid_17	baseline	treatment	5	1.39	4.8
pid_17	week_1	treatment	2	0.00	3.3
pid_17	week_7	treatment	3	0.90	3.7
pid_18	baseline	treatment	6	0.45	4.3
pid_18	week_1	treatment	1	1.81	3.6
pid_18	week_7	treatment	6	0.41	3.9
pid_19	baseline	placebo	7	1.34	5.3
pid_19	week_1	placebo	5	2.91	4.3
pid_19	week_7	placebo	5	1.27	4.5
pid_20	baseline	placebo	4	0.86	4.3
pid_20	week_1	placebo	8	1.45	5.2
pid_20	week_7	placebo	5	3.95	4.9
pid_21	baseline	treatment	5	0.50	4.6
pid_21	week_1	treatment	1	1.60	3.4
pid_21	week_7	treatment	4	1.23	4.8
pid_22	baseline	treatment	6	1.10	4.0
pid_22	week_1	treatment	3	0.58	4.2
pid_22	week_7	treatment	6	1.67	5.1
pid_23	baseline	placebo	8	0.99	5.4
pid_23	week_1	placebo	8	0.80	5.5
pid_23	week_7	placebo	3	3.67	3.1
pid_24	baseline	placebo	5	4.91	3.8
pid_24	week_1	placebo	7	0.94	5.1
pid_24	week_7	placebo	4	1.03	4.5
pid_25	baseline	treatment	3	2.84	3.9
pid_25	week_1	treatment	4	3.52	4.7
pid_25	week_7	treatment	2	0.49	3.7
pid_26	baseline	treatment	7	0.94	5.6
pid_26	week_1	treatment	0	0.11	3.0
pid_26	week_7	treatment	4	0.29	4.8
pid_27	baseline	placebo	7	1.17	5.5
pid_27	week_1	placebo	5	1.62	4.7
pid_27	week_7	placebo	8	0.76	4.7
pid_28	baseline	treatment	3	0.67	2.9
pid_28	week_1	treatment	1	0.05	3.3
pid_28	week_7	treatment	1	0.22	3.5
pid_29	baseline	placebo	7	2.39	5.8
pid_29	week_1	placebo	4	4.09	4.5
pid_29	week_7	placebo	3	3.13	3.5
pid_30	baseline	placebo	7	0.85	4.8
pid_30	week_1	placebo	8	2.56	5.1
pid_30	week_7	placebo	7	1.62	5.2
pid_31	baseline	treatment	6	1.78	4.4
pid_31	week_1	treatment	2	0.41	3.5
pid_31	week_7	treatment	2	1.36	2.8
pid_32	baseline	treatment	5	4.83	4.9
pid_32	week_1	treatment	1	0.03	3.3
pid_32	week_7	treatment	3	0.21	3.8
pid_33	baseline	treatment	6	5.26	4.6
pid_33	week_1	treatment	1	0.07	3.6
pid_33	week_7	treatment	2	1.92	3.3
pid_34	baseline	placebo	8	3.16	5.4
pid_34	week_1	placebo	4	1.12	4.7
pid_34	week_7	placebo	7	2.34	5.3
pid_35	baseline	placebo	8	0.74	5.3
pid_35	week_1	placebo	5	0.16	4.4
pid_35	week_7	placebo	3	1.97	3.9
pid_36	baseline	placebo	8	1.21	5.1
pid_36	week_1	placebo	5	2.28	4.3
pid_36	week_7	placebo	8	1.10	4.8
pid_37	baseline	treatment	5	1.16	4.8
pid_37	week_1	treatment	1	0.07	3.6
pid_37	week_7	treatment	2	0.70	3.2
pid_38	baseline	placebo	8	0.41	5.1
pid_38	week_1	placebo	5	1.55	4.8
pid_38	week_7	placebo	4	3.22	4.5
pid_39	baseline	treatment	6	1.61	4.6
pid_39	week_1	treatment	2	0.09	3.6
pid_39	week_7	treatment	5	0.77	4.7
pid_40	baseline	treatment	3	1.48	3.1
pid_40	week_1	treatment	2	0.17	3.1
pid_40	week_7	treatment	6	0.21	4.5
pid_41	baseline	treatment	4	1.51	4.3
pid_41	week_1	treatment	2	0.64	3.4
pid_41	week_7	treatment	4	0.78	4.4
pid_42	baseline	placebo	6	0.91	4.7
pid_42	week_1	placebo	5	0.88	4.3
pid_42	week_7	placebo	7	3.06	5.3
pid_43	baseline	placebo	6	1.08	4.7
pid_43	week_1	placebo	6	0.94	4.1
pid_43	week_7	placebo	6	1.79	4.1
pid_44	baseline	treatment	6	0.48	4.4
pid_44	week_1	treatment	1	1.67	3.5
pid_44	week_7	treatment	3	0.60	3.4

Goal: Visualize number of people with different nugent scores

Use `geom_bar()` to count before plotting

table_02 %>%
  ggplot(aes(y=nugent_score))+
  geom_bar()

Getting the bars into the right order

table_01 %>%
  ggplot(aes(y=education))+
  geom_bar()

Getting the bars into the right order

education_order <- c("less than grade 9","grade 10-12, not matriculated","grade 10-12, matriculated","post-secondary")
table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education))+
  geom_bar()

Display counts by smoking and education

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar()

Positions define how subgroups are shown

position = "dodge": Place bars for subgroups side-by-side

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "dodge")

Positions define how subgroups are shown

position = "stack": Place bars for subgroups on top of each other

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "stack")

Positions define how subgroups are shown

position = "fill": Like "stack", but scale to 100%

table_01 %>%
  mutate(education = fct_relevel(education, education_order)) %>%
  ggplot(aes(y=education, fill=smoker))+
  geom_bar(position = "fill")

Let’s take a poll

Go to the event on wooclap

2 questions: M3. What’s the difference between geom_col and geom_bar? and M3. What patterns did you see in the smoker CRP data (slide 49)?

Exercise

30:00

Time to try it yourself. Go to back to the module.

During an activity, place a yellow sticky on your laptop if you’re good to go and a pink sticky if you want help.

Intro to data visualization and data wrangling with the tidyverse

Goals for this session

Discussions: discord

Stickies

Practicalities

Group Pen and Paper exercise

Presentation

Aesthetics - the elements of data visualization

Plots map data onto graphical elements.

pH mapped to y position

pH mapped to color

Commonly used aesthetics

The same data values can be mapped to different aesthetics

We can use many different aesthetics at once

Creating aesthetic mappings in ggplot

We define the mapping with aes()

We frequently omit argument names

We frequently omit argument names

The geom determines how the data is shown

The geom determines how the data is shown

The geom determines how the data is shown

Different geoms have parameters for control

Different geoms have parameters for control

Important: color and fill apply to different elements

Many geoms have both color and fill aesthetics

Many geoms have both color and fill aesthetics

Many geoms have both color and fill aesthetics

Aesthetics can also be used as parameters in geoms

Aesthetics can also be used as parameters in geoms

Exercise

Visualizing amounts

We often encounter datasets containing simple amounts

We can visualize amounts with bar plots

Bars can also run horizontally

Avoid rotated axis labels

Avoid rotated axis labels - flip the axes!

Pay attention to the order of the bars

Pay attention to the order of the bars

We can use dots instead of bars

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Let’s take a poll

Grouped bars

We use grouped bars for higher-dimensional datasets

We are free to choose by which variable to group

We can also use multiple plot panels (facets)

Making bar plots using ggplot2

The simple dataset

Visualize as a bar plot

Order by data value

Order by data value, descending

Flip x and y, set custom x axis label

Sometimes we need to count before visualization

Goal: Visualize number of people with different nugent scores

Use geom_bar() to count before plotting

Getting the bars into the right order

Getting the bars into the right order

Display counts by smoking and education

Positions define how subgroups are shown

Positions define how subgroups are shown

Positions define how subgroups are shown

Let’s take a poll

Exercise

Intro to data visualization and data wrangling with the `tidyverse`

Creating aesthetic mappings in `ggplot`

We define the mapping with `aes()`

Important: `color` and `fill` apply to different elements

Many geoms have both `color` and `fill` aesthetics

Many geoms have both `color` and `fill` aesthetics

Many geoms have both `color` and `fill` aesthetics

Making bar plots using `ggplot2`

Use `geom_bar()` to count before plotting