```{r}
table_01 %>%
count(arm)
```
# A tibble: 2 × 2
arm n
<chr> <int>
1 placebo 23
2 treatment 21
tidyverse
In this exercise, we will continue with basic data manipulations, now moving on to grouping and summarizing, making data tables wider or longer, and joining data tables.
We will be using the R package, tidyverse for the data manipulation functions %>%
, group_by()
, summarize()
, pivot_wider()
, pivot_longer()
, and join functions such as left_join()
Paste the following into the top code chunk of your qmd
file.
Download the data files from the dataset page and place all 5 files into this directory.
Paste the following code chunk into a new qmd file in this project:
library(tidyverse)
table_01 <- read_csv("01_participant_metadata_UKZN_workshop_2023.csv")
table_02 <- read_csv("02_visit_clinical_measurements_UKZN_workshop_2023.csv")
In many data analysis settings, we want to break a dataset into subsets and then perform some summary calculation on each subset. The simplest example is counting, which we have done previously with the count() function.
This function subdivides the penguins dataset into subsets for each species and then calculates the number n for each subset, which is the number of observations in each subset.
The function count()
here does all the work for us, but what if instead of counting we wanted to calculate the mean weight of the penguins for each species, or calculate the mean weight and count at the same time? We need a general framework that allows us to do these kinds of calculations with maximum flexibility.
The tidyverse approach is to first group a dataset with group_by()
and then to calculate grouped summaries with summarize()
.
Let’s first consider just grouping. If we look at the raw R output of just the penguins table or the penguins table after running it through group_by(arm)
, we see that the table is the same, except in the second case there is a line # Groups: arm [2]
which indicates that the table is grouped by arm and there are two groups. (Here, we need to pipe the tables into the print() function to see the raw R output instead of a formatted table that would hide the grouping information.)
# A tibble: 44 × 6
pid arm smoker age education sex
<chr> <chr> <chr> <dbl> <chr> <lgl>
1 pid_01 placebo non-smoker 26 grade 10-12, matriculated FALSE
2 pid_02 placebo smoker 33 grade 10-12, matriculated FALSE
3 pid_03 placebo smoker 30 post-secondary FALSE
4 pid_04 placebo non-smoker 34 grade 10-12, not matriculated FALSE
5 pid_05 treatment non-smoker 29 grade 10-12, matriculated FALSE
6 pid_06 placebo smoker 34 post-secondary FALSE
7 pid_07 placebo non-smoker 31 grade 10-12, not matriculated FALSE
8 pid_08 placebo smoker 30 grade 10-12, not matriculated FALSE
9 pid_09 treatment non-smoker 35 grade 10-12, not matriculated FALSE
10 pid_10 treatment non-smoker 32 less than grade 9 FALSE
# ℹ 34 more rows
# A tibble: 44 × 6
# Groups: arm [2]
pid arm smoker age education sex
<chr> <chr> <chr> <dbl> <chr> <lgl>
1 pid_01 placebo non-smoker 26 grade 10-12, matriculated FALSE
2 pid_02 placebo smoker 33 grade 10-12, matriculated FALSE
3 pid_03 placebo smoker 30 post-secondary FALSE
4 pid_04 placebo non-smoker 34 grade 10-12, not matriculated FALSE
5 pid_05 treatment non-smoker 29 grade 10-12, matriculated FALSE
6 pid_06 placebo smoker 34 post-secondary FALSE
7 pid_07 placebo non-smoker 31 grade 10-12, not matriculated FALSE
8 pid_08 placebo smoker 30 grade 10-12, not matriculated FALSE
9 pid_09 treatment non-smoker 35 grade 10-12, not matriculated FALSE
10 pid_10 treatment non-smoker 32 less than grade 9 FALSE
# ℹ 34 more rows
We can also group by multiple data columns at once, and we can undo any grouping with ungroup()
.
# A tibble: 44 × 6
# Groups: arm, education [8]
pid arm smoker age education sex
<chr> <chr> <chr> <dbl> <chr> <lgl>
1 pid_01 placebo non-smoker 26 grade 10-12, matriculated FALSE
2 pid_02 placebo smoker 33 grade 10-12, matriculated FALSE
3 pid_03 placebo smoker 30 post-secondary FALSE
4 pid_04 placebo non-smoker 34 grade 10-12, not matriculated FALSE
5 pid_05 treatment non-smoker 29 grade 10-12, matriculated FALSE
6 pid_06 placebo smoker 34 post-secondary FALSE
7 pid_07 placebo non-smoker 31 grade 10-12, not matriculated FALSE
8 pid_08 placebo smoker 30 grade 10-12, not matriculated FALSE
9 pid_09 treatment non-smoker 35 grade 10-12, not matriculated FALSE
10 pid_10 treatment non-smoker 32 less than grade 9 FALSE
# ℹ 34 more rows
# A tibble: 44 × 6
pid arm smoker age education sex
<chr> <chr> <chr> <dbl> <chr> <lgl>
1 pid_01 placebo non-smoker 26 grade 10-12, matriculated FALSE
2 pid_02 placebo smoker 33 grade 10-12, matriculated FALSE
3 pid_03 placebo smoker 30 post-secondary FALSE
4 pid_04 placebo non-smoker 34 grade 10-12, not matriculated FALSE
5 pid_05 treatment non-smoker 29 grade 10-12, matriculated FALSE
6 pid_06 placebo smoker 34 post-secondary FALSE
7 pid_07 placebo non-smoker 31 grade 10-12, not matriculated FALSE
8 pid_08 placebo smoker 30 grade 10-12, not matriculated FALSE
9 pid_09 treatment non-smoker 35 grade 10-12, not matriculated FALSE
10 pid_10 treatment non-smoker 32 less than grade 9 FALSE
# ℹ 34 more rows
Now try this yourself. Group the table_01 dataset by education_level and smoker.
Answer for yourself How many distinct groups did this produce?
Now undo the previous grouping.
Also verify what the output looks like when you omit the print()
function at the end.
Once we have set up a grouping for a data table, we can then calculate summary data with the summarize()
function. This function works similarly to mutate()
, in that we provide it with statements of the form <new column name> = <computation>
, where <new column name>
stands for the name of the new column that is being created and <computation>
stands for the computation that is used to generate the values in the new column.
As an example, using table_02
we want to calculate the median ph of participants, by arm
, we could write summarise(median_ph = median(ph))
, and this would create a new column called median_ph
Try this out. First group by arm
and then make the new column:
Now see what it looks like if you instead group by timepoint
Now try grouping by both timepoint
and arm
We can perform multiple summaries at once by adding more statements inside the summarise() function, separated by a ,
. To try this out, calculate the median nugent
in addition to the median ph
When performing summaries, we often want to know how many observations there are in each group (i.e., we want to count). We can do this with the function n()
, which inside summarise() gives us the group size. So, we can count by adding a statement such as count = n()
inside `summarise(). Try this out.
For efficient data processing, we usually want tables in long form, where each columns is one variable and each row is one observation. However, in some applications, for example when making a table easier to read for humans, a wide format can be preferred. In a wide format, some variables are displayed as column names, and other variables are distributed over multiple columns.
First, make a summary table that shows median ph
by arm
and time_point
, just like you did above, and save it to a variable ph_summary_long
Now, try using pivot_wider()
to make a column for each arm. Remember, use ?pivot_wider
if you want help, and try asking google or chatGPT if you get stuck.
What if you wanted to instead make a column for each time point, and have the arms be different rows?
Finally, we sometimes encounter the situation where we have two data sets that contain different pieces of information about the same subjects or objects, and we need to merge these tables for further analysis. In this situation, we need to perform a join, and there are multiple different types of joins available: left_join(), right_join(), inner_join(), full_join(). These joins all differ in how they handle cases where an observation is present in only one of the two tables but missing in the other.
Our instructional dataset has no missing values, so all types of joins are actually equivalent. Try to join table_01
and table_02
using left_join()