Introduction
In this worksheet, we will continue with basic data manipulations, now moving on to grouping and summarizing, making data tables wider or longer, and joining data tables.
We will be using two R packages, tidyverse for the
data manipulation functions %>%
,
group_by()
, summarize()
,
pivot_wider()
, pivot_longer()
, and join
functions such as left_join()
, and
palmerpenguins for the penguins
dataset.
Copy the below code into a chunk at the top. The
penguins_long
, penguins_wide
,
TX_population
, and TX_area
tables are for
later exercises.
```{r library-calls}
# load required libraries
library(tidyverse)
library(palmerpenguins)
# long and wide summary tables to be used later
penguins_long <- penguins %>%
filter(!is.na(sex)) %>%
group_by(species, sex) %>%
summarize(mean_weight = mean(body_mass_g))
penguins_wide <- penguins_long %>%
pivot_wider(names_from = "sex", values_from = "mean_weight")
# for join examples at the end
TX_population <- read_csv(file =
"city,year,population
Houston,2014,2239558
San Antonio,2014,1436697
Austin,2014,912791
Austin,2010,790390")
TX_area <- read_csv(file =
"city,area
Houston,607.5
Dallas,385.6
Austin,307.2")
```
First, we will be working with the dataset penguins
containing data on individual penguins on Antarctica.
```{r}
penguins
```
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Analyzing subsets
In many data analysis settings, we want to break a dataset into
subsets and then perform some summary calculation on each subset. The
simplest example is counting, which we have done previously with the
count()
function.
```{r count-demo, echo=TRUE}
penguins %>%
count(species)
```
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
This function subdivides the penguins
dataset into
subsets for each species and then calculates the number n
for each subset, which is the number of observations in each subset.
The function count()
here does all the work for us, but
what if instead of counting we wanted to calculate the mean weight of
the penguins for each species, or calculate the mean weight and count at
the same time? We need a general framework that allows us to do these
kinds of calculations with maximum flexibility.
The tidyverse approach is to first group a dataset with
group_by()
and then to calculate grouped summaries with
summarize()
.
Grouping
Let’s first consider just grouping. If we look at the raw R output of
just the penguins
table or the penguins
table
after running it through group_by(species)
, we see that the
table is the same, except in the second case there is a line
# Groups: species [3]
which indicates that the table is
grouped by species and there are three groups. (Here, we need to pipe
the tables into the print()
function to see the raw R
output instead of a formatted table that would hide the grouping
information.)
```{r}
penguins %>%
print()
```
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
```{r}
penguins %>%
group_by(species) %>%
print()
```
# A tibble: 344 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
We can also group by multiple data columns at once, and we can undo
any grouping with ungroup()
.
```{r echo = TRUE, eval=TRUE}
penguins %>%
group_by(species, island) %>%
print()
```
# A tibble: 344 × 8
# Groups: species, island [5]
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
```{r}
penguins %>%
group_by(species) %>%
ungroup() %>%
print()
```
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Now try this yourself. Group the penguins dataset by sex
and island
.
```{r group-basic}
penguins %>%
___ %>%
print()
```
Answer for yourself How many distinct groups did the previous exercise produce?
Now undo the previous grouping.
```{r ungroup}
# build all the code for this exercise
```
Also verify what the output looks like when you omit the
print()
function at the end.
Performing summaries
Once we have set up a grouping for a data table, we can then
calculate summary data with the summarize()
function. This
function works similarly to mutate()
, in that we provide it
with statements of the form
<new column name> = <computation>
, where
<new column name>
stands for the name of the new
column that is being created and <computation>
stands
for the computation that is used to generate the values in the new
column. As an example, if we want to calculate the mean weight (body
mass) of penguins, we could write
summarize(mean_weight = mean(body_mass_g))
, and this would
create a new column called mean_weight
.
Try this out. First group by sex
and then calculate the
mean weight for each sex.
```{r summarize-simple}
penguins %>%
group_by(sex) %>%
___
```
We see that male penguins on average are heavier than female
penguins. We also see that there is a row containing NA
s.
This happens because there are a few entries in the dataset for which we
know neither the penguins’ sex nor their weight.
Next, see if the pattern changes if we subdivide the dataset by species.
```{r summarize-simple2}
penguins %>%
group_by(___) %>%
___
```
When running this code, you see a message that
summarise()
has grouped the output by species
.
This happens because if you group by multiple variables and then perform
a summary it usually makes sense to keep all but the innermost
groupings. Verify this is the case by piping the output from
summarize()
into print()
.
We can perform multiple summaries at once by adding more statements
inside the summarize()
function. To try this out, calculate
the mean bill length in addition to the mean weight.
```{r summarize-multiple}
penguins %>%
group_by(species, sex) %>%
summarize(
___
)
```
When performing summaries, we often want to know how many
observations there are in each group (i.e., we want to count). We can do
this with the function n()
, which inside
summarize()
gives us the group size. So, we can count by
adding a statement such as count = n()
inside
summarize()
. Try this out.
```{r summarize-multiple2}
penguins %>%
group_by(species, sex) %>%
summarize(
___
)
```
Removing missing values
If we try the previous calculation but grouping only by species and
not by sex, we notice a problem: Most values have been replaced by
NA
.
```{r}
penguins %>%
group_by(species) %>%
summarize(
mean_weight = mean(body_mass_g),
mean_bill_length = mean(bill_length_mm)
)
```
# A tibble: 3 × 3
species mean_weight mean_bill_length
<fct> <dbl> <dbl>
1 Adelie NA NA
2 Chinstrap 3733. 48.8
3 Gentoo NA NA
This happens because R does not like to calculate the mean of a set of numbers where at least one is missing. Whenever there are missing values, we need to think carefully whether we can just ignore them or need to do something more sophisticated. In the penguins dataset there are only a handful of missing values, and therefore ignoring them is fine.
We can ignore missing values by setting na.rm = TRUE
inside the mean()
function. Try this out.
```{r summarize-na-rm}
penguins %>%
group_by(species) %>%
summarize(
___
)
```
Alternatively, we could filter out all rows that contain
NA
s in the columns of interest. We test whether a column
contains NA
s with is.na(<column name>)
,
and to keep rows without NA
s we use
!is.na(<column name>)
. Try this out.
```{r summarize-na-rm-filter}
penguins %>%
filter(___) %>%
group_by(species) %>%
summarize(
mean_weight = mean(body_mass_g),
mean_bill_length = mean(bill_length_mm)
)
```
Making tables wider or longer
For efficient data processing, we usually want tables in long form, where each columns is one variable and each row is one observation. However, in some applications, for example when making a table easier to read for humans, a wide format can be preferred. In a wide format, some variables are displayed as column names, and other variables are distributed over multiple columns.
Consider the following two versions of a summary table. The first is in long format, where sex is one column and the mean weight is another.
```{r echo = TRUE}
penguins_long
```
# A tibble: 6 × 3
# Groups: species [3]
species sex mean_weight
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Chinstrap female 3527.
4 Chinstrap male 3939.
5 Gentoo female 4680.
6 Gentoo male 5485.
The second is in wide format, where the values of the sex variable (female or male) are used as column headings, and the mean weight values are distributed over these two columns.
```{r echo = TRUE}
penguins_wide
```
# A tibble: 3 × 3
# Groups: species [3]
species female male
<fct> <dbl> <dbl>
1 Adelie 3369. 4043.
2 Chinstrap 3527. 3939.
3 Gentoo 4680. 5485.
You can turn a long table into a wide table using the function
pivot_wider()
, which takes two arguments:
names_from
specifies the long column from which the new
wide column names should be taken (here e.g., "sex"
), and
values_from
specifies the long column from which the new
wide values should be taken (here e.g., "mean_weight"
). Try
this out.
```{r pivot-wider}
penguins_long %>%
___
```
You can similarly turn a wide table into a long one using the
function pivot_longer()
. It has arguments
cols
, names_to
, and values_to
.
cols
specifies the wide columns on which to operate,
names_to
specifies into which long columns the names of the
wide columns should be written, and values_to
specfies into
which long columns the values of the wide columns should be written.
Note that while names_to
and values_to
are
specified as strings (that is, in quotes, such as
names_to = "sex"
), the cols
argument does not
use quotes (e.g., cols = c(female, male)
).
Try this out.
```{r pivot-longer}
penguins_wide %>%
___
```
Notice how the arguments names_from
and
values_from
in pivot_wider()
are exactly
equivalent to names_to
and values_to
in
pivot_longer()
.
Combining datasets with joins
Finally, we sometimes encounter the situation where we have two data
sets that contain different pieces of information about the same
subjects or objects, and we need to merge these tables for further
analysis. In this situation, we need to perform a join, and there are
multiple different types of joins available: left_join()
,
right_join()
, inner_join()
,
full_join()
. These joins all differ in how they handle
cases where an observation is present in only one of the two tables but
missing in the other.
Two explore joins, consider the following two datasets, which contain the population number of three Texas cities and the city areas, respectively. The cities in the two tables are not the same, on purpose.
```{r echo = TRUE}
TX_population
TX_area
```
# A tibble: 4 × 3
city year population
<chr> <dbl> <dbl>
1 Houston 2014 2239558
2 San Antonio 2014 1436697
3 Austin 2014 912791
4 Austin 2010 790390
# A tibble: 3 × 2
city area
<chr> <dbl>
1 Houston 608.
2 Dallas 386.
3 Austin 307.
Try to merge TX_area
into TX_population
,
using left_join()
.
```{r left-join}
left_join(___)
```
What happens if you reverse the two arguments?
Now try the same with full_join()
.
```{r full-join}
# build all the code for this exercise
```
Try right_join()
and inner_join()
alsoSee
if you can describe what each join function does and how they differ
from one another.