Introduction
In this worksheet, we will discuss how to perform basic data manipulations, such as filtering data rows that meet certain conditions, choosing data columns, and arranging data in ascending or descending order.
We will be using two R packages, tidyverse for the
data manipulation functions %>%
, filter()
,
select()
, arrange()
, count()
, and
mutate()
, and palmerpenguins for the
penguins
dataset.
```{r library-calls}
# load required libraries
library(tidyverse)
library(palmerpenguins)
```
We will be working with the dataset penguins
containing
data on individual penguins on Antarctica.
```{r}
penguins
```
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
The pipe (%>%
, read: “and then”)
When writing complex data analysis pipelines, we frequently use the
pipe operator %>%
to move data from one analysis step to
the next. The pipe is pronounced “and then”, and it takes the data on
its left and uses it as the first argument for the function on its
right.
For example, to see the first few lines of a dataset, we often write
head(data)
. Instead, we can write
data %>% head()
.
Try this yourself. Write code that displays the first few lines of
the penguins
dataset, using %>%
and
head()
:
```{r pipe}
# use a pipe to show the first few lines
# of the penguins dataframe
penguins
```
Choosing data rows
The function filter()
allows you to find rows in a
dataset that meet one or more specific conditions. The syntax is
data %>% filter(condition)
, where condition is a logical
condition. For example, filter(x > 5)
would pick all
rows for which the value in column x
is greater than 5.
As an example, the following code picks all penguins from the island Biscoe:
```{r filter-island-demo, echo=TRUE}
penguins %>%
filter(island == "Biscoe")
```
# A tibble: 168 × 8
species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Biscoe 37.8 18.3 174 3400 fema… 2007
2 Adelie Biscoe 37.7 18.7 180 3600 male 2007
3 Adelie Biscoe 35.9 19.2 189 3800 fema… 2007
4 Adelie Biscoe 38.2 18.1 185 3950 male 2007
5 Adelie Biscoe 38.8 17.2 180 3800 male 2007
6 Adelie Biscoe 35.3 18.9 187 3800 fema… 2007
7 Adelie Biscoe 40.6 18.6 183 3550 male 2007
8 Adelie Biscoe 40.5 17.9 187 3200 fema… 2007
9 Adelie Biscoe 37.9 18.6 172 3150 fema… 2007
10 Adelie Biscoe 40.5 18.9 180 3950 male 2007
# … with 158 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Now it’s your turn. Pick all penguins from the island Dream:
```{r filter-island}
penguins %>%
filter(___)
```
Filtering for multiple conditions
You can also state multiple conditions, separated by a comma. For
example, filter(x > 5, y < 2)
would pick all rows for
which the value in the column x
is greater than 5 and the
value in the column y
is less than 2. Note that the
conditions are combined via logical and, both need to be satisfied for
the row to be picked.
To try this out, pick all penguins of species Chinstrap from the island Dream:
```{r filter-species-island}
# build all the code for this exercise
```
Choosing data columns
The function select()
allows you to pick specific data
columns by name. This is frequently useful when a dataset has many more
columns than we are interested in at the time. For example, if we are
only interested in the penguins’ species, island, and sex, we could
select these three columns:
```{r select-example}
penguins %>%
select(species, island, sex)
```
# A tibble: 344 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen <NA>
5 Adelie Torgersen female
6 Adelie Torgersen male
7 Adelie Torgersen female
8 Adelie Torgersen male
9 Adelie Torgersen <NA>
10 Adelie Torgersen <NA>
# … with 334 more rows
Try this yourself, picking the columns representing the penguin
species (species
), the bill length
(bill_length_mm
), and then flipper length
(flipper_length_mm
).
```{r select-exercise}
# build all the code for this exercise
```
Choosing columns for removal
Another situation that arises frequently is one where we want to
remove specific columns. We can also do this with select()
,
but now write select(-column)
to remove one or more
columns.
Try this. Remove the column species
.
```{r select-remove}
# build all the code for this exercise
```
And now remove both species
and island
.
```{r select-remove2}
# build all the code for this exercise
```
Sorting data
The function arrange()
allows you to sort data by one or
more columns. For example, data %>% arrange(x)
would
sort the data by increasing values of x
, and
data %>% arrange(x, y)
would sort the data first by
x
and then, for ties in x
, by
y
.
As an example, the following code sorts penguins by their flipper length:
```{r arrange-flipper-demo}
penguins %>%
arrange(flipper_length_mm)
```
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Biscoe 37.9 18.6 172 3150 fema… 2007
2 Adelie Biscoe 37.8 18.3 174 3400 fema… 2007
3 Adelie Torgersen 40.2 17 176 3450 fema… 2009
4 Adelie Dream 39.5 16.7 178 3250 fema… 2007
5 Adelie Dream 37.2 18.1 178 3900 male 2007
6 Adelie Dream 33.1 16.1 178 2900 fema… 2008
7 Chinstrap Dream 46.1 18.2 178 3250 fema… 2007
8 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
9 Adelie Torgersen 37.8 17.3 180 3700 <NA> 2007
10 Adelie Biscoe 37.7 18.7 180 3600 male 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
Now it’s your turn. Sort the penguins by bill length:
```{r arrange-bill-length}
# build all the code for this exercise
```
Arranging in descending order
To arrange data in descending order, enclose the data column in
desc()
. For example,
data %>% arrange(desc(x))
would sort the data by
decreasing values of x
. (desc
stands for
“descending”.)
Try this out. Sort the penguins by bill length, from largest to smallest:
```{r arrange-bill-length-desc}
# build all the code for this exercise
```
Counting
We frequently want to count how many times a particular value or
combination of values occurs in a dataset. We do this using the
count()
function. For example, the following code counts
how many penguins of the different species there are in the
penguins
dataset.
```{r count-demo}
penguins %>%
count(species)
```
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
Now try this yourself. Count how many male and female penguins there are.
```{r count-simple}
# build all the code for this exercise
```
We can also perform more fine-grained counts, by providing the
count()
function with more than one column name. See if you
can count how many male and female penguins the dataset contains for
each species.
```{r count-complex}
# build all the code for this exercise
```
Now count how many penguins of each species the dataset contains for each island.
```{r count-complex2}
# build all the code for this exercise
```
Chaining analysis steps into pipelines
We can chain multiple analysis steps into a pipeline by continuing to
add “and then” statements. For example,
data %>% count(...) %>% arrange(...)
would first
count and then sort the data.
Try this out by counting the number of penguins of each species and then sorting by the number of penguins.
```{r analysis-chain}
# build all the code for this exercise
```
You may remember from before that the default sorting is from the smallest to the largest value. Repeat the previous exercise but now arrange the penguin species from the most frequent to the least frequent:
```{r analysis-chain2}
# build all the code for this exercise
```
Creating new data columns
The function mutate()
allows you to add new columns to a
data table. For example, data %>% mutate(sum = x + y)
would create a new column sum
that is the sum of the
columns x
and y
:
```{r simple-mutate-example}
data <- tibble(x = 1:3, y = c(10, 20, 30))
data
```
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 10
2 2 20
3 3 30
```{r simple-mutate-example2}
data %>%
mutate(
sum = x + y
)
```
# A tibble: 3 × 3
x y sum
<int> <dbl> <dbl>
1 1 10 11
2 2 20 22
3 3 30 33
Note that the part to the left of the equals sign (here,
sum
) is the name of the new column, and the part to the
right of the equals sign (here, x + y
) is an R expression
that evaluates to the values in the new column.
Now apply this concept to the penguins
dataset. Add a
new column bill_ratio
that is the ratio of bill length and
bill depth:
```{r mutate-bill-ratio}
penguins %>%
select(species, bill_length_mm, bill_depth_mm) %>%
mutate(
___ = ___
)
```
Counting with custom conditions
It is quite common that we want to count items that meet a specific condition. For example, let’s say we want to count how many penguins of different species have flippers longer than 200mm. To do this efficiently, we first create a new column that indicates whether the condition is met or not, and we then use count with that indicator column.
The easiest way to create indicator columns is via the function
ifelse()
, which takes three arguments: a condition, a
result if the condition is met, and a result if the condition is not
met. The following example shows how to create an indicator column
showing whether a variable is positive or negative:
```{r indicator-example1}
data <- tibble(x = c(-0.5, 2.3, 50, -1.4))
data
```
# A tibble: 4 × 1
x
<dbl>
1 -0.5
2 2.3
3 50
4 -1.4
```{r indicator-example2}
data %>%
mutate(
sign_of_x = ifelse(x >= 0, "positive", "negative")
)
```
# A tibble: 4 × 2
x sign_of_x
<dbl> <chr>
1 -0.5 negative
2 2.3 positive
3 50 positive
4 -1.4 negative
Now try this yourself. Count how many penguins of different species have flippers longer than 200mm. Then sort your results from most frequent to least frequent.
```{r flipper-length-count}
penguins %>%
mutate(
___
) %>%
count(___) %>%
arrange(___)
```