NYGC/AMNH Workshop on Microbial Ecology - Introduction to R

Claus Wilke remixed by Joseph Elsherbini
2022/08/24

Data Wrangling 1 Exercise

Introduction

In this worksheet, we will discuss how to perform basic data manipulations, such as filtering data rows that meet certain conditions, choosing data columns, and arranging data in ascending or descending order.

We will be using two R packages, tidyverse for the data manipulation functions %>%, filter(), select(), arrange(), count(), and mutate(), and palmerpenguins for the penguins dataset.


```{r library-calls}
# load required libraries
library(tidyverse)
library(palmerpenguins)
```
We will be working with the dataset penguins containing data on individual penguins on Antarctica.

```{r}
penguins
```
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

The pipe (%>%, read: “and then”)

When writing complex data analysis pipelines, we frequently use the pipe operator %>% to move data from one analysis step to the next. The pipe is pronounced “and then”, and it takes the data on its left and uses it as the first argument for the function on its right.

For example, to see the first few lines of a dataset, we often write head(data). Instead, we can write data %>% head().

Try this yourself. Write code that displays the first few lines of the penguins dataset, using %>% and head():


```{r pipe}
# use a pipe to show the first few lines 
# of the penguins dataframe
penguins
```

Choosing data rows

The function filter() allows you to find rows in a dataset that meet one or more specific conditions. The syntax is data %>% filter(condition), where condition is a logical condition. For example, filter(x > 5) would pick all rows for which the value in column x is greater than 5.

As an example, the following code picks all penguins from the island Biscoe:


```{r filter-island-demo, echo=TRUE}
penguins %>%
  filter(island == "Biscoe")
```
# A tibble: 168 × 8
   species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex    year
   <fct>   <fct>           <dbl>         <dbl>         <int>   <int> <fct> <int>
 1 Adelie  Biscoe           37.8          18.3           174    3400 fema…  2007
 2 Adelie  Biscoe           37.7          18.7           180    3600 male   2007
 3 Adelie  Biscoe           35.9          19.2           189    3800 fema…  2007
 4 Adelie  Biscoe           38.2          18.1           185    3950 male   2007
 5 Adelie  Biscoe           38.8          17.2           180    3800 male   2007
 6 Adelie  Biscoe           35.3          18.9           187    3800 fema…  2007
 7 Adelie  Biscoe           40.6          18.6           183    3550 male   2007
 8 Adelie  Biscoe           40.5          17.9           187    3200 fema…  2007
 9 Adelie  Biscoe           37.9          18.6           172    3150 fema…  2007
10 Adelie  Biscoe           40.5          18.9           180    3950 male   2007
# … with 158 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Now it’s your turn. Pick all penguins from the island Dream:


```{r filter-island}
penguins %>%
  filter(___)
```

Filtering for multiple conditions

You can also state multiple conditions, separated by a comma. For example, filter(x > 5, y < 2) would pick all rows for which the value in the column x is greater than 5 and the value in the column y is less than 2. Note that the conditions are combined via logical and, both need to be satisfied for the row to be picked.

To try this out, pick all penguins of species Chinstrap from the island Dream:


```{r filter-species-island}
 # build all the code for this exercise

```

Choosing data columns

The function select() allows you to pick specific data columns by name. This is frequently useful when a dataset has many more columns than we are interested in at the time. For example, if we are only interested in the penguins’ species, island, and sex, we could select these three columns:


```{r select-example}
penguins %>%
  select(species, island, sex)
```
# A tibble: 344 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen <NA>  
 5 Adelie  Torgersen female
 6 Adelie  Torgersen male  
 7 Adelie  Torgersen female
 8 Adelie  Torgersen male  
 9 Adelie  Torgersen <NA>  
10 Adelie  Torgersen <NA>  
# … with 334 more rows

Try this yourself, picking the columns representing the penguin species (species), the bill length (bill_length_mm), and then flipper length (flipper_length_mm).


```{r select-exercise}
 # build all the code for this exercise
```

Choosing columns for removal

Another situation that arises frequently is one where we want to remove specific columns. We can also do this with select(), but now write select(-column) to remove one or more columns.

Try this. Remove the column species.


```{r select-remove}
 # build all the code for this exercise
```

And now remove both species and island.


```{r select-remove2}
# build all the code for this exercise

```

Sorting data

The function arrange() allows you to sort data by one or more columns. For example, data %>% arrange(x) would sort the data by increasing values of x, and data %>% arrange(x, y) would sort the data first by x and then, for ties in x, by y.

As an example, the following code sorts penguins by their flipper length:


```{r arrange-flipper-demo}
penguins %>%
  arrange(flipper_length_mm)
```
# A tibble: 344 × 8
   species   island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year
   <fct>     <fct>              <dbl>         <dbl>    <int>   <int> <fct> <int>
 1 Adelie    Biscoe              37.9          18.6      172    3150 fema…  2007
 2 Adelie    Biscoe              37.8          18.3      174    3400 fema…  2007
 3 Adelie    Torgersen           40.2          17        176    3450 fema…  2009
 4 Adelie    Dream               39.5          16.7      178    3250 fema…  2007
 5 Adelie    Dream               37.2          18.1      178    3900 male   2007
 6 Adelie    Dream               33.1          16.1      178    2900 fema…  2008
 7 Chinstrap Dream               46.1          18.2      178    3250 fema…  2007
 8 Adelie    Dream               37.5          18.9      179    2975 <NA>   2007
 9 Adelie    Torgersen           37.8          17.3      180    3700 <NA>   2007
10 Adelie    Biscoe              37.7          18.7      180    3600 male   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Now it’s your turn. Sort the penguins by bill length:


```{r arrange-bill-length}
# build all the code for this exercise
```

Arranging in descending order

To arrange data in descending order, enclose the data column in desc(). For example, data %>% arrange(desc(x)) would sort the data by decreasing values of x. (desc stands for “descending”.)

Try this out. Sort the penguins by bill length, from largest to smallest:


```{r arrange-bill-length-desc}
# build all the code for this exercise
```

Counting

We frequently want to count how many times a particular value or combination of values occurs in a dataset. We do this using the count() function. For example, the following code counts how many penguins of the different species there are in the penguins dataset.


```{r count-demo}
penguins %>%
  count(species)
```
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Now try this yourself. Count how many male and female penguins there are.


```{r count-simple}
# build all the code for this exercise
```

We can also perform more fine-grained counts, by providing the count() function with more than one column name. See if you can count how many male and female penguins the dataset contains for each species.


```{r count-complex}
# build all the code for this exercise
```

Now count how many penguins of each species the dataset contains for each island.


```{r count-complex2}
# build all the code for this exercise
```

Chaining analysis steps into pipelines

We can chain multiple analysis steps into a pipeline by continuing to add “and then” statements. For example, data %>% count(...) %>% arrange(...) would first count and then sort the data.

Try this out by counting the number of penguins of each species and then sorting by the number of penguins.


```{r analysis-chain}
# build all the code for this exercise
```

You may remember from before that the default sorting is from the smallest to the largest value. Repeat the previous exercise but now arrange the penguin species from the most frequent to the least frequent:


```{r analysis-chain2}
# build all the code for this exercise
```

Creating new data columns

The function mutate() allows you to add new columns to a data table. For example, data %>% mutate(sum = x + y) would create a new column sum that is the sum of the columns x and y:


```{r simple-mutate-example}
data <- tibble(x = 1:3, y = c(10, 20, 30))
data
```
# A tibble: 3 × 2
      x     y
  <int> <dbl>
1     1    10
2     2    20
3     3    30

```{r simple-mutate-example2}
data %>%
  mutate(
    sum = x + y
  )
```
# A tibble: 3 × 3
      x     y   sum
  <int> <dbl> <dbl>
1     1    10    11
2     2    20    22
3     3    30    33

Note that the part to the left of the equals sign (here, sum) is the name of the new column, and the part to the right of the equals sign (here, x + y) is an R expression that evaluates to the values in the new column.

Now apply this concept to the penguins dataset. Add a new column bill_ratio that is the ratio of bill length and bill depth:


```{r mutate-bill-ratio}
penguins %>%
  select(species, bill_length_mm, bill_depth_mm) %>%
  mutate(
    ___ = ___
  )
```

Counting with custom conditions

It is quite common that we want to count items that meet a specific condition. For example, let’s say we want to count how many penguins of different species have flippers longer than 200mm. To do this efficiently, we first create a new column that indicates whether the condition is met or not, and we then use count with that indicator column.

The easiest way to create indicator columns is via the function ifelse(), which takes three arguments: a condition, a result if the condition is met, and a result if the condition is not met. The following example shows how to create an indicator column showing whether a variable is positive or negative:


```{r indicator-example1}
data <- tibble(x = c(-0.5, 2.3, 50, -1.4))
data
```
# A tibble: 4 × 1
      x
  <dbl>
1  -0.5
2   2.3
3  50  
4  -1.4

```{r indicator-example2}
data %>%
  mutate(
    sign_of_x = ifelse(x >= 0, "positive", "negative")
  )
```
# A tibble: 4 × 2
      x sign_of_x
  <dbl> <chr>    
1  -0.5 negative 
2   2.3 positive 
3  50   positive 
4  -1.4 negative 

Now try this yourself. Count how many penguins of different species have flippers longer than 200mm. Then sort your results from most frequent to least frequent.


```{r flipper-length-count}
penguins %>%
  mutate(
    ___
  ) %>%
  count(___) %>%
  arrange(___)
```