class: center, middle, title-slide .title[ # Data wrangling 1 ] .author[ ### Claus O. Wilke, remixed by Joseph Elsherbini ] .date[ ### 2023-01-10 ] --- ## Elementary data manipulations -- - Pick rows: `filter()` -- - Pick columns: `select()` -- - Sort rows: `arrange()` -- - Count things: `count()` -- - Make new columns: `mutate()` --- class: center, middle # But first: the pipe operator `%>%` --- class: center, middle # `%>%` is pronounced "and then" --- ## The pipe `%>%` feeds data into functions .tiny-font[ ```r head(penguins) ``` ``` # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g ``` ] --- ## The pipe `%>%` feeds data into functions .tiny-font[ ```r # head(penguins) penguins %>% head() ``` ``` # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g ``` ] --- ## The pipe `%>%` feeds data into functions .tiny-font[ ```r ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = species)) + geom_point() ``` ] .center[ <!-- --> ] --- ## The pipe `%>%` feeds data into functions .tiny-font[ ```r # ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = species)) + geom_point() penguins %>% ggplot(aes(bill_length_mm, bill_depth_mm, color = species)) + geom_point() ``` ] .center[ <!-- --> ] --- ## Since R 4.1: Native pipe `|>` .tiny-font[ ```r penguins |> ggplot(aes(bill_length_mm, bill_depth_mm, color = species)) + geom_point() ``` ] .center[ <!-- --> ] --- ## Which to use? Native pipe or old-school pipe? -- - `|>` is the future. If you can, use it. -- - `%>%` works on older installations. It's the safe choice for now. -- We use `%>%` here because many people still run R 3.x or 4.0. [//]: # "segment ends here" --- class: center middle ## Picking rows or columns, and sorting --- ## Pick rows from a table: `filter()` <br> .center[ <img src = "data-wrangling-1_files/filter.svg", width = 75%></img> ] --- ## Filter out penguins of species Gentoo .tiny-font[ ```r penguins %>% filter(species == "Gentoo") ``` ``` # A tibble: 124 × 8 species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Gentoo Biscoe 46.1 13.2 211 4500 fema… 2007 2 Gentoo Biscoe 50 16.3 230 5700 male 2007 3 Gentoo Biscoe 48.7 14.1 210 4450 fema… 2007 4 Gentoo Biscoe 50 15.2 218 5700 male 2007 5 Gentoo Biscoe 47.6 14.5 215 5400 male 2007 6 Gentoo Biscoe 46.5 13.5 210 4550 fema… 2007 7 Gentoo Biscoe 45.4 14.6 211 4800 fema… 2007 8 Gentoo Biscoe 46.7 15.3 219 5200 male 2007 9 Gentoo Biscoe 43.3 13.4 209 4400 fema… 2007 10 Gentoo Biscoe 46.8 15.4 215 5150 male 2007 # … with 114 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] --- ## Filter out penguins with bill length > 50 mm .tiny-font[ ```r penguins %>% filter(bill_length_mm > 50) ``` ``` # A tibble: 52 × 8 species island bill_length_mm bill_depth_mm flipper_len…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Gentoo Biscoe 50.2 14.3 218 5700 male 2007 2 Gentoo Biscoe 59.6 17 230 6050 male 2007 3 Gentoo Biscoe 50.5 15.9 222 5550 male 2008 4 Gentoo Biscoe 50.5 15.9 225 5400 male 2008 5 Gentoo Biscoe 50.1 15 225 5000 male 2008 6 Gentoo Biscoe 50.4 15.3 224 5550 male 2008 7 Gentoo Biscoe 54.3 15.7 231 5650 male 2008 8 Gentoo Biscoe 50.7 15 223 5550 male 2008 9 Gentoo Biscoe 51.1 16.3 220 6000 male 2008 10 Gentoo Biscoe 52.5 15.6 221 5450 male 2009 # … with 42 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] --- ## Pick columns from a table: `select()` <br> .center[ <img src = "data-wrangling-1_files/select.svg", width = 75%></img> ] --- ## Pick columns `species`, `island`, and `sex` .tiny-font[ ```r penguins %>% select(species, island, sex) ``` ``` # A tibble: 344 × 3 species island sex <fct> <fct> <fct> 1 Adelie Torgersen male 2 Adelie Torgersen female 3 Adelie Torgersen female 4 Adelie Torgersen <NA> 5 Adelie Torgersen female 6 Adelie Torgersen male 7 Adelie Torgersen female 8 Adelie Torgersen male 9 Adelie Torgersen <NA> 10 Adelie Torgersen <NA> # … with 334 more rows ``` ] --- ## Sort the rows in a table: `arrange()` <br> .center[ <img src = "data-wrangling-1_files/arrange.svg", width = 75%></img> ] --- ## Sort penguins by bill length, ascending .tiny-font[ ```r penguins %>% arrange(bill_length_mm) ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Dream 32.1 15.5 188 3050 fema… 2009 2 Adelie Dream 33.1 16.1 178 2900 fema… 2008 3 Adelie Torgersen 33.5 19 190 3600 fema… 2008 4 Adelie Dream 34 17.1 185 3400 fema… 2008 5 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 6 Adelie Torgersen 34.4 18.4 184 3325 fema… 2007 7 Adelie Biscoe 34.5 18.1 187 2900 fema… 2008 8 Adelie Torgersen 34.6 21.1 198 4400 male 2007 9 Adelie Torgersen 34.6 17.2 189 3200 fema… 2008 10 Adelie Biscoe 35 17.9 190 3450 fema… 2008 # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] --- ## Sort penguins by bill length, descending .tiny-font[ ```r penguins %>% arrange(desc(bill_length_mm)) ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Gentoo Biscoe 59.6 17 230 6050 male 2007 2 Chinstrap Dream 58 17.8 181 3700 fema… 2007 3 Gentoo Biscoe 55.9 17 228 5600 male 2009 4 Chinstrap Dream 55.8 19.8 207 4000 male 2009 5 Gentoo Biscoe 55.1 16 230 5850 male 2009 6 Gentoo Biscoe 54.3 15.7 231 5650 male 2008 7 Chinstrap Dream 54.2 20.8 201 4300 male 2008 8 Chinstrap Dream 53.5 19.9 205 4500 male 2008 9 Gentoo Biscoe 53.4 15.8 219 5500 male 2009 10 Chinstrap Dream 52.8 20 205 4550 male 2008 # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] [//]: # "segment ends here" --- class: center middle ## Counting things --- ## Count things .tiny-font[ ```r penguins ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] --- ## Count things .tiny-font[ ```r penguins %>% count(species) ``` ``` # A tibble: 3 × 2 species n <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` ] --- ## Count things .tiny-font[ ```r penguins %>% count(species, island) ``` ``` # A tibble: 5 × 3 species island n <fct> <fct> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Dream 68 5 Gentoo Biscoe 124 ``` ] --- ## Use the pipe to build analysis pipelines .tiny-font[ ```r penguins %>% filter(species == "Adelie") ``` ``` # A tibble: 152 × 8 species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 # … with 142 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` ] --- ## Use the pipe to build analysis pipelines .tiny-font[ ```r penguins %>% filter(species == "Adelie") %>% select(island, sex) ``` ``` # A tibble: 152 × 2 island sex <fct> <fct> 1 Torgersen male 2 Torgersen female 3 Torgersen female 4 Torgersen <NA> 5 Torgersen female 6 Torgersen male 7 Torgersen female 8 Torgersen male 9 Torgersen <NA> 10 Torgersen <NA> # … with 142 more rows ``` ] --- ## Use the pipe to build analysis pipelines .tiny-font[ ```r penguins %>% filter(species == "Adelie") %>% select(island, sex) %>% count(island, sex) ``` ``` # A tibble: 8 × 3 island sex n <fct> <fct> <int> 1 Biscoe female 22 2 Biscoe male 22 3 Dream female 27 4 Dream male 28 5 Dream <NA> 1 6 Torgersen female 24 7 Torgersen male 23 8 Torgersen <NA> 5 ``` ] [//]: # "segment ends here" --- class: center middle ## Adding new columns to a table --- ## Make a new table column: `mutate()` <br> .center[ <img src = "data-wrangling-1_files/mutate.svg", width = 75%></img> ] --- ## Example: flipper length in cm .tiny-font[ ```r penguins %>% select(species, island, flipper_length_mm) ``` ``` # A tibble: 344 × 3 species island flipper_length_mm <fct> <fct> <int> 1 Adelie Torgersen 181 2 Adelie Torgersen 186 3 Adelie Torgersen 195 4 Adelie Torgersen NA 5 Adelie Torgersen 193 6 Adelie Torgersen 190 7 Adelie Torgersen 181 8 Adelie Torgersen 195 9 Adelie Torgersen 193 10 Adelie Torgersen 190 # … with 334 more rows ``` ] --- ## Example: flipper length in cm .tiny-font[ ```r penguins %>% select(species, island, flipper_length_mm) %>% mutate(flipper_length_cm = flipper_length_mm / 10) ``` ``` # A tibble: 344 × 4 species island flipper_length_mm flipper_length_cm <fct> <fct> <int> <dbl> 1 Adelie Torgersen 181 18.1 2 Adelie Torgersen 186 18.6 3 Adelie Torgersen 195 19.5 4 Adelie Torgersen NA NA 5 Adelie Torgersen 193 19.3 6 Adelie Torgersen 190 19 7 Adelie Torgersen 181 18.1 8 Adelie Torgersen 195 19.5 9 Adelie Torgersen 193 19.3 10 Adelie Torgersen 190 19 # … with 334 more rows ``` ] --- ## Make multiple columns at once .tiny-font[ ```r penguins %>% select(species, island, flipper_length_mm) %>% mutate( flipper_length_cm = flipper_length_mm / 10, # <- notice the comma flipper_length_in = flipper_length_mm / 25.4 ) ``` ``` # A tibble: 344 × 5 species island flipper_length_mm flipper_length_cm flipper_length_in <fct> <fct> <int> <dbl> <dbl> 1 Adelie Torgersen 181 18.1 7.13 2 Adelie Torgersen 186 18.6 7.32 3 Adelie Torgersen 195 19.5 7.68 4 Adelie Torgersen NA NA NA 5 Adelie Torgersen 193 19.3 7.60 6 Adelie Torgersen 190 19 7.48 7 Adelie Torgersen 181 18.1 7.13 8 Adelie Torgersen 195 19.5 7.68 9 Adelie Torgersen 193 19.3 7.60 10 Adelie Torgersen 190 19 7.48 # … with 334 more rows ``` ] [//]: # "segment ends here" --- ## Exercise Go to [https://elsherbini.github.io/AMNH_R_Workshop_2023/modules/data-wrangling-module/](https://elsherbini.github.io/AMNH_R_Workshop_2023/modules/data-wrangling-module/) and complete the "Data Wrangling 1" exercise.