+ addition
- subtraction
* multiplication
/ division
^ or ** power
%*% y matrix multiplication c(5, 3) %*% c(2, 4) == 22
x %% y modulo (x mod y) 5 %% 2 == 1
x %/% y whole number division: 5 %/% 2 == 2 x
Intro to R, RStudio, and Quarto
Let’s learn how to use RStudio to run R code and render documents with Quarto.
Slides: Intro to RStudio, Quarto, and R
Worksheet: Introduction to R and Quarto
1. Operators and Functions in R
In order to get familiar with the R language, we shall start by using simple operators and functions that are intuitive to us in our everyday lives.
1.1 Arithmetic operators
Note that while the first half is self-explanatory, the second is more specific to programming and/or R.
Let’s try a few examples in R. First, always have a code chunk at the top that loads the libraries you need:
```{r}
library(tidyverse)
```
Then, copy the following code chunk into your quarto document.
99 + 1 + -1
64 / 4
64 / (2+2)
64 / 2 + 2
Notice the difference in the last two commands and it’s effect on the output.
1.2 Logical operators and functions
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal
!= not equal
!x not x (negation)
| y x OR y
x & y x AND y
x xor(x, y) exclusive OR (either in x or y, but not in both)
isTRUE(x) truth test for x
Few examples for you to copy in your quarto document or directly on the console,
99 < 1
!(99 < 1)
64 == 8*8
3 > 2) | (4 > 5) (
1.3 Numeric functions
abs(x) absolute value
sqrt(x) square root
ceiling(x) round up: ceiling(3.475) is 4
floor(x) round down: floor(3.475) is 3
round(x, digits=n) round: round(3.475, digits=2) is 3.48
cos(x), sin(x), tan(x), acos(x), cosh(x), acosh(x) etc.
log(x) natural logarithm
log(10, base = n) base n logarithm
log2(x) base 2 logarithm
log10(x) base 10 logarithm
exp(x) exponential function: e^x
Few examples for you to copy in your quarto document or directly on the console,
abs(99)
abs(-99)
sqrt(64)
floor(6.789)
1.4 Statistical functions
Below is a list of statistical functions. These functions can have the argument na.rm, which is set to FALSE by default. This will let you deal with missing values (na = not available). If set to false, these are not removed (rm = remove).
mean(x, na.rm = FALSE) mean
sd(x) standard deviation
var(x) variance
median(x) median
quantile(x, probs) quantile of x. probs: vector of probabilities
sum(x) sum
min(x) minimal value of x (x_min)
max(x) xaximal value of x (x_max)
range(x) x_min und x_max
# if center = TRUE: subtract mean
# if scale = TRUE: divide by sd
scale(x, center = TRUE, scale = TRUE) center and standardize
# weighted sampling with argument prob:
sample(x, size, replace = FALSE, prob) sampling with or without replacement. prob: vector of weights
To get help with a function, type ?function_name
in the console.
For example, try typing ?mean
into the console. Check out this guide on how to read the help page.
1.5 Other useful functions
c() combine: used to create a vector
seq(from, to, by) generates a sequence
: colon operator: generates a 'regular' sequence in increments of 1
rep(x, times, each) repeats x
: sequence is repeated n times
times: each element is repeated n times
each
head(x, n = 6) first 6 elements of x
tail(x, n = 6) last 6 elements of x
Few examples for you to copy in your quarto document or directly on the console,
c(1, 2, 3, 4, 5, 6)
mean(c(1, 2, 3, 4, 5, 6))
mean(c(1, NA, 3, 4, 5, 6), na.rm = TRUE)
mean(c(1, NA, 3, 4, 5, 6), na.rm = FALSE)
sum(c(1, 2, 3, 4, 5, 6))
seq(from = 1, to = 6, by = 1)
seq(from = 1, to = 6, by = 2)
rep(1:6, times = 2, each = 2)
Ready to put these new skills to use? Here’s an exercise.
Exercise 1
Write R commands to calculate the following:
- The sum of your birth day, month and year
- 250 divided by the product of 4 and 5
- Half of the sum of 37.5, 51.3, and 101.7
- \(\frac{1}{3} * { (1+3+5+7+2) \over (3+5+4)}\)
- \(\sqrt[3]{8}\)
- \(\sin\pi\), \(\cos\pi\), \(\tan\pi\)
- Calculate the 10-based logarithm of 100, and multiply the result with the cosine of pi. Hint: see ?log and ?pi.
- Calculate the mean, sd and range of the vector [1, 3, 4, 7, 11, 16]
- Generate the following output: 4 4 4 5 5 5 6 6 6 7 7 7
- Generate the following output: 2 4 6 8 10 12
Challenge Exercise 1
Write R commands to evaluate the following:
- You have 83 chocolates in a bag. You would like to divide them into smaller bags of 8.
1.1 How many small bags will you need?
1.2 After the bags are filled, how many extra chocolates will you have remaining? - You are planning a research study with the following eligibility criteria:
- Study participant should be between 18-25 years old (variable: age)
- Hemoglobin should be over 10 (variable: hgb)
- Study participant should not weigh over 50 (variable: wgt)
- Study location should be X or Y (variable: loc)
- I have the following observations from a clinical parameter – x <- c(23.1924, 21.4545, 24.6778) However, I would prefer to limit the values to 1 decimal point. How would you do that?
- Look up the rnorm() function in the Help Viewer. What arguments does this function take? Any default values?
- Try to nest a function within another
Assuming you get all the information in forms of the above variables, write an R command to determine eligibility.
2. Objects and Vectors in R
2.1 Objects / Variables
So far, we have been happy to run functions and read the results on the screen. What if you’d like to read results later? You will need to save them by creating Objects.
Copy the following code chunk into your quarto document
<- 9
number1 sqrt(number1)
[1] 3
<- 10
number2
* number2 number1
[1] 90
What do you think happens when you use the same object name to another value? Try it!
<- 100
newnumber <- 150
newnumber
#What does R pick?
newnumber
Object/Variable names
You can choose almost any name you like for an object, as long as the name does not begin with a number or a special character like +, -, *, /, ^, !, @, or &.
# good variable names
x_mean
x_sd
num_people
age
# not so good
p
a
# bad variable names
x.mean-x sd.of
2.2 Vectors
Vectors are the fundamental data type in R - all other data types are composed of vectors. These can subdivided into:
numeric vectors: a further subdivision is into integer
(whole numbers) and double
(floating point numbers).
character vectors: these consist of characters strings and are surrounded by quotes, either single '
or double "
, e.g. 'word'
or "word"
.
logical vectors: these can take three values: TRUE
, FALSE
or NA
.
Vectors consist of elements of the same type, i.e., we cannot combine logical
and character
elements in a vector. Vectors have three properties:
Type: typeof()
: what is it? Length: length()
: how many elements? Attribute: attributes()
: additional metadata
Vectors are created using the c()
function or by using special function, such as seq()
or rep()
.
# Numeric vectors
<- c(2.1,2.2,2.3,2.4,2.5)
num_vec length(num_vec)
[1] 5
# Character vectors
<- c("hello","hi")
char_vec length(char_vec)
[1] 2
# Logical vectors
<- c(TRUE,FALSE,TRUE)
log_vec typeof(log_vec)
[1] "logical"
Some fun with vectors
1] # what's the first element of num_vec?
num_vec[
4] # 4th?
num_vec[
-1] # what does -1 do?
num_vec[
1:3]
num_vec[
-c(1, 3)] num_vec[
Try this on your own with the character vectors. What differences and similarities do you notice?
Exercise time!
Exercise 2
- Calculate the mean and standard deviation of a numeric vector of the first 6 multiples of 5.
- What happens if you attempt at arithmetic function on a numeric vector? (Example: num_vec + 1)
- What happens if you attempt the same on a character vector?
- Explore the
str_length()
function on a character vector. - What happens if you try to make a vector consisting of different data types?
- Metadata hygiene: Give examples of good and bad ways to make variable names for the following columns in the Metadata spreadsheet –
- Participant ID
- Date of Sample Collection
- Type of Sample
- Mean of past 3 weights
- Number of people
Challenge Exercise 2
- Calculate the mean of the sum of the first 6 even numbers - using one R command.
first_name
andlast_name
are two separate variables. Make one variable with both together, calling itfull_name
. Hint: it is a function
3. Dataframes in R
For data analysis and statistics, data frames are important objects of data representation. A data frame is a two dimensional structure with rows and columns, like a table. You can think of it as a collection of vectors. Let us try making one.
# creating a vector with an ID
<- c(1, 2, 3)
id
# creating another vector with names
<- c("PersonA", "PersonB", "PersonC")
name
# creating another vector with year of birth
<- c(1990, 1995, 2000)
year_of_birth
# creating another vector with favourite colour
<- c("red", "green", "yellow")
fav_colour
#Make a dataframe from the above
# note, make sure you've added the tidyverse library
library(tidyverse)
<- tibble(id, name, year_of_birth, fav_colour)
df
df
# A tibble: 3 × 4
id name year_of_birth fav_colour
<dbl> <chr> <dbl> <chr>
1 1 PersonA 1990 red
2 2 PersonB 1995 green
3 3 PersonC 2000 yellow
You just made a dataframe!
Now, normally you will be importing one and using it to get useful information. You can extract useful information like number of rows, columns, details of each, summary of values, etc
Here are a few helpful functions -
attributes(df)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3
$names
[1] "id" "name" "year_of_birth" "fav_colour"
rownames(df)
[1] "1" "2" "3"
colnames(df)
[1] "id" "name" "year_of_birth" "fav_colour"
# We didn't assign any yet! Let us try to assign column names
colnames(df) <- c("ID","name","year_of_birth","favourite_colour")
colnames(df)
[1] "ID" "name" "year_of_birth" "favourite_colour"
## Note: When importing an existing data file, depending on your it's structure, you can ask R to import with or without row and column names)
nrow(df)
[1] 3
ncol(df)
[1] 4
# Data frame subsetting (I want only a few columns of my interest)
select(df, name)
# A tibble: 3 × 1
name
<chr>
1 PersonA
2 PersonB
3 PersonC
select(df, year_of_birth)
# A tibble: 3 × 1
year_of_birth
<dbl>
1 1990
2 1995
3 2000
#extract row 2 only
slice(df, 2)
# A tibble: 1 × 4
ID name year_of_birth favourite_colour
<dbl> <chr> <dbl> <chr>
1 2 PersonB 1995 green
#extract rows 1, and 3
slice(df, c(1,3))
# A tibble: 2 × 4
ID name year_of_birth favourite_colour
<dbl> <chr> <dbl> <chr>
1 1 PersonA 1990 red
2 3 PersonC 2000 yellow
#extract column 3 only
select(df, 3)
# A tibble: 3 × 1
year_of_birth
<dbl>
1 1990
2 1995
3 2000
#extract rows based on one condition
filter(df, year_of_birth > 1995)
# A tibble: 1 × 4
ID name year_of_birth favourite_colour
<dbl> <chr> <dbl> <chr>
1 3 PersonC 2000 yellow
#extract rows based on multiple conditions
filter(df, year_of_birth > 1995 & favourite_colour=="red")
# A tibble: 0 × 4
# ℹ 4 variables: ID <dbl>, name <chr>, year_of_birth <dbl>,
# favourite_colour <chr>
filter(df, year_of_birth > 1995 | favourite_colour=="red")
# A tibble: 2 × 4
ID name year_of_birth favourite_colour
<dbl> <chr> <dbl> <chr>
1 1 PersonA 1990 red
2 3 PersonC 2000 yellow
#Note the difference in output upon use of different logical operators AND and OR
Exercise 3
- Make an expression to get only rows when the name is Person C or the favorite color is green.
- Make a new dataframe with 6 rows and 5 columns (get creative!)
- Add different combinations of data to be able to use the above functions and compare output.
- Try above functions on your new dataframe and note any interesting observations.
- Try some other functions:
str()
,head()
,view()
.
Challenge Exercise 3
- Can you think of other simple questions you may need to query your dataset?
- Try to look your query up on google and see if you can find a function that addresses your need! (add ‘in R’ at the end for relevant answers!)
Congratulations! You have successfully reached the end of this exercise. You now possess the most important skill: google your query!
End of session worksheet: Introduction to R and Quarto
Introduction
Now that we practiced a bit on mock and small datasets, let us try a ‘real’ one. In this worksheet, we will look at the Ice Breaker Poll from this morning and learn how to perform basic data manipulations, such as filtering data rows that meet certain conditions, choosing data columns, and arranging data in ascending or descending order.
First, download the icebreaker poll file containing the dataset. Download the dataset here and move it to your project directory.
Use the following command to read in the data from the Ice Breaker poll. ::: {.cell}
<- read_csv("Ice Breaker Survey (Responses) - Form Responses 1.csv") # make sure this file name exists in your current directory! ice_breaker_df
Rows: 5 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Timestamp, first_thing_in_morning, vanilla_chocolate, superpower, s...
dbl (3): num_languages, num_browser_tabs, height_cm
lgl (1): years_current_country
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
:::
We will be using the R package tidyverse for the data manipulation functions %>%
, filter()
, select()
, arrange()
, count()
, and mutate()
.
library(tidyverse)
The pipe (%>%
, read: “and then”)
When writing complex data analysis pipelines, we frequently use the pipe operator %>%
to move data from one analysis step to the next. The pipe is pronounced “and then”, and it takes the data on its left and uses it as the first argument for the function on its right.
For example, to see the first few lines of a dataset, we often write head(dataframe)
. Instead, we can write dataframe %>% head()
.
Try this yourself. Write code that displays the first few lines of the ice_breaker_df
dataset, using %>%
and head()
:
# build all the code for this exercise
Now get all the column names using the colnames()
function on the ice_breaker_df
.
# build all the code for this exercise
Choosing data rows
The function filter()
allows you to find rows in a dataset that meet one or more specific conditions. The syntax is dataframe %>% filter(condition)
, where condition is a logical condition. For example, filter(x > 5)
would pick all rows for which the value in column x
is greater than 5.
As an example, the following code picks all survey responses where people prefer chocolate over vanilla ice cream:
%>%
ice_breaker_df filter(vanilla_chocolate == "Chocolate")
# A tibble: 2 × 13
Timestamp first_thing_in_morning num_languages vanilla_chocolate superpower
<chr> <chr> <dbl> <chr> <chr>
1 10/5/2023 1… Go back to sleep 3.25 Chocolate Shape shi…
2 10/5/2023 1… Go back to sleep 2.5 Chocolate Flight
# ℹ 8 more variables: social_media <chr>, num_browser_tabs <dbl>,
# height_cm <dbl>, procrastinate <chr>, extreme_sport <chr>,
# r_experience <chr>, travel_to_workshop <chr>, years_current_country <lgl>
Can you tell how many people that is from looking at the size of the tibble?
Now it’s your turn to try one. Pick all responses where people would like to try Skydiving.
%>%
ice_breaker_df filter(___)
Filtering for multiple conditions
You can also state multiple conditions, separated by a comma. For example, filter(x > 5, y < 2)
would pick all rows for which the value in the column x
is greater than 5 and the value in the column y
is less than 2. Note that the conditions are combined via logical and, both need to be satisfied for the row to be picked.
To try this out, pick all survey responses where people taller than XXX cm would like to retain their Facebook.
# build all the code for this exercise
Choosing data columns
The function select()
allows you to pick specific data columns by name. This is frequently useful when a dataset has many more columns than we are interested in at the time. For example, if we are only interested in the responses regarding what people do first thing in the morning, what superpower they would like, and how they procrastinate, we could select just those three columns:
%>%
ice_breaker_df select(first_thing_in_morning, superpower, procrastinate)
# A tibble: 5 × 3
first_thing_in_morning superpower procrastinate
<chr> <chr> <chr>
1 Check text messages Flight Watching TV
2 Go back to sleep Shape shifting Eating snacks
3 Go back to sleep Flight Watching TV
4 Turn off the alarm Flight Browsing the internet
5 Turn off the alarm <NA> <NA>
Try this yourself, picking the columns representing responses to how many browser tabs people have open right now and what social media they would like to keep.
# build all the code for this exercise
Choosing columns for removal
Another situation that arises frequently is one where we want to remove specific columns. We can also do this with select()
, but now write select(-column)
to remove one or more columns.
Try this. Remove the column num_browser_tabs
.
# build all the code for this exercise
And now try removing both num_browser_tabs
and procrastinate
.
Sorting data
The function arrange()
allows you to sort data by one or more columns. For example, dataframe %>% arrange(x)
would sort the data by increasing values of x
, and dataframe %>% arrange(x, y)
would sort the data first by x
and then, for ties in x
, by y
.
As an example, the following code sorts responses by the person’s height:
%>%
ice_breaker_df arrange(height_cm)
# A tibble: 5 × 13
Timestamp first_thing_in_morning num_languages vanilla_chocolate superpower
<chr> <chr> <dbl> <chr> <chr>
1 10/5/2023 1… Go back to sleep 2.5 Chocolate Flight
2 10/5/2023 1… Go back to sleep 3.25 Chocolate Shape shi…
3 10/5/2023 1… Turn off the alarm 2 Vanilla Flight
4 10/5/2023 1… Check text messages 1 Vanilla Flight
5 10/15/2023 … Turn off the alarm NA <NA> <NA>
# ℹ 8 more variables: social_media <chr>, num_browser_tabs <dbl>,
# height_cm <dbl>, procrastinate <chr>, extreme_sport <chr>,
# r_experience <chr>, travel_to_workshop <chr>, years_current_country <lgl>
Now it’s your turn. Sort responses by the number of languages people can speak:
# build all the code for this exercise
Arranging in descending order
To arrange data in descending order, enclose the data column in desc()
. For example, dataframe %>% arrange(desc(x))
would sort the data by decreasing values of x
. (desc
stands for “descending”.)
Try this out. Sort the responses by height again, this time from largest to smallest:
# build all the code for this exercise
Counting
We frequently want to count how many times a particular value or combination of values occurs in a dataset. We do this using the count()
function. For example, the following code counts how many of each number we got for the number of languages people can speak.
# A tibble: 5 × 2
num_languages n
<dbl> <int>
1 1 1
2 2 1
3 2.5 1
4 3.25 1
5 NA 1
Now try this yourself. Count how many prefer vanilla ice cream and how many chocolate.
# build all the code for this exercise
Chaining analysis steps into pipelines
We can chain multiple analysis steps into a pipeline by continuing to add “and then” statements. For example, dataframe %>% count(...) %>% arrange(...)
would first count and then sort the data.
Try this out by counting the number of responses of languages spoken and and then sorting by the number.
# build all the code for this exercise
Creating new data columns
The function mutate()
allows you to add new columns to a data table. For example, dataframe %>% mutate(sum = x + y)
would create a new column sum
that is the sum of the columns x
and y
:
<- tibble(x = 1:3, y = c(10, 20, 30))
data data
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 10
2 2 20
3 3 30
%>%
data mutate(sum = x + y)
# A tibble: 3 × 3
x y sum
<int> <dbl> <dbl>
1 1 10 11
2 2 20 22
3 3 30 33
Note that the part to the left of the equals sign (here, sum
) is the name of the new column, and the part to the right of the equals sign (here, x + y
) is an R expression that evaluates to the values in the new column.
Now apply this concept to the ice_breaker_df
dataset. Add a new column browsing by language
that is the ratio of number of browser tabs currently open and number of languages spoken:
# build all the code for this exercise
Counting with custom conditions
It is quite common that we want to count items that meet a specific condition. For example, let’s say we want to count how many people are taller than 155 cm. To do this efficiently, we first create a new column that indicates whether the condition is met or not, and we then use count with that indicator column.
The easiest way to create indicator columns is via the function if_else()
, which takes three arguments: a condition, a result if the condition is met, and a result if the condition is not met. The following example shows how to create an indicator column showing whether a variable is positive or negative:
<- tibble(x = c(-0.5, 2.3, 50, -1.4))
data data
# A tibble: 4 × 1
x
<dbl>
1 -0.5
2 2.3
3 50
4 -1.4
%>%
data mutate(
sign_of_x = if_else(x >= 0, "positive", "negative")
%>%
) count(sign_of_x)
# A tibble: 2 × 2
sign_of_x n
<chr> <int>
1 negative 2
2 positive 2
Now try this yourself. Count how many people are taller than 155 cm. Then sort your results.
Here are a few additional exercises that you can work on to practice and learn more about survey responses from everyone in this room!
Exercise - fun with the survey
Write R commands for the following -
1. How many people took this survey?
2. How many questions did we ask?
3. What questions did we ask?
4. Give a few examples of the data types captured in the questions
5. Look at responses of questions 4-6 from all participants
6. Try to rename a column (question)
7. Make a new dataframe of 5 questions of your choice.
8. Can you get the height of the tallest person in this room?
9. How many people speak more than 2 languages?
10. Select the question about R experience and sort by the kind of R background and experience in this room.
12. What is the ratio of people who took a plane to this workshop vs those who walked?