16S Data Processing

Slides

Load packages of interest

library(tidyverse)
library(dada2)
library(here)
library(phyloseq)
library(reshape2)

Specify path to the files

path <- "../subsampled_reads/Gut" ## CHANGE to the directory containing the fastq files.
#list.files(path) # this lists the names of each file inside the directory 

Separate foward and reverse reads using file extensions

This allows you to assess the quality of the foward and reverse reads appropriately

list.files function is used with specific patterns (“R1.fastq.gz” for forward reads and “R2.fastq.gz” for reverse reads) to separate the forward and reverse reads.

The sort function ensures that the forward and reverse reads are in the same order, which is necessary for the subsequent paired-end read assembly and analysis.

full.names = TRUE: This argument specifies that you want the full path names of the files rather than just their names.

# Sort ensures forward/reverse reads are in same order
fwdR<- sort(list.files(path, pattern = "R1.fastq.gz", full.names = TRUE))
revR <- sort(list.files(path, pattern = "R2.fastq.gz", full.names = TRUE))

Inspect the quality of the reads

plotQualityProfile(fwdR[1:2])
plotQualityProfile(revR[1:4])

Red line: Represents the scaled proportion of reads that extend to at least a certain position in the sequencing read.

In the context of next-generation sequencing (NGS), reads are the sequences generated by the sequencing machine. The length of these reads can vary depending on the technology used and the specific settings of the sequencing run. The “position” in this context refers to the position along the length of the reads. For example, position 1 would be the first base of the reads, position 2 would be the second base, and so on. The “scaled proportion of reads that extend to at least that position” means the proportion of reads that are at least as long as that position, adjusted (or scaled) for the total number of reads. This kind of plot can be used to assess the quality of the sequencing run. For example, if a large proportion of reads do not extend to the expected full length, it may indicate an issue with the sequencing process, such as poor quality of the DNA input or a technical problem with the sequencing machine.

The number after the red line in a sequencing analysis plot typically represents the count or proportion of reads that extend to at least the position indicated by the red line. This can be used to assess the quality and depth of the sequencing data. If the number is high, it means that a large proportion of the reads extend to the full length, which is generally a good sign. It indicates that the sequencing process was successful and that the data is of high quality. Conversely, if the number is low, it suggests that many reads did not extend to the full length. This could be due to various factors such as poor quality of the DNA input, technical issues with the sequencing machine, or errors in the sequencing process. It’s important to note that the specifics can vary depending on the exact context and the software or pipeline being used. Therefore, it’s always a good idea to refer to the documentation or guidelines associated with the specific sequencing analysis pipeline you’re using.

Extract sample names

sample.names <- sapply(strsplit(basename(fwdR), "_R"), `[`, 1)

sapply is a function that applies a given function to each element of a list or vector and returns the results in a simplified format.

strsplit separates each element of the character vector fwdR at the “_R” character.

basename(fwdR) extracts the last part of the file path specified by fwdR. It removes the directory part of the file path and returns only the file name.

The sapply function is used to apply the subsetting function to each element of the list returned by strsplit. The 1 inside indicates that we want to keep only the first element of each split character vector.

Name the filtered reads

To identify the filtered reads, we add a string “_F_filt” to the sample names paste0 allows you to put the file extension “_F_filt.fastq.gz” to each element of the sample.names vector.

filtFs <- file.path(path,"filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path,"filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

file.path(path, "filtered", paste0(sample.names, "\_F_filt.fastq.gz")): This part constructs the file paths for the filtered reads. file.path() concatenates directory paths and file names. path = base directory where the raw files are located. "filtered" is a subdirectory within path. paste0(sample.names, "\_F_filt.fastq.gz") creates file names by appending each sample name from the vector sample.names with "\_F_filt.fastq.gz".

names(filtFs) <- sample.names and names(filtRs) <- sample.names: These lines assign the sample names to the constructed file paths. This way, each file path is associated with its corresponding sample name.

Filter and trim

out <- filterAndTrim(fwdR,filtFs,revR,filtRs,compress=TRUE,
                     truncLen=c(300,270), #truncation length (fwdR,revR) 
                     maxN=0,#sequences that contain no unknown bases
                     maxEE=c(2,2),
                     truncQ=2 , multithread = TRUE  
                     )

compress = TRUE: Indicates whether or not to compress the output. truncLen: Specifies the truncation length for the forward and reverse reads, respectively. maxN = 0: Maximum number of unknown bases allowed in the sequences. maxEE = c(2, 2): Specifies the maximum expected errors for the forward and reverse reads, respectively. truncQ = 2: Quality score threshold for truncation.

Multithreading is a programming technique used to achieve multitasking within a single process. It allows a program to perform multiple tasks concurrently by breaking them into smaller threads of execution that can run independently.

The multithread parameter is controlling whether the filterAndTrim function will utilize multiple threads (multithreading) to process the data in parallel.

Inspect filtered reads

head(out)
save(out, file="filtering_summary.RData")

Specify the minimum number of reads expected post filtering and trimming

keep <- out[,"reads.out"] > 200

Determine number of samples that are to be kept

length(out[keep])/2 # number samples w/ over 200 reads

#404 samples out of 418

Determine the percentage of remaining samples

(length(out[keep]))/2/(length(out)/2) # percent of samples kept
 #0.9665072 good number 

Determine samples to keep for downstream analysis

filtFs <- file.path(filtFs)[keep]
filtRs <- file.path(filtRs)[keep]

Error rates

The DADA2 algorithm depends on a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns the error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many optimization problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

errF <- learnErrors(filtFs, multithread = TRUE)
errR <- learnErrors(filtRs, multithread = TRUE)

Visualize the estimated error rates as a sanity check.

plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ=TRUE)

plotErrors: Creates plots of errors or quality metrics associated with sequencing data. errF and errR: These are variables representing the errors or quality metrics for the forward and reverse reads, respectively. nominalQ = TRUE: This argument specifies whether to use nominal quality scores. If TRUE, it indicates that the quality scores are nominal (e.g., Phred scores), and the plot should be adjusted accordingly. If FALSE or not specified, the function may interpret the quality scores differently.

The error rates for each possible transition (eg. A->C, A->G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence. The red line shows the error rates expected under the nominal definition of the Q-value. Here the black line (the estimated rates) fits the observed rates well, and the error rates drop with increased quality as expected.

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)

Name dereplicated files

names(derepFs) <- sample.names[keep]
names(derepRs) <- sample.names[keep]

Sample Inference

The core sample inference algorithm is applied to the dereplicated data.

dadaFs <- dada(derepFs, err=errF, pool = FALSE, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, pool = FALSE, multithread=TRUE)

Merge paired reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)

Construct Sequence Table

Amplicon sequence variant table (ASV) table: a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)
dim(seqtab) #tells you the (rows, columns) in a matrix
#rows represent samples and columns have unique sequence variants 

Determine how many bases are in each amplicon using nchar #Table summarizes the number of amplicons with a specific length

table(nchar(getSequences(seqtab)))

View length distribution

hist(nchar(getSequences(seqtab)),
     main="Distribution of sequence lengths")

In the event that a lot of reads have much longer or shorter lengths than expected , filter seqtab to keep only the reads that fall within the expected range May be the result of non-specific priming

seqtab.f <- seqtab[,nchar(colnames(seqtab)) %in% seq(430, 480)]
# % of reads with desired length
sum(seqtab.f)/sum(seqtab)

Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab.f, method="consensus", multithread=TRUE, verbose=TRUE)
#nonchimeric % of reads
sum(seqtab.nochim)/sum(seqtab.f)
dim(seqtab.nochim)

save(seqtab.nochim, file="seqtab.nochim120424.RData")

Inspect distribution of sequence lengths:

table(nchar(getSequences(seqtab.nochim)))

View length distribution

hist(nchar(getSequences(seqtab.nochim)),
     main="Distribution of sequence lengths")

Track reads through the pipeline

Inspect the the number of reads that made it through each step in the pipeline to verify everything worked as expected.

getN <- function(x) sum(getUniques(x))
track <- cbind(out, 
               sapply(dadaFs, getN), 
               sapply(dadaRs, getN),
               sapply(mergers, getN), 
               rowSums(seqtab.f), 
               rowSums(seqtab.nochim))

colnames(track) <- c("input",
                     "filtered",
                     "denoisedF",
                     "denoisedR",
                     "merged",
                     "tabled", 
                     "nonchim")

# A .txt file to map the loss of reads across the different DADA2 steps
write.table(track,
            "track_2304424.txt", sep = "\t", col.names = T, row.names = F)
save(track, file="track_2304424.txt")

Assign taxonomy

taxa <- assignTaxonomy(
  seqtab.nochim,
  "~/workshop_2/taxonomic_database/silva_nr99_v138.1_wSpecies_train_set.fa",
  minBoot = 60,
  multithread=TRUE)
unname(head(taxa))

Check assignmnent

taxa.print <- taxa 
rownames(taxa.print) <- NULL # Removing sequence rownames for display only
head(taxa.print)

saveRDS(taxa,
        "taxa_230424.rds")