Metagenomic Data Analysis on T-BioInfo and R (DADA2 and Phyloseq)

Elia Brodsky - www.ebrodsky.site
4 min readSep 17, 2020
DADA2 pipeline with Phyloseq on T-BioInfo and in R Studio

The microbiome has been fascinating to many researchers — it is a new way of thinking about microorganisms that populate almost every area of our body, colonizing our skin, helping maintain the gut function and even influencing how we feel. To study microbial community (microbiota) composition as it exists in its natural environment, metagenomic sequencing using paired-end reads can yield important information about diversity of a microbiota — an interesting metric that has been found to be associated with a variety of diseases in the human body. Here is a portion of the Metagenomics 2 course where you can learn about the microbiome, metagenomic sequencing and analysis of 16s Amplicon rRNA data.

V1 — V9 Hyper variable regions of the 16s rRNA and sequencing methods for Illumina MiSeq

The main idea is that 1) bacterial organisms contain a gene that is unique to eukaryotes that is essential to a vital function: the 16s ribosomal RNA. Since its overall function differs little within the eukaryote domain, we can find smaller regions in this gene that can give us a good sense of how taxonomy of organisms within the bacterial kingdom are related to each other. 2) the 16s rRNA contains 9 “hyper variable regions” that can show taxonomic differences between phyla, genera, species and families. 3) This is extremely useful — the length of these regions ranges between 80–150 nucleotides long, making them perfect for MiSeq Illumina sequencers that produce 300 bp long reads. 4) Thus, one or several of these HVRs can be sequenced to study a whole community of known and mostly obscure organisms and isntead of looking at a noisy compilation of genomic data, we see the diversity, richness of species (variety or biodiversity) and can see variation in relative abundance by various taxonomic levels.

Here is a simple example. There are over 25,000 samples in the American Gut Project (AGP), the biggest crowdsourced microbiome project in the world! And even though the project is called “American Gut”, it has samples beyond the USA and they even store their sequencing files on the EBI (European Bioinformatics Institute site). These are donated samples from skin, fecal and mouth microbial samples that come with a detailed set of phenotypic data points from an associated survey of self-reported conditions. These include exercise, diet, sensitivities and other lifestyle as well as medical information. The amazing part is that anyone can also download and analyze this data to study trends in microbial communities associated with these kinds of phenotypes.

Once you download the data, it has to be processed (using a bioinformatics pipeline, such as the DADA2, UPARSE or QUIIME tools) and then analyzed for diversity, composition and proportions that can be compared between groups. Here is an example pipeline using the DADA2 worflow on the T-BioInfo platform:

These pipelines generate a .biom file which contains information about the sequences found in a sample, grouped into operational taxonomic units (OTUs), including their abundance in each sample and the annotation provided for each sample.

After the file is prepared, we can use a package in R, called “phyloseq” to analyze these data and visualize the results. Let’s see how to do that in R, first, let’s install the phyloseq package:

if (!requireNamespace(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”)
BiocManager::install(“phyloseq”)

then, we need to load the data:

library("phyloseq") 
library("ggplot2")
library("dplyr")
otu_mat <- read.table("Amplicon_Abundance_table.txt", header = TRUE, sep = "\t")
tax_mat <- read.table("Taxonomy_table.txt", header = TRUE, sep = "\t")
samples_df <- read.table("SupplementaryTable.txt", header = TRUE, sep = "\t")

Now these files have to be transformed and turned into data that phyloseq can use, or phyloseq “objects. This is how you can do it in R:

#Define Row Names for each type of data
row.names(otu_mat) <- otu_mat$OTUs
#Remove the column which is already used as row from the data
otu_mat <- otu_mat %>% select (-OTUs)
row.names(tax_mat) <- tax_mat$class
tax_mat <- tax_mat %>% select (-class)
row.names(samples_df) <- samples_df$Run
samples_df <- samples_df %>% select (-Run)
sampletype <- unique(row.names(samples_df))
#Transform into matrix
otu_mat <- as.matrix(otu_mat)
tax_mat <- as.matrix(tax_mat)
#Transform matrix data as input for Phyloseq
OTU <- otu_table(otu_mat, taxa_are_rows = TRUE)
TAX <- tax_table(tax_mat)
samples <- sample_data(samples_df)
DF <- phyloseq(OTU, TAX, samples)

Once the data is prepared, phyloseq can generate a variety of analyses and visual plots to help explore this dataset:

#Barplot showing proportion of OTUs by phyla
p <- plot_bar(acne, fill = "Phylum")
DF.ord <- ordinate(DF, "NMDS", "bray")
sample_variables(DF)
#Richness plot
pp <- plot_richness(DF, color = "Dermatology_Disord", measures = c("Chao1", "Shannon"))

As a result, we will see the analysis of samples for alpha and beta diversity or study the composition of each sample or group of samples trying to find differences between groups (skin condition, gut sensitivity or disease).

Alpha diversity plot using the Chao1 and Shannon indeces

For a full tutorial, visit code.omicslogic.com

--

--

Elia Brodsky - www.ebrodsky.site

Healthcare, Life Sciences, Data... In the past, startup co-founder @PineBiotech — big data, bioinformatics, healthcare