Basic Analyses of Microbiome Data in R

To highlight several of the concepts discussed in this tutorial document, we will be using an abridged dataset from a previous lab member, Kei Fujimura’s study Neonatal gut microbiota associates with childhood multisensitized atopy and T cell differentiation. The sample data and OTU table can be found in /data/Users/dli2/DOC. Please note that the OTU table has been processed through all of the applicable steps shown in the processing document (very important to do so prior to any analyses!).

All analyses moving forward have been run on Lynchserver2 for the development of this document. However, Wynton can also be used to analyze data. Katie highly encourages you to use one of these two servers, especially as datasets become large.

0.1 Moving files from Wynton to Lynchserver2

Use the scp command as described above, similar to what you needed to do with your mapping file. You will need to run from lynchserver2 and your command will look something like:

cd to where you want your data to go
scp {yourusername}@dt2.wynton.ucsf.edu:/wynton/group/lynch/{location of your data} .

The dot/period means to use the directory that you are currently in.

0.2 Using RStudioServer

Katie has set up a nice interface for being able to run code in R, see plots as they’re generated, and access some files all within the same space. In order to use it on a Mac:

  1. Go to Terminal and type: ssh -N -f -L localhost:8001:localhost:8787 {yourusername}@lynchserver2.ucsf.edu
  2. Replace {yourusername} with your actual username for Lynchserver2
  3. Go into your favorite web browser and type localhost:8001
  4. Enter your username and password (again)
  5. You should be taken to the web version of Rstudio and be able to access files as if you were on the server

A note re: Windows machines, this works on your computer but use PowerShell instead. Discuss with Katie if you run into issues.

0.3 A Note about Finding Scripts

I have developed most of the analysis scripts to be found in three places: Lynchserver2, Wynton and GitHub. All locations are managed by github, so all versions should be similar if not identical.

Lynchserver2: /data/Users/kmccauley/LabCode/

Wynton: /wynton/group/lynch/kmccauley/LabCode/

GitHub: https://github.com/lynchlab-ucsf/lab-code/

1. Importing our data

The process here will include how to generate a phyloseq object from scratch; if you already have a basic phyloseq object to start analyzing, you can jump to the next section that will have some basic cleanup suggestions before moving into true analysis. Anyway, in order to start compiling individual pieces of data into a phyloseq object, we first need to load our data into R. Here, we load our sample data file. We also create some new variables for our data which will be used for grouping later on.

stool_data <- read.table("fujimura_sample_data.txt", header=T, check.names=F, comment="", sep="\t")
## Typically, all four options at the end of read.table are included when reading in most tables and they are described here:
### header=T -> Data already has a header/variable names included
### check.names=F -> Sometimes R will reformat your variable names without your consent, and this keeps R from doing that
### comment="" -> You'll recall that some mapping files require the # symbol in front of SampleID, and this encourages R to not see your variable names as a commented line
### sep="\t" -> confirms that your data values are separated by a tab (which should be the case in almost all txt files). CSVs are comma-separated, but there's a whole read.csv function for those.

#creating a month variable for each participant (ages are in days right now)
stool_data$month <- round(stool_data$ageStool/30, 0)
stool_data$month <- as.factor(stool_data$month)

stool_data$month_group <- ifelse(stool_data$month==1, "A",
                               ifelse(stool_data$month==2, "B",
                                ifelse(stool_data$month==3, "C",
                                  ifelse(stool_data$month==4, "D", "E")
                                    )))

Next, we load in our OTU table. We also do a bit of cleaning to our data (any steps beyond reading in your OTU table are likely not necessary, but is done here for the example dataset in particular).

otu_data <- read.delim("fujimura_rarefied_otutable.txt", check.names=F, sep="\t")
#giving our otus an otu label, rather than just numbers.
otu_data$`#OTUID` <- paste0("OTU_", otu_data$`#OTUID`)
#removing OTUs that are reference genomes as we don't need those (they have "reference" in the name so this line is selecting all rows of the OTU table that don't have "reference" in them)
otu_data <- otu_data[!grepl("Reference", otu_data$`#OTUID`), ]

Now, we will load Phyloseq, a useful package commonly used for microbial analyses.

Note: if these packages are not installed yet, you can run the code as shown here to install them.

source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')

install.packages("ggplot2")
#loading our installed packages in R.
library(phyloseq)
#for more plotting features later.
library(ggplot2)

Next, we prepare our data to be merged into a Phyloseq object.

#This is for phyloseq. phyloseq doesn't like when your OTU table has the taxonomy info attached so we're separating it for now.
#also doing some other cleaning up to make a phyloseq object
taxonomy_data <- otu_data[, c("#OTUID", "taxonomy")]
rownames(taxonomy_data) <- taxonomy_data$`#OTUID`
rownames(otu_data) <- otu_data$`#OTUID`
rownames(stool_data) <- stool_data$`#SampleID`
otu_data$`#OTUID` <- NULL
otu_data$taxonomy <- NULL

This function is to make the taxonomy table look nice, and for easier use later on. Note that there is no output in this code chunk, this is simply saving our function.

split.tax <- function(tax.dat) {
  taxanames <- strsplit(as.character(tax.dat[,2]),"; ")
  mat <- t(sapply(taxanames,
                  function(x,m) c(x,rep(NA,m-length(x))),
                  max(rapply(taxanames,length))))
  
  newnames <- gsub("_","",mat)
  newnames <- as.matrix(newnames)
  colnames(newnames) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
  row.names(newnames) <- tax.dat[,1]
  newnames[,6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,5][newnames[,6] %in% c("") | is.na(newnames[,6])]
  newnames[,6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,4][newnames[,6] %in% c("") | is.na(newnames[,6])]
  return(newnames)
}

Now, we run the function on our taxonomy data to clean it up.

#use the function on our taxonomy table
taxonomy_data <- split.tax(taxonomy_data)
taxonomy_data <- sub('.', '', taxonomy_data)
taxonomy_data <- as.data.frame(taxonomy_data)

To view our new formatted taxonomy table:

head(taxonomy_data)
##              Kingdom         Phylum               Class             Order
## OTU_100497  Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1010113 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1012948 Bacteria Actinobacteria      Actinobacteria   Actinomycetales
## OTU_1023075 Bacteria     Firmicutes          Clostridia     Clostridiales
## OTU_103166  Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1033413 Bacteria     Firmicutes             Bacilli   Lactobacillales
##                         Family              Genus Species
## OTU_100497  Enterobacteriaceae Enterobacteriaceae    <NA>
## OTU_1010113 Enterobacteriaceae Enterobacteriaceae    <NA>
## OTU_1012948 Corynebacteriaceae    Corynebacterium        
## OTU_1023075    Veillonellaceae        Veillonella        
## OTU_103166  Enterobacteriaceae        Citrobacter        
## OTU_1033413    Enterococcaceae       Enterococcus

We see that our classifications have been separated into their respective hierarchies for each OTU in our data.

Finally, we read in our phylogenetic tree.

phytree <- read_tree("rep_set_aligned_pfiltered_041614.tre")
taxa_names(phytree) <- paste0("OTU_", taxa_names(phytree)) ## Needed for this particular dataset
phytree <- ape::root(phytree, 1, resolve.root=TRUE) ## For most analyses, a rooted tree is required. This command roots the tree. If phylogenetic analyses appear "off", you may need to change the `1` value to another number.

Now, we make a Phyloseq object upon which we will conduct our preliminary analyses.

#making a phyloseq object. phyloseq takes a variety of datasets, but here we are giving it our sample data, OTU table, and taxonomy data.
phy <- phyloseq(sample_data(stool_data), otu_table(as.matrix(otu_data), taxa_are_rows = T), tax_table(as.matrix(taxonomy_data)), phy_tree(phytree))

To see the contents of our phyloseq object:

phy
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 858 taxa and 130 samples ]
## sample_data() Sample Data:       [ 130 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 858 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 858 tips and 857 internal nodes ]

As you can see, we have 130 samples, 858 taxa, and 6 variables in our sample data.

Using the phyloseq object from the processing pipeline

You can also import the phyloseq object using the readRDS function like so:

phy <- readRDS("phyloseq_noneg.rds") ## load
saveRDS(phy, "saving_my_phyloseq.rds") ## example of saving the file

You can save any object that lives in R as an RDS file, but not all RDS files you encounter will be phyloseq objects. Just a word of warning.

There are a couple of items that are covered above that will also be helpful here. Specifically:

Rooting the Tree

A “rooted” tree is necessary for Weighted and Unweighted UniFrac analyses, but the tree is not naturally rooted, so we run the following to make that happen.

phy_tree(phy) <- ape::root(phy_tree(phy), 1, resolve.root=TRUE)
Renaming Your Taxa

When you get your phyloseq object from the DADA2 pipeline, your sequence variants will be the actual 253-ish bp (if you’ve sequenced the V4 region) sequence, which while nice to have, it makes some figures messy, taxa difficult to discriminate, and actually already exists in another component of the phyloseq object. To make this change, you can use the following line of code:

taxa_names(phy) <- paste0("SV_", 1:length(taxa_names(phy))) ## Paste, without separators, the prefix "SV_" to numbers ranged 1 through the length of the taxa names string from my phyloseq object

2. Alpha diversity

To calculate alpha diversity for each of our samples, we use the following calculations.

library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
otu_table <- t(otu_table(phy)) ## This dataset needed a transpose function (making the rows the columns and vice versa); If I don't use it, I get a "replacement has XXX rows, data has YYY" error
sample_data(phy)$equitability <- diversity(otu_table)/log(specnumber(otu_table)) ## Pielou's Evenness
sample_data(phy)$chao1 <- t(estimateR(otu_table))[,2] ## Chao1 richness
sample_data(phy)$PD_whole_tree <- picante::pd(otu_table, phy_tree(phy), include.root = TRUE)[,1] ## Faith's Phylogenetic Diversity

We can then relate alpha diversity to our sample data using linear models. We do not need linear mixed effects models here because we are examining one timepoint.

summary(lm(chao1 ~ ageStool, data=data.frame(sample_data(phy))))
## 
## Call:
## lm(formula = chao1 ~ ageStool, data = data.frame(sample_data(phy)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.059  -41.623   -5.374   29.591  220.499 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 316.9914    12.1544  26.080   <2e-16 ***
## ageStool      0.6421     0.2769   2.319    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.45 on 128 degrees of freedom
## Multiple R-squared:  0.04033,    Adjusted R-squared:  0.03283 
## F-statistic: 5.379 on 1 and 128 DF,  p-value: 0.02196

This output tells us that there is a significant (P=0.022) relationship between richness and the age, in days, of the stool. Specifically, for every day older the child is, there is a 0.64 increase in richness.

If we wished to plot this relationship, we could do so using ggplot2:

library(ggplot2)
ggplot(data.frame(sample_data(phy)), aes(x=ageStool, y=chao1)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

We can also examine categorical relationships in a similar manner:

summary(lm(chao1 ~ month_group, data=data.frame(sample_data(phy))))
## 
## Call:
## lm(formula = chao1 ~ month_group, data = data.frame(sample_data(phy)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.493  -36.169   -2.226   33.273  223.920 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    337.33       6.15  54.847   <2e-16 ***
## month_groupB    13.95      13.13   1.063    0.290    
## month_groupC    32.65      27.64   1.181    0.240    
## month_groupD    28.56      60.57   0.471    0.638    
## month_groupE    97.67      60.57   1.612    0.109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.26 on 125 degrees of freedom
## Multiple R-squared:  0.03718,    Adjusted R-squared:  0.006367 
## F-statistic: 1.207 on 4 and 125 DF,  p-value: 0.3114

Here, months were grouped into alphabetical categories, and each P-value represents the unit-change between Group A and each other group. Here, there is no significant difference between Group A and each of the other groups. We can also test if the groups are different from each other (one p-value) with an analysis of Variance (ANOVA) test as shown here:

anova(lm(chao1 ~ month_group, data=data.frame(sample_data(phy))))
## Analysis of Variance Table
## 
## Response: chao1
##              Df Sum Sq Mean Sq F value Pr(>F)
## month_group   4  17527  4381.8  1.2066 0.3114
## Residuals   125 453924  3631.4

A minimal plot of this relationship:

ggplot(data.frame(sample_data(phy)), aes(x=month_group, y=chao1, fill=month_group)) + 
  geom_boxplot() +
  geom_point(shape=21, size=3)

3. Beta diversity Relationships

Next, we will take a look at differences in composition between samples. To do so, we will create an ordination of all samples in our Phyloseq object and plot them. We can generate Bray Curtis (“bray”) and Canberra (“canberra”) distance matrices. And assuming we have a phylogenetic tree in our phyloseq object, we can also calculate Weighted UniFrac (“wunifrac”) and Unweighted UniFrac (“unifrac”) distance matrices.

age_pcoa <- ordinate(phy, method="PCoA", distance="bray")
plot_ordination(phy, age_pcoa, type="samples", color="month")

From our plot, we do not see any immediate distinctions in composition between months. However, to test this statistically, we use the adonis2 function from the R package vegan.

If not installed:

install.packages("vegan")
library(vegan)

To conduct a PERMANOVA test and determine if beta diversity significantly differs across our groups (here, months of age), we first create a distance matrix with our data:

dm_age <- phyloseq::distance(phy, method = "bray")

To statistically test if our variable is indeed significant:

#for reproducibility purposes (set.seed needs to be run at the exact same time as the adonis function in order to get the exact same result)
set.seed(123)
vegan::adonis2(dm_age ~ month, data = data.frame(phy@sam_data))
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## vegan::adonis2(formula = dm_age ~ month, data = data.frame(phy@sam_data))
##           Df SumOfSqs      R2      F Pr(>F)
## month      4    1.590 0.03531 1.1438   0.17
## Residual 125   43.441 0.96469              
## Total    129   45.031 1.00000

We look at the line starting with month as this is our variable in question. The R2 (R-squared) value indicates the % variation explained by our variable, and the p-value denotes whether or not this result is significant (in this case, our samples do not significantly differ in composition when grouping by months of age).

If you plan to undertake multivariable analyses, you will need to include the by="margins" option in the adonis2 function. This will calculate R^2 and P-values that account for all other variables in the model (Type III Sum of Squares). The default by="terms" will calculate the R2 and P-values based only on accounting for variables above each variable (Type I Sum of Squares). In other words, if there are three variables in the model, the first variable’s results will be the relationship without considering the other two variables, the second variable’s results will only account for the previous variable, and the last variable will account for the other two variables.

Repeated Measures

If you have repeated measures in your data, you will be unable to use adonis2 to determine if a variable is associated with microbiota composition. Instead, you will need to obtain the axis values and use linear mixed effects models. This study doesn’t have repeated measures, but for the purpose of illustration, we will use this dataset and use simple linear models.

#for reproducibility purposes
set.seed(123)
age_pcoa <- ordinate(phy, method="PCoA", distance="bray")

pcoa_data <- merge(phy@sam_data, age_pcoa$vectors[,1:3], by=0)
# You can see that the Axis values have been added to our sample data:
head(pcoa_data)
##   Row.names X.SampleID ageStool DMM    Description month month_group
## 1       A02        A02       20   4  PPG_feces_A02     1           A
## 2       A05        A05       35   2  PPG_feces_A05     1           A
## 3       A06        A06       38   4  PPG_feces_A06     1           A
## 4       A07        A07       74   2  PPG_feces_A07     2           B
## 5      A101       A101       77   2 PPG_feces_A101     3           C
## 6      A106       A106       52   1 PPG_feces_A106     2           B
##   equitability    chao1 PD_whole_tree     Axis.1        Axis.2       Axis.3
## 1    0.4851637 319.2069      17.48600 -0.2929754 -0.1725481826 -0.174631337
## 2    0.3265574 379.5000      17.56041  0.3627402  0.0771376594 -0.077711683
## 3    0.4285474 329.3913      15.65793 -0.2864202 -0.0669919650 -0.212846851
## 4    0.4021263 364.4762      16.41518  0.3252046  0.0004981516 -0.046632159
## 5    0.3343220 280.3333      13.89997  0.4339586 -0.0252256065  0.005000593
## 6    0.4563690 394.0000      18.81986 -0.2240980  0.1155055941  0.067084269
summary(lm(Axis.1 ~ ageStool, data=pcoa_data))
## 
## Call:
## lm(formula = Axis.1 ~ ageStool, data = pcoa_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33470 -0.22365 -0.06457  0.24416  0.45250 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.0012927  0.0520799   0.025    0.980
## ageStool    -0.0000326  0.0011864  -0.027    0.978
## 
## Residual standard error: 0.2547 on 128 degrees of freedom
## Multiple R-squared:  5.899e-06,  Adjusted R-squared:  -0.007807 
## F-statistic: 0.0007551 on 1 and 128 DF,  p-value: 0.9781
summary(lm(Axis.2 ~ ageStool, data=pcoa_data))
## 
## Call:
## lm(formula = Axis.2 ~ ageStool, data = pcoa_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44292 -0.12034  0.00076  0.10104  0.41315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.0523471  0.0394482   1.327    0.187
## ageStool    -0.0013201  0.0008986  -1.469    0.144
## 
## Residual standard error: 0.193 on 128 degrees of freedom
## Multiple R-squared:  0.01658,    Adjusted R-squared:  0.008898 
## F-statistic: 2.158 on 1 and 128 DF,  p-value: 0.1443
summary(lm(Axis.3 ~ ageStool, data=pcoa_data))
## 
## Call:
## lm(formula = Axis.3 ~ ageStool, data = pcoa_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33670 -0.11108 -0.00691  0.11805  0.39254 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.0711617  0.0328016  -2.169   0.0319 *
## ageStool     0.0017946  0.0007472   2.402   0.0178 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1604 on 128 degrees of freedom
## Multiple R-squared:  0.04312,    Adjusted R-squared:  0.03565 
## F-statistic: 5.768 on 1 and 128 DF,  p-value: 0.01776

Here, we can see that the age (in days) of the stool relates to PC3 of the Bray Curtis distance matrix. The same analysis above, but using mixed effects models, would look like:

library(lmerTest)
summary(lmer(Axis.1 ~ ageStool + (1|StudyID), data=pcoa_data)) ## Note that StudyID does not exist in the data

Lab Code Relating All Variables to a Distance Matrix

A script is available to relate all sample data to your distance matrices. To use this script, you will need to save distance matrices to a file like below.

write.table(data.frame(as.matrix(phyloseq::distance(phy, method = "bray"))), "bray_curtis_dm.txt", sep="\t")

The script can be found here: /data/Users/kmccauley/LabCode/BetaDiversity/Adonis_All_Variables_20Jan19.R.

To use it, scroll to the bottom and fill in the needed information (distance matrix, sample information, output file, etc). Additional information about how to use this script is provided at the top of the code and in the README file (/data/Users/kmccauley/LabCode/README). Once the function is filled out, you can run the script using: Rscript Adonis_All_Variables_20Jan19.R.

THE FIRST TIME YOU RUN THIS SCRIPT IT WILL BREAK. This is because it prints out a file called Variable_Table.txt. View this file and ensure that R is reading your script as intended, and if you need to make any modifications, do so. This can include removing any variables that you would prefer not to analyze (ie, SampleID). Then, when you re-run the Rscript command, it will list out every variable it is analyzing, confirming the variable type (ie, factor, numeric, etc) as it analyzes them. This script will also print out preliminary Principal Coordinate Analysis (PCoA) plots for all variables.

This script will also identify where you have repeated measures (hence why it asks for Sample ID as well as Study/Participant ID) and if found, will include analysis of Axis 1 and Axis 2 using linear mixed-effects models.

4. Differential Taxon (SV) Abundance Analysis

We can identify taxa that differentiate between groups using either the “Many-Model” approach or DESeq2. Both use count-based methods, but take different approaches.

A. Many Model Script

This method analyzes each OTU/SV using an assortment of several count-based models including: Poissson, Negative Binomial and Zero-Inflated Negative Binomial models. The fit of each of these models is compared, and the model that fits the data best is considered the “winning” model from which the estimate and the p-value are derived. This allows the data to determine which model fits the data best. As written, this script can analyze data with repeated measures, but currently needs rarefied data to make appropriate conclusions.

The script for this analysis can be found in: /data/Users/kmccauley/LabCode/SigTaxa/ManyModelScript.R. This script has been developed to take either an OTU table and data file, or a phyloseq object, so

Katie is working on a follow-up script to this analysis, and more details will be provided when it is closer to being ready.

B. DESeq2 Approach

An alternate method is DESeq2. This method uses a modified Negative Binomial model on non-rarefied data to identify taxa differentially-abundant between groups. It’s preferred input is non-rarefied, non-transformed data, as it implements the Variance Stabilized Transformation as described above. Katie has tried DESeq with both rarefied and non-rarefied data, and has discussed it with other statisticians. Results with rarefied/non-rarefied data were similar, and the statisticians said rarefied data should not impact model-fit based on the statistical methods being implemented. You can also use DESeq for continuous variables. However, DESeq cannot be used for data with repeated measures.

Wrapper functions have been developed by Katie to make using DESeq2 easier.

## Load in the wrapper functions
library(DESeq2)
print(getwd())
## [1] "/Users/kmccauley/Box/processing-docs/assets"
source("../../lab-code/SigTaxa/DESeqFunctions.R") ## You might need to hunt this down in your respective environment. But wherever you find the lab code, you should find a sub-directory called SigTaxa.

## Subsetting to the two groups I want to analyze:
ngm_data_grpAB <- subset_samples(phy, month_group %in% c("A","B"))

## Removing low-prevalence taxa:
ngm_data_grpAB_f <- filter_taxa(ngm_data_grpAB, function(x) sum(x > 0) > (0.2*length(x)), TRUE)
dds <- phyloseq_to_deseq2(ngm_data_grpAB_f, ~month_group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds2 <- DESeq(dds, fitType="local", sfType="poscounts")
## Two wrapper functions to print out the results of DESeq, as well as generate a volcano plot.
print_res(dds2, ngm_data_grpAB_f, alpha=0.01, var="month_group",ref="A",cont="B", sig.only=TRUE)
##         OTUname    baseMean log2FoldChange     lfcSE      stat       pvalue
## 1   OTU_4436046   56.559587      -5.631804 0.8918776 -6.314548 2.709526e-10
## 2   OTU_4416113   84.220369      -5.312729 0.7884558 -6.738144 1.604221e-11
## 3   OTU_4306262   12.229104      -5.234182 1.4373188 -3.641629 2.709186e-04
## 4   OTU_1846390    7.963263      -5.230130 1.3093861 -3.994337 6.487543e-05
## 5    OTU_851733    8.176479      -5.156444 1.2550599 -4.108524 3.981951e-05
## 6    OTU_259772   10.214898      -5.129374 1.1926722 -4.300741 1.702279e-05
## 7    OTU_182874   30.919137      -5.088010 1.2920954 -3.937798 8.223288e-05
## 8   OTU_4433823  126.832435      -4.994828 1.1491945 -4.346373 1.384070e-05
## 9   OTU_4464445   10.405667      -4.992402 1.2907358 -3.867873 1.097888e-04
## 10  OTU_4343627 7001.874400      -4.969519 0.8189864 -6.067890 1.296018e-09
## 11   OTU_192308    6.111008      -4.957759 0.8851167 -5.601248 2.128134e-08
## 12   OTU_191361   36.630689      -4.793991 0.8068438 -5.941659 2.821516e-09
## 13  OTU_3244896   46.213041      -4.711654 0.9895148 -4.761580 1.920833e-06
## 14  OTU_4329112    6.786985      -4.661969 1.3792324 -3.380118 7.245462e-04
## 18   OTU_315982   24.941107      -4.477678 1.1147041 -4.016921 5.896353e-05
## 19   OTU_553611 1069.544229      -4.296237 0.7959401 -5.397689 6.750488e-08
## 20  OTU_4381553  302.471101      -4.245171 0.8558998 -4.959892 7.053244e-07
## 21   OTU_193528   53.139797      -4.217688 0.8915581 -4.730693 2.237544e-06
## 25  OTU_4401580   31.366964      -3.463016 0.9494178 -3.647516 2.647879e-04
## 33   OTU_170652  264.027948      -3.097130 0.8987530 -3.446030 5.688875e-04
## 46  OTU_4335781   38.578220      -2.647693 0.7139778 -3.708368 2.085990e-04
## 48   OTU_137056    7.331480      -2.625679 0.7666072 -3.425064 6.146535e-04
## 406  OTU_170462    1.131276       2.111737 0.6178248  3.418020 6.307851e-04
## 412 OTU_4480189    2.089340       2.858249 0.6546802  4.365870 1.266178e-05
## 416  OTU_289734  735.437681       3.073925 0.8136809  3.777802 1.582188e-04
## 418  OTU_179400 1121.858944       3.655439 1.0122313  3.611268 3.047031e-04
## 420 OTU_3327894   23.622946       3.904229 1.0207794  3.824753 1.309033e-04
## 421  OTU_292735   46.818392       3.987236 1.0557270  3.776768 1.588766e-04
## 422  OTU_544419    4.417782       4.531388 1.2788739  3.543264 3.952067e-04
## 423  OTU_359954   17.863115       5.029689 0.6216430  8.090959 5.919659e-16
## 424 OTU_1100972   39.479231       5.377540 0.8796274  6.113429 9.751281e-10
##             padj comparison ref cont  Kingdom          Phylum
## 1   3.829463e-08     A_vs_B   A    B Bacteria      Firmicutes
## 2   3.400949e-09     A_vs_B   A    B Bacteria  Proteobacteria
## 3   4.594780e-03     A_vs_B   A    B Bacteria Verrucomicrobia
## 4   1.618070e-03     A_vs_B   A    B Bacteria      Firmicutes
## 5   1.125565e-03     A_vs_B   A    B Bacteria      Firmicutes
## 6   5.155475e-04     A_vs_B   A    B Bacteria      Firmicutes
## 7   1.937041e-03     A_vs_B   A    B Bacteria   Bacteroidetes
## 8   4.514198e-04     A_vs_B   A    B Bacteria   Bacteroidetes
## 9   2.450024e-03     A_vs_B   A    B Bacteria      Firmicutes
## 10  1.099023e-07     A_vs_B   A    B Bacteria   Bacteroidetes
## 11  1.289041e-06     A_vs_B   A    B Bacteria      Firmicutes
## 12  1.993871e-07     A_vs_B   A    B Bacteria      Firmicutes
## 13  8.144333e-05     A_vs_B   A    B Bacteria      Firmicutes
## 14  9.909923e-03     A_vs_B   A    B Bacteria   Bacteroidetes
## 18  1.562534e-03     A_vs_B   A    B Bacteria      Firmicutes
## 19  3.577759e-06     A_vs_B   A    B Bacteria  Actinobacteria
## 20  3.322862e-05     A_vs_B   A    B Bacteria   Bacteroidetes
## 21  8.624715e-05     A_vs_B   A    B Bacteria   Bacteroidetes
## 25  4.594780e-03     A_vs_B   A    B Bacteria   Bacteroidetes
## 33  8.614583e-03     A_vs_B   A    B Bacteria      Firmicutes
## 46  3.845477e-03     A_vs_B   A    B Bacteria  Actinobacteria
## 48  8.915096e-03     A_vs_B   A    B Bacteria      Firmicutes
## 406 8.915096e-03     A_vs_B   A    B Bacteria      Firmicutes
## 412 4.473827e-04     A_vs_B   A    B Bacteria      Firmicutes
## 416 3.061986e-03     A_vs_B   A    B Bacteria      Firmicutes
## 418 4.969005e-03     A_vs_B   A    B Bacteria      Firmicutes
## 420 2.775151e-03     A_vs_B   A    B Bacteria   Bacteroidetes
## 421 3.061986e-03     A_vs_B   A    B Bacteria      Firmicutes
## 422 6.206208e-03     A_vs_B   A    B Bacteria      Firmicutes
## 423 2.509935e-13     A_vs_B   A    B Bacteria      Firmicutes
## 424 1.033636e-07     A_vs_B   A    B Bacteria      Firmicutes
##                   Class              Order              Family           Genus
## 1            Clostridia      Clostridiales     Lachnospiraceae           Dorea
## 2   Gammaproteobacteria  Enterobacteriales  Enterobacteriaceae        Serratia
## 3      Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae     Akkermansia
## 4            Clostridia      Clostridiales      Clostridiaceae  Clostridiaceae
## 5               Bacilli    Lactobacillales    Lactobacillaceae   Lactobacillus
## 6            Clostridia      Clostridiales     Lachnospiraceae     Coprococcus
## 7           Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 8           Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 9            Clostridia      Clostridiales     Lachnospiraceae           Dorea
## 10          Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 11           Clostridia      Clostridiales     Lachnospiraceae                
## 12           Clostridia      Clostridiales     Lachnospiraceae                
## 13              Bacilli    Lactobacillales    Streptococcaceae   Streptococcus
## 14          Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 18           Clostridia      Clostridiales      Clostridiaceae     Clostridium
## 19       Actinobacteria  Bifidobacteriales  Bifidobacteriaceae Bifidobacterium
## 20          Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 21          Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 25          Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 33           Clostridia      Clostridiales      Clostridiaceae     Clostridium
## 46       Actinobacteria  Bifidobacteriales  Bifidobacteriaceae Bifidobacterium
## 48              Bacilli         Bacillales   Staphylococcaceae  Staphylococcus
## 406          Clostridia      Clostridiales     Lachnospiraceae  [Ruminococcus]
## 412             Bacilli    Lactobacillales    Lactobacillaceae   Lactobacillus
## 416          Clostridia      Clostridiales     Lachnospiraceae         Blautia
## 418          Clostridia      Clostridiales     Lachnospiraceae  [Ruminococcus]
## 420         Bacteroidia      Bacteroidales      Bacteroidaceae     Bacteroides
## 421          Clostridia      Clostridiales     Lachnospiraceae         Blautia
## 422             Bacilli    Lactobacillales    Streptococcaceae     Lactococcus
## 423          Clostridia      Clostridiales     Veillonellaceae     Veillonella
## 424             Bacilli    Lactobacillales    Streptococcaceae     Lactococcus
##         Species
## 1              
## 2    marcescens
## 3   muciniphila
## 4          <NA>
## 5       reuteri
## 6              
## 7        ovatus
## 8      fragilis
## 9              
## 10     fragilis
## 11             
## 12             
## 13     luteciae
## 14     fragilis
## 18  perfringens
## 19             
## 20             
## 21       caccae
## 25         <NA>
## 33  perfringens
## 46             
## 48         <NA>
## 406            
## 412        zeae
## 416            
## 418            
## 420   uniformis
## 421            
## 422            
## 423            
## 424
plot_res(dds2, ngm_data_grpAB_f, alpha=0.01, var="month_group",ref="A",cont="B")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

These plots may not be perfect for your needs, but will help provide a brief overview of the most-important taxa.

5. Relative abundance plots

Now, we may want to look at how our taxa are distributed across our samples.

plot_bar(phy)

At first glance, this is not entirely informative. We can instead plot by a taxonomic classification, or by treatment group. First, we will group all of our taxa into Family labels.

Note that taxrank will take any classification present in your taxonomy, but with higher classifications comes less resolution of your data.

ngm_abundance <- tax_glom(phy, taxrank = "Family")

Next, we will group our samples by a variable of interest. Here, we combine our microbiome profiles according to age in months, which could also be considered a microbiota group/community state type. We also want our counts in our table to be relative counts, rather than absolute counts, so we will take care of that too.

ngm_abundance <- merge_samples(ngm_abundance, "month_group")

#every count is now transformed into a proportion
ngm_abundance <- transform_sample_counts(ngm_abundance, function(x) x / sum(x))

Now, if we create our abundance plot, grouping by month and coloring by Family classification, we get a more informative result.

stacked_barplot <- plot_bar(ngm_abundance, x = "month", fill = "Family")

stacked_barplot

#cleaning up our plot a bit
stacked_barplot <- stacked_barplot + 
  geom_bar(aes(fill=Family), stat="identity", position="stack") + 
  #changing the x-axis label
  xlab("Age (months)") + 
  #changing the y-axis label
  ylab("Relative Abundance (%)") + 
  #changing our y axis to a percentage, rather than a decimal scale
  scale_y_continuous(labels = scales::percent) + 
  #italicizing our Family labels, and rotating x-axis labels to be properly oriented
  theme(legend.text = element_text(face="italic"), axis.text.x = element_text(angle=0,hjust=0.5))

#view our new and improved plot
stacked_barplot

Now let’s say that you want to only visualize the top 10 families in your data. In order to do this, I loosely follow some ideas in this page, depending on my needs.

ngm_abundance10 <- ngm_abundance
families <- names(sort(taxa_sums(ngm_abundance10), decreasing = T)[1:10]) ## Get top 10 families
tax_table(ngm_abundance10)[,"Family"][!rownames(tax_table(ngm_abundance10)) %in% families] <- "Other" ## When the family is NOT (see: !) part of the top 10 families, call it "Other"
plot_bar(ngm_abundance10, fill="Family")

## If you don't like the order of things (especially with "Other" in the middle of Family names), you'll need to use regular ggplot, which translates nicely with a phyloseq "melting" function (making your dataset into a "long" dataset) for more customizability:
ps <- psmelt(ngm_abundance10)
ps$Family2 <- factor(ps$Family, levels=c(unique(ps$Family)[!unique(ps$Family) %in% "Other"], "Other")) #relevels "Other" so it's at the end. The logic is a little complicated, so feel free to reach out if you have questions. But basically(ish), it would be dictated as "get the levels of the Family variable when those levels *aren't* Other, then ADD Other as a category at the end, and use this order to reorder our family variable into a new variable called Family2".

table(ps$Family, ps$Family2) ## As a sanity check, you can do something like this to confirm that your Bifidobacterium is still called Bifidobacterium with the reordered variable.
##                      
##                       Bifidobacteriaceae Lachnospiraceae Enterobacteriaceae
##   Bacteroidaceae                       0               0                  0
##   Bifidobacteriaceae                   5               0                  0
##   Clostridiaceae                       0               0                  0
##   Coriobacteriaceae                    0               0                  0
##   Enterobacteriaceae                   0               0                  5
##   Erysipelotrichaceae                  0               0                  0
##   Lachnospiraceae                      0               5                  0
##   Other                                0               0                  0
##   Porphyromonadaceae                   0               0                  0
##   Streptococcaceae                     0               0                  0
##   Veillonellaceae                      0               0                  0
##                      
##                       Erysipelotrichaceae Bacteroidaceae Clostridiaceae
##   Bacteroidaceae                        0              5              0
##   Bifidobacteriaceae                    0              0              0
##   Clostridiaceae                        0              0              5
##   Coriobacteriaceae                     0              0              0
##   Enterobacteriaceae                    0              0              0
##   Erysipelotrichaceae                   5              0              0
##   Lachnospiraceae                       0              0              0
##   Other                                 0              0              0
##   Porphyromonadaceae                    0              0              0
##   Streptococcaceae                      0              0              0
##   Veillonellaceae                       0              0              0
##                      
##                       Coriobacteriaceae Veillonellaceae Porphyromonadaceae
##   Bacteroidaceae                      0               0                  0
##   Bifidobacteriaceae                  0               0                  0
##   Clostridiaceae                      0               0                  0
##   Coriobacteriaceae                   5               0                  0
##   Enterobacteriaceae                  0               0                  0
##   Erysipelotrichaceae                 0               0                  0
##   Lachnospiraceae                     0               0                  0
##   Other                               0               0                  0
##   Porphyromonadaceae                  0               0                  5
##   Streptococcaceae                    0               0                  0
##   Veillonellaceae                     0               5                  0
##                      
##                       Streptococcaceae Other
##   Bacteroidaceae                     0     0
##   Bifidobacteriaceae                 0     0
##   Clostridiaceae                     0     0
##   Coriobacteriaceae                  0     0
##   Enterobacteriaceae                 0     0
##   Erysipelotrichaceae                0     0
##   Lachnospiraceae                    0     0
##   Other                              0   160
##   Porphyromonadaceae                 0     0
##   Streptococcaceae                   5     0
##   Veillonellaceae                    0     0
## Since we're here, I'll also show how to set up a custom color palette
library(RColorBrewer)
## Get 10 colors, plus grey for "Other"
custom_colors <- c(brewer.pal(10, "Set3"), "darkgrey") ## To see all default color palettes and how many colors they have, use `display.brewer.all()` in the console.
names(custom_colors) <- levels(ps$Family2) ## Assign the levels of our Family varible to the names of our vector of colors

#Starts with your typical ggplot setup
ggplot(ps, aes(x=month,y=Abundance, fill=Family2)) +
  # Sets up the common pieces of the bar plot
  geom_bar(stat="identity", position="stack", color="black") +
  # Asks the plot to use our color scheme, and changes the name from Family2 to Family
  scale_fill_manual("Family", values=custom_colors)

Now we can quickly see a summary of the profiles associated with different months

Further reading and resources

DADA2 Tutorial

A 16s analysis pipeline tutorial written by Jordan Bisanz

Phyloseq Website

Some fun uses of phyloseq, which might give you some ideas too!

Acknowlegements

Katie would like to give many thanks to Danny Li who helped develop the initial version of the analysis pipeline, as well as the several members of the Lynch Lab and BCMM who provided invaluable constructive feedback along the way.