To highlight several of the concepts discussed in this tutorial document, we will be using an abridged dataset from a previous lab member, Kei Fujimura’s study Neonatal gut microbiota associates with childhood multisensitized atopy and T cell differentiation. The sample data and OTU table can be found in /data/Users/dli2/DOC
. Please note that the OTU table has been processed through all of the applicable steps shown in the processing document (very important to do so prior to any analyses!).
All analyses moving forward have been run on Lynchserver2 for the development of this document. However, Wynton can also be used to analyze data. Katie highly encourages you to use one of these two servers, especially as datasets become large.
Use the scp command as described above, similar to what you needed to do with your mapping file. You will need to run from lynchserver2 and your command will look something like:
cd to where you want your data to go
scp {yourusername}@dt2.wynton.ucsf.edu:/wynton/group/lynch/{location of your data} .
The dot/period means to use the directory that you are currently in.
Katie has set up a nice interface for being able to run code in R, see plots as they’re generated, and access some files all within the same space. In order to use it on a Mac:
ssh -N -f -L localhost:8001:localhost:8787 {yourusername}@lynchserver2.ucsf.edu
localhost:8001
A note re: Windows machines, this works on your computer but use PowerShell instead. Discuss with Katie if you run into issues.
I have developed most of the analysis scripts to be found in three places: Lynchserver2, Wynton and GitHub. All locations are managed by github, so all versions should be similar if not identical.
Lynchserver2: /data/Users/kmccauley/LabCode/
Wynton: /wynton/group/lynch/kmccauley/LabCode/
The process here will include how to generate a phyloseq object from scratch; if you already have a basic phyloseq object to start analyzing, you can jump to the next section that will have some basic cleanup suggestions before moving into true analysis. Anyway, in order to start compiling individual pieces of data into a phyloseq object, we first need to load our data into R. Here, we load our sample data file. We also create some new variables for our data which will be used for grouping later on.
<- read.table("fujimura_sample_data.txt", header=T, check.names=F, comment="", sep="\t")
stool_data ## Typically, all four options at the end of read.table are included when reading in most tables and they are described here:
### header=T -> Data already has a header/variable names included
### check.names=F -> Sometimes R will reformat your variable names without your consent, and this keeps R from doing that
### comment="" -> You'll recall that some mapping files require the # symbol in front of SampleID, and this encourages R to not see your variable names as a commented line
### sep="\t" -> confirms that your data values are separated by a tab (which should be the case in almost all txt files). CSVs are comma-separated, but there's a whole read.csv function for those.
#creating a month variable for each participant (ages are in days right now)
$month <- round(stool_data$ageStool/30, 0)
stool_data$month <- as.factor(stool_data$month)
stool_data
$month_group <- ifelse(stool_data$month==1, "A",
stool_dataifelse(stool_data$month==2, "B",
ifelse(stool_data$month==3, "C",
ifelse(stool_data$month==4, "D", "E")
)))
Next, we load in our OTU table. We also do a bit of cleaning to our data (any steps beyond reading in your OTU table are likely not necessary, but is done here for the example dataset in particular).
<- read.delim("fujimura_rarefied_otutable.txt", check.names=F, sep="\t")
otu_data #giving our otus an otu label, rather than just numbers.
$`#OTUID` <- paste0("OTU_", otu_data$`#OTUID`)
otu_data#removing OTUs that are reference genomes as we don't need those (they have "reference" in the name so this line is selecting all rows of the OTU table that don't have "reference" in them)
<- otu_data[!grepl("Reference", otu_data$`#OTUID`), ] otu_data
Now, we will load Phyloseq, a useful package commonly used for microbial analyses.
Note: if these packages are not installed yet, you can run the code as shown here to install them.
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
install.packages("ggplot2")
#loading our installed packages in R.
library(phyloseq)
#for more plotting features later.
library(ggplot2)
Next, we prepare our data to be merged into a Phyloseq object.
#This is for phyloseq. phyloseq doesn't like when your OTU table has the taxonomy info attached so we're separating it for now.
#also doing some other cleaning up to make a phyloseq object
<- otu_data[, c("#OTUID", "taxonomy")]
taxonomy_data rownames(taxonomy_data) <- taxonomy_data$`#OTUID`
rownames(otu_data) <- otu_data$`#OTUID`
rownames(stool_data) <- stool_data$`#SampleID`
$`#OTUID` <- NULL
otu_data$taxonomy <- NULL otu_data
This function is to make the taxonomy table look nice, and for easier use later on. Note that there is no output in this code chunk, this is simply saving our function.
<- function(tax.dat) {
split.tax <- strsplit(as.character(tax.dat[,2]),"; ")
taxanames <- t(sapply(taxanames,
mat function(x,m) c(x,rep(NA,m-length(x))),
max(rapply(taxanames,length))))
<- gsub("_","",mat)
newnames <- as.matrix(newnames)
newnames colnames(newnames) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
row.names(newnames) <- tax.dat[,1]
6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,5][newnames[,6] %in% c("") | is.na(newnames[,6])]
newnames[,6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,4][newnames[,6] %in% c("") | is.na(newnames[,6])]
newnames[,return(newnames)
}
Now, we run the function on our taxonomy data to clean it up.
#use the function on our taxonomy table
<- split.tax(taxonomy_data)
taxonomy_data <- sub('.', '', taxonomy_data)
taxonomy_data <- as.data.frame(taxonomy_data) taxonomy_data
To view our new formatted taxonomy table:
head(taxonomy_data)
## Kingdom Phylum Class Order
## OTU_100497 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1010113 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1012948 Bacteria Actinobacteria Actinobacteria Actinomycetales
## OTU_1023075 Bacteria Firmicutes Clostridia Clostridiales
## OTU_103166 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1033413 Bacteria Firmicutes Bacilli Lactobacillales
## Family Genus Species
## OTU_100497 Enterobacteriaceae Enterobacteriaceae <NA>
## OTU_1010113 Enterobacteriaceae Enterobacteriaceae <NA>
## OTU_1012948 Corynebacteriaceae Corynebacterium
## OTU_1023075 Veillonellaceae Veillonella
## OTU_103166 Enterobacteriaceae Citrobacter
## OTU_1033413 Enterococcaceae Enterococcus
We see that our classifications have been separated into their respective hierarchies for each OTU in our data.
Finally, we read in our phylogenetic tree.
<- read_tree("rep_set_aligned_pfiltered_041614.tre")
phytree taxa_names(phytree) <- paste0("OTU_", taxa_names(phytree)) ## Needed for this particular dataset
<- ape::root(phytree, 1, resolve.root=TRUE) ## For most analyses, a rooted tree is required. This command roots the tree. If phylogenetic analyses appear "off", you may need to change the `1` value to another number. phytree
Now, we make a Phyloseq object upon which we will conduct our preliminary analyses.
#making a phyloseq object. phyloseq takes a variety of datasets, but here we are giving it our sample data, OTU table, and taxonomy data.
<- phyloseq(sample_data(stool_data), otu_table(as.matrix(otu_data), taxa_are_rows = T), tax_table(as.matrix(taxonomy_data)), phy_tree(phytree)) phy
To see the contents of our phyloseq object:
phy
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 858 taxa and 130 samples ]
## sample_data() Sample Data: [ 130 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 858 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 858 tips and 857 internal nodes ]
As you can see, we have 130 samples, 858 taxa, and 6 variables in our sample data.
You can also import the phyloseq object using the readRDS
function like so:
phy <- readRDS("phyloseq_noneg.rds") ## load
saveRDS(phy, "saving_my_phyloseq.rds") ## example of saving the file
You can save any object that lives in R as an RDS file, but not all RDS files you encounter will be phyloseq objects. Just a word of warning.
There are a couple of items that are covered above that will also be helpful here. Specifically:
A “rooted” tree is necessary for Weighted and Unweighted UniFrac analyses, but the tree is not naturally rooted, so we run the following to make that happen.
phy_tree(phy) <- ape::root(phy_tree(phy), 1, resolve.root=TRUE)
When you get your phyloseq object from the DADA2 pipeline, your sequence variants will be the actual 253-ish bp (if you’ve sequenced the V4 region) sequence, which while nice to have, it makes some figures messy, taxa difficult to discriminate, and actually already exists in another component of the phyloseq object. To make this change, you can use the following line of code:
taxa_names(phy) <- paste0("SV_", 1:length(taxa_names(phy))) ## Paste, without separators, the prefix "SV_" to numbers ranged 1 through the length of the taxa names string from my phyloseq object
To calculate alpha diversity for each of our samples, we use the following calculations.
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
<- t(otu_table(phy)) ## This dataset needed a transpose function (making the rows the columns and vice versa); If I don't use it, I get a "replacement has XXX rows, data has YYY" error
otu_table sample_data(phy)$equitability <- diversity(otu_table)/log(specnumber(otu_table)) ## Pielou's Evenness
sample_data(phy)$chao1 <- t(estimateR(otu_table))[,2] ## Chao1 richness
sample_data(phy)$PD_whole_tree <- picante::pd(otu_table, phy_tree(phy), include.root = TRUE)[,1] ## Faith's Phylogenetic Diversity
We can then relate alpha diversity to our sample data using linear models. We do not need linear mixed effects models here because we are examining one timepoint.
summary(lm(chao1 ~ ageStool, data=data.frame(sample_data(phy))))
##
## Call:
## lm(formula = chao1 ~ ageStool, data = data.frame(sample_data(phy)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.059 -41.623 -5.374 29.591 220.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.9914 12.1544 26.080 <2e-16 ***
## ageStool 0.6421 0.2769 2.319 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.45 on 128 degrees of freedom
## Multiple R-squared: 0.04033, Adjusted R-squared: 0.03283
## F-statistic: 5.379 on 1 and 128 DF, p-value: 0.02196
This output tells us that there is a significant (P=0.022) relationship between richness and the age, in days, of the stool. Specifically, for every day older the child is, there is a 0.64 increase in richness.
If we wished to plot this relationship, we could do so using ggplot2:
library(ggplot2)
ggplot(data.frame(sample_data(phy)), aes(x=ageStool, y=chao1)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
We can also examine categorical relationships in a similar manner:
summary(lm(chao1 ~ month_group, data=data.frame(sample_data(phy))))
##
## Call:
## lm(formula = chao1 ~ month_group, data = data.frame(sample_data(phy)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.493 -36.169 -2.226 33.273 223.920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 337.33 6.15 54.847 <2e-16 ***
## month_groupB 13.95 13.13 1.063 0.290
## month_groupC 32.65 27.64 1.181 0.240
## month_groupD 28.56 60.57 0.471 0.638
## month_groupE 97.67 60.57 1.612 0.109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.26 on 125 degrees of freedom
## Multiple R-squared: 0.03718, Adjusted R-squared: 0.006367
## F-statistic: 1.207 on 4 and 125 DF, p-value: 0.3114
Here, months were grouped into alphabetical categories, and each P-value represents the unit-change between Group A and each other group. Here, there is no significant difference between Group A and each of the other groups. We can also test if the groups are different from each other (one p-value) with an analysis of Variance (ANOVA) test as shown here:
anova(lm(chao1 ~ month_group, data=data.frame(sample_data(phy))))
## Analysis of Variance Table
##
## Response: chao1
## Df Sum Sq Mean Sq F value Pr(>F)
## month_group 4 17527 4381.8 1.2066 0.3114
## Residuals 125 453924 3631.4
A minimal plot of this relationship:
ggplot(data.frame(sample_data(phy)), aes(x=month_group, y=chao1, fill=month_group)) +
geom_boxplot() +
geom_point(shape=21, size=3)
Next, we will take a look at differences in composition between samples. To do so, we will create an ordination of all samples in our Phyloseq object and plot them. We can generate Bray Curtis (“bray”) and Canberra (“canberra”) distance matrices. And assuming we have a phylogenetic tree in our phyloseq object, we can also calculate Weighted UniFrac (“wunifrac”) and Unweighted UniFrac (“unifrac”) distance matrices.
<- ordinate(phy, method="PCoA", distance="bray")
age_pcoa plot_ordination(phy, age_pcoa, type="samples", color="month")
From our plot, we do not see any immediate distinctions in composition between months. However, to test this statistically, we use the adonis2
function from the R package vegan
.
If not installed:
install.packages("vegan")
library(vegan)
To conduct a PERMANOVA test and determine if beta diversity significantly differs across our groups (here, months of age), we first create a distance matrix with our data:
<- phyloseq::distance(phy, method = "bray") dm_age
To statistically test if our variable is indeed significant:
#for reproducibility purposes (set.seed needs to be run at the exact same time as the adonis function in order to get the exact same result)
set.seed(123)
::adonis2(dm_age ~ month, data = data.frame(phy@sam_data)) vegan
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## vegan::adonis2(formula = dm_age ~ month, data = data.frame(phy@sam_data))
## Df SumOfSqs R2 F Pr(>F)
## month 4 1.590 0.03531 1.1438 0.17
## Residual 125 43.441 0.96469
## Total 129 45.031 1.00000
We look at the line starting with month
as this is our variable in question. The R2 (R-squared) value indicates the % variation explained by our variable, and the p-value denotes whether or not this result is significant (in this case, our samples do not significantly differ in composition when grouping by months of age).
If you plan to undertake multivariable analyses, you will need to include the by="margins"
option in the adonis2 function. This will calculate R^2 and P-values that account for all other variables in the model (Type III Sum of Squares). The default by="terms"
will calculate the R2 and P-values based only on accounting for variables above each variable (Type I Sum of Squares). In other words, if there are three variables in the model, the first variable’s results will be the relationship without considering the other two variables, the second variable’s results will only account for the previous variable, and the last variable will account for the other two variables.
If you have repeated measures in your data, you will be unable to use adonis2
to determine if a variable is associated with microbiota composition. Instead, you will need to obtain the axis values and use linear mixed effects models. This study doesn’t have repeated measures, but for the purpose of illustration, we will use this dataset and use simple linear models.
#for reproducibility purposes
set.seed(123)
<- ordinate(phy, method="PCoA", distance="bray")
age_pcoa
<- merge(phy@sam_data, age_pcoa$vectors[,1:3], by=0)
pcoa_data # You can see that the Axis values have been added to our sample data:
head(pcoa_data)
## Row.names X.SampleID ageStool DMM Description month month_group
## 1 A02 A02 20 4 PPG_feces_A02 1 A
## 2 A05 A05 35 2 PPG_feces_A05 1 A
## 3 A06 A06 38 4 PPG_feces_A06 1 A
## 4 A07 A07 74 2 PPG_feces_A07 2 B
## 5 A101 A101 77 2 PPG_feces_A101 3 C
## 6 A106 A106 52 1 PPG_feces_A106 2 B
## equitability chao1 PD_whole_tree Axis.1 Axis.2 Axis.3
## 1 0.4851637 319.2069 17.48600 -0.2929754 -0.1725481826 -0.174631337
## 2 0.3265574 379.5000 17.56041 0.3627402 0.0771376594 -0.077711683
## 3 0.4285474 329.3913 15.65793 -0.2864202 -0.0669919650 -0.212846851
## 4 0.4021263 364.4762 16.41518 0.3252046 0.0004981516 -0.046632159
## 5 0.3343220 280.3333 13.89997 0.4339586 -0.0252256065 0.005000593
## 6 0.4563690 394.0000 18.81986 -0.2240980 0.1155055941 0.067084269
summary(lm(Axis.1 ~ ageStool, data=pcoa_data))
##
## Call:
## lm(formula = Axis.1 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33470 -0.22365 -0.06457 0.24416 0.45250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0012927 0.0520799 0.025 0.980
## ageStool -0.0000326 0.0011864 -0.027 0.978
##
## Residual standard error: 0.2547 on 128 degrees of freedom
## Multiple R-squared: 5.899e-06, Adjusted R-squared: -0.007807
## F-statistic: 0.0007551 on 1 and 128 DF, p-value: 0.9781
summary(lm(Axis.2 ~ ageStool, data=pcoa_data))
##
## Call:
## lm(formula = Axis.2 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44292 -0.12034 0.00076 0.10104 0.41315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0523471 0.0394482 1.327 0.187
## ageStool -0.0013201 0.0008986 -1.469 0.144
##
## Residual standard error: 0.193 on 128 degrees of freedom
## Multiple R-squared: 0.01658, Adjusted R-squared: 0.008898
## F-statistic: 2.158 on 1 and 128 DF, p-value: 0.1443
summary(lm(Axis.3 ~ ageStool, data=pcoa_data))
##
## Call:
## lm(formula = Axis.3 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33670 -0.11108 -0.00691 0.11805 0.39254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0711617 0.0328016 -2.169 0.0319 *
## ageStool 0.0017946 0.0007472 2.402 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1604 on 128 degrees of freedom
## Multiple R-squared: 0.04312, Adjusted R-squared: 0.03565
## F-statistic: 5.768 on 1 and 128 DF, p-value: 0.01776
Here, we can see that the age (in days) of the stool relates to PC3 of the Bray Curtis distance matrix. The same analysis above, but using mixed effects models, would look like:
library(lmerTest)
summary(lmer(Axis.1 ~ ageStool + (1|StudyID), data=pcoa_data)) ## Note that StudyID does not exist in the data
A script is available to relate all sample data to your distance matrices. To use this script, you will need to save distance matrices to a file like below.
write.table(data.frame(as.matrix(phyloseq::distance(phy, method = "bray"))), "bray_curtis_dm.txt", sep="\t")
The script can be found here: /data/Users/kmccauley/LabCode/BetaDiversity/Adonis_All_Variables_20Jan19.R
.
To use it, scroll to the bottom and fill in the needed information (distance matrix, sample information, output file, etc). Additional information about how to use this script is provided at the top of the code and in the README file (/data/Users/kmccauley/LabCode/README
). Once the function is filled out, you can run the script using: Rscript Adonis_All_Variables_20Jan19.R
.
THE FIRST TIME YOU RUN THIS SCRIPT IT WILL BREAK. This is because it prints out a file called Variable_Table.txt. View this file and ensure that R is reading your script as intended, and if you need to make any modifications, do so. This can include removing any variables that you would prefer not to analyze (ie, SampleID). Then, when you re-run the Rscript command, it will list out every variable it is analyzing, confirming the variable type (ie, factor, numeric, etc) as it analyzes them. This script will also print out preliminary Principal Coordinate Analysis (PCoA) plots for all variables.
This script will also identify where you have repeated measures (hence why it asks for Sample ID as well as Study/Participant ID) and if found, will include analysis of Axis 1 and Axis 2 using linear mixed-effects models.
We can identify taxa that differentiate between groups using either the “Many-Model” approach or DESeq2. Both use count-based methods, but take different approaches.
This method analyzes each OTU/SV using an assortment of several count-based models including: Poissson, Negative Binomial and Zero-Inflated Negative Binomial models. The fit of each of these models is compared, and the model that fits the data best is considered the “winning” model from which the estimate and the p-value are derived. This allows the data to determine which model fits the data best. As written, this script can analyze data with repeated measures, but currently needs rarefied data to make appropriate conclusions.
The script for this analysis can be found in: /data/Users/kmccauley/LabCode/SigTaxa/ManyModelScript.R
. This script has been developed to take either an OTU table and data file, or a phyloseq object, so
Katie is working on a follow-up script to this analysis, and more details will be provided when it is closer to being ready.
An alternate method is DESeq2. This method uses a modified Negative Binomial model on non-rarefied data to identify taxa differentially-abundant between groups. It’s preferred input is non-rarefied, non-transformed data, as it implements the Variance Stabilized Transformation as described above. Katie has tried DESeq with both rarefied and non-rarefied data, and has discussed it with other statisticians. Results with rarefied/non-rarefied data were similar, and the statisticians said rarefied data should not impact model-fit based on the statistical methods being implemented. You can also use DESeq for continuous variables. However, DESeq cannot be used for data with repeated measures.
Wrapper functions have been developed by Katie to make using DESeq2 easier.
## Load in the wrapper functions
library(DESeq2)
print(getwd())
## [1] "/Users/kmccauley/Box/processing-docs/assets"
source("../../lab-code/SigTaxa/DESeqFunctions.R") ## You might need to hunt this down in your respective environment. But wherever you find the lab code, you should find a sub-directory called SigTaxa.
## Subsetting to the two groups I want to analyze:
<- subset_samples(phy, month_group %in% c("A","B"))
ngm_data_grpAB
## Removing low-prevalence taxa:
<- filter_taxa(ngm_data_grpAB, function(x) sum(x > 0) > (0.2*length(x)), TRUE)
ngm_data_grpAB_f <- phyloseq_to_deseq2(ngm_data_grpAB_f, ~month_group) dds
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
<- DESeq(dds, fitType="local", sfType="poscounts")
dds2 ## Two wrapper functions to print out the results of DESeq, as well as generate a volcano plot.
print_res(dds2, ngm_data_grpAB_f, alpha=0.01, var="month_group",ref="A",cont="B", sig.only=TRUE)
## OTUname baseMean log2FoldChange lfcSE stat pvalue
## 1 OTU_4436046 56.559587 -5.631804 0.8918776 -6.314548 2.709526e-10
## 2 OTU_4416113 84.220369 -5.312729 0.7884558 -6.738144 1.604221e-11
## 3 OTU_4306262 12.229104 -5.234182 1.4373188 -3.641629 2.709186e-04
## 4 OTU_1846390 7.963263 -5.230130 1.3093861 -3.994337 6.487543e-05
## 5 OTU_851733 8.176479 -5.156444 1.2550599 -4.108524 3.981951e-05
## 6 OTU_259772 10.214898 -5.129374 1.1926722 -4.300741 1.702279e-05
## 7 OTU_182874 30.919137 -5.088010 1.2920954 -3.937798 8.223288e-05
## 8 OTU_4433823 126.832435 -4.994828 1.1491945 -4.346373 1.384070e-05
## 9 OTU_4464445 10.405667 -4.992402 1.2907358 -3.867873 1.097888e-04
## 10 OTU_4343627 7001.874400 -4.969519 0.8189864 -6.067890 1.296018e-09
## 11 OTU_192308 6.111008 -4.957759 0.8851167 -5.601248 2.128134e-08
## 12 OTU_191361 36.630689 -4.793991 0.8068438 -5.941659 2.821516e-09
## 13 OTU_3244896 46.213041 -4.711654 0.9895148 -4.761580 1.920833e-06
## 14 OTU_4329112 6.786985 -4.661969 1.3792324 -3.380118 7.245462e-04
## 18 OTU_315982 24.941107 -4.477678 1.1147041 -4.016921 5.896353e-05
## 19 OTU_553611 1069.544229 -4.296237 0.7959401 -5.397689 6.750488e-08
## 20 OTU_4381553 302.471101 -4.245171 0.8558998 -4.959892 7.053244e-07
## 21 OTU_193528 53.139797 -4.217688 0.8915581 -4.730693 2.237544e-06
## 25 OTU_4401580 31.366964 -3.463016 0.9494178 -3.647516 2.647879e-04
## 33 OTU_170652 264.027948 -3.097130 0.8987530 -3.446030 5.688875e-04
## 46 OTU_4335781 38.578220 -2.647693 0.7139778 -3.708368 2.085990e-04
## 48 OTU_137056 7.331480 -2.625679 0.7666072 -3.425064 6.146535e-04
## 406 OTU_170462 1.131276 2.111737 0.6178248 3.418020 6.307851e-04
## 412 OTU_4480189 2.089340 2.858249 0.6546802 4.365870 1.266178e-05
## 416 OTU_289734 735.437681 3.073925 0.8136809 3.777802 1.582188e-04
## 418 OTU_179400 1121.858944 3.655439 1.0122313 3.611268 3.047031e-04
## 420 OTU_3327894 23.622946 3.904229 1.0207794 3.824753 1.309033e-04
## 421 OTU_292735 46.818392 3.987236 1.0557270 3.776768 1.588766e-04
## 422 OTU_544419 4.417782 4.531388 1.2788739 3.543264 3.952067e-04
## 423 OTU_359954 17.863115 5.029689 0.6216430 8.090959 5.919659e-16
## 424 OTU_1100972 39.479231 5.377540 0.8796274 6.113429 9.751281e-10
## padj comparison ref cont Kingdom Phylum
## 1 3.829463e-08 A_vs_B A B Bacteria Firmicutes
## 2 3.400949e-09 A_vs_B A B Bacteria Proteobacteria
## 3 4.594780e-03 A_vs_B A B Bacteria Verrucomicrobia
## 4 1.618070e-03 A_vs_B A B Bacteria Firmicutes
## 5 1.125565e-03 A_vs_B A B Bacteria Firmicutes
## 6 5.155475e-04 A_vs_B A B Bacteria Firmicutes
## 7 1.937041e-03 A_vs_B A B Bacteria Bacteroidetes
## 8 4.514198e-04 A_vs_B A B Bacteria Bacteroidetes
## 9 2.450024e-03 A_vs_B A B Bacteria Firmicutes
## 10 1.099023e-07 A_vs_B A B Bacteria Bacteroidetes
## 11 1.289041e-06 A_vs_B A B Bacteria Firmicutes
## 12 1.993871e-07 A_vs_B A B Bacteria Firmicutes
## 13 8.144333e-05 A_vs_B A B Bacteria Firmicutes
## 14 9.909923e-03 A_vs_B A B Bacteria Bacteroidetes
## 18 1.562534e-03 A_vs_B A B Bacteria Firmicutes
## 19 3.577759e-06 A_vs_B A B Bacteria Actinobacteria
## 20 3.322862e-05 A_vs_B A B Bacteria Bacteroidetes
## 21 8.624715e-05 A_vs_B A B Bacteria Bacteroidetes
## 25 4.594780e-03 A_vs_B A B Bacteria Bacteroidetes
## 33 8.614583e-03 A_vs_B A B Bacteria Firmicutes
## 46 3.845477e-03 A_vs_B A B Bacteria Actinobacteria
## 48 8.915096e-03 A_vs_B A B Bacteria Firmicutes
## 406 8.915096e-03 A_vs_B A B Bacteria Firmicutes
## 412 4.473827e-04 A_vs_B A B Bacteria Firmicutes
## 416 3.061986e-03 A_vs_B A B Bacteria Firmicutes
## 418 4.969005e-03 A_vs_B A B Bacteria Firmicutes
## 420 2.775151e-03 A_vs_B A B Bacteria Bacteroidetes
## 421 3.061986e-03 A_vs_B A B Bacteria Firmicutes
## 422 6.206208e-03 A_vs_B A B Bacteria Firmicutes
## 423 2.509935e-13 A_vs_B A B Bacteria Firmicutes
## 424 1.033636e-07 A_vs_B A B Bacteria Firmicutes
## Class Order Family Genus
## 1 Clostridia Clostridiales Lachnospiraceae Dorea
## 2 Gammaproteobacteria Enterobacteriales Enterobacteriaceae Serratia
## 3 Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Akkermansia
## 4 Clostridia Clostridiales Clostridiaceae Clostridiaceae
## 5 Bacilli Lactobacillales Lactobacillaceae Lactobacillus
## 6 Clostridia Clostridiales Lachnospiraceae Coprococcus
## 7 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 8 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 9 Clostridia Clostridiales Lachnospiraceae Dorea
## 10 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 11 Clostridia Clostridiales Lachnospiraceae
## 12 Clostridia Clostridiales Lachnospiraceae
## 13 Bacilli Lactobacillales Streptococcaceae Streptococcus
## 14 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 18 Clostridia Clostridiales Clostridiaceae Clostridium
## 19 Actinobacteria Bifidobacteriales Bifidobacteriaceae Bifidobacterium
## 20 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 21 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 25 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 33 Clostridia Clostridiales Clostridiaceae Clostridium
## 46 Actinobacteria Bifidobacteriales Bifidobacteriaceae Bifidobacterium
## 48 Bacilli Bacillales Staphylococcaceae Staphylococcus
## 406 Clostridia Clostridiales Lachnospiraceae [Ruminococcus]
## 412 Bacilli Lactobacillales Lactobacillaceae Lactobacillus
## 416 Clostridia Clostridiales Lachnospiraceae Blautia
## 418 Clostridia Clostridiales Lachnospiraceae [Ruminococcus]
## 420 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 421 Clostridia Clostridiales Lachnospiraceae Blautia
## 422 Bacilli Lactobacillales Streptococcaceae Lactococcus
## 423 Clostridia Clostridiales Veillonellaceae Veillonella
## 424 Bacilli Lactobacillales Streptococcaceae Lactococcus
## Species
## 1
## 2 marcescens
## 3 muciniphila
## 4 <NA>
## 5 reuteri
## 6
## 7 ovatus
## 8 fragilis
## 9
## 10 fragilis
## 11
## 12
## 13 luteciae
## 14 fragilis
## 18 perfringens
## 19
## 20
## 21 caccae
## 25 <NA>
## 33 perfringens
## 46
## 48 <NA>
## 406
## 412 zeae
## 416
## 418
## 420 uniformis
## 421
## 422
## 423
## 424
plot_res(dds2, ngm_data_grpAB_f, alpha=0.01, var="month_group",ref="A",cont="B")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
These plots may not be perfect for your needs, but will help provide a brief overview of the most-important taxa.
Now, we may want to look at how our taxa are distributed across our samples.
plot_bar(phy)
At first glance, this is not entirely informative. We can instead plot by a taxonomic classification, or by treatment group. First, we will group all of our taxa into Family labels.
Note that taxrank
will take any classification present in your taxonomy, but with higher classifications comes less resolution of your data.
<- tax_glom(phy, taxrank = "Family") ngm_abundance
Next, we will group our samples by a variable of interest. Here, we combine our microbiome profiles according to age in months, which could also be considered a microbiota group/community state type. We also want our counts in our table to be relative counts, rather than absolute counts, so we will take care of that too.
<- merge_samples(ngm_abundance, "month_group")
ngm_abundance
#every count is now transformed into a proportion
<- transform_sample_counts(ngm_abundance, function(x) x / sum(x)) ngm_abundance
Now, if we create our abundance plot, grouping by month and coloring by Family classification, we get a more informative result.
<- plot_bar(ngm_abundance, x = "month", fill = "Family")
stacked_barplot
stacked_barplot
#cleaning up our plot a bit
<- stacked_barplot +
stacked_barplot geom_bar(aes(fill=Family), stat="identity", position="stack") +
#changing the x-axis label
xlab("Age (months)") +
#changing the y-axis label
ylab("Relative Abundance (%)") +
#changing our y axis to a percentage, rather than a decimal scale
scale_y_continuous(labels = scales::percent) +
#italicizing our Family labels, and rotating x-axis labels to be properly oriented
theme(legend.text = element_text(face="italic"), axis.text.x = element_text(angle=0,hjust=0.5))
#view our new and improved plot
stacked_barplot
Now let’s say that you want to only visualize the top 10 families in your data. In order to do this, I loosely follow some ideas in this page, depending on my needs.
<- ngm_abundance
ngm_abundance10 <- names(sort(taxa_sums(ngm_abundance10), decreasing = T)[1:10]) ## Get top 10 families
families tax_table(ngm_abundance10)[,"Family"][!rownames(tax_table(ngm_abundance10)) %in% families] <- "Other" ## When the family is NOT (see: !) part of the top 10 families, call it "Other"
plot_bar(ngm_abundance10, fill="Family")
## If you don't like the order of things (especially with "Other" in the middle of Family names), you'll need to use regular ggplot, which translates nicely with a phyloseq "melting" function (making your dataset into a "long" dataset) for more customizability:
<- psmelt(ngm_abundance10)
ps $Family2 <- factor(ps$Family, levels=c(unique(ps$Family)[!unique(ps$Family) %in% "Other"], "Other")) #relevels "Other" so it's at the end. The logic is a little complicated, so feel free to reach out if you have questions. But basically(ish), it would be dictated as "get the levels of the Family variable when those levels *aren't* Other, then ADD Other as a category at the end, and use this order to reorder our family variable into a new variable called Family2".
ps
table(ps$Family, ps$Family2) ## As a sanity check, you can do something like this to confirm that your Bifidobacterium is still called Bifidobacterium with the reordered variable.
##
## Bifidobacteriaceae Lachnospiraceae Enterobacteriaceae
## Bacteroidaceae 0 0 0
## Bifidobacteriaceae 5 0 0
## Clostridiaceae 0 0 0
## Coriobacteriaceae 0 0 0
## Enterobacteriaceae 0 0 5
## Erysipelotrichaceae 0 0 0
## Lachnospiraceae 0 5 0
## Other 0 0 0
## Porphyromonadaceae 0 0 0
## Streptococcaceae 0 0 0
## Veillonellaceae 0 0 0
##
## Erysipelotrichaceae Bacteroidaceae Clostridiaceae
## Bacteroidaceae 0 5 0
## Bifidobacteriaceae 0 0 0
## Clostridiaceae 0 0 5
## Coriobacteriaceae 0 0 0
## Enterobacteriaceae 0 0 0
## Erysipelotrichaceae 5 0 0
## Lachnospiraceae 0 0 0
## Other 0 0 0
## Porphyromonadaceae 0 0 0
## Streptococcaceae 0 0 0
## Veillonellaceae 0 0 0
##
## Coriobacteriaceae Veillonellaceae Porphyromonadaceae
## Bacteroidaceae 0 0 0
## Bifidobacteriaceae 0 0 0
## Clostridiaceae 0 0 0
## Coriobacteriaceae 5 0 0
## Enterobacteriaceae 0 0 0
## Erysipelotrichaceae 0 0 0
## Lachnospiraceae 0 0 0
## Other 0 0 0
## Porphyromonadaceae 0 0 5
## Streptococcaceae 0 0 0
## Veillonellaceae 0 5 0
##
## Streptococcaceae Other
## Bacteroidaceae 0 0
## Bifidobacteriaceae 0 0
## Clostridiaceae 0 0
## Coriobacteriaceae 0 0
## Enterobacteriaceae 0 0
## Erysipelotrichaceae 0 0
## Lachnospiraceae 0 0
## Other 0 160
## Porphyromonadaceae 0 0
## Streptococcaceae 5 0
## Veillonellaceae 0 0
## Since we're here, I'll also show how to set up a custom color palette
library(RColorBrewer)
## Get 10 colors, plus grey for "Other"
<- c(brewer.pal(10, "Set3"), "darkgrey") ## To see all default color palettes and how many colors they have, use `display.brewer.all()` in the console.
custom_colors names(custom_colors) <- levels(ps$Family2) ## Assign the levels of our Family varible to the names of our vector of colors
#Starts with your typical ggplot setup
ggplot(ps, aes(x=month,y=Abundance, fill=Family2)) +
# Sets up the common pieces of the bar plot
geom_bar(stat="identity", position="stack", color="black") +
# Asks the plot to use our color scheme, and changes the name from Family2 to Family
scale_fill_manual("Family", values=custom_colors)
Now we can quickly see a summary of the profiles associated with different months
A 16s analysis pipeline tutorial written by Jordan Bisanz
Some fun uses of phyloseq, which might give you some ideas too!
Katie would like to give many thanks to Danny Li who helped develop the initial version of the analysis pipeline, as well as the several members of the Lynch Lab and BCMM who provided invaluable constructive feedback along the way.