After your sequencing run is complete, there are a few methods to obtain the sequencing data in its raw format for downstream analysis. (1) Download directly from the sequencer itself with an external drive, or (2) download from BaseSpace, assuming the run has been linked.
Insert an external drive into the sequencer’s USB port. There should be at least 100GB of space on the external drive. Runs typically have around 60GB of data, but it’s always safest to have more space.
Navigate to the location of the raw run data. You are looking for a folder formatted like: 190826_NS500170_0076_AH772CBGXC
.
The first six numbers are the date in YYMMDD format, the second section
is the name of the machine and machine number (NextSeq 500 #170), the
third section is the run number (here the 76th run on our NextSeq), and
the last section is a letter code that also associates with the run.
If you need to use this method, you may need to set up a call with Katie to determine exactly where this directory lives (once Katie recalls, this document will be updated), but once you’ve located it, you click and drag the directory to your external drive and wait for the run to download. This takes approximately 1-2 hours. Once it’s complete, you can disconnect the drive.
Once downloaded, you will upload the data to Wynton where we will do all sequence data processing:
Start by plugging in your drive. If you have a Mac, there are two Terminal commands that can upload data: scp
and rsync
.
There are benefits to both, but for the purpose of uploading sequence
data, scp is better because it gives informative progress text. An
example usage for your needs would be:
scp -r ~/Volumes/MyDrive/190826_NS500170_0076_AH772CBGXC/ {username}@dt2.wynton.ucsf.edu:/wynton/group/lynch/NextSeq_data/
A couple of things to note:
The -r option means that it will upload all of the files within the run directory.
Here we are uploading using the data transfer node on Wynton (hence, dt2), which offers much faster upload speeds than using a login node. We are also uploading to a directory on Wynton where our raw runs live. This is important for downstream processing.
If you are not on a Mac, you will have more success with software like FileZilla and using an SFTP connection (port 22!). Speak with Katie if you need support with this.
You will now need to upload the mapping file using the same method as above, and it will need to be saved in /wynton/group/lynch/NextSeq_Processed/mapping_files
. This will then set you up perfectly for processing the data using the developed script.
In order to download from BaseSpace, the first thing you will need to
do is download the Command Line Interface (CLI) for BaseSpace following
the directions here https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview
using the Linux method. You will want to install onto Wynton in your home (~
) directory. Specifically run the following commands on Wynton:
cd ~
wget "https://api.bintray.com/content/basespace/BaseSpaceCLI-EarlyAccess-BIN/latest/\$latest/amd64-linux/bs?bt_package=latest" -O $HOME/bin/bs
chmod u+x $HOME/bin/bs
~/bin/bs auth
The output will be a link that you copy into your browser, and you will sign into basespace, thereby linking the bs command to your account. You will only need to do this once. Once complete, return to the command line.
There are several options available once you have BaseSpace
installed. Review the link above to find the commands you need. Commands
like ~/bin/bs list run
will list out all of the runs that
are available to you, including the project identifier, which will be
helpful when downloading (~/bin/bs download run {id}
). I would suggest going to the /wynton/group/lynch/NextSeq_data/
directory before running the download command so that the run downloads to our raw run directory.
You will now need to upload the mapping file using the scp command (From your computer: scp /my/mapping/file {username}@dt2.wynton.ucsf.edu:/wynton/group/lynch/NextSeq_Processed/mapping_files/
). This will then set you up perfectly for processing the data using the developed script.
The steps needed to process a NextSeq run and develop an ASV table have already been pipelined. Assuming the runs have been set up as above (run directory in /NextSeq_data/ and mapping file in /NextSeq_Processed/mapping_file/), you are almost ready to get started.
ssh devX
, with X being 1, 2, or 3.module load CBI r
Rscript /wynton/group/lynch/kmccauley/mySoftware/install_packages.R
This will install any packages that are missing before starting the
processing steps. The code will tell you what packages are missing and
will install them automatically. If all needed packages are already
installed, you will get a message saying as much. You should always run
this script before processing a run, as the version of R may have
changed.Once complete, start processing the run:
cd /wynton/group/lynch/NextSeq_Processed/
qsub -M {your-email} scripts/complete_16s_pipeline.sh 190826_NS500170_0076_AH772CBGXC Nextseq_190826_mapping.txt
The -M option will send you an e-mail when the process finishes. scripts/complete_16s_pipeline.sh
is the name of the bash file with all of the commands. 190826_NS500170_0076_AH772CBGXC
is the run name. Nextseq_190826_mapping.txt
is the mapping file (which should live in the mapping_files directory).
After several hours (16-24), you will have the following directories in
the directory for this run within NextSeq_Processed:
submission
: This directory contains all of your R1
and R2 files separated by sample ID. This is what gets uploaded to SRA
and also what is used for DADA2.
FASTQC
contains the results from FASTQC. You can scp the files to your computer for viewing in a browser.
dada2_output
contains all of the files generated from DADA2. These include:
The file that you will use for subsequent analysis is phyloseq_noneg.rds
.
These steps provide an extra layer of processing and filtering for your analyses and were completed automatically as part of the script above.
Next, we want to remove any SVs in our table that may be contaminants, or generally present in extremely low counts relative to our total count, cutting down on noise in our data. This is typically 0.0001% of the total read count in your dataset.
Next, we use a script to remove signal deriving from the negative controls from our samples. This form of negative control filtering outright removes any SVs found in more than 15% of your negative controls and less than 15% of samples. Among the SVs that remaine, the average read count within negative controls is subtracted from the read count in samples, and negative numbers are returned to 0.
The script then returns a NTC-filtered phyloseq object. Also output by the script is a figure showing the distribution of taxa before and after running the script (Combined_Cleaning_Figure.pdf). An example output is shown here (this plot coming from running the same example SV table through the NTC script):
We can tell from this figure that taxa like Pseudomonas and Delftia were removed, and while some taxa also present in negative controls remain, they are very likely to be biologically meaningful. Further examination of this data (not shown) found that remaining taxa had a very low total read count within the negative controls. Hence, the lack of movement of those taxa.
All microbiome data needs to be normalized for differences in read depth. This can be done through either Multiply Rarefying or Variance Stabilized Transformations.
Considerations for your own dataset should be driven by the questions you wish to answer. Our preference is for multiply-rarefying, but if you have low-abudance samples or heterogeneity in read counts across samples (indicating the loss of several samples after rarefying) you may want to consider variance-stabilized transformations.
Rarefying means to subsample each sample to an even read depth. We use a method for representative rarefaction in which we subsample each sample 100 times and choose the randomized sample at the center of a distance matrix. However, before we can carry out this process, we need to determine the optimal depth. This depth is that at which there is no additional diversity to gain and you retain an optimal number of samples (samples with a read depth lower than the rarefying depth are removed).
First, confirm R is loaded into your workspace:
module load CBI r
R
library(vegan)
library(phyloseq)
## read in the saved RDS file:
phy <- readRDS("phyloseq_noneg.rds")
## First, you will want to subset your dataset to the sample set for your study
phy.WISC <- subset_samples(phy, Owner %in% "WISC")
#pdf("AlphaRarefactionPlot.pdf")
rarecurve(t(otu_table(phy.WISC)), step=floor(max(taxa_sums(phy))/3000),cex=0.5)
In order to view a plot made on Wynton, you can copy it to your desktop (after using the pdf and dev.off functions above) using either your favorite FTP or the following scp command from your laptop:
scp {username}@dt2.wynton.ucsf.edu:{image_location}/AlphaRarefactionPlot.pdf ~/Downloads/
You want to identify the depth where diversity plateaus, but you don’t end up losing many samples. In this example, we would likely pick something in the 50,000-70,000 read range, since diversity has plateaued and we start losing samples due to read depth soon after.
To initiate the multiply-rarefying step, you can run the following script from the location of your phyloseq object on Wynton.
Rscript /wynton/group/lynch/kmccauley/dada2_files/Multipy_Rarefy_phy.R phyloseq_noneg.rds 50000 phyloseq_mrare.rds
The first object is the name/location of the script. phyloseq_noneg.rds
indicates the name of the input phyloseq object; 50000
is the planned read depth; phyloseq_mrare.rds
is the name of the output object. If you’d rather not specify an output name, one will be chosen for you.
We can also use the variance stabilized transformation, which can easily be applied before downstream analyses using DESeq2 functions:
library(DESeq2)
## read in the saved RDS file:
phy <- readRDS("phyloseq_noneg.rds")
phy.vst <- phy ## Initialize a VST object
deseqdat <- phyloseq_to_deseq2(phy, ~1) ## We can use a place-holder design formula
deseqdat2 = estimateSizeFactors(deseqdat, type="poscounts")
otu_table(phy.vst) <- otu_table(counts(deseqdat2, normalized=TRUE),taxa_are_rows=TRUE) ## Update OTU table
Moving forward with analyses, we will be using an abridged dataset from from Kei’s study Neonatal gut microbiota associates with childhood multisensitized atopy and T cell differentiation. The sample data and OTU table can be found in /data/Users/dli2/DOC
.
Please note that the OTU table has been processed through all of the
applicable steps shown above (very important to do so prior to any
analyses!).
All analyses moving forward are run on Lynchserver2. While you can run some analyses on Wynton, it’s not advised to run interactive analyses on the dev nodes. Any scripts Katie has developed have also been optimized for use on Lynchserver2.
The process here will include how to generate a phyloseq object from
scratch, though you can also import the phyloseq object as shown above,
using the readRDS
function.
First, we need to load our data into R. Here, we load our sample data file. We also create some new variables for our data which will be used for grouping later on.
stool_data <- read.table("/data/Users/dli2/Doc/fujimura_sample_data.txt", header=T, check.names=F, comment="", sep="\t")
#creating a month variable for each participant (ages are in days right now)
stool_data$month <- round(stool_data$ageStool/30, 0)
stool_data$month <- as.factor(stool_data$month)
#age_data$month <- as.factor(age_data$month)
stool_data$month_group <- ifelse(stool_data$month==1, "A",
ifelse(stool_data$month==2, "B",
ifelse(stool_data$month==3, "C",
ifelse(stool_data$month==4, "D", "E")
)))
Next, we load in our OTU table. We also do a bit of cleaning to our data (any steps beyond reading in your OTU table are likely not necessary, but is done here for the example dataset in particular).
otu_data <- read.delim("/data/Users/dli2/Doc/fujimura_rarefied_otutable.txt", check.names=F, sep="\t")
#giving our otus an otu label, rather than just numbers.
otu_data$`#OTUID` <- paste0("OTU_", otu_data$`#OTUID`)
#removing OTUs that are reference genomes as we don't need those (they have "reference" in the name so this line is selecting all rows of the OTU table that don't have "reference" in them)
otu_data <- otu_data[!grepl("Reference", otu_data$`#OTUID`), ]
Now, we will load Phyloseq, a useful package commonly used for microbial analyses.
Note: if these packages are not installed yet, you can run the code as shown here to install them.
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
install.packages("ggplot2")
#loading our installed packages in R.
library(phyloseq)
#for more plotting features later.
library(ggplot2)
Next, we prepare our data to be merged into a Phyloseq object.
#This is for phyloseq. phyloseq doesn't like when your OTU table has the taxonomy info attached so we're separating it for now.
#also doing some other cleaning up to make a phyloseq object
taxonomy_data <- otu_data[, c("#OTUID", "taxonomy")]
rownames(taxonomy_data) <- taxonomy_data$`#OTUID`
rownames(otu_data) <- otu_data$`#OTUID`
rownames(stool_data) <- stool_data$`#SampleID`
otu_data$`#OTUID` <- NULL
otu_data$taxonomy <- NULL
This function is to make the taxonomy table look nice, and for easier use later on. Note that there is no output in this code chunk, this is simply saving our function.
split.tax <- function(tax.dat) {
taxanames <- strsplit(as.character(tax.dat[,2]),"; ")
mat <- t(sapply(taxanames,
function(x,m) c(x,rep(NA,m-length(x))),
max(rapply(taxanames,length))))
newnames <- gsub("_","",mat)
newnames <- as.matrix(newnames)
colnames(newnames) <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")
row.names(newnames) <- tax.dat[,1]
newnames[,6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,5][newnames[,6] %in% c("") | is.na(newnames[,6])]
newnames[,6][newnames[,6] %in% c("") | is.na(newnames[,6])] <- newnames[,4][newnames[,6] %in% c("") | is.na(newnames[,6])]
return(newnames)
}
Now, we run the function on our taxonomy data to clean it up.
#use the function on our taxonomy table
taxonomy_data <- split.tax(taxonomy_data)
taxonomy_data <- sub('.', '', taxonomy_data)
taxonomy_data <- as.data.frame(taxonomy_data)
To view our new formatted taxonomy table:
## Kingdom Phylum Class Order
## OTU_100497 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1010113 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1012948 Bacteria Actinobacteria Actinobacteria Actinomycetales
## OTU_1023075 Bacteria Firmicutes Clostridia Clostridiales
## OTU_103166 Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales
## OTU_1033413 Bacteria Firmicutes Bacilli Lactobacillales
## Family Genus Species
## OTU_100497 Enterobacteriaceae Enterobacteriaceae <NA>
## OTU_1010113 Enterobacteriaceae Enterobacteriaceae <NA>
## OTU_1012948 Corynebacteriaceae Corynebacterium
## OTU_1023075 Veillonellaceae Veillonella
## OTU_103166 Enterobacteriaceae Citrobacter
## OTU_1033413 Enterococcaceae Enterococcus
We see that our classifications have been separated into their respective hierarchies for each OTU in our data.
Finally, we read in our phylogenetic tree.
phytree <- read_tree("rep_set_aligned_pfiltered_041614.tre")
taxa_names(phytree) <- paste0("OTU_", taxa_names(phytree)) ## Needed for this particular dataset
phytree <- ape::root(phytree, 1, resolve.root=TRUE) ## For most analyses, a rooted tree is required. This command roots the tree. If phylogenetic analyses appear "off", you may need to change the `1` value to another number.
Now, we make a Phyloseq object upon which we will conduct our preliminary analyses.
#making a phyloseq object. phyloseq takes a variety of datasets, but here we are giving it our sample data, OTU table, and taxonomy data.
ngm_data <- phyloseq(sample_data(stool_data), otu_table(as.matrix(otu_data), taxa_are_rows = T), tax_table(as.matrix(taxonomy_data)), phy_tree(phytree))
To see the contents of our phyloseq object:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 858 taxa and 130 samples ]
## sample_data() Sample Data: [ 130 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 858 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 858 tips and 857 internal nodes ]
As you can seem we have 130 samples, 858 taxa, and 6 variables in our sample data.
To calculate alpha diversity for each of our samples, we use the following calculations:
sample_data(ngm_data)$equitability <- diversity(t(otu_table(ngm_data)))/log(specnumber(t(otu_table(ngm_data)))) ## Pielou's Evenness
sample_data(ngm_data)$chao1 <- t(estimateR(t(otu_table(ngm_data))))[,2] ## Chao1 richness
sample_data(ngm_data)$PD_whole_tree <- picante::pd(t(otu_table(ngm_data)), phy_tree(ngm_data), include.root = TRUE)[,2] ## Faith's Phylogenetic Diversity
We can then relate alpha diversity to our sample data using linear models. We do not need linear mixed effects models here because we are examining one timepoint.
##
## Call:
## lm(formula = chao1 ~ ageStool, data = data.frame(sample_data(ngm_data)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.059 -41.623 -5.374 29.591 220.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.9914 12.1544 26.080 <2e-16 ***
## ageStool 0.6421 0.2769 2.319 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.45 on 128 degrees of freedom
## Multiple R-squared: 0.04033, Adjusted R-squared: 0.03283
## F-statistic: 5.379 on 1 and 128 DF, p-value: 0.02196
This output tells us that there is a significant (P=0.022) relationship between richness and the age, in days, of the stool. Specifically, for every day older the child is, there is a 0.64 increase in richness.
If we wished to plot this relationship, we could do so using ggplot2:
library(ggplot2)
ggplot(data.frame(sample_data(ngm_data)), aes(x=ageStool, y=chao1)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
We can also examine categorical relationships in a similar manner:
##
## Call:
## lm(formula = chao1 ~ month_group, data = data.frame(sample_data(ngm_data)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.493 -36.169 -2.226 33.273 223.920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 337.33 6.15 54.847 <2e-16 ***
## month_groupB 13.95 13.13 1.063 0.290
## month_groupC 32.65 27.64 1.181 0.240
## month_groupD 28.56 60.57 0.471 0.638
## month_groupE 97.67 60.57 1.612 0.109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.26 on 125 degrees of freedom
## Multiple R-squared: 0.03718, Adjusted R-squared: 0.006367
## F-statistic: 1.207 on 4 and 125 DF, p-value: 0.3114
Here, months were grouped into alphabetical categories, and each P-value represents the unit-change between Group A and each other group. Here, there is no significant difference between Group A and each of the other groups. We can also test if the groups are different from each other (one p-value) with an analysis of Variance (ANOVA) test as shown here:
## Analysis of Variance Table
##
## Response: chao1
## Df Sum Sq Mean Sq F value Pr(>F)
## month_group 4 17527 4381.8 1.2066 0.3114
## Residuals 125 453924 3631.4
A minimal plot of this relationship:
ggplot(data.frame(sample_data(ngm_data)), aes(x=month_group, y=chao1, fill=month_group)) +
geom_boxplot() +
geom_point(shape=21, size=3)
Next, we will take a look at differences in composition between samples. To do so, we will create an ordination of all samples in our Phyloseq object and plot them. We can generate Bray Curtis (“bray”) and Canberra (“canberra”) distance matrices. And assuming we have a phylogenetic tree in our phyloseq object, we can also calculate Weighted UniFrac (“wunifrac”) and Unweighted UniFrac (“unifrac”) distance matrices.
age_pcoa <- ordinate(ngm_data, method="PCoA", distance="bray")
plot_ordination(ngm_data, age_pcoa, type="samples", color="month")
From our plot, we do not see any immediate distinctions in
composition between months. However, to test this statistically, we use
the adonis2
function from the R package vegan
.
If not installed:
install.packages("vegan")
library(vegan)
To conduct a PERMANOVA test and determine if beta diversity significantly differs across our groups (here, months of age), we first create a distance matrix with our data:
To statistically test if our variable is indeed significant:
#for reproducibility purposes (set.seed needs to be run at the exact same time as the adonis function in order to get the exact same result)
set.seed(123)
vegan::adonis2(dm_age ~ month, data = data.frame(ngm_data@sam_data))
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## vegan::adonis2(formula = dm_age ~ month, data = data.frame(ngm_data@sam_data))
## Df SumOfSqs R2 F Pr(>F)
## month 4 1.590 0.03531 1.1438 0.17
## Residual 125 43.441 0.96469
## Total 129 45.031 1.00000
We look at the line starting with month
as this is our
variable in question. The R2 (R-squared) value indicates the % variation
explained by our variable, and the p-value denotes whether or not this
result is significant (in this case, our samples do not significantly
differ in composition when grouping by months of age).
If you plan to undertake multivariable analyses, you will need to include the by="margins"
option in the adonis2 function. This will calculate R^2 and P-values
that account for all other variables in the model (Type III Sum of
Squares). The default by="terms"
will calculate the R2 and
P-values based only on accounting for variables above each variable
(Type I Sum of Squares). In other words, if there are three variables in
the model, the first variable’s results will be the relationship
without considering the other two variables, the second variable’s
results will only account for the previous variable, and the last
variable will account for the other two variables.
If you have repeated measures in your data, you will be unable to use adonis2
to determine if a variable is associated with microbiota composition.
Instead, you will need to obtain the axis values and use linear mixed
effects models. This study doesn’t have repeated measures, but for the
purpose of illustration, we will use this dataset and use simple linear
models.
#for reproducibility purposes
set.seed(123)
age_pcoa <- ordinate(ngm_data, method="PCoA", distance="bray")
pcoa_data <- merge(ngm_data@sam_data, age_pcoa$vectors[,1:3], by=0)
# You can see that the Axis values have been added to our sample data:
head(pcoa_data)
## Row.names X.SampleID ageStool DMM Description month month_group
## 1 A02 A02 20 4 PPG_feces_A02 1 A
## 2 A05 A05 35 2 PPG_feces_A05 1 A
## 3 A06 A06 38 4 PPG_feces_A06 1 A
## 4 A07 A07 74 2 PPG_feces_A07 2 B
## 5 A101 A101 77 2 PPG_feces_A101 3 C
## 6 A106 A106 52 1 PPG_feces_A106 2 B
## equitability chao1 PD_whole_tree Axis.1 Axis.2 Axis.3
## 1 0.4851637 319.2069 268 -0.2929754 -0.1725481826 -0.174631337
## 2 0.3265574 379.5000 243 0.3627402 0.0771376594 -0.077711683
## 3 0.4285474 329.3913 255 -0.2864202 -0.0669919650 -0.212846851
## 4 0.4021263 364.4762 256 0.3252046 0.0004981516 -0.046632159
## 5 0.3343220 280.3333 215 0.4339586 -0.0252256065 0.005000593
## 6 0.4563690 394.0000 304 -0.2240980 0.1155055941 0.067084269
##
## Call:
## lm(formula = Axis.1 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33470 -0.22365 -0.06457 0.24416 0.45250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0012927 0.0520799 0.025 0.980
## ageStool -0.0000326 0.0011864 -0.027 0.978
##
## Residual standard error: 0.2547 on 128 degrees of freedom
## Multiple R-squared: 5.899e-06, Adjusted R-squared: -0.007807
## F-statistic: 0.0007551 on 1 and 128 DF, p-value: 0.9781
##
## Call:
## lm(formula = Axis.2 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44292 -0.12034 0.00076 0.10104 0.41315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0523471 0.0394482 1.327 0.187
## ageStool -0.0013201 0.0008986 -1.469 0.144
##
## Residual standard error: 0.193 on 128 degrees of freedom
## Multiple R-squared: 0.01658, Adjusted R-squared: 0.008898
## F-statistic: 2.158 on 1 and 128 DF, p-value: 0.1443
##
## Call:
## lm(formula = Axis.3 ~ ageStool, data = pcoa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33670 -0.11108 -0.00691 0.11805 0.39254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0711617 0.0328016 -2.169 0.0319 *
## ageStool 0.0017946 0.0007472 2.402 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1604 on 128 degrees of freedom
## Multiple R-squared: 0.04312, Adjusted R-squared: 0.03565
## F-statistic: 5.768 on 1 and 128 DF, p-value: 0.01776
Here, we can see that the age (in days) of the stool relates to PC3 of the Bray Curtis distance matrix. The same analysis above, but using mixed effects models, would look like:
library(lmerTest)
summary(lmer(Axis.1 ~ ageStool + (1|StudyID), data=pcoa_data)) ## Note that StudyID does not exist in the data
A script is available to relate all sample data to your distance matrices. To use this script, you will need to save distance matrices to a file like below.
The script can be found here: /data/Users/kmccauley/LabCode/BetaDiversity/Adonis_All_Variables_20Jan19.R
.
To use it, scroll to the bottom and fill in the needed information
(distance matrix, sample information, output file, etc). Additional
information about how to use this script is provided at the top of the
code and in the README file (/data/Users/kmccauley/LabCode/README
). Once the function is filled out, you can run the script using: Rscript Adonis_All_Variables_20Jan19.R
.
THE FIRST TIME YOU RUN THIS SCRIPT IT WILL BREAK. This is because it prints out a file called Variable_Table.txt. View this file and ensure that R is reading your script as intended, and if you need to make any modifications, do so. This can include removing any variables that you would prefer not to analyze (ie, SampleID). Then, when you re-run the Rscript command, it will list out every variable it is analyzing, confirming the variable type (ie, factor, numeric, etc) as it analyzes them. This script will also print out preliminary Principal Coordinate Analysis (PCoA) plots for all variables.
This script will also identify where you have repeated measures (hence why it asks for Sample ID as well as Study/Participant ID) and if found, will include analysis of Axis 1 and Axis 2 using linear mixed-effects models.
We can identify taxa that differentiate between groups using either the Three-Model approach or DESeq2. Both use count-based methods, but take slightly different approaches.
This method analyzes each OTU/SV using Poissson, Negative Binomial and Zero-Inflated Negative Binomial models. The fit of each of these three models is compared, and the model that fits the data best is considered the “winning” model from which the estimate and the p-value are derived. This allows the data to determine which model fits the data best. As written, this script can analyze data with repeated measures, but currently needs rarefied data to make appropriate conclusions.
The script for this analysis can be found in: /data/Users/kmccauley/LabCode/SigTaxa/ThreeModel_LabCode_29Sept20.R
To use this script, you will need to obtain the OTU table from the phyloseq object.
After copying the Three-Model script to your working directory, you
can open it and enter the necessary inputs into the first several lines.
Then, once it has been filled out, you can again run the script using Rscript ThreeModel_LabCode_29Sept20.R
.
It will take awhile to run, since it’s running three different models
on each Sequence Variant, but soon your analysis will complete. The
output file includes all analyzed OTUs, the results from each of the
three models, as well as a few other statistics. The last three columns
display the winning model for each OTU, as well as the estimate and FDR
p-value for that model.
Katie is working on a follow-up script to this analysis, and more details will be provided when it is closer to being ready.
An alternate method is DESeq2. This method uses a modified Negative Binomial model on non-rarefied data to identify taxa differentially-abundant between groups. It’s preferred input is non-rarefied, non-transformed data, as it implements the Variance Stabilized Transformation as described above. Katie has tried DESeq with both rarefied and non-rarefied data, and has discussed it with other statisticians. Results with rarefied/non-rarefied data were similar, and the statisticians said rarefied data should not impact model-fit based on the statistical methods being implemented. You can also use DESeq for continuous variables. However, DESeq cannot be used for data with repeated measures.
Wrapper functions have been developed by Katie to make using DESeq2 easier.
## Load in the wrapper functions
library(DESeq2)
source("/data/Users/kmccauley/LabCode/SigTaxa/DESeqFunctions.R")
## Subsetting to the two groups I want to analyze:
ngm_data_grpAB <- subset_samples(ngm_data, month_group %in% c("A","B"))
## Removing low-prevalence taxa:
ngm_data_grpAB_f <- filter_taxa(ngm_data_grpAB, function(x) sum(x > 0) > (0.2*length(x)), TRUE)
dds <- phyloseq_to_deseq2(ngm_data_grpAB_f, ~month_group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds2 <- DESeq(dds, fitType="local", sfType="poscounts")
## Two wrapper functions to print out the results of DESeq, as well as generate a volcano plot.
print_res(dds2, ngm_data_grpAB_f, alpha=0.01, var="month_group",ref="A",cont="B", sig.only=TRUE)
## OTUname baseMean log2FoldChange lfcSE stat pvalue
## 1 OTU_4436046 56.559587 -5.631804 0.8918776 -6.314548 2.709526e-10
## 2 OTU_4416113 84.220369 -5.312729 0.7884558 -6.738144 1.604221e-11
## 3 OTU_4306262 12.229104 -5.234182 1.4373188 -3.641629 2.709186e-04
## 4 OTU_1846390 7.963263 -5.230130 1.3093861 -3.994337 6.487543e-05
## 5 OTU_851733 8.176479 -5.156444 1.2550599 -4.108524 3.981951e-05
## 6 OTU_259772 10.214898 -5.129374 1.1926722 -4.300741 1.702279e-05
## 7 OTU_182874 30.919137 -5.088010 1.2920954 -3.937798 8.223288e-05
## 8 OTU_4433823 126.832435 -4.994828 1.1491945 -4.346373 1.384070e-05
## 9 OTU_4464445 10.405667 -4.992402 1.2907358 -3.867873 1.097888e-04
## 10 OTU_4343627 7001.874400 -4.969519 0.8189864 -6.067890 1.296018e-09
## 11 OTU_192308 6.111008 -4.957759 0.8851167 -5.601248 2.128134e-08
## 12 OTU_191361 36.630689 -4.793991 0.8068438 -5.941659 2.821516e-09
## 13 OTU_3244896 46.213041 -4.711654 0.9895148 -4.761580 1.920833e-06
## 14 OTU_4329112 6.786985 -4.661969 1.3792324 -3.380118 7.245462e-04
## 18 OTU_315982 24.941107 -4.477678 1.1147041 -4.016921 5.896353e-05
## 19 OTU_553611 1069.544229 -4.296237 0.7959401 -5.397689 6.750488e-08
## 20 OTU_4381553 302.471101 -4.245171 0.8558998 -4.959892 7.053244e-07
## 21 OTU_193528 53.139797 -4.217688 0.8915581 -4.730693 2.237544e-06
## 25 OTU_4401580 31.366964 -3.463016 0.9494178 -3.647516 2.647879e-04
## 33 OTU_170652 264.027948 -3.097130 0.8987530 -3.446030 5.688875e-04
## 46 OTU_4335781 38.578220 -2.647693 0.7139778 -3.708368 2.085990e-04
## 48 OTU_137056 7.331480 -2.625679 0.7666072 -3.425064 6.146535e-04
## 406 OTU_170462 1.131276 2.111737 0.6178248 3.418020 6.307851e-04
## 412 OTU_4480189 2.089340 2.858249 0.6546802 4.365870 1.266178e-05
## 416 OTU_289734 735.437681 3.073925 0.8136809 3.777802 1.582188e-04
## 418 OTU_179400 1121.858944 3.655439 1.0122313 3.611268 3.047031e-04
## 420 OTU_3327894 23.622946 3.904229 1.0207794 3.824753 1.309033e-04
## 421 OTU_292735 46.818392 3.987236 1.0557270 3.776768 1.588766e-04
## 422 OTU_544419 4.417782 4.531388 1.2788739 3.543264 3.952067e-04
## 423 OTU_359954 17.863115 5.029689 0.6216430 8.090959 5.919659e-16
## 424 OTU_1100972 39.479231 5.377540 0.8796274 6.113429 9.751281e-10
## padj comparison ref cont Kingdom Phylum
## 1 3.829463e-08 A_vs_B A B Bacteria Firmicutes
## 2 3.400949e-09 A_vs_B A B Bacteria Proteobacteria
## 3 4.594780e-03 A_vs_B A B Bacteria Verrucomicrobia
## 4 1.618070e-03 A_vs_B A B Bacteria Firmicutes
## 5 1.125565e-03 A_vs_B A B Bacteria Firmicutes
## 6 5.155475e-04 A_vs_B A B Bacteria Firmicutes
## 7 1.937041e-03 A_vs_B A B Bacteria Bacteroidetes
## 8 4.514198e-04 A_vs_B A B Bacteria Bacteroidetes
## 9 2.450024e-03 A_vs_B A B Bacteria Firmicutes
## 10 1.099023e-07 A_vs_B A B Bacteria Bacteroidetes
## 11 1.289041e-06 A_vs_B A B Bacteria Firmicutes
## 12 1.993871e-07 A_vs_B A B Bacteria Firmicutes
## 13 8.144333e-05 A_vs_B A B Bacteria Firmicutes
## 14 9.909923e-03 A_vs_B A B Bacteria Bacteroidetes
## 18 1.562534e-03 A_vs_B A B Bacteria Firmicutes
## 19 3.577759e-06 A_vs_B A B Bacteria Actinobacteria
## 20 3.322862e-05 A_vs_B A B Bacteria Bacteroidetes
## 21 8.624715e-05 A_vs_B A B Bacteria Bacteroidetes
## 25 4.594780e-03 A_vs_B A B Bacteria Bacteroidetes
## 33 8.614583e-03 A_vs_B A B Bacteria Firmicutes
## 46 3.845477e-03 A_vs_B A B Bacteria Actinobacteria
## 48 8.915096e-03 A_vs_B A B Bacteria Firmicutes
## 406 8.915096e-03 A_vs_B A B Bacteria Firmicutes
## 412 4.473827e-04 A_vs_B A B Bacteria Firmicutes
## 416 3.061986e-03 A_vs_B A B Bacteria Firmicutes
## 418 4.969005e-03 A_vs_B A B Bacteria Firmicutes
## 420 2.775151e-03 A_vs_B A B Bacteria Bacteroidetes
## 421 3.061986e-03 A_vs_B A B Bacteria Firmicutes
## 422 6.206208e-03 A_vs_B A B Bacteria Firmicutes
## 423 2.509935e-13 A_vs_B A B Bacteria Firmicutes
## 424 1.033636e-07 A_vs_B A B Bacteria Firmicutes
## Class Order Family Genus
## 1 Clostridia Clostridiales Lachnospiraceae Dorea
## 2 Gammaproteobacteria Enterobacteriales Enterobacteriaceae Serratia
## 3 Verrucomicrobiae Verrucomicrobiales Verrucomicrobiaceae Akkermansia
## 4 Clostridia Clostridiales Clostridiaceae Clostridiaceae
## 5 Bacilli Lactobacillales Lactobacillaceae Lactobacillus
## 6 Clostridia Clostridiales Lachnospiraceae Coprococcus
## 7 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 8 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 9 Clostridia Clostridiales Lachnospiraceae Dorea
## 10 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 11 Clostridia Clostridiales Lachnospiraceae
## 12 Clostridia Clostridiales Lachnospiraceae
## 13 Bacilli Lactobacillales Streptococcaceae Streptococcus
## 14 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 18 Clostridia Clostridiales Clostridiaceae Clostridium
## 19 Actinobacteria Bifidobacteriales Bifidobacteriaceae Bifidobacterium
## 20 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 21 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 25 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 33 Clostridia Clostridiales Clostridiaceae Clostridium
## 46 Actinobacteria Bifidobacteriales Bifidobacteriaceae Bifidobacterium
## 48 Bacilli Bacillales Staphylococcaceae Staphylococcus
## 406 Clostridia Clostridiales Lachnospiraceae [Ruminococcus]
## 412 Bacilli Lactobacillales Lactobacillaceae Lactobacillus
## 416 Clostridia Clostridiales Lachnospiraceae Blautia
## 418 Clostridia Clostridiales Lachnospiraceae [Ruminococcus]
## 420 Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
## 421 Clostridia Clostridiales Lachnospiraceae Blautia
## 422 Bacilli Lactobacillales Streptococcaceae Lactococcus
## 423 Clostridia Clostridiales Veillonellaceae Veillonella
## 424 Bacilli Lactobacillales Streptococcaceae Lactococcus
## Species
## 1
## 2 marcescens
## 3 muciniphila
## 4 <NA>
## 5 reuteri
## 6
## 7 ovatus
## 8 fragilis
## 9
## 10 fragilis
## 11
## 12
## 13 luteciae
## 14 fragilis
## 18 perfringens
## 19
## 20
## 21 caccae
## 25 <NA>
## 33 perfringens
## 46
## 48 <NA>
## 406
## 412 zeae
## 416
## 418
## 420 uniformis
## 421
## 422
## 423
## 424
These plots may not be perfect for your needs, but will help provide a brief overview of the most-important taxa.
Now, we may want to look at how our taxa are distributed across our samples.
At first glance, this is not entirely informative. We can instead plot by a taxonomic classification, or by treatment group. First, we will group all of our taxa into Family labels.
Note that taxrank
will take any classification present in your taxonomy, but with higher classifications comes less resolution of your data.
Next, we will group our samples by a variable of interest. Here, we group samples according to age in months. We also want our counts in our table to be relative counts, rather than absolute counts, so we will take care of that too.
ngm_abundance <- merge_samples(ngm_abundance, "month_group")
#every count is now transformed into a proportion
ngm_abundance <- transform_sample_counts(ngm_abundance, function(x) x / sum(x))
Now, if we create our abundance plot, grouping by month and coloring by Family classification, we get a more informative result.
#cleaning up our plot a bit
stacked_barplot <- stacked_barplot +
geom_bar(aes(color= Family, fill=Family), stat="identity", position="stack") +
#adding an x-axis label
xlab("Age (months)") +
#adding a y-axis label
ylab("Relative Abundance (%)") +
#changing our y axis to a percentage, rather than a decimal scale
scale_y_continuous(labels = scales::percent) +
#italicizing our Family labels, and rotating x-axis labels to be properly oriented
theme(legend.text = element_text(face="italic"), axis.text.x = element_text(angle=0,hjust=0.5))
#view our new and improved plot
stacked_barplot
Now we can observe if there are any immediate differences in the microbiome composition of our samples.