Initial Draft: September 28, 2022
Version 0 Reviewers: Kevin Magnaye, Claire Schachtschneider
This document describes the protocols and steps required to place and/or store most data types generated in the lab and is needed to ensure availability and transparency of data across all members of the lab at any given time. The purpose of this protocol is to allow our data to try to live within the framework of FAIR principles which states that data should be Findable, Accessible, Interoperable and Reusable. Our goal here is to simply establish a set of rules for raw and intermediate data files that ensures the longevity of the data across the lab for years to come.
The scope of the SOP includes data derived from metabolomics, metagenomics, RNAseq, metatranscriptomics, and participant/sample metadata, with a touch on lab-specific data as well. Well-established and well-utilized protocols exist for the storage and downstream processing of 16S rRNA and ITS2 datasets generated from both the NextSeq and MiSeq instruments and are currently described elsewhere.
AirTable will perform the linkages necessary across all data modalities generated within studies. If there is no base specific to your project, copy one of the existing bases and rename it to your study before beginning. Login information for AirTable via the Lynch Lab account can be found in the Lynch Lab Etiquette document.
Upon receipt of metabolomics data from Metabolon, the raw Excel sheet, sample submission forms and metadata files provided both to and by the company is placed in AirTable in the Results worksheet within the base specific to the project for which the data was generated. On the Results worksheet, include the raw metabolomics data in the Attachments column, and fill in all fields.
Samples undergoing metabolomics analysis must have an entry in the Progress Tracker, which should be linked via the Processing ID. The entry in the progress tracker is necessary to ensure that a Benchling link to any bench-based preparation of these samples before shipment to Metabolon is documented. This should include any relevant notations that would be necessary to write precise, reproducible methods for any future paper using these data.
Once raw (which in this case is really intermediate) data is uploaded to AirTable as described, downstream files may be analyzed from preferred analysis locations1.
The document does not currently cover data generated by the BCMM Metabolomics Core, as several details regarding the final format and disposition of generated data have not been determined yet. This document will be updated when additional information is available.
Upon an e-mail confirming the completion of metagenomic shotgun sequencing from QB3 or the UCSF Center for Advanced Technologies (CAT), download the data to Wynton. Wynton provides (a) great download speeds, (b) lots of space, and, coming soon, (c) backups. In order to download data from the CAT, carefully review the e-mail received, which includes the location of data within the CAT servers in the subject line, as well as passwords, etc.
Steps for download are:
ssh dt2
cd /wynton/group/lynch/Metag_data/
_
”,
sample type(s), “_
”, date (DDMonYY; the date the data was
available/provided). For example, mkdir ITN_stool_17Jun22
.
If there are multiple sample types, pick the best single-word descriptor
possible – the samples actually included within this directory will be
described in better detail on AirTable (see Step 10)2.rsync -avP hiseq_user@fastq.ucsf.edu:/volume1/SSD/220617_A00351_0725_AWHKX3_PE150/ .
,
wherein everything stays the same except after the SSD, use the name
provided in the subject line of the CAT e-mail.Scripts for metagenomic data proessing are located on GitHub and
Wynton
(/wynton/group/lynch/kmccauley/Metagenomics_Pipeline_KM/
).
SOPs for processing are also located on GitHub. As of writing, the SOP
on GitHub does not account for this new protocol. Therefore the
following modifications should be made.
sample_qc_pipeline.sh
script should use as input
the pertinent files within the Metag_data and output to Metag_Processed.
This will ensure that everyone will have access to downstream
files.Files from internal RNAseq should be placed in
/wynton/group/lynch/RNAseq_data/
.The first level directory
should have a descriptive name similar to metagenomics: Study name,
“_
”, sample type, “_
”, DDMonYY, again using
the best descriptor possible for your samples. Within this directory,
place raw sequencing files with the run name as a directory. If more
than one run contributes RNAseq data to a project, this will keep
everything ordered.
Go to the AirTable base for the relevant project, and under the Results tab, add an entry for your RNAseq results, using the Notes column to state the name of the directory created above. This “result” entry must be associated with a Processing ID (or multiple processing IDs) from the Progress Tracker, which must include a Benchling link to notes related to bench-based processing of the samples that underwent RNAseq. There should be enough information at the Benchling link for someone else to reproduce and describe the data that went on for sequencing without reaching out to you.
There is currently no standardized pipeline for RNA-seq processing of
raw fastq files. Processed RNAseq files for each project should be in
/wynton/group/lynch/RNAseq_Processed/
. Within this
directory, create another directory that matches the entry in
RNAseq_data
. Create a second-level directory for
host_expression
and (if applicable)
microbial_expression
. Each of these directories should
include FastQC outputs and a gene expression (read counts) matrix, as
well as a document containing all code that generated the files present
in this directory.
Participant and sample metadata should ideally be retained in a “always accessible all of the time” location that is also password protected with two-factor authentication (ie, Wynton/Lynchserver2). Though our data is often “de-identified” (see this for all requirements for de-identification), it still falls within “P3” or “Sensitive” protection levels. See this and this site from UCSF to learn more about data security classifications. All P3 and P4 data, per UCSF policy, needs to be securely maintained. Never send patient data over e-mail; Box can sometimes be a secure alternative.
Within your group-user directory (ie,
/wynton/group/lynch/kmccauley
), create a directory for your
project being as descriptive as possible. Within the project-specific
directory, create another directory called RawData
or
something similar and include all files that were provided externally
for the project.
The scope of this is essentially any data created on the bench – DNA concentrations, TEER values, Cell line work, etc.
The automatic location for bench-generated data should be Benchling. If a folder does not exist for your project, create one and use one of the templates designed by Claire (if not applicable, create your own). Once generated data has been finalized in Benchling, download your analytic files and store them in AirTable. If additional statistical analyses need to be performed, you can do this from your preferred analysis location.
Ideally, all work should be accessible by anyone (who is a Lynch Lab member) all of the time. I am not sure yet what this means for analysis performed on laptops. As much as possible, use Wynton (specifically the /wynton/group/lynch space within a personal directory) to perform analysis to ensure accessibility of code and downstream files. Wynton now has an easy-to-use rstudio link, too! Alternatively, saving to Box Drive or utilizing GitHub could assist in transferring all relevant files and code upon a lab member’s departure.↩︎
There are likely to be times when your data does not fit this usecase (in the case of RNAseq, you may be performing it on cell lines and not a human study). Do the best you can and rely on the space available in AirTable/Benchling to provide as much information as possible to point a future person in the right direction. As mentioned in Background and Scope, this is really the ultimate goal.↩︎
See footnote 1.↩︎