Background

This document describes the protocols and steps required to place and/or store most data types generated in the lab and is needed to ensure availability and transparency of data across all members of the lab at any given time. The purpose of this protocol is to allow our data to try to live within the framework of FAIR principles which states that data should be Findable, Accessible, Interoperable and Reusable. Our goal here is to simply establish a set of rules for raw and intermediate data files that ensures the longevity of the data across the lab for years to come.

Scope

The scope of the SOP includes data derived from metabolomics, metagenomics, RNAseq, metatranscriptomics, and participant/sample metadata, with a touch on lab-specific data as well. Well-established and well-utilized protocols exist for the storage and downstream processing of 16S rRNA and ITS2 datasets generated from both the NextSeq and MiSeq instruments and are currently described elsewhere.

Preamble/Protocols Applicable to All Datasets

AirTable will perform the linkages necessary across all data modalities generated within studies. If there is no base specific to your project, copy one of the existing bases and rename it to your study before beginning. Login information for AirTable via the Lynch Lab account can be found in the Lynch Lab Etiquette document.

Data Storage Requirements

Metabolomics

Upon receipt of metabolomics data from Metabolon, the raw Excel sheet, sample submission forms and metadata files provided both to and by the company is placed in AirTable in the Results worksheet within the base specific to the project for which the data was generated. On the Results worksheet, include the raw metabolomics data in the Attachments column, and fill in all fields.

Samples undergoing metabolomics analysis must have an entry in the Progress Tracker, which should be linked via the Processing ID. The entry in the progress tracker is necessary to ensure that a Benchling link to any bench-based preparation of these samples before shipment to Metabolon is documented. This should include any relevant notations that would be necessary to write precise, reproducible methods for any future paper using these data.

Once raw (which in this case is really intermediate) data is uploaded to AirTable as described, downstream files may be analyzed from preferred analysis locations¹.

The document does not currently cover data generated by the BCMM Metabolomics Core, as several details regarding the final format and disposition of generated data have not been determined yet. This document will be updated when additional information is available.

Metagenomics

Raw Files

Upon an e-mail confirming the completion of metagenomic shotgun sequencing from QB3 or the UCSF Center for Advanced Technologies (CAT), download the data to Wynton. Wynton provides (a) great download speeds, (b) lots of space, and, coming soon, (c) backups. In order to download data from the CAT, carefully review the e-mail received, which includes the location of data within the CAT servers in the subject line, as well as passwords, etc.

Steps for download are:

Log into Wynton
Move to the data transfer nodes ssh dt2
Change directories to our central metagenomics storage location cd /wynton/group/lynch/Metag_data/
Make a directory formatted as follows: Study name, “_”, sample type(s), “_”, date (DDMonYY; the date the data was available/provided). For example, mkdir ITN_stool_17Jun22. If there are multiple sample types, pick the best single-word descriptor possible – the samples actually included within this directory will be described in better detail on AirTable (see Step 10)².
Move into this new directory.
Type in rsync -avP hiseq_user@fastq.ucsf.edu:/volume1/SSD/220617_A00351_0725_AWHKX3_PE150/ ., wherein everything stays the same except after the SSD, use the name provided in the subject line of the CAT e-mail.
Type in the provided password when prompted.
Watch files download.
After downloading, review the files provided, ensuring they match the anticipated samples sent for sequencing. Any files that are not relevant to the study in question should be deleted or, if relevant to another member’s project, moved and labeled as described above.
Go to the AirTable base for the relevant project, and under the Results tab, add an entry for your metagenomics results, using the Notes column to state the name of the directory created in Step 4. This “result” entry must be associated with a Processing ID (or multiple processing IDs) from the Progress Tracker, which must include a Benchling link to notes related to bench-based proessing of the samples that underwent shotgun metagenomics.
In addition, place any documentation, including a PDF of the e-mail from the CAT sharing the data with you in the Attachments column.

Intermediate Files

Scripts for metagenomic data proessing are located on GitHub and Wynton (/wynton/group/lynch/kmccauley/Metagenomics_Pipeline_KM/). SOPs for processing are also located on GitHub. As of writing, the SOP on GitHub does not account for this new protocol. Therefore the following modifications should be made.

The sample_qc_pipeline.sh script should use as input the pertinent files within the Metag_data and output to Metag_Processed. This will ensure that everyone will have access to downstream files.
The directory in Metag_Processed should match the one created in Step 4 for raw data.
All processing using relevant scripts up to and including Humann3 table generation should be saved within Metag_Processed. The resulting table can then be moved to the preferred analysis location³.

(Dual) RNA-seq

Raw Files

Files from internal RNAseq should be placed in /wynton/group/lynch/RNAseq_data/.The first level directory should have a descriptive name similar to metagenomics: Study name, “_”, sample type, “_”, DDMonYY, again using the best descriptor possible for your samples. Within this directory, place raw sequencing files with the run name as a directory. If more than one run contributes RNAseq data to a project, this will keep everything ordered.

Go to the AirTable base for the relevant project, and under the Results tab, add an entry for your RNAseq results, using the Notes column to state the name of the directory created above. This “result” entry must be associated with a Processing ID (or multiple processing IDs) from the Progress Tracker, which must include a Benchling link to notes related to bench-based processing of the samples that underwent RNAseq. There should be enough information at the Benchling link for someone else to reproduce and describe the data that went on for sequencing without reaching out to you.

Intermediate Files

There is currently no standardized pipeline for RNA-seq processing of raw fastq files. Processed RNAseq files for each project should be in /wynton/group/lynch/RNAseq_Processed/. Within this directory, create another directory that matches the entry in RNAseq_data. Create a second-level directory for host_expression and (if applicable) microbial_expression. Each of these directories should include FastQC outputs and a gene expression (read counts) matrix, as well as a document containing all code that generated the files present in this directory.

Participant/Sample Metadata

Participant and sample metadata should ideally be retained in a “always accessible all of the time” location that is also password protected with two-factor authentication (ie, Wynton/Lynchserver2). Though our data is often “de-identified” (see this for all requirements for de-identification), it still falls within “P3” or “Sensitive” protection levels. See this and this site from UCSF to learn more about data security classifications. All P3 and P4 data, per UCSF policy, needs to be securely maintained. Never send patient data over e-mail; Box can sometimes be a secure alternative.

Within your group-user directory (ie, /wynton/group/lynch/kmccauley), create a directory for your project being as descriptive as possible. Within the project-specific directory, create another directory called RawData or something similar and include all files that were provided externally for the project.

Auxillary Bench-generated Data

The scope of this is essentially any data created on the bench – DNA concentrations, TEER values, Cell line work, etc.

The automatic location for bench-generated data should be Benchling. If a folder does not exist for your project, create one and use one of the templates designed by Claire (if not applicable, create your own). Once generated data has been finalized in Benchling, download your analytic files and store them in AirTable. If additional statistical analyses need to be performed, you can do this from your preferred analysis location.

Ideally, all work should be accessible by anyone (who is a Lynch Lab member) all of the time. I am not sure yet what this means for analysis performed on laptops. As much as possible, use Wynton (specifically the /wynton/group/lynch space within a personal directory) to perform analysis to ensure accessibility of code and downstream files. Wynton now has an easy-to-use rstudio link, too! Alternatively, saving to Box Drive or utilizing GitHub could assist in transferring all relevant files and code upon a lab member’s departure.↩︎
There are likely to be times when your data does not fit this usecase (in the case of RNAseq, you may be performing it on cell lines and not a human study). Do the best you can and rely on the space available in AirTable/Benchling to provide as much information as possible to point a future person in the right direction. As mentioned in Background and Scope, this is really the ultimate goal.↩︎
See footnote 1.↩︎

Centralized Data SOP

Author: Katie McCauley

2022-10-20

Background

Scope

Preamble/Protocols Applicable to All Datasets

Data Storage Requirements

Metabolomics

Metagenomics

Raw Files

Intermediate Files

(Dual) RNA-seq

Raw Files

Intermediate Files

Participant/Sample Metadata

Auxillary Bench-generated Data