I believe most of these steps can be done either before or after you’ve uploaded your sequence files. I have described the process in the order that makes the most sense to me, but will point out where deviations can be made.
Go to the ENA website in order to register an account and/or log in. There may be a lab account, but I don’t know the login details. This will generate a submission account ID (your “WebinID”, having format: Webin-XXXXX
), which you will use throughout the submission process for any studies you submit data for.
After logging in, click the “Register Study” button under “Studies (Projects)”. It should have a yellow + sign.
Fill in all required fields (PubMed IDs can be added after acceptance of a manuscript). All fields can be modified later.
Click Submit.
At this point you will receive an e-mail thanking you for your “recent submission to the European Nucleotide Archive”. A “study accession number” will be included in this e-mail; this is the reference number that would need to be mentioned in the associated manuscript for which you are uploading the data.
Return to the “Welcome” page
Under the “Samples” header (green), choose “Register Samples”. Then choose “Download spreadsheet to register samples” (since you haven’t made a spreadsheet yet).
Under “Select checklist group”, choose “Environmental Checklists”. This will take you to a screen with sample checklists for different types of samples. Options include host-associated, human-associated, human gut, human oral, human skin, etc.
When you make a selection, you will be provided with the list of mandatory fields that you will need to fill in for each of your samples. Use the “Validation” column to understand what fields have specific drop-down values that they require (for instance, geographic location
requires that the United States be formatted as “USA”). At the top of this list, check the “Show Description” check box for additional tidbits of information. In the old system, you could identify fields that you knew would be consistent across all samples and start to fill them in, but not in this new system, so scroll to the bottom and click “Next”.
This page will ask you to download (green button) the template you just set up so that you can fill it in for each of your samples. “Back” will take you back to the list of mandatory fields and their descriptions. Fill out the spreadsheet.
I can only give general hints and tips about doing this. A lot of this will be trial and error when you’re first starting out with this process. If you have your mapping file handy for your run, you can copy and paste your sample names into the sample_alias field, which can be a helpful start. I don’t remember the maximum number of samples you can upload at one time, but it’s greater than 500 and less than 4,000… Also, the first column tax_id
should be 256318, which is the NCBI taxonomy ID for “metagenome”, and the second column (scientific_name
) should be “metagenome”, and this information can be made true for all samples. This, however, assumes that you are uploading amplicon or shotgun data, and if you’re doing something a little different, you’ll need to tread with caution.
If you didn’t need to log out of SRA while filling out your spreadsheet then go back to where you left off, click “back”, and click “Upload filled spreadsheet to register samples”. Then browse to where you saved your checklist (probably in your Downloads folder…). Click “Submit Completed Spreadsheet”.
Once your file is done uploading (could take many minutes, depending on the number of samples), it will give you it’s status via a pop-up window. If you have any errors, it will tell you, and you’ll need to decipher those errors, correct, and resubmit. If not, you will get a page with a blue box at the top that says “The submission was successful”, and a list of all samples that you registered. Click “Close”, and return to the Dashboard using the sandwich icon (three horizontal lines) in the upper left portion of the page.
This step assumes that you have run your samples through the 16S Pipeline used in the Lynch Lab starting in December 2019. If your data has not undergone this process, you will need to ensure that you have your forward and reverse raw reads ready for this next step.
First, uploaded samples need to be compressed before uploading to ENA. If this needs to be done for your samples, go into a directory with the forward and/or reverse reads that you plan to upload and run gzip *.fastq
. This will compress all files ending in .fastq
(and therefore append a .gz
).
Checksums 1 are also needed for this process (see footnote for a quick description). There are a few ways to go about this. You can create a single md5 checksum file with all checksums in the directory and include this list when submitting your files (later) by running (md5sum * > checksum_list.txt
), OR you can create individual md5 files for each fastq file and upload them with your fastq files. This latter method is actually the easiest. Similar to above, after gzipping the files, run the following command in the same directory: for i in *.fastq.gz ; do md5sum "$i" > "$i.md5" ; done ;
. You’ll get a ton of very small files that have your fastq name, plus an .md5 extension. If you view one of these files, it will have a bunch of letters and numbers.
To upload your sequences (and the resulting md5sum files, if you generated them) to ENA, you can use different command line programs. If you are on Lynchserver2, you will likely want to use Aspera; lftp should be used on Wynton (see here for additional details on how to use each of these command line interfaces). This step can be completed in as many commands as you need/want, but generally does take some time to complete.
Either in piecemeal fashion or at the very end of uploading all of your files, you will need to start “Submitting” your sequence reads. On the dashboard, go to the orange section (“Raw Reads (Experiments and Runs)”), and click “Submit Reads”.
Now choose the option that says “Submit paired reads using two Fastq files”. This will give you a set of the required variables needed for each sample, as well as a sense of what values you can include, very similar to submitting your samples. Click “Next” to download the template like before, and click “back” to have the required formats handy.
However you prefer, fill out the spreadsheet for the samples that you’ve uploaded so far. Typically (your mileage may vary), you can make the names of your forward and reverse files by using an Excel equation; so something like =A3&"_R1.fastq.gz"
, and copying the equation for the remaining values; when you save as the TSV, the text will be saved, but not the equation. If you opted to upload your md5 files individually, you don’t need to fill out the md5 column. If you chose to add your md5 sums later, this is where you would need the information from that file to help populate this file.
Continue to “Submit Completed Spreadsheet” until you receive a message that you have successfully uploaded your sequences to ENA.
There are a few tools within ENA that you can use to determine which files have been associated with samples and which have not. There is a page for “Unsubmitted Files”, but it gets updated only once per day, so you may need to visit this page the next day to determine if there are any missing samples. It generates a list of files that haven’t been associated with a “Run” or “Samples”. You used to receive a happy e-mail when everything was successfully uploaded, but that part seems to be a little more cryptic now….
These are a combination of letters and numbers that uniquely identify the contents of your file. You make one on your end, ENA makes one on its end, and they’re compared. If they match, it means that the file was successfully transferred, but if they don’t it means that the file was incomplete or corrupted since the checksum was made.↩︎