a-guide-to-automatically-downloading-ncbi-sra-reads

A Guide to Automatically Downloading NCBI SRA Reads

https://github.com/asadprodhan/a-guide-to-automatically-downloading-ncbi-sra-reads

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, sciencedirect.com, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary

Keywords

automation download ncbi reads sra
Last synced: 9 months ago · JSON representation ·

Repository

A Guide to Automatically Downloading NCBI SRA Reads

Basic Info
  • Host: GitHub
  • Owner: asadprodhan
  • License: gpl-3.0
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 637 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
automation download ncbi reads sra
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

A Guide to Automatically Downloading NCBI SRA Reads

M. Asaduzzaman Prodhan*

DPIRD Diagnostics and Laboratory Services, Department of Primary Industries and Regional Development
3 Baron-Hay Court, South Perth, WA 6151, Australia. *Correspondence: prodhan82@gmail.com


License GPL 3.0 DOI ORCID



The National Center for Biotechnology Information (NCBI) is a global public repository that houses a vast collection of genomic information. Within the NCBI, the Sequence Read Archive (SRA) is a dedicated repository for housing the raw DNA sequencing reads that are submitted to the NCBI. Therefore, SRA serves as a critical resource for researchers providing an open-access to millions of sequences from diverse organisms and environments. As such, SRA reads fuel groundbreaking research ranging from genome assembly and variation analysis to transcriptomics and metagenomics.


Here, I present a guide on how to automatically download the NCBI SRA reads of your interest.


Content

Step 1: Download and Setup the SRA Tool Kit

Step 2: Collect SRA Accession Numbers

Step 3: Download Reads

POTENTIAL ERRORS


Step 1: Download and Setup the SRA Tool Kit


  • Create a directory on your Linux Desktop

mkdir sratoolkits

  • cd to sratoolkits and download the sra tool kit using the following link:

sudo wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

Visit the NABI SRA Toolkit[^SRA] manual (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit)

  • Move the sratoolkits directory in your /usr/bin directory

  • cd to the sratoolkits/sratoolkit.3.1.1-ubuntu64/bin directory

  • Find out the path by running the following command

pwd

  • Now, add this path to your PATH variable as follows:

What is a PATH variable? See an explanation here[^PATH]

export PATH=$PATH:path/to/sratoolkits/sratoolkit.3.1.1-ubuntu64/bin

Note that there should not be any space on either sides of the "=" sign in the above command


However, you log out, the path will drop from the PATH variable. If you want to add the above (or any path) to the PATH variable permanently, then follow the following steps:


nano ./bashrc

The above command will open the bashrc profile.

Now, copy and paste the following command in the bashrc profile, save and close.

export PATH=$PATH:path/to/sratoolkits/sratoolkit.3.1.1-ubuntu64/bin


Has the sratoolkit been added to your PATH variable?


To test, get out of your current directory and run the following command:

fasterq-dump


fasterq-dump is one of the executibles located in the sra tool kit bin directory. If the sra tool kit has been added to your PATH variable, the above command should be run from any directory in your Linux computer without needing to specifying the path of its executible in the command. And the above command will produce the usage options on the screen. Then, it's all good!


Now, you are ready to download the reads from the NCBI SRA


Step 2: Collect SRA Accession Numbers


Here, I am going to use BioProject PRJNA340941[^Hakim] for demonstration purpose. This BioProject contains Illumina sequencing of 16S rRNA from a root microbiome study[^Hakim]. You can try downloading reads from any study such as transcriptomics[^RNAseq], mitochondrial genomics[^Mitogenomics] etc that has a BioProject number.


We will use the Bioproject from the following publication as an example:

https://www.sciencedirect.com/science/article/pii/S094450131930970X#sec0010

  • Open the NCBI SRA

https://www.ncbi.nlm.nih.gov/sra/docs/

  • Search the Bioproject number as follows


Figure 1: NCBI SRA Search Box.


  • Scroll down to the bottom of the page and select "Run Selector" and press Go. See the screenshot below


Figure 2: NCBI SRA Run Selector.


  • Click on the Accession List and Metadata as marked on the following screenshot


Figure 3: NCBI SRA Accesssion List and Metadata.


  • The Accession List will look like this


Figure 4: NCBI SRA Accesssion List.


  • The Metadata will look like this


Figure 5: NCBI SRA Metadata.



Step 3: Download Reads


  • There are two steps to download the reads

    • The first step will download the reads in SRA format using a command called prefetch
    • The second step will convert the SRA format into fastq using a command called fasterq-dump


The following script has combined both commands in a single script to automate downloading reads from a list of accessions


```

!/bin/bash

Description: This script automatically downloads NCBI SRA reads

Author: Asad Prodhan PhD

Email: prodhan82@gmail.com

Date: 2024-07-01

Version: 1.0

File containing the list of SRA accession numbers

SRALIST="SRRAcc_List.txt"

Loop through each accession number in the list

while IFS= read -r accession; do echo "Processing $accession" prefetch.3.1.1 $accession && fasterq-dump.3.1.1 $accession --outdir reads done < "$SRA_LIST"

The end

```


Note that the older versions of prefetch cannot locate the reads in the NCBI SRA. You need to use the latest version.


[1] Download the script here and save as prefetch_fasterq-dump.sh


[2] Put the above script and your Accession List in the same directory


[3] Run the following command to confirm that both documents are in unix format


dos2unix *


[4] Run the following command to confirm that you have execution permission


chmod +x *


[5] Now, run the script as follows


./prefetch_fasterq-dump.sh


The reads will be automatically downloaded and saved in the reads directory


Figure 6: Automatic NCBI SRA Read Download.



POTENTIAL ERRORS


  • If you get an error that the SRRAccList.txt is a non-kart file, then change the file extension from txt to kart


mv SRR_Acc_List.txt SRR_Acc_List.kart

And run the script again

./prefetch_fasterq-dump.sh


  • If you get the following error, then it suggests that you might need to use the latest version of prefetch and fasterq-dump. You specify the versions as prefetch3.1.1 or so on. See the above script.


Figure 7: Version Conflict.



REFERENCES

[^SRA]: NCBI SRA Toolkit. [cited 2 Jul 2024]. Available: https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

[^PATH]: Prodhan MA. About the PATH. Zenodo. 2024 [cited 26 Apr 2024]. doi:https://doi.org/10.5281/zenodo.11068991

[^Hakim]: Hakim S, Mirza BS, Imran A, Zaheer A, Yasmin S, Mubeen F, et al. Illumina sequencing of 16S rRNA tag shows disparity in rhizobial and non-rhizobial diversity associated with root nodules of mung bean (Vigna radiata L.) growing in different habitats in Pakistan. Microbiol Res. 2020;231: 126356. doi:10.1016/j.micres.2019.126356

[^RNAseq]: Prodhan MA, Pariasca-Tanaka J, Ueda Y, Hayes PE, Wissuwa M. Comparative transcriptome analysis reveals a rapid response to phosphorus deficiency in a phosphorus-efficient rice genotype. Sci Rep. 2022;12: 9460. doi:10.1038/s41598-022-13709-w

[^Mitogenomics]: Prodhan MA, Widmer M, Kinene T, Kehoe M. Whole mitochondrial genomes reveal the relatedness of the browsing ant incursions in Australia. Sci Rep. 2023;13: 10273. doi:10.1038/s41598-023-37425-1

Owner

  • Name: Asad Prodhan
  • Login: asadprodhan
  • Kind: user
  • Location: Perth, Australia
  • Company: Department of Primary Industries and Regional Development

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this article, please cite it as below."
authors:
- family-names: "Prodhan"
  given-names: "M. Asaduzzaman"
  orcid: "https://orcid.org/0000-0002-1320-3486"
title: "A Guide to Automatically Downloading NCBI SRA Reads"
version: 1
doi: 10.5281/zenodo.12622074
date-released: '2024-07-02'
repository-code: "https://github.com/asadprodhan/How-to-automatically-download-reads-from-the-NCBI-SRA/tree/main"
preferred-citation:
  type: article
  authors:
  - family-names: "Prodhan"
    given-names: "M. Asaduzzaman"
    orcid: "https://orcid.org/0000-0002-1320-3486"
  doi: "10.5281/zenodo.12622074"
  journal: "Zenodo"
  title: "A Guide to Automatically Downloading NCBI SRA Reads"
  year: 2024

GitHub Events

Total
Last Year