a-guide-to-automatically-downloading-ncbi-sra-reads

A Guide to Automatically Downloading NCBI SRA Reads

https://github.com/asadprodhan/a-guide-to-automatically-downloading-ncbi-sra-reads

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, sciencedirect.com, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Keywords

automation download ncbi reads sra

Last synced: 10 months ago · JSON representation ·

Repository

A Guide to Automatically Downloading NCBI SRA Reads

Basic Info

Host: GitHub
Owner: asadprodhan
License: gpl-3.0
Language: Shell
Default Branch: main
Homepage:
Size: 637 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

automation download ncbi reads sra

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

A Guide to Automatically Downloading NCBI SRA Reads

M. Asaduzzaman Prodhan^*

DPIRD Diagnostics and Laboratory Services, Department of Primary Industries and Regional Development

3 Baron-Hay Court, South Perth, WA 6151, Australia. ^*Correspondence: prodhan82@gmail.com

The National Center for Biotechnology Information (NCBI) is a global public repository that houses a vast collection of genomic information. Within the NCBI, the Sequence Read Archive (SRA) is a dedicated repository for housing the raw DNA sequencing reads that are submitted to the NCBI. Therefore, SRA serves as a critical resource for researchers providing an open-access to millions of sequences from diverse organisms and environments. As such, SRA reads fuel groundbreaking research ranging from genome assembly and variation analysis to transcriptomics and metagenomics.

Here, I present a guide on how to automatically download the NCBI SRA reads of your interest.

Content

Step 1: Download and Setup the SRA Tool Kit

Step 2: Collect SRA Accession Numbers

Step 3: Download Reads

POTENTIAL ERRORS

Step 1: Download and Setup the SRA Tool Kit

Create a directory on your Linux Desktop

mkdir sratoolkits

cd to sratoolkits and download the sra tool kit using the following link:

sudo wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

Visit the NABI SRA Toolkit[^SRA] manual (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit)

Move the sratoolkits directory in your /usr/bin directory
cd to the sratoolkits/sratoolkit.3.1.1-ubuntu64/bin directory
Find out the path by running the following command

pwd

Now, add this path to your PATH variable as follows:

What is a PATH variable? See an explanation here[^PATH]

export PATH=$PATH:path/to/sratoolkits/sratoolkit.3.1.1-ubuntu64/bin

Note that there should not be any space on either sides of the "=" sign in the above command

However, you log out, the path will drop from the PATH variable. If you want to add the above (or any path) to the PATH variable permanently, then follow the following steps:

nano ./bashrc

The above command will open the bashrc profile.

Now, copy and paste the following command in the bashrc profile, save and close.

export PATH=$PATH:path/to/sratoolkits/sratoolkit.3.1.1-ubuntu64/bin

Has the sratoolkit been added to your PATH variable?

To test, get out of your current directory and run the following command:

fasterq-dump

fasterq-dump is one of the executibles located in the sra tool kit bin directory. If the sra tool kit has been added to your PATH variable, the above command should be run from any directory in your Linux computer without needing to specifying the path of its executible in the command. And the above command will produce the usage options on the screen. Then, it's all good!

Now, you are ready to download the reads from the NCBI SRA

Step 2: Collect SRA Accession Numbers

Here, I am going to use BioProject PRJNA340941[^Hakim] for demonstration purpose. This BioProject contains Illumina sequencing of 16S rRNA from a root microbiome study[^Hakim]. You can try downloading reads from any study such as transcriptomics[^RNAseq], mitochondrial genomics[^Mitogenomics] etc that has a BioProject number.

We will use the Bioproject from the following publication as an example:

https://www.sciencedirect.com/science/article/pii/S094450131930970X#sec0010

Open the NCBI SRA

https://www.ncbi.nlm.nih.gov/sra/docs/

Search the Bioproject number as follows

Figure 1: NCBI SRA Search Box.

Scroll down to the bottom of the page and select "Run Selector" and press Go. See the screenshot below

Figure 2: NCBI SRA Run Selector.

Click on the Accession List and Metadata as marked on the following screenshot

Figure 3: NCBI SRA Accesssion List and Metadata.

The Accession List will look like this

Figure 4: NCBI SRA Accesssion List.

The Metadata will look like this

Figure 5: NCBI SRA Metadata.

Step 3: Download Reads

There are two steps to download the reads
- The first step will download the reads in SRA format using a command called prefetch
- The second step will convert the SRA format into fastq using a command called fasterq-dump

The following script has combined both commands in a single script to automate downloading reads from a list of accessions

```

!/bin/bash

Description: This script automatically downloads NCBI SRA reads

Author: Asad Prodhan PhD

Email: prodhan82@gmail.com

Date: 2024-07-01

Version: 1.0

File containing the list of SRA accession numbers

SRALIST="SRRAcc_List.txt"

Loop through each accession number in the list

while IFS= read -r accession; do echo "Processing $accession" prefetch.3.1.1 $accession && fasterq-dump.3.1.1 $accession --outdir reads done < "$SRA_LIST"

The end

```

Note that the older versions of prefetch cannot locate the reads in the NCBI SRA. You need to use the latest version.

[1] Download the script here and save as prefetch_fasterq-dump.sh

[2] Put the above script and your Accession List in the same directory

[3] Run the following command to confirm that both documents are in unix format

dos2unix *

[4] Run the following command to confirm that you have execution permission

chmod +x *

[5] Now, run the script as follows

./prefetch_fasterq-dump.sh

The reads will be automatically downloaded and saved in the reads directory

Figure 6: Automatic NCBI SRA Read Download.

POTENTIAL ERRORS

If you get an error that the SRRAccList.txt is a non-kart file, then change the file extension from txt to kart

mv SRR_Acc_List.txt SRR_Acc_List.kart

And run the script again

./prefetch_fasterq-dump.sh

If you get the following error, then it suggests that you might need to use the latest version of prefetch and fasterq-dump. You specify the versions as prefetch3.1.1 or so on. See the above script.

Figure 7: Version Conflict.

REFERENCES

[^SRA]: NCBI SRA Toolkit. [cited 2 Jul 2024]. Available: https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

[^PATH]: Prodhan MA. About the PATH. Zenodo. 2024 [cited 26 Apr 2024]. doi:https://doi.org/10.5281/zenodo.11068991

[^Hakim]: Hakim S, Mirza BS, Imran A, Zaheer A, Yasmin S, Mubeen F, et al. Illumina sequencing of 16S rRNA tag shows disparity in rhizobial and non-rhizobial diversity associated with root nodules of mung bean (Vigna radiata L.) growing in different habitats in Pakistan. Microbiol Res. 2020;231: 126356. doi:10.1016/j.micres.2019.126356

[^RNAseq]: Prodhan MA, Pariasca-Tanaka J, Ueda Y, Hayes PE, Wissuwa M. Comparative transcriptome analysis reveals a rapid response to phosphorus deficiency in a phosphorus-efficient rice genotype. Sci Rep. 2022;12: 9460. doi:10.1038/s41598-022-13709-w

[^Mitogenomics]: Prodhan MA, Widmer M, Kinene T, Kehoe M. Whole mitochondrial genomes reveal the relatedness of the browsing ant incursions in Australia. Sci Rep. 2023;13: 10273. doi:10.1038/s41598-023-37425-1

Owner

Name: Asad Prodhan
Login: asadprodhan
Kind: user
Location: Perth, Australia
Company: Department of Primary Industries and Regional Development

Website: www.linkedin.com/in/asadprodhan
Twitter: Asad_Prodhan
Repositories: 2
Profile: https://github.com/asadprodhan

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this article, please cite it as below."
authors:
- family-names: "Prodhan"
  given-names: "M. Asaduzzaman"
  orcid: "https://orcid.org/0000-0002-1320-3486"
title: "A Guide to Automatically Downloading NCBI SRA Reads"
version: 1
doi: 10.5281/zenodo.12622074
date-released: '2024-07-02'
repository-code: "https://github.com/asadprodhan/How-to-automatically-download-reads-from-the-NCBI-SRA/tree/main"
preferred-citation:
  type: article
  authors:
  - family-names: "Prodhan"
    given-names: "M. Asaduzzaman"
    orcid: "https://orcid.org/0000-0002-1320-3486"
  doi: "10.5281/zenodo.12622074"
  journal: "Zenodo"
  title: "A Guide to Automatically Downloading NCBI SRA Reads"
  year: 2024

a-guide-to-automatically-downloading-ncbi-sra-reads

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

A Guide to Automatically Downloading NCBI SRA Reads

M. Asaduzzaman Prodhan*

Content

Step 1: Download and Setup the SRA Tool Kit

However, you log out, the path will drop from the PATH variable. If you want to add the above (or any path) to the PATH variable permanently, then follow the following steps:

Has the sratoolkit been added to your PATH variable?

Step 2: Collect SRA Accession Numbers

Step 3: Download Reads

The following script has combined both commands in a single script to automate downloading reads from a list of accessions

!/bin/bash

Description: This script automatically downloads NCBI SRA reads

Author: Asad Prodhan PhD

Email: prodhan82@gmail.com

Date: 2024-07-01

Version: 1.0

File containing the list of SRA accession numbers

Loop through each accession number in the list

The end

The reads will be automatically downloaded and saved in the reads directory

POTENTIAL ERRORS

REFERENCES

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

M. Asaduzzaman Prodhan^*