grabseqs

A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA

https://github.com/louiejtaylor/grabseqs

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary

Keywords

bioinformatics conda metagenomics ncbi-sra ngs python sra
Last synced: 6 months ago · JSON representation

Repository

A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA

Basic Info
  • Host: GitHub
  • Owner: louiejtaylor
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 250 KB
Statistics
  • Stars: 106
  • Watchers: 6
  • Forks: 14
  • Open Issues: 8
  • Releases: 20
Topics
bioinformatics conda metagenomics ncbi-sra ngs python sra
Created over 7 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

grabseqs

Utility for simplifying bulk downloading data from next-generation sequencing repositories, like NCBI SRA, MG-RAST.

CircleCI Conda version Conda downloads Paper link

iMicrobe is currently not supported--working to remedy this (2025/08/14)

Install

Install grabseqs and all dependencies via conda:

conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge

Or with pip (and install the non-Python dependencies yourself):

pip install grabseqs

Note: If you're using SRA data, after you've installed sra-tools, run vdb-config -i and turn off local file caching unless you want extra copies of the downloaded sequences taking up space (read more here).

Quick start

Download all samples from a single SRA Project:

grabseqs sra SRP#######

Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):

grabseqs sra SRR######## ERP####### PRJNA######## ERR########

If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

Similar syntax works for MG-RAST:

grabseqs mgrast mgp##### mgm#######

Detailed usage

See the grabseqs FAQ for detailed troubleshooting tips.

Fun options:

grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######

(translation: use 10 threads, save metadata to proj/metadata.csv, download to the dir proj/, retry failed downloads 3x, get all samples from SRP#######)

If you'd like to do a dry run and only get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

If you'd like to pass your own arguments to fasterq-dump to get data in a slightly different format, you can do so like this:

grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"

Full usage:

grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
             [-f] [-l] [--no_parsing] [--parse_run_ids]
             [--use_fastq_dump]
             id [id ...]

positional arguments:
  id                One or more BioProject, ERR/SRR or ERP/SRP number(s)

optional arguments:
  -h, --help        show this help message and exit
  -m METADATA       filename in which to save SRA metadata (.csv format,
                    relative to OUTDIR)
  -o OUTDIR         directory in which to save output. created if it doesn't
                    exist
  -r RETRIES        number of times to retry download
  -t THREADS        threads to use (for fasterq-dump/pigz)
  -f                force re-download of files
  -l                list (but do not download) samples to be grabbed
  --parse_run_ids   parse SRR/ERR identifers (do not pass straight to fasterq-
                    dump)
  --custom_fqdump_args CUSTOM_FQD_ARGS
                    "string" containing args to pass to fastq-dump
  --use_fastq_dump  use legacy fastq-dump instead of fasterq-dump (no
                    multithreaded downloading)

Downloads .fastq.gz files to OUTDIR (or the working directory if not specified). If the -m flag is passed, saves metadata to OUTDIR with filename METADATA in csv format.

Similar options are available for downloading from MG-RAST:

grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                [-t THREADS] [-f] [-l]
                rastid [rastid ...]

Troubleshooting

See the grabseqs FAQ for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to open an issue!

Dependencies

  • Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
  • sra-tools>3.2
  • pigz
  • wget

If you use conda (on Linux), these will be installed for you!

Grabseqs runs on Mac or Linux. We've tested on these specific OSes:

Linux (conda or pip): - CentOS 6, 7, and 8 - Debian 9 and 10 - Ubuntu 16.04, 18.04, and 19.10 - Red Hat Enterprise 6, 7, and 8 - SUSE Enterprise 12 and 15

Mac (pip): - MacOS 10.14

Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):

  • requests 2.22.0
  • pandas>2

Citation

If you use grabseqs in your work, please cite:

Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." Bioinformatics, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167

Please also cite the researchers who generated the data (and the repository, if appropriate)!


Changelog

1.0.0 (2025-08-14) - Added a walk-through for adding a new repo using template.py - Better handling for invalid SRA accession numbers - Update endpoint for NCBI for SRA downloads - Temporarily remove iMicrobe--needs rewrite to use a different tool

0.7.0 (2020-01-29) - Allow users to pass custom args to fast(er)q-dump - Minor re-writes of download handling code for easier readability

0.6.1 (2019-12-20) - Validate compressed files (fix #8 and #34)

0.6.0 (2019-12-12) - Gracefully handle incomplete or missing dependencies - Major rewrite of test suite

0.5.2 (2019-12-05) - Improvements to work with multiple versions of Python 3

0.5.1 (2019-11-23) - Hotfix handling outdated versions of sra-tools

0.5.0 (2019-04-11) - Metadata available for all sources in .csv format

History

This project spawned out of/incorporates code from hisss; many thanks to ArwaAbbas for helping make this work!

Owner

  • Name: Louis J Taylor
  • Login: louiejtaylor
  • Kind: user

Hunting viruses at the University of Pennsylvania.

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 15
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 15

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 260
  • Total Committers: 5
  • Avg Commits per committer: 52.0
  • Development Distribution Score (DDS): 0.354
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
louiejtaylor l****r@g****m 168
Louis J Taylor l****r 72
Louis l****r@g****m 15
ArwaAbbas a****a@g****m 3
Meagan Rubel r****l@c****l 2

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 39
  • Total pull requests: 18
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 5 hours
  • Total issue authors: 13
  • Total pull request authors: 2
  • Average comments per issue: 1.28
  • Average comments per pull request: 0.0
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • louiejtaylor (27)
  • a-h-b (1)
  • guanxiangliang (1)
  • EisenRa (1)
  • ElDeveloper (1)
  • Damianyangyang (1)
  • dutchscientist (1)
  • ressy (1)
  • cdiener (1)
  • xsq2022 (1)
  • nsheff (1)
  • sejsong (1)
  • ArwaAbbas (1)
Pull Request Authors
  • louiejtaylor (17)
  • ArwaAbbas (1)
Top Labels
Issue Labels
enhancement (7) bug (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 101 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 11
  • Total maintainers: 1
pypi.org: grabseqs

Easily download reads from next-gen sequencing repositories like NCBI SRA

  • Versions: 11
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 101 Last month
Rankings
Stargazers count: 7.1%
Forks count: 9.3%
Dependent packages count: 10.1%
Average: 13.4%
Downloads: 18.9%
Dependent repos count: 21.5%
Maintainers (1)
Last synced: 7 months ago

Dependencies

environment.yml conda
  • pandas
  • pigz
  • python >3
  • requests
  • sra-tools >2.9
  • wget
setup.py pypi
  • argparse *
  • pandas *
  • requests *
  • requests-html *