grabseqs

A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA

https://github.com/louiejtaylor/grabseqs

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Keywords

bioinformatics conda metagenomics ncbi-sra ngs python sra

Last synced: 9 months ago · JSON representation

Repository

A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA

Basic Info

Host: GitHub
Owner: louiejtaylor
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 250 KB

Statistics

Stars: 106
Watchers: 6
Forks: 14
Open Issues: 8
Releases: 20

Topics

bioinformatics conda metagenomics ncbi-sra ngs python sra

Created over 7 years ago · Last pushed 10 months ago

Metadata Files

Readme License

grabseqs

Utility for simplifying bulk downloading data from next-generation sequencing repositories, like NCBI SRA, MG-RAST.

iMicrobe is currently not supported--working to remedy this (2025/08/14)

Install

Install grabseqs and all dependencies via conda:

conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge

Or with pip (and install the non-Python dependencies yourself):

pip install grabseqs

Note: If you're using SRA data, after you've installed sra-tools, run vdb-config -i and turn off local file caching unless you want extra copies of the downloaded sequences taking up space (read more here).

Quick start

Download all samples from a single SRA Project:

grabseqs sra SRP#######

Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):

grabseqs sra SRR######## ERP####### PRJNA######## ERR########

If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

Similar syntax works for MG-RAST:

grabseqs mgrast mgp##### mgm#######

Detailed usage

See the grabseqs FAQ for detailed troubleshooting tips.

Fun options:

grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP#######

(translation: use 10 threads, save metadata to proj/metadata.csv, download to the dir proj/, retry failed downloads 3x, get all samples from SRP#######)

If you'd like to do a dry run and only get a list of samples that will be downloaded, pass -l:

grabseqs sra -l SRP########

If you'd like to pass your own arguments to fasterq-dump to get data in a slightly different format, you can do so like this:

grabseqs sra SRP####### -r 0 --custom_fqdump_args="--split-spot --progress"

Full usage:

grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
             [-f] [-l] [--no_parsing] [--parse_run_ids]
             [--use_fastq_dump]
             id [id ...]

positional arguments:
  id                One or more BioProject, ERR/SRR or ERP/SRP number(s)

optional arguments:
  -h, --help        show this help message and exit
  -m METADATA       filename in which to save SRA metadata (.csv format,
                    relative to OUTDIR)
  -o OUTDIR         directory in which to save output. created if it doesn't
                    exist
  -r RETRIES        number of times to retry download
  -t THREADS        threads to use (for fasterq-dump/pigz)
  -f                force re-download of files
  -l                list (but do not download) samples to be grabbed
  --parse_run_ids   parse SRR/ERR identifers (do not pass straight to fasterq-
                    dump)
  --custom_fqdump_args CUSTOM_FQD_ARGS
                    "string" containing args to pass to fastq-dump
  --use_fastq_dump  use legacy fastq-dump instead of fasterq-dump (no
                    multithreaded downloading)

Downloads .fastq.gz files to OUTDIR (or the working directory if not specified). If the -m flag is passed, saves metadata to OUTDIR with filename METADATA in csv format.

Similar options are available for downloading from MG-RAST:

grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]
                [-t THREADS] [-f] [-l]
                rastid [rastid ...]

Troubleshooting

See the grabseqs FAQ for detailed troubleshooting tips. If the FAQs don't fix your problem, feel free to open an issue!

Dependencies

Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)
sra-tools>3.2
pigz
wget

If you use conda (on Linux), these will be installed for you!

Grabseqs runs on Mac or Linux. We've tested on these specific OSes:

Linux (conda or pip): - CentOS 6, 7, and 8 - Debian 9 and 10 - Ubuntu 16.04, 18.04, and 19.10 - Red Hat Enterprise 6, 7, and 8 - SUSE Enterprise 12 and 15

Mac (pip): - MacOS 10.14

Grabseqs has been tested and works with the following version of the Python dependencies (though these are neither minimal nor pinned version numbers):

requests 2.22.0
pandas>2

Citation

If you use grabseqs in your work, please cite:

Louis J Taylor, Arwa Abbas, Frederic D Bushman. "grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories." Bioinformatics, (2020), btaa167, https://doi.org/10.1093/bioinformatics/btaa167

Please also cite the researchers who generated the data (and the repository, if appropriate)!

Changelog

1.0.0 (2025-08-14) - Added a walk-through for adding a new repo using template.py - Better handling for invalid SRA accession numbers - Update endpoint for NCBI for SRA downloads - Temporarily remove iMicrobe--needs rewrite to use a different tool

0.7.0 (2020-01-29) - Allow users to pass custom args to fast(er)q-dump - Minor re-writes of download handling code for easier readability

0.6.1 (2019-12-20) - Validate compressed files (fix #8 and #34)

0.6.0 (2019-12-12) - Gracefully handle incomplete or missing dependencies - Major rewrite of test suite

0.5.2 (2019-12-05) - Improvements to work with multiple versions of Python 3

0.5.1 (2019-11-23) - Hotfix handling outdated versions of sra-tools

0.5.0 (2019-04-11) - Metadata available for all sources in .csv format

History

This project spawned out of/incorporates code from hisss; many thanks to ArwaAbbas for helping make this work!

Owner

Name: Louis J Taylor
Login: louiejtaylor
Kind: user

Twitter: Louviridae
Repositories: 4
Profile: https://github.com/louiejtaylor

Hunting viruses at the University of Pennsylvania.

GitHub Events

Total

Create event: 1
Release event: 1
Issues event: 1
Watch event: 1
Issue comment event: 2
Push event: 15

Last Year

Create event: 1
Release event: 1
Issues event: 1
Watch event: 1
Issue comment event: 2
Push event: 15

Committers

Last synced: over 2 years ago

All Time

Total Commits: 260
Total Committers: 5
Avg Commits per committer: 52.0
Development Distribution Score (DDS): 0.354

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
louiejtaylor	l**r@g**m	168
Louis J Taylor	l****r	72
Louis	l**r@g**m	15
ArwaAbbas	a**a@g**m	3
Meagan Rubel	r**l@c**l	2

Issues and Pull Requests

Last synced: over 2 years ago

All Time

Total issues: 39
Total pull requests: 18
Average time to close issues: about 1 month
Average time to close pull requests: about 5 hours
Total issue authors: 13
Total pull request authors: 2
Average comments per issue: 1.28
Average comments per pull request: 0.0
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

louiejtaylor (27)
a-h-b (1)
guanxiangliang (1)
EisenRa (1)
ElDeveloper (1)
Damianyangyang (1)
dutchscientist (1)
ressy (1)
cdiener (1)
xsq2022 (1)
nsheff (1)
sejsong (1)
ArwaAbbas (1)

Pull Request Authors

louiejtaylor (17)
ArwaAbbas (1)

Top Labels

Issue Labels

enhancement (7) bug (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 101 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 11
Total maintainers: 1

pypi.org: grabseqs

Easily download reads from next-gen sequencing repositories like NCBI SRA

Homepage: https://github.com/louiejtaylor/grabseqs
Documentation: https://grabseqs.readthedocs.io/
License: MIT License
Latest release: 0.7.0
published over 6 years ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 101 Last month

Rankings

Stargazers count: 7.1%

Forks count: 9.3%

Dependent packages count: 10.1%

Average: 13.4%

Downloads: 18.9%

Dependent repos count: 21.5%

Maintainers (1)

louiejtaylor

Last synced: 10 months ago

Dependencies

environment.yml conda

pandas
pigz
python >3
requests
sra-tools >2.9
wget

setup.py pypi

argparse *
pandas *
requests *
requests-html *

grabseqs

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

grabseqs

Install

Quick start

Detailed usage

Troubleshooting

Dependencies

Citation

Changelog

History

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: grabseqs

Rankings

Maintainers (1)

Dependencies