fastq-dl

Download FASTQ files from SRA or ENA repositories.

https://github.com/rpetit3/fastq-dl

Last synced: 6 months ago · JSON representation ·

Repository

Download FASTQ files from SRA or ENA repositories.

Basic Info

Host: GitHub
Owner: rpetit3
License: mit
Language: Python
Default Branch: master
Size: 383 KB

Statistics

Stars: 341
Watchers: 4
Forks: 27
Open Issues: 13
Releases: 17

Created over 6 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog Funding License Citation

fastq-dl

Download FASTQ files from the European Nucleotide Archive or the Sequence Read Archive repositories.

Introduction

fastq-dl takes an ENA/SRA accession (Study, Sample, Experiment, or Run) and queries ENA (via Data Warehouse API) to determine the associated metadata. It then downloads FASTQ files for each Run. For Samples or Experiments with multiple Runs, users can optionally merge the runs.

Installation

Bioconda

fastq-dl is available from Bioconda and I highly recommend you go this route to for installation.

{bash} conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl conda activate fastq-dl

Usage

```{bash} fastq-dl --help

Usage: fastq-dl [OPTIONS]

Download FASTQ files from ENA or SRA.

╭─ Required Options ──────────────────────────────────────────────────────────────────────────╮ │ * --accession -a TEXT ENA/SRA accession to query. (Study, Sample, Experiment, Run │ │ accession) │ │ [required] │ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Download Options ──────────────────────────────────────────────────────────────────────────╮ │ --provider [ena|sra] Specify which provider (ENA or SRA) to use. │ │ [default: ena] │ │ --group-by-experiment Group Runs by experiment accession. │ │ --group-by-sample Group Runs by sample accession. │ │ --max-attempts -m INTEGER Maximum number of download attempts. [default: 10] │ │ --sra-lite Set preference to SRA Lite │ │ --only-provider Only attempt download from specified provider. │ │ --only-download-metadata Skip FASTQ downloads, and retrieve only the │ │ metadata. │ │ --ignore -I Ignore MD5 checksums for downloaded files. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Additional Options ────────────────────────────────────────────────────────────────────────╮ │ --outdir -o TEXT Directory to output downloads to. [default: ./] │ │ --prefix TEXT Prefix to use for naming log files. [default: fastq] │ │ --cpus INTEGER Total cpus used for downloading from SRA. [default: 1] │ │ --force -F Overwrite existing files. │ │ --silent Only critical errors will be printed. │ │ --sleep -s INTEGER Minimum amount of time to sleep between retries (API query and │ │ download) │ │ [default: 10] │ │ --version -V Show the version and exit. │ │ --verbose -v Print debug related text. │ │ --help -h Show this message and exit. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────╯ ```

fastq-dl requires a single ENA/SRA Study, Sample, Experiment, or Run accession and FASTQs for all Runs that fall under the given accession will be downloaded. For example, if a Study accession is given all Runs under that studies umbrella will be downloaded. By default, fastq-dl will try to download from ENA first, then SRA.

--accession

The accession you would like to download associated FASTQS for. Currently the following types of accessions are accepted.

| Accession Type | Prefixes | Example | |----------------|---------------------|------------------------------------------| | BioProject | PRJEB, PRJNA, PRJDB | PRJEB42779, PRJNA480016, PRJDB14838 | | Study | ERP, DRP, SRP | ERP126685, DRP009283, SRP158268 | | BioSample | SAMD, SAME, SAMN | SAMD00258402, SAMEA7997453, SAMN06479985 | | Sample | ERS, DRS, SRS | ERS5684710, DRS259711, SRS2024210 | | Experiment | ERX, DRX, SRX | ERX5050800, DRX406443, SRX4563689 | | Run | ERR, DRR, SRR | ERR5260405, DRR421224, SRR7706354 |

The accessions are using regular expressions from the ENA Training Modules - Accession Numbers section.

--provider

fastq-dl gives you the option to download from ENA or SRA. the --provider option will specify which provider you would like to attempt downloads from first. If a download fails from the first provider, additional attempts will be made using the other provider.

ENA was selected as the default provider because the FASTQs are available directly without the need for conversion.

--only-provider

By default, fastq-dl will fallback on a secondary provider to attempt downloads. There may be cases where you would prefer to disable this feature, and that is exactly the purpose of --only-provider. When provided, if a FASTQ cannot be downloaded from the original provider, no additional attempts will be made.

--group-by-experiment & --group-by-sample

There maybe times you might want to group Run accessions based on a Experiment or Sample accessions. This will merge FASTQs associated with a Run accession based its associated Experiment accession (--group-by-experiment) or Sample accession (--group-by-sample).

--sra-lite

Downloads from SRA are provided in SRA Normalized and SRA Lite formats. SRA Normalized is the original format with full base quality scores and SRA Lite are smaller due to simplifying the quality scores to a uniform Q30. By default the preference will be set to SRA Normalized, if you prefer SRA Lite you can use --sra-lite to set the preference to SRA Lite.

Output Files

| Extension | Description | |--------------------|------------------------------------------------------------------------------------------| | -run-info.tsv | Tab-delimited file containing metadata for each Run downloaded | | -run-mergers.tsv | Tab-delimited file merge information from --group-by-experiment or --group-by-sample | | .fastq.gz | FASTQ files downloaded from ENA or SRA |

Example Usage

Download FASTQs associated with a Study

Sometimes you might be reading a paper and they very kindly provided a BioProject of all the samples they sequenced. So, you decide you want to download FASTQs for all the samples associated with the BioProject. fastq-dl can help you with that!

{bash} fastq-dl --accession PRJNA248678 --provider SRA fastq-dl --accession PRJNA248678

The above commands will download the 3 Runs that fall under Study accession PRJNA248678 from either SRA (--provider SRA) or ENA (without --provider).

Download FASTQs associated with an Experiment

Let's say instead of the whole BioProject you just want a single Experiment. You can do that as well.

{bash} fastq-dl --accession SRX477044

The above command would download the Run accessions from ENA that fall under Experiment SRX477044.

The relationship of Experiment to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Experiment Accession (e.g. re-sequencing the same sample). Although in most cases, it is a 1-to-1 relationship, you can use --group-by-experiment to merge multiple runs associated with an Experiment accession into a single FASTQ file.

Download FASTQs associated with an Sample

Ok, this time you just want a single Sample, or Biosample.

{bash} fastq-dl --accession SRS1904245 --provider SRA

The above command would download the Run accessions from SRA that fall under Sample SRS1904245.

Similar to Experiment accessions, the relationship of Sample to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Sample Accession. Although in most cases, it is a 1-to-1 relationship, you can use --group-by-sample to merge multiple runs associated with an Sample accession into a single FASTQ file.

_Warning! For some type strains (e.g. S. aureus USA300) a Biosample accession might be associated with 100s or 1000s of Run accessions. These Runs are likely associated with many different conditions and really should not fall under a single BioSample accession. Please consider this when using --group-by-sample.

Download FASTQs associated with a Run

Let's keep it super simple and just download a Run.

{bash} fastq-dl --accession SRR1178105 --provider SRA

The above command would download the Run SRR1178105 from SRA. Run accessions are the end of the line (1-to-1 relationship), so you will always get the expected Run.

Motivation

fastq-dl, is a spin-off of ena-dl (pre-2017), that has been developed for usage with Bactopia. With this in mind, EBI/NCBI and provide their own tools (enaBrowserTools and SRA Toolkit) that offer more extensive access to their databases.

Owner

Name: Robert A. Petit III
Login: rpetit3
Kind: user
Location: Cheyenne, WY
Company: Wyoming Public Health Laboratory

Website: https://www.robertpetit.com/
Twitter: rpetit3
Repositories: 147
Profile: https://github.com/rpetit3

Bioinformatician at the Wyoming Public Health Laboratory. Developer of the Bactopia and other microbial genomic tools.

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use fastq-dl, please cite it as below."
authors:
- family-names: "Petit III"
  given-names: "Robert A. "
  orcid: "https://orcid.org/0000-0002-1350-9426"
- family-names: "Hall"
  given-names: "Micheal B."
  orcid: "https://orcid.org/0000-0003-3683-6208"
- family-names: "Tonkin-Hill"
  given-names: "Gerry"
  orcid: "https://orcid.org/0000-0002-1350-9426"
- family-names: "Zhu"
  given-names: "Jie"
- family-names: "Read"
  given-names: "Timothy D."
  orcid: "https://orcid.org/0000-0001-8966-9680"
title: "fastq-dl: efficiently download FASTQ files from SRA or ENA repositories"
url: "https://github.com/rpetit3/fastq-dl"
version: 2.0.2

GitHub Events

Total

Create event: 7
Issues event: 13
Release event: 5
Watch event: 60
Delete event: 3
Issue comment event: 31
Push event: 12
Pull request event: 1
Fork event: 3

Last Year

Create event: 7
Issues event: 13
Release event: 5
Watch event: 60
Delete event: 3
Issue comment event: 31
Push event: 12
Pull request event: 1
Fork event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 130
Total Committers: 4
Avg Commits per committer: 32.5
Development Distribution Score (DDS): 0.346

Past Year

Commits: 18
Committers: 2
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.333

Top Committers

Name	Email	Commits
Robert A. Petit III	r**t@g**m	85
Michael Hall	m**l@m**h	40
Gerry Tonkin-Hill	g**l@g**m	4
Jie Zhu	a**j@g**m	1

Committer Domains (Top 20 + Academic)

mbh.sh: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 27
Total pull requests: 20
Average time to close issues: 2 months
Average time to close pull requests: 2 months
Total issue authors: 24
Total pull request authors: 6
Average comments per issue: 2.85
Average comments per pull request: 1.4
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 6

Past Year

Issues: 14
Pull requests: 6
Average time to close issues: 5 days
Average time to close pull requests: 19 days
Issue authors: 12
Pull request authors: 3
Average comments per issue: 0.64
Average comments per pull request: 1.17
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

mbhall88 (3)
freekvh (2)
pooser (1)
Benjamin-Valderrama (1)
Sven-Winter (1)
songmj86 (1)
alienzj (1)
MostafaYA (1)
reetm09 (1)
kapsakcj (1)
ConYel (1)
RZ9082 (1)
theosanderson (1)
gbouras13 (1)
rpetit3 (1)

Pull Request Authors

mbhall88 (6)
dependabot[bot] (6)
rpetit3 (4)
rraadd88 (2)
gtonkinhill (1)
alienzj (1)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

dependencies (6) python (2)

Packages

Total packages: 1
Total downloads:
- pypi 142 last-month
Total docker downloads: 77

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 7
Total maintainers: 1

pypi.org: fastq-dl

Download FASTQ files from SRA or ENA repositories.

Homepage: https://github.com/rpetit3/fastq-dl
Documentation: https://fastq-dl.readthedocs.io/
License: MIT
Latest release: 3.0.1
published 12 months ago

Versions: 7
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 142 Last month
Docker Downloads: 77

Rankings

Docker downloads count: 3.8%

Stargazers count: 5.0%

Forks count: 8.9%

Dependent packages count: 10.1%

Average: 13.0%

Dependent repos count: 21.6%

Downloads: 28.6%

Maintainers (1)

rpetit3

Last synced: 6 months ago

fastq-dl

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

fastq-dl

Introduction

Installation

Bioconda

Usage

--accession

--provider

--only-provider

--group-by-experiment & --group-by-sample

--sra-lite

Output Files

Example Usage

Download FASTQs associated with a Study

Download FASTQs associated with an Experiment

Download FASTQs associated with an Sample

Download FASTQs associated with a Run

Motivation

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: fastq-dl

Rankings

Maintainers (1)