fetch_ngs

Workflow to Fetch Public Sequencing Data and Metadata Using iSeq and MrBiomics Module.

https://github.com/epigen/fetch_ngs

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 14 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

bam database fastq genomics next-generation-sequencing ngs repository
Last synced: 4 months ago · JSON representation ·

Repository

Workflow to Fetch Public Sequencing Data and Metadata Using iSeq and MrBiomics Module.

Basic Info
Statistics
  • Stars: 10
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 6
Topics
bam database fastq genomics next-generation-sequencing ngs repository
Created 10 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

MrBiomics DOI GitHub license GitHub Release Snakemake

Fetch Public Sequencing Data and Metadata Using iSeq

A Snakemake 8 workflow to fetch (download) and process public sequencing data and metadata from GSA, SRA, ENA, GEO and DDBJ databases using iSeq.

[!NOTE]
This workflow adheres to the module specifications of MrBiomics, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.

⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our future work!

[!IMPORTANT]
If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.15005419.

Workflow Rulegraph

🖋️ Authors

💿 Software

This project wouldn't be possible without the following software and their dependencies.

| Software | Reference (DOI) | | :---: | :---: | | iSeq | https://github.com/BioOmics/iSeq | | pandas | https://doi.org/10.5281/zenodo.3509134 | | Picard | https://broadinstitute.github.io/picard/ | | Snakemake | https://doi.org/10.12688/f1000research.29032.2 |

🔬 Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post-execution in the result directory ({module}/envs/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

Data Acquisition & Processing. Public sequencing data were retrieved from [GSA|SRA|ENA|DDBJ] under the accession(s) [accession_ids] using iSeq (ver) [ref]. The data were downloaded as FASTQ files (and converted to unmapped BAM (uBAM) files using Picard FastqToSam (ver) [ref], preserving sample information and read groups while supporting both single-end and paired-end sequencing data). Metadata for each dataset was collected and merged into a single Comprehensive reference file.

The data acquisition and processing described here were performed using a publicly available Snakemake (ver) [ref] workflow 10.5281/zenodo.15005419.

🚀 Features

The workflow performs the following steps that produce the outlined results:

  • Data Acquisition
    • Downloads sequencing data from public repositories GSA, SRA, ENA, and DDBJ using various accession ID types
    • Extracts comprehensive metadata for each dataset
    • Supports parallel downloading for improved performance using threads
  • Data Processing
    • Automatic handling of both single-end and paired-end sequencing data
    • Creation of a unified comprehensive metadata file with accession IDs and file paths
    • Optional conversion from FASTQ (as *.fastq.gz) to unmapped BAM(as *.bam) format using Picard's FastqToSam
  • Metadata-only mode for quick exploration without downloading sequence files (metadata_only: 1)
  • Considerations
    • Dependent on iSeq's supported repositories and accession types
    • Requires internet connectivity and sufficient storage space for downloaded data

The workflow produces the following directory structure:

{result_path}/ └── fetch_ngs/ ├── metadata.csv # merged metadata for all accessions ├── .fastq_to_bam/ # processing marker files │ └── [accession].done └── [accession]/ # one directory per accession ├── [accession].metadata.csv # metadata for this accession └── [sample].[bam/fastq.gz] # sequence files

🛠️ Usage

Here are some tips for the usage of this workflow: - Run your workflow with snakemake --resources parallel_downloads=3 to restrict concurrent download jobs to three (worked well for me), thereby reducing the risk of triggering IP blacklisting from excessive parallel FTP connections. This can also be achieved by using the workflow's profile. In case of usage as module, put the parameter into the parent workflow's profile. - Specify accession IDs in the configuration file as a list to download multiple datasets in one run - Use metadata_only: 1 for a quick preview of available data before committing to full downloads - Choose between FASTQ or BAM output formats based on your downstream analysis needs - For large datasets, consider increasing threads and mem parameters - For super series (e.g., GSE) or projects containing many samples, start by running in metadata_only: 1 mode to extract run accession IDs. Then use these IDs in the config to enable maximum parallelization, avoiding sequential download and conversion. - The merged metadata file can be used as a basis for sample annotation files downstream - BAM output format (output_format: bam) is recommended for direct integration with BAM compatible downstream analysis workflows

⚙️ Configuration

Detailed specifications can be found here ./config/README.md

📖 Examples

Explore detailed examples showcasing module usage in our comprehensive end-to-end MrBiomics Recipes, including data, configuration, annotation and results: - ATAC-seq Analysis Recipe - RNA-seq Analysis Recipe

🔗 Links

📚 Resources

📑 Publications

The following publications successfully used this module for their analyses. - FirstAuthors et al. (202X) Journal Name - Paper Title. - ...

⭐ Star History

Star History Chart

Owner

  • Name: Computational Epigenetics
  • Login: epigen
  • Kind: organization
  • Location: Vienna, Austria

Computational Epigenetics Research and Software

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Workflow to Fetch Public Sequencing Data and Metadata
  Using iSeq
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Stephan
    family-names: Reichl
    orcid: 'https://orcid.org/0000-0001-8555-7198'
    affiliation: CeMM Research Center for Molecular Medicine
  - given-names: Christoph
    family-names: Bock
    orcid: 'https://orcid.org/0000-0001-6091-3088'
    affiliation: CeMM Research Center for Molecular Medicine
identifiers:
  - type: doi
    value: 10.5281/zenodo.15005419.
    description: >-
      This DOI represents all versions, and will always
      resolve to the latest one.
repository-code: 'https://github.com/epigen/fetch_ngs'
url: 'https://epigen.github.io/fetch_ngs/'
abstract: >-
  A Snakemake workflow to fetch (download) and process
  public sequencing data and metadata from GSA, SRA, ENA,
  GEO and DDBJ databases using iSeq.
keywords:
  - Bioinformatics
  - Workflow
  - Databases
  - Metadata
  - NGS
  - Snakemake
license: MIT

GitHub Events

Total
  • Create event: 6
  • Issues event: 9
  • Release event: 6
  • Watch event: 12
  • Member event: 2
  • Public event: 1
  • Push event: 10
Last Year
  • Create event: 6
  • Issues event: 9
  • Release event: 6
  • Watch event: 12
  • Member event: 2
  • Public event: 1
  • Push event: 10

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 6
  • Total pull requests: 0
  • Average time to close issues: about 24 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 6
  • Pull requests: 0
  • Average time to close issues: about 24 hours
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sreichl (6)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels