Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: CMG-GUTS
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 22.3 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created 12 months ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License Citation

README.md

Nextflow run with docker run with singularity nf-test

Introduction metaBIOMx

The metagenomics microbiomics pipeline is a best-practice suite for the decontamination and annotation of sequencing data obtained via short-read shotgun sequencing. The pipeline contains NF-core modules and other local modules that are in the similar format. It can be runned via both docker and singularity containers.

workflow

Pipeline summary

The pipeline is able to perform different taxonomic annotation on either (single/paired) reads or contigs. The different subworkflows can be defined via --bypass_<method> flags, a full overview is shown by running --help. By default the pipeline will check if the right databases are present in the right formats, when the path is provided. If this is not the case, compatible databases will be automatically downloaded.

For both subworkflows the pipeline will perform read trimming via Trimmomatic and/or AdapterRemoval, followed by human removal via Kneaddata. Before and after each step the quality control will be assessed via fastqc and a multiqc report is created as output. Then taxonomy annotation is done as follows:

Read annotation - paired reads are interleaved using BBTools. - MetaPhlAn3 and HUMAnN3 are used for taxonomy and functional profiling. - taxonomy profiles are merged into a single BIOM file using biom-format.

Contig annotation - read assembly is performed via SPAdes. - Quality assesment of contigs is done via Busco. - taxonomy profiles are created using CAT. - Read abundance estimation is performed on the contigs using Bowtie2 and BCFtools. - Contigs are selected if a read can be aligned against a contig and a BIOM file is generated using biom-format.

Installation

[!NOTE] Make sure you have installed the latest nextflow version!

Clone the repository in a directory of your choice: bash git clone https://github.com/CMG-GUTS/metabiomx.git

The pipeline is containerised, meaning it can be runned via docker or singularity images. No further actions need to be performed when using the docker profile, except a docker registery needs to be set on your local system, see docker. In case singularity is used, images are automatically cached within the project directory.

Usage

Since the latest version, metaBIOMx works with both a samplesheet (CSV) format or a path to the input files. Preferably, samplesheets should be provided. bash nextflow run main.nf --input <samplesheet.csv> -work-dir work -profile singularity nextflow run main.nf --input <'*_{1,R1,2,R2}.{fq,fq.gz,fastq,fastq.gz}'> -work-dir work -profile singularity

📋 Sample Metadata File Specification

metaBIOMx expects your sample input data to follow a simple, but strict structure to ensure compatibility and allow upfront validation. The input should be provided as a CSV file where each entry = one sample with specified sequencing file paths. Additional properties not mentioned here will be ignored by the validation step.


Minimum requirement

  • sample_id ➡ every entry must have a unique, non-empty sample identifier.
  • No spaces are allowed in sample IDs — use underscores _ or dashes - instead.
  • forward_read ➡ every entry must provide a path to an existing forward read FASTQ file (gzipped).
  • If reverse_read is provided, forward_read must also be present. Example:

| sampleid | forwardread | reverseread | |-----------|---------------|--------------------| | sample1 | sample1R1.fastq.gz | sample1R2.fastq.gz | | sample2 | D0293271.fastq.gz | D0293272.fastq.gz | | S3 | L9283R1.fastq.gz | L9283R1.fastq.gz |


Properties and Validation Rules

🔹 Required properties

| Property | Type | Rules / Description | |--------------|--------|----------------------------------------------------------------------------------------------------| | sample_id | string | Unique sample ID with no spaces (^\S+$). Serves as an identifier. | | forward_read | string | File path to forward sequencing read. Must be non-empty string matching FASTQ gzipped pattern. File must exist. |

🔹 Optional property

| Property | Type | Rules / Description | |----------------|--------|----------------------------------------------------------------------------------------------------| | reverse_read | string | File path to reverse sequencing read. Same constraints as forward_read. Required if specified. |

Example cases

🔹 Read annotation

bash nextflow run main.nf \ --input <samplesheet.csv> \ # (optional) --bypass_trim \ # (optional) --bypass_decon \ --bypass_contig_annotation \ -work-dir work \ -profile singularity

🔹 Contig annotation

bash nextflow run main.nf \ --input <samplesheet.csv> \ # (optional) --bypass_trim \ # (optional) --bypass_decon \ --bypass_read_annotation \ -work-dir work \ -profile singularity

In case you only have assemblies and wish to perform contig annotation: bash nextflow run main.nf \ --input <samplesheet.csv> \ --bypass_assembly \ --bypass_read_annotation \ -work-dir work \ -profile singularity

Automatic database setup

The pipeline requires a set of databases which are used by the different tools within this workflow. The user is required to specify the location in where the databases will be downloaded. It is also possible to download the databases manually. The configure subworkflow will evaluate the database format and presence of the compatible files automatically. bash nextflow run main.nf \ --bowtie_db path/to/db/bowtie2 \ --metaphlan_db path/to/db/metaphlan \ --humann_db path/to/db/humann \ --cat_pack_db path/to/db/catpack \ --busco_db path/to/db/busco_downloads \ -work-dir <work/dir> \ -profile <singularity,docker>

Manual database setup ### HUMAnN3 and MetaPhlan3 DB Make sure the `path/to/db/humann` should contain a `chocophlan`, `uniref` and `utility_mapping` directory. These can be obtained by the following command: ```bash docker pull biobakery/humann:latest docker run --rm -v $(pwd):/scripts biobakery/humann:latest \ humann_databases --download chocophlan full ./path/to/db/humann \ && humann_databases --download uniref uniref90_diamond ./path/to/db/humann \ && humann_databases --download utility_mapping full ./path/to/db/humann ``` ### MetaPhlAn DB ```bash wget http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202403.tar \ && tar -xvf mpa_vJun23_CHOCOPhlAnSGB_202403.tar -C path/to/db/metaphlan \ && rm mpa_vJun23_CHOCOPhlAnSGB_202403.tar wget http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/bowtie2_indexes/mpa_vJun23_CHOCOPhlAnSGB_202403_bt2.tar \ && tar -xvf mpa_vJun23_CHOCOPhlAnSGB_202403_bt2.tar -C path/to/db/metaphlan \ && rm mpa_vJun23_CHOCOPhlAnSGB_202403_bt2.tar echo 'mpa_vJun23_CHOCOPhlAnSGB_202403' > path/to/db/metaphlan/mpa_latest ``` ### Kneaddata DB ```bash docker pull agusinac/kneaddata:latest docker run --rm -v $(pwd):/scripts agusinac/kneaddata:latest \ kneaddata_database \ --download human_genome bowtie2 ./path/to/db/bowtie2 ``` ### CAT_pack DB A pre-constructed diamond database can be [downloaded](https://tbb.bio.uu.nl/tina/CAT_pack_prepare/) manually or by command: ```bash docker pull agusinac/catpack:latest docker run --rm -v $(pwd):/scripts agusinac/catpack:latest \ CAT_pack download \ --db nr \ -o path/to/db/catpack ``` ### busco DB BUSCO expects that the directory is called `busco_downloads`. ```bash docker pull ezlabgva/busco:v5.8.2_cv1 docker run --rm -v $(pwd):/scripts ezlabgva/busco:v5.8.2_cv1 \ busco \ --download bacteria_odb12 \ --download_path path/to/db/busco_downloads ```

Support

If you are having issues, please create an issue

Citations

You can cite the metabiomx using the following DOI: https://doi.org/10.48546/workflowhub.workflow.1787.5

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Owner

  • Name: Cancer Microbiology Group
  • Login: CMG-GUTS
  • Kind: organization
  • Email: Annemarie.Boleij@radboudumc.nl

Citation (CITATIONS.md)

# metabiomx: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [AdapterRemoval2](https://doi.org/10.1186/)

  > Schubert, M., Lindgreen, S., and Orlando, L. 2016. "AdapterRemoval v2: Rapid Adapter Trimming, Identification, and Read Merging." BMC Research Notes 9 (February): 88. doi: 10.1186/s13104-016-1900-2

- [Trimmomatic](https://doi.org/10.1093/bioinformatics/btu170)

  > Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1. PMID: 24695404; PMCID: PMC4103590.

- [BBTools](https://doi.org/10.1371/journal.pone.0185056)

  > Bushnell B, Rood J, Singer E (2017) BBMerge – Accurate paired shotgun read merging via overlap. PLOS ONE 12(10): e0185056.

- [BCFtools](https://doi.org/10.1093/gigascience/giab008)

  > Danecek, Petr, et al. "Twelve years of SAMtools and BCFtools." Gigascience 10.2 (2021): giab008. doi: 10.1093/gigascience/giab008

- [Bowtie2](https:/dx.doi.org/10.1038/nmeth.1923)

  > Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: 10.1038/nmeth.1923.

- [Busco](https://doi.org/10.1007/978-1-4939-9173-0_14)

  > Seppey, M., Manni, M., & Zdobnov, E. M. (2019). BUSCO: assessing genome assembly and annotation completeness. In Gene prediction (pp. 227-245). Humana, New York, NY. doi: 10.1007/978-1-4939-9173-0_14.

- [CAT](https://doi.org/10.1186/s13059-019-1817-x)

  > von Meijenfeldt, F. B., Arkhipova, K., Cambuy, D. D., Coutinho, F. H., & Dutilh, B. E. (2019). Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome biology, 20(1), 1-14. doi: 10.1186/s13059-019-1817-x.

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [SPAdes](https://doi.org/10.1101/gr.213959.116)

  > Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834. doi: 10.1101/gr.213959.116.

- [MetaPhlAn3/HUMAnN3](https://doi.org/10.7554/eLife.65088)

  > Francesco BeghiniLauren J McIverAitor Blanco-MíguezLeonard DuboisFrancesco AsnicarSagun MaharjanAna MailyanPaolo ManghiMatthias ScholzAndrew Maltez ThomasMireia Valles-ColomerGeorge WeingartYancong ZhangMoreno ZolfoCurtis HuttenhowerEric A FranzosaNicola Segata (2021) Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3 eLife 10:e65088.

- [biom-format](https://doi.org/10.1186/2047-217X-1-7)

  > Daniel McDonald, Jose C Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, J Gregory Caporaso, The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome, GigaScience, Volume 1, Issue 1, December 2012, 2047–217X–1–7, https://doi.org/10.1186/2047-217X-1-7

## Software packaging/containerisation tools
- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Create event: 5
  • Issues event: 1
  • Release event: 6
  • Watch event: 1
  • Delete event: 2
  • Member event: 1
  • Push event: 7
Last Year
  • Create event: 5
  • Issues event: 1
  • Release event: 6
  • Watch event: 1
  • Delete event: 2
  • Member event: 1
  • Push event: 7

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • agusinac (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

containers/docker/adapterremoval/Dockerfile docker
  • ubuntu 20.04 build
containers/docker/anot2biom/Dockerfile docker
  • ubuntu 20.04 build
containers/docker/bbtools/Dockerfile docker
  • ubuntu xenial build
containers/docker/bowtie2_samtools/Dockerfile docker
  • staphb/samtools latest build
containers/docker/cat_pack/Dockerfile docker
  • ubuntu 20.04 build
containers/docker/kneaddata/Dockerfile docker
  • ubuntu 20.04 build