mess

Snakemake pipeline for simulating shotgun metagenomic samples

https://github.com/metagenlab/mess

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary

Keywords

bioinformatics shotgun-metagenomics snakemake-workflow
Last synced: 6 months ago · JSON representation ·

Repository

Snakemake pipeline for simulating shotgun metagenomic samples

Basic Info
Statistics
  • Stars: 23
  • Watchers: 5
  • Forks: 3
  • Open Issues: 6
  • Releases: 11
Topics
bioinformatics shotgun-metagenomics snakemake-workflow
Created over 6 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Metagenomic Sequence Simulator (MeSS)

license install with bioconda version downloads

tests docs docker

DOI paper

The Metagenomic Sequence Simulator (MeSS) is a Snakemake pipeline, implemented using Snaketool, for simulating illumina, Oxford Nanopore (ONT) and Pacific Bioscience (PacBio) shotgun metagenomic samples.

:mag: Overview

MeSS takes as input NCBI taxa or local genome assemblies to generate either long (PacBio or ONT) or short (illumina) reads. In addition to reads, MeSS optionally generates bam alignment files and taxonomic + sequence abundances in CAMI format.

```mermaid %%{init: {'theme':'forest'}}%% flowchart LR input["samples.tsv or samples/*.tsv"] --> taxons

subgraph genomedownload["genome download"] dlchoice{download ?} taxons["taxons or accesions"] --> dlchoice dlchoice -->|yes| assemblyfinder dlchoice -->|no| fasta assemblyfinder --> fasta end style genomedownload color:#15161a

input --> distchoice subgraph communitydesign["**community design**"] distchoice{draw distribution ?} distchoice -->|yes| dist["distribution (lognormal, even)"] dist --> abundances distchoice -->|no| reads distchoice -->|no| bases distchoice -->|no| abundances depth["coverage depth"] reads --> depth bases --> depth abundances["abundances (sequence, taxonomic)"] --> depth end style communitydesign color:#15161a style community_design color:#15161a

fasta --> simulator depth --> simulator

simulator["read simulator (art_illumina, pbsim3...)"] simulator --> bam simulator --> fastq simulator --> CAMI-profile

%% subgraph color fills classDef red fill:#faeaea,color:#fff,stroke:#333; classDef blue fill:#eaecfa,color:#fff,stroke:#333; class genome_download blue

class community_design red ```

:books: Documentation

More details can be found in the documentation

:zap: Quick start

:gear: Installation

sh conda create -n mess mess

  • Docker

sh docker pull ghcr.io/metagenlab/mess:latest

  • From source

sh git clone https://github.com/metagenlab/MeSS.git pip install -e MeSS

:pagefacingup: Usage

:arrow_right: Input

Let's simulate two metagenomic samples with the following taxa and read counts in samples.tsv: | sample | taxon | reads | | --- | --- | --- | | sample1 | 487 | 174840 | | sample1 | 727 | 90679 | | sample1 | 729 | 13129 | | sample2 | 28132 | 147863 | | sample2 | 199 | 147545 | | sample2 | 729 | 131300 |

:rocket: Command

sh mess run -i samples.tsv

[!IMPORTANT] Apptainer is the default and recommended dependency deployment method for maximum reproducibility !

If you would like to use conda you can specify --sdm conda.

:cardindexdividers: Outputs

  • Downloaded genomes in mess_out/assembly_finder/download

sh ┣ 📂GCF_000144405.1 ┃ ┗ 📜GCF_000144405.1_ASM14440v1_genomic.fna.gz ┣ 📂GCF_001298465.1 ┃ ┗ 📜GCF_001298465.1_ASM129846v1_genomic.fna.gz ┣ 📂GCF_016127215.1 ┃ ┗ 📜GCF_016127215.1_ASM1612721v1_genomic.fna.gz ┣ 📂GCF_020736045.1 ┃ ┗ 📜GCF_020736045.1_ASM2073604v1_genomic.fna.gz ┣ 📂GCF_022869645.1 ┗ 📜GCF_022869645.1_ASM2286964v1_genomic.fna.gz

  • Simulated reads in mess_out/fastq

sh ┣ 📜sample1_R1.fq.gz ┣ 📜sample1_R2.fq.gz ┣ 📜sample2_R1.fq.gz ┗ 📜sample2_R2.fq.gz

[!TIP] By default mess outputs paired illumina reads with the Hiseq25k error profile. Other outputs, and error profiles are described here and here

:bar_chart: Resources usage

Using samples.tsv, mess runs in under 2min, while using around 1.8GB of physical RAM

| taskid | hash | nativeid | name | status | exit | submit | duration | realtime | %cpu | peakrss | peakvmem | rchar | wchar | | ------- | --------- | --------- | -------- | --------- | ---- | ----------------------- | -------- | -------- | ------ | -------- | --------- | ------ | ------ | | 1 | fe/03c2bc | 62286 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:41:15.820 | 1m 50s | 1m 50s | 111.5% | 1.8 GB | 9 GB | 3.5 GB | 2.4 GB | | 1 | ff/0d03b1 | 73355 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:55:12.903 | 1m 52s | 1m 52s | 112.6% | 1.7 GB | 8.8 GB | 3.5 GB | 2.4 GB | | 1 | 07/d352bf | 83576 | MESS (1) | COMPLETED | 0 | 2024-09-04 12:57:30.600 | 1m 50s | 1m 50s | 113.2% | 1.7 GB | 8.9 GB | 3.5 GB | 2.4 GB |

[!NOTE] Average resources usage measured 3 times with one CPU (using nextflow, excluding dependency deployment time).

More details in the resource usage documentation

:fire: Features

Using phage.tsv

| sample | taxon | cov_sim | | :----- | :----- | :------ | | phage | 347329 | 200 |

:dna: Multi sequencing technology

  • Illumina

sh mess run -i phage.tsv --tech illumina -o mess_out/illumina seqkit stats --all -T -b mess_out/illumina/fastq/*

| file | numseqs | sumlen | avglen | N50 | Q20(%) | Q30(%) | AvgQual | | :------------- | :------- | :------ | :------ | :-- | :----- | :----- | :------ | | phageR1.fq.gz | 44000 | 6600000 | 150.0 | 150 | 98.01 | 91.67 | 27.81 | | phage_R2.fq.gz | 44000 | 6600000 | 150.0 | 150 | 97.31 | 89.65 | 26.52 |

  • Nanopore

sh mess run -i phage.tsv --tech nanopore -o mess_out/nanopore seqkit stats --all -T -b mess_out/nanopore/fastq/*

| file | numseqs | sumlen | avg_len | N50 | Q20(%) | Q30(%) | AvgQual | | :---------- | :------- | :------- | :------ | :---- | :----- | :----- | :------ | | phage.fq.gz | 1486 | 13203006 | 8884.9 | 12329 | 73.99 | 62.65 | 13.60 |

  • PacBio HiFi

sh mess run -i phage.tsv -o mess_out/pacbio --tech pacbio --error hifi seqkit stats --all -T -b mess_out/pacbio/fastq/*

| file | numseqs | sumlen | avg_len | N50 | Q20(%) | Q30(%) | AvgQual | | :---------- | :------- | :------- | :------ | :---- | :----- | :----- | :------ | | phage.fq.gz | 1430 | 12588621 | 8803.2 | 12666 | 99.92 | 99.78 | 40.51 |

[!NOTE] We use pbsim3 to simulate multi-pass CLR reads which are converted to HiFi reads with ccs.

PacBio HiFi reads simulations usually take longer compared to other error profiles.

:o: Circular assemblies

Inspired by readSimulator's approach, mess can shuffle genome start points to get circular genome assemblies.

[!WARNING] All contigs in the fasta will be circularised

  • Linear (default, --rotate 1)

sh mess run -i phage.tsv -o mess_out/linear

  • Circular (--rotate 3)

sh mess run -i phage.tsv --rotate 3 -o mess_out/circular

[!NOTE] Assembled using unicycler, visualized using bandage

:sos: Help

All command-line options at described here

`mess -h`

Citation

Please consider citing MeSS if you use it in your work.

Farid Chaabane, Trestan Pillonel, Claire Bertelli, MeSS and assembly_finder: A toolkit for in silico metagenomic sample generation, Bioinformatics, 2024;, btae760, https://doi.org/10.1093/bioinformatics/btae760

BibTeX @article{chaabane_mess_2024, title = {MeSS and assembly_finder: A toolkit for in silico metagenomic sample generation}, issn = {1367-4811}, url = {https://doi.org/10.1093/bioinformatics/btae760}, doi = {10.1093/bioinformatics/btae760}, journal = {Bioinformatics}, author = {Chaabane, Farid and Pillonel, Trestan and Bertelli, Claire}, month = dec, year = {2024}, pages = {btae760}, }

Owner

  • Name: metagenlab
  • Login: metagenlab
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
title: "MeSS: simulate short and long read metagenomic samples"
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Farid
    family-names: Chaabane
    email: farid.chaabane@chuv.ch
    orcid: "https://orcid.org/0009-0007-9322-1281"
    affiliation: >-
      Institute of Microbiology, Lausanne University
      Hospital and University of Lausanne, Lausanne,
      Switzerland
  - given-names: Trestan
    family-names: Pillonel
    email: trestan.pillonel@chuv.ch
    orcid: "https://orcid.org/0000-0002-5725-7929"
    affiliation: >-
      Institute of Microbiology, Lausanne University
      Hospital and University of Lausanne, Lausanne,
      Switzerland
  - given-names: Claire
    family-names: Bertelli
    email: claire.bertelli@chuv.ch
    orcid: "https://orcid.org/0000-0003-0550-8981"
    affiliation: >-
      Institute of Microbiology, Lausanne University
      Hospital and University of Lausanne, Lausanne,
      Switzerland
identifiers:
  - type: doi
    value: 10.5281/zenodo.13365501
    description: zenodo software
repository-code: "https://github.com/metagenlab/MeSS"
url: "https://metagenlab.github.io/MeSS/"
abstract: >-
  Snakemake pipeline for simulating shotgun metagenomic samples
license: MIT
preferred-citation:
  type: article
  authors:
    - given-names: Farid
      family-names: Chaabane
    - given-names: Trestan
      family-names: Pillonel
    - given-names: Claire
      family-names: Bertelli
  doi: "10.1093/bioinformatics/btae760"
  journal: "Bioinformatics"
  title: "MeSS and assembly_finder: A toolkit for in silico metagenomic sample generation"
  year: 2024
  url: "https://doi.org/10.1093/bioinformatics/btae760"

GitHub Events

Total
  • Create event: 11
  • Release event: 3
  • Issues event: 30
  • Watch event: 5
  • Delete event: 12
  • Issue comment event: 24
  • Push event: 64
  • Pull request review event: 1
  • Pull request review comment event: 3
  • Pull request event: 29
  • Fork event: 4
Last Year
  • Create event: 11
  • Release event: 3
  • Issues event: 30
  • Watch event: 5
  • Delete event: 12
  • Issue comment event: 24
  • Push event: 64
  • Pull request review event: 1
  • Pull request review comment event: 3
  • Pull request event: 29
  • Fork event: 4

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 15
  • Total pull requests: 21
  • Average time to close issues: 4 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 8
  • Total pull request authors: 3
  • Average comments per issue: 1.87
  • Average comments per pull request: 0.29
  • Merged pull requests: 16
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 13
  • Pull requests: 14
  • Average time to close issues: 13 days
  • Average time to close pull requests: 3 days
  • Issue authors: 7
  • Pull request authors: 2
  • Average comments per issue: 1.85
  • Average comments per pull request: 0.29
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • teojcryan (4)
  • farchaab (4)
  • matnguyen (2)
  • Rohit-Satyam (1)
  • inspirewind (1)
  • HSecaira (1)
  • seanlu96 (1)
  • baptwr (1)
Pull Request Authors
  • farchaab (19)
  • teojcryan (1)
  • CarraraAlessia (1)
Top Labels
Issue Labels
bug (6) enhancement (5) documentation (2) good first issue (1)
Pull Request Labels
enhancement (6) bug (1)

Dependencies

setup.py pypi