miassembler

https://github.com/ebi-metagenomics/miassembler

Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization ebi-metagenomics has institutional domain (www.ebi.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary

Scientific Fields

Biology Life Sciences - 40% confidence

Last synced: 4 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: EBI-Metagenomics
License: apache-2.0
Language: Nextflow
Default Branch: main
Size: 34 MB

Statistics

Stars: 4
Watchers: 4
Forks: 0
Open Issues: 1
Releases: 8

Created almost 2 years ago · Last pushed 5 months ago

Metadata Files

Readme Changelog Contributing License Citation

Introduction

ebi-metagenomics/miassembler is a bioinformatics pipeline for the assembly of long and short metagenomic reads.

This pipeline supports both short and long reads; however, it does not yet support hybrid assemblies.

The steps of the pipeline for short- and long-reads processing are outlined in the documentation.

This pipeline is mostly a direct port of the mi-automation assembly generation pipeline. Some of the bespoke scripts used to remove contaminated contigs or to calculate the coverage of the assembly were replaced with tools provided by the community (SeqKit and quast respectively).

[!NOTE] This pipeline uses the nf-core template with some tweaks, but it's not part of nf-core.

Usage

Pipeline help:

```angular2html Typical pipeline command:

nextflow run main.nf -profile --samplesheet samplesheet.csv --outdir

--help [boolean, string] Show the help message for all top level parameters. When a parameter is given to --help, the full help message of that parameter will be printed.

Input/output options --samplesheet [string] Path to comma-separated file containing information about the raw reads with the prefix (read accession) to be used. --studyaccession [string] The ENA Study secondary accession --readsaccession [string] The ENA Run primary accession --privatestudy [boolean] To use if the ENA study is private, this feature ony works on EBI infrastructure at the moment --referencegenomesfolder [string] The folder containing the reference genomes. It must follow a specific structure — see docs/README for details. --contaminantreference [string] Name of the subfolder with the reference genome located in to be used for host decontamination --skiphumandecontamination [boolean] Scrubbing human contamination from raw reads and assembled contigs is performed by default as standard procedure. Set this flag to true to skip human decontamination. [default: false] --humanreference [string] Name of the subfolder with the human genome reference located in <referencegenomesfolder> to be used for human decontamination. Option is strongly encouraged as contamination with human DNA during laboratory sequencing is widespread and can impact analysis results. --phixreference [string] Name of the subfolder with the PhiX genome reference located in to be used for decontamination of Illumina reads --diamonddb [string] Path to diamond db (e.g. NCBI-nr) to perform frameshift correction. --outdir [string] The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure. [default: results] --email [string] Email address for completion summary. --multiqctitle [string] MultiQC report title. Printed as page header, used for filename if not otherwise specified.

Input metadata options --assembler [string] The short or long reads assembler (accepted: spades, metaspades, megahit, flye) --longreadsassemblerconfig [string] Configuration to use flye with. (accepted: nano-raw, nano-corr, nano-hq, pacbio-raw, pacbio-corr, pacbio-hifi) --flyeversion [string] [default: 2.9] --spadesversion [string] [default: 3.15.5] --megahitversion [string] [default: 1.2.9] --assemblymemory [number] Default memory allocated for the assembly process. [default: 100] --spadesonly_assembler [boolean] Run SPAdes/metaSPAdes without the error correction step. [default: true]

Reads QC options --shortreadsfilterratiothreshold [number] The maximum fraction of reads that are allowed to be filtered out. If exceeded, it flags excessive filtering. The default value is 0.1, meaning that if less than 10% of the reads are retained after filtering, the threshold is considered exceeded, and the run is not assembled. [default: 0.1] --shortreadslowreadscountthreshold [number] The minimum number of reads required after filtering. If below, it flags a low read count and the run is not assembled. [default: 1000] --longreadsminreadlength [integer] Minimum read length for pre-assembly quality filtering [default: 200] --longreadspacbioqualitythreshold [number] The Q20 threshold that a pacbio sample needs to exceed to be labelled as HiFi. [default: 0.9] --longreadsontquality_threshold [number] The Q20 threshold that an ONT sample needs to exceed to be labelled as high-quality. [default: 0.8]

Assembly QC options --shortreadsmincontiglength [integer] Minimum contig length filter for short reads. [default: 500] --shortreadsmincontiglengthmetat [integer] Minimum contig length filter for short reads metaT. [default: 200] --shortreadscontigthreshold [integer] Minimum number of contigs in human+phiX+host cleaned assembly. [default: 2]

--minqcov [number] Minimum query coverage threshold (0.0-1.0). Specifies the minimum fraction of the contig sequence that must align to the reference genome for the contig to be classifed as contamination. [default: 0.3] --minpid [number] Minimum percent identity threshold (0.0-1.0). Specifies the minimum sequence similarity between the contig and reference genome alignment required to classify the contig as contamination. [default: 0.4]

Generic options --multiqcmethodsdescription [string] Custom MultiQC yaml file containing HTML including a methods description. ```

You can run this pipeline with two options:

Command-line parameters

Example:

bash nextflow run ebi-metagenomics/miassembler \ -profile codon_slurm \ --assembler metaspades \ --reference_genome human.fasta \ --reference_genomes_folder references/ --outdir testing_results \ --study_accession SRP002480 \ --reads_accession SRR1631361

Samplesheet

bash nextflow run ebi-metagenomics/miassembler \ -profile codon_slurm \ --samplesheet tests/samplesheet/test.csv

The samplesheet is a comma-separated file (.csv) with the following columns:

study_accession: Unique identifier for the study.
reads_accession: Unique identifier for the reads.
fastq_1: Full path to the first FastQ file.
fastq_2: Full path to the second FastQ file (for paired-end reads). Leave empty if single-end.
library_layout: Either single or paired.
library_strategy: One of metagenomic, metatranscriptomic, genomic, transcriptomic, or other.
platform: Relevant for long reads, requiring either ont or pb for nanopore or pacbio, respectively.
assembler: Relevant for short reads, where either megahit, metaspades, or spades can be picked. Flye is also supported
assembly_memory: Integer value specifying the memory allocated for the assembly process.
assembler_config: Configuration to use flye with. One of "nano-raw", "nano-corr", "nano-hq", "pacbio-raw", "pacbio-corr", "pacbio-hifi".
contaminantreference: Filename of fasta reference to be used for host decontamination of reads and assembly. BWA-MEM2 index files must exist in <contaminantreference>/bwa-mem2/ and their names must start with .fna.*.
humanreference: Name of the subfolder with genome reference to be used for human decontamination of reads and assembly. BWA-MEM2 index files must exist in <humanreference>/bwa-mem2/ and their names must start with .fna.*.
phixreference: Filename of PhiX genome reference to be used for PhiX decontamination of reads and assembly. BWA-MEM2 index files must must exist in <phixreference>/bwa-mem2/ and their names must start with .fna.*.

The header row is mandatory.

Full Samplesheet Example

The pipeline can handle both single-end and paired-end reads. A full samplesheet for different library layouts and strategies might look like this:

csv study_accession,reads_accession,fastq_1,fastq_2,library_layout,library_strategy,assembler PRJ1,ERR1,/path/to/reads/ERR1_1.fq.gz,/path/to/reads/ERR1_2.fq.gz,paired,metagenomic PRJ2,ERR2,/path/to/reads/ERR2.fq.gz,,single,genomic,metaspades PRJ3,ERR3,/path/to/reads/ERR3_1.fq.gz,/path/to/reads/ERR3_2.fq.gz,paired,transcriptomic

Example with additional columns:

csv study_accession,reads_accession,fastq_1,fastq_2,library_layout,library_strategy,assembler,assembly_memory,assembler_config,platform,contaminant_reference,human_reference,phix_reference PRJ1,ERR1,/path/to/reads/ERR1_1.fq.gz,/path/to/reads/ERR1_2.fq.gz,paired,metagenomic,spades,16,,,,, PRJ2,ERR2,/path/to/reads/ERR2.fq.gz,,single,genomic,flye,32,nano-hq,ont,chicken.fa,human.fa,phix.fa

ENA Private Data

The pipeline includes a module to download private data from ENA using the EMBL-EBI FIRE (File Replication) system. This system is restricted for use within the EMBL-EBI network and will not work unless connected to that network.

If you have private data to assemble, you must provide the full path to the files on a system that Nextflow can access.

Microbiome Informatics Team

To process private data, the pipeline should be launched with the --private_study flag, and the samplesheet must include the private FTP (transfer services) paths. The download_from_fire module will be utilized to download the files.

This module uses Nextflow secrets. Specifically, it requires the FIRE_ACCESS_KEY and FIRE_SECRET_KEY secrets to authenticate and download the files.

Outputs

The outputs of the pipeline are organized as follows:

results ├── pipeline_info ├── DRP0076 │ └── DRP007622 │ ├── DRR2807 │ │ └── DRR280712 │ │ ├── assembly │ │ │ └── megahit │ │ │ └── 1.2.9 │ │ │ ├── coverage │ │ │ ├── decontamination │ │ │ └── qc │ │ │ ├── multiqc │ │ │ └── quast │ │ │ └── DRR280712 │ │ └── qc │ │ ├── fastp │ │ └── fastqc │ └── multiqc └── SRP1154 └── SRP115494 ├── multiqc ├── SRR5949 │ └── SRR5949318 │ ├── assembly │ │ └── metaspades │ │ └── 3.15.5 │ │ ├── coverage │ │ ├── decontamination │ │ └── qc │ │ ├── multiqc │ │ └── quast │ │ └── SRR5949318 │ └── qc │ ├── fastp │ └── fastqc └── SRR6180 └── SRR6180434 --> QC Failed (not assembled) └── qc ├── fastp └── fastqc

The nested structure based on ENA Study and Reads accessions was created to suit the Microbiome Informatics team’s needs. The benefit of this structure is that results from different runs of the same study won’t overwrite any results.

Coverage

The pipeline reports the coverage values for the assembly using two mechanisms: jgi_summarize_bam_contig_depths and a custom whole assembly coverage and coverage depth.

jgisummarizebamcontigdepths

This tool summarizes the depth of coverage for each contig from BAM files containing the mapped reads. It quantifies the extent to which contigs in an assembly are covered by these reads. The output is a tabular file, with rows representing contigs and columns displaying the summarized coverage values from the BAM files. This summary is useful for binning contigs or estimating abundance in various metagenomic datasets.

This file is generated per assembly and stored in the following location (e.g., for study SRP115494 and run SRR6180434): SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage_depth_summary.tsv.gz

Example output of `jgi_summarize_bam_contig_depths`

| contigName | contigLen | totalAvgDepth | SRR6180434sorted.bam | SRR6180434sorted.bam-var | | -------------------------------- | --------- | ------------- | --------------------- | ------------------------- | | NODE1length539cov_105.072314 | 539 | 273.694 | 273.694 | 74284.7 |

Explanation of the Columns:

contigName: The name or identifier of the contig (e.g., NODE_1_length_539_cov_105.072314). This is usually derived from the assembly process and may include information such as the contig length and coverage.
contigLen: The length of the contig in base pairs (e.g., 539).
totalAvgDepth: The average depth of coverage across the entire contig from all BAM files (e.g., 273.694). This represents the total sequencing coverage averaged across the length of the contig. This value will be the same as the sample avg. depth in assemblies of a single sample.
SRR6180434_sorted.bam: The average depth of coverage for the specific sample represented by this BAM file (e.g., 273.694). This shows how well the contig is covered by reads.
SRR6180434_sorted.bam-var: The variance in the depth of coverage for the same BAM file (e.g., 74284.7). This gives a measure of how uniform or uneven the read coverage is across the contig.

Coverage JSON

The pipeline calculates two key metrics: coverage and coverage depth for the entire assembly. The coverage is determined by dividing the number of assembled base pairs by the total number of base pairs before filtering. Coverage depth is calculated by dividing the number of assembled base pairs by the total length of the assembly, provided the assembly length is greater than zero. These metrics provide insights into how well the reads cover the assembly and the average depth of coverage across the assembled contigs. The script that calculates this number is calculateassemblycoverage.py.

The pipeline creates a JSON file with the following content:

json { "coverage": 0.04760503915318373, "coverage_depth": 273.694 }

The file is stored in (e.g. for study SRP115494 and run SRR6180434) -> SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage.json

Top Level Reports

MultiQC

The pipeline produces two MultiQC reports: one per study and one per run. These reports aggregate statistics related to raw reads, read QC, assembly, and assembly QC.

The run-level MultiQC report is generated for runs that passed QC and were assembled. The study-level MultiQC report includes all runs; however, runs without assemblies will not have assembly stats included.

QC failed runs

QC failed runs are filtered out to prevent downstream assembly failures.

Runs that fail QC checks are excluded from the assembly process. These runs are listed in the file qc_failed_runs.csv, along with the corresponding exclusion message. Assembling such runs may cause the pipeline to fail or produce very poor assemblies.

Example:

csv SRR6180434,short_reads_filter_ratio_threshold_exceeded

Runs exclusion messages

| Exclusion Message | Description | | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | short_reads_filter_ratio_threshold_exceeded | The maximum fraction of reads that are allowed to be filtered out. If exceeded, it flags excessive filtering. The default value is 0.1, meaning that if less than 10% of the reads are retained after filtering, the threshold is considered exceeded, and the run is not assembled. | | short_reads_low_reads_count_threshold | The minimum number of reads required after filtering. If below, it flags a low read count, and the run is not assembled. | | short_reads_contig_threshold | The minimum number of contigs allowed after human+phiX+host cleaning. If below it flags a low contig count and the cleaned assembly isn't generated. |

Assembled Runs

Runs that were successfully assembled are listed in a CSV file named assembled_runs.csv. This file contains the run accession, assembler, and assembler version used.

Example:

csv DRR280712,megahit,1.2.9 SRR5949318,metaspades,3.15.5

Tests

There is a very small test data set ready to use:

bash nextflow run main.nf -resume -profile test,docker

It's also possible to run the nf-test suite with

bash nf-test test

End to end tests

Two end-to-end tests can be launched (with megahit and metaspades) with the following command:

bash pytest tests/workflows/ --verbose

Owner

Name: MGnify
Login: EBI-Metagenomics
Kind: organization
Email: metagenomics-help@ebi.ac.uk
Location: Genome Campus, UK

Website: https://www.ebi.ac.uk/metagenomics/
Twitter: MGnifyDB
Repositories: 153
Profile: https://github.com/EBI-Metagenomics

MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data

Citation (CITATIONS.md)

# ebi-metagenomics/miassembler: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Release event: 4
Watch event: 1
Delete event: 23
Issue comment event: 12
Push event: 122
Pull request event: 57
Pull request review event: 183
Pull request review comment event: 207
Create event: 25

Last Year

Release event: 4
Watch event: 1
Delete event: 23
Issue comment event: 12
Push event: 122
Pull request event: 57
Pull request review event: 183
Pull request review comment event: 207
Create event: 25

Issues and Pull Requests

Last synced: 4 months ago

All Time

Total issues: 0
Total pull requests: 26
Average time to close issues: N/A
Average time to close pull requests: 3 days
Total issue authors: 0
Total pull request authors: 6
Average comments per issue: 0
Average comments per pull request: 0.31
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 26
Average time to close issues: N/A
Average time to close pull requests: 3 days
Issue authors: 0
Pull request authors: 6
Average comments per issue: 0
Average comments per pull request: 0.31
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mberacochea (1)

Pull Request Authors

mberacochea (21)
Ge94 (15)
KateSakharova (7)
ochkalova (4)
jmattock5 (3)
SandyRogers (3)

Top Labels

Issue Labels

Pull Request Labels

bug (1)

Dependencies

modules/nf-core/blast/blastn/meta.yml cpan

modules/nf-core/bwamem2/index/meta.yml cpan

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/fastqc/meta.yml cpan

modules/nf-core/megahit/meta.yml cpan

modules/nf-core/metabat2/jgisummarizebamcontigdepths/meta.yml cpan

modules/nf-core/multiqc/meta.yml cpan

modules/nf-core/quast/meta.yml cpan

modules/nf-core/samtools/idxstats/meta.yml cpan

modules/nf-core/seqkit/grep/meta.yml cpan

modules/nf-core/seqkit/seq/meta.yml cpan

modules/nf-core/spades/meta.yml cpan

pyproject.toml pypi

modules/ebi-metagenomics/bwamem2/mem/meta.yml cpan

modules/ebi-metagenomics/samtools/bam2fq/meta.yml cpan

modules/nf-core/fastp/meta.yml cpan

subworkflows/ebi-metagenomics/reads_bwamem2_decontamination/meta.yml cpan

modules/nf-core/blast/blastn/environment.yml conda

blast 2.14.1.*

modules/nf-core/bwamem2/index/environment.yml conda

bwa-mem2 2.2.1.*

modules/nf-core/custom/dumpsoftwareversions/environment.yml conda

multiqc 1.17.*

modules/nf-core/fastqc/environment.yml conda

fastqc 0.12.1.*

modules/nf-core/megahit/environment.yml conda

megahit 1.2.9.*
pigz 2.6.*

modules/nf-core/metabat2/jgisummarizebamcontigdepths/environment.yml conda

metabat2 2.15.*

modules/nf-core/multiqc/environment.yml conda

multiqc 1.18.*

modules/nf-core/quast/environment.yml conda

quast 5.2.0.*

modules/nf-core/samtools/idxstats/environment.yml conda

samtools 1.18.*

modules/nf-core/seqkit/grep/environment.yml conda

seqkit 2.4.0.*

modules/nf-core/seqkit/seq/environment.yml conda

seqkit 2.6.1.*

modules/nf-core/spades/environment.yml conda

spades 3.15.5.*

modules/ebi-metagenomics/bwamem2/mem/environment.yml pypi

modules/ebi-metagenomics/samtools/bam2fq/environment.yml pypi

miassembler

Science Score: 65.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Introduction

Usage

Command-line parameters

Samplesheet

Full Samplesheet Example

ENA Private Data

Microbiome Informatics Team

Outputs

Coverage

jgisummarizebamcontigdepths

Example output of jgi_summarize_bam_contig_depths

Explanation of the Columns:

Coverage JSON

Top Level Reports

MultiQC

QC failed runs

Runs exclusion messages

Assembled Runs

Tests

End to end tests

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Example output of `jgi_summarize_bam_contig_depths`