biosiftr

BioSIFTR - Biome-specific Shallow-shotgun Inference of Functional Traits through Read-mapping

https://github.com/ebi-metagenomics/biosiftr

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization ebi-metagenomics has institutional domain (www.ebi.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

BioSIFTR - Biome-specific Shallow-shotgun Inference of Functional Traits through Read-mapping

Basic Info

Host: GitHub
Owner: EBI-Metagenomics
License: apache-2.0
Language: Python
Default Branch: dev
Homepage:
Size: 20.9 MB

Statistics

Stars: 2
Watchers: 4
Forks: 0
Open Issues: 3
Releases: 4

Created over 2 years ago · Last pushed 7 months ago

Metadata Files

Readme Contributing License Citation

BioSIFTR

Biome-specific Shallow-shotgun Inference of Functional Traits through Read-mapping

ebi-metagenomics/biosiftr is a bioinformatics pipeline that generates taxonomic and functional profiles for low-yield (shallow shotgun: < 10 M reads) short raw-reads using MGnify biome-specific genome catalogues as a reference.

The biome selection includes all the biomes available in the MGnify genome catalogues.

The main sections of the pipeline include the following steps:

Raw-reads quality control (fastp)
HQ reads decontamination versus human, phyX, and host (bwa-mem2)
QC report of decontaminated reads (FastQC)
Integrated quality report of reads before and after decontamination (MultiQC)
Mapping HQ clean reads using Sourmash and bwa-mem2 (optional)
Taxonomic profile generation
Functional profile inference

The final output includes a species relative abundance table, Pfam and KEGG Orthologs (KO) count tables, a KEGG modules completeness table, and DRAM-style visuals (optional). In addition, the shallow-mapping pipeline will integrate the taxonomic and functional tables of all the samples in the input samplesheet.

Installation

This workflow was built using Nextflow and follows nf-core good practices. It is containerised, so users can use either Docker or Apptainer/Singularity to run the pipeline. At the moment, it doesn't support Conda environments.

The pipeline requires Nextflow and a container technology such as Apptainer/Singularity or Docker.

Required Reference Databases

The first time you run the pipeline, it will download the required MGnify genomes catalogue reference files and the human_phiX bwa-mem2 index. Other common hosts like mouse will be also automatically downloaded.

Running the pipeline using bwa-mem2 is optional. If you want to run the pipeline with this option set the --download_bwa true. This database will occupy considerable storage in your system, depending on the biome (approx database sizes):

106 G marine-v2-0
38 G human-gut-v2-0-2
29 G mouse-gut-v1-0
21 G cow-rumen-v1-0-1
16 G sheep-rumen-v1-0
15 G pig-gut-v1-0
10 G chicken-gut-v1-0-1
4.5 G human-oral-v1-0-1
2.0 G human-vaginal-v1-0
2.2 G non-model-fish-gut-v2-0
2.5 G honeybee-gut-v1-0-1
1.5 G zebrafish-fecal-v1-0

In addition, instructions to generate the databases from custom catalogues can be found in the BioSIFTR paper's repository.

Usage

Prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

csv sample,fastq_1,fastq_2 paired_sample,/PATH/test_R1.fq.gz,/PATH/test_R2.fq.gz single_sample,/PATH/test.fq.gz

Each row represents a fastq file (single-end) or a pair of fastq files (paired end) where 'sample' is a unique identifier for each dataset, 'fastq1' is the path to the first FASTQ file, and 'fastq2' is the path to the second FASTQ file for paired-end data.

Now, you can run the pipeline using the minimum arguments:

bash nextflow run ebi-metagenomics/biosiftr \ --biome <CATALOGUE_ID> \ --input samplesheet.csv \ --outdir <PROJECT_NAME> default = `results` \ --dbs </path/to/dbs> \ --decontamination_indexes </path to folder with bwa-mem2 indexes>

The central location for the databases can be set in the config file.

Optional arguments include:

bash --run_bwa <boolean> default = `false` # To generate results using bwa-mem2 besides sourmash --core_mode <boolean> default = `false` # To use core functions instead of pangenome functions --run_dram <boolean> default = `false` # To generate DRAM results

Use --core_mode true for large catalogues like the human-gut to avoid over-prediction due to a large number of accessory genes in the pangenome. Nextflow option -profile can be used to select a suitable config for your computational resources. You can add profile files to the config directory. Nextflow option -resume can be used to re-run the pipeline from the last successfully finished step.

Available biomes

This can be any of the MGnify catalogues for which shallow-mapping databases are currently available

| Biome | Catalogue Version | | ------------------ | ------------------------------------------------------------------------------------ | | chicken-gut | v1.0.1 | | cow-rumen | v1.0.1 | | human-gut | v2.0.2 ⚠️ | | human-oral | v1.0.1 | | human-vaginal | v1.0 | | honeybee-gut | v1.0.1 | | marine | v2.0 | | mouse-gut | v1.0 | | non-model-fish-gut | v2.0 | | pig-gut | v1.0 | | sheep-rumen | v1.0 | | zebrafish-fecal | v1.0 |

⚠️ Note for human-gut:

The human-gut shallow-mapping database was created manually by re-running Panaroo to reconstruct the pangenomes. This is likely to have caused discrepancies in the pangenomes, so please bear that in mind.

Test

To test the installed tool with your downloaded databases, you can run the pipeline using the small test dataset. Even if there are no hits with the biome you are interested in, the pipeline should finish successfully. Add -profile if you have set up a config profile for your compute resources.

bash cd biosiftr/tests nextflow run ../main.nf \ --input test_samplesheet.csv \ --biome <CATALOGUE_ID> \ --dbs </path/to/dbs> \ --decontamination_indexes </path to folder with bwa-mem2 indexes>

Credits

ebi-metagenomics/biosiftr pipeline was originally written by @Ales-ibt.

We thank the following people for their extensive assistance in the development of this pipeline: @mberacochea

Owner

Name: MGnify
Login: EBI-Metagenomics
Kind: organization
Email: metagenomics-help@ebi.ac.uk
Location: Genome Campus, UK

Website: https://www.ebi.ac.uk/metagenomics/
Twitter: MGnifyDB
Repositories: 153
Profile: https://github.com/EBI-Metagenomics

MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
    - family-names: "Escobar-Zepeda"
      given-names: "Alejandra"
      orcid: "https://orcid.org/0000-0003-3549-9115"
    - family-names: "Beracochea"
      given-names: "Martin"
      orcid: "https://orcid.org/0000-0003-3472-3736"
title: "BioSIFTR"
version: 1.2.0
doi: 10.48546/WORKFLOWHUB.WORKFLOW.1735.1
date-released: 2025-06-17
url: "https://github.com/EBI-Metagenomics/biosiftr_extended_methods"

GitHub Events

Total

Create event: 6
Issues event: 3
Release event: 2
Delete event: 6
Issue comment event: 13
Push event: 29
Pull request review event: 35
Pull request review comment event: 36
Pull request event: 15

Last Year

Create event: 6
Issues event: 3
Release event: 2
Delete event: 6
Issue comment event: 13
Push event: 29
Pull request review event: 35
Pull request review comment event: 36
Pull request event: 15

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 12
Average time to close issues: N/A
Average time to close pull requests: 4 days
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 1.17
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 12
Average time to close issues: N/A
Average time to close pull requests: 4 days
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 1.17
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Ales-ibt (2)

Pull Request Authors

mberacochea (8)
Ales-ibt (3)
jmattock5 (1)

Top Labels

Issue Labels

enhancement (2)

Pull Request Labels

Dependencies

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/checkout v3 composite
nf-core/setup-nextflow v1 composite

.github/workflows/clean-up.yml actions

actions/stale v7 composite

.github/workflows/fix-linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite

.github/workflows/linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
mshick/add-pr-comment v1 composite
nf-core/setup-nextflow v1 composite
psf/black stable composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

.github/workflows/release-announcments.yml actions

actions/setup-python v4 composite
rzr/fediverse-action master composite
zentered/bluesky-post-action v0.0.2 composite

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/fastqc/meta.yml cpan

modules/nf-core/multiqc/meta.yml cpan

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science