taxodactyl

https://github.com/qcif/taxodactyl

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: qcif
License: mit
Language: Python
Default Branch: main
Size: 118 MB

Statistics

Stars: 2
Watchers: 3
Forks: 0
Open Issues: 27
Releases: 12

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences

qcif/taxodactyl is a modular, reproducible Nextflow workflow for the conservative taxonomy assignment to DNA sequences, designed for high-confidence, auditable results in biosecurity and biodiversity contexts. The workflow integrates multiple bioinformatics tools and databases, automates best-practice analysis steps, and produces detailed reports with supporting evidence for each taxonomic assignment.

taxodactyl_diagram

Quick links

Workflow Overview

The pipeline orchestrates a series of analytical steps, each encapsulated in a dedicated module or subworkflow. The main stages are:

Environment Configuration Sets up environment variables and paths required for downstream processes, ensuring reproducibility and portability.
Input Validation Checks the integrity and compatibility of input files (FASTA sequences, metadata, databases), preventing downstream errors.
Sequence Search
- BLAST Core Nucleotide Database (BLASTN): Queries input sequences against the NCBI nucleotide database using BLASTN.
- BOLD v4 (API): Queries input sequences against the Barcode of Life Data Systems. Taxonomic lineage included in the results.
Hit Extraction Parses BLAST results to extract relevant hits for each query.
Taxonomic ID Extraction Retrieve taxids for BLAST hit records.
Build Taxonomic Lineage Maps taxonomic IDs to full lineages, enabling downstream filtering and reporting.
Candidate Evaluation Identifies candidate species for each query, applying configurable thresholds for identity and coverage.
Supporting Evidence Evaluation
- Supporting Publications: Assesses the diversity of publications supporting each candidate species' reference sequences.
- Database Coverage: Evaluates the representation of candidate species, taxa of interest and preliminary taxonomic ID in global databases (GBIF, GenBank, BOLD).
Multiple Sequence Alignment (MAFFT) Aligns candidate and query sequences to prepare for phylogenetic analysis.
Phylogenetic Tree Construction (FastMe) Builds a phylogenetic tree to visualise relatedness of candidate and query sequences.
Workflow report Generates detailed HTML and text reports, including sequence alignments, phylogenetic trees, database coverage, and supporting evidence for each assignment.

Usage

Software

To run the qcif/taxodactyl pipeline, you will need the following software installed:

Nextflow Tested versions: 24.10.3, 24.10.6
Java Required by Nextflow. Tested version: 17.0.13
Singularity Used for containerised execution of all bioinformatics tools, ensuring reproducibility. Tested version: 3.7.0

[!NOTE] - Instructions on how to set up Nextflow and a compatible version of Java can be found on this page. - To install singularity follow instructions from this website. - We provide different profiles as per the default nf-core configuration however this pipeline was only tested with singularity. - The pipeline was tested only on a Linux-based operating system - specifically, Ubuntu 24.04.1 LTS. - If you have never downloaded or run a Nextflow pipeline, we have some additional tips and bash commands in the step-by-step guide.

NCBI API Key

API Key is used to authenticate with the NCBI Entrez API for an increased rate limit. You can generate it following the instructions from this article.

TaxonKit

Download the NCBI taxonomy data files and extract them to ~/.taxonkit. Similarly, download the taxonkit tool and move into the same folder.

BLAST Core Nucleotide Database

To search sequences against the BLAST Core Nucleotide Database, you must download it first. We recommend running the update_blastdb.pl program. Follow instructions from this book. Perl installation is required. The command should look like this: perl ~/ncbi-blast-2.16.0+/bin/update_blastdb.pl --decompress core_nt

Sequences file (`sequences.fasta`)

You will need a FASTA file containing the query sequences (up to 100), e.g. ```

VE24-1075COI TGGATCATCTCTTAGAATTTTAATTCGATTAGAATTAAGACAAATTAATTCTATTATTWATAATAATCAATTATATAATGTAATTGTTCACAATTCATGCTTTTATTATAATTTTTTTTATAACTATACCAATTGTAATTGGTGGATTTGGAAATTGATTAATTCCTATAATAATAGGATGTCCTGATATATCATTTCCACSTTTAAATAATATTAGATTTTGATTATTACCTCCATCATTAATAATAATAATTTGTAGATTTTTAATTAATAATGGAACAGGAACAGGATGAACAATTTAYCCHCCTTTATCAAACAATATTGCACATAATAACATTTCAGTTGATTTAACTATTTTTTCTTTACATTTAGCAGGWATCTCATCAATTTTAGGAGCAATTAACTTTATTTGTACAATTCTTAATATAATAYCAAAYAATATAAAACTAAATCAAATTCCTCTTTTTCCTTGATCAATTTTAATTACAGCTATTTTATTAATTTTATMTTTACCAGTTTTAGCTGGTGCCATTACAATATTATTAACTGATCGTAATTTAAATACATCATTTTTGATCCAGCAGGAGGAGGAGATCC VE24-1079COI AACTTTATATTTCATTTTTGGAATATGGGCAGGTATATTAGGAACTTCACTAAGATGAATTATTCGAATTGAACTTGGACAACCAGGATCATTTATTGGAGATGATCAAATTTATAATGTAGTAGTTACCGCACACGCATTTATTATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCTCTAATAATTGGAGCACCAGATATAGCATTCCCACGGATAAATAATATAAGATTTTGATTATTACCACCCTCAATTACACTTCTTATTATAAGATCTATAGTAGAAAGAGGAGCAGGAACTGGATGAACAGTATATCCCCCACTATCATCAAATATTGCACATAGTGGAGCATCAGTAGACCTAGCAATTTTTTCACTACATTTAGCAGGTGTATCTTCAATTTTAGGAGCAATTAATTTCATCTCAACAATTATTAATATACGACCTGAAGGCATATCTCCAGAACGAATTCCATTATTTGTATGATCAGTAGGTATTACAGCATTACTATTATTATTATCATTACCAGTTCTAGCTGGAGCTATTACAATATTATTAACAGATCGAAACTTTAATACCTCATTCTTTGACCCAGTAGGAGGAGGAGATCCTATCTTATATCAACATTTATTTTGATTTTTT ``[!NOTE] - Example can be downloaded from [test/query.fasta`](test/query.fasta).

Metadata file (`metadata.csv`)

The metadata file provides essential information about each sequence and must follow the structure below. Each row corresponds to a sample and should include required and, optionally, additional columns.

Required Columns

sample_id - Unique identifier for the sample. Must match the sequence ID in the sequences.fasta file. Cannot contain spaces.
locus - Name of the genetic locus for the sample, which must be in the list of permitted loci. If deliberately providing no locus, the value NA is also accepted. > [!NOTE] > - By default, COX1_SPECIES_PUBLIC (all published COI records from BOLD and GenBank with a minimum sequence length of 500bp) is used for BOLD search, so the locus from metadata will be ignored when db_type = bold. > - You can modify the BOLD database by changing the bold_database_name parameter (see docs/params.md). However, we have not tested other BOLD databases besides COX1_SPECIES_PUBLIC. > - Loci synonyms will be checked as well (see scripts/config/loci.json). > - If you need to modify which loci and synonyms are permitted, see the technical documentation.
preliminary_id - Preliminary morphology ID of the sample.

Optional Columns

taxaofinterest - Taxa of interest for the sample. If multiple, separate them with a | character.
host - Host organism of the sample.
country - Country of origin for the sample.
sequencing_platform - Sequencing platform used for the sample.
sequencingreadcoverage - Sequencing read coverage for the sample.

Example

sample_id	locus	preliminary_id	taxa_of_interest	host	country	sequencing_platform	sequencing_read_coverage
VE24-1075_COI	COI	Aphididae	Myzus persicae\|Aphididae	Cut flower Rosa	Ecuador	Nanopore	30x
VE24-1079_COI	COI	Miridae	Lygus pratensis	Cut flower Paenonia	Netherlands	Nanopore	30x

[!NOTE] - All required columns must be present for every sample. - Optional columns can be left blank or completely omitted if not applicable. - Columns 4 and 5 are examples of "arbitrary columns" - add any arbitrary columns you like, and they will be included in the workflow report "Sample metadata". - For more details on the metadata schema, see assets/schema_input.json. - Example can be downloaded from test/metadata.csv.

To run the pipeline against local BLAST Core Nt Database: bash nextflow run /path/to/pipeline/taxodactyl/main.nf \ --metadata /path/to/metadata.csv \ --sequences /path/to/sequences.fasta \ --blastdb /path/to/blastdbs/core_nt \ --outdir /path/to/output \ -profile singularity \ --taxdb /path/to/.taxonkit/ \ --ncbi_api_key API_KEY \ --ncbi_user_email EMAIL \ --analyst_name "Magdalena Antczak" \ --facility_name "QCIF" \ -resume

To run the pipeline using the BOLD web database: bash nextflow run /path/to/pipeline/taxodactyl/main.nf \ --metadata /path/to/metadata.csv \ --sequences /path/to/sequences.fasta \ --db_type bold \ --outdir /path/to/output \ -profile singularity \ --taxdb /path/to/.taxonkit/ \ --ncbi_api_key API_KEY \ --ncbi_user_email EMAIL \ --analyst_name "Magdalena Antczak" \ --facility_name "QCIF" \ -resume

[!NOTE] - For a detailed explanation of all pipeline parameters, see parameter documentation. - We recommend avoiding spaces in file and folder names to prevent issues in command-line operations. - The error strategy for the workflow is set to ignore. It means that even if a process encounters an error, Nextflow will continue executing subsequent processes rather than terminating the workflow. This is to avoid interrupting the entire workflow with multiple queries when only one of them fails. Unfortunately, this behaviour prevents more detailed errors from being displayed in the standard output. Instead, you will only see which tasks failed, and the hashes assigned to them that you can use to navigate the work folder to find specific errors. As a workaround, you can run the following script at the end of your run from the directory where the pipeline was executed: bash /path/to/pipeline/taxodactyl/bin/collect_errors.sh. As a result, a list of processes should be displayed together with their work directory paths, the last 10 lines of standard error and the last 10 lines of standard output. - You can find detailed instructions and practical examples for customising the pipeline configuration in the docs/customise.md file. This guide covers how to set parameters, adjust resources, change error strategies, and modify the Singularity cache directory for your Nextflow runs.

Pipeline output

After running the pipeline, the output directory will contain a separate folder for each query sequence and a folder with information about the run. Here, we show the results folder structure when using the two databases. For more information, see the output documentation. See this document for a detailed description of the analysis and interpretation of the workflow report.

BLAST Core Nucleotide Database

. ├── blast_result.xml ├── pipeline_info │ ├── execution_report_2025-06-22_22-53-15.html │ ├── execution_timeline_2025-06-22_22-53-15.html │ ├── execution_trace_2025-06-22_22-53-15.txt │ ├── params_2025-06-22_22-53-29.json │ └── pipeline_dag_2025-06-22_22-53-15.html ├── query_001_VE24-1075_COI │ ├── all_hits.fasta │ ├── candidates.csv │ ├── candidates.fasta │ ├── candidates_identity_boxplot.png │ ├── candidates_phylogeny.fasta │ ├── candidates_phylogeny.msa │ ├── candidates_phylogeny.nwk │ └── report_VE24-1075_COI_20250622_225319.html └── query_002_VE24-1079_COI ├── all_hits.fasta ├── candidates.csv ├── candidates.fasta ├── candidates_phylogeny.fasta ├── candidates_phylogeny.msa ├── candidates_phylogeny.nwk └── report_VE24-1079_COI_20250622_225319.html BOLD . ├── pipeline_info │ ├── execution_report_2025-06-22_22-53-22.html │ ├── execution_timeline_2025-06-22_22-53-22.html │ ├── execution_trace_2025-06-22_22-53-22.txt │ ├── params_2025-06-22_22-53-34.json │ └── pipeline_dag_2025-06-22_22-53-22.html ├── query_001_VE24-1075_COI │ ├── all_hits.fasta │ ├── candidates.csv │ ├── candidates.fasta │ ├── candidates_phylogeny.fasta │ ├── candidates_phylogeny.msa │ ├── candidates_phylogeny.nwk │ └── report_BOLD_VE24-1075_COI_20250622_225326.html └── query_002_VE24-1079_COI ├── all_hits.fasta ├── candidates.csv ├── candidates.fasta ├── candidates_identity_boxplot.png ├── candidates_phylogeny.fasta ├── candidates_phylogeny.msa ├── candidates_phylogeny.nwk └── report_BOLD_VE24-1079_COI_20250622_225326.html

Credits

Department of Agriculture, Fisheries and Forestry QCIF

qcif/taxodactyl was originally written by Magdalena Antczak, Cameron Hyde, Daisy Li from QCIF Ltd. The project was funded by the Department of Agriculture, Fisheries and Forestry and the Australian BioCommons.

The workflow was designed by: - Cameron Hyde - Magdalena Antczak - Lanxi (Daisy) Li - Valentine Murigneux - Sarah Williams - Michael Thang - Bradley Pease - Shaun Bochow - Grace Sun

Citations

If you use qcif/taxodactyl for your analysis, please cite it using the following

Antczak, M., Hyde, C., Li, Lanxi (Daisy), Murigneux, V., Williams, S., Thang, M., Pease, B., Bochow, S., & Sun, G. (2025). TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.1782.3

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

qcif/taxodactyl uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: QCIF
Login: qcif
Kind: organization
Location: Queensland, Australia

Website: https://www.qcif.edu.au/
Repositories: 45
Profile: https://github.com/qcif

Queensland Cyber Infrastructure Foundation

Citation (CITATIONS.md)

# qcif/taxodactyl: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [BLAST+](https://pubmed.ncbi.nlm.nih.gov/20003500/)
  
  > Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857.

- [MAFFT](https://pubmed.ncbi.nlm.nih.gov/23329690/)
  
  > Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16. PMID: 23329690; PMCID: PMC3603318.

- [FastME]( https://pmc.ncbi.nlm.nih.gov/articles/PMC4576710/)
  
  > Lefort V, Desper R, Gascuel O. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Mol Biol Evol. 2015 Oct;32(10):2798-800. doi: 10.1093/molbev/msv150. Epub 2015 Jun 30. PMID: 26130081; PMCID: PMC4576710.

- [TaxonKit]( https://pubmed.ncbi.nlm.nih.gov/34001434/)
  
  > Shen W, Ren H. TaxonKit: A practical and efficient NCBI taxonomy toolkit. J Genet Genomics. 2021 Sep 20;48(9):844-850. doi: 10.1016/j.jgg.2021.03.006. Epub 2021 Apr 15. PMID: 34001434.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 8
Issues event: 90
Release event: 1
Watch event: 1
Delete event: 5
Member event: 1
Issue comment event: 18
Push event: 137
Pull request event: 4

Last Year

Create event: 8
Issues event: 90
Release event: 1
Watch event: 1
Delete event: 5
Member event: 1
Issue comment event: 18
Push event: 137
Pull request event: 4

Committers

Last synced: 11 months ago

All Time

Total Commits: 134
Total Committers: 3
Avg Commits per committer: 44.667
Development Distribution Score (DDS): 0.06

Past Year

Commits: 134
Committers: 3
Avg Commits per committer: 44.667
Development Distribution Score (DDS): 0.06

Top Committers

Name	Email	Commits
Ubuntu	u**u@c**l	126
Cameron Hyde	c**e@n**m	7
ubuntu	u**u@c**0	1

Committer Domains (Top 20 + Academic)

neoformit.com: 1 cjh-db-admin-20241120.novalocal: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 79
Total pull requests: 3
Average time to close issues: 11 days
Average time to close pull requests: 7 days
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.62
Average comments per pull request: 0.33
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 79
Pull requests: 3
Average time to close issues: 11 days
Average time to close pull requests: 7 days
Issue authors: 2
Pull request authors: 1
Average comments per issue: 1.62
Average comments per pull request: 0.33
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

neoformit (42)
mantczakaus (37)

Pull Request Authors

neoformit (3)

Top Labels

Issue Labels

enhancement (34) bug (21) low priority (6) duplicate (3) python (3) high_priority (2) report (2) blocking (2)

taxodactyl

Science Score: 75.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences

Quick links

Workflow Overview

Usage

Software

NCBI API Key

TaxonKit

BLAST Core Nucleotide Database

Sequences file (sequences.fasta)

Metadata file (metadata.csv)

Required Columns

Optional Columns

Example

Pipeline output

Credits

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Sequences file (`sequences.fasta`)

Metadata file (`metadata.csv`)