taxodactyl
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Committers with academic emails
-
✓Institutional organization owner
Organization qcif has institutional domain (www.qcif.edu.au) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: qcif
- License: mit
- Language: Python
- Default Branch: main
- Size: 118 MB
Statistics
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 27
- Releases: 12
Metadata Files
README.md
TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences
|
qcif/taxodactyl is a modular, reproducible Nextflow workflow for the conservative taxonomy assignment to DNA sequences, designed for high-confidence, auditable results in biosecurity and biodiversity contexts. The workflow integrates multiple bioinformatics tools and databases, automates best-practice analysis steps, and produces detailed reports with supporting evidence for each taxonomic assignment. |
Quick links
- Example workflow report
- Documentation of the analysis
- Python scripts (for developers)
Workflow Overview
The pipeline orchestrates a series of analytical steps, each encapsulated in a dedicated module or subworkflow. The main stages are:
Environment Configuration Sets up environment variables and paths required for downstream processes, ensuring reproducibility and portability.
Input Validation Checks the integrity and compatibility of input files (FASTA sequences, metadata, databases), preventing downstream errors.
Sequence Search
- BLAST Core Nucleotide Database (BLASTN): Queries input sequences against the NCBI nucleotide database using BLASTN.
- BOLD v4 (API): Queries input sequences against the Barcode of Life Data Systems. Taxonomic lineage included in the results.
Hit Extraction Parses BLAST results to extract relevant hits for each query.
Taxonomic ID Extraction Retrieve taxids for BLAST hit records.
Build Taxonomic Lineage Maps taxonomic IDs to full lineages, enabling downstream filtering and reporting.
Candidate Evaluation Identifies candidate species for each query, applying configurable thresholds for identity and coverage.
Supporting Evidence Evaluation
Multiple Sequence Alignment (MAFFT) Aligns candidate and query sequences to prepare for phylogenetic analysis.
Phylogenetic Tree Construction (FastMe) Builds a phylogenetic tree to visualise relatedness of candidate and query sequences.
Workflow report Generates detailed HTML and text reports, including sequence alignments, phylogenetic trees, database coverage, and supporting evidence for each assignment.
Usage
Software
To run the qcif/taxodactyl pipeline, you will need the following software installed:
Nextflow Tested versions: 24.10.3, 24.10.6
Java Required by Nextflow. Tested version: 17.0.13
Singularity Used for containerised execution of all bioinformatics tools, ensuring reproducibility. Tested version: 3.7.0
[!NOTE] - Instructions on how to set up Nextflow and a compatible version of Java can be found on this page. - To install singularity follow instructions from this website. - We provide different profiles as per the default nf-core configuration however this pipeline was only tested with singularity. - The pipeline was tested only on a Linux-based operating system - specifically, Ubuntu 24.04.1 LTS. - If you have never downloaded or run a Nextflow pipeline, we have some additional tips and bash commands in the step-by-step guide.
NCBI API Key
API Key is used to authenticate with the NCBI Entrez API for an increased rate limit. You can generate it following the instructions from this article.
TaxonKit
Download the NCBI taxonomy data files and extract them to ~/.taxonkit. Similarly, download the taxonkit tool and move into the same folder.
BLAST Core Nucleotide Database
To search sequences against the BLAST Core Nucleotide Database, you must download it first. We recommend running the update_blastdb.pl program. Follow instructions from this book. Perl installation is required.
The command should look like this:
perl ~/ncbi-blast-2.16.0+/bin/update_blastdb.pl --decompress core_nt
Sequences file (sequences.fasta)
You will need a FASTA file containing the query sequences (up to 100), e.g. ```
VE24-1075COI TGGATCATCTCTTAGAATTTTAATTCGATTAGAATTAAGACAAATTAATTCTATTATTWATAATAATCAATTATATAATGTAATTGTTCACAATTCATGCTTTTATTATAATTTTTTTTATAACTATACCAATTGTAATTGGTGGATTTGGAAATTGATTAATTCCTATAATAATAGGATGTCCTGATATATCATTTCCACSTTTAAATAATATTAGATTTTGATTATTACCTCCATCATTAATAATAATAATTTGTAGATTTTTAATTAATAATGGAACAGGAACAGGATGAACAATTTAYCCHCCTTTATCAAACAATATTGCACATAATAACATTTCAGTTGATTTAACTATTTTTTCTTTACATTTAGCAGGWATCTCATCAATTTTAGGAGCAATTAACTTTATTTGTACAATTCTTAATATAATAYCAAAYAATATAAAACTAAATCAAATTCCTCTTTTTCCTTGATCAATTTTAATTACAGCTATTTTATTAATTTTATMTTTACCAGTTTTAGCTGGTGCCATTACAATATTATTAACTGATCGTAATTTAAATACATCATTTTTGATCCAGCAGGAGGAGGAGATCC VE24-1079COI AACTTTATATTTCATTTTTGGAATATGGGCAGGTATATTAGGAACTTCACTAAGATGAATTATTCGAATTGAACTTGGACAACCAGGATCATTTATTGGAGATGATCAAATTTATAATGTAGTAGTTACCGCACACGCATTTATTATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCTCTAATAATTGGAGCACCAGATATAGCATTCCCACGGATAAATAATATAAGATTTTGATTATTACCACCCTCAATTACACTTCTTATTATAAGATCTATAGTAGAAAGAGGAGCAGGAACTGGATGAACAGTATATCCCCCACTATCATCAAATATTGCACATAGTGGAGCATCAGTAGACCTAGCAATTTTTTCACTACATTTAGCAGGTGTATCTTCAATTTTAGGAGCAATTAATTTCATCTCAACAATTATTAATATACGACCTGAAGGCATATCTCCAGAACGAATTCCATTATTTGTATGATCAGTAGGTATTACAGCATTACTATTATTATTATCATTACCAGTTCTAGCTGGAGCTATTACAATATTATTAACAGATCGAAACTTTAATACCTCATTCTTTGACCCAGTAGGAGGAGGAGATCCTATCTTATATCAACATTTATTTTGATTTTTT ``
[!NOTE] - Example can be downloaded from [test/query.fasta`](test/query.fasta).
Metadata file (metadata.csv)
The metadata file provides essential information about each sequence and must follow the structure below. Each row corresponds to a sample and should include required and, optionally, additional columns.
Required Columns
- sample_id - Unique identifier for the sample. Must match the sequence ID in the
sequences.fastafile. Cannot contain spaces. - locus - Name of the genetic locus for the sample, which must be in the list of permitted loci. If deliberately providing no locus, the value
NAis also accepted. > [!NOTE] > - By default,COX1_SPECIES_PUBLIC(all published COI records from BOLD and GenBank with a minimum sequence length of 500bp) is used for BOLD search, so the locus from metadata will be ignored whendb_type = bold. > - You can modify the BOLD database by changing thebold_database_nameparameter (see docs/params.md). However, we have not tested other BOLD databases besidesCOX1_SPECIES_PUBLIC. > - Loci synonyms will be checked as well (seescripts/config/loci.json). > - If you need to modify which loci and synonyms are permitted, see the technical documentation. - preliminary_id - Preliminary morphology ID of the sample.
Optional Columns
- taxaofinterest - Taxa of interest for the sample. If multiple, separate them with a
|character. - host - Host organism of the sample.
- country - Country of origin for the sample.
- sequencing_platform - Sequencing platform used for the sample.
- sequencingreadcoverage - Sequencing read coverage for the sample.
Example
| sample_id | locus | preliminary_id | taxa_of_interest | host | country | sequencing_platform | sequencing_read_coverage |
|---|---|---|---|---|---|---|---|
| VE24-1075_COI | COI | Aphididae | Myzus persicae|Aphididae | Cut flower Rosa | Ecuador | Nanopore | 30x |
| VE24-1079_COI | COI | Miridae | Lygus pratensis | Cut flower Paenonia | Netherlands | Nanopore | 30x |
[!NOTE] - All required columns must be present for every sample. - Optional columns can be left blank or completely omitted if not applicable. - Columns 4 and 5 are examples of "arbitrary columns" - add any arbitrary columns you like, and they will be included in the workflow report "Sample metadata". - For more details on the metadata schema, see
assets/schema_input.json. - Example can be downloaded fromtest/metadata.csv.
To run the pipeline against local BLAST Core Nt Database:
bash
nextflow run /path/to/pipeline/taxodactyl/main.nf \
--metadata /path/to/metadata.csv \
--sequences /path/to/sequences.fasta \
--blastdb /path/to/blastdbs/core_nt \
--outdir /path/to/output \
-profile singularity \
--taxdb /path/to/.taxonkit/ \
--ncbi_api_key API_KEY \
--ncbi_user_email EMAIL \
--analyst_name "Magdalena Antczak" \
--facility_name "QCIF" \
-resume
To run the pipeline using the BOLD web database:
bash
nextflow run /path/to/pipeline/taxodactyl/main.nf \
--metadata /path/to/metadata.csv \
--sequences /path/to/sequences.fasta \
--db_type bold \
--outdir /path/to/output \
-profile singularity \
--taxdb /path/to/.taxonkit/ \
--ncbi_api_key API_KEY \
--ncbi_user_email EMAIL \
--analyst_name "Magdalena Antczak" \
--facility_name "QCIF" \
-resume
[!NOTE] - For a detailed explanation of all pipeline parameters, see parameter documentation. - We recommend avoiding spaces in file and folder names to prevent issues in command-line operations. - The error strategy for the workflow is set to
ignore. It means that even if a process encounters an error, Nextflow will continue executing subsequent processes rather than terminating the workflow. This is to avoid interrupting the entire workflow with multiple queries when only one of them fails. Unfortunately, this behaviour prevents more detailed errors from being displayed in the standard output. Instead, you will only see which tasks failed, and the hashes assigned to them that you can use to navigate the work folder to find specific errors. As a workaround, you can run the following script at the end of your run from the directory where the pipeline was executed:bash /path/to/pipeline/taxodactyl/bin/collect_errors.sh. As a result, a list of processes should be displayed together with their work directory paths, the last 10 lines of standard error and the last 10 lines of standard output. - You can find detailed instructions and practical examples for customising the pipeline configuration in the docs/customise.md file. This guide covers how to set parameters, adjust resources, change error strategies, and modify the Singularity cache directory for your Nextflow runs.
Pipeline output
After running the pipeline, the output directory will contain a separate folder for each query sequence and a folder with information about the run. Here, we show the results folder structure when using the two databases. For more information, see the output documentation. See this document for a detailed description of the analysis and interpretation of the workflow report.
BLAST Core Nucleotide Database
.
├── blast_result.xml
├── pipeline_info
│ ├── execution_report_2025-06-22_22-53-15.html
│ ├── execution_timeline_2025-06-22_22-53-15.html
│ ├── execution_trace_2025-06-22_22-53-15.txt
│ ├── params_2025-06-22_22-53-29.json
│ └── pipeline_dag_2025-06-22_22-53-15.html
├── query_001_VE24-1075_COI
│ ├── all_hits.fasta
│ ├── candidates.csv
│ ├── candidates.fasta
│ ├── candidates_identity_boxplot.png
│ ├── candidates_phylogeny.fasta
│ ├── candidates_phylogeny.msa
│ ├── candidates_phylogeny.nwk
│ └── report_VE24-1075_COI_20250622_225319.html
└── query_002_VE24-1079_COI
├── all_hits.fasta
├── candidates.csv
├── candidates.fasta
├── candidates_phylogeny.fasta
├── candidates_phylogeny.msa
├── candidates_phylogeny.nwk
└── report_VE24-1079_COI_20250622_225319.html
BOLD
.
├── pipeline_info
│ ├── execution_report_2025-06-22_22-53-22.html
│ ├── execution_timeline_2025-06-22_22-53-22.html
│ ├── execution_trace_2025-06-22_22-53-22.txt
│ ├── params_2025-06-22_22-53-34.json
│ └── pipeline_dag_2025-06-22_22-53-22.html
├── query_001_VE24-1075_COI
│ ├── all_hits.fasta
│ ├── candidates.csv
│ ├── candidates.fasta
│ ├── candidates_phylogeny.fasta
│ ├── candidates_phylogeny.msa
│ ├── candidates_phylogeny.nwk
│ └── report_BOLD_VE24-1075_COI_20250622_225326.html
└── query_002_VE24-1079_COI
├── all_hits.fasta
├── candidates.csv
├── candidates.fasta
├── candidates_identity_boxplot.png
├── candidates_phylogeny.fasta
├── candidates_phylogeny.msa
├── candidates_phylogeny.nwk
└── report_BOLD_VE24-1079_COI_20250622_225326.html
Credits
qcif/taxodactyl was originally written by Magdalena Antczak, Cameron Hyde, Daisy Li from QCIF Ltd. The project was funded by the Department of Agriculture, Fisheries and Forestry and the Australian BioCommons.
The workflow was designed by: - Cameron Hyde - Magdalena Antczak - Lanxi (Daisy) Li - Valentine Murigneux - Sarah Williams - Michael Thang - Bradley Pease - Shaun Bochow - Grace Sun
Citations
If you use qcif/taxodactyl for your analysis, please cite it using the following
Antczak, M., Hyde, C., Li, Lanxi (Daisy), Murigneux, V., Williams, S., Thang, M., Pease, B., Bochow, S., & Sun, G. (2025). TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.1782.3
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
qcif/taxodactyl uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: QCIF
- Login: qcif
- Kind: organization
- Location: Queensland, Australia
- Website: https://www.qcif.edu.au/
- Repositories: 45
- Profile: https://github.com/qcif
Queensland Cyber Infrastructure Foundation
Citation (CITATIONS.md)
# qcif/taxodactyl: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [BLAST+](https://pubmed.ncbi.nlm.nih.gov/20003500/) > Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421. PMID: 20003500; PMCID: PMC2803857. - [MAFFT](https://pubmed.ncbi.nlm.nih.gov/23329690/) > Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013 Apr;30(4):772-80. doi: 10.1093/molbev/mst010. Epub 2013 Jan 16. PMID: 23329690; PMCID: PMC3603318. - [FastME]( https://pmc.ncbi.nlm.nih.gov/articles/PMC4576710/) > Lefort V, Desper R, Gascuel O. FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Mol Biol Evol. 2015 Oct;32(10):2798-800. doi: 10.1093/molbev/msv150. Epub 2015 Jun 30. PMID: 26130081; PMCID: PMC4576710. - [TaxonKit]( https://pubmed.ncbi.nlm.nih.gov/34001434/) > Shen W, Ren H. TaxonKit: A practical and efficient NCBI taxonomy toolkit. J Genet Genomics. 2021 Sep 20;48(9):844-850. doi: 10.1016/j.jgg.2021.03.006. Epub 2021 Apr 15. PMID: 34001434. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Create event: 8
- Issues event: 90
- Release event: 1
- Watch event: 1
- Delete event: 5
- Member event: 1
- Issue comment event: 18
- Push event: 137
- Pull request event: 4
Last Year
- Create event: 8
- Issues event: 90
- Release event: 1
- Watch event: 1
- Delete event: 5
- Member event: 1
- Issue comment event: 18
- Push event: 137
- Pull request event: 4
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ubuntu | u****u@c****l | 126 |
| Cameron Hyde | c****e@n****m | 7 |
| ubuntu | u****u@c****0 | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 79
- Total pull requests: 3
- Average time to close issues: 11 days
- Average time to close pull requests: 7 days
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 1.62
- Average comments per pull request: 0.33
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 79
- Pull requests: 3
- Average time to close issues: 11 days
- Average time to close pull requests: 7 days
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 1.62
- Average comments per pull request: 0.33
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- neoformit (42)
- mantczakaus (37)
Pull Request Authors
- neoformit (3)