https://github.com/bede/deacon
FASTX search and [host] depletion using minimizers
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 11 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, ncbi.nlm.nih.gov, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary
Repository
FASTX search and [host] depletion using minimizers
Basic Info
Statistics
- Stars: 80
- Watchers: 3
- Forks: 11
- Open Issues: 7
- Releases: 11
Metadata Files
README.md
Deacon

Search and depletion of FASTA/FASTQ files and streams using accelerated minimizer matching. Default parameters balance sensitivity and specificity for the application of microbial metagenomic host depletion, for which a validated prebuilt index is available. Classification sensitivity, specificity and memory requirements may be tuned by varying k-mer length (-k), window size (-w), and the two match thresholds (-a and -r). Minimizer k and w are chosen at index time, while the match thresholds can be chosen at filter time. To be considered a match, sequences must meet both an absolute threshold (-a, default 2 minimizer hits) and a relative threshold (-r, default 0.01 or 1% of minimizers). Paired sequences are also supported: a match in either mate causes both mates in the pair to be retained or discarded; deacon filter retains only matches by default (search mode) and discards matches in --deplete mode. Deacon reports filtering performance during execution and optionally writes a JSON --summary upon completion. Sequences can optionally be renamed using --rename for privacy and smaller file sizes. Gzip, zst and xz compression formats are natively supported and detected by file extension.
Deacon is capable of filtering compressed long reads at >500Mbp/s and indexing a human genome in <30s (Apple M1). Filtering at >2Gbp/s is possible with uncompressed input. Peak memory usage during filtering is 5GB for the default panhuman index. Use Zstandard (zst) compression and/or pipe output to an external compressor such as pigz for best performance.
Benchmarks for panhuman host depletion of complex microbial metagenomes are described in a preprint. Among tested approaches, Deacon with the panhuman-1 (k=31, w=15) index exhibited the highest balanced accuracy for both long and short simulated reads. Deacon was however less specific than Hostile for short reads.
[!IMPORTANT] Deacon is actively developed and unstable. Take note of software and index version(s) used in order to guarantee reproducibility of your results. Carefully review the CHANGELOG when updating. Version 0.7.0 introduced a new index container format that is incompatible with prior versions. Please report any problems you encounter by creating an issue or using the email address in my profile.
Install
cargo 
bash
cargo install deacon
conda/mamba/pixi 
bash
conda install -c bioconda deacon
Usage
Indexing
Use deacon index build to quickly build custom indexes. For human host depletion, the prebuilt validated panhuman index is recommended, available for download below from Zenodo or faster object storage. Object storage is provided by the ModMedMicro research unit at the University of Oxford.
```shell deacon index build chm13v2.fa > human.k31w15.idx
Discard very low complexity minimizers
deacon index build -e 0.5 chm13v2.fa > human.k31w15e5.idx ```
Prebuilt indexes
| Name/URL | Composition | Minimizers | Subtracted minimizers | Size | Date |
| :----------------------------------------------------------: | :----------------------------------------------------------: | ----------- | --------------------- | ----- | ------- |
| panhuman-1 (k=31, w=15) Cloud, Zenodo | HPRC Year 1 ∪ CHM13v2.0 ∪ GRCh38.p14 - bacteria (FDA-ARGOS) - viruses (RefSeq) | 409,913,780 | 20,781 (0.0051%) | 3.7GB | 2025-07 |
| panmouse-1a (k=31, w=15, e=0.5) Cloud | GRCm39 ∪ PRJEB47108 - bacteria (FDA-ARGOS) - viruses (RefSeq) | 548,331,948 | 8,246 (0.0015%) | 4.6GB | 2025-08 |
Filtering
The main command deacon filter accepts an index path followed by up to two query FASTA/FASTQ file paths, depending on whether query sequences originate from stdin, a single file, or paired input files. Paired queries are supported as either separate files or interleaved stdin, and written interleaved to either stdout or file, or else to separate paired output files. For paired reads, distinct minimizer hits originating from either mate are counted. By default, query sequences must meet both an absolute threshold of 2 minimizer hits (-a 2) and a relative threshold of 1% of minimizers (-r 0.01) to pass the filter. Filtering can be inverted for e.g. host depletion using the --deplete (-d) flag. Gzip, Zstandard, and xz compression formats are detected automatically by file extension. Use Zstandard compression rather than Gzip where possible for best performance.
Examples
```bash
Keep only human sequences
deacon filter panhuman-1.k31w15.idx reads.fq.gz > filt.fq
Host depletion using the panhuman-1 index and default thresholds
deacon filter -d panhuman-1.k31w15.idx reads.fq.gz -o filt.fq.gz
Max sensitivity with absolute threshold of 1 and no relative threshold
deacon filter -d -a 1 -r 0 panhuman-1.k31w15.idx reads.fq.gz -o filt.fq.gz
More specific 10% relative match threshold
deacon filter -d -r 0.1 panhuman-1.k31w15.idx reads.fq.gz > filt.fq.gz
Stdin and stdout
zcat reads.fq.gz | deacon filter -d panhuman-1.k31w15.idx > filt.fq
Faster Zstandard compression
deacon filter -d panhuman-1.k31w15.idx reads.fq.zst -o filt.fq.zst
Fast gzip with pigz
deacon filter -d panhuman-1.k31w15.idx reads.fq.gz | pigz > filt.fq.gz
Paired reads
deacon filter -d panhuman-1.k31w15.idx r1.fq.gz r2.fq.gz > filt12.fq deacon filter -d panhuman-1.k31w15.idx r1.fq.gz r2.fq.gz -o filt.r1.fq.gz -O filt.r2.fq.gz zcat r12.fq.gz | deacon filter -d panhuman-1.k31w15.idx - - > filt12.fq
Save summary JSON
deacon filter -d panhuman-1.k31w15.idx reads.fq.gz -o filt.fq.gz -s summary.json
Replace read headers with incrementing integers
deacon filter -d -R panhuman-1.k31w15.idx reads.fq.gz > filt.fq
Only look for minimizer hits inside the first 1000bp per record
deacon filter -d -p 1000 panhuman-1.k31w15.idx reads.fq.gz > filt.fq
Debug mode: see sequences with minimizer hits in stderr
deacon filter -d --debug panhuman-1.k31w15.idx reads.fq.gz > filt.fq ```
Command line reference
Filtering
```bash $ deacon filter -h Keep or discard DNA fastx records with sufficient minimizer hits to the index
Usage: deacon filter [OPTIONS]
Arguments:
Options: -o, --output
Indexing
```bash $ deacon index -h Create and compose minimizer indexes
Usage: deacon index
Commands: build Index minimizers contained within a fastx file info Show index information union Combine multiple minimizer indexes (A ∪ B…) diff Subtract minimizers in one index from another (A - B) help Print this message or the help of the given subcommand(s)
Options: -h, --help Print help ```
```bash $ deacon index build -h Index minimizers contained within a fastx file
Usage: deacon index build [OPTIONS]
Arguments: Path to input fastx file (supports gz, zst and xz compression)
Options:
-k
Building custom indexes
Building custom Deacon indexes is quite fast. Nevertheless, when indexing many large genomes, it may be worthwhile separately indexing and subsequently combining indexes into one succinct index. Combine distinct minimizers from multiple indexes using deacon index union. Similarly, use deacon index diff to subtract the minimizers contained in one index from another. This can be helpful for e.g. eliminating shared minimizers between the target and host genomes when building custom (non-human) indexes for host depletion.
- Use
deacon index union 1.idx 2.idx 3.idx… > 1+2+3.idxto succinctly combine two (or more!) deacon indexes. - Use
deacon index diff 1.idx 2.idx > 1-2.idxto subtract minimizers in 1.idx from 2.idx. Useful for masking out shared minimizer content between e.g. target and host genomes. - In version
0.7.0and above,deacon index diffalso supports subtracting minimizers from an index using a fastx file or stream, e.g.deacon index diff 1.idx 2.fa.gz > 1-2.idxorzcat *.fa.gz | deacon index diff 1.idx - > 1-2.idx.
For best performance, set the --capacity argument of deacon index build to a number of minimizers in millions greater than that you expect your index to contain. Setting this too low will cause delays during indexing for hash table resizing.
Filtering summary statistics
Use -s summary.json to save detailed filtering statistics:
json
{
"version": "deacon 0.9.0",
"index": "panhuman-1.k31w15.idx",
"input": "HG02334.1m.fastq.gz",
"input2": null,
"output": "-",
"output2": null,
"k": 31,
"w": 15,
"abs_threshold": 2,
"rel_threshold": 0.01,
"prefix_length": 0,
"deplete": true,
"rename": false,
"seqs_in": 1000000,
"seqs_out": 13452,
"seqs_removed": 986548,
"seqs_removed_proportion": 0.986548,
"bp_in": 5477122928,
"bp_out": 5710050,
"bp_removed": 5471412878,
"bp_removed_proportion": 0.9989574727324798,
"time": 125.755103875,
"seqs_per_second": 7951,
"bp_per_second": 43553881
}
Citation
Bede Constantinides, John Lees, Derrick W Crook. "Deacon: fast sequence filtering and contaminant depletion" bioRxiv 2025.06.09.658732, https://doi.org/10.1101/2025.06.09.658732
Please also consider citing the SimdMinimizers paper:
Ragnar Groot Koerkamp, Igor Martayan. "SimdMinimizers: Computing random minimizers, fast" bioRxiv 2025.01.27.634998, https://doi.org/10.1101/2025.01.27.634998
Owner
- Name: Bede Constantinides
- Login: bede
- Kind: user
- Company: Oxford Nanopore Technologies
- Website: bede.im
- Twitter: beconsta
- Repositories: 76
- Profile: https://github.com/bede
GitHub Events
Total
- Create event: 18
- Issues event: 32
- Release event: 6
- Watch event: 52
- Delete event: 3
- Issue comment event: 44
- Public event: 1
- Push event: 71
- Pull request review comment event: 1
- Pull request review event: 1
- Pull request event: 10
- Fork event: 7
Last Year
- Create event: 18
- Issues event: 32
- Release event: 6
- Watch event: 52
- Delete event: 3
- Issue comment event: 44
- Public event: 1
- Push event: 71
- Pull request review comment event: 1
- Pull request review event: 1
- Pull request event: 10
- Fork event: 7
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 25
- Total pull requests: 6
- Average time to close issues: 6 days
- Average time to close pull requests: 6 days
- Total issue authors: 10
- Total pull request authors: 4
- Average comments per issue: 1.28
- Average comments per pull request: 0.83
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 25
- Pull requests: 6
- Average time to close issues: 6 days
- Average time to close pull requests: 6 days
- Issue authors: 10
- Pull request authors: 4
- Average comments per issue: 1.28
- Average comments per pull request: 0.83
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bede (15)
- funnell (2)
- cslamo (1)
- arpit20328 (1)
- dutchscientist (1)
- naturepoker (1)
- Ben7124 (1)
- ASLeonard (1)
- pmenzel (1)
- imartayan (1)
Pull Request Authors
- RagnarGrootKoerkamp (3)
- lmmx (1)
- imartayan (1)
- tmaklin (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cargo 3,685 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 11
- Total maintainers: 1
crates.io: deacon
Fast DNA sequence filtering with minimizers
- Homepage: https://github.com/bede/deacon
- Documentation: https://docs.rs/deacon/
- License: MIT
-
Latest release: 0.10.0
published 6 months ago